One solution to solve this problem would be to use SignalR.
You can follow this SO answer to get the microphone input and also a very nice explanation of how to handle microphone input with websockets.
The following is only sudo code to explain the concept!
It is also greatly simplified, because I don't know, for example, whether Google's API can handle the fact that you always send it only fragments of speech input, etc. And as I said the code gives only a rough overview of the basic process and has no logic if the server is offline etc.
But inside of the function process_microphone_buffer(event)
function you can call SignalR.
So the function would something like
function process_microphone_buffer(event) {
// you should handle this as a singelton
const connection = new signalR.HubConnectionBuilder().withUrl("/speechToTextHub ").build();
const microphone_output_buffer = event.inputBuffer.getChannelData(0);
connection.invoke("SendMicrophoneBuffer", microphone_output_buffer).catch(function (err) {
return console.error(err.toString());
});
}
And on your Server you implement a corresponding hub:
using Microsoft.AspNetCore.SignalR;
using System.Threading.Tasks;
namespace SignalRChat.Hubs
{
public class SpeechToTextHub : Hub
{
public async Task SendMicrophoneBuffer(byte[] buffer)
{
var googleApi = new GoogleApi();
var speechToTextResult = await googleApi.GetTextFromSpeechAsync(buffer);
Context.Clients.Client(Context.ConnectionId).SendAsync("SpeechToTextResult", speechToTextResult);
}
}
}
And on your client you have something like this
connection.on("SpeechToTextResult", function (textResult) {
console.log(textResult);
});
If the answer is too general for Stackoverflow, I can also remove it.
If there are still open questions, I can extend my answer accordingly.