How do I stream live audio from the browser to Google Cloud Speech via socket.io?

Question

I have a situation with a React-based app where I have an input for which I wanted to allow voice input as well. I'm okay making this compatible with Chrome and Firefox only, so I was thinking of using getUserMedia. I know I'll be using Google Cloud's Speech to Text API. However, I have a few caveats:

I want this to stream my audio data live, not just when I'm done recording. This means that a lot of solutions I've found won't work very well, because it's not sufficient to save the file and then send it out to Google Cloud Speech.
I don't trust my front end with my Google Cloud API information. Instead, I already have a service running on the back end which has my credentials, and I want to stream the audio (live) to that back end, then from that back end stream to Google Cloud, and then emit updates to my transcript as they come in back to the Front End.
I already connect to that back end service using socket.io, and I want to manage this entirely via sockets, without having to use Binary.js or anything similar.

Nowhere seems to have a good tutorial on how to do this. What do I do?

Amber B. · Accepted Answer · 2018-06-22T15:20:40.197

First, credit where credit is due: a huge amount of my solution here was created by referencing vin-ni's Google-Cloud-Speech-Node-Socket-Playground project. I had to adapt this some for my React app, however, so I'm sharing a few of the changes I made.

My solution here was composed of four parts, two on the front end and two on the back end.

My front end solution was of two parts:

A utility file to access my microphone, stream audio to the back end, retrieve data from the back end, run a callback function each time that data was received from the back end, and then clean up after itself either when done streaming or when the back end threw an error.
A microphone component which wrapped my React functionality.

My back end solution was of two parts:

A utility file to handle the actual speech recognize stream
My main.js file

(These don't need to be separated by any means; our main.js file is just already a behemoth without it.)

Most of my code will just be excerpted, but my utilities will be shown in full because I had a lot of problem with all of the stages involved. My front end utility file looked like this:

// Stream Audio
let bufferSize = 2048,
    AudioContext,
    context,
    processor,
    input,
    globalStream;

//audioStream constraints
const constraints = {
    audio: true,
    video: false
};

let AudioStreamer = {
    /**
     * @param {function} onData Callback to run on data each time it's received
     * @param {function} onError Callback to run on an error if one is emitted.
     */
    initRecording: function(onData, onError) {
        socket.emit('startGoogleCloudStream', {
            config: {
                encoding: 'LINEAR16',
                sampleRateHertz: 16000,
                languageCode: 'en-US',
                profanityFilter: false,
                enableWordTimeOffsets: true
            },
            interimResults: true // If you want interim results, set this to true
        }); //init socket Google Speech Connection
        AudioContext = window.AudioContext || window.webkitAudioContext;
        context = new AudioContext();
        processor = context.createScriptProcessor(bufferSize, 1, 1);
        processor.connect(context.destination);
        context.resume();

        var handleSuccess = function (stream) {
            globalStream = stream;
            input = context.createMediaStreamSource(stream);
            input.connect(processor);

            processor.onaudioprocess = function (e) {
                microphoneProcess(e);
            };
        };

        navigator.mediaDevices.getUserMedia(constraints)
            .then(handleSuccess);

        // Bind the data handler callback
        if(onData) {
            socket.on('speechData', (data) => {
                onData(data);
            });
        }

        socket.on('googleCloudStreamError', (error) => {
            if(onError) {
                onError('error');
            }
            // We don't want to emit another end stream event
            closeAll();
        });
    },

    stopRecording: function() {
        socket.emit('endGoogleCloudStream', '');
        closeAll();
    }
}

export default AudioStreamer;

// Helper functions
/**
 * Processes microphone data into a data stream
 * 
 * @param {object} e Input from the microphone
 */
function microphoneProcess(e) {
    var left = e.inputBuffer.getChannelData(0);
    var left16 = convertFloat32ToInt16(left);
    socket.emit('binaryAudioData', left16);
}

/**
 * Converts a buffer from float32 to int16. Necessary for streaming.
 * sampleRateHertz of 1600.
 * 
 * @param {object} buffer Buffer being converted
 */
function convertFloat32ToInt16(buffer) {
    let l = buffer.length;
    let buf = new Int16Array(l / 3);

    while (l--) {
        if (l % 3 === 0) {
            buf[l / 3] = buffer[l] * 0xFFFF;
        }
    }
    return buf.buffer
}

/**
 * Stops recording and closes everything down. Runs on error or on stop.
 */
function closeAll() {
    // Clear the listeners (prevents issue if opening and closing repeatedly)
    socket.off('speechData');
    socket.off('googleCloudStreamError');
    let tracks = globalStream ? globalStream.getTracks() : null; 
        let track = tracks ? tracks[0] : null;
        if(track) {
            track.stop();
        }

        if(processor) {
            if(input) {
                try {
                    input.disconnect(processor);
                } catch(error) {
                    console.warn('Attempt to disconnect input failed.')
                }
            }
            processor.disconnect(context.destination);
        }
        if(context) {
            context.close().then(function () {
                input = null;
                processor = null;
                context = null;
                AudioContext = null;
            });
        }
}

The main salient point of this code (aside from the getUserMedia configuration, which was in and of itself a bit dicey) is that the onaudioprocess callback for the processor emitted speechData events to the socket with the data after converting it to Int16. My main changes here from my linked reference above were to replace all of the functionality to actually update the DOM with callback functions (used by my React component) and to add some error handling that wasn't included in the source.

I was then able to access this in my React Component by just using:

onStart() {
    this.setState({
        recording: true
    });
    if(this.props.onStart) {
        this.props.onStart();
    }
    speechToTextUtils.initRecording((data) => {
        if(this.props.onUpdate) {
            this.props.onUpdate(data);
        }   
    }, (error) => {
        console.error('Error when recording', error);
        this.setState({recording: false});
        // No further action needed, as this already closes itself on error
    });
}

onStop() {
    this.setState({recording: false});
    speechToTextUtils.stopRecording();
    if(this.props.onStop) {
        this.props.onStop();
    }
}

(I passed in my actual data handler as a prop to this component).

Then on the back end, my service handled three main events in main.js:

// Start the stream
            socket.on('startGoogleCloudStream', function(request) {
                speechToTextUtils.startRecognitionStream(socket, GCSServiceAccount, request);
            });
            // Receive audio data
            socket.on('binaryAudioData', function(data) {
                speechToTextUtils.receiveData(data);
            });

            // End the audio stream
            socket.on('endGoogleCloudStream', function() {
                speechToTextUtils.stopRecognitionStream();
            });

My speechToTextUtils then looked like:

// Google Cloud
const speech = require('@google-cloud/speech');
let speechClient = null;

let recognizeStream = null;

module.exports = {
    /**
     * @param {object} client A socket client on which to emit events
     * @param {object} GCSServiceAccount The credentials for our google cloud API access
     * @param {object} request A request object of the form expected by streamingRecognize. Variable keys and setup.
     */
    startRecognitionStream: function (client, GCSServiceAccount, request) {
        if(!speechClient) {
            speechClient = new speech.SpeechClient({
                projectId: 'Insert your project ID here',
                credentials: GCSServiceAccount
            }); // Creates a client
        }
        recognizeStream = speechClient.streamingRecognize(request)
            .on('error', (err) => {
                console.error('Error when processing audio: ' + (err && err.code ? 'Code: ' + err.code + ' ' : '') + (err && err.details ? err.details : ''));
                client.emit('googleCloudStreamError', err);
                this.stopRecognitionStream();
            })
            .on('data', (data) => {
                client.emit('speechData', data);

                // if end of utterance, let's restart stream
                // this is a small hack. After 65 seconds of silence, the stream will still throw an error for speech length limit
                if (data.results[0] && data.results[0].isFinal) {
                    this.stopRecognitionStream();
                    this.startRecognitionStream(client, GCSServiceAccount, request);
                    // console.log('restarted stream serverside');
                }
            });
    },
    /**
     * Closes the recognize stream and wipes it
     */
    stopRecognitionStream: function () {
        if (recognizeStream) {
            recognizeStream.end();
        }
        recognizeStream = null;
    },
    /**
     * Receives streaming data and writes it to the recognizeStream for transcription
     * 
     * @param {Buffer} data A section of audio data
     */
    receiveData: function (data) {
        if (recognizeStream) {
            recognizeStream.write(data);
        }
    }
};

(Again, you don't strictly need this util file, and you could certainly put the speechClient as a const on top of the file depending on how you get your credentials; this is just how I implemented it.)

And that, finally, should be enough to get you started on this. I encourage you to do your best to understand this code before you reuse or modify it, as it may not work 'out of the box' for you, but unlike all other sources I have found, this should get you at least started on all involved stages of the project. It is my hope that this answer will prevent others from suffering like I have suffered.

@vinni Ha, I hadn't realized you had your own question/answer about this situation on SO already. I might have still posted my answer just for the clarifications about React-based usage, since that did have some subtleties (removing the listeners is important if your component can unmount/remount repeatedly, for example), but I'm surprised it didn't turn up in my googling. Thanks again for your work - I really could not have gotten this kickstarted without it! — Amber B., Jul 05 '18 at 19:42
Great! I am puzzled by the fact that Speech Recognition providers like Google Cloud and IBM Watson have demos that work on SAFARI. I've tried to reverse engineer IBM and Google's demo websites without success, as the code is vast. Did your solution work for Safari too? — Josh, Jul 12 '18 at 13:11
@Josh This is a good question, and I don't have the foggiest idea, actually! It may be some time before I get a chance to test it and see it, but maybe this will help? https://stackoverflow.com/questions/21015847/how-to-make-getusermedia-work-on-all-browsers — Amber B., Jul 12 '18 at 13:38
I'm always getting Malordered Data Received. Send exactly one config, followed by audio data.error. Can you please help me? — ravi, Oct 15 '18 at 15:26
I am also facing one strange issue that when i console.log(speech.SpeechClient).. it gives me null. i installed @google-cloud/speech. so i don't know why it's happening. — ravi, Oct 15 '18 at 15:56
@ravi That's a strange error... I'm afraid I'm not 100% sure what could be causing it. You may have better results posting a new question where you can post your full setup and asking what the problem could be. — Amber B., Oct 15 '18 at 16:55
@AmberB. I am trying to implement your code but in the server side I am having `Maximum call stack size exceeded` exceeded. Does nodejs needs any special setup? — Sisir, May 26 '20 at 06:53
@Sisir Usually that's a sign of something running out of memory somewhere, often but not always because of an infinite loop somewhere. I'm using Nodejs for this, so I'm not aware of any special configuration needed beyond what I have here. I'd post your implementation details in a new question, you may have a memory issue or infinite loop somewhere. — Amber B., May 27 '20 at 13:24
I recently implemented something like this but the process time for "real-time" is so slow that actually recording a full stream and sending that is much faster. Is that something you've experienced? — FrenchMajesty, Mar 27 '21 at 20:29
We haven't used this functionality extensively recently, but in all our tests the response time was reasonable. While it's possible something changed on Google's end, my first inclination would be to see if your setup is causing this delay somehow. — Amber B., Mar 30 '21 at 19:53

How do I stream live audio from the browser to Google Cloud Speech via socket.io?

1 Answers1

Linked