8

I'd like to be able to end a Google speech-to-text stream (created with streamingRecognize), and get back the pending SR (speech recognition) results.

In a nutshell, the relevant Node.js code:

// create SR stream
const stream = speechClient.streamingRecognize(request);

// observe data event
const dataPromise = new Promise(resolve => stream.on('data', resolve));

// observe error event
const errorPromise = new Promise((resolve, reject) => stream.on('error', reject));

// observe finish event
const finishPromise = new Promise(resolve => stream.on('finish', resolve));

// send the audio
stream.write(audioChunk);

// for testing purposes only, give the SR stream 2 seconds to absorb the audio
await new Promise(resolve => setTimeout(resolve, 2000));

// end the SR stream gracefully, by observing the completion callback
const endPromise = util.promisify(callback => stream.end(callback))();

// a 5 seconds test timeout
const timeoutPromise = new Promise(resolve => setTimeout(resolve, 5000)); 

// finishPromise wins the race here
await Promise.race([
  dataPromise, errorPromise, finishPromise, endPromise, timeoutPromise]);

// endPromise wins the race here
await Promise.race([
  dataPromise, errorPromise, endPromise, timeoutPromise]);

// timeoutPromise wins the race here
await Promise.race([dataPromise, errorPromise, timeoutPromise]);

// I don't see any data or error events, dataPromise and errorPromise don't get settled

What I experience is that the SR stream ends successfully, but I don't get any data events or error events. Neither dataPromise nor errorPromise gets resolved or rejected.

How can I signal the end of my audio, close the SR stream and still get the pending SR results?

I need to stick with streamingRecognize API because the audio I'm streaming is real-time, even though it may stop suddenly.

To clarify, it works as long as I keep streaming the audio, I do receive the real-time SR results. However, when I send the final audio chunk and end the stream like above, I don't get the final results I'd expect otherwise.

To get the final results, I actually have to keep streaming silence for several more seconds, which may increase the ST bill. I feel like there must be a better way to get them.

Updated: so it appears, the only proper time to end a streamingRecognize stream is upon data event where StreamingRecognitionResult.is_final is true. As well, it appears we're expected to keep streaming audio until data event is fired, to get any result at all, final or interim.

This looks like a bug to me, filing an issue.

Updated: it now seems to have been confirmed as a bug. Until it's fixed, I'm looking for a potential workaround.

Updated: for future references, here is the list of the current and previously tracked issues involving streamingRecognize.

I'd expect this to be a common problem for those who use streamingRecognize, surprised it hasn't been reported before. Submitting it as a bug to issuetracker.google.com, as well.

noseratio
  • 56,401
  • 21
  • 172
  • 421

3 Answers3

2

My bad — unsurprisingly, this turned to be an obscure race condition in my code.

I've put together a self-contained sample that works as expected (gist). It helped me tracking down the issue. Hopefully, it may help others and my future self:

// A simple streamingRecognize workflow,
// tested with Node v15.0.1, by @noseratio

import fs from 'fs';
import path from "path";
import url from 'url'; 
import util from "util";
import timers from 'timers/promises';
import speech from '@google-cloud/speech';

export {}

// need a 16-bit, 16KHz raw PCM audio 
const filename = path.join(path.dirname(url.fileURLToPath(import.meta.url)), "sample.raw");
const encoding = 'LINEAR16';
const sampleRateHertz = 16000;
const languageCode = 'en-US';

const request = {
  config: {
    encoding: encoding,
    sampleRateHertz: sampleRateHertz,
    languageCode: languageCode,
  },
  interimResults: false // If you want interim results, set this to true
};

// init SpeechClient
const client = new speech.v1p1beta1.SpeechClient();
await client.initialize();

// Stream the audio to the Google Cloud Speech API
const stream = client.streamingRecognize(request);

// log all data
stream.on('data', data => {
  const result = data.results[0];
  console.log(`SR results, final: ${result.isFinal}, text: ${result.alternatives[0].transcript}`);
});

// log all errors
stream.on('error', error => {
  console.warn(`SR error: ${error.message}`);
});

// observe data event
const dataPromise = new Promise(resolve => stream.once('data', resolve));

// observe error event
const errorPromise = new Promise((resolve, reject) => stream.once('error', reject));

// observe finish event
const finishPromise = new Promise(resolve => stream.once('finish', resolve));

// observe close event
const closePromise = new Promise(resolve => stream.once('close', resolve));

// we could just pipe it: 
// fs.createReadStream(filename).pipe(stream);
// but we want to simulate the web socket data

// read RAW audio as Buffer
const data = await fs.promises.readFile(filename, null);

// simulate multiple audio chunks
console.log("Writting...");
const chunkSize = 4096;
for (let i = 0; i < data.length; i += chunkSize) {
  stream.write(data.slice(i, i + chunkSize));
  await timers.setTimeout(50);
}
console.log("Done writing.");

console.log("Before ending...");
await util.promisify(c => stream.end(c))();
console.log("After ending.");

// race for events
await Promise.race([
  errorPromise.catch(() => console.log("error")), 
  dataPromise.then(() => console.log("data")),
  closePromise.then(() => console.log("close")),
  finishPromise.then(() => console.log("finish"))
]);

console.log("Destroying...");
stream.destroy();
console.log("Final timeout...");
await timers.setTimeout(1000);
console.log("Exiting.");

The output:

Writting...
Done writing.
Before ending...
SR results, final: true, text:  this is a test I'm testing voice recognition This Is the End
After ending.
data
finish
Destroying...
Final timeout...
close
Exiting.

To test it, a 16-bit/16KHz raw PCM audio file is required. An arbitrary WAV file wouldn't work as is because it contains a header with metadata.

noseratio
  • 56,401
  • 21
  • 172
  • 421
1

This: "I'm looking for a potential workaround." - have you considered extending from SpeechClient as a base class? I don't have credential to test, but you can extend from SpeechClient with your own class and then call the internal close() method as needed. The close() method shuts down the SpeechClient and resolves the outstanding Promise.

Alternatively you could also Proxy the SpeechClient() and intercept/respond as needed. But since your intent is to shut it down, the below option might be your workaround.

const speech = require('@google-cloud/speech');

class ClientProxy extends speech.SpeechClient {
  constructor() {
    super();
  }
  myCustomFunction() {
    this.close();
  }
}

const clientProxy = new ClientProxy();
try {
  clientProxy.myCustomFunction();
} catch (err) {
  console.log("myCustomFunction generated error: ", err);
}
Randy Casburn
  • 11,404
  • 1
  • 12
  • 26
  • Currently I only call `close` on `SpeechClient` when I'm done with the whole session. It might be a bit expensive (resource-wise) to create/close an instance of `SpeechClient` every time I sense voice. But let me see if it solves the original problem, i.e., whether or not I'll get final `data` event on the stream if I call `SpeechClient.close()`... – noseratio Nov 02 '20 at 02:05
  • Right, then you may need to Proxy `streamRecognize` – Randy Casburn Nov 02 '20 at 02:07
  • *Right, then you may need to Proxy streamRecognize* - I've already tried that to no avail, including working directly with `SpeechClient._streamRecognize` (a thin gRPC layer API which `streamRecognize` itself wraps). I'm surprised no one has run into this before. – noseratio Nov 02 '20 at 02:13
  • It that doesn't do it for you, you also have access to the `destroy()` method of `_streamRecognize` through the same hierarchy. The custom function should be able to call `this.streamRecognize.destroy()` Have you tried that? – Randy Casburn Nov 02 '20 at 02:16
  • Yep, I used `destroy` originally and then replaced it with [`stream.end`](https://nodejs.org/api/stream.html#stream_writable_end_chunk_encoding_callback) while investigating this problem, which is the right way to end a writable stream gracefully. Still no data, but at least I see the `finish` event. – noseratio Nov 02 '20 at 02:32
  • So I take subclassing and calling `close()`didn't work? – Randy Casburn Nov 02 '20 at 02:34
  • 1
    I've just finished verifying that: no, it didn't. As soon as I call `SpeechClient.close()`, the stream is terminated with the error "UNAVAILABLE", code: 14. Which was kinda expected but worth the short anyway - thanks. – noseratio Nov 02 '20 at 02:56
1

Since it's a bug, I don't know if this is suitable for you but I have used this.recognizeStream.end(); several times in different situations and it worked. However, my code was a bit different...

This feed may be something for you: https://groups.google.com/g/cloud-speech-discuss/c/lPaTGmEcZQk/m/Kl4fbHK2BQAJ

Sven Eschlbeck
  • 676
  • 3
  • 16
  • Following your link, [this is](https://groups.google.com/g/cloud-speech-discuss/c/lPaTGmEcZQk/m/aYXOJbWlBAAJ) pretty much what I do in my question's code, using `v1beta1` as well. Sadly, I don't get any `data` events, only `finish`. I'm going to create a minimal reproducible standalone app. – noseratio Nov 02 '20 at 11:02
  • 1
    Yeah, you are right. This is indeed pretty annoying. Sadly, I am not really familiar with node.js but I found a site where there's a Java equivalent, so I understand it, too... https://cloud.google.com/translate/media/docs/streaming#media_translation_translate_from_file-nodejs Look at the part where dealing with END_OF_SINGLE_UTTERANCE is explained. This may help but again, the node.js is not really clear to me. – Sven Eschlbeck Nov 02 '20 at 12:49
  • Eventually, your answer has helped me to find a bug in my code. I'll let it hang for a little longer but the bounty is yours, thanks! – noseratio Nov 03 '20 at 02:34
  • I'm glad that I could help...tough problem though :) – Sven Eschlbeck Nov 03 '20 at 14:03