How to get SSML timestamps from Google Cloud text-to-speech API

Question

I want to use SSML markers through the Google Cloud text-to-speech API to request the timing of these markers in the audio stream. These timestamps are necessary in order to provide cues for effects, word/section highlighting and feedback to the user.

I found this question which is relevant, although the question refers to the timestamps for each word and not the SSML <mark> tag.

The following API request returns OK but shows the lack of the requested marker data. This is using the Cloud Text-to-Speech API v1.

{
 "voice": {
  "languageCode": "en-US"
 },
 "input": {
  "ssml": "<speak>First, <mark name=\"a\"/> second, <mark name=\"b\"/> third.</speak>"
 },
 "audioConfig": {
  "audioEncoding": "mp3"
 }
}

Response:

{
 "audioContent":"//NExAAAAANIAAAAABcFAThYGJqMWA..."
}

Which only provides the synthesized audio without any contextual information.

Is there an API request that I am overlooking which can expose information about these markers such as is the case with IBM Watson and Amazon Polly?

Did you find a solution for this? Looks like Google's api doesn't support speech marks. Correct? — Bret, Jul 09 '20 at 18:50

score 3 · Answer 1 · answered Oct 01 '20 at 08:37

Looks like this is supported in Cloud Text-to-Speech API v1beta1: https://cloud.google.com/text-to-speech/docs/reference/rest/v1beta1/text/synthesize#TimepointType

You can use https://texttospeech.googleapis.com/v1beta1/text:synthesize. Set TimepointType to SSML_MARK. If this field is not set, timepoints are not returned by default.

How to get SSML timestamps from Google Cloud text-to-speech API

1 Answers1