34

I want to stream audio from the web and convert that to text using Python Google-cloud-speech API. I have integrated that in my Django channels code.

For frontend, I have directly copied this code and the backend has this code (please see below). Now, coming to the problem, I am not getting any exceptions or errors but I was not getting any results from google API.

What I tried:

  • I put debug points inside for loop of process function, the control never reaches inside the loop.

  • I have gone through the java code here and tried to understand that. I have a setup that java code in my local and debugged it. One thing I understood is in java code, the method onWebSocketBinary is receiving an integer array, from frontend we are sending that like this.

      socket.send(Int16Array.from(floatSamples.map(function (n) {return n * MAX_INT;})));
    
  • In java, they are converting into bytestring then sending it to Google. Whereas in Django, I put debug points and noticed that I am getting data in a binary string. So, I felt I don't need to do anything with that. but, I tried few several ways by converting that to integer array, but that didn't work because google is expecting in bytes itself (you can see the commented code below).

  • I went through this example code and this from Google and I am doing the same thing, I didn't understand what I am doing it wrong here.

Django Code:

import json

from channels.generic.websocket import WebsocketConsumer

# Imports the Google Cloud client library
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types

# Instantiates a client
client = speech.SpeechClient()
language_code = "en-US"
streaming_config = None


class SpeechToTextConsumer(WebsocketConsumer):
    def connect(self):
        self.accept()

    def disconnect(self, close_code):
        pass

    def process(self, streaming_recognize_response: types.StreamingRecognitionResult):
        for response in streaming_recognize_response:
            if not response.results:
                continue
            result = response.results[0]
            self.send(text_data=json.dumps(result))

    def receive(self, text_data=None, bytes_data=None):
        global streaming_config
        if text_data:
            data = json.loads(text_data)
            rate = data["sampleRate"]
            config = types.RecognitionConfig(
                encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
                sample_rate_hertz=rate,
                language_code=language_code,
            )
            streaming_config = types.StreamingRecognitionConfig(
                config=config, interim_results=True, single_utterance=False
            )
            types.StreamingRecognizeRequest(streaming_config=streaming_config)
            self.send(text_data=json.dumps({"message": "processing..."}))
        if bytes_data:
            # bytes_data = bytes_data[math.floor(len(bytes_data) / 2) :]
            # bytes_data = bytes_data.lstrip(b"\x00")
            # bytes_data = int.from_bytes(bytes_data, "little")
            stream = [bytes_data]
            requests = (
                types.StreamingRecognizeRequest(audio_content=chunk) for chunk in stream
            )
            responses = client.streaming_recognize(streaming_config, requests)
            self.process(responses)
Shayan Shafiq
  • 1,548
  • 4
  • 13
  • 21
Lokesh Sanapalli
  • 936
  • 2
  • 15
  • 38

1 Answers1

1

I ran into a similar issue while creating a virtual artificially intelligent assistant, and believe that I could offer at least a bit of help. I am in no way an expert, but I did find a way to implement Google's text-to-speech engine. I used python's speech_recognition library (you can download with pip install speech_recognition) and importing it as "sr". from here you set up Google's API with the recognize.recognize_google(audio file). You do not need an account as this library includes a key already and is super easy to set up and implement wherever, (such as Django). Here is a really helpful link to a tutorial on this that I really recommend. Here is a link to the documentation. Here is a helpful program that takes an audio file and transcribes it using all of the available speech recognition services. This is the code below, you can use whichever service you like, sphinx runs offline, and google's API doesn't require signup because it already has a key and password.

    #!/usr/bin/env python3

import speech_recognition as sr

# obtain path to "english.wav" in the same folder as this script
from os import path
AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "english.wav")
# AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "french.aiff")
# AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "chinese.flac")

# use the audio file as the audio source
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
    audio = r.record(source)  # read the entire audio file

# recognize speech using Sphinx
try:
    print("Sphinx thinks you said " + r.recognize_sphinx(audio))
except sr.UnknownValueError:
    print("Sphinx could not understand audio")
except sr.RequestError as e:
    print("Sphinx error; {0}".format(e))

# recognize speech using Google Speech Recognition
try:
    # for testing purposes, we're just using the default API key
    # to use another API key, use `r.recognize_google(audio, key="GOOGLE_SPEECH_RECOGNITION_API_KEY")`
    # instead of `r.recognize_google(audio)`
    print("Google Speech Recognition thinks you said " + r.recognize_google(audio))
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Speech Recognition service; {0}".format(e))

# recognize speech using Google Cloud Speech
GOOGLE_CLOUD_SPEECH_CREDENTIALS = r"""INSERT THE CONTENTS OF THE GOOGLE CLOUD SPEECH JSON CREDENTIALS FILE HERE"""
try:
    print("Google Cloud Speech thinks you said " + r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS))
except sr.UnknownValueError:
    print("Google Cloud Speech could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Cloud Speech service; {0}".format(e))

# recognize speech using Wit.ai
WIT_AI_KEY = "INSERT WIT.AI API KEY HERE"  # Wit.ai keys are 32-character uppercase alphanumeric strings
try:
    print("Wit.ai thinks you said " + r.recognize_wit(audio, key=WIT_AI_KEY))
except sr.UnknownValueError:
    print("Wit.ai could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Wit.ai service; {0}".format(e))

# recognize speech using Microsoft Azure Speech
AZURE_SPEECH_KEY = "INSERT AZURE SPEECH API KEY HERE"  # Microsoft Speech API keys 32-character lowercase hexadecimal strings
try:
    print("Microsoft Azure Speech thinks you said " + r.recognize_azure(audio, key=AZURE_SPEECH_KEY))
except sr.UnknownValueError:
    print("Microsoft Azure Speech could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Microsoft Azure Speech service; {0}".format(e))

# recognize speech using Microsoft Bing Voice Recognition
BING_KEY = "INSERT BING API KEY HERE"  # Microsoft Bing Voice Recognition API keys 32-character lowercase hexadecimal strings
try:
    print("Microsoft Bing Voice Recognition thinks you said " + r.recognize_bing(audio, key=BING_KEY))
except sr.UnknownValueError:
    print("Microsoft Bing Voice Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Microsoft Bing Voice Recognition service; {0}".format(e))

# recognize speech using Houndify
HOUNDIFY_CLIENT_ID = "INSERT HOUNDIFY CLIENT ID HERE"  # Houndify client IDs are Base64-encoded strings
HOUNDIFY_CLIENT_KEY = "INSERT HOUNDIFY CLIENT KEY HERE"  # Houndify client keys are Base64-encoded strings
try:
    print("Houndify thinks you said " + r.recognize_houndify(audio, client_id=HOUNDIFY_CLIENT_ID, client_key=HOUNDIFY_CLIENT_KEY))
except sr.UnknownValueError:
    print("Houndify could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Houndify service; {0}".format(e))

# recognize speech using IBM Speech to Text
IBM_USERNAME = "INSERT IBM SPEECH TO TEXT USERNAME HERE"  # IBM Speech to Text usernames are strings of the form XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
IBM_PASSWORD = "INSERT IBM SPEECH TO TEXT PASSWORD HERE"  # IBM Speech to Text passwords are mixed-case alphanumeric strings
try:
    print("IBM Speech to Text thinks you said " + r.recognize_ibm(audio, username=IBM_USERNAME, password=IBM_PASSWORD))
except sr.UnknownValueError:
    print("IBM Speech to Text could not understand audio")
except sr.RequestError as e:
    print("Could not request results from IBM Speech to Text service; {0}".format(e))

Hope this helped in some way!

Mason Choi
  • 77
  • 1
  • 8