C# - Capture RTP Stream and send to speech recognition

Question

What I am trying to accomplish:

Capture RTP Stream in C#
Forward that stream to the System.Speech.SpeechRecognitionEngine

I am creating a Linux-based robot which will take microphone input, send it Windows machine which will process the audio using Microsoft Speech Recognition and send the response back to the robot. The robot might be hundreds of miles from the server, so I would like to do this over the Internet.

What I have done so far:

Have the robot generate an RTP stream encoded in MP3 format (other formats available) using FFmpeg (the robot is running on a Raspberry Pi running Arch Linux)
Captured stream on the client computer using VLC ActiveX control
Found that the SpeechRecognitionEngine has the available methods:
1. recognizer.SetInputToWaveStream()
2. recognizer.SetInputToAudioStream()
3. recognizer.SetInputToDefaultAudioDevice()
Looked at using JACK to send the output of the app to line-in, but was completely confused by it.

What I need help with:

I'm stuck on how to actually send the stream from VLC to the SpeechRecognitionEngine. VLC doesn't expose the stream at all. Is there a way I can just capture a stream and pass that stream object to the SpeechRecognitionEngine? Or is RTP not the solution here?

Thanks in advance for your help.

There is no need to send audio over internet. You can archive immediate response with offline speech recognition engine [CMUSphinx](http://cmusphinx.sourceforge.net). CMUSphinx engine accuracy is not very different from Microsoft engine and it will perfectly work on Raspberry Pi itself. — Nikolay Shmyrev, Apr 08 '13 at 19:54
I am using a high quality speech synthesis engine which only runs on Windows, so unfortunately I can't make everything inclusive on the Pi. The AI will also be doing computationally intensive tasks which are more than the Pi can handle. — dgreenheck, Apr 08 '13 at 20:15

score 6 · Accepted Answer · edited May 23 '17 at 11:55

After much work, I finally got Microsoft.SpeechRecognitionEngine to accept a WAVE audio stream. Here's the process:

On the Pi, I have ffmpeg running. I stream the audio using this command

ffmpeg -ac 1 -f alsa -i hw:1,0 -ar 16000 -acodec pcm_s16le -f rtp rtp://XXX.XXX.XXX.XXX:1234

On the the server side, I create a UDPClient and listen on port 1234. I receive the packets on a separate thread. First, I strip off the RTP header (header format explained here) and write the payload to a special stream. I had to use the SpeechStreamer class described in Sean's response in order for the SpeechRecognitionEngine to work. It wasn't working with a standard Memory Stream.

The only thing I had to do on the speech recognition side set the input to the audio stream instead of the default audio device.

recognizer.SetInputToAudioStream( rtpClient.AudioStream,
    new SpeechAudioFormatInfo(WAVFile.SAMPLE_RATE, AudioBitsPerSample.Sixteen, AudioChannel.Mono));

I haven't done extensive testing on it (i.e. letting it stream for days and seeing if it still works), but I'm able to save off the audio sample in the SpeechRecognized and it sounds great. I'm using a sample rate of 16 KHz. I might bump it down to 8 KHz to reduce the amount of data transfer, but I will worry about that once it becomes a problem.

I should also mention, the response is extremely fast. I can speak an entire sentence and get a response in less than a second. The RTP connection seems to add very little overhead to the process. I'll have to try a benchmark and compare it with just using MIC input.

EDIT: Here is my RTPClient class.

    /// <summary>
    /// Connects to an RTP stream and listens for data
    /// </summary>
    public class RTPClient
    {
        private const int AUDIO_BUFFER_SIZE = 65536;

        private UdpClient client;
        private IPEndPoint endPoint;
        private SpeechStreamer audioStream;
        private bool writeHeaderToConsole = false;
        private bool listening = false;
        private int port;
        private Thread listenerThread; 

        /// <summary>
        /// Returns a reference to the audio stream
        /// </summary>
        public SpeechStreamer AudioStream
        {
            get { return audioStream; }
        }
        /// <summary>
        /// Gets whether the client is listening for packets
        /// </summary>
        public bool Listening
        {
            get { return listening; }
        }
        /// <summary>
        /// Gets the port the RTP client is listening on
        /// </summary>
        public int Port
        {
            get { return port; }
        }

        /// <summary>
        /// RTP Client for receiving an RTP stream containing a WAVE audio stream
        /// </summary>
        /// <param name="port">The port to listen on</param>
        public RTPClient(int port)
        {
            Console.WriteLine(" [RTPClient] Loading...");

            this.port = port;

            // Initialize the audio stream that will hold the data
            audioStream = new SpeechStreamer(AUDIO_BUFFER_SIZE);

            Console.WriteLine(" Done");
        }

        /// <summary>
        /// Creates a connection to the RTP stream
        /// </summary>
        public void StartClient()
        {
            // Create new UDP client. The IP end point tells us which IP is sending the data
            client = new UdpClient(port);
            endPoint = new IPEndPoint(IPAddress.Any, port);

            listening = true;
            listenerThread = new Thread(ReceiveCallback);
            listenerThread.Start();

            Console.WriteLine(" [RTPClient] Listening for packets on port " + port + "...");
        }

        /// <summary>
        /// Tells the UDP client to stop listening for packets.
        /// </summary>
        public void StopClient()
        {
            // Set the boolean to false to stop the asynchronous packet receiving
            listening = false;
            Console.WriteLine(" [RTPClient] Stopped listening on port " + port);
        }

        /// <summary>
        /// Handles the receiving of UDP packets from the RTP stream
        /// </summary>
        /// <param name="ar">Contains packet data</param>
        private void ReceiveCallback()
        {
            // Begin looking for the next packet
            while (listening)
            {
                // Receive packet
                byte[] packet = client.Receive(ref endPoint);

                // Decode the header of the packet
                int version = GetRTPHeaderValue(packet, 0, 1);
                int padding = GetRTPHeaderValue(packet, 2, 2);
                int extension = GetRTPHeaderValue(packet, 3, 3);
                int csrcCount = GetRTPHeaderValue(packet, 4, 7);
                int marker = GetRTPHeaderValue(packet, 8, 8);
                int payloadType = GetRTPHeaderValue(packet, 9, 15);
                int sequenceNum = GetRTPHeaderValue(packet, 16, 31);
                int timestamp = GetRTPHeaderValue(packet, 32, 63);
                int ssrcId = GetRTPHeaderValue(packet, 64, 95);

                if (writeHeaderToConsole)
                {
                    Console.WriteLine("{0} {1} {2} {3} {4} {5} {6} {7} {8}",
                        version,
                        padding,
                        extension,
                        csrcCount,
                        marker,
                        payloadType,
                        sequenceNum,
                        timestamp,
                        ssrcId);
                }

                // Write the packet to the audio stream
                audioStream.Write(packet, 12, packet.Length - 12);
            }
        }

        /// <summary>
        /// Grabs a value from the RTP header in Big-Endian format
        /// </summary>
        /// <param name="packet">The RTP packet</param>
        /// <param name="startBit">Start bit of the data value</param>
        /// <param name="endBit">End bit of the data value</param>
        /// <returns>The value</returns>
        private int GetRTPHeaderValue(byte[] packet, int startBit, int endBit)
        {
            int result = 0;

            // Number of bits in value
            int length = endBit - startBit + 1;

            // Values in RTP header are big endian, so need to do these conversions
            for (int i = startBit; i <= endBit; i++)
            {
                int byteIndex = i / 8;
                int bitShift = 7 - (i % 8);
                result += ((packet[byteIndex] >> bitShift) & 1) * (int)Math.Pow(2, length - i + startBit - 1);
            }
            return result;
        }
    }

Any idea how to test RTPClient using VLC on Windows or ffmpeg on windows ? — Jean-Philippe Encausse, Jul 03 '13 at 19:07
Don't use RTP. Use `ffmpeg ... -acodec pcm_s16le -f s16le tcp://...` to send raw audio samples over TCP. Now you don't need to deal with RTP protocol **and** TCP ensures you receive packets reliably in-order. — Aleksandr Dubinsky, May 16 '18 at 15:49

score 2 · Answer 2 · edited May 23 '17 at 10:33

2

I think you should keep it simpler. Why use RTP and a special library to capture the the RTP? Why not just take the audio data from the Rasperry Pi and use Http Post to send it to your server?

Keep in mind that System.Speech does not support MP3 format. This might be helpful - Help with SAPI v5.1 SpeechRecognitionEngine always gives same wrong result with C#. For System.Speech audio must be in PCM, ULaw, or ALaw format. The most reliable way to determine which formats your recognizer supports is to interrogate it with RecognizerInfo.SupportedAudioFormats.

Then you can post the data to your server (and use ContentType = "audio/x-wav"). We've used a Url format like

http://server/app/recognize/{sampleRate}/{bits}/{isStereo}

to include the audio parameters in the request. Send the captured wav file in the POST body.

One catch we ran into is we had to add a WAV file header to the data before sending it to System.Speech. Our data was PCM, but not in WAV format. See https://ccrma.stanford.edu/courses/422/projects/WaveFormat/ in case you need to do this.

edited May 23 '17 at 10:33

Community

1
1

answered Apr 09 '13 at 12:46

Michael Levy

12,767
15
60
98

Except sending the wav file will not allow you to archive immediate response. The send operation takes time and response takes time. If properly implemented RTP-based solution, or, better, MRCP one can provide you a response time 10 times smaller than you can get with the implementation that sends whole files. – Nikolay Shmyrev Apr 09 '13 at 20:35
1

true, but for a robot that may be receiving voice commands across the Internet, being processed on a desktop Windows machine (assuming since System.Speech is being used), I've got a feeling the added latency won't be a problem. – Michael Levy Apr 09 '13 at 23:42
1

Thanks for the response. I actually did end up getting it to work over RTP. I'll post what I did in case anyone else is looking at how to do this. – dgreenheck Apr 10 '13 at 18:41

score 0 · Answer 3 · edited Sep 02 '17 at 00:47

It's an old thread, but was useful for a project I was working on. But, I had the same issues as some other people trying to use dgreenheck's code with a Windows PC as the source.

Got FFMpeg working with that 0 changes to the code using the following parameters:

ffmpeg -ac 1 -f dshow -i audio="{recording device}" -ar 16000 -acodec pcm_s16le -f rtp rtp://{hostname}:{port}

In my case, the recording device name was "Microphone (Realtek High Definition Audio)", but I used the following to get the recording device name:

ffmpeg -list_devices true -f dshow -i dummy

C# - Capture RTP Stream and send to speech recognition

3 Answers3

Linked