Recognizing notes within recorded sound - Part 2 - Python

Question

this is a continuation of this question here.

This is the code I used in order to get the samples:

spf = wave.open(speech,'r')
sound_info = spf.readframes(-1)
sound_info = fromstring(sound_info, 'Int16')

The length of sound_info is 194560, which is 4.4 times the sample rate of 44100. The length of the sound file is 2.2 seconds, so isn't sound_info twice the length it should be?

Also I can only seem to find enough info on why FFTs are used in order to produce the frequency spectrum.

I would like to split a sound up and analyse the frequency spectrum of multiple fractions of a second, rather than the whole sound file.

Help would be very much appreciated. :)

This is the basic sound_info graph

plot(sound_info)

This is the FFT graph

freq = [abs(x.real) for x in fft(sound_info)]
plot(freq)

‘isn't sound_info twice the length it should be?’: stereo? — bobince, Sep 15 '10 at 10:54
thanks for that bobince, but then how do I interpret sound_info? Because the data is sequential — RadiantHex, Sep 15 '10 at 11:14

unutbu · Answer 1 · 2010-09-15T13:34:06.880

1

If your wav file has two channels, then the length of sound_info would be 2*sample rate*duration (seconds). The channel data alternate, so if you have slurpped all the values into a 1-dimensional array, data, then the values associated with one channel would be data[::2], and the other would be data[1::2].

Roughly speaking, smooth functions can be represented as sums of sine and cosine waves (with various amplitudes and frequencies).

The FFT (Fast Fourier Transform) relates the function to the coefficients (amplitudes) of those sine and cosine waves. That is, there is a one-to-one mapping between the function on the one hand and the sequence of coefficients on the other.

If a sound sample consists mainly of one note, its FFT will have one coefficient which is very big (in absolute value), and the others will be very small. That coefficient corresponds to a particular sine wave, with a particular frequency. That's the frequency of the note.

edited Sep 15 '10 at 13:34

answered Sep 15 '10 at 10:47

unutbu

711,858
148
1,594
1,547

@unutbu Thanks for your awesome reply! :) Would you know why sound_info is sequencial though? – RadiantHex Sep 15 '10 at 11:26
1

+1: Musical sounds have overtones. A lot of them: they're integer multiples of the fundamental frequency. Further, real instruments include a great deal of noise as well as time-shifted signals (i.e., doppler shifts) that make recognizing the fundamental challenging. – S.Lott Sep 15 '10 at 12:36
@S.Lott thanks for that. Is there not a way of getting a list of frequencies for each sample? Or is each sample limited to only one frequency value? :| – RadiantHex Sep 15 '10 at 12:42
@RadiantHex: What do you think the FFT gives you? It transforms time-domain samples into frequency domain. Please read up on FFT more carefully. – S.Lott Sep 15 '10 at 13:10
@S.Lott: So I would have to split the samples into time groups in order to obtain the energy value of each frequency changing over time? – RadiantHex Sep 15 '10 at 14:16
1

@RadiantHex: you might want to check out my answer to http://stackoverflow.com/questions/2648151/python-frequency-detection/2649540#2649540. It might help with the frequency detection anyway. Also, if you are really looking to get the frequencies at specific times, then you should look into the short-time Fourier transform. – Justin Peel Sep 15 '10 at 15:20
@RadiantHex: Yes. You transform a time-domain sample into frequency domain data. Too big a time domain and you have multiple pitches. Too small a time domain and you may not have a complete fundamental. Also, random time slices are useless; you have to find a "beat" if you want to find "music" (i.e., melody). Please read up on FFT more carefully. – S.Lott Sep 15 '10 at 17:49
@S.Lott: thanks for sharing that, any idea what a good place to read up on music theory is? =) – RadiantHex Sep 16 '10 at 09:12
@RadiantHex: Is google broken? Did you read http://stackoverflow.com/questions/2648151/python-frequency-detection/2649540#2649540? – S.Lott Sep 16 '10 at 10:04

score 0 · Answer 2 · answered May 28 '16 at 17:13

Don't reinvent the wheel :)

Check out http://librosa.github.io, especially the part about the Short-Time-Fourier Transform (STFT) or in your case rather something like a Constant-Q-Transform (CQT).

But first things first: Let's assume we have a stereo signal (2 channels) from an audio file. For now, we throw away spatial information which is encoded in the two channels of the audio file by creating an average channel (sum up both channels and divide by 2). We now have a signal which is mono (1 channel). Since we have a digital signal, each point in time is called a sample.

Now begins the fun part, we chop the signal into small chunks (called frames) by taking consecutive samples (512 or multiples of 2 are standard values). By taking the discrete Fourier Transform (DFT) on each of these frames, we get a time-frequency representation called the spectrogram. Any further concepts (overlap etc.) can be read in every DSP book or in resources like this lab course: https://www.audiolabs-erlangen.de/content/05-fau/professor/00-mueller/02-teaching/2016s_apl/LabCourse_STFT.pdf

Note that the frequency axis of the DFT is linearly spaced. In the western music system, an octave is split into 12 semitones whose center frequencies are spaced in a logarithmic fashion. Check out the script above about a binning strategy how to receive a logarithmically spaced frequency axis from the linear STFT. However, this approach is very basic and there are lots of other and probably better approaches.

Now back to your problem of note recognition. First: It's a very hard one. :) As mentioned above, a real sound played by an instruments contains overtones. Also, if you are interested in transcribing notes played by complete bands, you get interference by the other musicians etc.

Talking about methods you could try out: Lot's of people nowadays use non-negative matrix fatorization (NMF or similar LDPCA) or neural networks to approach this task. For instance, NMF is included in scikit-learn. To get started, I would recommend NMF. Use only mono-timbral sounds, i.e., a single instrument playing at a time. Initialize the templates with simple decaying overtone structures and see what happens.

Here are some examples from librosa: http://nbviewer.jupyter.org/github/librosa/librosa/blob/master/examples/LibROSA%20demo.ipynb — Stefan Balke, May 28 '16 at 17:13

Recognizing notes within recorded sound - Part 2 - Python

2 Answers2