10

Ive been experimenting with the FFT algorithm. I use NAudio along with a working code of the FFT algorithm from the internet. Based on my observations of the performance, the resulting pitch is inaccurate.

What happens is that I have an MIDI (generated from GuitarPro) converted to WAV file (44.1khz, 16-bit, mono) that contains a pitch progression starting from E2 (the lowest guitar note) up to about E6. What results is for the lower notes (around E2-B3) its generally very wrong. But reaching C4 its somewhat correct in that you can already see the proper progression (next note is C#4, then D4, etc.) However, the problem there is that the pitch detected is a half-note lower than the actual pitch (e.g. C4 should be the note but D#4 is displayed).

What do you think may be wrong? I can post the code if necessary. Thanks very much! Im still beginning to grasp the field of DSP.

Edit: Here is a rough scratch of what Im doing

byte[] buffer = new byte[8192];
int bytesRead;
do
{
  bytesRead = stream16.Read(buffer, 0, buffer.Length);
} while (bytesRead != 0);

And then: (waveBuffer is simply a class that is there to convert the byte[] into float[] since the function only accepts float[])

public int Read(byte[] buffer, int offset, int bytesRead)
{
  int frames = bytesRead / sizeof(float);
  float pitch = DetectPitch(waveBuffer.FloatBuffer, frames);
}

And lastly: (Smbpitchfft is the class that has the FFT algo ... i believe theres nothing wrong with it so im not posting it here)

private float DetectPitch(float[] buffer, int inFrames)
{
  Func<int, int, float> window = HammingWindow;
  if (prevBuffer == null)
  {
    prevBuffer = new float[inFrames]; //only contains zeroes
  }  

  // double frames since we are combining present and previous buffers
  int frames = inFrames * 2;
  if (fftBuffer == null)
  {
    fftBuffer = new float[frames * 2]; // times 2 because it is complex input
  }

  for (int n = 0; n < frames; n++)
  {
     if (n < inFrames)
     {
       fftBuffer[n * 2] = prevBuffer[n] * window(n, frames);
       fftBuffer[n * 2 + 1] = 0; // need to clear out as fft modifies buffer
     }
     else
     {
       fftBuffer[n * 2] = buffer[n - inFrames] * window(n, frames);
       fftBuffer[n * 2 + 1] = 0; // need to clear out as fft modifies buffer
     }
   }
   SmbPitchShift.smbFft(fftBuffer, frames, -1);
  }

And for interpreting the result:

float binSize = sampleRate / frames;
int minBin = (int)(82.407 / binSize); //lowest E string on the guitar
int maxBin = (int)(1244.508 / binSize); //highest E string on the guitar

float maxIntensity = 0f;
int maxBinIndex = 0;

for (int bin = minBin; bin <= maxBin; bin++)
{
    float real = fftBuffer[bin * 2];
    float imaginary = fftBuffer[bin * 2 + 1];
    float intensity = real * real + imaginary * imaginary;
    if (intensity > maxIntensity)
    {
        maxIntensity = intensity;
        maxBinIndex = bin;
    }
}

return binSize * maxBinIndex;

UPDATE (if anyone is still interested):

So, one of the answers below stated that the frequency peak from the FFT is not always equivalent to pitch. I understand that. But I wanted to try something for myself if that was the case (on the assumption that there are times in which the frequency peak IS the resulting pitch). So basically, I got 2 softwares (SpectraPLUS and FFTProperties by DewResearch ; credits to them) that is able to display the frequency-domain for the audio signals.

So here are the results of the frequency peaks in the time domain:

SpectraPLUS

SpectraPLUS

and FFT Properties: enter image description here

This was done using a test note of A2 (around 110Hz). Upon looking at the images, they have frequency peaks around the range of 102-112 Hz for SpectraPLUS and 108 Hz for FFT Properties. On my code, I get 104Hz (I use 8192 blocks and a samplerate of 44.1khz ... 8192 is then doubled to make it complex input so in the end, I get around 5Hz for binsize, as compared to the 10Hz binsize of SpectraPLUS).

So now Im a bit confused, since on the softwares they seem to return the correct result but on my code, I always get 104Hz (note that I have compared the FFT function that I used with others such as Math.Net and it seems to be correct).

Do you think that the problem may be with my interpretation of the data? Or do the softwares do some other thing before displaying the Frequency-Spectrum? Thanks!

Ilmari Karonen
  • 44,762
  • 9
  • 83
  • 142
user488792
  • 1,813
  • 5
  • 29
  • 37
  • Hi! The value that I get for maxBinIndex is at bin 20 (around 100-104 Hz) which results to around G# which is a half-note down from the supposed A. This is consistent with other .wav files, sometimes being a whole-step down. – user488792 Feb 23 '11 at 03:54
  • @eryksun Thanks! Your last point there is interesting. I will try and look into it. – user488792 Feb 23 '11 at 07:14
  • @eryksun Hi! Thank you very much! That seems to be the problem. My code now works and returns the correct frequency. Seems like I missed out on this solution from Paul R's answer, since at that time, I didnt much yet about the FFT. However, I have learned a lot thanks to all your help. So thanks again! – user488792 Feb 23 '11 at 12:30
  • However, `prevBuffer` elements are never set, so the values are always 0. Is it correct behaviour? – linquize Jan 31 '16 at 14:00

4 Answers4

11

It sounds like you may have an interpretation problem with your FFT output. A few random points:

  • the FFT has a finite resolution - each output bin has a resolution of Fs / N, where Fs is the sample rate and N is the size of the FFT

  • for notes which are low on the musical scale, the difference in frequency between successive notes is relatively small, so you will need a sufficiently large N to discrimninate between notes which are a semitone apart (see note 1 below)

  • the first bin (index 0) contains energy centered at 0 Hz but includes energy from +/- Fs / 2N

  • bin i contains energy centered at i * Fs / N but includes energy from +/- Fs / 2N either side of this center frequency

  • you will get spectral leakage from adjacent bins - how bad this is depends on what window function you use - no window (== rectangular window) and spectral leakage will be very bad (very broad peaks) - for frequency estimation you want to pick a window function that gives you sharp peaks

  • pitch is not the same thing as frequency - pitch is a percept, frequency is a physical quantity - the perceived pitch of a musical instrument may be slightly different from the fundamental frequency, depending on the type of instrument (some instruments do not even produce significant energy at their fundamental frequency, yet we still perceive their pitch as if the fundamental were present)

My best guess from the limited information available though is that perhaps you are "off by one" somewhere in your conversion of bin index to frequency, or perhaps your FFT is too small to give you sufficient resolution for the low notes, and you may need to increase N.

You can also improve your pitch estimation via several techniques, such as cepstral analysis, or by looking at the phase component of your FFT output and comparing it for successive FFTs (this allows for a more accurate frequency estimate within a bin for a given FFT size).


Notes

(1) Just to put some numbers on this, E2 is 82.4 Hz, F2 is 87.3 Hz, so you need a resolution somewhat better than 5 Hz to discriminate between the lowest two notes on a guitar (and much finer than this if you actually want to do, say, accurate tuning). At a 44.1 kHz sample then you probably need an FFT of at least N = 8192 to give you sufficient resolution (44100 / 8192 = 5.4 Hz), probably N = 16384 would be better.

Paul R
  • 195,989
  • 32
  • 353
  • 519
  • Hi Paul! Thanks very much for the answer! I am currently using a Hamming Window for the window function and use N = 4096. But the reason for this is that I make use of interleaving to make the input buffer for the FFT algo much larger. Generally, I interleave zeroes with the input buffer. Im going to try some things to try and check if it improves accuracy. Thanks! – user488792 Feb 11 '11 at 11:04
  • 2
    @user488792: OK - it sounds like you've made a good start anyway - Hamming is a reasonable choice of window, but note that stuffing zeroes into your data to get more *apparent* resolution doesn't really buy you anything - it just interpolates the resulting FFT output which makes it look smoother, but there's no additional *information* (no such thing as a free lunch !). – Paul R Feb 11 '11 at 11:12
  • @eryksun: good point - I was reading "interleaving" as "padding". @user488792: the zeroes need to be appended to the buffer to get the interpolated spectrum, as @eryksun rightly says - is this what you are doing, or are you really interleaving zeroes *between* samples ? – Paul R Feb 11 '11 at 11:51
  • Ill give an outline of what Im doing. Im still a beginner so I just got it from a site and not really sure about what it does hence the confusion, but it seems to work so I left it at that. Im gonna be editing my post to add the additional information, since I think it may be too long for a comment. – user488792 Feb 11 '11 at 12:02
  • @user488792: looks OK as far as I can tell - the 0s you are inserting are just for the imaginary part of the input data and then there seems to be some zero padding coming from another buffer (maybe just on the first iteration - probably some kind of overlap scheme ?). – Paul R Feb 11 '11 at 12:29
  • I checked for other iterations and the padding always pads zeroes. Im not really sure what its purpose is, but I tried to remove the padding and start directly from adding the data from the input buffer. However, it didnt work properly. Maybe the error is the part in which I interpret the data from the fftBuffer (after the FFT). – user488792 Feb 11 '11 at 12:38
  • @eryksun yes i checked the FFT algo and it accepts complex input and also complex output. Im gonna try experiment some more based on the answers you gave me. Thanks so much! If you may have other suggestions, ill gladly accept them thanks! – user488792 Feb 11 '11 at 12:45
  • @eryksun Actually, I looked again and I forgot to add the re-assignment of the prevBuffer to buffer. But may I ask if that may affect the performance since it still seems a bit ok with the higher frequencies? – user488792 Feb 11 '11 at 12:59
  • Oh ok. Ill try assigning the prevBuffer. Thanks! – user488792 Feb 11 '11 at 13:10
  • Hi again! Ive made changes to my code such as re-assigning the prevBuffer and to make the sample size larger (8192 since I tried 16384 but got poorer results). Because of this, Im wondering if I may be interpreting the results from the FFT wrongly? Ive posted what I currently have. A better clarification of the FFT result would also be good, since currently I am confused on how changing the values of the minBin and maxBin changes the result. Thanks! – user488792 Feb 12 '11 at 06:54
  • When you calculate minBin and maxBin you should be rounding rather than truncating as at present, otherwise you may be one bin off. – Paul R Feb 12 '11 at 07:57
  • 1
    Hi! Thanks for all the help. After much tracing and debugging and looking into the values during run-time, I have come to the conclusion that maybe the problem is with my audio signal (as some have mentioned that the frequency-pitch estimation is not really defined. Ill continue to try and experiment with it but in the mean time, I think that I have learned a lot and have a better understanding of the FFT algorithm. Thanks a lot! – user488792 Feb 12 '11 at 08:32
3

I thought this might help you. I made some plots of the 6 open strings of a guitar. The code is in Python using pylab, which I recommend for experimenting:

# analyze distorted guitar notes from
# http://www.freesound.org/packsViewSingle.php?id=643
#
# 329.6 E - open 1st string
# 246.9 B - open 2nd string
# 196.0 G - open 3rd string
# 146.8 D - open 4th string
# 110.0 A - open 5th string
#  82.4 E - open 6th string

from pylab import *
import wave

fs = 44100.0 
N = 8192 * 10
t = r_[:N] / fs
f = r_[:N/2+1] * fs / N 
gtr_fun = [329.6, 246.9, 196.0, 146.8, 110.0, 82.4]

gtr_wav = [wave.open('dist_gtr_{0}.wav'.format(n),'r') for n in r_[1:7]]
gtr = [fromstring(g.readframes(N), dtype='int16') for g in gtr_wav]
gtr_t = [g / float64(max(abs(g))) for g in gtr]
gtr_f = [2 * abs(rfft(g)) / N for g in gtr_t]

def make_plots():
    for n in r_[:len(gtr_t)]:
        fig = figure()
        fig.subplots_adjust(wspace=0.5, hspace=0.5)
        subplot2grid((2,2), (0,0))
        plot(t, gtr_t[n]); axis('tight')
        title('String ' + str(n+1) + ' Waveform')
        subplot2grid((2,2), (0,1))
        plot(f, gtr_f[n]); axis('tight')
        title('String ' + str(n+1) + ' DFT')
        subplot2grid((2,2), (1,0), colspan=2)
        M = int(gtr_fun[n] * 16.5 / fs * N)
        plot(f[:M], gtr_f[n][:M]); axis('tight')
        title('String ' + str(n+1) + ' DFT (16 Harmonics)')

if __name__ == '__main__':
    make_plots()
    show()

String 1, fundamental = 329.6 Hz:

String 1, f0 = 329.6 Hz

String 2, fundamental = 246.9 Hz:

enter image description here

String 3, fundamental = 196.0 Hz:

enter image description here

String 4, fundamental = 146.8 Hz:

enter image description here

String 5, fundamental = 110.0 Hz:

enter image description here

String 6, fundamental = 82.4 Hz:

enter image description here

The fundamental frequency isn't always the dominant harmonic. It determines the spacing between harmonics of a periodic signal.

Eryk Sun
  • 29,588
  • 3
  • 76
  • 95
  • Hi! Thanks very much for this and I appreciate your effort. This will really come in handy for studying and for further analysis. Thanks! – user488792 Feb 13 '11 at 06:49
  • Hello! I made some updates on how I am faring. Could you take a look at it? Thanks a lot! – user488792 Feb 22 '11 at 11:40
1

I had a similar question and the answer for me was to use Goertzel instead of FFT. If you know what tones you are looking for (MIDI) Goertzel is capable of detecting the tones to within one sinus wave (one cycle). It does this by generating the sinus wave of the sound and "placing it on top of the raw data" to see if it exist. FFT samples large amounts of data to provide an aproximate frequency spectrum.

Community
  • 1
  • 1
Tedd Hansen
  • 11,020
  • 14
  • 57
  • 93
  • Hi! Thanks for the suggestion! However, Im working with WAV files so I think FFT would be better in this case. Additionally, Im trying to get it working and learn it better because in the future, Im going to be using it for Chord Detection (along with other algorithms of course). Thanks! – user488792 Feb 11 '11 at 13:03
1

Musical pitch is different from frequency peak. Pitch is a psycho-perceptual phenomena that may depend more on the overtones and such. The frequency of what a human would call the pitch could be missing or quite small in the actual signal spectra.

And a frequency peak in a spectrum can be different from any FFT bin center. The FFT bin center frequencies will change in frequency and spacing depending only on the FFT length and sample rate, not the spectra in the data.

So you have at least 2 problems with which to contend. There are a ton of academic papers on frequency estimation as well as the separate subject of pitch estimation. Start there.

hotpaw2
  • 68,014
  • 12
  • 81
  • 143