Don't reinvent the wheel :)
Check out http://librosa.github.io, especially the part about the Short-Time-Fourier Transform (STFT) or in your case rather something like a Constant-Q-Transform (CQT).
But first things first:
Let's assume we have a stereo signal (2 channels) from an audio file. For now, we throw away spatial information which is encoded in the two channels of the audio file by creating an average channel (sum up both channels and divide by 2). We now have a signal which is mono (1 channel). Since we have a digital signal, each point in time is called a sample.
Now begins the fun part, we chop the signal into small chunks (called frames) by taking consecutive samples (512 or multiples of 2 are standard values).
By taking the discrete Fourier Transform (DFT) on each of these frames, we get a time-frequency representation called the spectrogram.
Any further concepts (overlap etc.) can be read in every DSP book or in resources like this lab course:
https://www.audiolabs-erlangen.de/content/05-fau/professor/00-mueller/02-teaching/2016s_apl/LabCourse_STFT.pdf
Note that the frequency axis of the DFT is linearly spaced. In the western music system, an octave is split into 12 semitones whose center frequencies are spaced in a logarithmic fashion. Check out the script above about a binning strategy how to receive a logarithmically spaced frequency axis from the linear STFT.
However, this approach is very basic and there are lots of other and probably better approaches.
Now back to your problem of note recognition.
First: It's a very hard one. :)
As mentioned above, a real sound played by an instruments contains overtones.
Also, if you are interested in transcribing notes played by complete bands, you get interference by the other musicians etc.
Talking about methods you could try out:
Lot's of people nowadays use non-negative matrix fatorization (NMF or similar LDPCA) or neural networks to approach this task.
For instance, NMF is included in scikit-learn.
To get started, I would recommend NMF. Use only mono-timbral sounds, i.e., a single instrument playing at a time. Initialize the templates with simple decaying overtone structures and see what happens.