Extremely small or NaN values appear in training neural network

Question

I'm trying to implement a neural network architecture in Haskell, and use it on MNIST.

I'm using the hmatrix package for linear algebra. My training framework is built using the pipes package.

My code compiles and doesn't crash. But the problem is, certain combinations of layer size (say, 1000), minibatch size, and learning rate give rise to NaN values in the computations. After some inspection, I see that extremely small values (order of 1e-100) eventually appear in the activations. But, even when that doesn't happen, the training still doesn't work. There's no improvement over its loss or accuracy.

I checked and rechecked my code, and I'm at a loss as to what the root of the problem could be.

Here's the backpropagation training, which computes the deltas for each layer:

backward lf n (out,tar) das = do
    let δout = tr (derivate lf (tar, out)) -- dE/dy
        deltas = scanr (\(l, a') δ ->
                         let w = weights l
                         in (tr a') * (w <> δ)) δout (zip (tail $ toList n) das)
    return (deltas)

lf is the loss function, n is the network (weight matrix and bias vector for each layer), out and tar are the actual output of the network and the target (desired) output, and das are the activation derivatives of each layer.

In batch mode, out, tar are matrices (rows are output vectors), and das is a list of the matrices.

Here's the actual gradient computation:

  grad lf (n, (i,t)) = do
    -- Forward propagation: compute layers outputs and activation derivatives
    let (as, as') = unzip $ runLayers n i
        (out) = last as
    (ds) <- backward lf n (out, t) (init as') -- Compute deltas with backpropagation
    let r  = fromIntegral $ rows i -- Size of minibatch
    let gs = zipWith (\δ a -> tr (δ <> a)) ds (i:init as) -- Gradients for weights
    return $ GradBatch ((recip r .*) <$> gs, (recip r .*) <$> squeeze <$> ds)

Here, lf and n are the same as above, i is the input, and t is the target output (both in batch form, as matrices).

squeeze transforms a matrix into a vector by summing over each row. That is, ds is a list of matrices of deltas, where each column corresponds to the deltas for a row of the minibatch. So, the gradients for the biases are the average of the deltas over all the minibatch. The same thing for gs, which corresponds to the gradients for the weights.

Here's the actual update code:

move lr (n, (i,t)) (GradBatch (gs, ds)) = do
    -- Update function
    let update = (\(FC w b af) g δ -> FC (w + (lr).*g) (b + (lr).*δ) af)
        n' = Network.fromList $ zipWith3 update (Network.toList n) gs ds
    return (n', (i,t))

lr is the learning rate. FC is the layer constructor, and af is the activation function for that layer.

The gradient descent algorithm makes sure to pass in a negative value for the learning rate. The actual code for the gradient descent is simply a loop around a composition of grad and move, with a parameterized stop condition.

Finally, here's the code for a mean square error loss function:

mse :: (Floating a) => LossFunction a a
mse = let f (y,y') = let gamma = y'-y in gamma**2 / 2
          f' (y,y') = (y'-y)
      in  Evaluator f f'

Evaluator just bundles a loss function and its derivative (for calculating the delta of the output layer).

The rest of the code is up on GitHub: NeuralNetwork.

So, if anyone has an insight into the problem, or even just a sanity check that I'm correctly implementing the algorithm, I'd be grateful.

Thanks, I'll look into that. But I don't think this is normal behavior. As far as I know, other implementations of what I'm trying to do(simple feedforward fully connected neural network), either in Haskell or other languages, don't seem to be doing that. — Charles Langlois, Jun 22 '17 at 00:14
@Charles: Did you actually try your own networks and data sets with said other implementations? In my own experience, BP will easily go haywire when the NN is ill-suited to the problem. If you have doubts about your implementation of BP, you can compare its output with that of a naive gradient calculation (over a toy-sized NN, of course) -- which is way harder to get wrong than BP. — shinobi, Jun 30 '17 at 13:37
Okay, I might try that. My tests were with a (784-X-10) network(i.e. 1 hidden layer). I tried for a few values of X, also varying batch size and learning rate. My dataset is MNIST. It's my understanding that this kind of architecture should work for MNIST. — Charles Langlois, Jul 01 '17 at 10:40
No, sorry, I went on to do other things and left this in the freezer. Shame on me. — Charles Langlois, Nov 03 '18 at 16:28
Just ran into your question and had the same problem when doing neural network for MNIST in F#. I started learning with [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/index.html) to get a working example. Then I converted the Python to F# and thought I had done it correctly, but alas ran into the same problems as you. — Guy Coder, Nov 07 '18 at 13:51
So I did one of the options of last recourse and generated the initial random data with the Python example then transported it over to the F# example. This took many hours to write the export and import routines as it is not a copy and paste. Thus both version are working on the same starting data, but that was generated randomly. Then I singled stepped the Python version and the F# version at the same time and compared the numbers for the variables. This eventually led to me uncovering the differences. Needless to say it was a long and tedious process that took many days. — Guy Coder, Nov 07 '18 at 13:55
Now the F# version works the same as the Python version. I posted some of the details with code examples when giving answers here at SO, but I can't find them in a few minutes, otherwise I would add them here. :) — Guy Coder, Nov 07 '18 at 14:00
Thanks, I'll go and read those links, but I'm confused. You encountered the same issue, and eventually succeeded in making it work? Did you find out what the problem was? — Charles Langlois, Nov 07 '18 at 21:27
If it helps, someone made a [neural network library in Haskell](https://github.com/HuwCampbell/grenade). — gngdb, May 27 '19 at 09:07
`NaN` typically appears when you try to `0 / 0`, but I don't see any divisions in the code? Is there some normalization happening somewhere? — kutschkem, Oct 22 '19 at 09:06
Here is a guide about [Neural Networks in Haskell](http://penkovsky.com/neural-networks/) I don't think it will help solve the numeric issue in the code, but could be helpful nevertheless. — lehins, Nov 27 '19 at 14:12
Also, just to save you some time, Xavier initialisation can be implemented through the following: `σ = (6/(input_size + output_size))^0.5`, Then multiply that by a random integer from the set `{-1, 0, 1}` to get the initial value for every weight in the network. — Recessive, Dec 02 '19 at 05:29
Isn't MNIST typically a classification problem? Why are you using MES? You should be using softmax crossentropy (calculated from the logits) no? — mdaoust, Jan 11 '20 at 17:37
@mdaoust I don't know, I don't remember why I chose that function(maybe I was under the impression that it didn't matter). What are "logits"? — Charles Langlois, Jan 14 '20 at 20:09
@CharlesLanglois, It may not be your problem (I can't read the code) but "mean square error" is not convex for a classification problem,which could explain getting stuck. "logits" is just a fancy way to say log-odds: Use the `ce = x_j - log(sum_i(exp(x)))` calculation [from here](https://blog.feedly.com/tricks-of-the-trade-logsumexp/) so you don't take the log of the exponential (which often generates NaNs) — mdaoust, Jan 15 '20 at 12:57
Congratulations on being the [highest voted](https://stackoverflow.com/unanswered/tagged/?tab=votes) question (as of Jan '20) with no upvoted or accepted answers! — hongsy, Jan 18 '20 at 13:46
lol, thanks. I've upvoted many comments that are helpful and relevant. If I ever decide to pick this up again, I'll be sure to post an answer. — Charles Langlois, Jan 18 '20 at 17:39
I did exactly what you are doing [in 2015](https://github.com/jeremycochoy/haskell-ffnn). My advice is: 1) Check your backprop equation. 2) Check your learning rate 3) Check your initialization, 4) Check your loss function. Even with the right equations, you can have very weird behavior with the wrong meta-parameters. — Jeremy Cochoy, Feb 10 '20 at 14:02
I've dropped Haskell and Hmatrix tags. Seam's like most comments pointed at purely NN related issue. — przemo_li, Feb 27 '20 at 14:12
I found a similar solved question: https://codereview.stackexchange.com/questions/122698/conversion-of-a-simple-python-neural-network-to-a-haskell-implementation/122770 — EthanDevelops, May 24 '20 at 21:31
Just checked your code in Github, specifically `Data/Activation.hs`. Your `logistic` and other activation functions are varbatim math definitions; within the bounds of floating point accuracy, it is very easy to get `activation(x)=1` or `activation(x)=0` even for non-infinite `x`. Either use log-spaces, or hack in offsets s.t. `p=1` or `p=0` probabilities never occur. — hyiltiz, Jul 29 '20 at 23:04
@CharlesLanglois, have you found the cause? Kindly share if so as I am interested in numeric error issues because I am having the same situations. — mon, Mar 14 '21 at 06:07
@mon Sorry, I haven't gotten back on this (yet), this was just a small token attempt at the time, unrelated to my job or my other programming interests(haven't touched Haskell in a while, unfortunately). Many of the comments here and the 1 answer below seem to provide promising solutions/explanations, such as using a different activation function or loss function, using an initialization algorithm to provide inital weights... Good luck! — Charles Langlois, Mar 23 '21 at 20:13

jcft2 · Answer 1 · 2020-08-11T21:31:47.463

Do you know about "vanishing" and "exploding" gradients in backpropagation? I'm not too familiar with Haskell so I can't easily see what exactly your backprop is doing, but it does look like you are using a logistic curve as your activation function.

If you look at the plot of this function you'll see that the gradient of this function is nearly 0 at the ends (as input values get very large or very small, the slope of the curve is almost flat), so multiplying or dividing by this during backpropagation will result in a very big or very small number. Doing this repeatedly as you pass through multiple layers causes the activations to approach zero or infinity. Since backprop updates your weights by doing this during training, you end up with a lot of zeros or infinities in your network.

Solution: there are loads of methods out there that you can search for to solve the vanishing gradient problem, but one easy thing to try is to change the type of activation function you are using to a non-saturating one. ReLU is a popular choice as it mitigates this particular problem (but might introduce others).

Extremely small or NaN values appear in training neural network

1 Answers1

Linked