11

I'm having sort of an issue trying to figure out how to tune the parameters for my perceptron algorithm so that it performs relatively well on unseen data.

I've implemented a verified working perceptron algorithm and I'd like to figure out a method by which I can tune the numbers of iterations and the learning rate of the perceptron. These are the two parameters I'm interested in.

I know that the learning rate of the perceptron doesn't affect whether or not the algorithm converges and completes. I'm trying to grasp how to change n. Too fast and it'll swing around a lot, and too low and it'll take longer.

As for the number of iterations, I'm not entirely sure how to determine an ideal number.

In any case, any help would be appreciated. Thanks.

doug
  • 65,292
  • 23
  • 156
  • 195
Albert Diego
  • 133
  • 1
  • 1
  • 7

3 Answers3

13

Start with a small number of iterations (it's actually more conventional to count 'epochs' rather than iterations--'epochs' refers to the number of iterations through the entire data set used to train the network). By 'small' let's say something like 50 epochs. The reason for this is that you want to see how the total error is changing with each additional training cycle (epoch)--hopefully it's going down (more on 'total error' below).

Obviously you are interested in the point (the number of epochs) where the next additional epoch does not cause a further decrease in total error. So begin with a small number of epochs so you can approach that point by increasing the epochs.

The learning rate you begin with should not be too fine or too coarse, (obviously subjective but hopefully you have a rough sense for what is a large versus small learning rate).

Next, insert a few lines of testing code in your perceptron--really just a few well-placed 'print' statements. For each iteration, calculate and show the delta (actual value for each data point in the training data minus predicted value) then sum the individual delta values over all points (data rows) in the training data (i usually take the absolute value of the delta, or you can take the square root of the sum of the squared differences--doesn't matter too much. Call that summed value "total error"--just to be clear, this is total error (sum of the error across all nodes) per epoch.

Then, plot the total error as a function of epoch number (ie, epoch number on the x axis, total error on the y axis). Initially of course, you'll see the data points in the upper left-hand corner trending down and to the right and with a decreasing slope

Let the algorithm train the network against the training data. Increase the epochs (by e.g., 10 per run) until you see the curve (total error versus epoch number) flatten--i.e., additional iterations doesn't cause a decrease in total error.

So the slope of that curve is important and so is its vertical position--ie., how much total error you have and whether it continues to trend downward with more training cycles (epochs). If, after increasing epochs, you eventually notice an increase in error, start again with a lower learning rate.

The learning rate (usually a fraction between about 0.01 and 0.2) will certainly affect how quickly the network is trained--i.e., it can move you to the local minimum more quickly. It can also cause you to jump over it. So code a loop that trains a network, let's say five separate times, using a fixed number of epochs (and a the same starting point) each time but varying the learning rate from e.g., 0.05 to 0.2, each time increasing the learning rate by 0.05.

One more parameter is important here (though not strictly necessary), 'momentum'. As the name suggests, using a momentum term will help you get an adequately trained network more quickly. In essence, momentum is a multiplier to the learning rate--as long as the the error rate is decreasing, the momentum term accelerates the progress. The intuition behind the momentum term is 'as long as you traveling toward the destination, increase your velocity'.Typical values for the momentum term are 0.1 or 0.2. In the training scheme above, you should probably hold momentum constant while varying the learning rate.

doug
  • 65,292
  • 23
  • 156
  • 195
  • This is the right answer, but don't restrict your Learning Rate and Momentum to that limited of a domain (scaling). Generate a random network, train 0.01-0.99 taking 0.05 steps. If you can't shake out the global minima this way, make sure you normalized your inputs correctly. I am a big fan of Log! – deepelement May 30 '14 at 01:57
3

About the learning rate not affecting whether or not the perceptron converges - That's not true. If you choose a learning rate that is too high, you will probably get a divergent network. If you change the learning rate during learning, and it drops too fast (i.e stronger than 1/n) you can also get a network that never converges (That's because the sum of N(t) over t from 1 to inf is finite. that means the vector of weights can only change by a finite amount).

Theoretically it can be shown for simple cases that changing n (learning rate) according to 1/t (where t is the number of presented examples) should work good, but I actually found that in practice, the best way to do this, is to find good high n value (the highest value that doesn't make your learning diverge) and low n value (this one is tricker to figure. really depends on the data and problem), and then let n change linearly over time from high n to low n.

Ofri Raviv
  • 22,321
  • 3
  • 51
  • 54
1

The learning rate depends on the typical values of data. There is no rule of thumb in general. Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

Normalizing the data to a zero-mean, unit variance or between 0-1 or any other standard form can help in selecting a value of learning rate. As doug mentioned, learning rate between 0.05 and 0.2 generally works well.

Also this will help in making the algorithm converge faster.

Source: Juszczak, P.; D. M. J. Tax, and R. P. W. Dui (2002). "Feature scaling in support vector data descriptions". Proc. 8th Annu. Conf. Adv. School Comput. Imaging: 95–10.

Rishi Dua
  • 2,146
  • 1
  • 21
  • 35