2

I am trying to implement the logistic regression learning algorithm in Python. The hypothesis function I implemented is:

   def hypothesis(params, features):
       z = sum(p * f for p, f in zip(params, features))
       return 1 / (1 + math.e ** -z)

The dataset I use for testing is from UCI Machine Learning Repository, it contains data like these (The first column is the target, the other columns are selected features):

1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
3,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
2,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480

As you can see, some features are very large compared to the others. So the z in my hypothesis function is too large as a power of e. In fact, for larger values, the zs are all 0.0. And if I change numerator to 3, OverflowError is raised.

Maybe I should have normalized the data before I feed them to my program, any idea how can I do this?

Atik
  • 177
  • 1
  • 18
satoru
  • 27,201
  • 27
  • 83
  • 126
  • @Blender I've edited the question and explain on this. The first column is the target, the other columns are selected features. – satoru Dec 10 '12 at 06:31
  • Is `params` supposed to look like `[n, n, n, ...]` where `n` is always the same number? – Blender Dec 10 '12 at 06:31
  • @Blender `params` is the vector `[θ1, θ2, θ3 ...]`, and `z` is the dot product of this vector and an instance of the selected features. – satoru Dec 10 '12 at 06:37
  • @satoru, the following link might be useful for future readers of your post ;-) http://en.wikipedia.org/wiki/Feature_scaling – Tin Mar 25 '14 at 16:22

1 Answers1

2

Not really a StackOverflow question =/

This question seems to me like it should be asked somewhere else - it seems like you're looking for an algorithm rather than an implementation of an algorithm.

That aside - you would normalize this data set by column. Calculate the SD and mean of each column, and normalize that to an SD of 2 and a mean of 10. This just means that once you've calculated the SD of a column and its mean, you calculate the new value of each entry in the column by first figuring out how many SD away from the column's mean it is, then getting the number 10 and adding or subtracting that many 2's from 10 (depending on whether data is some SD's above or some SD's below mean for that column).

For example, say we have a column that contains some numbers, and we've calculated the SD to be 3, and the mean to be 50. We now come across a member of this column - the number 56. 56 is two SD above 50 (the mean), and so it would be normalized to 14 (10(new mean)+2(new SD)*2(number of column SD's above column's mean)).

The numbers 10 and 2 can be replaced with other numbers, but I think 10 and 2 seem about right.

For help on calculating a running SD (standard deviation) and mean, see this other StackOverflow question - How to efficiently calculate a running standard deviation?

Community
  • 1
  • 1
Shariq
  • 571
  • 1
  • 5
  • 12
  • 2
    But it did raise a OverflowError ;p – satoru Dec 10 '12 at 07:08
  • Are questions for algorithms off-topic? Could you link to the location in the FAQ where it says that, because I couldn't find it? – Niki Dec 10 '12 at 08:09
  • Be aware that Z normalisation may not work in cases where the feature has a very skewed distribution. For example, z normalising something with a log-normal distribution could still produce very large values. There are all manner of feasible normalisations (z score, log transform, probit, logit, etc etc), and ideally you want all your features on roughly the same scale, which may mean using different normalisation schemes for different features. – Ben Allison Dec 10 '12 at 14:32
  • (at)BenAllison - the dataset being used does not seem to have a very skewed distribution. (at)nikie - Granted; it's fine to ask what algorithm to use for a certain problem; but that was me being nice - the question was asking for a way to normalize data. Normalization of data is a statistics question, and not a computer science question, as far as I am aware. @Satoru.Logic - :PP I hope my answer helped :) – Shariq Dec 12 '12 at 02:54