51

I'm using scikit-learn and numpy and I want to set the global seed so that my work is reproducible.

Should I use numpy.random.seed or random.seed?

From the link in the comments, I understand that they are different, and that the numpy version is not thread-safe. I want to know specifically which one to use to create IPython notebooks for data analysis. Some of the algorithms from scikit-learn involve generating random numbers, and I want to be sure that the notebook shows the same results on every run.

desertnaut
  • 46,107
  • 19
  • 109
  • 140
shadowtalker
  • 8,614
  • 2
  • 34
  • 70
  • 1
    for using `np.random.seed()` you won't need to import anything, but for using `random.seed()` you will need to import the `random` module – ZdaR Jun 25 '15 at 17:44
  • 5
    Please DO NOT set the global seed, this is unsafe. You can create your own `Random` object and set its seed instead. Read the last comment by Muhammad Alkarouri in this question for a safer workaround: http://stackoverflow.com/a/3717456/1524913 – jeromej Jun 25 '15 at 19:13
  • @Leb thanks for the link, but it's not clear which one I should use in my case. I edited the question. – shadowtalker Jun 25 '15 at 20:12
  • @JeromeJ it's not clear how to use `color_rnd` as per that example. If I run `color_rnd.seed(1234)`, will functions like `sklearn.cross_validation.KFold` "know" to use it instead of whatever RNG it normally uses? – shadowtalker Jun 25 '15 at 20:14
  • They may not if they rely on `random` directly sadly. My point was, at least then. whenever you type code, avoid to use `random` itself directly. I'm not sure what to do in your scenario, that's a bit of a bummer. Maybe a decorator but I think you'd have to tinker with the function context but I'm not 100% sure, I'd have to have a deeper look at it to be sure. – jeromej Jun 26 '15 at 00:42

1 Answers1

46

Should I use np.random.seed or random.seed?

That depends on whether in your code you are using numpy's random number generator or the one in random.

The random number generators in numpy.random and random have totally separate internal states, so numpy.random.seed() will not affect the random sequences produced by random.random(), and likewise random.seed() will not affect numpy.random.randn() etc. If you are using both random and numpy.random in your code then you will need to separately set the seeds for both.

Update

Your question seems to be specifically about scikit-learn's random number generators. As far as I can tell, scikit-learn uses numpy.random throughout, so you should use np.random.seed() rather than random.seed().

One important caveat is that np.random is not threadsafe - if you set a global seed, then launch several subprocesses and generate random numbers within them using np.random, each subprocess will inherit the RNG state from its parent, meaning that you will get identical random variates in each subprocess. The usual way around this problem is to pass a different seed (or numpy.random.Random instance) to each subprocess, such that each one has a separate local RNG state.

Since some parts of scikit-learn can run in parallel using joblib, you will see that some classes and functions have an option to pass either a seed or an np.random.RandomState instance (e.g. the random_state= parameter to sklearn.decomposition.MiniBatchSparsePCA). I tend to use a single global seed for a script, then generate new random seeds based on the global seed for any parallel functions.

ali_m
  • 62,795
  • 16
  • 193
  • 270
  • 1
    I'm using `numpy.random` for any random number generation I do in the console. I don't know what `sklearn` uses internally. Hence my question. – shadowtalker Jun 25 '15 at 20:09
  • 2
    Thanks. One reason I'm asking is because the only way to pass a `numpy.random.RandomState` instance to `sklearn.grid_search.GridSearchCV` is by explicitly passing an object to its `cv` argument, like `sklearn.cross_validation.StratifiedKFold`. However that constructor requires you to know the number of rows in your data set when the model is instantiated. That means you have to re-instantiate the model whenever you want to fit it on new data, which is not how you're supposed to use these objects. I'll ask a targeted follow-up – shadowtalker Jun 25 '15 at 20:48
  • 1
    I'm not sure I really understand your motivation. Is there some particular reason why you *want* the cross-validation folds to be different for different search parameters in `GridSearchCV`? As far as I can see it should not matter. – ali_m Jun 25 '15 at 21:05
  • 1
    That's not what I mean. I want the folds to be the same every time I open the notebook and press "Run all", because I need the results to be reproducible. – shadowtalker Jun 25 '15 at 21:08
  • OK, in which case couldn't you just create a new `cv` instance (either using the same global seed, or some new random seed derived from it) every time you want to fit on some new data? – ali_m Jun 25 '15 at 21:18
  • I was hoping to avoid that. But I just demo'ed it, and having to create a new instance to fit new data might actually be a feature and not a bug. "Explicit is better than implicit" etc. – shadowtalker Jun 25 '15 at 21:20
  • 3
    As a general principle I think it's best to keep any kind of meta-optimization code separate from model classes. Not only is it more explicit, but it also tends to lead to more reusable code. – ali_m Jun 25 '15 at 21:50