Reproducing LASSO / Logistic Regression results in R with Python using the Iris Dataset

Question

I'm trying to reproduce the following R results in Python. In this particular case the R predictive skill is lower than the Python skill, but this is usually not the case in my experience (hence the reason for wanting to reproduce the results in Python), so please ignore that detail here.

The aim is to predict the flower species ('versicolor' 0 or 'virginica' 1). We have 100 labelled samples, each consisting of 4 flower characteristics: sepal length, sepal width, petal length, petal width. I've split the data into training (60% of data) and test sets (40% of data). 10-fold cross-validation is applied to the training set to search for the optimal lambda (the parameter that is optimized is "C" in scikit-learn).

I'm using glmnet in R with alpha set to 1 (for the LASSO penalty), and for python, scikit-learn's LogisticRegressionCV function with the "liblinear" solver (the only solver that can be used with L1 penalisation). The scoring metrics used in the cross-validation are the same between both languages. However somehow the model results are different (the intercepts and coefficients found for each feature vary quite a bit).

R Code

library(glmnet)
library(datasets)
data(iris)

y <- as.numeric(iris[,5])
X <- iris[y!=1, 1:4]
y <- y[y!=1]-2

n_sample = NROW(X)

w = .6
X_train = X[0:(w * n_sample),]  # (60, 4)
y_train = y[0:(w * n_sample)]   # (60,)
X_test = X[((w * n_sample)+1):n_sample,]  # (40, 4)
y_test = y[((w * n_sample)+1):n_sample]   # (40,)

# set alpha=1 for LASSO and alpha=0 for ridge regression
# use class for logistic regression
set.seed(0)
model_lambda <- cv.glmnet(as.matrix(X_train), as.factor(y_train),
                        nfolds = 10, alpha=1, family="binomial", type.measure="class")

best_s  <- model_lambda$lambda.1se
pred <- as.numeric(predict(model_lambda, newx=as.matrix(X_test), type="class" , s=best_s))

# best lambda
print(best_s)
# 0.04136537

# fraction correct
print(sum(y_test==pred)/NROW(pred))   
# 0.75

# model coefficients
print(coef(model_lambda, s=best_s))
#(Intercept)  -14.680479
#Sepal.Length   0        
#Sepal.Width   0
#Petal.Length   1.181747
#Petal.Width    4.592025

Python Code

from sklearn import datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0]  # four features. Disregard one of the 3 species.                                                                                                                 
y = y[y != 0]-1  # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.                                                                               

n_sample = len(X)

w = .6
X_train = X[:int(w * n_sample)]  # (60, 4)
y_train = y[:int(w * n_sample)]  # (60,)
X_test = X[int(w * n_sample):]  # (40, 4)
y_test = y[int(w * n_sample):]  # (40,)

X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)

clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = ‘accuracy’, random_state=0)
clf.fit(X_train_transformed, y_train)

print clf.score(X_train_fit.transform(X_test), y_test)  # score is 0.775
print clf.intercept_  #-1.83569557
print clf.coef_  # [ 0,  0, 0.65930981, 1.17808155] (sepal length, sepal width, petal length, petal width)
print clf.C_  # optimal lambda: 0.35938137

score 4 · Accepted Answer · answered Apr 24 '17 at 21:00

There are a few things that are different in the examples above:

Scale of the coefficients
- glmnet (https://cran.r-project.org/web/packages/glmnet/glmnet.pdf) standardizes the data and "The coefficients are always returned on the original scale". Hence you did not scale your data before calling glmnet.
- The Python code standardizes the data, then fits to that standardized data. The coefs in this case are in the standardized scale, not the original scale. This makes the coefs between the examples non-comparable.
LogisticRegressionCV by default uses stratifiedfolds. glmnet uses k-fold.
They are fitting different equations. Notice that scikit-learn logistic fits (http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) with the regularization on the logistic side. glmnet puts the regularization on the penalty.
Choosing the regularization strengths to try - glmnet defaults to 100 lambdas to try. scikit LogisticRegressionCV defaults to 10. Due to the equation scikit solves, the range is between 1e-4 and 1e4 (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV).
Tolerance is different. In some problems I have had, tightening the tolerance significantly changed the coefs.
- glmnet defaults thresh to 1e-7
- LogisticRegressionCV default tol to 1e-4
- Even after making them the same, they may not measure the same thing. I do not know what liblinear measures. glmnet - "Each inner coordinate-descent loop continues until the maximum change in the objective after any coefficient update is less than thresh times the null deviance."

You may want to try printing the regularization paths to see if they are very similar, just stopping on a different strength. Then you can research why.

Even after changing what you can change which is not all of the above, you may not get the same coefs or results. Though you are solving the same problem in different software, how the software solves the problem may be different. We see different scales, different equations, different defaults, different solvers, etc.

score 1 · Answer 2 · answered Apr 24 '17 at 12:55

The problem that you've got here is the ordering of the datasets (note I haven't checked the R code, but I'm certain this is the problem). If I run your code and then run this

print np.bincount(y_train) # [50 10]
print np.bincount(y_test) # [ 0 40]

You can see the training set is not representative of the test set. However if I make a couple of changes to your Python code then I get a test accuracy of 0.9.

from sklearn import datasets
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0]  # four features. Disregard one of the 3 species.                                                                                                                 
y = y[y != 0]-1  # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.                                                                               

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, 
                                                                    test_size=0.4,
                                                                    random_state=42,
                                                                    stratify=y)


X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)

clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = 'accuracy', random_state=0)
clf.fit(X_train_transformed, y_train)

print clf.score(X_train_fit.transform(X_test), y_test)  # score is 0.9
print clf.intercept_  #0.
print clf.coef_  # [ 0., 0. ,0., 0.30066888] (sepal length, sepal width, petal length, petal width)
print clf.C_ # [ 0.04641589]

thanks a lot. The train_test_split function seems handy, however (see my response to Grr) I'm not sure if this is the reason for the differences between the two languages. I will try to implement a balanced split between the two (in both R and python) and then update my initial post. — Oliver Angelil, Apr 24 '17 at 15:11
I suggest creating two files, one for your training set and one for the test set and reading these in to Python and R. That is the safest way to ensure your data is split correctly. — piman314, Apr 24 '17 at 15:51

score 1 · Answer 3 · answered Apr 24 '17 at 12:57

I have to take umbrage with a couple of things here.

Firstly, "for python, scikit-learn's LogisticRegressionCV function with the "liblinear" solver (the only solver that can be used with L1 penalisation)". That is just patently false, unless you meant to qualify that in some more definitive way. Just take a look at the descriptions of the sklearn.linear_model classes and you will see a handful that specifically mention L1. I am sure that others allow you to implement it as well, but I don't really feel like counting them.

Secondly, your method for splitting the data is less than ideal. Take a look at your input and output after the split and you will find that in your split all of the test samples have target values of 1, while the target of 1 only accounts for 1/6 of your training sample. This imbalance, which is not representative of the distribution of the targets, will cause your model to be poorly fit. For example, just using sklearn.model_selection.train_test_split out of the box and then refitting the LogisticRegressionCV classifier exactly as you had, results in an accuray of .92

Now all that being said there is a glmnet package for python and you can replicate your results using this package. There is a blog by the authors of this project that discusses some of the limitations in trying to recreate glmnet results with sklearn. Specifically:

"Scikit-Learn has a few solvers that are similar to glmnet, ElasticNetCV and LogisticRegressionCV, but they have some limitations. The first one only works for linear regression and the latter does not handle the elastic net penalty." - Bill Lattner GLMNET FOR PYTHON

thanks for your time. I should have said "the only solver that can be used with L1 penalisation when using the LogisticRegressionCV function". The documentation lists four solvers that can be used (‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’); only liblinear can be used with L1. Yes the splitting is not ideal. I would not do this in operation; however since I am splitting the same way between R and Python I am not sure this is the reason for the different results (I was not sure how to make a balanced split in R). The glmnet package for python might be the solution. Thanks. — Oliver Angelil, Apr 24 '17 at 14:50

Reproducing LASSO / Logistic Regression results in R with Python using the Iris Dataset

3 Answers3