Getting different accuracy without modifying the code

Question

Don't know if the problem is in the way I am splitting the dataset or if I am doing something wrong but every time I run de the program I am getting different accuracy. Can anyone please help me find out the problem? Thank you Here is my code:

import pandas as pd
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier,     ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

# load the data
from sklearn.tree import DecisionTreeClassifier

# url = "data/lung-cancer.data"
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/lung-        cancer/lung-cancer.data"
data_set = pd.read_csv(url)

def clean_data(data_set):
    # replace the ? with NaN
    data_set = data_set.convert_objects(convert_numeric=True)
    # replace the NaN with the average of the row
    data_set = data_set.fillna(data_set.mean(axis=0), axis=0)

    return data_set

data_set = clean_data(data_set)

def split_data(data_set):
    # split the data in two parts train(80%), test(20%)
    train, test = train_test_split(data_set.values, test_size=0.2)

    # first column of the data are labels
    labels_test = test[:, :1]
    labels_train = train[:, :1]

    # the rest of the columns are features
    features_test = test[:, 1:]
    features_train = train[:, 1:]

    return features_train, labels_train, features_test, labels_test

features_train, labels_train, features_test, labels_test = split_data(data_set)
"""
print(labels_train)
print(features_train)
print(features_test)
print(labels_test)
"""

# Modeling step Test different algorithms
random_state = 2
classifiers = [
    GaussianNB(),
    KNeighborsClassifier(n_neighbors=3),
    KNeighborsClassifier(n_neighbors=5),
    SVC(kernel="poly", C=0.4, probability=True),
    DecisionTreeClassifier(random_state=3),
    RandomForestClassifier(random_state=3),
    AdaBoostClassifier(random_state=3),
    ExtraTreesClassifier(random_state=3),
    GradientBoostingClassifier(random_state=3),
    MLPClassifier(random_state=random_state)
]

accuracy_res = []
algorithm_res = []
for clf in classifiers:
    clf.fit(features_train, labels_train)
    name = clf.__class__.__name__

    train_predictions = clf.predict(features_test)

    accuracy = accuracy_score(labels_test, train_predictions)
    print(name, "{:.4%}".format(accuracy))
    accuracy_res.append(accuracy)
    algorithm_res.append(name)
    print()

y_pos = np.arange(len(algorithm_res))
plt.barh(y_pos, accuracy_res, align='center', alpha=0.5)
plt.yticks(y_pos, algorithm_res)
plt.xlabel('Accuracy')
plt.title('Algorithms')
plt.show()

Here is the results that i'm getting: First result

GaussianNB 28.5714%
KNeighborsClassifier 57.1429%
KNeighborsClassifier 71.4286%
SVC 57.1429%
DecisionTreeClassifier 57.1429%
RandomForestClassifier 42.8571%
AdaBoostClassifier 42.8571%
ExtraTreesClassifier 42.8571%
GradientBoostingClassifier 57.1429%
MLPClassifier 57.1429%

Second result

GaussianNB 28.5714%
KNeighborsClassifier 42.8571%
KNeighborsClassifier 28.5714%
SVC 57.1429%
DecisionTreeClassifier 28.5714%
RandomForestClassifier 57.1429%
AdaBoostClassifier 57.1429%
ExtraTreesClassifier 42.8571%
GradientBoostingClassifier 28.5714%
MLPClassifier 57.1429%

Third result

GaussianNB 71.4286%
KNeighborsClassifier 71.4286%
KNeighborsClassifier 71.4286%
SVC 28.5714%
DecisionTreeClassifier 28.5714%
RandomForestClassifier 57.1429%
AdaBoostClassifier 71.4286%
ExtraTreesClassifier 57.1429%
GradientBoostingClassifier 28.5714%
MLPClassifier 28.5714%

score 1 · Answer 1 · answered Nov 22 '17 at 14:44

Since you are using train_test_split, it splits your data randomly which causes the difference in accuracy each time you run the above code.

I would suggest to look at the output multiple times and find the mean of accuracy from a number of outputs. You can redirect the output and let python do it for you. Take the model giving highest of mean accuracy.

When I ran your code I got the best accuracy when using KNeighborsClassifier with n_neighbors=5. Also I made a few modifications so that there are no warnings. Please find the updated code as below. I had updated the comments where ever there is a modification for reference.

import pandas as pd
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier,     ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

# load the data
from sklearn.tree import DecisionTreeClassifier

# url = "data/lung-cancer.data"
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/lung-cancer/lung-cancer.data"
data_set = pd.read_csv(url)

def clean_data(data_set):
    # replace the ? with NaN
    # data_set = data_set.convert_objects(convert_numeric=True)
    # convert objects is deprecated
    data_set = data_set.apply(pd.to_numeric, errors='coerce')
    # replace the NaN with the average of the row
    data_set = data_set.fillna(data_set.mean(axis=0), axis=0)

    return data_set

data_set = clean_data(data_set)

def split_data(data_set):
    # split the data in two parts train(80%), test(20%)
    train, test = train_test_split(data_set.values, test_size=0.2)

    # first column of the data are labels
    labels_test = test[:, :1]
    labels_train = train[:, :1]

    # the rest of the columns are features
    features_test = test[:, 1:]
    features_train = train[:, 1:]

    return features_train, labels_train, features_test, labels_test

features_train, labels_train, features_test, labels_test = split_data(data_set)
"""
print(labels_train)
print(features_train)
print(features_test)
print(labels_test)
"""

# Modeling step Test different algorithms
random_state = 2
classifiers = [
    GaussianNB(),
    KNeighborsClassifier(n_neighbors=3),
    KNeighborsClassifier(n_neighbors=5),
    SVC(kernel="poly", C=0.4, probability=True),
    DecisionTreeClassifier(random_state=3),
    RandomForestClassifier(random_state=3),
    AdaBoostClassifier(random_state=3),
    ExtraTreesClassifier(random_state=3),
    GradientBoostingClassifier(random_state=3),
    # MLPClassifier(random_state=random_state)
    # Set hidden_layer_sizes and max_iter parameters 
    # so that multilayer perceptron will converge
    MLPClassifier(solver='lbfgs', hidden_layer_sizes=[100], max_iter=2000, activation='logistic', random_state=random_state)
]

accuracy_res = []
algorithm_res = []
for clf in classifiers:
    # clf.fit(features_train, labels_train)
    # Added ravel to convert column vector to 1d array
    clf.fit(features_train, labels_train.ravel())
    name = clf.__class__.__name__

    train_predictions = clf.predict(features_test)

    accuracy = accuracy_score(labels_test, train_predictions)
    print(name, "{:.4%}".format(accuracy))
    accuracy_res.append(accuracy)
    algorithm_res.append(name)
    print()

y_pos = np.arange(len(algorithm_res))
plt.barh(y_pos, accuracy_res, align='center', alpha=0.5)
plt.yticks(y_pos, algorithm_res)
plt.xlabel('Accuracy')
plt.title('Algorithms')
plt.show()

as suggested, mean of accuracy is a good metric or as it is called, k-fold cross-validation. More on it [here](http://scikit-learn.org/stable/modules/cross_validation.html). — skrubber, Nov 23 '17 at 04:34
Thanks for the link to documentation. Also I think he can use any of the ensemble methods such as bagging/boosting which uses bootstrapping to get more accurate results with random sampling with replacement. It will provide voting for classification and averaging for regression. — codeslord, Nov 23 '17 at 05:08
Somebody downvoted the post. Please leave a comment with the reason so that I can correct it. Thanks! — codeslord, Nov 23 '17 at 05:58
@Wagner'P Thanks. Try ensemble methods also if time permits. — codeslord, Nov 23 '17 at 13:34
I would like to ask which methods should I choose and how to select them? — Wagner'P, Nov 23 '17 at 14:55
@Wagner'P There is no golden rule. you may use a bagging learning technique, similar to that one followed by Random Forest. This technique allows the training of 'small' classifiers which see a small portion of the whole data. Afterwards, a simple voting scheme (as in Random Forest) will lead you to a very interesting and robust classification. The following link has a few simple examples on how to use ensemble ML algorithms. https://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/ — codeslord, Nov 23 '17 at 15:46

score 0 · Answer 2 · answered Nov 22 '17 at 09:55

0

from sklearn.model_selection import train_test_split

You used train_test_split of sklearn, which split your data into Train_set and Test_set Randomly . So, any time you retrain your model, the data is not similar to the other version of its.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

answered Nov 22 '17 at 09:55

Phước Hữu Lưu

448
2
13

Please, how do you suggest me to change it? What's the best way to do this? – Wagner'P Nov 22 '17 at 09:58
write your own function to split data first – Phước Hữu Lưu Nov 22 '17 at 09:59
1

Adding to this answer, @Wagner'P if you want to reproduce the same split each time, try adding `numpy.random.seed(SOME_INTEGER)` before the `train_test_split` line. Look here for details:- https://stackoverflow.com/a/31058798/3374996 – Vivek Kumar Nov 22 '17 at 11:27
Thank you, I'll try that – Wagner'P Nov 22 '17 at 12:11
2

Why not using the random_state parameter of train_test_split, just as you do for your models? – Marcus V. Nov 23 '17 at 10:03

raghus · Answer 3 · 2020-04-26T17:10:02.193

0

Change this line

train, test = train_test_split(data_set.values, test_size=0.2)

to

train, test = train_test_split(data_set.values, test_size=0.2,random_state=0)

The value of random_state doesn't necessarily need to be 0 it can be 1 or 2 or 42. Just as long as it is the same value each time the split happens. Then, different runs will give you consistent results.

edited Apr 26 '20 at 17:10

answered Apr 25 '20 at 21:05

raghus

103
6

Getting different accuracy without modifying the code

3 Answers3