0

Right now I have the following code that gets some features and labels data from a csv file and uses them to create a DecisionTreeClassifier model and fit it.

import csv
from sklearn import tree
from sklearn.externals import joblib

mycsv = csv.reader(open('postsBase2.csv'))

features = []
labels = []

for row in mycsv:
    features.append([row[2], row[3], row[6]])
    labels.append(row[8])


clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)

I actually have a few other fields in the csv I would like to load that are categorical data. They are in row indexes 7 and 8. The categorical data in row index 7 can be one of 4 categories and the categorical data in row index 8 can be one of 5 categories.

I want to add these to my features and then pass them into the OneHotEncoding class somehow to turn them into categorical data the model can be fitted with: The update code with some psuedocode for what I want to do is below:

import csv
from sklearn import tree
from sklearn.externals import joblib

mycsv = csv.reader(open('postsBase2.csv'))

features = []
labels = []
for row in mycsv:
    features.append([row[2], row[3], row[6], row[7], row[8]])
    labels.append(row[8])


//Here I now want to process the features from row index 7 and 8 via OneHotEncoding somehow to make them acceptable for the DecisionTreeClassifier

clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)

How can I do this?

Mark Keane
  • 858
  • 2
  • 9
  • 24
  • 2
    You should use [pandas.get_dummies()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) for this. It is easier than using scikit. In scikit you have to first convert the categorical features into integers, and then you can use one-hot encoder on it, or else you have to use DictVectorizer (but it needs a dict form input) – Vivek Kumar Apr 13 '17 at 06:18
  • Possible duplicate of [scikit-learn: One hot encoding of string categorical features](http://stackoverflow.com/questions/35107559/scikit-learn-one-hot-encoding-of-string-categorical-features) – Vivek Kumar Apr 13 '17 at 06:35
  • Also, if you insist on using scikit, see the workaround here: https://github.com/scikit-learn/scikit-learn/issues/7493 – Vivek Kumar Apr 13 '17 at 06:36

0 Answers0