Questions tagged [one-hot-encoding]

One-Hot Encoding is a method to encode categorical variables to numerical data that Machine Learning algorithms can deal with.

Also known as Dummy Encoding, One-Hot Encoding is a method to encode categorical variables, where no such ordinal relationship exists, to numerical data that Machine Learning algorithms can deal with. One hot encoding is the most widespread approach, and it works very well unless your categorical variable takes on a large number of unique values. One hot encoding creates new, binary columns, indicating the presence of each possible value from the original data. These columns store ones and zeros for each row, indicating the categorical value of that row.

951 questions
269
votes
21 answers

Convert array of indices to 1-hot encoded numpy array

Let's say I have a 1d numpy array a = array([1,0,3]) I would like to encode this as a 2D one-hot array b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]]) Is there a quick way to do this? Quicker than just looping over a to set elements of b, that is.
James Atwood
  • 3,409
  • 2
  • 15
  • 17
169
votes
20 answers

How can I one hot encode in Python?

I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding? I am trying to do the…
avicohen
  • 2,057
  • 5
  • 14
  • 15
62
votes
4 answers

Can sklearn random forest directly handle categorical features?

Say I have a categorical feature, color, which takes the values ['red', 'blue', 'green', 'orange'], and I want to use it to predict something in a random forest. If I one-hot encode it (i.e. I change it to four dummy variables), how do I tell…
hahdawg
  • 1,169
  • 1
  • 9
  • 16
59
votes
5 answers

Running get_dummies on several DataFrame columns?

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
Emre
  • 4,545
  • 7
  • 25
  • 39
48
votes
1 answer

adding dummy columns to the original dataframe

I have a dataframe looks like this:             JOINED_CO GENDER    EXEC_FULLNAME  GVKEY  YEAR  CONAME  BECAMECEO  REJOIN   LEFTOFC    LEFTCO  RELEFT    REASON  PAGE CO_PER_ROL…
Brad
  • 539
  • 1
  • 4
  • 8
45
votes
9 answers

One Hot Encoding using numpy

If the input is zero I want to make an array which looks like this: [1,0,0,0,0,0,0,0,0,0] and if the input is 5: [0,0,0,0,0,1,0,0,0,0] For the above I wrote: np.put(np.zeros(10),5,1) but it did not work. Is there any way in which, this can be…
Abhijay Ghildyal
  • 3,252
  • 3
  • 23
  • 49
34
votes
3 answers

One hot encoding of string categorical features

I'm trying to perform a one hot encoding of a trivial dataset. data = [['a', 'dog', 'red'] ['b', 'cat', 'green']] What's the best way to preprocess this data using Scikit-Learn? On first instinct, you'd look towards Scikit-Learn's…
hlin117
  • 16,266
  • 25
  • 66
  • 87
30
votes
3 answers

Feature names from OneHotEncoder

I am using OneHotEncoder to encode few categorical variables (eg - Sex and AgeGroup). The resulting feature names from the encoder are like - 'x0_female', 'x0_male', 'x1_0.0', 'x1_15.0' etc. >>> train_X = pd.DataFrame({'Sex':['male', 'female']*3,…
Supratim Haldar
  • 1,976
  • 2
  • 12
  • 24
29
votes
5 answers

How to one hot encode several categorical variables in R

I'm working on a prediction problem and I'm building a decision tree in R, I have several categorical variables and I'd like to one-hot encode them consistently in my training and testing set. I managed to do it on my training data with : temps <-…
xeco
  • 391
  • 1
  • 3
  • 3
25
votes
3 answers

In TensorFlow, what is the argument 'axis' in the function 'tf.one_hot'

Could anyone help with an an explanation of what axis is in TensorFlow's one_hot function? According to the documentation: axis: The axis to fill (default: -1, a new inner-most axis) Closest I came to an answer on SO was an explanation relevant to…
20
votes
3 answers

Convert a 2d matrix to a 3d one hot matrix numpy

I have np matrix and I want to convert it to a 3d array with one hot encoding of the elements as third dimension. Is there a way to do with without looping over each row eg a=[[1,3], [2,4]] should be made into b=[[1,0,0,0], [0,0,1,0], …
Rahul
  • 2,312
  • 2
  • 19
  • 26
17
votes
1 answer

Why does Spark's OneHotEncoder drop the last category by default?

I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default. For example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss =…
Corey
  • 1,705
  • 11
  • 22
16
votes
9 answers

OneHotEncoder categorical_features deprecated, how to transform specific column

I need to transform the independent field from string to arithmetical notation. I am using OneHotEncoder for the transformation. My dataset has many independent columns of which some are as: Country | Age …
Hassaan
  • 3,661
  • 10
  • 28
  • 61
16
votes
1 answer

Handling unknown values for label encoding

How can I handle unknown values for label encoding in sk-learn? The label encoder will only blow up with an exception that new labels were detected. What I want is the encoding of categorical variables via one-hot-encoder. However, sk-learn does not…
Georg Heiler
  • 13,862
  • 21
  • 115
  • 217
14
votes
1 answer

Train multi-class image classifier in Keras

I was following a tutorial to learn train a classifier using Keras https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html Specifically, from the second script given by the author, I wanted to transform the…
1
2 3
63 64