-2

I have created a prediction model for this dataset

>>df.head()

    Service    Tasks Difficulty     Hours
0   ABC         24     1           0.833333
1   CDE         77     1           1.750000
2   SDE         90     3           3.166667
3   QWE         47     1           1.083333
4   ASD         26     3           1.000000

>>df.shape
(998,4)

>>X = df.iloc[:,:-1]
>>y = df.iloc[:,-1].values
>>from sklearn.compose import ColumnTransformer 
>>ct = ColumnTransformer([("cat", OneHotEncoder(),[0])], remainder="passthrough")
>>X = ct.fit_transform(X)  
>>x = X.toarray()
>>x = x[:,1:]

>>x.shape
(998,339)

>>from sklearn.ensemble import RandomForestRegressor
>>rf_model = RandomForestRegressor(random_state = 1)
>>rf_model.fit(x,y)

How can I use this model to predict Hours for user input in this format [["SDE", 90, 3]]

I tried

>>test_input = [["SDE", 90, 3]]
>>test_input = ct.fit_transform(test_input)  
>>test_input = test_input[[:,1:]

>>test_input[0]
array([24, 1], dtype=object)


>>predict_hours = rf_model.predict(test_input)
ValueError

Since my dataset has many categorical values its not possible enter the encoded value of "SDE" as input, I need to convert "SDE" to onehot encoded format after receiving the input [["SDE", 90, 3]]

I don't know how to do it can anyone help?

sebin
  • 63
  • 3
  • Please repeat [on topic](https://stackoverflow.com/help/on-topic) and [how to ask](https://stackoverflow.com/help/how-to-ask) from the [intro tour](https://stackoverflow.com/tour). Stack Overflow is not intended to replace existing documentation and tutorials. Since there are many sites that illustrate using one-hot encoding, we expect you to use those well before post here. – Prune Jan 06 '21 at 23:02
  • 1
    Don't use `fit_transform()` on both your training and prediction samples. `fit()` your transformer(s) to your training data, then `transform()` both your training and test data with the fitted transformer – G. Anderson Jan 06 '21 at 23:07

1 Answers1

0

You can use Pipeline for easily handling preprocess and classification stages

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# I have created a dummy dataset
df = pd.read_csv('test.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1].values

# preprocessor
preprocessor = ColumnTransformer([("cat", OneHotEncoder(handle_unknown='ignore'),[0])], remainder="passthrough")

# create a pipeline with preprocessor and classifier
pipeline = Pipeline([('preprocessor', preprocessor),
                      ('classifier', RandomForestRegressor(random_state = 1))
                      ])
# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
                                                    random_state=0)

# train the pipelime
pipeline.fit(X_train, y_train)

# predict
print(pipeline.predict(X_test))
ismail durmaz
  • 2,280
  • 1
  • 4
  • 17