It seems the issue is that a missing value in Orange is represented as ?
or ~
. Oddly enough, the Orange.data.Table(numpy.ndarray)
constructor does not infer that numpy.nan
should be converted to ?
or ~
and instead converts them to 1.#QO
. The custom function below, pandas_to_orange()
, addresses this problem.
import Orange
import numpy as np
import pandas as pd
from collections import OrderedDict
# Adapted from https://github.com/biolab/orange3/issues/68
def construct_domain(df):
columns = OrderedDict(df.dtypes)
def create_variable(col):
if col[1].__str__().startswith('float'):
return Orange.feature.Continuous(col[0])
if col[1].__str__().startswith('int') and len(df[col[0]].unique()) > 50:
return Orange.feature.Continuous(col[0])
if col[1].__str__().startswith('date'):
df[col[0]] = df[col[0]].values.astype(np.str)
if col[1].__str__() == 'object':
df[col[0]] = df[col[0]].astype(type(""))
return Orange.feature.Discrete(col[0], values = df[col[0]].unique().tolist())
return Orange.data.Domain(list(map(create_variable, columns.items())))
def pandas_to_orange(df):
domain = construct_domain(df)
df[pd.isnull(df)]='?'
return Orange.data.Table(Orange.data.Domain(domain), df.values.tolist())
df = pd.DataFrame({'col1':[1, 2, np.nan, 4, 5, 6, 7, 8, 9, np.nan, 11],
'col2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110.]})
tmp = pandas_to_orange(df)
for i in range(0, len(tmp)):
print tmp[i]
The output is:
[1.000, 10.000]
[2.000, 20.000]
[?, 30.000]
[4.000, 40.000]
[5.000, 50.000]
[6.000, 60.000]
[7.000, 70.000]
[8.000, 80.000]
[9.000, 90.000]
[?, 100.000]
[11.000, 110.000]
The reason I wanted to properly encode the missing values is so I can use the Orange imputation library. However, it appears that the predictive tree model in the library does not do much more than simple mean imputation. Specifically, it imputes the same value for all missing values.
imputer = Orange.feature.imputation.ModelConstructor()
imputer.learner_continuous = Orange.classification.tree.TreeLearner(min_subset=20)
imputer = imputer(tmp )
impdata = imputer(tmp)
for i in range(0, len(tmp)):
print impdata[i]
Here's the output:
[1.000, 10.000]
[2.000, 20.000]
[5.889, 30.000]
[4.000, 40.000]
[5.000, 50.000]
[6.000, 60.000]
[7.000, 70.000]
[8.000, 80.000]
[9.000, 90.000]
[5.889, 100.000]
[11.000, 110.000]
I was looking for something that will fit a model, say kNN, on the complete cases and use the fitted model to predict the missing cases. fancyimpute
(a Python 3 package) does this but throws MemoryError
on my 300K+ input.
from fancyimpute import KNN
df = pd.DataFrame({'col1':[1, 2, np.nan, 4, 5, 6, 7, 8, 9, np.nan, 11],
'col2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110.]})
X_filled_knn = KNN(k=3).complete(df)
X_filled_knn
Output is:
array([[ 1. , 10. ],
[ 2. , 20. ],
[ 2.77777784, 30. ],
[ 4. , 40. ],
[ 5. , 50. ],
[ 6. , 60. ],
[ 7. , 70. ],
[ 8. , 80. ],
[ 9. , 90. ],
[ 9.77777798, 100. ],
[ 11. , 110. ]])
I can probably find a workaround or split the dataset into chunks (not ideal).