12

I notice that this is an issue on GitHub already. Does anyone have any code that converts a Pandas DataFrame to an Orange Table?

Explicitly, I have the following table.

       user  hotel  star_rating  user  home_continent  gender
0         1     39          4.0     1               2  female
1         1     44          3.0     1               2  female
2         2     63          4.5     2               3  female
3         2      2          2.0     2               3  female
4         3     26          4.0     3               1    male
5         3     37          5.0     3               1    male
6         3     63          4.5     3               1    male
hlin117
  • 16,266
  • 25
  • 66
  • 87
  • The orange format does not look that difficult to ouput: http://docs.orange.biolab.si/reference/rst/Orange.data.formats.html also it supports importing csv files and guessing the datatypes, have you tried anything? – EdChum Oct 12 '14 at 08:54
  • So I could understand how data is saved into a *.tab file, but specifically, is there a function or series of calls you can make that lets you convert a panda DataFrame to an Orange Table? (Side comment: It's funny how the page talks about how data is stored in an external file, but doesn't talk about how to save / load from files. I personally think Orange is not well documented.) – hlin117 Oct 12 '14 at 13:19
  • Would a workflow that saves the table in Pandas as a file and then imports the file in Orange work? Or too much of a kludge? I guess the field data types might not be passed nicely. – BKay Oct 16 '14 at 19:01
  • @BKay That's a start, but I'm looking for something more elegant or straightforward. Essentially, that sounds like EdChum's idea. – hlin117 Oct 16 '14 at 22:57

7 Answers7

19

The documentation of Orange package didn't cover all the details. Table._init__(Domain, numpy.ndarray) works only for int and float according to lib_kernel.cpp.

They really should provide an C-level interface for pandas.DataFrames, or at least numpy.dtype("str") support.

Update: Adding table2df, df2table performance improved greatly by utilizing numpy for int and float.

Keep this piece of script in your orange python script collections, now you are equipped with pandas in your orange environment.

Usage: a_pandas_dataframe = table2df( a_orange_table ) , a_orange_table = df2table( a_pandas_dataframe )

Note: This script works only in Python 2.x, refer to @DustinTang 's answer for Python 3.x compatible script.

import pandas as pd
import numpy as np
import Orange

#### For those who are familiar with pandas
#### Correspondence:
####    value <-> Orange.data.Value
####        NaN <-> ["?", "~", "."] # Don't know, Don't care, Other
####    dtype <-> Orange.feature.Descriptor
####        category, int <-> Orange.feature.Discrete # category: > pandas 0.15
####        int, float <-> Orange.feature.Continuous # Continuous = core.FloatVariable
####                                                 # refer to feature/__init__.py
####        str <-> Orange.feature.String
####        object <-> Orange.feature.Python
####    DataFrame.dtypes <-> Orange.data.Domain
####    DataFrame.DataFrame <-> Orange.data.Table = Orange.orange.ExampleTable 
####                              # You will need this if you are reading sources

def series2descriptor(d, discrete=False):
    if d.dtype is np.dtype("float"):
        return Orange.feature.Continuous(str(d.name))
    elif d.dtype is np.dtype("int"):
        return Orange.feature.Continuous(str(d.name), number_of_decimals=0)
    else:
        t = d.unique()
        if discrete or len(t) < len(d) / 2:
            t.sort()
            return Orange.feature.Discrete(str(d.name), values=list(t.astype("str")))
        else:
            return Orange.feature.String(str(d.name))


def df2domain(df):
    featurelist = [series2descriptor(df.icol(col)) for col in xrange(len(df.columns))]
    return Orange.data.Domain(featurelist)


def df2table(df):
    # It seems they are using native python object/lists internally for Orange.data types (?)
    # And I didn't find a constructor suitable for pandas.DataFrame since it may carry
    # multiple dtypes
    #  --> the best approximate is Orange.data.Table.__init__(domain, numpy.ndarray),
    #  --> but the dtype of numpy array can only be "int" and "float"
    #  -->  * refer to src/orange/lib_kernel.cpp 3059:
    #  -->  *    if (((*vi)->varType != TValue::INTVAR) && ((*vi)->varType != TValue::FLOATVAR))
    #  --> Documents never mentioned >_<
    # So we use numpy constructor for those int/float columns, python list constructor for other

    tdomain = df2domain(df)
    ttables = [series2table(df.icol(i), tdomain[i]) for i in xrange(len(df.columns))]
    return Orange.data.Table(ttables)

    # For performance concerns, here are my results
    # dtndarray = np.random.rand(100000, 100)
    # dtlist = list(dtndarray)
    # tdomain = Orange.data.Domain([Orange.feature.Continuous("var" + str(i)) for i in xrange(100)])
    # tinsts = [Orange.data.Instance(tdomain, list(dtlist[i]) )for i in xrange(len(dtlist))] 
    # t = Orange.data.Table(tdomain, tinsts)
    #
    # timeit list(dtndarray)  # 45.6ms
    # timeit [Orange.data.Instance(tdomain, list(dtlist[i])) for i in xrange(len(dtlist))] # 3.28s
    # timeit Orange.data.Table(tdomain, tinsts) # 280ms

    # timeit Orange.data.Table(tdomain, dtndarray) # 380ms
    #
    # As illustrated above, utilizing constructor with ndarray can greatly improve performance
    # So one may conceive better converter based on these results


def series2table(series, variable):
    if series.dtype is np.dtype("int") or series.dtype is np.dtype("float"):
        # Use numpy
        # Table._init__(Domain, numpy.ndarray)
        return Orange.data.Table(Orange.data.Domain(variable), series.values[:, np.newaxis])
    else:
        # Build instance list
        # Table.__init__(Domain, list_of_instances)
        tdomain = Orange.data.Domain(variable)
        tinsts = [Orange.data.Instance(tdomain, [i]) for i in series]
        return Orange.data.Table(tdomain, tinsts)
        # 5x performance


def column2df(col):
    if type(col.domain[0]) is Orange.feature.Continuous:
        return (col.domain[0].name, pd.Series(col.to_numpy()[0].flatten()))
    else:
        tmp = pd.Series(np.array(list(col)).flatten())  # type(tmp) -> np.array( dtype=list (Orange.data.Value) )
        tmp = tmp.apply(lambda x: str(x[0]))
        return (col.domain[0].name, tmp)

def table2df(tab):
    # Orange.data.Table().to_numpy() cannot handle strings
    # So we must build the array column by column,
    # When it comes to strings, python list is used
    series = [column2df(tab.select(i)) for i in xrange(len(tab.domain))]
    series_name = [i[0] for i in series]  # To keep the order of variables unchanged
    series_data = dict(series)
    print series_data
    return pd.DataFrame(series_data, columns=series_name)
TurtleIzzy
  • 857
  • 6
  • 13
10

Answer below from a closed issue on github

from Orange.data.pandas_compat import table_from_frame
out_data = table_from_frame(df)

Where df is your dataFrame. So far I've only noticed a need to manually define a domain to handle dates if the data source wasn't 100% clean and to the required ISO standard.

I realize this is an old question and a lot changed from when it was first asked - but this question comes up top on google search results on the topic.

Creo
  • 131
  • 1
  • 5
7
from Orange.data.pandas_compat import table_from_frame,table_to_frame
df= table_to_frame(in_data)
#here you go
out_data = table_from_frame(df)

based on answer of Creo

Shimon Doodkin
  • 3,676
  • 29
  • 33
4

In order to convert pandas DataFrame to Orange Table you need to construct a domain, which specifies the column types.

For continuous variables, you only need to provide the name of the variable, but for Discrete variables, you also need to provide a list of all possible values.

The following code will construct a domain for your DataFrame and convert it to an Orange Table:

import numpy as np
from Orange.feature import Discrete, Continuous
from Orange.data import Domain, Table
domain = Domain([
    Discrete('user', values=[str(v) for v in np.unique(df.user)]),
    Discrete('hotel', values=[str(v) for v in np.unique(df.hotel)]),
    Continuous('star_rating'),
    Discrete('user', values=[str(v) for v in np.unique(df.user)]),
    Discrete('home_continent', values=[str(v) for v in np.unique(df.home_continent)]),
    Discrete('gender', values=['male', 'female'])], False)
table = Table(domain, [map(str, row) for row in df.as_matrix()])

The map(str, row) step is needed so Orange know that the data contains values of discrete features (and not the indices of values in the values list).

astaric
  • 175
  • 1
  • 3
  • This works great! I tested it out, and it seems I could sort the table by gender, so I'll assume most of the other table functions would work. – hlin117 Oct 18 '14 at 18:02
  • Is there no other datatype if you want to describe a feature being an ID? (Example, a user ID) – hlin117 Oct 19 '14 at 16:17
4

This code is revised from @TurtleIzzy for Python3.

import numpy as np
from Orange.data import Table, Domain, ContinuousVariable, DiscreteVariable


def series2descriptor(d):
    if d.dtype is np.dtype("float") or d.dtype is np.dtype("int"):
        return ContinuousVariable(str(d.name))
    else:
        t = d.unique()
        t.sort()
        return DiscreteVariable(str(d.name), list(t.astype("str")))

def df2domain(df):
    featurelist = [series2descriptor(df.iloc[:,col]) for col in range(len(df.columns))]
    return Domain(featurelist)

def df2table(df):
    tdomain = df2domain(df)
    ttables = [series2table(df.iloc[:,i], tdomain[i]) for i in range(len(df.columns))]
    ttables = np.array(ttables).reshape((len(df.columns),-1)).transpose()
    return Table(tdomain , ttables)

def series2table(series, variable):
    if series.dtype is np.dtype("int") or series.dtype is np.dtype("float"):
        series = series.values[:, np.newaxis]
        return Table(series)
    else:
        series = series.astype('category').cat.codes.reshape((-1,1))
        return Table(series)
Thierry Lathuille
  • 21,301
  • 10
  • 35
  • 37
DustinTang
  • 41
  • 2
1

Something like this?

table = Orange.data.Table(df.as_matrix())

The columns in Orange will get generic names (a1, a2...). If you want to copy the names and the types from the data frame, construct Orange.data.Domain object (http://docs.orange.biolab.si/reference/rst/Orange.data.domain.html#Orange.data.Domain.init) from the data frame and pass it as the first argument above.

See the constructors in http://docs.orange.biolab.si/reference/rst/Orange.data.table.html.

JanezD
  • 448
  • 3
  • 7
  • I get a domain error when I try this. "TypeError: invalid arguments for constructor (domain or examples or both expected)". Can you provide some code to also add in a domain? – hlin117 Oct 17 '14 at 18:48
  • 1
    Say you have `df = DataFrame({"A": [1, 2, 3, 4], "B": [8, 7, 6, 5]})`. Construct a domain with `domain = Orange.data.Domain([Orange.feature.Continuous(name) for name in df.columns])` and then `table = Orange.data.Table(domain, df.as_matrix())` – JanezD Oct 18 '14 at 14:56
  • Oh, if it doesn't work: what does you data frame look like? If `df.as_matrix().dtype` is `object`, Orange won't accept it. You must convert the categorical data into indices. – JanezD Oct 18 '14 at 15:04
1

table_from_frame, which is available in Python 3, doesn't allow the definition of a class column and therefore, the generated table cannot be used directly to train a classification model. I tweaked the table_from_frame function so it'll allow the definition of a class column. Notice that the class name should be given as an additional parameter.

"""Pandas DataFrame↔Table conversion helpers"""
import numpy as np
import pandas as pd
from pandas.api.types import (
    is_categorical_dtype, is_object_dtype,
    is_datetime64_any_dtype, is_numeric_dtype,
)

from Orange.data import (
    Table, Domain, DiscreteVariable, StringVariable, TimeVariable,
    ContinuousVariable,
)

__all__ = ['table_from_frame', 'table_to_frame']


def table_from_frame(df,class_name, *, force_nominal=False):
    """
    Convert pandas.DataFrame to Orange.data.Table

    Parameters
    ----------
    df : pandas.DataFrame
    force_nominal : boolean
        If True, interpret ALL string columns as nominal (DiscreteVariable).

    Returns
    -------
    Table
    """

    def _is_discrete(s):
        return (is_categorical_dtype(s) or
                is_object_dtype(s) and (force_nominal or
                                        s.nunique() < s.size**.666))

    def _is_datetime(s):
        if is_datetime64_any_dtype(s):
            return True
        try:
            if is_object_dtype(s):
                pd.to_datetime(s, infer_datetime_format=True)
                return True
        except Exception:  # pylint: disable=broad-except
            pass
        return False

    # If df index is not a simple RangeIndex (or similar), put it into data
    if not (df.index.is_integer() and (df.index.is_monotonic_increasing or
                                       df.index.is_monotonic_decreasing)):
        df = df.reset_index()

    attrs, metas,calss_vars = [], [],[]
    X, M = [], []

    # Iter over columns
    for name, s in df.items():
        name = str(name)
        if name == class_name:
            discrete = s.astype('category').cat
            calss_vars.append(DiscreteVariable(name, discrete.categories.astype(str).tolist()))
            X.append(discrete.codes.replace(-1, np.nan).values)
        elif _is_discrete(s):
            discrete = s.astype('category').cat
            attrs.append(DiscreteVariable(name, discrete.categories.astype(str).tolist()))
            X.append(discrete.codes.replace(-1, np.nan).values)
        elif _is_datetime(s):
            tvar = TimeVariable(name)
            attrs.append(tvar)
            s = pd.to_datetime(s, infer_datetime_format=True)
            X.append(s.astype('str').replace('NaT', np.nan).map(tvar.parse).values)
        elif is_numeric_dtype(s):
            attrs.append(ContinuousVariable(name))
            X.append(s.values)
        else:
            metas.append(StringVariable(name))
            M.append(s.values.astype(object))

    return Table.from_numpy(Domain(attrs, calss_vars, metas),
                            np.column_stack(X) if X else np.empty((df.shape[0], 0)),
                            None,
                            np.column_stack(M) if M else None)
omer sagi
  • 513
  • 6
  • 9