0

say I have a data frame,df, with columns: id |site| time| clicks |impressions

I want to use the machine learning technique of k-fold cross validation ( split the data randomly into k=10 equal sized partitions - based on eg column id) . I think of this as a mapping from id: {0,1,...9} ( so new column 'fold' going from 0-9) then iteratively take 9/10 partitions as training data and 1/10 partition as validation data ( so first fold==0 is validation, rest is training, then fold==1, rest is training) [ so am thinking of this as a generator based on grouping by fold column]

finally I want to group all the training data by site and time ( and similarly for validation data) ( in other words sum over the fold index, but keeping the site and time indices)

What is the right way of doing this in pandas?

The way I thought of doing it at the moment is

df_sum=df.groupby( 'fold','site','time').sum() 
#so df_sum has indices fold,site, time 
# create new Series object,dat, name='cross'  by mapping fold indices 
# to      'training'/'validation'
df_train_val=df_sum.groupby( [ dat,'site','time']).sum()
df_train_val.xs('validation',level='cross')

Now the direct problem I run into is that groupby with columns will handle introducing a Series object but groupby on multiindices doesn't [df_train_val assignment above doesn't work]. Obviously I could use reset_index but given that I want to group over site and time [ to aggregate over folds 1 to 9, say] this seems wrong. ( I assume grouping is much faster on indices than on 'raw' columns)

So Question 1 is this the right way to do cross-validation followed by aggregation in pandas. More generally grouping and then regrouping based on multiindex values.

Question 2 - is there a way of mixing arbitrary mappings with multilevel indices.

seanv507
  • 1,027
  • 1
  • 10
  • 22
  • Sci-kit learn has a [k-fold](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html) method that is compatible with pandas, no need to implement your own unless you have a requirement that is not satisfied by sci-kit learn's implementation – EdChum Jul 28 '14 at 15:56
  • well I am not using sci-kit learn, and the problem is not with implementing k-fold but with the aggregation step after performing the k-fold: the df_train_val assignment. I would believe these data-munging ops are all easier and faster done in pandas than scikit-learn. eg I could do k-fold in scikit-learn first and then group by the other vars. but then i am grouping the data [wrt site & time] for every fold. it seems much more efficient to group by fold, site, time once and then reassign by fold 0...9 to 'train' or 'validation' – seanv507 Jul 28 '14 at 21:23
  • It would make more sense to group first and then k-fold and yes grouping on index should be faster however, depending on the size and distribution of your groups you may find it very slow. This has been my experience when working with large datasets, it may be fine for your dataset anyway I think your approach is sound – EdChum Jul 28 '14 at 21:33

1 Answers1

0

This generator seems to do what I want. You pass in the grouped data (with 1 index corresponding to the fold [0 to n_folds]).

def split_fold2(fold_data, n_folds, new_fold_col='fold'):
    i_fold=0
    indices=list(fold_data.index.names)
    slicers=[slice(None)]*len(fold_data.index.names)
    fold_index=fold_data.index.names.index(new_fold_col)
    indices.remove(new_fold_col)

    while (i_fold<n_folds):
        slicers[fold_index]=[i for i in range(n_folds) if i !=i_fold]
        slicers_tuple=tuple(slicers)
        train_data=fold_data.loc[slicers_tuple,:].groupby(level=indices).sum()
        val_data=fold_data.xs(i_fold,level=new_fold_col)
        yield train_data,val_data
        i_fold+=1

On my data set this takes :

CPU times: user 812 ms, sys: 180 ms, total: 992 ms  Wall time: 991 ms 

(to retrieve one fold) replacing train_data assignment with train_data=fold_data.select(lambda x: x[fold_index]!=i_fold).groupby(level=indices).sum() takes

CPU times: user 2.59 s, sys: 263 ms, total: 2.85 s Wall time: 2.83 s
seanv507
  • 1,027
  • 1
  • 10
  • 22