say I have a data frame,df, with columns: id |site| time| clicks |impressions
I want to use the machine learning technique of k-fold cross validation ( split the data randomly into k=10 equal sized partitions - based on eg column id) . I think of this as a mapping from id: {0,1,...9} ( so new column 'fold' going from 0-9) then iteratively take 9/10 partitions as training data and 1/10 partition as validation data ( so first fold==0 is validation, rest is training, then fold==1, rest is training) [ so am thinking of this as a generator based on grouping by fold column]
finally I want to group all the training data by site and time ( and similarly for validation data) ( in other words sum over the fold index, but keeping the site and time indices)
What is the right way of doing this in pandas?
The way I thought of doing it at the moment is
df_sum=df.groupby( 'fold','site','time').sum()
#so df_sum has indices fold,site, time
# create new Series object,dat, name='cross' by mapping fold indices
# to 'training'/'validation'
df_train_val=df_sum.groupby( [ dat,'site','time']).sum()
df_train_val.xs('validation',level='cross')
Now the direct problem I run into is that groupby with columns will handle introducing a Series object but groupby on multiindices doesn't [df_train_val assignment above doesn't work]. Obviously I could use reset_index but given that I want to group over site and time [ to aggregate over folds 1 to 9, say] this seems wrong. ( I assume grouping is much faster on indices than on 'raw' columns)
So Question 1 is this the right way to do cross-validation followed by aggregation in pandas. More generally grouping and then regrouping based on multiindex values.
Question 2 - is there a way of mixing arbitrary mappings with multilevel indices.