3

I am trying to split a large data frame into smaller data frames based on a number of conditions in R. I would like each of these data frames to be named according to the variables on which they were split, but as there are over 1660 new "sub" data frames, I cannot do this manually.

An example of the whole data frame:

ID LENGTH    GRADE    CODE      DURATION    STATUS

1   1          A1      ABC      Less than 10     Y            
2   2          A1      ABC      More than 10     Y            
3   1          A1      DEF      Less than 10     Y                     
4   2          A2      ABC      Less than 10     Y                      
5   1          B1      ABC      More than 10     Y                      
6   3          B2      DEF      Less than 10     Y 

There are over 900,000 entries to be split by 7 variables into about 1660 non-empty groups - I have found this by creating a new grouped dataframe

> Grouped_DF<- DF %>% group_by(LENGTH,GRADE,CODE,DURATION,STATUS,...)

> nrow(Grouped_Data)
[1] 1660

which consists of the groups I desire, but now I want to create a new dataframe for each of these groups, with all of the entries that fall into each group. I have tried using the split function:

SplitGroups<-split(DF, with(DF, interaction(LENGTH,GRADE,CODE,DURATION,STATUS,..)))

Which generates the following list:

> class(SplitGroups)
[1] "list"
> length(SplitGroups)
[1] 24480

An example of the output:

> SplitGroups
$1.A1.ABC.Less Than 10.N`
# A tibble: 10 x 65
# Groups:   ID [10]
# ... with 65 variables: 

Now I want to take the non empty dataframes, rename them as, for example, '1.A1.ABC.Less Than 10.N' (or something similar) and store this into the global environment.

I am aware this could be done using subset, for example:

1.A1.ABC.LessThan10.N <- subset(DF, LENGTH==1 & GRADE=="A1" & CODE=="ABC" & .....) 

and so on, but this is not practical for the number of subsets needed.

Any help would be appreciated, thanks.

erat93
  • 43
  • 3
  • 1
    `splitGroups 0]` will drop all empty data.frames. Storing this in a list is probably the preferable method, as it is easier to systematically interact with the data.frames. See [gregor's post here](https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames) on some tips for doing this. – lmo Jul 26 '17 at 12:21
  • We can use `Filter(function(x) nrow(x) > 0, splitGroups)` – akrun Jul 26 '17 at 12:30

1 Answers1

1

I would also recommend keeping your data frames in a list, but here's a solution to what you wanted

Your data:

df <- data.frame(ID=1:6, LENGTH=c(1,2,1,2,1,3),
             GRADE=c(rep("A1",3),"A2","B1","B2"),
             CODE=c("ABC","ABC","DEF","ABC","ABC","DEF"),
             DURATION=c("Less than 10", "More than 10", "Less than 10",
                      "Less than 10", "More than 10", "Less than 10"),
             STATUS=rep("Y",6), stringsAsFactors=F)

Grouping your data by all columns and adding a column (newnames)to create a unique variable name. *Note that I removed spaces from DURATION using gsub and prepended newnames with a Z because R does not like variable names that start with a number:

grp.df <- df %>% 
         group_by_all() %>%
         mutate(newnames=paste0("Z",ID,LENGTH,GRADE,CODE,gsub(" ", "", DURATION),STATUS))

Splitting your data frame into a list (as you've already done...)

split.df <- split(grp.df, grp.df$newnames)

Saving split data frames by new variable names using assign

for (I in 1:length(split.df)) { assign(unique(split.df[[I]]$newnames), split.df[[I]]) }
CPak
  • 12,079
  • 2
  • 20
  • 38