-1

Having a large data frame(almost 100m rows) Want to subset the data frame by factors i.e complete data of first 100 factors into one data frame ,next 100 into another OR(the below one even I'm not sure) Factors (categories) starts from Letter A:J in one batch,L:R as another data frame like that

(Actually I'm facing memory issues when dealing with large data frames,simple rows split can't help problem that working)

Any suggestion appreciated ..Thanks

  Sample data set

ID  FACTORS VALUE
1   ABCD    100
2   ABCD    101
3   ABCD    102
4   ABCD    103
5   ABCD    104
6   DEFG    105
7   DEFG    106
8   DEFG    107
9   DEFG    108
10  DEFG    109
11  DEFG    110
12  HIJK    111
13  HIJK    112
14  HIJK    113
15  HIJK    114
16  HIJK    115
17  HIJK    116
18  MNOP    117
19  MNOP    118
20  MNOP    119
21  MNOP    120
22  MNOP    121
23  99-1    122
24  99-1    123
25  99-1    124
26  99-2    125
27  99-2    126
r2evans
  • 77,184
  • 4
  • 55
  • 96
stats03
  • 1
  • 1
  • If we use this sample data as an example and said "first 3 factors" instead, does that mean you want one frame to have `ABCD` through `HIJK`, and the next frame to have `MNOP` and `99-1`? – r2evans Aug 04 '18 at 22:31
  • Thanks, Yes thats the idea to split the data frame into multiple datafames – stats03 Aug 04 '18 at 22:33
  • Just `subset` the data and then do `split` i.e. `df2 – akrun Aug 04 '18 at 22:39

1 Answers1

1

This is related loosely to Split a vector into chunks in R

First, let's get the unique factors and split them up into bins of size n:

fctrs <- unique(dat$FACTORS)
fctrs
# [1] "ABCD" "DEFG" "HIJK" "MNOP" "99-1" "99-2"
n <- 3 # set to 100 for your data
fctrgroups <- split(fctrs, ceiling(seq_along(fctrs)/n))
str(fctrgroups)
# List of 2
#  $ 1: chr [1:3] "ABCD" "DEFG" "HIJK"
#  $ 2: chr [1:3] "MNOP" "99-1" "99-2"

(The last group may be less than n.)

THere are two ways you can work through this. If you're going to keep it all in-memory but just work on a subset at a time, then I suggest you keep the separated frames in a list and subsequently do your work within another lapply:

lst_o_frames <- lapply(fctrgroups, function(f) subset(dat, FACTORS %in% f))
str(lst_o_frames)
# List of 2
#  $ 1:'data.frame':    17 obs. of  3 variables:
#   ..$ ID     : int [1:17] 1 2 3 4 5 6 7 8 9 10 ...
#   ..$ FACTORS: chr [1:17] "ABCD" "ABCD" "ABCD" "ABCD" ...
#   ..$ VALUE  : int [1:17] 100 101 102 103 104 105 106 107 108 109 ...
#  $ 2:'data.frame':    10 obs. of  3 variables:
#   ..$ ID     : int [1:10] 18 19 20 21 22 23 24 25 26 27
#   ..$ FACTORS: chr [1:10] "MNOP" "MNOP" "MNOP" "MNOP" ...
#   ..$ VALUE  : int [1:10] 117 118 119 120 121 122 123 124 125 126

If you take your work and put it into a user function named myfunc, then you can do

processed_lst_o_frames <- lapply(lst_o_frames, myfunc)

If, however, you just want to save the data to CSVs (or similar) so you can work with them elsewhere, then something like this will work:

for (f in fctrgroups) {
  write.csv(subset(dat, FACTORS %in% f), paste0(f[[1]][1], ".csv"))
}

Note that this method is often used to do the actual work on the subset frames, too. Doing it this way is certainly feasible, but misses a strength of R and a simplifying programming step of "do some function on each elem of a list".

r2evans
  • 77,184
  • 4
  • 55
  • 96
  • Thanks a lot akrun,r2evans. @r2evans , its just an extended or different question just to pass a simple function on lst_o_frames myfunc(){ k% group_by(FACTORS) %>% mutate(me=mean(VALUE),med=median(VALUE)) return(k) } Any help pls – stats03 Aug 04 '18 at 23:34
  • Something closer to `myfunc % group_by(FACTORS) ...; }`. Perhaps you should look at https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames/24376207#24376207 for hints at processing a list of frames. – r2evans Aug 04 '18 at 23:51
  • stats03, does this answer your question? If so, please ["accept" it](https://stackoverflow.com/help/someone-answers) (and your other more-recent question, too). Thanks! – r2evans Aug 07 '18 at 15:08