Subsetting a dataset into multiples subsets in R

Question

I have a data that looks something like this:

structure(list(ID = structure(c(1L, 2L, 2L, 3L, 4L, 5L, 6L, 6L, 
6L), .Label = c("a", "b", "c", "d", "e", "f"), class = "factor"), 
    Value = c(10L, 13L, 12L, 43L, 23L, 66L, 78L, 42L, 19L)), .Names = c("ID", 
"Value"), class = "data.frame", row.names = c(NA, -9L))

I would like to divide this dataset into multiple datasets on the basis of the ID values, i.e. one dataset that contains only ID = a, another that contains only ID = b, and so on.

How do I do this subsetting automatically in R? I understand that if the number of values in ID is less, we could just do it manually, but in case there are a lot of values under ID, there has to be a smarter way of doing this.

Basically, I would like to have data1, data2, data3, data4, data5, data6, which contain IDs a, b, c, d, e, f respectively — Kenneth Singh, Dec 05 '17 at 17:36
[Keeping data frames in a list is a much better idea.](https://stackoverflow.com/a/24376207/4497050) Even that is usually unnecessary due to grouping options, though which is appropriate depends on the context. — alistaire, Dec 05 '17 at 17:38

Matt W. · Answer 1 · 2017-12-05T17:42:37.583

3

You can use the split function.

df <- structure(list(ID = structure(c(1L, 2L, 2L, 3L, 4L, 5L, 6L, 6L, 
6L), .Label = c("a", "b", "c", "d", "e", "f"), class = "factor"), 
    Value = c(10L, 13L, 12L, 43L, 23L, 66L, 78L, 42L, 19L)), .Names = c("ID", 
"Value"), class = "data.frame", row.names = c(NA, -9L))

> df
  ID Value
1  a    10
2  b    13
3  b    12
4  c    43
5  d    23
6  e    66
7  f    78
8  f    42
9  f    19

listed_df <- split(df, df$ID)

> listed_df
$a
  ID Value
1  a    10

$b
  ID Value
2  b    13
3  b    12

$c
  ID Value
4  c    43

$d
  ID Value
5  d    23

$e
  ID Value
6  e    66

$f
  ID Value
7  f    78
8  f    42
9  f    19

To call on one of these just use index it with $.

sum(listed_df$f$Value)

You can also lapply a function across each of the dataframes within the list. If you wanted to sum up each Value or something you could do..

lapply(df_list, function(x) sum(x$Value))

You can also do this just by grouping the original dataframe by ID and then perform summarise operations on it from there.

edited Dec 05 '17 at 17:42

answered Dec 05 '17 at 17:40

Matt W.

3,331
16
40

Yes, split does divide the dataset into the subsets that I want. But I also need to have these subsets as separate dataframes, say data1, data2, data3, data4, data5, data6 that contain IDs a, b, c, d, e, f respectively. How do I do this? – Kenneth Singh Dec 05 '17 at 17:42
1

I answered how you call on it. `listed_df$a` _is_ the dataframe that you're talking about. – Matt W. Dec 05 '17 at 17:43
`> class(df_list$a) [1] "data.frame"` – Matt W. Dec 05 '17 at 17:44
Yup, I would like to do this automatically. Something like subset[i] = df[df$ID == i,] – Kenneth Singh Dec 05 '17 at 17:45
@KennethSingh If you prefer using copy/paste and find/replace to create lots of code when a simple loop or `lapply` would do, you can put all the data frames in the list to your global environment with `list2env`. But you'll be making things harder on yourself. First set the names of the list to whatever you want the names of the data frames to be. See the [list of data frames FAQ](https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames/24376207#24376207) for some discussion and examples. – Gregor Thomas Dec 05 '17 at 17:45
Basically I should be able to write a line of code that will generate all the subsets, instead of calling them manually. – Kenneth Singh Dec 05 '17 at 17:46
1

Right - what you're asking is how to do the first step automatically and all subsequent steps manually. Matt is trying to show you how to do all steps automatically. – Gregor Thomas Dec 05 '17 at 17:48
that line of code that generates all the subsets is `split(df, df$ID)`. If you want to reference all of them and do an operation to each of them, use `lapply` against the list of dfs. – Matt W. Dec 05 '17 at 17:51
Can you help me do that? So yes, we got a list that contains all my required subsets. How do I write a for loop (or something else) that will give me something like this: for i in 1:length(listed_df) {subset[i] = lapply(listed_df, function(x) x[i])} – Kenneth Singh Dec 05 '17 at 17:56
I am really unsure of my syntax though – Kenneth Singh Dec 05 '17 at 17:56
you don't have to do the for loop in front of it. `lapply` is a vectorized function that will iterate through your list and apply the function you pass it. So just do `lapply(listed_df, function(x) blahblah)` – Matt W. Dec 05 '17 at 17:58
So will this work: listed_df_subset = lapply(listed_df, function(x) x[1]) – Kenneth Singh Dec 05 '17 at 17:59
But wait this will give me the first data frame in listed_df right? So 1 is hard coded here. This is my doubt - how do I make this argument under x[] to be dynamic? – Kenneth Singh Dec 05 '17 at 18:00
1

You can try it yourself. x is _each dataframe_ in the list. so you're looking at the "a" df, and doing df$a[1] which pulls out the first column. So you're iterating through each df with that `lapply` call to show the first column of each df. – Matt W. Dec 05 '17 at 18:02
I think we are close but still not there yet. I am not seeing 6 different datasets using the lapply method you suggested. Maybe I am missing something? Can you please help me write a code that will automatically give me 6 different subsets without me having to call each of them manually from listed_df? – Kenneth Singh Dec 05 '17 at 18:07
update your question to include more details with what you're trying to do to each dataframe. What you're asking for is exactly as we explained and is the best way to do it. I want to understand the bigger picture so if you can update details in the question with that information – Matt W. Dec 05 '17 at 18:18

score 0 · Answer 2 · answered Dec 07 '17 at 19:45

This should be pretty easy.

exampleb <- subset(df, ID == 'b')

exampleb
  ID Value
2  b    13
3  b    12

Also, take a look at these links.

https://www.r-bloggers.com/5-ways-to-subset-a-data-frame-in-r/

https://www.statmethods.net/management/subset.html

Subsetting a dataset into multiples subsets in R

2 Answers2