Split dataframe based on value in column - loop over list of id's

Question

I'm trying to split a dataframe based on values in the id column.

what I have:

ids<-as.data.frame(c("a","a","a","b","b","b","c","c","c"))
unique_id<-unique(ids)
values<-as.data.frame(rep(1:3,3))
df<-as.data.frame(cbind(ids,values))
colnames(df)<-c("id","values")

and it looks like:

the code and error I'm getting is:

> for(id in unique_id){
+     paste0("value_for_",id)<-split(df, id = df$id)
+ }
Error in deparse(...) : 
  unused argument (id = c(1, 1, 1, 2, 2, 2, 3, 3, 3))

what I want:

 value_for_a
  id value
  a     1
  a     2
  a     3

 value_for_b
  id value
  b     1
  b     2
  b     3

 value_for_c
  id value
  c     1
  c     2
  c     3

I feel this should be fairly straightforward, but I'm fresh out of ideas. I am not opposed to using more sophisticated methods than a for loop.

Use it without trying to assign to `paste`, and not inside a loop (it's already vectorized). `group_list = split(df, df$id)` is all you need. The names of the list will already be based on the `id` column. — Gregor Thomas, Jan 12 '18 at 22:56
You *shouldn't* want these as separate data frames, a `list` of data frames is much much easier to work with. You can use for loops or `lapply` to process them further in parallel, or still do it one at a time. See [How do I make a list of data frames?](https://stackoverflow.com/a/24376207/903061) for more discussion and tips. — Gregor Thomas, Jan 12 '18 at 22:57

InfiniteFlash · Accepted Answer · 2018-01-12T23:01:29.567

1

You can use nest() for this.

library(tidyr)
df%>%
group_by(id)%>%
nest()

# A tibble: 3 x 2
  id     data            
  <fctr> <list>          
1 a      <tibble [3 x 1]>
2 b      <tibble [3 x 1]>
3 c      <tibble [3 x 1]>

Each tibble contains the values you're interested in.

df%>%
group_by(id)%>%
nest()%>%
.$data


[[1]]
# A tibble: 3 x 1
  values
   <int>
1      1
2      2
3      3

[[2]]
# A tibble: 3 x 1
  values
   <int>
1      1
2      2
3      3

[[3]]
# A tibble: 3 x 1
  values
   <int>
1      1
2      2
3      3

edited Jan 12 '18 at 23:01

answered Jan 12 '18 at 22:28

InfiniteFlash

960
1
7
21

this is beautiful. how would I go about calling those data frames later? I need to do other processing on them and combine them with other data frames. – longlivebrew Jan 12 '18 at 22:37
Well, you can use the `id` column as a reference, so you would be able to refer to the data.frames you're interested in by using `id` corresponding to the `data` you're interested in. – InfiniteFlash Jan 12 '18 at 22:40
1

"*Why not use nest() for this?*" Because `split` does the same thing, and is built in to base R so it doesn't need any external dependencies? `split(df, df$id)`. – Gregor Thomas Jan 12 '18 at 22:58
Ah, let me rephrase my wording then. I didn't mean it for it to be a question. – InfiniteFlash Jan 12 '18 at 23:01
1

I did understand that your question was rhetorical, but I think it's a good question to ask: *Why use* `library(tidyr); library(dplyr); df%>% group_by(id)%>% nest()` *?*, when `split(df, df$id)` does the same thing? I think the only reasonable answer is "if you're already using `dplyr`, and maybe `purrr`, and for future workflow reasons you want data frames embedded in each other rather than a just simple list, here's how to do it." – Gregor Thomas Jan 12 '18 at 23:46
That is a fair criticism of what I've put above. – InfiniteFlash Jan 12 '18 at 23:52
@Gregor - looks like there's issues, and not entirely sure what they mean... Error in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : group length is 0 but data length > 0 – longlivebrew Jan 15 '18 at 21:12
@Meg It works fine on the data in your question. Make sure you spelled everything right, and you're using a column in your data as the second argument. (That's the error you would get if you did `split(df, df$some_column_that_doesnt_exist)` ) – Gregor Thomas Jan 15 '18 at 21:16
@Gregor thank you, this is exactly what I needed. Eventually I'm going to need to process other data frames, and match up timestamps (which I neglected from this example for simplicity), and ids - I'm not sure how to go about doing that if IDs are within dataframes of a list... hmm – longlivebrew Jan 15 '18 at 21:21
Have a look at [How to make a list of data frames](https://stackoverflow.com/a/24376207/903061). Using a list of data frames should be much easier than using multiple data frames not in a list. (Though the simplest, of course, is only a single data frame.) – Gregor Thomas Jan 15 '18 at 21:26
alrighty - I'll take a look. thanks again for the help on this - split(df, df$id) did the trick! feel free to post it in an answer so I can upvote you :P – longlivebrew Jan 15 '18 at 21:45

score 0 · Answer 2 · answered Jan 12 '18 at 22:04

I would recommend to split dataframe using split() function (there's function in R to do exactly what you want).

For example:

# Using OPs data
split(df, df$id)

Here you ask to split df by column id. Output of this function is list of df's.

$a
  id values
1  a      1
2  a      2
3  a      3

$b
  id values
4  b      1
5  b      2
6  b      3

$c
  id values
7  c      1
8  c      2
9  c      3

You can get wanted names using this command:

myList <- split(df, df$id)
names(myList) <- paste0("value_for_", names(myList))

Split dataframe based on value in column - loop over list of id's

2 Answers2