0

I'm trying to split a dataframe based on values in the id column.

what I have:

ids<-as.data.frame(c("a","a","a","b","b","b","c","c","c"))
unique_id<-unique(ids)
values<-as.data.frame(rep(1:3,3))
df<-as.data.frame(cbind(ids,values))
colnames(df)<-c("id","values")

and it looks like:

> df
  id values
  a      1
  a      2
  a      3
  b      1
  b      2
  b      3
  c      1
  c      2
  c      3

the code and error I'm getting is:

> for(id in unique_id){
+     paste0("value_for_",id)<-split(df, id = df$id)
+ }
Error in deparse(...) : 
  unused argument (id = c(1, 1, 1, 2, 2, 2, 3, 3, 3))

what I want:

 value_for_a
  id value
  a     1
  a     2
  a     3

 value_for_b
  id value
  b     1
  b     2
  b     3

 value_for_c
  id value
  c     1
  c     2
  c     3

I feel this should be fairly straightforward, but I'm fresh out of ideas. I am not opposed to using more sophisticated methods than a for loop.

pogibas
  • 24,254
  • 17
  • 63
  • 100
longlivebrew
  • 281
  • 3
  • 15
  • 3
    Use `split`; `split(df, df$id)` – CPak Jan 12 '18 at 22:03
  • is that any different from what is in the code? – longlivebrew Jan 12 '18 at 22:36
  • Use it without trying to assign to `paste`, and not inside a loop (it's already vectorized). `group_list = split(df, df$id)` is all you need. The names of the list will already be based on the `id` column. – Gregor Thomas Jan 12 '18 at 22:56
  • You *shouldn't* want these as separate data frames, a `list` of data frames is much much easier to work with. You can use for loops or `lapply` to process them further in parallel, or still do it one at a time. See [How do I make a list of data frames?](https://stackoverflow.com/a/24376207/903061) for more discussion and tips. – Gregor Thomas Jan 12 '18 at 22:57
  • If you really want to do this, look at `list2env`. – A5C1D2H2I1M1N2O1R2T1 Jan 13 '18 at 07:30

2 Answers2

1

You can use nest() for this.

library(tidyr)
df%>%
group_by(id)%>%
nest()

# A tibble: 3 x 2
  id     data            
  <fctr> <list>          
1 a      <tibble [3 x 1]>
2 b      <tibble [3 x 1]>
3 c      <tibble [3 x 1]>

Each tibble contains the values you're interested in.

df%>%
group_by(id)%>%
nest()%>%
.$data


[[1]]
# A tibble: 3 x 1
  values
   <int>
1      1
2      2
3      3

[[2]]
# A tibble: 3 x 1
  values
   <int>
1      1
2      2
3      3

[[3]]
# A tibble: 3 x 1
  values
   <int>
1      1
2      2
3      3
InfiniteFlash
  • 960
  • 1
  • 7
  • 21
  • this is beautiful. how would I go about calling those data frames later? I need to do other processing on them and combine them with other data frames. – longlivebrew Jan 12 '18 at 22:37
  • Well, you can use the `id` column as a reference, so you would be able to refer to the data.frames you're interested in by using `id` corresponding to the `data` you're interested in. – InfiniteFlash Jan 12 '18 at 22:40
  • 1
    "*Why not use nest() for this?*" Because `split` does the same thing, and is built in to base R so it doesn't need any external dependencies? `split(df, df$id)`. – Gregor Thomas Jan 12 '18 at 22:58
  • Ah, let me rephrase my wording then. I didn't mean it for it to be a question. – InfiniteFlash Jan 12 '18 at 23:01
  • 1
    I did understand that your question was rhetorical, but I think it's a good question to ask: *Why use* `library(tidyr); library(dplyr); df%>% group_by(id)%>% nest()` *?*, when `split(df, df$id)` does the same thing? I think the only reasonable answer is "if you're already using `dplyr`, and maybe `purrr`, and for future workflow reasons you want data frames embedded in each other rather than a just simple list, here's how to do it." – Gregor Thomas Jan 12 '18 at 23:46
  • That is a fair criticism of what I've put above. – InfiniteFlash Jan 12 '18 at 23:52
  • @Gregor - looks like there's issues, and not entirely sure what they mean... Error in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : group length is 0 but data length > 0 – longlivebrew Jan 15 '18 at 21:12
  • @Meg It works fine on the data in your question. Make sure you spelled everything right, and you're using a column in your data as the second argument. (That's the error you would get if you did `split(df, df$some_column_that_doesnt_exist)` ) – Gregor Thomas Jan 15 '18 at 21:16
  • @Gregor thank you, this is exactly what I needed. Eventually I'm going to need to process other data frames, and match up timestamps (which I neglected from this example for simplicity), and ids - I'm not sure how to go about doing that if IDs are within dataframes of a list... hmm – longlivebrew Jan 15 '18 at 21:21
  • Have a look at [How to make a list of data frames](https://stackoverflow.com/a/24376207/903061). Using a list of data frames should be much easier than using multiple data frames not in a list. (Though the simplest, of course, is only a single data frame.) – Gregor Thomas Jan 15 '18 at 21:26
  • alrighty - I'll take a look. thanks again for the help on this - split(df, df$id) did the trick! feel free to post it in an answer so I can upvote you :P – longlivebrew Jan 15 '18 at 21:45
0

I would recommend to split dataframe using split() function (there's function in R to do exactly what you want).

For example:

# Using OPs data
split(df, df$id)

Here you ask to split df by column id. Output of this function is list of df's.

$a
  id values
1  a      1
2  a      2
3  a      3

$b
  id values
4  b      1
5  b      2
6  b      3

$c
  id values
7  c      1
8  c      2
9  c      3

You can get wanted names using this command:

myList <- split(df, df$id)
names(myList) <- paste0("value_for_", names(myList))
pogibas
  • 24,254
  • 17
  • 63
  • 100