1

I have a dataset in r with two columns of numerical data and one with an identifier. Some of the rows share the same identifier (i.e. they are the same individual), but contain different data. I want to use the identifier to move those that share an identifier from a row into a columns. There are currently 600 rows, but there should be 400.

Can anyone share r code that might do this? I am new to R, and have tried the reshape (cast) programme, but I can't really follow it, and am not sure it's exactly what i'm trying to do.

Any help gratefully appreciated.

UPDATE:

Current

ID Age Sex  
1   3   1  
1   5   1  
1   6   1  
1   7   1  
2   1   2  
2   12  2  
2   5   2  
3   3   1  

Expected output

ID Age Sex Age2 Sex2 Age3 Sex3 Age4 Sex4   
1   3   1   5    1     6    1    7    1
2   1   2   12   2     5    2
3   3   1  

UPDATE 2:

So far I have tried using the melt and dcast commands from reshape2. I am getting there, but it still doesn't look quite right. Here is my code:

x <- melt(example, id.vars = "ID")

x$time <- ave(x$ID, x$ID, FUN = seq_along)

example2 <- dcast (x, ID ~ time, value.var = "value")

and here is the output using that code:

ID  A   B   C    D     E    F    G    H (for clarity i have labelled these) 
1   3   5   6    7     1    1    1    1
2   1   12  5    2     2    2
3   3   1

So, as you can probably see, it is mixing up the 'sex' and 'age' variables and combining them in the same column. For example column D has the value '7' for person 1 (age4), but '2' for person 2 (Sex). I can see that my code is not instructing where the numerical values should be cast to, but I do not know how to code that part. Any ideas?

abbg662
  • 11
  • 3

2 Answers2

1

Here's an approach using gather, spread and unite from the tidyr package:

suppressPackageStartupMessages(library(tidyverse))  
x <- tribble(
  ~ID, ~Age, ~Sex,  
  1,   3,   1,  
  1,   5,   1,  
  1,   6,   1,  
  1,   7,   1,  
  2,   1,   2,  
  2,   12,  2,  
  2,   5,   2,  
  3,   3,   1  
)

x %>% group_by(ID) %>%
  mutate(grp = 1:n()) %>%
  gather(var, val, -ID, -grp) %>%
  unite("var_grp", var, grp, sep ='') %>%
  spread(var_grp, val, fill = '')
#> # A tibble: 3 x 9
#> # Groups:   ID [3]
#>      ID  Age1  Age2  Age3  Age4  Sex1  Sex2  Sex3  Sex4
#> * <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1     1     3     5     6     7     1     1     1     1
#> 2     2     1    12     5           2     2     2      
#> 3     3     3                       1

If you prefer to keep the columns numeric then just remove the fill='' argument from spread(var_grp, val, fill = '').

Other questions which might help with this include:

R spreading multiple columns with tidyr

How can I spread repeated measures of multiple variables into wide format?

markdly
  • 3,899
  • 2
  • 14
  • 24
  • Many thanks for your help - this worked for this particular problem. I'll now try and apply it to my real dataset. – abbg662 Sep 19 '17 at 14:17
0

I have recently come across a similar issue in my data, and wanted to provide an update using the tidyr 1.0 functions as gather and spread have been retired. The new pivot_longer and pivot_wider are currently much slower than gather and spread, especially on very large datasets, but this is supposedly fixed in the next update of tidyr, so hope this updated solution is useful to people.

library(tidyr)
library(dplyr)

x %>%
  group_by(ID) %>%
  mutate(grp = 1:n()) %>%
  pivot_longer(-c(ID, grp), names_to = "var", values_to = "val") %>%
  unite("var_grp", var, grp, sep = "") %>%
  pivot_wider(names_from = var_grp, values_from = val)

#> # A tibble: 3 x 9
#> # Groups:   ID [3]
#>      ID  Age1  Sex1  Age2  Sex2  Age3  Sex3  Age4  Sex4
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     3     1     5     1     6     1     7     1
#> 2     2     1     2    12     2     5     2    NA    NA
#> 3     3     3     1    NA    NA    NA    NA    NA    NA
caldwellst
  • 2,477
  • 3
  • 14