1

I am trying to melt/stack/gather multiple specific columns of a dataframe into 2 columns, retaining all the others. I have tried many, many answers on stackoverflow without success (some below). I basically have a situation similar to this post here: Reshaping multiple sets of measurement columns (wide format) into single columns (long format) only many more columns to retain and combine. It is important to mention my year columns are factors and I have many, many more columns than the sample listed below so I want to call column names not positions.

>df
ID Code Country     year.x   value.x  year.y value.y year.x.x value.x.x              
1  A    USA         2000     34.33422 2001 35.35241  2002   42.30042 
1  A    Spain       2000     34.71842 2001 39.82727  2002   43.22209 
3  B    USA         2000     35.98180 2001 37.70768  2002   44.40232 
3  B    Peru        2000     33.00000 2001 37.66468  2002   41.30232 
4  C    Argentina   2000     37.78005 2001 39.25627  2002   45.72927 
4  C    Peru        2000     40.52575 2001 40.55918  2002   46.62914

I tried using the pivot_longer in tidyr based on the post above which seemed very similar, which resulted in various errors depending on what I did:

pivot_longer(df, 
             cols = -c(ID, Code, Country), 
             names_to = c(".value", "group"),
             names_sep = ".")

I also played with melt in reshape2 in various ways which either melted only the values columns or only the years columns. Such as:

new.df <- reshape2:::melt(df, id.var = c("ID", "Code", "Country"), measure.vars=c("value.x", "value.y", "value.x.x", "value.y.y", "value.x.x.x", "value.y.y.y"), value.name = "value", variable.vars=c('year.x','year.y', "year.x.x", "year.y.y", "year.x.x.x", "year.y.y.y", "value.x", variable.name = "year")

I also tried dplyr gather based on other posts but I find it extremely difficult to understand the help page and posts. To be clear what I am looking to achieve:

ID Code Country  year   value                
1  A    USA      2000   34.33422  
1  A    Spain    2000   34.71842  
3  B    USA      2000   35.98180  
3  B    Peru     2000   33.00000  
4  C    Argentina2000   37.78005  
4  C    Peru     2000   40.52575 
1  A    USA      2001   35.35241  
1  A    Spain    2001   39.82727  
3  B    USA      2001   37.70768  
3  B    Peru     2001   37.66468  
4  C    Argentina2001   39.25627  
4  C    Peru     2001   40.55918 
1  A    USA      2002   42.30042  
etc.

I really appreciate the help here.

KNN
  • 420
  • 2
  • 12
  • Possibly related: https://stackoverflow.com/q/57549204/5325862, https://stackoverflow.com/q/54816348/5325862 – camille Oct 08 '19 at 17:44
  • Possible duplicate of [Gather multiple sets of columns](https://stackoverflow.com/questions/25925556/gather-multiple-sets-of-columns) – camille Oct 08 '19 at 17:45
  • 1
    That post has the same answer as I provided above. – KNN Oct 08 '19 at 17:50
  • @KNN If you are updating the example, please do keep the original one because updating and removing the original input example is kind of defeating the original answer's effort – akrun Oct 08 '19 at 18:02
  • That post has 5 answers, and was then marked as a duplicate of one with 7 answers. None of those can be adapted to help? Have you tried escaping the `.` in your separator, since it's a regex special character? – camille Oct 08 '19 at 18:27
  • As far as I can see with the updates on the OP's post, it is not a dupe – akrun Oct 08 '19 at 18:30
  • This has become a pretty different question since the updates. In its current state, I agree with @akrun and am retracting my close vote. But that's why it's helpful for the question you post initially to be the one you actually want answered – camille Oct 08 '19 at 18:42

1 Answers1

3

We can specify the names_pattern

library(tidyr)
library(dplyr)
df %>%  
   pivot_longer(cols = -c(ID, Code, Country),
       names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")

Or use the names_sep with escaped . as according to ?pivot_longer

names_sep - names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).

which implies that by default the regex is on and the . in regex matches any character and not the literal dot. To get the literal value, either escape or place it inside square bracket

pivot_longer(df, 
         cols = -c(ID, Code, Country), 
          names_to = c(".value", "group"),
          names_sep = "\\.")
# A tibble: 18 x 6
#      ID Code  Country   group  year value
#   <int> <chr> <chr>     <chr> <int> <dbl>
# 1     1 A     USA       x      2000  34.3
# 2     1 A     USA       y      2001  35.4
# 3     1 A     USA       z      2002  42.3
# 4     1 A     Spain     x      2000  34.7
# 5     1 A     Spain     y      2001  39.8
# 6     1 A     Spain     z      2002  43.2
# 7     3 B     USA       x      2000  36.0
# 8     3 B     USA       y      2001  37.7
# 9     3 B     USA       z      2002  44.4
#10     3 B     Peru      x      2000  33  
#11     3 B     Peru      y      2001  37.7
#12     3 B     Peru      z      2002  41.3
#13     4 C     Argentina x      2000  37.8
#14     4 C     Argentina y      2001  39.3
#15     4 C     Argentina z      2002  45.7
#16     4 C     Peru      x      2000  40.5
#17     4 C     Peru      y      2001  40.6
#18     4 C     Peru      z      2002  46.6

Update

For the updated dataset

library(stringr)
df2 %>% 
   rename_at(vars(matches("year|value")), ~ 
     str_replace(., "^([^.]+\\.[^.]+)\\.([^.]+)$", "\\1\\2")) %>% 
     pivot_longer(cols = -c(ID, Code, Country),
        names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")

Or without the rename, use regex lookaround

df2 %>%
   pivot_longer(cols = -c(ID, Code, Country), 
       names_to = c(".value", "group"),
           names_sep = "(?<=year|value)\\.")

data

df <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A", 
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA", 
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L, 
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818, 
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L, 
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468, 
39.25627, 40.55918), year.z = c(2002L, 2002L, 2002L, 2002L, 2002L, 
2002L), value.z = c(42.30042, 43.22209, 44.40232, 41.30232, 45.72927, 
46.62914)), class = "data.frame", row.names = c(NA, -6L))



df2 <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A", 
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA", 
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L, 
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818, 
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L, 
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468, 
39.25627, 40.55918), year.x.x = c(2002L, 2002L, 2002L, 2002L, 
2002L, 2002L), value.x.x = c(42.30042, 43.22209, 44.40232, 41.30232, 
45.72927, 46.62914)), class = "data.frame", row.names = c(NA, 
-6L))
akrun
  • 674,427
  • 24
  • 381
  • 486
  • So I should have added that my columns continue as such: value.x.x.x and year.x.x.x, value.y.y.y, year.y.y.y, etc. That is from a previous left_join. – KNN Oct 08 '19 at 17:54
  • I know. I was trying to simplify the many columns I had for the example so it was my mistake. – KNN Oct 08 '19 at 17:59
  • @KNN As you can see, the `names_pattern` I provided is based on the pattern you showed – akrun Oct 08 '19 at 18:00
  • I updated the question to include more column names. – KNN Oct 08 '19 at 18:06
  • Okay I see what you did. Is there a way to make the str_replace() regex just match everything before the first period so not having to add that pattern for every .? So it is compatible with value.x, value.x.x, value.x.x.x, value.x.x.x.x, etc.? – KNN Oct 08 '19 at 18:25
  • @KNN Can you please check my update `Or without the rename, use regex lookaround` – akrun Oct 08 '19 at 18:27
  • @KNN No problem. I think the dupes linked are not at all dupes for this question – akrun Oct 08 '19 at 18:32