regex (in gathering multiple sets of columns with tidyr)

Question

inspired by hadley's nifty gather approach in this answer I tried to use tidyr's gather() and spread() in combination with a regular expression, regex, but I seem to get it wrong on the regex.

I did study several regex questions; this one, this one, and also at regex101.com. I tried to circumvent the regex by using starts_with(), ends_with() and matches() inspired by this question, but with no luck.

I am asking here in the hope that someone can show where I get it wrong and I can solve it, preferably using, the select helpers from tidyselect.

I need to select 2 regex-groups one up to the last . and one consisting of what comes after the last ., I made this two example below, one where my code s working and one where I am stuck.

First the example that is working,

# install.packages(c("tidyverse"), dependencies = TRUE)
require(tidyverse)

The first data set, that work, looks like this,

myData1 <- tibble(
  id = 1:10,
  Wage.1997.1 = c(NA, 32:38, NA, NA),
  Wage.1997.2 = c(NA, 12:18, NA, NA),
  Wage.1998.1 = c(NA, 42:48, NA, NA),
  Wage.1998.2 = c(NA, 2:8, NA, NA),  
  Wage.1998.3 =  c(NA, 42:48, NA, NA),    
  Job.Type.1997.1 = NA,
  Job.Type.1997.2 = c(NA, rep(c('A', 'B'), 4), NA),
  Job.Type.1998.1 = c(NA, rep(c('A', 'B'), 4), NA),
  Job.Type.1998.2 = c(NA, rep(c('A', 'B'), 4), NA)  
)

and this is how I gather() it,

myData1 %>% gather(key, value, -id) %>%  
   extract(col = key, into = c("variable", "id.job"), regex = "(.*?\\..*?)\\.(.)$") %>% 
   spread(variable, value)
#> # A tibble: 30 x 6
#>       id id.job Job.Type.1997 Job.Type.1998 Wage.1997 Wage.1998
#>    <int> <chr>  <chr>         <chr>         <chr>     <chr>    
#>  1     1 1      <NA>          <NA>          <NA>      <NA>     
#>  2     1 2      <NA>          <NA>          <NA>      <NA>     
#>  3     1 3      <NA>          <NA>          <NA>      <NA>     
#>  4     2 1      <NA>          A             32        42       
#>  5     2 2      A             A             12        2        
#>  6     2 3      <NA>          <NA>          <NA>      42       
#>  7     3 1      <NA>          B             33        43       
#>  8     3 2      B             B             13        3        
#>  9     3 3      <NA>          <NA>          <NA>      43       
#> 10     4 1      <NA>          A             34        44       
#> # ... with 20 more rows

It works, I suspect I overdoing it with the regex, but it works. However, my real data can have either one or two digest at the end, i.e.

The second data, where I get stuck,

myData2 <- tibble(
  id = 1:10,
  Wage.1997.1 = c(NA, 32:38, NA, NA),
  Wage.1997.12 = c(NA, 12:18, NA, NA),
  Wage.1998.1 = c(NA, 42:48, NA, NA),
  Wage.1998.12 = c(NA, 2:8, NA, NA),  
  Wage.1998.13 =  c(NA, 42:48, NA, NA),    
  Job.Type.1997.1 = NA,
  Job.Type.1997.12 = c(NA, rep(c('A', 'B'), 4), NA),
  Job.Type.1998.1 = c(NA, rep(c('A', 'B'), 4), NA),
  Job.Type.1998.12 = c(NA, rep(c('A', 'B'), 4), NA)  
)

Now, this is where I use (0[0-1]|1[0-9])$ for the second group, I also tried thing like \d{1}|\d{2}, but did that not work either.

myData2 %>% gather(key, value, -id) %>% 
     extract(col = key, into = c("variable", "id.job"), 
             regex = "(.*?\\..*?)\\.(0[0-1]|1[0-9])$") %>%  
     spread(variable, value)

The expected output would be something like this,

#> # A tibble: 30 x 6
#>       id id.job Job.Type.1997 Job.Type.1998 Wage.1997 Wage.1998
#>    <int> <chr>  <chr>         <chr>         <chr>     <chr>    
#>  1     1 1      <NA>          <NA>          <NA>      <NA>     
#>  2     1 12     <NA>          <NA>          <NA>      <NA>     
#>  3     1 13     <NA>          <NA>          <NA>      <NA>     
#>  4     2 1      <NA>          A             32        42       
#>  5     2 12     A             A             12        2        
#>  6     2 13     <NA>          <NA>          <NA>      42       
#>  7     3 1      <NA>          B             33        43       
#>  8     3 12     B             B             13        3        
#>  9     3 13     <NA>          <NA>          <NA>      43       
#> 10     4 1      <NA>          A             34        44       
#> # ... with 20 more rows

A simply solution à la t this question using select helpers like starts_with(), ends_with(), matches(), etc. would be appreciated.

Can you show the expected output? Are you looking for `myData2 %>% gather(key, value, -id) %>% extract(col = key, into = c("variable", "id.job"), "^(.*)\\.([^.]+)$") %>% spread(variable, value)` — akrun, Feb 18 '18 at 15:19
@akrun, I added _expected output_. Thank you for your feedback! — Eric Fail, Feb 18 '18 at 15:24

score 3 · Accepted Answer · answered Feb 18 '18 at 15:24

We can change the regex in extract to match characters and capture as group ((.*)) from the start (^) of the string followed by a dot (\\.) and one or more characters that are not a dot captured as a group (([^.]+)) till the end ($) of the string

myData2 %>%
    gather(key, value, -id)  %>% 
    extract(col = key, into = c("variable", "id.job"), "^(.*)\\.([^.]+)$") %>%
    spread(variable, value)
# A tibble: 30 x 6
#      id id.job Job.Type.1997 Job.Type.1998 Wage.1997 Wage.1998
# * <int> <chr>  <chr>         <chr>         <chr>     <chr>    
# 1     1 1      <NA>          <NA>          <NA>      <NA>     
# 2     1 12     <NA>          <NA>          <NA>      <NA>     
# 3     1 13     <NA>          <NA>          <NA>      <NA>     
# 4     2 1      <NA>          A             32        42       
# 5     2 12     A             A             12        2        
# 6     2 13     <NA>          <NA>          <NA>      42       
# 7     3 1      <NA>          B             33        43       
# 8     3 12     B             B             13        3        
# 9     3 13     <NA>          <NA>          <NA>      43       
#10     4 1      <NA>          A             34        44       
# ... with 20 more rows

Thanks a lot. That was super helpful. I realize it might seem mundane, but I spent way to long on this. I guess it's not possible to use _select helpers_ like `starts_with()`, `ends_with()`, `matches()` in this case. Can you possibly recommend any good `regex` tutorials for `R`-people? — Eric Fail, Feb 18 '18 at 15:35
@EricFail Glad to know that it helped. For regex learning, [this](https://www.regular-expressions.info/tutorial.html) seems to be good. In `R`, you may have to add escape for some expressions — akrun, Feb 18 '18 at 15:37

regex (in gathering multiple sets of columns with tidyr)

1 Answers1