How to separate a column in dplyr based on regex

Question

I have the following data frame:

df <- structure(list(X2 = c("BB_137.HVMSC", "BB_138.combined.HVMSC", 
"BB_139.combined.HVMSC", "BB_140.combined.HVMSC", "BB_141.HVMSC", 
"BB_142.combined.HMSC-bm")), .Names = "X2", row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Which looks like this

> df
# A tibble: 6 x 1
                       X2
                    <chr>
1            BB_137.HVMSC
2   BB_138.combined.HVMSC
3   BB_139.combined.HVMSC
4   BB_140.combined.HVMSC
5            BB_141.HVMSC
6 BB_142.combined.HMSC-bm

What I want to do is to separate into two columns (with . as separator), by keeping the last field as second column

              col1 col2
            BB_137 HVMSC
   BB_138.combined HVMSC
   BB_139.combined HVMSC
   BB_140.combined HVMSC
            BB_141 HVMSC
   BB_142.combined HMSC-bm

What's the right way to do it?

My attempt is this:

> df %>% separate(X2, into = c("sid","status", "tiss"), sep = "[.]") 
# A tibble: 6 x 3
     sid   status    tiss
*  <chr>    <chr>   <chr>
1 BB_137    HVMSC    <NA>
2 BB_138 combined   HVMSC
3 BB_139 combined   HVMSC
4 BB_140 combined   HVMSC
5 BB_141    HVMSC    <NA>
6 BB_142 combined HMSC-bm

Warning message: Too few values at 2 locations: 1, 5

score 10 · Accepted Answer · answered Sep 11 '17 at 03:20

10

We can use negative lookahead as separator in the separate function.

library(tidyr)
separate(data = df, col = X2, into = c("col1", "col2"), sep = "(\\.)(?!.*\\.)")

#            col1    col2
#           <chr>   <chr>
#1          BB_137   HVMSC
#2 BB_138.combined   HVMSC
#3 BB_139.combined   HVMSC
#4 BB_140.combined   HVMSC
#5          BB_141   HVMSC
#6 BB_142.combined HMSC-bm

Regex taken from this answer.

answered Sep 11 '17 at 03:20

Ronak Shah

286,338
16
97
143

2

Oh geez... negative lookahead. Every time I think I've got regex covered I learn something new. – thelatemail Sep 11 '17 at 03:25

score 1 · Answer 2 · answered Dec 13 '20 at 17:45

1

We can also use tidyr::extract()

extract(df, X2, c("col1","col2"), "(.*)\\.(H.*)")

answered Dec 13 '20 at 17:45

woodspock

83
1
5

How to separate a column in dplyr based on regex

2 Answers2

Linked