In string column, remove text preceding first comma (delimiter)

Question

This has to be a simple sub or gsub but I can't seem to find it on soverflow. Likely a duplicate someplace, but somewhere I can't seem to find.

data

df <- data.frame(c1=c(1:4),c2=c("431, Dallas, TX", "c63728 , Denver, CO", ",New Orleans, LA", "somewhere,NY, NY"))

data desired

df.desired <- data.frame(c1=c(1:4),c2=c("Dallas, TX", "Denver, CO", "New Orleans, LA", "NY, NY"))

Edited for the good answer below by pasqui for what I asked, but I'm modifying the question slightly

I'd just like to remove the first string and comma. So I'd like it to work in below as well:

data

df <- data.frame(c1=c(1:4),c2=c("431, Dallas, TX, 75225", "c63728 , Denver, CO, 80121", ",New Orleans, LA", "somewhere,NY, NY"))

data desired

df.desired <- data.frame(c1=c(1:4),c2=c("Dallas, TX, 75225", "Denver, CO, 80121", "New Orleans, LA", "NY, NY"))

[`stringi`](https://cran.r-project.org/web/packages/stringi/stringi.pdf) package probably does what you want too, it's very full-featured and vectorized. I wasn't able to figure out at first glance — smci, Apr 15 '18 at 02:21
Are you guaranteed there are exactly two commas, or at least two commas? R has no right-split that I can find, so you need regex as people showed. — smci, Apr 15 '18 at 02:25

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

library(dplyr)

df %>% 
    mutate(c2 = gsub("(^.*,\\s{0,1})(.*,.*$)", "\\2", c2))

#Output
  c1              c2
1  1      Dallas, TX
2  2      Denver, CO
3  3 New Orleans, LA
4  4          NY, NY

NB: This is a solution based on "capturing groups": they are good in terms of cognitive economy (for the human). The are more efficient options for the machine.

Editing:

Tweaking the regex to cope with both cases

I keep playing with Regex Capturing groups

Given the second data.frame:

df <- data.frame(c1=c(1:4),c2=c("431, Dallas, TX, 75225", "c63728 , Denver, CO, 80121", ",New Orleans, LA", "somewhere,NY, NY"))

We apply:

df %>% 
    mutate(c2 = gsub("(^.*,{1}?)(.*,.*$)", "\\2", c2))

And the output is:

  c1                 c2
1  1  Dallas, TX, 75225
2  2  Denver, CO, 80121
3  3    New Orleans, LA
4  4             NY, NY

It works for your first example as well

This works great for my example, but any idea if there are more than two groups i.e. c19183, Denver, CO, 80121 and I just want to keep all three after the first comma? I.e. there I'd want (Denver, CO, 80121), see the above question edits. — Neal Barsch, Apr 14 '18 at 23:29
@NealBarsch I just saw your comment and edit; thus, I tweaked my regex accordingly. `Capturing Groups` are even more flexible than that :) — Pasqui, Apr 15 '18 at 17:56

score 1 · Accepted Answer · answered Apr 14 '18 at 23:34

With base R you can use:

df$desired  <- trimws(gsub(pattern='^.*?,', replacement = '', df$c2), which='left')

Or with the tidyverse:

library(dplyr)
library(stringr)

df %>% 
  mutate(desired = 
           str_replace(c2, pattern = '^.*?,', replacement = ""),
         desired = str_trim(desired, side='left')) -> df

The '^.*?,' expression looks for any values at the start of the string up to the first comma. The ? makes the expression non-greedy when searching for a comma as per this answer on stack overflow:

Regular expression to stop at first match

score 0 · Answer 3 · answered Apr 15 '18 at 06:38

You could use str_split, remove the first entry of each vector and then paste them all back together

df %>% 
  mutate(c2 = c2 %>% str_split(",") %>%
           lapply(function(x){
             x[-1] %>% 
               str_trim() %>% 
               str_c(collapse = ", ")
           }))

In string column, remove text preceding first comma (delimiter)

3 Answers3

Editing:

Tweaking the regex to cope with both cases