2

This has to be a simple sub or gsub but I can't seem to find it on soverflow. Likely a duplicate someplace, but somewhere I can't seem to find.

data

df <- data.frame(c1=c(1:4),c2=c("431, Dallas, TX", "c63728 , Denver, CO", ",New Orleans, LA", "somewhere,NY, NY"))

data desired

df.desired <- data.frame(c1=c(1:4),c2=c("Dallas, TX", "Denver, CO", "New Orleans, LA", "NY, NY"))

Edited for the good answer below by pasqui for what I asked, but I'm modifying the question slightly

I'd just like to remove the first string and comma. So I'd like it to work in below as well:

data

df <- data.frame(c1=c(1:4),c2=c("431, Dallas, TX, 75225", "c63728 , Denver, CO, 80121", ",New Orleans, LA", "somewhere,NY, NY"))

data desired

df.desired <- data.frame(c1=c(1:4),c2=c("Dallas, TX, 75225", "Denver, CO, 80121", "New Orleans, LA", "NY, NY"))
smci
  • 26,085
  • 16
  • 96
  • 138
Neal Barsch
  • 1,781
  • 7
  • 29
  • [`stringi`](https://cran.r-project.org/web/packages/stringi/stringi.pdf) package probably does what you want too, it's very full-featured and vectorized. I wasn't able to figure out at first glance – smci Apr 15 '18 at 02:21
  • Are you guaranteed there are exactly two commas, or at least two commas? R has no right-split that I can find, so you need regex as people showed. – smci Apr 15 '18 at 02:25

3 Answers3

2
library(dplyr)

df %>% 
    mutate(c2 = gsub("(^.*,\\s{0,1})(.*,.*$)", "\\2", c2))

#Output
  c1              c2
1  1      Dallas, TX
2  2      Denver, CO
3  3 New Orleans, LA
4  4          NY, NY

NB: This is a solution based on "capturing groups": they are good in terms of cognitive economy (for the human). The are more efficient options for the machine.

Editing:

Tweaking the regex to cope with both cases

I keep playing with Regex Capturing groups

Given the second data.frame:

df <- data.frame(c1=c(1:4),c2=c("431, Dallas, TX, 75225", "c63728 , Denver, CO, 80121", ",New Orleans, LA", "somewhere,NY, NY"))

We apply:

df %>% 
    mutate(c2 = gsub("(^.*,{1}?)(.*,.*$)", "\\2", c2))

And the output is:

  c1                 c2
1  1  Dallas, TX, 75225
2  2  Denver, CO, 80121
3  3    New Orleans, LA
4  4             NY, NY

It works for your first example as well

Community
  • 1
  • 1
Pasqui
  • 473
  • 4
  • 11
  • This works great for my example, but any idea if there are more than two groups i.e. c19183, Denver, CO, 80121 and I just want to keep all three after the first comma? I.e. there I'd want (Denver, CO, 80121), see the above question edits. – Neal Barsch Apr 14 '18 at 23:29
  • 1
    @NealBarsch I just saw your comment and edit; thus, I tweaked my regex accordingly. `Capturing Groups` are even more flexible than that :) – Pasqui Apr 15 '18 at 17:56
1

With base R you can use:

df$desired  <- trimws(gsub(pattern='^.*?,', replacement = '', df$c2), which='left')

Or with the tidyverse:

library(dplyr)
library(stringr)

df %>% 
  mutate(desired = 
           str_replace(c2, pattern = '^.*?,', replacement = ""),
         desired = str_trim(desired, side='left')) -> df

The '^.*?,' expression looks for any values at the start of the string up to the first comma. The ? makes the expression non-greedy when searching for a comma as per this answer on stack overflow:

Regular expression to stop at first match

RPyStats
  • 186
  • 5
0

You could use str_split, remove the first entry of each vector and then paste them all back together

df %>% 
  mutate(c2 = c2 %>% str_split(",") %>%
           lapply(function(x){
             x[-1] %>% 
               str_trim() %>% 
               str_c(collapse = ", ")
           }))
GordonShumway
  • 1,741
  • 7
  • 15