Remove duplicated substrings in R

Question

I have a dataframe in R as below

   bacteria    sample
1    A         HM_001
2    B         HM_001_HM_001
3    C         A2_HM_001
4    D         A2_HM_001_HM_001
5    E         HM_002
6    F         HM_002_HM_002
7    G         A2_HM_002
8    H         A2_HM_002_HM_002

and wish to remove duplicated substrings down the sample column so that the final output is as below:

   bacteria    sample
1    A         HM_001
2    B         HM_001
3    C         A2_HM_001
4    D         A2_HM_001
5    E         HM_002
6    F         HM_002
7    G         A2_HM_002
8    H         A2_HM_002

score 1 · Accepted Answer · answered Feb 15 '21 at 19:44

Using regex with gsub

df1$sample_new <-  with(df1, gsub("([A-Z]+_\\d+)_?\\1+", "\\1", sample))

-output

df1
#   bacteria           sample sample_new
#1        A           HM_001     HM_001
#2        B    HM_001_HM_001     HM_001
#3        C        A2_HM_001  A2_HM_001
#4        D A2_HM_001_HM_001  A2_HM_001
#5        E           HM_002     HM_002
#6        F    HM_002_HM_002     HM_002
#7        G        A2_HM_002  A2_HM_002
#8        H A2_HM_002_HM_002  A2_HM_002

data

df1 <- structure(list(bacteria = c("A", "B", "C", "D", "E", "F", "G", 
"H"), sample = c("HM_001", "HM_001_HM_001", "A2_HM_001", "A2_HM_001_HM_001", 
"HM_002", "HM_002_HM_002", "A2_HM_002", "A2_HM_002_HM_002")), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8"))

Remove duplicated substrings in R

1 Answers1

data