3

How does one use regex expressions in R to replace the nested parenthesis in this example:

chf <- "(Mn,Ca,Zn)5(AsO4)2((AsO3)OH)24(H2O)(OH(AsO3))(OH(AsO3)OH)"

The desired output is

"(Mn,Ca,Zn)5(AsO4)2(AsO3OH)24(H2O)(OHAsO3)(OHAsO3OH)"

I'm trying this but I'm not able to exclude what's inside the nested brackets.

> str_replace_all(chf,"\\((\\w+)\\)","(gone)")

[1] "(Mn,Ca,Zn)5(gone)2((gone)OH)24(gone)(OH(gone))(OH(gone)OH)"
val
  • 1,419
  • 1
  • 22
  • 48

1 Answers1

4

You may use

library(gsubfn)
chf <- "(Mn,Ca,Zn)5(AsO4)2((AsO3)OH)24(H2O)(OH(AsO3))(OH(AsO3)OH)"
gsubfn("\\((?:[^()]++|(?R))*\\)", ~ gsub("(^\\(|\\)$)|[()]", "\\1", x, perl=TRUE), chf, perl=TRUE, backref=0)
# => [1] "(Mn,Ca,Zn)5(AsO4)2(AsO3OH)24(H2O)(OHAsO3)(OHAsO3OH)"

The \((?:[^()]++|(?R))*\) regex is a known PCRE pattern to match nested parentheses. Once the match is found gsubfn takes the string and removes all non-initial and non-final parentheses using gsub("(^\\(|\\)$)|[()]", "\\1", x, perl=TRUE). Here, (^\\(|\\)$) matches and captures the first ( and last ) into Group 1 and then any ( and ) are matched with [()]. The replacement is the contents of Group 1.

A base R equivalent solution:

chf <- "(Mn,Ca,Zn)5(AsO4)2((AsO3)OH)24(H2O)(OH(AsO3))(OH(AsO3)OH)"
gre <- gregexpr("\\((?:[^()]++|(?R))*\\)", chf, perl=TRUE)
matches <- regmatches(chf, gre)
regmatches(chf, gre) <- lapply(matches, gsub, pattern="(^\\(|\\)$)|[()]", replacement="\\1")
> chf
# => "(Mn,Ca,Zn)5(AsO4)2(AsO3OH)24(H2O)(OHAsO3)(OHAsO3OH)"
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397