5

I have the regex need to replace all backslashes \\ with \" unless the \\ is between two dollar signs $\\bar{x}$. I don't know how to say in regex replace all these unless it falls between these two characters.

Here's a string and a gsub that gets rid og all \\ even inside double dollars

x <- c("I like \\the big\\ red \\dog\\ $\\hat + \\bar$, here it is $\\bar{x}$",
    "I have $50 to \\spend\\", "$\\frac{4}{5}$ is nice", "$\\30\\ is nice too") 

gsub("\\\\", "\"", x)

## > gsub("\\\\", "\"", x)
## [1] "I like \"the big\" red \"dog\" $\"hat + \"bar$, here it is $\"bar{x}$" 
## [2] "I have $50 to \"spend\""    
## [3] "$\"frac{4}{5}$ is nice"   
## [4] "$\"30\" is nice too"  

What I am after is:

## [1] "I like \"the big\" red \"dog\" $\\hat + \\bar$, here it is $\\bar{x}$" 
## [2] "I have $50 to \"spend\""
## [3] "$\\frac{4}{5}$ is nice"   
## [4] "$\"30\" is nice too" 
Tyler Rinker
  • 99,090
  • 56
  • 292
  • 477
  • 2
    I don't think that regex is the right tool for this .. you should probably split and join on `$` – Explosion Pills Apr 11 '13 at 00:07
  • 1
    What @ExplosionPills said. This language is at best context Free, and I'm pretty sure it's context sensitive. The biggest problem is things like `"I have $50 to \\spend\\. My sister has $40."` What's appropriate there? – FrankieTheKneeMan Apr 11 '13 at 00:10
  • @FrankieTheKneeMan That's a chance happening I'm willing to deal with. The intended purpose is to grab academic quotes. This may include math (and $$ is the markup method of math). Rarely would I come across two uses of dollar signs like that. – Tyler Rinker Apr 11 '13 at 00:13
  • @Explosion Pills I'm open to other methods I was looking for the most efficient. I thought you could maybe `gsub` as I have and then look for occurrences where you have `$$` with `"\\$.+?\\$"` but couldn't connect that last dot. – Tyler Rinker Apr 11 '13 at 00:15
  • 1
    I'm no R genius, but you should use something like http://stat.ethz.ch/R-manual/R-patched/library/base/html/strsplit.html to split the string on `"$"`, then run your existing gsub on every other resulting piece. (0, 2, 4, etc...) You may also want to run it on the last piece no matter what. Then you should use http://stat.ethz.ch/R-manual/R-patched/library/base/html/paste.html to put them back together. – FrankieTheKneeMan Apr 11 '13 at 00:19
  • @FrankieTheKneeMan That may be the way to go. Seems inelegant but if necessary I'd go that route. – Tyler Rinker Apr 11 '13 at 00:21
  • I highly recommend using `strsplit(x, "$", fixed=TRUE)`. Then when you paste it back together it's just `paste0(x, collapse="$")` – FrankieTheKneeMan Apr 11 '13 at 00:33
  • @FrankieTheKneeMan unless the `$` is at the end of the string – Tyler Rinker Apr 11 '13 at 00:43
  • Again, I don't know R that well, but most string split libraries would split `"$"` into `["",""]`, which when pasted with `"$"`, would generate `"$"` again. – FrankieTheKneeMan Apr 11 '13 at 00:50
  • Another solution not mentioned here is the simple method from [Match (or replace) a pattern except in situations s1, s2, s3 etc](http://stackoverflow.com/q/23589174/) You would use this simple regex: `\$[^$]*\$|(\\\\)` The left side matches expressions within two dollar signs. We ignore these matches. The right side matches and captures to Group 1 your double backslashes. These are the ones to replace. – zx81 May 27 '14 at 02:12

2 Answers2

5

If you ignore the content-dependent problem, then it is possible to do replacement with PCRE regex. (It is possible to patch it on case-by-case basis, if the $ which doesn't denote the portion to preserve \ has a non-ambiguous form).

Assumes that $ always starts and ends a non-replacement region, except for the case of the odd last $ in the string.

Pattern (the first line is RAW regex, the second line is quoted string literal):

\G((?:[^$\\]|\$[^$]*+\$|\$(?![^$]*+\$))*+)\\
"\\G((?:[^$\\\\]|\\$[^$]*+\\$|\\$(?![^$]*+\\$))*+)\\\\"

Replace string:

\1"
"\\1\""

DEMO 1
DEMO 2

Explanation

The idea is to find the next \ in the string that is not contained within 2 $. This is achieved by make sure the match always starts from where the last match left off \G, to ensure we don't skip over any literal $ and match the \ inside.

There are 3 forms of sequences that we don't replace:

  • Is NOT either literal $ or literal \: [^$\\]
  • Any text in between 2 $ (this doesn't take into account escaping mechanism, if any): \$[^$]*+\$
  • Allow replacement of \ after the odd last $: \$(?![^$]*+\$)

So we just march through any combination of the 3 forms of sequences above, and match the nearest \ for replacement.

Same assumption as above, except that $<digit> will not start a non-replacement region.

This will work even with this kind of string:

I have $50 to \spend\. I just $\bar$ remembered that I have another $30 dollars $\left$ from my last \paycheck\. Lone $ \at the end\

Pattern:

\G((?:[^$\\]|\$\d|\$(?![^$]*\$)|\$[^$]*+\$)*+)\\
"\\G((?:[^$\\\\]|\\$\\d|\\$(?![^$]*\\$)|\\$[^$]*+\\$)*+)\\\\"

DEMO

\$\d is added in front of the \$[^$]*+\$ in alternation to make the engine check for that case first.

nhahtdh
  • 52,949
  • 15
  • 113
  • 149
  • Looks promising but can't get it to work with R: `invalid regular expression '\G((?:[^$\\]|\$[^$]*+\$)*)\\', reason 'Invalid use of repetition operators'` – Tyler Rinker Apr 11 '13 at 00:33
  • @TylerRinker: You need to enable `perl=TRUE` http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html – nhahtdh Apr 11 '13 at 00:36
  • @TylerRinker: Made a slight change. Please use the 2nd line. The first line is to show the actual form of the regex. – nhahtdh Apr 11 '13 at 00:40
  • It may be an R thing but the results have a lot of dollar signs as you can see in my **EDIT** 2 above – Tyler Rinker Apr 11 '13 at 00:41
  • With `R` to do replacement you do `"\\1"` not `"$1"`. Try `gsub("\\G((?:[^$\\\\]|\\$[^$]*+\\$)*)\\\\", "\\1\"", x, perl=TRUE)`. Seems to struggle with the second and fourth strings though. – mathematical.coffee Apr 11 '13 at 00:47
  • @mathematical.coffee: Thanks. I don't use R myself. There are some variations on whether `$` and ``\`` is used in to backreference the text captured between different flavors. – nhahtdh Apr 11 '13 at 00:51
  • @TylerRinker: Add a bit to make it work with the case 2 and case 4. – nhahtdh Apr 11 '13 at 01:17
  • @nhahtdh mathematical.coffee's solution worked better. I thank you for the work you put in. This was not an easy regex as I thought. Thank you and +1 – Tyler Rinker Apr 11 '13 at 01:38
  • 1
    @TylerRinker: It's fine. I am not confident I can maintain this mess either. – nhahtdh Apr 11 '13 at 01:39
5

Using the strsplit method of @FrankieTheKneeMan:

x <- c("I like \\the big\\ red \\dog\\ $\\hat + \\bar$, here it is $\\bar{x}$",
       "I have $50 to \\spend\\",
       "$\\frac{4}{5}$ is nice",
       "$\\30\\ is nice too") 

# > cat(x, sep='\n')
# I like \the big\ red \dog\ $\hat + \bar$, here it is $\bar{x}$
# I have $50 to \spend\
# $\frac{4}{5}$ is nice
# $\30\ is nice too

# split into parts separated by '$'.
# Add a space at the end of every string to deal with '$'
#  at the end of the string (as
#      strsplit('a$', '$', fixed=T)
#  is just 'a' in R)
bits <- strsplit(paste(x, ''), '$', fixed=T)

# apply the regex to every second part (starting with the first)
# and always to the last bit (because of the ' ' we added)
out <- sapply(bits, function (x) {
                   idx <- unique(c(seq(1, length(x), by=2), length(x)))
                   x[idx] <- gsub('\\', '\"', x[idx], fixed=T)
                   # join back together
                   x <- paste(x, collapse='$')
                   # remove that last ' ' we added
                   substring(x, 1, nchar(x) - 1)
               }, USE.NAMES=F)

# > cat(out, sep='\n')
# I like "the big" red "dog" $\hat + \bar$, here it is $\bar{x}$
# I have $50 to "spend"
# $\frac{4}{5}$ is nice
# $"30" is nice too

This will always have cases in which it fails ("I have $20. \\hi\\ Now I have $30"), so you will have to keep that in mind and test it against other strings of the format you are expecting.

mathematical.coffee
  • 51,909
  • 10
  • 130
  • 180
  • Wow I tried every which ay to make Frankie's method work. Thanks Mathematical Coffee. I'm going to poke around with it a bit and accept. I think regex may have been more difficult than I anticipated. – Tyler Rinker Apr 11 '13 at 01:07
  • It is possible to patch my approach to make it works with `"I have $20. \\hi\\ Now I have $30"`, but it will becomes very unmaintainable. – nhahtdh Apr 11 '13 at 01:22
  • @mathematical.coffee Thank you I thought this was an easy regex. Not so. Works well. +1 – Tyler Rinker Apr 11 '13 at 01:37