2

I'm trying to remove all characters (including newlines) between two given substrings, using R's gsub("regexp", "", string, perl=T) (i.e. replace all matches with the empty string).

What I have so far is the regular expression (?<=A)(?s:.)+(?=B) where I use the s modifier to make the . match newline also. The problem is that when there are multiple occurences of the lookahead B, I only want to remove whatever lies between A and the first B:

I have A remove \r\n this B but leave this B

I want AB but leave this B

but so far what I get is AB

How can I modify the regex to make the lookahead stop at the first occurence?

ikegami
  • 322,729
  • 15
  • 228
  • 466
user28400
  • 187
  • 1
  • 8

2 Answers2

3

Make it non greedy try this:

(?<=A)(?s:.)+?(?=B)
ikegami
  • 322,729
  • 15
  • 228
  • 466
Arunesh Singh
  • 3,349
  • 12
  • 24
  • perfec, thanks! Sorry for the dumb question (in hindsight), I had done a lot of googling but didn't know the "greedy" term so never found the answer. – user28400 Aug 05 '15 at 16:31
  • study here and be wise http://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-quantifiers – Arunesh Singh Aug 05 '15 at 16:32
  • The optional quantifier `*` is useless, should be non-optional. –  Aug 05 '15 at 16:40
  • Using the non-greedy modifier as anything but an optimization is fragile. I *think* you're safe if you only use one in the pattern, but that's it. Since it's easier and far clearer to avoid this dangerous construct in this case, I disagree with this solution. – ikegami Aug 05 '15 at 16:54
  • In that context, it is better written as `(?s:.*?)` or `(?s:.+?)`. No need to rescope a single character. –  Aug 05 '15 at 16:57
  • @sln, I disagree. There's no difference between `(?s:.+?)` and `(?s:.)+?` in Perl (as seen using `perl -Mre=debug -e'qr/(?s:.)+?/'` and `perl -Mre=debug -e'qr/(?s:.+?)/'`), and there probably isn't any in PCRE. – ikegami Aug 05 '15 at 17:03
  • @ikegami - Perl optimizes everything, never said there was a difference. Like you say `(?s:.)+?` is the same as `(?s:(?:(?:(?:.))))+?` same program steps. In other engines ? eh, not so sure this doesn't result in unneeded steps. –  Aug 05 '15 at 17:23
  • @ikegami - I don't think that `(?:(?:(?:` to `)))` is technically considered nesting and is a single group. Whereas a group that is quantified is not technically the same as a group of quantified items, even though it could be optimized out as in such a simple case as this. I would imagine some engines not doing that optimization. –  Aug 05 '15 at 17:46
  • @sln, What are you talking about? How is that remotely relevant? – ikegami Aug 05 '15 at 17:46
  • @ikegami - I'm talking about optimization's. I guess I'm saying `(?s:.+?)` is in no way even closely or remotely the same as `(?s:.)+?`.. –  Aug 05 '15 at 17:49
  • @sln, I've already proved they are exactly the same, so I really don't care that you think they're completely different. – ikegami Aug 05 '15 at 17:54
  • @sln, I just noticed you said I said "`(?s:.)+?` is the same as `(?s:(?:(?:(?:.))))+?`". I never said anything of the kind. I have no idea if it's true or not (nor do I care). – ikegami Aug 05 '15 at 17:54
  • `(?s:(?:(?:(?:.))))+?` optimizes to same code as `(?s:.)+?` , `use re 'debug';` –  Aug 05 '15 at 17:56
2

This is a specific case where using Dot-All, the dot, and quantifiers should
not be used. The read on it is confusing and doesn't convey its real intent.

(?<=A)[^B]+(?=B)

  • This would indeed work perfectly in the sample case I gave. What if I wanted instead to lookahead to `\\\\centering` ? (I'm using this regex to clean up some legacy LaTeX files) – user28400 Aug 05 '15 at 16:51
  • @user28400 - Using `.*` or `.*?` is frought with danger, there are ways to not use it affectively, but just as a filler is dangerous. As an example, in compound sub-expressions this `.*?` can be construed as greedy and _could_ not do what you think. –  Aug 05 '15 at 17:01
  • `(?:(?!STRING).)` is to `(?:STRING)` as `[^CHAR]` is to `CHAR`. (`s` modifier presumed) – ikegami Aug 05 '15 at 17:06