R regex help: Perl s modifier + lookahead too aggressive

Question

I'm trying to remove all characters (including newlines) between two given substrings, using R's gsub("regexp", "", string, perl=T) (i.e. replace all matches with the empty string).

What I have so far is the regular expression (?<=A)(?s:.)+(?=B) where I use the s modifier to make the . match newline also. The problem is that when there are multiple occurences of the lookahead B, I only want to remove whatever lies between A and the first B:

I have A remove \r\n this B but leave this B

I want AB but leave this B

but so far what I get is AB

How can I modify the regex to make the lookahead stop at the first occurence?

In Perl, `A\KB(?=C)` is faster than `(?<=A)B(?=C)`. It's probably the same for PCRE. — ikegami, Aug 05 '15 at 17:01

score 3 · Accepted Answer · edited Aug 05 '15 at 16:57

3

Make it non greedy try this:

(?<=A)(?s:.)+?(?=B)

edited Aug 05 '15 at 16:57

ikegami

322,729
15
228
466

answered Aug 05 '15 at 16:16

Arunesh Singh

3,349
12
24

perfec, thanks! Sorry for the dumb question (in hindsight), I had done a lot of googling but didn't know the "greedy" term so never found the answer. – user28400 Aug 05 '15 at 16:31
study here and be wise http://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-quantifiers – Arunesh Singh Aug 05 '15 at 16:32
The optional quantifier `*` is useless, should be non-optional. – Aug 05 '15 at 16:40
Using the non-greedy modifier as anything but an optimization is fragile. I *think* you're safe if you only use one in the pattern, but that's it. Since it's easier and far clearer to avoid this dangerous construct in this case, I disagree with this solution. – ikegami Aug 05 '15 at 16:54
In that context, it is better written as `(?s:.*?)` or `(?s:.+?)`. No need to rescope a single character. – Aug 05 '15 at 16:57
@sln, I disagree. There's no difference between `(?s:.+?)` and `(?s:.)+?` in Perl (as seen using `perl -Mre=debug -e'qr/(?s:.)+?/'` and `perl -Mre=debug -e'qr/(?s:.+?)/'`), and there probably isn't any in PCRE. – ikegami Aug 05 '15 at 17:03
@ikegami - Perl optimizes everything, never said there was a difference. Like you say `(?s:.)+?` is the same as `(?s:(?:(?:(?:.))))+?` same program steps. In other engines ? eh, not so sure this doesn't result in unneeded steps. – Aug 05 '15 at 17:23
@ikegami - I don't think that `(?:(?:(?:` to `)))` is technically considered nesting and is a single group. Whereas a group that is quantified is not technically the same as a group of quantified items, even though it could be optimized out as in such a simple case as this. I would imagine some engines not doing that optimization. – Aug 05 '15 at 17:46
@sln, What are you talking about? How is that remotely relevant? – ikegami Aug 05 '15 at 17:46
@ikegami - I'm talking about optimization's. I guess I'm saying `(?s:.+?)` is in no way even closely or remotely the same as `(?s:.)+?`.. – Aug 05 '15 at 17:49
@sln, I've already proved they are exactly the same, so I really don't care that you think they're completely different. – ikegami Aug 05 '15 at 17:54
@sln, I just noticed you said I said "`(?s:.)+?` is the same as `(?s:(?:(?:(?:.))))+?`". I never said anything of the kind. I have no idea if it's true or not (nor do I care). – ikegami Aug 05 '15 at 17:54
`(?s:(?:(?:(?:.))))+?` optimizes to same code as `(?s:.)+?` , `use re 'debug';` – Aug 05 '15 at 17:56

score 2 · Answer 2 · answered Aug 05 '15 at 16:37

2

This is a specific case where using Dot-All, the dot, and quantifiers should
not be used. The read on it is confusing and doesn't convey its real intent.

(?<=A)[^B]+(?=B)

answered Aug 05 '15 at 16:37

This would indeed work perfectly in the sample case I gave. What if I wanted instead to lookahead to `\\\\centering` ? (I'm using this regex to clean up some legacy LaTeX files) – user28400 Aug 05 '15 at 16:51
@user28400 - Using `.*` or `.*?` is frought with danger, there are ways to not use it affectively, but just as a filler is dangerous. As an example, in compound sub-expressions this `.*?` can be construed as greedy and _could_ not do what you think. – Aug 05 '15 at 17:01
`(?:(?!STRING).)` is to `(?:STRING)` as `[^CHAR]` is to `CHAR`. (`s` modifier presumed) – ikegami Aug 05 '15 at 17:06

R regex help: Perl s modifier + lookahead too aggressive

2 Answers2