A multi-line, variedly greedy, regular expression

Question

Given the following text, what PCRE regular expression would you use to extract the parts marked in bold?

00:20314 lorem ipsum
  want this
  kryptonite

00:02314 quux
  padding
  dont want this

00:03124 foo
     neither this

00:01324 foo
     but we want this
     stalagmite

00:02134 tralala
     not this

00:03124 bar foo
     and we want this
     kryptonite but not this(!)

00:02134 foo bar
     and not this either

00:01234 dolor sit amet
     EOF

IOW, we want to extract sections that start, in regex terms, with "^0" and end with "(kryptonite|stalagmite)".

Been chomping on this for a bit, finding it a hard nut to crack. TIA!

A couple of ways to do it. Can the delimiters be in the body? — , Sep 26 '14 at 20:08
The only thing that delimits this requires no other `^0` in the body. — , Sep 26 '14 at 20:28

hwnd · Accepted Answer · 2014-09-26T22:53:26.807

4

One way to do this would be Negative Lookahead combined with inline (?sm) dotall and multi-line modifiers.

(?sm)^0(?:(?!^0).)*?(?:kryptonite|stalagmite)

Live Demo

edited Sep 26 '14 at 22:53

answered Sep 26 '14 at 20:16

hwnd

65,661
4
77
114

score 3 · Answer 2 · answered Sep 26 '14 at 20:23

3

This looks like it works.

 # (?ms)^0(?:(?!(?:^0|kryptonite|stalagmite)).)*(kryptonite|stalagmite)

 (?ms)
 ^ 0
 (?:
      (?!
           (?: ^ 0 | kryptonite | stalagmite )
      )
      . 
 )*
 ( kryptonite | stalagmite )

answered Sep 26 '14 at 20:23

Same concept, but you include the keywords as well. Nice =) – hwnd Sep 26 '14 at 20:25
The keyword's probably not needed. Yours is the better one. – Sep 26 '14 at 20:27
Still, think alike =) (+1) – hwnd Sep 26 '14 at 20:28
Its the only way really. – Sep 26 '14 at 20:29

score 2 · Answer 3 · edited May 23 '17 at 11:57

2

I believe this will be the most efficient:

^0(?:\R(?!\R)|.)*?\b(?:kryptonite|stalagmite)\b

Demo

Obviously we start with ^0 and then end with either kryptonite or stalagmite (in a non-capturing group, for the heck of it) surrounded by \b word boundaries.

(?:\R(?!\R)|.)*? is the interesting part though, so let's break it down. One key concept first is PCRE's \R newline sequence.

(?:      (?# start non-capturing group for repetition)
  \R     (?# match a newline character)
  (?!\R) (?# not followed by another newline)
 |       (?# OR)
  .      (?# match any character, except newline)
)*?      (?# lazily repeat this group)

edited May 23 '17 at 11:57

Community

1
1

answered Sep 26 '14 at 20:36

Sam

18,756
2
40
65

You need to add `$` to your expression – HamZa Sep 26 '14 at 20:42
1

@HamZa, I don't believe so: `00:03124 bar foo and we want this kryptonite but not this(!)` – Sam Sep 26 '14 at 20:43

score -1 · Answer 4 · answered Sep 26 '14 at 20:13

-1

^(00:.*?(kryptonite|stalagmite)) with the s modifier

answered Sep 26 '14 at 20:13

atxdba

4,978
5
19
28

Simply doesn't match the expected output – HamZa Sep 26 '14 at 20:45

A multi-line, variedly greedy, regular expression

4 Answers4