3

I'm trying to get collection of string subsets from a string, in this example pairs of <tags></tags> Given the string:

<tag>abc</tag><tag>123</tag>

I want 2 groups: <tag>abc</tag> and <tag>123</tag>

That's easy as <tag>.*?</tag> pattern.

Example

But I would like it to be more precise.

Given the string:

<tag>abc</tag><tag><tag>123</tag>

I would it to omit the second <tag> in the middle (because I'm searching for open and closing tags).

I want this result:

<tag>abc</tag>
<tag>123</tag>

I've tried to create a lookahead or lookbehind but no luck (I'm sure I'm using it wrong):

<tag>.*?(?<!<tag>)</tag>
Alan Moore
  • 68,531
  • 11
  • 88
  • 149
Sagiv b.g
  • 26,049
  • 8
  • 51
  • 86

2 Answers2

4

I assume the <tag> and </tag> are used as an example as leading/trailing delimiters.

Note that the lazy dot matching will still match from the first leading delimiter till the first occurrence of the trailing delimiter including any occurrences of the leading one.

To work around it, use a tempered greedy token:

<tag>(?:(?!</?tag>).)*</tag>

See the regex demo

Since the lookahead is executed at each position, this construct is rather resource consuming. You can unroll it as

<tag>[^<]*(?:<(?!/?tag>)[^<]*)*</tag>

See another regex demo.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • it looks good and it works. though i can't fully understand the syntax. guess i have some reading to do :) – Sagiv b.g Jul 27 '16 at 14:54
  • @Sag1v: The link I provided contains all the necessary information. Here it is: http://stackoverflow.com/a/37343088/3832970 – Wiktor Stribiżew Jul 27 '16 at 14:55
  • @WiktorStribiżew yes sir, i was referring to the link you provided as the "reading to do" would accept your answer as soon as i can. – Sagiv b.g Jul 27 '16 at 14:56
  • as for you bombastic second example i've made a change in the pattern by ommiting the tag name changed `/tag>` to `/?.*>` example: `[^)[^` is this resource consuming? – Sagiv b.g Jul 27 '16 at 15:02
  • This is just wrong. Inside the negative lookahead, you need to use the sequences that you want to exclude in between delimiters. The `.*` can't be used if you have marked up text. – Wiktor Stribiżew Jul 27 '16 at 15:07
  • i was trying to go for a pattern that will get any tag not just `...` but also `...` etc... – Sagiv b.g Jul 27 '16 at 21:19
  • 1
    So, use ``, or `` and then use `(?!/?tag\w+>)` in the lookahead. – Wiktor Stribiżew Jul 27 '16 at 21:26
0

This one permit to get only text and number :

<tag>(.[a-zA-Z\d]*)</tag>
mdelpeix
  • 147
  • 8