2

I have a string containing a simple xml structure:

<folder>
 <id=1>
 <id=6>
 <folder>
  <id=2>
  <id=6>
 </folder>
 <folder>
  <id=3>
  <id=5>
 </folder>
</folder>

How would I target just the folder containing id=x using regex?

For example, if id=2 I want to return just <folder><id=2><id=6></folder>

Andrea Corbellini
  • 15,400
  • 2
  • 45
  • 63
darkace
  • 800
  • 1
  • 10
  • 25

2 Answers2

0

The following should work:

<folder>\s*(<id=\d+>)*\s*<id=xxx>.*?</folder>

Note that your string contains newline characters: you should unable the "DOTALL" option. How to enable such option depends on the language you are using.

In the case of C#, it seems that you need to enable Singleline mode:

Regex.Matches(input, pattern, RegexOptions.Singleline)

Example using grep and id=2:

$ grep -Pzo '(?s)<folder>\s*(<id=\d+>)*\s*<id=2>.*?</folder>' a
<folder>
  <id=2>
  <id=6>
 </folder>

(Here (?s) enables DOTALL.)

Andrea Corbellini
  • 15,400
  • 2
  • 45
  • 63
  • I'm not sure that this works for `id=1` because it should return the whole file/content in the specified case. Maybe the regex has to find the corresponding closing token. – Sebastian Schumann Sep 17 '15 at 07:50
  • @Verarind: I do not believe this is a requirement. Especially because it's impossible to deal with an unspecified number of level of nesting with regular expressions. – Andrea Corbellini Sep 17 '15 at 08:19
  • Maybe it's impossible in python but .Net knows about [balancing group definitions](https://msdn.microsoft.com/en-us/library/bs2twtah(v=vs.110).aspx#balancing_group_definition) what ables you to match the corresponding closing token to an open token. – Sebastian Schumann Sep 17 '15 at 08:22
  • 1
    Those are "overpowered" regular expressions. Traditional regular expressions have no memory (or, if you prefer, traditional regular expressions can match only regular languages). However, whatever regex dialect you are using, I still believe that using a proper parser would be much easier and much more efficient if you want to deal with nested tags. – Andrea Corbellini Sep 17 '15 at 09:05
0

Solution

<folder>(?:(?!</?folder>).)*<id=2>(?:(?!</?folder>).|(?<open><folder>)|(?<-open></folder>))*?(?(open)(?!))</folder>

DEMO

Explanation

We start with the requested tag: <folder>

Now something that is not <folder> and </folder>: (?:(?!</?folder>).)*

Next what we're looking for: <id=2>

And than there is comming something .* till the end token: </folder>

The problem is that something can contain an opening and closing token. That has to be captured. Best way of doing this is using balancing group definitions. Having this we can match until we find the closing token. .* has to be the BGD for your tokens: (?:(?!</?folder>).|(?<open><folder>)|(?<-open></folder>))*?(?(open)(?!))

A good introduction into BGD is here and here

Community
  • 1
  • 1
Sebastian Schumann
  • 2,792
  • 12
  • 33