7

Here is a sample custom tag i have from a sitemap.xml

<url>
  <loc>http://sitename.com/programming/php/?C=D;O=A</loc>
  <changefreq>weekly</changefreq>
  <priority>0.64</priority>
</url>

There are many entries like this and if you see loc tag it has c=d;0=a at the end. I want to remove all entries starting with <url> ending with </url> which contains C=D;0=A or similar patterns like that.

The following expression matched the whole of the above specified tag

<url>(.|\r\n)*?<\/url>

but I want to match like what i had specified in the above statement.

How do we form regex to match such conditions(patterns) ?

Jayapal Chandran
  • 8,687
  • 14
  • 62
  • 85
  • you don't, see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Fredrik Pihl Jun 16 '11 at 08:13
  • 1
    @Fredrik, the answer is NOT correct. Regex can be used to parse xml but it's not the best way to do it. – Karolis Jun 16 '11 at 08:19
  • 1
    @Fredrik: There's no problem with using regex here. OP isn't trying to parse XML, but a very specific subset of it that looks like the example in his post. – Tim Jun 16 '11 at 08:35
  • Have a look at [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Nick Weaver Jun 16 '11 at 08:39
  • I am not parsing xml but i need to remove those entries in sitemap using a text editor like dreamweaver. so that i can have a cleaner sitemap because i did not add a index.html in one of the folder in my site which has many subfolders. it is this folder i am making a sitemap http://vikku.info/programming/ . – Jayapal Chandran Jun 17 '11 at 12:07

3 Answers3

11

Try this:

/<url>(?:(?!<\/url>).)*C=D;O=A.*?<\/url>/m

The negative lookahead guaranties that you do not match multiple nodes.

See here: rubular

morja
  • 7,779
  • 2
  • 31
  • 51
6

It is not a good idea to use regex for XML. Depending on the language you should use some XML reader, extract the <url> node and then use regex to match the content of the node. One useful language for querying XML data, which is supported by many XML libraries is XPath.

Petar Ivanov
  • 84,604
  • 7
  • 74
  • 90
  • I am not using regex to parse xml but it is just my sitemap where it had all those entries because i did not have a default index.htm script in an important folder which had many sub folders. i wanted to update my sitemap without those extra c=d items so i need a regex to remove all those entries and to keep the sitemap clean. so i cannot write a program to remove the unwanted entries instead i just need a regex to remove that on the fly and want to update my sitemap. – Jayapal Chandran Jun 17 '11 at 12:01
  • Sometimes these libraries are overkill. For example, processing wiki-text that contains limited html-like tags. @morja's answer actually answers the question... – Jonathan Feb 15 '14 at 02:20
0

If you absolutely have to use regex, this one:

<([a-z][a-z0-9]*)\b[^>]*>(.*?)(C=D;O=A){1}(.*?)</\1>

will get you the line:

http://sitename.com/programming/php/?C=D;O=A

I would then traverse up to the parent tag and do whatever I wanted with it.

Jayapal Chandran
  • 8,687
  • 14
  • 62
  • 85
Mattis
  • 4,719
  • 2
  • 29
  • 46