regex matching an open and close tag and a certain text patterns inside that tag

Question

Here is a sample custom tag i have from a sitemap.xml

<url>
  <loc>http://sitename.com/programming/php/?C=D;O=A</loc>
  <changefreq>weekly</changefreq>
  <priority>0.64</priority>
</url>

There are many entries like this and if you see loc tag it has c=d;0=a at the end. I want to remove all entries starting with <url> ending with </url> which contains C=D;0=A or similar patterns like that.

The following expression matched the whole of the above specified tag

<url>(.|\r\n)*?<\/url>

but I want to match like what i had specified in the above statement.

How do we form regex to match such conditions(patterns) ?

you don't, see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Fredrik Pihl, Jun 16 '11 at 08:13
@Fredrik, the answer is NOT correct. Regex can be used to parse xml but it's not the best way to do it. — Karolis, Jun 16 '11 at 08:19
@Fredrik: There's no problem with using regex here. OP isn't trying to parse XML, but a very specific subset of it that looks like the example in his post. — Tim, Jun 16 '11 at 08:35
Have a look at [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Nick Weaver, Jun 16 '11 at 08:39
I am not parsing xml but i need to remove those entries in sitemap using a text editor like dreamweaver. so that i can have a cleaner sitemap because i did not add a index.html in one of the folder in my site which has many subfolders. it is this folder i am making a sitemap http://vikku.info/programming/ . — Jayapal Chandran, Jun 17 '11 at 12:07

score 11 · Accepted Answer · answered Jun 16 '11 at 10:52

11

Try this:

/<url>(?:(?!<\/url>).)*C=D;O=A.*?<\/url>/m

The negative lookahead guaranties that you do not match multiple nodes.

See here: rubular

answered Jun 16 '11 at 10:52

morja

7,779
2
31
51

i forgot to mention that i am using dreamweaver to remove. anyway let met try that and will update here. – Jayapal Chandran Jun 17 '11 at 12:04
ok, dreamweaver might not support lookaround... but give it a try. – morja Jun 17 '11 at 12:09
for dreamweaver it did not work. hope it would work in php... ? – Jayapal Chandran Jul 07 '11 at 13:34
Yes, it works in php. But you need to use the `s` flag instead of `m`. – morja Jul 07 '11 at 14:01
what if we don't have the node name. I just want a regular expression to validate an xml node like this test – Niloofar May 18 '17 at 19:17
try this: /(?:(?!).)*test.*?/m – morja May 18 '17 at 19:44

score 6 · Answer 2 · answered Jun 16 '11 at 08:12

6

It is not a good idea to use regex for XML. Depending on the language you should use some XML reader, extract the <url> node and then use regex to match the content of the node. One useful language for querying XML data, which is supported by many XML libraries is XPath.

answered Jun 16 '11 at 08:12

Petar Ivanov

84,604
7
74
90

I am not using regex to parse xml but it is just my sitemap where it had all those entries because i did not have a default index.htm script in an important folder which had many sub folders. i wanted to update my sitemap without those extra c=d items so i need a regex to remove all those entries and to keep the sitemap clean. so i cannot write a program to remove the unwanted entries instead i just need a regex to remove that on the fly and want to update my sitemap. – Jayapal Chandran Jun 17 '11 at 12:01
Sometimes these libraries are overkill. For example, processing wiki-text that contains limited html-like tags. @morja's answer actually answers the question... – Jonathan Feb 15 '14 at 02:20

score 0 · Answer 3 · edited Mar 20 '15 at 11:49

0

If you absolutely have to use regex, this one:

<([a-z][a-z0-9]*)\b[^>]*>(.*?)(C=D;O=A){1}(.*?)</\1>

will get you the line:

http://sitename.com/programming/php/?C=D;O=A

I would then traverse up to the parent tag and do whatever I wanted with it.

edited Mar 20 '15 at 11:49

Jayapal Chandran

8,687
14
62
85

answered Jun 16 '11 at 08:32

Mattis

4,719
2
29
46

It matched only one line and not that complete url open and close tag. – Jayapal Chandran Jul 07 '11 at 13:35

regex matching an open and close tag and a certain text patterns inside that tag

3 Answers3

Linked