Using Regex to find everything between the center tags

Question

So I am using a vendor application that uses regex to cut out code from my site. I have an entire div that I want to find and remove using regex controls but when I try it the line breaks and divs get in my way. Does anyone have any ideas? I am trying .+? but like i said think with the line breaks and everything it stops it from gathering everything.

<center>
<div id="divSiteFooter">
<div class="container darkblue">
<div class="row">
<div class="twocol">
<h3>Experience</h3>
</div>
<div class="twocol">
<h3>Access</h3>
</div>
<div class="twocol">
<h3>Assistance</h3>
</div>
<div class="twocol">
<h3>Inquire</h3>
</div>
<div class="fourcol last">
<h3>Connect</h3>
</div>
</div>
</div>
<div class="container darkerblue">
<div class="twocol">
<ul>
<li><a href="/Experience-Avalon/">Avalon Choice Cruising</a></li>
<li><a href="/Cruise-Vacations/">Our Cruises</a></li>
<li><a href="/River-Cruise-Ships/">Our Fleet</a></li>
<li><a href="/interactive-suite/">Photos & Videos</a></li>
<li><a href="/Affiliations/">Awards and Affiliations</a></li>
</ul>
</div></div></div>
</center>

If this is a language that supports PCRE you need to set the "dot all" modifier — Explosion Pills, Apr 16 '13 at 16:09
I don't know what language you're using for this, but regardless, Regex really is not the right tool for parsing HTML. You might have a look at CsQuery or the Html Agility Pack. http://stackoverflow.com/a/16006543/618649 — Craig, Apr 16 '13 at 16:09
I advise reading [this](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) before going much further. — Bernhard Barker, Apr 16 '13 at 16:11
Basically this is the footer from my page and when it comes in to this vendor application it messes a lot of things up so I just want to remove my entire footer. The 3rd party vendor is a mobile application site, so I just want to remove that entire footer or center so it doesnt even get loaded into the application. — user1566783, Apr 16 '13 at 16:19
Are you trying to do this on the server, or in the browser? If you're doing this in the browser, I'd just use jQuery, grab the right div with a CSS selector and get rid of it: `$("div[id='blah']").html("")` or some such thing... If you're on the server, find an appropriate HTML parser for your platform (ASP.NET? PHP?) — Craig, Apr 16 '13 at 16:29

MikeM · Accepted Answer · 2013-04-17T08:23:45.467

1

Either use

<center>[\s\S]+?</center>

or turn on singleline or DOTALL mode if available so . matches any character including newlines,

(?s)<center>.+?</center>

The way that you turn on singleline mode varies with the language/tool, but adding (?s) to the start of the regex will work with many of them (but not Javascript).

Further to comments

It would be better to include the opening div tag to make sure you're removing the right center element. I.e.

<center>\s*<div id="divSiteFooter">[\s\S]+?</center>

edited Apr 17 '13 at 08:23

answered Apr 16 '13 at 16:23

MikeM

9,855
2
27
42

Awesome this worked perfectly. The site even works like I imagined it was breaking because of that reason! – user1566783 Apr 16 '13 at 16:29
1

Regex is still honestly the *wrong* tool to use for parsing HTML. Fair warning. :-) – Craig Apr 16 '13 at 16:35
That's assuming there isn't a `` somewhere in the footer (thus the tags are basically exactly as provided, with little to no change of variation). Since it's a footer, there may not be any / much other data following it, so `+` instead of `+?` may be better. Or extend the check to include the 3 ``'s as well, or similar. `[\s\S]` is a little strange, `(.|\n)` is more clear. – Bernhard Barker Apr 17 '13 at 07:10
@Dukeling. Yes, it assumed there was only one center element in the _page_. I don't agree that just `+` would be better than `+?` or that the check should be extended to the three divs. `(.|\n)` uses a capture group unnecessarily and doesn't always match `\r`. – MikeM Apr 17 '13 at 08:38
@MikeM It's easy enough to make it a non-capturing group (`?:` in Java for example). Then `(.|\r|\n)`, or, depending on the language, there may be an option to allow wildcards to include new-lines which may be a better option. – Bernhard Barker Apr 17 '13 at 08:44

score 0 · Answer 2 · answered Apr 16 '13 at 16:12

0

Have you tried regex modifier (or matching mode) multiline to search through newlines?

http://www.regular-expressions.info/anchors.html#multi

answered Apr 16 '13 at 16:12

Jakuje

20,643
11
53
62

Using Regex to find everything between the center tags

2 Answers2