Remove all empty tags except specified

Question

The following PHP regex removes all empty tags:

#<[^\/>]*>([\s]?)*<\/[^>]*>#u

I want to remove empty tags that do not match:

<div style="clear:both"></div>

I tried:

#^(<div style="clear:both"></div>)<[^\/>]*>([\s]?)*<\/[^>]*>#u

...but it didn't work.

How do I add a negation?

You are confusing `^`. It means start of subject or line. You need an assertion `(?!...)` instead. — mario, Jan 13 '13 at 02:48
HTML is not a regular language. And closing tags are *allowed* to be *missing* in some circumstances. In other words, an HTML document can be considered well-formed even if it is missing particular ending tags. This makes it more difficult to search for empty tags. — Tyler Crompton, Jan 13 '13 at 02:52
@TylerCrompton my html is generated by php script so its perfectly fine in my situation to use regex. Please check below my comments on dom html parser. — Maximus, Jan 13 '13 at 03:02
@PeeHaa Simply? Well, for specific HTML that you are eyeballing, sure you do — just as you would in your editor. But for general HTML of utterly unknown provenance and composition and quirks, you’re right: one does not parse it ***simply*** with regexes. One instead [parses HTML ***rigourously, carefully, judiciously, meticulously, complexly, intricately, tortuously*** — perhaps even ***brain-twistingly***](http://stackoverflow.com/a/4234491/471272). But simply? No. Fortunately, such things are rarely required, since most canned HTML is diddleable in `vi` easily enough with `:%s/a/b/g`. — tchrist, Jan 13 '13 at 03:03
What does the fact it comes from PHP have to do with parsing HTML with something that isn't suited? — PeeHaa, Jan 13 '13 at 03:03
@tchrist Sure Tom. One could simply parse a fixed formatted HTML string with regex, until OP wants to match / exclude other things. Or make it easier to maintain. Or... Or... What's the point in it when you have a dedicated HTML parser built in? (besides having fun) — PeeHaa, Jan 13 '13 at 03:09
@TylerCrompton It is utterly immaterial that HTML does not conform to the formal and abstruse to the point of being recondite definition of a “regular language” per theoretical computer science, because the patterns used by modern programming languages broke through that nonsensical barrier before most people reading this were born, pretty much as soon as they added `(a+).*\1`. That is not the point. The point is that while perfectly possible, it is seldom advisable. That is something else altogether. — tchrist, Jan 13 '13 at 03:11
@PeeHaa You’re right: it is obviously for fun. I do not know that I would ever actually use regexes given a builtin dom processor. I just know that I edit HTML files in `vi` all the time, and when I do, I never shy from using `s/foo/bar/` type substitutions. I think people here too often over-engineer some works-everywhere-everytime solution instead of just doing what it takes to take care of the current task and go home. — tchrist, Jan 13 '13 at 03:13

Tyler Crompton · Accepted Answer · 2013-01-13T03:41:00.577

3

Assuming that it is well-formed and there are no missing end tags, this should do the trick:

<(?!div\s+style=(?:"[^"]*?\bclear:\s*both\b[^"]*"|'[^']*?\bclear:\s*both\b[^']*')\s*>\s*</div>).*?>\s*</.*?>

Make sure to use the case-insensitivity flag too. I would still advise against it, though.

EDIT: I haven't tested my edits, but I'm fairly confident that it's a bit more thorough.

edited Jan 13 '13 at 03:41

answered Jan 13 '13 at 03:04

Tyler Crompton

11,740
12
59
91

How do i combine it with my regex? – Maximus Jan 13 '13 at 03:12
@jason, do you mind providing a list of a couple of examples of what it should and shouldn't match? I'm not sure as to what exactly you're trying to do other than match most empty tags. – Tyler Crompton Jan 13 '13 at 03:31
1

so complicated, thats y u dont parse html with regex – slier Jan 13 '13 at 09:53

Remove all empty tags except specified

1 Answers1