0

The following PHP regex removes all empty tags:

#<[^\/>]*>([\s]?)*<\/[^>]*>#u

I want to remove empty tags that do not match:

<div style="clear:both"></div>

I tried:

#^(<div style="clear:both"></div>)<[^\/>]*>([\s]?)*<\/[^>]*>#u

...but it didn't work.

How do I add a negation?

Andy Lester
  • 81,480
  • 12
  • 93
  • 144
Maximus
  • 2,696
  • 4
  • 31
  • 51
  • 3
    One does not simply parse / process HTML with regex. – PeeHaa Jan 13 '13 at 02:44
  • You are confusing `^`. It means start of subject or line. You need an assertion `(?!...)` instead. – mario Jan 13 '13 at 02:48
  • HTML is not a regular language. And closing tags are *allowed* to be *missing* in some circumstances. In other words, an HTML document can be considered well-formed even if it is missing particular ending tags. This makes it more difficult to search for empty tags. – Tyler Crompton Jan 13 '13 at 02:52
  • 1
    @TylerCrompton my html is generated by php script so its perfectly fine in my situation to use regex. Please check below my comments on dom html parser. – Maximus Jan 13 '13 at 03:02
  • @PeeHaa Simply? Well, for specific HTML that you are eyeballing, sure you do — just as you would in your editor. But for general HTML of utterly unknown provenance and composition and quirks, you’re right: one does not parse it ***simply*** with regexes. One instead [parses HTML ***rigourously, carefully, judiciously, meticulously, complexly, intricately, tortuously*** — perhaps even ***brain-twistingly***](http://stackoverflow.com/a/4234491/471272). But simply? No. Fortunately, such things are rarely required, since most canned HTML is diddleable in `vi` easily enough with `:%s/a/b/g`. – tchrist Jan 13 '13 at 03:03
  • What does the fact it comes from PHP have to do with parsing HTML with something that isn't suited? – PeeHaa Jan 13 '13 at 03:03
  • @tchrist Sure Tom. One could simply parse a fixed formatted HTML string with regex, until OP wants to match / exclude other things. Or make it easier to maintain. Or... Or... What's the point in it when you have a dedicated HTML parser built in? (besides having fun) – PeeHaa Jan 13 '13 at 03:09
  • @TylerCrompton It is utterly immaterial that HTML does not conform to the formal and abstruse to the point of being recondite definition of a “regular language” per theoretical computer science, because the patterns used by modern programming languages broke through that nonsensical barrier before most people reading this were born, pretty much as soon as they added `(a+).*\1`. That is not the point. The point is that while perfectly possible, it is seldom advisable. That is something else altogether. – tchrist Jan 13 '13 at 03:11
  • 1
    @PeeHaa You’re right: it is obviously for fun. I do not know that I would ever actually use regexes given a builtin dom processor. I just know that I edit HTML files in `vi` all the time, and when I do, I never shy from using `s/foo/bar/` type substitutions. I think people here too often over-engineer some works-everywhere-everytime solution instead of just doing what it takes to take care of the current task and go home. – tchrist Jan 13 '13 at 03:13

1 Answers1

3

Assuming that it is well-formed and there are no missing end tags, this should do the trick:

<(?!div\s+style=(?:"[^"]*?\bclear:\s*both\b[^"]*"|'[^']*?\bclear:\s*both\b[^']*')\s*>\s*</div>).*?>\s*</.*?>

Make sure to use the case-insensitivity flag too. I would still advise against it, though.

EDIT: I haven't tested my edits, but I'm fairly confident that it's a bit more thorough.

Tyler Crompton
  • 11,740
  • 12
  • 59
  • 91