1

What's the appropriate Perl or Java regex to extract only the second line below? It should find the div tag containing the class="matchthis" attribute.

<div>Do not match this</div>
<div class="matchthis">MATCH THIS</div>
<div class="unimportant">Do not match this</div>

Please do not tell me to use DOM/Soup/etc. I wonder if raw regex can solve the simple problem above (you'll be awarded for the answer!). Yes I'm aware of this post so don't even mention it.

Community
  • 1
  • 1
slashline
  • 169
  • 1
  • 3
  • 9
  • 11
    Why are you asking us not to give you the correct answer? – SLaks Jun 10 '11 at 22:48
  • It's not clear if your div element can contain anything inside (other divs?) and if the tag can contain other attributes... – leonbloy Jun 10 '11 at 23:01
  • 2
    @SLaks: Why does your religion blind you to other reasonable approaches? Your comment is overvalued and misplaced. It is also wrong. – tchrist Jun 11 '11 at 00:42
  • @SLaks: Unfortunately for you, someone else here provided an answer that actually works. – slashline Jun 13 '11 at 15:28

3 Answers3

3

As you already seem to know, using regular expressions to parse HTML is a bad idea.

In this specific case, I'm pretty sure all you really want is this:

<div class="lulz">(.*)<\/div>

Now, the more flexible you want to get, the more unreadable your regular expression will become. And this is the danger of trying to use regular expressions instead of a proper parser. For instance, say you want to allow for additional attributes besides class. A kind of functional regular expression for this might look like:

<div[^>]*class="[^\"]*lulz[^\"]*".*>(.*)<\/div>

Totally readable, right? (Also, almost certainly very wrong.)

Dan Tao
  • 119,009
  • 50
  • 280
  • 431
  • 1
    It's only a bad idea when attacking certain problems, sometimes, within a specific domain it is quite possible. Whether that will work will depend on if there are nested DIVs in side that one, and if there are any other attributes on the matching DIV. – Orbling Jun 10 '11 at 22:57
  • 1
    @Orbling: Honestly, I tend to think it's *always* a bad idea, for the simple reason that RegEx is already a fairly heavyweight solution, and there are alternatives out there (e.g., the now-well-known `HtmlAgilityPack` for .NET) that are really not a hassle to use *at all* and are far more correct and robust. – Dan Tao Jun 10 '11 at 22:58
  • 1
    or
    ]+?class="lulz"[^>]*?>(.*)(?!
    ) escaping where necessary
    – Javier C Jun 10 '11 at 22:59
  • 1
    @Dan Tao: Depends on what language you are using, regex is *far* more efficient than parsing an entire DOM tree, if a regex is usable in the case required. This has a Perl tag (and a Java tag which is confusing), in Perl regexp are the immediate goto tool. The regex engines in the script languages are pretty quick and efficient - I consider it a lightweight solution. – Orbling Jun 10 '11 at 23:03
  • @Orbling: Fair enough; to be honest, I have a tendency to not notice tags because this site is so .NET-heavy in general. Given that I have no Perl experience, I am definitely willing to soften my claim that grabbing values from HTML with RegEx is "always" a bad idea; I still suspect that it is *often* a bad idea, however. – Dan Tao Jun 10 '11 at 23:07
  • 4
    but this regex will also match `
    MATCH
    WRONG
    `. Better use `
    (.*?)
    `. (`\/` is an invalid escape sequence in Java)
    – user85421 Jun 10 '11 at 23:13
  • @Dan Tao: Well, as I almost never have to touch .Net (occasional F# usage), it is a side of the site I do not see. ;-) It is frequently a bad idea to use regex with HTML, because of the level of unreliability and complexity of HTML and potentially nesting difficulties. But if you have consistent HTML, with sufficient uniqueness to target in on, then it is simple enough with regex. – Orbling Jun 10 '11 at 23:19
  • 1
    **This is a false assumptions!** Regular expressions **should never ever** become hard to read. If they do, you’re doing them wrong. When you understand the approaches in [these answers](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326), [this one](http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes/4843579#4843579), and [this one](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491), then come back and consider recanting your heresies. – tchrist Jun 11 '11 at 00:47
  • *chuckle* Please explain how your rather broad "hard to read" applies across the entire population of programmers. Once you become proficient designing correct, effective regular expressions, "hard to read" becomes a far more distant target, especially when you consider the 'x' modifier and its affect on comprehension. – Rob Raisch Jun 11 '11 at 22:36
  • @Carlos Heuberger and @Dan Tao has the correct answer (I had to remove the \/ escape in Java, as pointed out by Carlos). This:
    (.*?)
    did the job and it's much quicker than parsing an entire DOM tree. It's working GREAT for my application. Thank you guys!
    – slashline Jun 13 '11 at 15:27
1

If there are no nested tags inside your <div> you can use this

/<div[^>]+class="matchthis"[^>]*>[^>]*<\/div>/

Otherwise you need to know what is inside or a different solution (as you know).

cordsen
  • 1,671
  • 12
  • 10
0

If your are interested only in text between tags, instead of the whole line, you could use lookarounds.

With this regex,

m{(?<=<div class="matchthis">)([^<]+)(?=</div>)}

you can get text between tags inside the $1 variable; note that the second group of round parentheses is the capturing one.

The first and the last group of round parentheses are positive lookarounds, they don't capture text.

Anyway, others have already given advice: don't (ab)use regexes on HTML.

Marco De Lellis
  • 1,149
  • 6
  • 10
  • Look-behind is not widely supported, mostly due to the computational burden it imposes. JavaScript does not support it, for example. – Rob Raisch Jun 11 '11 at 22:29