1

I need to catch the following tags + content in html source of the page:

<li class="someClass someClass2">
    ... some html code ...
</li>

I'm not very good at regular expressions, so I'll also appreciate comments containing links to a good tutorial. I've been checking http://www.regular-expressions.info/ out, but I'm not very happy with explanations there.

What I found on the above site was smt like this:

<li\b[^>]*>(.*?)</li>

This matches all the <li> tags, which is not what I want. I tried messing around with it, and tested this one

<li class="someClass someClass[1-9]{1,1}[0-9]*">(.*?)</li>

Unfortunately, this one doesn't do the job as well. The second class name is in format someClassX, where X is from {1, 2, ... } (well, obviously, it's not a set of natural numbers :) )

All I get from this regexp is "no matches". I'm using Ubuntu, Kodos tool.

What's even more depressing is the fact that this regexp:

<li class="someClass someClass[1-9]{1,1}[0-9]*">

actually catches the opening <li> tags, but nothing more, just as if it gets "distracted" by new line character.

I'm still looking for a solution on google, and I'll post it here if found, but I would also really appreciate some helpful input :)

Thx

icyrock.com
  • 25,648
  • 4
  • 58
  • 78
hummingBird
  • 2,235
  • 3
  • 21
  • 40
  • 2
    Does it need to be a regular expression? Because HTML is not a regular language and your attempt to parse it with regular expressions could possibly fail. – Gumbo Nov 20 '10 at 15:17
  • @thejh: I'll add an answer to your question to the above Q =) – hummingBird Nov 20 '10 at 15:17
  • One thing - I suppose you are working with javascript, but in any case it would be useful to add a tag for the target language, as regex support/implementation is different from language to language. – icyrock.com Nov 20 '10 at 15:21
  • @Gumbo: Well, I want to compose a PHP script that will automatically grab some content from a site. Regexp is the first I thought of. – hummingBird Nov 20 '10 at 15:21
  • @icyrock.com: actually, it's php. However, I'm testing on Kodos program for linux. (http://stackoverflow.com/questions/4022089/testing-regular-expressions-tool-linux-ubuntu) this link is a result of an answer to some of my questions. – hummingBird Nov 20 '10 at 15:22
  • 4
    @playcat: Then please have a look at http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php-closed et al. – Gumbo Nov 20 '10 at 15:23
  • @Gumbo: thx, that's really helpful. +1 – hummingBird Nov 20 '10 at 15:25
  • 2
    *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Nov 20 '10 at 15:35
  • @Gordon: That’s what I was looking for. – Gumbo Nov 20 '10 at 15:57
  • @Gumbo: He isn’t using REGULAR regular expressions. He’s using PHP. Once you have patterns like `/(.)\1/`, you are no longer REGULAR in the pathetically useless textbook sense that has no bearing whatsoever on any pattern language that anyone in this world uses. **PLEASE** stop mindlessly parroting the party-line falderol! – tchrist Nov 20 '10 at 20:53
  • @tchrist: I know that. But for the “daily work” it’s easier to use an already existing HTML parser than to build a tremendous complex regular expression that might get it right. Even the solution you proposed is not perfect. – Gumbo Nov 20 '10 at 21:25
  • @Gumbo: If you mean [my giant `` extractor](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491), it *may* be perfect for convincing people not to attempt such a thing, which is half of why I wrote it. I could break it if I wanted to easily enough, because it’s not doing a full parse. It would also be easy to add something that dealt with bits like ` – tchrist Nov 20 '10 at 22:18
  • @tchrist: The problem is that many people think regular expressions *are* the universal tool to solve every string-based problem. In that case I think it’s better to prevent people from trying to use regular expressions even if it is possible because it would be coupled with an utmost effort (not to mention that many only know the expressions `.*` and `.*?` that can lead to horrible behavior). And besides that, most languages already provide generally accepted parsers that can even deal with the quirks. – Gumbo Nov 20 '10 at 22:38
  • @Gumbo: Even if all things are possible, that doesnt make them expedient—or advisable. Have you ever seen the regexes for finding solutions to Diophantine equations of order one, or for factoring composite numbers into primes? :) Yeah, nested `.*` and `.*?` are just *murder*; ”grabby” quantifiers like `*+`, `++`, etc can help that lots *if* you’re careful. If only regexes had better debugger/profiler support to see how bad things can be! Java6 (erroneously) allows variable-width lookbehinds, but you go exponential, meaning that any real-world data takes next to forever and you never know why. – tchrist Nov 20 '10 at 22:50
  • @Gumbo, just for the record (which record I dunno, but it must be SOME record :), I’ve updated my [Yes You Can Parse HTML with Regexes](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491) posting to include a robust parser (lexer ((chunker)). Tried really hard to break it but couldn’t. I hope it shows people that while you **CAN** parse any HTML with regexes, you really should use an existing parser class to do it for you. – tchrist Nov 22 '10 at 02:50