HOWTO: compose REGEXP to match
with certain class attr

Question

I need to catch the following tags + content in html source of the page:

<li class="someClass someClass2">
    ... some html code ...
</li>

I'm not very good at regular expressions, so I'll also appreciate comments containing links to a good tutorial. I've been checking http://www.regular-expressions.info/ out, but I'm not very happy with explanations there.

What I found on the above site was smt like this:

<li\b[^>]*>(.*?)</li>

This matches all the <li> tags, which is not what I want. I tried messing around with it, and tested this one

<li class="someClass someClass[1-9]{1,1}[0-9]*">(.*?)</li>

Unfortunately, this one doesn't do the job as well. The second class name is in format someClassX, where X is from {1, 2, ... } (well, obviously, it's not a set of natural numbers :) )

All I get from this regexp is "no matches". I'm using Ubuntu, Kodos tool.

What's even more depressing is the fact that this regexp:

<li class="someClass someClass[1-9]{1,1}[0-9]*">

actually catches the opening <li> tags, but nothing more, just as if it gets "distracted" by new line character.

I'm still looking for a solution on google, and I'll post it here if found, but I would also really appreciate some helpful input :)

Thx

Does it need to be a regular expression? Because HTML is not a regular language and your attempt to parse it with regular expressions could possibly fail. — Gumbo, Nov 20 '10 at 15:17
@thejh: I'll add an answer to your question to the above Q =) — hummingBird, Nov 20 '10 at 15:17
One thing - I suppose you are working with javascript, but in any case it would be useful to add a tag for the target language, as regex support/implementation is different from language to language. — icyrock.com, Nov 20 '10 at 15:21
@Gumbo: Well, I want to compose a PHP script that will automatically grab some content from a site. Regexp is the first I thought of. — hummingBird, Nov 20 '10 at 15:21
@icyrock.com: actually, it's php. However, I'm testing on Kodos program for linux. (http://stackoverflow.com/questions/4022089/testing-regular-expressions-tool-linux-ubuntu) this link is a result of an answer to some of my questions. — hummingBird, Nov 20 '10 at 15:22
@playcat: Then please have a look at http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php-closed et al. — Gumbo, Nov 20 '10 at 15:23
*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Nov 20 '10 at 15:35
@Gumbo: He isn’t using REGULAR regular expressions. He’s using PHP. Once you have patterns like `/(.)\1/`, you are no longer REGULAR in the pathetically useless textbook sense that has no bearing whatsoever on any pattern language that anyone in this world uses. **PLEASE** stop mindlessly parroting the party-line falderol! — tchrist, Nov 20 '10 at 20:53
@tchrist: I know that. But for the “daily work” it’s easier to use an already existing HTML parser than to build a tremendous complex regular expression that might get it right. Even the solution you proposed is not perfect. — Gumbo, Nov 20 '10 at 21:25
@Gumbo: If you mean [my giant `` extractor](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491), it *may* be perfect for convincing people not to attempt such a thing, which is half of why I wrote it. I could break it if I wanted to easily enough, because it’s not doing a full parse. It would also be easy to add something that dealt with bits like ` — tchrist, Nov 20 '10 at 22:18
@tchrist: The problem is that many people think regular expressions *are* the universal tool to solve every string-based problem. In that case I think it’s better to prevent people from trying to use regular expressions even if it is possible because it would be coupled with an utmost effort (not to mention that many only know the expressions `.*` and `.*?` that can lead to horrible behavior). And besides that, most languages already provide generally accepted parsers that can even deal with the quirks. — Gumbo, Nov 20 '10 at 22:38
@Gumbo: Even if all things are possible, that doesnt make them expedient—or advisable. Have you ever seen the regexes for finding solutions to Diophantine equations of order one, or for factoring composite numbers into primes? :) Yeah, nested `.*` and `.*?` are just *murder*; ”grabby” quantifiers like `*+`, `++`, etc can help that lots *if* you’re careful. If only regexes had better debugger/profiler support to see how bad things can be! Java6 (erroneously) allows variable-width lookbehinds, but you go exponential, meaning that any real-world data takes next to forever and you never know why. — tchrist, Nov 20 '10 at 22:50
@Gumbo, just for the record (which record I dunno, but it must be SOME record :), I’ve updated my [Yes You Can Parse HTML with Regexes](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491) posting to include a robust parser (lexer ((chunker)). Tried really hard to break it but couldn’t. I hope it shows people that while you **CAN** parse any HTML with regexes, you really should use an existing parser class to do it for you. — tchrist, Nov 22 '10 at 02:50

Kamal · Accepted Answer · 2010-11-20T17:11:10.690

2

This regex does what you're looking for (in Kodos at least... your mileage may vary!)

<li class="someClass someClass[0-9]+">(.*\n)*?</li>

edited Nov 20 '10 at 17:11

answered Nov 20 '10 at 15:54

Kamal

5,930
2
18
11

Unfortunately, it doesn't do the job... It selects everything from starting li tag, until ending li tag... I entered `
test

hummingBird

Nov 20 '10 at 17:03

@playcat, I have edited my answer slightly (added a question mark near the end of the regex, to consume the minimal instead of maximal matching pattern). Does that do the trick? – Kamal Nov 20 '10 at 17:12

Yes, that one did the job :). Thank you! However, I was highly discouraged to use regular expressions for grabbing content from HTML files. However, I do feel interested in learning them more thoroughly. – hummingBird Nov 20 '10 at 19:05

@playcat take a look at [this question](http://stackoverflow.com/questions/4044946/regex-to-split-html-tags) and the various answers. They show how while *of course* you **CAN** use modern patterns to parse HTML — as everyone knows — you probably should only do so on specific HTML, not generic HTML. Otherwise it becomes too much of a bother to get right; most people never manage. – tchrist Nov 20 '10 at 20:58

HOWTO: compose REGEXP to match with certain class attr

1 Answers1

HOWTO: compose REGEXP to match
with certain class attr