Regex issues matching HTML Tag

Question

So I'm trying to use sed (it has to be sed on these systems, so please don't just recommend to use Perl) to match an HTML tag and get the contents out of it. The HTML tags look about like this:

<div class="SectionText"> Received poor service or think your current mechanic is ripping you off? Get some help from <a href="http://www.union.umd.edu/gradlegalaid/index.htm" target="_blank">Graduate Legal Aid</a> or consult the <a href="http://www.oag.state.md.us/Consumer/index.htm" target="_blank">Maryland Attorney General Office of Consumer Protection</a> at <a href="mailto:consumer@oag.state.md.us">consumer@oag.state.md.us</a> or through their hotline at 410-528-8662 or 888-743-0023.<br /></div>

All on one line. So, I wrote this one... But it doesn't work.

sed 's/<div class=\"SectionText\">\([^<\/div>]*\)<\/div>/\1/g'

This does not alter any text.

I tried to use this website as a guideline - http://www.ibm.com/developerworks/linux/library/l-sed2/index.html (under RegExp Snafus)\

The most important thing is for this line script NOT to be greedy and match up until the last

I think you should remove the `\ ` before the first `(` and the `)` — Bjørne Malmanger, Mar 16 '12 at 20:43
@Bjørne Malmanger: He needs those to escape the parens for the command line, because he is using `sed`. — Jeff B, Mar 16 '12 at 20:51
@Bjørne Malmanger, @Jeff B: No, those are part of sed's funky regex syntax. It uses `\(` and `\)` for grouping, and `\|` for alternatives. http://www.delorie.com/gnu/docs/sed/sed_5.html — Thomas, Mar 16 '12 at 21:03
@Truth Those who can, do. See [here](http://stackoverflow.com/a/4234491/471272) and [here](http://stackoverflow.com/a/7198796/471272). All things are possible, but not all are expedient. Anybody who has to ask how, surely should not be attempting it. — tchrist, Mar 16 '12 at 22:55
@tchrist I never said it wasn't possible. I just asked him not to try. — Madara's Ghost, Mar 17 '12 at 09:23

score 3 · Answer 1 · edited May 23 '17 at 12:11

3

Aside from trying to use regular expressions on html (See RegEx match open tags except XHTML self-contained tags), the first problems I see is this:

[^<\/div>]*

This is saying match any characters that aren't <, /, d, i, v, or >. And clearly, you have a d and an i in there. ("Rece i ve d poor serv....")

If you are set on using regex for this, and you have a very controlled/predictabled input, you could simply do [^<>], assuming your text won't have these characters. But, I see that you do, because you have tags inside of your div...

But, if you do this:

sed 's/<div.class="SectionText">\(.*\)<\/div>/\1/g'

It should work as long as you don't have multiple </div>s. The .* will only match until it finds the <\/div>.

edited May 23 '17 at 12:11

Community

1
1

answered Mar 16 '12 at 20:48

Jeff B

29,005
6
60
85

Argh, you are right. I suppose he could just match on non-``, and hope they don't occur in the text. But, then he shouldn't be using RegEx for this anyway. Edited. – Jeff B Mar 16 '12 at 20:53
`.*` is greedy. [it won't work if there are multiple ``](http://ideone.com/VaNgt) – jfs Mar 16 '12 at 21:14
Right, that's what I said. "as long as you don't have nested `div`s" – Jeff B Mar 16 '12 at 21:42
Actually, I guess that's slightly different. I guess I should have said, as long as you don't have multiple `` tags. – Jeff B Mar 16 '12 at 22:01
See [here](http://stackoverflow.com/a/4234491/471272) and [here](http://stackoverflow.com/a/7198796/471272). All things are possible, but not all are expedient. Anybody who has to ask how, surely should not be attempting it. – tchrist Mar 16 '12 at 22:54
@tchrist: I agree wholeheartedly. However, sometimes I do things with regex that I shouldn't. If I know my input, and I want to do something fast, one-time use, I will go with it. For production code, or something that needs to be robust, I would try to use the right tool for the right job. I sometimes use a screwdriver for a chisel, too, in a pinch. – Jeff B Mar 16 '12 at 22:58
@JeffB I use regexes on HTML all the time. For example, when editing a page in `vi`, or just plain grepping. For most edit jobs, I use `perl`, whose recursible patterns actually ***can*** handle HTML — unlike those of `sed`. However, I also know [when to use a full parser](http://stackoverflow.com/a/9702624/471272). People asking for regex help with HTML just aren’t good enough to be able to use any general-purpose, completist, and correct regex-based answer given them. Those who can, don’t need to ask. – tchrist Mar 16 '12 at 23:10

score 2 · Accepted Answer · answered Mar 16 '12 at 21:01

[^<\/div>]*

This does not do what you think it does. This matches any sequence of characters that are not <, /, d, i, v or >.

In Perl you could simply use .*?, but as sed does not support non-greedy matches, you'll have to write something like this beauty:

sed 's#<div class="SectionText">\(\([^<]\|<[^/]\|</[^d]\|</d[^i]\|</di[^v]\|</div[^>]\)*\)</div>#\1#g'

This says "any sequence of characters that are not <, or are < not followed by /, or are </ not followed by d, and so on.

Needless to say, this is an unreadable, unmaintainable and nearly unwritable piece of crap and you should almost certainly not be using it, but if you absolutely, positively must use regexes to parse HTML and absolutely, positively must use sed, then here you go.

Regex issues matching HTML Tag

2 Answers2