Regex: match all html tags on line before line break

Question

I'm pulling my hair out over this one. I want to get all contents within <a> tags, my html structure looks like:

<ul>
  <li><a href="#">One</a></li>
  <li><a href="#">Two</a></li>
  <li><a href="#">Three</a></li>
</ul>

And my regex:

/<a ?.*>(.*?)<\/a>/

The problem occurs when the the cms doesn't output the <li>'s with a line break:

<ul>
  <li><a href="#">One</a></li><li><a href="#">Two</a></li>
  <li><a href="#">Three</a></li>
</ul>

This is some example output of the match array:

Array
(
    [0] => Array
        (
            [0] => <a href="/schools/early-years-groups" class="active">Early Years Groups</a></li><li class="leaf first menu-mlid-20328 order_early_years_stuff"><a href="#" title="Order Schools Stuff">Order Early Years Stuff</a>
            [1] => <a href="/schools/early-years-groups/fundraise" title="Fundraise">Fundraise</a>
            [2] => <a href="/schools/early-years-groups/ey-showcase" title="Early Years Showcase">Early Years Showcase</a>
            [3] => <a href="/schools/how-to-pay-your-money-in" title="">How To Pay Your Money In</a>
            [4] => <a href="/schools/early-years-groups/learning-activities" title="Learning Activities">Learning Activities</a>
        )

    [1] => Array
        (
            [0] => Order Early Years Stuff
            [1] => Fundraise
            [2] => Early Years Showcase
            [3] => How To Pay Your Money In
            [4] => Learning Activities
        )

)

Thanks very much for any help this is driving me nuts!

Ah, the never-ending [stream of confusion](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)... — Kerrek SB, Nov 11 '11 at 12:39
Do you need to use regex for this task? PHP has a few HTML parsers at its disposal that are better suited for this. — Jens, Nov 11 '11 at 12:41
@KerrekSB [oh, the irony…](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491) — Gordon, Nov 11 '11 at 13:52
possible duplicate of http://stackoverflow.com/questions/3946506/crawling-a-html-page-using-php/3955436#3955436 — Gordon, Nov 11 '11 at 13:55

score 2 · Answer 1 · answered Nov 11 '11 at 12:41

2

You should not use a regular expression to parse HTML... you will find plenty of examples lying around here explaining why.

Maybe something like PHP Simple DOM Parser will do the trick for you.

answered Nov 11 '11 at 12:41

npinti

50,175
5
67
92

Marcus · Answer 2 · 2011-11-11T12:48:55.477

2

The problem is that you use a greedy search when looking for the > when it should be lazy. Here's an example:

<a .*?>(.*?)<\/a>
     ^

See it in action here: http://regexr.com?2v60h

edited Nov 11 '11 at 12:48

answered Nov 11 '11 at 12:43

Marcus

11,428
5
44
64

score 1 · Accepted Answer · answered Nov 11 '11 at 12:47

1

Your regex is too 'greedy' on the opening tag. Something like this should work better:

<a\s?[^>]*>([^<]*)</a>

It matches the anchor, with an optional space, followed by anything BUT the closing > of the tag, so it will definitely stop when it hits that >. The same trick applies to the anchor's content, look for anything BUT the < of the closing anchor tag.

answered Nov 11 '11 at 12:47

Oldskool

32,791
7
50
64

Thank you! I am a total regex noob, this works but selects the whole element as opposed to the contents, or the element and the contents in a multidimensional like above - any ideas? But thanks on the right track :) – Dominic Nov 11 '11 at 12:53

Regex: match all html tags on line before line break

3 Answers3