-1

I'm pulling my hair out over this one. I want to get all contents within <a> tags, my html structure looks like:

<ul>
  <li><a href="#">One</a></li>
  <li><a href="#">Two</a></li>
  <li><a href="#">Three</a></li>
</ul>

And my regex:

/<a ?.*>(.*?)<\/a>/

The problem occurs when the the cms doesn't output the <li>'s with a line break:

<ul>
  <li><a href="#">One</a></li><li><a href="#">Two</a></li>
  <li><a href="#">Three</a></li>
</ul>

This is some example output of the match array:

Array
(
    [0] => Array
        (
            [0] => <a href="/schools/early-years-groups" class="active">Early Years Groups</a></li><li class="leaf first menu-mlid-20328 order_early_years_stuff"><a href="#" title="Order Schools Stuff">Order Early Years Stuff</a>
            [1] => <a href="/schools/early-years-groups/fundraise" title="Fundraise">Fundraise</a>
            [2] => <a href="/schools/early-years-groups/ey-showcase" title="Early Years Showcase">Early Years Showcase</a>
            [3] => <a href="/schools/how-to-pay-your-money-in" title="">How To Pay Your Money In</a>
            [4] => <a href="/schools/early-years-groups/learning-activities" title="Learning Activities">Learning Activities</a>
        )

    [1] => Array
        (
            [0] => Order Early Years Stuff
            [1] => Fundraise
            [2] => Early Years Showcase
            [3] => How To Pay Your Money In
            [4] => Learning Activities
        )

)

Thanks very much for any help this is driving me nuts!

Dominic
  • 48,717
  • 14
  • 109
  • 126
  • 1
    Ah, the never-ending [stream of confusion](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)... – Kerrek SB Nov 11 '11 at 12:39
  • Do you need to use regex for this task? PHP has a few HTML parsers at its disposal that are better suited for this. – Jens Nov 11 '11 at 12:41
  • @KerrekSB [oh, the irony…](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491) – Gordon Nov 11 '11 at 13:52
  • possible duplicate of http://stackoverflow.com/questions/3946506/crawling-a-html-page-using-php/3955436#3955436 – Gordon Nov 11 '11 at 13:55

3 Answers3

2

You should not use a regular expression to parse HTML... you will find plenty of examples lying around here explaining why.

Maybe something like PHP Simple DOM Parser will do the trick for you.

npinti
  • 50,175
  • 5
  • 67
  • 92
2

The problem is that you use a greedy search when looking for the > when it should be lazy. Here's an example:

<a .*?>(.*?)<\/a>
     ^

See it in action here: http://regexr.com?2v60h

Marcus
  • 11,428
  • 5
  • 44
  • 64
1

Your regex is too 'greedy' on the opening tag. Something like this should work better:

<a\s?[^>]*>([^<]*)</a>

It matches the anchor, with an optional space, followed by anything BUT the closing > of the tag, so it will definitely stop when it hits that >. The same trick applies to the anchor's content, look for anything BUT the < of the closing anchor tag.

Oldskool
  • 32,791
  • 7
  • 50
  • 64
  • Thank you! I am a total regex noob, this works but selects the whole element as opposed to the contents, or the element and the contents in a multidimensional like above - any ideas? But thanks on the right track :) – Dominic Nov 11 '11 at 12:53