0

The following regular expression extracts all hrefs from a page with 'preg_match_all':

/\s+href\s*=\s*[\"\']?([^\s\"\']+)[\"\'\s]+/ims

IF there is a 'rel' attribute in the 'a' tag i would like to return that with the result. How do i modify the code at the top to include the 'rel' attribute(if present)?

UPDATE: the following:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do 
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut 
enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi 
ut aliquip ex ea commodo consequat. <a href="http://example.com" rel="nofollow">Duis</a>
nirure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat
nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui
officia deserunt mollit anim id est laborum.

returns:

Array
(
    [0] => Array
        (
            [0] =>  href="http://example.com" 
        )

    [1] => Array
        (
            [0] => http://example.com
        )

)

i would like it to return:

Array
(
    [0] => Array
        (
            [0] =>  href="http://example.com" rel="nofollow"
        )

    [1] => Array
        (
            [0] => http://example.com
        )

)
Oht
  • 409
  • 5
  • 19

2 Answers2

1
\s+href\s*=\s*[\"\']?(([^\s\"\']+)[\"\'\s]+rel="[^"]*")|\s+href\s*=\s*[\"\']?([^\s\"\']+)[\"\'\s]+

You can use this.This will give rel if it is there.

See demo.

http://regex101.com/r/jT3pG3/4

vks
  • 63,206
  • 9
  • 78
  • 110
0

Can optionally capture it using a lookahead:

$regex = '~<a\b(?=(?>[^>]*rel\s*=\s*["\']([^"\']+))?)[^>]*href=\s*["\']\s*\K[^"\']+~';

Add the i (PCRE_CASELESS) modifier after closing delimiter ~ to match case insensitive.

See further explanation and example on regex101 and SO Regex FAQ

Using preg_match_all maybe want to add PREG_SET_ORDER flag:

preg_match_all($regex, $str, $out, PREG_SET_ORDER);
print_r($out);

Which gives a result like this:

Array
(
    [0] => Array
        (
            [0] => http://example.com
            [1] => nofollow
        )

    [1] => Array
        (
            [0] => http://example2.com
            [1] => nofollow
        )

)

See test at eval.in

As others mentioned, regex is not the perfect means for parsing html. Depends on what you're going to achieve and how the input looks / if it is your input and know what to expect.

Community
  • 1
  • 1
Jonny 5
  • 11,051
  • 2
  • 20
  • 42