Extracting rel with href

Question

The following regular expression extracts all hrefs from a page with 'preg_match_all':

/\s+href\s*=\s*[\"\']?([^\s\"\']+)[\"\'\s]+/ims

IF there is a 'rel' attribute in the 'a' tag i would like to return that with the result. How do i modify the code at the top to include the 'rel' attribute(if present)?

UPDATE: the following:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do 
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut 
enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi 
ut aliquip ex ea commodo consequat. <a href="http://example.com" rel="nofollow">Duis</a>
nirure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat
nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui
officia deserunt mollit anim id est laborum.

returns:

Array
(
    [0] => Array
        (
            [0] =>  href="http://example.com" 
        )

    [1] => Array
        (
            [0] => http://example.com
        )

)

i would like it to return:

Array
(
    [0] => Array
        (
            [0] =>  href="http://example.com" rel="nofollow"
        )

    [1] => Array
        (
            [0] => http://example.com
        )

)

you can also use alternatives like HTML parsers `DOMDocument` — Kevin, Sep 15 '14 at 09:09

score 1 · Answer 1 · answered Sep 15 '14 at 09:21

1

\s+href\s*=\s*[\"\']?(([^\s\"\']+)[\"\'\s]+rel="[^"]*")|\s+href\s*=\s*[\"\']?([^\s\"\']+)[\"\'\s]+

You can use this.This will give rel if it is there.

See demo.

http://regex101.com/r/jT3pG3/4

answered Sep 15 '14 at 09:21

vks

63,206
9
78
110

@ThomasdeRoo what is to be escaped? – vks Sep 15 '14 at 09:32

score 0 · Accepted Answer · edited May 23 '17 at 11:57

Can optionally capture it using a lookahead:

$regex = '~<a\b(?=(?>[^>]*rel\s*=\s*["\']([^"\']+))?)[^>]*href=\s*["\']\s*\K[^"\']+~';

Add the i (PCRE_CASELESS) modifier after closing delimiter ~ to match case insensitive.

See further explanation and example on regex101 and SO Regex FAQ

Using preg_match_all maybe want to add PREG_SET_ORDER flag:

preg_match_all($regex, $str, $out, PREG_SET_ORDER);
print_r($out);

Which gives a result like this:

Array
(
    [0] => Array
        (
            [0] => http://example.com
            [1] => nofollow
        )

    [1] => Array
        (
            [0] => http://example2.com
            [1] => nofollow
        )

)

See test at eval.in

As others mentioned, regex is not the perfect means for parsing html. Depends on what you're going to achieve and how the input looks / if it is your input and know what to expect.

@ThomasdeRoo of course, it was the regex only, see update and example at eval.in — Jonny 5, Sep 15 '14 at 09:33

Extracting rel with href

2 Answers2