2

I'm having trouble putting together the proper RegEx pattern to add target="_blank" to my links. To add that to all links.. no problem, but I need to exclude certain instances based on the pattern.

This is the preg_replace() I'm using to update ALL links with target that are showing http:// in the href

preg_replace('/(<a href="http:[^"]+")>/is','\\1 target="_blank">',$content);

Here are scenarios I'm trying to get

link1 /somepage.htm (no target="_blank") Above works
link2 http://www.somesiteexternal.com/ (add target="_blank") Above works
link3 http://www.example.com/somepage.htm (no target="_blank") this is where I'm having a problem.

I want to exclude http://www.example.com or http://example.com (which would be the domain where the code lives) from the target handling, but if the link is an absolute link or to another external site that is NOT using domain.com then I want that.

Trying to add the exclude/exception pattern in this (<a href="http:[^"]+") is giving me trouble.

Thanks! hanji

Andy Lester
  • 81,480
  • 12
  • 93
  • 144
hanji
  • 267
  • 2
  • 16

2 Answers2

5

Here is the way that uses DOM manipulations to obtain what you want.

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);

$linkNodeList = $xpath->query('//a[starts-with(@href, "http://")]');

$targetAttr = $dom->createAttribute('target');

foreach($linkNodeList as $linkNode) {
    if (!$linkNode->hasAttribute('target'))
        $linkNode->appendChild($targetAttr);
    $linkNode->setAttribute('target', '_blank');
}

$html = $dom->saveHTML();

Note: for some reason, the LIBXML_... constants are sometimes not defined, so you can solve the problem by addying this before:

if (!defined('LIBXML_HTML_NOIMPLIED'))
  define('LIBXML_HTML_NOIMPLIED', 8192);
if (!defined('LIBXML_HTML_NODEFDTD'))
  define ('LIBXML_HTML_NODEFDTD', 4);

If you want to exclude a specific domain, you can use parse_url and add a condition in the foreach loop (that is the most easy way):

$forbidden_host = 'example.com';

foreach($linkNodeList as $linkNode) {
    $host = parse_url($linkNode->getAttribute('href'),  PHP_URL_HOST);
    $host = preg_replace('~\Awww\.~', '', strtolower($host));
    if ($host === $forbidden_host)
        continue;

    if (!$linkNode->hasAttribute('target'))
        $linkNode->appendChild($targetAttr);

    $linkNode->setAttribute('target', '_blank');

}

or you can insert a condition in the xpath query:

$query = '//a[starts-with(@href, "http://") and not(starts-with(@href, "http://www.example.com") or starts-with(@href, "http://example.com"))]';
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
  • This doesn't have the exclude handling issue. Also, this is content pulled from the DB.. I don't think I can use DOM at this point? – hanji Jan 28 '15 at 21:09
  • @hanji: you can, it's not a problem, you only need to have the html in a php variable. `$html` here. – Casimir et Hippolyte Jan 28 '15 at 21:10
  • Gotcha. I just read up on it a little. How do we handle the 'exclude' condition in the query? – hanji Jan 28 '15 at 21:15
  • First off.. Thanks for the help! The target="_blank" is only added to the last anchor tag for some reason. Also, using this method, my CMS (coming from a database) has completed HTML tags wrapping the content. This was what I was worried about. So it's a complete HTML document nested in my HTML document. Looking to see what my options are in the manual right now. – hanji Jan 28 '15 at 21:35
  • @hanji You can prevent this behaviour using the php constants LIBXML_HTML_NOIMPLIED and LIBXML_HTML_NODEFDTD – Casimir et Hippolyte Jan 28 '15 at 21:40
2

Since this seems to be your own html, modify your regex:

/(<a href="http:[^"]+")>/is
                | add here a negative lookahead: (?!\/\/(?:www\.)?example\.com)

So it becomes:

/(<a href="http:(?!\/\/(?:www\.)?example\.com)[^"]+")>/is

Test at regex101.com; Regex FAQ; Regex to parse HTML

Community
  • 1
  • 1
Jonny 5
  • 11,051
  • 2
  • 20
  • 42