1

Have some internal links in my site content that do not have a trailing "/" and it is causing some crawling issues for me. Want to do a search and replace for these links. So https://www.example.com/slug should become https://www.example.com/slug/. I am using the following function to push the entire content of a page through and replace all necessary links on the page:

function str_replace_links($subject, &$count) {
    //match the first part of the link http://www.example.com{/slug}
    $regex = '/(https:\/\/www.example.com)(\/[a-zA-Z_0-9\-]*)*';
    //check for the trailing '/' or if it is a file
    $regex .= '([^(\/|\.js|\.css|\.xml|\.less|\.png|\.jpg|\.jpeg|\.gif|\.pdf|\.doc|\.txt|\.ico|\.rss|\.zip|\.mp3|\.rar|\.exe|\.wmv|\.doc|\.avi|\.ppt|\.mpg|\.mpeg|\.tif|\.wav|\.mov|\.psd|\.ai|\.xls|\.mp4|\.m4a|\.swf|\.dat|\.dmg|\.iso|\.flv|\.torrent|\.ttf|\.woff|\.svg|\.eot|\.woff2)])';
    //finish ooff regex
    $regex .= '/i';
    $i; // counter for # changed
    $content = preg_replace($regex, '$1$2/', $subject, 1, $i);
    $count += $i;
    return $content;
}

I have tried test with a string a few links:

$string ='
<a href="https://www.example.com/slug1/page">1</a><br/>
<a href="https://www.example.com/slug2/page">2</a><br/>
<a href="https://www.example.com/slug1/page/">3</a><br/>
<a href="https://www.example.com/slug2/page/">4</a><br/>
<a href="https://www.example.com/">5</a><br/>
<a href="https://www.example.com">5b</a><br/>
<a href="https://www.example.com/style.css">6</a><br/>
<a href="https://www.example.com/style.jpg">7</a><br/>
<a href="https://www.example.com/style.png">8</a><br/>
<a href="https://www.example.com/style.pdf">9</a><br/>
';

echo str_replace_links($string, $switch);

However, this doesn't result in proper results:

<a href="https://www.example.com/page/>1</a><br/>
<a href="https://www.example.com/page/>2</a><br/>
<a href="https://www.example.com//>3</a><br/>
<a href="https://www.example.com//>4</a><br/>
<a href="https://www.example.com//>5</a><br/>
<a href="https://www.example.com/>5b</a><br/>
<a href="https://www.example.com/st/le.css">6</a><br/>
<a href="https://www.example.com/st/le.jpg">7</a><br/>
<a href="https://www.example.com/st/le.png">8</a><br/>
<a href="https://www.example.com/st/le.pdf">9</a><br/>

Any help with the regex would be appreciated.

jppower175
  • 383
  • 4
  • 15

1 Answers1

0

You can use a tweaked URL validator to do it.

~(?i)(?<=")((?!mailto:)(?:[a-z]*:\/\/)?(?:\S+(?::\S*)?@)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{a1}-\x{ffff}0-9]+-?)*[a-z\x{a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{a1}-\x{ffff}0-9]+-?)*[a-z\x{a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{a1}-\x{ffff}]{2,})))|localhost)(:\d{2,5})?(?:\/(?:[^\s/]*/)*[^\s/.]+)?)(?=")~

https://regex101.com/r/GcT8ZU/1

Formatted

 (?i)

 (?<= " )
 (                             # (1 start)
      (?! mailto: )
      (?: [a-z]* :\/\/ )?
      (?:
           \S+ 
           (?: : \S* )?
           @
      )?
      (?:
           (?:
                (?:
                     [1-9] \d? 
                  |  1 \d\d 
                  |  2 [01] \d 
                  |  22 [0-3] 
                )
                (?:
                     \.
                     (?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
                ){2}
                (?:
                     \.
                     (?:
                          [1-9] \d? 
                       |  1 \d\d 
                       |  2 [0-4] \d 
                       |  25 [0-4] 
                     )
                )
             |  (?:
                     (?: [a-z\x{a1}-\x{ffff}0-9]+ -? )*
                     [a-z\x{a1}-\x{ffff}0-9]+ 
                )
                (?:
                     \.
                     (?: [a-z\x{a1}-\x{ffff}0-9]+ -? )*
                     [a-z\x{a1}-\x{ffff}0-9]+ 
                )*
                (?:
                     \.
                     (?: [a-z\x{a1}-\x{ffff}]{2,} )
                )
           )
        |  localhost
      )
      ( : \d{2,5} )?                # (2)
      (?:
           \/
           (?: [^\s/]* / )*
           [^\s/.]+ 
      )?
 )                             # (1 end)
 (?= " )