0

I have a bunch of rawr contents in database.

some containing string http://www.example.com/subfolder/name.pdf or /subfolder/name.pdf

I need a pattern replace on these to turn them into /wp-content/uploads/old/subfolder/name.pdf there can be many levels of subfolders! /subfolder1/subfolder2/subfolder3/file.pdf

The pattern for finding I use is

/http[^\s]+pdf/
/href="\/[^\s]+pdf/

But how to replace the pattern with another pattern? ( the example above ^ )

I have

search for /http:\/\/www.example.com(.*).pdf"/
replace with /wp-content/uploads/old$1.pdf"

search for /href="\/pdf(.*)\.pdf">/

this works fine until there are more than 1 pdf links in one table cell

example

<a href="/pdf/subdir/name.pdf">clickhere</a><a href="/pdf/subdir/name.pdf">2nd PDF</a>

vico
  • 2,093
  • 2
  • 15
  • 35
  • What have you tried? What are the conditions for the match? Are there any exceptions? Please include examples for both – Mariano Sep 16 '15 at 17:28
  • Are you aware your regex matches "`xxxhttpxxxpdfxxxx.html`"? – Mariano Sep 16 '15 at 17:35
  • Which database do you use? Regex replacement functions are available in oracle and by [some user defined functions](https://github.com/hholzgra/mysql-udf-regexp) in mysql. A `preg_replace` code for this would be `$out = preg_replace('&^(http://www.example.com/)(.*[.]pdf)$&', '$1wp-content/uploads/$2', $in);` and the other likewise (if the URL is fixed; replace it by a pattern like `[^/]+` if not) – syck Sep 16 '15 at 17:36
  • updated with what I have – vico Sep 16 '15 at 17:38

2 Answers2

1

this works fine until there are more than 1 pdf links in one table cell

The regex engine is greedy by default, and it consumes as much as it can attempting a match. In order to reverse this behaviour, you could use a lazy quantifier, as explained in this post: Greedy vs. Reluctant vs. Possessive Quantifiers. So you have to add an extra ? after a quantifier to attempt a match with as less as it can consume. To make your greedy construct lazy, use [^\s]+?.

some containing string http://www.example.com/subfolder/name.pdf or /subfolder/name.pdf

But how to replace the pattern with another pattern?

As you can see, "http://www.example.com" is optional. You can make a part of your pattern optional with a (?:group) and a ? quantifier.

Pattern with an optional group:

(?:http://www\.example\.com)?/(\S+?)\.pdf
  • Don't forget to escape the dots, as they have a special meaning in regex.
  • Notice I used \S (capital "S") instead of [^\s] (they are both exactly the same).


One more thing, you may consider adding some boundaries in your pattern. I suggest using (?<!\w) (not preceded by a word character) and \b a word boundary to avoid a match as part of another word (as I commented in your question).

Regex:

(?<!\w)(?:http://www\.example\.com)?/(\S+?)\.pdf\b

Code:

$re = "@(?<!\\w)(?:http://www\\.example\\.com)?/(\\S+?)\\.pdf\\b@i"; 
$str = "some containing string http://www.example.com/subfolder/name.pdf
        or /subfolder/name.pdf
        <a href=\"/pdf/subdir/name.pdf\">clickhere</a>
        <a href=\"/pdf/subdir/name.pdf\">2nd PDF</a>"; 
$subst = "/wp-content/uploads/old/$1.pdf"; 

$result = preg_replace($re, $subst, $str);

Test in regex101

Community
  • 1
  • 1
Mariano
  • 6,073
  • 4
  • 27
  • 42
0

A sandbox example here: http://sandbox.onlinephpfunctions.com/code/cc47b98d16981b786cf2d573751b6a09a9725b90

$array = [
     "https://test.com/url/subfolder/subfolder/file.pdf",
     "https://test.com/url/subfolder1/subfolder/file.pdf",
     "/url/subfolder3/subfolder3/files.xml",
     "/url/subfolder/subfolder/file.pdf"
];

function setwpUrl($urls, $prepend) {
    for($i = 0; $i < count($urls); $i++) {
        preg_match_all("/(https?:\/\/[a-zA-Z0-9\.\-]+)?(.*)/", $urls[$i], $out);
        $urls[$i] = $prepend . $out[2][0];
    }
    return $urls;
}

$newUrls = setwpUrl($array, "/wp-content/uploads/old");

var_dump($newUrls);
Mark
  • 2,914
  • 2
  • 19
  • 29