A preg_replace puzzle: replacing zero or more of a char at the end of the subject

Question

Say $d is a directory path and I want to ensure that it starts and ends with exactly one slash (/). It may initially have zero, one or more leading and/or trailing slashes.

I tried:

preg_replace('%^/*|/*$', '/', $d);

which works for the leading slash but to my surprise yields two trailing slashes if $d has at least one trailing slash. If the subject is, e.g., 'foo///' then preg_replace() first matches and replaces the three trailing slashes with one slash and then it matches zero slashes at the end and replaces that with with a slash. (You can verify this by replacing the second argument with '[$0]'.) I find this rather counterintuitive.

While there are many other ways to solve the underlying problem (and I implemented one) this became a PCRE puzzle for me: what (scalar) pattern in a single preg_replace does this job?

ADDITIONAL QUESTION (edit)

Can anyone explain why this pattern matches the way it does at the end of the string but does not behave similarly at the start?

NikiC · Answer 1 · 2010-08-05T21:50:00.613

3

$path = '/' . trim($path, '/') . '/';

This first removes all slashes at beginning or end and then adds single ones again.

edited Aug 05 '10 at 21:50

answered Aug 05 '10 at 21:41

NikiC

95,987
31
182
219

1

this may be faster than a regex but he explicitly aksed for a regex and not another way to solve the problem. – Andreas Linden Aug 05 '10 at 21:53
1

While I do agree that the exact question asked for regex, but PHP offers a better solution than regex in this case. Either way, the OP gets both regex answers and a good PHP specific solution as well. OP can choose the validity by choosing one answer over another. – TCCV Aug 05 '10 at 23:14
1

This is probably the best solution to the underlying problem. It's just that my surprise at the result of the `%^/*|/*$%` pattern spun off this PCRE puzzle independently of the original problem. This is about bettering regex skills. Think of it like sudoku. – Aug 06 '10 at 01:08

Andreas Linden · Answer 2 · 2010-08-05T22:30:22.560

1

it can be done in a single preg_replace

preg_replace('/^\/{2,}|\/{2,}$|^([^\/])|([^\/])$/', '\2/\1', $d);

edited Aug 05 '10 at 22:30

answered Aug 05 '10 at 21:45

Andreas Linden

11,975
7
46
65

Nice. How about enhancing readability by taking out all backslashes: `return preg_replace('!^/{2,}|/{2,}$|^([^/])|([^/])$!', '$2/$1', $d);`? – Peter Ajtai Aug 05 '10 at 22:13
yes ofc, but I'm simply used to perl which only allows slashes as delimitter – Andreas Linden Aug 05 '10 at 22:29
It took me some effort to understand this. It's similar to salathe's answer but uses `^([^/])` and `([^/])$` instead of assertions adding the captured caracters back. I admire the complexity. – Aug 06 '10 at 00:40

score 1 · Answer 3 · answered Aug 05 '10 at 22:00

1

preg_replace('%^/*(.*?)/*$%', '/\1/', $d)

answered Aug 05 '10 at 22:00

John Kugelman

307,513
65
473
519

This one's nice too. `$` instead of `\ ` makes it easier to read imo: `preg_replace('%^/*(.*?)/*$%', '/$1/', $d);` – Peter Ajtai Aug 05 '10 at 22:15
Invert the thinking, i.e. capture what you want to keep rather than what you want to replace, and suddenly it's obvious. Excellent! – Aug 06 '10 at 00:44

score 1 · Accepted Answer · answered Aug 06 '10 at 03:09

Given a regex like /* that can legitimately match zero characters, the regex engine has to make sure that it never matches more than once in the same spot, or it would get stuck in an infinite loop. Thus, if it does consume zero characters, the engine jumps forward one position before attempting another match. As far as I know, that's the only situation in which the regex engine does anything on its own initiative.

What you're seeing is the opposite situation: the regex consumes one or more characters, then on the next go-round it tries to start matching at the spot where it left off. Never mind that this particular regex can't match anything but the one character, and it already matched as many of those as it could; it still has the option of matching nothing, so that's what it does.

So, why doesn't your regex match twice at the beginning, like it does at the end? Because of the start anchor (^). If the subject starts with one or more slashes, it consumes them and then tries to match zero slashes, but it fails because it's not at the beginning of the string any more. And if there are no slashes at the beginning, the manual bump-along has the same affect.

At the end of the subject it's a different story. If there are no slashes there, it matches nothing, tries to bump along and fails; end of story. But if it does match one or more slashes, it consumes them and tries to match again--and succeeds because the $ anchor still matches.

So in general, if you want to prevent this kind of double match, you can either add a condition to the beginning of the match to prevent it, like the ^ anchor does for the first alternative:

preg_replace('%^/*|(?<!/)/*$%', '/', $d);

...or make sure that part of the regex has to consume at least one character:

preg_replace('%^/*|([^/])/*$%', '$1/', $d);

But in this case you have a much simpler option, as demonstrated by John Kugelman: just capture the part you want to keep and chuck the rest.

+1 nice explanation. Though to me the behavior of the regex engine still is unintuitive in this case. — NikiC, Aug 06 '10 at 08:40
Fine exposition. Thank you, Alan. While it makes sense as you describe it, I doubt I'll be able to remember this next time something like this comes up -- the counterintuitive thing. But it's here for future reference. — , Aug 06 '10 at 20:27

score 0 · Answer 5 · edited May 23 '17 at 11:57

0

A small change to your pattern would be to separate out the two key concerns at the end of the string:

Replace multiple slashes with one slash
Replace no slashes with one slash

A pattern for that (and the existing part for matching at the start of the string) would look like:

#^/*|/+$|$(?<!/)#

A slightly less concise, but more precise, option would be to be very explicit about only matching zero or two-or-more slashes; the notion being, why replace one slash with one slash?

#^(?!/)|^/{2,}|/{2,}$|$(?<!/)#

^{Aside: nikic's suggestion to use trim (to remove leading/trailing slashes, then add your own) is a good one.}

edited May 23 '17 at 11:57

Community

1
1

answered Aug 05 '10 at 22:27

salathe

48,441
11
98
127

Very good. This answer most directly addresses the surpise I saw at the end of the string in my first attempt. At the moment I'm torn between accepting this and John Kugelman's answer. Your second versions is, I agree, precise, where John's sometimes does unnecessary work. But John's is very simple, approaching that of nikic. – Aug 06 '10 at 00:54

A preg_replace puzzle: replacing zero or more of a char at the end of the subject

5 Answers5

Linked