1

I am new to Regex. I want to match a certain URL pagePath pattern for Analytics.

The Problem:

The pattern looks like this:

/(de|en|fr|it)/../any-word-including-dashes/word-or-words-including-dashes-and-numbers

I want to match only this pattern and exclude all pagePathes with another forward slash or not matching the initial pattern:

Include:

/de/ab/word-word/word1-and-something-else
/de/ab/word-word/word1-and-something-else?any_ting1=any.-thing2

Exclude:

/de/ab/word-word/word1-and-something-else/
/de/ab/word-word/word1-and-something-else/anything
/de/ab/word-word
/fr/moreThanTwoCHAR/anything

My Regex:

After having searched on SO (Exclude forward slash before end , "Match anything but" and Finding exactly n occurences of "/", disallow 0 or more occurences of a CHAR) I came up with the following regex:

^(\/de|\/fr|\/en|\/it)\/..\/.+\/\w+[^\/]*

What it does correctly

It excludes correctly the following path:

/fr/moreThanTwoCHAR/anything

What it fails on

The problem with the above regex is that it matches also (tested on regex101):

/de/ab/word-word/word1-and-something-else/anything

And I can't seem to understand why it matches the string with an additional forward slash even if I stated to exclude 0 or more additional occurences (at least from what I understood). Anyone can explain me where I'm mistaken?

  • Thanks @WiktorStribiżew , I see what my fault was (having used the `[^\/]` wrongly). Would you want to answer my question so that I can accept it? – p6l-richard Jan 10 '18 at 16:11

1 Answers1

1

Note that . matches any char (except line break chars if no DOTALL option (/s) is used) thus your regex just matches more types of input than you need.

You may use

'~^/(de|fr|en|it)/[^/]{2}(?:/[^/]+){2}$~'

See the regex demo.

Pattern details:

  • ^ - start of input
  • / - a / char
  • (de|fr|en|it) - one of the three alternative substrings: de, fr, en or it
  • /[^/]{2} - / and then any 2 chars other than /
  • (?:/[^/]+){2} - 2 consecutive sequences of a / and then 1+ chars other than /
  • $ - end of input.
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397