-1

Why does \p{P} when used in (^|\p{P})(?!,alpha,),*alpha,* behave differently from \p{Ps} used in (^|\p{Ps})(?!,alpha,),*alpha,* when used to process (,alpha, \p{P} matches whereas \p{Ps} does not match

asr
  • 11
  • 1
  • The problem is with \p{Po}. For example, `(^|[\p{P}-[\p{Po}]])(?!,alpha,),*alpha,*` works as wanted. But I would still be interested in knowing why. – asr Oct 18 '20 at 11:01
  • See the [list of chars matched with `\p{Ps}`](https://www.fileformat.info/info/unicode/category/Ps/list.htm). All cateogries list: https://www.fileformat.info/info/unicode/category/index.htm – Wiktor Stribiżew Oct 18 '20 at 11:11

1 Answers1

0

Because \p{P} matches the , in your string, not (.

This is because you have a negative lookahead (?!,alpha,) after \p{P}. This means that "after 'any punctuation', there must not be the string ,alpha,". Well, There is ,alpha, after (, so \p{P} fails to match (. The regex engine moves forward one character, and tries again. This time, \p{P} matches , and there is no ,alpha, after , (there is only alpha,!), and the rest of the match succeeds too, so the whole match succeeds. The matched string is ,alpha,, without the (.

If you change the \p{P} to \p{Ps}, it will fail to match ( just like before, but also fail to match ,, causing the whole match the fail. Note that the ^ alternative doesn't get chosen, because even though the lookahead passes, your regex requires a , to immediately follow. But after the start of string, there is a ( instead.

Sweeper
  • 145,870
  • 17
  • 129
  • 225