Having trouble deciphering a sentence tokenizer regex

Question

The following regex is suppose to act as a sentence tokenizer pattern, but I'm having some trouble deciphering what exactly it's doing:

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s

I understand that it's using positive and negative lookbehinds, as the accepted answer of this post explains (they give the example of a negative lookbehind like this: (?<!B)A). But what is considered A in the above regex?

The `A` is `\s` at the end. Note that the positive lookbehind can be written as `(?<=[.?!])` using a character class. If a quantifier in the lookbehind is supported, you could shorten it making the `[a-z]` optional `(? — The fourth bird, Jul 23 '20 at 07:33

jdaz · Accepted Answer · 2020-07-23T01:20:15.183

The regex is checking for breaks between sentences. The negative lookbehinds prevent false matches that represent abbreviations instead of the ends of sentences. They mean:

(?<!\w\.\w.) Don't match anything that looks like A.b., 2.c., or 1.3. (Probably they meant for the second period to also be \. to match only a period, but as written it will match any character at the end, for example A.b! or g.Z4)
(?<![A-Z][a-z]\.) Don't match anything that looks like Cf., Dr., Mr., etc. Note this only checks two characters, so "Mrs." will be matched incorrectly.
(?<![A-Z]\.) Don't match anything that looks like A. or C.

Then if these all pass, it has a positive lookbehind (?<=\.|\?|\!) to check for ., ? or !.

And finally it matches on any whitespace \s.

Demo

Having trouble deciphering a sentence tokenizer regex

1 Answers1