Java - Regex to Split Tokens With Minimum Size and Delimiters

Question

I know I know, there are many similar questions, and I can say I read all of them. But, I am not good in regex and I couldn't figure out the regular expression that I need.

I want to split a String in Java, and I have 4 constraints:

The delimiters are [.?!] (end of the sentence)
Decimal numbers shouldn't be tokenized
The delimiters shouldn't be removed.
The minimum size of each token should be 5

For example, for input:

"Hello World! This answer worth $1.45 in U.S. dollar. Thank you."

The output will be:

[Hello World!, This answer worth $1.45 in U.S. dollar., Thank you.]

Up to now I got the answer for three first constraints by this regex:

text.split("(?<=[.!?])(?<!\\d)(?!\\d)");

And I know I should use {5,} somewhere in my regex, but any combination that I tried doesn't work.

For cases like: "I love U.S. How about you?" it doesn't matter if it gives me one or two sentences, as far as it doesn't tokenize S. as a separate sentence.

Finally, introducing a good tutorial of regex is appreciated.

UPDATE: As Chris mentioned in the comments, it is almost impossible to solve questions like this (to cover all the cases happen in natural languages) with regex. However, I found HamZa's answer the closet, and the most useful one.

So, Be careful! The accepted answer will not cover all possible use cases!

Are we sure that at the end of each sentence, there is a space? — Juto, Aug 16 '13 at 20:03
And, what happens if the sentence is shorter than 5 chars, i.e., `Hey!`? — Juto, Aug 16 '13 at 20:06
@Juto It can be. In this example there are spaces, but not in all cases — Afshin Moazami, Aug 16 '13 at 20:06
@Juto It should be concatenate to the other sentences (if exists) — Afshin Moazami, Aug 16 '13 at 20:07
This is looking dangerously close to natural language parsing, which is not an application for regular expressions. Remember, regular expressions can parse regular languages. Written english is not a regular language. Any solution you get with regular expressions is going to be rough. — Chris Bode, Aug 16 '13 at 20:07
It can be? What do you mean, as it could be a very different approach if there is always a space after `.?!` — Juto, Aug 16 '13 at 20:07
@Juto, I mean we cannot rely on that. But if there is a space, it should concatenate to the next string like my example " Thank you." — Afshin Moazami, Aug 16 '13 at 20:10
[My early answer](http://stackoverflow.com/questions/16377437/split-a-text-into-sentences/16377765#16377765) seems to work for most of your cases except for the `U.S.` case it fails. You just need to wrap it in a lookahead `(?=(?<=[.?!])\\s+(?=[a-z]))` — HamZa, Aug 16 '13 at 20:12
@AfshinMoazami [Take a look](http://regex101.com/r/vJ1nK1). It should work. Of course `\s+` should be `\\s+`. — HamZa, Aug 16 '13 at 20:20
@HamZa, but it doesn't tokenize the "Hello World!", does it? (it's a nice editor btw) — Afshin Moazami, Aug 16 '13 at 20:28
@AfshinMoazami Yes it did, btw I made a typo in there. [Take a look](http://regex101.com/r/xU3aD7). I've added the "substitution" option to see where it gets splitted. — HamZa, Aug 16 '13 at 20:32
@HamZa, in that online editor, it's fine. But in java, it shows these tokens: [Hello World! This answer worth $1.45 in U.S.] ,[ dollar. Thank you. He lives in the U.K.] and [ but still talks in dollars] It's strange! — Afshin Moazami, Aug 16 '13 at 20:36
@AfshinMoazami Are you using case insensitive flag ? Add `(?i)` to the beginning of your expression `(?i)(?=(?<=[.?!])(? — HamZa, Aug 16 '13 at 20:41
@ChrisBode, I agree that is "looking dangerously close to NL parsing", but there are some close answers that you can see. Thanks for the warning btw :) — Afshin Moazami, Aug 16 '13 at 20:56

score 2 · Answer 1 · answered Aug 16 '13 at 20:41

What about the next regular expression?

(?<=[.!?])(?!\w{1,5})(?<!\d)(?!\d)

e.g.

private static final Pattern REGEX_PATTERN = 
        Pattern.compile("(?<=[.!?])(?!\\w{1,5})(?<!\\d)(?!\\d)");

public static void main(String[] args) {
    String input = "Hello World! This answer worth $1.45 in U.S. dollar. Thank you.";

    System.out.println(java.util.Arrays.toString(
        REGEX_PATTERN.split(input)
    )); // prints "[Hello World!,  This answer worth $1.45 in U.S.,  dollar.,  Thank you.]"
}

Technically, this is the correct answer, but I prefer the Hamza's answer, which doesn't split the "U.S." and "dollar" Thanks buddy — Afshin Moazami, Aug 16 '13 at 20:55

score 2 · Accepted Answer · edited Jan 18 '21 at 12:34

Basing my answer from a previously made regex.
The regex was basically (?<=[.?!])\s+(?=[a-z]) which means match any whitespace one or more times preceded with either ., ? or ! and followed by [a-z] (not forgetting the i modifier).

Now let's modify it to the needs of this question:

We'll first convert it to a JAVA regex: (?<=[.?!])\\s+(?=[a-z])
We'll add the i modifier to match case insensitive (?i)(?<=[.?!])\\s+(?=[a-z])
We'll put the expression in a positive lookahead to prevent the "eating" of the characters (delimiters in this case) : (?=(?i)(?<=[.?!])\\s+(?=[a-z]))
We'll add a negative lookbehind to check if there is no abbreviation in the format LETTER DOT LETTER DOT : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z])

So our final regex looks like : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z]).

Some links:

Online tester, jump to JAVA
Explain tool (Not JAVA based)
THE regex tutorial
Java specific regex tutorial
SO regex chatroom
Some advanced nice regex-fu on SO
How does this regex find triangular numbers?
How can we match a^n b^n?
How does this Java regex detect palindromes?
How to determine if a number is a prime with regex?
"vertical" regex matching in an ASCII "image"
Can the for loop be eliminated from this piece of PHP code?
^-- See regex solution, although not sure if applicable in JAVA

This fails if the abbreviation is actually at the end of the sentence, e.g. `I live in the U.S.A. We speak English.` Additionally, it still splits on abbreviations of only one part, e.g. `Employees at Grammar Inc. make pedantic comments on the internet.` Both of these are essentially unresolvable with RegEx. — Chris Bode, Aug 16 '13 at 22:21

Java - Regex to Split Tokens With Minimum Size and Delimiters

2 Answers2

Linked