I know I know, there are many similar questions, and I can say I read all of them. But, I am not good in regex and I couldn't figure out the regular expression that I need.
I want to split a String in Java, and I have 4 constraints:
- The delimiters are [.?!] (end of the sentence)
- Decimal numbers shouldn't be tokenized
- The delimiters shouldn't be removed.
- The minimum size of each token should be 5
For example, for input:
"Hello World! This answer worth $1.45 in U.S. dollar. Thank you."
The output will be:
[Hello World!, This answer worth $1.45 in U.S. dollar., Thank you.]
Up to now I got the answer for three first constraints by this regex:
text.split("(?<=[.!?])(?<!\\d)(?!\\d)");
And I know I should use {5,}
somewhere in my regex, but any combination that I tried doesn't work.
For cases like: "I love U.S. How about you?"
it doesn't matter if it gives me one or two sentences, as far as it doesn't tokenize S.
as a separate sentence.
Finally, introducing a good tutorial of regex is appreciated.
UPDATE: As Chris mentioned in the comments, it is almost impossible to solve questions like this (to cover all the cases happen in natural languages) with regex. However, I found HamZa's answer the closet, and the most useful one.
So, Be careful! The accepted answer will not cover all possible use cases!