1

I know I know, there are many similar questions, and I can say I read all of them. But, I am not good in regex and I couldn't figure out the regular expression that I need.

I want to split a String in Java, and I have 4 constraints:

  1. The delimiters are [.?!] (end of the sentence)
  2. Decimal numbers shouldn't be tokenized
  3. The delimiters shouldn't be removed.
  4. The minimum size of each token should be 5

For example, for input:

"Hello World! This answer worth $1.45 in U.S. dollar. Thank you."

The output will be:

[Hello World!, This answer worth $1.45 in U.S. dollar., Thank you.]

Up to now I got the answer for three first constraints by this regex:

text.split("(?<=[.!?])(?<!\\d)(?!\\d)");

And I know I should use {5,} somewhere in my regex, but any combination that I tried doesn't work.

For cases like: "I love U.S. How about you?" it doesn't matter if it gives me one or two sentences, as far as it doesn't tokenize S. as a separate sentence.

Finally, introducing a good tutorial of regex is appreciated.

UPDATE: As Chris mentioned in the comments, it is almost impossible to solve questions like this (to cover all the cases happen in natural languages) with regex. However, I found HamZa's answer the closet, and the most useful one.

So, Be careful! The accepted answer will not cover all possible use cases!

Community
  • 1
  • 1
Afshin Moazami
  • 2,003
  • 5
  • 32
  • 54
  • Are we sure that at the end of each sentence, there is a space? – Juto Aug 16 '13 at 20:03
  • And, what happens if the sentence is shorter than 5 chars, i.e., `Hey!`? – Juto Aug 16 '13 at 20:06
  • @Juto It can be. In this example there are spaces, but not in all cases – Afshin Moazami Aug 16 '13 at 20:06
  • @Juto It should be concatenate to the other sentences (if exists) – Afshin Moazami Aug 16 '13 at 20:07
  • 3
    This is looking dangerously close to natural language parsing, which is not an application for regular expressions. Remember, regular expressions can parse regular languages. Written english is not a regular language. Any solution you get with regular expressions is going to be rough. – Chris Bode Aug 16 '13 at 20:07
  • It can be? What do you mean, as it could be a very different approach if there is always a space after `.?!` – Juto Aug 16 '13 at 20:07
  • @Juto, I mean we cannot rely on that. But if there is a space, it should concatenate to the next string like my example " Thank you." – Afshin Moazami Aug 16 '13 at 20:10
  • [My early answer](http://stackoverflow.com/questions/16377437/split-a-text-into-sentences/16377765#16377765) seems to work for most of your cases except for the `U.S.` case it fails. You just need to wrap it in a lookahead `(?=(?<=[.?!])\\s+(?=[a-z]))` – HamZa Aug 16 '13 at 20:12
  • @AfshinMoazami [Take a look](http://regex101.com/r/vJ1nK1). It should work. Of course `\s+` should be `\\s+`. – HamZa Aug 16 '13 at 20:20
  • @HamZa, but it doesn't tokenize the "Hello World!", does it? (it's a nice editor btw) – Afshin Moazami Aug 16 '13 at 20:28
  • @AfshinMoazami Yes it did, btw I made a typo in there. [Take a look](http://regex101.com/r/xU3aD7). I've added the "substitution" option to see where it gets splitted. – HamZa Aug 16 '13 at 20:32
  • @My early answer, It I do it right: "(?=(?<=[.!?])(? – Afshin Moazami Aug 16 '13 at 20:32
  • @HamZa, in that online editor, it's fine. But in java, it shows these tokens: [Hello World! This answer worth $1.45 in U.S.] ,[ dollar. Thank you. He lives in the U.K.] and [ but still talks in dollars] It's strange! – Afshin Moazami Aug 16 '13 at 20:36
  • 1
    @AfshinMoazami Are you using case insensitive flag ? Add `(?i)` to the beginning of your expression `(?i)(?=(?<=[.?!])(? – HamZa Aug 16 '13 at 20:41
  • Now, it works. Add it as an answer please :) – Afshin Moazami Aug 16 '13 at 20:53
  • @ChrisBode, I agree that is "looking dangerously close to NL parsing", but there are some close answers that you can see. Thanks for the warning btw :) – Afshin Moazami Aug 16 '13 at 20:56

2 Answers2

2

What about the next regular expression?

(?<=[.!?])(?!\w{1,5})(?<!\d)(?!\d)

e.g.

private static final Pattern REGEX_PATTERN = 
        Pattern.compile("(?<=[.!?])(?!\\w{1,5})(?<!\\d)(?!\\d)");

public static void main(String[] args) {
    String input = "Hello World! This answer worth $1.45 in U.S. dollar. Thank you.";

    System.out.println(java.util.Arrays.toString(
        REGEX_PATTERN.split(input)
    )); // prints "[Hello World!,  This answer worth $1.45 in U.S.,  dollar.,  Thank you.]"
}
Paul Vargas
  • 38,878
  • 15
  • 91
  • 139
  • Technically, this is the correct answer, but I prefer the Hamza's answer, which doesn't split the "U.S." and "dollar" Thanks buddy – Afshin Moazami Aug 16 '13 at 20:55
2

Basing my answer from a previously made regex.
The regex was basically (?<=[.?!])\s+(?=[a-z]) which means match any whitespace one or more times preceded with either ., ? or ! and followed by [a-z] (not forgetting the i modifier).

Now let's modify it to the needs of this question:

  1. We'll first convert it to a JAVA regex: (?<=[.?!])\\s+(?=[a-z])
  2. We'll add the i modifier to match case insensitive (?i)(?<=[.?!])\\s+(?=[a-z])
  3. We'll put the expression in a positive lookahead to prevent the "eating" of the characters (delimiters in this case) : (?=(?i)(?<=[.?!])\\s+(?=[a-z]))
  4. We'll add a negative lookbehind to check if there is no abbreviation in the format LETTER DOT LETTER DOT : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z])

So our final regex looks like : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z]).

Some links:

Community
  • 1
  • 1
HamZa
  • 13,530
  • 11
  • 51
  • 70
  • 2
    This fails if the abbreviation is actually at the end of the sentence, e.g. `I live in the U.S.A. We speak English.` Additionally, it still splits on abbreviations of only one part, e.g. `Employees at Grammar Inc. make pedantic comments on the internet.` Both of these are essentially unresolvable with RegEx. – Chris Bode Aug 16 '13 at 22:21
  • @ChrisBode Yup, I know. – HamZa Aug 16 '13 at 22:24