-1

I would like to split a string in word boundaries and hence for now I am considering that whitespace, a ',' and a '.' or '!' signify the boundaries of words.
In the following example:

String text = "This is, just a text to be used, for testing purpose. Nothing more!";
String[] words = text.split("[\\s+,.!]");
for(String w: words) {
    System.out.println(w);
}  

This prints:

This
is

just
a
text
to
be
used

for
testing
purpose

Nothing
more

As you can see there are empty words for the words that ended with , or . or !
But if I add a + in my regex:

String[] words = text.split("[\\s+,.!]+");
for(String w: words) {
    System.out.println(w);
}  
This
is
just
a
text
to
be
used
for
testing
purpose
Nothing
more

The empty words are not there. Why is that + required so that I avoid the empty words?

Jim
  • 2,341
  • 1
  • 11
  • 23

1 Answers1

3

"[\\s+,.!]" doesn't do what you think it does. Inside of []s, + is treated as a literal character, not the regex special character meaning "one or more".

The empty strings in this first pattern are because substrings like ", " have an empty string between "," and " ".

"[\\s+,.!]+" works because the regex repetition character + is in the correct location -- "one or more of any of the characters in the preceding group", i.e. the stuff inside [] before the last +.

But the + inside the group is probably not what you want. That would split "foo+bar" into {"foo", "bar"}, which appears to be a false positive. Use "[\\s,.!]+" to mitigate this.

ggorlen
  • 26,337
  • 5
  • 34
  • 50
  • So how could I include split on more than one whitespace if I can't use `\\s+` inside the brackets? – Jim May 18 '21 at 21:13
  • 1
    `"[\\s,.!]+"` does that. It treats, for example, `" "`, `" .. . . . "`, `"!!,."`, etc as single delimiters. `"foo ... !!, ,, . ,, bar"` => `{"foo", "bar"}`. – ggorlen May 18 '21 at 21:15