0

Im trying to fill an ArrayList with words, but sometimes it adds an empty character, why? How can i avoid this?

    ArrayList<String> textAL = new ArrayList<String>();
    String text = "This.IS(a) text example blah? bl:ah";
    String regex = "[\\s\\?\\.,:;\\)\\(]";

    String[] splittedText = text.split(regex);

    for(int i = 0; i < splittedText.length; i++){
        if(splittedText[i] != " "){  //ignore whitespace
            textAL.add(splittedText[i]);
        }           
    }

    for(int i = 0; i < textAL.size(); i++){
        System.out.println("t2(" + i + ") "+ textAL.get(i));
    }

Result:

textAL(0) This
textAL(1) IS
textAL(2) a
textAL(3) 
textAL(4) text
textAL(5) example
textAL(6) blah
textAL(7) 
textAL(8) bl
textAL(9) 
textAL(10) ah
Fran
  • 360
  • 3
  • 16

3 Answers3

2

You need to add a quantifier to your Pattern:

String text = "This.IS(a) text example blah? bl:ah";
// Edit: now with removed escapes when not necessary - thanks hwnd
//              ┌ original character class
//              |          ┌ greedy quantifier: "one or more times"
//              |          |
String regex = "[\\s?.:;)(]+";
String[] splittedText = text.split(regex);
System.out.println(Arrays.toString(splittedText));

Output

[This, IS, a, text, example, blah, bl, ah]
Mena
  • 45,491
  • 11
  • 81
  • 98
1

I think that the issue is that you're forgetting the + at the end of your regex, e.g.,

String regex = "[\\s\\?\\.,:;\\)\\(]+"

but how about something as simple as

String regex = "\\W+";

Note that \\W is the same as ^\\w

Test:

public static void main(String[] args) {
  ArrayList
  <String> textAL = new ArrayList<String>();
  String text = "This.IS(a) text example blah? bl:ah";
  // String regex = "[\\s\\?\\.,:;\\)\\(]+";
  String regex = "\\W+";

  String[] splittedText = text.split(regex);

  for(int i = 0; i < splittedText.length; i++){
      textAL.add(splittedText[i]);
  }

  for(int i = 0; i < textAL.size(); i++){
      System.out.println("t2(" + i + ") "+ textAL.get(i));
  }
}

Result:

t2(0) This
t2(1) IS
t2(2) a
t2(3) text
t2(4) example
t2(5) blah
t2(6) bl
t2(7) ah

Edit

Your other issue is here:

splittedText[i] != " "

You're comparing Strings using the != operator, and you never want to compare Strings using either == or !=. Instead, use the equals(...) or the equalsIgnoreCase(...) method. Understand that == and != checks if the two objects are or are not the same which is not what you're interested in. The methods on the other hand check if the two Strings have the same characters in the same order, and that's what matters here.

Fortunately, if you use the right regex, the above becomes a non-issue for your current code, but risks becoming an issue in future code, so please do take this to heart.

Hovercraft Full Of Eels
  • 276,051
  • 23
  • 238
  • 346
0

What about String regex = "[^\\w]+";, done this way so that you can add your own characters that you don't want matched, say like apostrophe "[^\\w']+"

Regular Jo
  • 4,348
  • 3
  • 18
  • 36