7

I'm preparing for OCPJP exam and I ran into the following example:

class Test {
   public static void main(String args[]) {
      String test = "I am preparing for OCPJP";
      String[] tokens = test.split("\\S");
      System.out.println(tokens.length);
   }
}

This code prints 16. I was expecting something like no_of_characters + 1. Can someone explain me, what does the split() method actually do in this case? I just don't get it...

peterremec
  • 678
  • 1
  • 12
  • 37

1 Answers1

14

It splits on every "\\S" which in regex engine represents \S non-whitespace character.

So lets try to split "x x" on non-whitespace (\S). Since this regex can be matched by one character lets iterate over them to mark places of split (we will use pipe | for that).

  • is 'x' non-whitespace? YES, so lets mark it | x
  • is ' ' non-whitespace? NO, so we leave it as is
  • is last 'x' non-whitespace? YES, so lets mark it | |

So as result we need to split our string at start and at end which initially gives us result array

["", " ", ""]
   ^    ^ - here we split

But since trailing empty strings are removed, result would be

[""," "]     <- result
        ,""] <- removed trailing empty string

so split returns array ["", " "] which contains only two elements.

BTW. To turn off removing last empty strings you need to use split(regex,limit) with negative value of limit like split("\\S",-1).


Now lets get back to your example. In case of your data you are splitting on each of

I am preparing for OCPJP
| || ||||||||| ||| |||||

which means

 ""|" "|""|" "|""|""|""|""|""|""|""|""|" "|""|""|" "|""|""|""|""|""

So this represents this array

[""," ",""," ","","","","","","","",""," ","",""," ","","","","",""]  

but since trailing empty strings "" are removed (if their existence was caused by split - more info at: Confusing output from String.split)

[""," ",""," ","","","","","","","",""," ","",""," ","","","","",""]  
                                                     ^^ ^^ ^^ ^^ ^^

you are getting as result array which contains only this part:

[""," ",""," ","","","","","","","",""," ","",""," "]  

which are exactly 16 elements.

Community
  • 1
  • 1
Pshemo
  • 113,402
  • 22
  • 170
  • 242
  • do you know why its 16? and why its only `4` for `I am preparing`? – Sabuj Hassan Mar 07 '14 at 20:20
  • 2
    Thanks! Now I got it. Its removing the last nonwhite section it that is adjacent to the end of line! – Sabuj Hassan Mar 07 '14 at 20:25
  • That's it... I didn't realize, that the trailing empty strings are removed. That explains the result. Thank you very much! – peterremec Mar 07 '14 at 20:28
  • @SabujHassan Exactly. If you want to turn off this default mechanism so trailing empty elements would not be removed just add negative limit as split argument like `split(regex,-1);`. – Pshemo Mar 07 '14 at 20:30
  • 1
    @peterremec No problem. But in the future start by reading javadocs of methods you are using. [`split`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split%28java.lang.String,%20int%29) method documentation mentions that `... trailing empty strings will be discarded.` – Pshemo Mar 07 '14 at 20:31
  • @Pshemo My mistake, I totally missed that part :S – peterremec Mar 07 '14 at 20:45
  • If \\S represents non-whitespace character then why it is considering last whitespace? – Lokesh Sanapalli Feb 11 '16 at 07:36
  • As split will removes trailing spaces, if we replace split("\\S") with split("\s") then we will get strings split by space, It will not remove trailing spaces and string. – Lokesh Sanapalli Feb 11 '16 at 08:59
  • @LokeshS I am not sure what you mean but let me rephrase my answer a little. When you split on `\S` you are splitting on each non-whitespace. Which means for string like `"a_b"` (where `_` represents space) at first your result array will look like `["", "_", ""]`, but because `split` removes trailing spaces returned array will be `["", "_"]`. Now which step is confusing you? – Pshemo Feb 11 '16 at 19:17
  • The confusing part to me is, in the example "a_b", as it splits on each non-whitespace character, why "_" is included in the result one. Similarly, you gave the example above "x x" which results [""," "] . But, you made the statement, "Important part is that by default split removes trailing empty strings", if it removes trailing empty strings and the part after it, why that empty string is included in the result? – Lokesh Sanapalli Feb 12 '16 at 07:42
  • If by default split removes trailing whitespaces, let's say I have string like "I am a good boy", I am splitting with "\\s", which splits on every whitespace. I should get the result as ["I","am","a","good"," "] as split removes trailing whitespaces like you said. But, I will get the result like this ["I","am","a","good","boy"]. What makes the difference exactly? – Lokesh Sanapalli Feb 12 '16 at 07:44
  • @LokeshS "why "_" is included in the result one" `_` in my example represents space, and `\S` represents non-whitespace. So lets step through splitting process of `"a_b"`. Lets iterate over each characters (since `\s` may be matched only by single character). Is `a` non-whitespace? Yes. So we split on it `|_b` (`|` represent place we split on)`. Is `_` non-whitespace? No it is whitespace so we don't split here, Is `b` non-whitespace? Yes, so we split on it `|_|` which in result gives us array `["", "_", ""]`. – Pshemo Feb 12 '16 at 12:52
  • @LokeshS You may also misunderstand what is considered as empty string. In Java (and many other languages) empty string is string which *doesn't contain **any** character*. This means that `" "` is *not* empty (because it *contains* one character representing space). Only `""` is considered as empty string (its length is 0). – Pshemo Feb 12 '16 at 12:55
  • @LokeshS Now trailing empty strings are series of empty string placed at the end of array. So array like `[a, "", ""]` has two trailing empty strings, but array like `[a, "", "", b]` doesn't have any trailing empty strings (because it ends with non-empty string). – Pshemo Feb 12 '16 at 13:00
  • @Pshemo How come `"ab".split(" ").length` returns `1`? – AnV Jan 03 '17 at 10:20
  • @Pshemo Also `"".split("[^A-Za-z]+");` returns `1`. Using the pipe strategy you mentioned, I couldn't figure out how these two happens. Please explain/clarify how to deal with such cases. – AnV Jan 03 '17 at 10:44
  • 1
    @AbhinavVutukuri In case `"ab".split(" ")` split didn't happened, so result array contains original string `["ab"]` so its length is 1. Second case is more interesting. Here split also couldn't happen because `""` doesn't contain any *single* character outside of ranges `A-Za-z` so we are getting as result array with original string `[""]`. Confusing part which I didn't mention in my answer is that removing trailing empty strings makes sense only if they ware *created* by splitting process. But here there was no split, so there is no need to remove anything. – Pshemo Jan 03 '17 at 11:49
  • 1
    @AbhinavVutukuri I tried to explain it in my answer in different question: http://stackoverflow.com/questions/25056607/confusing-output-from-string-split/25058091#25058091 – Pshemo Jan 03 '17 at 11:51