1

I work on a large scale data set and as of that I am interested in the most efficient way to split a String.

Well I found that Scanner vs. StringTokenizer vs. String.Split and that string tokenizer in Java which pretty much state that I should not use StringTokenizer.

I was convinced not to use it until I checked the @Neil Coffey's experiment chart in the second post Performance of string tokenisation: String.split() and StringTokenizer compared where StringTokenizer is notably faster.

So my question is I should not use a class because it's legacy (as it's officially stated) or should I go for it instead? I must admit that efficiency is crucial enough in my project. String.split shouldn't be at least comparably fast?

Is there any other fast string split alternative?

Community
  • 1
  • 1
Eypros
  • 4,362
  • 5
  • 29
  • 46
  • 1
    StringTokenizer is a legacy class, meaning it's only there for backward comptabillity issues, nothing else. it is never a good idea to use legacy code. if you want a decent car, do you buy the latest updated model of Ferrari, with all the extra's in it, or do you buy an old one, because "it might run a bit faster"? the new ones 'll still be able to be fixed in the shop when you bring it in, while on the old one, if something breaks, you might very well be on your own. – Stultuske May 29 '14 at 09:43
  • 1
    When you say efficiency is crucial, are you handling a very large amount of data or smaller amounts of data that you need to parse very quickly? – glenatron May 29 '14 at 09:51
  • I need to handle relatively small amount of data (of 64 token usually) many times (some decade million times) – Eypros May 29 '14 at 09:57

1 Answers1

5

There is an efficient & more feature rich string splitting methods are available in Google Guava library .

Guava's split method

Ex:

Iterable<String> splitted = Splitter.on(',')
    .omitEmptyStrings()
    .trimResults()
    .split("one,two,,   ,three");

for (String text : splitted) {
  System.out.println(text);
}

Output:

one
two
three

Ashok_Pradhan
  • 1,119
  • 10
  • 13
  • Thanks, I will check it – Eypros May 29 '14 at 09:59
  • I confirm that it's actually faster. On a 10 times validation conducted on a ~250000 lines text file the two implementations have vast differences. More specifically Google code used only 0.5249 of the time java `String.split` needed. – Eypros May 30 '14 at 07:27