40

In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.

Which of the following two ways is the best way to do this (in terms of performance)?

StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {

or

String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)
cristianoms
  • 2,526
  • 2
  • 23
  • 26
JohnJohnGa
  • 14,494
  • 15
  • 58
  • 83

10 Answers10

63

If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.

However, getting the data from a database is still likely to much more expensive.

StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
    sb.append(i).append(' ');
String sample = sb.toString();

int runs = 100000;
for (int i = 0; i < 5; i++) {
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            StringTokenizer st = new StringTokenizer(sample);
            List<String> list = new ArrayList<String>();
            while (st.hasMoreTokens())
                list.add(st.nextToken());
        }
        long time = System.nanoTime() - start;
        System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        Pattern spacePattern = Pattern.compile(" ");
        for (int r = 0; r < runs; r++) {
            List<String> list = Arrays.asList(spacePattern.split(sample, 0));
        }
        long time = System.nanoTime() - start;
        System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            List<String> list = new ArrayList<String>();
            int pos = 0, end;
            while ((end = sample.indexOf(' ', pos)) >= 0) {
                list.add(sample.substring(pos, end));
                pos = end + 1;
            }
        }
        long time = System.nanoTime() - start;
        System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
    }
 }

prints

StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us

The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)

If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/

Peter Lawrey
  • 498,481
  • 72
  • 700
  • 1,075
  • I already have hbase, I am doing some operation on my data and store it inside hbase: document = row in hbase – JohnJohnGa May 11 '11 at 14:34
  • 3
    Interesting, I ran your code and `split` is consistently taking about twice as long on my machine as `StringTokenizer`. `indexof` takes half as long. – Bill the Lizard May 11 '11 at 15:06
  • The version of the JDK might make a difference (as well as the type of CPU). I have Java 6 update 25. – Peter Lawrey May 11 '11 at 15:09
  • You should try the timings yourself, as it has been noted, your mileage may vary. ;) – Peter Lawrey May 11 '11 at 15:45
  • @Peter BTW: do you have an idea why we ve got these results? What's behind? – JohnJohnGa May 11 '11 at 16:07
  • 1
    Scanner and String tokenizer use Pattern/regex which is more flexible but not as efficient as just looking for a specific character. – Peter Lawrey May 11 '11 at 18:13
  • 3
    @Peter Lawrey: StringTokenizer does not use regex. – user207421 May 12 '11 at 10:58
  • @EJP, A bad assumption on my part, I assumed they worked similar because they performed similar for me. Reading the StringTokenizer code, its not obvious to me why it would be slower than Scanner for me on two different machines. I note @Bill found StringTokenizer faster than Scanner. – Peter Lawrey May 12 '11 at 12:52
  • String.split offers similar performance to indexOf in java 7. – tjjjohnson Feb 17 '14 at 04:14
  • 1
    @tjjjohnson Java 7 split does an operation similar to a series on indexOf, but only for limited, but very common operations. – Peter Lawrey Feb 17 '14 at 12:24
  • 1
    Just for the record, your implementation of indexOf loops is incorrect, you are missing the part after the last separator. Not sure this impact performance very much but anyway – Julien Jan 05 '16 at 21:44
  • 1
    @PeterLawrey Do you think if its reliable way to measure the performance this way? Processors and compilers usually tend to reorder execution, which may affect the assignment of long start or long time, right? – TriCore Jan 06 '16 at 04:41
  • 1
    @TriCore In theory yes, in practice I haven't seen the JIT reorganise around a loop. i.e. move something before a loop to after a loop or visa-versa. This was written 5 years ago and I would use JMH today. – Peter Lawrey Jan 06 '16 at 05:55
  • Pretty old post and not sure if the same experiment was tried moving indexOf before using split. – Optional May 26 '16 at 07:21
  • 1
    @Optional if I did it again today I would use JMH, however the test is repeated 5x so there is a test for indexOf both before and after using split. – Peter Lawrey May 26 '16 at 14:54
  • Oh yes, thanks. Good to see you active around even after 5 years of your post @PeterLawrey – Optional May 27 '16 at 04:59
  • @tjjjohnson: from Java 8 JRE runtime library String.split(): /* fastpath if the regex is a (1)one-char String and this character is not one of the RegEx's meta characters ".$|()[{^?*+\\", or (2)two-char String and the first char is the backslash and the second is not the ascii digit or ascii letter. */ And what is JMH? – beemaster Jul 20 '16 at 13:26
  • @beemaster Java Microbenchmark Harness http://openjdk.java.net/projects/code-tools/jmh/ – Peter Lawrey Jul 20 '16 at 14:17
  • The Pattern.split(sample, 0) test is incorrectly using a limit=0 argument which could make a major difference! See JavaDoc: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#split-java.lang.CharSequence-int- – ballzak Nov 28 '17 at 04:25
14

Split in Java 7 just calls indexOf for this input, see the source. Split should be very fast, close to repeated calls of indexOf.

nes1983
  • 14,012
  • 4
  • 42
  • 63
  • Are you sure? I can see in line 2361 under the link you provided: `return Pattern.compile(regex).split(this, limit);` – Krzysztof Wolny Oct 21 '14 at 18:58
  • 1
    Implementation is in 1770. – nes1983 Nov 15 '14 at 23:56
  • The implementation will use (i.e. `indexOf`) if the regex satisfies certain criteria, and will use `Pattern.compile(regex).split(this, limit);` otherwise. From the source: `fastpath if the regex is a (1)one-char String and this character is not one of the RegEx's meta characters ".$|()[{^?*+\\", or (2)two-char String and the first char is the backslash and the second is not the ascii digit or ascii letter.` But as pointed out elsewhere, this is an implementation detail and as such should not be relied upon. – hendalst Mar 30 '16 at 04:03
6

The Java API specification recommends using split. See the documentation of StringTokenizer.

Gilles 'SO- stop being evil'
  • 92,660
  • 35
  • 189
  • 229
developer
  • 9,003
  • 29
  • 80
  • 144
  • 1
    @downvoters : Please be clear with above question, do you need better among Tokenize vs split or you are looking for best approach regard less of Tokenize vs split – developer May 11 '11 at 14:33
  • 3
    The question is pretty clear that he's looking for the best way to do this in terms of performance. The API recommends split, but doesn't mention that (according to everything else I'm finding through Google) Tokenize performs better. – Bill the Lizard May 11 '11 at 14:37
  • @Bill, Sorry mistake from my side. then their might be change in title of the question – developer May 11 '11 at 14:43
5

Another important thing, undocumented as far as I noticed, is that asking for the StringTokenizer to return the delimiters along with the tokenized string (by using the constructor StringTokenizer(String str, String delim, boolean returnDelims)) also reduces processing time. So, if you're looking for performance, I would recommend using something like:

private static final String DELIM = "#";

public void splitIt(String input) {
    StringTokenizer st = new StringTokenizer(input, DELIM, true);
    while (st.hasMoreTokens()) {
        String next = getNext(st);
        System.out.println(next);
    }
}

private String getNext(StringTokenizer st){  
    String value = st.nextToken();
    if (DELIM.equals(value))  
        value = null;  
    else if (st.hasMoreTokens())  
        st.nextToken();  
    return value;  
}

Despite the overhead introduced by the getNext() method, that discards the delimiters for you, it's still 50% faster according to my benchmarks.

cristianoms
  • 2,526
  • 2
  • 23
  • 26
3

Use split.

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method instead.

Basanth Roy
  • 5,536
  • 4
  • 22
  • 24
2

While running micro (and in this case, even nano) benchmarks, there is a lot that affects your results. JIT optimizations and garbage collection to name just a few.

In order to get meaningful results out of the micro benchmarks, check out the jmh library. It has excellent samples bundled on how to run good benchmarks.

Ivo
  • 404
  • 3
  • 7
2

What the 19,000,000 documents have to do there ? Do you have to split words in all the documents on a regular basis ? Or is it a one shoot problem?

If you display/request one document at a time, with only 30 word, this is a so tiny problem that any method would work.

If you have to process all documents at a time, with only 30 words, this is a so tiny problem that you are more likely to be IO bound anyway.

Nicolas Bousquet
  • 3,838
  • 1
  • 14
  • 18
2

Regardless of its legacy status, I would expect StringTokenizer to be significantly quicker than String.split() for this task, because it doesn't use regular expressions: it just scans the input directly, much as you would yourself via indexOf(). In fact String.split() has to compile the regex every time you call it, so it isn't even as efficient as using a regular expression directly yourself.

user207421
  • 289,834
  • 37
  • 266
  • 440
1

This could be a reasonable benchmarking using 1.6.0

http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml#.V6-CZvnhCM8
chiperortiz
  • 4,313
  • 6
  • 40
  • 68
-1

Performance wise StringTokeniser is way better than split. Check the code below,

enter image description here

But according to Java docs its use is discouraged. Check Here

Community
  • 1
  • 1
Code_Mode
  • 5,693
  • 1
  • 19
  • 34
  • 2
    You are only instantiating it, but after that you have to get all the tokens which takes time(of course it will be less than that of split's) – ram914 May 26 '18 at 22:22