160

I just learned about Java's Scanner class and now I'm wondering how it compares/competes with the StringTokenizer and String.Split. I know that the StringTokenizer and String.Split only work on Strings, so why would I want to use the Scanner for a String? Is Scanner just intended to be one-stop-shopping for spliting?

skaffman
  • 381,978
  • 94
  • 789
  • 754
Dave
  • 3,919
  • 5
  • 28
  • 34

10 Answers10

244

They're essentially horses for courses.

  • Scanner is designed for cases where you need to parse a string, pulling out data of different types. It's very flexible, but arguably doesn't give you the simplest API for simply getting an array of strings delimited by a particular expression.
  • String.split() and Pattern.split() give you an easy syntax for doing the latter, but that's essentially all that they do. If you want to parse the resulting strings, or change the delimiter halfway through depending on a particular token, they won't help you with that.
  • StringTokenizer is even more restrictive than String.split(), and also a bit fiddlier to use. It is essentially designed for pulling out tokens delimited by fixed substrings. Because of this restriction, it's about twice as fast as String.split(). (See my comparison of String.split() and StringTokenizer.) It also predates the regular expressions API, of which String.split() is a part.

You'll note from my timings that String.split() can still tokenize thousands of strings in a few milliseconds on a typical machine. In addition, it has the advantage over StringTokenizer that it gives you the output as a string array, which is usually what you want. Using an Enumeration, as provided by StringTokenizer, is too "syntactically fussy" most of the time. From this point of view, StringTokenizer is a bit of a waste of space nowadays, and you may as well just use String.split().

Dave Jarvis
  • 28,853
  • 37
  • 164
  • 291
Neil Coffey
  • 20,815
  • 6
  • 58
  • 78
  • 8
    Would also be interesting to see Scanner's results on the same tests you ran on String.Split and StringTokenizer. – Dave Mar 27 '09 at 20:09
  • 2
    Gave me an answer to another question: "why is use of StringTokenizer discouraged, as stated in the Java API notes?". From this text it seems that the answer would be "because String.split() is fast enough". – Legs May 01 '11 at 06:49
  • 2
    So is StringTokenizer pretty much deprecated now? – Steve the Maker Mar 05 '12 at 22:11
  • what to use instead of it? Scanner? – Adrian Jul 29 '14 at 18:06
  • As I mention above, for most purposes you can use String.split[] for the same purpose as StringTokenizer. – Neil Coffey Aug 02 '14 at 23:00
  • StringTokenizer is considered deprecated, but I still use it from time to time for parsing simply because it's the easiest for the types of parsing I do. If they do eventually drop it completely, I'll have to go back and rewrite the code, but it's been deprecated for eons, doesn't seem to be going away. :-) – Brian Knoblauch Dec 01 '15 at 20:27
  • 4
    I realize it's an answer to an old question, but if I need to split a huge text stream into tokens on the fly, isn't `StringTokenizer` still my best bet because `String.split()` will simply run out of memory? – Sergei Tachenov Jan 26 '16 at 08:30
  • I'm not sure I quite understand: both StringTokenizer and String.split() will require the entire sequence to be in memory. For splitting on the fly, if you're justing splitting on a particular character, it's probably as easy as anything to just "hand-crank" things. For more complex splitting criteria, pattern.split() can take an arbitrary CharSequence. – Neil Coffey Jan 27 '16 at 19:58
  • Sorry for the late reply. I thought `StringTokenizer` accepts an input stream. Must have been thinking of `Scanner`. Still, I can think of one good use of `StringTokenizer` vs `split`: if you pass `returnDelims = true`, you get delimiters, which you can't do with `split`. – Sergei Tachenov Mar 13 '16 at 06:39
  • Split is not slower in all cases, this can be seen when the delimiter passed to split is not parsed as a regex. – Luke Jan 20 '20 at 21:02
58

Let's start by eliminating StringTokenizer. It is getting old and doesn't even support regular expressions. Its documentation states:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

So let's throw it out right away. That leaves split() and Scanner. What's the difference between them?

For one thing, split() simply returns an array, which makes it easy to use a foreach loop:

for (String token : input.split("\\s+") { ... }

Scanner is built more like a stream:

while (myScanner.hasNext()) {
    String token = myScanner.next();
    ...
}

or

while (myScanner.hasNextDouble()) {
    double token = myScanner.nextDouble();
    ...
}

(It has a rather large API, so don't think that it's always restricted to such simple things.)

This stream-style interface can be useful for parsing simple text files or console input, when you don't have (or can't get) all the input before starting to parse.

Personally, the only time I can remember using Scanner is for school projects, when I had to get user input from the command line. It makes that sort of operation easy. But if I have a String that I want to split up, it's almost a no-brainer to go with split().

Michael Myers
  • 178,094
  • 41
  • 278
  • 290
  • 23
    StringTokenizer is 2x as fast as String.split(). If you don't NEED to use regular expressions, DON'T! – Alex Worden Jan 08 '13 at 19:17
  • I just used `Scanner` to detect new line characters in a given `String`. Since new line characters can vary from platform to platform (look at `Pattern`'s javadoc!) **and** input string is NOT guaranteed to conform to `System.lineSeparator()`, I find `Scanner` more suitable as it already knows what new line characters to look for when calling `nextLine()`. For `String.split` I will have to feed in the correct regex pattern to detect line separators, which I don't find stored in any standard location (the best I can do is copy it from the `Scanner` class' source). – ADTC Aug 16 '13 at 03:07
9

StringTokenizer was always there. It is the fastest of all, but the enumeration-like idiom might not look as elegant as the others.

split came to existence on JDK 1.4. Slower than tokenizer but easier to use, since it is callable from the String class.

Scanner came to be on JDK 1.5. It is the most flexible and fills a long standing gap on the Java API to support an equivalent of the famous Cs scanf function family.

H Marcelo Morales
  • 2,927
  • 1
  • 18
  • 28
6

Split is slow, but not as slow as Scanner. StringTokenizer is faster than split. However, I found that I could obtain double the speed, by trading some flexibility, to get a speed-boost, which I did at JFastParser https://github.com/hughperkins/jfastparser

Testing on a string containing one million doubles:

Scanner: 10642 ms
Split: 715 ms
StringTokenizer: 544ms
JFastParser: 290ms
Hugh Perkins
  • 6,646
  • 6
  • 50
  • 63
  • Some Javadoc would have been nice, and what if you want to parse something other than numeric data? – NickJ Apr 09 '13 at 10:24
  • Well, it's designed for speed, not beauty. It's quite simple, just a few lines, so you could add a few more options for text parsing if you want. – Hugh Perkins Apr 15 '13 at 03:11
6

If you have a String object you want to tokenize, favor using String's split method over a StringTokenizer. If you're parsing text data from a source outside your program, like from a file, or from the user, that's where a Scanner comes in handy.

Bill the Lizard
  • 369,957
  • 201
  • 546
  • 842
4

String.split seems to be much slower than StringTokenizer. The only advantage with split is that you get an array of the tokens. Also you can use any regular expressions in split. org.apache.commons.lang.StringUtils has a split method which works much more faster than any of two viz. StringTokenizer or String.split. But the CPU utilization for all the three is nearly the same. So we also need a method which is less CPU intensive, which I am still not able to find.

Manish
  • 49
  • 1
  • 3
    This answer is slightly nonsensical. You say you are looking for something which is faster but "less CPU intensive". Any program is executed by the CPU. If a program does not utilize your CPU 100%, then it must be waiting for something else, like I/O. That should not ever be an issue when discussing string tokenization, unless you're doing direct disc access (which we notably are not doing here). – Jolta Jan 04 '13 at 15:36
4

I recently did some experiments about the bad performance of String.split() in highly performance sensitive situations. You may find this useful.

http://eblog.chrononsystems.com/hidden-evils-of-javas-stringsplit-and-stringr

The gist is that String.split() compiles a Regular Expression pattern each time and can thus slow down your program, compared to if you use a precompiled Pattern object and use it directly to operate on a String.

pdeva
  • 36,445
  • 42
  • 122
  • 154
  • 4
    Actually String.split() doesn't always compile the pattern. Look at the source if 1.7 java, you will see that there is a check if the pattern is a single character and not an escaped one, it will split the string without regexp, so it should be quite fast. – Krzysztof Krasoń Nov 07 '12 at 21:02
1

For the default scenarios I would suggest Pattern.split() as well but if you need maximum performance (especially on Android all solutions I tested are quite slow) and you only need to split by a single char, I now use my own method:

public static ArrayList<String> splitBySingleChar(final char[] s,
        final char splitChar) {
    final ArrayList<String> result = new ArrayList<String>();
    final int length = s.length;
    int offset = 0;
    int count = 0;
    for (int i = 0; i < length; i++) {
        if (s[i] == splitChar) {
            if (count > 0) {
                result.add(new String(s, offset, count));
            }
            offset = i + 1;
            count = 0;
        } else {
            count++;
        }
    }
    if (count > 0) {
        result.add(new String(s, offset, count));
    }
    return result;
}

Use "abc".toCharArray() to get the char array for a String. For example:

String s = "     a bb   ccc  dddd eeeee  ffffff    ggggggg ";
ArrayList<String> result = splitBySingleChar(s.toCharArray(), ' ');
Simon
  • 12,485
  • 14
  • 63
  • 84
1

One important difference is that both String.split() and Scanner can produce empty strings but StringTokenizer never does it.

For example:

String str = "ab cd  ef";

StringTokenizer st = new StringTokenizer(str, " ");
for (int i = 0; st.hasMoreTokens(); i++) System.out.println("#" + i + ": " + st.nextToken());

String[] split = str.split(" ");
for (int i = 0; i < split.length; i++) System.out.println("#" + i + ": " + split[i]);

Scanner sc = new Scanner(str).useDelimiter(" ");
for (int i = 0; sc.hasNext(); i++) System.out.println("#" + i + ": " + sc.next());

Output:

//StringTokenizer
#0: ab
#1: cd
#2: ef
//String.split()
#0: ab
#1: cd
#2: 
#3: ef
//Scanner
#0: ab
#1: cd
#2: 
#3: ef

This is because the delimiter for String.split() and Scanner.useDelimiter() is not just a string, but a regular expression. We can replace the delimiter " " with " +" in the example above to make them behave like StringTokenizer.

John29
  • 2,830
  • 2
  • 27
  • 48
-5

String.split() works very good but has its own boundaries, like if you wanted to split a string as shown below based on single or double pipe (|) symbol, it doesn't work. In this situation you can use StringTokenizer.

ABC|IJK

  • 13
    Actualy, you can split your example with just "ABC|IJK".split("\\|"); – Tomo Feb 22 '13 at 22:34
  • "ABC||DEF||".split("\\|") does not really work though because it will ignore the trailing two empty values, which makes parsing more comlicated than it should be. – Armand Aug 04 '14 at 18:28