Grab word and character counts in string

Question

I'm trying to write a super-efficient method that operates in two "modes" (WORD and CHARACTER) that accepts a String and tells me the number of words (separated by 1+ whitespaces) or characters (non-whitespace characters) in it:

public int getCount(String toExamine, boolean wordMode) {
    int count = 0;

    if(wordMode) {
        // Return the number of words.
    }
    else {
        // Return the number of characers.
    }

    return count;
}

I know I could accomplish the WORD mode version using a StringTokenizer:

StringTokenizer tokenizer = new StringTokenizer(" ");

But I have absolutely no clue as to what to use for the CHARACTER mode (the number of non-whitespace characters). I'm sure I could use something crude like:

for(int i = 0; i < toExamine.length; i++)
    if(Character.isSpace(toExamine.charAt(i)))
        count++;

But that is sort of ugly and might not be the most efficient way of doing this (same for the StringTokenizer piece). Could a regex be used here, or some other Java String/Character madness that would get me what I need in super-efficient fashion? I'm working on tens of millions of String here. Thanks in advance.

Why do you think that StringTokenizer is so [efficient](http://stackoverflow.com/questions/5965767/performance-of-stringtokenizer-class-vs-split-method-in-java)? — supersam654, Feb 22 '13 at 16:26
I think your method for character is good enough, if you are not running it on too much data at once. (I am not sure whether regex will be faster in the case of a lot of data, though). Note that the looping and the `StringTokenizer` method does **not** do the exact same thing. — nhahtdh, Feb 22 '13 at 16:27
@supersam654: `indexOf` is not the as `StringTokenizer` and is not as extensible, if we use the default setting (space, tab, etc.). — nhahtdh, Feb 22 '13 at 18:18

score 0 · Accepted Answer · answered Feb 22 '13 at 16:28

0

covert to char array and and iterate with for loop

int charCount =0;
for(int i=0; i<sentence.length(); i++) {
    if(!Character.isWhitespace(sentence.charAt(i))) {
        charCount++;
    }
}

In other way replace all whitespace and count the length with below code

int charcount = 0;
String newSentence =sentence.replaceAll("\\s+", "");
charcount = newSentence.length();

answered Feb 22 '13 at 16:28

Rais Alam

6,742
11
50
84

second option will replace all whitespace and create a single sentence and will count all length of chars in that sentence. – Rais Alam Feb 22 '13 at 16:34

score 0 · Answer 2 · answered Feb 22 '13 at 16:32

0

This is not faster then the for loop, but if you need to use regular expressions you can try something like:

int noSpaces=toExamine.split("\\s+").length-1;

The number pf characters will be:

int noChar=toExamine.length-noSpaces;

answered Feb 22 '13 at 16:32

dan

12,532
3
34
44

Your `noSpaces` will fail with `"noSpaceString"`, if you want to count number of words. I cannot make sense of what you are trying to do with your code. – nhahtdh Feb 22 '13 at 19:08

score 0 · Answer 3 · answered Feb 22 '13 at 20:07

The test program below produces the following result. The program will output 5 sets of such result, but I only show one here. The lines with // are my annotation, not the output of the program.

// Percentage of non-space over space is approximately 0.857
// Length of the full string generated is 1 075 662
0.857 1075662
// Name_of_method (Result): 15_Runs_In_Microseconds | Average_In_Microseconds
countWords_1 (131489): 20465 20240 21045 20193 20000 19972 20551 39489 19859 19971 19889 19877 20049 19900 19949 | 21429
countWords_2 (131489): 255500 258723 254543 255956 253606 263549 254096 254402 254191 254296 253752 261501 260788 261574 254178 | 256710
countWords_3 (131489): 26225 25022 24830 24829 24545 24819 25459 24625 25628 24700 24936 24794 24794 24849 25026 | 25005
countWords_4 (131489): 24537 24169 25283 24862 23863 23902 24068 23906 51472 23731 23889 23844 23832 24275 23896 | 25968
countWords_5 (131489): 81087 112095 80008 81290 81472 80581 80717 80460 79870 80557 80694 80923 145686 80564 80849 | 87123
countWords_6 (131489): 114391 114146 111946 111873 112331 167207 134117 118217 112843 112804 113533 111834 112830 112392 118181 | 118576
countChars_1 (922546): 150507 109102 150453 111352 149753 108099 153842 109034 150817 117258 149219 108194 152839 110340 149524 | 132022
countChars_2 (922546): 28779 29473 52499 27182 26519 27743 26717 27161 26451 27060 26307 27309 26350 62824 33134 | 31700
countChars_3 (922546): 25408 25127 24980 24832 24624 24671 24848 24712 24634 24622 24607 24613 24661 24765 24883 | 24799
countChars_4 (922546): 81489 82246 80906 80718 80803 81147 81113 81798 81030 81024 108508 80768 80780 80671 80753 | 82916
countChars_5 (922546): 26086 25546 24846 43734 25016 25083 24894 25530 25031 25041 25114 24935 25358 24895 43498 | 27640
countChars_6 (922546): 102559 102257 101381 101589 103432 101739 102794 129472 101305 101834 103124 101486 101254 102874 101481 | 103905

countWords_2 and countWords_6 are one-liner methods involving tricks with regex and replaceAll, which is very slow compared to other methods. countWords_5 uses a pre-compiled Pattern to do matching, faster than the one-liners with replaceAll, but is still slower compare to others.

countWords_3 and countWords_4 are simple looping, but with some minor difference. The timing doesn't show a conclusive difference. (I look for consistency in whether the timing is bigger or smaller, and the difference in timing should be at least around 5 ms).

countWords_1 uses StringTokenizer with the default delimiters, which doesn't include Unicode characters. Therefore, it doesn't make a good comparison here, since the semantic is completely different.

For counting number of words (defined as a sequence of non-whitespace characters), simple looping is faster than the regex methods that I can think of.

countChars_1 and countChars_6 are one-liner solutions involving tricks with regex and replaceAll. Again, it is slower than countChars_4, which uses pre-compiled Pattern. And again, all regex solutions are slower than simple looping.

countChars_2, countChars_3 and countChars_5 are some variations of simple looping. The difference in countChars_3 and countChars_5 from many runs I have observed is not very consistent and therefore non-conclusive. But countChars2 is usually slightly slower, possibly due to new memory has to be allocated to the char[] returned by toCharArray function.

I don't guarantee that the methods I have here are the fastest, but it shows some idea about how simple looping compares to regex solutions.

You can run this test program and decide for yourself. I have written the test so that you can freely:

Change the length of the generated test string and how frequent space characters appears.

Currently, the length of the test string is random between 700 000 to 1 300 000 characters, and the non-space to space character ratio varies between 4:1 to 9:1 (I take a guess for general text). You can set the FLUCTUATION to 0, so that the length or the ratio is fixed - very useful when you want to test edge cases.
Replace how the test string is generated (real data instead of randomly generated string).

Currently, a subset of ASCII characters is used: some 64 non-space characters; space, new line, tab and carriage return are used as whitespace characters. There are Unicode whitespace characters, but not included in the current test.
Add new method to test, marked with @Test annotation.

import java.util.regex.Pattern;
import java.util.regex.Matcher;

import java.util.Arrays;
import java.util.ArrayList;
import java.util.Random;
import java.util.StringTokenizer;

import java.lang.reflect.Method;

import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.lang.annotation.Target;
import java.lang.annotation.ElementType;

class TestStringProcessing_15028652 {

  @Retention(RetentionPolicy.RUNTIME)
  @Target(ElementType.METHOD)
  private @interface Test {};

  // From 0.80 - 0.90 (4:1 to 9:1 non-space:space characters ratio)
  private static final double NON_SPACE_RATIO = 0.85;
  private static final double NON_SPACE_RATIO_FLUCTUATION = 0.05;

  // With the way the test is written, it is not going to work well with small input (1000 is NOT enough)
  // Currently set to 700 000 - 1 300 000 characters
  private static final int NUM_CHARS = 1000000;
  private static final int NUM_CHARS_FLUCTUATION = 300000;

  // Some whitespace characters
  private static final char WHITESPACES[] = {' ', '\t', '\r', '\n'};

  // Number of times to run all methods
  private static final int NUM_OUTER = 5;
  // Number of times to run each method
  private static final int NUM_REPEAT = 15;

  static {
    for (int i = 0; i < WHITESPACES.length; i++) {
      assert(Character.isWhitespace(WHITESPACES[i]));
    }
  }

  private static Random random = new Random();

  private static String generateInput() {

    double nonSpaceRatio = NON_SPACE_RATIO + random.nextDouble() * 2 * NON_SPACE_RATIO_FLUCTUATION - NON_SPACE_RATIO_FLUCTUATION;
    int numChars = NUM_CHARS + random.nextInt(2 * NUM_CHARS_FLUCTUATION) - NUM_CHARS_FLUCTUATION;

    System.out.printf("%.3f %d\n", nonSpaceRatio, numChars);

    StringBuffer output = new StringBuffer();

    for (int i = 0; i < numChars; i++) {
      if (random.nextDouble() < nonSpaceRatio) {
        output.append((char) (random.nextInt(64) + '0'));
      } else {
        output.append(WHITESPACES[random.nextInt(WHITESPACES.length)]);
      }
    }

    return output.toString();
  }

  private static ArrayList<Method> getTestMethods() {
    Class<?> klass = null;
    try {
      klass = Class.forName(Thread.currentThread().getStackTrace()[1].getClassName());
    } catch (Exception e) {
      e.printStackTrace();
      System.err.println("Something really bad happened. Bailling out...");
      System.exit(1);
    }
    Method[] methods = klass.getMethods();
    // System.out.println(klass);
    // System.out.println(Arrays.toString(methods));

    ArrayList<Method> testMethods = new ArrayList<Method>();

    for (Method method: methods) {
        if (method.isAnnotationPresent(Test.class)) {
          testMethods.add(method);
        }
    }

    return testMethods;
  }


  public static void runTestReflection() {
    ArrayList<Method> methods = getTestMethods();

    for (int t = 0; t < NUM_OUTER; t++) {
      String input = generateInput();

      for (Method method: methods) {

        try {
          System.out.print(method.getName() + " (" + method.invoke(null, input) + "): ");
        } catch (Exception e) {
          e.printStackTrace();
        }

        long sum = 0;
        for (int i = 0; i < NUM_REPEAT; i++) {
          long start, end;
          Object result;

          try {
            start = System.nanoTime();
            result = method.invoke(null, input);
            end = System.nanoTime();

            System.out.print((end - start) / 1000 + " ");
            sum += (end - start) / 1000;
          } catch (Exception e) {
            e.printStackTrace();
          }
        }

        System.out.println("| " + sum / NUM_REPEAT);
      }

      System.out.println();
    }
  }

  public static void main(String args[]) {
    runTestReflection();
  }

  @Test
  public static int countWords_1(String input) {
    // WARNING: This is NOT the same as isWhitespace, since isWhitespace
    // also consider Unicode characters.
    return new StringTokenizer(input).countTokens();
  }

  @Test
  public static int countWords_2(String input) {
    return input.replaceAll("\\S+", "$0 ").length() - input.length();
  }

  @Test
  public static int countWords_3(String input) {
    int count = 0;
    boolean in = false;

    for (int i = 0; i < input.length(); i++) {
      if (!Character.isWhitespace(input.charAt(i))) {
        if (!in) {
          in = true;
          count++;
        }
      } else {
        in = false;
      }
    }

    return count;
  }

  @Test
  public static int countWords_4(String input) {
    int count = 0;

    for (int i = 0; i < input.length(); i++) {
      if (!Character.isWhitespace(input.charAt(i))) {
        do {
          i++;
        } while (i < input.length() && !Character.isWhitespace(input.charAt(i)));
        count++;
      }
    }

    return count;
  }

  @Test
  public static int countWords_5(String input) {
    int count = 0;
    Matcher m = p.matcher(input);

    while (m.find()) {
      count++;
    }

    return count;
  }

  @Test
  public static int countWords_6(String input) {
    return input.replaceAll("\\s*+\\S++\\s*+", " ").length();
  }

  @Test
  public static int countChars_1(String input) {
    return input.replaceAll("\\s+", "").length();
  }

  @Test
  public static int countChars_2(String input) {
    int count = 0;
    for (char c: input.toCharArray()) {
      if (!Character.isWhitespace(c)) {
        count++;
      }
    }

    return count;
  }

  @Test
  public static int countChars_3(String input) {
    int count = 0;
    for (int i = 0; i < input.length(); i++) {
      if (!Character.isWhitespace(input.charAt(i))) {
        count++;
      }
    }

    return count;
  }

  private static Pattern p = Pattern.compile("\\S+");

  @Test
  public static int countChars_4(String input) {
    Matcher m = p.matcher(input);
    int count = 0;

    while (m.find()) {
      count += m.end() - m.start();
    }

    return count;
  }

  @Test
  public static int countChars_5(String input) {
    int count = input.length();
    for (int i = 0; i < input.length(); i++) {
      if (Character.isWhitespace(input.charAt(i))) {
        count--;
      }
    }

    return count;
  }

  @Test
  public static int countChars_6(String input) {
    return input.length() - input.replaceAll("\\S+", "").length();
  }
}

Grab word and character counts in string

3 Answers3