27

String handling in Java is something I'm trying to learn to do well. Currently I want to take in a string and replace any characters I find.

Here is my current inefficient (and kinda silly IMO) function. It was written to just work.

public String convertWord(String word)
{
    return word.toLowerCase().replace('á', 'a')
                             .replace('é', 'e')
                             .replace('í', 'i')
                             .replace('ú', 'u')
                             .replace('ý', 'y')
                             .replace('ð', 'd')
                             .replace('ó', 'o')
                             .replace('ö', 'o')
                             .replaceAll("[-]", "")
                             .replaceAll("[.]", "")
                             .replaceAll("[/]", "")
                             .replaceAll("[æ]", "ae")
                             .replaceAll("[þ]", "th");
}

I ran 1.000.000 runs of it and it took 8182ms. So how should I proceed in changing this function to make it more efficient?

Solution found:

Converting the function to this

public String convertWord(String word)
{
    StringBuilder sb = new StringBuilder();

    char[] charArr = word.toLowerCase().toCharArray();

    for(int i = 0; i < charArr.length; i++)
    {
        // Single character case
        if(charArr[i] == 'á')
        {
            sb.append('a');
        }
        // Char to two characters
        else if(charArr[i] == 'þ')
        {
            sb.append("th");
        }
        // Remove
        else if(charArr[i] == '-')
        {
        }
        // Base case
        else
        {   
            sb.append(word.charAt(i));
        }
    }

    return sb.toString();
}

Running this function 1.000.000 times takes 518ms. So I think that is efficient enough. Thanks for the help guys :)

Ólafur Waage
  • 64,767
  • 17
  • 135
  • 193
  • 3
    *Some* of the job is here: http://stackoverflow.com/questions/1008802/converting-symbols-accent-letters-to-english-alphabet . I don't know about `æ` and `þ`. – Kobi Mar 29 '11 at 09:52

9 Answers9

20

You could create a table of String[] which is Character.MAX_VALUE in length. (Including the mapping to lower case)

As the replacements got more complex, the time to perform them would remain the same.

private static final String[] REPLACEMENT = new String[Character.MAX_VALUE+1];
static {
    for(int i=Character.MIN_VALUE;i<=Character.MAX_VALUE;i++)
        REPLACEMENT[i] = Character.toString(Character.toLowerCase((char) i));
    // substitute
    REPLACEMENT['á'] =  "a";
    // remove
    REPLACEMENT['-'] =  "";
    // expand
    REPLACEMENT['æ'] = "ae";
}

public String convertWord(String word) {
    StringBuilder sb = new StringBuilder(word.length());
    for(int i=0;i<word.length();i++)
        sb.append(REPLACEMENT[word.charAt(i)]);
    return sb.toString();
} 
Peter Lawrey
  • 498,481
  • 72
  • 700
  • 1,075
  • This is the solution that helped me create my current code. So I'm accepting this. But mikera helped a lot as well. – Ólafur Waage Mar 29 '11 at 10:37
  • 1
    It seemed a little crazy at first, but it turns out the array is only 64KB, which isn't so bad at all. – Kobi Mar 29 '11 at 10:45
  • @Kobi, This look up is very fast and doesn't require any objects. The convertWord() only creates one temporary object (the StringBuilder) – Peter Lawrey Mar 29 '11 at 10:54
  • That comment was in fact my odd way of showing support! (I just wanted to check the size was reasonable before I voted) – Kobi Mar 29 '11 at 10:55
  • @Kobi, the size will be much smaller than using the maximum size a `Map` can be. ;) – Peter Lawrey Mar 29 '11 at 11:09
  • @Peter - That depends. You don't need all characters in your map if you default to removing the character (that is - you remove characters that aren't mapped, which is probably the case for most characters). I'll take your word about boxing though. – Kobi Mar 29 '11 at 11:15
8

My suggestion would be:

  • Convert the String to a char[] array
  • Run through the array, testing each character one by one (e.g. with a switch statement) and replacing it if needed
  • Convert the char[] array back to a String

I think this is probably the fastest performance you will get in pure Java.

EDIT: I notice you are doing some changes that change the length of the string. In this case, the same principle applies, however you need to keep two arrays and increment both a source index and a destination index separately. You might also need to resize the destination array if you run out of target space (i.e. reallocate a larger array and arraycopy the existing destination array into it)

mikera
  • 101,777
  • 23
  • 241
  • 402
  • 1
    Basically, you should iterate over the characters and use a `StringBuilder`. – Kobi Mar 29 '11 at 09:54
  • 1
    I would do the same, with one small variation if you don't mind using a little more space: I would use a `Map`, where keys are the set of characters that you want to replace, and the values are the corresponding replacements. This avoids the `switch` statement. – MarcoS Mar 29 '11 at 09:54
  • Kobi / MarcoS - agree that both of your approaches are more clean / elegant :-) though they will also both be somewhat slower. Depends on how much you really care about "efficient" I guess...... – mikera Mar 29 '11 at 10:01
  • `Map` would work since I replace æ with ae for example. – Ólafur Waage Mar 29 '11 at 10:01
  • Using a Character as a key could result in an Object being created for every character in your string. ;) – Peter Lawrey Mar 29 '11 at 10:03
5

My implementation is based on look up table.

public static String convertWord(String str) {
    char[] words = str.toCharArray();
    char[] find = {'á','é','ú','ý','ð','ó','ö','æ','þ','-','.',
            '/'};
    String[] replace = {"a","e","u","y","d","o","o","ae","th"};
    StringBuilder out = new StringBuilder(str.length());
    for (int i = 0; i < words.length; i++) {
        boolean matchFailed = true;
        for(int w = 0; w < find.length; w++) {
            if(words[i] == find[w]) {
                if(w < replace.length) {
                    out.append(replace[w]);
                }
                matchFailed = false;
                break;
            }
        }
        if(matchFailed) out.append(words[i]);
    }
    return out.toString();
}
Prince John Wesley
  • 58,850
  • 11
  • 80
  • 91
2

My first choice would be to use a StringBuilder because you need to remove some chars from the string.

Second choice would be to iterate throw the array of chars and add the treated char to another array of the inicial size of the string. Then you would need to copy the array to trim the possible unused positions.

After that, I would make some performance tests to see witch one is better.

bruno conde
  • 46,109
  • 14
  • 94
  • 114
0

I doubt, that you can speed up the 'character replacement' at all really. As for the case of regular expression replacement, you may compile the regexs beforehand

dcn
  • 4,119
  • 2
  • 28
  • 36
0

Use the function String.replaceAll. Nice article similar with what you want: link

Ovidiu Pacurar
  • 7,817
  • 2
  • 27
  • 36
0

Any time we have problems like this we use regular expressions are they are by far the fastest way to deal with what you are trying to do.

Have you already tried regular expressions?

bstack
  • 2,240
  • 2
  • 24
  • 36
0

What i see being inefficient is that you are gonna check again characters that have already been replaced, which is useless.

I would get the charArray of the String instance, iterate over it, and for each character spam a series of if-else like this:

char[] array = word.toCharArray();
for(int i=0; i<array.length; ++i){
    char currentChar = array[i];
    if(currentChar.equals('é'))
        array[i] = 'e';
    else if(currentChar.equals('ö'))
        array[i] = 'o';
    else if(//...
}
Dunaril
  • 2,607
  • 5
  • 27
  • 53
0

I just implemented this utility class that replaces a char or a group of chars of a String. It is equivalent to bash tr and perl tr///, aka, transliterate. I hope it helps someone!

package your.package.name;

/**
 * Utility class that replaces chars of a String, aka, transliterate.
 * 
 * It's equivalent to bash 'tr' and perl 'tr///'.
 *
 */
public class ReplaceChars {

    public static String replace(String string, String from, String to) {
        return new String(replace(string.toCharArray(), from.toCharArray(), to.toCharArray()));
    }

    public static char[] replace(char[] chars, char[] from, char[] to) {

        char[] output = chars.clone();
        for (int i = 0; i < output.length; i++) {
            for (int j = 0; j < from.length; j++) {
                if (output[i] == from[j]) {
                    output[i] = to[j];
                    break;
                }
            }
        }
        return output;
    }

    /**
     * For tests!
     */
    public static void main(String[] args) {

        // Example from: https://en.wikipedia.org/wiki/Caesar_cipher
        String string = "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG";
        String from = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
        String to = "XYZABCDEFGHIJKLMNOPQRSTUVW";

        System.out.println();
        System.out.println("Cesar cypher: " + string);
        System.out.println("Result:       " + ReplaceChars.replace(string, from, to));
    }
}

This is the output:

Cesar cypher: THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
Result:       QEB NRFZH YOLTK CLU GRJMP LSBO QEB IXWV ALD
fabiolimace
  • 405
  • 3
  • 5