clustering words based on their char set

Question

Say there is a word set and I would like to clustering them based on their char bag (multiset). For example

{tea, eat, abba, aabb, hello}

will be clustered into

{{tea, eat}, {abba, aabb}, {hello}}.

abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.

To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.

Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.

Is there a more efficient way to solve the problem?

I would consider the middle the: sort the characters of the key string, but don't do the RLE compression on it, so `abba` and `aabb` would both come out as `aabb`. Easy to do, not much awkwardness or chance of the "compression" blowing up and making the string longer. — Jerry Coffin, Aug 11 '13 at 01:43
Do you need to be able to retrieve the original strings as well? — keyboardP, Aug 11 '13 at 01:55
A proper set does not include a count; then "meet" and "met" are composed of the same set of characters. — tripleee, Aug 11 '13 at 07:41
Technically a duplicate of [Group together all the anagrams](http://stackoverflow.com/q/17934627/2417578), _but_ that question was marked as a duplicate of something it didn't duplicate. — sh1, Aug 11 '13 at 11:48

score 2 · Answer 1 · edited May 23 '17 at 12:17

2

For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better) See my answer here.

edited May 23 '17 at 12:17

Community

1
1

answered Aug 11 '13 at 11:35

wildplasser

38,231
6
56
94

2

Don't use 2 in your list of primes. When you reach the overflow case, multiplying a number by 2 will map two values to one result (one value where 0 shifts off the top, another value where 1 shifts off the top). If you multiply by any odd number then you always get a 1:1 mapping mod `2**n` so you don't lose information. See my answer [here](http://stackoverflow.com/a/18144931/2417578). – sh1 Aug 11 '13 at 12:24
Oh, I see you already made that observation in your other post. I used the remaining bit to distinguish between hashes that had overflowed and those that had not, because the latter gives definite positive as well as negative results (although many overflowed hashes may still be unique I cannot easily prove that for any of them). – sh1 Aug 11 '13 at 12:43
As I stated in my original answer, I have observed no overflow using a list of >> 100K Dutch words (including misspellings, etc); the largest words being about 15 chars long and the product being a 64 bit `unsigned long long int`. Using a mapping such that the most common letters (e,t) map to the lowest primes seems enough to keep the product within bounds. – wildplasser Aug 11 '13 at 13:49
It is Fundamental Theorem of Arithmetic: every integer can be uniquely represented as multiplication of prime numbers (integer factorization). Since `2` is a first prime number, that works with it too. – rook Aug 11 '13 at 14:46
@rook, but this is not true in [modular arithmetic](http://en.wikipedia.org/wiki/Modular_arithmetic). – sh1 Aug 11 '13 at 14:58
@wildplasser, English-language counter-examples [given to me](http://stackoverflow.com/q/18162204/2417578#comment26606935_18162204) yesterday: pterygoplichtys, glyptoperichthys, supercalifragilisticexpialidocious. OK, not exactly English words, but words which can appear in English text. – sh1 Aug 11 '13 at 15:02
WRT your _omit the 2_: on second thought, I think it is not needed. For a word with `X` 'e's (presuming 'e' to be the most frequent letter, mapping to `2` ), the hash has `X` '0' bits at the LSB, the _rest_ of the product is in the upper (64-X) bits, modulo (64-x), assuming a 64bits hash. – wildplasser Aug 11 '13 at 15:21
The only real worry here is that hashes could overflow when the product gets too large. – darksky Aug 11 '13 at 22:27
@wildplasser, for sensible English words and large integers it's not a huge loss, but you would start to feel it if you filtered your data through a hash table which was based on the least-significant bits. At that point, every 'e' doubles your collision rate. – sh1 Aug 16 '13 at 20:23

score 1 · Answer 2 · answered Aug 11 '13 at 07:29

Since you are going to sort words, I assume all characters ascii values are in the range 0-255. Then you can do a Counting Sort over the words.

The counting sort is going to take the same amount of time as the size of the input word. Reconstruction of the string obtained from counting sort will take O(wordlen). You cannot make this step less than O(wordLen) because you will have to iterate the string at least once ie O(wordLen). There is no predefined order. You cannot make any assumptions about the word without iterating though all the characters in that word. Traditional sorting implementations(ie comparison based ones) will give you O(n * lg n). But non comparison ones give you O(n).

Iterate over all the words of the list and sort them using our counting sort. Keep a map of sorted words to the list of known words they map. Addition of elements to a list takes constant time. So overall the complexity of the algorithm is O(n * avgWordLength).

Here is a sample implementation

import java.util.ArrayList;


public class ClusterGen {

    static String sortWord(String w) {
        int freq[] = new int[256];

        for (char c : w.toCharArray()) {
            freq[c]++;
        }
        StringBuilder sortedWord = new StringBuilder();
        //It is at most O(n)
        for (int i = 0; i < freq.length; ++i) {
            for (int j = 0; j < freq[i]; ++j) {
                sortedWord.append((char)i);
            }
        }
        return sortedWord.toString();
    }

    static Map<String, List<String>> cluster(List<String> words) {
        Map<String, List<String>> allClusters = new HashMap<String, List<String>>();

        for (String word : words) {
            String sortedWord = sortWord(word);
            List<String> cluster = allClusters.get(sortedWord);
            if (cluster == null) {
                cluster = new ArrayList<String>();
            }
            cluster.add(word);
            allClusters.put(sortedWord, cluster);
        }

        return allClusters;
    }

    public static void main(String[] args) {
        System.out.println(cluster(Arrays.asList("tea", "eat", "abba", "aabb", "hello")));
        System.out.println(cluster(Arrays.asList("moon", "bat", "meal", "tab", "male")));

    }
}

Returns

{aabb=[abba, aabb], ehllo=[hello], aet=[tea, eat]}
{abt=[bat, tab], aelm=[meal, male], mnoo=[moon]}

score 1 · Answer 3 · answered Aug 11 '13 at 16:55

Using an alphabet of x characters and a maximum word length of y, you can create hashes of (x + y) bits such that every anagram has a unique hash. A value of 1 for a bit means there is another of the current letter, a value of 0 means to move on to the next letter. Here's an example showing how this works:

Let's say we have a 7 letter alphabet(abcdefg) and a maximum word length of 4. Every word hash will be 11 bits. Let's hash the word "fade": 10001010100

The first bit is 1, indicating there is an a present. The second bit indicates that there are no more a's. The third bit indicates that there are no more b's, and so on. Another way to think about this is the number of ones in a row represents the number of that letter, and the total zeroes before that string of ones represents which letter it is.

Here is the hash for "dada": 11000110000

It's worth noting that because there is a one-to-one correspondence between possible hashes and possible anagrams, this is the smallest possible hash guaranteed to give unique hashes for any input, which eliminates the need to check everything in your buckets when you are done hashing.

I'm well aware that using large alphabets and long words will result in a large hash size. This solution is geared towards guaranteeing unique hashes in order to avoid comparing strings. If you can design an algorithm to compute this hash in constant time(given you know the values of x and y) then you'll be able to solve the entire grouping problem in O(n).

score 0 · Answer 4 · answered Aug 11 '13 at 01:53

I would do this in two steps, first sort all your words according to their length and work on each subset separately(this is to avoid lots of overlaps later.)

The next step is harder and there are many ways to do it. One of the simplest would be to assign every letter a number(a = 1, b = 2, etc. for example) and add up all the values for each word, thereby assigning each word to an integer. Then you can sort the words according to this integer value which drastically cuts the number you have to compare.

Depending on your data set you may still have a lot of overlaps("bad" and "cac" would generate the same integer hash) so you may want to set a threshold where if you have too many words in one bucket you repeat the previous step with another hash(just assigning different numbers to the letters) Unless someone has looked at your code and designed a wordlist to mess you up, this should cut the overlaps to almost none.

Keep in mind that this approach will be efficient when you are expecting small numbers of words to be in the same char bag. If your data is a lot of long words that only go into a couple char bags, the number of comparisons you would do in the final step would be astronomical, and in this case you would be better off using an approach like the one you described - one that has no possible overlaps.

thanks. as i said, it's an interview question and the interviewer suggests an O(n) algorithm. counting the chars for each word seems to be on the right track, but i dont know the final solution in his mind. — user2671488, Aug 11 '13 at 02:51
Is there a known maximum word length? If not O(n) sounds like a pipe dream. — flancor, Aug 11 '13 at 03:25
@flancor, I've seen a lot of solutions to problems on SO claiming to be O(n) on the assumption that you have an arbitrary-precision arithmetic library. — sh1, Aug 11 '13 at 11:55
If you choose your letter-value map more carefully then you can probably go a long way to answering [my related question](http://stackoverflow.com/q/18162204/2417578) about how to optimise a solution closely related to yours. — sh1, Aug 11 '13 at 12:05
@sh1 After sleeping on the problem I've posted a new solution that may interest you. I believe it is closer to what the OP is looking for but it's not optimized for real-world data. — flancor, Aug 11 '13 at 16:58

score 0 · Answer 5 · answered Aug 11 '13 at 01:55

0

One thing I've done that's similar to this, but allows for collisions, is to sort the letters, then get rid of duplicates. So in your example, you'd have buckets for "aet", "ab", and "ehlo".

Now, as I say, this allows for collisions. So "rod" and "door" both end up in the same bucket, which may not be what you want. However, the collisions will be a small set that is easily and quickly searched.

So once you have the string for a bucket, you'll notice you can convert it into a 32-bit integer (at least for ASCII). Each letter in the string becomes a bit in a 32-bit integer. So "a" is the first bit, "b" is the second bit, etc. All (English) words make a bucket with a 26-bit identifier. You can then do very fast integer compares to find the bucket a new words goes into, or find the bucket an existing word is in.

answered Aug 11 '13 at 01:55

user1118321

23,821
4
52
78

Thanks.this is an interview question and i actually propose the solution you mentioned. however, the interviewer says that it's general char set (could be unicode), so you cannot convert it into a bit vector and fit it into a int/long. – user2671488 Aug 11 '13 at 02:47
I think he was just trying to lead you on a different path. It seems like he wanted runtime complexity to be optimized instead of minor speed optimizations like using a bit vector. There are at least two answers posted already that both propose the same `O(n)` solution (which can be summarized as "use bucket sort and a hashtable"). You should accept one or ask questions if you don't understand it. – rliu Aug 11 '13 at 09:01

score 0 · Answer 6 · answered Aug 11 '13 at 06:13

Count the frequency of characters in each of the strings then build a hash table based on the frequency table. so for an example, for string aczda and aacdz we get 20110000000000000000000001. Using hash table we can partition all these strings in buckets in O(N).

score 0 · Answer 7 · answered Aug 11 '13 at 22:09

26-bit integer as a hash function

If your alphabet isn't too large, for instance, just lower case English letters, you can define this particular hash function for each word: a 26 bit integer where each bit represents whether that English letter exists in the word. Note that two words with the same char set will have the same hash.

Then just add them to a hash table. It will automatically be clustered by hash collisions.

It will take O(max length of the word) to calculate a hash, and insertion into a hash table is constant time. So the overall complexity is O(max length of a word * number of words)

clustering words based on their char set

7 Answers7

Linked