Algorithm for grouping anagram words

Question

Given a set of words, we need to find the anagram words and display each category alone using the best algorithm.

input:

man car kile arc none like

output:

man
car arc
kile like
none

The best solution I am developing now is based on an hashtable, but I am thinking about equation to convert anagram word into integer value.

Example: man => 'm'+'a'+'n' but this will not give unique values.

Any suggestion?

See following code in C#:

string line = Console.ReadLine();
string []words=line.Split(' ');
int[] numbers = GetUniqueInts(words);
for (int i = 0; i < words.Length; i++)
{
    if (table.ContainsKey(numbers[i]))
    {
        table[numbers[i]] = table[numbers[i]].Append(words[i]);
    }
    else
    {
        table.Add(numbers[i],new StringBuilder(words[i]));
    }

}

The problem is how to develop GetUniqueInts(string []) method.

So you want a hash function that returns the same hash for combinations of the same letters in different orders, with a unique hash for each combination of letters (no false matches)? — Sparr, Dec 28 '08 at 09:26

score 39 · Accepted Answer · answered Dec 28 '08 at 09:38

39

Don't bother with a custom hash function at all. Use the normal string hash function on whatever your platform is. The important thing is to make the key for your hash table the idea of a "sorted word" - where the word is sorted by letter, so "car" => "acr". All anagrams will have the same "sorted word".

Just have a hash from "sorted word" to "list of words for that sorted word". In LINQ this is incredibly easy:

using System;
using System.Collections.Generic;
using System.Linq;

class FindAnagrams
{
    static void Main(string[] args)
    {
        var lookup = args.ToLookup(word => SortLetters(word));

        foreach (var entry in lookup)
        {
            foreach (var word in entry)
            {
                Console.Write(word);
                Console.Write(" ");
            }
            Console.WriteLine();
        }
    }

    static string SortLetters(string original)
    {
        char[] letters = original.ToCharArray();
        Array.Sort(letters);
        return new string(letters);
    }
}

Sample use:

c:\Users\Jon\Test>FindAnagrams.exe man car kile arc none like
man
car arc
kile like
none

answered Dec 28 '08 at 09:38

Jon Skeet

1,261,211
792
8,724
8,929

1

wow, that looks sexy. much shorter than the c++ version '( :) – Johannes Schaub - litb Dec 28 '08 at 09:48
I am not thinking about custom hashing but to make the key integer instead of sorting all the words – Ahmed Said Dec 28 '08 at 09:50
1

I would be interested in seeing perf numbers for this vs my scheme I think mine should calculate a hash value faster, because it can be done with 1 pass through the string in O(N). The sort if O(n log n) However, lookup may be better I'm not sure how my hash function would distribute values. – Scott Wisniewski Dec 28 '08 at 09:51
Don't forget that this is only sorting each word - and the words in a real dictionary ary pretty short. Complexity becomes relevant with large n. – Jon Skeet Dec 28 '08 at 10:09
Sure.. but, running through the string once, doing addition and some bit shifting and anding is going to be faster than doing a sort, and then going through the string and doing some bit shifting and anding. – Scott Wisniewski Dec 28 '08 at 10:16
I was just using complexity to be terse, because their are limits on comment length. – Scott Wisniewski Dec 28 '08 at 10:17
The more I think about it Jon Skeets method is the best one since it will work with any characters. Methods that tries to be clever assumes there is only 26 characters in the alphabet and breaks if non-ascii characters are used. – some Dec 28 '08 at 12:49
3

In fact, if you restrict it to 26 characters, the sort can be done in O(n) very easily too. The main advantage my route has is simplicity, to be honest - it's *obviously* correct without worrying too much about the maths. I like simple solutions :) – Jon Skeet Dec 28 '08 at 13:15
@litb: it would be even shorter (by 7 lines) if the inner loop is replaced by Console.WriteLine(String.Join(" ", entry)) :) – Hosam Aly Jan 04 '09 at 12:08

score 18 · Answer 2 · answered Dec 28 '08 at 11:05

18

I used a Godel-inspired scheme:

Assign the primes P_1 to P_26 to the letters (in any order, but to obtain smallish hash values best to give common letters small primes).

Built a histogram of the letters in the word.

Then the hash value is the product of each letter's associated prime raised to the power of its frequency. This gives a unique value to every anagram.

Python code:

primes = [2, 41, 37, 47, 3, 67, 71, 23, 5, 101, 61, 17, 19, 13, 31, 43, 97, 29, 11, 7, 73, 83, 79, 89, 59, 53]


def get_frequency_map(word):
    map = {}

    for letter in word:
        map[letter] = map.get(letter, 0) + 1

    return map


def hash(word):
    map = get_frequency_map(word)
    product = 1
    for letter in map.iterkeys():
        product = product * primes[ord(letter)-97] ** map.get(letter, 0)
    return product

This cleverly transforms the tricky problem of finding subanagrams into the (also known to be tricky) problem of factoring large numbers...

answered Dec 28 '08 at 11:05

Nice! Unique prime factorisation FTW. How about unicode input? Sorting and comparing the strings would win in that case :) – James Brady Dec 28 '08 at 12:31
I love this answer. It is very cool. I answered this question and reviewed answers for a recruiting questionaire at a company where I worked. Most people would just produce one word anagrams. And I don't think anybody but me seriously optimized it. There is lots of room to show off in this question. – markets Dec 28 '08 at 17:09
3

But arbitrarily large words will require arbitrarily large integers. You may as well use the sorted word (or the frequency map) as the hash key. – Roddy Dec 29 '08 at 12:17
Yup this is the standard and the best way to do it, interviewers will be expecting/or impressed with this solution +1. But should be aware of the downside of it, as mentioned by @Roddy – nmd Dec 07 '12 at 15:29

score 7 · Answer 3 · answered Dec 28 '08 at 10:04

7

A Python version for giggles:

from collections import defaultdict
res = defaultdict(list)
L = "car, acr, bat, tab, get, cat".split(", ")

for w in L:
    res["".join(sorted(w))].append(w)

print(res.values())

answered Dec 28 '08 at 10:04

James Brady

21,724
7
48
56

Also, see namin's permutation algorithm here: http://stackoverflow.com/questions/396421/checking-if-two-strings-are-permutations-of-each-other-in-python#396438 – James Brady Dec 28 '08 at 17:47

score 3 · Answer 4 · answered Dec 28 '08 at 09:16

3

I don't think you'll find anything better than a hash table with a custom hash function (that would sort the letters of he word before hashing it).

Sum of the letters will never work, because you can't really make 'ac' and 'bb' different.

answered Dec 28 '08 at 09:16

Paulius

5,609
7
39
47

yes the sum will not work but let us see a new way to convert anagram word into a unique number – Ahmed Said Dec 28 '08 at 09:19
You're not thinking straight about hashing and uniqueness. You can't guarantee uniqueness with a hashing function, so you need a way of handling duplicate 'hits' in your table anyway. Sum of letter might be non-optimal hash, but it should still work. – Roddy Dec 28 '08 at 09:26
1

Assigning prime numbers to alphabets and by product of primenumbers of anagrams will help u to build hash table. – naren Mar 31 '15 at 12:10

Scott Wisniewski · Answer 5 · 2008-12-28T09:41:29.690

3

You will need large integers (or a bit vector actually) but the following might work

the first occurrence of each letter get's assigned the bit number for that letter, the second occurence gets the bit number for that letter + 26.

For example

a #1 = 1 b #1 = 2 c #1 = 4 a #2 = 2^26 b #2 = 2 ^ 27

You can then sum these together, to get a unique value for the word based on it's letters.

Your storage requirements for the word values will be:

n * 26 bits

where n is the maximum number of occurrences of any repeated letter.

edited Dec 28 '08 at 09:41

answered Dec 28 '08 at 09:35

Scott Wisniewski

23,372
7
55
89

Would it be sufficient to have 26 unique values (2^0 up to 2^25), then compare words by computing the sum and some other commutative function, like XOR? It seems like it should be enough, but I can't up with a convincing argument why... :) – James Brady Dec 28 '08 at 09:43
Wether or not XOR would be good depends on the distribution of words in the dictionary. It's a good idea for improvement though. The only real way to know would be to test and measure both. – Scott Wisniewski Dec 28 '08 at 10:05

score 2 · Answer 6 · answered Apr 13 '11 at 02:23

I wouldn't use hashing since it adds additional complexity for look-up and adds. Hashing, sorting and multiplications are all going to be slower than a simple array-based histogram solution with tracking uniques. Worst case is O(2n):

// structured for clarity
static bool isAnagram(String s1, String s2)
{
    int[] histogram = new int[256];

    int uniques = 0;

    // scan first string
    foreach (int c in s1)
    {
        // count occurrence
        int count = ++histogram[c];

        // count uniques
        if (count == 1)
        {
            ++uniques;
        }
    }

    // scan second string
    foreach (int c in s2)
    {
        // reverse count occurrence
        int count = --histogram[c];

        // reverse count uniques
        if (count == 0)
        {
            --uniques;
        }
        else if (count < 0) // trivial reject of longer strings or more occurrences
        {
            return false;
        }
    }

    // final histogram unique count should be 0
    return (uniques == 0);
}

`O(2n)` is the same as `O(n)`. – phant0m Nov 24 '12 at 21:55 — phant0m, Nov 24 '12 at 21:55

score 1 · Answer 7 · answered Dec 28 '08 at 10:06

I have implemented this before with a simple array of letter counts, e.g.:

unsigned char letter_frequency[26];

Then store that in a database table together with each word. Words that have the same letter frequency 'signature' are anagrams, and a simple SQL query then returns all anagrams of a word directly.

With some experimentation with a very large dictionary, I found no word that exceeded a frequency count of 9 for any letter, so the 'signature' can be represented as a string of numbers 0..9 (The size could be easily halved by packing into bytes as hex, and further reduced by binary encoding the number, but I didn't bother with any of this so far).

Here is a ruby function to compute the signature of a given word and store it into a Hash, while discarding duplicates. From the Hash I later build a SQL table:

def processword(word, downcase)
  word.chomp!
  word.squeeze!(" ") 
  word.chomp!(" ")
  if (downcase)
    word.downcase!
  end
  if ($dict[word]==nil) 
    stdword=word.downcase
    signature=$letters.collect {|letter| stdword.count(letter)}
    signature.each do |cnt|
      if (cnt>9)
        puts "Signature overflow:#{word}|#{signature}|#{cnt}"
      end
    end
    $dict[word]=[$wordid,signature]
    $wordid=$wordid+1
  end
end

score 1 · Answer 8 · answered Dec 28 '08 at 21:45

Assign a unique prime number to the letters a-z

Iterate your word array, creating a product of primes based on the letters in each word.
Store that product in your word list, with the corresponding word.

Sort the array, ascending by the product.

Iterate the array, doing a control break at every product change.

score 0 · Answer 9 · answered Aug 12 '09 at 14:29

In C, I just implemented the following hash which basically does a 26-bit bitmask on whether the word in the dictionary has a particular letter in it. So, all anagrams have the same hash. The hash doesn't take into account repeated letters, so there will be some additional overloading, but it still manages to be faster than my perl implementation.

#define BUCKETS 49999

struct bucket {
    char *word;
    struct bucket *next;
};

static struct bucket hash_table[BUCKETS];

static unsigned int hash_word(char *word)
{
    char *p = word;
    unsigned int hash = 0;

    while (*p) {
        if (*p < 97 || *p > 122) {
            return 0;
        }
        hash |= 2 << (*p - 97);
        *p++;
    }

    return hash % BUCKETS;
}

Overloaded buckets created and added as linked list, etc. Then just write a function that makes sure that the words that match the hash value are the same length and that the letters in each are 1 to 1 and return that as a match.

score 0 · Answer 10 · answered Mar 19 '10 at 23:23

I will generate the hasmap based on the sample word and the rest of the alphabets I won't care.

For example if the word is "car" my hash table will be like this: a,0 b,MAX c,1 d,MAX e,MAX ... .. r,2 . As a result any has greater than 3 will consider as not matching

(more tuning...) And my comparison method will compare the hash total within the hash calculation itself. It won't continue once it can identify the word is not equal.

public static HashMap<String, Integer> getHashMap(String word) {
        HashMap<String, Integer> map = new HashMap<String, Integer>();
        String[] chars = word.split("");
        int index = 0;
        for (String c : chars) {
            map.put(c, index);
            index++;
        }
        return map;
    }

    public static int alphaHash(String word, int base,
            HashMap<String, Integer> map) {
        String[] chars = word.split("");
        int result = 0;
        for (String c : chars) {
            if (c.length() <= 0 || c.equals(null)) {
                continue;
            }
            int index = 0;
            if (map.containsKey(c)) {
                index = map.get(c);
            } else {
                index = Integer.MAX_VALUE;
            }
            result += index;
            if (result > base) {
                return result;
            }
        }
        return result;
    }

Main method

  HashMap<String, Integer> map = getHashMap(sample);
        int sampleHash = alphaHash(sample, Integer.MAX_VALUE, map);
        for (String s : args) {
                if (sampleHash == alphaHash(s, sampleHash, map)) {
                    System.out.print(s + " ");
                }
            }

score 0 · Answer 11 · answered Jan 03 '18 at 11:47

Just want to add simple python solution in addition to the other useful answers:

def check_permutation_group(word_list):
    result = {}

    for word in word_list:
        hash_arr_for_word = [0] * 128  # assuming standard ascii

        for char in word:
            char_int = ord(char)
            hash_arr_for_word[char_int] += 1

        hash_for_word = ''.join(str(item) for item in hash_arr_for_word)

        if not result.get(hash_for_word, None):
            result[str(hash_for_word)] = [word]
        else:
            result[str(hash_for_word)] += [word]

return list(result.values())

score 0 · Answer 12 · answered Feb 21 '18 at 04:32

python code:

line = "man car kile arc none like"
hmap = {}
for w in line.split():
  ws = ''.join(sorted(w))
  try:
    hmap[ws].append(w)
  except KeyError:
    hmap[ws] = [w]

for i in hmap:
   print hmap[i]

output:

['car', 'arc']
['kile', 'like']
['none']
['man']

score 0 · Answer 13 · edited Feb 28 '12 at 10:41

0

Anagrams can be found in following way:

Length of word should match.
Perform addition of each character in terms of integer value. This sum will match if you perform same on anagram.
Perform multiplication of each character in terms of integer value. Evaluated value will match if you perform same on anagram.

So I thought through above three validations, we can find anagrams. Correct me if I'm wrong.

Example: abc cba

Length of both words is 3.

Sum of individual characters for both words is 294.

Prod of individual characters for both words is 941094.

edited Feb 28 '12 at 10:41

bluish

23,093
23
110
171

answered Feb 27 '12 at 17:23

Praveen541

9
1

What if my word is 'zzzzzzzzzz'? Then the product will be `7.3046314e+20`. Storing and calculating this value could be a strain. What if we have even longer words? Considering this, is this solution efficient? – Ganz7 Sep 01 '15 at 16:56

sbr · Answer 14 · 2014-10-20T04:27:39.193

-1

JavaScript version. using hashing.

Time Complexity: 0(nm) , where n is number of words, m is length of word

var words = 'cat act mac tac ten cam net'.split(' '),
    hashMap = {};

words.forEach(function(w){
    w = w.split('').sort().join('');
    hashMap[w] = (hashMap[w]|0) + 1;
});

function print(obj,key){ 
    console.log(key, obj[key]);
}

Object.keys(hashMap).forEach(print.bind(null,hashMap))

edited Oct 20 '14 at 04:27

answered Sep 02 '13 at 22:09

sbr

4,212
5
39
45

It's not O(n), because sorting takes not constant time – HitOdessit Oct 19 '14 at 17:46

Algorithm for grouping anagram words

14 Answers14

Linked