get list of anagrams from a dictionary

Question

Basically, Anagrams are like permutation of string.E.g stack ,sackt ,stakc all are anagrams of stack (thought above words aren't meaningful). Anyways you could have understood what I basically meant.

Now, I want a list of anagrams given million words or simply say from a dictionary.

My basic question is Find total number of unique anagrams in a dictionary?

Sorting and comparing won't work as it's time complexity is pretty bad.

I thought of using hash table, string as key.

But the problem is what should be the hash function ? It would be helpful if some pseudocode provided. Some other approaches better than mentioned approaches would also be helpful.

Thanks.

question not horribly clear. can you please rephrase the objective? — Nicholas DiPiazza, Jun 19 '12 at 20:07
Do you mean: I have a dictionary of one million words, I wish to identify all sets of words within the dictionary that are anagrams of each other? E.g. If the dictionary contained: [tap, pat, pot, top] you would wish to see [[tap, pat], [pot, top]]? — Alex Wilson, Jun 19 '12 at 20:09
yeah @Alex .I just want how many different anagrams are there ? — vijay, Jun 19 '12 at 20:10
Sorting is the solution here, and its complexity is linear if you assume some constant upper bound on word length. You just have to sort the right thing; the characters, not the words. — Fred Foo, Jun 19 '12 at 20:16
I'm obviously happy to have my answer unaccepted for a more pleasing solution, but would you mind up-voting it if you though it proved at all useful? Thanks! — Alex Wilson, Jun 20 '12 at 11:19

score 24 · Accepted Answer · edited May 23 '17 at 12:17

24

The obvious solution is to map each character to a prime number and multiply the prime numbers. So if 'a'' -> 2 and 'b' -> 3, then

'ab' -> 6
'ba' -> 6
'bab' -> 18
'abba' -> 36
'baba' -> 36

To minimise the chance of overflow, the smallest primes could be assigned to the more frequent letters (e,t,i,a,n). Note: The 26th prime is 101.

UPDATE: an implementation can be found here

edited May 23 '17 at 12:17

Community

1
1

answered Jun 20 '12 at 10:07

wildplasser

38,231
6
56
94

1

You still have to deal with overflow, wich might lead to "collisions". Probably by storing letter frequency histograms with each entry. – wildplasser Jun 20 '12 at 10:23
Well: thanks! Please note that (once you have to deal with collisions) it will work with non-prime (*random*) numbers, too. This is similar to Zobrist-hashing. But with primes it *looks* cleaner. – wildplasser Jun 20 '12 at 10:31

Alex Wilson · Answer 2 · 2012-06-20T09:58:33.097

One possible hash function could be (assuming english words only) a sorted count of the number of occurrences of each letter. So for "anagram" you would generate [('a', 3), ('g', 1), ('n', 1), ('m', 1), ('r',1)].

Alternatively you could get an inexact grouping by generating a bitmask from your word where for bits 0-25 each bit represented the presence or absence of that letter (bit 0 representing 'a' through to bit 25 representining 'z'). But then you'd have to do a bit more processing to split each hashed group further to distinguish e.g. "to" from "too".

Do either of these ideas help? Any particular implementation language in mind (I could do C++, python or Scala)?

Edit: added some example Scala code and output:

OK: I'm in Scala mode at the moment, so I've knocked something up to do what you ask, but (ahem) it may not be very clear if you're not that familiar with Scala or functional programming.

Using a big list of english words from here: http://scrapmaker.com/data/wordlists/twelve-dicts/2of12.txt

I run this Scala code on them (takes about 5 seconds using Scala 2.9 in script mode, including time to compile, with a dictionary of about 40,000 words. Not the most efficient code, but the first thing that came to mind).

// Hashing function to go from a word to a sorted list of letter counts
def toHash(b:String) = b.groupBy(x=>x).map(v => (v._1, v._2.size) ).toList.sortWith(_._1 < _._1)


// Read all words from file, one word per line
val lines = scala.io.Source.fromFile("2of12.txt").getLines

// Go from list of words to list of (hashed word, word)
val hashed = lines.map( l => (toHash(l), l) ).toList

// Group all the words by hash (hence group all anagrams together)
val grouped = hashed.groupBy( x => x._1 ).map( els => (els._1, els._2.map(_._2)) )

// Sort the resultant anagram sets so the largest come first
val sorted = grouped.toList.sortWith( _._2.size > _._2.size )

for ( set <- sorted.slice(0, 10) )
{
    println( set._2 )
}

This dumps out the first 10 sets of anagrams (sets with the most members first) being:

List(caret, cater, crate, react, trace)
List(reins, resin, rinse, risen, siren)
List(luster, result, rustle, sutler, ulster)
List(astir, sitar, stair, stria, tarsi)
List(latrine, ratline, reliant, retinal)
List(caper, crape, pacer, recap)
List(merit, miter, remit, timer)
List(notes, onset, steno, stone)
List(lair, liar, lira, rail)
List(drawer, redraw, reward, warder)

Note that this uses the first suggestion (list of counts of letters) not the more complicated bitmask method.

Edit 2: You can replace the hash function with a simple sort on the chars of each word (as suggested by JAB) and get the same result with clearer/faster code:

def toHash(b:String) = b.toList.sortWith(_<_)

Could you help me the explanatory algorithm.That would be very helpful. — vijay, Jun 19 '12 at 20:27

Steve Konves · Answer 3 · 2012-06-19T20:46:09.593

1

If you XOR the hash-code values of each character, and then XOR the result by the input length, you will get the same value regardless of the order of the word, meaning that all anagrams will produce the same hash. (XORing by the length prevents 'boss' and 'bo' from returning the same value, because the hash of the 's' against itself is always 0.)

Example:

int AnagramHash(string input)
{
    int output = 0;

    foreach(char c in input)
        output ^= c.GetHashCode();

    return output ^ input.Length;
}

You will still have to search for all words with the same AnagramHash. I would update the dictionary table with a field for the hash (regardless of your algorithm) to reduce overall computation.

EDIT: Also, as a side note, XOR is the simplest operation performed by the ALU so if you do end up using it, you should be able to generate your hashes fairly quickly.

edited Jun 19 '12 at 20:46

answered Jun 19 '12 at 20:33

Steve Konves

2,568
3
23
43

In C# `GetHashCode()` is a method on all classes. It essentially generates a unique integer value for any object. (Objects with the same value will produce the same integer.) For a different language, you could just use the byte value of each character as the hash code, because they would still be unique for each value. – Steve Konves Jun 19 '12 at 20:38
"You will still have to search for all words with the same AnagramHash." Not if you put the words in lists/etc. that are stored at the locations in the dictionary specified by `AnagramHash`. – JAB Jun 20 '12 at 15:39
Any problem if I use prime numbers to code each of the characters? – ultimate cause May 12 '18 at 19:05

JAB · Answer 4 · 2012-06-19T20:27:41.087

0

Sorting and comparing won't work as it's time complexity is pretty bad.

Exchanging time complexity for extra memory, just store the counts of the letters in a word in a 26-char (or the equivalent in whatever language you're using, and assuming you're using the Roman alphabet and only alphabetic characters) array and hash the array. You're stuck with O(n) time relative to word length, but most English words aren't really that long.

e.g. stack, sackt, and stakc would all have an array with the locations for s, t, a, c, k == 1 and the rest all set to 0.

Based on your comment, which implies that you are indeed okay with sorting the characters of a word as long as you aren't sorting words themselves, you could do something even simpler than Alex's answer and just sort the characters in the word strings and hash the results. (larsmans said it first, but didn't post it as an answer, so...)

edited Jun 19 '12 at 20:27

answered Jun 19 '12 at 20:18

JAB

19,150
4
64
78

Basically,I am concerned about time complexity.And have a look at other answer.I think it would take care of both complexities.Thanks – vijay Jun 19 '12 at 20:21
1

It does, but you said you didn't want sorting, so I gave you something that doesn't involve sorting. – JAB Jun 19 '12 at 20:22
Thanks.Sorry I got lost somewhere :P – vijay Jun 19 '12 at 20:25
Alex isn't sorting the characters.He is making a sorted count of characters in the word which is quite cool.Anyways,thanks for your help. – vijay Jun 19 '12 at 20:33
JAB is correct though - sorting the characters (as long as you still keep duplicates) and using that as the hash will work well - and in fact is probably more elegant and efficient than the list from chars to counts that I suggested. – Alex Wilson Jun 19 '12 at 21:27

score 0 · Answer 5 · answered Jun 22 '12 at 15:52

Use a hashmap with string as key and list(string) as value where list of strings contain all anagrams of a key string.

The question is similar to "find all anagrams of a word in a file"

View algo and code here http://justprogrammng.blogspot.com/2012/06/determine-anagrams-of-word-in-file.html

get list of anagrams from a dictionary

5 Answers5

Linked