Count the number of similar words having same letters

Question

Problem:

How to identify or count the number of similar words having same letters(count of letters should be same, order can be anything).

For example, below are the words:

aabb, aaab, abbb, abaa, abab, aabc, caba, babb, baaa

Below are the similar words are:

aabb, abab
aaab, abaa, baaa
abbb, babb
aabc, caba

Below are the basic logics I got for this is:

Method1: First sort the letters in individual words. Compare the sorted words.

Method2: Comparing each letter of each word across remaining words. (This will be the highest count of iterations)

Please suggest some better logics for solving..

A sort is not really necessary, I would just use a set of multisets. If no multiset implementation is available, you just need a `Set>>` — Dici, Jul 02 '15 at 10:18
There are many ways to do this. Have you tried at least implementing one of the two methods you mention? — fge, Jul 02 '15 at 10:20

score 0 · Answer 1 · answered Jul 02 '15 at 10:29

0

Well, your method1 pretty much does the trick efficiently, if you use the sorted words efficiently.

First, sort the words.

Now you have reduced the problem to Element Distinctness Problem, "similar words" just became identical.

It can be done by one of two ways:

Sort the list of (sorted) words, and iterate, all "similar" words will be adjacent to each other.
Iterating while maintaining a hash table that counts how many times you have seen each word. (no second sort needed in this approach).

You could use a histogram, where each histogram counts the number of occurances of each character. Have such a histogram for each word, and check for identical histograms.

answered Jul 02 '15 at 10:29

amit

166,614
24
210
314

I can't see why you are trying "complicated" solutions based on sorting whereas it is not required at all. Just using some hash sets does the trick. Am I missing something ? – Dici Jul 02 '15 at 11:14
@Dici Hashing is not more nor less complicated than sorting, it's different. I also mentioned how to do it without sorting, and you would have seen it if you read the whole answer (or ask for a clarification if you don't know what a histogram is) – amit Jul 02 '15 at 11:30
Hashing is faster than sorting, there's just a slight memory overhead in an `HashMap`. It just feels like it is not the natural approach here, if you ask to any non-programmer person, they are just going to count the letters in each word and compare the results. I did not find it clear that you propose a non-sort based solution because you start with `first, sort the words`, but I can see now that you do it in your last sentence – Dici Jul 02 '15 at 11:53
@Dici If the strings are short - then the constants involved in hashing might overcome the log(n) factor (hashing's constants are poor comparing to sorting). If they are not short - the extra memory could be an issue. This is a classic problem where one has to choose which is preferable between existing trade offs. – amit Jul 02 '15 at 11:56
In practice the important variable is the number of words not their length. In this regard, hashing scales well better. But anyway, you gave both answers so it's perfectly fine :) – Dici Jul 02 '15 at 12:01

score 0 · Answer 2 · answered Jul 02 '15 at 10:32

First of all, you need to have a class for LetterCluster. In that class, you should store a Map of character and number, a number for length and the words found which comply to the group. Then, the idea is to iterate all the words and all the existent groups for each word and in each iteration:

You compare the length of the String with the character number of the group. If they do not match, then there is no need for further analysis, since strings of different length cannot have the same alphabet
If the length was the same. If the Map of the String is equivalent with the map of the group, then add the String to the group's word collection and stop the iteration
If there was no compatible group found for the word, then create a new group, store the length of the word, its Map and itself in the collection.

Note, that this grouping is effectively partitioning your words and you can see words having the same alphabet in the same group.

score 0 · Answer 3 · answered Jul 02 '15 at 10:36

0

I'd say : count each different letters. Then compare the calculated numbers.

Like 'aabb' => a = 2, b = 2; 'abab' => a = 2, b = 2; 'aaab' => a = 3, b = 2.

The time complexity should be good, but the memory complexity is linear with the number of letters contained in the alphabet you are working with.

answered Jul 02 '15 at 10:36

Raphallal

65
9

The number of letters in the alphabet is constant, so the complexity cannot be expressed on this ! The memory complexity of your solution (and mine) is `O(number of anagrams)` which is at worse `O(number of words)`. Thus, it is indeed linear but not in respect of the number of letters in the alphabet – Dici Jul 02 '15 at 11:11

score 0 · Answer 4 · answered Jul 02 '15 at 11:05

I should not give you a full answer, but I have a concise functional solution :

Stream.of("aabb", "aaab", "abbb", "abaa", "abab", "aabc", "caba", "babb", "baaa")
      .map(s -> s.chars().boxed().collect(Collectors.groupingBy(i -> (char) i.intValue(), Collectors.counting())))
      .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))

score 0 · Answer 5 · answered Jul 02 '15 at 11:14

First define similarity in some precise way.

You can say two words are similar if:

A1. they are built from the same letters

A2. letter from word w1 at position p can be found in word2 at positions (p-N ... p+N)

both above defs can match your request but are different.

Count the number of similar words having same letters

5 Answers5