Problems With the Current Approach / Other Answers
There are a lot of answers, but none of them is very efficient.
For every letter in a sub-anagram candidate, we search through a list and remove letters. One search takes linear time. Since we have to do a search for each letter, we end up with a quadratic time complexity.
Some people suggested to use a set instead of a list. Searching in a set takes constant time, so we would end up with linear time. However, the set approach fails when the same letter occurs multiple times.
The presented solutions are also slow because of constant speed factors. When we use List<Character>
or Set<Character>
, the char
s of the String have to be boxed inside Character
objects. Creating and handling these objects is much slower than using the primitive char
type.
Solution
Multisets
We can us a multiset (also know as bag) to represent the letters in a word. For each word, we create a multiset of its letters and check whether that multiset is a subset of the base word's letter multiset.
Example
Base word "Food"
has the multi set {f, o, o, d}
.
Word "do"
has the multi set {d, o}
.
Word "dod"
has the multi set {d, d, o}
.
{d, o}
is a subset of {f, o, o, d}
==> do
is a sub-anagram of food
.
{d, o, d}
is not a subset of {f, o, o, d}
==> dod
is not a sub-anagram of food
.
Storing Multisets
Since we know, that only the characters 'a'
to 'z'
occur, we use an int
array to represent a multiset. The value of array[0]
is the number of 'a'
s; the value of array[1]
is the number of 'b'
s, and so on.
array[1]
can also be written as array['b' - 'a']
Example
The word "Food"
with the multiset {f, o, o, d}
is represented by the array
// Entry for: a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
int[] multiSet = {0,0,0,1,0,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0};
Subset Check
a
is subset of b
if and only if a[i] <= b[i]
for all i
.
When we do the subset test while computing the multiset a
, we don't have to check all 26 array entries, but only the entries which were set to a value greater than zero.
Re-use Work
We want to check a lot of words for one base word. We can re-use the multiset for the base word and don't have to compute it over and over again.
Instead of writing a method that returns true
or false
, we write a method that returns the list of all sub-anagrams for a given base word and a given dictionary (list of words to be checked).
Minor optimizations
If a word is longer than the base word, it cannot be a sub-anagram. In such cases, we don't have to compute the multiset for that word.
Implementation
public static List<String> subAnagrams(String base, List<String> dictionary) {
char[] usableChars = new char['z' - 'a'];
base = base.toLowerCase();
for (int i = 0; i < base.length(); ++i) {
++usableChars[base.charAt(i) - 'a'];
}
List<String> subAnagrams = new ArrayList<>();
for (String candidate : dictionary) {
boolean isSubAnagram = candidate.length() <= base.length();
candidate = candidate.toLowerCase();
char[] usedChars = new char['z' - 'a'];
for (int i = 0; isSubAnagram && i < candidate.length(); ++i) {
int charIndex = candidate.charAt(i) - 'a';
isSubAnagram = ++usedChars[charIndex] <= usableChars[charIndex];
}
if (isSubAnagram) {
subAnagrams.add(candidate);
}
}
return subAnagrams;
}
Example usage
public static void main(String[] args) {
List<String> dict = new ArrayList<>();
dict.add("Do");
dict.add("Odd");
dict.add("Good");
dict.add("World");
dict.add("Foo");
System.out.println(subAnagrams("Food", dict));
}
prints [do, foo]