Given a dictionary and a list of letters find all valid words that can be built with the letters

Question

The brute force way can solve the problem in O(n!), basically calculating all the permutations and checking the results in a dictionary. I am looking for ways to improve the complexity. I can think of building a tree out of the dictionary but still checking all letters permutations is O(n!). Are there better ways to solve this problem?

Letters can have duplicates.

The api for the function looks like this:

List<String> findValidWords(Dict dict, char letters[])

Not all the letters have to be used to build a valid word. If there are two A's in the letters, then at most you can use two A's. if there is only one B in the letters, you can at most use one B. — apadana, Aug 14 '14 at 00:19
Hint: prime numbers ... http://stackoverflow.com/a/11117236/905902 — wildplasser, Aug 14 '14 at 00:33
Why building a tree is O(n!)? what is the maximum size of `letters`? — Pham Trung, Aug 14 '14 at 01:41
Edited the wordings. I meant checking all words permutations against the tree is still O(n!). — apadana, Aug 14 '14 at 03:58
if not all letters need to be used in a word, then this is more than computing all permutations, it is generating all subsets, so its O(2^n). isn't it? — balajikris, Dec 28 '16 at 03:00

Pham Trung · Accepted Answer · 2015-12-26T17:09:01.707

15

Assume that letters only contains letters from a to z.

Use an integer array to count the number of occurrence of a character in letters.

For each word in the dictionary, check if there is a specific character in the word that appears more than allowed, if not, add this word into result.

    List<String> findValidWords(List<String> dict, char letters[]){
        int []avail = new int[26];
        for(char c : letters){
            int index = c - 'a';
            avail[index]++;
        }
        List<String> result = new ArrayList();
        for(String word: dict){
            int []count = new int[26];
            boolean ok = true;
            for(char c : word.toCharArray()){
                int index = c - 'a';
                count[index]++;
                if(count[index] > avail[index]){
                    ok = false;
                    break;
                }
            }
            if(ok){
                result.add(word);
            }
        }
        return result;
    }

So we can see that the time complexity is O(m*k) with m is number of word in the dictionary and k is the maximum total of characters in a word

edited Dec 26 '15 at 17:09

answered Aug 14 '14 at 02:06

Pham Trung

10,784
2
20
41

2

Let's take the example of english dictionary, and the letters given are only 5 of total alphabet. Your solution would require way too additional time to solve the problem. – Vishal Dec 26 '15 at 16:23
@Vishal I don't think 6 iterations is `way too additional time`. Can you give me an example in your case when there is a word which need more than 6 iteration to determine this word is valid/invalid? Feel free to add your solution also. – Pham Trung Dec 26 '15 at 17:27
2

Let's assume that we have english dictionary with 2,00,000 words and we would need to determine all the words which includes only A, b, C, D, and E. So it does lot of unnecessary processing in this case. – Vishal Dec 27 '15 at 03:01
1

I do not have a ready solution, but I think that it is possible to achieve better solution with the help of TRIE Data Structure. But it solely depends on the intention of the program. If the requirement is to search all the valid chars only once, yes your answer is perfect. But if search is a continuous process, solution must first create searchable DS, and perform search operation on dictionary data. – Vishal Dec 27 '15 at 03:02
A dictionary is usually very large. This might not be a good idea to check all the words in a dictionary. – Aditya Joshee Oct 20 '19 at 04:06
@AdityaJoshee Pls elaborate on your idea. You can sort it, hash it, or divide the dictionary into multiple machines, but you will still need to go through the dictionary at least once. – Pham Trung Oct 23 '19 at 10:03

user2566092 · Answer 2 · 2014-08-14T01:46:04.457

You can sort each word in your dictionary so that the letters appear in the same order as they do in the alphabet, and then build a trie out of your sorted words. (where each node contains a list of all words that can be made out of the letters). (linear time in total letter length of dictionary) Then, given a set of query letters, sort the letters the same way and proceed through the trie using depth first search in all possible directions that use a subset of your letters from left to right. Any time you reach a node in the trie that contains words, output those words. Each path you explore can be charged to at least one word in the dictionary, so the worst case complexity to find all nodes that contain words you can make is O(kn) where n is the number of words in the dictionary and k is the maximum number of letters in a word. However for somewhat restricted sets of query letters, the running time should be much faster per query.

ivg · Answer 3 · 2020-05-07T22:21:06.410

Here is the algorithm that will find all words that can be formed from a set of letters in O(1). We will represent words with their spectra and store them in a prefix tree (aka trie).

General Description

The spectrum of a word W is an array S of size N, such that S(i) is the number of occurrences (aka frequency) of an A(i) letter in the word W, where A(i) is the i-th letter of a chosen alphabet and N is its size.

For example, in the English alphabet, A(0) is A, A(1) is B, ... , A(25) is Z. A spectrum of the word aha is <2,0,0,0,0,0,0,1,0,...,0>.

We will store the dictionary in a prefix trie, using spectrum as a key. The first token of a key is the frequency of letter A, the second is the frequency of letter B and so on. (From here and below we will use the English alphabet as an example).

Once formed, our dictionary will be a tree with the height 26 and width that varies with each level, depending on a popularity of the letter. Basically, each layer will have a number of subtrees that is equal to the maximum word frequency of this letter in the provided dictionary.

Since our task is not only to decide whether we can build a word from the provided set of characters but also to find these words (a search problem), then we need to attach the words to their spectra (as spectral transformation is not invertible, consider spectra of words read and dear). We will attach a word to the end of each path that represents its spectrum.

To find whether we can build a word from a provided set we will build a spectrum of the set, and find all paths in the prefix trie with the frequencies bounded by the corresponding frequencies of the set's spectrum. (Note, we are not forcing to use all letters from the set, so if a word uses fewer letters, then we can build it. Basically, our requirement is that for all letters in the word the frequency of a letter should be less than or equal than a frequency of the same letter in the provided set).

The complexity of the search procedure doesn't depend on the length of the dictionary or the length of the provided set. On average, it is equal to 26 times the average frequency of a letter. Given the English alphabet, it is a quite small constant factor. For other alphabets, it might not be the case.

Reference implementation

I will provide a reference implementation of an algorithm in OCaml.

The dictionary data type is recursive:

type t = {
  dict : t Int.Map.t;
  data : string list;
}

(Note: it is not the best representation, probably it is better to represent it is a sum type, e.g., type t = Dict of t Int.Map.t | Data of string list, but I found it easier to implement it with the above representation).

We can generalize the algorithm by a spectrum function, either using a functor, or by just storing the spectrum function in the dictionary, but for the simplicity, we will just hardcode the English alphabet in the ASCII representation,

let spectrum word =
  let index c = Char.(to_int (uppercase c) - to_int 'A') in
  let letters = Char.(to_int 'Z' - to_int 'A' + 1) in
  Array.init letters ~f:(fun i ->
      String.count word ~f:(fun c -> index c = i))

Next, we will define the add_word function of type dict -> string -> dict, that will add a new path to our dictionary, by decomposing a word to its spectrum, and adding each constituent. Each addition will require exactly 26 iterations, not including the spectrum computation. Note, the implementation is purely functional, and doesn't use any imperative features. Every time the function add_word returns a new data structure.

let add_word dict word =
  let count = spectrum word in
  let rec add {dict; data} i =
    if i < Array.length count then {
      data;
      dict = Map.update dict count.(i) ~f:(function
          | None -> add empty (i+1)
          | Some sub -> add sub (i+1))
    } else {empty with data = word :: data} in
  add dict 0

We are using the following definition of the empty value in the add function:

let empty = {dict = Int.Map.empty; data=[]}

Now let's define the is_buildable function of type dict -> string -> bool that will decide whether the given set of characters can be used to build any word in the dictionary. Although we can express it via the search, by checking the size of the found set, we would still prefer to have a specialized implementation, as it is more efficient and easier to understand. The definition of the function follows closely the general description provided above. Basically, for every character in the alphabet, we check whether there is an entry in the dictionary with the frequency that is less or equal than the frequency in the building set. If we checked all letters, then we proved, that we can build at least one word with the given set.

let is_buildable dict set =
  let count = spectrum set in
  let rec find {dict} i =
    i >= Array.length count ||
    Sequence.range 0 count.(i) ~stop:`inclusive |>
    Sequence.exists ~f:(fun cnt -> match Map.find dict cnt with
      | None -> false
      | Some dict -> find dict (i+1)) in
  find dict 0

Now, let's actually find the set of all words, that are buildable from the provided set:

let build dict set =
  let count = spectrum set in
  let rec find {dict; data} i =
    if i < Array.length count then
      Sequence.range 0 count.(i) ~stop:`inclusive |>
      Sequence.concat_map ~f:(fun cnt -> match Map.find dict cnt with
          | None -> Sequence.empty
          | Some dict -> find dict (i+1))
    else Sequence.of_list data in
  find dict 0

We will basically follow the structure of the is_buildable function, except that instead of proving that such a frequency exists for each letter, we will collect all the proofs by reaching the end of the path and grabbing the set of word attached to it.

Testing and example

For the sake of completeness, we will test it by creating a small program, that will read a dictionary, with each word on a separate line, and interact with a user, by asking for a set and printing the resultion set of words, that can be built from it.

module Test = struct
  let run () =
    let dict =
      In_channel.(with_file Sys.argv.(1)
                    ~f:(fold_lines ~init:empty ~f:add_word)) in
    let prompt () =
      printf "Enter characters and hit enter (or Ctrl-D to stop): %!" in
    prompt ();
    In_channel.iter_lines stdin ~f:(fun set ->
        build dict set |> Sequence.iter ~f:print_endline;
        prompt ())
end

Here comes and example of interaction, that uses /usr/share/dict/american-english dictionary available on my machine (Ubunty Trusty).

./scrabble.native /usr/share/dict/american-english
Enter characters and hit enter (or Ctrl-D to stop): read
r
R
e
E
re
Re
Er
d
D
Rd
Dr
Ed
red
Red
a
A
Ra
Ar
era
ear
are
Rae
ad
read
dear
dare
Dare
Enter characters and hit enter (or Ctrl-D to stop):

(Yep, the dictionary contains words, that like r and d that are probably not true English words. In fact, for each letter the dictionary has a word, so, we can basically build a word from each non-empty set of alphabet letters).

The full implementation along with the building instructions can be found on Gist

score 2 · Answer 4 · answered Aug 14 '14 at 00:18

2

A better way to do this is to loop through all the words in the dictionary and see if the word can be built with the letters in the array.

answered Aug 14 '14 at 00:18

nmore

2,075
1
11
18

score 1 · Answer 5 · answered Aug 14 '14 at 00:31

"Sign" the letters available by sorting them in order; that's O(m log m), where m is the number of letters.
"Sign" each word in the dictionary by sorting the letters of the word in order; that's O(k log k), where k is the length of the word.
Compare the letter signature to each word signature; that's O(min(m, k) * n), where n is the number of words in the dictionary. Output any word that matches.

Assuming an English word list of approximately a quarter-million words, and no more than about half a dozen, that should be nearly instantaneous.

score 1 · Answer 6 · answered May 30 '16 at 11:26

I was recently asked the same question in BankBazaar interview. I was given the option to (he said that in a very subtle manner) pre-process the dictionary in any way I want.

My first thought was to arrange the dictionary in a trie or ternary search tree, and make all the words from the letters given. In any optimization way, that would take n! + n-1! + n-2! n-3! + ..... + n word checks(n being the number of letters) in worst case, which was not acceptable.

The other way could be to check all the dictionary words whether they can be made from the given letters. This again in any optimized way would take noOfDictionaryWords(m) * average size of dictionary words(k) at worst case, which was again not acceptable.

Now I have n! + n-1! + n-2! + .... + N words, which I have to check in the dictionary, and I don't want to check them all, so what are the situations that I have to check only a subset of them, and how to group them.

If I have to check only combination and not permutation, the result gets to 2^n.

so I have to pre-process the dictionary words in such a way that if I pass a combination, all the anagrams would be printed.

A ds something like this : http://1.bp.blogspot.com/-9Usl9unQJpY/Vg6IIO3gpsI/AAAAAAAAAbM/oTuhRDWelhQ/s1600/hashmapArrayForthElement.png A hashvalue made by the letters(irrespective of its positions and permutation), pointing to list containing all the words made by those letters, then we only need to check that hashvalue.

I gave the answer to make the hash value by assigning a prime value to all the alphabets and while calculating the hash value of a word, multiply all the assigned values. This will create a problem of having really big hash values given that 26th prime is 101, and many null values in the map taking space. We could optimize it a bit by rather than starting lexicographically with a = 2, b = 3, c = 5, d = 7.... z = 101, we search for the most used alphabets and assign them small values, like vowels, and 's', 't' etc. The interviewer accepted it, but was not expecting the answer, so there is definitely another answer, for better or worse but there is.

Thanks for the answer. I am not sure what you mean by anagram in 6th paragraph. Why do you want to print anagrams? — apadana, May 30 '16 at 19:03
Also did you look at the accepted answer, if yes do you think that would be a good one, or still there is another better solution? Appreciate your input here as I also got the feeling there might be another answer. — apadana, May 30 '16 at 19:06
@apadana consider the dictionary contains all the anagrams, "like", "kile", "ilke", "ekil". and our list of letters have {a, i, e, l, k, t}, now while we are making four letter words from our word list, we make "ielk", and check in our ds with the calculated hash value, which is same for all the anagrams of "ielk", it prints all the valid permutation of it. so we don't have to check with 4! permutation, but rather just one. — Rahul Tibrewal, May 31 '16 at 06:06

score 1 · Answer 7 · answered Oct 02 '18 at 04:53

Swift 4

func findValidWords(in dictionary: [String], with letters: [Character]) -> [String] {
    var validWords = [String]()

    for word in dictionary {
        var temp = word
        for char in letters {
            temp = temp.filter({ $0 != char })

            if temp.isEmpty {
                validWords.append(word)
                break
            }
        }
    }

    return validWords
}

print(findValidWords(in: ["ape", "apples", "orange", "elapse", "lap", "soap", "bar", "sole"], with: ["a","p","l","e","s","o"]))

Output => ["ape", "apples", "elapse", "lap", "soap", "sole"]

score 0 · Answer 8 · answered Aug 14 '14 at 04:41

Following is more efficient way :-

1.Use count sort to count all letters appearing in the a word in dictionary.
2.Do count sort on the collection of letter that you are given. 
3.Compare if the counts are same then the word can be made.
4. Do this for all words in dictionary.

This will be inefficient for multiple such queries so you can do following :-

1. make a tupple for each word using count sort.
2. put the tupple in a Tree or hashmap with count entries.
3. When query is given do count sort and lookup tupple in hashmap

.

Time complexity :-

The above method gives O(1) time complexity for a query and O(N) time complexity for hash table construction where N is no of words in dictionary

score 0 · Answer 9 · edited May 23 '17 at 12:34

(cf. anagram search, e.g. using primes looks cleaner for a signature based approach - collect for all non-equivalent "substrings of letters"])
Given the incentive, I'd (pre)order Dict by (set of characters that make up each word, increasing length) and loop over the subsets from letters checking validity of each word until too long.
Alternatively, finding the set of words from dict out of chars from letters can be considered a multi-dimensional range query: with "eeaspl" specifying letters, valid words have zero to two "e", one or none of a, s, p, l, and no other characters at all - bounds on word length (no longer than letters, lower bound to taste) blend in nicely.
Then again, data structures like k-d-trees do well with few, selective dimensions. (Would-be comment: you do not mention alphabet cardinality, whether "valid" depends on capitalisation or diacritics, "complexity" includes programmer effort or preprocessing of dict - the latter may be difficult to amortise if dict is immutable.)

score 0 · Answer 10 · answered Aug 15 '17 at 18:53

Swift 3

 func findValidWords(wordsList: [String] , string: String) -> [String]{

    let charCountsDictInTextPassed = getCharactersCountIn(string: string)
    var wordsArrayResult: [String] = []

    for word in wordsList {

        var canBeProduced = true
        let currentWordCharsCount = getCharactersCountIn(string: word)

        for (char, count) in currentWordCharsCount {
            if let charCountInTextPassed = charCountsDictInTextPassed[char], charCountInTextPassed >= count {
                continue
            }else{
                canBeProduced = false
                break
            }
        }// end for

        if canBeProduced {
            wordsArrayResult.append(word)
        }//end if
    }//end for
        return wordsArrayResult
}


// Get the count of each character in the string
func getCharactersCountIn(string: String) -> [String: Int]{
    var charDictCount:[String: Int] = [:]
    for char in string.characters {
        if let count = charDictCount[String(char)] {
            charDictCount[String(char)] = count + 1
        }else{
            charDictCount[String(char)] = 1
        }
    }//end for
    return charDictCount
}

It would be good to also add an explanation of your solution. — Martin Evans, Aug 15 '17 at 19:17

score 0 · Answer 11 · answered Jan 03 '19 at 14:55

My English is not good so try to understand.

My approach is using bit/bitwise to increase speed. Still bruteforce, though.

FIRST STEP

We only consider distinct character in each word and mark its existence. English has 26 characters, so we need 26 bits. Integer is 32 bits. That's enough.

Now encode each words in dictionary to an integer number.

abcdddffg -> 123444667 -> 123467 (only distinct characters) -> 1111011 (bits) -> 123 (decimal number)

So 2,000,000 words will be converted into 2,000,000 integer numbers.

Now let say you have this set of letters: a,b,c,d,e

abcde -> 12345 -> 1111100 (bits)

Do AND operation and we have:

1111100 (abcde)
&
1111011 (abcdddffg, no e)
=
1111000 (result) => result != abcdddffg => word cannot be created

Other example with a,b,c,d,e,f,g,h:

11111111 (abcdefgh)
&
11110110 (abcdddffg, no e and h)
= 
11110110 (result) => result == abcdddffg => word can be created

SECOND STEP

While converting word to number, store the letter count also. If we found a match in first step, we continue to check if the number of letters is enough too.

Depend on the requirement, you might not need this second step.

COMPLEXITY

O(n) to convert word to number and store letters count. Only need to do this once.
O(n) for each search query.

score -1 · Answer 12 · answered Jun 03 '17 at 17:37

If letters can be repeated, that means that a word can be infinitely long. You would obviously cap this at the length of the longest word in the dictionary, but there are still too many words to check. Like nmore suggested, you'd rather iterate over the dictionary to do this.

List<String> findAllValidWords(Set<String> dict, char[] letters) {
  List<String> result = new LinkedList<>();
  Set<Character> charSet = new HashSet<>();
  for (char letter : letters) {
    charSet.add(letter);
  }
  for (String word : dict) {
    if (isPossible(word, charSet)) {
      result.add(word);
    }
  }
  return result;
}

boolean isPossible(String word, Set<Character> charSet) {
  // A word is possible if all its letters are contained in the given letter set
  for (int i = 0; i < word.length(); i++) {
    if (!charSet.contains(word.charAt(i))) {
      return false;
    }
  }
  return true;
}

Given a dictionary and a list of letters find all valid words that can be built with the letters

12 Answers12

General Description

Reference implementation

Testing and example

Linked