Checking if a word is a sub-Anagram of another (Java)

Question

The words "unsold" & "silo" are sub-anagrams of the word "insidiously". That is, they can be spelt using only letters from "insidiously". There are obviously many more, and this concept is the basis of a word game found in 'The Australian' newspaper.

I'm trying to write a program that takes two arguments - a word, and another that might be a sub-anagram of this word and returns true if it is. So far this is what I've got:

public boolean isAnswer(String word, String base)
    ArrayList<Character> characters = new ArrayList<>();
    for(char x : base.toCharArray)
    {
        characters.add(x)
    }
    for(char y : word.toCharArray)
    {
        if(characters.contains(x))
        {
            characters.remove(x)
        }
        else
        {
            return false;
        }
    return true;
    }

It does work, but if I'm looping through every word in the English dictionary this will be extremely taxing on memory. How can I do this without creating an ArrayList local variable?

"It does work" nope it's impossible, missing { } ; ( ) it can't compile so it can't work — azro, Jun 29 '17 at 12:57

Sujal Mandal · Answer 1 · 2017-06-29T08:47:36.443

0

If you want to make your existing program better consider using a SET instead of a LIST as it will

Eliminate the duplicate additions in your characters collection, saving space.
Save you some iterations in the next loop, saving time.

EDIT

However this optimization may not work in conditions pointed out by one of the comments.

EX - when an base has only "ab" & the word has "aab"

edited Jun 29 '17 at 08:47

answered Jun 29 '17 at 08:29

Sujal Mandal

851
1
12
25

1

Using a Set may not be helpful because duplicate characters are relevant for this task. A `word` 'aab' for example would only fit to `base` when it contains 'aab' and not when it contains only 'ab'. – Florian S. Jun 29 '17 at 08:37

Florian S. · Answer 2 · 2017-06-29T13:33:25.970

0

You could directly replace in base. this is not very efficient and creates a lot of String objects but it is very easy to read:

public boolean isAnswer(String word, String base)
{
  for (char ch : word.toCharArray())
  {
    base = base.replaceFirst("" + ch, "");
  }
  return base.trim().length() == 0;
}

edited Jun 29 '17 at 13:33

answered Jun 29 '17 at 08:33

Florian S.

346
1
11

But you will produce new String each iteration, this obviously can't be very efficient solution. – Krzysztof Cichocki Jun 29 '17 at 08:55
Nice idea to save some code, but does not work for *sub*-anagrams. `do` is a sub-anagram of `food`, but the method will return `false`. Also watch out for cases where letters occur multiple times. `replace` replaces all occurrences. – Socowi Jun 29 '17 at 13:16
You are right. I was not aware that the replace() method for characters replaces all occurences of a character. Modifying the above example to use Strings. – Florian S. Jun 29 '17 at 13:34
Thanks Florian, this is an interesting method. @Socow – cornelius Jun 30 '17 at 08:56

azro · Answer 3 · 2017-06-29T12:56:50.907

0

Your code miss many {},;, () , It can't clearly compilet and work ^^, and i changed the order of the "if" and how to add all the base

public boolean isAnswer(String word, String base) {
      ArrayList<Character> characters = new ArrayList<>();
      characters.addAll(Arrays.asList(base.toCharArray()));
      for (char y : word.toCharArray()) {
          if (!characters.contains(y)) {
              return false;
          }
          characters.remove(y);
      }
      return true;
}

edited Jun 29 '17 at 12:56

answered Jun 29 '17 at 08:33

azro

35,213
7
25
55

X is also out of scope when it's used – Michael Jun 29 '17 at 08:37
@Michael yes it's 'y' in fact ^^ and miss '()' at toCharArray and other corrections – azro Jun 29 '17 at 12:55

Krzysztof Cichocki · Answer 4 · 2017-06-29T10:18:40.877

I believe this would be the solution that should run fast and consume the smallest amount of memory:

public class Snippet {

public static void main(String[] args) {

    System.out.println(isAnswer("unsold", "insidiously"));
    System.out.println(isAnswer("silo", "insidiously"));
    System.out.println(isAnswer("silk", "insidiously"));
}

public static boolean isAnswer(String word, String base) {
    char[] baseCharArr = base.toCharArray();
    for (int wi = 0; wi < word.length(); wi++) {
        boolean contains = false;
        char wchar = word.charAt(wi);
        for (int bi = 0; bi < baseCharArr.length; bi++) {
            if (baseCharArr[bi]==wchar) {
                baseCharArr[bi]='_'; // to not use this letter anymore we delete it using some sign that is non valid to from a word.
                contains=true;
                break;
            }
        }
        if (!contains) {
            return false;
        }
    }
    return true;
}

}

Okay this is interesting, you've just used baseCharArr[bi] = '_' where I've used characters.remove(x). Except that the loop will go through every character even if it finds one that doesn't match the base. So it can be made a fair bit faster if we correct that. — cornelius, Jun 30 '17 at 09:11
take a look ar: `if (!contains) { return false; }` - clearly it will return false, if any of required letters is missing — Krzysztof Cichocki, Jun 30 '17 at 11:03

Yati Sawhney · Answer 5 · 2017-06-29T08:58:19.760

I would suggest you to go for a java.util.Set to avoid unnecessary iterations. Please find the code below:

private static boolean isSubAnagram() {
        String str  = "insidiously";
        String anagram = "siloy";

        Set<Character> set = new HashSet<Character>();
        for(int i = 0 ; i < str.length() ; ++i){
            set.add(new Character(str.charAt(i)));
        }

        int count = 0;
        for(int i = 0 ; i < anagram.length() ; ++i){
            if(set.contains(anagram.charAt(i))){
                ++count;
            }
        }

        return count == anagram.length();

    }

If the letter count in the base string and the so called sub anagram needs to be same then go for:

private static boolean isSubAnagram() {
    String str  = "insidiously";
    String anagram = "siloyl";

    List<Character> list = new ArrayList<Character>();
    for(int i = 0 ; i < str.length() ; ++i){
        list.add(new Character(str.charAt(i)));
    }               

    for(int i = 0 ; i < anagram.length() ; ++i){
        char curChar = anagram.charAt(i);
        if(list.contains(curChar)){
            list.remove(new Character(curChar));
            continue;
        }else{
            return false;
        }
    }

    return true;
}

Set is unsuitable. If your base word uses a character twice or more, you should be allowed to use that character in a sub-anagram the same number of times. — Michael, Jun 29 '17 at 08:44
I am not aware of the word game. But if that's the case set shouldn't be used. I will edit this. Thanks! — Yati Sawhney, Jun 29 '17 at 08:46
@Michael from where you get that information, that the letters can appear max the exact number of times as in the base word? The OP didn't give such constarint. — Krzysztof Cichocki, Jun 29 '17 at 08:54

Michael · Answer 6 · 2017-06-29T09:12:56.073

One optimisation might be to first ensure that the word is not longer than the base.

public boolean isAnswer(String word, String base)
{
    if (word.length() > base.length()) return false;
    //...
}

I suspect if the words are exactly the same length, there may be a faster way than comparing all of the characters:

public boolean isAnswer(String word, String base)
{
    if (word.length() > base.length()) {
        return false;
    }
    else if (word.length() == base.length()) {
        return isFullAnagram(); // I'll leave the implementation of this up to you
    }
    //...
}

The next step in optimising this would be to ensure you're not naively trying every word in the dictionary:

// Don't do this
public static void main(String... args)
{
    String base = "something";
    for (final String word : dictionary)
    {
        if (isAnswer(word, base)) // do something
    }
}
// Don't do this

You have a big advantage in that any dictionary text file worth its salt will be pre-sorted. A basic optimisation would be to chunk your dictionary into 26 files - one for words starting with each letter - and skip any files which can't possibly match.

public static void main(String... args)
{
    String base = "something";
    Set<Characters> characters = // populate with chars from base

    for (final Section section : dictionary)
    {
        if (characters.contains(section.getChar())
        {
            for (final String word : section)
            {
                if (isAnswer(word, base)) // do something
            }
        }
    }
}

The next thing I would do is to look at parallelising this process. A basic approach would be to run each section on its own thread (so you're looking at up to about 12 threads for most common English words).

public static void main(String... args)
{
    String base = "something";
    Set<Characters> characters = // populate with chars from base

    for (final Section section : dictionary)
    {
        if (characters.contains(section.getChar())
        {
            startMyThread(section, base);
        }
    }
}

You could get the threads to return a Future that you can check at the end. I'll leave that detail up to you.

A library like CUDA allows you to use very high concurrency by pushing computation to the GPU. You could have hundreds of threads running simultaneously. I'm not sure what a good strategy would look like in this case.

_{I'm working on the assumption that you'll only have to deal with the 26 letters of the Roman alphabet. Every such game I've seen in newspapers avoids words with diacritics: café, fiancée, naïve etc.}

Socowi · Answer 7 · 2017-06-29T16:18:23.717

Problems With the Current Approach / Other Answers

There are a lot of answers, but none of them is very efficient.

For every letter in a sub-anagram candidate, we search through a list and remove letters. One search takes linear time. Since we have to do a search for each letter, we end up with a quadratic time complexity.

Some people suggested to use a set instead of a list. Searching in a set takes constant time, so we would end up with linear time. However, the set approach fails when the same letter occurs multiple times.

The presented solutions are also slow because of constant speed factors. When we use List<Character> or Set<Character>, the chars of the String have to be boxed inside Character objects. Creating and handling these objects is much slower than using the primitive char type.

Solution

Multisets

We can us a multiset (also know as bag) to represent the letters in a word. For each word, we create a multiset of its letters and check whether that multiset is a subset of the base word's letter multiset.

Example

Base word "Food" has the multi set {f, o, o, d}.
Word "do" has the multi set {d, o}.
Word "dod" has the multi set {d, d, o}.

{d, o} is a subset of {f, o, o, d} ==> do is a sub-anagram of food.
{d, o, d} is not a subset of {f, o, o, d} ==> dod is not a sub-anagram of food.

Storing Multisets

Since we know, that only the characters 'a' to 'z' occur, we use an int array to represent a multiset. The value of array[0] is the number of 'a's; the value of array[1] is the number of 'b's, and so on. array[1] can also be written as array['b' - 'a']

Example

The word "Food" with the multiset {f, o, o, d} is represented by the array

// Entry for:     a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
int[] multiSet = {0,0,0,1,0,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0};

Subset Check

a is subset of b if and only if a[i] <= b[i] for all i.

When we do the subset test while computing the multiset a, we don't have to check all 26 array entries, but only the entries which were set to a value greater than zero.

Re-use Work

We want to check a lot of words for one base word. We can re-use the multiset for the base word and don't have to compute it over and over again. Instead of writing a method that returns true or false, we write a method that returns the list of all sub-anagrams for a given base word and a given dictionary (list of words to be checked).

Minor optimizations

If a word is longer than the base word, it cannot be a sub-anagram. In such cases, we don't have to compute the multiset for that word.

Implementation

public static List<String> subAnagrams(String base, List<String> dictionary) {
    char[] usableChars = new char['z' - 'a'];
    base = base.toLowerCase();
    for (int i = 0; i < base.length(); ++i) {
        ++usableChars[base.charAt(i) - 'a'];
    }

    List<String> subAnagrams = new ArrayList<>();
    for (String candidate : dictionary) {
        boolean isSubAnagram = candidate.length() <= base.length();
        candidate = candidate.toLowerCase();
        char[] usedChars = new char['z' - 'a'];
        for (int i = 0; isSubAnagram && i < candidate.length(); ++i) {
            int charIndex = candidate.charAt(i) - 'a';
            isSubAnagram = ++usedChars[charIndex] <= usableChars[charIndex];
        }
        if (isSubAnagram) {
            subAnagrams.add(candidate);
        }
    }
    return subAnagrams;
}

Example usage

public static void main(String[] args) {
    List<String> dict = new ArrayList<>();
    dict.add("Do");
    dict.add("Odd");
    dict.add("Good");
    dict.add("World");
    dict.add("Foo");
    System.out.println(subAnagrams("Food", dict));  
}

prints [do, foo]