32

I have following working Java code for searching for a word against a list of words and it works perfectly and as expected:

public class Levenshtein {
    private int[][] wordMartix;

    public Set similarExists(String searchWord) {

        int maxDistance = searchWord.length();
        int curDistance;
        int sumCurMax;
        String checkWord;

        // preventing double words on returning list
        Set<String> fuzzyWordList = new HashSet<>();

        for (Object wordList : Searcher.wordList) {
            checkWord = String.valueOf(wordList);
            curDistance = calculateDistance(searchWord, checkWord);
            sumCurMax = maxDistance + curDistance;
            if (sumCurMax == checkWord.length()) {
                fuzzyWordList.add(checkWord);
            }
        }
        return fuzzyWordList;
    }

    public int calculateDistance(String inputWord, String checkWord) {
        wordMartix = new int[inputWord.length() + 1][checkWord.length() + 1];

        for (int i = 0; i <= inputWord.length(); i++) {
            wordMartix[i][0] = i;
        }

        for (int j = 0; j <= checkWord.length(); j++) {
            wordMartix[0][j] = j;
        }

        for (int i = 1; i < wordMartix.length; i++) {
            for (int j = 1; j < wordMartix[i].length; j++) {
                if (inputWord.charAt(i - 1) == checkWord.charAt(j - 1)) {
                    wordMartix[i][j] = wordMartix[i - 1][j - 1];
                } else {
                    int minimum = Integer.MAX_VALUE;
                    if ((wordMartix[i - 1][j]) + 1 < minimum) {
                        minimum = (wordMartix[i - 1][j]) + 1;
                    }

                    if ((wordMartix[i][j - 1]) + 1 < minimum) {
                        minimum = (wordMartix[i][j - 1]) + 1;
                    }

                    if ((wordMartix[i - 1][j - 1]) + 1 < minimum) {
                        minimum = (wordMartix[i - 1][j - 1]) + 1;
                    }

                    wordMartix[i][j] = minimum;
                }
            }
        }

        return wordMartix[inputWord.length()][checkWord.length()];
    }

}

Right now when I search for a word like job it returns a list:

Output

joborienterede
jobannoncer
jobfunktioner
perjacobsen
jakobsen
jobprofiler
jacob
jobtitler
jobbet
jobdatabaserne
jobfunktion
jakob
jobs
studenterjobber
johannesburg
jobmuligheder
jobannoncerne
jobbaser
job
joberfaringer

As you can see the output has a lot of related words but has also non-related ones like jakob, jacob etc., which is correct regarding the Levenshtein formula, but I would like to build further and write a method that can fine tune my search so I can get more relevant and related words.

I have worked few hours on it and lost my sight of creativity.

My Question: Is it possible to fine tune the existing method to return relevant/related words Or should I take another approach Or??? in all cases YES or NO, I appreciated if can get input and inspiration regarding improving searching results?


UPDATE

After asking this question long time back I have not really found a solution and I back to it because it is time where I need a useful answer, it is fine to supply the answer with JAVA code samples, but what is most important is a detailed answer with description of available methods and approaches used to index best and most relevant search results and ignoring none relevant words. I know this is an open and endless area, but I need to have some inspiration to start some where.

Note: The oldest answer right now is based on one of the comment inputs and is not helpful (useless), it just sorting the distance, that does not mean getting better search results/quality.

So I did distance sorting and the results was like this:

job
jobs
jacob
jakob
jobbet
jakobsen
jobbaser
jobtitler
jobannoncer
jobfunktion
jobprofiler
perjacobsen
johannesburg
jobannoncerne
joberfaringer
jobfunktioner
jobmuligheder
jobdatabaserne
joborienterede
studenterjobber

so word jobbaser is relevant and jacob/jakob is not relevant, but the distance for jobbaser is bigger than jacob/jakob. So that did not really helped.


General feedback regarding answers

  • @SergioMontoro, it solves almost the problem.
  • @uSeemSurprised, it solves the problem but need continually manipulation.
  • @Gene concept is excellent, but it is relaying on external url.

Thanks I would like to personally thanks all of you who contributed to this question, I have got nice answers and useful comments.

Special thanks to answers from @SergioMontoro, @uSeemSurprised and @Gene, those are different but valid and useful answers.

@D.Kovács is pointing some interesting solution.

I wish I could give bounty to all of those answers. Chose one answer and give it bounty, that does not mean the other answers is not valid, but that only mean that the particular answer I chose was useful for me.

maytham-ɯɐɥʇʎɐɯ
  • 21,551
  • 10
  • 85
  • 103
  • 2
    You have said `jakob` is not related but that required understanding of the meaning of the word. You will not be able to do much better with simple techniques such as Levenshtein distance and will need to start looking into natural language processing techniques. – DrYap Nov 15 '15 at 17:24
  • 2
    Why don't you sort by the distance that the Levenshtein algorithm returns? – bhspencer Nov 15 '15 at 17:29
  • 8
    Unless you define "relevant" you may never come to a satisfactory solution. – laune Nov 15 '15 at 17:32
  • What you compute is words from the list that contain the letters of the search word in the original order. This can be computed without this Levenshtein magic, and I don't know what you would like to infer from those words. – laune Nov 15 '15 at 17:46
  • 1
    Regarding bhspencer's proposal: since all the returned words have the same value, sorting them by that value may not enlighten you. – laune Nov 15 '15 at 17:53
  • 1
    The technology you're probably looking for is a semantic network. Implementing one is a big task because it must be created and trained. There are thousands of papers. Here's an example service: http://swoogle.umbc.edu/SimService/index.html – Gene Jan 02 '17 at 22:02
  • jacob might have been what the user is looking for if he is typing very fast and missed the a and c. You could ignore this fact and just look for words where the string is present, it's also faster than the calculating the distance. But if (sorry don't have a better example now) you are looking for "asses" then you probably don't want words related to "assess", "masses", "classes" etc. but you would still get those unless you are using a method which understands semantics (very complex). – maraca Jan 03 '17 at 01:08
  • Do you have a limited dictionary? Like 1000 words related to the job market. Or is it basically all English words? – maraca Jan 04 '17 at 21:18
  • I was asking, because the easiest semantic aproach is probably tag-based (like stackoverflow). So you would need to tag each word and also create tag synonyms. Then you can search via the tags and use the Levenshtein distance to get the closest tags for unknown words. But that doesn't seem feasible for you. – maraca Jan 04 '17 at 21:33
  • No. This is one of the most complex problems in NLP. – DGoiko May 29 '19 at 15:57

5 Answers5

10

Without understanding the meaning of the words like @DrYap suggests, the next logical unit to compare two words (if you are not looking for misspellings) is syllables. It is very easy to modify Levenshtein to compare syllables instead of characters. The hard part is breaking the words into syllables. There is a Java implementation TeXHyphenator-J which can be used to split the words. Based on this hyphenation library, here is a modified version of Levenshtein function written by Michael Gilleland & Chas Emerick. More about syllable detection here and here. Of course, you'll want to avoid syllable comparison of two single syllable words probably handling this case with standard Levenshtein.

import net.davidashen.text.Hyphenator;

public class WordDistance {

    public static void main(String args[]) throws Exception {
        Hyphenator h = new Hyphenator();
        h.loadTable(WordDistance.class.getResourceAsStream("hyphen.tex"));
        getSyllableLevenshteinDistance(h, args[0], args[1]);
    }

    /**
     * <p>
     * Calculate Syllable Levenshtein distance between two words </p>
     * The Syllable Levenshtein distance is defined as the minimal number of
     * case-insensitive syllables you have to replace, insert or delete to transform word1 into word2.
     * @return int
     * @throws IllegalArgumentException if either str1 or str2 is <b>null</b>
     */
    public static int getSyllableLevenshteinDistance(Hyphenator h, String s, String t) {
        if (s == null || t == null)
            throw new NullPointerException("Strings must not be null");

        final String hyphen = Character.toString((char) 173);
        final String[] ss = h.hyphenate(s).split(hyphen);
        final String[] st = h.hyphenate(t).split(hyphen);

        final int n = ss.length;
        final int m = st.length;

        if (n == 0)
            return m;
        else if (m == 0)
            return n;

        int p[] = new int[n + 1]; // 'previous' cost array, horizontally
        int d[] = new int[n + 1]; // cost array, horizontally

        for (int i = 0; i <= n; i++)
            p[i] = i;

        for (int j = 1; j <= m; j++) {
            d[0] = j;
            for (int i = 1; i <= n; i++) {
                int cost = ss[i - 1].equalsIgnoreCase(st[j - 1]) ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
            }
            // copy current distance counts to 'previous row' distance counts
            int[] _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now actually has the most recent cost counts
        return p[n];
    }

}
Community
  • 1
  • 1
Serg M Ten
  • 5,278
  • 4
  • 19
  • 37
5

You can modify Levenshtein Distance by adjusting the scoring when consecutive characters match.

Whenever there are consecutive characters that match, the score can then be reduced thus making the search more relevent.

eg : Lets say the factor by which we want to reduce score by is 10 then if in a word we find the substring "job" we can reduce the score by 10 when we encounter "j" furthur reduce it by (10 + 20) when we find the string "jo" and finally reduce the score by (10 + 20 + 30) when we find "job".

I have written a c++ code below :

#include <bits/stdc++.h>

#define INF -10000000
#define FACTOR 10

using namespace std;

double memo[100][100][100];

double Levenshtein(string inputWord, string checkWord, int i, int j, int count){
    if(i == inputWord.length() && j == checkWord.length()) return 0;    
    if(i == inputWord.length()) return checkWord.length() - j;
    if(j == checkWord.length()) return inputWord.length() - i;
    if(memo[i][j][count] != INF) return memo[i][j][count];

    double ans1 = 0, ans2 = 0, ans3 = 0, ans = 0;
    if(inputWord[i] == checkWord[j]){
        ans1 = Levenshtein(inputWord, checkWord, i+1, j+1, count+1) - (FACTOR*(count+1));
        ans2 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
        ans3 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
        ans = min(ans1, min(ans2, ans3));
    }else{
        ans1 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
        ans2 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
        ans = min(ans1, ans2);
    }
    return memo[i][j][count] = ans;
}

int main(void) {
    // your code goes here
    string word = "job";
    string wordList[40];
    vector< pair <double, string> > ans;
    for(int i = 0;i < 40;i++){
        cin >> wordList[i];
        for(int j = 0;j < 100;j++) for(int k = 0;k < 100;k++){
            for(int m = 0;m < 100;m++) memo[j][k][m] = INF;
        }
        ans.push_back( make_pair(Levenshtein(word, wordList[i], 
            0, 0, 0), wordList[i]) );
    }
    sort(ans.begin(), ans.end());
    for(int i = 0;i < ans.size();i++){
        cout << ans[i].second << " " << ans[i].first << endl;
    }
    return 0;
}

Link to demo : http://ideone.com/4UtCX3

Here the FACTOR is taken as 10, you can experiment with other words and choose the appropriate value.

Also note that the complexity of the above Levenshtein Distance has also increased, it is now O(n^3) instead of O(n^2) as now we are also keeping track of the counter that counts how many consecutive characters we have encountered.

You can further play with the score by increasing it gradually after you find some consecutive substring and then a mismatch, instead of the current way where we have a fixed score of 1 that is added to the overall score.

Also in the above solution you can remove the strings that have score >=0 as they are not at all releavent you can also choose some other threshold for that to have a more accurate search.

uSeemSurprised
  • 1,723
  • 2
  • 13
  • 18
4

Since you asked, I'll show how the UMBC semantic network can do at this kind of thing. Not sure it's what you really want:

import static java.lang.String.format;
import static java.util.Comparator.comparingDouble;
import static java.util.stream.Collectors.toMap;
import static java.util.function.Function.identity;

import java.util.Map.Entry;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.regex.Pattern;

public class SemanticSimilarity {
  private static final String GET_URL_FORMAT
      = "http://swoogle.umbc.edu/SimService/GetSimilarity?"
          + "operation=api&phrase1=%s&phrase2=%s";
  private static final Pattern VALID_WORD_PATTERN = Pattern.compile("\\w+");
  private static final String[] DICT = {
    "cat",
    "building",
    "girl",
    "ranch",
    "drawing",
    "wool",
    "gear",
    "question",
    "information",
    "tank" 
  };

  public static String httpGetLine(String urlToRead) throws IOException {
    URL url = new URL(urlToRead);
    HttpURLConnection conn = (HttpURLConnection) url.openConnection();
    conn.setRequestMethod("GET");
    try (BufferedReader reader = new BufferedReader(
        new InputStreamReader(conn.getInputStream()))) {
      return reader.readLine();
    }
  }

  public static double getSimilarity(String a, String b) {
    if (!VALID_WORD_PATTERN.matcher(a).matches()
        || !VALID_WORD_PATTERN.matcher(b).matches()) {
      throw new RuntimeException("Bad word");
    }
    try {
      return Double.parseDouble(httpGetLine(format(GET_URL_FORMAT, a, b)));
    } catch (IOException | NumberFormatException ex) {
      return -1.0;
    }
  }

  public static void test(String target) throws IOException {
    System.out.println("Target: " + target);
    Arrays.stream(DICT)
        .collect(toMap(identity(), word -> getSimilarity(target, word)))
        .entrySet().stream()
        .sorted((a, b) -> Double.compare(b.getValue(), a.getValue()))
        .forEach(System.out::println);
    System.out.println();
  }

  public static void main(String[] args) throws Exception {
    test("sheep");
    test("vehicle");
    test("house");
    test("data");
    test("girlfriend");
  }
}

The results are kind of fascinating:

Target: sheep
ranch=0.38563728
cat=0.37816614
wool=0.36558008
question=0.047607
girl=0.0388761
information=0.027191084
drawing=0.0039623436
tank=0.0
building=0.0
gear=0.0

Target: vehicle
tank=0.65860236
gear=0.2673374
building=0.20197356
cat=0.06057514
information=0.041832563
ranch=0.017701812
question=0.017145569
girl=0.010708235
wool=0.0
drawing=0.0

Target: house
building=1.0
ranch=0.104496084
tank=0.103863
wool=0.059761923
girl=0.056549154
drawing=0.04310725
cat=0.0418914
gear=0.026439993
information=0.020329408
question=0.0012588014

Target: data
information=0.9924584
question=0.03476312
gear=0.029112043
wool=0.019744944
tank=0.014537057
drawing=0.013742204
ranch=0.0
cat=0.0
girl=0.0
building=0.0

Target: girlfriend
girl=0.70060706
ranch=0.11062875
cat=0.09766617
gear=0.04835723
information=0.02449007
wool=0.0
question=0.0
drawing=0.0
tank=0.0
building=0.0
Gene
  • 42,664
  • 4
  • 51
  • 82
2

I tried the suggestion from the comments about sorting the matches by the distance returned by Levenshtein algo, and it seems it does produce better results.

(As I could not find how I could not find the Searcher class from your code, I took the liberty of using a different source of wordlist, Levenshtein implementation, and language.)

Using the word list provided in Ubuntu, and Levenshtein algo implementation from - https://github.com/ztane/python-Levenshtein, I created a small script that asks for a word and prints all closest words and distance as tuple.

Code - https://gist.github.com/atdaemon/9f59ad886c35024bdd28

from Levenshtein import distance
import os

def read_dict() :
    with open('/usr/share/dict/words','r') as f : 
        for line in f :
            yield str(line).strip()

inp = str(raw_input('Enter a word : '))

wordlist = read_dict()
matches = []
for word in wordlist :
    dist = distance(inp,word)
    if dist < 3 :
        matches.append((dist,word))
print os.linesep.join(map(str,sorted(matches)))

Sample output -

Enter a word : job
(0, 'job')
(1, 'Bob')
(1, 'Job')
(1, 'Rob')
(1, 'bob')
(1, 'cob')
(1, 'fob')
(1, 'gob')
(1, 'hob')
(1, 'jab')
(1, 'jib')
(1, 'jobs')
(1, 'jog')
(1, 'jot')
(1, 'joy')
(1, 'lob')
(1, 'mob')
(1, 'rob')
(1, 'sob')
...

Enter a word : checker
(0, 'checker')
(1, 'checked')
(1, 'checkers')
(2, 'Becker')
(2, 'Decker')
(2, 'cheaper')
(2, 'cheater')
(2, 'check')
(2, "check's")
(2, "checker's")
(2, 'checkered')
(2, 'checks')
(2, 'checkup')
(2, 'cheeked')
(2, 'cheekier')
(2, 'cheer')
(2, 'chewer')
(2, 'chewier')
(2, 'chicer')
(2, 'chicken')
(2, 'chocked')
(2, 'choker')
(2, 'chucked')
(2, 'cracker')
(2, 'hacker')
(2, 'heckler')
(2, 'shocker')
(2, 'thicker')
(2, 'wrecker')
maytham-ɯɐɥʇʎɐɯ
  • 21,551
  • 10
  • 85
  • 103
Anish Tambe
  • 356
  • 3
  • 4
2

This really is an open-ended question, but I would suggest an alternative approach which uses for example the Smith-Waterman algorithm as described in this SO.

Another (more light-weight) solution would be to use other distance/similarity metrics from NLP (e.g., Cosine similarity or Damerau–Levenshtein distance).

Community
  • 1
  • 1
D. Kovács
  • 1,122
  • 10
  • 21