2

I am trying to implement search engine based on keywords search. Can anyone tell me which is the best (fastest) algorithm to implement a search for key words?

What I need is:

My keywords:

search, faster, profitable

Their synonyms:

search: grope, google, identify, search   
faster: smart, quick, faster  
profitable: gain, profit  

Now I should search all possible permutations of the above synonyms in a Database to identify the most matching words.

Andrey Shchekin
  • 19,691
  • 16
  • 89
  • 152
user723644
  • 31
  • 2
  • 1
    Don't use MySQL for this. User something like lucene or elasticsearch. – blockhead May 13 '11 at 05:14
  • Sounds to me like you already got your solution... you go through every permutation of the words in your list, and get a `SELECT ... WHERE ... LIKE $permutation`. It should take just a few seconds with your given list. –  May 13 '11 at 08:06

1 Answers1

1

The best solution would be to use an existing search engine, like Lucene or one of its alternative ( see Which are the best alternatives to Lucene? ).

Now, if you want to implement that yourself (it's really a great and existing problem), you should have a look at the concept of Inverted Index. That's what Google and other search engines use. Of course, they have a LOT of additional systems on top of it, but that's the basic.

The idea of an inverted index, is that for each keyword (and synonyms), you store the id of the documents that contain the keyword. It's then very easy to lookup the matching documents for a set of keyword, because you just calculate an intersection (or an union depending on what you want to do) of their list in the inverted index. Example :

Let's assume that is your inverted index :

smart: [42,35]
gain: [42]
profit: [55]

Now if you have a query "smart, gain", your matching documents are the intersection (or the union) of [42, 35] and [42].

To handle synonyms, you just need to extend your query to include all synonyms for the words in the initial query. Based on your example, you query would become "faster, quick, gain, profit, profitable".

Once you've implemented that, a nice improvement is to add TFIDF weighting to your keywords. That's basically a way to weight rare words (programming) more than common ones (the).

The other approach is to just go through all your documents and find the ones that contain your words (or their synonyms). The inverted index will be MUCH faster though, because you don't have to go through all your documents every time. The time-consuming operation is building the index, which only has to be done once.

Community
  • 1
  • 1
Julien
  • 904
  • 5
  • 13