6

I want to make a dictionary and I need to have a query that can find words with misspelling (I mean Typos / spelling error or typo), if the query could not find exact word then try to get with other spellings...

So what is Arabic: In Arabic there are some letter with (almost) same pronunciation but different letter, people sometimes don't know which one should use. For example there are 4 kind of Z in Arabic "ز / ظ / ذ / ض" pronunciation are different (a little) but people will forgot which one is the correct spelling. an example of one word with different using of "z" letter:

مریز / مریض / مریظ / مریذ

The correct is مریض

Here are other worlds that has more than one latter:

z: ض / ز / ذ / ظ 

T: ت / ط / 

S: ث / س / ص / 

Gh: ق / غ

So what is your idea? How should be the query?

I want if a user searched for "مریز" instead of showing him an error of 404 (not found) search the database with other letter (all Z) then return result if I found anything.

kiokoshin
  • 63
  • 3
  • I know arabic, but I don't know if you tried anything. You see, someone will give you negative, because you didn't even post any queries where one can see if you even tried. – Amel Jan 09 '16 at 07:21
  • @Amel I don't want the exact query, I want idea (I don't know Arabic well) thanks for your comment – kiokoshin Jan 09 '16 at 07:35

3 Answers3

5

In German, we have the same issue regarding t and tt or dt - especially in names.

One way to approach this would be to store additional normalized column containing the name / word with fixed transformation.

 tt -> t
 dt -> t
 ß  -> s
 ss -> s

So table would contain

 WORD    | NORMALIZED
 schmitt | schmit
 schmidt | schmit

At query time, apply these same transformations to the query and then compare against normalized column.

Jan
  • 13,319
  • 3
  • 27
  • 51
  • Thanks, I did not know there is same problem in German (my native language is not Arabic!), you know it is a great dictionary, I am looking for a good algorithm (and query) because there are about half million words for search! – kiokoshin Jan 09 '16 at 07:33
  • The police ought to use a similar algorithm on automobile license plates -- is it O or 0; I or 1; Z or 2. Even E vs F can be a problem if there is a "plate surround". – Rick James Jan 10 '16 at 04:15
4

There is an algorithm called Levenshtein distance (there are others as well), which tells the edit distance between two strings.

You could derive from this, try to find the most resembling words in your dictionary compared to your input.

Later you can assign weight for substitutions based on the letter tuples you mentioned to refine your search.

In fact there is an implementation for MySQL you definitely should check out: https://www.artfulsoftware.com/infotree/qrytip.php?id=552
Most of the levenshtein+mysql questions here in SO point to this page.

Koshinae
  • 2,083
  • 2
  • 29
  • 35
2

A simpler solution would be using regular expressions within a like statement. For letters that are likely to be misspelled you can keep varieties in a regular expression wildcard. For the letters corresponding to z wildcard is "[زذظض]" You can replace all ز، ذ، ظ، ض letters with the wildcard and then query with a like statement:

select * from searched_table where word like "%[مرى[زذظض%"

After you find all versions of the searched word, you may either show the user all of them, or you may calculate the levenshtein distance (koshinae's answer) and show the closest words.

Edit: only for the letter Z, query would be like below

set @word = 'مرىض'; -- take this text from user
set @word = replace(@word, 'ذ', 'Z');
set @word = replace(@word, 'ظ', 'Z');
set @word = replace(@word, 'ض', 'Z');
set @word = replace(@word, 'ز', 'Z');
set @word = replace(@word, 'Z', '[زظضذ]');
set @word = Concat('%',  @word,  '%');
select @word;

select * from mydb.searchTable where word like @word;  
Abdullah Nehir
  • 950
  • 11
  • 19