SQL and fuzzy comparison

Question

Let's assume we have a table of People (name, surname, address, SSN, etc).

We want to find all rows that are "very similar" to specified person A. I would like to implement some kind of fuzzy logic comparation of A and all rows from table People. There will be several fuzzy inference rules working separately on several columns (e.g. 3 fuzzy rules for name, 2 rules on surname, 5 rules on address)

The question is Which of the following 2 approaches would be better and why?

Implement all fuzzy rules as stored procedures and use one heavy SELECT statement to return all rows that are "very similar" to A. This approach may include using soundex, sim metric etc.
Implement one or more simplier SELECT statements, that returns less accurate results, "rather similar" to A, and then fuzzy-compare A with all returned rows (outside database) to get "very similar" rows. So fuzzy comparation would be implemented in my favorit programming language.

Table People should have up to 500k rows, and I would like to make about 500-1000 queries like this a day. I use MySQL (but this is yet to be considered).

Why not use a case in your select that returns a 1 if it is similar, 0 if not, then sum all the columns. Those who are greater than a specific range should be returned. Seems like a simple solution to your problem. — Jesus Zamora, Apr 03 '13 at 23:30
@JesusZamora: The problem is that first of all the value of similarity might be float (e.g 0.43), that is why I was talking about "fuzzy comparation". Second problem is that counting this value of similarity is rather haevy and I dont know which of this approaches would be more optimal. — running.t, Apr 03 '13 at 23:38
@JesusZamora: fuzzy logic assumes, that it is impossible to simple say that something is true or not, because there are a lot of factors which can be almost true or not entirely false. — noisy, Apr 03 '13 at 23:38
I have just found [this question](http://stackoverflow.com/questions/1914671/php-mysql-small-scale-fuzzy-search) and if I understand correctly, best answer to that question shows how 2. aproach might be implemented. — running.t, Apr 04 '13 at 17:12
Possible 3rd solution. Use a temporary table (possibly just storing the matching id fields) and populate it from a few simple queries tailored to a basic check on that column (ie, maybe name could be checked on soundex, or the first letter, etc, while address based on town). Then perform you full checks on the subset of records stored in the temp table. — Kickstart, Apr 12 '13 at 13:26

score 4 · Accepted Answer · edited Dec 07 '17 at 00:08

I don't really think there is a definitive answer because it depends on information not available in the question. Anyway, too long for a comment.

DBMSes are good at retrieving information according to indexes. It does not make sense to have a db server wasting time in heavy computations unless it is dedicated for this specific purpose (as answered by @Adrian).

Therefore, your client application should delegate to the DBMS the retrieval of information required by the rules.

If the computations are minor, all could be done on the server. Else, pull it off into the client system.

The disadvantage of the second approach lies in the amount of data traveling from the server to the client and the number of connections to establish. So, typically it is a compromise between computation and data transfer in the server. A balance to be achieved depending on the specificities of the fuzzy rules.

Edit: I've seen in a comment that you are almost sure to have to implement the code in the client. In that case, you should consider an additional criterion, code locality, for maintenance purposes, i.e., try to have all code that is related together, not spreading it between systems (and languages).

score 2 · Answer 2 · answered Apr 08 '13 at 16:10

2

I would say you're best off using simple selects to get the closest matches you can without hammering the database, then do the heavy lifting in your application layer. The reason I would suggest this solution is scalability: if you do your heavy lifting in the application layer, your problem is a perfect use case for a map-reduce-style solution wherein you can distribute the processing of similarities across nodes and get your results back much faster than if you put it through the database; plus, this way, you're not locking up your database and slowing down any other operations that may be going on at the same time.

answered Apr 08 '13 at 16:10

Adrian

32,698
4
74
74

I think that is the way I am going to implement this. But I am still wondering if there are any disadvantages or potential risks of this approach. – running.t Apr 09 '13 at 09:54
1

Disadvantages are going to be based on your situation: if your database server generally has more unused capacity than your application server, you'll want to offload more work to the database server. If they're on the same box, you'd have to benchmark it both ways to get the best answer for your particular situation. – Adrian Apr 10 '13 at 22:17

score 1 · Answer 3 · edited Dec 07 '17 at 00:32

1

Since you're still considering what DB to use PostgreSQL has fuzzystrmatch module which provides Levenshtein and Soundex functions. Also, you might want to look on the pg_trm module as described here. Maybe you could also put the index on the column using soundex() so you won't have to calculate that every time. But you seem to optimize prematurely so my advice would be to test using pg and then wonder if you need to optimize or not, the numbers you provided really don't seem like a lot considered you almost have two minutes to run one query.

edited Dec 07 '17 at 00:32

Salomon Zhang

1,423
2
21
34

answered Apr 04 '13 at 00:02

Jakub Kania

13,220
2
32
41

Using levenshtein and soundex functions from [fuzzystrmatch](http://www.postgresql.org/docs/current/static/fuzzystrmatch.html) sounds pretty well if I decide to use first approach . However I am almost sure I will have to use my own procedures for at least some of the metrics (fuzzy rules) so using these modules will not solve everything. Anyway I am still not sure if doing everything inside DB is better and why. – running.t Apr 04 '13 at 00:16
And to be more precise regarding requirements. There will be 500-1000 queries a day, but it should be finished as fast as possible (like less than hour) – running.t Apr 04 '13 at 00:20
@running.t running levenshtein() function a milion times in a row takes less than half a second on my laptop, diffrence() is similar so you shouldn't be worried at least about doing that in the db. – Jakub Kania Apr 04 '13 at 01:27

score 0 · Answer 4 · answered Apr 11 '13 at 13:39

An option i'd consider is to add a column in the "People Talbe" that is the SoundEx value of the person.

I've done joins using

Select [Column}
From People P 
    Inner join TableA A  on Soundex(A.ComarisonColumn) = P.SoundexColumn

That'll return anything in TableA that has the same SoundEx value from the People Tables SoundEx Column.

I haven't used that kind of query on tables that size, but i see no issues with trying it. You can also index that SoundExColumn to help with performance.

SQL and fuzzy comparison

4 Answers4