0

I have to implement fulltext search in multiple columns with result weighting based on relevance of certain columns / fields.

All the solutions I've come across seem to use single-column indexes for calculating relevance and one multiple-column index for the WHERE clause. See: https://stackoverflow.com/a/600915/168719 or https://stackoverflow.com/a/6305108/168719

Here's my query then:

SELECT MATCH(name) AGAINST (text) as relevance_name, 
MATCH(description) AGAINST(text) as relevance_description, 
MATCH(description_long) AGAINST (text) as relevance_description_long 
FROM products WHERE

And I'm facing the choice between:

a)

MATCH(name, description, description_long) AGAINST (text) > 0

b)

MATCH(name) AGAINST (text) > 0 
OR MATCH(description) AGAINST (text) > 0 
OR MATCH(description_long) AGAINST (text) > 0

After which the sorting clause comes.

ORDER BY (relevance_name * 2 + 
relevance_description * 3 + 
relevance_description_long * 4) / 9

The question is - what is the superiority of a (apparently the preferred method) over b?

a requires creating another fulltext index (across all searchable columns), which obviously takes more disk space.

What are the advantages? Is it a matter of performance? Or search quality?

Community
  • 1
  • 1
Konrad Morawski
  • 7,689
  • 6
  • 47
  • 80

1 Answers1

2

Manual on page 12.9.1. Natural Language Full-Text Searches tells us:

For each row in the table, MATCH() returns a relevance value; that is, a similarity measure between the search string and the text in that row in the columns named in the MATCH() list.

Therefore, MATCH () will return different values for MATCH (c1,c2,c3) and MATCH(c1) + MATCH(c2) + MATCH(c3). Similar difference will be when using match with OR operator.

Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.

You should use approach B, because it is in the same form as your query.

Stoleg
  • 8,059
  • 1
  • 17
  • 27
  • In the `WHERE` clause, I am not concerned with what the exact relevance is. It serves no other purpose than to screen out the results that don't match. My impression is that **b** is sufficient, too. But then why is it that **a** gets recommended most often? Surely there must be some reasoning behind that – Konrad Morawski May 23 '13 at 14:16
  • @Konrad. For Match(1,2,3) there is only one read of the index, whereas for **B** there are 3 reads, making it slower. Point in my answer was that you should use same construct in select and where clauses. – Stoleg May 23 '13 at 15:07
  • so you mean that those 3 reads (in `SELECT`) are cached anyway, thus in the approach **b** I end up with 3 reads altogether (but not 6)? While using **a** in this query adds a 4th read? – Konrad Morawski May 23 '13 at 15:09
  • I was talking individual statements. When you use `Select M(1),M(2),M(3) [...] Where M(1) or M(2) or M(3)` then you has 3 reads of full-text index. When you use `Select M(1),M(2),M(3) [...] Where M(1,2,3)` then you have 3 reads full-text index for Select plus 1 read for Where clause. However in `Select M(1,2,3) [...] Where M(1,2,3)` there will be only 1 read of full-text index. Chaching is a different matter. – Stoleg May 23 '13 at 15:18