how an search index works when querying many words?

Question

I'm trying to build my own search engine for experimenting.

I know about the inverted indexes. for example when indexing words.

the key is the word and has a list of document ids containing that word. So when you search for that word you get the documents right away

how does it work for multiple words

you get all documents for every word and traverse those document to see if have both words?

I feel it is not the case.

anyone knows the real answer for this without speculating?

If you can get all the documents (or document ids) for word A and you can do the same for word B, you can also produce an intersection of the two result sets without opening the document itself. — biziclop, Feb 22 '12 at 00:47

score 1 · Answer 1 · answered Dec 09 '12 at 14:35

You need to store position of a word in a document in index file. Your index file structure should be like this.. word id - doc id- no. of hits- pos of hits.

enter image description here

Now suppose the query contains 4 words "w1 w2 w3 w4" . Choose those files containing most of the words. Now calculate their relative distance in the document. The document where most of the words occur and their relative distance is minimum will have high priority in search results.

I have developed a total search engine without using any crawling or indexing tool available in internet. You can read a detailed description here-Search Engine

for more info read this paper by Google founders-click here

score 0 · Answer 2 · answered Feb 22 '12 at 12:51

0

You find the intersection of document sets as biziclop said, and you can do it in a fairly fast way. See this post and the papers linked therein for a more formal description.

answered Feb 22 '12 at 12:51

Xodarap

11,101
8
49
87

The post does not really address the issue of match list _intersection_ (i.e. AND queries), as it talks about OR queries. – jogojapan Feb 24 '12 at 06:04
@jogojapan: The papers linked are the core implementation details. I think the important part is that the bounds can be improved by finding only the top k. – Xodarap Feb 24 '12 at 17:02

score 0 · Answer 3 · answered Feb 24 '12 at 06:17

As pointed out by biziclop, for an AND query you need to intersect the match lists (aka inverted lists) for the two query terms.

In typical implementations, the inverted lists are implemented such that they can be searched for any given document id very efficiently (generally, in logarithmic time). One way to achieve this is to keep them sorted (and use binary search), but note that this is not trivial as there is also a need to store them in compressed form.

Given a query A AND B, and assume that there are occ(A) matches for A and occ(B) matches for B (i.e. occ(x) := the length of the match list for term x). Assume, without loss of generality, that occ(A) > occ(B), i.e. A occurs more frequently in the documents than B. What you do then is to iterate through all matches for B and search for each of them in the list for A. If indeed the lists can be searched in logarithmic time, this means you need

occ(B) * log(occ(A))

computational steps to identify all matches that contain both terms.

A great book describing various aspects of the implementation is Managing Gigabytes.

score 0 · Answer 4 · answered Feb 25 '12 at 14:51

Inverted index is very efficient for getting intersection, using a zig-zag alorithm:

Assume your terms is a list T:

lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
  if (currTerm > T.last): //if we have passed the last term:
     insert lastDoc into result
     currTerm <- 0
     lastDoc <- lastDoc + 1
     continue
  docId <- T[currTerm].getFirstAfter(lastDoc-1)
  if (docID != lastDoc):
     lastDoc <- docID
     currTerm <- 0
  else: 
     currTerm <- currTerm + 1

This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.

The algorithm will be most efficient if the terms are sorted such that the rarest term is first.

The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.

Note: Though this alorithm is efficient, AFAIK lucene does not use it.

More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]

score -1 · Answer 5 · edited May 23 '17 at 12:16

I don't really understand why people is talking about intersection for this.

Lucene supports combination of queries using BooleanQuery, which you can nest indefinitely if you must.

The QueryParser also supports the AND keyword, which would require both words to be in the document.

Example (Lucene.NET, C#):

var outerQuery + new BooleanQuery();
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word1 ) ), BooleanClause.Occur.MUST );
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word2 ) ), BooleanClause.Occur.MUST );

If you want to split the words (your actual search term) using the same analyzer, there are ways to do that too. Although, a QueryParser might be easier to use.

You can view this answer for example on how to split the string using the same analyzer that you used for indexing:

No hits when searching for "mvc2" with lucene.net

Your "a" AND "b" query precisely computes an intersection between the set of doc matching "a" and the set of docs matching "b" — fulmicoton, Jan 08 '16 at 01:52

how an search index works when querying many words?

5 Answers5