108

Can someone please explain the difference between the different analyzers within Lucene? I am getting a maxClauseCount exception and I understand that I can avoid this by using a KeywordAnalyzer but I don't want to change from the StandardAnalyzer without understanding the issues surrounding analyzers. Thanks very much.

Journeyman
  • 9,339
  • 14
  • 73
  • 121

2 Answers2

210

In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter.

Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn't split the text at all and takes all the field as a single token. At the same time, StandardAnalyzer (and most other analyzers) use spaces and punctuation as a split points. For example, for phrase "I am very happy" it will produce list ["i", "am", "very", "happy"] (or something like that). For more information on specific analyzers/tokenizers see its Java Docs.

Stemmers are used to get the base of a word in question. It heavily depends on the language used. For example, for previous phrase in English there will be something like ["i", "be", "veri", "happi"] produced, and for French "Je suis très heureux" some kind of French analyzer (like SnowballAnalyzer, initialized with "French") will produce ["je", "être", "tre", "heur"]. Of course, if you will use analyzer of one language to stem text in another, rules from the other language will be used and stemmer may produce incorrect results. It isn't fail of all the system, but search results then may be less accurate.

KeywordAnalyzer doesn't use any stemmers, it passes all the field unmodified. So, if you are going to search some words in English text, it isn't a good idea to use this analyzer.

Stop words are the most frequent and almost useless words. Again, it heavily depends on language. For English these words are "a", "the", "I", "be", "have", etc. Stop-words filters remove them from the token stream to lower noise in search results, so finally our phrase "I'm very happy" with StandardAnalyzer will be transformed to list ["veri", "happi"].

And KeywordAnalyzer again does nothing. So, KeywordAnalyzer is used for things like ID or phone numbers, but not for usual text.

And as for your maxClauseCount exception, I believe you get it on searching. In this case most probably it is because of too complex search query. Try to split it to several queries or use more low level functions.

Kenny Evitt
  • 8,023
  • 5
  • 59
  • 84
ffriend
  • 24,834
  • 13
  • 81
  • 125
  • 1
    @ffriend: i don't think Stemmer (using snowball or other algorithms) can convert am -> be because it's a job of Lemmatizer. You can check it here http://snowball.tartarus.org/demo.php – Tho Jan 07 '15 at 09:51
  • So where does Tika fit into this? Isn't it technically an analyzer? – anon Jan 31 '15 at 12:24
  • 1
    @anon: Tika is a separate project with several key features. Assuming that you mean Tika parsers, I'd say that Tika takes byte stream and outputs text + metadata, while Lucene analyzers take text and output processed token stream. For example, you may first parse PDF or XML file with Tika, producing documents with fields like "title", "author" and "text", and then analyze some or all of these fields with Lucene analyzers. – ffriend Jan 31 '15 at 12:44
  • just wondering, "very" and "happy" are not declined words, why are they transformed into "veri" and "happi"? is it to match iy differences since they sound similar? – oguzalb Feb 06 '20 at 14:17
0

In my perspective, I have used StandAnalyzer and SmartCNAnalyzer. As I have to search text in Chinese. Obviously, SmartCnAnalyzer is better at handling Chinese. For diiferent purposes, you have to choose properest analyzer.

neal
  • 115
  • 2
  • 11