7

I am using solr spellcheck for russian language. When you are typing with Cyrillic chars, everything it's ok, but it doesn't work when you are typing with Latin chars.

I want that spellcheck correct and when you are typing with Cyrillic chars and when are you typing with Latin chars. And corret to text with Cyrillic chars.

For example, when you type:

телевидениеее or televidenieee

It should correct to:

телевидение

schema.xml:

<fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
    </analyzer>
</fieldType>

solrconfig.xml

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <lst name="spellchecker">
        <str name="name">default</str>
        <str name="field">spellcheck</str>
        <str name="classname">solr.IndexBasedSpellChecker</str>
        <str name="buildOnCommit">true</str>
        <str name="buildOnOptimize">true</str>
        <str name="spellcheckIndexDir">./spellchecker</str>
        <str name="accuracy">0.75</str>
    </lst>
    <lst name="spellchecker">
        <str name="name">wordbreak</str>
        <str name="field">spellcheck</str>
        <str name="classname">solr.WordBreakSolrSpellChecker</str>
        <str name="combineWords">false</str>
        <str name="breakWords">true</str>
        <int name="maxChanges">1</int>
    </lst>
</searchComponent>

Thanks for help

Slava Vedenin
  • 49,939
  • 13
  • 36
  • 57
KiraLT
  • 1,510
  • 1
  • 18
  • 31
  • Just to clarify - you want to have `televidenieee` transliterated to `телевидениеее` and then fixed by spellchecker to `телевидение`, right? – rchukh Nov 04 '13 at 01:02
  • Could you also share the `requestHandler` you employ? – cheffe Nov 05 '13 at 13:23

1 Answers1

5

It can be achived with ICUTransformFilterFactory, which will (un)transliterate the input query each time.

Here is an example, of how one can enable this functionality:

  1. Enable icu4j amalyzers (lucene-analyzers-icu-*.jar, icu4j-*.jar):

    Those libraries can be found in contrib/analysis-extras folder of solr distribution from official site (they also available via maven).

    In solrconfig.xml add something like these to enable them (there can be a single lib dir with all the jars that you need, in this example it just uses default location relative to example/solr/collection1/conf folder from official distribution):

    <lib dir="../../../contrib/analysis-extras/lib" regex=".*\.jar" />
    <lib dir="../../../contrib/analysis-extras/lucene-libs" regex=".*\.jar" />
    
  2. Split spell_text field analyzers into two separate list for index and query.

  3. Add solr.ICUTransformFilterFactory as query analyzer with the following id Any-Cyrillic; NFD; [^\p{Alnum}] Remove:

    <fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
    
        <filter class="solr.ICUTransformFilterFactory" id="Any-Cyrillic; NFD; [^\p{Alnum}] Remove" />
      </analyzer>
    </fieldType>
    

Regarding the ICUTransformFilterFactory id - Any-Cyrillic; NFD; [^\p{Alnum}] Remove:

The configuration described above is working on my local machine the same way for russian transliterations and russian words

Community
  • 1
  • 1
rchukh
  • 2,647
  • 1
  • 19
  • 24
  • That, of course, means that you won't be able to search by latin chars in that field(because it will be converted into cyrillic characters). IF you need to search by *both* latin AND cyrillic characters, then you may use copyField(s) for separate cyrillic and latin spellchecking. – rchukh Nov 05 '13 at 23:10
  • I need to search by both latin and cyrillic characters. For example query "tilevizor smasung" to be fixed to "телевизор samsung". I can create two fields (one for latin, another for cyrillic letters). But how I can use them both for spellchecking? – KiraLT Nov 28 '13 at 18:08
  • Well... when I was saying about both latin and cyrillic chars in the previous comment I meant that they will be separated - either latin or cyrillic.. What you are asking here is much more trickier.. e.g. how can you tell that "smasung" should be corrected to "samsung" and not "самсунг" if there are both "samsung" and "самсунг" in the field that is in use for spellchecking? – rchukh Nov 28 '13 at 19:33
  • 1
    It's a little chance of being both "samsung" and "самсунг" in the spellcheck field, so we can try at first to correct in latin characters and then if no correction is found - in cyrillic. Or we can choose by term frequency (but that probably would be even more difficult). Could you suggest something? – KiraLT Nov 29 '13 at 11:34
  • Using term-frequency sounds like a good approach, especially if solr index is based on some trusted input (e.g. if one is sure that content does not have any misspellings). Can you open a new question about this(or link to an existing if there already is one)? This answer is already quite big, and trying to solve here this issue as well will only add confusion. – rchukh Dec 02 '13 at 17:49
  • New question: http://stackoverflow.com/questions/20350714/how-to-make-solr-spellchecker-to-correct-both-latin-and-cyrillic-words – KiraLT Dec 03 '13 at 12:07