Search Manual / 3.8.2 Configuring Multi-Language Search

Search Manual / Version 2307

3.8.2 Configuring Multi-Language Search

The process of multi-language search configuration consists of the following steps, that are described in the next paragraphs:

Defining text tokenization and filtering in different field types
Defining index fields for different languages
Defining the fields from which the language is determined
Defining where the detected language is stored.
Configuring language dependent field handling
Configuring the search request handler

Note

It's not necessary to adapt the feeder configuration for multi-language support. Feeders just feed text into some fields (for example name and textbody) and the search engine puts the text into the correct language-dependent fields.

Configuring different field types

Text tokenization and filtering in Apache Solr can be configured in the file conf/schema.xml of a Solr config set. For example in <solr-home>/configsets/content/conf/schema.xml for the content config set.

For each field, a field type is defined. That is, which kind of data is written to this field. In the default content config set, for example, the field textbody is of type text_general. The field type is connected with a certain analyzer which is used to tokenize and filter the text. The default configuration contains some field types with different analyzers, for example:

text_general, configured with Solr StandardTokenizer with reasonable cross-language defaults
text_zh, configured for tokenization of Simplified and Traditional Chinese (outcommented by default)

Apache Solr provides special field types for lots of languages in its example configuration, for example text_ja for Japanese and text_ko for Korean. Most of these field types are not defined in the default configuration of the CoreMedia Search Engine to keep the configuration files simple and avoid unnecessary overhead. If required, add field types from the Solr example configuration to your configuration. You can find these additional field types in the configuration file server/solr/configsets/_default/conf/managed-schema after downloading and unpacking the Apache Solr distribution. You can download Solr from http://solr.apache.org.

Example

If you index text of one language only and want to use a special field type, you can simply change field definitions from type text_general to the chosen field type in schema.xml, for example to text_de for German text.

<fields>
  ...
  <field name="textbody" type="text_de" ... />
</fields>

Configuring multi-language index fields

You need to define language-dependent fields for all languages that need a special analyzer. To do so, simply add a new field element with the name followed by the language code. Section 6.6, “Supported Languages in Solr Language Detection” in the reference shows the list of supported languages.

Note

Note, that language-dependent fields must be indexed. A field declaration with attribute indexed="false" cannot be used as language-dependent field.

Fields in the content config set must also be declared with attribute stored="true" or docValues="true" to make it possible to use partial updates in the Content Feeder.

The following example shows fields and additional types in <solr-home>/configsets/content/conf/schema.xml for using dedicated field types for Simplified Chinese, Japanese, Korean while using the field type text_general for other languages. The example shows the fields name and textbody of the content config set. To enable sorting on field name, it uses Solr field types based on SortableTextField.

<field name="name"                 type="text_gen_sort"
                                   indexed="true" stored="true"/>
<field name="name_ja"              type="text_ja_sort"
                                   indexed="true" stored="true"/>
<field name="name_zh-cn"           type="text_zh_sort"
                                   indexed="true" stored="true"/>
<field name="name_ko"              type="text_ko_sort"
                                   indexed="true" stored="true"/>
...
<field name="textbody"             type="text_general"
                                   indexed="true" stored="false"
                                   multiValued="true"/>
<field name="textbody_ja"          type="text_ja"
                                   indexed="true" stored="false"
                                   multiValued="true"/>
<field name="textbody_zh-cn"       type="text_zh"
                                   indexed="true" stored="false"
                                   multiValued="true"/>
<field name="textbody_ko"          type="text_ko"
                                   indexed="true" stored="false"
                                   multiValued="true"/>

<!-- field types "text_general", "text_gen_sort" and "text_zh" are
     already defined in the default configuration, the latter
     needs to be enabled, because it's outcommented by default -->

<!-- field types "text_ja" and "text_ko" can be
     copied from the Apache Solr example configuration -->

<!-- field types "text_ja_sort", "text_zh_sort" and
     "text_ko_sort" can be copied from the field types without
     "_sort" suffix, adapting the name and replacing
     "solr.TextField" with "solr.SortableTextField" -->
...

In the above example, Japanese text goes into name_ja and textbody_ja, Simplified Chinese text goes into name_zh-cn and textbody_zh-cn, Korean text goes into name_ko and textbody_ko and text from all other languages is indexed in the fields name and textbody.

Besides Simplified Chinese you can also configure Traditional Chinese text with the fields name_zh-tw and textbody_zh-tw. The language code zh from previous CoreMedia releases is not generated anymore, but existing fields name_zh and textbody_zh are still used as fallback when indexing and searching.

Configuring language detection

By default, the Search Engine detects the language of the index fields name and textbody for Content Feeder indices (config set content) and of index field textbody for CAE Feeder indices (config set cae). Both use the field language to store the detected language. Language detection is skipped if the field language has been set by the feeder. You can change these settings in the config set's file conf/solrconfig.xml below the element <updateRequestProcessorChain> with class LangDetectLanguageIdentifierUpdateProcessorFactory:

<processor class="org.apache.solr.update.processor.
    LangDetectLanguageIdentifierUpdateProcessorFactory">
  <str name="langid.fl">textbody,name</str>
  <str name="langid.langField">language</str>
  <str name="langid.fallback">en</str>
</processor>

The parameter langid.langField defines the index field that will be filled with the language code of the document. Section 6.6, “Supported Languages in Solr Language Detection” in the reference shows the list of supported languages. The value in parameter langid.fl is a comma-separated list of index fields that are used for language detection. The parameter langid.fallback configures English as fallback if the language can not be detected from the text.

For more details about the Solr LangDetectLanguageIdentifierUpdateProcessorFactory, see Solr Reference Guide: Language Detection.

Configuring language-dependent field handling

In order to be flexible, the Search Engine separates language detection and the handling of language-dependent fields. Therefore, field handling is configured in a separate class.

You can change these language-dependent field handling settings in the config set's file conf/solrconfig.xml below the element <updateRequestProcessorChain> with class LanguageDependentFieldsProcessorFactory.

<processor class="com.coremedia.solr.update.processor.
           LanguageDependentFieldsProcessorFactory">
  <str name="languageField">language</str>
  <str name="textFields">textbody,name</str>
</processor>

The parameter languageField defines the index field that contains the language code of the document. This must be the same value as configured for language detection above.

The value in the parameter textFields is a comma-separated list of fields whose content should be put into language-dependent fields if such fields exist for the language. Normally, this is the same value as configured for language detection except if you want to exclude some text fields from language detection.

Configuring the search request handler

By default, the search request handlers for Content Feeder and CAE Feeder indices are configured in solrconfig.xml to search across multiple index fields. For example, the config set content configures the /editor search request handler with the qf parameter to search in fields textbody, name and numericid. Matches in the field name are scored higher than matches in textbody because of the configured ^2 boost. Note that the language-dependent fields name_* and textbody_* are not configured here but will be picked up automatically.

<requestHandler name="/editor" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">cmdismax</str>
    <str name="echoParams">none</str>
    <float name="tie">0.1</float>
    <str name="qf">textbody name^2 numericid^10</str>
    <str name="pf">textbody name^2</str>
    <str name="mm">100%</str>
    <str name="q.alt">*:*</str>

    <str name="suggest.spellcheck.dictionary">textbody</str>
  </lst>
  <arr name="last-components">
    <str>suggest</str>
    <str>spellcheck</str>
  </arr>
</requestHandler>

Adapt the configuration of the request handler's qf and pf parameters if you want to use other default search fields.

The predefined request handlers can also be used in custom search applications. They can be selected in SolrJ by calling SolrQuery.setParam(CommonParams.QT, "/cmdismax");. If you prefer Solr's standard search handler you will have to explicitly search across language-dependent fields, by constructing "OR" queries in a Lucene query syntax or by configuring all fields for standard Solr dismax or edismax query parsers, for instance.

Search Results

Table Of Contents

Filter

Search Manual / Version 2307

3.8.2 Configuring Multi-Language Search

Search Results