The process of multi-language search configuration consists of the following steps, that are described in the next paragraphs:
Defining text tokenization and filtering in different field types
Defining index fields for different languages
Defining the fields from which the language is determined
Defining where the detected language is stored.
Configuring language dependent field handling
Configuring the search request handler
Note | |
---|---|
It's not necessary to adapt the feeder configuration for multi-language support. Feeders
just feed text into some fields (for example |
Configuring different field types
Text tokenization and filtering in Apache Solr can be configured in the file conf/schema.xml
of a Solr config set. For example in <solr-home>/configsets/content/conf/schema.xml
for the content
config set.
For each field, a field type is defined. That is, which kind of data is written to this
field. In the default content
config set, for example, the field textbody
is of type
text_general
. The field type is connected with a certain analyzer which is used
to tokenize and filter the text. The default configuration contains some field types with
different analyzers, for example:
text_general
, configured for tokenization of non-CJK languages with reasonable cross-language defaultstext_zh
, configured for tokenization of Chinese (Simplified and Traditional)
Apache Solr provides special field types for lots of languages in its example configuration, for example
text_ja
for Japanese and text_cjk
which can be used for Korean. Most of these
field types are not defined in the default configuration of the
CoreMedia Search Engine to keep the configuration files simple and avoid
unnecessary overhead. If required, add field types from the Solr example configuration to your configuration.
You can find these additional field types in the file
example/solr/collection1/conf/schema.xml
after downloading and unpacking
the Apache Solr distribution. You can download Solr from
http://lucene.apache.org/solr/.
Example
If you index Chinese text only, you can simply change field definitions from type
text_general
to type text_zh
in schema.xml
:
<fields> ... <field name="textbody" type="text_zh" ... /> </fields>
Configuring multi-language index fields
You need to define language-dependent fields for all languages that need a special analyzer. To do so, simply add a new field element with the name followed by the language code. Section 6.6, “Supported Languages in Solr Language Detection” in the appendix shows the list of supported languages.
Note | |
---|---|
Note, that language-dependent fields must be indexed. A field declaration with attribute
Fields in the |
The following example shows necessary fields and additional types in
<solr-home>/configsets/content/conf/schema.xml
for supporting Simplified
Chinese, Japanese, Korean and non-CJK languages in the predefined fields
name_tokenized
and textbody
of the content
config set.
<field name="name_tokenized" type="text_general" indexed="true" stored="true"/> <field name="name_tokenized_ja" type="text_ja" indexed="true" stored="true"/> <field name="name_tokenized_zh-cn" type="text_zh" indexed="true" stored="true"/> <field name="name_tokenized_ko" type="text_cjk" indexed="true" stored="true"/> ... <field name="textbody" type="text_general" indexed="true" stored="false" multiValued="true"/> <field name="textbody_ja" type="text_ja" indexed="true" stored="false" multiValued="true"/> <field name="textbody_zh-cn" type="text_zh" indexed="true" stored="false" multiValued="true"/> <field name="textbody_ko" type="text_cjk" indexed="true" stored="false" multiValued="true"/> <!-- field types "text_general" and "text_zh" are already defined in default configuration --> <!-- field types "text_cjk" and "text_ja" are copied from the Apache Solr example configuration --> ...
In the above example, Japanese text goes into name_tokenized_ja
and
textbody_ja
, Simplified Chinese text goes into
name_tokenized_zh-cn
and textbody_zh-cn
, Korean text goes into
name_tokenized_ko
and textbody_ko
and text from all other
languages is indexed in the fields name_tokenized
and textbody
.
Besides Simplified Chinese you can also configure Traditional Chinese text with the fields
name_tokenized_zh-tw
and textbody_zh-tw
. The language code
zh
from previous CoreMedia releases is not generated anymore, but existing
fields name_tokenized_zh
and textbody_zh
are still used as
fallback when indexing and searching.
Configuring language detection
By default, the Search Engine detects the language of the index fields
name_tokenized
and textbody
for
Content Feeder indices (config set content
) and of index field
textbody
for CAE Feeder indices
(config set cae
). Both use the field language
to store the detected language.
Language detection is skipped if the field language
has been set by the feeder.
You can change these settings in the config set's
file conf/solrconfig.xml
below the element
<updateRequestProcessorChain>
with class
LangDetectLanguageIdentifierUpdateProcessorFactory
:
<processor class="org.apache.solr.update.processor. LangDetectLanguageIdentifierUpdateProcessorFactory"> <str name="langid.fl">textbody,name_tokenized</str> <str name="langid.langField">language</str> <str name="langid.fallback">en</str> </processor>
The parameter langid.langField
defines the index field that will be filled with the
language code of the document. Section 6.6, “Supported Languages in Solr Language Detection” in the
appendix shows the list of supported languages. The value in parameter langid.fl
is a
comma-separated list of index fields that are used for language detection. The parameter
langid.fallback
configures English as fallback if the language can not be detected from the text.
For more details about the Solr LangDetectLanguageIdentifierUpdateProcessorFactory
, see the
Solr reference guide at
https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing.
Configuring language-dependent field handling
In order to be flexible, the Search Engine separates language detection and the handling of language-dependent fields. Therefore, field handling is configured in a separate class.
You can change these language-dependent field handling settings in the config set's file
conf/solrconfig.xml
below the element
<updateRequestProcessorChain>
with class
LanguageDependentFieldsProcessorFactory
.
<processor class="com.coremedia.solr.update.processor. LanguageDependentFieldsProcessorFactory"> <str name="languageField">language</str> <str name="textFields">textbody,name_tokenized</str> </processor>
The parameter languageField
defines the index field that contains the language code of the
document. This must be the same value as configured for language detection above.
The value in the parameter textFields
is a comma-separated list of fields whose content should be
put into language-dependent fields if such fields exist for the language. Normally, this is the same value
as configured for language detection except if you want to exclude some text fields from language detection.
Configuring the search request handler
By default, the search request handlers for Content Feeder and
CAE Feeder indices are configured in solrconfig.xml
to search across
multiple index fields. For example, the config set content
configures the /editor
search request handler with the qf
parameter to search in fields textbody
,
name_tokenized
and numericid
. Matches in the
field name_tokenized
are scored higher than matches in textbody
because of the
configured ^2
boost. Note that the language-dependent
fields name_tokenized_*
and textbody_*
are not configured here but will be picked up
automatically.
<requestHandler name="/editor" class="solr.SearchHandler"> <lst name="defaults"> <str name="defType">cmdismax</str> <str name="echoParams">none</str> <float name="tie">0.1</float> <str name="qf">textbody name_tokenized^2 numericid^10</str> <str name="pf">textbody name_tokenized^2</str> <str name="mm">100%</str> <str name="q.alt">*:*</str> <str name="suggest.spellcheck.dictionary">textbody</str> </lst> <arr name="last-components"> <str>suggest</str> <str>spellcheck</str> </arr> </requestHandler>
Adapt the configuration of the request handler's qf
and pf
parameters if you want to use other default search fields.
The predefined request handlers can also be used in custom search applications. They can be selected in
SolrJ by calling SolrQuery.setParam(CommonParams.QT, "/cmdismax");
or by
appending /cmdismax
to the URL used to connect to Solr. If you prefer
Solr's standard search handler you will have to explicitly search across language-dependent
fields, by constructing "OR" queries in a Lucene query syntax or by configuring all fields for
standard Solr dismax or edismax query parsers, for instance.