The following paragraphs describe some details of the language processing steps.
Language detection
The Solr config sets content
and cae
for
Content Feeder and CAE Feeder indices define the
field language
in their index schema in schema.xml
. This field holds the
language of the index document, if available.
It's recommended to let feeder applications set the language of index documents, if a language is
available at that point. The Content Feeder and CAE Feeder
applications of the CoreMedia Blueprint automatically set the language
field for CMLocalized
documents and content beans. See
Section 4.2.2, “Content Configuration” and Section 5.4.3, “Customizing Feedables” to learn
how to set index fields such as the language
field in the Content Feeder
and CAE Feeder.
If the language
field is not already set by the feeder, then the search engine will try to detect
the language of the index document by its content and set the field accordingly. To this end, the
file solrconfig.xml
configures a Solr
LangDetectLanguageIdentifierUpdateProcessorFactory
to detect the language of incoming
index documents. It is described in detail in the Solr Reference Guide at
https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing.
See Section 6.6, “Supported Languages in Solr Language Detection” in the appendix of this manual for a list of supported
languages. The language code from that list is stored as value in language
field.
Note | |
---|---|
Language detection may not always return the correct language, especially for very short texts. The language should be set by the feeder, if it is known in advance. |
Knowing the language of an index document is a prerequisite to index text in a language-specific way. The search engine can put the text in a field that is specially configured for that language, for example with correct rules to break the text into single words.
Tokenization
To provide search functionality, the search engine needs to split text into searchable words. This process is commonly referred to as tokenization or word segmentation. Most languages use whitespace to separate words, which means that text can be tokenized by splitting it at whitespaces. Chinese, Japanese and Korean texts cannot be tokenized this way. Chinese and Japanese don't use whitespaces at all and Korean does not use whitespaces consistently.
Indexing into language dependent fields
Text must be indexed into a separate language dependent field to tokenize or preprocess it according to its language. This is the basis for efficient language dependent search. Depending on your requirements you can configure correct tokenization for CJK languages or add some language-specific analysis steps such as stemming for western languages. In both cases you need to configure language dependent fields.
Example
A customized schema.xml
defines the index fields name_tokenized
and name_tokenized_jp
. If the feeder feeds a document with Japanese text in its
name, then the text will be indexed in the field name_tokenized_jp
. The index
field name_tokenized
will be empty for that document. Another document contains
German text in its name that will be indexed in the field name_tokenized
,
because schema.xml
does not define a field name_tokenized_de
.
Search in language-dependent fields
When searching in Studio, Site Manager or with
Unified API's
SearchService methods, searches are automatically performed
across multiple fields including language-dependent fields. To this end, the
Search Engine contains a CoreMedia-specific
Solr query parser named cmdismax
. This parser is a variant of Solr’s standard
dismax query parser (see
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
for more details). The improvements of the cmdismax
parser are support for
wildcard searches (for example, core*) and searching across all language-dependent fields.
The default Solr config sets for Content Feeder and
CAE Feeder indices configure search request handlers
to use the cmdismax
parser in solrconfig.xml
: the handler
/editor
for editorial search in the content
config set and the handler
/cmdismax
for website search in the cae
config set.
If you want to use a different query parser such as the default Lucene query parser or the
Solr Extended DisMax (edismax) query parser, you must explicitly search
in all required language-dependent fields. For the edismax query parser this would mean
enumerating all required language-dependent fields in the qf
(query fields) parameter.