3.5.1. Details of Language Processing Steps

The following paragraphs describe some details of the language processing steps.

Language detection

The Solr config sets content and cae for Content Feeder and CAE Feeder indices define the field language in their index schema in schema.xml. This field holds the language of the index document, if available.

It's recommended to let feeder applications set the language of index documents, if a language is available at that point. The Content Feeder and CAE Feeder applications of the CoreMedia Blueprint automatically set the language field for CMLocalized documents and content beans. See Section 4.2.2, “Content Configuration” and Section 5.4.3, “Customizing Feedables to learn how to set index fields such as the language field in the Content Feeder and CAE Feeder.

If the language field is not already set by the feeder, then the search engine will try to detect the language of the index document by its content and set the field accordingly. To this end, the file solrconfig.xml configures a Solr LangDetectLanguageIdentifierUpdateProcessorFactory to detect the language of incoming index documents. It is described in detail in the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing. See Section 6.6, “Supported Languages in Solr Language Detection” in the appendix of this manual for a list of supported languages. The language code from that list is stored as value in language field.

[Note]Note

Language detection may not always return the correct language, especially for very short texts. The language should be set by the feeder, if it is known in advance.

Knowing the language of an index document is a prerequisite to index text in a language-specific way. The search engine can put the text in a field that is specially configured for that language, for example with correct rules to break the text into single words.

Tokenization

To provide search functionality, the search engine needs to split text into searchable words. This process is commonly referred to as tokenization or word segmentation. Most languages use whitespace to separate words, which means that text can be tokenized by splitting it at whitespaces. Chinese, Japanese and Korean texts cannot be tokenized this way. Chinese and Japanese don't use whitespaces at all and Korean does not use whitespaces consistently.

Indexing into language dependent fields

Text must be indexed into a separate language dependent field to tokenize or preprocess it according to its language. This is the basis for efficient language dependent search. Depending on your requirements you can configure correct tokenization for CJK languages or add some language-specific analysis steps such as stemming for western languages. In both cases you need to configure language dependent fields.

Example

A customized schema.xml defines the index fields name_tokenized and name_tokenized_jp. If the feeder feeds a document with Japanese text in its name, then the text will be indexed in the field name_tokenized_jp. The index field name_tokenized will be empty for that document. Another document contains German text in its name that will be indexed in the field name_tokenized, because schema.xml does not define a field name_tokenized_de.

Search in language-dependent fields

When searching in Studio, Site Manager or with Unified API's SearchService methods, searches are automatically performed across multiple fields including language-dependent fields. To this end, the Search Engine contains a CoreMedia-specific Solr query parser named cmdismax. This parser is a variant of Solr’s standard dismax query parser (see https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser for more details). The improvements of the cmdismax parser are support for wildcard searches (for example, core*) and searching across all language-dependent fields.

The default Solr config sets for Content Feeder and CAE Feeder indices configure search request handlers to use the cmdismax parser in solrconfig.xml: the handler /editor for editorial search in the content config set and the handler /cmdismax for website search in the cae config set.

If you want to use a different query parser such as the default Lucene query parser or the Solr Extended DisMax (edismax) query parser, you must explicitly search in all required language-dependent fields. For the edismax query parser this would mean enumerating all required language-dependent fields in the qf (query fields) parameter.