4.2.3. Advanced Configuration

Configuring Batch Handling

The Content Feeder sends document changes to the CoreMedia Search Engine in batches. You can configure the number of documents in a batch and when to send a batch. Batch sizes and sending rate influence the indexing speed.

[Note]Note

Configuration not mandatory: Normally you do not need to change the default settings.

The Content Feeder sends a batch when one of the following conditions is fulfilled:

  • The maximum number of documents in a batch has been reached.

  • The batch size in bytes would exceed the configured maximum if more documents were added.

  • Maximum time delays are reached.

The file feeder.properties contains properties to configure batch sending.

  • feeder.maxBatchSize: The maximum number of index documents in a batch. A smaller batch may be sent if the maximum byte size is reached before.

  • feeder.maxBatchByteSize: The maximum number of bytes allowed in a batch. A smaller batch may be sent if the maximum batch size is reached before.

  • feeder.sendIdleDelay: The maximum seconds to wait sending a new batch if the Content Feeder is idle. This value normally is small to feed a document quickly for low latency, such as when a document was changed by an editor.

  • feeder.sendMaxDelay: The maximum seconds to wait sending a new batch if the batch is not yet full. This value normally is higher to avoid sending small batches, for example when large amounts of documents are imported with an importer.

[Caution]Caution

Note, that open batches are kept in main memory. You have to reserve 2*maxBatchByteSize bytes for the batches.

Configuring Error Handling

The Content Feeder automatically retries operation after some communication problems with the CoreMedia Search Engine. The following properties configure the retry behavior:

  • feeder.retrySendIdleDelay: The maximum seconds to wait sending a failed batch again, if the Content Feeder is idle.

  • feeder.retrySendMaxDelay: The maximum seconds to wait sending a failed batch again, if the batch is not yet full.

  • feeder.solr.sendRetryDelay: The delay in seconds between a failed batch sending and the next try. The default value is 30 seconds.

  • feeder.retryConnectToIndexDelay.seconds: The delay in seconds between retries to connect to the Search Engine on startup. The default value is 10 seconds.

  • feeder.solr.connection.timeout: The connection timeout set on the SolrJ SolrServer. It determines how long the client waits to establish a connection without any response from the server. The default value is 0. That means it will wait forever. You can configure the timeout in milliseconds.

  • feeder.solr.socket.timeout: The socket timeout set on the SolrJ SolrServer. It determines how long the client waits for a response from the server after the connection was established and the request was already sent. The default value is set to 600000 milliseconds. That means it will wait for 10 minutes.

Configuring Tika

Apache Tika is used to extract text from blob properties for indexing. It provides parsers for various formats, which can be customized in a special Apache Tika XML configuration file. The default configuration covers typical formats so that a custom configuration is rarely needed. If you need to fine-tune the configuration of Apache Tika, please have a look at the documentation of Apache Tika for the format of the Tika Config XML file. The location of this file can be configured with the Spring configuration property feeder.tika.config. The value of this property is a Spring Resource location. The following example configures an Apache Tika Config file from the local file system:

Example

feeder.tika.config=file:/opt/path/tika-config.xml

Configuring Tika metadata extraction

In addition to extracting body text, Tika can extract metadata for some binary formats such as the creator of a Microsoft Word file. You can use the configuration properties feeder.tika.appendMetadata and feeder.tika.copyMetadata to extract and index metadata from binary formats.

The property feeder.tika.appendMetadata takes a comma-separated list of metadata identifiers. The Content Feeder simply appends the matching metadata values to the indexed body text when Apache Tika extracts such a value.

The property feeder.tika.copyMetadata takes a comma-separated list where each entry consists of a metadata identifier followed by an equal sign (=) and the name of the index field the metadata should be copied to. When a matching metadata value is found, it will be stored in the configured index field. Note that with Apache Solr target index fields must be defined as multiValued="true" to avoid indexing errors if there are multiple metadata values with the same identifier. See also Section 4.5, “Modify the Search Index”.

Example

feeder.tika.copyMetadata=creator=author

The above example configures the Content Feeder to store the creator as extracted from the metadata in the index field author. Note that the index field must be declared in the Solr schema for this to work.

Metadata identifiers are specific to Apache Tika. You can find some of them in the API documentation of Apache Tika class org.apache.tika.metadata.TikaCoreProperties.

Configuring updates of rights rule changes

The Content Feeder indexes the groups with potential read rights to a document in the index field groups. The set of groups is then used to narrow a user's search down to the documents where he could have read rights to. This is an optimization to reduce the number of search results on which the client must check read rights and for more accurate search suggestion numbers. The downside of this optimization is an increased feeding load, because documents must be reindexed after changing rights rules on any parent folder up to the root folder. You can disable this optimization by setting the property feeder.indexGroups to false in the file feeder.properties. If you've set that property to false, then you should also configure the Studio application to not add a superfluous query condition for the indexed groups by setting its property studio.rest.searchService.useGroupsFilterQuery to false.

Because rights changes may lead to lots of reindexing, the Content Feeder treats these changes differently than normal editorial changes. It updates index documents after rights changes in the background when it is idle. Rights changes are processed with lower priority than editorial changes. Feeding of rights changes does not block feeding of editorial changes.