Configuring Batch Handling
The Content Feeder sends document changes to the CoreMedia Search Engine in batches. You can configure the number of documents in a batch and when to send a batch. Batch sizes and sending rate influence the indexing speed.
Note | |
---|---|
Configuration not mandatory: Normally you do not need to change the default settings. |
The Content Feeder sends a batch when one of the following conditions is fulfilled:
The maximum number of documents in a batch has been reached.
The batch size in bytes would exceed the configured maximum if more documents were added.
Maximum time delays are reached.
The file feeder.properties
contains properties to configure batch sending.
feeder.maxBatchSize
: The maximum number of index documents in a batch. A smaller batch may be sent if the maximum byte size is reached before.feeder.maxBatchByteSize:
The maximum number of bytes allowed in a batch. A smaller batch may be sent if the maximum batch size is reached before.feeder.sendIdleDelay:
The maximum seconds to wait sending a new batch if the Content Feeder is idle. This value normally is small to feed a document quickly for low latency, such as when a document was changed by an editor.feeder.sendMaxDelay:
The maximum seconds to wait sending a new batch if the batch is not yet full. This value normally is higher to avoid sending small batches, for example when large amounts of documents are imported with an importer.
Caution | |
---|---|
Note, that open batches are kept in main memory. You have to reserve |
Configuring Error Handling
The Content Feeder automatically retries operation after some communication problems with the CoreMedia Search Engine. The following properties configure the retry behavior:
feeder.retrySendIdleDelay
: The maximum seconds to wait sending a failed batch again, if the Content Feeder is idle.feeder.retrySendMaxDelay
: The maximum seconds to wait sending a failed batch again, if the batch is not yet full.feeder.solr.sendRetryDelay
: The delay in seconds between a failed batch sending and the next try. The default value is 30 seconds.feeder.retryConnectToIndexDelay.seconds
: The delay in seconds between retries to connect to the Search Engine on startup. The default value is 10 seconds.feeder.solr.connection.timeout
: The connection timeout set on the SolrJSolrServer
. It determines how long the client waits to establish a connection without any response from the server. The default value is 0. That means it will wait forever. You can configure the timeout in milliseconds.feeder.solr.socket.timeout
: The socket timeout set on the SolrJSolrServer
. It determines how long the client waits for a response from the server after the connection was established and the request was already sent. The default value is set to 600000 milliseconds. That means it will wait for 10 minutes.
Configuring Tika
Apache Tika is used to extract text from blob properties for indexing. It provides parsers for various formats,
which can be customized in a special Apache Tika XML configuration file. The default configuration covers
typical formats so that a custom configuration is rarely needed. If you need to fine-tune the
configuration of Apache Tika, please have a look at the documentation of Apache Tika for the format of the
Tika Config XML file. The location of this file can be configured with the Spring configuration
property feeder.tika.config
. The value of this property is a Spring Resource location.
The following example configures an Apache Tika Config file from the local file system:
Example
feeder.tika.config=file:/opt/path/tika-config.xml
Configuring Tika metadata extraction
In addition to extracting body text, Tika can
extract metadata for some binary formats such as the creator of a Microsoft Word file. You can use the
configuration properties feeder.tika.appendMetadata
and feeder.tika.copyMetadata
to extract and index metadata from binary formats.
The property feeder.tika.appendMetadata
takes a comma-separated list of metadata identifiers.
The Content Feeder simply appends the matching metadata values to the indexed body
text when Apache Tika extracts such a value.
The property feeder.tika.copyMetadata
takes a comma-separated list where each entry consists
of a metadata identifier followed by an equal sign (=
) and the name of the index field
the metadata should be copied to. When a matching metadata value is found, it will be stored in the configured
index field. Note that with Apache Solr target index fields must be defined as
multiValued="true"
to avoid indexing errors if there are multiple metadata values with the same
identifier. See also Section 4.5, “Modify the Search Index”.
Example
feeder.tika.copyMetadata=creator=author
The above example configures the Content Feeder to store the creator as extracted
from the metadata in the index field author
. Note that the index field must be declared in the
Solr schema for this to work.
Metadata identifiers are specific to Apache Tika. You can find some of them in the API documentation of
Apache Tika class org.apache.tika.metadata.TikaCoreProperties
.
Configuring updates of rights rule changes
The Content Feeder indexes the groups with potential read rights to a document in the
index field groups
. The set of groups is then used to narrow a user's search down to the documents
where he could have read rights to. This is an optimization to reduce the number of search results on which the
client must check read rights and for more accurate search suggestion numbers. The downside of this optimization
is an increased feeding load, because documents
must be reindexed after changing rights rules on any parent folder up to the root folder. You can disable this
optimization by setting the property feeder.indexGroups
to false
in the file
feeder.properties
. If you've set that property to false
, then you should also
configure the Studio application to not add a superfluous query condition for the
indexed groups by setting its property studio.rest.searchService.useGroupsFilterQuery
to
false
.
Because rights changes may lead to lots of reindexing, the Content Feeder treats these changes differently than normal editorial changes. It updates index documents after rights changes in the background when it is idle. Rights changes are processed with lower priority than editorial changes. Feeding of rights changes does not block feeding of editorial changes.