6.1. Content Feeder Configuration

The Content Feeder is configured in the files WEB-INF/application.properties and WEB-INF/application.xml.

Solr specific configuration properties

These properties are configured in file application.properties of the Content Feeder.

Attribute Value Default Description
feeder.solr.url URL http://localhost:8082/solr/coremedia The URL where the Content Feeder can reach the Search Engine. The URL points to the Apache Solr core for the Content Feeder.
feeder.solr.username user name or empty (empty) User name for HTTP Basic authentication when connecting to the Apache Solr web application. Leave empty for no authentication.
feeder.solr.password user name or empty (empty) Password for HTTP Basic authentication when connecting to the Apache Solr web application.
feeder.solr.collection String coremedia The collection that should be used by the Content Feeder.
feeder.solr.sendRetryDelay time in seconds 10 Delay in seconds between trying to send a batch.
solr.partialUpdates true or false true Specifies whether partial updates are supported for updating document metadata in Solr. This requires that all fields in the Solr index are configured as stored="true" except fields that are <copyField/> destinations, which must be configured as stored="false". This is because partial updates are applied to the index document reconstructed from the existing stored field values. Note that configuration property feeder.partialUpdate.aspects may still restrict usage of partial updates to certain document aspects.
solr.partialUpdatesSkipIndexCheck true or false false If solr.partialUpdates is true, the Solr index schema is analyzed whether fields are stored as required for partial updates. The Feeder will log a warning and not use partial update functionality if the index seems to not support it. You can set this property to true to skip the check.

Table 6.1. Solr specific properties


General Feeder configuration properties

These properties are configured in file application.properties of the Content Feeder.

Login data

The following properties are used to define the login data for the Content Server and the administration page of the Search Engine.

Attribute Value Default Description
feeder.management.user user name feeder The user name to be used in the HTTP authentication of the administration page of the Content Feeder. This is not an account from the user management of the Content Server.
feeder.management.password password feeder The password to be used in the HTTP authentication of the administration page of the Content Feeder.
repository.user user name feeder The user account the Content Feeder uses to read content.
repository.password password feeder The password for the user account of the Content Feeder.

Table 6.2. Properties for login


Partial update configuration

With this property you can configure the usage of partial updates, if supported by the connected Indexer - for example for Solr as configured with property solr.partialUpdates.

Attribute Value Default Description
feeder.partialUpdate.aspects comma-separated list of document aspects or * multiSite The aspects of index documents that can be updated with a partial update, provided that the connected Indexer supports partial updates (for example, solr.partialUpdates=true for Solr). Multiple values are separated by comma. Use the special value * to use partial updates for all aspects, if possible. An empty value means that partial updates are not used. See the API documentation of Feedable.isPartialUpdate, FeedableAspect and ContentFeedableAspect in package com.coremedia.cap.feeder for more details.

Table 6.3. Partial update configuration


Batch configuration

With these properties you can configure the processing of batches.

Attribute Value Default Description
feeder.maxBatchSize number of documents 500 The maximum number of documents in a batch.
feeder.maxBatchByteSize number of bytes 5242880 (5 MB) The maximum batch size in byte.
feeder.sendIdleDelay time in seconds 3 The time to wait between adding a document to a batch and sending that batch to the search engine if the Content Feeder is idle. If a document was changed and no further changes are made within sendIdleDelay seconds, the document will be sent after that time to the search engine. This setting leads to a low latency for changes to become visible in search as long as the system is not very busy.
feeder.sendMaxDelay time in seconds 20 The maximum time to wait between adding a document to a batch and sending that batch. This setting is typically larger than sendIdleDelay to allow batches to grow for better throughput.
feeder.maxOpenBatches int 5 The maximum number of batches indexed in parallel. This setting is not used with the default integration of Apache Solr but only with custom implementations of the com.coremedia.cap.feeder.index.async.AsyncIndexer interface. The Content Feeder does not call the index method of the AsyncIndexer interface to index another batch if the maximum number of parallel batches has been reached. The method will not be called until a callback about the persistence of one of these batches has been received.
feeder.maxProcessedBatches int 1 The maximum number of batches processed by the Indexer in parallel. This setting is not used with the default integration of Apache Solr but only with custom implementations of the com.coremedia.cap.feeder.index.async.AsyncIndexer interface. The Content Feeder does not call the index method of the AsyncIndexer interface to index another batch if the configured number of currently processed batches has been reached. The method will not be called until a callback about completed processing or persistence of one of these batches has been received.

Table 6.4. Properties for batch configuration


What to feed

You can use the following properties to define which elements the Content Feeder should feed to the Search Engine.

Attribute Value Default Description
feeder.indexDeleted true or false true true if documents in the trash should be indexed. If you do not need to find documents in the trash and want to keep your index smaller, you can change this to false.
feeder.indexPath true or false false Indicate whether a document's folder path is indexed in field 'folder'. If set to true (not recommended), folder renames lead to refeeding of all documents below that folder. The alternative field 'folderpath' which contains the folder path as folder ids is the recommended way to refer to a folder path.
feeder.indexReferrers true or false false true to reindex a document after its referrers have changed.
feeder.indexNameInTextBody true or false true

Configures whether the document name should be indexed in index field textbody. It can make sense to disable this if lots of document names contain unique identifiers (from third-party systems, for example) to avoid problems with too many unique terms in field textbody.

feeder.indexGroups true or false true

true to index the groups with potential read rights with the document in the index field groups. This set of groups is then used to narrow a user's search to the documents where he might have read rights to. This is an optimization to get smaller search results for some queries and content structures and to get more accurate search suggestion counts. The client has to check for read rights anyway.

If set to false, then you should also configure the Studio application to not add a superfluous query condition for the indexed groups by setting its property studio.rest.searchService.useGroupsFilterQuery to false.

feeder.updateGroups.immediately true or false false If feeder.indexGroups is true, configures whether the field groups is updated immediately after a change of a folder's right rule. It is recommended to keep this set to false and let the Content Feeder update the index field groups in the background with lower priority than updates for editorial changes. It is quite expensive to set this to true because all documents below the folder will be reindexed.

Table 6.5. Properties to feed additional items


Document types to feed

You can restrict the indexed documents by their type using the includes and excludes properties.

Attribute Value Default Description
feeder.content.type.includes document type name Document_ The name of the abstract or concrete document type whose documents should be indexed. Regular expressions are not allowed.
feeder.content.type.excludes document type name Preferences, EditorPreferences, Dictionary, Query The name of the abstract or concrete document type whose documents should not be indexed. Regular expressions are not allowed.

Table 6.6. Properties to specify document types.


Properties to feed

The default configuration feeds all properties for all specified document types. For configuration of indexed properties by their name, see the section for XML configuration below.

Property types to feed

You can only select a document property from a document type if its property type is specified with the following rules.

Property Value Default Description
feeder.content.propertyType.string true or false true Set this property to false in order to exclude String properties from indexing.
feeder.content.propertyType.integer true or false false Set this property to true in order to include Integer properties when indexing.
feeder.content.propertyType.date true or false false Set this property to true in order to include Date properties when indexing.
feeder.content.propertyType.linkList true or false false Set this property to true in order to include LinkList properties when indexing.
feeder.content.propertyType.struct true or false false Set this property to true in order to include Struct properties when indexing.
feeder.content.propertyType.xmlGrammars List of included grammar names separated by comma coremedia-richtext-1.0

You can define which XML properties should be indexed by specifying their grammar.

Example

feeder.content.propertyType.xmlGrammars=coremedia-richtext-1.0

feeder.content.propertyType.blobMimeType.includes List of included MIME types separated by comma See file

You can define which blob properties are indexed, depending on the MIME type.

Example

feeder.content.propertyType.blobMimeType.includes=text/*

All blobs of MIME type text/* are indexed.

feeder.content.propertyType.blobMimeType.excludes List of excluded MIME types separated by comma (empty)

Exclude some blobs from indexing depending on the MIME type. If you've included a primary MIME type such as text/* or even the catch all type */*, you can exclude some concrete types with this property.

Example

feeder.content.propertyType.blobMimeType.excludes=text/plain

Blobs of MIME type text/plain will not be indexed.

feeder.content.propertyType.blobMaxSize size in bytes 5242880 (5 MB)

Configure the maximum size of indexed blob properties. Larger values will be skipped.

This configuration can be overridden in a Spring XML configuration file where you can configure the maximum size per MIME type by customizing the bean feederContentBlobMaxSizePerMimeType. See XML configuration for an example.

Table 6.7. Include property types


Tika configuration

You can customize text extraction with Apache Tika using the following properties:

Property Value Default Description
feeder.tika.config location of Apache Tika Config XML (empty)

The location of an optional custom Apache Tika Config XML file with custom Tika parsers. The value is a Spring Resource location, for example a value such as file:/path/tika-config.xml can be used to reference a local file. Use an empty value for the default configuration.

feeder.tika.appendMetadata comma-separated list of metadata identifiers (empty)

Comma-separated list of metadata identifiers extracted from blob properties by Apache Tika that are appended to the extracted body text. See Section 4.2.3, “Advanced Configuration”

feeder.tika.copyMetadata comma-separated list of entries for the format <metadata identifier>=<index field name> (empty)

Comma-separated list of metadata identifiers extracted from blob properties by Apache Tika and index field names to copy the metadata to. See Section 4.2.3, “Advanced Configuration”

feeder.tika.timeout.milliseconds milliseconds120000 (2 minutes) Set the maximum time after which text extraction from binary data with Apache Tika fails. If extraction fails, the binary data will be skipped for the index document. Lower values will avoid that the Feeder is blocked for a long time in text extraction.
feeder.tika.warn.millisecondsmilliseconds15000 (15 seconds) Set the time after which a warning is logged when text extraction from binary data with Apache Tika takes some time.

Table 6.8. Tika configuration


Configuration of ImageDimensionFeedablePopulator

The following properties configure the ImageDimensionFeedablePopulator bean.

Attribute Value Default Description
feeder.populator.imageDimension.docType document type name none (required) The document type of the content to be indexed, including subtypes.
feeder.populator.imageDimension.widthPropertyName document property name none

The property name of the content which holds the width value. If not set, feeder.populator.imageDimension.dataPropertyName

must be set.

feeder.populator.imageDimension.heightPropertyName document property name none

The property name of the content which holds the height value.

If not set, feeder.populator.imageDimension.dataPropertyName

must be set.

feeder.populator.imageDimension.dataPropertyName document property name none

The name of the blob property which holds the image data. The value of this object must be of type com.coremedia.cap.common.Blob. If not set, feeder.populator.imageDimension.widthPropertyName

and feeder.populator.imageDimension.heightPropertyName must be set.

feeder.populator.imageDimension.largeWidth positive number none (required) Lower bound width of large images.
feeder.populator.imageDimension.largeHeight positive number none (required) Lower bound height of large images.
feeder.populator.imageDimension.mediumWidth positive number none (required) Lower bound width of medium images.
feeder.populator.imageDimension.mediumHeight positive number none (required) Lower bound height of medium images.

Table 6.9. Properties to configure ImageDimensionFeedablePopulator.


Error behavior

You can use the following properties to customize the Content Feeder behavior in case of errors.

Attribute Value Default Description
feeder.retrySendIdleDelay time in seconds 60 The time to wait before retrying to send documents to the search engine after failures to do so. This delay is used if the Content Feeder is idle.
feeder.retrySendMaxDelay time in seconds 600 The maximum time to wait before retrying to send documents to the search engine after failures.
feeder.retryConnectToIndexDelay.seconds time in seconds 10 The time to wait between retries to connect to the search engine on startup.
feeder.executorRetryDelay time in milliseconds 60000 The delay to wait before the Content Feeder retries to access the source data after failures.
feeder.solr.connection.timeout time in milliseconds 0 The connection timeout set on the SolrJ SolrServer. It determines how long the client waits to establish a connection without any response from the server. The default value of 0 means it will wait forever.
feeder.solr.socket.timeout time in milliseconds 600000 (10 minutes) The socket timeout set on the SolrJ SolrServer. It determines how long the client waits for a response from the server after the connection was established and the request was already sent. The value of 0 means it will wait forever.

Table 6.10. Properties for Content Feeder configuration


Configure Statistics

You can configure time intervals to show statistics on the Content Feeder admin page and in the content server log.

Attribute Value Default Description
statisticInterval time in milliseconds 3600000 Maximum time interval to show statistics on the administration page. With the default you can show overall statistics (since starting the Content Feeder) and statistics for the last n seconds, where n <= statisticInterval.
statisticLogInterval time in milliseconds 600000 Interval to log statistic information of the Content Feeder in the log file of the CoreMedia Content Server (coremedia.log).

Table 6.11. Attributes for statistics time intervals


XML configuration

The Spring XML configuration file application.xml allows more advanced configuration and customization. This section just describes the possibility to configure indexed document properties by name.

Properties to feed

If you want to restrict the document fields, you can specify a map entry with included or excluded fields for some or all document types. A map entry for a super type is valid for all subtypes, if not overridden with an entry for a subtype. If no entry is specified for a document type or its ancestors, all document properties are included. The wildcard * stands for all properties and can be used to include or exclude all properties of a type. Note however that you can either configure a list of included or excluded properties for a certain type but not both, and property lists from different entries will not be merged.

Configure included properties

The following example configures a map from document type names (abstract or concrete) to indexed properties. The values of the map are comma-separated property names of the respective document type. Only the listed properties will be indexed. Document types not listed here will by default be indexed with all properties if not configured otherwise via excluded properties.

<customize:append id="feederContentPropertyIncludesCustomizer" bean="feederContentPropertyIncludes">
  <map>
    <entry key="doctype1" value="prop1,prop2"/>
    <entry key="doctype2" value="prop3"/>
  </map>
</customize:append>

Configure excluded properties

The following example configures a map from document type names (abstract or concrete) to properties excluded from indexing. The values of the map are comma-separated property names of the respective document type. Only the properties not listed here will be indexed. Document types not listed here will by default be indexed with all properties if not configured otherwise via included properties.

<customize:append id="feederContentPropertyExcludesCustomizer" bean="feederContentPropertyExcludes">
  <map>
    <entry key="doctype4" value="prop4,prop5"/>

    <!--
      exclude all properties of doctype5
      only meta-data gets indexed
     -->
    <entry key="doctype5" value="*"/>
  </map>
</customize:append>