CoreMedia Search Manual/6.1. Content Feeder Configuration

6.1. Content Feeder Configuration

The Content Feeder is configured in the files WEB-INF/application.properties and WEB-INF/application.xml.

Solr specific configuration properties

These properties are configured in file application.properties of the Content Feeder.

Attribute	Value	Default	Description
`feeder.solr.url`	URL	http://localhost:8082/solr/coremedia	The URL where the Content Feeder can reach the Search Engine. The URL points to the Apache Solr core for the Content Feeder.
`feeder.solr.username`	user name or empty	(empty)	User name for HTTP Basic authentication when connecting to the Apache Solr web application. Leave empty for no authentication.
`feeder.solr.password`	user name or empty	(empty)	Password for HTTP Basic authentication when connecting to the Apache Solr web application.
`feeder.solr.collection`	String	coremedia	The collection that should be used by the Content Feeder.
`feeder.solr.sendRetryDelay`	time in seconds	10	Delay in seconds between trying to send a batch.
`solr.partialUpdates`	true or false	true	Specifies whether partial updates are supported for updating document metadata in Solr. This requires that all fields in the Solr index are configured as `stored="true"` except fields that are `<copyField/>` destinations, which must be configured as `stored="false"`. This is because partial updates are applied to the index document reconstructed from the existing stored field values. Note that configuration property `feeder.partialUpdate.aspects` may still restrict usage of partial updates to certain document aspects.
`solr.partialUpdatesSkipIndexCheck`	true or false	false	If `solr.partialUpdates` is true, the Solr index schema is analyzed whether fields are stored as required for partial updates. The Feeder will log a warning and not use partial update functionality if the index seems to not support it. You can set this property to true to skip the check.

Table 6.1. Solr specific properties

General Feeder configuration properties

These properties are configured in file application.properties of the Content Feeder.

Login data

The following properties are used to define the login data for the Content Server and the administration page of the Search Engine.

Attribute	Value	Default	Description
`feeder.management.user`	user name	feeder	The user name to be used in the HTTP authentication of the administration page of the Content Feeder. This is not an account from the user management of the Content Server.
`feeder.management.password`	password	feeder	The password to be used in the HTTP authentication of the administration page of the Content Feeder.
`repository.user`	user name	feeder	The user account the Content Feeder uses to read content.
`repository.password`	password	feeder	The password for the user account of the Content Feeder.

Table 6.2. Properties for login

Partial update configuration

With this property you can configure the usage of partial updates, if supported by the connected Indexer - for example for Solr as configured with property solr.partialUpdates.

Attribute	Value	Default	Description
`feeder.partialUpdate.aspects`	comma-separated list of document aspects or *	multiSite	The aspects of index documents that can be updated with a partial update, provided that the connected Indexer supports partial updates (for example, `solr.partialUpdates=true` for Solr). Multiple values are separated by comma. Use the special value * to use partial updates for all aspects, if possible. An empty value means that partial updates are not used. See the API documentation of `Feedable.isPartialUpdate`, `FeedableAspect` and `ContentFeedableAspect` in package `com.coremedia.cap.feeder` for more details.

Table 6.3. Partial update configuration

Batch configuration

With these properties you can configure the processing of batches.

Attribute	Value	Default	Description
`feeder.maxBatchSize`	number of documents	500	The maximum number of documents in a batch.
`feeder.maxBatchByteSize`	number of bytes	5242880 (5 MB)	The maximum batch size in byte.
`feeder.sendIdleDelay`	time in seconds	3	The time to wait between adding a document to a batch and sending that batch to the search engine if the Content Feeder is idle. If a document was changed and no further changes are made within `sendIdleDelay` seconds, the document will be sent after that time to the search engine. This setting leads to a low latency for changes to become visible in search as long as the system is not very busy.
`feeder.sendMaxDelay`	time in seconds	20	The maximum time to wait between adding a document to a batch and sending that batch. This setting is typically larger than `sendIdleDelay` to allow batches to grow for better throughput.
`feeder.maxOpenBatches`	int	5	The maximum number of batches indexed in parallel. This setting is not used with the default integration of Apache Solr but only with custom implementations of the com.coremedia.cap.feeder.index.async.AsyncIndexer interface. The Content Feeder does not call the index method of the AsyncIndexer interface to index another batch if the maximum number of parallel batches has been reached. The method will not be called until a callback about the persistence of one of these batches has been received.
`feeder.maxProcessedBatches`	int	1	The maximum number of batches processed by the Indexer in parallel. This setting is not used with the default integration of Apache Solr but only with custom implementations of the com.coremedia.cap.feeder.index.async.AsyncIndexer interface. The Content Feeder does not call the index method of the AsyncIndexer interface to index another batch if the configured number of currently processed batches has been reached. The method will not be called until a callback about completed processing or persistence of one of these batches has been received.

Table 6.4. Properties for batch configuration

What to feed

You can use the following properties to define which elements the Content Feeder should feed to the Search Engine.

Attribute	Value	Default	Description
`feeder.indexDeleted`	`true` or `false`	`true`	`true` if documents in the trash should be indexed. If you do not need to find documents in the trash and want to keep your index smaller, you can change this to `false`.
`feeder.indexPath`	`true` or `false`	`false`	Indicate whether a document's folder path is indexed in field 'folder'. If set to `true` (not recommended), folder renames lead to refeeding of all documents below that folder. The alternative field 'folderpath' which contains the folder path as folder ids is the recommended way to refer to a folder path.
`feeder.indexReferrers`	`true` or `false`	`false`	`true` to reindex a document after its referrers have changed.
`feeder.indexNameInTextBody`	`true` or `false`	`true`	Configures whether the document name should be indexed in index field textbody. It can make sense to disable this if lots of document names contain unique identifiers (from third-party systems, for example) to avoid problems with too many unique terms in field textbody.
`feeder.indexGroups`	`true` or `false`	`true`	`true` to index the groups with potential read rights with the document in the index field `groups`. This set of groups is then used to narrow a user's search to the documents where he might have read rights to. This is an optimization to get smaller search results for some queries and content structures and to get more accurate search suggestion counts. The client has to check for read rights anyway. If set to `false`, then you should also configure the Studio application to not add a superfluous query condition for the indexed groups by setting its property `studio.rest.searchService.useGroupsFilterQuery` to `false`.
`feeder.updateGroups.immediately`	`true` or `false`	`false`	If `feeder.indexGroups` is `true`, configures whether the field `groups` is updated immediately after a change of a folder's right rule. It is recommended to keep this set to `false` and let the Content Feeder update the index field `groups` in the background with lower priority than updates for editorial changes. It is quite expensive to set this to `true` because all documents below the folder will be reindexed.

Table 6.5. Properties to feed additional items

Document types to feed

You can restrict the indexed documents by their type using the includes and excludes properties.

Attribute	Value	Default	Description
`feeder.content.type.includes`	document type name	Document_	The name of the abstract or concrete document type whose documents should be indexed. Regular expressions are not allowed.
`feeder.content.type.excludes`	document type name	Preferences, EditorPreferences, Dictionary, Query	The name of the abstract or concrete document type whose documents should not be indexed. Regular expressions are not allowed.

Table 6.6. Properties to specify document types.

Properties to feed

The default configuration feeds all properties for all specified document types. For configuration of indexed properties by their name, see the section for XML configuration below.

Property types to feed

You can only select a document property from a document type if its property type is specified with the following rules.

Property	Value	Default	Description
`feeder.content.propertyType.string`	`true` or `false`	`true`	Set this property to `false` in order to exclude `String` properties from indexing.
`feeder.content.propertyType.integer`	`true` or `false`	`false`	Set this property to `true` in order to include `Integer` properties when indexing.
`feeder.content.propertyType.date`	`true` or `false`	`false`	Set this property to `true` in order to include `Date` properties when indexing.
`feeder.content.propertyType.linkList`	`true` or `false`	`false`	Set this property to `true` in order to include `LinkList` properties when indexing.
`feeder.content.propertyType.struct`	`true` or `false`	`false`	Set this property to `true` in order to include `Struct` properties when indexing.
`feeder.content.propertyType.xmlGrammars`	List of included grammar names separated by comma	`coremedia-richtext-1.0`	You can define which XML properties should be indexed by specifying their grammar. Example `feeder.content.propertyType.xmlGrammars=coremedia-richtext-1.0`
`feeder.content.propertyType.blobMimeType.includes`	List of included MIME types separated by comma	See file	You can define which blob properties are indexed, depending on the MIME type. Example `feeder.content.propertyType.blobMimeType.includes=text/` All blobs of MIME type `text/` are indexed.
`feeder.content.propertyType.blobMimeType.excludes`	List of excluded MIME types separated by comma	(empty)	Exclude some blobs from indexing depending on the MIME type. If you've included a primary MIME type such as `text/` or even the catch all type `/`, you can exclude some concrete types with this property. Example* `feeder.content.propertyType.blobMimeType.excludes=text/plain` Blobs of MIME type `text/plain` will not be indexed.
`feeder.content.propertyType.blobMaxSize`	size in bytes	5242880 (5 MB)	Configure the maximum size of indexed blob properties. Larger values will be skipped. This configuration can be overridden in a Spring XML configuration file where you can configure the maximum size per MIME type by customizing the bean `feederContentBlobMaxSizePerMimeType`. See XML configuration for an example.

Table 6.7. Include property types

Tika configuration

You can customize text extraction with Apache Tika using the following properties:

Property	Value	Default	Description
`feeder.tika.config`	location of Apache Tika Config XML	(empty)	The location of an optional custom Apache Tika Config XML file with custom Tika parsers. The value is a Spring Resource location, for example a value such as `file:/path/tika-config.xml` can be used to reference a local file. Use an empty value for the default configuration.
`feeder.tika.appendMetadata`	comma-separated list of metadata identifiers	(empty)	Comma-separated list of metadata identifiers extracted from blob properties by Apache Tika that are appended to the extracted body text. See Section 4.2.3, “Advanced Configuration”
`feeder.tika.copyMetadata`	comma-separated list of entries for the format <metadata identifier>=<index field name>	(empty)	Comma-separated list of metadata identifiers extracted from blob properties by Apache Tika and index field names to copy the metadata to. See Section 4.2.3, “Advanced Configuration”
`feeder.tika.timeout.milliseconds`	milliseconds	`120000` (2 minutes)	Set the maximum time after which text extraction from binary data with Apache Tika fails. If extraction fails, the binary data will be skipped for the index document. Lower values will avoid that the Feeder is blocked for a long time in text extraction.
`feeder.tika.warn.milliseconds`	milliseconds	`15000` (15 seconds)	Set the time after which a warning is logged when text extraction from binary data with Apache Tika takes some time.

Table 6.8. Tika configuration

Configuration of ImageDimensionFeedablePopulator

The following properties configure the ImageDimensionFeedablePopulator bean.

Attribute	Value	Default	Description
`feeder.populator.imageDimension.docType`	document type name	none (required)	The document type of the content to be indexed, including subtypes.
`feeder.populator.imageDimension.widthPropertyName`	document property name	none	The property name of the content which holds the width value. If not set, `feeder.populator.imageDimension.dataPropertyName` must be set.
`feeder.populator.imageDimension.heightPropertyName`	document property name	none	The property name of the content which holds the height value. If not set, `feeder.populator.imageDimension.dataPropertyName` must be set.
`feeder.populator.imageDimension.dataPropertyName`	document property name	none	The name of the blob property which holds the image data. The value of this object must be of type `com.coremedia.cap.common.Blob`. If not set, `feeder.populator.imageDimension.widthPropertyName` and `feeder.populator.imageDimension.heightPropertyName` must be set.
`feeder.populator.imageDimension.largeWidth`	positive number	none (required)	Lower bound width of large images.
`feeder.populator.imageDimension.largeHeight`	positive number	none (required)	Lower bound height of large images.
`feeder.populator.imageDimension.mediumWidth`	positive number	none (required)	Lower bound width of medium images.
`feeder.populator.imageDimension.mediumHeight`	positive number	none (required)	Lower bound height of medium images.

Table 6.9. Properties to configure ImageDimensionFeedablePopulator.

Error behavior

You can use the following properties to customize the Content Feeder behavior in case of errors.

Attribute	Value	Default	Description
`feeder.retrySendIdleDelay`	time in seconds	60	The time to wait before retrying to send documents to the search engine after failures to do so. This delay is used if the Content Feeder is idle.
`feeder.retrySendMaxDelay`	time in seconds	600	The maximum time to wait before retrying to send documents to the search engine after failures.
`feeder.retryConnectToIndexDelay.seconds`	time in seconds	10	The time to wait between retries to connect to the search engine on startup.
`feeder.executorRetryDelay`	time in milliseconds	60000	The delay to wait before the Content Feeder retries to access the source data after failures.
`feeder.solr.connection.timeout`	time in milliseconds	0	The connection timeout set on the SolrJ `SolrServer`. It determines how long the client waits to establish a connection without any response from the server. The default value of 0 means it will wait forever.
`feeder.solr.socket.timeout`	time in milliseconds	600000 (10 minutes)	The socket timeout set on the SolrJ `SolrServer`. It determines how long the client waits for a response from the server after the connection was established and the request was already sent. The value of 0 means it will wait forever.

Table 6.10. Properties for Content Feeder configuration

Configure Statistics

You can configure time intervals to show statistics on the Content Feeder admin page and in the content server log.

Attribute	Value	Default	Description
`statisticInterval`	time in milliseconds	3600000	Maximum time interval to show statistics on the administration page. With the default you can show overall statistics (since starting the Content Feeder) and statistics for the last n seconds, where n <= `statisticInterval`.
`statisticLogInterval`	time in milliseconds	600000	Interval to log statistic information of the Content Feeder in the log file of the CoreMedia Content Server (`coremedia.log`).

Table 6.11. Attributes for statistics time intervals

XML configuration

The Spring XML configuration file application.xml allows more advanced configuration and customization. This section just describes the possibility to configure indexed document properties by name.

Properties to feed

If you want to restrict the document fields, you can specify a map entry with included or excluded fields for some or all document types. A map entry for a super type is valid for all subtypes, if not overridden with an entry for a subtype. If no entry is specified for a document type or its ancestors, all document properties are included. The wildcard * stands for all properties and can be used to include or exclude all properties of a type. Note however that you can either configure a list of included or excluded properties for a certain type but not both, and property lists from different entries will not be merged.

Configure included properties

The following example configures a map from document type names (abstract or concrete) to indexed properties. The values of the map are comma-separated property names of the respective document type. Only the listed properties will be indexed. Document types not listed here will by default be indexed with all properties if not configured otherwise via excluded properties.

<customize:append id="feederContentPropertyIncludesCustomizer" bean="feederContentPropertyIncludes">
  <map>
    <entry key="doctype1" value="prop1,prop2"/>
    <entry key="doctype2" value="prop3"/>
  </map>
</customize:append>

Configure excluded properties

The following example configures a map from document type names (abstract or concrete) to properties excluded from indexing. The values of the map are comma-separated property names of the respective document type. Only the properties not listed here will be indexed. Document types not listed here will by default be indexed with all properties if not configured otherwise via included properties.

<customize:append id="feederContentPropertyExcludesCustomizer" bean="feederContentPropertyExcludes">
  <map>
    <entry key="doctype4" value="prop4,prop5"/>

    <!--
      exclude all properties of doctype5
      only meta-data gets indexed
     -->
    <entry key="doctype5" value="*"/>
  </map>
</customize:append>

CoreMedia Search Manual, Version 7.5.45-10 Chapter 6. Appendix | 6.1. Content Feeder Configuration