2207 Deployment Manual / 3.9.2 CAE Feeder Properties

3.9.2 CAE Feeder Properties

Properties for general configuration

`repository.user`
Value	user name
Default	none
Description	The name of the user to connect to the CoreMedia Content Server.
`repository.password`
Value	password
Default	none
Description	The password of the user to connect to the CoreMedia Content Server.
`repository.domain`
Value	domain
Default	none
Description	The domain of the user to connect to the CoreMedia Content Server. Empty String for a built-in user.
`repository.url`
Value	URL
Default	none
Description	The URL to the IOR of the CoreMedia Content Server.
`jdbc.driver`
Value	driver class
Default	none
Description	The class of the database driver. For example: `oracle.jdbc.driver.OracleDriver`
`jdbc.url`
Value	URL
Default	none
Description	The URL to connect to the database.
`jdbc.user`
Value	user name
Default	none
Description	The name of the user to connect to the database.
`jdbc.login-user-name`
Value	the user name for the database login
Default	value of jdbc.user
Description	The user name for a database login. If not set, the value of "jdbc.user" will be used to log in to the database. In some cases the login username differs from the actual user, e.g. with PostgreSQL on Azure a postfix on the user name is necessary to log in. Set this property additionally to jdbc.user. (e.g. jdbc.login-user-name=username@domain jdbc.user=username).
`jdbc.password`
Value	password
Default	none
Description	The password of the user to connect to the database.
`feeder.contentSelector.basePath`
Value	String
Default	`/Sites`
Description	A comma-separated list of base folders for which content beans are indexed. Changing this property will not trigger any re-indexing of already indexed content. See Section 5.3.2, “Resetting” in Search Manual for details on re-indexing.
`feeder.contentSelector.contentTypes`
Value	String
Default	`Document_`
Description	A comma-separated list of content types for which content beans are indexed. Changing this property will not trigger any re-indexing of already indexed content. See Section 5.3.2, “Resetting” in Search Manual for details on re-indexing.
`feeder.contentSelector.includeSubTypes`
Value	Boolean
Default	`true`
Description	Specifies whether the sub types of the content types configured with property `feeder.contentSelector.contentTypes` are selected as well. Changing this property will not trigger any re-indexing of already indexed content. See Section 5.3.2, “Resetting” in Search Manual for details on re-indexing.
`feeder.core.executor-queue-capacity`
Value	int
Default	2000
Description	Capacity of the CAE Feeder's executor queue, which is internally used to transfer evaluated values
`feeder.core.executor-retry-delay`
Value	milliseconds
Default	60000
Description	The delay in milliseconds to wait before the CAE Feeder retries to access the source data after failures to do so.
`feeder.batch.max-bytes`
Value	bytes
Default	20971520 (20 MB)
Description	The maximum size of a batch in bytes. The CAE Feeder sends a batch to the Search Engine if its maximum size would be exceeded when adding more entries. Note, that byte computation is a rough estimate only.
`feeder.batch.max-size`
Value	int
Default	500
Description	The maximum number of entries in a batch. If the maximum number is reached, the CAE Feeder sends the batch to the Search Engine.
`feeder.batch.max-open`
Value	int
Default	5
Description	The maximum number of batches indexed in parallel. This setting is not used with the default integration of Apache Solr but only with custom implementations of the com.coremedia.cap.feeder.index.async.AsyncIndexer interface. The CAE Feeder does not call the index method of the AsyncIndexer interface to index another batch if the maximum number of parallel batches has been reached. The method will not be called until a callback about the persistence of one of these batches has been received.
`feeder.batch.max-processed`
Value	int
Default	1
Description	The maximum number of batches processed by the Indexer in parallel. This setting is not used with the default integration of Apache Solr but only with custom implementations of the com.coremedia.cap.feeder.index.async.AsyncIndexer interface. The CAE Feeder does not call the index method of the AsyncIndexer interface to index another batch if the configured number of currently processed batches has been reached. The method will not be called until a callback about completed processing or persistence of one of these batches has been received.
`feeder.batch.retry-send-idle-delay`
Value	milliseconds
Default	60000
Description	The CAE Feeder sends a batch which only contains retried entries and is not full with regard to the `feeder.batch.max-size` and `feeder.batch.max-bytes` properties after the CAE Feeder was idle for the time configured in this property. A retried entry is an entry which was sent to the Search Engine before but could not be indexed successfully. If the batch contains entries which are not retried, the value of property `feeder.batch.send-idle-delay` is used instead.
`feeder.batch.retry-send-max-delay`
Value	milliseconds
Default	600000
Description	The maximum time in milliseconds between the time the CAE Feeder received an error from the Search Engine and the time, the CAE Feeder tries to send the failed entry as part of a batch to the Search Engine again. The time is exceeded if an error occurs while contacting the Search Engine. If the batch contains entries which are not retried, the value of property `feeder.batch.send-max-delay` is used instead.
`feeder.beanPropertyMaxBytes`
Value	number of bytes
Default	5242880 (5 MB)
Description	The maximum size in bytes for the value of a bean property or -1 for no limitation. Larger values are ignored and will not be sent to the Search Engine.
`feeder.beanMapping.mimeType.includes`
Value	comma-separated list of included MIME types
Default	/
Description	List of included MIME types for blob properties configured for indexing at the BeanMappingFeedablePopulator. For details, see the API documentation of method `setMimeTypeIncludes` of com.coremedia.cap.feeder.bean.BeanMappingFeedablePopulator Example `feeder.beanMapping.mimeType.includes=text/` Only indexes blobs of MIME type `text/`.
`feeder.beanMapping.mimeType.excludes`
Value	comma-separated list of excluded MIME types
Default
Description	List of excluded MIME types for blob properties configured for indexing at the BeanMappingFeedablePopulator. For details, see the API documentation of method `setMimeTypeExcludes` of com.coremedia.cap.feeder.bean.BeanMappingFeedablePopulator Example `feeder.beanMapping.mimeType.excludes=text/xml` Indexes all blobs except blobs of MIME type `text/xml`.
`feeder.batch.send-idle-delay`
Value	milliseconds
Default	10000
Description	The CAE Feeder sends a batch which is not full with regard to the `feeder.batch.max-size` and `feeder.batch.max-bytes` properties after the CAE Feeder was idle for the configured time in milliseconds.
`feeder.batch.send-max-delay`
Value	milliseconds
Default	120000
Description	The maximum time in milliseconds after which the CAE Feeder sends a batch which is not full with regard to the `feeder.batch.max-size` and `feeder.batch.max-bytes` properties. The time may be exceeded if an error occurs while contacting the Search Engine or if the CAE Feeder is under high load.
`proactiveengine.log.progress.interval.seconds`
Value	seconds
Default	600
Description	Set the time interval to log some statistics about the progress, including the number of keys that are currently invalid and still need to be computed.
`proactiveengine.senders.evaluators`
Value	number of threads
Default	50
Description	Number of evaluator threads in the CAE Feeder. The number of threads influences performance not only because evaluations can execute concurrently but also because higher values increase the probability that the CAE Feeder writes the state of multiple evaluations to the database in one database transaction.
`proactiveengine.senders.delay`
Value	milliseconds
Default	0
Description	Minimum delay in milliseconds between notifications of the Feeder by the internal Proactive Engine sub component. Higher values lead to reduced throughput.
`proactiveengine.senders.idledelay`
Value	milliseconds
Default	10000
Description	Delay in milliseconds between notifications of the Feeder by the internal Proactive Engine sub component if the application is idle. Smaller values can be configured to reduce the latency of the CAE Feeder but may lead to increased load on the database.
`dependencyStore.maxTransactionWeight`
Value	maximum number of changed keys per database transaction
Default	2500
Description	The maximum weight of a database transaction to change stored dependencies. The weight is interpreted as the number of changed keys, that is, a transaction with one deleted key has weight 1. Multiple transactions will be used to process an event that causes the invalidation of more keys.

Table 3.48. Configuration of general properties independent from the type of the search engine

Properties to configure Apache Tika

You can customize text extraction with Apache Tika using the following properties:

`feeder.tika.append-metadata`
Type	java.lang.String
Default
Description	Comma-separated list of metadata identifiers returned by Apache Tika to append to the extracted body text.
`feeder.tika.config`
Type	org.springframework.core.io.Resource
Default
Description	The location of a custom Tika Config XML, for example to customize the default Tika parsers. See Apache Tika documentation for details on configuring Tika. The value of this property must be a Spring Resource location (e.g. file:/path/to/local/file) or empty for defaults.
`feeder.tika.copy-metadata`
Type	java.lang.String
Default
Description	Comma-separated list of metadata identifiers returned by Apache Tika and names of Feedable elements to copy the metadata to. Entries in the comma separated list have the following format: "metadata identifier"="element name". With Apache Solr, target index fields must be defined as multiValued="true" to avoid indexing errors if there are multiple metadata values with the same identifier.
`feeder.tika.timeout`
Type	java.time.Duration
Default	2m
Description	The maximum time after which text extraction from binary data with Apache Tika fails. If extraction fails, the binary data will be skipped for the index document. Lower values will avoid that the Feeder is blocked for a long time in text extraction.
`feeder.tika.warn-time-threshold`
Type	java.time.Duration
Default	15s
Description	The time after which a warning is logged when text extraction from binary data with Apache Tika takes some time.
`feeder.tika.zip-bomb-prevention.enabled`
Type	java.lang.Boolean
Default	true
Description	Sets whether Apache Tika's "Zip bomb" prevention is enabled. When a "Zip bomb" is detected, no text will be extracted from the Blob, but a warning will be logged. Note that "Zip bombs" are not restricted to ZIP files but also apply to PDFs or other formats. Disabled "Zip bomb" prevention bears the risk of OutOfMemoryError-s. Note that false positives are possible.
`feeder.tika.zip-bomb-prevention.maximum-compression-ratio`
Type	java.lang.Long
Default	-1
Description	Sets the ratio between output characters and input bytes for the Apache Tika "Zip bomb" prevention. If this ratio is exceeded (after the output threshold has been reached) then no text will be extracted and a warning will be logged. Set to -1 to use the default of Apache Tika.
`feeder.tika.zip-bomb-prevention.maximum-depth`
Type	java.lang.Integer
Default	-1
Description	Sets the maximum XML element nesting level for the Apache Tika "Zip bomb" prevention. If this depth level is exceeded then no text will be extracted, and a warning will be logged. Set to -1 to use the default of Apache Tika.
`feeder.tika.zip-bomb-prevention.maximum-package-entry-depth`
Type	java.lang.Integer
Default	-1
Description	Sets the maximum package entry nesting level for the Apache Tika "Zip bomb" prevention. If this depth level is exceeded then no text will be extracted, and a warning will be logged. Set to -1 to use the default of Apache Tika.

Table 3.49. Feeder Tika Configuration Properties

Properties for Solr configuration

The following properties are only used for a CoreMedia Search Engine based on Apache Solr:

`feeder.solr.nested-documents.enabled`
Type	java.lang.Boolean
Default	true
Description	Whether storing nested feedables as nested documents is supported in Solr. This requires that the Solr schema contains a _root_ field. Note that if you add that field to the schema, you have to recreate the index from scratch.
`feeder.solr.nested-documents.skip-index-check`
Type	java.lang.Boolean
Default	false
Description	If feeder.solr.nested-documents.enabled is true, the Solr index schema is checked whether it contains the _root_ field. The Feeder will log a warning and not use nested documents, if feeding of nested documents is attempted but the index does not support it. You can set this property to true to skip checking the index schema.
`feeder.solr.send-retry-delay`
Type	java.time.Duration
Default	30s
Description	The delay to wait before the Feeder retries to send data after failures from Solr.
`solr.cae.collection`
Type	java.lang.String
Default
Description	The name of the Solr collection for web site search. This property does not have a default. It's typically set to 'preview' or 'live'.
`solr.cae.config-set`
Type	java.lang.String
Default	cae
Description	The name of the Solr config set to use when creating the CAE collection. This property is used by the CAE Feeder.
`solr.cloud`
Type	java.lang.Boolean
Default	false
Description	Whether to connect to SolrCloud. If true, connect to a SolrCloud cluster. SolrCloud connection details must be set either as ZooKeeper addresses (solr.zookeeper.addresses) or, if the former is unset or empty as HTTP URLs (solr.url). If false, connect to stand-alone Solr nodes via HTTP URLs (solr.url).
`solr.connection-timeout`
Type	java.lang.Integer
Default	0
Description	Connection timeout in milliseconds, or 0 for no timeout, or a negative value to use SolrClient default.
`solr.index-data-directory`
Type	java.lang.String
Default	data
Description	Value for the "dataDir" parameter of the Solr CoreAdmin API / Collection API request to create a Solr index.
`solr.password`
Type	java.lang.String
Default
Description	Password for HTTP basic authentication, used if a non-empty solr.username has been specified. The value may have been encrypted with the tool "cm encryptpasswordproperty".
`solr.socket-timeout`
Type	java.lang.Integer
Default	600000
Description	Socket timeout in milliseconds, or 0 for no timeout, or a negative value to use SolrClient default.
`solr.url`
Type	java.util.List<java.lang.String>
Default	http://localhost:40080/solr
Description	The list of Solr URLs to connect to. These URLs are ignored if connecting to SolrCloud (solr.cloud=true) and non-empty ZooKeeper addresses (solr.zookeeper.addresses) have been set. For a Feeder application that is not connected to a SolrCloud cluster, a single URL to the Solr leader must be configured.
`solr.use-xml-response-writer`
Type	java.lang.Boolean
Default	false
Description	Whether SolrJ should use XML response format instead of Javabin format.
`solr.username`
Type	java.lang.String
Default
Description	Username for HTTP basic authentication, or empty string for no authentication.
`solr.zookeeper.addresses`
Type	java.util.List<java.lang.String>
Default
Description	ZooKeeper addresses for connecting to SolrCloud. Only used if solr.cloud=true.
`solr.zookeeper.chroot`
Type	java.lang.String
Default
Description	Optional ZooKeeper chroot path for Solr. ZooKeeper chroot support makes it possible to isolate the SolrCloud tree in a ZooKeeper instance that is Only used if solr.cloud=true and solr.zookeeper.addresses is set to non-empty value.
`solr.zookeeper.client-timeout`
Type	java.lang.Integer
Default	10000
Description	Client-timeout for ZooKeeper in milliseconds, or a negative value to use SolrClient default. Only used if solr.cloud=true and solr.zookeeper.addresses is set to non-empty value.
`solr.zookeeper.connect-timeout`
Type	java.lang.Integer
Default	10000
Description	Connect-timeout for ZooKeeper in milliseconds, or a negative value to use SolrClient default. Only used if solr.cloud=true and solr.zookeeper.addresses is set to non-empty value.

Table 3.50. CAE Feeder Solr Configuration Properties

Was this article useful?

Search Results

Table Of Contents

Filter

Deployment Manual / Version 2207

3.9.2 CAE Feeder Properties

Properties for general configuration

Properties to configure Apache Tika

Properties for Solr configuration

Search Results