The Content Feeder is configured in the files
WEB-INF/application.properties
and WEB-INF/application.xml
.
Solr specific configuration properties
These properties are configured in file application.properties
of the Content
Feeder.
Attribute | Value | Default | Description |
---|---|---|---|
feeder.solr.url | URL | http://localhost:8082/solr/coremedia | The URL where the Content Feeder can reach the Search Engine. The URL points to the Apache Solr core for the Content Feeder. |
feeder.solr.username | user name or empty | (empty) | User name for HTTP Basic authentication when connecting to the Apache Solr web application. Leave empty for no authentication. |
feeder.solr.password | user name or empty | (empty) | Password for HTTP Basic authentication when connecting to the Apache Solr web application. |
feeder.solr.collection | String | coremedia | The collection that should be used by the Content Feeder. |
feeder.solr.sendRetryDelay | time in seconds | 10 | Delay in seconds between trying to send a batch. |
solr.partialUpdates | true or false | true | Specifies whether partial updates are supported for updating document metadata in Solr.
This requires that all fields in the Solr index are configured as stored="true" except fields
that are <copyField/> destinations, which must be configured as
stored="false" . This is because partial updates are applied to the index document
reconstructed from the existing stored field values. Note that configuration property
feeder.partialUpdate.aspects may still restrict usage of partial updates to certain
document aspects.
|
solr.partialUpdatesSkipIndexCheck | true or false | false | If solr.partialUpdates is true, the Solr index schema is analyzed whether fields
are stored as required for partial updates. The Feeder will log a warning and not use partial update
functionality if the index seems to not support it. You can set this property to true to skip the check.
|
Table 6.1. Solr specific properties
General Feeder configuration properties
These properties are configured in file application.properties
of the Content
Feeder.
Login data
The following properties are used to define the login data for the Content Server and the administration page of the Search Engine.
Attribute | Value | Default | Description |
---|---|---|---|
feeder.management.user | user name | feeder | The user name to be used in the HTTP authentication of the administration page of the Content Feeder. This is not an account from the user management of the Content Server. |
feeder.management.password | password | feeder | The password to be used in the HTTP authentication of the administration page of the Content Feeder. |
repository.user | user name | feeder | The user account the Content Feeder uses to read content. |
repository.password | password | feeder | The password for the user account of the Content Feeder. |
Table 6.2. Properties for login
Partial update configuration
With this property you can configure the usage of partial updates, if supported by the connected Indexer -
for example for Solr as configured with property solr.partialUpdates
.
Attribute | Value | Default | Description |
---|---|---|---|
feeder.partialUpdate.aspects | comma-separated list of document aspects or * | multiSite | The aspects of index documents that can be updated with a partial update, provided that
the connected Indexer supports partial updates (for example, solr.partialUpdates=true for Solr).
Multiple values are separated by comma. Use the special value * to use partial updates for all aspects,
if possible. An empty value means that partial updates are not used.
See the API documentation of Feedable.isPartialUpdate ,
FeedableAspect and ContentFeedableAspect in
package com.coremedia.cap.feeder for more details.
|
Table 6.3. Partial update configuration
Batch configuration
With these properties you can configure the processing of batches.
Attribute | Value | Default | Description |
---|---|---|---|
feeder.maxBatchSize | number of documents | 500 | The maximum number of documents in a batch. |
feeder.maxBatchByteSize | number of bytes | 5242880 (5 MB) | The maximum batch size in byte. |
feeder.sendIdleDelay | time in seconds | 3 | The time to wait between adding a document to a batch and sending that batch to
the search engine if the Content Feeder is idle. If a document was changed and no further
changes are made within sendIdleDelay seconds, the document will be sent after that
time to the search engine. This setting leads to a low latency for changes to become
visible in search as long as the system is not very busy. |
feeder.sendMaxDelay | time in seconds | 20 | The maximum time to wait between adding a document to a batch and sending that
batch. This setting is typically larger than sendIdleDelay to allow
batches to grow for better throughput. |
feeder.maxOpenBatches | int | 5 | The maximum number of batches indexed in parallel. This setting is not used with the default integration of Apache Solr but only with custom implementations of the com.coremedia.cap.feeder.index.async.AsyncIndexer interface. The Content Feeder does not call the index method of the AsyncIndexer interface to index another batch if the maximum number of parallel batches has been reached. The method will not be called until a callback about the persistence of one of these batches has been received. |
feeder.maxProcessedBatches | int | 1 | The maximum number of batches processed by the Indexer in parallel. This setting is not used with the default integration of Apache Solr but only with custom implementations of the com.coremedia.cap.feeder.index.async.AsyncIndexer interface. The Content Feeder does not call the index method of the AsyncIndexer interface to index another batch if the configured number of currently processed batches has been reached. The method will not be called until a callback about completed processing or persistence of one of these batches has been received. |
Table 6.4. Properties for batch configuration
What to feed
You can use the following properties to define which elements the Content Feeder should feed to the Search Engine.
Attribute | Value | Default | Description |
---|---|---|---|
feeder.indexDeleted | true or false
| true
| true if documents in the trash should be indexed. If you do not need to
find documents in the trash and want to keep your index smaller, you can change this
to false . |
feeder.indexPath | true or false
| false
| Indicate whether a document's folder path is indexed in field 'folder'. If set
to true (not recommended), folder renames lead to refeeding of all
documents below that folder. The alternative field 'folderpath' which contains the
folder path as folder ids is the recommended way to refer to a folder path. |
feeder.indexReferrers | true or false
| false
| true to reindex a document after its referrers have changed. |
feeder.indexNameInTextBody | true or false
| true
|
Configures whether the document name should be indexed in index field textbody. It can make sense to disable this if lots of document names contain unique identifiers (from third-party systems, for example) to avoid problems with too many unique terms in field textbody. |
feeder.indexGroups | true or false
| true
|
If set to |
feeder.updateGroups.immediately | true or false
| false
| If feeder.indexGroups is true , configures whether the
field groups is updated immediately after a change of a folder's right
rule. It is recommended to keep this set to false and let the
Content Feeder update the index field groups
in the background with lower priority than updates for editorial changes.
It is quite expensive to set this to true because all documents
below the folder will be reindexed. |
Table 6.5. Properties to feed additional items
Document types to feed
You can restrict the indexed documents by their type using the includes
and excludes
properties.
Attribute | Value | Default | Description |
---|---|---|---|
feeder.content.type.includes | document type name | Document_ | The name of the abstract or concrete document type whose documents should be indexed. Regular expressions are not allowed. |
feeder.content.type.excludes | document type name | Preferences, EditorPreferences, Dictionary, Query | The name of the abstract or concrete document type whose documents should not be indexed. Regular expressions are not allowed. |
Table 6.6. Properties to specify document types.
Properties to feed
The default configuration feeds all properties for all specified document types. For configuration of indexed properties by their name, see the section for XML configuration below.
Property types to feed
You can only select a document property from a document type if its property type is specified with the following rules.
Property | Value | Default | Description |
---|---|---|---|
feeder.content.propertyType.string | true or false
| true
| Set this property to false in order to exclude String
properties from indexing. |
feeder.content.propertyType.integer | true or false
| false
| Set this property to true in order to include Integer
properties when indexing. |
feeder.content.propertyType.date | true or false
| false
| Set this property to true in order to include Date
properties when indexing. |
feeder.content.propertyType.linkList | true or false
| false
| Set this property to true in order to include
LinkList properties when indexing. |
feeder.content.propertyType.struct | true or false
| false
| Set this property to true in order to include Struct
properties when indexing. |
feeder.content.propertyType.xmlGrammars | List of included grammar names separated by comma | coremedia-richtext-1.0
|
You can define which XML properties should be indexed by specifying their grammar. Example
|
feeder.content.propertyType.blobMimeType.includes | List of included MIME types separated by comma | See file |
You can define which blob properties are indexed, depending on the MIME type. Example
All blobs of MIME type |
feeder.content.propertyType.blobMimeType.excludes | List of excluded MIME types separated by comma | (empty) |
Exclude some blobs from indexing depending on the MIME type. If you've included a primary MIME type such
as Example
Blobs of MIME type |
feeder.content.propertyType.blobMaxSize | size in bytes | 5242880 (5 MB) |
Configure the maximum size of indexed blob properties. Larger values will be skipped.
This configuration can be overridden in a Spring XML configuration file where you can configure the
maximum size per MIME type by customizing the bean |
Table 6.7. Include property types
Tika configuration
You can customize text extraction with Apache Tika using the following properties:
Property | Value | Default | Description |
---|---|---|---|
feeder.tika.config | location of Apache Tika Config XML | (empty) |
The location of an optional custom Apache Tika Config XML file with custom Tika parsers.
The value is a Spring Resource location, for example a value such as
|
feeder.tika.appendMetadata | comma-separated list of metadata identifiers | (empty) |
Comma-separated list of metadata identifiers extracted from blob properties by Apache Tika that are appended to the extracted body text. See Section 4.2.3, “Advanced Configuration” |
feeder.tika.copyMetadata | comma-separated list of entries for the format <metadata identifier>=<index field name> | (empty) |
Comma-separated list of metadata identifiers extracted from blob properties by Apache Tika and index field names to copy the metadata to. See Section 4.2.3, “Advanced Configuration” |
feeder.tika.timeout.milliseconds | milliseconds | 120000 (2 minutes)
| Set the maximum time after which text extraction from binary data with Apache Tika fails. If extraction fails, the binary data will be skipped for the index document. Lower values will avoid that the Feeder is blocked for a long time in text extraction. |
feeder.tika.warn.milliseconds | milliseconds | 15000 (15 seconds)
| Set the time after which a warning is logged when text extraction from binary data with Apache Tika takes some time. |
Table 6.8. Tika configuration
Configuration of ImageDimensionFeedablePopulator
The following properties configure the ImageDimensionFeedablePopulator bean.
Attribute | Value | Default | Description |
---|---|---|---|
feeder.populator.imageDimension.docType | document type name | none (required) | The document type of the content to be indexed, including subtypes. |
feeder.populator.imageDimension.widthPropertyName | document property name | none |
The property name of the content which holds the width value. If not set,
must be set. |
feeder.populator.imageDimension.heightPropertyName | document property name | none |
The property name of the content which holds the height value.
If not set, must be set. |
feeder.populator.imageDimension.dataPropertyName | document property name | none |
The name of the blob property which holds the image data. The value of this object must be of type
and |
feeder.populator.imageDimension.largeWidth | positive number | none (required) | Lower bound width of large images. |
feeder.populator.imageDimension.largeHeight | positive number | none (required) | Lower bound height of large images. |
feeder.populator.imageDimension.mediumWidth | positive number | none (required) | Lower bound width of medium images. |
feeder.populator.imageDimension.mediumHeight | positive number | none (required) | Lower bound height of medium images. |
Table 6.9. Properties to configure ImageDimensionFeedablePopulator.
Error behavior
You can use the following properties to customize the Content Feeder behavior in case of errors.
Attribute | Value | Default | Description |
---|---|---|---|
feeder.retrySendIdleDelay | time in seconds | 60 | The time to wait before retrying to send documents to the search engine after failures to do so. This delay is used if the Content Feeder is idle. |
feeder.retrySendMaxDelay | time in seconds | 600 | The maximum time to wait before retrying to send documents to the search engine after failures. |
feeder.retryConnectToIndexDelay.seconds | time in seconds | 10 | The time to wait between retries to connect to the search engine on startup. |
feeder.executorRetryDelay | time in milliseconds | 60000 | The delay to wait before the Content Feeder retries to access the source data after failures. |
feeder.solr.connection.timeout | time in milliseconds | 0 | The connection timeout set on the
SolrJ SolrServer . It determines how long the client waits to establish a connection
without any response from the server. The default value of 0 means it will wait forever. |
feeder.solr.socket.timeout | time in milliseconds | 600000 (10 minutes) | The socket timeout set on the
SolrJ SolrServer . It determines how long the client waits for a response from the
server after the connection was established and the request was already sent.
The value of 0 means it will wait forever.
|
Table 6.10. Properties for Content Feeder configuration
Configure Statistics
You can configure time intervals to show statistics on the Content Feeder admin page and in the content server log.
Attribute | Value | Default | Description |
---|---|---|---|
statisticInterval | time in milliseconds | 3600000 | Maximum time interval to show statistics on the administration page. With the
default you can show overall statistics (since starting the Content Feeder) and statistics for the last n seconds, where
n <= statisticInterval . |
statisticLogInterval | time in milliseconds | 600000 | Interval to log statistic information of the Content Feeder in the log file of the CoreMedia
Content Server (coremedia.log ). |
Table 6.11. Attributes for statistics time intervals
XML configuration
The Spring XML configuration file application.xml
allows more advanced configuration and
customization. This section just describes the possibility to configure indexed document properties by name.
Properties to feed
If you want to restrict the document fields, you can specify a map entry with included or excluded fields for some
or all document types. A map entry for a super type is valid for all subtypes, if not overridden with an entry for
a subtype. If no entry is specified for a document type or its ancestors, all document properties are included.
The wildcard *
stands for all properties and can be used to include or exclude all properties of a
type. Note however that you can either configure a list of included or excluded properties for a certain type but
not both, and property lists from different entries will not be merged.
Configure included properties
The following example configures a map from document type names (abstract or concrete) to indexed properties. The values of the map are comma-separated property names of the respective document type. Only the listed properties will be indexed. Document types not listed here will by default be indexed with all properties if not configured otherwise via excluded properties.
<customize:append id="feederContentPropertyIncludesCustomizer" bean="feederContentPropertyIncludes"> <map> <entry key="doctype1" value="prop1,prop2"/> <entry key="doctype2" value="prop3"/> </map> </customize:append>
Configure excluded properties
The following example configures a map from document type names (abstract or concrete) to properties excluded from indexing. The values of the map are comma-separated property names of the respective document type. Only the properties not listed here will be indexed. Document types not listed here will by default be indexed with all properties if not configured otherwise via included properties.
<customize:append id="feederContentPropertyExcludesCustomizer" bean="feederContentPropertyExcludes"> <map> <entry key="doctype4" value="prop4,prop5"/> <!-- exclude all properties of doctype5 only meta-data gets indexed --> <entry key="doctype5" value="*"/> </map> </customize:append>