The minimum time after editorial changes were sent to the Search
Engine and before background feeding takes place. This is used to
prioritize feeding of editorial changes over background feeding, for
example to process rights-rule changes or for periodic issue
reindexing. It should not be necessary to change the default setting.
feeder.content.index-deleted
Type
Boolean
Default
true
Description
Whether contents in the trash should be indexed. If you do not need to
find contents in the trash and want to keep your index smaller, you
can change this to false.
feeder.content.index-groups
Type
Boolean
Default
true
Description
Whether the IDs of groups with potential rights to read the content
are indexed in the field "groups". This set of groups is
then used to narrow a user's search to the contents where he might
have read rights to. This is an optimization to get smaller search
results for some queries and content structures and to get more
accurate search suggestion counts. The client has to check for read
rights anyway. For details, see also the description of the field
"groups" in Solr schema.xml. If set to false, then you must
also configure Studio Server and Content Server to not add a query
condition for the indexed groups. To this end, set the Studio property
"studio.rest.search-service.use-groups-filter-query" and the
Content Server "solr.use-groups-filter-query" to
"false".
feeder.content.index-name-in-textbody
Type
Boolean
Default
true
Description
Whether the content name should be indexed in field
"textbody". It can make sense to disable this if lots of
content names contain unique identifiers (from third-party systems,
for example) to avoid problems with too many unique terms in field
"textbody".
feeder.content.index-referrers
Type
Boolean
Default
false
Description
Whether a content is reindexed after its referrers have changed.
feeder.content.issues.index
Type
Boolean
Default
true
Description
Whether to index content issues.
feeder.content.issues.initial-feeding
Type
Boolean
Default
false
Description
Whether content issues are already part of the initial feeding of an
empty index. This property does not have any effect if
feeder.content.issues.index is set to false. If true, initial feeding
may take longer. If false, feeding of content issues starts after
initial feeding has been completed.
feeder.content.issues.reindex-after
Type
Duration
Default
1d
Description
The duration after which indexed issues are considered outdated and
become subject to periodic reindexing. This property does not have any
effect if feeder.content.issues.index or
feeder.content.issues.reindex-periodically are set to false.
feeder.content.issues.reindex-periodically
Type
Boolean
Default
true
Description
Whether content issues are reindexed periodically. Note that issue
reindexing is performed with low priority, and will not block feeding
of editorial changes. Issue reindexing will be paused as long as
editorial changes need to be processed. This property does not have
any effect if feeder.content.issues.index is set to false.
feeder.content.issues.reindex-time-max-percentage
Type
Integer
Default
100
Description
The maximum percentage of time used to trigger issue reindexing. If
set to a value below 100, periodic issue reindexing will try to pause
and stay inactive for some time, so that it does not use more than the
configured percentage of a time window, even if issues are older than
configured in feeder.content.issues.reindex-after. This only applies
to issue reindexing and the Content Feeder may still perform other
tasks. The configured value must be in the range of 1 to 100. Note
that issue reindexing is always performed with low priority, and will
be paused as long as editorial changes need to be processed, even if
this property is set to 100. This property does not have any effect if
feeder.content.issues.index or
feeder.content.issues.reindex-periodically are set to false.
feeder.content.issues.reindex-time-window
Type
Duration
Default
10m
Description
The time window used with
feeder.content.issues.reindex-time-max-percentage. Larger values for
the time window lead to less but longer pauses. This property does not
have any effect if feeder.content.issues.index or
feeder.content.issues.reindex-periodically are set to false, or if
feeder.content.issues.reindex-time-max-percentage is 100.
feeder.content.management.password
Type
String
Default
feeder
Description
The password to be used in the HTTP authentication of the
administration page of the Content Feeder.
feeder.content.management.user
Type
String
Default
feeder
Description
The user name to be used in the HTTP authentication of the
administration page of the Content Feeder. This is not an account from
the user management of the Content Server.
feeder.content.partial-update-aspects
Type
List<String>
Default
*
Description
Configures the aspects of index documents that can be updated with a
partial update, provided that the connected Indexer supports partial
updates (for example, feeder.solr.partial-updates.enabled=true for
Solr). Multiple values are separated by comma. Use the special value
"*" to use partial updates for all aspects, if possible. An
empty value means that partial updates are not used. See the API
documentation of Feedable.isPartialUpdate, FeedableAspect and
ContentFeedableAspect in package com.coremedia.cap.feeder for more
details.
feeder.content.property-type.blob-max-size
Type
org.springframework.util.unit.DataSize
Default
5MB
Description
Configure the maximum size of indexed blob properties. Larger blob
values will be skipped. This configuration can be overridden for
specific MIME-types by customizing Spring bean
"feederContentBlobMaxSizePerMimeType".
List of MIME-types of "Blob" properties excluded from
indexing. You can exclude a more specific type (e.g. text/xml) while
including the corresponding primary type (e.g. text/*).
List of abstract or concrete content types excluded from feeding. With
the configuration of some type, all of its subtypes are excluded
implicitly, if not configured otherwise. Note that it is an error to
configure the same content type in this property and in
feeder.content.type.includes. Rules for more specific types override
rules for less specific types. Regular expressions are not supported.
feeder.content.type.includes
Type
List<String>
Default
Content_
Description
List of abstract or concrete content types included for feeding. With
the configuration of some type, all of its subtypes are included
implicitly, if not configured otherwise. Note that it is an error to
configure the same content type in this property and in
feeder.content.type.excludes. Rules for more specific types override
rules for less specific types. Regular expressions are not supported.
feeder.content.update-groups-immediately
Type
Boolean
Default
false
Description
If feeder.content.index-groups is true, configures whether the field
"groups" is updated immediately after a change of a folder's
right rule. It is recommended to keep this set to false, and let the
Content Feeder update the index field in the background with lower
priority than updates for editorial changes. It is quite expensive to
set this to true because all contents below the folder would be
reindexed.
The maximum number of results to fetch with a single paginated Solr
query when retrieving content items with outdated issues. If more
results are available, multiple queries with Solr cursor pagination
will be used, and each one will be restricted to this configured
maximum number of results.
feeder.content.issues.solr.filter
Type
String
Default
types:Document_
Description
Solr filter query to restrict the content items for which outdated
issues are reindexed.
feeder.content.issues.solr.query-min-delay
Type
Duration
Default
10s
Description
The minimum time to wait before Solr is queried again for content
items with outdated issues after the last query. This delay is not
used for paginated queries that just retrieve the next page for a
previous query.
feeder.solr.nested-documents.enabled
Type
Boolean
Default
true
Description
Whether storing nested feedables as nested documents is supported in
Solr. This requires that the Solr schema contains a _root_ field. Note
that if you add that field to the schema, you have to recreate the
index from scratch.
feeder.solr.nested-documents.skip-index-check
Type
Boolean
Default
false
Description
If feeder.solr.nested-documents.enabled is true, the Solr index schema
is checked whether it contains the _root_ field. The Feeder will log a
warning and not use nested documents, if feeding of nested documents
is attempted but the index does not support it. You can set this
property to true to skip checking the index schema.
feeder.solr.partial-updates.enabled
Type
Boolean
Default
true
Description
Whether partial updates are supported for updating content metadata in
Solr. This requires that all fields in the Solr index are configured
as stored="true" or docValues="true" except fields
that are copyField destinations, which must be configured as
stored="false". This is because partial updates are applied
to the index document reconstructed from the existing stored field
values.
feeder.solr.partial-updates.skip-index-check
Type
Boolean
Default
false
Description
If feeder.solr.partial-updates.enabled is true, the Solr index schema
is analyzed whether fields are stored as required for partial updates.
The Feeder will log a warning and not use partial update functionality
if the index seems to not support it. You can set this property to
true to skip the check.
feeder.solr.send-retry-delay
Type
Duration
Default
30s
Description
The delay to wait before the Feeder retries to send data after
failures from Solr.
solr.cloud
Type
Boolean
Default
false
Description
Whether to connect to SolrCloud. If true, connect to a SolrCloud
cluster. SolrCloud connection details must be set either as ZooKeeper
addresses (solr.zookeeper.addresses) or, if the former is unset or
empty as HTTP URLs (solr.url). If false, connect to stand-alone Solr
nodes via HTTP URLs (solr.url).
solr.connection-timeout
Type
Integer
Default
0
Description
Connection timeout in milliseconds, or 0 for no timeout, or a negative
value to use SolrClient default.
solr.content.collection
Type
String
Default
studio
Description
The name of the Solr collection for editorial search.
solr.content.config-set
Type
String
Default
content
Description
The name of the Solr config set to use when creating the collection
for editorial search. This property is used by the Content Feeder.
solr.index-data-directory
Type
String
Default
data
Description
Value for the "dataDir" parameter of the Solr CoreAdmin API
/ Collection API request to create a Solr index.
solr.password
Type
String
Description
Password for HTTP basic authentication, used if a non-empty
solr.username has been specified. The value may have been encrypted
with the tool "cm encryptpasswordproperty".
solr.proxy-host
Type
String
Description
Proxy host for Solr communication that needs to be set if a proxy
should be used.
solr.proxy-is-secure
Type
Boolean
Default
false
Description
Secure flag for Solr proxy.
solr.proxy-is-socks4
Type
Boolean
Default
false
Description
SOCKS 4 flag for Solr proxy.
solr.proxy-port
Type
Integer
Default
0
Description
Proxy port for Solr communication that needs to be set if a proxy
should be used.
solr.socket-timeout
Type
Integer
Default
600000
Description
Socket timeout in milliseconds, or 0 for no timeout, or a negative
value to use SolrClient default.
solr.url
Type
List<String>
Default
http://localhost:40080/solr
Description
The list of Solr URLs to connect to. These URLs are ignored if
connecting to SolrCloud (solr.cloud=true) and non-empty ZooKeeper
addresses (solr.zookeeper.addresses) have been set. For a Feeder
application that is not connected to a SolrCloud cluster, a single URL
to the Solr leader must be configured.
solr.use-http1
Type
Boolean
Default
false
Description
Whether HTTP/1 (true) or HTTP/2 (false) shall be used by Solr clients.
Deprecation
This property has been deprecated and will be removed in a future version.
solr.use-xml-response-writer
Type
Boolean
Default
false
Description
Whether SolrJ should use XML response format instead of Javabin
format.
solr.username
Type
String
Description
Username for HTTP basic authentication, or empty string for no
authentication.
solr.zookeeper.addresses
Type
List<String>
Description
ZooKeeper addresses for connecting to SolrCloud. Only used if
solr.cloud=true.
solr.zookeeper.chroot
Type
String
Description
Optional ZooKeeper chroot path for Solr. ZooKeeper chroot support
makes it possible to isolate the SolrCloud tree in a ZooKeeper
instance that is Only used if solr.cloud=true and
solr.zookeeper.addresses is set to non-empty value.
solr.zookeeper.client-timeout
Type
Integer
Default
10000
Description
Client-timeout for ZooKeeper in milliseconds, or a negative value to
use SolrClient default. Only used if solr.cloud=true and
solr.zookeeper.addresses is set to non-empty value.
solr.zookeeper.connect-timeout
Type
Integer
Default
10000
Description
Connect-timeout for ZooKeeper in milliseconds, or a negative value to
use SolrClient default. Only used if solr.cloud=true and
solr.zookeeper.addresses is set to non-empty value.
The following properties are used to define the login data for the Content Server
repository.user
Value
user name
Default
feeder
Description
The user account the Content Feeder
uses to read content.
repository.password
Value
password
Default
feeder
Description
The password for the user account of the Content Feeder.
Table 3.43. Properties for login
Batch configuration properties for Content Feeder
With these properties you can configure the processing of batches.
feeder.batch.max-bytes
Type
org.springframework.util.unit.DataSize
Default
5MB
Description
The maximum batch size in bytes. The Feeder sends a batch to the
search engine if its maximum size would be exceeded when adding more
entries. Note, that byte computation is a rough estimate only. A
smaller batch may be sent if the maximum number of index documents is
reached before, or if configured delays are reached.
feeder.batch.max-open
Type
Integer
Default
5
Description
The maximum number of batches indexed in parallel. This setting is not
used with the default integration of Apache Solr but only with custom
implementations of the
com.coremedia.cap.feeder.index.async.AsyncIndexer interface. The
Feeder does not call the index method of the AsyncIndexer interface to
index another batch if the maximum number of parallel batches has been
reached. The method will not be called until a callback about the
persistence of one of these batches has been received.
feeder.batch.max-processed
Type
Integer
Default
1
Description
The maximum number of batches processed by the Indexer in parallel.
This setting is not used with the default integration of Apache Solr
but only with custom implementations of the
com.coremedia.cap.feeder.index.async.AsyncIndexer interface. The
Feeder does not call the index method of the AsyncIndexer interface to
index another batch if the configured number of currently processed
batches has been reached. The method will not be called until a
callback about completed processing or persistence of one of these
batches has been received.
feeder.batch.max-size
Type
Integer
Default
500
Description
The maximum number of index documents in a batch. If the maximum
number is reached, the Feeder sends the batch to the search engine. A
smaller batch may be sent if the maximum byte size is reached before,
or if configured delays are reached.
feeder.batch.retry-send-idle-delay
Type
Duration
Default
1m
Description
The time to wait before retrying to send index documents to the search
engine after failures. This delay is used if the feeder is idle.
feeder.batch.retry-send-max-delay
Type
Duration
Default
10m
Description
The maximum time to wait before retrying to send index documents to
the search engine after failures. This delay is used if the feeder is
not idle. The setting is typically larger than retry-send-idle-delay.
feeder.batch.send-idle-delay
Type
Duration
Default
3s
Description
The time between adding an index document to a batch and sending that
batch to the search engine, if the batch is not yet full according to
the max-size and max-bytes configuration properties, and if the feeder
is idle. If a change needs to be sent to the search engine, and no
further changes were made within the specified time, then an index
document for the change will be sent after that time to the search
engine. A small delay ensures low latency for changes to become
visible in the search engine, as long as the system is not too busy.
feeder.batch.send-max-delay
Type
Duration
Default
20s
Description
The maximum time between adding an index document to a batch and
sending that batch to the search engine. This setting is typically
larger than send-idle-delay to allow batches to grow and increase
throughput, for example when large amounts of content are created by
an import process. The configured value may still be exceeded under
high load, or if there are problems connecting to the search engine.
Table 3.44. Feeder Batch Configuration Properties
Properties to configure Apache Tika
You can customize text extraction with Apache Tika using the following properties:
feeder.tika.append-metadata
Type
String
Description
Comma-separated list of metadata identifiers returned by Apache Tika
to append to the extracted body text.
feeder.tika.config
Type
org.springframework.core.io.Resource
Description
The location of a custom Tika Config XML, for example to customize the
default Tika parsers. See Apache Tika documentation for details on
configuring Tika. The value of this property must be a Spring Resource
location (e.g. file:/path/to/local/file) or empty for defaults.
feeder.tika.copy-metadata
Type
String
Description
Comma-separated list of metadata identifiers returned by Apache Tika
and names of Feedable elements to copy the metadata to. Entries in the
comma separated list have the following format: "metadata
identifier"="element name". With Apache Solr, target
index fields must be defined as multiValued="true" to avoid
indexing errors if there are multiple metadata values with the same
identifier.
feeder.tika.timeout
Type
Duration
Default
2m
Description
The maximum time after which text extraction from binary data with
Apache Tika fails. If extraction fails, the binary data will be
skipped for the index document. Lower values will avoid that the
Feeder is blocked for a long time in text extraction.
feeder.tika.warn-time-threshold
Type
Duration
Default
15s
Description
The time after which a warning is logged when text extraction from
binary data with Apache Tika takes some time.
feeder.tika.zip-bomb-prevention.enabled
Type
Boolean
Default
true
Description
Sets whether Apache Tika's "Zip bomb" prevention is enabled.
When a "Zip bomb" is detected, no text will be extracted
from the Blob, but a warning will be logged. Note that "Zip
bombs" are not restricted to ZIP files but also apply to PDFs or
other formats. Disabled "Zip bomb" prevention bears the risk
of OutOfMemoryError-s. Note that false positives are possible.
Sets the ratio between output characters and input bytes for the
Apache Tika "Zip bomb" prevention. If this ratio is exceeded
(after the output threshold has been reached) then no text will be
extracted and a warning will be logged. Set to -1 to use the default
of Apache Tika.
feeder.tika.zip-bomb-prevention.maximum-depth
Type
Integer
Default
-1
Description
Sets the maximum XML element nesting level for the Apache Tika
"Zip bomb" prevention. If this depth level is exceeded then
no text will be extracted, and a warning will be logged. Set to -1 to
use the default of Apache Tika.
Sets the maximum package entry nesting level for the Apache Tika
"Zip bomb" prevention. If this depth level is exceeded then
no text will be extracted, and a warning will be logged. Set to -1 to
use the default of Apache Tika.
Table 3.45. Feeder Tika Configuration Properties
Feeder Core Properties
You can use the following properties to customize some internal settings of the Content Feeder.
feeder.core.executor-queue-capacity
Type
Integer
Default
100
Description
Maximum capacity of the Feeder's executor queue, which is internally
used to transfer evaluated values.
feeder.core.executor-retry-delay
Type
Duration
Default
1m
Description
The delay to wait before the Feeder retries to access the source data
after failures.