Configuring Document Types
You can restrict the indexed documents by their type in the file feeder.properties
. The document
types are configured with the following two properties:
feeder.content.type.includes=Document_ feeder.content.type.excludes=\ EditorPreferences,Preferences,Dictionary,Query
Note | |
---|---|
Configuration not mandatory: The default configuration includes all document types except EditorPreferences, Preferences, Dictionary and Query. |
The property feeder.content.type.includes
contains a comma-separated list of document types to be
included. Contrary the property feeder.content.type.excludes
contains a comma-separated list of
document types to be excluded. With a specified type all subtypes are included and excluded, respectively.
It is an error to specify the same document type in both properties. Rules for more specific types override
rules for less specific types.
Caution | |
---|---|
Note, that the Content Feeder does not update already processed documents after changing the document types to index. A configuration change only affects newly processed documents. If you want to update all documents, restart the Content Feeder with an empty index. |
Configuring Properties for Indexing
You can restrict the indexed properties of a document by their name and type. You can further restrict the indexed XML properties by their grammar and the indexed blob properties by their MIME type and size.
Note | |
---|---|
Configuration not mandatory: The default configuration
includes all String and CoreMedia RichText XML properties. It
also includes blob properties of the MIME types |
You can configure indexed document properties by their name by customizing the Spring beans
feederContentPropertyIncludes
and feederContentPropertyExcludes
in the file
applicationContext.xml
. The following example configures the Content
Feeder to index only the properties 'Author' and 'Text' of document type Article and all
properties of document type Picture except the property 'Copyright'.
<customize:append id="feederContentPropertyIncludesCustomizer" bean="feederContentPropertyIncludes"> <map> <entry key="Article" value="Author,Text"/> </map> </customize:append> <customize:append id="feederContentPropertyExcludesCustomizer" bean="feederContentPropertyExcludes"> <map> <entry key="Picture" value="Copyright"/> </map> </customize:append>
Note that it is an error to specify both included and excluded properties for the same type.
See the description of the beans in file applicationContext.xml
for more details.
Note | |
---|---|
The CoreMedia Feeder applications use Apache Tika for text extraction from binary formats. You can find the list of formats supported by Tika at https://tika.apache.org/1.13/formats.html. Note however, that the Blueprint Feeder applications do not include all transitive Tika libraries to reduce the total number of dependencies and avoid potential version conflicts. Libraries for less common formats such as NetCDF scientific files, Java class files and many more have been excluded. Have a look at the classpath of the Feeder applications and extend it if needed. Libraries for common formats such as Microsoft Office or PDF are supported by default. |
You can also change the indexed document properties by their type in the file feeder.properties
.
The following example shows the default configuration for property types:
# indexed property types feeder.content.propertyType.string=true feeder.content.propertyType.integer=false feeder.content.propertyType.date=false feeder.content.propertyType.linkList=false feeder.content.propertyType.struct=false
# Indexed xml properties, configured by xml grammar # comma separated grammar names (as used in the document # type definition, attribute Name of element XmlGrammar) feeder.content.propertyType.xmlGrammars=coremedia-richtext-1.0
# Indexed blob properties, configured by comma-separated MIME-types # If you don't configure any MIME-types in the includes property, # no blob properties will be indexed. # You can exclude a more specific type (for example, text/xml) while # including the corresponding primary type (for example, text/*) feeder.content.propertyType.blobMimeType.includes=text/*, \ application/pdf,application/msword,application/ \ vnd.openxmlformats-officedocument.wordprocessingml.document feeder.content.propertyType.blobMimeType.excludes= # The maximum size in byte for included blob properties; # larger blobs will be skipped. # This configuration can be overridden in a Spring XML configuration # file where you can configure the maximum size per MIME-type by # customizing the bean 'feederContentBlobMaxSizePerMimeType'. # See applicationContext.xml for an example. feeder.content.propertyType.blobMaxSize=5242880
Caution | |
---|---|
Note, that the Content Feeder does not update already processed documents after changing the properties. A configuration change only affects newly processed documents. If you want to update all documents, restart the Content Feeder with an empty index. |
Configuring Fields to Index in
The Content Feeder can be configured to index document properties into special index
fields. You can search for content in these fields if your Search Engine
indexes these fields. To this end, the fields must be added to the file schema.xml
in the Apache
Solr config set for the Content Feeder in directory
<solr-home>/configsets/content/conf
. Please refer to the
Apache Solr
documentation for more information.
Note | |
---|---|
Configuration not mandatory: By default, all document properties are
indexed in the index field |
The Content Feeder supports two types of field configuration, the
PropertyField
and the FeedablePopulator
. A PropertyField
maps a document
property to an index field and whether the property value should also be indexed in the field
textbody
. The more flexible FeedablePopulator
interface allows you to populate a
Feedable
object from a given document.
If you configure a new field in the Solr schema.xml
, you can search for text in that specific
field. Note, that searching in specific fields is not possible in the Site
Manager and CoreMedia Studio but only in
custom search applications using CoreMedia APIs or native
Search Engine APIs.
The following example adds a field with the name myfield
to the Apache
Solr schema.xml
. Fields must be configured with the attributes stored="true"
and indexed="true"
. For a more information, see the Apache
Solr documentation.
<fields> ... <field name="myfield" type="text_general" stored="true" indexed="true"/> </fields>
Configuring PropertyField Beans
Beans of type PropertyField
are configured in a customize:append
element in file
applicationContext.xml
. A PropertyField
bean requires the attributes
name
, doctype
and property
. Attribute name
specifies the
index field name as configured in the Solr schema.xml
. Attribute doctype
specifies the
name of the document type and attribute property
specifies the name of the document property, which
is mapped to the index field. Furthermore, it's possible to configure whether the property's value should also be
indexed in the field textbody
. By default, it will be indexed in textbody
but you can
disable this by setting the attribute textBody="false"
. Another optional attribute
ignoreIfEmpty
configures whether a missing or empty property value should be indexed. The default
value is false
meaning an empty value is indexed.
Note that excluded document types will not be indexed even if a matching
PropertyField
is configured. The following example configures indexing of the property
headline of document type
Article into the index field myfield
. It is not
indexed in field textbody
and empty values are ignored:
<customize:append id="addFeedableProperties" bean="contentConfiguration" property="propertyFields"> <list> <bean class="com.coremedia.cms.feeder.content.PropertyField"> <property name="name" value="myfield"/> <property name="doctype" value="Article"/> <property name="property" value="headline"/> <property name="textBody" value="false"/> <property name="ignoreIfEmpty" value="true"/> </list> </bean> </customize:append>
Configuring FeedablePopulator Beans
FeedablePopulator
Spring beans are configured in the list property feedablePopulators
and/or in the list property partialUpdateFeedablePopulators
of Spring bean index
using a customize:append
element, for example in file
applicationContext.xml
. The following FeedablePopulator
classes already exist:
PropertyPathFeedablePopulator
: Index specific values from a struct document property.XPathFeedablePopulator
: Extracts a text fragment from an XML document property.ImageDimensionFeedablePopulator
: Set image attributes like image orientation, dimension, and size category.ContentStatusFeedablePopulator
: Set the document status (approved, deleted, etc).
Your own populator classes just need to implement the FeedablePopulator
interface and can then be
configured the same way. The method FeedablePopulator#populate
will be called with a
com.coremedia.cap.content.Content
object, that is the type parameter T
of
FeedablePopulator
implementations must be Content
or a super type of
Content
.
Populators registered at property feedablePopulators
of Spring bean index
are
called when a document gets added or updated and the whole document data is sent to the search engine.
Populators registered at property partialUpdateFeedablePopulators
are called for partial updates,
when only document metadata is sent to the search engine. You can also register a custom
FeedablePopulator
at both list properties and use method isPartialUpdate
of the passed in
Feedable to detect
whether a partial update is being processed. Method getUpdatedAspects
of the extended
interface Feedable2
returns which aspects of the index document are changed with a partial update.
PropertyPathFeedablePopulator
The PropertyPathFeedablePopulator
is configured with a dot-separated property path to index a
specific property value from a struct document property. The first name in the property path denotes the
struct document property itself while the following names specify nested properties of the struct.
The constructor argument type
selects the type of the documents.
The argument element
maps to the field name in the index.
Furthermore, it's possible to configure whether the
value should also be indexed in the field textbody
using the property textBody
.
By default, it will not be indexed in the
textbody
field but you can enable this by setting the property textBody
to
true
.
The following example configures a populator to feed the index field author
from a
localSettings.metadata.author
struct property path of Article
documents.
<customize:append id="addAuthorFeedablePopulator" bean="index" property="feedablePopulators"> <list> <ref bean="authorFeedablePopulator"/> </list> </customize:append> <bean class= "com.coremedia.cap.feeder.populate.PropertyPathFeedablePopulator"> <constructor-arg index="0" name="type" value="Article"/> <constructor-arg index="1" name="propertyPath" value="localSettings.metadata.author"/> <constructor-arg index="2" name="element" value="author"/> </bean>
XPathFeedablePopulator
XPathFeedablePopulators
extract text of a fragment from an XML property. The fragment is specified
with an XPath expression in the property XPath
. The required property element
maps to
the field name in the index. The property contentType
selects the type of the document and the
property property
selects the document property. Furthermore, it's possible to configure whether the
property's value should also be indexed in the field textbody
. By default, it will be indexed in
textbody
but you can disable this by setting the property textBody
to
false
. The namespaces property defines namespaces which can be used in the XPath expression.
The following example configures a populator to feed the index field tabletext
from
Text
properties in Article
documents.
<customize:append id="addFeedablePopulators" bean="index" property="feedablePopulators"> <list> <bean class="com.coremedia.cap.feeder.populate. \ XPathFeedablePopulator"> <property name="element" value="tabletext"/> <property name="contentType" value="Article"/> <property name="property" value="Text"/> <property name="textBody" value="false"/> <property name="XPath" value="/r:div/r:table"/> <property name="namespaces"> <map> <entry key="r" value="http://www.coremedia.com/2003/richtext-1.0"/> </map> </property> </bean> </list> </customize:append>
ImageDimensionFeedablePopulator
The ImageDimensionFeedablePopulator
is used to detect the orientation (portrait, square,
landscape), dimension (width, height) and size category (small, medium, large) of an image. After detection the
following index fields are set:
imageOrientation
: portrait (value=0), square (value=1) and landscape (value=2) mode.imageSizeCategory
: small (value=0), medium (value=1) and large (value=2) mode.imageWidth
: image width in pixel.imageHeight
: image height in pixel.imageMaxLength
: maximum ofimageWidth
andimageHeight
An image has portrait(landscape) mode if its height(width) is larger than its width(height). If width and height
are equal, it has square mode. An image is categorized as large(as medium) if its width is larger than or equal
to the configured largeWidth
(mediumWidth
) property and its height is also larger than or equal to the configured
largeHeight
(mediumHeight
) property. The image is small, if its width is smaller
than mediumWidth
or its height
is smaller than mediumHeight
.
To categorize image orientation (portrait, square, landscape) and image size (small, medium, large), some filter properties must be configured:
docType:
the document type of the content to be indexed, including subtypeswidthPropertyName
: the property name of the content which holds the width valueheightPropertyName:
the property name of the content which holds the height valuedataPropertyName:
the property name of the content which holds the image data. The value of this object must be of typecom.coremedia.cap.common.Blob
.
You must set either widthPropertyName
and heightPropertyName
or
dataPropertyName
or both. If the two dimension properties do not exist, the blob data is read to
determine the dimension.
largeWidth:
lower bound width of large imageslargeHeight:
lower bound height of large imagesmediumWidth:
lower bound width of medium imagesmediumHeight:
lower bound height of medium images
The following example shows an ImageDimensionFeedablePopulator
configuration.
<customize:append id="addFeedablePopulators" bean="index" property="feedablePopulators"> <list> <bean class= "com.coremedia.cap.feeder.populate.ImageDimensionFeedablePopulator"> <property name="largeWidth" value="${feeder.populator.imageDimension.largeWidth}"/> <property name="largeHeight" value="${feeder.populator.imageDimension.largeHeight}"/> <property name="mediumWidth" value="${feeder.populator.imageDimension.mediumWidth}"/> <property name="mediumHeight" value="${feeder.populator.imageDimension.mediumHeight}"/> <property name="docType" value="${feeder.populator.imageDimension.docType}"/> <property name="widthPropertyName" value="${feeder.populator.imageDimension.widthPropertyName}"/> <property name="heightPropertyName" value="${feeder.populator.imageDimension.heightPropertyName}"/> <property name="dataPropertyName" value="${feeder.populator.imageDimension.dataPropertyName}"/> </bean> </list> </customize:append>
The property values of the populator bean are filtered from a property file.
ContentStatusFeedablePopulator
The ContentStatusFeedablePopulator
classifies a document in one of four status categories:
0:
in production (not approved and not deleted)1:
approved (place and content)2:
published (place and content)3:
deleted
After classification, the status value of the document is stored in the index field status
. The
following example shows a ContentStatusFeedablePopulator
configuration:
<customize:append id="addFeedablePopulators" bean="index" property="feedablePopulators"> <list> <bean class="com.coremedia.cap.feeder. \ populate.ContentStatusFeedablePopulator"/> </list> </customize:append>
Caution | |
---|---|
Note, that the Content Feeder does not update already processed documents after changing the fields to index. A configuration change only affects newly processed documents. If you want to update all documents, restart the Content Feeder with an empty index. |