CoreMedia Search Manual/4.2.2. Content Configuration

4.2.2. Content Configuration

Configuring Document Types

You can restrict the indexed documents by their type in the file feeder.properties. The document types are configured with the following two properties:

feeder.content.type.includes=Document_
feeder.content.type.excludes=\
  EditorPreferences,Preferences,Dictionary,Query

	Note
	Configuration not mandatory: The default configuration includes all document types except EditorPreferences, Preferences, Dictionary and Query.

The property feeder.content.type.includes contains a comma-separated list of document types to be included. Contrary the property feeder.content.type.excludes contains a comma-separated list of document types to be excluded. With a specified type all subtypes are included and excluded, respectively. It is an error to specify the same document type in both properties. Rules for more specific types override rules for less specific types.

	Caution
	Note, that the Content Feeder does not update already processed documents after changing the document types to index. A configuration change only affects newly processed documents. If you want to update all documents, restart the Content Feeder with an empty index.

Configuring Properties for Indexing

You can restrict the indexed properties of a document by their name and type. You can further restrict the indexed XML properties by their grammar and the indexed blob properties by their MIME type and size.

	Note
	Configuration not mandatory: The default configuration includes all String and CoreMedia RichText XML properties. It also includes blob properties of the MIME types `text/*`, `application/pdf,` `application/msword` and `application/vnd.openxmlformats-officedocument.wordprocessingml.document` (`docx` files) that are not larger than 5 MB.

You can configure indexed document properties by their name by customizing the Spring beans feederContentPropertyIncludes and feederContentPropertyExcludes in the file applicationContext.xml. The following example configures the Content Feeder to index only the properties 'Author' and 'Text' of document type Article and all properties of document type Picture except the property 'Copyright'.

<customize:append id="feederContentPropertyIncludesCustomizer" bean="feederContentPropertyIncludes">
  <map>
    <entry key="Article" value="Author,Text"/>
  </map>
</customize:append>

<customize:append id="feederContentPropertyExcludesCustomizer" bean="feederContentPropertyExcludes">
  <map>
    <entry key="Picture" value="Copyright"/>
  </map>
</customize:append>

Note that it is an error to specify both included and excluded properties for the same type.

See the description of the beans in file applicationContext.xml for more details.

Note

The CoreMedia Feeder applications use Apache Tika for text extraction from binary formats. You can find the list of formats supported by Tika at https://tika.apache.org/1.13/formats.html. Note however, that the Blueprint Feeder applications do not include all transitive Tika libraries to reduce the total number of dependencies and avoid potential version conflicts. Libraries for less common formats such as NetCDF scientific files, Java class files and many more have been excluded. Have a look at the classpath of the Feeder applications and extend it if needed. Libraries for common formats such as Microsoft Office or PDF are supported by default.

You can also change the indexed document properties by their type in the file feeder.properties. The following example shows the default configuration for property types:

# indexed property types
feeder.content.propertyType.string=true
feeder.content.propertyType.integer=false
feeder.content.propertyType.date=false
feeder.content.propertyType.linkList=false
feeder.content.propertyType.struct=false

# Indexed xml properties, configured by xml grammar
# comma separated grammar names (as used in the document 
# type definition, attribute Name of element XmlGrammar)
feeder.content.propertyType.xmlGrammars=coremedia-richtext-1.0

# Indexed blob properties, configured by comma-separated MIME-types
# If you don't configure any MIME-types in the includes property,
# no blob properties will be indexed.
# You can exclude a more specific type (for example, text/xml) while
# including the corresponding primary type (for example, text/*)
feeder.content.propertyType.blobMimeType.includes=text/*, \
application/pdf,application/msword,application/ \
vnd.openxmlformats-officedocument.wordprocessingml.document
feeder.content.propertyType.blobMimeType.excludes=

# The maximum size in byte for included blob properties;
# larger blobs will be skipped.
# This configuration can be overridden in a Spring XML configuration
# file where you can configure the maximum size per MIME-type by
# customizing the bean 'feederContentBlobMaxSizePerMimeType'.
# See applicationContext.xml for an example.
feeder.content.propertyType.blobMaxSize=5242880

	Caution
	Note, that the Content Feeder does not update already processed documents after changing the properties. A configuration change only affects newly processed documents. If you want to update all documents, restart the Content Feeder with an empty index.

Configuring Fields to Index in

The Content Feeder can be configured to index document properties into special index fields. You can search for content in these fields if your Search Engine indexes these fields. To this end, the fields must be added to the file schema.xml in the Apache Solr config set for the Content Feeder in directory <solr-home>/configsets/content/conf. Please refer to the Apache Solr documentation for more information.

	Note
	Configuration not mandatory: By default, all document properties are indexed in the index field `textbody`. They are also indexed in fields whose name starts with `cm` and ends with the lowercase name of the property - if such fields exist in the index. For example, a property `Headline` is indexed in the field `cmheadline`. This configuration allows you to use different index field names.

The Content Feeder supports two types of field configuration, the PropertyField and the FeedablePopulator. A PropertyField maps a document property to an index field and whether the property value should also be indexed in the field textbody. The more flexible FeedablePopulator interface allows you to populate a Feedable object from a given document.

If you configure a new field in the Solr schema.xml, you can search for text in that specific field. Note, that searching in specific fields is not possible in the Site Manager and CoreMedia Studio but only in custom search applications using CoreMedia APIs or native Search Engine APIs.

The following example adds a field with the name myfield to the Apache Solr schema.xml. Fields must be configured with the attributes stored="true" and indexed="true". For a more information, see the Apache Solr documentation.

<fields>
  ...
  <field name="myfield" type="text_general"
                        stored="true" indexed="true"/>
</fields>

Configuring PropertyField Beans

Beans of type PropertyField are configured in a customize:append element in file applicationContext.xml. A PropertyField bean requires the attributes name, doctype and property. Attribute name specifies the index field name as configured in the Solr schema.xml. Attribute doctype specifies the name of the document type and attribute property specifies the name of the document property, which is mapped to the index field. Furthermore, it's possible to configure whether the property's value should also be indexed in the field textbody. By default, it will be indexed in textbody but you can disable this by setting the attribute textBody="false". Another optional attribute ignoreIfEmpty configures whether a missing or empty property value should be indexed. The default value is false meaning an empty value is indexed.

Note that excluded document types will not be indexed even if a matching PropertyField is configured. The following example configures indexing of the property headline of document type Article into the index field myfield. It is not indexed in field textbody and empty values are ignored:

<customize:append id="addFeedableProperties" 
bean="contentConfiguration" property="propertyFields">
  <list>
    <bean class="com.coremedia.cms.feeder.content.PropertyField">
      <property name="name" value="myfield"/>
      <property name="doctype" value="Article"/>
      <property name="property" value="headline"/>
      <property name="textBody" value="false"/>
      <property name="ignoreIfEmpty" value="true"/>
  </list>
</bean>
</customize:append>

Configuring FeedablePopulator Beans

FeedablePopulator Spring beans are configured in the list property feedablePopulators and/or in the list property partialUpdateFeedablePopulators of Spring bean index using a customize:append element, for example in file applicationContext.xml. The following FeedablePopulator classes already exist:

PropertyPathFeedablePopulator: Index specific values from a struct document property.
XPathFeedablePopulator: Extracts a text fragment from an XML document property.
ImageDimensionFeedablePopulator: Set image attributes like image orientation, dimension, and size category.
ContentStatusFeedablePopulator: Set the document status (approved, deleted, etc).

Your own populator classes just need to implement the FeedablePopulator interface and can then be configured the same way. The method FeedablePopulator#populate will be called with a com.coremedia.cap.content.Content object, that is the type parameter T of FeedablePopulator implementations must be Content or a super type of Content.

Populators registered at property feedablePopulators of Spring bean index are called when a document gets added or updated and the whole document data is sent to the search engine. Populators registered at property partialUpdateFeedablePopulators are called for partial updates, when only document metadata is sent to the search engine. You can also register a custom FeedablePopulator at both list properties and use method isPartialUpdate of the passed in Feedable to detect whether a partial update is being processed. Method getUpdatedAspects of the extended interface Feedable2 returns which aspects of the index document are changed with a partial update.

PropertyPathFeedablePopulator

The PropertyPathFeedablePopulator is configured with a dot-separated property path to index a specific property value from a struct document property. The first name in the property path denotes the struct document property itself while the following names specify nested properties of the struct. The constructor argument type selects the type of the documents. The argument element maps to the field name in the index. Furthermore, it's possible to configure whether the value should also be indexed in the field textbody using the property textBody. By default, it will not be indexed in the textbody field but you can enable this by setting the property textBody to true.

The following example configures a populator to feed the index field author from a localSettings.metadata.author struct property path of Article documents.

<customize:append id="addAuthorFeedablePopulator"
 bean="index" property="feedablePopulators">
  <list>
    <ref bean="authorFeedablePopulator"/>
  </list>
</customize:append>

<bean class=
"com.coremedia.cap.feeder.populate.PropertyPathFeedablePopulator">
  <constructor-arg index="0" name="type" value="Article"/>
  <constructor-arg index="1" name="propertyPath"
                   value="localSettings.metadata.author"/>
  <constructor-arg index="2" name="element" value="author"/>
</bean>

XPathFeedablePopulator

XPathFeedablePopulators extract text of a fragment from an XML property. The fragment is specified with an XPath expression in the property XPath. The required property element maps to the field name in the index. The property contentType selects the type of the document and the property property selects the document property. Furthermore, it's possible to configure whether the property's value should also be indexed in the field textbody. By default, it will be indexed in textbody but you can disable this by setting the property textBody to false. The namespaces property defines namespaces which can be used in the XPath expression.

The following example configures a populator to feed the index field tabletext from Text properties in Article documents.

<customize:append id="addFeedablePopulators" 
 bean="index" property="feedablePopulators">
  <list>
    <bean 
     class="com.coremedia.cap.feeder.populate. \
      XPathFeedablePopulator">
      <property name="element" value="tabletext"/>
      <property name="contentType" value="Article"/>
      <property name="property" value="Text"/>
      <property name="textBody" value="false"/>
      <property name="XPath" value="/r:div/r:table"/>
      <property name="namespaces">
        <map>
 <entry key="r" 
  value="http://www.coremedia.com/2003/richtext-1.0"/>
        </map>
      </property>
    </bean>
  </list>
</customize:append>

ImageDimensionFeedablePopulator

The ImageDimensionFeedablePopulator is used to detect the orientation (portrait, square, landscape), dimension (width, height) and size category (small, medium, large) of an image. After detection the following index fields are set:

imageOrientation: portrait (value=0), square (value=1) and landscape (value=2) mode.
imageSizeCategory: small (value=0), medium (value=1) and large (value=2) mode.
imageWidth: image width in pixel.
imageHeight: image height in pixel.
imageMaxLength: maximum of imageWidth and imageHeight

An image has portrait(landscape) mode if its height(width) is larger than its width(height). If width and height are equal, it has square mode. An image is categorized as large(as medium) if its width is larger than or equal to the configured largeWidth (mediumWidth) property and its height is also larger than or equal to the configured largeHeight (mediumHeight) property. The image is small, if its width is smaller than mediumWidth or its height is smaller than mediumHeight.

To categorize image orientation (portrait, square, landscape) and image size (small, medium, large), some filter properties must be configured:

docType: the document type of the content to be indexed, including subtypes
widthPropertyName: the property name of the content which holds the width value
heightPropertyName: the property name of the content which holds the height value
dataPropertyName: the property name of the content which holds the image data. The value of this object must be of type com.coremedia.cap.common.Blob.

You must set either widthPropertyName and heightPropertyName or dataPropertyName or both. If the two dimension properties do not exist, the blob data is read to determine the dimension.

largeWidth: lower bound width of large images
largeHeight: lower bound height of large images
mediumWidth: lower bound width of medium images
mediumHeight: lower bound height of medium images

The following example shows an ImageDimensionFeedablePopulator configuration.

<customize:append id="addFeedablePopulators" 
 bean="index" property="feedablePopulators">
  <list>
    <bean 
     class=
"com.coremedia.cap.feeder.populate.ImageDimensionFeedablePopulator">
      <property name="largeWidth" 
       value="${feeder.populator.imageDimension.largeWidth}"/>
      <property name="largeHeight" 
       value="${feeder.populator.imageDimension.largeHeight}"/>
      <property name="mediumWidth" 
       value="${feeder.populator.imageDimension.mediumWidth}"/>
      <property name="mediumHeight" 
       value="${feeder.populator.imageDimension.mediumHeight}"/>
      <property name="docType" 
       value="${feeder.populator.imageDimension.docType}"/>
      <property name="widthPropertyName" 
       value="${feeder.populator.imageDimension.widthPropertyName}"/>
      <property name="heightPropertyName" 
       value="${feeder.populator.imageDimension.heightPropertyName}"/>
      <property name="dataPropertyName" 
       value="${feeder.populator.imageDimension.dataPropertyName}"/>
    </bean>  
  </list>
</customize:append>

The property values of the populator bean are filtered from a property file.

ContentStatusFeedablePopulator

The ContentStatusFeedablePopulator classifies a document in one of four status categories:

0: in production (not approved and not deleted)
1: approved (place and content)
2: published (place and content)
3: deleted

After classification, the status value of the document is stored in the index field status. The following example shows a ContentStatusFeedablePopulator configuration:

<customize:append id="addFeedablePopulators" 
bean="index" property="feedablePopulators">
  <list>
    <bean class="com.coremedia.cap.feeder. \
   populate.ContentStatusFeedablePopulator"/>
  </list>
</customize:append>

	Caution
	Note, that the Content Feeder does not update already processed documents after changing the fields to index. A configuration change only affects newly processed documents. If you want to update all documents, restart the Content Feeder with an empty index.

CoreMedia Search Manual, Version 7.5.45-10 4.2. Configure the Content Feeder | 4.2.2. Content Configuration