Search Manual / 4.2.2.2 Configuring Properties for Indexing

Search Manual / Version 2204

4.2.2.2 Configuring Properties for Indexing

You can restrict the indexed properties of a content by their name and type. You can further restrict the indexed XML properties by their grammar and the indexed blob properties by their MIME type and size.

If you want to restrict the content fields, you can specify a map entry with included or excluded fields for some or all content types. A map entry for a super type is valid for all subtypes, if not overridden with an entry for a subtype. If no entry is specified for a content type or its ancestors, all content properties are included. The wildcard * stands for all properties and can be used to include or exclude all properties of a type. Note however that you can either configure a list of included or excluded properties for a certain type but not both, and property lists from different entries will not be merged.

Note

Configuration not mandatory: The default configuration includes all String and CoreMedia RichText XML properties. It also includes blob properties of the MIME types text/*, application/pdf, application/msword and application/vnd.openxmlformats-officedocument.wordprocessingml.document (docx files) that are not larger than 5 MB.

You can configure indexed content properties by their name by customizing the Spring beans feederContentPropertyIncludes and feederContentPropertyExcludes in the file applicationContext.xml.

The following example configures the Content Feeder to index only the properties 'Author' and 'Text' of content type Article and all properties of content type Picture except the property 'Copyright'. Only the listed properties will be indexed for content type Article and only the not listed properties for content type Picture will be indexed. Content types not listed here will by default be indexed with all properties if not configured otherwise via excluded or included properties.

<customize:append id="feederContentPropertyIncludesCustomizer" bean="feederContentPropertyIncludes">
  <map>
    <entry key="Article" value="Author,Text"/>
  </map>
</customize:append>

<customize:append id="feederContentPropertyExcludesCustomizer" bean="feederContentPropertyExcludes">
  <map>
    <entry key="Picture" value="Copyright"/>
  </map>
</customize:append>

Note that it is an error to specify both included and excluded properties for the same type.

See the description of the beans in file applicationContext.xml for more details.

Note

The CoreMedia Feeder applications use Apache Tika for text extraction from binary formats. You can find the list of formats supported by Tika at https://tika.apache.org/2.4.1/formats.html. Note however, that the Blueprint Feeder applications do not include all transitive Tika libraries to reduce the total number of dependencies and avoid potential version conflicts. Libraries for less common formats such as NetCDF scientific files and many more have been excluded. Have a look at the classpath of the Feeder applications and extend it if needed. Libraries for common formats such as Microsoft Office or PDF are supported by default.

You can also change the indexed content properties by their type. The following example shows the default configuration for property types:

# indexed property types
feeder.content.property-type.string=true
feeder.content.property-type.integer=false
feeder.content.property-type.date=false
feeder.content.property-type.link-list=false
feeder.content.property-type.struct=false

# Indexed xml properties, configured by xml grammar
# comma separated grammar names (as used in the content
# type definition, attribute Name of element XmlGrammar)
feeder.content.property-type.xml-grammars=coremedia-richtext-1.0

# Indexed blob properties, configured by comma-separated MIME-types
# If you don't configure any MIME-types in the includes property,
# no blob properties will be indexed.
# You can exclude a more specific type (for example, text/xml) while
# including the corresponding primary type (for example, text/*)
feeder.content.property-type.blob-mime-type.includes=text/*, \
application/pdf,application/msword,application/ \
vnd.openxmlformats-officedocument.wordprocessingml.document
feeder.content.property-type.blob-mime-type.excludes=

# The maximum size in byte for included blob properties;
# larger blobs will be skipped.
# This configuration can be overridden using Spring configuration
# where you can configure the maximum size per MIME-type by
# customizing the bean 'feederContentBlobMaxSizePerMimeType'.
feeder.content.property-type.blob-max-size=5242880

Caution

Note, that the Content Feeder does not update already processed contents after changing the properties. A configuration change only affects newly processed contents. You must reindex as described in Section 3.5, “Reindexing”, if you want to update all contents or contents of a certain type.

Search Results

Table Of Contents

Filter

Search Manual / Version 2204

4.2.2.2 Configuring Properties for Indexing

Search Results