Search Manual / 5.2.6 Configuring Tika metadata extraction

Search Manual / Version 2307

5.2.6 Configuring Tika metadata extraction

In addition to extracting body text, Tika can extract metadata for some binary formats such as the creator of a Microsoft Word file. You can use the following properties to extract and index metadata from binary formats:

feeder.tika.append-metadata
feeder.tika.copy-metadata

The property feeder.tika.append-metadata takes a comma-separated list of metadata identifiers. The CAE Feeder simply appends the matching metadata values to the indexed body text when Apache Tika extracts such a value.

The property feeder.tika.copy-metadata takes a comma-separated list where each entry consists of a metadata identifier followed by an equal sign (=) and the name of the index field the metadata should be copied to. When a matching metadata value is found, it will be stored in the configured index field. Note that with Apache Solr target index fields must be defined as multiValued="true" to avoid indexing errors if there are multiple metadata values with the same identifier. See also Section 5.4.4, “Modifying the Search Index”.

Example

feeder.tika.copy-metadata=dc:creator=author

The above example configures the CAE Feeder to store the dc:creator metadata value in the index field author. Note that the index field must be declared in the Solr schema for this to work.

Metadata identifiers are specific to Apache Tika. You can find some of them in the API documentation of Apache Tika class org.apache.tika.metadata.TikaCoreProperties.

Search Results

Table Of Contents

Filter

Search Manual / Version 2307

5.2.6 Configuring Tika metadata extraction

Example

Search Results