Search Manual / Version 2406.0
Table Of Contents
In addition to extracting body text, Tika can
extract metadata for some binary formats such as the creator of a Microsoft Word file. You can use the
configuration properties feeder.tika.append-metadata
and feeder.tika.copy-metadata
to extract and index metadata from binary formats.
The property feeder.tika.append-metadata
takes a comma-separated list of metadata identifiers.
The Content Feeder simply appends the matching metadata values to the indexed body
text when Apache Tika extracts such a value.
The property feeder.tika.copy-metadata
takes a comma-separated list where each entry consists
of a metadata identifier followed by an equal sign (=
) and the name of the index field
the metadata should be copied to. When a matching metadata value is found, it will be stored in the configured
index field. Note that with Apache Solr target index fields must be defined as
multiValued="true"
to avoid indexing errors if there are multiple metadata values with the same
identifier. See also Section 4.5, “Modify the Search Index”.
Example
feeder.tika.copy-metadata=dc:creator=author
The above example configures the Content Feeder to store the dc:creator
metadata value in the index field author
. Note that the index field must be declared in the
Solr schema for this to work.
Metadata identifiers are specific to Apache Tika. You can find some of them in the API documentation of
Apache Tika class org.apache.tika.metadata.TikaCoreProperties
.