Search Manual / Version 2010
Table Of ContentsIn addition to extracting body text, Tika can extract metadata for some binary formats such as the creator of a Microsoft Word file. You can use the following properties to extract and index metadata from binary formats:
feeder.tika.append-metadata
feeder.tika.copy-metadata
The property feeder.tika.append-metadata
takes a comma-separated list of metadata identifiers.
The CAE Feeder simply appends the matching metadata values to the indexed body
text when Apache Tika extracts such a value.
The property feeder.tika.copy-metadata
takes a comma-separated list where each entry consists
of a metadata identifier followed by an equal sign (=
) and the name of the index field
the metadata should be copied to. When a matching metadata value is found, it will be stored in the configured
index field. Note that with Apache Solr target index fields must be defined as
multiValued="true"
to avoid indexing errors if there are multiple metadata values with the same
identifier. See also Section 5.4.4, “Modifying the Search Index”.
Example
feeder.tika.copy-metadata=creator=author
The above example configures the CAE Feeder to store the creator as extracted
from the metadata in the index field author
. You have to declare the index field in the
Solr schema for this to work.
Metadata identifiers are specific to Apache Tika. You can find some of them in the API documentation of
Apache Tika class org.apache.tika.metadata.TikaCoreProperties
.