4.1. Concepts

The Content Feeder sends content and metadata of documents to the CoreMedia Search Engine. The Search Engine extracts the textual data of the documents, indexes them and provides the possibility to search for these documents. The Content Feeder is a web application that connects to the Content Server and to the Search Engine.

The CoreMedia Content Server provides a search service which hides the functionality of the CoreMedia Search Engine from clients. The server contacts the CoreMedia Search Engine to serve client search requests. The Site Manager and custom clients that use the Unified API SearchService get the search results directly from the CoreMedia Content Server.

It is also possible to send search requests from custom clients directly to the CoreMedia Search Engine using the native API of the underlying search engine. This is recommended in most cases because the search service of the Content Server does not support all search features of Apache Solr and adds some performance overhead compared to a direct connection. The Studio back-end is an example for a search client that sends search requests directly to the Search Engine.

Search Engine Integration

Figure 4.1. Search Engine Integration


The CoreMedia Content Feeder feeds an index which is needed for the full-text search feature in the Site Manager and in CoreMedia Studio. Multiple Content Feeders can use the same CoreMedia Search Engine but require separate indices.

To provide full-text search for documents in the Content Delivery Environment, a separate Content Feeder can be set up that connects to the CoreMedia Master Live Server and feeds another index.

Feeding the Search Engine

When the Content Feeder starts for the first time, it iterates over the documents in the repository and sends them to the Search Engine for indexing. After this initialization phase, the Content Feeder sends documents to the Search Engine after they have changed or when they are newly created.

When the Content Feeder restarts, it automatically continues its work with the next document that needs to be indexed. This document is determined from a timestamp stored by the Content Feeder in the same index of the Search Engine. During restart the Content Feeder retrieves the timestamp from the Search Engine to continue feeding.

The CoreMedia Search Engine indexes textual data from document properties and a number of metadata attributes such as the path of the document, the name of its creator and the last time the document was published. In the configuration of the Content Feeder you can restrict the indexed documents by their type and its indexed properties by their name and type. Note, that the CoreMedia Search Engine only indexes the latest document version.

Partial Updates

The Content Feeder can use partial updates if only document metadata has changed. This means, it does not need to send the whole document data to the search engine but just a small set of changed metadata, for example a changed path after documents have been moved to another place in the repository. This can greatly improve performance, especially if lots of documents are affected and expensive operations such as parsing text from PDF can be avoided.

The Content Feeder can use partial updates, if the connected search engine supports it. Apache Solr supports partial updates if index fields are configured as stored as in the default configuration. See the description of the configuration properties solr.partialUpdates, solr.partialUpdatesSkipIndexCheck and feeder.partialUpdate.aspects in Section 6.1, “Content Feeder Configuration” for more details.

Batches

For better performance the Content Feeder sends batches to the Search Engine. A batch contains changes of multiple documents. A batch that was sent to the Search Engine is called an open batch until all contained changes have been written to the Search Engine's index persistently.

Error conditions

If the Content Feeder or the Search Engine is unable to process a certain document, an error document is indexed instead. It serves as placeholder for the original document in the index of the Search Engine.

When a document contains binary data of an unsupported format, no error document is written. Instead, such documents are indexed without the binary data and the document can still be found based on the content of other fields.

Error documents contain the value ERROR in the index field feederstate and are not returned as search result by the Content Server. You can search for error documents using the administration page of the Content Feeder. An error document is replaced with the correct document when the document changes in the CoreMedia Content Server and the cause of the error has been removed.

Communication problems to the CoreMedia Search Engine lead to search errors in clients. The Content Feeder retries feeding until the Search Engine responds successfully. Search requests from clients succeed as soon as the communication problems have been resolved.

Restrictions

The CoreMedia Search Engine provides a fast and efficient full-text search for the indexed documents. However, because of the asynchronous nature of the indexing process, search results do not always reflect the current state of the repository. A document may need a couple of seconds after it was sent to the Search Engine, before it appears in the search results. Sometimes you can query for changes faster if you use the more powerful but in general slower built-in query feature of the CoreMedia Content Server.

The CoreMedia Search Engine supports search in the content of the latest document version. If you want to search for older versions or for folders you have to use the query feature of the CoreMedia Content Server or use the CoreMedia CAE Feeder to index the required data as part of content beans.