loading table of contents...

6.3.19. Robots File

Requirements

Technical editors should be able to adjust site behavior regarding robots (also known as crawlers or spiders) from search engines like Google. For example:

  • Enable/disable crawling of certain pages including their sub pages.

  • Enable/disable crawling of certain single documents.

  • Specify certain bots to crawl different sections of the site.

To support this functionality most robots follow the rules of robots.txt files like explained here: http://www.robotstxt.org/ For example, the site "Corporate" is accessible as http://corporate.blueprint.coremedia.com. For all content of this site the robots will look for a file called robots.txt by performing an HTTP GET request to http://corporate.blueprint.coremedia.com/robots.txt A sample robots.txt file may look like this:

        User-agent: Googlebot,Bingbot
        Disallow: /folder1/
        Allow: /folder1/myfile.html
      

Example 6.2. A robots.txt file


Solution

Blueprint's cae-base-lib module provides a RobotsHandler which is responsible for generating a robots.txt file. A RobotsHandler instance is configured in blueprint-handler.xml. It handles URLs like http://corporate.blueprint7.coremedia.com:49080/blueprint/servlet/robots/corporate

This is a typical preview URL. In order to have the correct external URL for the robots one needs to use Apache rewrite URLs that forwards incoming GET requests for http://corporate.blueprint7.coremedia.com/robots.txt to http://corporate.blueprint7.coremedia.com:49080/blueprint/servlet/robots/corporate

The RobotsHandler will be responsible for requests like this due to the path element /robots The last path element of this URL (in this example /corporate will be evaluated by RobotsHandler to determine the root page that has been requested. In this example "corporate" is the URL segment of the Corporate Root Page. Thus, RobotsHandler will use Corporate root page's settings to check for Robots.txt configuration.

To add configuration for a Robots.txt file the corresponding root page (here: "Corporate") needs a setting called Robots.txt

Robots.txt settings

Figure 6.19. Robots.txt settings


Example configuration for a Robots.txt file

The settings document itself is organized as a StructList property like in this example:

Channel settings with configuration for Robots.txt as a linked setting on a root page

Figure 6.20. Channel settings with configuration for Robots.txt as a linked setting on a root page


For any specified user agent the following properties are supported:

  • User-agent: Specifies the user agent(s) that are valid for this node.

  • Disallow: A link list of items to be disallowed for robots. This list specifies a black list for navigation elements or content: Elements that should not be crawled. Navigation elements will be interpreted by "do not crawl elements below this navigation path". This leads to two entries in the resulting robots.txt file: one for the link to the navigation element and one for the same link with a trailing '/'. The latter informs the crawler to treat this link as path (thus the crawler will not work on any elements below this path). Single content elements will be interpreted as "do not crawl this document"

  • Allow: A link list of items to be explicitly allowed for robots. This list specifies navigation elements or content that should be crawled. It is interpreted as a white list. Usually one would only use a black list. However, if you intend to hide a certain navigation path for robots but you want one single document below this navigation to be crawled you would add the navigation path to the disallow list and the single document to the allow list.

  • custom-entries: This is a String List to specify custom entries in the Robots.txt. All elements here will be added as a new line in the Robots.txt for this node.

The example settings document will result in the following robots.txt file:

        User-agent: *
        Disallow: /corporate/corporate-information/
        Allow: /corporate/corporate-information/contact-us

        User-agent: Googlebot
        Disallow: /corporate/embedding-test
       

Example 6.3. robots.txt file generated by the example settings