loading table of contents...

6.3.20. Sitemap

Requirements

If you run a public website, you want to get listed by search engines and therefore give web crawlers hints about the pages they should crawl. http://www.sitemaps.org/ declares an XML format for such sitemaps which is supported by many search engines, especially from Google, Yahoo! and Microsoft.

"Sitemap" in terms of http://www.sitemaps.org/ is not to be mistaken with a human readable sitemap which visualizes the structure of a website (see Section 6.3.18, “Content Type Sitemap”). It is rather a complete index of all pages of a site. A simple sitemap file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
  <url>
    <loc>
    http://helios.coremedia.com/perfectchef/spicy-duck-694
    </loc>
  </url>
  <url>
    <loc>
    http://helios.coremedia.com/perfectchef/share-your-recipes-696
    </loc>
  </url>
  ...
</urlset>

Example 6.4. A sitemap file


The size of a sitemap is limited to 50,000 URLs. Larger sites must be split into several sitemap files and a sitemap index file which aggregates the sitemap files. A sitemap index file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>http://helios.coremedia.com/sitemap1.xml.gz</loc>
    <lastmod>2014-03-31T15:33:26+02:00</lastmod>
  </sitemap>
  ...
</sitemapindex>

Example 6.5. A sitemap index file


Solution

A sitemap consists of multiple entities (the index and the sitemap files) and has dependencies on almost the whole repository. If a new content is created, which "coincidentally" occurs in the first sitemap file, the entries of all subsequent sitemap files are shifted. In border cases even the number of sitemap files may change, which affects the sitemap index file. So you cannot generate single sitemap entities on crawler demand, asynchronously and independent of each other, but you must generate a complete sitemap which represents a snapshot of the repository. Moreover, the exhaustive dependencies make sitemaps practically uncacheable, and the generation is expensive. For these reasons Blueprint does not render sitemaps on demand but pregenerates them periodically. So you must distinguish between sitemap generation and sitemap service. Both are handled by the live web application, though.

Sitemap Generation

CoreMedia Blueprint features separated sitemaps for each site. Sitemap generation depends on some site specific configuration, like the document types to include or paths to exclude, amongst others. This configuration is specified by SitemapSetup Spring beans. The lc and the corporate extension each provide a SitemapSetup bean suitable for their particular sites. Projects can declare their own sitemap setups. The setups are collected in the sitemapConfigurations Spring map.

<bean id="livecontextSitemapConfiguration" class="c.c.b.c.s.SitemapSetup">
  <property name="protocol" value="http"/>
  ...
</bean>

<customize:append id="appendLSC" bean="sitemapConfigurations">
  <map>
    <entry key="livecontext" value-ref="livecontextSitemapConfiguration"/>
  </map>
</customize:append>

If you want to generate a sitemap for a site, you have to specify the setting sitemapOrgConfiguration at the root channel. It is a String setting, and the value must be a key of the sitemapConfigurations map.

Selection of a sitemap setup

Figure 6.21. Selection of a sitemap setup


By default, the PerfectChef sites and the Corporate sites are sitemap-enabled, while the Aurora sites are not. Since the Aurora sites serve only as backend for WCS applications, there is no need for sitemaps.

Sitemaps are generated periodically in the Delivery CAE by a SitemapGenerationJob. You can specify the initial start time and the period as application properties blueprint.sitemap.starttime and blueprint.sitemap.period, respectively. For details about the values see the JavaDoc of the setters in SitemapGenerationJob. The Blueprint is preconfigured to run the sitemap generation nightly at 01:30. You can also trigger sitemap generation for a particular site manually by the URL

http://live-cae:49080/blueprint/servlet/internal/corporate-de-de/sitemap-org

where corporate-de-de stands for the segment of the site's root channel. Note that it is an internal URL which can only be invoked directly on the CAE's servlet container. Sitemap generation is an expensive administrative task, which is not to be exposed to end users. CoreMedia's default Apache rewrite rules block internal URLs, see rewrite.inc files.

The sitemaps are written into the file system under a directory which is specified by the blueprint.sitemap.target.root application property. That means, the CAE needs write permissions for this directory.

Sitemap Service

The generated sitemaps are available by the URL pattern

/service/sitemap/the-site-ID/sitemap_index.xml

In order to inform search crawlers, the sitemap URLs are included in the robots.txt files. Since there is only one robots file per web presence, you will see multiple sitemap entries for the localized sites:

User-agent: *
Disallow: /

Sitemap: http://corporate.acme.com/service/sitemap/ab..ee/sitemap_index.xml
Sitemap: http://corporate.acme.com/service/sitemap/1c..7a/sitemap_index.xml