Requirements
If you run a public website, you want to get listed by search engines and therefore give web crawlers hints about the pages they should crawl. http://www.sitemaps.org/ declares an XML format for such sitemaps which is supported by many search engines, especially from Google, Yahoo! and Microsoft.
"Sitemap" in terms of http://www.sitemaps.org/ is not to be mistaken with a human readable sitemap which visualizes the structure of a website (see Section 6.3.18, “Content Type Sitemap”). It is rather a complete index of all pages of a site. A simple sitemap file looks like this:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> <url> <loc> http://helios.coremedia.com/perfectchef/spicy-duck-694 </loc> </url> <url> <loc> http://helios.coremedia.com/perfectchef/share-your-recipes-696 </loc> </url> ... </urlset>
Example 6.4. A sitemap file
The size of a sitemap is limited to 50,000 URLs. Larger sites must be split into several sitemap files and a sitemap index file which aggregates the sitemap files. A sitemap index file looks like this:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://helios.coremedia.com/sitemap1.xml.gz</loc> <lastmod>2014-03-31T15:33:26+02:00</lastmod> </sitemap> ... </sitemapindex>
Example 6.5. A sitemap index file
Solution
A sitemap consists of multiple entities (the index and the sitemap files) and has dependencies on almost the whole repository. If a new content is created, which "coincidentally" occurs in the first sitemap file, the entries of all subsequent sitemap files are shifted. In border cases even the number of sitemap files may change, which affects the sitemap index file. So you cannot generate single sitemap entities on crawler demand, asynchronously and independent of each other, but you must generate a complete sitemap which represents a snapshot of the repository. Moreover, the exhaustive dependencies make sitemaps practically uncacheable, and the generation is expensive. For these reasons Blueprint does not render sitemaps on demand but pregenerates them periodically. So you must distinguish between sitemap generation and sitemap service. Both are handled by the live web application, though.
Sitemap Generation
CoreMedia Blueprint features separated sitemaps for each site. Sitemap generation
depends on some site specific configuration, like the document types to include or paths to exclude, amongst others.
This configuration is specified by SitemapSetup
Spring beans. The lc
and the corporate
extension each provide a SitemapSetup
bean suitable for their particular
sites. Projects can declare their own sitemap setups. The setups are collected in the
sitemapConfigurations
Spring map.
<bean id="livecontextSitemapConfiguration" class="c.c.b.c.s.SitemapSetup"> <property name="protocol" value="http"/> ... </bean> <customize:append id="appendLSC" bean="sitemapConfigurations"> <map> <entry key="livecontext" value-ref="livecontextSitemapConfiguration"/> </map> </customize:append>
If you want to generate a sitemap for a site, you have to specify the setting sitemapOrgConfiguration
at the root channel. It is a String
setting, and the value must be a key of the
sitemapConfigurations
map.
By default, the PerfectChef sites and the Corporate sites are sitemap-enabled, while the Aurora sites are not. Since the Aurora sites serve only as backend for WCS applications, there is no need for sitemaps.
Sitemaps are generated periodically in the Delivery CAE by a SitemapGenerationJob
. You can specify
the initial start time and the period as application properties blueprint.sitemap.starttime
and
blueprint.sitemap.period
, respectively. For details about the values see the JavaDoc of the setters
in SitemapGenerationJob
. The Blueprint is preconfigured to run the sitemap
generation nightly at 01:30. You can also trigger sitemap generation for a particular site manually by the URL
http://live-cae:49080/blueprint/servlet/internal/corporate-de-de/sitemap-org
where corporate-de-de
stands for the segment of the site's root channel. Note that it is an internal
URL which can only be invoked directly on the CAE's servlet container. Sitemap generation is an expensive
administrative task, which is not to be exposed to end users. CoreMedia's default Apache rewrite rules block
internal
URLs, see rewrite.inc
files.
The sitemaps are written into the file system under a directory which is specified by the
blueprint.sitemap.target.root
application property. That means, the CAE needs write permissions for
this directory.
Sitemap Service
The generated sitemaps are available by the URL pattern
/service/sitemap/the-site-ID/sitemap_index.xml
In order to inform search crawlers, the sitemap URLs are included in the robots.txt
files. Since there
is only one robots file per web presence, you will see multiple sitemap entries for the localized sites:
User-agent: * Disallow: / Sitemap: http://corporate.acme.com/service/sitemap/ab..ee/sitemap_index.xml Sitemap: http://corporate.acme.com/service/sitemap/1c..7a/sitemap_index.xml