Blueprint Developer Manual / Version 2104
Table Of ContentsRequirements
If you run a public website, you want to get listed by search engines and therefore give web crawlers hints about the pages they should crawl. http://www.sitemaps.org/ declares an XML format for such sitemaps which is supported by many search engines, especially from Google and Microsoft.
"Sitemap" in terms of http://www.sitemaps.org/ is not to be mistaken with a human readable sitemap which visualizes the structure of a website (see Section 5.4.18, “Content Type Sitemap”). It is rather a complete index of all pages of a site. A simple sitemap file looks like this:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> <url> <loc> http://helios.coremedia.com/corporate/spicy-duck-694 </loc> </url> <url> <loc> http://helios.coremedia.com/corporate/share-your-recipes-696 </loc> </url> ... </urlset>
Example 5.4. A sitemap file
The size of a sitemap is limited to 50,000 URLs. Larger sites must be split into several sitemap files and a sitemap index file which aggregates the sitemap files. A sitemap index file looks like this:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://helios.coremedia.com/sitemap1.xml.gz</loc> <lastmod>2014-03-31T15:33:26+02:00</lastmod> </sitemap> ... </sitemapindex>
Example 5.5. A sitemap index file
Solution
A sitemap consists of multiple entities (the index and the sitemap files) and has dependencies on almost the whole repository. If a new content is created, which "coincidentally" occurs in the first sitemap file, the entries of all subsequent sitemap files are shifted.
In border cases even the number of sitemap files may change, which affects the sitemap index file. So you cannot generate single sitemap entities on crawler demand, asynchronously and independent of each other, but you must generate a complete sitemap which represents a snapshot of the repository. Moreover, the exhaustive dependencies make sitemaps practically uncacheable, and the generation is expensive. For these reasons Blueprint does not render sitemaps on demand but pregenerates them periodically. So you must distinguish between sitemap generation and sitemap service. Both are handled by the live web application, though.
Sitemap Generation
CoreMedia Blueprint features separated sitemaps for each site.
Sitemap generation depends on some site specific configuration, like the document types to
include or paths to exclude, amongst others. This configuration is specified by
SitemapSetup
Spring beans.
The lc
and the corporate
extension each provide a SitemapSetup
bean suitable for
their particular sites. Projects can declare their own sitemap setups. The setups are collected
in the sitemapConfigurations
Spring map.
<bean id="livecontextSitemapConfiguration" class="c.c.b.c.s.SitemapSetup"> <property name="protocol" value="http"/> ... </bean> <customize:append id="appendLSC" bean="sitemapConfigurations"> <map> <entry key="livecontext" value-ref="livecontextSitemapConfiguration"/> </map> </customize:append>
If you want to generate a sitemap for a site, you have to specify the setting
sitemapOrgConfiguration
at the root channel. It is a String
setting,
and the value must be a key of the sitemapConfigurations
map.
By default, the Corporate sites are sitemap-enabled. The Aurora sites are not sitemap-enabled, since the Aurora sites serve only as backend for HCL Commerce applications, there is no need for sitemaps.
Sitemaps are generated periodically in the Delivery CAE by a SitemapGenerationJob
.
You can specify the initial start time and the period as application properties
cae.sitemap.starttime
and cae.sitemap.period-minutes
,
respectively. For details about the values see the Javadoc of the setters in
SitemapGenerationJob
. The Blueprint is preconfigured to
run the sitemap generation nightly at 01:30. You can also trigger sitemap generation for a
particular site manually by the URL
http://live-cae:49080/blueprint/servlet/internal/corporate-de-de/sitemap-org
where corporate-de-de
stands for the segment of the site's root channel. Note that
it is an internal URL which can only be invoked directly on the CAE's servlet container. Sitemap
generation is an expensive administrative task, which is not to be exposed to end users.
CoreMedia's default Apache rewrite rules block internal
URLs, see files in the
deployment folder global/deployment/chef/blueprint/cookbooks/blueprint-proxy/
templates/default/rewrite
.
The sitemaps are written into the file system under a directory which is specified by the
cae.sitemap.target-root
application property. That means, the CAE needs write
permissions for this directory.
Sitemap Service
The generated sitemaps are available by the URL pattern
/service-sitemap-siteID-sitemap_index.xml
This pattern consists only of a single segment without a path, so there are no path restrictions for the URLs included in the sitemap.
In order to inform search crawlers, the sitemap URLs are included in the robots.txt
files. Since there is only one robots file per web presence, you will see multiple sitemap
entries for the localized sites:
User-agent: * Disallow: / Sitemap: http://corporate.acme.com/service-sitemap-ab...ee-sitemap_index.xml Sitemap: http://corporate.acme.com/service-sitemap-1c...7a-sitemap_index.xml