4.8.3.4. Example Configuration of a Watchdog

This section explains a simple watchdog.xml file configuration. The file begins with the XML and document type declarations:

<?xml version="1.0" encoding="ISO-8859-1" ?>

<!DOCTYPE Watchdog SYSTEM "./lib/xml/coremedia-watchdog.dtd">

  • That is, the DTD for the watchdog.xml file with the name coremedia-watchdog.dtd is located on the local host under the installation directory in the lib/xml directory.

<Component name="WatchServer" startAction="WS-CorbaQuery">

  • A component called WatchServer is configured which immediately starts the action "WS-CorbaQuery".

<ServerQuery name="WS-CorbaQuery" url="http://localhost:44441/coremedia/ior" user="watchdog" password="watchdog"/>

  • The start action WS-CorbaQuery is defined as a ServerQuery action that requests the CoreMedia server IOR URL at http://localhost:44441/coremedia/ior and logs in with user name "watchdog" and password "watchdog".

<ServerQuery name="WS-CorbaReQuery" url="http://localhost:44441/coremedia/ior" user="watchdog" password="watchdog"/>

  • Action WS-CorbaReQuery is the same type of action with the same attributes. The meaning of this double definition will become clear below in the description of the edges.

<DB name="WS-CheckDB" propertyFile="corem/sql.properties"/>

  • Another action to check the database. The file corem/sql.properties contains the connection parameters for the database.

<Script name="WS-RestartServer" command="service cm7-cms-tomcat restart " timeout="15" interval="600" events="3"/>

  • This script action restarts the CoreMedia Content Server Tomcat on a Linux system. The concrete command depends on your concrete installation. If the server has not restarted successfully after 15 seconds (timeout attribute), the result code is 11 (Timeout). If the server is restarted four times within 600 seconds (interval attribute), result code 13 (RespawningTooFast) is returned. This script action restarts the CoreMedia Content Server Tomcat on a Linux system. The concrete command depends on your concrete installation. If the server has not restarted successfully after 15 seconds (timeout attribute), the result code is 11 (Timeout). If the server is restarted four times within 600 seconds (interval attribute), result code 13 (RespawningTooFast) is returned.

<Script name="WS-Abort" command="echo watchdog; watch server: abort" timeout="10"/>

  • This action prints out an abort message (command attribute). The timeout interval for this message is 10 seconds (timeout attribute). Alternatively you can email to the watchdog administrator to inform him about the watchdog termination.

  • Now connect the previously defined actions with <Edge> elements.

<Edge from="WS-CorbaQuery" to="WS-CorbaQuery" code="ok" delay="60"/>

  • In the error-free case, when the result code is ok, the server is queried every 60 seconds.

<Edge from="WS-CorbaQuery" to="WS-CorbaQuery" code="no_licenses" delay="60"/>

  • The same happens, if the result code is "no_licenses", because there was no free license to log on to the server.

<Edge from="WS-CorbaQuery" to="WS-Abort" code="invalid_login" delay="0"/>

  • If the result code is "invalid_login", because the authentication has failed, then the abort action is executed. The administrator must correct the login configuration and restart the watchdog application later.

<Edge from="WS-CorbaQuery" to="WS-CheckDB" code="error" delay="0"/>

  • If WS-CorbaQuery returns an error, the action WS-CheckDB is invoked immediately without delay. The latter action checks whether there is a database error. As the result code "error" is the default code, this action is also invoked for all the result codes for which no <Edge> element is configured.

<Edge from="WS-CheckDB" to="WS-CorbaReQuery" code="ok" delay="0"/>

  • If the database check results in no error, the action WS-CorbaReQuery is called to check the server again. In this way, a possibly unnecessary restart of the server can be avoided. Remember that WS-CheckDB was called as a reaction to an error from WS-CorbaQuery. If the reason for this error was a database problem, the server will continue to operate without restart as soon as the database is online again. The server is restarted only if the database is OK and a following check on the server fails again.

<Edge from="WS-CheckDB" to="WS-CheckDB" code="error" delay="60"/>

  • The database is checked every 60 seconds as long as the database returns an error result. As the result code "error" is the default code, this action is also invoked for all the result codes for which no <Edge> element is configured.

<Edge from="WS-CheckDB" to="WS-Abort" code="no_jdbc_driver" delay="0"/>

  • If the database check fails due to a missing JDBC driver, the abort action is invoked without delay. The administrator must correct the driver configuration and restart the watchdog application later.

<Edge from="WS-CorbaReQuery" to="WS-CorbaQuery" code="ok" delay="60"/>

  • If the server check results in no error, the error-free state is reached again with the action WS-CorbaQuery being called with 60 seconds delay.

<Edge from="WS-CorbaReQuery" to="WS-CorbaQuery" code="no_licenses" delay="60"/>

  • The error-free state is also reached with 60 seconds delay when there are no free licenses to log on to the server.

<Edge from="WS-CorbaReQuery" to="WS-Abort" code="invalid_login" delay="0"/>

  • If the result code is "invalid_login", because the authentication has failed, the abort action is invoked without delay. The administrator must correct the login configuration and restart the watchdog application later.

<Edge from="WS-CorbaReQuery" to="WS-RestartServer" code="error" delay="0"/>

  • If the second server check yields an error again, the action "WS-RestartServer" is invoked to restart the server. At this point the database works correct and there seems to be an internal server error, which hopefully can be solved with a server restart. As the result code "error" is the default code, this action is also invoked for all the result codes for which no <Edge> element is configured.

<Edge from="WS-RestartServer" to="WS-CorbaQuery" code="ok" delay="60"/>

  • If the server was restarted successfully, the error-free state is reached again with the action WS-CorbaQuery being called with 60 seconds delay.

<Edge from="WS-RestartServer" to="WS-Abort" code="error" delay="0"/>

  • If the server restart has failed, the abort action is invoked without delay. The administrator must analyze the reason, why the server fails to start and restart the watchdog application. As the result code "error" is the default code, this action is also invoked for all the result codes for which no <Edge> element is configured.

<Edge from="WS-RestartServer" to="WS-Abort" code="respawning_too_fast" delay="0"/>

  • If the server is restarted more than three times in 600 seconds, the abort action is invoked without delay. The administrator must analyze the reason, why the server fails to start and restart the watchdog application later.

</Component>