Operations Basics / Version 2010
Table Of Contents
This section explains a simple watchdog.xml
file configuration. The file begins with the XML and
document type declarations:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE Watchdog SYSTEM "./lib/xml/coremedia-watchdog.dtd">
That is, the DTD for the
watchdog.xml
file with the namecoremedia-watchdog.dtd
is located on the local host under the installation directory in thelib/xml
directory.
<Component name="WatchServer" startAction="WS-CorbaQuery">
A component called WatchServer is configured which immediately starts the action "WS-CorbaQuery".
<ServerQuery name="WS-CorbaQuery" url="http://localhost:8080/ior" user="watchdog"
password="watchdog"/>
The start action WS-CorbaQuery is defined as a ServerQuery action that requests the CoreMedia server IOR URL at
http://localhost:8080/ior
and logs in with user name "watchdog" and password "watchdog".
<ServerQuery name="WS-CorbaReQuery" url="http://localhost:8080/ior" user="watchdog"
password="watchdog"/>
Action
WS-CorbaReQuery
is the same type of action with the same attributes. The meaning of this double definition will become clear below in the description of the edges.
<DB name="WS-CheckDB" propertyFile="corem/sql.properties"/>
Another action to check the database. The file
corem/sql.properties
contains the connection parameters for the database.
<Script name="WS-RestartServer" command="service <servicename> restart " timeout="15" interval="600"
events="3"/>
This script action restarts the CoreMedia Content Server Tomcat on a Linux system. The concrete command depends on your concrete installation. If the server has not restarted successfully after 15 seconds (
timeout
attribute), the result code is 11 (Timeout). If the server is restarted four times within 600 seconds (interval
attribute), result code 13 (RespawningTooFast) is returned. This script action restarts the CoreMedia Content Server Tomcat on a Linux system. The concrete command depends on your concrete installation. If the server has not restarted successfully after 15 seconds (timeout
attribute), the result code is 11 (Timeout). If the server is restarted four times within 600 seconds (interval
attribute), result code 13 (RespawningTooFast) is returned.
<Script name="WS-Abort" command="echo watchdog; watch server: abort" timeout="10"/>
This action prints out an abort message (
command
attribute). The timeout interval for this message is 10 seconds (timeout
attribute). Alternatively you can email to the watchdog administrator to inform him about the watchdog termination.Now connect the previously defined actions with <Edge> elements.
<Edge from="WS-CorbaQuery" to="WS-CorbaQuery" code="ok" delay="60"/>
In the error-free case, when the result code is
ok,
the server is queried every 60 seconds.
<Edge from="WS-CorbaQuery" to="WS-CorbaQuery" code="no_licenses" delay="60"/>
The same happens, if the result code is "no_licenses", because there was no free license to log on to the server.
<Edge from="WS-CorbaQuery" to="WS-Abort" code="invalid_login" delay="0"/>
If the result code is "invalid_login", because the authentication has failed, then the abort action is executed. The administrator must correct the login configuration and restart the watchdog application later.
<Edge from="WS-CorbaQuery" to="WS-CheckDB" code="error" delay="0"/>
If
WS-CorbaQuery
returns an error, the actionWS-CheckDB
is invoked immediately without delay. The latter action checks whether there is a database error. As the result code "error" is the default code, this action is also invoked for all the result codes for which no <Edge> element is configured.
<Edge from="WS-CheckDB" to="WS-CorbaReQuery" code="ok" delay="0"/>
If the database check results in no error, the action WS-CorbaReQuery is called to check the server again. In this way, a possibly unnecessary restart of the server can be avoided. Remember that WS-CheckDB was called as a reaction to an error from WS-CorbaQuery. If the reason for this error was a database problem, the server will continue to operate without restart as soon as the database is online again. The server is restarted only if the database is OK and a following check on the server fails again.
<Edge from="WS-CheckDB" to="WS-CheckDB" code="error" delay="60"/>
The database is checked every 60 seconds as long as the database returns an error result. As the result code "error" is the default code, this action is also invoked for all the result codes for which no <Edge> element is configured.
<Edge from="WS-CheckDB" to="WS-Abort" code="no_jdbc_driver" delay="0"/>
If the database check fails due to a missing JDBC driver, the abort action is invoked without delay. The administrator must correct the driver configuration and restart the watchdog application later.
<Edge from="WS-CorbaReQuery" to="WS-CorbaQuery" code="ok" delay="60"/>
If the server check results in no error, the error-free state is reached again with the action
WS-CorbaQuery
being called with 60 seconds delay.
<Edge from="WS-CorbaReQuery" to="WS-CorbaQuery" code="no_licenses" delay="60"/>
The error-free state is also reached with 60 seconds delay when there are no free licenses to log on to the server.
<Edge from="WS-CorbaReQuery" to="WS-Abort" code="invalid_login" delay="0"/>
If the result code is "invalid_login", because the authentication has failed, the abort action is invoked without delay. The administrator must correct the login configuration and restart the watchdog application later.
<Edge from="WS-CorbaReQuery" to="WS-RestartServer" code="error" delay="0"/>
If the second server check yields an error again, the action "WS-RestartServer" is invoked to restart the server. At this point the database works correct and there seems to be an internal server error, which hopefully can be solved with a server restart. As the result code "error" is the default code, this action is also invoked for all the result codes for which no <Edge> element is configured.
<Edge from="WS-RestartServer" to="WS-CorbaQuery" code="ok" delay="60"/>
If the server was restarted successfully, the error-free state is reached again with the action WS-CorbaQuery being called with 60 seconds delay.
<Edge from="WS-RestartServer" to="WS-Abort" code="error" delay="0"/>
If the server restart has failed, the abort action is invoked without delay. The administrator must analyze the reason, why the server fails to start and restart the watchdog application. As the result code "error" is the default code, this action is also invoked for all the result codes for which no <Edge> element is configured.
<Edge from="WS-RestartServer" to="WS-Abort" code="respawning_too_fast" delay="0"/>
If the server is restarted more than three times in 600 seconds, the abort action is invoked without delay. The administrator must analyze the reason, why the server fails to start and restart the watchdog application later.
</Component>