Why should I use this guide
In order to have the most successful QoS deployment, it is important to plan properly and follow the deployment steps in order to provide the most efficient configuration---which can mean the difference between having good and having great results with QoS.
This document goes step-by-step through the process of configuring and deploying QoS, from the pre-deployment planning stages through the daily production operation.
Overview
Quality of Service, or QoS, is a feature in IBM Domino Social Edition 9 designed to react to the general operation of a Domino server in order to keep that server up and functioning reliably at all times. If QoS detects that a server is not responding or hung, QoS can be configured to email an administrator about the problem and/or automatically terminate the server and restart it. QoS log information can also be useful to IBM Support when analyzing problems.
QoS operation
QoS consists of two parts: an add-on to the Java console and a probe task on the server. The Java console add-on contacts the probe task periodically to check on the health of the server. Since this operates as an independent task, it is unaffected by any issues the Domino server may be experiencing. If the probe task does not respond promptly, or if it reports a problem, then NSD, sending a email to the administrator or/and kill & restart the Domino server action is/are taken depending on what the administrator has configured. Therefore, when a problem is detected, QoS monitor will trigger these actions that have been configured before.
Prerequisites
Before anything can be done with QoS, there are some prerequisites that must be addressed. These may not all apply in your situation, but it is important to verify them to ensure that any changes needed can be accommodated.
Enabling QoS when starting Domino
-
Install Domino 9.
-
Start the Domino server.
-
Run 'nserver -jc' and then 'nserver -jc -q -y' to create the initial dcontroller.ini file in the server's data directory.
-
Add QOS_ENABLE=1 to the dcontroller.ini file.
-
Add QOS_ENABLE=1 to the Domino server notes.ini file
-
Run the server using "(n)server -jc" Note: Run the server using "(n)server –jc –c" if X is not installed on the machine
Preparing another Mail Server (optional)
Administrators can configure QoS to send mail to notify the administrator when QoS has detected a problem. Since the mail can not be sent through a server that has problems, an alternate mail server is required for this feature.
Planning
Configure QoS basic options
IBM Domino QoS provides several configuration options for customers to customize their demands. The following parameters are recommended to set in the dcontroller.ini file. (Refer "Other configuration options" for details)
• QOS_PROBE_TIMEOUT=30
• QOS_SHUTDOWN_TIMEOUT=15
• QOS_RESTART_TIMEOUT=15
• QOS_APPS_TIMEOUT=60
Identify critical server operations
QoS provides a mechanism to pause or resume the QoS service at a specific time. Pausing QoS avoids allowing the server to be killed during an option that is expected to take a long time or that is critical to server operation; examples are backups or other maintenance operations. Temporarily disabling QoS allows these operations to complete without being misinterpreted by QoS as a server problem.
To pause QoS probing, use the following command at the Domino server console: tell qos pause
then it will not kill or restart the Domino server
To resume QoS probing, use the following command at the Domino server console:tell qos resume
“Periodic Commands” strongly recommended for these periodical maintenance or backup operations. For example, company X administrator will backup their databases during 2 AM and 4 AM, the periodic commands are created like this:
Figure 1 Periodic Commands
It means the command “tell qos pause” will be executed after 60 minutes and the command “tell qos resume” will be executed after 120 minutes; therefore, during this period(one hour), the qos service will be disabled. These two commands will be executed each day.
Configure SMTP recipients
When QoS detects server exceptions, it can send a SMTP email to a specified administrator or multiple administrators with notification of the exception. Mail notification configuration is recommended since thus the administrators are notified when server exceptions occur.
The following configurations are recommended for production deployment (refer "Notify SMTP recipient" section for details)
QOS_MAIL_TO= serverAdmin1@renovations.com,serverAdmin2@renovations.com
QOS_MAIL_SMTP_SERVER= 192.168.1.19
QOS_MAIL_ATTACH_LOGS=1
Figure 2 displays the content of QoS Mail notification when QoS detects one server crash. The mail title describes the server exception summary and the body lists the server information and attaches the QoS log and controller log.
Figure 2 Exception notification mail
Configure NO KILL option
When QoS detects server exceptions, it can trigger other actions (e.g. sending email to administrator) instead of killing and restarting the server directly. (You can also set QoS to send mail to an administrator whether or not you enable the NO KILL option.)
For initial deployment, the administrator should probably use the NO KILL option until they are comfortable with the amount of triggering that is going on(Strongly recommend to try QoS with NO KILL option for at least a week, because that way any maintenance operations can be done and the administrator can see if QoS triggers accidentally on those.). Once administrators are happy with the timeout settings and the situations where QoS is triggering, then they can turn NO KILL off. It's a bit dangerous to just turn it on and hope it doesn't kill the server too often.
The following configuration is recommended for initial deployment.
QOS_NOKILL=1
Configure QoS restarts limitation
As mentioned in the last section, after the trail period, you maybe are willing to turn off the NO KILL option. Then we recommend configuring restart limitation (refer "Limit QoS restart times" section for details): QoS will be disabled if Domino server restarts three times in one day.
QOS_RESTART_LIMIT_ENABLE=1
QOS_RESTART_LIMIT_MAXIMUM=3
QOS_RESTART_LIMIT_PERIOD=720
Verify the QoS is running
Once completed the entire QoS configuration, it's time to have a try to find out whether it works.
How to verify SMTP notification
-
Type "tell qosprobe quit" in IBM Domino console to quit the probe task process or do other actions to simulate "server exception" scenario.
-
Wait for some time (specified by parameter QOS_PROBE_TIMEOUT or 5 minutes).
-
Check the log file located in the Domino server's data directory whether there are similar logs like this:
2013/02/19 16:40:44 QoS Controller: Deactivating probe...
2013/02/19 16:40:44 QoS Controller: QoS Probe deactivated.
2013/02/19 16:45:44 QoS Controller: Server has crashed.
2013/02/19 16:45:44 QoS Kill: Sending notification mail from QoSMailNotifier@SBM to serverAdmin1@renovations.com
2013/02/19 16:45:45 QoS Kill: Send notify mail to [serverAdmin1@renovations.com] ok
2013/02/19 16:45:45 QoS Kill: Killing Domino Server...
2013/02/19 16:45:49 QoS Kill: Setting kill complete.
2013/02/19 16:45:54 QoS Kill: Restarting DominoStarter thread
-
Check whether the administrator receives the notification mail, see figure 2.
How to verify NO KILL option
-
Type "tell qosprobe quit" in IBM Domino console to quit the probe task process.
-
Wait for some time (specified by parameter QOS_PROBE_TIMEOUT or 5 minutes).
-
Check the log file located in the Domino server's data directory whether there are similar logs like this:
2013/02/19 17:25:35 QoS Controller: OpMsg=CRASH Type=QOS ObjectType=ServerName ObjectValue=CN=SBM/O=ABC TimeDate=20130219T172533,83+08
2013/02/19 17:25:35 QoS Controller: Server CN=SBM/O=ABC has crashed.
2013/02/19 17:25:35 QoS Controller: Deactivating probe...
2013/02/19 17:25:35 QoS Controller: QoS Probe deactivated.
2013/02/19 17:30:35 QoS Controller: Server has crashed.
2013/02/19 17:30:35 QoS Kill: Can't send notify mail with null value
2013/02/19 17:30:35 QoS Kill: The QOS_NOKILL is enabled.. QoS will not kill server.
-
Check whether Domino server is still available.
How to Disable QoS
You need to follow these steps if you want to disable QoS.
-
Change QOS_ENABLE into 0 or remove QOS_ENABLE from the dcontroller.ini file.
-
Change QOS_ENABLE into 0 or remove QOS_ENABLE from the Domino server notes.ini file
Attention: Make sure you have completed these two steps, otherwise it may cause unexpected server restarts.
All configurations
Limit QoS restarts times
QoS provides the option to limit the QoS restart times during one interval. When the restart times reach the limitation, the QoS service is deactivated. The following parameters are set in the dcontroller.ini file:
• QOS_RESTART_LIMIT_ENABLE=
Determines whether to enable the restart limitation. The setting is used to avoid endless restarts. For example, if the server crashed for some reason and QoS doesn't enable the option, the server will be restarted endlessly which can be avoided by enable this option. The default is 0.
• QOS_RESTART_LIMIT_MAXIMUM=
Set the maximum restart times during specific interval (set by QOS_RESTART_LIMIT_PERIOD). The default is 3.
• QOS_RESTART_LIMIT_PERIOD=
Restart time limitation interval; QoS allows only the restart times during this period. The maximum value of default value and configured value will be used and the default value is 30 minutes.
Notify SMTP recipient
QoS provides a option to send a email to the administrator when server exceptions occur. The following parameters are set in the dcontroller.ini file:
• QOS_MAIL_TO=
Administrator mail address. Multiple email address can be separated by comma.
• QOS_MAIL_SMTP_SERVER=
SMTP mail server IP and SMTP port with the format :
• QOS_NO_KILL=
Whether to enable no kill option. Set to 1 to enable the option and 0 to disable it.
• QOS_MAIL_ATTACH_LOGS=
Whether to attach logs in the mail sent to administrator
Important The QOS_MAIL options do not support a user name/password combination. The specified SMTP server must accept mail without password authentication.
Other configuration options
• QOS_PROBE_INTERVAL=n
The probe interval in minutes. This can be set in the notes.ini. The default value is 1 minute.
Set any of the following parameters in the dcontroller.ini file.
• QOS_PROBE_TIMEOUT=
The probe timeout in minutes. In production environment, the probe timeout value should be set to larger values. The default is 5 minutes.
• QOS_DISABLE_PROBING=
Disable all QoS probing.
• QOS_SHUTDOWN_TIMEOUT=
The length of time a shutdown is allowed to take before QoS will smart kill the server. For some reason, Domino server can't shutdown the server correctly and QoS will kill the server if the server shutdown operation doesn't complete in this period. The default value is 5 minutes.
• QOS_RESTART_TIMEOUT=
The length of time a server restart is allowed (including RM restart) to take before QoS will smart kill the server. This time starts after the server is completely down (clean). The default is 5 minutes.
• QOS_APPS_TIMEOUT=
The length of time a long running application is allowed to continue without showing progress before QoS smart kills the server. In production environment, the value should be set to larger values. For example, long running applications may possibly cause IBM Domino not response and it may last for more than 10 minutes(default value), and in this case, the server will be restarted by mistake. 30 or 60 is recommended. We recommend you pause the QoS service if you are running very large long application which will be talked in the next section. The default value is10 minutes.
QoS controller log file
You will find a new log file in the Domino server's data directory. The QoS controller log file contains details corresponding to various events as captured or processed by the QoS controller, events relating to QoS probing, hygienic server restart, server crashes, QoS smart kills, and other miscellaneous events.
Note You may also want to provide IBM support with the log file if you are troubleshooting a server problem with them.
Log file naming convention
The QoS controller log file name contains a 24 hour timestamp in the format YYYYMMDDHHmm:
qoscntrlr201301282028.out
This timestamp indicates the time that the QoS controller was started. The above file name would be the QoS controller log for a service start of May 17th, 2011 at 3:28 PM. If the service is stopped and started again, the current qoscntrlrYYYYMMDDHHmm.out file is given the .log extension and a new qoscntrlrYYYYMMDDHHmm.out file is created with the current time. These qoscntrlrYYYYMMDDHHmm.log files are automatically deleted when the service is started if they are older than 14 days.
How to read the log file
At the very beginning of the log file, you will see general configuration information for this logged run of the QoS controller:
2013/01/28 20:28:04 QoS Controller: Starting QOSPipeWatcher
2013/01/28 20:28:04 QoS Controller: QOS_PROBE_TIMEOUT=5 minutes
2013/01/28 20:28:04 QoS Controller: QOS_SHUTDOWN_TIMEOUT=5 minutes
2013/01/28 20:28:04 QoS Controller: QOS_RESTART_TIMEOUT=5 minutes
2013/01/28 20:28:04 QoS Controller: QOS_APPS_TIMEOUT=10 minutes
2013/01/28 20:28:04 QoS Controller: nsd Program Path=E:\Domino\nsd
2013/01/28 20:28:04 QoS Controller: QOS_RESTART_LIMIT_MAXIMUM=3
2013/01/28 20:28:04 QoS Controller: QOS_RESTART_LIMIT_PERIOD=30 minutes
2013/01/28 20:28:04 QoS Controller: QOS_NOKILL=false
2013/01/28 20:28:04 QoS Controller: QOS_MAIL_TO=null
2013/01/28 20:28:04 QoS Controller: QOS_MAIL_SMTP_SERVER=null
These items, along with some other basic items, can be configured in the Domino controller ini file (dcontroller.ini), found in the server's data directory. The rest of the file from this point on contains a log entry for each message sent to the QoS controller by the server or one of its tasks. These messages have the format:
2013/01/28 20:28:16 QoS Controller: OpMsg=START Type=QOS ObjectType=ServerName ObjectValue=CN=Application Server/O=ABC ObjectType2=ProcessName ObjectValue2=nserver TimeDate=20130128T202811,77+08
2013/01/28 20:28:16 QoS Applications: Clearing long running apps list
2013/01/28 20:28:20 QoS Controller: OpMsg=START Type=SERVER TimeDate=20130128T202811,76+08
2013/01/28 20:28:20 QoS Controller: OpMsg=START Type=SERVER TimeDate=20130128T202817,20+08
2013/01/28 20:28:30 QoS Controller: OpMsg=READY Type=SERVER TimeDate=20130128T202828,24+08
All messages logged to the QoS controller log file have a timestamp. If the QoS controller logs the message, it has the format:
TimeDate=20120508T001506,95-04
If one of the QoS controller's other threads logs a message to the log file, it has the format:
2013/01/28 00:15:21 QoS Probe:
2013/01/28 00:15:21 QoS Applications:
2013/01/28 00:15:21 QoS Kill:
What to look for in the log file
This table shows examples of basic logging events you should see when looking at the QoS controller log file.
Event
|
Example of what log shows
|
Normal server startup
|
2013/01/28 00:15:09 QoS Controller: OpMsg=START Type=QOS ObjectType=ServerName ObjectValue=CN=rc45/O=dev ObjectType2=ProcessName ObjectValue2=nserver TimeDate=20120508T001506,95-04
2013/01/28 00:15:09 QoS Controller: OpMsg=START Type=SERVER TimeDate=20120508T001507,40-04
2013/01/28 00:15:10 QoS Applications: Clearing long running apps list
2013/01/28 00:15:21 QoS Controller: OpMsg=READY Type=SERVER TimeDate=20120508T001517,92-04
2013/01/28 00:15:21 QoS Controller: Server is ready to process requests
|
Normal server shutdown
|
2013/01/28 00:45:22 QoS Controller: OpMsg=END Type=SERVER ObjectType=Detail ObjectValue=Quit TimeDate=20120508T004516,01-04
2013/01/28 00:45:22 QoS Controller: Deactivating probe...
2013/01/28 00:45:22 QoS Controller: QoS Probe deactivated.
2013/01/28 00:45:26 QoS Controller: OpMsg=END Type=QOS ObjectType=ServerName ObjectValue=CN=rc45/O=dev TimeDate=20120508T004523,51-04
2013/01/28 00:45:27 QoS Applications: Clearing long running apps list
|
QoS probing
|
2013/01/28 00:15:21 QoS Controller: Activating probe...
2013/01/28 00:15:21 QoS Controller: QoS Probe activated.
2013/01/28 00:15:21 QoS Probe: Starting qosprobe...
2013/01/28 00:15:25 QoS Probe: OpMsg=START, Type=PROBE
2013/01/28 00:16:25 QoS Probe: The QoS Probe is probing.
2013/01/28 00:16:25 QoS Probe: SUCCESS (156ms)
2013/01/28 00:17:25 QoS Probe: SUCCESS (16ms)
2013/01/28 00:18:25 QoS Probe: SUCCESS (31ms)
2013/01/28 00:19:25 QoS Probe: SUCCESS (16ms)
2013/01/28 00:20:26 QoS Probe: SUCCESS (15ms)
|
Long-running applications
|
2013/01/28 00:38:32 QoS Controller: OpMsg=START Type=FIXUP ObjectType=DB ObjectValue=C:\Program Files\IBM\Domino\Data\ddm.nsf TimeDate=20120508T003826,18-04
2013/01/28 00:38:32 QoS Controller: OpMsg=END Type=FIXUP ObjectType=DB ObjectValue=C:\Program Files\IBM\Domino\Data\ddm.nsf TimeDate=20120508T003829,79-04
2013/01/28 00:38:32 QoS Applications: Adding FIXUP[C:\Program Files\IBM\Domino\Data\ddm.nsf] to long running apps list
2013/01/28 00:38:32 QoS Applications: Removing FIXUP[C:\Program Files\IBM\Domino\Data\ddm.nsf] from long running apps list
...
2013/01/28 00:47:42 QoS Controller: OpMsg=START Type=COMPACT ObjectType=DB ObjectValue=events4.nsf TimeDate=20120508T004740,23-04
2013/01/28 00:47:42 QoS Controller: OpMsg=END Type=COMPACT ObjectType=DB ObjectValue=events4.nsf TimeDate=20120508T004740,23-04
2013/01/28 00:47:43 QoS Applications: Adding COMPACT[events4.nsf] to long running apps list
2013/01/28 00:47:43 QoS Applications: Removing COMPACT[events4.nsf] from long running apps list
|
Evidence of a server crash in the log file
The QoS controller monitors and logs crash events to ensure the kill and restart are performed in a reasonable amount of time. To see evidence of this in the QoS controller log, search for the text "=CRASH" in the log file. Here is an example:
2013/01/28 01:00:44 QoS Controller: OpMsg=CRASH Type=QOS ObjectType=ServerName ObjectValue=CN=rc45/O=dev TimeDate=20120508T010039,48-04
2013/01/28 01:00:44 QoS Controller: Server CN=rc45/O=dev has crashed.
2013/01/28 01:00:44 QoS Controller: Deactivating probe...
2013/01/28 01:00:44 QoS Controller: QoS Probe deactivated.
Evidence of a smart kill in the log file
The QoS controller is coded to kill the server intelligently based on information it receives from the server or from QoS probing. Here is what a smart kill from an QoS Probe timeout might look like in the QoS controller file:
2013/01/28 20:30:40 QoS Probe: SUCCESS (140ms)
2013/01/28 20:31:40 QoS Probe: SUCCESS (141ms)
2013/01/28 20:32:40 QoS Probe: SUCCESS (141ms)
2013/01/28 20:33:40 QoS Probe: SUCCESS (140ms)
2013/01/28 20:34:40 QoS Probe: SUCCESS (141ms)
2013/01/28 20:39:40 QoS Probe: The probe thread has not received a message from qosprobe within the timeout period.
2013/01/28 20:39:40 QoS Probe: The qosprobe addin has timed out, is not responding, or is not running.
2013/01/28 20:39:40 QoS Controller: Deactivating probe...
2013/01/28 20:39:40 QoS Controller: QoS Probe deactivated.
2013/01/28 20:39:40 QoS Controller: OpMsg=TIMEOUT Type=PROBE TimeDate=null
2013/01/28 20:39:40 QoS Controller: The controller has received a probe timeout.
2013/01/28 20:39:40 QoS Kill: Can't send notify mail with null value
2013/01/28 20:39:40 QoS Kill: Triggering failover...
2013/01/28 20:39:44 QoS Kill: Running nsd...
2013/01/28 20:40:39 QoS Kill: Killing Domino Server...
2013/01/28 20:40:43 QoS Kill: Setting kill complete.
2013/01/28 20:40:48 QoS Kill: Restarting DominoStarter thread
Trouble Shooting
What happens if QoS doesn't work?
There are many causes which lead to that QoS doesn't work:
-
QoS is not enabled. Please check whether QOS_ENABLE=1 is set in notes.ini and dcontroller.ini
-
QoS probe task doesn't work correctly.
-
Type "show task" in Domino console to see whether qosprobe task is in the task list.
Figure 3 Task list
If yes, then check the log file to see QoS probe error.
2013/02/19 18:00:03 QoS Probe: Starting qosprobe...
2013/02/19 18:00:07 QoS Probe: OpMsg=START, Type=PROBE
2013/02/19 18:01:07 QoS Probe: The QoS Probe is probing.
2013/02/19 18:01:07 QoS Probe: ERROR: ProbeError=4803
-
If not, check Domino directory to make sure (n)qosprobe is there.
-
Domino server is not started in Java server. You can look up the process to see whether the "scontroller.exe"(Window, e.g) is available.
Figure 4 Windows Task Manager
What happens if QoS keeps killing my server when I don't want it to?
You can find summary information about why QoS keeps kill the server in the log qoscntrlrYYYYMMDDHHmm.out. For example, the following information shows that the server was killed by abnormally termination of server process. Then you can find more details information in the NSD file.
2013/02/18 20:55:31 QoS Probe: SUCCESS (16ms)
2013/02/18 20:56:18 QoS Controller: The server process has terminated abnormally.
2013/02/18 20:56:18 QoS Kill: Can't send notify mail with null value
2013/02/18 20:56:18 QoS Controller: Deactivating probe...
2013/02/18 20:56:18 QoS Controller: QoS Probe deactivated.
2013/02/18 20:56:18 QoS Kill: Triggering failover...
2013/02/18 20:56:22 QoS Kill: Running nsd...
2013/02/18 20:56:28 QoS Kill: Killing Domino Server...
Why QOS_RESTART_TIMEOUT doesn't take effect?
The most possibly reason is that Fault Recovery is also enabled in the Domino server. In this release, QoS and fault recovery should not be enabled at the same time with QoS. To disable fault recovery, please make sure “Automatically Restart Server After Fault/Crash” is not checked in figure 5.
Figure 5 Disable fault recovery
Why server is restarted when running long applications?
If the Domino server doesn't response for a specific time and it will check whether there are long running application, like DB Backup, fixup, etc. QoS will wait the long-running application to completed or time out. The timeout is controlled by configuration QOS_APPS_TIMEOUT and the default value is 10 minutes, therefore, if the long-running application is not completed in 10 minutes and the server also doesn't response, the server will be restart automatically. The solution is to set a larger value to QOS_APPS_TIMEOUT or pause the QoS service when running these critical server operations.
Why administrator doesn't receive the mail when server crashed?
You can find more information why the mail is not sent successfully in the log file qoscntrlrYYYYMMDDHHmm.out. There maybe several possible causes:
-
The configuration is not set correctly(refer "Notify SMTP recipient" section for details).
-
The SMTP server is not running. In the log file qoscntrlrYYYYMMDDHHmm.out, more information is provided, for example:
2013/02/18 21:10:07 QoS Kill: Sending notification mail from QoSMailNotifier@SBM to serverAdmin1@renovations.com
2013/02/18 21:10:09 QoS Kill: Send notify mail with errors. Send to smtp server[192.168.1.19] failed. Unknown SMTP host: 192.168.1.19
2013/02/18 21:10:09 QoS Kill: Killing Domino Server...
-
The specified SMTP server doesn't accept mail without password authentication. In IBM Domino Social Edition 9, the QOS_MAIL_TO options do not support a user name/password combination. More information could be found in the log file, for example:
2013/02/18 21:36:37 QoS Kill: Sending notification mail from QoSMailNotifier@SBM to serverAdmin1@renovations.com
2013/02/18 21:36:50 QoS Kill: Send notify mail with errors. Send to smtp server[192.168.1.19] failed. Could not connect to SMTP host: 192.168.1.19, port: 25