NTP Debugging Techniques

Once the NTP software distribution has been compiled and installed and the configuration file constructed, the next step is to verify correct operation and fix any bugs that may result. Usually, the command line that starts the daemon is included in the system startup file, so it is executed only at system boot time; however, the daemon can be stopped and restarted from root at any time. Usually, no command-line arguments are required, unless special actions described in the xntpd.8 man page are required. Once started, the daemon will begin sending messages, as specified in the configuration file, and interpreting received messages.

The best way to verify correct operation is using the ntpq utility program, which implements the management functions specified in Appendix A of the NTP specification RFC-1305. See the ntpq.8 man page for directions on its use and example displays. Another utility program useful in some cases is xntpdc. See the xntpdc.8 man page for directions on its use and example displays. Both programs can operate remotely; that is, either program can run in one machine and used to inspect the state variables of a daemon running in another machine. In addition, the xntpdc program can be used to selectively enable and disable some functions of the daemon while the daemon is running.

In extreme cases with elusive bugs, the daemon can operate in two modes, depending on the presence of the -d command-line switch. If not present, the daemon detaches from the controlling terminal and proceeds autonomously. If one or more -d switches are present, the daemon does not detach and generates special output useful for debugging. In general, interpretation of this output requires reference to the sources.

When first started, the daemon normally polls the servers listed in the configuration file at 64-second intervals. In order to allow a sufficient number of samples for the NTP algorithms to reliably discriminate between correctly operating servers and possible intruders, at least four valid messages from at least one server is required before the daemon can not set the local clock. However, if the current local time is greater than 1000 seconds in error from the server time, the daemon will not set the local clock; instead, it will plant a message in the system log and shut down. It is necessary to set the local clock to within 1000 seconds first, either by a time-of-year hardware clock, by first using the ntpdate program or manually be eyeball and wristwatch.

After starting the daemon, run the ntpq program using the -n switch, which will avoid possible distractions due to name resolution problems. Use the peer command to display a billboard showing the status of configured peers and possibly other clients poking the daemon. After operating for a few minutes, the display should be something like:

ntpq>pe
  remote           refid      st when poll reach   delay  offset    disp
========================================================================
+128.4.2.6    132.249.16.1     2  131  256  373     9.89   16.28   23.25
*128.4.1.20   .WWVB.           1  137  256  377   280.62   21.74   20.23
-128.8.2.88   128.8.10.1       2   49  128  376   294.14    5.94   17.47
+128.4.2.17   .WWVB.           1  173  256  377   279.95   20.56   16.40

The hosts shown in the remote column should agree with the entries in the configuration file, plus any peers not mentioned in the file at the same or lower than your stratum that happen to be configured to peer with you. The refid entry shows the current source of synchronization for that peer, while the st reveals its stratum and the st entry the polling interval, in seconds. The when entry shows the time since the peer was last heard, normally in seconds, while the reach entry shows the status of the reachability register (see RFC-1305), which is in octal format. The remaining entries show the latest delay, offset and dispersion computed for the peer, in milliseconds. The tattletale characters at the left margin display the synchronization status of each peer. The currently selected peer is marked "*", while additional peers designated acceptable for synchronization, but not currently selected, are marked "+". Peers marked "*" and "+" are included in a weighted average computation to set the local clock; the data produced by peers marked with other symbols are discarded. See the ntpq documentation for the meaning of these symbols.

Additional details for each peer separately can be determined by the following procedure. First, use the as command to display an index of association identifiers, such as

ntpq>as
ind assID status  conf reach auth condition  last_event cnt
===========================================================
  1 11670  7414    no   yes   ok    synchr.   reachable  1
  2 11673  7614    no   yes   ok   sys.peer   reachable  1
  3 11833  7314    no   yes   ok    outlyer   reachable  1
  4 11868  7414    no   yes   ok    synchr.   reachable  1

Each line in this billboard is associated with the corresponding line the pe billboard above. Next, use the rv command and the respective identifier to display a detailed synopsis of the selected peer, such as

ntpq>rv 11670
status=7414 reach, auth, sel_sync, 1 event, event_reach
srcadr=128.4.2.6, srcport=123, dstadr=128.4.2.7, dstport=123, keyid=1,
stratum=2, precision=-10, rootdelay=362.00, rootdispersion=21.99,
refid=132.249.16.1,
reftime=af00bb44.849b0000  Fri, Jan 15 1993  4:25:40.517,
delay=    9.89, offset=   16.28, dispersion=23.25, reach=373, valid=8,
hmode=2, pmode=1, hpoll=8, ppoll=10, leap=00, flash=0x0,
org=af00bb48.31a90000  Fri, Jan 15 1993  4:25:44.193,
rec=af00bb48.305e3000  Fri, Jan 15 1993  4:25:44.188,
xmt=af00bb1e.16689000  Fri, Jan 15 1993  4:25:02.087,
filtdelay=  16.40   9.89 140.08   9.63   9.72   9.22  10.79 122.99,
filtoffset= 13.24  16.28 -49.19  16.04  16.83  16.49  16.95 -39.43,
filterror=  16.27  20.17  27.98  31.89  35.80  39.70  43.61  47.52

A detailed explanation of the fields in this billboard are beyond the scope of this discussion; however, most variables defined in the specification RFC-1305 can be found. The most useful portion for debugging is the last three lines, which give the roundtrip delay, clock offset and dispersion for each of the last eight measurement rounds, all in milliseconds. Note that the dispersion, which is an estimate of the error, increases as the age of the sample increases. From these data, it is usually possible to determine the incidence of severe packet loss, network congestion, and unstable local clock oscillators. There are no hard and fast rules here, since every case is unique; however, if one or more of the rounds show zeros, or if the clock offset changes dramatically in the same direction for each round, cause for alarm exists.

Finally, the state of the local clock can be determined using the rv command (without the argument), such as

ntpq>rv
status=0664 leap_none, sync_ntp, 6 events, event_peer/strat_chg
system="UNIX", leap=00, stratum=2, rootdelay=280.62,
rootdispersion=45.26, peer=11673, refid=128.4.1.20,
reftime=af00bb42.56111000  Fri, Jan 15 1993  4:25:38.336, poll=8,
clock=af00bbcd.8a5de000  Fri, Jan 15 1993  4:27:57.540, phase=21.147,
freq=13319.46, compliance=2

The most useful data in this billboard show when the clock was last adjusted reftime, together with its status and most recent exception event. An explanation of these data is in the specification RFC-1305.

Once the daemon has set the local clock it will continuously track the discrepancy between local time and NTP time and adjust the local clock accordingly. There are two components of this adjustment, time and frequency. These adjustments are automatically determined by the clock discipline algorithm, which functions as a hybrid phase/frequency feedback loop. The behavior of this algorithm is carefully controlled to minimize residual errors due to network jitter and frequency variations of the local clock hardware oscillator that normally occur in practice. However, when started for the first time, the algorithm may take some time to converge on the intrinsic frequency error of the particular oscillator unique to the daemon environment.

It has sometimes been the experience that the local clock oscillator frequency error is too large for the NTP discipline algorithm, which can correct frequency errors as large as 30 seconds per day. There are two possibilities that may result in this problem. First, the hardware time- of-year clock chip must be disabled when using NTP, since this can destabilize the discipline process. This is usually done using the tickadj program, but other means may be necessary. For instance, in the Sun Solaris kernel, this must be done using a command in the system startup file.

Normally, the daemon will adjust the local clock in small steps in such a way that system and user programs are unaware of its operation. The adjustment process operates continuously as long as the apparent clock error exceeds 128 milliseconds, which for most Internet paths is a quite rare event. If the event is simply an outlyer due to an occasional network delay spike, the correction is simply discarded; however, if the apparent time error persists for an interval of about 20 minutes, the local clock is stepped to the new value (as an option, the daemon can be compiled to slew at an accelerated rate to the new value, rather than be stepped). This behavior is designed to resist errors due to severely congested network paths, as well as errors due to confused radio clocks upon the epoch of a leap second.

Debugging checklist

If the ntpq or xntpdc programs do not show that messages are being received by the daemon or that received messages do not result in correct synchronization, verify the following:

Verify the /etc/services file host machine is configured to accept UDP packets on the NTP port 123. NTP is specifically designed to use UDP and does not respond to TCP.
Check the system log for xntpd messages about configuration errors, name-lookup failures or initialization problems.
Using the xntpdc program and iostats command, verify that the received packets and packets sent counters are incrementing. If the packets send counter does not increment and the configuration file includes designated servers, something may be wrong in the network configuration of the xntpd host. If this counter does increment and packets are actually being sent to the network, but the recieved packets counter does not increment, something may be wrong in the network or the server may not be responding.
If both the packets sent counter and received packets counter do increment, but the xntpd host clock does not appear to be set, use either the ntpq or xntpdc program and the peer command to display the status of all peers (both client and server). Verify that valid clock offset, roundtrip delay and dispersion are displayed for at least one peer. The clock offset should be less than about 1000 seconds, the roundtrip delay should be less than about one second and the dispersion less than one second.
While the algorithm can tolerate a relatively large frequency error (over 350 parts per million or 30 seconds per day), various configuration errors (and in some cases kernel bugs) can exceed this tolerance, leading to erratic behavior. This can result in frequent loss of synchronization, together with wildly swinging offsets. Use the xntpdc program (or temporary configuration file) and disable pll command to prevent the xntpd daemon from setting the clock. Using the ntpq or xntpdc programs watch the apparent offset as it varies over time to determine the intrinsic frequency error. If the error increases by more than 22 milliseconds per 64-ms poll interval, the intrinsic frequency must be reduced by some means. The easiest way to do this is with the adjtime program and the tick argument.

David L. Mills (mills@udel.edu)