Tricky problem with NTP

I am what is known as the Lead Product Engineer (LPE) for NTP in Solaris. As such, I am often involved in the NTP project development.

Recently, Dr. Mills (inventor of NTP) was having a problem with the latest
development version of NTP running on one particular Solaris system. After an indeterminate period of time, the NTP daemon
would just exit, without any information to the log or debug output.

Dr. Mills was perplexed, so I offered to take a look using dtrace. Actually, I first used truss. The truss showed that the
daemon was receiving a SIGINT signal and exiting, as it is supposed to do when it got that signal. I then used the dtrace
proc provider (this was my first use of the proc provider, by the way) to determine the sender of the signal, like so:

#!/usr/sbin/dtrace -s
proc:::signal-send
/args[2] == SIGINT && args[1]->pr_fname == "ntpd"/
{
printf("SIGINT signal sent to %s by uid=%d pid=%d",
args[1]->pr_fname,uid,pid);
}

The results of this showed that the uid and pid were both 0. So, the signal was coming from root, but what does pid=0 mean? What it means is that there is no process sending the signal, it is coming from the kernel. So, I re-ran the test, this time adding a
“stack();” line, so I could see the kernel stack at the time the signal is generated.</P

What I found was that the signal was being requested by the ldterm kernel module, which was in turn being called during the processing of data from the su serial driver.

So, looking at the ntp.conf file, I see that this system has two refclocks configured, one for a GPS and one for a PPS. Now, the GPS reads data on the serial line, but the PPS only looks for state transitions. So, maybe once in a while, the GPS sends something that the ldterm modules interprets as the INT character. Looking at the code for ldterm, I see that SIGINT is sent when the ldterm modules receives the M_BREAK flag from the serial driver, but only if the IGNBRK option is not set.
So, we need to set IGNBRK for refclocks. But now I know that it isn’t the INT character, it is a break. You get a break on a serial line when the other device loses power, or the cable is disconnected. Maybe the GPS power cycles on occasion?

So, I look at the NTP code, and find that we already set IGNBRK when the refclocks are opened. So, how can the SIGINT be sent? I fire up mdb and look at the flags set for each of the serial lines in the currently running ntpd. Well, the GPS has IGNBRK set, but the PPS does not. I look at where in the serial driver the ldterm modules is being called (actually, it isn’t called from there, the putnext routine is called, but it amounts the same thing in streams) and it turns out that the M-BREAK is also sent when the data read by the driver has a parity error.

So there you have it. Even though the PPS does not read data off the of serial line, random electric noise sometimes presents data on the line. Since it is random, it has a good chance that it will have a parity error, causing a break indication, resulting in a SIGINT signal. The daemon did not bother setting IGNBRK on the PPS serial line, because it was not going to read data anyway. The solution is to set IGNBRK on all serial lines, not just the ones opened to be read.

Afterwards, Dr. Mills told me that he had reports of the same problem on Linux systems, but it had never happened to him, so was unable to debug it. So, Linux reaps the benefits if dtrace indirectly.


Technorati Tags:
,
,

Advertisements

2 responses to this post.

  1. Wow — a nasty problem, and great analysis!

    Reply

  2. Errant signals on serial lines can cause all sorts of odd, intermittant issues, and it’s true that they can look like a “break” sequence to the port.

    I was thinking that the serial “alt break” sequence could be useful here for prevention, but alas, it’s only for the serial console driver.

    Also, Dr. Mills should make sure that he is not running a login port monitor on whatever serial port his GPS is attached.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: