I now know more about NTP and Dtrace than I wanted to.

I have spent the much of the last two weeks tracking down a bug in NTP (version 4 from Udel, not the one in Solaris). During which time I learned much more than I ever wanted to about the inner workings of both NTP and Dtrace.

The problem started when I was running the latest NTP version 4 build on a Solaris 10 system here. Every once in a while, ntpd would get into a state where instead of my typical 200ms round trip time for NTP requests, they would show up to 500 seconds. And all of the servers would have these large RTT’s.
Restarting ntpd always cleared the problem up, and the problem would last from 2 to 6 hours or so but would then clear up on its own. It often started within an hour of starting ntpd, but sometimes up to 20 hours would go by before the problem started.

So, this is the ideal problem for dtrace. I could wait until the problem started, and then trace like crazy until I figured it out, without having to restart ntpd to turn on debugging or anything. Or at least that was the theory.

The reality was a little different. I had a learning curve problem. I didn’t understand the inner workings of ntpd or the intricacies of dtrace well enough to craft the correct tests. The problem would start, I would try to trace something and find out I didn’t know exactly how to write the script. Or I didn’t understand ntpd well enough to trace exactly the right thing. Or I got bit by a bug in dtrace. Or something else. Each time I would figure the problem out, that would be very nearly exactly the time when the problem stopped.

I won’t give you the blow by blow details but I will tell you the ultimate cause and what I learned.
The problem turned out that NTP would read one packet every time it received a SIGPOLL signal. However, SIGPOLL is not guaranteed to be delivered one to one with the packets. So, if a SIGPOLL got missed, it would read each packet when the next one came in, thus making the receive time appear arbitrarily long after the real receive time. This is a regression in the ntpd code and will be fixed soon.

Somethings I learned about dtrace:

1. Be mindful of your data model. If you are running on a 64bit kernel, dtrace will also run in a 64 bit mode. Mostly this is good, but if you are using the pid provider, you need to cognizant of which of the data are 32 bit and which are 64. Remember that dtrace predefines all of the kernel symbols, so they may be the wrong size for data garnered from the pid provider. In particular, pointers and longs are different, so many structs will likely be the wrong size as well. Here’s a trick from Bryan Cantrill: “You can look at curpsinfo->pr_dmodel and compare that to PR_MODEL_ILP32 and PR_MODEL_LP64 to determine if it’s a 32-bit or 64-bit process, respectively.”

2. You can use the “-p” flag and the $target macro to specify the pids in the pid provider. This is well documented, but I still discovered that I didn’t know it. So, you would invoke your script like this: “./ntpd.d -p `pgrep ntpd`” and this would cause the a line that looks like “pid$target:::entry”
to replace $target with the pid number found by pgrep. Note to Bryan, Adam and Michael: How about letting $target take multiple pid numbers and implicitly expand the line so that “pid$target:::entry” invoked by “-p 123 456” would implicitly expand to “pid123:::entry,pid456:::entry”.

3. You can get the same effect as if-then statements by factoring out d-clauses with temp variables.
Set a variable to the results of the boolean condition, and then have separate clauses with the test of the variable in the predicate test. It is a new way of thinking about flow control that takes a little getting used to. Since dtrace always executes clauses in program order, for a specific predicate with conditional, you first have a predicate with a clause that sets a temporary variable to the value of your boolean conditional, then have two clauses with the same predicate and a test for the two values of thetemporary variable, one for each clause and then finally you have the same predicate again for the unconditional post-processing.

For instance, if you wanted to have this:(warning, invalid D ahead)

fbt::dummyfunction:entry
/pred == 1/
{
instrc1;
if (var1 == 2) {
instrc2;
} else {
instrc3;
}
instrc4;
}

Do it like this:

fbt::dummyfunction:entry
/pred == 1/
{
instrc1;
this->temp1 = (var1 == 2) + 1;
}
fbt::dummyfunction:entry
/pred == 1 && this->temp1 == 2/
{
instrc2;
}
fbt::dummyfunction:entry
/pred == 1 && this->temp1 == 1/
{
instrc3;
}
fbt::dummyfunction:entry
/pred == 1 && this->temp1/
{
instrc4;
this->temp1 = 0;
}

4. I still don’t get the power of the inline facility. I made a little function that prints out the
current wallclock time to 5 decimal places like so: “#define printnow printf(“%Y.%06d”,(this->_wts=walltimestamp),(this->_wts/1000)%1000000);” But I had to invoke the C preprocessor to use this. I wanted to switch to inline, but it complained that void isn’t equal to void. Huh?

5. The resolution of the walltimestamp is the same resolution as the tick. It might be either 10ms or1 ms depending on how your system is configured. The walltimestamp therefore may not change for a long time in your tracing and may be up to 10ms ahead or behind other methods of getting the system time, such as that returned by clock_gettime.

5. You probably know that the pid provider can instrument your program right down to the single assembly instruction level. But you have to specify the address of those instructions and how do you get what instruction is at what address? It turns out that this isn’t as easy as you might hope. You can find out what your program looks like in assembly, with line number annotations using the “-S” option to the compiler. But this listing does not have address offsets, and it uses pseudo-instructions, which are single line assembly source that get replaced by one or more real ssembly code that doesn’t look quite the same. You can use the “dis” command to disassemble your object files, but the results are not easily related back to your source lines, and the addresses are relative to the start of the object you are disassembling, not relative to the function as you need. You could look at the “.s” output from the compiler, but those pseudo-instructions make it difficult to match up the lines. The dis command has a “-L” option that is supposed to print out the source line numbers, but it doesn’t work.

The dbx command also has a disassemble command. The addvantage of this method is that dbx does print out the addresses relative to the function. This disadvantage is it is not easy to just run a CLI to dump out entire function (I am sure it can be done, but I am not enough of a dbx wise to do it) and it is also missing the source line numbers. What is infuriating is that dbx lets you put break points at source line numbers, so all the info to do what I want must be there, since dbx does it.

I also found a program that disassembles and prints line numbers, namely er_src, which is part of the Sun Compiler suite. Although the docs on this suggest that it should be able to print function relative address, I couldn’t get it to do that. So, it almost works, since it has the addresses and the line numbers. I just had to subtract the address of the beginning of the function. A little hex arrithmetic and I had my addresses. I filed a bug against er_src, so maybe it will get fixed one day.

That’s all I have time for today, since I am going to leave on vacation in 10 minutes. I have a few more items, but they will have to wait until I get back.

Technorati Tags:
,
,

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: