I was recently pulled into a problem with system clocks. Because I am the “go to guy” for NTP at Sun, I often get involved with system clock problems because they are often only noticed with the customer starts running NTP
Anyway, in this particular case, a customer was evaluating some systems and noticed that the computer clock was seemed to be jumping around quite a bit. The account team set up a couple of test systems and were able to see the same problem. That’s when I was called in.
The first thing to do in this kind of situation is to get NTP out of the picture. NTP can only correct clock drifts up to a certain point, and beyond that you start getting some nasty interactions that can obscure the real issues. So, we turned xntpd off and started running “ntpdate -s -q” in a cron job every minute.
This revealed something interesting. The system was drifting to a two second offset in about half an hour and then jumping back to zero offset. This is a typical symptom when you are not running NTP. There is a battery backed hardware TOD (time of day) clock built into almost all modern systems which is used to set the time at boot up. It is also used to double check the system clock. When the system clock and the TOD clock differ by 2 seconds or more, one is set to match the other. Which happens depends on other variables, but in the set up I described, the system clock is almost always reset to match the TOD clock.
So, in this case what was happening was clear. The system clock drifted by 2 seconds each half hour and then was reset by the TOD clock. If you checked less often than once each half hour, it would appear to jump around randomly.
So why was the system losing time? We calculated this out, and it works out to about a 0.125% error rate. This rang a bell with one of the other engineers I was working with. It turns out that the system clock is modulated by 0.25%. I had never heard of this before. It is called “spread-spectrum clock”.
Here’s the deal. Systems throw off a lot of EMI (electrical magnetic interference) and there are regulations by the FCC as to exactly how much EMI you are allowed to have. One reason that computers throw off this EMI is that they are run using a digital clock signal that synchronizes all of the system components. All of the parts are in sync at exactly the same frequency, so they throw off a lot of EMI at exactly that frequency, with lesser amounts at the harmonics.
The FCC regulations determine the peak EMI at a particular frequency allowed. So, to lower the peak EMI, the idea is to slightly modulate the system frequency, so that the EMI thrown off is spread over a range of frequencies. This lowers the peak while keeping the total energy released constant. At first I thought it sounded like cheating, but I looked it up and the FCC is totally okay with this.
So, in this case, the system frequency is modulated downward by 0.25% three thousand times a second. I guess that because it was modulated so quickly (3KHz) by such a small amount (0.25%) the designers figured that nothing in the software would be affected. The only thing is, when you are counting by nanoseconds, those errors can add up. The average frequency is of course 0.125% slower than the rated frequency which as I said above works out to an error of 2500 part per million. NTP can correct errors of up to 500 PPM, so NTP was way out of its league.
We normally require a maximum error of 2 seconds per day on Sun systems, which is an accuracy of 0.0023%. Makes you appreciate how finely tuned these things are.
So, as a result, the firmware is being changed to report the average frequency to the system, not the peak frequency. See? Even a small error can add up