Time Travel 10 milliseconds

I was recently involved with a maddening bug (CR 6540867). The problem was that the gettimeofday system call would on occasion show a time 10 milliseconds in the past. This only affected the amd64 architecture and while running multiple CPUs.
This turned out to be an exercise in frustration for me, but also a major learning experience.

I had a test case that would run gettimeofday calls in a tight loop, and check if one ever returned a value prior to one already returned. I used truss to see what system call was being used. Surprise! The gettimeofday call doesn’t use any system call, the truss just sat there.

Okay, remembering the rule of “use the source, Luke”, I used the OpenGrok source browser at OpenSolaris.org to start to get a handle on just how gettimeofday worked. Drilling down in the the code in libc, I found that on amd64, it is an assembly code line that looks like this:

SYSFASTTRAP(GETHRESTIME)

So where is the corresponding implementation in the kernel for the trap GETHRESTIME? I plugged that into OpenGrok and got another surprise; putting it into the “definition” search got no results, the “symbol” search got the libc implementations, and putting it into “full” search got 110 hits, but none of them was the one I wanted. For full searchs opengrok is case insensitive, so I was getting all the hits for the routine “gethrestime”. So where is it?

Turns out that the kernel uses the symbol “T_GETHRESTIME” instead. I found that symbol in the intel architecture subdirectory of the kernel. To make a long story short, what actually gets called goes through a convoluted maze of function pointers and such, but the gist of it is that the trap calls through the trap table to get_hrestime which calls gethrestime which calls the function pointed to by gethrestimef which happens to be pc_gethrestime.

Once in pc_gethrestime, it reads the timestamp set at the last tick, adds the time since the last tick to that and returns it. There is a possible race here, since the timestamp of the last tick is set in the hres_clock routine which could run at the same time on another CPU. If you read the timestamp of the last tick, and then the number of cycles since the last tick, but hres_tick ran in between, you would see the time go backward one tick, i.e. 10 milliseconds. Hmm…sound familiar?

The thing is that there is a lock around the timestamp and the routine that gets the cycles since the last tick so that they are guaranteed to be consistent. The code looks like this:

void pc_gethrestime(timestruc_t *tp)
{
int lock_prev;
timestruc_t now;
int nslt;               /* nsec since last tick */
loop:
lock_prev = hres_lock;
now = hrestime;
nslt = (int)(gethrtime() - hres_last_tick);
lots of stuff that normalizes the timestamp (elided)
if ((hres_lock & ~1) != lock_prev)
goto loop;

</p

Now it took a lot of thinking to see how this lock works. The idea is that the lowest byte of
hres_lock is 0 when it is unlocked and -1 (8 one bits) when it is locked. When hres_lock grabs the lock,
it uses the atomic amd64 instruction xchgb, which exchanges the value in the byte with -1. If
the value returned is 0, then it has the lock. If the value was already -1, then it was already locked and hres_lock needs to go back and try again. When it wants to unlock the lock, it does an atomic increment to it. This changes the 8 one bits to all zero and adds one to the value in the upper three bytes. So when pc_gethrestime stores the value of the lock before it starts and then compares it to the value afterward except the last bit, if the lock was locked before it started, or is locked now or was locked at anytime in between, then the test fails and it
goes back and does it all again.

So, try as I might, I just could not come up with a scenario where we could get the inconsistent values. I was banging my head against the wall on this. Finally, I had to go away for a week, and I passed this to my colleague Sudheer Abdul-Salam to work on, and he finally came up with the answer that was eluding me.

Before I tell you the answer, I want to explain something. I have a background writing software, and really take the source code to heart. To me, the source code is the ultimate authority. On the other hand Sudheer is a kernel guy. To a kernel guy, the core dump is the ultimate authority. So Sudheer didn’t just look at the source, he looked at the core. And what he found was that for the lines:

lock_prev = hres_lock;
now = hrestime;

the compiler was actually producing

pc_gethrestime+0x10: movq  0x3eaee1(%rip),%rax	
pc_gethrestime+0x17: movq  0x3eaee2(%rip),%rdx	
pc_gethrestime+0x1e: movl  0x3eaefc(%rip),%ebx	

That is, the code was reading the hretime timestamp outside
of the block locked by hres_lock. When the tick was updated at just that
point, the time jumped backward. Well, if you can’t trust the compiler,
who can you trust?

I am told that the problem is that hres_lock
and hrestime should be declared “volatile” and then the compiler would
not have reordered the loads. Doh!

Hat’s off to Sudheer for figuring this one
out and preserving my sanity. Remember that you can go right to the source, but it is the
core that is core. I just hope I don’t get an issue that involves taking the cash (cache) to the (memory) bank. The only thing more core than core is what’s in the cache.

Advertisements

One response to this post.

  1. Posted by UX-admin on April 25, 2007 at 7:02 am

    This is great stuff. For what it’s worth, I’ve really enjoyed reading it (and well, assembler has always been a passion of mine…)

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: