2012-07-01

Linux's leap-second deadlocks

Intro

The leap second is an extra second we insert irregularly at midnight at the end of certain months as determined by astronomers. UTC clocks render this as 23:59:60.

Yesterday at that time, linux servers around the world became wedged or experienced huge CPU spikes due to deadlock bugs in the leap second code. This post was linked on hackernews today, and has a good summary of some of the bugs in the comments. Here I'll discuss the leap second deadlocks from a concurrency researcher's perspective.

Five Bugs

I did a bit of digging to see where in Linux's code things were actually going wrong. It turns out there have actually been five different bugs related to the leap second management. It also turns out that we've seen linux deadlock at the leap second before, in 2007. (Race conditions can exist unnoticed for a very long time in infrequently-tested code paths. Who knew?!)

Each of these bugs result from some interaction with a spin-lock called "xtime_lock". Take a look at this code (trimmed and approximated a bit), which has moved and changed between different functions over the years but currently lives in "ntp_leap_second" in kernel/time/ntp.c.
        write_seqlock(&xtime_lock);
        switch (time_state) {
        case TIME_INS:
                timekeeping_leap_insert(-1);
                time_state = TIME_OOP;
                clock_was_set();
                printk(KERN_NOTICE
                        "Clock: inserting leap second 23:59:60 UTC\n");
                break;
        case TIME_DEL:
                timekeeping_leap_insert(1);
                time_state = TIME_WAIT;
                clock_was_set();
                printk(KERN_NOTICE
                        "Clock: deleting leap second 23:59:59 UTC\n");
                break;
        // (more cases omitted ...)
        }
        write_sequnlock(&xtime_lock);
I will list the bugs in chronological order.
  1. One deadlock is described here (fixed by same, in 2007). The function clock_was_set() calls smp_call_function() to "retrigger CPU local events" in the high-resolution timer subsystem. Unfortunately, it's forbidden to call smp_call_function() in "atomic context", which this code is in because it holds a spinlock (and moreover, is running in the timer interrupt handler). This was "fixed" in this commit - they simply removed the call.
  2. Another deadlock bug is shown in this post (It was fixed in 2008, according to this - but not all machines' linux versions had that fix for the 2008-2009 new year's leap second). The above code's call to printk actually needs to schedule the logging daemon kthread in order to print. Linux has a complicated "completely fair" scheduling algorithm that, when under enough system load, it needs to check the timer to determine what scheduling pattern to use. Thus, only when under heavy load, the call to printk() while holding xtime_lock would attempt to acquire xtime_lock again, causing deadlock.
  3. A third deadlock bug, linked from the serverfault post I linked to in the intro, is shown and fixed in this commit, dated a month and a half ago of this year. Nine days prior, the "ntp_lock" spinlock was split out from "xtime_lock", for finer locking granularity, but was wrong here because of circular lock ordering. Only kernels built from this nine-day window would have this bug.
  4. Despite all these bugs' fixes, yesterday saw linux servers having problems around the world. Many people reported huge CPU spikes, which turned out to be resulting from futex misbehaviours. (Literally while writing this post, the systems hacker friend who told me yesterday about the bug had the same futex problem on his own server.)
    It turned out that in bug #1 above, removing clock_was_set() was wrong after all. Technical details about this are here (also, major props to John Stultz, the same guy who fixed bug #3, for being on the case promptly last night, and for offering clarification when I emailed him). In short, the bug happened because the missing call caused sub-second high-resolution timers to always immediately return, which causes userspace applications that use them in loops to instead run in tight loops eating up CPU. The popular-seeming fix to this was to run 'date -s "`date`"', which calls settimeofday(), which calls clock_has_changed(), replacing the missing call (reference).
  5. And yet, there is still another bug lurking whose cause nobody seems to have discovered yet. In the serverfault post I linked in the intro, there was also an error message "[3161000.864001] BUG: spinlock lockup on CPU#1, ntpd/3358". This message is printed when the kernel detects a spinlock has been held for a second or more, indicating (you guessed it) deadlock. The linux folks don't appear to have figured this one out yet (not to disparage them - it's only been a day).

Research Applicability

An obvious zeroth-order conclusion: There were/are disproportionately many disastrous race conditions in the leap second code simply because it's such an infrequently-executed codepath. Hence, there's a call for some extra form of verification apart from "run the code as it is" - whether it's code-coverage-based stress tests, symbolic execution, or static analysis. I think static analysis would do best here.

One interesting thing to note is all of the locks involved here are global - the xtime_lock and the ntp_lock, and also the "interrupt handling context" which is a global property of the code. Projects like RacerX (paper pdf) would be well-capable of finding deadlocks #2 and #3 by simple callgraph analysis.

Bug #1 is especially interesting, though. This "can't call scheduler code while in interrupt context" is a very simple property, for which there's currently a runtime check for (in kernel/smp.c):
    /* Can deadlock when called with interrupts disabled. */
    WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() && !oops_in_progress);
But I think this property can be ensured at compile-time - so that code with this bug will never build, let alone ship to production systems. I'd thought about this before, and last fall I wrote a primitive checker for this as a class project, which I called Atomic All-Nighters, and which would have been directly capable of finding bug #1 without ever even compiling or running the code. Here's the writeup pdf.

My high-minded idealist goal for the Atomic All-Nighters project is to get these static properties of the code representable by constructs in the kernel programming language itself. That way, the compiler would refuse to build any code that had this sort of bug. I'm going to work more on that project more when I get back to CMU to start on my ph.d. in the fall. I'm excited for it.