2011-07-15

The Importance of Being Assertive, A Trivial Style Guideline for Serious Programmers

one of the key ways that landslide "talks" to the guest kernel (i.e., manipulating thread interleaving patterns) is by triggering timer interrupts at key points in the kernel's execution. in terms of very general race detection, this is about the equivalent of looking at the code and thinking "what happens if we get preempted here?" (which is the traditional way of writing "race-free" concurrent code); of course, in this project, it will have some extra automated cleverness attached so it can be effective.

because landslide's code gets invoked at every instruction of the guest's execution, we must have a notion of a scheduling operation being "still in progress" - that is, after triggering a timer interrupt, there will be several instructions before a target thread (one we decided to switch to) actually starts running. if we take note of when the target thread starts up, we can provide a few guarantees about the instructions that will be run until then - namely, that the kernel must be executing in an interrupt handler and/or the context switcher (and NOT running the "meaty bits" of any other thread's code-path). this may seem obvious, but it is still an invariant that i would like to rely on, so i expressed it as an assert in the code:

    assert(ACTION(scheduler, context_switch) || HANDLING_INTERRUPT(scheduler));

imagine my surprise, testing the implementation, when this assert tripped!

after a bit of debugging, i discovered that the invariant violation was happening at the instruction immediately following the instruction at which i tried to trigger a timer interrupt. it turns out that, in some cases, simics may decide to delay interrupt processing by a few instructions (seemingly by a non-deterministic amount, too) after i set the CPU's pending interrupt flags.

the fix (ensuring that when i decide to trigger a timer interrupt it is actually received immediately) is nothing special; what is important is to realise that my programming environment had some peculiarity that broke an assumption that i didn't even realise i was making when i established the invariants of my own code. so upon finding this new assumption (and writing code to make sure it worked), i added another assert:

    if (scheduler->just_triggered_timer_interrupt) {
        assert(get_current_eip() == get_timer_wrap_begin() &&
               "simics attempted to delay our interrupt! :<");
        scheduler->just_triggered_timer_interrupt = false;
    }

now if something else of this nature goes wrong, i will know immediately, with a useful error message to boot. but imagine if i'd never written that first assert to begin with? simics could have merrily delayed all my interrupts for however long it wanted, i would never have known, and wherever i would decide to trigger interrupts (i.e., notionally good places for exposing race conditions) would have no bearing on when they actually happened! i could spend months on this project and it would never work right and i might never know why.

use asserts, fellow hackers - not just comments or thoughts in your heads. you'll be happy for it later.

No comments:

Post a Comment