Winning Race Conditions

PLDI evaluation

2015-10-15T17:46:00.003-04:00

Running experiments for PLDI has begun in earnest. My evaluation plan calls for 800 CPU-days of testing:

80 P2 thread libraries * 6 test cases
79(?) Pintos kernels * 2 test cases

638 codebase+testcase pairs total

For each one, a 10-hour control experiment, a 10-cpu * 1-hour "live" experiment, and a 10-cpu * 1-hour "data race false negative" experiment (don't worry, the paper will explain it... should it get published!).

(80*6+79*2)*3*10 = 19,140 cpu-hours = 797.5 cpu-days.

And 200 CPUs to do it with.

big green button

2015-09-09T05:28:00.002-04:00

Hello internet, it's been a while.

Tonight I'm having a "1% moment" of research. That is, 99% of the time, I either have my head on the grindstone, or am endlessly worrying and guilting myself about not getting enough done, being an impostor, etc.; but that other 1% is why I'm still a grad student. Because sometimes at 5 in the morning I finish writing mindless automation glue code, finish patching horribly broken anonymous student code, and finish debugging the bugs in my bug-finding software (ha), and finally reach a state where I can hit a big green button marked "GO RUN THE EXPERIMENT" and watch the computer do something absolutely frickin' amazing.

My current project is an extension of Landslide that automatically searches for new preemption points (PPs) during the course of a systematic test, adds new state spaces to explore using those PPs, and figures out with state space estimation which state spaces are most likely to finish testing in a given CPU budget. I'm calling it "iterative deepening" by analogy with the chess AI technique, and you can find my latest talk slides here for more details.

But mostly the purpose of this post is for me to share some eye-candy. Here's what Landslide looks like when it's feeling victorious.

The key thing to note here is that bugs are only found in state spaces with data-race preemption points, which only Landslide's iterative deepening framework is capable of identifying and using. IOW, these bugs would be missed by any other systematic testing tool that interposed only on thread library calls.

I've finally got a conference deadline in my sights where getting accepted seems realistic. It's been a looong build-up to this point. Keep your eyes peeled.

What is a "data race" and when is a race not a data race?

2012-10-20T12:30:00.001-04:00

I've been meaning to write about this since my post series about Rust (in particular here, where I wrote "while data races are no longer possible, race conditions in general still are" about the RWARC library). In general, Rust statically guarantees freedom from data races, though not freedom from all races. But what does that mean?

A data race is when multiple threads concurrently access the same memory location where at least one access is a write. "Concurrently" here could mean either literally at the same time (threads run on different CPUs) or abstractly at the same time (threads interleave with each other on the same CPU); i.e., no synchronisation primitive enforces that one thread's access completes before the other begins.

Thread 1	Thread 2
if (p != NULL)
	p = NULL;
output(p->data);

Data race detectors, such as Eraser and Helgrind, analyse threads' mutual-exclusion and happens-before relationships to identify unsafe concurrent accesses like these. But it's possible to stop accesses from being concurrent without enforcing correct behaviour:

Thread 1	Thread 2
mutex_lock(m);
bool ok = p != NULL;
mutex_unlock(m);
	mutex_lock(m);
	p = NULL;
	mutex_unlock(m);
mutex_lock(m);
if (ok) output(p->data);
mutex_unlock(m);

Now the data race is gone, but the bug has simply become a higher-level race condition. Most literature calls this an "atomicity violation" (and some literature even uses "race" to mean exclusively data races).

You might think this code looks silly, but if you're working in a project with many layers of abstraction and function/module boundaries, this kind of mistake can be all too easy to make, and data race detectors are powerless to find them.

Consider this real-world example. When I started at Mozilla this summer, Rust 0.2 had recently shipped, and its release notes mentioned that it was "Helgrind-clean" (meaning no data races existed). Yet the Rust runtime contained this code:

bool rust_task::blocked() {

    mutex_lock(this->lifecycle_lock);

    bool is_blocked = this->state == task_state_blocked;

    mutex_unlock(this->lifecycle_lock);

    return is_blocked;

}

Sure, accessing the field was safely protected by the mutex, but once it dropped the lock and returned, all bets were off as to whether the returned value was still accurate or not. (I fixed several bugs related to this, and removed this function entirely.)

In a similar vein, Rust's type system guarantees that concurrent tasks cannot share state but instead must use message-passing to communicate, which precludes the possibility of data races completely by enforcing happens-before relationships on all data accesses (or in the case of this post, by enforcing mutual-exclusion relationships). Yet it's still possible to write nondeterministic programs in Rust (using select2, failure propagation, etc), and so race conditions are still possible.

The moral of this story is that data races are only one of many types of races, and though many tools exist for finding them, just because one guarantees absence of data races does not mean your code is completely concurrency-safe. Not to say these tools aren't useful, but they often fail where more sophisticated race-finding techniques could succeed, and even still, no automated race-finding tool can substitute for a careful human brain when reasoning about concurrency.

What exactly is a "race condition", anyway?

2012-10-20T11:25:00.000-04:00

A friend of mine is taking the operating systems class at UMD, in which the second project is to implement inter-process signals. He noted a peculiarity in the specification: that processes are not woken up immediately if they receive a signal while blocked (e.g. on child processes, on keyboard/disk input). As a result, it could be completely random whether or not a process receiving a signal gets killed immediately or sleeps forever.

He discussed this with the professor, and they disagreed over whether this nondeterminism constituted a "race condition" or not. After all, the specification allows for signals to fail to wake up processes under certain circumstances, so there's nothing wrong about implementing it that way. On the other hand, a kernel whose signalling mechanism always wakes up processes in bounded time (i.e., finitely long -- whereas waiting for keyboard input could take forever) could provide stronger guarantees about inter-process communication.

In my interpretation, both arguments don't tell the entire story. For starters, race conditions don't necessarily entail wrong behaviour; I've seen plenty of "benign" race conditions with comments along the lines of "if X and Y race, Z will happen, and this is OK". Benign races aside, though, "race condition" to me means "unexpected behaviour occurs nondeterministically". So, if you want to be precise, it's important to talk about race conditions with respect to certain expectations.

Someone writing a userspace program for this kernel who didn't realise that signals might never get taken (and hence produced code that sometimes accidentally sleeps forever) could say they were bitten by a race in the kernel. But if they'd read the spec carefully, they might've written code that handled the nondeterminism more robustly. They could say the spec's nondeterminism made it less useful than other possible specs, but it wouldn't be fair to blame the particular implementation of this spec for being buggy.

In short, I would say the specification itself has a race condition in it, but implementations thereof don't. What's important is who holds the expectations and who nondeterministically breaks them.

Rust (0): Index and Conclusion

2012-09-26T14:18:00.004-04:00

This four-post series on Rust is intended to introduce you to the language, to teach you about Rust's cool language features, and to give a debriefing of what I contributed to it this summer.

These posts are targetted for an audience with some knowledge of programming language design principles. You should be lightly familiar with both systems programming languages such as C++ and with functional languages such as Haskell or ML, and preferably strongly skilled in at least one or the other domain.

Do feel free to skip ahead, if you're already familiar with parts of the language, or to bail out early, if you're not interested in an involved tour of concurrency primitives. All the same, I hope you get something out of some or all of these posts.

Primer - an introduction to the language's syntax, memory model, and concurrency model
Linked Task Failure - advanced parallel programming and error handling with tasks (my first project)
Typesafe Shared State - an overview of the region system and a parallelism library that makes heavy use of it
Typesafe Shared Mutable State - using trickery with Rust's type system to achieve a completely safe interface for common concurrency idioms (my second project)

I'd like to close with an argument for why I think Rust is the "language of the future" for systems programming.

Rust's strong static type system relieves programmers from worrying about many types of errors they should never have to. NULL pointer crashes, memory management errors, surprising implicit type coercions, and dynamic cast exceptions don't exist anymore. Meanwhile, features like closures and higher-order functions (missing in C++ (until very recent versions)), algebraic datatypes and parametric polymorphism (both missing in Go), and traits (existential types; a combination of haskell-style typeclasses and OO-style interfaces) allow you to concisely express ideas that would otherwise involve a lot of legwork in certain "conventional" languages.
Unlike other functional languages, however, Rust has heavy focus on performance as well. Stack-allocated data lets you often avoid dynamic allocation overhead and garbage collection (even closures can sometimes be entirely on the stack). The region system and borrow checker allow for type-and-memory-safe aliasing of arbitrary data with no runtime overhead. Explicit copyability as part of the type system lets you be aware of when expensive copies might occur.
Finally (and this is the big one, for me), Rust's type system includes a concurrency-aware memory model. Forbidding unprotected shared state and using message-passing over pipes as the main communication mechansim means programmers no longer have to worry about data races, and is also friendly to massively-parallel applications where cache-line contention is a serious worry. The use of noncopyable types means the message-passing library can safely assume all communication will be one-to-one, which allows for a blazing fast implementation under the hood. Noncopyable types also give other strong guarantees, such as the safety of ARCs and the fact that two tasks cannot deadlock when communicating over a single pipe.

Hopefully I've gotten you excited about using Rust for safe + performant parallel programming (or maybe several months from now, when its features and syntax are more stable). And to the Rust community: Thanks, it's been a blast.

Rust (4): Typesafe Shared Mutable State

2012-09-25T16:05:00.000-04:00

This post is a continuation of shared immutable state. Before I introduce how we do safe shared mutable state, I'll take a moment to show why unprotected shared mutable state is dangerous.

Dangers of Shared State

If you're a functional programmer, you're probably used to a language in which nested data structures are allocated in several heap cells, each of which is garbage-collected, so multiple users can freely alias into the same data, implicitly copy to make changes, and so on.

Rust's approach is somewhat different: it focuses on stack-allocation, avoiding expensive implicit copies, and predictable performance. In fact, heap-allocation only occurs when you write the @ or ~ sigil; and, absent @-pointers, Rust's representation semantics don't involve garbage collection at all. Instead:

Data types are representated with interior types, meaning data types are embedded directly within one another rather than using pointer indirection. You can, of course, create borrowed pointers to such types and pass them between functions.
Stack-allocated and ~-allocated values are owned data, which get eagerly freed/deinitialised immediately upon going out of scope or being overwritten.
Rustic data structures can have in-place mutability, indicated with the mut keyword. While also supported by many other functional languages, in Rust it presents new difficulties with aliasing pointers because of point #2 above.

With such a C/C++-like representation model, the prospect of sharing mutable state among multiple actors is a lot more dangerous. To show why, let's say we added a data-race-enabling function to ARC's interface:

    fn get_mut<T: Const Send>(arc: &a/ARC<T>) -> &a/mut T

Then we can commit badness like:

    let arc: ARC<Option<~int>> = ARC(Some(~31337));

    let arc2 = clone(&arc);

    do task::spawn |move arc2| {

        // Might print "Some(~31337)". Might print "None". Might segfault.

        io::println(fmt!("%?", *get(&arc2)));

}

    // Frees and deinitialises the owned pointer inside the ARC.

    *get_mut(&arc) = None;

    // (But what if this runs after the other task determines the data

    //  is Some, but before it dereferences the contained pointer??)

With sufficient cleverness, this can even be harnessed to implement arbitrary type coercion. (See my solution here.)

Reader-Writer ARCs

The ARC already existed when I arrived at Mozilla, but there was no similar (and safe) solution for the state being mutable. I created the RWARC, with a reader-writer lock inside, to fill this gap.

You create them just like you create ARCs:

    fn RWARC<T: Const Send>(data: T) -> RWARC<T>

    fn clone<T: Const Send>(arc: &RWARC<T>) -> RWARC<T>

But when using them, instead of getting an unlimited-use reference to the data inside, you give the interface a closure to run on the data, and it runs the closure for you with the rwlock held in the correct mode.

    fn read <T: Const Send>(arc: &RWARC<T>, blk: fn(&T))

    fn write<T: Const Send>(arc: &RWARC<T>, blk: fn(&mut T))

The key difference is that the region associated with the data pointer is the region of the closure, rather than some arbitrary region defined by the caller. This allows read() and write() to enforce that the contained reader-writer lock is always held in the correct mode when references to the data exist.

Now we can fix the example from before.

    let arc = RWARC(Some(~31337));

    for 5.times {

        let arc2 = clone(&arc);

        do task::spawn |move arc2| {

            do read(&arc2) |state: &Option<~int>| {

                // Long-running reads on state still happen in parallel.

                io::println(fmt!("%?", *state));

}

}

}

    do write(&arc) |state: &mut Option<~int>| {

        // Exclusive write access. No other aliases to state can exist concurrently.

        *state = None;

}

Note that while data races are no longer possible, race conditions in general still are. (I mentioned earlier that shared mutable state introduces nondeterminism.) Here, anywhere between zero and five "None"s will be printed.

The compiler will, of course, reject code that tries to cheat the interface:

    let escaped_state;

    do write(&arc) |state| {

        escaped_state = state; // ERROR: reference not valid outside of its lifetime

}

A brief informal justification of safety:

The Const restriction still enforces that readers only see deeply immutable state. Also, even with mutable state, it still prevents cycles from being created, because the RWARC itself does not have the Const kind.
References to the shared state cannot escape the closure called by read() or write(). In effect, the region system statically enforces that the lock must be held in order to access the state.

The Concurrency Primitives You Know and Love

Condition Variables

The RWARC also comes with some other features to remind you of home (if "home" to you means old C-style concurrency primitives you fought off race conditions with back in the day). We have condition variables:

    fn write_cond<T: Const Send>(arc: &RWARC<T>, blk: fn(&mut T, &Condvar))

    fn wait(cond: &Condvar)

    fn signal(cond: &Condvar) -> bool

    fn broadcast(cond: &Condvar) -> uint

These work as you might expect. Like the &mut T reference, the Condvar reference can only be used inside the closure (i.e., while the lock is held).

    let arc = RWARC(~[]);

    let arc2 = clone(&arc);

    do task::spawn |move arc2| {

        do write_cond(&arc2) |state,cond| {

            // Poor man's message-passing. Of course, pipes are much

            // faster; rwarcs and condvars are built on top of pipes.

            vec::push(state, ~"hello there!");

            signal(cond);

}

}

    do write_cond(&arc) |state,cond| {

        while state.len() == 0 {

            wait(cond);

}

        io::println(vec::pop(state));

}

(The more seasoned concurrency hackers among you might now be wondering what if you wanted to associate multiple conditions with the same state? That can be done too -- gritty details are in the docs.)

Downgrade (or, Now You're Just Showing Off with the Region System)

(Do feel free to zone out for this section.)

If you're used to being able to atomically "downgrade" write access into read access without letting other writers through in the meantime, you can do that here too. (I'm presenting this feature mostly just to show off more stuff you can do by combining the region system with noncopyable types.)

    // Calls a closure which will write, then downgrade, then read.
    fn write_downgrade<T: Const Send>(arc: &RWARC<T>, blk: fn(RWWriteMode/&a<T>))

    // Converts a "write permission" token to a "read permission" token.

    fn downgrade<T: Const Send>(token: RWWriteMode/&a<T>) -> RWReadMode/&a<T>

    fn write<T: Const Send>(token: &RWWriteMode<T>, blk: fn(&mut T))

    fn read <T: Const Send>(token: &RWReadMode <T>, blk: fn(&T))

Here, the RWWriteMode and RWReadMode are noncopyable "permission tokens" that allow the user to write or read, and downgrade() is a function that consumes the write token and wakes up any readers waiting on the rwlock. Since the tokens are noncopyable, the caller cannot still have write permissions after calling downgrade() (which would, of course, result in data races).

The "RWWriteMode/&a" syntax indicates an opaque data structure with region pointers inside. While the write mode token is passed by ownership (so that it can in turn be surrendered to downgrade()), its scope is still constrained by the associated region, which means it can't escape from the closure passed to write_downgrade(). And downgrade() converts a write mode token to a read mode token with the same region, so the latter can't escape either.

Complex as the above functions may seem, using the interface simply looks like this:

    do write_downgrade(&arc) |token| {

        do write(&token) |mutable_state| {

...

}

        let token = downgrade(move token);

        do read(&token) |immutable_state| {

...

}

}

Unwrap

Finally, RWARCs (ARCs too) also now have a mechanism to get your data back out again.

    fn unwrap<T: Const Send>(arc: RWARC<T>) -> T

Of course, it wouldn't be valid to reclaim ownership of the data while other tasks might still have aliases to it. Instead, unwrap() blocks the calling task until its reference is the only reference alive, and then takes ownership of the data instead of freeing it. (To avoid deadlock, subsequent callers to unwrap() on the same ARC immediately fail.)

This adds expressivity in two ways: it relieves you from having to deeply-copy the shared data if you need to own it (which would be extra problematic if it had noncopyables inside), and it automatically synchronises with the ARC's other users. You could use this to implement a fork-join pattern, like so:

    let arc = RWARC(some_data);

    for num_cpus().times {

        let arc2 = clone(&arc);

        do task::spawn |move arc2| {

            process_data(arc2); // might read, write, whatever

}

}

    let modified_data = unwrap(move arc); // blocks on all child tasks at once

    // do more of the algorithm, etc.

All this without ever once copying the data.

This about wraps up the contributions I made this summer at Mozilla. In my next post I'll conclude the series with a summary of why I like Rust so much.

Rust (3): Typesafe Shared State

2012-09-22T00:38:00.000-04:00

Previously I introduced Rust, talking about syntax, pointer types, and light-weight parallelism and message-passing. I also wrote about my own summer project, flexible failure propagation between tasks, talking about some more advanced programming techniques with Rustic tasks.

Through it all you might have been wondering, "No shared state?! I see the value in eliminating data races, but isn't it sometimes what you want?" Yes! That's what this post is for.

Consider: When spawning a bunch of tasks to parallelly process a large data structure, it would be a shame to have to deeply copy the whole thing and send one copy over a pipe to each task (expensive in both space and time). You'd want each task to be able to alias the same data instead.

Shared Immutable State

Rust's standard library includes the ARC, which stands for Atomically Reference-Counted object. The ARC serves as a wrapper-handle to some data you wish to share; rather than copying the data itself, you instead copy just the handle, which just involves atomically incrementing a reference count for the contained data.

To create an ARC:

    // Given ownership of some data, wraps it in an ARC.

fn ARC<T: Const Send>(data: T) -> ARC<T>

The polymorphic type T is constrained by the Send kind (which I mentioned in my primer post), so it can only be used with data of types that you could also otherwise send over pipes, and also by the Const kind, which means the data can have no mutable interior fields (the type has to be deeply immutable to guarantee no data races).

Like pipe endpoints, the ARC is a noncopyable type. New handles to the same ARC cannot be freely created (for that would bypass the reference counting mechanism); they must be made using the rest of the interface. (ARC also uses destructors internally, so the moment an ARC handle leaves scope, the reference count gets dropped. When the count hits zero, the data will be freed as well.)

And to use an ARC:

// Creates a new handle to the ARC.

    fn clone<T: Const Send>(arc: &ARC<T>) -> ARC<T>

    // Get an immutable pointer to the underlying data.

fn get<T: Const Send>(arc: &a/ARC<T>) -> &a/T

You'll notice the use of &-pointers (borrowed pointers) in this interface. In clone(), this means the argument ARC is passed by-reference rather than by-ownership to create the new handle. The interface of get() introduces some new syntax, &a/T, which to explain I'll need to introduce regions.

As I hinted at in my primer post, borrowed pointers are statically analysed to ensure they don't outlive the data they were borrowed from. This is done by associating a region with the borrowed pointer to denote its lifetime (which is tied to some lexical scope or inherited from other data's lifetime).

Mostly, regions exist behind-the-scenes, since the compiler can infer them when needed. Sometimes it is useful, though, to explicitly write that two regions will be the same -- the &a/T syntax denotes a borrowed pointer to a T with some lifetime a. Because the same region variable is used to borrow the ARC itself ("&a/ARC<T>"), the compiler knows to enforce in get()'s caller that the returned pointer cannot outlive the associated ARC handle. get() is said to be region-parametric; that is, the region variable a can be instantiated with whatever region is appropriate at each call-site.

Examples

Here's a code snippet that demonstrates basic ARC usage. I create an ARC with a BigDataStructure inside, clone a second handle, and then in two parallel tasks get references into them.

    fn init() -> BigDataStructure   { ... }
    fn access(x: &BigDataStructure) { ... }

    fn main() {
        let arc1 = ARC(init());   // refcount == 1
        let arc2 = clone(&arc1);  // refcount == 2
        do task::spawn |move arc2| {  // gives child ownership of 2nd handle
            let x2: &BigDataStructure = get(&arc2);
            access(x2);  // in parallel with the below
            // arc2 gets dropped. BigDataStructure might get freed here.....
            // (note: x2 can no longer be accessed)
        }
        let x1: &BigDataStructure = get(&arc1);
        access(x1);  // in parallel with the above
        // arc1 gets dropped. .....or it might get freed here.
        // (note: x1 can no longer be accessed)
    }

Here are some examples of ways the type system prevents unsafe usage.

First, the compiler won't let me bypass the reference-counting mechanism:

    let arc1 = ARC(init());  // refcount == 1
    let arc2 = arc1; // ERROR: copying a noncopyable value
    // double free :(

If ARC handles were copyable, two destructors would run here and the reference count would get decremented too many times.
The compiler will also stop me from using the reference from get() after the associated ARC handle went out of scope (which is legal in a language like C++, and would result in a use-after-free):

    fn broken_get(arc: ARC<BigDataStructure>) -> &a/BigDataStructure {
        // note the unconstrained region variable ^
        let x = get(&arc);
        return x;  // ERROR: reference not valid outside of its lifetime
        // note: the arc handle would get dropped here(??)
    }
    access(broken_get(ARC(init())));  // use after free :(
Finally, I will try to surrender ownership of my ARC handle by sending it over a pipe (perhaps to another task), while still holding on to a pointer I borrowed from it with get().

    let (sender,receiver) = pipes::stream();
    let arc = ARC(init());
    let x = get(&arc);      // NOTE: loan of local variable granted here
    sender.send(move arc);  // ERROR: moving out of local variable
                            //        prohibited due to outstanding loan
    access(x);  // unknown whether arc is still alive(??)

But the compiler's borrow checker stopped me, because the "loan" I had created earlier was still in scope.

Safety

Because Rust intentionally has no language features to support shared state, the ARC library provides it by using unsafe code internally. Given that unsafe code "shifts the burden of proof from the compiler to the programmer", how can we know the interface is right?

While we are working on a proof of the region system's correctness in general, we don't have a proof for this interface in particular (though I'd be curious how one would look!). Nevertheless, we can be quite confident in the ARC's safety because of the guarantees that Rust's language features provide:

The Const kind restriction and the immutable pointer returned by get() ensure that once inside an ARC, data can never be modified. This makes data races impossible, and also precludes the possibility of constructing a cyclic reference among ARCs. (Reference counting is a safe memory management strategy only in absence of cycles.)
The use of noncopyable ("linear") types for the ARC handles ensures that the reference count exactly matches the number of handles, and therefore the associated data will only be freed when all handles have left scope.
The regioned type signature of get() ensures that a reference to the contained data must be outlived by its associated handle (and hence, by #2, outlived also by the contained data itself).

Stay tuned for a follow-up post explaining a still more advanced interface I created for safely sharing mutable state between tasks.

Rust (2): Linked Task Failure

2012-09-18T12:17:00.003-04:00

In my last post, I gave an introduction to Rust's syntax and memory/concurrency model. None of that stuff was anything I contributed -- that's what I'll talk about in this post.

Rust has a built-in mechanism for failure, sort of light-weight exceptions that can be thrown but not caught. It is written "fail" (or "fail "reason"", or sometimes "assert expr"), and it causes the task to unwind its stack, running destructors and freeing owned memory along the way, and then exit itself.

There are library convenience wrappers for handling failure on the other side of the task boundary, so:

    let result = do task::try {  // spawns and waits for a task

        fail "oops!";

};

    assert result.is_err();

(There is talk of extending failure to support throwing values of an "any" type and catching them, but that will take development effort.)

But not all failure is created equal. In some cases you might need to abort the entire program (perhaps you're writing an assert which, if it trips, indicates an unrecoverable logic error); in other cases you might want to contain the failure at a certain boundary (perhaps a small piece of input from the outside world, which you happen to be processing in parallel, is malformed and its processing task can't proceed).

Hence the need for different linked failure spawn modes, which was my main project at Mozilla this summer. One of the main motivations for configurable failure propagation is Servo, a parallel web browser being written in Rust (again from Mozilla Research), so along with the code examples below I'll also include a web-browser-style use case for each failure mode.

Linked Task Failure

By default, task failure is bidirectionally linked, which means if either task dies, it kills the other one.

    do task::spawn {

        do task::spawn {

            fail;  // All three tasks will die.

}

        sleep_forever();  // will get woken up by force

}

    sleep_forever();  // will get woken up by force

There are plans for Servo to have parallel HTML/CSS parsing and lexing, so the parse phase can start before lexing finishes. If an error happens during either phase, though, the other one should stop immediately -- an application for bidirectionally linked failure.

Supervised Task Failure

If you want parent tasks to kill their children, but not for a child task's failure to kill the parent, you can call task::spawn_supervised for unidirectionally linked failure.

The function task::try uses spawn_supervised internally, with additional logic to wait for the child task to finish before returning. Hence:

    let (receiver,sender) = pipes::stream();

    do task::spawn {  // bidirectionally linked

        // Wait for the supervised child task to exist.

        let message = receiver.recv();

        // Kill both it and the parent task.

        assert message != 42;

}

    do task::try {  // unidirectionally linked

        sender.send(42);

        sleep_forever();  // will get woken up by force

}

    // Flow never reaches here -- parent task was killed too.

Supervised failure is useful in any situation where one task manages multiple children tasks, such as with a parent tab task and several image render children tasks, each of the latter of which could fail due to corrupted image data. This failure mode was inspired by Erlang.

This mode of failure propagation was also the hardest to fully support, because parent task failure must propagate across multiple generations even if an intermediate generation has already exited:

    do task::spawn_supervised {

        do task::spawn_supervised {

            sleep_forever();  // should get woken up by force

}

        // Intermediate task immediately exits.

}

    wait_for_a_while();

    fail;  // must kill grandchild even if child is gone

Unlinked Task Failure

Finally, tasks can be configured to not propagate failure to each other at all, using task::spawn_unlinked for isolated failure.

    let (time1, time2) = (random(), random());

    do task::spawn_unlinked {

        sleep_for(time2);  // won't get forced awake

        fail;

}

    sleep_for(time1);  // won't get forced awake

    fail;

    // It will take MAX(time1,time2) for the program to finish.

If you're a Firefox user, you're probably familiar with this screen. Using tasks with isolated failure would prevent the entire browser from crashing if one particular tab crashed.

Wrap-Up

I'd also like to note that asynchronous failure is one of the few sources of nondeterminism in Rust. This code, for example, is dependent on task scheduling patterns:

    fn random_bit() -> bool {

        let result = do task::try {  // supervised

            do task::spawn { fail; }  // linked

            // Might get through here ok; might get killed.

};

        return result.is_success();

}

The fact that Rust has no shared state between tasks makes it difficult to trip over inherent randomness in scheduling patterns.

Other sources of nondeterminism include (1) a certain library for shared state, which I'll talk about in my next post; (2) the ability to select on multiple pipes at once; (3) the ability to detect when a pipe endpoint was closed before the message was received (called "try_send()"); and of course (4) system I/O (which includes random number generation). Eric Holk and I believe that in absence of these five things, Rust code (including one-to-one pipe communication) is deterministic.

If you're interested, the slide deck I used for my end-of-internship presentation on linked failure (with more of the same pictures) is here.

Rust (1): Primer

2012-09-04T15:14:00.003-04:00

I spent my summer at Mozilla Research working on Rust. There were several interesting things I did that I'll write about in subsequent posts; this one is an introduction/primer.

Rust is an experimental, still-in-development language that is geared towards parallelism and performance while at the same time providing a strong static type system. (You can read the other buzzwords on the website.)

Syntax Primer

On the front page of Rust's website, there is a code snippet:

    fn main() {
        for 5.times {
            println("Here's some Rust!");
        }
    }

This looks sort of cutesy and imperative, but actually there is some syntax sugar going on which facilitates a more functional-programming idiom. The above code is equivalent to:

    fn main() {
        times(5, || { println("Here's some Rust!"); true });
    }

where "|args*| { stmt* }" is the lambda/closure syntax (like in Ruby), and "times" is a core library function implemented as:

    fn times(count: uint, blk: fn() -> bool) {  // 'blk' is a stack-allocated closure
        if count > 0 {
            if blk() {  // Only continue looping if blk succeeds
                times(count-1, blk);  // Iterate until count hits 0
            }
        }
    }

The long and short of this is that idiomatic Rust typically has a lot of curly-brace "control flow blocks" that are actually closures, and higher-order functions are commonplace.

Concurrency

So, when I was giving my end-of-internship talk (which I'll link in my next post), I showed how easy it is to add parallelism to your rust program.

    fn main() {
        for 5.times {
            do task::spawn { // create 5 tasks to print a message in parallel
                println("Here's some Rust!");
            }
        }
    }

'task::spawn' has the signature "fn spawn(child: ~fn())" and is implemented with magic (unsafe code and runtime calls) internally. The 'do' syntax is similar to the 'for' syntax, but doesn't use the "iteration protocol" in which the closure returns bool.

(That code is equivalent to "times(5, || { task::spawn(|| { println("..."); }); true });".)

The Memory Model

If you've a sharp eye, you're wondering what that "~" is that I snuck in on the type of the closure for the child task. That's actually a pointer type, of which Rust has three (none of which can be null, by the way):

~T is a unique pointer to a T. It points to memory allocated in the send heap, which means data inside of unique pointers can be sent between tasks. You can copy unique pointers, but only by deeply copying (otherwise they wouldn't be unique!) (and by default, they are "non-implicitly-copyable", so the compiler will issue warnings if you copy them without writing the "copy" keyword).
@T is a managed pointer to a T. Currently, these are reference-counted and cycle-collected (they may be full-on GCed in the future). Copying one increments the reference count, so multiple managed pointers can point to the same data. These are allocated on a per-task private heap, and cannot be sent between tasks.
&T is a borrowed pointer to a T. It can point to the inside of arbitrary data structures - on the stack, inside ~ or @ pointers, etc. Rust has a static analysis, called the "borrow checker", that ensures that borrowed pointers must not outlive the scope of the pointed-to data (i.e., it is impossible for rust programs to have a use-after-free).

Behind this analysis is a sophisticated region system, developed by Niko Matsakis, which you can read about in this tutorial on his blog. I'll also talk a bit more about these in a later post.

The end result here is that in Rust there can be no shared state between tasks; tasks may only communicate by message-passing or by moving unique values into unique closures. More technically said, there is an inherent "send" kind that denotes whether a type may be sent to another task. ~T is sendable if T is sendable; @T and &T are never sendable; structs (conjunctive types) and enums (disjunctive types) are sendable if their contents are sendable; primitive types are always sendable.

Communication

Tasks can pass messages between each other using pipes, which is Rust's communication primitive. Pipes consist of a send endpoint and a receive endpoint, each of which is a noncopyable type (or "linear type", by correspondence with linear logic).

Pipes' noncopyability ensures that communication is one-to-one (i.e., multiple tasks cannot send or receive on the same pipe), which allows their internal synchronisation implementation to be much simpler than N-to-N might require, and hence also be blazing fast. The other benefit of noncopyability is it allows for pipe protocols, statically-enforced send/receive state machines that ensure you can't send/receive values of the "wrong" type, or (for example) try to receive when the other endpoint is also receiving.

I was working closely this summer with Eric Holk, the one responsible for pipes. You can read more about them (some examples, some performance, some type theory) on his blog.

Conclusion

I've got several more posts coming up to talk about the two cool things I personally worked on this summer. Hopefully this post has gotten you enough up to speed on what's going on in Rust to follow along with what I did.

Hopefully also I've gotten you excited about using Rust to write parallel programs that are both safe and performant. I know I am.

Linux's leap-second deadlocks

2012-07-01T06:23:00.001-04:00

Intro

The leap second is an extra second we insert irregularly at midnight at the end of certain months as determined by astronomers. UTC clocks render this as 23:59:60.

Yesterday at that time, linux servers around the world became wedged or experienced huge CPU spikes due to deadlock bugs in the leap second code. This post was linked on hackernews today, and has a good summary of some of the bugs in the comments. Here I'll discuss the leap second deadlocks from a concurrency researcher's perspective.

Five Bugs

I did a bit of digging to see where in Linux's code things were actually going wrong. It turns out there have actually been five different bugs related to the leap second management. It also turns out that we've seen linux deadlock at the leap second before, in 2007. (Race conditions can exist unnoticed for a very long time in infrequently-tested code paths. Who knew?!)

Each of these bugs result from some interaction with a spin-lock called "xtime_lock". Take a look at this code (trimmed and approximated a bit), which has moved and changed between different functions over the years but currently lives in "ntp_leap_second" in kernel/time/ntp.c.

        write_seqlock(&xtime_lock);
        switch (time_state) {
        case TIME_INS:
                timekeeping_leap_insert(-1);
                time_state = TIME_OOP;
                clock_was_set();
                printk(KERN_NOTICE
                        "Clock: inserting leap second 23:59:60 UTC\n");
                break;
        case TIME_DEL:
                timekeeping_leap_insert(1);
                time_state = TIME_WAIT;
                clock_was_set();
                printk(KERN_NOTICE
                        "Clock: deleting leap second 23:59:59 UTC\n");
                break;
        // (more cases omitted ...)
        }
        write_sequnlock(&xtime_lock);

I will list the bugs in chronological order.

One deadlock is described here (fixed by same, in 2007). The function clock_was_set() calls smp_call_function() to "retrigger CPU local events" in the high-resolution timer subsystem. Unfortunately, it's forbidden to call smp_call_function() in "atomic context", which this code is in because it holds a spinlock (and moreover, is running in the timer interrupt handler). This was "fixed" in this commit - they simply removed the call.
Another deadlock bug is shown in this post (It was fixed in 2008, according to this - but not all machines' linux versions had that fix for the 2008-2009 new year's leap second). The above code's call to printk actually needs to schedule the logging daemon kthread in order to print. Linux has a complicated "completely fair" scheduling algorithm that, when under enough system load, it needs to check the timer to determine what scheduling pattern to use. Thus, only when under heavy load, the call to printk() while holding xtime_lock would attempt to acquire xtime_lock again, causing deadlock.
A third deadlock bug, linked from the serverfault post I linked to in the intro, is shown and fixed in this commit, dated a month and a half ago of this year. Nine days prior, the "ntp_lock" spinlock was split out from "xtime_lock", for finer locking granularity, but was wrong here because of circular lock ordering. Only kernels built from this nine-day window would have this bug.
Despite all these bugs' fixes, yesterday saw linux servers having problems around the world. Many people reported huge CPU spikes, which turned out to be resulting from futex misbehaviours. (Literally while writing this post, the systems hacker friend who told me yesterday about the bug had the same futex problem on his own server.)
It turned out that in bug #1 above, removing clock_was_set() was wrong after all. Technical details about this are here (also, major props to John Stultz, the same guy who fixed bug #3, for being on the case promptly last night, and for offering clarification when I emailed him). In short, the bug happened because the missing call caused sub-second high-resolution timers to always immediately return, which causes userspace applications that use them in loops to instead run in tight loops eating up CPU. The popular-seeming fix to this was to run 'date -s "`date`"', which calls settimeofday(), which calls clock_has_changed(), replacing the missing call (reference).
And yet, there is still another bug lurking whose cause nobody seems to have discovered yet. In the serverfault post I linked in the intro, there was also an error message "[3161000.864001] BUG: spinlock lockup on CPU#1, ntpd/3358". This message is printed when the kernel detects a spinlock has been held for a second or more, indicating (you guessed it) deadlock. The linux folks don't appear to have figured this one out yet (not to disparage them - it's only been a day).

Research Applicability

An obvious zeroth-order conclusion: There were/are disproportionately many disastrous race conditions in the leap second code simply because it's such an infrequently-executed codepath. Hence, there's a call for some extra form of verification apart from "run the code as it is" - whether it's code-coverage-based stress tests, symbolic execution, or static analysis. I think static analysis would do best here.

One interesting thing to note is all of the locks involved here are global - the xtime_lock and the ntp_lock, and also the "interrupt handling context" which is a global property of the code. Projects like RacerX (paper pdf) would be well-capable of finding deadlocks #2 and #3 by simple callgraph analysis.

Bug #1 is especially interesting, though. This "can't call scheduler code while in interrupt context" is a very simple property, for which there's currently a runtime check for (in kernel/smp.c):

    /* Can deadlock when called with interrupts disabled. */
    WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() && !oops_in_progress);

But I think this property can be ensured at compile-time - so that code with this bug will never build, let alone ship to production systems. I'd thought about this before, and last fall I wrote a primitive checker for this as a class project, which I called Atomic All-Nighters, and which would have been directly capable of finding bug #1 without ever even compiling or running the code. Here's the writeup pdf.

My high-minded idealist goal for the Atomic All-Nighters project is to get these static properties of the code representable by constructs in the kernel programming language itself. That way, the compiler would refuse to build any code that had this sort of bug. I'm going to work more on that project more when I get back to CMU to start on my ph.d. in the fall. I'm excited for it.

thesis

2012-05-23T00:52:00.002-04:00

Two weeks ago I gave my master's thesis defence presentation, and last friday I submitted the thesis document itself. here is my defence slide deck, and here is the dissertation.

Among other things, in the dissertation I talk about:

How Landslide deals with the challenges of kernel-space pertaining to systematic exploration
How I evaluated Landslide (I met with students of 15-410 and gave them Landslide, and they found bugs with it)
What testing approaches I think should be involved in the future of systematic exploration
A handful of pictures (some educational and some for levity), and some humourous prose here and there as well

Hope you enjoy!

410 lecture

2012-04-06T12:09:00.001-04:00

Today I presented Landslide to 15-410, in an attempt to get students to try landslide out, so I can evaluate how effective it is when students are using it on their own kernels.

Here are my lecture slides!

converging

2012-02-07T14:09:00.000-05:00

I have been playing around with various settings in the arbiter - the body of code that says when an instruction is interesting enough to be a choice point, and that also chooses whom to run next at a choice point we haven't visited before - and am interested to note the differences in the direction of tree traversal.

At any choice point we haven't visited before, choosing any thread to run will make progress with the search (the real progress logic is in the explorer, which marks off old branches as all explored, uninteresting, or still interesting) - so the arbiter can choose to continue running the current thread (unless it descheduled itself), or to preempt and run somebody else. I call staying with the current thread the "leftmost branch", so a left-to-right tree traversal is one which starts off with fewer preemptions.

The interesting trend is that left-to-right traversals seem to take more time to find a bug (which I did not expect), but produce shorter, more concise choice traces at the end (which I did). Traversing left-to-right finds an 8-long choice trace (2-3-4-2-3-2-3-4) in 6.5 minutes, while a right-to-left traversal of the same tree finds a 15-long choice trace (2-3-2-3-4-3-4-3-4-3-2-4-2-3-4) in 2.5 minutes.

Also, with one particular bug that I edited the guest kernel to have, a different failure came up in each traversal order (one tripped an assert, and another got a thread wedged on an unowned-but-locked mutex). I wasn't surprised, but still think it is very cool.

Next up: making sure landslide can easily find already-known bugs in student-submitted kernels, rather than just in my own.

the lay of the (kernel-)land

2011-12-08T16:57:00.002-05:00

the main question that's been flying around in my research-head over the last couple months is, "what's actually so different about kernel-space that makes this research, instead of just re-implementing user-space stuff that people have already done?"

if I could write a comprehensive answer to this question, I'd have my thesis, so of course my answer isn't very put-together, but I think the scattered parts I've figured out (mostly thanks to talking to people at the PDL retreat) are concrete enough to talk about here.

environment

it should be no surprise that wrapping a program in a dynamic testing framework is a lot harder when there's no "magic kernel" to sit on the other side of said framework from the thing being tested (in short, the kernel is the thing being tested itself). there are a few possibilities I've thought about for approaching this:

1. running the guest kernel natively on the hardware itself. the testing framework would need to live inside the kernel itself, which presents some nasty challenges: for one, making sure that changing the thing being tested doesn't change the way its bugs show up; for two, it would take a bunch of complicated details to make run at all (such as, how is interrupt delivery controlled? certainly not automatically by the hardware anymore). (another possibility would be to use something like system management mode, which I haven't thought enough about.)

2. simulating the guest kernel. this is what I'm actually doing for the project, using Simics, because it's easiest to actually write code for. controlling the kernel is as simple as asking Simics to call into Landslide once per instruction executed, and also asking it to trigger a timer interrupt when we decide it's time to preempt the currently running kernel thread. the downside, of course, is that simulation is slow, and running code once per simulated instruction makes it even slower.

3. virtualisation. this would be sort of a happy middle-ground, where the guest kernel is running (for the most part) directly on the hardware, so there would be much less performance hit than with full simulation, but also the VMM holds some degree of control over the kernel that the testing framework could make use of. nevertheless, without per-instruction control, it would still require extra trickery both to track exactly what the guest kernel is doing (e.g. when does somebody get added to the runqueue?) and also to decide precisely when to trigger a nondeterminism event. requiring annotations inside the guest kernel would be one approach, but that would have its own limitations in terms of tracking "shared memory accesses" as potential decision points.

exploration strategies

for the most part, the "decision tree" looks about the same as in multithreaded user-space programs, but some challenges arise that are inherent from having the concurrency implementation itself being part of what's being tested. for one, detecting when one thread switches to another is no longer a trivial thing that simply falls out of knowing which process is currently running (the kernel is "the process"), and also, once we decide which thread we want to run, causing that thread to run can't just be done by calling "yield to this thread"; you have to understand what the scheduler needs to have happen to cause that thread to start running. (both of these named problems are already solved in Landslide.)

another interesting related problem arises when considering how to implement partial order reduction (the main algorithm that dBug uses). it works by determining that certain "steps" in each interleaving pattern are "independent" of each other, by virtue of their memory accesses being not in conflict (read/write or write/write). however, since each "step" both starts and ends in the context switcher / runqueue code (which is no longer a magic black box), there will necessarily be memory accesses to the same locations in each. some concessions will need to be made in accuracy to be able to say anything is independent of anything else - this is one of the things I intend to implement over the coming months.

possible applications

one major perspective I gained at the PDL retreat was that the things people want to debug kernels for are different than the things people want to debug userspace programs for. for the most part, the core bits of professional kernels (Linux, ...) are correct, since that's where all the talented programmers look most often; it's where the less talented programmers work - i.e., device drivers - that problems tend to show up.

(the reason I'm getting away with not thinking about this in my project itself is because the case study is Pebbles, the kernel that students implement for 15-410.)

so thinking about device drivers is a big avenue for future development of my subject. instead of just saying, "the timer interrupt is the only source of nondeterminism we need; we'll test for races between threads", the races we look for would be between interrupt handlers, bottom-halves, and threads executing related system calls. in this application, it's less meaningful to try to use a bunch of threads, and more meaningful to focus on the code paths themselves, so the timer interrupt would play second-fiddle to device interrupts (and the data that the device be communicating with those interrupts, such as network packets or disk read buffers).

another possible avenue is to think about is kernels that run on "non-CPU" processors - for example, controller firmware for storage devices (the most striking example I remember discussing) are structured in a concurrent way to service parallel i/o requests. such "kernels" are already tested in simulated environments (so said the guy I was talking with about this), and so instrumenting the simulators with a dynamic race detection framework would not be too big of a step up.

(in summary)

for the most part, these are all "pipe dream"-type thoughts, since I have all of a year to do the project itself and show that it works at all in kernel environments. but of course part of research is justifying its potential broader applications, so this stuff I consider to be the building blocks of my future work section.

vanish_vanish

2011-11-10T19:37:00.000-05:00

a few weeks ago i hit the big milestone of being able to actually explore the decision tree. in addition to the meta-scheduler, this required me to build the following pieces of infrastructure:

- tree explorer, to find the next unexplored branch. (71 lines of code)
- save-and-restore, to travel back in time, as previously discussed (310 lines of code)
- pretty-printing / diagnostics-reporting "found a bug" routine (42 lines of code)
- deadlock detection (trivial, given the agent-tracking code already in place)

i've been using vanish_vanish, a small userspace program that simply forks a child, which vanishes, then vanishes itself, to test the infrastructure as it comes into existence. the reason this is interesting is because i modified my student kernel so its vanish implementation does approximately as follows:

void NORETURN vanish()
{
    task_lock(self);
...
    for_each_child(self) {
task_lock(child); // A
child->parent = &init_process;
        task_unlock(child);
}
...
    task_lock(self->parent); // B
notify_waiter(self->parent);
    task_unlock(self->parent);
...
}

the astute programmer will note that a concurrently vanishing parent and child could deadlock if the parent tries to execute line A while the child tries to execute line B, but this is not guaranteed. the extra nice thing about this test case is that to find the bug, one only needs to consider calls to task_lock as potential choice points (actually, only even such calls from within vanish, which is what i'm really doing here). (i'm going to add shared memory accesses soon, but that will make the tree explode, which would be bad for finding whether my infrastructure is correct.)

using vanish_vanish as the test case and the naively-implemented guest kernel, landslide can find the deadlock in a little bit over 4 minutes:

...

[SCHEDULE]      about to switch threads 3 -> 4
[SCHEDULE]      about to switch threads 4 -> 3
[SCHEDULE]      DEADLOCK! (3 -> 4 -> 3)
[BUG!]          ****    A bug was found!   ****
[BUG!]          **** Choice trace follows. ****
[BUG!]          Choice 1:       at eip 0x00105ae5, trigger_count 1144575, TID 1
[BUG!]          Choice 2:       at eip 0x00105ae5, trigger_count 1158049, TID 2
[BUG!]          Choice 3:       at eip 0x00106930, trigger_count 1677311, TID 3
[BUG!]          Choice 4:       at eip 0x00106930, trigger_count 1677854, TID 4
[BUG!]          Choice 5:       at eip 0x00106930, trigger_count 1712596, TID 3
[BUG!]          Choice 6:       at eip 0x00106930, trigger_count 1747558, TID 4
[BUG!]          Choice 7:       at eip 0x00106930, trigger_count 1747805, TID 3
[BUG!]          Choice 8:       at eip 0x00106930, trigger_count 1748273, TID 2
[BUG!]          Choice 9:       at eip 0x00106930, trigger_count 1749356, TID 2
[BUG!]          Choice 10:      at eip 0x00106930, trigger_count 1750372, TID 4
[BUG!]          Current eip 0x00106957, trigger_count 1750826, total triggers 61208725
[BUG!]          Total choice points 2705, total backtracks 1378, depths 18934

apart from looking impressive, this output is a good prompt to talk about where the project is headed to next.

to conclude that landslide can find deadlocks such as this in under 5 minutes is not an honest statistic, because the decision tree will get a lot bigger when i enrich the set of decision points, which is a necessary next step. (consider: if you removed one of the task_lock calls, there would still be a race, but it would not be a deadlock anymore, and we would need to look at shared memory accesses to find it.) what we can conclude from this test output, however, is the exploration rate: on average, it takes landslide 0.09 seconds per decision point in the tree. it's not clear how fast this is in terms of my eventual goal; this will become clear when i build a better set of decision points and try testing for more subtle race conditions.

i'm certainly proud to have enough infrastructure to be able to say "look! it works!", but it's not clear how much weight that should carry until i have some evidence from a more realistic test case.

PDL retreat

2011-11-09T13:19:00.001-05:00

i presented landslide at the PDL retreat, as a 10-minute work-in-progress talk and also during the poster sessions. (here are my slides.)

there's not much to say that wouldn't sound trite - in short, how great it was to synchronise with people from industry about what my project and i ought to be expecting out of each other, and also how great it was to get so many different advices from people about how to chart a path for my future.

one thing that's really worth mentioning is that during the industry feedback session at the end of the retreat, christos from vmware called out my project as "the Holy Grail for anybody who does kernel development; I'd like to see that work moving forward."

time travel

2011-09-17T20:31:00.005-04:00

an important tenet of dynamic race condition detection is being able to selectively go back in time to an earlier point in the guest's execution and make a different decision about which thread interleaving to choose next. the set of possible execution interleavings can be thought of as a tree, where each edge/node represents a scheduling decision (which thread to run next?) at a choice point (a point at which it might be interesting to preempt the guest).

the really interesting question that I'll be tackling in the months to come is how to reduce the size of this tree to something manageable - for which I'll be borrowing ideas from dBug, possibly among other previous work - but for now, i'm setting up infrastructure that lets me explore the tree.

in order to do so productively, though, landslide needs to preserve at least some state when jumping back to an earlier point to make a different decision, or else it won't be able to do any meaningful analysis (e.g., where to prune the tree, or even where it's visited before). but, landslide also tracks a lot of information about the guest kernel's execution (so it can know, for example, which threads are runnable), and that information should not be preserved, since it's specific to each particular execution path.

what data is important to preserve when travelling "back in time", as it were, and what data should we discard? the implementation mechanism for restoring old state is already somewhat obtuse, and combined with all the different types of state that landslide manages, makes this somewhat difficult to think about.

I realised, though, that there's already been a situation where I had to solve exactly this problem...

now I need to figure out how to work the game's other mechanics into my code as well, right?

The Importance of Being Assertive, A Trivial Style Guideline for Serious Programmers

2011-07-15T17:58:00.005-04:00

one of the key ways that landslide "talks" to the guest kernel (i.e., manipulating thread interleaving patterns) is by triggering timer interrupts at key points in the kernel's execution. in terms of very general race detection, this is about the equivalent of looking at the code and thinking "what happens if we get preempted here?" (which is the traditional way of writing "race-free" concurrent code); of course, in this project, it will have some extra automated cleverness attached so it can be effective.

because landslide's code gets invoked at every instruction of the guest's execution, we must have a notion of a scheduling operation being "still in progress" - that is, after triggering a timer interrupt, there will be several instructions before a target thread (one we decided to switch to) actually starts running. if we take note of when the target thread starts up, we can provide a few guarantees about the instructions that will be run until then - namely, that the kernel must be executing in an interrupt handler and/or the context switcher (and NOT running the "meaty bits" of any other thread's code-path). this may seem obvious, but it is still an invariant that i would like to rely on, so i expressed it as an assert in the code:

    assert(ACTION(scheduler, context_switch) || HANDLING_INTERRUPT(scheduler));

imagine my surprise, testing the implementation, when this assert tripped!

after a bit of debugging, i discovered that the invariant violation was happening at the instruction immediately following the instruction at which i tried to trigger a timer interrupt. it turns out that, in some cases, simics may decide to delay interrupt processing by a few instructions (seemingly by a non-deterministic amount, too) after i set the CPU's pending interrupt flags.

the fix (ensuring that when i decide to trigger a timer interrupt it is actually received immediately) is nothing special; what is important is to realise that my programming environment had some peculiarity that broke an assumption that i didn't even realise i was making when i established the invariants of my own code. so upon finding this new assumption (and writing code to make sure it worked), i added another assert:

    if (scheduler->just_triggered_timer_interrupt) {

        assert(get_current_eip() == get_timer_wrap_begin() &&

               "simics attempted to delay our interrupt! :<");
        scheduler->just_triggered_timer_interrupt = false;
    }

now if something else of this nature goes wrong, i will know immediately, with a useful error message to boot. but imagine if i'd never written that first assert to begin with? simics could have merrily delayed all my interrupts for however long it wanted, i would never have known, and wherever i would decide to trigger interrupts (i.e., notionally good places for exposing race conditions) would have no bearing on when they actually happened! i could spend months on this project and it would never work right and i might never know why.

use asserts, fellow hackers - not just comments or thoughts in your heads. you'll be happy for it later.

agence

2011-07-04T00:55:00.003-04:00

in order to schedule threads in any interesting way, landslide will need to have pretty tight control over the internal scheduler of the kernel under test. this is not easy, especially since every kernel will have its own slightly different way of tracking scheduler state, so landslide will need a flexible and accurate way of monitoring what's going on in the guest kernel.

i just finished a bit of framework that lets landslide see a "reflection" of the guest kernel's scheduler. soon, we'll be able to bend the guest's scheduling patterns to our whims, but for now, here's what we see when passively viewing the runqueue during the boot-up sequence (comments are my own, post-hoc):

simics> c

switched threads 1 -> -268370093 at 0x1009cf  // garbage from init i haven't cleaned up yet

switched threads -268370093 -> 1 at 0x105385  // it's research-quality; give me a break

switched threads 1 -> 2 at 0x102f7b

new agent 2 at eip 0x105613 -- [2, 1]         // shell fork()ed from init

agent 2 gone at eip 0x1056d5 -- [1]           // shell blocks on readline()

agent 1 gone at eip 0x1056d4 -- []             // take init off RQ to schedule

switched threads 2 -> 1 at 0x102f7b

new agent 1 at eip 0x105613 -- [1]               // init back on RQ after context-switch

agent 1 gone at eip 0x1056d5 -- []               // init blocks on wait()

one particularity of the exhibited guest kernel (my student kernel) is that every context-switch involves pulling the target thread off of the runqueue and putting it back on later - which we see clearly here. also, keep in mind that this is all from within simics; the kernel itself is totally unmodified. (obviously, the same output could be achieved by putting print statements into the kernel at key points, which is not the point here.)

i also use the term "agent" to refer to a thread that is currently on the runqueue (i.e. can be context-switched to at any moment); it is recycled terminology from dBug.

anyway, so if i type a command at the prompt, it continues:

new agent 2 at eip 0x105613 -- [2]            // kbd handler sees "\n" and puts shell on RQ
agent 2 gone at eip 0x1056d4 -- []            // take shell off RQ to schedule
switched threads 1 -> 2 at 0x102f7b
new agent 2 at eip 0x105613 -- [2]            // shell back on RQ after context-switch
switched threads 2 -> 3 at 0x102f7b
new agent 3 at eip 0x105613 -- [3, 2]         // shell forks "juggle" process
switched threads 3 -> 4 at 0x102f7b
new agent 4 at eip 0x105613 -- [4, 3, 2]      // "juggle" forks its own child threads...

...and so on.

secret sauce

2011-06-14T14:38:00.001-04:00

Jiří tells me

mandelbratwurst

2011-06-14T14:27:00.004-04:00

Last friday, my research accidentally met its goal of finding bugs - or at least, in one very specific case.

I was looking into ways for my code to generate keyboard interrupts+input - it turns out simics has a nice interface in the kbd0.key_event attribute - and testing it by triggering some hard-coded keyboard input when the kernel reaches a particular (also hard-coded) point in the fork() path. the input was "mandelbrot\n", which (being received by the shell) should cause the so-named test to start running - except the input would be repeated when fork() runs again, so the kernel's readline() logic would have to deal with multiple inputted lines while another program is writing to the console at the same time.

Here's what pathos, the 410 reference kernel, does:

And what BrOS, the student kernel of Eric Faust (who is periodically keeping me company and scoffing at my code), does:

Slightly different, and not in agreement with the Pebbles specification, though Eric claims it's misdesigned rather than buggy. Finally, here's what POBBLES, my own student kernel, does:

Note the top-left character - that is where every single character of the mandelbrot fractal is being drawn. Also, the input string (rather, parts thereof) is drawn five times, despite only being issued twice. No "misdesign" would cause that...

welcome

2011-06-14T12:35:00.004-04:00

Hello! This is the project blog for Ben Blum's 5th year master's project at CMU, under the advisory of Garth Gibson and working with Jiří Šimša.

Jiří has a project called dBug, which does runtime concurrency verification on userland programs by interposing itself between the application under test and the dynamic libraries, thereby to control the nondeterministic aspects of the program's execution. I hope to extend these same ideas to work in kernel-space, where concurrency issues can be much more intricate and subtle.

15-410 - Operating System Design and Implementation - is a class at CMU where students, in a six-week-long project, implement a small UNIX-like kernel, called Pebbles, from the ground up. Students do most of their development in Simics, an x86 simulator, and typically produce code which has many race conditions, which have to be spotted either by the student during the project or by the grader after the project.

I aim to develop a system for race condition detection on Pebbles kernels in the form of a Simics module, to be used by the 410 course staff to help grade student submissions and by the students to help debug their code, and hopefully to serve as a starting point for similar tools in more complicated environments (i.e., industrial kernels such as Linux).

The project is called landslide because it shows that pebbles kernels are not as stable as one might hope.