the lay of the (kernel-)land

the main question that's been flying around in my research-head over the last couple months is, "what's actually so different about kernel-space that makes this research, instead of just re-implementing user-space stuff that people have already done?"

if I could write a comprehensive answer to this question, I'd have my thesis, so of course my answer isn't very put-together, but I think the scattered parts I've figured out (mostly thanks to talking to people at the PDL retreat) are concrete enough to talk about here.


it should be no surprise that wrapping a program in a dynamic testing framework is a lot harder when there's no "magic kernel" to sit on the other side of said framework from the thing being tested (in short, the kernel is the thing being tested itself). there are a few possibilities I've thought about for approaching this:

1. running the guest kernel natively on the hardware itself. the testing framework would need to live inside the kernel itself, which presents some nasty challenges: for one, making sure that changing the thing being tested doesn't change the way its bugs show up; for two, it would take a bunch of complicated details to make run at all (such as, how is interrupt delivery controlled? certainly not automatically by the hardware anymore). (another possibility would be to use something like system management mode, which I haven't thought enough about.)

2. simulating the guest kernel. this is what I'm actually doing for the project, using Simics, because it's easiest to actually write code for. controlling the kernel is as simple as asking Simics to call into Landslide once per instruction executed, and also asking it to trigger a timer interrupt when we decide it's time to preempt the currently running kernel thread. the downside, of course, is that simulation is slow, and running code once per simulated instruction makes it even slower.

3. virtualisation. this would be sort of a happy middle-ground, where the guest kernel is running (for the most part) directly on the hardware, so there would be much less performance hit than with full simulation, but also the VMM holds some degree of control over the kernel that the testing framework could make use of. nevertheless, without per-instruction control, it would still require extra trickery both to track exactly what the guest kernel is doing (e.g. when does somebody get added to the runqueue?) and also to decide precisely when to trigger a nondeterminism event. requiring annotations inside the guest kernel would be one approach, but that would have its own limitations in terms of tracking "shared memory accesses" as potential decision points.

exploration strategies

for the most part, the "decision tree" looks about the same as in multithreaded user-space programs, but some challenges arise that are inherent from having the concurrency implementation itself being part of what's being tested. for one, detecting when one thread switches to another is no longer a trivial thing that simply falls out of knowing which process is currently running (the kernel is "the process"), and also, once we decide which thread we want to run, causing that thread to run can't just be done by calling "yield to this thread"; you have to understand what the scheduler needs to have happen to cause that thread to start running. (both of these named problems are already solved in Landslide.)

another interesting related problem arises when considering how to implement partial order reduction (the main algorithm that dBug uses). it works by determining that certain "steps" in each interleaving pattern are "independent" of each other, by virtue of their memory accesses being not in conflict (read/write or write/write). however, since each "step" both starts and ends in the context switcher / runqueue code (which is no longer a magic black box), there will necessarily be memory accesses to the same locations in each. some concessions will need to be made in accuracy to be able to say anything is independent of anything else - this is one of the things I intend to implement over the coming months.

possible applications

one major perspective I gained at the PDL retreat was that the things people want to debug kernels for are different than the things people want to debug userspace programs for. for the most part, the core bits of professional kernels (Linux, ...) are correct, since that's where all the talented programmers look most often; it's where the less talented programmers work - i.e., device drivers - that problems tend to show up.

(the reason I'm getting away with not thinking about this in my project itself is because the case study is Pebbles, the kernel that students implement for 15-410.)

so thinking about device drivers is a big avenue for future development of my subject. instead of just saying, "the timer interrupt is the only source of nondeterminism we need; we'll test for races between threads", the races we look for would be between interrupt handlers, bottom-halves, and threads executing related system calls. in this application, it's less meaningful to try to use a bunch of threads, and more meaningful to focus on the code paths themselves, so the timer interrupt would play second-fiddle to device interrupts (and the data that the device be communicating with those interrupts, such as network packets or disk read buffers).

another possible avenue is to think about is kernels that run on "non-CPU" processors - for example, controller firmware for storage devices (the most striking example I remember discussing) are structured in a concurrent way to service parallel i/o requests. such "kernels" are already tested in simulated environments (so said the guy I was talking with about this), and so instrumenting the simulators with a dynamic race detection framework would not be too big of a step up.

(in summary)

for the most part, these are all "pipe dream"-type thoughts, since I have all of a year to do the project itself and show that it works at all in kernel environments. but of course part of research is justifying its potential broader applications, so this stuff I consider to be the building blocks of my future work section.



a few weeks ago i hit the big milestone of being able to actually explore the decision tree. in addition to the meta-scheduler, this required me to build the following pieces of infrastructure:

- tree explorer, to find the next unexplored branch. (71 lines of code)
- save-and-restore, to travel back in time, as previously discussed (310 lines of code)
- pretty-printing / diagnostics-reporting "found a bug" routine (42 lines of code)
- deadlock detection (trivial, given the agent-tracking code already in place)

i've been using vanish_vanish, a small userspace program that simply forks a child, which vanishes, then vanishes itself, to test the infrastructure as it comes into existence. the reason this is interesting is because i modified my student kernel so its vanish implementation does approximately as follows:

void NORETURN vanish()
    for_each_child(self) {
        task_lock(child);  // A
        child->parent = &init_process;
    task_lock(self->parent);  // B

the astute programmer will note that a concurrently vanishing parent and child could deadlock if the parent tries to execute line A while the child tries to execute line B, but this is not guaranteed. the extra nice thing about this test case is that to find the bug, one only needs to consider calls to task_lock as potential choice points (actually, only even such calls from within vanish, which is what i'm really doing here). (i'm going to add shared memory accesses soon, but that will make the tree explode, which would be bad for finding whether my infrastructure is correct.)

using vanish_vanish as the test case and the naively-implemented guest kernel, landslide can find the deadlock in a little bit over 4 minutes:

[SCHEDULE]      about to switch threads 3 -> 4
[SCHEDULE]      about to switch threads 4 -> 3
[SCHEDULE]      DEADLOCK! (3 -> 4 -> 3)
[BUG!]          ****    A bug was found!   ****
[BUG!]          **** Choice trace follows. ****
[BUG!]          Choice 1:       at eip 0x00105ae5, trigger_count 1144575, TID 1
[BUG!]          Choice 2:       at eip 0x00105ae5, trigger_count 1158049, TID 2
[BUG!]          Choice 3:       at eip 0x00106930, trigger_count 1677311, TID 3
[BUG!]          Choice 4:       at eip 0x00106930, trigger_count 1677854, TID 4
[BUG!]          Choice 5:       at eip 0x00106930, trigger_count 1712596, TID 3
[BUG!]          Choice 6:       at eip 0x00106930, trigger_count 1747558, TID 4
[BUG!]          Choice 7:       at eip 0x00106930, trigger_count 1747805, TID 3
[BUG!]          Choice 8:       at eip 0x00106930, trigger_count 1748273, TID 2
[BUG!]          Choice 9:       at eip 0x00106930, trigger_count 1749356, TID 2
[BUG!]          Choice 10:      at eip 0x00106930, trigger_count 1750372, TID 4
[BUG!]          Current eip 0x00106957, trigger_count 1750826, total triggers 61208725
[BUG!]          Total choice points 2705, total backtracks 1378, depths 18934

apart from looking impressive, this output is a good prompt to talk about where the project is headed to next.

to conclude that landslide can find deadlocks such as this in under 5 minutes is not an honest statistic, because the decision tree will get a lot bigger when i enrich the set of decision points, which is a necessary next step. (consider: if you removed one of the task_lock calls, there would still be a race, but it would not be a deadlock anymore, and we would need to look at shared memory accesses to find it.) what we can conclude from this test output, however, is the exploration rate: on average, it takes landslide 0.09 seconds per decision point in the tree. it's not clear how fast this is in terms of my eventual goal; this will become clear when i build a better set of decision points and try testing for more subtle race conditions.

i'm certainly proud to have enough infrastructure to be able to say "look! it works!", but it's not clear how much weight that should carry until i have some evidence from a more realistic test case.


PDL retreat

i presented landslide at the PDL retreat, as a 10-minute work-in-progress talk and also during the poster sessions. (here are my slides.)

there's not much to say that wouldn't sound trite - in short, how great it was to synchronise with people from industry about what my project and i ought to be expecting out of each other, and also how great it was to get so many different advices from people about how to chart a path for my future.

one thing that's really worth mentioning is that during the industry feedback session at the end of the retreat, christos from vmware called out my project as "the Holy Grail for anybody who does kernel development; I'd like to see that work moving forward."


time travel

an important tenet of dynamic race condition detection is being able to selectively go back in time to an earlier point in the guest's execution and make a different decision about which thread interleaving to choose next. the set of possible execution interleavings can be thought of as a tree, where each edge/node represents a scheduling decision (which thread to run next?) at a choice point (a point at which it might be interesting to preempt the guest).

the really interesting question that I'll be tackling in the months to come is how to reduce the size of this tree to something manageable - for which I'll be borrowing ideas from dBug, possibly among other previous work - but for now, i'm setting up infrastructure that lets me explore the tree.

in order to do so productively, though, landslide needs to preserve at least some state when jumping back to an earlier point to make a different decision, or else it won't be able to do any meaningful analysis (e.g., where to prune the tree, or even where it's visited before). but, landslide also tracks a lot of information about the guest kernel's execution (so it can know, for example, which threads are runnable), and that information should not be preserved, since it's specific to each particular execution path.

what data is important to preserve when travelling "back in time", as it were, and what data should we discard? the implementation mechanism for restoring old state is already somewhat obtuse, and combined with all the different types of state that landslide manages, makes this somewhat difficult to think about.

I realised, though, that there's already been a situation where I had to solve exactly this problem...

now I need to figure out how to work the game's other mechanics into my code as well, right?


The Importance of Being Assertive, A Trivial Style Guideline for Serious Programmers

one of the key ways that landslide "talks" to the guest kernel (i.e., manipulating thread interleaving patterns) is by triggering timer interrupts at key points in the kernel's execution. in terms of very general race detection, this is about the equivalent of looking at the code and thinking "what happens if we get preempted here?" (which is the traditional way of writing "race-free" concurrent code); of course, in this project, it will have some extra automated cleverness attached so it can be effective.

because landslide's code gets invoked at every instruction of the guest's execution, we must have a notion of a scheduling operation being "still in progress" - that is, after triggering a timer interrupt, there will be several instructions before a target thread (one we decided to switch to) actually starts running. if we take note of when the target thread starts up, we can provide a few guarantees about the instructions that will be run until then - namely, that the kernel must be executing in an interrupt handler and/or the context switcher (and NOT running the "meaty bits" of any other thread's code-path). this may seem obvious, but it is still an invariant that i would like to rely on, so i expressed it as an assert in the code:

    assert(ACTION(scheduler, context_switch) || HANDLING_INTERRUPT(scheduler));

imagine my surprise, testing the implementation, when this assert tripped!

after a bit of debugging, i discovered that the invariant violation was happening at the instruction immediately following the instruction at which i tried to trigger a timer interrupt. it turns out that, in some cases, simics may decide to delay interrupt processing by a few instructions (seemingly by a non-deterministic amount, too) after i set the CPU's pending interrupt flags.

the fix (ensuring that when i decide to trigger a timer interrupt it is actually received immediately) is nothing special; what is important is to realise that my programming environment had some peculiarity that broke an assumption that i didn't even realise i was making when i established the invariants of my own code. so upon finding this new assumption (and writing code to make sure it worked), i added another assert:

    if (scheduler->just_triggered_timer_interrupt) {
        assert(get_current_eip() == get_timer_wrap_begin() &&
               "simics attempted to delay our interrupt! :<");
        scheduler->just_triggered_timer_interrupt = false;

now if something else of this nature goes wrong, i will know immediately, with a useful error message to boot. but imagine if i'd never written that first assert to begin with? simics could have merrily delayed all my interrupts for however long it wanted, i would never have known, and wherever i would decide to trigger interrupts (i.e., notionally good places for exposing race conditions) would have no bearing on when they actually happened! i could spend months on this project and it would never work right and i might never know why.

use asserts, fellow hackers - not just comments or thoughts in your heads. you'll be happy for it later.



in order to schedule threads in any interesting way, landslide will need to have pretty tight control over the internal scheduler of the kernel under test. this is not easy, especially since every kernel will have its own slightly different way of tracking scheduler state, so landslide will need a flexible and accurate way of monitoring what's going on in the guest kernel.

i just finished a bit of framework that lets landslide see a "reflection" of the guest kernel's scheduler. soon, we'll be able to bend the guest's scheduling patterns to our whims, but for now, here's what we see when passively viewing the runqueue during the boot-up sequence (comments are my own, post-hoc):

simics> c
switched threads 1 -> -268370093 at 0x1009cf  // garbage from init i haven't cleaned up yet
switched threads -268370093 -> 1 at 0x105385  // it's research-quality; give me a break
switched threads 1 -> 2 at 0x102f7b
new agent 2 at eip 0x105613 -- [2, 1]         // shell fork()ed from init
agent 2 gone at eip 0x1056d5 -- [1]           // shell blocks on readline()
agent 1 gone at eip 0x1056d4 -- []            // take init off RQ to schedule
switched threads 2 -> 1 at 0x102f7b
new agent 1 at eip 0x105613 -- [1]            // init back on RQ after context-switch
agent 1 gone at eip 0x1056d5 -- []            // init blocks on wait()

one particularity of the exhibited guest kernel (my student kernel) is that every context-switch involves pulling the target thread off of the runqueue and putting it back on later - which we see clearly here. also, keep in mind that this is all from within simics; the kernel itself is totally unmodified. (obviously, the same output could be achieved by putting print statements into the kernel at key points, which is not the point here.)

i also use the term "agent" to refer to a thread that is currently on the runqueue (i.e. can be context-switched to at any moment); it is recycled terminology from dBug.

anyway, so if i type a command at the prompt, it continues:

new agent 2 at eip 0x105613 -- [2]            // kbd handler sees "\n" and puts shell on RQ
agent 2 gone at eip 0x1056d4 -- []            // take shell off RQ to schedule
switched threads 1 -> 2 at 0x102f7b
new agent 2 at eip 0x105613 -- [2]            // shell back on RQ after context-switch
switched threads 2 -> 3 at 0x102f7b
new agent 3 at eip 0x105613 -- [3, 2]         // shell forks "juggle" process
switched threads 3 -> 4 at 0x102f7b
new agent 4 at eip 0x105613 -- [4, 3, 2]      // "juggle" forks its own child threads...

...and so on.


secret sauce

Jiří tells me


Last friday, my research accidentally met its goal of finding bugs - or at least, in one very specific case.

I was looking into ways for my code to generate keyboard interrupts+input - it turns out simics has a nice interface in the kbd0.key_event attribute - and testing it by triggering some hard-coded keyboard input when the kernel reaches a particular (also hard-coded) point in the fork() path. the input was "mandelbrot\n", which (being received by the shell) should cause the so-named test to start running - except the input would be repeated when fork() runs again, so the kernel's readline() logic would have to deal with multiple inputted lines while another program is writing to the console at the same time.

Here's what pathos, the 410 reference kernel, does:

And what BrOS, the student kernel of Eric Faust (who is periodically keeping me company and scoffing at my code), does:

Slightly different, and not in agreement with the Pebbles specification, though Eric claims it's misdesigned rather than buggy. Finally, here's what POBBLES, my own student kernel, does:

Note the top-left character - that is where every single character of the mandelbrot fractal is being drawn. Also, the input string (rather, parts thereof) is drawn five times, despite only being issued twice. No "misdesign" would cause that...


Hello! This is the project blog for Ben Blum's 5th year master's project at CMU, under the advisory of Garth Gibson and working with Jiří Šimša.

Jiří has a project called dBug, which does runtime concurrency verification on userland programs by interposing itself between the application under test and the dynamic libraries, thereby to control the nondeterministic aspects of the program's execution. I hope to extend these same ideas to work in kernel-space, where concurrency issues can be much more intricate and subtle.

15-410 - Operating System Design and Implementation - is a class at CMU where students, in a six-week-long project, implement a small UNIX-like kernel, called Pebbles, from the ground up. Students do most of their development in Simics, an x86 simulator, and typically produce code which has many race conditions, which have to be spotted either by the student during the project or by the grader after the project.

I aim to develop a system for race condition detection on Pebbles kernels in the form of a Simics module, to be used by the 410 course staff to help grade student submissions and by the students to help debug their code, and hopefully to serve as a starting point for similar tools in more complicated environments (i.e., industrial kernels such as Linux).

The project is called landslide because it shows that pebbles kernels are not as stable as one might hope.