2012-10-20

What is a "data race" and when is a race not a data race?

I've been meaning to write about this since my post series about Rust (in particular here, where I wrote "while data races are no longer possible, race conditions in general still are" about the RWARC library). In general, Rust statically guarantees freedom from data races, though not freedom from all races. But what does that mean?

A data race is when multiple threads concurrently access the same memory location where at least one access is a write. "Concurrently" here could mean either literally at the same time (threads run on different CPUs) or abstractly at the same time (threads interleave with each other on the same CPU); i.e., no synchronisation primitive enforces that one thread's access completes before the other begins.

Thread 1Thread 2
if (p != NULL)

p = NULL;
output(p->data);


Data race detectors, such as Eraser and Helgrind, analyse threads' mutual-exclusion and happens-before relationships to identify unsafe concurrent accesses like these. But it's possible to stop accesses from being concurrent without enforcing correct behaviour:

Thread 1Thread 2
mutex_lock(m);
bool ok = p != NULL;
mutex_unlock(m);

mutex_lock(m);

p = NULL;

mutex_unlock(m);
mutex_lock(m);
if (ok) output(p->data);
mutex_unlock(m);


Now the data race is gone, but the bug has simply become a higher-level race condition. Most literature calls this an "atomicity violation" (and some literature even uses "race" to mean exclusively data races).

You might think this code looks silly, but if you're working in a project with many layers of abstraction and function/module boundaries, this kind of mistake can be all too easy to make, and data race detectors are powerless to find them.

Consider this real-world example. When I started at Mozilla this summer, Rust 0.2 had recently shipped, and its release notes mentioned that it was "Helgrind-clean" (meaning no data races existed). Yet the Rust runtime contained this code:

bool rust_task::blocked() {
    mutex_lock(this->lifecycle_lock);
    bool is_blocked = this->state == task_state_blocked;
    mutex_unlock(this->lifecycle_lock);
    return is_blocked;
}

Sure, accessing the field was safely protected by the mutex, but once it dropped the lock and returned, all bets were off as to whether the returned value was still accurate or not. (I fixed several bugs related to this, and removed this function entirely.)

In a similar vein, Rust's type system guarantees that concurrent tasks cannot share state but instead must use message-passing to communicate, which precludes the possibility of data races completely by enforcing happens-before relationships on all data accesses (or in the case of this post, by enforcing mutual-exclusion relationships). Yet it's still possible to write nondeterministic programs in Rust (using select2, failure propagation, etc), and so race conditions are still possible.

The moral of this story is that data races are only one of many types of races, and though many tools exist for finding them, just because one guarantees absence of data races does not mean your code is completely concurrency-safe. Not to say these tools aren't useful, but they often fail where more sophisticated race-finding techniques could succeed, and even still, no automated race-finding tool can substitute for a careful human brain when reasoning about concurrency.

What exactly is a "race condition", anyway?

A friend of mine is taking the operating systems class at UMD, in which the second project is to implement inter-process signals. He noted a peculiarity in the specification: that processes are not woken up immediately if they receive a signal while blocked (e.g. on child processes, on keyboard/disk input). As a result, it could be completely random whether or not a process receiving a signal gets killed immediately or sleeps forever.

He discussed this with the professor, and they disagreed over whether this nondeterminism constituted a "race condition" or not. After all, the specification allows for signals to fail to wake up processes under certain circumstances, so there's nothing wrong about implementing it that way. On the other hand, a kernel whose signalling mechanism always wakes up processes in bounded time (i.e., finitely long -- whereas waiting for keyboard input could take forever) could provide stronger guarantees about inter-process communication.

In my interpretation, both arguments don't tell the entire story. For starters, race conditions don't necessarily entail wrong behaviour; I've seen plenty of "benign" race conditions with comments along the lines of "if X and Y race, Z will happen, and this is OK". Benign races aside, though, "race condition" to me means "unexpected behaviour occurs nondeterministically". So, if you want to be precise, it's important to talk about race conditions with respect to certain expectations.

Someone writing a userspace program for this kernel who didn't realise that signals might never get taken (and hence produced code that sometimes accidentally sleeps forever) could say they were bitten by a race in the kernel. But if they'd read the spec carefully, they might've written code that handled the nondeterminism more robustly. They could say the spec's nondeterminism made it less useful than other possible specs, but it wouldn't be fair to blame the particular implementation of this spec for being buggy.

In short, I would say the specification itself has a race condition in it, but implementations thereof don't. What's important is who holds the expectations and who nondeterministically breaks them.