04 - What every systems programmer should know about concurrency

Thinking Creating a reading report I’m working through crafting a detailed 3000-word report in English, blending casual and detailed tones, while maintaining neutrality and adhering to style guidelines, all with the help of the concurrency-primer.pdf. Pulling together insights I’m creating a cohesive report on concurrency, memory models, and hardware. Using the concurrency-primer.pdf for key points and aiming for a detailed, 3000-word analysis. Crafting detailed analyses I’m structuring a cohesive text to tackle each question on concurrency and associated technologies. This approach ensures thorough and organized coverage of topics without bullet points. Crafting the report Piecing together the report, keeping it cohesive without bullet points, and referencing concurrency-primer.pdf as needed. The aim is a detailed, 3000-word discussion on concurrency topics. Thought for 1m 45s

Concurrency is both fascinating and challenging. As computer scientists, we try to create systems that do many things at once. But when multiple threads or processes share resources, we enter a realm where the usual rules about program order and predictability do not always apply. In this reading report, I will discuss my understanding of the paper’s main ideas, reflect on my personal opinions, and connect these insights to the questions posed. This discussion will be continuous, so I will not present it as bullet points or an outline. Instead, I will write a single, cohesive narrative that explores each of the topics requested. Along the way, I will draw from my own understanding of ICS (Introduction to Computer Systems), databases, OS fundamentals, and the paper itself. Where relevant, I will cite the provided source: [concurrency-primer.pdf].

I would like to begin by talking about the root cause of concurrency problems. At a high level, concurrency issues arise because the machine, the compiler, and even the language itself can reorder or optimize memory accesses in ways that are not always obvious. To say it differently, concurrency problems exist when multiple threads assume a certain sequence of events, but the actual hardware or compiler produces a different sequence. This can happen because modern CPUs use pipelines, out-of-order execution, and caches, while compilers reorder instructions to optimize performance. This is also because memory operations across cores and caches can become visible at different times. When you have code that depends on a certain ordering (for example, writing a value before setting a ready flag), you are counting on the system to respect that order. Without special care—like using specific atomic operations or memory barriers—this assumption may be violated, and thus concurrency bugs can arise.

Concurrency problems often happen in shared-memory systems. If a single thread writes some data, and a second thread reads that data, you would think you can rely on the line of code in thread A to happen before the line of code in thread B. Yet that “happens-before” relationship might not hold in the compiled program or in the hardware if there is no explicit synchronization. So the root cause is that in multi-threaded systems, we can’t guarantee that code lines appear in a single global order. Data writes and reads may be rearranged. Caches may store old data. In short, concurrency bugs arise from the interplay of reordering by compilers, caching in hardware, and the different timings that appear in multi-core or multi-processor systems.

This fundamental problem is illustrated in the snippet of code that deals with a global integer v and a Boolean flag v_ready. Let’s restate the snippet for clarity:

int v;
bool v_ready = false;
void threadA() {
    // Write the value
    // and set its ready flag.
    v = 42;
    v_ready = true;
}
void threadB() {
    // Await a value change and read it.
    if (!v_ready) {
        /* wait */
    } else {
        /* do something else */
    }
    const int my_v = v;
    // Do something with my_v...
}

If this code ran on a single CPU with no compiler optimizations and a straightforward memory model, we would assume that when threadB sees v_ready == true, it would also see the most recent write to v, namely 42. But in a real system, the compiler or CPU might reorder v_ready = true; so that it occurs before v = 42;. It might also store the v value in a register or a cache line that hasn’t been made visible to the other core. In that case, threadB might see the flag set, but still read a stale value of v. Even if the code is small, it’s not guaranteed that the statements happen in the order we wrote them. This is because modern systems use memory reordering for speed. That might not be a problem in single-threaded code, but in multi-threaded code, it leads to subtle errors. From an ICS or OS perspective, we know that compilers generate machine instructions based on what is best for performance, not always for preserving a strict source-code ordering. Databases also face concurrency challenges (though at a higher abstraction level) and usually solve them with locking protocols or transaction isolation. Still, the underlying phenomenon is similar: data might be changed by one entity without guaranteeing that the other has read the same changes.

We can tie this in to the next question: what are the necessary technologies of modern computers that may interfere with the correctness of concurrent programming? Several main technologies come to mind. First is compilation optimization. Compilers rearrange instructions to improve speed and reduce pipeline stalls. They assume that if two operations do not seem to depend on each other in a single-threaded sense, they can be reordered for performance. This is especially visible with a naive read-then-write sequence, which the compiler might reorder as write-then-read if it deems that faster. Another technology is the multi-level cache. Each core might have its own cache lines, so writing to a variable in one core’s cache doesn’t always appear instantly in another core’s cache. A third technology is the multi-core/NUMA (Non-Uniform Memory Access) architecture. On some systems, different parts of memory might have different speeds depending on which processor is accessing them. All of these can break the naive assumption that a write in one thread is immediately visible to another. That is why concurrency is so hard to get right and why we need explicit synchronization operations or memory fences.

In the paper, the author says something along the lines that all these complications mean there is no consistent concept of “now” in a multithreaded program [concurrency-primer.pdf]. I find that line quite important. When we think about a single-threaded environment, we can talk about the current moment in time, the next line of code, and the code that ran before. But in a multithreaded program, you might have instructions from each thread interleaving in complex ways. One core might reorder instructions differently from another. Another core might observe memory changes at different times. This means that two different threads may have different ideas of “the present.” Thread A might see a certain variable updated, but as far as Thread B is concerned, that update might not have happened yet. On top of hardware reordering, compilers can reorder instructions, and the language specification might allow it as well. Thus we need a combined effort: hardware provides atomic instructions and memory fences, compilers provide intrinsics and sequence points, and the language provides an abstract memory model. The application code must also use the correct synchronization constructs to ensure ordering. Only then do we get a well-defined sense of order between threads.

Next is how to understand “enforcing law and order” and the atomicity of operations necessary for correct concurrent programming. The phrase “enforcing law and order” suggests that concurrency needs rules about which operations can happen at once, and how threads can coordinate. If these rules aren’t set, we end up with race conditions, out-of-order reads and writes, and mysterious bugs. Atomicity is a cornerstone of concurrency because it ensures that certain operations are indivisible. That is, if we do an atomic operation (like an atomic increment or compare-and-swap), we guarantee that no other thread can see a partial update. This might be enforced by CPU instructions, or in C/C++ by the std::atomic<T> and the associated memory ordering constraints [concurrency-primer.pdf]. When an operation is atomic, either another thread sees the entire write or it sees nothing. This prevents problems known as torn reads or torn writes, where half of a write is visible to another thread but not the other half.

I want to say more about the atomic computing capabilities in modern languages like C and C++. Starting in C11, we have stdatomic.h, which introduces types such as _Atomic int or _Atomic<T> in a templated context, or std::atomic<T> in C++. These types and their operations give us a reliable way to do atomic loads, stores, or read-modify-write (RMW) operations. Before these features became standardized, programmers often had to resort to compiler-specific intrinsics (like GCC’s __sync_* or __atomic_* built-ins), or inline assembly with special CPU instructions. Now, it’s simpler: we can say std::atomic<int> x; and use x.store(...), x.load(...), or x.compare_exchange_strong(...). The language then guarantees that these operations are atomic, meaning no other thread sees partial updates. They also allow specifying memory order constraints, like memory_order_relaxed, memory_order_acquire, memory_order_release, or memory_order_seq_cst. These specify how the compiler and hardware reorder that particular operation. This helps us control concurrency to a fine degree.

The term “torn reads and writes” (also mentioned in the source) describes a situation where you read or write part of a variable but not all of it [concurrency-primer.pdf]. This can happen if the variable is bigger than the CPU’s native word size, or if we’re not using atomic instructions. Suppose we have a 64-bit integer on a 32-bit machine, and two 32-bit halves are stored in memory. One thread might update one 32-bit half, and another thread might read the other half, resulting in a “torn” read. Such partial reads can produce nonsense values. Torn reads and writes undermine concurrency because they introduce random, unexpected behavior. With atomic operations, we either see the entire 64-bit value updated, or we see none of the update. That is how we avoid tearing.

When we re-examine read-modify-write (RMW) operations, we see that they are crucial to building synchronization structures, like mutexes, or even implementing lock-free data structures. An RMW instruction reads a value, modifies it in some way (like incrementing), and writes it back all in one atomic step. The CPU ensures that no other thread can intervene in the middle. If we need to implement a spinlock or a compare-and-swap loop, we rely on RMW instructions. If they weren’t atomic, then between the read and the write, another thread could change the same variable. Then we would lose updates or see inconsistent values. So RMW instructions are building blocks for concurrency.

The question of whether lock-free concurrency is better than concurrency with locks is nuanced. Sometimes, lock-free concurrency can yield higher throughput and less blocking. If you use lock-free queues, you might handle many concurrent pushes and pops faster than you could with a big lock. But lock-free algorithms can also get very complicated, especially with memory reclamation and hazard pointers. They can lead to subtle bugs. They also aren’t always faster if contention is high. If many threads are fighting to update the same atomic variable, you might end up with a lot of wasted CAS (compare-and-swap) retries. Meanwhile, a mutex might quickly put threads to sleep instead of letting them busy-wait. So lock-free concurrency is not always better in terms of raw speed. Still, it is often advocated because it can avoid deadlocks and reduce unpredictable blocking times. For real-time or low-latency systems, lock-free concurrency can be beneficial, but you have to weigh the design complexity.

One interesting topic is how to ensure sequential consistency on weakly-ordered hardware, like ARM. On strongly ordered systems (older x86 for instance), the CPU tries to keep the order of memory operations close to what you see in the code. On weakly-ordered systems, memory operations can be re-ordered more aggressively. This can produce higher performance, but it complicates concurrency. To achieve sequential consistency, we use memory barriers or fences (like dmb instructions on ARM). The language might also specify that certain atomic operations include these barriers. So if you want code that is guaranteed to behave in a consistent order on all platforms, you must use the correct memory order constraints or explicit fences. This ensures that even an ARM processor cannot reorder memory accesses around that fence. Without them, different parts of the system might see different orders of reads and writes.

Load-link (LL) and store-conditional (SC) instructions are another interesting mechanism to implement atomic read-modify-write. The idea behind LL/SC is that you do a “load-link” to read a variable and mark that you intend to write to it. Then you do some local computation. Finally, you do a “store-conditional” which checks if that variable was changed in the meantime. If it wasn’t changed, the store succeeds, which completes the atomic RMW. If it was changed, the store fails, and you have to try again. This is used in some RISC architectures like ARM or PowerPC, or MIPS. It’s a neat approach, because it doesn’t require a big global lock. It only fails if someone changed the memory location in the interim. However, it can produce false positives. If the cache line is invalidated for some reason, or if some other CPU wrote to the same cache line, the store-conditional will fail even if the variable is the same. These false positives reduce performance, because you have to retry. The cause can be any spurious cache invalidation or bus event that marks your link as broken. So if your algorithm has high contention or if your code does a lot of operations between LL and SC, you might see many failures. This can reduce performance. Even so, LL/SC is a powerful building block for atomic operations in a lock-free context.

Another question is whether a programmer can have precise control over the memory model. In most high-level languages, we can specify memory ordering for atomic operations, as with std::memory_order_seq_cst or std::memory_order_release. For example, we can say:

while (!foo.compare_exchange_weak(
    expected, expected * by,
    memory_order_seq_cst, // On success
    memory_order_relaxed)) { // On failure
    // empty loop
}

This snippet means that if the compare-and-swap (CAS) succeeds, we want sequential consistency (no reordering around that operation). But if the CAS fails, we only need relaxed ordering. Why might we do that? If the operation fails, we know that our assumption about expected was wrong, so the ordering might not matter as much. On success, however, we want to ensure that all threads see the new value in a consistent manner. We do have some control, but we need to understand the hardware and the compiler’s memory model thoroughly. If we choose the wrong memory order, we might break correctness. If we choose the most restrictive memory order all the time (memory_order_seq_cst), we might sacrifice performance. So yes, we can control the memory model, but only as precisely as the language allows, and it requires deep knowledge to do it correctly.

Now let’s talk about the role of caches in concurrency and look at the example of a read-write lock:

struct RWLock {
    int readers;
    bool hasWriter; // Zero or one writers
};

Does the read-write lock improve program efficiency? It might, if the workload has many readers and few writers. In theory, you can allow multiple readers to share the lock without blocking each other, while a single writer obtains exclusive access. But if you have frequent writes, the read-write lock might not help. Each time a thread acquires or releases the lock, it modifies shared state in RWLock. That might cause cache coherence traffic. Each core needs to see the updated state, which can involve invalidating caches. On a system with many threads contending, the overhead of constantly updating readers or hasWriter can undermine the benefits. Also, every change to the lock might bounce cache lines between cores. So read-write locks can help if many threads mostly read shared data and writes are rare. But if the ratio of reads to writes is not high, or if we have many simultaneous operations, we might end up with performance worse than a simple lock. The reason is that cache lines containing readers or hasWriter keep changing owners among cores.

About the volatile modifier: it is often misunderstood as a tool for concurrency. In C/C++, volatile does not provide atomicity or memory ordering guarantees. It tells the compiler not to optimize out certain accesses, often used for hardware registers. But it does not create a happens-before relationship. So if you try to “fix” concurrency by just marking a variable as volatile, you still have no guarantee that your program sees updates in the correct order or that tearing does not happen. You need atomics or some synchronization primitive to get that guarantee. One must be careful not to assume that volatile alone handles concurrency. It does not. Instead, we should rely on std::atomic types and memory order constraints.

Finally, the paper mentions “Atomic Fusion” [concurrency-primer.pdf]. My interpretation is that “Atomic Fusion” refers to combining multiple atomic operations into one conceptual step, or grouping them in such a way that they happen together. We might want to do multiple updates in a single atomic transaction. But normal CPU instructions don’t always support that. We can do an atomic increment, or an atomic compare-and-swap, but if we need to update two different variables at once, it’s much harder. If we try to do them in separate instructions, we risk having another thread observe only one update and not the other. So the notion of “Atomic Fusion” is that we want to fuse multiple operations into a single atomic step. That might require hardware transactional memory, or a lock that covers both updates, or a clever algorithm. We should treat it with care because it’s easy to think we are doing an atomic combination when we are not. We might do a normal store to one variable, then a normal store to the other, but that’s not atomic as a whole. Real atomic fusion might only be possible with specialized instructions, or with locking that ensures the two changes happen together. If we rely on partial updates, we risk concurrency anomalies.

Before concluding, I want to add some personal thoughts. I find concurrency one of the most thrilling areas in computer science because it challenges us to think about the hardware, the compiler, and the language all at once. It can be humbling: even small programs with a few lines of code can break if we don’t use the right synchronization. At the same time, concurrency powers so many modern applications, from high-throughput servers to parallel computations. Reading the paper made me aware of just how many factors can break our assumptions: out-of-order execution, caches, compiler optimizations, misused volatile, or partial atomic operations. Each factor alone isn’t too complicated, but in combination, they form a puzzle that requires careful design and testing to solve.

In my university courses, especially ICS and OS, I have learned about how caching works, how compilers reorder instructions, and how operating systems handle scheduling. But it was only when I tried to write some multi-threaded code that I realized how all these elements interact. The arrangement of instructions matters, the memory fences matter, and the cache coherence matters. Doing concurrency right is like choreographing dancers, each with their own tempo, in a dance that must look orderly to the audience. If we do not manage that, we get chaos.

From a database standpoint, concurrency is handled at a higher level with transactions, locking, or optimistic concurrency control. But the principle is the same: we want to ensure that data changes are atomic, consistent, isolated, and durable (ACID). In low-level systems programming, we have to manage concurrency ourselves, often using built-in atomic operations, memory fences, or specialized instructions like LL/SC. This has taught me that concurrency is not a single, simple concept, but a layered set of guarantees that must be carefully enforced from top to bottom.

Returning to the question about the snippet with v = 42; and v_ready = true;, I believe the answer is clearer now. The main problem is that without proper memory ordering, threadB might see v_ready == true before it sees v == 42. One fix is to use an atomic store with a release semantic and an atomic load with an acquire semantic. For example, we might do:

#include <stdatomic.h>
 
_Atomic int v;
_Atomic bool v_ready;
 
void threadA() {
    atomic_store_explicit(&v, 42, memory_order_relaxed);
    atomic_store_explicit(&v_ready, true, memory_order_release);
}
 
void threadB() {
    while (!atomic_load_explicit(&v_ready, memory_order_acquire)) {
        // Wait
    }
    int my_v = atomic_load_explicit(&v, memory_order_relaxed);
    // Use my_v
}

Here, memory_order_release ensures that all writes in threadA prior to that store are visible to any thread that performs a matching memory_order_acquire on the same variable. That way, threadB is guaranteed to see the updated value of v after it observes v_ready == true. This pattern is a standard concurrency idiom for a producer-consumer scenario. Without it, we cannot assume any particular ordering of v and v_ready by default.

This also relates to the comment about “no consistent concept of now.” The system might reorder, buffer, or delay those writes. Only a memory fence or an atomic operation with the right ordering can force some consistent ordering between threads. Without that, each thread has its own sense of time, so to speak.

Putting all these points together, we see that concurrency issues are rooted in the mismatch between our mental model of ordered execution and the reality of how modern systems reorder instructions and cache data. The code snippet with v and v_ready reveals how these reorderings can cause subtle errors if we do not use memory ordering or synchronization. The necessary technologies of modern systems (compilation optimization, multi-level caches, multi-core hardware) all can break naive assumptions of “do this, then that.” The author’s statement about “no consistent concept of now” underlines that concurrency is a cross-layer challenge, requiring the hardware, the compiler, the language, and our code to cooperate in establishing ordering.

To enforce law and order in concurrency, we need atomic operations, fences, and memory models. Atomic operations ensure indivisible updates and avoid torn reads or writes. They give us building blocks, like read-modify-write (RMW), that let us construct lock-free data structures. Lock-free concurrency isn’t always more efficient than locks, but it avoids certain pitfalls like deadlocks and can improve performance if used with care. In weakly-ordered systems like ARM, we need to add fences or use the right memory ordering on our atomic operations to ensure sequential consistency when needed. We also have instructions like load-link/store-conditional that let us build atomic RMW by linking reads to subsequent stores, although false positives can degrade performance.

Programmers can indeed have a high degree of control over the memory model, but it requires specifying the memory order in each atomic operation. Doing so can get complicated quickly, because we need to weigh the trade-offs between correctness guarantees and performance. For instance, we might want memory_order_seq_cst on success of a compare-exchange but only memory_order_relaxed on failure.

Cache interference is pervasive in concurrency. Every update to shared data typically invalidates cache lines, which leads to traffic on the interconnect between cores. The performance benefits of a read-write lock depend on how many readers share the lock at once and how often we have writers. If we have a lot of contention or frequent writes, the overhead can overshadow the potential gains. That’s because each lock acquisition modifies the shared lock structure, triggering cache invalidations. Meanwhile, some might think a volatile variable solves concurrency, but it does not. It prevents certain compiler optimizations, but it does not guarantee atomicity or ordering. Using volatile alone is a recipe for subtle, hard-to-debug race conditions. Instead, we need real atomic or lock-based synchronization.

Last, “Atomic Fusion” is about grouping multiple operations into one atomic transaction or operation. We do not get that for free in normal CPU instructions unless we rely on hardware transactional memory or we implement locks that cover all operations. If we’re not careful, we can assume we did an atomic action when in reality it was two separate atomic steps, with the possibility of interleaving. That can lead to concurrency anomalies where another thread sees a partial update. So we should treat “Atomic Fusion” as an advanced concept and be aware that implementing it requires specialized support.

In conclusion, reading the paper and reflecting on concurrency fundamentals reminds me that concurrency demands careful attention to ordering and visibility. The root cause of concurrency problems is the mismatch between the high-level assumptions in our code and the actual reorderings done by compilers and hardware. The snippet with v = 42; and v_ready = true; is a classic demonstration of how this mismatch can lead to incorrect results. Modern systems introduce multiple layers that can reorder or hide memory operations, so we must use atomic operations and memory fences to keep them in check. We can see that “now” is not well-defined in a multi-threaded program unless we create order with synchronization. Atomicity is essential so that we do not have partial updates, but even atomic operations can be tricky if we do not specify the right memory ordering. Lock-free concurrency is not always better, but it can be useful when carefully designed. We rely on hardware instructions like LL/SC to implement atomic RMW on certain architectures, but they can fail due to false positives, which reduces performance. Programmers do have some control over the memory model, but using it correctly requires knowledge and discipline. Compare-and-swap with different memory models on success and failure is one example of fine-grained control. The cache also plays a major interference role, since it can cause frequent invalidations and degrade performance when locks or shared variables are updated often. Lastly, we must avoid misusing volatile in concurrent code, and we should understand “Atomic Fusion” as an advanced concept that requires specialized support if we need multiple updates to happen atomically as a group.

Overall, this paper reaffirms the complexity and depth of concurrency in modern computer systems. I found that it shed light on many subtle points: how memory models work, how atomic operations are the building blocks of concurrency, and how hardware, compilers, and languages work together to define what we see in multithreaded code. My personal takeaway is that concurrency is a fascinating puzzle that demands cross-layer thinking, from hardware instructions all the way to high-level application design. It also emphasizes the importance of using the right abstractions—like locks, atomic operations, and consistent memory models—to prevent the strange reorderings and partial updates that can lead to concurrency bugs. Reading the paper was a beneficial exercise in refining my mental model of how real hardware and compilers handle concurrent code, and I hope to apply these insights in my future projects.

Subscript

探索

04 - What every systems programmer should know about concurrency

关系图谱

反向链接