inside

Stupid RCU Tricks: So You Want to Torture RCU With a Non-Trivial Userspace?

In order to save mass-storage space and to reduce boot times, rcutorture runs out of a tiny initrd filesystem that consists only of a root directory and a statically linked init program based on nolibc.  This init program binds itself to a randomly chosen CPU, spins for a short time, sleeps for a short time, and repeats, the point being to inject at least a little userspace execution for the benefit of nohz_full CPUs.

This works very well most of the time.  But what if you want to use a full userspace when torturing RCU, perhaps because you want to test runtime changes to RCU's many sysfs parameters, run heavier userspace loads on nohz_full CPUs, or even to run BPF programs?

What you do is go back to the old way of building rcutorture's initrd.

Which raises the question as to what the new way might be.

What rcutorture does is to look at the tools/testing/selftests/rcutorture/initrd directory.  If this directory does not already a file named init, the tools/testing/selftests/rcutorture/bin/mkinitrd.sh script builds the aforementioned statically linked init program.

Which means that you can put whatever initrd file tree you wish into that initrd directory, and as long as it contains a /init program, rcutorture will happily bundle that file tree into an initrd in each the resulting rcutorture kernel images.

Collapse )
SequentialCaveman

Parallel Programming: June 2023 Update

The v2023.06.11a release of Is Parallel Programming Hard, And, If So, What Can You Do About It? is now available! The double-column version is also available from arXiv.org.

This release contains a new section on thermal throttling (along with a new cartoon), improvements to the memory-ordering chapter (including intuitive subsets of the Linux-kernel memory model), fixes to the deferred-processing chapter, additional clocksource-deviation material to the "What Time Is It?" section, and numerous fixes inspired by questions and comments from readers.  Discussions with Yariv Aridor were especially fruitful.  Akira Yokosawa contributed some quick quizzes and other upgrades of the technical discussions, along with a great many improvements to grammar, glossaries, epigraphs, and the build system.  Leonardo Bras also provided some much-appreciated build-system improvements, and also started up continuous integration for some of the code samples.

Elad Lahav, Alan Huang, Zhouyi Zhou, and especially SeongJae Park contributed numerous excellent fixes for grammatical and typographical errors.  SeongJae's fixes were from his Korean translation of this book.

Elad Lahav, Alan Huang, and Patrick Pan carried out some much-needed review of the code samples and contributed greatly appreciated fixes and improvements.  In some cases, they drug the code kicking and screaming into the 2020s.  :-)

inside

What Does It Mean To Be An RCU Implementation?

Under Construction

A correspondent closed out 2022 by sending me an off-list email asking whether or not a pair of Rust crates (rcu_clean and left_right) were really implementations of read-copy update (RCU), with an LWN commenter throwing in crossbeam's epoch crate for good measure.  At first glance, this is a pair of simple yes/no questions that one should be able to answer off the cuff.

What Is An RCU?

Except that there is quite a variety of RCU implementations in the wild.  Even if we remain within the cozy confines of the Linux kernel, we have: (1) The original "vanilla" RCU, (2) Sleepable RCU (SRCU), (3) Tasks RCU, (4) Tasks Rude RCU, and Tasks Trace RCU.  These differ not just in performance characteristics, in fact, it is not in general possible to mechanically convert (say) SRCU to RCU.  The key attributes of RCU implementations are the marking of read-side code regions and data accesses on the one hand and some means of waiting on all pre-existing readers on the other.  For more detail, see the 2019 LWN article and for more background, see the Linux Foundation RCU presentations here and here.

The next sections provide an overview of the Linux-kernel RCU implementations' functional properties, with performance and scalability characteristics left as an exercise for the interested reader.

Vanilla RCU

Vanilla RCU has quite a variety of bells and whistles:

  • Explicit nesting read-side markers, rcu_read_lock(), rcu_read_unlock(), rcu_dereference(), and friends.
  • Pointer-update function, rcu_assign_pointer().
  • Synchronous grace-period-wait primitives, synchronize_rcu() and synchronize_rcu_expedited().
  • An asynchronous grace-period wait primitive, call_rcu().  And additionally a synchronous callback wait primitive, rcu_barrier().
  • Polled grace-period wait primitives.

Sleepable RCU (SRCU)

Collapse )
inside

Stupid RCU Tricks: So You Want To Add Kernel-Boot Parameters Behind rcutorture's Back?

A previous post in this series showed how you can use the --bootargs parameter and .boot files to supply kernel boot parameters to the kernels under test.  This works, but it turns out that there is another way, which is often the case with the Linux kernel.  This other way is Masami Hiramatsu's bootconfig facility, which is nicely documented in detail here.  This blog post is a how-to guide on making use of bootconfig when running rcutorture.

The bootconfig facility allows kernel boot parameters to be built into initrd or directly into the kernel itself, this last being the method used here.  This requires that the kernel build system be informed of the parameters.  Suppose that these parameters are placed in a file named /tmp/dump_tree.bootparam as follows:

kernel.rcutree.dump_tree=1
kernel.rcutree.blimit=15

Note well the "kernel." prefix, which is required here.  The other option is an "init." prefix, which would cause the parameter to instead be passed to the init process.

Then the following three Kconfig options inform the build system of this file:

CONFIG_BOOT_CONFIG=y
CONFIG_BOOT_CONFIG_EMBED=y
CONFIG_BOOT_CONFIG_EMBED_FILE="/tmp/dump_tree.bootparam"

Collapse )
inside

Hiking Hills

A few years ago, I posted on the challenges of maintaining low weight as one ages.  I have managed to  stay near my target weight, with the occasional excursion in either direction, though admittedly more often up than down.  My suspicion that maintaining weight would prove 90% as difficult as losing it has proven to be all too well founded.  As has the observation that exercise is inherently damaging to muscles (see for example here), especially as one's body's ability to repair itself decreases inexorably with age.

It can be helpful to refer back to those old college physics courses.  One helpful formula is the well-worn Newtonian formula for kinetic energy, which is equal to half your mass times the square of your velocity.  Now, the human body does not maintain precisely the same speed while moving (that is after all what bicycles are for), and the faster you are going, the more energy your body must absorb when decreasing your velocity by a set amount on each footfall.  In fact, this amount of energy increases linearly with your average velocity.  So you can reduce the energy absorption (and thus the muscle and joint damage) by decreasing your speed.  And here you were wondering why old people often move much more slowly than do young people!

Collapse )
RCU

Stupid RCU Tricks: CPP Summit Presentation

I had the privilege of presenting Unraveling Fence & RCU Mysteries (C++ Concurrency Fundamentals) to the CPP Summit.  As the title suggests, this covered RCU from a C++ viewpoint.  At the end, there were several excellent questions:

  1. How does the rcu_synchronize() wait-for-readers operation work?
  2. But the rcu_domain class contains lock() and unlock() member functions!!!
  3. Lockless things make me nervous!

There was limited time for questions, and each question's answer could easily have consumed the full 50 minutes alloted for the full talk.  Therefore, I address these questions in the following sections.

How Does rcu_synchronize() Work?

There are a great many ways to make this work.  Very roughly speaking, userspace RCU implementations usually have per-thread counters that are updated by readers and sampled by updaters, with the updaters waiting for all of the counters to reach zero.  There are a large number of pitfalls and optimizations, some of which are covered in the 2012 Transactions On Parallel and Distributed Systems paper (non-paywalled draft).  The most detailed discussion is in the supplementary materials.

More recent information may be found in Section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It?

The rcu_domain Class Contains lock() and unlock() Member Functions?

Indeed it does!

Collapse )
SequentialCaveman

Parallel Programming: September 2022 Update

The v2022.09.25a release of Is Parallel Programming Hard, And, If So, What Can You Do About It? is now available! The double-column version is also available from arXiv.org.

This version boasts an expanded index and API index, and also adds a number of improvements, perhaps most notably boldface for the most pertinent pages for a given index entry, courtesy of Akira Yokosawa. Akira also further improved the new ebook-friendly PDFs, expanded the list of acronyms, updated the build system to allow different perfbook formats to be built concurrently, adjusted for Ghostscript changes, carried out per-Linux-version updates, and did a great deal of formatting and other cleanup.

One of the code samples now use C11 thread-local storage instead of the GCC __thread storage class, courtesy of Elad Lahav. Elad also added support for building code samples on QNX.

Johann Klähn, SeongJae Park, Xuwei Fu, and Zhouyi Zhou provided many welcome fixes throughout the book.

This release also includes a number of updates to RCU, memory ordering, locking, and non-blocking synchronization, as well as additional information on the combined use of synchronization mechanisms.
inside

Kangrejos 2022: The Rust for Linux Workshop

I had the honor of attending this workshop, however, this is not a full report. Instead, I am reporting on what I learned and how it relates to my So You Want to Rust the Linux Kernel? blog series that I started in 2021, and of which this post is a member. I will leave more complete coverage of a great many interesting sessions to others (for example, here, here, and here), who are probably better versed in Rust than am I in any case.

Device Communication and Memory Ordering

Andreas Hindborg talked about his work on a Rust-language PCI NVMe driver for the Linux kernel x86 architecture. I focus on the driver's interaction with device firmware in main memory. As is often the case, there is a shared structure in normal memory in which the device firmware reports the status of an I/O request. This structure has a special 16-bit field that is used to report that all of the other fields are now filled in, so that the Linux device driver can now safely access them.

This of course requires some sort of memory ordering, both on the part of the device firmware and on the part of the device driver. One straightforward approach is for the driver to use smp_load_acquire() to access that 16-bit field, and only if the returned value indicates that the other fields have been filled in, access those other fields.

To the surprise of several Linux-kernel developers in attendance, Andreas's driver instead does a volatile load of the entire structure, 16-bit field and all. If the 16-bit field indicates that all the fields have been updated by the firmware, it then does a second volatile load of the entire structure and then proceeds to use the values from this second load. Apparently, the performance degradation from these duplicate accesses, though measurable, is quite small. However, given the possibility of larger structures for which the performance degradation might be more profound, Andreas was urged to come up with a more elaborate Rust construction that would permit separate access to the 16-bit field, thus avoiding the redundant memory traffic.

In a subsequent conversation, Andreas noted that the overall performance degradation of this Rust-language prototype compared to the C-language version is at most 2.61%, and that in one case there is a yet-as-unexplained performance increase of 0.67%. Of course, given that both C and Rust are compiled, statically typed, non-garbage-collected, and handled by the same compiler back-end, it should not be a surprise that they achieve similar performance. An interesting question is whether the larger amount of information available to the Rust compiler will ever allow it to consistently provide better performance than does C.

Rust and Linux-Kernel Filesystems

Wedson Almeida Filho described his work on a Rust implementations of portions of the Linux kernel's 9p filesystem. The part of this effort that caught my attention was use of the Rust language's asynchronous facilities to implement some of the required state machines and use of a Rust-language type-aware packet-decoding facility.

So what do these two features have to do with concurrency in general, let alone memory-ordering in particular?

Not much. Sure, call_rcu() is (most of) the async equivalent of synchronize_rcu(), but that should be well-known and thus not particularly newsworthy.

But these two features might have considerable influence over the success or lack thereof for the Rust-for-Linux effort.

In the past, much of the justification for use of Rust in Linux has been memory safety, based on the fact that 70% of the Linux exploits have involved memory unsafety. Suppose that Rust manages to address the entire 70%, as opposed to relegating (say) half of that to Rust-language unsafe code. Then use of Rust might at best provide a factor-of-three improvement.

Of course, a factor of three is quite good in absolute terms. However, in my experience, at least an order of magnitude is usually required to with high probability motivate the kind of large change represented by the C-to-Rust transition. Therefore, in order for me to rate the Rust-for-Linux projects success as highly probable, another factor of three is required.

One possible source of the additional factor of three might come from the Rust-for-Linux community giving neglected drivers some much-needed attention. Perhaps the 9p work would qualify, but it seems unlikely that the NVMe drivers are lacking attention.

Perhaps Rust async-based state machines and type-aware packet decoding will go some ways towards providing more of the additional factor of three. Or perhaps a single factor of three will suffice in this particular case. Who knows?

“Pinning” for the Linux Kernel in Rust

Benno Lossin and Gary Guo reported on their respective efforts to implement pinning in Rust for the Linux kernel (a more detailed report may be found here).

But perhaps you, like me, are wondering just what pinning is and why the Linux kernel needs it.

It turns out that the Rust language allows the compiler great freedom to rearrange data structures by default, similar to what would happen in C-language code if each and every structure was marked with the Linux kernel's __randomize_layout directive. In addition, by default, Rust-language data can be moved at runtime.

This apparently allows additional optimizations, but it is not particularly friendly to Linux-kernel idioms that take addresses of fields in structures. Such idioms include pretty much all of Linux's linked-list code, as well as things like RCU callbacks. And who knows, perhaps this situation helps to explain recent hostility towards linked lists expressed on social media. ;-)

The Rust language accommodates these idioms by allowing data to be pinned, thus preventing the compiler from moving it around.

However, pinning data has consequences, including the fact that such pinning is visible in Rust's type system. This visibility allows Rust compilers to complain when (for example) list_for_each_entry() is passed an unpinned data structure. Such complaints are important, given that passing unpinned data to primitives expecting pinned data would result in memory unsafety. The code managing the pinning is expected to be optimized away, but there are concerns that it would make itself all too apparent during code review and debugging.

Benno's and Gary's approaches were compared and contrasted, but my guess is that additional work will be required to completely solve this problem. Some attendees would like to see pinning addressed at the Rust-language level rather than at the library level, but time will tell what approach is most appropriate.

RCU Use Cases

Although RCU was not an official topic, for some reason it came up frequently during the hallway tracks that I participated in. Which is probably a good thing. ;-)

One question is exactly how the various RCU use cases should be represented in Rust. Rust's atomic-reference-count facility has been put forward, and it is quite possible that, in combination with liberal use of atomics, it will serve for some of the more popular use cases. Wedson suggested that Revocable covers another group of use cases, and at first glance, it appears that Wedson might be quite right. Still others might be handled by wait-wakeup mechanisms. There are still some use cases to be addressed, but some progress is being made.

Rust and Python

Some people suggested that Rust might take over from Python, adding that Rust solves many problems that Python is said to encounter in large-scale programs, and that Rust should prove more ergonomic than Python. The first is hard to argue with, given that Python seems to be most successful in smaller-scale projects that make good use of Python's extensive libraries. The second seems quite unlikely to me, in fact, it is hard for me to imagine that anything about Rust would seem in any way ergonomic to a dyed-in-the-wool Python developer.

One particularly non-ergonomic attribute of current Rust is likely to be its libraries, which last I checked were considerably less extensive than those of Python. Although the software-engineering shortcomings of large-scale Python projects can and have motivated conversion to Rust, it appears to me that smaller Python projects making heavier use of a wider variety of libraries might need more motivation.

Automatic conversion of Python libraries, anyone? ;-)

Conclusions

I found Kangrejos 2022 to be extremely educational and thought-provoking, and am very glad that I was able to attend. I am still uncertain about Rust's prospects in the Linux kernel, but the session provided some good examples of actual work done and problems solved. Perhaps Rust's async and type-sensitive packet-decoding features will help increase its probability of success, and perhaps there are additional Rust features waiting to be applied, for example, use of Rust iterators to simplify lockdep cycle detection. And who knows? Perhaps future versions of Rust will be able to offer consistent performance advantages. Stranger things have happened!

History

September 19, 2022: Changes based on LPC and LWN reports.
elephant, penguin

Confessions of a Recovering Proprietary Programmer, Part XIX: Concurrent Computational Models

The September 2022 issue of the Communications of the ACM features an entertaining pair of point/counterpoint articles, with William Dally advocating for concurrent computational models that more accurately model both energy costs and latency here and Uzi Vishkin advocating for incremental modifications to the existing Parallel Random Access Machine (PRAM) model here. Some cynics might summarize Dally's article as “please develop software that makes better use of our hardware so that we can sell you more hardware” while other cynics might summarize Vishkin's article as “please keep funding our research project“. In contrast, I claim that both articles are worth a deeper look. The next section looks at Dally's article, the section after that at Vishkin's article, followed by discussion.

TL;DR: In complete consonance with a depressingly large fraction of 21st-century discourse, they are talking right past each other.

On the Model of Computation: Point: We Must Extend Our Model of Computation to Account for Cost and Location

Dally makes a number of good points. First, he notes that past decades have seen a shift from computation being more expensive than memory references to memory references being more expensive than computation, and by a good many orders of magnitude. Second, he points out that in production, constant factors can be more important than asymptotic behavior. Third, he describes situations where hardware caches are not a silver bullet. Fourth, he calls out the necessity of appropriate computational models for performance programming, though to his credit, he does clearly state that not all programming is performance programming. Fifth, he calls out the global need for reduced energy expended on computation, motivated by projected increases in computation-driven energy demand. Sixth and finally, he advocates for a new computational model that builds on PRAM by (1) assigning different costs to computation and to memory references and (1) making memory-reference costs a function of a distance metric based on locations assigned to processing units and memories.

On the Model of Computation: Counterpoint: Parallel Programming Wall and Multicore Software Spiral: Denial Hence Crisis

Vishkin also makes some interesting points. First, he calls out the days of Moore's-Law frequency scaling, calling for the establishing of a similar “free lunch” scaling in terms of numbers of CPUs. He echoes Dally's point about not all programming being performance programming, but argues that in addition to a “memory wall” and an “energy wall”, there is also a “parallel programming wall” because programming multicore systems is in “simply too difficult”. Second, Vishkin calls on hardware vendors to take more account of ease of use when creating computer systems, including systems with GPUs. Third, Vishkin argues that compilers should take up much of the burden of squeezing performance out of hardware. Fourth, he makes several arguments that concurrent systems should adapt to PRAM rather than adapting PRAM to existing hardware. Fifth and finally, he calls for funding for a “multicore software spiral” that he hopes will bring back free-lunch performance increases driven by Moore's Law, but based on increasing numbers of cores rather than increasing clock frequency.

Discussion

Figure 2.3 of Is Parallel Programming Hard, And, If So, What Can You Do About It? (AKA “perfbook”) shows the “Iron Triangle” of concurrent programming. Vishkin focuses primarily at the upper edge of this triangle, which is primarily about productivity. Dally focuses primarily on the lower left edge of this triangle, which is primarily about performance. Except that Vishkin's points about the cost of performance programming are all too valid, which means that defraying the cost of performance programming in the work-a-day world requires very large numbers of users, which is often accomplished by making performance-oriented concurrent software be quite general, thus amortizing its cost over large numbers of use cases and users. This puts much performance-oriented concurrent software at the lower tip of the iron triangle, where performance meets generality.

One could argue that Vishkin's proposed relegation of performance-oriented code to compilers is one way to break the iron triangle of concurrent programming, but this argument fails because compilers are themselves low-level software (thus at the lower tip of the triangle). Worse yet, there have been many attempts to put the full concurrent-programming burden on the compiler (or onto transactional memory), but rarely to good effect. Perhaps SQL is the exception that proves this rule, but this is not an example that Vishkin calls out.

Both Dally and Vishkin correctly point to the ubiquity of multicore systems, so it only reasonable to look at how these systems are handled in practice. After all, it is only fair to check their apparent assumption of “badly”. And Section 2.3 of perfbook lists several straightforward approaches: (1) Using multiple instances of a sequential application, (2) Using the large quantity of existing parallel software, ranging from operating-system kernels to database servers (SQL again!) to numerical software to image-processing libraries to machine-learning libraries, to name but a few of the areas that Python developers could teach us about, and (3) Applying sequential performance optimizations such that single-threaded performance suffices. Those taking door 1 or door 2 will need only a few parallel performance programmers, and it seems to have proven eminently feasible to have those few to use a sufficiently accurate computational model. The remaining users simply invoke the software produced by these parallel performance programmers, and need not worry much about concurrency. The cases where none of these three doors apply are more challenging, but surmounting the occasional challenge is part of the programming game, not just the parallel-programming game.

One additional historical trend is the sharply decreasing cost of concurrent systems, both in terms of purchase price and energy consumption. Where an entry-level late-1980s parallel system cost millions of dollars and consumed kilowatts, an entry-level 2022 system can be had for less than $100. This means that there is no longer an overwhelming economic imperative to extract every ounce of performance from most year-2022 concurrent systems. For example, much smartphone code attains the required performance while running single-threaded, which means that additional performance from parallelism could well be a waste of energy. Instead, the smartphone automatically powers down the unused hardware, thus providing only the performance that is actually needed, and does so in an energy-efficient manner. The occasional heavy-weight application might turn on all the CPUs, but only such applications need the ministrations of parallel performance programmers and complex computational models.

Thus, Dally and Vishkin are talking past each other, describing different types of software that is produced by different types of developers, who can each use whatever computational model is appropriate to the task at hand.

Which raises the question of whether there might be a one-size-fits-all computational model. I have my doubts. In my 45 years of supporting myself and (later) my family by developing software, I have seen many models, frameworks, languages, and libraries come and go. In contrast, to the best of my knowledge, the speed of light and the sizes of the various types of atoms have remained constant. Don't get me wrong, I do hope that the trend towards vertical stacking of electronics within CPU chips will continue, and I also hope that this trend will result in smaller systems that are correspondingly less constrained by speed-of-light and size-of-atoms limitations. However, the laws of physics appear to be exceedingly stubborn things, and we would therefore be well-advised to work together. Dally's attempts to dictate current hardware conveniences to software developers is about as likely to succeed as is Vishkin's attempts to dictate current software conveniences to hardware developers. Both of these approaches are increasingly likely to fail should the trend towards special-purpose hardware continue full force. Software developers cannot always ignore the laws of physics, and by the same token, hardware developers cannot always ignore the large quantity of existing software and the large number of existing software developers. To be sure, in order to continue to break new ground, both software and hardware developers will need to continue to learn new tricks, but many of these tricks will require coordinated hardware/software effort.

Sadly, there is no hint in either Dally's or Vishkin's article of any attempt to learn what the many successful developers of concurrent software actually do. Thus, it is entirely possible that neither Dally's nor Vishkin's preferred concurrent computational model is applicable to actual parallel programming practice. In fact, in my experience, developers often use the actual hardware and software as the model, driving their development and optimization efforts from actual measurements and from intuitions built up from those measurements. Some might look down on this approach as being both non-portable and non-mathematical, but portability is often achieved simply by testing on a variety of hardware, courtesy of the relatively low prices of modern computer systems. As for mathematics, those of us who appreciate mathematics would do well to admit that it is often much more efficient to let the objective universe carry out the mathematical calculations in its normal implicit and ineffable manner, especially given the complexity of present day hardware and software.

Circling back to the cynics, there are no doubt some cynics that would summarize this blog post as “please download and read my book”. Except that in contrast to the hypothesized cynical summarization of Dally's and Vishkin's articles, you can download and read my book free of charge. Which is by design, given that the whole point is to promulgate information. Cue cynics calling out which of the three of us is the greater fool! ;-)
inside

Stupid SMP Tricks: A Review of Locking Engineering Principles and Hierarchy

Daniel Vetter put together a pair of intriguing blog posts entitled Locking Engineering Principles and Locking Engineering Hierarchy. These appear to be an attempt to establish a set of GPU-wide or perhaps even driver-tree-wide concurrency coding conventions.

Which would normally be none of my business. After all, to establish such conventions, Daniel needs to negotiate with the driver subsystem's developers and maintainers, and I am neither. Except that he did call me out on Twitter on this topic. So here I am, as promised, offering color commentary and the occasional suggestion for improvement, both of Daniel's proposal and of the kernel itself. The following sections review his two posts, and then summarize and amplify suggestions for improvement.

Locking Engineering Principles


Make it Dumb” should be at least somewhat uncontroversial, as this is quite similar to a rather more flamboyant sound bite from my 1970s Mechanical Engineering university coursework. Kernighan's Law asserting that debugging is twice as hard as coding should also be a caution.

This leads us to “Make it Correct”, which many might argue should come first in the list. After all, if correctness is not a higher priority than dumbness (or simplicity, if you prefer), then the do-nothing null solution would always be preferred. And to his credit, Daniel does explicitly state that “simple doesn't necessarily mean correct”.

It is hard to argue with Daniel's choosing to give lockdep place of pride, especially given how many times it has saved me over the years, and not just from deadlock. I never have felt the need to teach lockdep RCU's locking rules, but then again, RCU is much more monolithic than is the GPU subsystem. Perhaps if RCU's locking becomes more ornate, I would also feel the need to acquire RCU's key locks in the intended order at boot time. Give or take the fact that much of RCU can be used very early, even before the call to rcu_init().

The validation point is also a good one, with Daniel calling out might_lock(), might_sleep(), and might_alloc(). I go further and add lockdep_assert_irqs_disabled(), lockdep_assert_irqs_enabled(), and lockdep_assert_held(), but perhaps these additional functions are needed in low-level code like RCU more than they are in GPU drivers.

Daniel's admonishments to avoid reinventing various synchronization wheels of course comprise excellent common-case advice. And if you believe that your code is the uncommon case, you are most likely mistaken. Furthermore, it is hard to argue with “pick the simplest lock that works”.

It is hard to argue with the overall message of “Make it Fast”, especially the bit about the pointlessness of crashing faster. However, a minimum level of speed is absolutely required. For a 1977 example, an old school project required keeping up with the display. If the code failed to keep up with the display, that code was useless. More recent examples include deep sub-second request-response latency requirements in many Internet datacenters. In other words, up to a point, performance is not simply a “Make it Fast” issue, but also a “Make it Correct” issue. On the other hand, once the code meets its performance requirements, any further performance improvements instead falls into the lower-priority “Make it Fast” bucket.

Except that things are not always quite so simple. Not all performance improvements can be optimized into existence late in the game. In fact, if your software's performance requirements are expected to tax your system's resources, it will be necessary to make performance a first-class consideration all the way up to and including the software-architecture level, as discussed here. And, to his credit, Daniel hints at this when he notes that “the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts”. He also helpfully calls out io_uring as one avenue for radical updates.

The “Protect Data, not Code” is good general advice and goes back to Jack Inman's 1985 USENIX paper entitled “Implementing Loosely Coupled Functions on Tightly Coupled Engines”, if not further. However, there are times where what needs to be locked is not data, but perhaps a particular set of state transitions, or maybe a rarely accessed state, which might be best represented by the code for that state. That said, there can be no denying the big-kernel-lock hazards of excessive reliance on code locking. Furthermore, I have only rarely needed abstract non-data locking.

Nevertheless, a body of code as large as the full set of Linux-kernel device drivers should be expected to have a large number of exceptions to the “Protect Data” rule of thumb, reliable though that rule has proven in my own experience. The usual way of handling this is a waiver process. After all, given that every set of safety-critical coding guidelines I have seen has a waiver process, it only seems reasonable that device-driver concurrency coding conventions should also involve a waiver process. But that decision belongs to the device-driver developers and maintainers, not with me.

Locking Engineering Hierarchy


Most of the Level 0: No Locking section should be uncontroversial: Do things the easy way, at least when the easy way works. This approach goes back decades, for but one example, single-threaded user-interface code delegating concurrency to a database management system. And it should be well-understood that complicated things can be made easy (or at least easier) through use of carefully constructed and time-tested APIs, for example, the queue_work() family of APIs called out in this section. One important question for all such APIs is this: “Can someone with access to only the kernel source tree and a browser quickly find and learn how to use the given API?” After all, people are able to properly use the APIs they see elsewhere if they can quickly and easily learn about them.

And yes, given Daniel's comments regarding struct dma_fence later in this section, the answer for SLAB_TYPESAFE_BY_RCU appears to be “no” (but see Documentation/RCU/rculist_nulls.rst). The trick with SLAB_TYPESAFE_BY_RCU is that readers must validate the allocation. A common way of doing this is to have a reference count that is set to the value one immediately after obtaining the object from kmem_struct_alloc() and set to the value zero immediately before passing the object to kmem_struct_free(). And yes, I have a patch queued to fix the misleading text in Documentation/RCU/whatisRCU.rst, and I apologize to anyone who might have attempted to use locking without a reference count. But please also let the record show that there was no bug report.

A reader attempting to obtain a new lookup-based reference to a SLAB_TYPESAFE_BY_RCU object must do something like this:

  1. atomic_inc_not_zero(), which will return false and not do the increment if initial value was zero. Presumably dma_fence_get_rcu() handles this for struct dma_fence.
  2. If the value returned above was false, pretend that the lookup failed. Otherwise, continue with the following steps.
  3. Check the identity of the object. If this turns out to not be the desired object, release the reference (cleaning up if needed) and pretend that the lookup failed. Otherwise, continue. Presumably dma_fence_get_rcu_safe() handles this for struct dma_fence, in combination with its call to dma_fence_put().
  4. Use the object.
  5. Release the reference, cleaning up if needed.
This of course underscores Daniel's point (leaving aside the snark), which is that you should not use SLAB_TYPESAFE_BY_RCU unless you really need to, and even then you should make sure that you know how to use it properly.

But just when do you need to use SLAB_TYPESAFE_BY_RCU? One example is when you need RCU protection on such a hot fastpath that you cannot tolerate RCU-freed objects becoming cold in the CPU caches due to RCU's grace-period delays. Another example is where objects are being allocated and freed at such a high rate that if each and every kmem_cache_free() was subject to grace-period delays, excessive memory would be tied up waiting for grace periods, perhaps even resulting in out-of-memory (OOM) situations.

In addition, carefully constructed semantics are required. Semantics based purely on ordering tend to result in excessive conceptual complexity and thus in confusion. More appropriate semantics tend to be conditional, for example: (1) Objects that remain in existence during the full extent of the lookup are guaranteed to be found, (2) Objects that do not exist at any time during the full extent of the lookup are guaranteed not to be found, and (3) Objects that exist for only a portion of the lookup might or might not be found.

Does struct dma_fence need SLAB_TYPESAFE_BY_RCU? Is the code handling struct dma_fence doing the right thing? For the moment, I will just say that: (1) The existence of dma_fence_get_rcu_safe() gives me at least some hope, (2) Having two different slab allocators stored into different static variables having the same name is a bit code-reader-unfriendly, and (3) The use of SLAB_TYPESAFE_BY_RCU for a structure that contains an rcu_head structure is a bit unconventional, but perhaps these structures are sometimes obtained from kmalloc() instead of from kmem_struct_alloc().

One thing this section missed (or perhaps intentionally omitted): If you do need memory barriers, you are almost always better off using smp_store_release() and smp_load_acquire() than the old-style smp_wmb() and smp_rmb().

The Level 1: Big Dumb Lock certainly summarizes the pain of finding the right scope for your locks. Despite Daniel's suggesting that lockless tricks are almost always the wrong approach, such tricks are in common use. I would certainly agree that randomly hacking lockless tricks into your code will almost always result in disaster. In fact, this process will use your own intelligence against you: The smarter you think you are, the deeper a hole you will have dug for yourself before you realize that you are in trouble.

Therefore, if you think that you need a lockless trick, first actually measure the performance. Second, if there really is a performance issue, do the work to figure out which code and data is actually responsible, because your blind guesses will often be wrong. Third, read documentation, look at code, and talk to people to learn and fully understand how this problem has been solved in the past, then carefully apply those solutions. Fourth, if there is no solution, review your overall design, because an ounce of proper partitioning is worth some tons of clever lockless code. Fifth, if you must use a new solution, verify it beyond all reason. The code in kernel/rcu/rcutorture.c will give you some idea of the level of effort required.

The Level 2: Fine-grained Locking sections provide some excellent advice, guidance, and cautionary tales. Yes, lockdep is a great thing, but there are deadlocks that it does not detect, so some caution is still required.

The yellow-highlighted Locking Antipattern: Confusing Object Lifetime and Data Consistency describes how holding locks across non-memory-style barrier functions, that is, things like flush_work() as opposed to things like smp_mb(), can cause problems, up to and including deadlock. In contrast, whatever other problems memory-barrier functions such as smp_mb() cause, it does take some creativity to add them to a lock-based critical section so as to cause them to participate in a deadlock cycle. I would not consider flush_work() to be a memory barrier in disguise, although it is quite true that a correct high-performance implementation of flush_work() will require careful memory ordering.

I would have expected a rule such as “Don't hold a lock across flush_work() that is acquired in any corresponding workqueue handler”, but perhaps Daniel is worried about future deadlocks as well as present-day deadlocks. After all, if you do hold a lock across a call to flush_work(), perhaps there will soon be some compelling reason to acquire that lock in a workqueue handler, at which point it is game over due to deadlock.

This issue is of course by no means limited to workqueues. For but one example, it is quite possible to generate similar deadlocks by holding spinlocks across calls to del_timer_sync() that are acquired within timer handlers.

And that is why both flush_work() and del_timer_sync() tell lockdep what they are up to. For example, flush_work() invokes __flush_work() which in turn invokes lock_map_acquire() and lock_map_release() on a fictitious lock, and this same fictitious lock is acquired and released by process_one_work(). Thus, if you acquire a lock in a workqueue handler that is held across a corresponding flush_work(), lockdep will complain, as shown in the 2018 commit 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing") by Johannes Berg. Of course, del_timer_sync() uses this same trick, as shown in the 2009 commit 6f2b9b9a9d75 ("timer: implement lockdep deadlock detection"), also by Johannes Berg.

Nevertheless, deadlocks involving flush_work() and workqueue handlers can be subtle. It is therefore worth investing some up-front effort to avoid them.

The orange-highlighted Level 2.5: Splitting Locks for Performance Reasons discusses splitting locks. As the section says, there are complications. For one thing, lock acquisitions are anything but free, having overheads of hundreds or thousands of instructions at best, which means that adding additional levels of finer-grained locks can actually slow things down. Therefore, Daniel's point about prioritizing architectural restructuring over low-level synchronization changes is an extremely good one, and one that is all too often ignored.

As Daniel says, when moving to finer-grained locking, it is necessary to avoid increasing the common-case number of locks being acquired. As one might guess from the tone of this section, this is not necessarily easy. Daniel suggests reader-writer locking, which can work well in some cases, but suffers from performance and scalability limitations, especially in situations involving short read-side critical sections. The fact that readers and writers exclude each other can also result in latency/response-time issues. But again, reader-writer locking can be a good choice in some cases.

The last paragraph is an excellent cautionary tale. Take it from Daniel: Never let userspace dictate the order in which the kernel acquires locks. Otherwise, you, too, might find yourself using wait/wound mutexes. Worse yet, if you cannot set up an two-phase locking discipline in which all required locks are acquired before any work is done, you might find yourself writing deadlock-recovery code, or wounded-mutex recovery code, if you prefer. This recovery code can be surprisingly complex. In fact, one of the motivations for the early 1990s introduction of RCU into DYNIX/ptx was that doing so allowed deletion of many thousands of lines of such recovery code, along with all the yet-undiscovered bugs that code contained.

The red-highlighted Level 3: Lockless Tricks section begins with the ominous sentence “Do not go here wanderer!”

And I agree. After all, if you are using the facilities discussed in this section, namely RCU, atomics, preempt_disable(), local_bh_disable(), local_irq_save(), or the various memory barriers, you had jolly well better not be wandering!!!

Instead, you need to understand what you are getting into and you need to have a map in the form of a principled design and a careful well-validated implementation. And new situations may require additional maps to be made, although there are quite a few well-used maps already in Documentation/rcu and over here. In addition, as Daniel says, algorithmic and architectural fixes can often provide much better results than can lockless tricks applied at low levels of abstraction. Past experience suggests that some of those algorithmic and architectural fixes will involve lockless tricks on fastpaths, but life is like that sometimes.

It is now time to take a look at Daniel's alleged antipatterns.

The “Locking Antipattern: Using RCU” section does have some “interesting” statements:

  1. RCU is said to mix up lifetime and consistency concerns. It is not clear exactly what motivated this statement, but this view is often a symptom of a strictly temporal view of RCU. To use RCU effectively, one must instead take a combined spatio-temporal view, as described here and here.
  2. rcu_read_lock() is said to provide both a read-side critical section and to extend the lifetime of any RCU-protected object. This is true in common use cases because the whole purpose of an RCU read-side critical section is to ensure that any RCU-protected object that was in existence at any time during that critical section remains in existence up to the end of that critical section. It is not clear what distinction Daniel is attempting to draw here. Perhaps he likes the determinism provided by full mutual exclusion. If so, never forget that the laws of physics dictate that such determinism is often surprisingly expensive.
  3. The next paragraph is a surprising assertion that RCU readers' deadlock immunity is a bad thing! Never forget that although RCU can be used to replace reader-writer locking in a great many situations, RCU is not reader-writer locking. Which is a good thing from a performance and scalability viewpoint as well as from a deadlock-immunity viewpoint.
  4. In a properly designed system, locks and RCU are not “papering over” lifetime issues, but instead properly managing object lifetimes. And if your system is not properly designed, then any and all facilities, concurrent or not, are weapons-grade dangerous. Again, perhaps more maps are needed to help those who might otherwise wander into improper designs. Or perhaps existing maps need to be more consistently used.
  5. RCU is said to practically force you to deal with “zombie objects”. Twitter discussions with Daniel determined that such “zombie objects” can no longer be looked up, but are still in use. Which means that many use cases of good old reference counting also force you to deal with zombie objects: After all, removing an object from its search structure does not invalidate the references already held on that object. But it is quite possible to avoid zombie objects for both RCU and for reference counting through use of per-object locks, as is done in the Linux-kernel code that maps from a System-V semaphore ID to the corresponding in-kernel data structure, first described in Section of this paper. In short, if you don't like zombies, there are simple RCU use cases that avoid them. You get RCU's speed and deadlock immunity within the search structure, but full ordering, consistency, and zombie-freedom within the searched-for object.
  6. The last bullet in this section seems to argue that people should upgrade from RCU read-side critical sections to locking or reference counting as quickly as possible. Of course, such locking or reference counting can add problematic atomic-operation and cache-miss overhead to those critical sections, so specific examples would be helpful. One non-GPU example is the System-V semaphore ID example mentioned above, where immediate lock acquisition is necessary to provide the necessary System-V semaphore semantics.
  7. On freely using RCU, again, proper design is required, and not just with RCU. One can only sympathize with a driver dying in synchronize_rcu(). Presumably Daniel means that one of the driver's tasks hung in synchronize_rcu(), in which case there should have been an RCU CPU stall warning message which would point out what CPU or task was stuck in an RCU read-side critical section. It is quite easy to believe that diagnostics could be improved, both within RCU and elsewhere, but if this was intended to be a bug report, it is woefully insufficient.
  8. It is good to see that Daniel found at least one RCU use case that he likes (or at least doesn't hate too intensely), namely xarray lookups combined with kref_get_unless_zero() and kfree_rcu(). Perhaps this is a start, especially if the code following that kref_get_unless_zero() invocation does additional atomic operations on the xarray object, which would hide at least some of the kref_get_unless_zero() overhead.

The following table shows how intensively the RCU API is used by various v5.19 kernel subsystems:

Subsystem   Uses          LoC  Uses/KLoC
---------   ----   ----------  ---------
ipc           91        9,822       9.26
virt          68        9,013       7.54
net         7457    1,221,681       6.10
security     599      107,622       5.57
kernel      1796      423,581       4.24
mm           324      170,176       1.90
init           8        4,236       1.89
block        108       65,291       1.65
lib          319      214,291       1.49
fs          1416    1,470,567       0.96
include      836    1,167,274       0.72
drivers     5596   20,861,746       0.27
arch         546    2,189,975       0.25
crypto         6      102,307       0.06
sound         21    1,378,546       0.02
---------   ----   ----------  ---------
Total      19191   29,396,128       0.65

As you can see, the drivers subsystem has the second-highest total number of RCU API uses, but it also has by far the largest number of lines of code. As a result, the RCU usage intensity in the drivers subsystem is quite low, at about 27 RCU uses per 100,000 lines of code. But this view is skewed, as can be seen by looking more deeply within the drivers subsystem, but leaving out (aside from in the Total line) and drivers containing fewer than 100 instances of the RCU API:

Subsystem           Uses          LoC  Uses/KLoC
---------           ----   ----------  ---------
drivers/target       293       62,532       4.69
drivers/block        355       97,265       3.65
drivers/md           334      147,881       2.26
drivers/infiniband   548      434,430       1.26
drivers/net         2607    4,381,955       0.59
drivers/staging      114      618,763       0.18
drivers/scsi         150    1,011,146       0.15
drivers/gpu          399    5,753,571       0.07
---------           ----   ----------  ---------
Total               5596   20,861,746       0.27

The drivers/infiniband and drivers/net subtrees account for more than half of the RCU usage in the Linux kernel's drivers, and could be argued to be more about networking than about generic device drivers. And although drivers/gpu comes in third in terms of RCU usage, it also comes in first in terms of lines of code, making it one of the least intense users of RCU. So one could argue that Daniel already has his wish, at least within the confines of drivers/gpu.

This data suggests that Daniel might usefully consult with the networking folks in order to gain valuable guidelines on the use of RCU and perhaps atomics and memory barriers as well. On the other hand, it is quite possible that such consultations actually caused some of Daniel's frustration. You see, a system implementing networking must track the state of the external network, and it can take many seconds or even minutes for changes in that external state to propagate to that system. Therefore, expensive synchronization within that system is less useful than one might think: No matter how many locks and mutexes that system acquires, it cannot prevent external networking hardware from being reconfigured or even from failing completely.

Moving to drivers/gpu, in theory, if there are state changes initiated by the GPU hardware without full-system synchronization, networking RCU usage patterns should apply directly to GPU drivers. In contrast, there might be significant benefits from more tightly synchronizing state changes initiated by the system. Again, perhaps the aforementioned System-V semaphore ID example can help in such cases. But to be fair, given Daniel's preference for immediately acquiring a reference to RCU-protected data, perhaps drivers/gpu code is already taking this approach.

The “Locking Antipattern: Atomics” section is best taken point by point:

  1. For good or for ill, the ordering (or lack thereof) of Linux kernel atomics predates C++ atomics by about a decade. One big advantage of the Linux-kernel approach is that all accesses to atomic*_t variables are marked. This is a great improvement over C++, where a sequentially consistent load or store looks just like a normal access to a local variable, which can cause a surprising amount of confusion.
  2. Please please please do not “sprinkle” memory barriers over the code!!! Instead, actually design the required communication and ordering, and then use the best primitives for the job. For example, instead of “smp_mb__before_atomic(); atomic_inc(&myctr); smp_mb__after_atomic();”, maybe you should consider invoking atomic_inc_return(), discarding the return value if it is not needed.
  3. Indeed, some atomic functions operate on non-atomic*_t variables. But in many cases, you can use their atomic*_t counterparts. For example, instead of READ_ONCE(), atomic_read(). Instead of WRITE_ONCE(), atomic_set(). Instead of cmpxchg(), atomic_cmpxchg(). Instead of set_bit(), in many situations, atomic_or(). On the other hand, I will make no attempt to defend the naming of set_bit() and __set_bit().

To Daniel's discussion of “unnecessary trap doors”, I can only agree that reinventing read-write semaphores is a very bad thing.

I will also make no attempt to defend ill-thought-out hacks involving weak references or RCU. Sure, a quick fix to get your production system running is all well and good, but the real fix should be properly designed.

And Daniel makes an excellent argument when he says that if a counter can be protected by an already held lock, that counter should be implemented using normal C-language accesses to normal integral variables. For those situations where no such lock is at hand, there are a lot of atomic and per-CPU counting examples that can be followed, both in the Linux kernel and in Chapter 5 of “Is Parallel Programming Hard, And, If So, What Can You Do About It?”. Again, why unnecessarily re-invent the wheel?

However, the last paragraph, stating that atomic operations should only be used for locking and synchronization primitives in the core kernel is a bridge too far. After all, a later section allows for memory barriers to be used in libraries (at least driver-hacker-proof libraries), so it seems reasonable that atomic operations can also be used in libraries.

Some help is provided by the executable Linux-kernel memory model (LKMM) in tools/memory-model along with the kernel concurrency sanitizer (KCSAN), which is documented in Documentation/dev-tools/kcsan.rst, but there is no denying that these tools currently require significant expertise. Help notwithstanding, it almost always makes a lot of sense to hide complex operations, including complex operations involving concurrency, behind well-designed APIs.

And help is definitely needed, given that there are more than 10,000 invocations of atomic operations in the drivers tree, more than a thousand of which are in drivers/gpu.

There is not much to say about the “Locking Antipattern: preempt/local_irq/bh_disable() and Friends” section. These primitives are not heavily used in the drivers tree. However, lockdep does have enough understanding of these primitives to diagnose misuse of irq-disabled and bh-disabled spinlocks. Which is a good thing, given that there are some thousands of uses of irq-disabled spinlocks in the drivers tree, along with a good thousand uses of bh-disabled spinlocks.

The “Locking Antipattern: Memory Barriers” suggests that memory barriers should be packaged in a library or core kernel service, which is in the common case excellent advice. Again, the executable LKMM and KCSAN can help, but again these tools currently require some expertise. I was amused by Daniel's “I love to read an article or watch a talk by Paul McKenney on RCU like anyone else to get my brain fried properly”, and I am glad that my articles and talks provide at least a little entertainment value, if nothing else. ;-)

Summing up my view of these two blog posts, Daniel recommends that most driver code avoid concurrency entirely. Failing that, he recommends sticking to certain locking and reference-counting use cases, albeit including a rather complex acquire-locks-in-any-order use case. For the most part, he recommends against atomics, RCU, and memory barriers, with a very few exceptions.

For me, reading these blog posts induced great nostalgia, taking me back to my early 1990s days at Sequent, when the guidelines were quite similar, give or take a large number of non-atomically manipulated per-CPU counters. But a few short years later, many Sequent engineers were using atomic operations, and yes, a few were even using RCU. Including one RCU use case in a device driver, though that use case could instead be served by the Linux kernel's synchronize_irq() primitive.

Still, the heavy use of RCU and (even more so) of atomics within the drivers tree, combined with Daniel's distaste for these primitive, suggests that some sort of change might be in order.

Can We Fix This?

Of course we can!!!

But will a given fix actually improve the situation? That is the question.

Reading through this reminded me that I need to take another pass through the RCU documentation. I have queued a commit to fix the misleading wording for SLAB_TYPESAFE_BY_RCU on the -rcu tree: 08f8f09b2a9e ("doc: SLAB_TYPESAFE_BY_RCU uses cannot rely on spinlocks"). I also expect to improve the documentation of reference counting and its relation to SLAB_TYPESAFE_BY_RCU, and will likely find a number of other things in need of improvement.

This is also as good a time as any to announce that I will be holding an an RCU Office Hours birds-of-a-feather session at the 2022 Linux Plumbers Conference, in case that is helpful.

However, the RCU documentation must of necessity remain fairly high level. And to that end, GPU-specific advice about use of xarray, kref_get_unless_zero(), and kfree_rcu() really needs to be Documentation/gpu as opposed to Documentation/RCU. This would allow that advice to be much more specific and thus much more helpful to the GPU developers and maintainers. Alternatively, perhaps improved GPU-related APIs are required in order to confine concurrency to functions designed for that purpose. This alternative approach has the benefit of allowing GPU device drivers to focus more on GPU-specific issues and less on concurrency. On the other hand, given that the GPU drivers comprise some millions of lines of code, this might be easier said than done.

It is all too easy to believe that it is possible to improve the documentation for a number of other facilities that Daniel called on the carpet. At the same time, it is important to remember that the intent of documentation is communication, and that the optimal mode of communication depends on the target audience. At its best, documentation builds a bridge from where the target audience currently is to where they need to go. Which means the broader the target audience, the more difficult it is to construct that bridge. Which in turn means that a given subsystem likely need usage advice and coding standards specific to that subsystem. One size does not fit all.

The SLAB_TYPESAFE_BY_RCU facility was called on the carpet, and perhaps understandably so. Would it help if SLAB_TYPESAFE_BY_RCU were to be changed so as to allow locks to be acquired on objects that might at any time be passed to kmem_cache_free() and then reallocated via kmem_cache_alloc()? In theory, this is easy: Just have the kmem_cache in question zero pages allocated from the system before splitting them up into objects. Then a given object could have an “initialized” flag, and if that flag was cleared in an object just returned from kmem_struct_alloc(), then and only then would that lock be initialized. This would allow a lock to be acquired (under rcu_read_lock(), of course) on a freed object, and would allow a lock to be held on an object despite its being passed to kmem_struct_free() and returned from kmem_struct_alloc() in the meantime.

In practice, some existing SLAB_TYPESAFE_BY_RCU users might not be happy with the added overhead of page zeroing, so this might require an additional GFP_ flag to allow zeroing on a kmem_cache-by-kmem_cache basis. However, the first question is “Would this really help?”, and answering that question requires feedback developers and maintainers who are actually using SLAB_TYPESAFE_BY_RCU.

Some might argue that the device drivers should all be rewritten in Rust, and cynics might argue that Daniel wrote his pair of blog posts with exactly that thought in mind. I am happy to let those cynics make that argument, especially given that I have already held forth on Linux-kernel concurrency in Rust here. However, a possible desire to rust Linux-kernel device drivers does not explain Daniel's distaste for what he calls “zombie objects” because Rust is in fact quite capable of maintaining references to objects that have been removed from their search structure.

Summary and Conclusions

As noted earlier, reading these blog posts induced great nostalgia, taking me back to my time at Sequent in the early 1990s. A lot has happened in the ensuing three decades, including habitual use of locking in across the industry, and sometimes even correct use of locking.

But will generic developers ever be able to handle more esoteric techniques involving atomic operations and RCU?

I believe that the answer to this question is “yes”, as laid out in my 2012 paper Beyond Expert-Only Parallel Programming?. As in the past, tooling (including carefully designed APIs), economic forces (including continued ubiquitous multi-core systems), and acculturation (assisted by a vast quantity of open-source software) have done the trick, and I see no reason why these trends will not continue.

But what happens in drivers/gpu is up to the GPU developers and maintainers!

References


Atomic Operations and Memory Barriers, though more description than reference:

  1. Documentation/atomic_t.txt
  2. Documentation/atomic_bitops.txt
  3. Documentation/memory-barriers.txt
  4. Sometimes the docbook header is on the x86 arch_ function, for example, arch_atomic_inc() in arch/x86/include/asm/atomic.h rather than atomic_inc().
  5. The LKMM references below can also be helpful.

Kernel Concurrency Sanitizer (KCSAN):

  1. Documentation/dev-tools/kcsan.rst
  2. Finding race conditions with KCSAN
  3. Concurrency bugs should fear the big bad data-race detector (part 1)
  4. Concurrency bugs should fear the big bad data-race detector (part 2)
  5. Detecting missing memory barriers with KCSAN

Linux-Kernel Memory Model (LKMM):

  1. tools/memory-model, including its Documentation subdirectory.
  2. A formal kernel memory-ordering model (part 1)
  3. A formal kernel memory-ordering model (part 2)
  4. Who's afraid of a big bad optimizing compiler?
  5. Calibrating your fear of big bad optimizing compilers
  6. Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel (non-paywalled extended )edition

Read-Copy Update (RCU):

  1. Documentation/RCU
  2. The RCU API, 2019 edition
  3. Unraveling RCU-Usage Mysteries (Fundamentals)
  4. Unraveling RCU-Usage Mysteries (Additional Use Cases)
  5. Sections 9.5 and 9.6 of “Is Parallel Programming Hard, And, If So, What Can You Do About It?
  6. Many other references, some of which are listed here.