inside

Hiking Hills

A few years ago, I posted on the challenges of maintaining low weight as one ages.  I have managed to  stay near my target weight, with the occasional excursion in either direction, though admittedly more often up than down.  My suspicion that maintaining weight would prove 90% as difficult as losing it has proven to be all too well founded.  As has the observation that exercise is inherently damaging to muscles (see for example here), especially as one's body's ability to repair itself decreases inexorably with age.

It can be helpful to refer back to those old college physics courses.  One helpful formula is the well-worn Newtonian formula for kinetic energy, which is equal to half your mass times the square of your velocity.  Now, the human body does not maintain precisely the same speed while moving (that is after all what bicycles are for), and the faster you are going, the more energy your body must absorb when decreasing your velocity by a set amount on each footfall.  In fact, this amount of energy increases linearly with your average velocity.  So you can reduce the energy absorption (and thus the muscle and joint damage) by decreasing your speed.  And here you were wondering why old people often move much more slowly than do young people!

Collapse )
RCU

Stupid RCU Tricks: CPP Summit Presentation

I had the privilege of presenting Unraveling Fence & RCU Mysteries (C++ Concurrency Fundamentals) to the CPP Summit.  As the title suggests, this covered RCU from a C++ viewpoint.  At the end, there were several excellent questions:

  1. How does the rcu_synchronize() wait-for-readers operation work?
  2. But the rcu_domain class contains lock() and unlock() member functions!!!
  3. Lockless things make me nervous!

There was limited time for questions, and each question's answer could easily have consumed the full 50 minutes alloted for the full talk.  Therefore, I address these questions in the following sections.

How Does rcu_synchronize() Work?

There are a great many ways to make this work.  Very roughly speaking, userspace RCU implementations usually have per-thread counters that are updated by readers and sampled by updaters, with the updaters waiting for all of the counters to reach zero.  There are a large number of pitfalls and optimizations, some of which are covered in the 2012 Transactions On Parallel and Distributed Systems paper (non-paywalled draft).  The most detailed discussion is in the supplementary materials.

More recent information may be found in Section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It?

The rcu_domain Class Contains lock() and unlock() Member Functions?

Indeed it does!

Collapse )
SequentialCaveman

Parallel Programming: September 2022 Update

The v2022.09.25a release of Is Parallel Programming Hard, And, If So, What Can You Do About It? is now available! The double-column version is also available from arXiv.org.

This version boasts an expanded index and API index, and also adds a number of improvements, perhaps most notably boldface for the most pertinent pages for a given index entry, courtesy of Akira Yokosawa. Akira also further improved the new ebook-friendly PDFs, expanded the list of acronyms, updated the build system to allow different perfbook formats to be built concurrently, adjusted for Ghostscript changes, carried out per-Linux-version updates, and did a great deal of formatting and other cleanup.

One of the code samples now use C11 thread-local storage instead of the GCC __thread storage class, courtesy of Elad Lahav. Elad also added support for building code samples on QNX.

Johann Klähn, SeongJae Park, Xuwei Fu, and Zhouyi Zhou provided many welcome fixes throughout the book.

This release also includes a number of updates to RCU, memory ordering, locking, and non-blocking synchronization, as well as additional information on the combined use of synchronization mechanisms.
inside

Kangrejos 2022: The Rust for Linux Workshop

UNDER CONSTRUCTION



I had the honor of attending this workshop, however, this is not a full report. Instead, I am reporting on what I learned and how it relates to my So You Want to Rust the Linux Kernel? blog series that I started in 2021, and of which this post is a member. I will leave more complete coverage of a great many interesting sessions to others (for example, here, here, and here), who are probably better versed in Rust than am I in any case.

Device Communication and Memory Ordering

Andreas Hindborg talked about his work on a Rust-language PCI NVMe driver for the Linux kernel x86 architecture. I focus on the driver's interaction with device firmware in main memory. As is often the case, there is a shared structure in normal memory in which the device firmware reports the status of an I/O request. This structure has a special 16-bit field that is used to report that all of the other fields are now filled in, so that the Linux device driver can now safely access them.

This of course requires some sort of memory ordering, both on the part of the device firmware and on the part of the device driver. One straightforward approach is for the driver to use smp_load_acquire() to access that 16-bit field, and only if the returned value indicates that the other fields have been filled in, access those other fields.

To the surprise of several Linux-kernel developers in attendance, Andreas's driver instead does a volatile load of the entire structure, 16-bit field and all. If the 16-bit field indicates that all the fields have been updated by the firmware, it then does a second volatile load of the entire structure and then proceeds to use the values from this second load. Apparently, the performance degradation from these duplicate accesses, though measurable, is quite small. However, given the possibility of larger structures for which the performance degradation might be more profound, Andreas was urged to come up with a more elaborate Rust construction that would permit separate access to the 16-bit field, thus avoiding the redundant memory traffic.

In a subsequent conversation, Andreas noted that the overall performance degradation of this Rust-language prototype compared to the C-language version is at most 2.61%, and that in one case there is a yet-as-unexplained performance increase of 0.67%. Of course, given that both C and Rust are compiled, statically typed, non-garbage-collected, and handled by the same compiler back-end, it should not be a surprise that they achieve similar performance. An interesting question is whether the larger amount of information available to the Rust compiler will ever allow it to consistently provide better performance than does C.

Rust and Linux-Kernel Filesystems

Wedson Almeida Filho described his work on a Rust implementations of portions of the Linux kernel's 9p filesystem. The part of this effort that caught my attention was use of the Rust language's asynchronous facilities to implement some of the required state machines and use of a Rust-language type-aware packet-decoding facility.

So what do these two features have to do with concurrency in general, let alone memory-ordering in particular?

Not much. Sure, call_rcu() is (most of) the async equivalent of synchronize_rcu(), but that should be well-known and thus not particularly newsworthy.

But these two features might have considerable influence over the success or lack thereof for the Rust-for-Linux effort.

In the past, much of the justification for use of Rust in Linux has been memory safety, based on the fact that 70% of the Linux exploits have involved memory unsafety. Suppose that Rust manages to address the entire 70%, as opposed to relegating (say) half of that to Rust-language unsafe code. Then use of Rust might at best provide a factor-of-three improvement.

Of course, a factor of three is quite good in absolute terms. However, in my experience, at least an order of magnitude is usually required to with high probability motivate the kind of large change represented by the C-to-Rust transition. Therefore, in order for me to rate the Rust-for-Linux projects success as highly probable, another factor of three is required.

One possible source of the additional factor of three might come from the Rust-for-Linux community giving neglected drivers some much-needed attention. Perhaps the 9p work would qualify, but it seems unlikely that the NVMe drivers are lacking attention.

Perhaps Rust async-based state machines and type-aware packet decoding will go some ways towards providing more of the additional factor of three. Or perhaps a single factor of three will suffice in this particular case. Who knows?

“Pinning” for the Linux Kernel in Rust

Benno Lossin and Gary Guo reported on their respective efforts to implement pinning in Rust for the Linux kernel (a more detailed report may be found here).

But perhaps you, like me, are wondering just what pinning is and why the Linux kernel needs it.

It turns out that the Rust language allows the compiler great freedom to rearrange data structures by default, similar to what would happen in C-language code if each and every structure was marked with the Linux kernel's __randomize_layout directive. In addition, by default, Rust-language data can be moved at runtime.

This apparently allows additional optimizations, but it is not particularly friendly to Linux-kernel idioms that take addresses of fields in structures. Such idioms include pretty much all of Linux's linked-list code, as well as things like RCU callbacks. And who knows, perhaps this situation helps to explain recent hostility towards linked lists expressed on social media. ;-)

The Rust language accommodates these idioms by allowing data to be pinned, thus preventing the compiler from moving it around.

However, pinning data has consequences, including the fact that such pinning is visible in Rust's type system. This visibility allows Rust compilers to complain when (for example) list_for_each_entry() is passed an unpinned data structure. Such complaints are important, given that passing unpinned data to primitives expecting pinned data would result in memory unsafety. The code managing the pinning is expected to be optimized away, but there are concerns that it would make itself all too apparent during code review and debugging.

Benno's and Gary's approaches were compared and contrasted, but my guess is that additional work will be required to completely solve this problem. Some attendees would like to see pinning addressed at the Rust-language level rather than at the library level, but time will tell what approach is most appropriate.

RCU Use Cases

Although RCU was not an official topic, for some reason it came up frequently during the hallway tracks that I participated in. Which is probably a good thing. ;-)

One question is exactly how the various RCU use cases should be represented in Rust. Rust's atomic-reference-count facility has been put forward, and it is quite possible that, in combination with liberal use of atomics, it will serve for some of the more popular use cases. Wedson suggested that Revocable covers another group of use cases, and at first glance, it appears that Wedson might be quite right. Still others might be handled by wait-wakeup mechanisms. There are still some use cases to be addressed, but some progress is being made.

Rust and Python

Some people suggested that Rust might take over from Python, adding that Rust solves many problems that Python is said to encounter in large-scale programs, and that Rust should prove more ergonomic than Python. The first is hard to argue with, given that Python seems to be most successful in smaller-scale projects that make good use of Python's extensive libraries. The second seems quite unlikely to me, in fact, it is hard for me to imagine that anything about Rust would seem in any way ergonomic to a dyed-in-the-wool Python developer.

One particularly non-ergonomic attribute of current Rust is likely to be its libraries, which last I checked were considerably less extensive than those of Python. Although the software-engineering shortcomings of large-scale Python projects can and have motivated conversion to Rust, it appears to me that smaller Python projects making heavier use of a wider variety of libraries might need more motivation.

Automatic conversion of Python libraries, anyone? ;-)

Conclusions

I found Kangrejos 2022 to be extremely educational and thought-provoking, and am very glad that I was able to attend. I am still uncertain about Rust's prospects in the Linux kernel, but the session provided some good examples of actual work done and problems solved. Perhaps Rust's async and type-sensitive packet-decoding features will help increase its probability of success, and perhaps there are additional Rust features waiting to be applied, for example, use of Rust iterators to simplify lockdep cycle detection. And who knows? Perhaps future versions of Rust will be able to offer consistent performance advantages. Stranger things have happened!

History

September 19, 2022: Changes based on LPC and LWN reports.
elephant, penguin

Confessions of a Recovering Proprietary Programmer, Part XIX: Concurrent Computational Models

The September 2022 issue of the Communications of the ACM features an entertaining pair of point/counterpoint articles, with William Dally advocating for concurrent computational models that more accurately model both energy costs and latency here and Uzi Vishkin advocating for incremental modifications to the existing Parallel Random Access Machine (PRAM) model here. Some cynics might summarize Dally's article as “please develop software that makes better use of our hardware so that we can sell you more hardware” while other cynics might summarize Vishkin's article as “please keep funding our research project“. In contrast, I claim that both articles are worth a deeper look. The next section looks at Dally's article, the section after that at Vishkin's article, followed by discussion.

TL;DR: In complete consonance with a depressingly large fraction of 21st-century discourse, they are talking right past each other.

On the Model of Computation: Point: We Must Extend Our Model of Computation to Account for Cost and Location

Dally makes a number of good points. First, he notes that past decades have seen a shift from computation being more expensive than memory references to memory references being more expensive than computation, and by a good many orders of magnitude. Second, he points out that in production, constant factors can be more important than asymptotic behavior. Third, he describes situations where hardware caches are not a silver bullet. Fourth, he calls out the necessity of appropriate computational models for performance programming, though to his credit, he does clearly state that not all programming is performance programming. Fifth, he calls out the global need for reduced energy expended on computation, motivated by projected increases in computation-driven energy demand. Sixth and finally, he advocates for a new computational model that builds on PRAM by (1) assigning different costs to computation and to memory references and (1) making memory-reference costs a function of a distance metric based on locations assigned to processing units and memories.

On the Model of Computation: Counterpoint: Parallel Programming Wall and Multicore Software Spiral: Denial Hence Crisis

Vishkin also makes some interesting points. First, he calls out the days of Moore's-Law frequency scaling, calling for the establishing of a similar “free lunch” scaling in terms of numbers of CPUs. He echoes Dally's point about not all programming being performance programming, but argues that in addition to a “memory wall” and an “energy wall”, there is also a “parallel programming wall” because programming multicore systems is in “simply too difficult”. Second, Vishkin calls on hardware vendors to take more account of ease of use when creating computer systems, including systems with GPUs. Third, Vishkin argues that compilers should take up much of the burden of squeezing performance out of hardware. Fourth, he makes several arguments that concurrent systems should adapt to PRAM rather than adapting PRAM to existing hardware. Fifth and finally, he calls for funding for a “multicore software spiral” that he hopes will bring back free-lunch performance increases driven by Moore's Law, but based on increasing numbers of cores rather than increasing clock frequency.

Discussion

Figure 2.3 of Is Parallel Programming Hard, And, If So, What Can You Do About It? (AKA “perfbook”) shows the “Iron Triangle” of concurrent programming. Vishkin focuses primarily at the upper edge of this triangle, which is primarily about productivity. Dally focuses primarily on the lower left edge of this triangle, which is primarily about performance. Except that Vishkin's points about the cost of performance programming are all too valid, which means that defraying the cost of performance programming in the work-a-day world requires very large numbers of users, which is often accomplished by making performance-oriented concurrent software be quite general, thus amortizing its cost over large numbers of use cases and users. This puts much performance-oriented concurrent software at the lower tip of the iron triangle, where performance meets generality.

One could argue that Vishkin's proposed relegation of performance-oriented code to compilers is one way to break the iron triangle of concurrent programming, but this argument fails because compilers are themselves low-level software (thus at the lower tip of the triangle). Worse yet, there have been many attempts to put the full concurrent-programming burden on the compiler (or onto transactional memory), but rarely to good effect. Perhaps SQL is the exception that proves this rule, but this is not an example that Vishkin calls out.

Both Dally and Vishkin correctly point to the ubiquity of multicore systems, so it only reasonable to look at how these systems are handled in practice. After all, it is only fair to check their apparent assumption of “badly”. And Section 2.3 of perfbook lists several straightforward approaches: (1) Using multiple instances of a sequential application, (2) Using the large quantity of existing parallel software, ranging from operating-system kernels to database servers (SQL again!) to numerical software to image-processing libraries to machine-learning libraries, to name but a few of the areas that Python developers could teach us about, and (3) Applying sequential performance optimizations such that single-threaded performance suffices. Those taking door 1 or door 2 will need only a few parallel performance programmers, and it seems to have proven eminently feasible to have those few to use a sufficiently accurate computational model. The remaining users simply invoke the software produced by these parallel performance programmers, and need not worry much about concurrency. The cases where none of these three doors apply are more challenging, but surmounting the occasional challenge is part of the programming game, not just the parallel-programming game.

One additional historical trend is the sharply decreasing cost of concurrent systems, both in terms of purchase price and energy consumption. Where an entry-level late-1980s parallel system cost millions of dollars and consumed kilowatts, an entry-level 2022 system can be had for less than $100. This means that there is no longer an overwhelming economic imperative to extract every ounce of performance from most year-2022 concurrent systems. For example, much smartphone code attains the required performance while running single-threaded, which means that additional performance from parallelism could well be a waste of energy. Instead, the smartphone automatically powers down the unused hardware, thus providing only the performance that is actually needed, and does so in an energy-efficient manner. The occasional heavy-weight application might turn on all the CPUs, but only such applications need the ministrations of parallel performance programmers and complex computational models.

Thus, Dally and Vishkin are talking past each other, describing different types of software that is produced by different types of developers, who can each use whatever computational model is appropriate to the task at hand.

Which raises the question of whether there might be a one-size-fits-all computational model. I have my doubts. In my 45 years of supporting myself and (later) my family by developing software, I have seen many models, frameworks, languages, and libraries come and go. In contrast, to the best of my knowledge, the speed of light and the sizes of the various types of atoms have remained constant. Don't get me wrong, I do hope that the trend towards vertical stacking of electronics within CPU chips will continue, and I also hope that this trend will result in smaller systems that are correspondingly less constrained by speed-of-light and size-of-atoms limitations. However, the laws of physics appear to be exceedingly stubborn things, and we would therefore be well-advised to work together. Dally's attempts to dictate current hardware conveniences to software developers is about as likely to succeed as is Vishkin's attempts to dictate current software conveniences to hardware developers. Both of these approaches are increasingly likely to fail should the trend towards special-purpose hardware continue full force. Software developers cannot always ignore the laws of physics, and by the same token, hardware developers cannot always ignore the large quantity of existing software and the large number of existing software developers. To be sure, in order to continue to break new ground, both software and hardware developers will need to continue to learn new tricks, but many of these tricks will require coordinated hardware/software effort.

Sadly, there is no hint in either Dally's or Vishkin's article of any attempt to learn what the many successful developers of concurrent software actually do. Thus, it is entirely possible that neither Dally's nor Vishkin's preferred concurrent computational model is applicable to actual parallel programming practice. In fact, in my experience, developers often use the actual hardware and software as the model, driving their development and optimization efforts from actual measurements and from intuitions built up from those measurements. Some might look down on this approach as being both non-portable and non-mathematical, but portability is often achieved simply by testing on a variety of hardware, courtesy of the relatively low prices of modern computer systems. As for mathematics, those of us who appreciate mathematics would do well to admit that it is often much more efficient to let the objective universe carry out the mathematical calculations in its normal implicit and ineffable manner, especially given the complexity of present day hardware and software.

Circling back to the cynics, there are no doubt some cynics that would summarize this blog post as “please download and read my book”. Except that in contrast to the hypothesized cynical summarization of Dally's and Vishkin's articles, you can download and read my book free of charge. Which is by design, given that the whole point is to promulgate information. Cue cynics calling out which of the three of us is the greater fool! ;-)
inside

Stupid SMP Tricks: A Review of Locking Engineering Principles and Hierarchy

Daniel Vetter put together a pair of intriguing blog posts entitled Locking Engineering Principles and Locking Engineering Hierarchy. These appear to be an attempt to establish a set of GPU-wide or perhaps even driver-tree-wide concurrency coding conventions.

Which would normally be none of my business. After all, to establish such conventions, Daniel needs to negotiate with the driver subsystem's developers and maintainers, and I am neither. Except that he did call me out on Twitter on this topic. So here I am, as promised, offering color commentary and the occasional suggestion for improvement, both of Daniel's proposal and of the kernel itself. The following sections review his two posts, and then summarize and amplify suggestions for improvement.

Locking Engineering Principles


Make it Dumb” should be at least somewhat uncontroversial, as this is quite similar to a rather more flamboyant sound bite from my 1970s Mechanical Engineering university coursework. Kernighan's Law asserting that debugging is twice as hard as coding should also be a caution.

This leads us to “Make it Correct”, which many might argue should come first in the list. After all, if correctness is not a higher priority than dumbness (or simplicity, if you prefer), then the do-nothing null solution would always be preferred. And to his credit, Daniel does explicitly state that “simple doesn't necessarily mean correct”.

It is hard to argue with Daniel's choosing to give lockdep place of pride, especially given how many times it has saved me over the years, and not just from deadlock. I never have felt the need to teach lockdep RCU's locking rules, but then again, RCU is much more monolithic than is the GPU subsystem. Perhaps if RCU's locking becomes more ornate, I would also feel the need to acquire RCU's key locks in the intended order at boot time. Give or take the fact that much of RCU can be used very early, even before the call to rcu_init().

The validation point is also a good one, with Daniel calling out might_lock(), might_sleep(), and might_alloc(). I go further and add lockdep_assert_irqs_disabled(), lockdep_assert_irqs_enabled(), and lockdep_assert_held(), but perhaps these additional functions are needed in low-level code like RCU more than they are in GPU drivers.

Daniel's admonishments to avoid reinventing various synchronization wheels of course comprise excellent common-case advice. And if you believe that your code is the uncommon case, you are most likely mistaken. Furthermore, it is hard to argue with “pick the simplest lock that works”.

It is hard to argue with the overall message of “Make it Fast”, especially the bit about the pointlessness of crashing faster. However, a minimum level of speed is absolutely required. For a 1977 example, an old school project required keeping up with the display. If the code failed to keep up with the display, that code was useless. More recent examples include deep sub-second request-response latency requirements in many Internet datacenters. In other words, up to a point, performance is not simply a “Make it Fast” issue, but also a “Make it Correct” issue. On the other hand, once the code meets its performance requirements, any further performance improvements instead falls into the lower-priority “Make it Fast” bucket.

Except that things are not always quite so simple. Not all performance improvements can be optimized into existence late in the game. In fact, if your software's performance requirements are expected to tax your system's resources, it will be necessary to make performance a first-class consideration all the way up to and including the software-architecture level, as discussed here. And, to his credit, Daniel hints at this when he notes that “the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts”. He also helpfully calls out io_uring as one avenue for radical updates.

The “Protect Data, not Code” is good general advice and goes back to Jack Inman's 1985 USENIX paper entitled “Implementing Loosely Coupled Functions on Tightly Coupled Engines”, if not further. However, there are times where what needs to be locked is not data, but perhaps a particular set of state transitions, or maybe a rarely accessed state, which might be best represented by the code for that state. That said, there can be no denying the big-kernel-lock hazards of excessive reliance on code locking. Furthermore, I have only rarely needed abstract non-data locking.

Nevertheless, a body of code as large as the full set of Linux-kernel device drivers should be expected to have a large number of exceptions to the “Protect Data” rule of thumb, reliable though that rule has proven in my own experience. The usual way of handling this is a waiver process. After all, given that every set of safety-critical coding guidelines I have seen has a waiver process, it only seems reasonable that device-driver concurrency coding conventions should also involve a waiver process. But that decision belongs to the device-driver developers and maintainers, not with me.

Locking Engineering Hierarchy


Most of the Level 0: No Locking section should be uncontroversial: Do things the easy way, at least when the easy way works. This approach goes back decades, for but one example, single-threaded user-interface code delegating concurrency to a database management system. And it should be well-understood that complicated things can be made easy (or at least easier) through use of carefully constructed and time-tested APIs, for example, the queue_work() family of APIs called out in this section. One important question for all such APIs is this: “Can someone with access to only the kernel source tree and a browser quickly find and learn how to use the given API?” After all, people are able to properly use the APIs they see elsewhere if they can quickly and easily learn about them.

And yes, given Daniel's comments regarding struct dma_fence later in this section, the answer for SLAB_TYPESAFE_BY_RCU appears to be “no” (but see Documentation/RCU/rculist_nulls.rst). The trick with SLAB_TYPESAFE_BY_RCU is that readers must validate the allocation. A common way of doing this is to have a reference count that is set to the value one immediately after obtaining the object from kmem_struct_alloc() and set to the value zero immediately before passing the object to kmem_struct_free(). And yes, I have a patch queued to fix the misleading text in Documentation/RCU/whatisRCU.rst, and I apologize to anyone who might have attempted to use locking without a reference count. But please also let the record show that there was no bug report.

A reader attempting to obtain a new lookup-based reference to a SLAB_TYPESAFE_BY_RCU object must do something like this:

  1. atomic_inc_not_zero(), which will return false and not do the increment if initial value was zero. Presumably dma_fence_get_rcu() handles this for struct dma_fence.
  2. If the value returned above was false, pretend that the lookup failed. Otherwise, continue with the following steps.
  3. Check the identity of the object. If this turns out to not be the desired object, release the reference (cleaning up if needed) and pretend that the lookup failed. Otherwise, continue. Presumably dma_fence_get_rcu_safe() handles this for struct dma_fence, in combination with its call to dma_fence_put().
  4. Use the object.
  5. Release the reference, cleaning up if needed.
This of course underscores Daniel's point (leaving aside the snark), which is that you should not use SLAB_TYPESAFE_BY_RCU unless you really need to, and even then you should make sure that you know how to use it properly.

But just when do you need to use SLAB_TYPESAFE_BY_RCU? One example is when you need RCU protection on such a hot fastpath that you cannot tolerate RCU-freed objects becoming cold in the CPU caches due to RCU's grace-period delays. Another example is where objects are being allocated and freed at such a high rate that if each and every kmem_cache_free() was subject to grace-period delays, excessive memory would be tied up waiting for grace periods, perhaps even resulting in out-of-memory (OOM) situations.

In addition, carefully constructed semantics are required. Semantics based purely on ordering tend to result in excessive conceptual complexity and thus in confusion. More appropriate semantics tend to be conditional, for example: (1) Objects that remain in existence during the full extent of the lookup are guaranteed to be found, (2) Objects that do not exist at any time during the full extent of the lookup are guaranteed not to be found, and (3) Objects that exist for only a portion of the lookup might or might not be found.

Does struct dma_fence need SLAB_TYPESAFE_BY_RCU? Is the code handling struct dma_fence doing the right thing? For the moment, I will just say that: (1) The existence of dma_fence_get_rcu_safe() gives me at least some hope, (2) Having two different slab allocators stored into different static variables having the same name is a bit code-reader-unfriendly, and (3) The use of SLAB_TYPESAFE_BY_RCU for a structure that contains an rcu_head structure is a bit unconventional, but perhaps these structures are sometimes obtained from kmalloc() instead of from kmem_struct_alloc().

One thing this section missed (or perhaps intentionally omitted): If you do need memory barriers, you are almost always better off using smp_store_release() and smp_load_acquire() than the old-style smp_wmb() and smp_rmb().

The Level 1: Big Dumb Lock certainly summarizes the pain of finding the right scope for your locks. Despite Daniel's suggesting that lockless tricks are almost always the wrong approach, such tricks are in common use. I would certainly agree that randomly hacking lockless tricks into your code will almost always result in disaster. In fact, this process will use your own intelligence against you: The smarter you think you are, the deeper a hole you will have dug for yourself before you realize that you are in trouble.

Therefore, if you think that you need a lockless trick, first actually measure the performance. Second, if there really is a performance issue, do the work to figure out which code and data is actually responsible, because your blind guesses will often be wrong. Third, read documentation, look at code, and talk to people to learn and fully understand how this problem has been solved in the past, then carefully apply those solutions. Fourth, if there is no solution, review your overall design, because an ounce of proper partitioning is worth some tons of clever lockless code. Fifth, if you must use a new solution, verify it beyond all reason. The code in kernel/rcu/rcutorture.c will give you some idea of the level of effort required.

The Level 2: Fine-grained Locking sections provide some excellent advice, guidance, and cautionary tales. Yes, lockdep is a great thing, but there are deadlocks that it does not detect, so some caution is still required.

The yellow-highlighted Locking Antipattern: Confusing Object Lifetime and Data Consistency describes how holding locks across non-memory-style barrier functions, that is, things like flush_work() as opposed to things like smp_mb(), can cause problems, up to and including deadlock. In contrast, whatever other problems memory-barrier functions such as smp_mb() cause, it does take some creativity to add them to a lock-based critical section so as to cause them to participate in a deadlock cycle. I would not consider flush_work() to be a memory barrier in disguise, although it is quite true that a correct high-performance implementation of flush_work() will require careful memory ordering.

I would have expected a rule such as “Don't hold a lock across flush_work() that is acquired in any corresponding workqueue handler”, but perhaps Daniel is worried about future deadlocks as well as present-day deadlocks. After all, if you do hold a lock across a call to flush_work(), perhaps there will soon be some compelling reason to acquire that lock in a workqueue handler, at which point it is game over due to deadlock.

This issue is of course by no means limited to workqueues. For but one example, it is quite possible to generate similar deadlocks by holding spinlocks across calls to del_timer_sync() that are acquired within timer handlers.

And that is why both flush_work() and del_timer_sync() tell lockdep what they are up to. For example, flush_work() invokes __flush_work() which in turn invokes lock_map_acquire() and lock_map_release() on a fictitious lock, and this same fictitious lock is acquired and released by process_one_work(). Thus, if you acquire a lock in a workqueue handler that is held across a corresponding flush_work(), lockdep will complain, as shown in the 2018 commit 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing") by Johannes Berg. Of course, del_timer_sync() uses this same trick, as shown in the 2009 commit 6f2b9b9a9d75 ("timer: implement lockdep deadlock detection"), also by Johannes Berg.

Nevertheless, deadlocks involving flush_work() and workqueue handlers can be subtle. It is therefore worth investing some up-front effort to avoid them.

The orange-highlighted Level 2.5: Splitting Locks for Performance Reasons discusses splitting locks. As the section says, there are complications. For one thing, lock acquisitions are anything but free, having overheads of hundreds or thousands of instructions at best, which means that adding additional levels of finer-grained locks can actually slow things down. Therefore, Daniel's point about prioritizing architectural restructuring over low-level synchronization changes is an extremely good one, and one that is all too often ignored.

As Daniel says, when moving to finer-grained locking, it is necessary to avoid increasing the common-case number of locks being acquired. As one might guess from the tone of this section, this is not necessarily easy. Daniel suggests reader-writer locking, which can work well in some cases, but suffers from performance and scalability limitations, especially in situations involving short read-side critical sections. The fact that readers and writers exclude each other can also result in latency/response-time issues. But again, reader-writer locking can be a good choice in some cases.

The last paragraph is an excellent cautionary tale. Take it from Daniel: Never let userspace dictate the order in which the kernel acquires locks. Otherwise, you, too, might find yourself using wait/wound mutexes. Worse yet, if you cannot set up an two-phase locking discipline in which all required locks are acquired before any work is done, you might find yourself writing deadlock-recovery code, or wounded-mutex recovery code, if you prefer. This recovery code can be surprisingly complex. In fact, one of the motivations for the early 1990s introduction of RCU into DYNIX/ptx was that doing so allowed deletion of many thousands of lines of such recovery code, along with all the yet-undiscovered bugs that code contained.

The red-highlighted Level 3: Lockless Tricks section begins with the ominous sentence “Do not go here wanderer!”

And I agree. After all, if you are using the facilities discussed in this section, namely RCU, atomics, preempt_disable(), local_bh_disable(), local_irq_save(), or the various memory barriers, you had jolly well better not be wandering!!!

Instead, you need to understand what you are getting into and you need to have a map in the form of a principled design and a careful well-validated implementation. And new situations may require additional maps to be made, although there are quite a few well-used maps already in Documentation/rcu and over here. In addition, as Daniel says, algorithmic and architectural fixes can often provide much better results than can lockless tricks applied at low levels of abstraction. Past experience suggests that some of those algorithmic and architectural fixes will involve lockless tricks on fastpaths, but life is like that sometimes.

It is now time to take a look at Daniel's alleged antipatterns.

The “Locking Antipattern: Using RCU” section does have some “interesting” statements:

  1. RCU is said to mix up lifetime and consistency concerns. It is not clear exactly what motivated this statement, but this view is often a symptom of a strictly temporal view of RCU. To use RCU effectively, one must instead take a combined spatio-temporal view, as described here and here.
  2. rcu_read_lock() is said to provide both a read-side critical section and to extend the lifetime of any RCU-protected object. This is true in common use cases because the whole purpose of an RCU read-side critical section is to ensure that any RCU-protected object that was in existence at any time during that critical section remains in existence up to the end of that critical section. It is not clear what distinction Daniel is attempting to draw here. Perhaps he likes the determinism provided by full mutual exclusion. If so, never forget that the laws of physics dictate that such determinism is often surprisingly expensive.
  3. The next paragraph is a surprising assertion that RCU readers' deadlock immunity is a bad thing! Never forget that although RCU can be used to replace reader-writer locking in a great many situations, RCU is not reader-writer locking. Which is a good thing from a performance and scalability viewpoint as well as from a deadlock-immunity viewpoint.
  4. In a properly designed system, locks and RCU are not “papering over” lifetime issues, but instead properly managing object lifetimes. And if your system is not properly designed, then any and all facilities, concurrent or not, are weapons-grade dangerous. Again, perhaps more maps are needed to help those who might otherwise wander into improper designs. Or perhaps existing maps need to be more consistently used.
  5. RCU is said to practically force you to deal with “zombie objects”. Twitter discussions with Daniel determined that such “zombie objects” can no longer be looked up, but are still in use. Which means that many use cases of good old reference counting also force you to deal with zombie objects: After all, removing an object from its search structure does not invalidate the references already held on that object. But it is quite possible to avoid zombie objects for both RCU and for reference counting through use of per-object locks, as is done in the Linux-kernel code that maps from a System-V semaphore ID to the corresponding in-kernel data structure, first described in Section of this paper. In short, if you don't like zombies, there are simple RCU use cases that avoid them. You get RCU's speed and deadlock immunity within the search structure, but full ordering, consistency, and zombie-freedom within the searched-for object.
  6. The last bullet in this section seems to argue that people should upgrade from RCU read-side critical sections to locking or reference counting as quickly as possible. Of course, such locking or reference counting can add problematic atomic-operation and cache-miss overhead to those critical sections, so specific examples would be helpful. One non-GPU example is the System-V semaphore ID example mentioned above, where immediate lock acquisition is necessary to provide the necessary System-V semaphore semantics.
  7. On freely using RCU, again, proper design is required, and not just with RCU. One can only sympathize with a driver dying in synchronize_rcu(). Presumably Daniel means that one of the driver's tasks hung in synchronize_rcu(), in which case there should have been an RCU CPU stall warning message which would point out what CPU or task was stuck in an RCU read-side critical section. It is quite easy to believe that diagnostics could be improved, both within RCU and elsewhere, but if this was intended to be a bug report, it is woefully insufficient.
  8. It is good to see that Daniel found at least one RCU use case that he likes (or at least doesn't hate too intensely), namely xarray lookups combined with kref_get_unless_zero() and kfree_rcu(). Perhaps this is a start, especially if the code following that kref_get_unless_zero() invocation does additional atomic operations on the xarray object, which would hide at least some of the kref_get_unless_zero() overhead.

The following table shows how intensively the RCU API is used by various v5.19 kernel subsystems:

Subsystem   Uses          LoC  Uses/KLoC
---------   ----   ----------  ---------
ipc           91        9,822       9.26
virt          68        9,013       7.54
net         7457    1,221,681       6.10
security     599      107,622       5.57
kernel      1796      423,581       4.24
mm           324      170,176       1.90
init           8        4,236       1.89
block        108       65,291       1.65
lib          319      214,291       1.49
fs          1416    1,470,567       0.96
include      836    1,167,274       0.72
drivers     5596   20,861,746       0.27
arch         546    2,189,975       0.25
crypto         6      102,307       0.06
sound         21    1,378,546       0.02
---------   ----   ----------  ---------
Total      19191   29,396,128       0.65

As you can see, the drivers subsystem has the second-highest total number of RCU API uses, but it also has by far the largest number of lines of code. As a result, the RCU usage intensity in the drivers subsystem is quite low, at about 27 RCU uses per 100,000 lines of code. But this view is skewed, as can be seen by looking more deeply within the drivers subsystem, but leaving out (aside from in the Total line) and drivers containing fewer than 100 instances of the RCU API:

Subsystem           Uses          LoC  Uses/KLoC
---------           ----   ----------  ---------
drivers/target       293       62,532       4.69
drivers/block        355       97,265       3.65
drivers/md           334      147,881       2.26
drivers/infiniband   548      434,430       1.26
drivers/net         2607    4,381,955       0.59
drivers/staging      114      618,763       0.18
drivers/scsi         150    1,011,146       0.15
drivers/gpu          399    5,753,571       0.07
---------           ----   ----------  ---------
Total               5596   20,861,746       0.27

The drivers/infiniband and drivers/net subtrees account for more than half of the RCU usage in the Linux kernel's drivers, and could be argued to be more about networking than about generic device drivers. And although drivers/gpu comes in third in terms of RCU usage, it also comes in first in terms of lines of code, making it one of the least intense users of RCU. So one could argue that Daniel already has his wish, at least within the confines of drivers/gpu.

This data suggests that Daniel might usefully consult with the networking folks in order to gain valuable guidelines on the use of RCU and perhaps atomics and memory barriers as well. On the other hand, it is quite possible that such consultations actually caused some of Daniel's frustration. You see, a system implementing networking must track the state of the external network, and it can take many seconds or even minutes for changes in that external state to propagate to that system. Therefore, expensive synchronization within that system is less useful than one might think: No matter how many locks and mutexes that system acquires, it cannot prevent external networking hardware from being reconfigured or even from failing completely.

Moving to drivers/gpu, in theory, if there are state changes initiated by the GPU hardware without full-system synchronization, networking RCU usage patterns should apply directly to GPU drivers. In contrast, there might be significant benefits from more tightly synchronizing state changes initiated by the system. Again, perhaps the aforementioned System-V semaphore ID example can help in such cases. But to be fair, given Daniel's preference for immediately acquiring a reference to RCU-protected data, perhaps drivers/gpu code is already taking this approach.

The “Locking Antipattern: Atomics” section is best taken point by point:

  1. For good or for ill, the ordering (or lack thereof) of Linux kernel atomics predates C++ atomics by about a decade. One big advantage of the Linux-kernel approach is that all accesses to atomic*_t variables are marked. This is a great improvement over C++, where a sequentially consistent load or store looks just like a normal access to a local variable, which can cause a surprising amount of confusion.
  2. Please please please do not “sprinkle” memory barriers over the code!!! Instead, actually design the required communication and ordering, and then use the best primitives for the job. For example, instead of “smp_mb__before_atomic(); atomic_inc(&myctr); smp_mb__after_atomic();”, maybe you should consider invoking atomic_inc_return(), discarding the return value if it is not needed.
  3. Indeed, some atomic functions operate on non-atomic*_t variables. But in many cases, you can use their atomic*_t counterparts. For example, instead of READ_ONCE(), atomic_read(). Instead of WRITE_ONCE(), atomic_set(). Instead of cmpxchg(), atomic_cmpxchg(). Instead of set_bit(), in many situations, atomic_or(). On the other hand, I will make no attempt to defend the naming of set_bit() and __set_bit().

To Daniel's discussion of “unnecessary trap doors”, I can only agree that reinventing read-write semaphores is a very bad thing.

I will also make no attempt to defend ill-thought-out hacks involving weak references or RCU. Sure, a quick fix to get your production system running is all well and good, but the real fix should be properly designed.

And Daniel makes an excellent argument when he says that if a counter can be protected by an already held lock, that counter should be implemented using normal C-language accesses to normal integral variables. For those situations where no such lock is at hand, there are a lot of atomic and per-CPU counting examples that can be followed, both in the Linux kernel and in Chapter 5 of “Is Parallel Programming Hard, And, If So, What Can You Do About It?”. Again, why unnecessarily re-invent the wheel?

However, the last paragraph, stating that atomic operations should only be used for locking and synchronization primitives in the core kernel is a bridge too far. After all, a later section allows for memory barriers to be used in libraries (at least driver-hacker-proof libraries), so it seems reasonable that atomic operations can also be used in libraries.

Some help is provided by the executable Linux-kernel memory model (LKMM) in tools/memory-model along with the kernel concurrency sanitizer (KCSAN), which is documented in Documentation/dev-tools/kcsan.rst, but there is no denying that these tools currently require significant expertise. Help notwithstanding, it almost always makes a lot of sense to hide complex operations, including complex operations involving concurrency, behind well-designed APIs.

And help is definitely needed, given that there are more than 10,000 invocations of atomic operations in the drivers tree, more than a thousand of which are in drivers/gpu.

There is not much to say about the “Locking Antipattern: preempt/local_irq/bh_disable() and Friends” section. These primitives are not heavily used in the drivers tree. However, lockdep does have enough understanding of these primitives to diagnose misuse of irq-disabled and bh-disabled spinlocks. Which is a good thing, given that there are some thousands of uses of irq-disabled spinlocks in the drivers tree, along with a good thousand uses of bh-disabled spinlocks.

The “Locking Antipattern: Memory Barriers” suggests that memory barriers should be packaged in a library or core kernel service, which is in the common case excellent advice. Again, the executable LKMM and KCSAN can help, but again these tools currently require some expertise. I was amused by Daniel's “I love to read an article or watch a talk by Paul McKenney on RCU like anyone else to get my brain fried properly”, and I am glad that my articles and talks provide at least a little entertainment value, if nothing else. ;-)

Summing up my view of these two blog posts, Daniel recommends that most driver code avoid concurrency entirely. Failing that, he recommends sticking to certain locking and reference-counting use cases, albeit including a rather complex acquire-locks-in-any-order use case. For the most part, he recommends against atomics, RCU, and memory barriers, with a very few exceptions.

For me, reading these blog posts induced great nostalgia, taking me back to my early 1990s days at Sequent, when the guidelines were quite similar, give or take a large number of non-atomically manipulated per-CPU counters. But a few short years later, many Sequent engineers were using atomic operations, and yes, a few were even using RCU. Including one RCU use case in a device driver, though that use case could instead be served by the Linux kernel's synchronize_irq() primitive.

Still, the heavy use of RCU and (even more so) of atomics within the drivers tree, combined with Daniel's distaste for these primitive, suggests that some sort of change might be in order.

Can We Fix This?

Of course we can!!!

But will a given fix actually improve the situation? That is the question.

Reading through this reminded me that I need to take another pass through the RCU documentation. I have queued a commit to fix the misleading wording for SLAB_TYPESAFE_BY_RCU on the -rcu tree: 08f8f09b2a9e ("doc: SLAB_TYPESAFE_BY_RCU uses cannot rely on spinlocks"). I also expect to improve the documentation of reference counting and its relation to SLAB_TYPESAFE_BY_RCU, and will likely find a number of other things in need of improvement.

This is also as good a time as any to announce that I will be holding an an RCU Office Hours birds-of-a-feather session at the 2022 Linux Plumbers Conference, in case that is helpful.

However, the RCU documentation must of necessity remain fairly high level. And to that end, GPU-specific advice about use of xarray, kref_get_unless_zero(), and kfree_rcu() really needs to be Documentation/gpu as opposed to Documentation/RCU. This would allow that advice to be much more specific and thus much more helpful to the GPU developers and maintainers. Alternatively, perhaps improved GPU-related APIs are required in order to confine concurrency to functions designed for that purpose. This alternative approach has the benefit of allowing GPU device drivers to focus more on GPU-specific issues and less on concurrency. On the other hand, given that the GPU drivers comprise some millions of lines of code, this might be easier said than done.

It is all too easy to believe that it is possible to improve the documentation for a number of other facilities that Daniel called on the carpet. At the same time, it is important to remember that the intent of documentation is communication, and that the optimal mode of communication depends on the target audience. At its best, documentation builds a bridge from where the target audience currently is to where they need to go. Which means the broader the target audience, the more difficult it is to construct that bridge. Which in turn means that a given subsystem likely need usage advice and coding standards specific to that subsystem. One size does not fit all.

The SLAB_TYPESAFE_BY_RCU facility was called on the carpet, and perhaps understandably so. Would it help if SLAB_TYPESAFE_BY_RCU were to be changed so as to allow locks to be acquired on objects that might at any time be passed to kmem_cache_free() and then reallocated via kmem_cache_alloc()? In theory, this is easy: Just have the kmem_cache in question zero pages allocated from the system before splitting them up into objects. Then a given object could have an “initialized” flag, and if that flag was cleared in an object just returned from kmem_struct_alloc(), then and only then would that lock be initialized. This would allow a lock to be acquired (under rcu_read_lock(), of course) on a freed object, and would allow a lock to be held on an object despite its being passed to kmem_struct_free() and returned from kmem_struct_alloc() in the meantime.

In practice, some existing SLAB_TYPESAFE_BY_RCU users might not be happy with the added overhead of page zeroing, so this might require an additional GFP_ flag to allow zeroing on a kmem_cache-by-kmem_cache basis. However, the first question is “Would this really help?”, and answering that question requires feedback developers and maintainers who are actually using SLAB_TYPESAFE_BY_RCU.

Some might argue that the device drivers should all be rewritten in Rust, and cynics might argue that Daniel wrote his pair of blog posts with exactly that thought in mind. I am happy to let those cynics make that argument, especially given that I have already held forth on Linux-kernel concurrency in Rust here. However, a possible desire to rust Linux-kernel device drivers does not explain Daniel's distaste for what he calls “zombie objects” because Rust is in fact quite capable of maintaining references to objects that have been removed from their search structure.

Summary and Conclusions

As noted earlier, reading these blog posts induced great nostalgia, taking me back to my time at Sequent in the early 1990s. A lot has happened in the ensuing three decades, including habitual use of locking in across the industry, and sometimes even correct use of locking.

But will generic developers ever be able to handle more esoteric techniques involving atomic operations and RCU?

I believe that the answer to this question is “yes”, as laid out in my 2012 paper Beyond Expert-Only Parallel Programming?. As in the past, tooling (including carefully designed APIs), economic forces (including continued ubiquitous multi-core systems), and acculturation (assisted by a vast quantity of open-source software) have done the trick, and I see no reason why these trends will not continue.

But what happens in drivers/gpu is up to the GPU developers and maintainers!

References


Atomic Operations and Memory Barriers, though more description than reference:

  1. Documentation/atomic_t.txt
  2. Documentation/atomic_bitops.txt
  3. Documentation/memory-barriers.txt
  4. Sometimes the docbook header is on the x86 arch_ function, for example, arch_atomic_inc() in arch/x86/include/asm/atomic.h rather than atomic_inc().
  5. The LKMM references below can also be helpful.

Kernel Concurrency Sanitizer (KCSAN):

  1. Documentation/dev-tools/kcsan.rst
  2. Finding race conditions with KCSAN
  3. Concurrency bugs should fear the big bad data-race detector (part 1)
  4. Concurrency bugs should fear the big bad data-race detector (part 2)
  5. Detecting missing memory barriers with KCSAN

Linux-Kernel Memory Model (LKMM):

  1. tools/memory-model, including its Documentation subdirectory.
  2. A formal kernel memory-ordering model (part 1)
  3. A formal kernel memory-ordering model (part 2)
  4. Who's afraid of a big bad optimizing compiler?
  5. Calibrating your fear of big bad optimizing compilers
  6. Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel (non-paywalled extended )edition

Read-Copy Update (RCU):

  1. Documentation/RCU
  2. The RCU API, 2019 edition
  3. Unraveling RCU-Usage Mysteries (Fundamentals)
  4. Unraveling RCU-Usage Mysteries (Additional Use Cases)
  5. Sections 9.5 and 9.6 of “Is Parallel Programming Hard, And, If So, What Can You Do About It?
  6. Many other references, some of which are listed here.
inside

Stupid RCU Tricks: How Read-Intensive is The Kernel's Use of RCU?

RCU is a specialized synchronization mechanism, and is typically used where there are far more readers (rcu_read_lock(), rcu_read_unlock(), rcu_dereference(), and so on) than there are updaters (synchronize_rcu(), call_rcu(), rcu_assign_pointer(), and so on). But does the Linux kernel really make heavier use of RCU's read-side primitives than of its update-side primitives?

One way to determine this would be to use something like ftrace to record all the calls to these functions. This works, but trace messages can be lost, especially when applied to frequently invoked functions. Also, dumping out the trace buffer can perturb the syatem. Another approach is to modify the kernel source code to count these function invocations in a cache-friendly manner, then come up with some way to dump this to userspace. This works, but I am lazy. Yet another approach is to ask the tracing folks for advice.

This last is what I actually did, and because the tracing person I happened to ask happened to be Andrii Nakryiko, I learned quite a bit about BPF in general and the bpftrace command in particular. If you don't happen to have Andrii on hand, you can do quite well with Appendix A and Appendix B of Brendan Gregg's “BPF Performance Tools”. You will of course need to install bpftrace itself, which is reasonably straightforward on many Linux distributions.

Linux-Kernel RCU Read Intensity

Those of you who have used sed and awk have a bit of a running start because you can invoke bpftrace with a -e argument and a series of tracepoint/program pairs, where a program is bpftrace code enclosed in curly braces. This code is compiled, verified, and loaded into the running kernel as a kernel module. When the code finishes executing, the results are printed right there for you on stdout. For example:

bpftrace -e 'kprobe:__rcu_read_lock { @rcu_reader = count(); } kprobe:rcu_gp_fqs_loop { @gp = count(); } interval:s:10 { exit(); }'

This command uses the kprobe facility to attach a program to the __rcu_read_lock() function and to attach a very similar program to the rcu_gp_fqs_loop() function, which happens to be invoked exactly once per RCU grace period. Both programs count the number of calls, with @gp being the bpftrace “variable” accumulating the count, and the count() function doing the counting in a cache-friendly manner. The final interval:s:10 in effect attaches a program to a timer, so that this last program will execute every 10 seconds (“s:10”). Except that the program invokes the exit() function that terminates this bpftrace program at the end of the very first 10-second time interval. Upon termination, bpftrace outputs the following on an idle system:

Attaching 3 probes...

@gp: 977
@rcu_reader: 6435368

In other words, there were about a thousand grace periods and more than six million RCU readers during that 10-second time period, for a read-to-grace-period ratio of more than six thousand. This certainly qualifies as read-intensive.

But what if the system is busy? Much depends on exactly how busy the system is, as well as exactly how it is busy, but let's use that old standby, the kernel build (but using the nice command to avoid delaying bpftrace). Let's also put the bpftrace script into a creatively named file rcu1.bpf like so:

kprobe:__rcu_read_lock
{
        @rcu_reader = count();
}

kprobe:rcu_gp_fqs_loop
{
        @gp = count();
}

interval:s:10
{
        exit();
}

This allows the command bpftrace rcu1.bpf to produce the following output:

Attaching 3 probes...

@gp: 274
@rcu_reader: 78211260

Where the idle system had about one thousand grace periods over the course of ten seconds, the busy system had only 274. On the other hand, the busy system had 78 million RCU read-side critical sections, more than ten times that of the idle system. The busy system had more than one quarter million RCU read-side critical sections per grace period, which is seriously read-intensive.

RCU works hard to make the same grace-period computation cover multiple requests. Because synchronize_rcu() invokes call_rcu(), we can use the number of call_rcu() invocations as a rough proxy for the number of updates, that is, the number of requests for a grace period. (The more invocations of synchronize_rcu_expedited() and kfree_rcu(), the rougher this proxy will be.)

We can make the bpftrace script more concise by assigning the same action to a group of tracepoints, as in the rcu2.bpf file shown here:

kprobe:__rcu_read_lock, kprobe:call_rcu, kprobe:rcu_gp_fqs_loop { @[func] = count(); } interval:s:10 { exit(); }

With this file in place, bpftrace rcu2.bpf produces the following output in the midst of a kernel build:

Attaching 4 probes... @[rcu_gp_fqs_loop]: 128 @[call_rcu]: 195721 @[__rcu_read_lock]: 21985946

These results look quite different from the earlier kernel-build results, confirming any suspicions you might harbor about the suitability of kernel builds as a repeatable benchmark. Nevertheless, there are about 180K RCU read-side critical sections per grace period, which is still seriously read-intensive. Furthermore, there are also almost 2K call_rcu() invocations per RCU grace period, which means that RCU is able to amortize the overhead of a given grace period down to almost nothing per grace-period request.

Linux-Kernel RCU Grace-Period Latency

The following bpftrace program makes a histogram of grace-period latencies, that is, the time from the call to rcu_gp_init() to the return from rcu_gp_cleanup():

kprobe:rcu_gp_init { @start = nsecs; } kretprobe:rcu_gp_cleanup { if (@start) { @gplat = hist((nsecs - @start)/1000000); } } interval:s:10 { printf("Internal grace-period latency, milliseconds:\n"); exit(); }

The kretprobe attaches the program to the return from rcu_gp_cleanup(). The hist() function computes a log-scale histogram. The check of the @start variable avoids a beginning-of-time value for this variable in the common case where this script start in the middle of a grace period. (Try it without that check!)

The output is as follows:
Attaching 3 probes... Internal grace-period latency, milliseconds: @gplat: [2, 4) 259 |@@@@@@@@@@@@@@@@@@@@@@ | [4, 8) 591 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 137 |@@@@@@@@@@@@ | [16, 32) 3 | | [32, 64) 5 | | @start: 95694642573968

Most of the grace periods complete within between four and eight milliseconds, with most of the remainder completing within between two and four milliseconds and then between eight and sixteen milliseonds, but with a few stragglers taking up to 64 milliseconds. The final @start line shows that bpftrace simply dumps out all the variables. You can use the delete(@start) function to prevent printing of @start, but please note that the next invocation of rcu_gp_init() will re-create it.

It is nice to know the internal latency of an RCU grace period, but most in-kernel users will be more concerned about the latency of the synchronize_rcu() function, which will need to wait for the current grace period to complete and also for callback invocation. We can measure this function's latency with the following bpftrace script:

kprobe:synchronize_rcu { @start[tid] = nsecs; } kretprobe:synchronize_rcu { if (@start[tid]) { @srlat = hist((nsecs - @start[tid])/1000000); delete(@start[tid]); } } interval:s:10 { printf("synchronize_rcu() latency, milliseconds:\n"); exit(); }

The tid variable contains the ID of the currently running task, which allows this script to associate a given return from synchronize_rcu() with the corresponding call by using tid as an index to the @start variable.

As you would expect, the resulting histogram is weighted towards somewhat longer latencies, though without the stragglers:

Attaching 3 probes... synchronize_rcu() latency, milliseconds: @srlat: [4, 8) 9 |@@@@@@@@@@@@@@@ | [8, 16) 31 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 31 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| @start[4075307]: 96560784497352

In addition, we see not one but two values for @start. The delete statement gets rid of old ones, but any new call to synchronize_rcu() will create more of them.

Linux-Kernel Expedited RCU Grace-Period Latency

Linux kernels will sometimes executed synchronize_rcu_expedited() to obtain a faster grace period, and the following command will further cause synchronize_rcu() to act like synchronize_rcu_expedited():

echo 1 >  /sys/kernel/rcu_expedited

Doing this on a dual-socket system with 80 hardware threads might be ill-advised, but you only live once!

Ill-advised or not, the following bpftrace script measures synchronize_rcu_expedited() latency, but in microseconds rather than milliseconds:

kprobe:synchronize_rcu_expedited {
        @start[tid] = nsecs;
}

kretprobe:synchronize_rcu_expedited {
        if (@start[tid]) {
                @srelat = hist((nsecs - @start[tid])/1000);
                delete(@start[tid]);
        }
}

interval:s:10 {
        printf("synchronize_rcu() latency, microseconds:\n");
        exit();
}

The output of this script run concurrently with a kernel build is as follows:

Attaching 3 probes...
synchronize_rcu() latency, microseconds:


@srelat: 
[128, 256)            57 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)            14 |@@@@@@@@@@@@                                        |
[512, 1K)              1 |                                                    |
[1K, 2K)               2 |@                                                   |
[2K, 4K)               7 |@@@@@@                                              |
[4K, 8K)               2 |@                                                   |
[8K, 16K)              3 |@@                                                  |

@start[4140285]: 97489845318700

Most synchronize_rcu_expedited() invocations complete within a few hundred microseconds, but with a few stragglers around ten milliseconds.

But what about linear histograms? This is what the lhist() function is for, with added minimum, maximum, and bucket-size arguments:

kprobe:synchronize_rcu_expedited {
        @start[tid] = nsecs;
}

kretprobe:synchronize_rcu_expedited {
        if (@start[tid]) {
                @srelat = lhist((nsecs - @start[tid])/1000, 0, 1000, 100);
                delete(@start[tid]);
        }
}

interval:s:10 {
        printf("synchronize_rcu() latency, microseconds:\n");
        exit();
}

Running this with the usual kernel build in the background:

Attaching 3 probes...
synchronize_rcu() latency, microseconds:


@srelat: 
[100, 200)            26 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[200, 300)            13 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[300, 400)             5 |@@@@@@@@@@                                          |
[400, 500)             1 |@@                                                  |
[500, 600)             0 |                                                    |
[600, 700)             2 |@@@@                                                |
[700, 800)             0 |                                                    |
[800, 900)             1 |@@                                                  |
[900, 1000)            1 |@@                                                  |
[1000, ...)           18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |

@start[4184562]: 98032023641157

The final bucket is overflow, containing measurements that exceeded the one-millisecond limit.

The above histogram had only a few empty buckets, but that is mostly because the 18 synchronize_rcu_expedited() instances that overflowed the one-millisecond limit are consolidated into a single [1000, ...) overflow bucket. This is sometimes what is needed, but other times losing the maximum latency can be a problem. This can be dealt with given the following bpftrace program:

kprobe:synchronize_rcu_expedited {
        @start[tid] = nsecs;
}

kretprobe:synchronize_rcu_expedited {
        if (@start[tid]) {
                @srelat[(nsecs - @start[tid])/100000*100] = count();
                delete(@start[tid]);
        }       
}       

interval:s:10 {
        printf("synchronize_rcu() latency, microseconds:\n");
        exit();
}

Given the usual kernel-build background load, this produces the following output:

Attaching 3 probes...
synchronize_rcu() latency, microseconds:


@srelat[1600]: 1
@srelat[500]: 1
@srelat[1000]: 1
@srelat[700]: 1
@srelat[1100]: 1
@srelat[2300]: 1
@srelat[300]: 1
@srelat[400]: 2
@srelat[600]: 3
@srelat[200]: 4
@srelat[100]: 20
@start[763214]: 17487881311831

This is a bit hard to read, but simple scripting can be applied to this output to produce something like this:

100: 20
200: 4
300: 1
400: 2
500: 1
600: 3
700: 1
1000: 1
1100: 1
1600: 1

This produces compact output despite outliers such as the last entry, corresponding to an invocation that took somewhere between 1.6 and 1.7 milliseconds.

Summary

The bpftrace command can be used to quickly and easily script compiled in-kernel programs that can measure and monitor a wide variety of things. This post focused on a few aspects of RCU, but quite a bit more material may be found in Brendan Gregg's “BPF Performance Tools” book.
inside

Stupid RCU Tricks: Is RCU Watching?

It is just as easy to ask why RCU wouldn't be watching all the time. After all, you never know when you might need to synchronize!

Unfortunately, an eternally watchful RCU is impractical in the Linux kernel due to energy-efficiency considerations. The problem is that if RCU watches an idle CPU, RCU needs that CPU to execute instructions. And making an idle CPU unnecessarily execute instructions (for a rather broad definition of the word “unnecessarily”) will terminally annoy a great many people in the battery-powered embedded world. And for good reason: Making RCU avoid watching idle CPUs can provide 30-40% increases in battery lifetime.

In this, CPUs are not all that different from people. Interrupting someone who is deep in thought can cause them to lose 20 minutes of work. Similarly, when a CPU is deeply idle, asking it to execute instructions will consume not only the energy required for those instructions, but also much more energy to work its way out of that deep idle state, and then to return back to that deep idle state.

And this is why CPUs must tell RCU to stop watching them when they go idle. This allows RCU to ignore them completely, in particular, to refrain from asking them to execute instructions.

In some kernel configurations, RCU also ignores portions of the kernel's entry/exit code, that is, the last bits of kernel code before switching to userspace and the first bits of kernel code after switching away from userspace. This happens only in kernels built with CONFIG_NO_HZ_FULL=y, and even then only on CPUs mentioned in the CPU list passed to the nohz_full kernel parameter. This enables carefully configured HPC applications and CPU-bound real-time applications to get near-bare-metal performance from such CPUs, while still having the entire Linux kernel at their beck and call. Because RCU is not watching such applications, the scheduling-clock interrupt can be turned off entirely, thus avoiding disturbing such performance-critical applications.

But if RCU is not watching a given CPU, rcu_read_lock() has no effect on that CPU, which can come as a nasty shock to the corresponding RCU read-side critical section, which naively expected to be able to safely traverse an RCU-protected data structure. This can be a trap for the unwary, which is why kernels built with CONFIG_PROVE_LOCKING=y (lockdep) complain bitterly when rcu_read_lock() is invoked on CPUs that RCU is not watching.

But suppose that you have code using RCU that is invoked both from deep within the idle loop and from normal tasks.

Back in the day, this was not much of a problem. True to its name, the idle loop was not much more than a loop, and the deep architecture-specific code on the kernel entry/exit paths had no need of RCU. This has changed, especially with the advent of idle drivers and governors, to say nothing of tracing. So what can you do?

First, you can invoke rcu_is_watching(), which, as its name suggests, will return true if RCU is watching. And, as you might expect, lockdep uses this function to figure out when it should complain bitterly. The following example code lays out the current possibilities:

if (rcu_is_watching())
        printk("Invoked from normal or idle task with RCU watching.\n");
else if (is_idle_task(current))
        printk("Invoked from deep within in the idle task where RCU is not watching.\");
else
        printk("Invoked from nohz_full entry/exit code where RCU is not watching.\");

Except that even invoking printk() is an iffy proposition while RCU is not watching.

So suppose that you invoke rcu_is_watching() and it helpfully returns false, indicating that you cannot invoke rcu_read_lock() and friends. What now?

You could do what the v5.18 Linux kernel's kernel_text_address() function does, which can be abbreviated as follows:

no_rcu = !rcu_is_watching();
if (no_rcu)
        rcu_nmi_enter(); // Make RCU watch!!!
do_rcu_traversals();
if (no_rcu)
        rcu_nmi_exit(); // Return RCU to its prior watchfulness state.

If your code is not so performance-critical, you can do what the arm64 implementation of the cpu_suspend() function does:

RCU_NONIDLE(__cpu_suspend_exit());

This macro forces RCU to watch while it executes its argument as follows:

#define RCU_NONIDLE(a) \
        do { \
                rcu_irq_enter_irqson(); \
                do { a; } while (0); \
                rcu_irq_exit_irqson(); \
        } while (0)

The rcu_irq_enter_irqson() and rcu_irq_exit_irqson() functions are essentially wrappers around the aforementioned rcu_nmi_enter() and rcu_nmi_exit() functions.

Although RCU_NONIDLE() is more compact than the kernel_text_address() approach, it is still annoying to have to pass your code to a macro. And this is why Peter Zijlstra has been reworking the various idle loops to cause RCU to be watching a much greater fraction of their code. This might well be an ongoing process as the idle loops continue gaining functionality, but Peter's good work thus far at least makes RCU watch the idle governors and a much larger fraction of the idle loop's trace events. When combined with the kernel entry/exit work by Peter, Thomas Gleixner, Mark Rutland, and many others, it is hoped that the functions not watched by RCU will all eventually be decorated with something like noinstr, for example:

static noinline noinstr unsigned long rcu_dynticks_inc(int incby)
{
        return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks));
}

We don't need to worry about exactly what this function does. For this blog entry, it is enough to know that its noinstr tag prevents tracing this function, making it less problematic for RCU to not be watching it.

What exactly are you prohibited from doing while RCU is not watching your code?

As noted before, RCU readers are a no-go. If you try invoking rcu_read_lock(), rcu_read_unlock(), rcu_read_lock_bh(), rcu_read_unlock_bh(), rcu_read_lock_sched(), or rcu_read_lock_sched() from regions of code where rcu_is_watching() would return false, lockdep will complain.

On the other hand, using SRCU (srcu_read_lock() and srcu_read_unlock()) is just fine, as is RCU Tasks Trace (rcu_read_lock_trace() and rcu_read_unlock_trace()). RCU Tasks Rude does not have explicit read-side markers, but anything that disables preemption acts as an RCU Tasks Rude reader no matter what rcu_is_watching() would return at the time.

RCU Tasks is an odd special case. Like RCU Tasks Rude, RCU Tasks has implicit read-side markers, which are any region of non-idle-task kernel code that does not do a voluntary context switch (the idle tasks are instead handled by RCU Tasks Rude). Except that in kernels built with CONFIG_PREEMPTION=n and without any of RCU's test suite, the RCU Tasks API maps to plain old RCU. This means that code not watched by RCU is ignored by the remapped RCU Tasks in such kernels. Given that RCU Tasks ignores the idle tasks, this affects only user entry/exit code in kernels built with CONFIG_NO_HZ_FULL=y, and even then, only on CPUs mentioned in the list given to the nohz_full kernel boot parameter. However, this situation can nevertheless be a trap for the unwary.

Therefore, in post-v5.18 mainline, you can build your kernel with CONFIG_FORCE_TASKS_RCU=y, in which case RCU Tasks will always be built into your kernel, avoiding this trap.

In summary, energy-efficiency, battery-lifetime, and application-performance/latency concerns force RCU to avert its gaze from idle CPUs, and, in kernels built with CONFIG_NO_HZ_FULL=y, also from nohz_full CPUs on the low-level kernel entry/exit code paths. Fortunately, recent changes have allowed RCU to watch more code, but this being the kernel, corner cases will always be with us. This corner-case code from which RCU must avert its gaze requires the special handling described in this blog post.
SequentialCaveman

Parallel Programming: December 2021 Update

It is past time for another release of Is Parallel Programming Hard, And, If So, What Can You Do About It?. But first, what is the difference between an edition and a release?

The main difference is the level of validation. For example, during the several months leading up to the second edition, I read the entire book, fixing issues as I found them. So where an edition is a completed work, a release is primarily for the benefit of people who would like to see a recent snapshot of this book, but without having to install and run LaTeX and its dependencies.

Having cleared that up, here are the big-animal changes since the second edition:

  1. A lot of detailed work to make the new ebook-sized PDF usable, along with other formatting improvements, courtesy of Akira Yokosawa, SeongJae Park, and Balbir Singh.
  2. The nq build now places quick quizzes at the end of each chapter, courtesy of Akira Yokosawa.
  3. There are a few new sections in the “Locking” chapter, the “Putting It All Together” chapter, the “Advanced Synchronization: Memory Ordering” chapter, and the “Important Questions” appendix.
  4. There is now more validation information associated with much of the sample code throughout the book.
  5. The “RCU Usage” section has received a much-needed upgrade.
  6. There is now an index and a list of acronyms, courtesy of Akira Yokosawa.
  7. A new install_latex_package.sh script, courtesy of SeongJae Park.
  8. Greatly improved error handling, including scripts that check for common LaTeX formatting mistakes, courtesy of Akira Yokosawa.


Yang Lu and Zhouyi Zhou are translating the Second Edition to Chinese, and would greatly appreciate additional help.

Everyone mentioned above contributed a great many wordsmithing fixes, as did Chin En Lin, Elad Lahav, Zhouyi Zhou, and GitHub user “rootbeer”. A grateful “thank you” to everyone who contributed!
RCU

Stupid RCU Tricks: Removing CONFIG_RCU_FAST_NO_HZ

The CONFIG_RCU_FAST_NO_HZ Kconfig option was added many years ago to improve energy efficiency for systems having significant numbers of short bursts of idle time. Prior to the addition of CONFIG_RCU_FAST_NO_HZ, RCU would insist on keeping a given idle CPU's scheduling-clock tick enabled until all of that CPU's RCU callbacks had been invoked. On certain types of battery-powered embedded systems, these few additional scheduling-clock ticks would consume up to 40% of the battery lifetime. The people working on such systems were not amused, and were not shy about letting me know of their dissatisfaction with RCU's life choices. Please note that “letting me know” did not take the form of flaming me on LKML. Instead, they called me on the telephone and yelled at me.

Given that history, why on earth would I even be thinking about removing CONFIG_RCU_FAST_NO_HZ, let alone queuing a patch series intended for the v5.17 merge window???

The reason is that everyone I know of who builds their kernels with CONFIG_RCU_FAST_NO_HZ=y also boots those systems with each and every CPU designated as a rcu_nocbs CPU. With this combination, CONFIG_RCU_FAST_NO_HZ=y is doing nothing but placing a never-taken branch in the fastpath to and from idle. Such systems should therefore run slightly faster and with slightly better battery lifetime if their kernels were instead built with CONFIG_RCU_FAST_NO_HZ=n, which would get rid of that never-taken branch.

But given that battery-powered embedded folks badly wanted CONFIG_RCU_FAST_NO_HZ=y, and given that they are no longer getting any benefit from it, why on earth haven't they noticed?

The have not noticed because rcu_nocbs CPUs do not invoke their own RCU callbacks. This work is instead delegated to a set of per-CPU rcuoc kthreads, with a smaller set of rcuog kthreads managing those callbacks and requesting grace periods as needed. By default, these rcuoc and rcuog kthreads are not bound, which allows both the scheduler (and for that matter, the systems administrator) to take both performance and energy efficiency into account and to run those kthreads wherever is appropriate at any given time. In contrast, non-rcu_nocbs CPUs will always run their own callbacks, even if that means powering up an inconveniently placed portion of the system at an inconvenient time. This includes CONFIG_RCU_FAST_NO_HZ=y kernels, whose only advantage is that they power up inconveniently placed portions of systems at inconvenient times only 25% as often as would a non-rcu_nocbs CPU in a CONFIG_RCU_FAST_NO_HZ=n kernel.

In short, the rcu_nocbs CPUs' practice of letting the scheduler decide where to run the callbacks is especially helpful on asymmetric systems (AKA big.LITTLE systems), as shown by data collected by Dietmar Eggeman and Robin Randhawa. This point is emphasized by the aforementioned fact that everyone I know of who builds their kernels with CONFIG_RCU_FAST_NO_HZ=y also boots those systems with each and every CPU designated as a rcu_nocbs CPU.

So if no one is getting any benefit from building their kernels with CONFIG_RCU_FAST_NO_HZ=y, why keep that added complexity in the Linux kernel? Why indeed, and hence the patch series intended for the v5.17 merge window.

So if you know of someone who is getting significant benefit from CONFIG_RCU_FAST_NO_HZ=y who could not get that benefit from booting with rcu_nocbs CPUs, this would be a most excellent time to let me know!