Daniel Vetter put together a pair of intriguing blog posts entitled Locking Engineering Principles
and Locking Engineering Hierarchy
. These appear to be an attempt to establish a set of GPU-wide or perhaps even driver-tree-wide concurrency coding conventions.
Which would normally be none of my business. After all, to establish such conventions, Daniel needs to negotiate with the driver subsystem's developers and maintainers, and I am neither. Except that he did call me out
on Twitter on this topic. So here I am, as promised
, offering color commentary and the occasional suggestion for improvement, both of Daniel's proposal and of the kernel itself. The following sections review his two posts, and then summarize and amplify suggestions for improvement.
“Make it Dumb
” should be at least somewhat uncontroversial, as this is quite similar to a rather more flamboyant sound bite from my 1970s Mechanical Engineering university coursework. Kernighan's Law asserting that debugging is twice as hard as coding should also be a caution.
This leads us to “Make it Correct
”, which many might argue should come first in the list. After all, if correctness is not a higher priority than dumbness (or simplicity, if you prefer), then the do-nothing null solution would always be preferred. And to his credit, Daniel does explicitly state that “simple doesn't necessarily mean correct”.
It is hard to argue with Daniel's choosing to give lockdep
place of pride, especially given how many times it has saved me over the years, and not just from deadlock. I never have felt the need to teach lockdep RCU's locking rules, but then again, RCU is much more monolithic than is the GPU subsystem. Perhaps if RCU's locking becomes more ornate, I would also feel the need to acquire RCU's key locks in the intended order at boot time. Give or take the fact that much of RCU can be used very early, even before the call to rcu_init()
The validation point is also a good one, with Daniel calling out might_lock()
, and might_alloc()
. I go further and add lockdep_assert_irqs_disabled()
, and lockdep_assert_held()
, but perhaps these additional functions are needed in low-level code like RCU more than they are in GPU drivers.
Daniel's admonishments to avoid reinventing various synchronization wheels of course comprise excellent common-case advice. And if you believe that your code is the uncommon case, you are most likely mistaken. Furthermore, it is hard to argue with “pick the simplest lock that works”.
It is hard to argue with the overall message of “Make it Fast
”, especially the bit about the pointlessness of crashing faster. However, a minimum level of speed is absolutely required. For a 1977 example, an old school project required keeping up with the display. If the code failed to keep up with the display, that code was useless. More recent examples include deep sub-second request-response latency requirements in many Internet datacenters. In other words, up to a point, performance is not simply a “Make it Fast” issue, but also a “Make it Correct” issue. On the other hand, once the code meets its performance requirements, any further performance improvements instead falls into the lower-priority “Make it Fast” bucket.
Except that things are not always quite so simple. Not all performance improvements can be optimized into existence late in the game. In fact, if your software's performance requirements are expected to tax your system's resources, it will be necessary to make performance a first-class consideration all the way up to and including the software-architecture level, as discussed here
. And, to his credit, Daniel hints at this when he notes that “the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts”. He also helpfully calls out io_uring
as one avenue for radical updates.
The “Protect Data, not Code
” is good general advice and goes back to Jack Inman's 1985 USENIX paper entitled “Implementing Loosely Coupled Functions on Tightly Coupled Engines”, if not further. However, there are times where what needs to be locked is not data, but perhaps a particular set of state transitions, or maybe a rarely accessed state, which might be best represented by the code for that state. That said, there can be no denying the big-kernel-lock hazards of excessive reliance on code locking. Furthermore, I have only rarely needed abstract non-data locking.
Nevertheless, a body of code as large as the full set of Linux-kernel device drivers should be expected to have a large number of exceptions to the “Protect Data” rule of thumb, reliable though that rule has proven in my own experience. The usual way of handling this is a waiver process. After all, given that every set of safety-critical coding guidelines I have seen has a waiver process, it only seems reasonable that device-driver concurrency coding conventions should also involve a waiver process. But that decision belongs to the device-driver developers and maintainers, not with me.
Most of the Level 0: No Locking
section should be uncontroversial: Do things the easy way, at least when the easy way works. This approach goes back decades, for but one example, single-threaded user-interface code delegating concurrency to a database management system. And it should be well-understood that complicated things can be made easy (or at least easier) through use of carefully constructed and time-tested APIs, for example, the queue_work()
family of APIs called out in this section. One important question for all such APIs is this: “Can someone with access to only the kernel source tree and a browser quickly find and learn how to use the given API?” After all, people are able to properly use the APIs they see elsewhere if
they can quickly and easily learn about them.
And yes, given Daniel's comments regarding struct dma_fence
later in this section, the answer for SLAB_TYPESAFE_BY_RCU
appears to be “no” (but see Documentation/RCU/rculist_nulls.rst
). The trick with SLAB_TYPESAFE_BY_RCU
is that readers must validate the allocation. A common way of doing this is to have a reference count that is set to the value one immediately after obtaining the object from kmem_struct_alloc()
and set to the value zero immediately before passing the object to kmem_struct_free()
. And yes, I have a patch queued to fix the misleading text in Documentation/RCU/whatisRCU.rst
, and I apologize to anyone who might have attempted to use locking without a reference count. But please also let the record show that there was no bug report.
A reader attempting to obtain a new lookup-based reference to a SLAB_TYPESAFE_BY_RCU
object must do something like this:
- atomic_inc_not_zero(), which will return false and not do the increment if initial value was zero. Presumably dma_fence_get_rcu() handles this for struct dma_fence.
- If the value returned above was false, pretend that the lookup failed. Otherwise, continue with the following steps.
- Check the identity of the object. If this turns out to not be the desired object, release the reference (cleaning up if needed) and pretend that the lookup failed. Otherwise, continue. Presumably dma_fence_get_rcu_safe() handles this for struct dma_fence, in combination with its call to dma_fence_put().
- Use the object.
- Release the reference, cleaning up if needed.
This of course underscores Daniel's point (leaving aside the snark), which is that you should not use SLAB_TYPESAFE_BY_RCU
unless you really need to, and even then you should make sure that you know how to use it properly.
But just when do you need to use SLAB_TYPESAFE_BY_RCU
? One example is when you need RCU protection on such a hot fastpath that you cannot tolerate RCU-freed objects becoming cold in the CPU caches due to RCU's grace-period delays. Another example is where objects are being allocated and freed at such a high rate that if each and every kmem_cache_free()
was subject to grace-period delays, excessive memory would be tied up waiting for grace periods, perhaps even resulting in out-of-memory (OOM) situations.
In addition, carefully constructed semantics are required. Semantics based purely on ordering tend to result in excessive conceptual complexity and thus in confusion. More appropriate semantics tend to be conditional, for example: (1) Objects that remain in existence during the full extent of the lookup are guaranteed to be found, (2) Objects that do not exist at any time during the full extent of the lookup are guaranteed not to be found, and (3) Objects that exist for only a portion of the lookup might or might not be found.
Does struct dma_fence
? Is the code handling struct dma_fence
doing the right thing? For the moment, I will just say that: (1) The existence of dma_fence_get_rcu_safe()
gives me at least some hope, (2) Having two different slab allocators stored into different static variables having the same name is a bit code-reader-unfriendly, and (3) The use of SLAB_TYPESAFE_BY_RCU
for a structure that contains an rcu_head
structure is a bit unconventional, but perhaps these structures are sometimes obtained from kmalloc()
instead of from kmem_struct_alloc()
One thing this section missed (or perhaps intentionally omitted): If you do need memory barriers, you are almost always better off using smp_store_release()
than the old-style smp_wmb()
The Level 1: Big Dumb Lock
certainly summarizes the pain of finding the right scope for your locks. Despite Daniel's suggesting that lockless tricks are almost always the wrong approach, such tricks are in common use. I would certainly agree that randomly hacking lockless tricks into your code will almost always result in disaster. In fact, this process will use your own intelligence against you: The smarter you think you are, the deeper a hole you will have dug for yourself before you realize that you are in trouble.
Therefore, if you think that you need a lockless trick, first actually measure the performance. Second, if there really is a performance issue, do the work to figure out which code and data is actually responsible, because your blind guesses will often be wrong. Third, read documentation, look at code, and talk to people to learn and fully understand how this problem has been solved in the past, then carefully apply those solutions. Fourth, if there is no solution, review your overall design, because an ounce of proper partitioning is worth some tons of clever lockless code. Fifth, if you must use a new solution, verify it beyond all reason. The code in kernel/rcu/rcutorture.c
will give you some idea of the level of effort required.
The Level 2: Fine-grained Locking
sections provide some excellent advice, guidance, and cautionary tales. Yes, lockdep is a great thing, but there are deadlocks that it does not detect, so some caution is still required.
The yellow-highlighted Locking Antipattern: Confusing Object Lifetime and Data Consistency
describes how holding locks across non-memory-style barrier functions, that is, things like flush_work()
as opposed to things like smp_mb()
, can cause problems, up to and including deadlock. In contrast, whatever other problems memory-barrier functions such as smp_mb()
cause, it does take some creativity to add them to a lock-based critical section so as to cause them to participate in a deadlock cycle. I would not consider flush_work()
to be a memory barrier in disguise, although it is quite true that a correct high-performance implementation of flush_work()
will require careful memory ordering.
I would have expected a rule such as “Don't hold a lock across flush_work()
that is acquired in any corresponding workqueue handler”, but perhaps Daniel is worried about future deadlocks as well as present-day deadlocks. After all, if you do hold a lock across a call to flush_work()
, perhaps there will soon be some compelling reason to acquire that lock in a workqueue handler, at which point it is game over due to deadlock.
This issue is of course by no means limited to workqueues. For but one example, it is quite possible to generate similar deadlocks by holding spinlocks across calls to del_timer_sync()
that are acquired within timer handlers.
And that is why both flush_work()
tell lockdep what they are up to. For example, flush_work()
which in turn invokes lock_map_acquire()
on a fictitious lock, and this same fictitious lock is acquired and released by process_one_work()
. Thus, if you acquire a lock in a workqueue handler that is held across a corresponding flush_work()
, lockdep will complain, as shown in the 2018 commit 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing") by Johannes Berg. Of course, del_timer_sync()
uses this same trick, as shown in the 2009 commit 6f2b9b9a9d75 ("timer: implement lockdep deadlock detection"), also by Johannes Berg.
Nevertheless, deadlocks involving flush_work()
and workqueue handlers can be subtle. It is therefore worth investing some up-front effort to avoid them.
The orange-highlighted Level 2.5: Splitting Locks for Performance Reasons
discusses splitting locks. As the section says, there are complications. For one thing, lock acquisitions are anything but free, having overheads of hundreds or thousands of instructions at best, which means that adding additional levels of finer-grained locks can actually slow things down. Therefore, Daniel's point about prioritizing architectural restructuring over low-level synchronization changes is an extremely good one, and one that is all too often ignored.
As Daniel says, when moving to finer-grained locking, it is necessary to avoid increasing the common-case number of locks being acquired. As one might guess from the tone of this section, this is not necessarily easy. Daniel suggests reader-writer locking, which can work well in some cases, but suffers from performance and scalability limitations, especially in situations involving short read-side critical sections. The fact that readers and writers exclude each other can also result in latency/response-time issues. But again, reader-writer locking can be a good choice in some cases.
The last paragraph is an excellent cautionary tale. Take it from Daniel: Never let userspace dictate the order in which the kernel acquires locks. Otherwise, you, too, might find yourself using wait/wound mutexes. Worse yet, if you cannot set up an two-phase locking discipline in which all required locks are acquired before any work is done, you might find yourself writing deadlock-recovery code, or wounded-mutex recovery code, if you prefer. This recovery code can be surprisingly complex. In fact, one of the motivations for the early 1990s introduction of RCU into DYNIX/ptx was that doing so allowed deletion of many thousands of lines of such recovery code, along with all the yet-undiscovered bugs that code contained.
The red-highlighted Level 3: Lockless Tricks
section begins with the ominous sentence “Do not go here wanderer!”
And I agree. After all, if you are using the facilities discussed in this section, namely RCU, atomics, preempt_disable()
, or the various memory barriers, you had jolly well better not be wandering!!!
Instead, you need to understand what you are getting into and you need to have a map in the form of a principled design and a careful well-validated implementation. And new situations may require additional maps to be made, although there are quite a few well-used maps already in Documentation/rcu
and over here
. In addition, as Daniel says, algorithmic and architectural fixes can often provide much better results than can lockless tricks applied at low levels of abstraction. Past experience suggests that some of those algorithmic and architectural fixes will involve lockless tricks on fastpaths, but life is like that sometimes.
It is now time to take a look at Daniel's alleged antipatterns.
The “Locking Antipattern: Using RCU
” section does have some “interesting” statements:
- RCU is said to mix up lifetime and consistency concerns. It is not clear exactly what motivated this statement, but this view is often a symptom of a strictly temporal view of RCU. To use RCU effectively, one must instead take a combined spatio-temporal view, as described here and here.
- rcu_read_lock() is said to provide both a read-side critical section and to extend the lifetime of any RCU-protected object. This is true in common use cases because the whole purpose of an RCU read-side critical section is to ensure that any RCU-protected object that was in existence at any time during that critical section remains in existence up to the end of that critical section. It is not clear what distinction Daniel is attempting to draw here. Perhaps he likes the determinism provided by full mutual exclusion. If so, never forget that the laws of physics dictate that such determinism is often surprisingly expensive.
- The next paragraph is a surprising assertion that RCU readers' deadlock immunity is a bad thing! Never forget that although RCU can be used to replace reader-writer locking in a great many situations, RCU is not reader-writer locking. Which is a good thing from a performance and scalability viewpoint as well as from a deadlock-immunity viewpoint.
- In a properly designed system, locks and RCU are not “papering over” lifetime issues, but instead properly managing object lifetimes. And if your system is not properly designed, then any and all facilities, concurrent or not, are weapons-grade dangerous. Again, perhaps more maps are needed to help those who might otherwise wander into improper designs. Or perhaps existing maps need to be more consistently used.
- RCU is said to practically force you to deal with “zombie objects”. Twitter discussions with Daniel determined that such “zombie objects” can no longer be looked up, but are still in use. Which means that many use cases of good old reference counting also force you to deal with zombie objects: After all, removing an object from its search structure does not invalidate the references already held on that object. But it is quite possible to avoid zombie objects for both RCU and for reference counting through use of per-object locks, as is done in the Linux-kernel code that maps from a System-V semaphore ID to the corresponding in-kernel data structure, first described in Section of this paper. In short, if you don't like zombies, there are simple RCU use cases that avoid them. You get RCU's speed and deadlock immunity within the search structure, but full ordering, consistency, and zombie-freedom within the searched-for object.
- The last bullet in this section seems to argue that people should upgrade from RCU read-side critical sections to locking or reference counting as quickly as possible. Of course, such locking or reference counting can add problematic atomic-operation and cache-miss overhead to those critical sections, so specific examples would be helpful. One non-GPU example is the System-V semaphore ID example mentioned above, where immediate lock acquisition is necessary to provide the necessary System-V semaphore semantics.
- On freely using RCU, again, proper design is required, and not just with RCU. One can only sympathize with a driver dying in synchronize_rcu(). Presumably Daniel means that one of the driver's tasks hung in synchronize_rcu(), in which case there should have been an RCU CPU stall warning message which would point out what CPU or task was stuck in an RCU read-side critical section. It is quite easy to believe that diagnostics could be improved, both within RCU and elsewhere, but if this was intended to be a bug report, it is woefully insufficient.
- It is good to see that Daniel found at least one RCU use case that he likes (or at least doesn't hate too intensely), namely xarray lookups combined with kref_get_unless_zero() and kfree_rcu(). Perhaps this is a start, especially if the code following that kref_get_unless_zero() invocation does additional atomic operations on the xarray object, which would hide at least some of the kref_get_unless_zero() overhead.
The following table shows how intensively the RCU API is used by various v5.19 kernel subsystems:
Subsystem Uses LoC Uses/KLoC
--------- ---- ---------- ---------
ipc 91 9,822 9.26
virt 68 9,013 7.54
net 7457 1,221,681 6.10
security 599 107,622 5.57
kernel 1796 423,581 4.24
mm 324 170,176 1.90
init 8 4,236 1.89
block 108 65,291 1.65
lib 319 214,291 1.49
fs 1416 1,470,567 0.96
include 836 1,167,274 0.72
drivers 5596 20,861,746 0.27
arch 546 2,189,975 0.25
crypto 6 102,307 0.06
sound 21 1,378,546 0.02
--------- ---- ---------- ---------
Total 19191 29,396,128 0.65
As you can see, the drivers subsystem has the second-highest total number of RCU API uses, but it also has by far the largest number of lines of code. As a result, the RCU usage intensity in the drivers subsystem is quite low, at about 27 RCU uses per 100,000 lines of code. But this view is skewed, as can be seen by looking more deeply within the drivers subsystem, but leaving out (aside from in the Total line) and drivers containing fewer than 100 instances of the RCU API:
Subsystem Uses LoC Uses/KLoC
--------- ---- ---------- ---------
drivers/target 293 62,532 4.69
drivers/block 355 97,265 3.65
drivers/md 334 147,881 2.26
drivers/infiniband 548 434,430 1.26
drivers/net 2607 4,381,955 0.59
drivers/staging 114 618,763 0.18
drivers/scsi 150 1,011,146 0.15
drivers/gpu 399 5,753,571 0.07
--------- ---- ---------- ---------
Total 5596 20,861,746 0.27
subtrees account for more than half of the RCU usage in the Linux kernel's drivers, and could be argued to be more about networking than about generic device drivers. And although drivers/gpu
comes in third in terms of RCU usage, it also comes in first in terms of lines of code, making it one of the least intense users of RCU. So one could argue that Daniel already has his wish, at least within the confines of drivers/gpu
This data suggests that Daniel might usefully consult with the networking folks in order to gain valuable guidelines on the use of RCU and perhaps atomics and memory barriers as well. On the other hand, it is quite possible that such consultations actually caused some of Daniel's frustration. You see, a system implementing networking must track the state of the external network, and it can take many seconds or even minutes for changes in that external state to propagate to that system. Therefore, expensive synchronization within that system is less useful than one might think: No matter how many locks and mutexes that system acquires, it cannot prevent external networking hardware from being reconfigured or even from failing completely.
Moving to drivers/gpu
, in theory, if there are state changes initiated by the GPU hardware without full-system synchronization, networking RCU usage patterns should apply directly to GPU drivers. In contrast, there might be significant benefits from more tightly synchronizing state changes initiated by the system. Again, perhaps the aforementioned System-V semaphore ID example can help in such cases. But to be fair, given Daniel's preference for immediately acquiring a reference to RCU-protected data, perhaps drivers/gpu
code is already taking this approach.
The “Locking Antipattern: Atomics
” section is best taken point by point:
- For good or for ill, the ordering (or lack thereof) of Linux kernel atomics predates C++ atomics by about a decade. One big advantage of the Linux-kernel approach is that all accesses to atomic*_t variables are marked. This is a great improvement over C++, where a sequentially consistent load or store looks just like a normal access to a local variable, which can cause a surprising amount of confusion.
- Please please please do not “sprinkle” memory barriers over the code!!! Instead, actually design the required communication and ordering, and then use the best primitives for the job. For example, instead of “smp_mb__before_atomic(); atomic_inc(&myctr); smp_mb__after_atomic();”, maybe you should consider invoking atomic_inc_return(), discarding the return value if it is not needed.
- Indeed, some atomic functions operate on non-atomic*_t variables. But in many cases, you can use their atomic*_t counterparts. For example, instead of READ_ONCE(), atomic_read(). Instead of WRITE_ONCE(), atomic_set(). Instead of cmpxchg(), atomic_cmpxchg(). Instead of set_bit(), in many situations, atomic_or(). On the other hand, I will make no attempt to defend the naming of set_bit() and __set_bit().
To Daniel's discussion of “unnecessary trap doors”, I can only agree that reinventing read-write semaphores is a very bad thing.
I will also make no attempt to defend ill-thought-out hacks involving weak references or RCU. Sure, a quick fix to get your production system running is all well and good, but the real fix should be properly designed.
And Daniel makes an excellent argument when he says that if a counter can be protected by an already held lock, that counter should be implemented using normal C-language accesses to normal integral variables. For those situations where no such lock is at hand, there are a lot of atomic and per-CPU counting examples that can be followed, both in the Linux kernel and in Chapter 5 of “Is Parallel Programming Hard, And, If So, What Can You Do About It?
”. Again, why unnecessarily re-invent the wheel?
However, the last paragraph, stating that atomic operations should only be used for locking and synchronization primitives in the core kernel is a bridge too far. After all, a later section allows for memory barriers to be used in libraries (at least driver-hacker-proof libraries), so it seems reasonable that atomic operations can also be used in libraries.
Some help is provided by the executable Linux-kernel memory model (LKMM) in tools/memory-model
along with the kernel concurrency sanitizer (KCSAN), which is documented in Documentation/dev-tools/kcsan.rst
, but there is no denying that these tools currently require significant expertise. Help notwithstanding, it almost always makes a lot of sense to hide complex operations, including complex operations involving concurrency, behind well-designed APIs.
And help is definitely needed, given that there are more than 10,000 invocations of atomic operations in the drivers tree, more than a thousand of which are in drivers/gpu
There is not much to say about the “Locking Antipattern: preempt/local_irq/bh_disable() and Friends
” section. These primitives are not heavily used in the drivers tree. However, lockdep does have enough understanding of these primitives to diagnose misuse of irq-disabled and bh-disabled spinlocks. Which is a good thing, given that there are some thousands of uses of irq-disabled spinlocks in the drivers tree, along with a good thousand uses of bh-disabled spinlocks.
The “Locking Antipattern: Memory Barriers
” suggests that memory barriers should be packaged in a library or core kernel service, which is in the common case excellent advice. Again, the executable LKMM and KCSAN can help, but again these tools currently require some expertise. I was amused by Daniel's “I love to read an article or watch a talk by Paul McKenney on RCU like anyone else to get my brain fried properly”, and I am glad that my articles and talks provide at least a little entertainment value, if nothing else. ;-)
Summing up my view of these two blog posts, Daniel recommends that most driver code avoid concurrency entirely. Failing that, he recommends sticking to certain locking and reference-counting use cases, albeit including a rather complex acquire-locks-in-any-order use case. For the most part, he recommends against atomics, RCU, and memory barriers, with a very few exceptions.
For me, reading these blog posts induced great nostalgia, taking me back to my early 1990s days at Sequent, when the guidelines were quite similar, give or take a large number of non-atomically manipulated per-CPU counters. But a few short years later, many Sequent engineers were using atomic operations, and yes, a few were even using RCU. Including one RCU use case in a device driver, though that use case could instead be served by the Linux kernel's synchronize_irq()
Still, the heavy use of RCU and (even more so) of atomics within the drivers tree, combined with Daniel's distaste for these primitive, suggests that some sort of change might be in order.
Can We Fix This?
Of course we can!!!
But will a given fix actually improve the situation? That
is the question.
Reading through this reminded me that I need to take another pass through the RCU documentation. I have queued a commit to fix the misleading wording for SLAB_TYPESAFE_BY_RCU
on the -rcu tree: 08f8f09b2a9e ("doc: SLAB_TYPESAFE_BY_RCU uses cannot rely on spinlocks")
. I also expect to improve the documentation of reference counting and its relation to SLAB_TYPESAFE_BY_RCU
, and will likely find a number of other things in need of improvement.
This is also as good a time as any to announce that I will be holding an an RCU Office Hours birds-of-a-feather session at the 2022 Linux Plumbers Conference
, in case that is helpful.
However, the RCU documentation must of necessity remain fairly high level. And to that end, GPU-specific advice about use of xarray
, and kfree_rcu()
really needs to be Documentation/gpu
as opposed to Documentation/RCU
. This would allow that advice to be much more specific and thus much more helpful to the GPU developers and maintainers. Alternatively, perhaps improved GPU-related APIs are required in order to confine concurrency to functions designed for that purpose. This alternative approach has the benefit of allowing GPU device drivers to focus more on GPU-specific issues and less on concurrency. On the other hand, given that the GPU drivers comprise some millions of lines of code, this might be easier said than done.
It is all too easy to believe that it is possible to improve the documentation for a number of other facilities that Daniel called on the carpet. At the same time, it is important to remember that the intent of documentation is communication, and that the optimal mode of communication depends on the target audience. At its best, documentation builds a bridge from where the target audience currently is to where they need to go. Which means the broader the target audience, the more difficult it is to construct that bridge. Which in turn means that a given subsystem likely need usage advice and coding standards specific to that subsystem. One size does not fit all.
facility was called on the carpet, and perhaps understandably so. Would it help if SLAB_TYPESAFE_BY_RCU
were to be changed so as to allow locks to be acquired on objects that might at any time be passed to kmem_cache_free()
and then reallocated via kmem_cache_alloc()
? In theory, this is easy: Just have the kmem_cache
in question zero pages allocated from the system before splitting them up into objects. Then a given object could have an “initialized” flag, and if that flag was cleared in an object just returned from kmem_struct_alloc()
, then and only then would that lock be initialized. This would allow a lock to be acquired (under rcu_read_lock()
, of course) on a freed object, and would allow a lock to be held on an object despite its being passed to kmem_struct_free()
and returned from kmem_struct_alloc()
in the meantime.
In practice, some existing SLAB_TYPESAFE_BY_RCU
users might not be happy with the added overhead of page zeroing, so this might require an additional GFP_
flag to allow zeroing on a kmem_cache
basis. However, the first question is “Would this really help?”, and answering that question requires feedback developers and maintainers who are actually using SLAB_TYPESAFE_BY_RCU
Some might argue that the device drivers should all be rewritten in Rust, and cynics might argue that Daniel wrote his pair of blog posts with exactly that thought in mind. I am happy to let those cynics make that argument, especially given that I have already held forth on Linux-kernel concurrency in Rust here
. However, a possible desire to rust Linux-kernel device drivers does not explain Daniel's distaste for what he calls “zombie objects” because Rust is in fact quite capable of maintaining references to objects that have been removed from their search structure.
Summary and Conclusions
As noted earlier, reading these blog posts induced great nostalgia, taking me back to my time at Sequent in the early 1990s. A lot has happened in the ensuing three decades, including habitual use of locking in across the industry, and sometimes even correct use of locking.
But will generic developers ever be able to handle more esoteric techniques involving atomic operations and RCU?
I believe that the answer to this question is “yes”, as laid out in my 2012 paper Beyond Expert-Only Parallel Programming?
. As in the past, tooling (including carefully designed APIs), economic forces (including continued ubiquitous multi-core systems), and acculturation (assisted by a vast quantity of open-source software) have done the trick, and I see no reason why these trends will not continue.
But what happens in drivers/gpu
is up to the GPU developers and maintainers!
Atomic Operations and Memory Barriers, though more description than reference:
- Sometimes the docbook header is on the x86 arch_ function, for example, arch_atomic_inc() in arch/x86/include/asm/atomic.h rather than atomic_inc().
- The LKMM references below can also be helpful.
Kernel Concurrency Sanitizer (KCSAN):
- Finding race conditions with KCSAN
- Concurrency bugs should fear the big bad data-race detector (part 1)
- Concurrency bugs should fear the big bad data-race detector (part 2)
- Detecting missing memory barriers with KCSAN
Linux-Kernel Memory Model (LKMM):
- tools/memory-model, including its Documentation subdirectory.
- A formal kernel memory-ordering model (part 1)
- A formal kernel memory-ordering model (part 2)
- Who's afraid of a big bad optimizing compiler?
- Calibrating your fear of big bad optimizing compilers
- Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel (non-paywalled extended )edition
Read-Copy Update (RCU):
- The RCU API, 2019 edition
- Unraveling RCU-Usage Mysteries (Fundamentals)
- Unraveling RCU-Usage Mysteries (Additional Use Cases)
- Sections 9.5 and 9.6 of “Is Parallel Programming Hard, And, If So, What Can You Do About It?
- Many other references, some of which are listed here.