?

Log in

No account? Create an account

Exit Libris

I have only so many bookshelves, and I have not yet bought into ereaders, so from time to time books must leave. Here is the current batch:



  • 50 Simple Ways to Save Your House, Bruce Johnson. Yes, I did get this book before Youtube was invented. Why do you ask?
  • Algorithms to Live By: The Computer Science of Human Decisions, Brian Christian and Tom Griffiths. An OK introduction to probability and related ideas, but I am keeping Taleb's Incerto series in preference over this one. Oh, and Feller's classic textbooks as well.
  • Animals of East Africa, Louis S. B. Leakey. A book from my childhood, but time to let it go.
  • Edge of a Continent: The Pacific Coast from Alaska to Baja, Don Greame Kelley. Rossi & David C. Hunt. A book from my childhood, but time to let it go.
  • Freedom Manifesto: Why Free Markets are Moral and Big Government Isn't, Steve Forbes and Elizabeth Ames. If you read the title and say “But of course!”, you should avoid this book. But if you choke and sputter at the title, you should most definitely read this book. ;–)
  • Indian & Eskimo Artifacts of North America, Charles Miles. A book from my childhood, but time to let it go.
  • Just for Fun: The Story of an Accidental Revolutionary, Linus Torvalds and David Diamond. This is the hardback. The paperback takes less space, plus it is autographed. And I won the paperback in a coding contest at a long-ago linux.conf.au!
  • Leadership and Crisis, Bobby Jindal. Not everyone's cup of tea, but worth it for the anecdote about career day. You see, his son was bitterly disappointed that his dad was merely the governor of Louisiana instead of something really cool, like a policeman or fireman.
  • New Complete Do-It-Yourself Manual, Reader's Digest. Yes, I did get this book before Youtube was invented. Why do you ask?
  • Our Culture, What's Left of it, Theodore Dalrymple. His “Life at the Bottom” is a classic and worthwhile old-man rant, but this sequel suffers a bit by comparison.
  • Patent Failure: How Judges, Bureaucrats, and Lawyers Put Innovators at Risk, Jesse Bessen & Michael J. Meurer.
  • Plato, Not Prozac! Applying Eternal Wisdom to Everyday Problems, Lou Marinoff. I have no problem with Plato being considered better than prozac, but exercise works even better for me. Not bad as an introduction to various schools of philosophy, but I prefer “Plato and a Platypus Walk into a Bar... Understanding Philosophy Through Jokes”. Not that I can rattle off any of the schools of philosophy, so maybe I am just a philistine.
  • Snoop: What Your Stuff Says About You, Sam Gosling.
  • The Art of the Old West, Paul A. Rossi & David C. Hunt. A book from my childhood, but time to let it go.
  • The Halo Effect: How Managers Let Themselves be Deceived, Phil Rosenweig. A good book, though I never did figure the focus on managers in the title. Seems to me to apply to everyone.
  • The Hunger Games, Suzanne Collins.
  • The Inscrutable Americans, Anurag Mathur. Adventures at an American University for a kid from India.
  • The Lexus and the Olive Tree: Understanding Globalization, Thomas L. Friedman. Not a bad read, but a bit dated. Also, it seemed to me that he had traveled the world, but not so much his own country.
  • The Liberty Amendments, Mark R. Levin. If you feel the urge to run out and amend the USA constitution, you might want to read a few books like this one first. You see, other people just might have rather different ideas than you as to which direction the amendments should go.
  • The Millionaire Mind, Thomas J. Stanley. OK, but suffers from the sequel effect: “The Millionaire Next Door” is much better.
  • The Will to Meaning, Viktor E. Frankl. Actually not at all bad, but suffers by comparison to “Man's Search for Meaning”.
  • Tomorrow's Table: Organic Farming, Genetics, and the Future of Food, Pamela C. Ronald & Raoul W. Adamchak. A good thing to read if you labor under the delusion that farming is trivial. But I grew up in a farming community, so...

It is a bit sad to abandon some old friends, but such is life with physical books!

Tags:

2017 Year-End Advice

One of the occupational hazard of being an old man is the urge to provide unsolicited advice on any number of topics. This time, the topic is weight lifting.

Some years ago, I decided to start lifting weights. My body no longer tolerated running, so I had long since substituted various low-impact mechanical means of aerobic exercise. But there was growing evidence that higher muscle mass is a good thing as one ages, so I figured I should give it a try. This posting lists a couple of my mistakes, which could enable you to avoid them, which in turn could enable you to make brand-spanking new mistakes of your very own design!

The first mistake resulted in sporadic pains in my left palm and wrist, which appeared after many months of upper-body weight workouts. In my experience, at my age, any mention of this sort of thing to medical professionals will result in a tentative diagnosis of arthritis, with the only prescription being continued observation. This experience motivated me to do a bit of self-debugging beforehand, which led me to notice that the pain was only in my left wrist and only in the center of my left palm. This focused my attention on my two middle fingers, especially the one on which I have been wearing a wedding ring pretty much non-stop since late 1985. (Of course, those prone to making a certain impolite hand gesture might have reason to suspect their middle finger.)

So I tried removing my wedding ring. I was unable to do so, even after soaking my hand for some minutes in a bath of water, soap, and ice. This situation seemed like a very bad thing, regardless of what might be causing the pain. I therefore consulted my wife, who suggested a particular jewelry store. Shortly thereafter, I was sitting in a chair while a gentleman used a tiny but effective hand-cranked circular saw to cut through the ring and a couple pairs of pliers to open it up. The gentleman was surprised that it took more than ten turns of the saw to cut through the ring, in contrast to the usual three turns. Apparently wearing a ring for more than 30 years can cause it to work harden.

The next step was for me to go without a ring for a few weeks to allow my finger to decide what size it wanted to be, now that it had a choice. They gave me back the cut-open ring, which I carried in my pocket. Coincidence or not, during that time, the pains in my wrists and palms vanished. Later, jewelry store resized the ring.

I now remove my ring every night. If you take up any sort of weight lifting involving use of your hands, I recommend that you also remove any rings you might wear, just to verify that you still can.

My second mistake was to embark upon a haphazard weight-lifting regime. I felt that this was OK because I wasn't training for anything other than advanced age, so that any imbalances should be fairly easily addressed.

My body had other ideas, especially in connection with the bout of allergy/asthma/sinitus/brochitis/whatever that I have (knock on wood) mostly recovered from. This condition of course results in coughing, in which the muscles surrounding your chest work together to push air out of your lungs as abruptly and quickly as humanly possible. (Interestingly enough, the maximum velocity of cough-driven air seems to be subject to great dispute, perhaps because it is highly variable and because there are so many different places you could measure it.)

The maximum-effort nature of a cough is just fine if your various chest muscles are reasonably evenly matched. Unfortunately, I had not concerned myself with the effects of my weight-lifting regime on my ability to cough, so I learned the hard way that the weaker muscles might object to this treatment, and make their objections known by going into spasms. Spasms involving one's back can be surprisingly difficult to pin down, but for me, otherwise nonsensical shooting pains involving the neck and head are often due to something in my back. I started some simple and gentle back exercises, and also indulged in Warner Brothers therapy, which involves sitting in an easy chair watching Warner Brothers cartoons, assisted by a heating pad lent by my wife.

In summary, if you are starting weight training, (1) take an organized approach and (2) remove any rings you are wearing at least once a week.

Other than that, have a very happy new year!!!
I believe that Charles T. Porter's “Engineering Reminiscences“ was a gift from my grandfather, who was himself a machinist. Porter's most prominent contribution was the high-speed steam engine, that is to say, a steam engine operating at more than about 100 RPM. Although steam engines and their governors proved to be somewhat of a dead end, some of his dynamic balancing techniques are still in use.

Technology changes, people and organizations not so much. Chapter XVII starting on page 189 describes a demonstration of two of his new high-speed steam engines (on operating at 150 RPM the other at 300 RPM) along with one of his colleague's new boilers at the 1870 Fair of the American Institute in New York. The boiler ran slanted water tubes through the firebox to more efficiently separate steam from the remaining water. The engines were small by 1870s standards, one having 16-inch diameter cylinders with a 30-inch stroke and the other having 6-inch diameter cylinders with a 12-inch stroke.

Other exhibitors also had boilers and steam engines, and yet other exhibitors had equipment driven by steam engines. All the boilers and steam engines where connected, but given that steam engines were, then as now, considered to be way cooler than mere boilers, it should not be too surprising that the boilers could not produce enough steam to keep all the engines running. In fact, by the end of the day, the steam pressure had dropped by half, resulting in great consternation and annoyance all around. The finger of suspicion quickly pointed at Porter's two high-speed steam engines—after all, great speed clearly must imply equally great consumption of steam, right?

Porter had anticipated this situation, and had therefore installed a shutoff valve that isolated the boiler and his two high-speed steam engines from the rest of the Fair's equipment. Porter therefore closed his valve, with the result that the steam pressure within his little steam network immediately rose to 70 PSI and the pressure to the rest of the network dropped to 25 PSI. In fact, the boiler generated excess steam even at 70 PSI, so that the fireman had to leave the firebox door slightly open to artificially lower the boiler temperature.

The steam pressure to the rest of the fair continued to decrease until it was but 15 PSI. Over the noon hour, an additional boiler was installed, which brought the pressure up to 70 PSI. Restarting the steam engines of course reduced the pressure, but at 5PM it was still 25 PSI.

The superintendent of the machinery department had repeatedly asked Porter to reopen the valve, but each time Porter had refused. At 5PM, the superintendent made it clear that his request was now a demand, and that if Porter would not open the valve, the superintendent would open it for him. Porter finally agreed to open the valve, but only on the condition that the other managers of the institute verify that the boiler was in fact generating more than enough steam for both engines. These managers were summoned forthwith, and they agreed that the boiler had been producing most of the show's steam and that the pair of high-speed steam engines had been consuming very little. Porter opened the valve, and there was no further trouble with low-pressure steam.

It is all too easy to imagine a roughly similar story unfolding in today's world. ;–)

Porter went on to develop steam engines capable of running well in excess of 1,000 RPM, with one key challenge being convincing onlookers that the motion-blurred engine really was running that fast.

Interestingly enough, steam engines were Porter's third career. He was a lawyer for several years, but became disgusted with legal practice. At about that same time, he became quite interested in the problem of facing stone, that is, producing a machine that would take a rough-cut stone and give it a smooth planar face (smooth by the standards of the mid-1800s, anyway). After a couple of years of experimentation, he produced a steam-powered machine that efficiently faced stone. Unfortunately, at about that same time, others realized that saws could even more efficiently face stone, so his invention was what we might now call a technical success and a business failure.

Oddly enough, we have recently learned that the application of saws to stone was not an invention of the mid-1800s, but rather a re-invention of a technique used heavily in the ancient Roman Empire, and suspected of having been used as early as the 13th century BC. This is one of many interesting nuggets on life in the Roman Empire brought out by the historical novel “Tears of Stone” by Vannoy and Zeiglar. This novel is informed by Zeigler's application of Cold War remote-sensing technology to interesting areas of the Italian landscape, a fact that I had the privilege of learning directly from Zeigler himself.

On the other hand, perhaps Porter's ghost can console himself with the fact that the earliest stone saws were hand-powered, and those of the Roman Empire were water powered. Porter's stone-facing machine was instead powered by modern steam engines. Yes, the ancient Egyptians also made some use of steam power, but as far as we know they never applied it industrially, and never via a reciprocating engine driving a rotary shaft. And yes, all of the qualifiers in the preceding sentence are necessary.

As we learn more about ancient civilizations, it will be interesting to see what other “modern inventions” turn out to have deep roots in ancient times!

Tags:

Book review: "Make Trouble"

This book, by John Waters of “Hairspray” fame, was an impulse purchase. After all, who could fail to notice a small pink book with large white textured letters saying “Make Trouble”? It is a transcription of Waters's commencement address to the Rhode Institute School of Design's Class of 2015. Those who have known me over several decades might be surprised by this purchase, but what old man could resist a book whose flyleaf states “Anyone embarking on a creative path, he tells us, would do well to realize that pragmatism and discipline are as important as talent and that rejection is nothing to fear.”

They might be even more surprised that I agree with much of his advice. For but three examples:

  1. “A career in the arts is like a hitchhiking trip: All you need is one person to say ‘get in,’ and off you go.” Not really any different from my advising people to use the “high-school boy” algorithm when submitting papers and proposals.
  2. “Keep up with what's causing chaos in your field.” Not really any different from my “Go where there is trouble!”
  3. “Listen to your political enemies, particularly the smart ones”. Me, I would omit the word “political”, but close enough.
The book is mostly pictures, so if you are short of money, you do have the option of just reading it in the bookstore. See, I am making trouble already! ;–)

Tags:

Parallel Programming: November 2017 Update

This USA Thanksgiving holiday weekend features a new release of Is Parallel Programming Hard, And, If So, What Can You Do About It?.

This update includes more formatting and build-system improvements, bibliography updates, and better handling of listings, all courtesy of Akira Yokosawa; numerous fixes and updates from Junchang Wang, Pierre Kuo, SeongJae Park, and Yubin Ruan; a new futures section on quantum computing; updates to the formal-verification section based on recent collaborations; and a full rewrite of the memory-barriers section, which is now its own chapter. This rewrite was of course based on recent work with my partners in memory-ordering crime, Jade Alglave, Luc Maranget, Andrea Parri, and Alan Stern.

As always, git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git will be updated in real time.
I had the privilege of attending CPPCON last month. Michael Wong, Maged Michael, and I presented a parallel-programming overview, in which I presented the "Hardware and its Habits" chapter of Is Parallel Programming Hard, And, If So, What Can You Do About It?.

But the highlight for me was actually sitting in the audience for a pair of talks by people who had implemented RCU in C++.

Ansel Sermersheim presented a two-part talk entitled Multithreading is the answer. What is the question?. The second part of this talk covered lockless containers, and used a variant of RCU to implement a low-overhead libGuarded facility in order to more easily avoid deadlocks. The implementation is similar to the Linux-kernel real-time RCU implementation by Jim Houston and Joe Korty in that the counterpart to rcu_read_unlock() actively registers a quiescent state. Ansel's implementation goes further by also driving callback invocation from rcu_read_unlock(). Now I don't recommend this for a general-purpose RCU implementation due to the possibility of deadlock should a resource need to be held across rcu_read_unlock() and acquired within the callback. However, this approach should work just fine in the case where the callbacks just free memory and the memory allocator does not contain too many RCU read-side critical sections.

Fedor Pikus presented a talk entitled Read, Copy, Update, then what? RCU for non-kernel programmers, in which he gave a quite-decent introduction to use of RCU. This introduction included an improved version of my long-standing where-to-use-RCU diagram, which I fully intend to incorporate. I had a number of but-you-could moments, including the usual "put the size in with the array" advice, ways of updating things already exposed to readers, and the fact that RCU really can tolerate multiple writers, along with some concerns about counter overflow. Nevertheless, an impressive amount of great information in a one-hour talk!

It is very good to see more people making use of RCU!
This is the fourth and final book in Nassim Taleb's Incerto series, which makes a case for antifragility as a key component of design, taking the art of design one step beyond robustness. An antifragile system is one where variation, chaos, stress, and errors improve the results. For example, within limits, stressing muscles and bones makes them stronger. In contrast, stressing a device made of (say) aluminum will eventually cause it to fail. Taleb gives a lengthy list of examples in Table 1 starting on page 23, some of which seem more plausible than others. An example implausible entry lists rule-based systems as fragile, principles-based systems as robust, and virtue-based systems as antifragile. Although I can imagine a viewpoint where this makes sense, any expectation that a significantly large swath of present-day society will agree on a set of principles (never mind virtues!) seems insanely optimistic. The table nevertheless provides much good food for thought.

Taleb states that he has constructed antifragile financial strategies using insurance to control downside risks. But he also states on page 6 “Thou shalt not have antifragility at the expense of the fragility of others.” Perhaps Taleb figures that few will shed tears for any difficulties that insurance companies might get into, perhaps he is taking out policies that are too small to have material effect on the insurance company in question, or perhaps his policies are counter to the insurance company's main business, so that payouts to Taleb are anticorrelated with payouts to the company's other customers. One presumes that he has thought this through carefully, because a bankrupt insurance company might not be all that effective at controlling his downside risks.

Appendix I beginning on page 435 gives a graphical summary of the books main messages. Figure 28 on page 441 is good grist for the mills of those who would like humanity to become an intergalactic species: After all, confining the human race seems likely to limit its upside. (One counterargument would posit that a finite object might have unbounded value, but such counterarguments typically rely on there being a very large number of human beings interested in that finite object, which some would consider to counter this counterargument.)

The right-hand portion of Figure 30 on page 442 illustrates what the author calls local antifragility and global fragility. To see this, imagine that the x-axis represents variation from nominal conditions, and the y-axis represents payoff, with large positive payoffs being highly desired. The right-hand portion shows something not unrelated to the function x^2-x^4, which gives higher payoffs as you move in either direction from x=0, peaking when x reaches one divided by the square root of two (either positive or negative), dropping back to zero when x reaches +1 or -1, and dropping like a rock as one ventures further away from x=0. The author states that this local antifragility and global fragility is the most dangerous of all, but given that he repeatedly stresses that antifragile systems are antifragile only up to a point, this dangerous situation would seem to be the common case. Those of us who believe that life is inherently dangerous should have no problem with this apparent contradiction.

But what does all of this have to do with parallel programming???

Well, how about “Is RCU antifragile?”

One case for RCU antifragility is the batching optimizations that allow many (as in thousands) concurrent requests to share the same grace-period computation. Therefore, the heavier the update-side load on RCU, the more efficiently RCU operates.

However, load is but one of many aspects of RCU's environment that might be varied. For an extreme example, RCU is exceedingly fragile with respect to small perturbations of the program counter, as Peter Sewell so ably demonstrated, by running emacs, no less. RCU is also fragile with respect to timekeeping anomalies, for example, it can emit false-positive RCU CPU stall warnings if different CPUs have tens-of-seconds disagreements as to the current time. However, the aforementioned bones and muscles are similarly fragile with respect to any number of chemical substances (AKA “poisons”), to say nothing of well-known natural phenomena such as lightning bolts and landslides.

Even when excluding hardware misbehavior such as auto-perturbing program counters and unsynchronized clocks, RCU would still be subject to software aging, and RCU has in fact require multiple interventions from its developers and maintainer in order to keep up with changing hardware, workload, and usage. One could therefore argue that RCU is fragile with respect to perturbations of time, although the combination of RCU and its developers, reviewers, and maintainer seem to have kept up reasonably well thus far.

On the other hand, perhaps it is unrealistic to evaluate the antifragility of software without including black-hat hackers. Achieving antifragility in that sort of environment is still very much a grand challenge problem, but a challenge that must be faced. Oh, you think RCU is to low-level for this sort of attack? There was a time when I thought so. And then came rowhammer.

So please be careful, and, where possible, antifragile! It is after all a real world out there!!!
We have been making good progress on the next release of Is Parallel Programming Hard, And, If So, What Can You Do About It?, and hope to have a new release out soonish.

In the meantime, for those of you for whom the English text in this book has simply gotten in the way, there is now an alternative:

perfbook_cn_cover

On the off-chance that any of you are seriously interested, this is available from
Amazon China, JD.com, Taobao.com, and Dangdang.com. For the rest of you, you have at least seen the picture.  ;–)
The last month or two has seen a lot of work simplifying the Linux-kernel RCU implementation, with more than 2700 net lines of code removed. The remainder of this post lists the user-visible changes, along with alternative ways to get the corresponding job done.

  1. The infamous CONFIG_RCU_KTHREAD_PRIO Kconfig parameter is now defunct, but the rcutree.kthread_prio kernel boot parameter gets the job done.
  2. The CONFIG_NO_HZ_FULL_SYSIDLE Kconfig parameter has kicked the bucket. There is no replacement because no one was using it. If you need it, revert the -rcu commit tagged by sysidle.2017.05.11a.
  3. The CONFIG_PROVE_RCU_REPEATEDLY Kconfig parameter is no more. There is no replacement because as far as I know, no one has used it for many years. It was a great help in tracking down lockdep-RCU warnings back in the day, but these warnings are now sufficiently rare that finding them one boot at a time is no longer a problem. If you need it, do the obvious hacking on Kconfig and lockdep.c.
  4. The CONFIG_SPARSE_RCU_POINTER Kconfig parameter now rests in peace. There is no replacement because there doesn't seem to be any reason for RCU's sparse checking to be the only such checking that is optional. If you really need to disable RCU's sparse checking, hand-edit the definition as needed.
  5. The CONFIG_CLASSIC_SRCU Kconfig parameter bought the farm. This was only present to handle massive failures of the new Tree/Tiny SRCU implementations, but these appear to be quite reliable and should be used instead of Classic SRCU.
  6. RCU's debugfs tracing is done for. As far as I know, I was the only real user, and I haven't used it in years. If you need it, revert the -rcu commit tagged by debugfs.2017.05.15a.
  7. The CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO, and CONFIG_RCU_NOCB_CPU_ALL Kconfig parameters have departed. Use the rcu_nocbs kernel boot parameter instead, which can do quite a bit more than those Kconfig parameters ever could.
  8. Tiny RCU's event tracing and RCU CPU stall warnings are now pushing up daisies. The point of Tiny RCU is to be tiny and educational, and these added features were not helping reach either of these two goals. The replacement is to reproduce the problem with Tree RCU.
  9. These changes should matter only to people running rcutorture:

    1. The CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT and CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT_DELAY Kconfig parameters have been entombed: Use the rcutree.gp_preinit_delay kernel boot parameter instead.
    2. The CONFIG_RCU_TORTURE_TEST_SLOW_INIT and CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY Kconfig parameters have given up the ghost: Use the rcutree.gp_init_delay kernel boot parameter instead.
    3. The CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP and CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig parameters have passed on: Use the rcutree.gp_cleanup_delay kernel boot parameter instead.
There will probably be a few more simplifications in the near future, but this should be at least enough for one merge window!
With the Linux-kernel v4.13 merge window coming up, it is time to do at least a little heavy-duty testing of the patches destined for v4.14, which had been but lightly tested on my laptop. An overnight run on a larger test machine looked very good—with the exception of scenario TREE01 (defined by tools/testing/selftests/rcutorture/configs/rcu/TREE01{.boot,} in the Linux-kernel source tree), which got no fewer than 190 failures in a half-hour run. In other words, rcutorture saw 190 too-short grace periods in 30 minutes, for about one every 20 seconds.

This is not just bad. This is RCU completely and utterly failing to be RCU.

My first action was to re-run the tests on the commits slated for v4.13. You can imagine my relief to see them pass on all scenarios, including TREE01.

Then it was time for bisection. I have been burned many times by false bisections due to RCU's probabilistic failure modes, so I ran 24 30-minute tests on each commit. Fortunately, I could run six in parallel, so that each commit only consumed about two hours of test time. The bisection converged on a commit that adds a --kconfig argument to the rcutorture scripts, which allow me to do things like force lockdep to run in all scenarios. However, this commit should have absolutely no effect on the inner workings of RCU.

OK, perhaps this commit managed to fatally mess up the .config file. But no, the .config files from this commit compare equal to those from the preceding commit. Some additional poking gives me confidence that the kernels being built are also identical. Still, the one fails and the other does not.

The next step is to look very carefully at the console output from the failing runs, most of which contain many complaints about RCU grace periods being too short. Except that one of them also contains RCU CPU stall warnings. In fact, one of the stall warnings lists no fewer than 26 CPUs as stalling the current RCU grace period.

This came as a bit of a surprise, partly because I don't ever recall ever seeing that many CPUs stalling a single grace period, but mostly because the test was only supposed to use eight CPUs.

A look at the beginning of the console output showed that RCU was inexplicably prepared to deal with 43 CPUs instead of the expected eight. A bit more digging showed that the qemu command used to run the failing test had “-smp 43”, while the qemu command for the successful test instead had “-smp 8”. In both cases, the qemu command also included the kernel boot parameter “maxcpus=8”. And a very stupid bug in the --kconfig change to the scripts turned out to be responsible for the bogus -smp argument.

The next step is to swap the values of qemu's -smp argument. And the failure follows the “-smp 43” setting. This means that it is possible that the RCU failures are due to a latent timing bug in RCU. After all, the test system has only 64 CPUs, and I was running 43*6=258 CPUs worth of tests on it. But running six concurrent rcutorture tests with both -smp and maxcpus set to 43 passes with flying colors. So RCU must be suffering from some other problem.

The next question is exactly what is supposed to happen when qemu and the kernel have very different ideas of how many CPUs there are. The ever-helpful Documentation/admin-guide/kernel-parameters.txt file states that maxcpus= limits not the overall number of CPUs, but rather the number that are brought up at boot time. Another look at the console output confirms that in the failing case, eight CPUs are brought up at boot time. However, the other 35 come online some time after boot, sometimes taking a few minutes to come up. Which explains another anomaly I noticed while bisecting, namely that about half the tests ran 30 minutes without failure, but the ones that failed did so within the first five minutes of the run. Apparently the RCU failures are connected somehow to the late arrival of the extra 35 CPUs.

Except that RCU configured itself for the full 43 CPUs, and RCU is supposed to be able to handle CPUs coming and going. In fact, RCU has repeatedly demonstrated its ability to handle CPUs coming and going for more than a decade. So it is time to enable event tracing on a failure scenario (thank you, Steve!). One of the traces shows that there is no RCU callback connected with the first failure, which points the finger of suspicion at RCU expedited grace periods.

A quick inspection of the expedited code shows missing synchronization for the case where a CPU makes its very first appearance just as an expedited grace period starts. Oh, the leaf rcu_node structure's ->lock is held both when updating the number of CPUs that have ever been seen (which is the rcu_state structure's ->ncpus field) and when updating the bitmasks indicating exactly which CPUs have ever been seen (which is the leaf rcu_node structure's ->expmaskinitnext field), but it drops that lock between those two updates.

This means that the expedited grace period might sample the ->ncpus field, notice the change, and therefore check all the ->expmaskinitnext fields—but before those fields had been updated. Not a problem for this grace period, since the new CPUs haven't yet started and thus cannot yet be running any RCU read-side critical sections, which means that there is no reason whatsoever for this grace period to pay any attention to them. However, the next expedited grace period would again sample the ->ncpus field, see no change, and thus not bother checking the ->expmaskinitnext fields. Thus, this grace period would also ignore the new CPUs, which by this time could be very much alive and running RCU read-side critical sections. Hence the too-short grace periods, and hence them showing up within the first few minutes of the run, during the time that the extra 35 CPUs are in the process of coming online.

The fix is easy: Just move the update of ->ncpus to the same critical section as the update of ->expmaskinitnext. With this fix, rcutorture passes the TREE01 scenario even with bogus -smp arguments to qemu. There is therefore once again a bug in rcutorture: There are still bugs in RCU somewhere, and rcutorture is failing to find them!

Strangely enough, I might never have noticed the bug in expedited grace periods had I not made a stupid mistake in the scripting. Sometimes it takes a bug to locate a bug!