Log in

No account? Create an account

Previous Entry | Next Entry

In a recent post, I called out some of the past year's activity in the area of transactional memory (TM). This post takes a closer look at hardware transactional memory (HTM), with an eye to where it might fit into the parallel programmer's toolbox.

But first, what is HTM?

Current HTM implementations make use of processor caches and speculative execution to ensure that a designated group of statements (called a “transaction”) executes atomically from the viewpoint of any other transaction. The beginning and end of a transaction is marked by begin-transaction and commit-transaction machine instructions, respectively, and might also include abort-transaction instructions. The transaction is executed within the confines of its CPU, so that its work is not visible to other CPUs until the commit-transaction instruction is executed. An abort-transaction instruction squashes the speculative execution, discarding any changes that the transaction might have carried out, while also branching to a failure handler. The location of the failure handler is normally specified by the begin-transaction instruction, either as an explicit address or by means of condition codes. This failure handler might then either: (1) retry the transaction, (2) execute a fallback code sequence, for example, using locking, or (3) otherwise handle the transaction's failure. In this way, each transaction either executes atomically with respect to all other transactions, or is aborted with no changes to shared state.

So where does HTM fit into the parallel programmer's toolbox? Ideally, we would like a classification of the applicability of HTM similar to that shown for RCU in the diagram below:

Best for read-mostly data where inconsistency is tolerated

On the other hand, I have been developing, using, and maintaining RCU in production for almost two decades. Because the oldest commercially available HTM implementation is still quite young by comparison, any attempt to similarly classify its use must necessarily rely on educated guesses, extrapolation, and speculation. But I am most certainly not going to let that stop me from making a first attempt! :-)

The remainder of this posting will look at a number of questions that need to be asked of current commercially available HTM implementations, but without focusing on any particular implementation. The answers will be primarily about HTM in the publicly documented here-and-now, though some additional speculation about possible future implementations will inevitably leak in. This additional speculation will draw on selected academic research projects.

Does HTM Belong In The Parallel Programmer's Toolbox?

The first question we need to ask is whether HTM has any place at all in the parallel programmer's toolbox. After all, as the new entrant, it must prove its worth compared to any number of concurrency-control mechanisms that have been in production use for decades. In addition, there has been no shortage of technologies that have roused great excitement but eventually failed of their promise, a historical tendency that is greatly illuminated by Ioannidis's Sixth Corrollary.

Ioannidis notwithstanding, it seems quite likely that the answer to this first question is in the affirmative. To quote Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory, “An important TM near-term opportunity is thus update-heavy workloads using large non-partitionable data structures such as high-diameter unstructured graphs.” (Disclaimer: I co-authored this paper, which is a revision of the 2007 PLOS paper by the same name with revised presentation here.) Although there are other mechanisms that can be used to handle large update-heavy non-partitionable data structures, these other methods suffer from cache-locality problems. There is thus hope that HTM can provide performance advantages in this area. In addition, use of HTM for transactional lock elision promises to greatly reduce the number of cache misses associated with lock acquisition and release, which is likely to provide large performance advantages to HTM for short lock-based critical sections—at least in cases where none of the oft-written variables protected by the elided lock share a cache line with that lock.

Should HTM Be Used Universally?

Given that HTM very likely has a place in the parallel programmer's toolbox, the logical next question to ask is whether parallel programmers can simplify their lives by just using HTM for everything.

The answer to this question is an emphatic “no” for the following reasons:

  1. HTM is unable to handle non-idempotent operations, such as unbuffered I/O (especially things like remote procedure calls) and system calls including memory mapping and exec(). This limitation appears to be inherent to HTM.
  2. HTM limits maximum transaction size. So why do current commercially available HTM implementations limit transaction size and what can be done about it?
  3. HTM currently lacks forward-progress guarantees. So why do current commercially available HTM implementations lack forward-progress guarantees?
  4. Even with forward-progress guarantees, HTM is subject to aborts and rollbacks, which (aside from wasting energy) are failure paths. Failure code paths are in my experience difficult to work with. The possibility of failure is not handled particularly well by human brain cells, which are programmed for optimism. Failure code paths also pose difficulties for validation, particularly in cases where the probability of failure is low or in cases where multiple failures are required to reach a given code path.
  5. HTM is subject to thorny validation issues, including the inability to usefully set breakpoints in or single-step through transaction. In addition, although transactional lock elision might greatly reduce the probability of deadlock, the fact that the fallback code might be executed means that fallback-code deadlock cannot be ignored. Finally, in many cases the fallback code will be invoked quite infrequently, which might allow fallback-code bugs (including performance problems involving high levels of lock contention) to escape detection during validation, thus inflicting themselves on users. Use of HTM will therefore likely require painstaking attention at validation time, and also tools to detect deadlock and exercise fallback code.
  6. There are subtle semantic differences between locking and transactional lock elision. One example is interactions with outside accesses as shown by Blundell et al. Another example is the empty lock-based critical section, which waits for all prior critical sections for locking, but is a no-op when all concurrent lock-based critical sections have been elided in favor of HTM execution. This use of locking as a message-passing mechanism can also be used in non-empty critical sections. (More on this here.)

We can see that HTM is useful on the one hand, but has substantial limitations on the other. Therefore, the next section looks at where HTM is most likely to be helpful.

Where is HTM Most Likely to be Helpful?

HTM is likely to be at its best for large in-memory data structures that are difficult to statically partition but that are dynamically partitionable, in other words, the conflict probability is reasonably low. There must be a reasonable non-TM fallback algorithm for every transaction. The workload should ideally be update-heavy with small accesses and updates, and not subject to aggressive real-time constraints. Finally, if HTM is used for transactional lock elision, any empty critical sections must continue to use explicit locking.

Why is this situation best for HTM?

A thought experiment involving a red-black tree might be helpful.

First, consider a red-black tree that supports insertions, deletions, and lookups of single elements, but where all of these APIs return the exact number of elements in the tree. In this case, the resulting transactions will be very prone to conflicts on the variable in which the element count is maintained, resulting in poor performance and scalability, especially for update-heavy workloads.

This should not be surprising, because returning the size causes insertion and deletion to be strongly non-commutative. Maintaining and returning a consistent count is simply bad for parallelism.

Therefore, let's next consider a red-black tree with insertion, deletion, and lookup, but without the exact count of the number of elements in the tree. In this case, if the tree is large, the conflict probabilities between random insertion, deletion, and lookup operations is extremely low. In this case, HTM is likely to perform and scale quite well.

Finally, let's add an API that enumerates all the the nodes in the red-black tree. A transaction implementing this API will conflict with all concurrent insertion or deletion operations, severely limiting performance and scalability. In many cases, this situation can be alleviated by using hazard pointers or RCU to protect these large read-only accesses. Although the RCU/hazard-pointer reader can still conflict with updates, the conflict window is now limited to each single read-side load, instead of extending across the entire read-side operation, as is the case for transactions. (RCU and hazard pointers could help even more if their loads didn't generate conflicts, but instead returned the old value of any variable written by an in-flight transaction, but current HTM implementations do not appear to do this.)

This last example illustrates the importance of using combinations of synchronization mechanisms in ways that play to each mechanism's strengths.


In summary, although it appears that HTM will be able to earn a place in the parallel programmer's toolbox, it is not a magic wand with which to wish away all concurrency problems. Like every other tool in the toolbox, it will be necessary to carefully consider whether HTM is the right tool for a particular job. Furthermore, like every other tool in the toolbox, it may in some cases be necessary to tune, restructure, or even redesign your code in order to obtain the full benefits of HTM.

HTM should therefore be an interesting and exciting learning experience for all concerned. ;-)

(Thanks to a great number of people, especially Jon Walpole, Josh Triplett, and Andi Kleen for a number of illuminating discussions on TM in general and HTM in particular.)


Mar. 12th, 2012 12:29 pm (UTC)
This looks horribly familiar
After thinking about the article for some two days or so I start to get the feeling that a "transaction" as seen here is very close to the concept of "monad" from functional programming, as used in Haskell specifically, where any sequences of operations in a sequential programming sense are encapsulated in a Monad to hide side effects from the purely "functional world" inside a functional program.

What transactional memory achieves is encapsulation of the side effect of interacting with other processes, which can be likened to I/O really, as passed a token between these processes would be equivalent to taking a lock. So we could as well call it "monadic memory" - what we want to hide is the interaction with something outside of our smooth program flow. Two communicating Haskell processes could conveniently encapsulate their IPC as a monadic operation implemented by a transaction through a TM.

That is hardly a coincidence: when some research projects in the 70ies and 80ies tried to deal with massive parallelism they often came up with functional programming and distributed reduction of the parse-tree bottom-up as the way to go about things. Such was the case with e.g. the japanese fifth generation computer. Those would use Prolog to achieve a functional approach to the problem. (I'm told the Connection Machine did something similar.)

One *could* argue that by actively avoiding to redefine our problems in domain-specific languages (functional, data flow etc) we have come to the point where features of these languages come creeping in from the bottom layer of the architecture. On the other hand one could also argue that this way only the features that we really need from these languages have crystallized from the hardware side of things.

Anyway, thanks for an extremely thought-provoking article that really made me think about things!

Linus Walleij
Mar. 12th, 2012 03:07 pm (UTC)
Re: This looks horribly familiar
Interesting points! I recall discussions on parallelism with functional-programming advocates in the early 1980s. They were not all that worried about scalability, just incremental increases in performance. They also were happy to ignore the cost of communications.

I hope that the current transactional-memory crowd is taking a more realistic view of the matter. ;-)
Mar. 13th, 2012 02:17 am (UTC)
If that PLoS paper is TL;DR:

Though the PLoS paper has more rigor :)

Mar. 13th, 2012 03:13 am (UTC)
Re: If that PLoS paper is TL;DR:
OMG!!! I had better get rid of all the green jelly beans!!! ;-)

That said, I do expect HTM to find some significant uses, as an optimization to locking if nothing else.
Mar. 17th, 2012 12:00 am (UTC)
Hardware support for locks
HTM is interesting, but what about hardware support for locks? As the main performance problem with locking seems to be cache misses during contention, it occurs to me to wonder what would happen if a multicore chip devoted some of its gates to a reasonable number (32K?) of shared 1 bit registers, along with instructions for read, write and TAS, so that locks could be implemented without the overhead of memory latency. The OS would need to allocate a register to each lock at initialisation - or maintain a traditional, memory based lock if it runs out of registers, but then many structures could be locked at coarser granularity, so there should actually be significantly fewer locks.

Has anyone done this? If not, why not? The circuitry would be essentially identical to that which maintains the cache lines.
Mar. 17th, 2012 12:22 am (UTC)
Re: Hardware support for locks
Something like this was tried in the 1990s, but the software plus caches out-performed the hardware "acceleration". It still takes time to propagate the information about who holds what lock across the system, and even though CPU core clock rates are limited at roughly 2003 levels, electrons just don't move fast enough.

The thing is that the key performance benefit of transactional lock elision is not a more-efficient implementations of locks, but instead omitting the locks altogether. Because the lock state need not be communicated (if things work out nicely), it is not necessary to wait for the electrons to do the communication. On the other hand, all the transactional-memory limitations apply, and even in the absence of these limitations, from what I can see, transactional lock elision results in some subtle changes in semantics. But more on that later...
Mar. 20th, 2012 11:33 am (UTC)
I think it's worth pointing out that HTM can also be helpful in other scenarios, because it should give decent performance to other programming abstractions for synchronization.

For example, if the red-black tree is read-mostly, then even though RCU and hazard-pointers might give similar or somewhat better performance, it's likely still easier to wrap existing sequential code in a transaction than to restructure the code or the data to use a less transparent abstraction (e.g., RCU). (Of course, this only holds for programming-language transactions, not for a raw use of HTM.)

Therefore, one should consider all the aspects when thinking about the concurrent programming toolbox. (Paul, I know you are aware of this, but it can't hurt to make sure everyone is.)
Mar. 22nd, 2012 05:36 pm (UTC)
I agree, HTM can be used in a number of places, as indicated by the section title "Where is HTM Most Likely to be Helpful?" The contents of that section are thus the areas where I believe HTM has the most promise, not the only places that HTM could be used.

On your point about RCU and hazard pointers vs. TM, it is important to note that this is not necessarily a strict either-or decision. For an example using both RCU and STM on a red-black tree, please refer to http://www.usenix.org/event/hotpar11/tech/final_files/Howard.pdf. This paper gives a clear example of how using multiple tools in concert can give superior results.

But what is your definition of "transparent"? Any reasonable definition I have been able to come up with has RCU being either more or less transparent, depending on the workload.