You are viewing paulmck

Previous Entry | Next Entry

TMEverywhere
The previous post described some potential pitfalls in blindly applying hardware lock elision to legacy software. This post looks instead at how changes in CPU cache geometry have increased hardware transactional memory's (HTM's) cache footprint, at least from a theoretical standpoint. This is yet another example of how hardware giveth and software taketh away, at least according to my colleagues working on hardware.

For example, cache associativity has increased over the past couple of decades. For example, Sequent systems of 20 years ago had two-way set-associative caches, while the laptop I am typing this on has an eight-way set-associative cache. The following graph, which models a 4KB cache with 64-byte cache lines, shows why this is increase is important to HTM:

HTM success probability for 4K cache

To use this graph, select the transaction size in cache lines on the x-axis, then read the y-axis to find the overflow (failure) probability for the desired associativity. For example, a direct-mapped (AKA one-way set-associative) cache has about a 50% failure probability for a 10-cacheline transaction, while an eight-way set-associative cache has less than a one-in-a-million failure probability at this same size. Clearly, the increases in microprocessor cache associativity over the past 20 years have profoundly reduced the failure probability of modest-sized transactions.

But a 4KB cache with 64-byte cache lines cannot support a transaction of more than 64 cache lines, no matter how large the associativity. Fortunately, on-chip cache sizes have also increased, with my laptop's L0 cache being 32KB rather than 4KB. Unfortunately, the standard closed-form expression for success probability is severely stressed by this increase in size. For example, the expression for 64 references into a 32KB cache having 64-byte cache lines and eight-way set associativity weighs in at more than 12 megabytes. Because this expression is a sum of a very large number of terms having very small values, indefinite-precision arithmetic is a must, which makes the analytic approach quite slow for modest sized transactions (though it is quite efficient for the smallest transactions as well as the largest transactions that have some chance of success).

Fortunately, there is another approach that works quite well for modest failure probabilities, for example, down to 0.0001. This approach is monte carlo simulation, where we randomly generate a long series of sequences of references and estimate the failure probability based on the results. For example, the following figure shows the same analytic data as the earlier figure, but overlays it with billion-trial monte carlo simulations.

HTM success probability for 4K cache plus monte carlo

The monte carlo trials agree quite well with the analytic results, especially for non-infinitesimal failure probabilities, which gives us some confidence in the monte carlo results for a 32KB cache:

HTM success probability for 32K cache

As you can see, moving from a 4KB to a 32KB cache significantly decreases the failure probability for a given HTM transaction. But the x86 family normally increases L0 cache size in tandem with associativity because this allows the L0 cache lookup to be carried out concurrently with the TLB lookup. Thus, an x86 4KB cache will be direct mapped, a 8KB cache will be two-way set associative, and so on:

HTM success probability for x86 cache

The newer x86 systems are more friendly to HTM than are the older systems, and might be even more friendly if transactions are allowed to overflow from the L0 cache into the L1 or L2 caches.

But not all hardware advances have been helpful. The increase in cache-line size from 32 bytes to 64 bytes reduces the number of cache lines, which, if the addresses are randomly selected from a large memory, increases overflow probability as shown in the following figure:

HTM success probability for different cache-line sizes

In short, HTM failure probabilities have decreased with increasing hardware capabilities, which has interestingly enough made analysis more difficult. But personally, I have always preferred improved performance over easy analysis.

Comments

(Anonymous)
Mar. 31st, 2013 04:50 pm (UTC)
Test, just a test
Hello. And Bye.