You are viewing paulmck

Previous Entry | Next Entry

TMEverywhere
My earlier posting on hardware transactional memory (http://paulmck.livejournal.com/31853.html) brought an unusual amount of private email, especially on my discussion of the hazards of eliding empty critical sections (http://kernel.org/pub/linux/kernel/people/paulmck/Answers/TransactionalMemoryEverywhere/WhyEmptyLock.html). Several people made the interesting claim that any correct code involving empty lock-based critical sections that relies entirely on memory references is guaranteed either to: (1) operate correctly under HTM or (2) result in an abort, causing execution to fall back to the lock-based code, again operating correctly. I invested more time than I care to admit unsuccessfully attempting to generate a counter-example, so I am becoming more inclined to believe them, despite the lack of a formal proof.

There is a straightforward rationale for their claim, namely that in strongly atomic HTM implementations, the results of a given transaction are not visible until after the transaction completes. So if you can see that the transaction started, it has already finished, at which point a no-op will successfully wait for it. So any sequence of code that tests variables to determine when a given critical section has started, and then uses an empty lock-based critical section to wait for it to complete will operate correctly even when the empty critical section becomes essentially a no-op under hardware transactional lock elision.

Of course, this would not be true of a weakly atomic HTM implementation, but as far as I know all the currently proposed production HTM implementations are strongly atomic.

But there are other ways to communicate than relying on memory references, for example, you could instead rely on the passage of time. Please note that relying on the passage of time is very difficult to get right, especially on today's super-scalar hardware that relies so heavily on speculative execution. And if your code is running on top of a hypervisor, relying on timing is even more risky. And if I catch you trying to sneak a timing-based patch into the Linux kernel, I will almost certainly NAK it. However, older computer systems were much more amenable to use of timing, so it would be no surprise if large legacy software made use of such techniques. The following is an example of what a timing-based algorithm might look like.

Each worker thread corresponds to a data feed out to some receiver. Each worker thread records the current time (e.g., from clock_gettime()) to a per-thread timestamp after executing each work unit. Because this is a per-thread timestamp, a normal write suffices. Because this is a real-time application, it is a fatal error for a given thread to fail to update its per-thread timestamp for more than (say) 100 microseconds. Results are placed into a shared-memory buffer, which is drained by a separate set of threads that transmit the results out over the network. Locks are used sparingly to access and update global state, and, again given that this is a real-time application, locks are granted in FIFO order within priority level.

Worker threads can finish, and when they do they must disentangle themselves from the rest of the application. Worker threads also indicate their status in a thread-local variable my_status that is initialized to -1. However, worker threads do not exit: They instead go to a thread pool to accommodate later processing requirements.

There is also a control thread that monitors the worker threads. In particular, it maintains a histogram of thread statuses for threads that have exited. This same thread is responsible for assigning new threads or reassigning old ones, and runs at a priority no higher than that of the worker threads.

Worker threads do something like the following:

	int my_status = -1;  /* Thread-local variable. */

	while (continue_working()) {
		enqueue_any_new_work(); /* Might include locks. */
		wp = dequeue_work();  /* Thread-local, no locking. */
		do_work(wp);  /* Might include locks. */
		my_timestamp = clock_gettime(...); /* Thread-local. */
	}

	acquire_lock(&departing_thread_lock);

	/*
	 * disentangle from application, might acquire other locks,
	 * can take much longer than MAX_LOOP_TIME, especially if
	 * many threads exit concurrently.
	 */

	my_status = get_return_status();

	release_lock(&departing_thread_lock);

	/* thread awaits repurposing. */


The control thread might do the following:

	for (;;) {
		for_each_thread(t) {
			ct = clock_gettime(...);
		        d = ct - per_thread(my_timestamp, t);
			if (d >= MAX_LOOP_TIME) {
				/* thread departing. */
				acquire_lock(&departing_thread_lock);
				release_lock(&departing_thread_lock);
				i = per_thread(my_status, t);
				status_hist[i]++; /* Bad idea if TLE! */
			}
		}
		/* Repurpose threads as needed. */
	}


Here the passage of time is used to ensure that the empty lock-based critical section happens after the departing worker thread has acquired the departing_thread_lock. If the empty lock is allowed to become a no-op, then the per_thread(my_status, t) access can abort the departing thread's transaction, so that the result is the initial value of -1. This cannot happen with locking.

Note that enclosing the access to my_status in a lock/transaction does not change things. This problem is not the extra-transactional access, but rather the structure of the application.

Again, please don't try this sort of thing on superscalar CPUs (which is almost all of them these days), and especially please don't try it in the Linux kernel. In fact, you should be very cautious about doing this sort of thing even on systems specially designed for hard real-time response. But the key point is that if you have a large application that is not well understood, you might want to be careful about applying HTM-based lock elision to it. This should not be a surprise: What you have is a real-time program, and there has not yet been much work on TM in real-time environments.

Speaking of real-time environments, how does priority boosting interact with HTM-based lock elision?

Acknoowledgments: Many thanks to Josh Triplett, Mathieu Desnoyers, Jon Walpole, Phil Howard, and Stephan Diestelhorst for many illuminating discussions. Please note that my acknowledgment of these people's contributions to my understanding of transactional lock elision should in no way be interpreted to imply their agreement with anything I say here. ;-)

Comments