Sunday, May 1, 2011

A Perspective on HTM Designs

I've talked with a number of faculty candidates this past season.  I have come to realize that people that work in non-architecture systems areas haven't really kept track of the take-home messages from recent research in hardware transactional memory (HTM).  Or, perhaps, the messages have been obfuscated by other signals (STM results or industry trends).  Here's my unbiased view on HTM design options.  Unbiased because I have merely dabbled in TM research and don't have a horse in this race :-).  Harris, Larus, and Rajwar's TM book should be an excellent read if you want a more detailed account.

For those new to TM or HTM, I'll first provide a quick summary.  With numerous cores available today, hardware is effective if software is efficiently multi-threaded.  It is imperative that architects come up with innovations that make it easier for programmers to write multi-threaded code.  HTM is one such innovation.  Programmers simply identify regions of code (transactions) that should be executed atomically.  The underlying hardware looks for parallelism, but ensures that the end result is as if each transaction executed atomically and in isolation.  HTM can therefore provide high performance with low programming effort.  With a few exceptions (usually contrived), transactional code is also deadlock-free.  Transactional semantics are an improvement over almost any known multi-threaded programming model.  HTM makes the programming process a little bit easier but is by no means a panacea.

The Wisconsin LogTM paper introduced a frequently used taxonomy to describe HTM implementations.  They showed that one could use either lazy or eager conflict detection and either lazy or eager version management.  When explaining HTM to students, I have always found it easier and more intuitive to start with an implementation that uses lazy versioning and lazy conflict detection, such as Stanford's TCC design.  In TCC, every transaction works on locally cached copies of data and writes are not immediately propagated to other caches.  At the end of its execution, a transaction must first acquire permissions to commit; it then makes all its writes visible through the conventional cache coherence protocol.  If another in-progress transaction has touched any of these written values (a conflict), it aborts and re-starts.  Each transaction executes optimistically and conflicts are detected when one of the transactions is trying to commit (lazy conflict detection).  Upon cache overflow, the new results computed by a transaction are stored in a temporary log in main memory; when the transaction commits, the overflowed results must be copied from the log into their permanent homes (lazy versioning).  Bloom filters can be used to track the contents of the log and efficiently detect conflicts.

While the TCC design is easier to understand, I feel it has a few drawbacks (explained shortly).  In my view, an HTM design that uses eager versioning and eager conflict detection, such as Wisconsin's LogTM design, does a better job optimizing for the common case.  In LogTM, each transaction proceeds pessimistically, expecting a conflict at every step (eager conflict detection).  Writes are immediately made visible to the rest of the system.  The underlying cache coherence protocol detects conflicting accesses by in-progress transactions.  When a conflict is detected, one of the transactions is forced to wait for the other to finish.  Deadlock scenarios are easily detected and handled by using transaction timestamps.  Upon cache overflow, the old result is stored in a log and the new result is immediately placed in its future permanent home (eager versioning).  Some provisions are required so we can continue to detect conflicts on overflowed data (called sticky pointers; now you know why this is a little harder to explain to TM newbies :-).

The reason that I believe that EE-HTM (Eager conflict detection and Eager versioning) is more efficient than LL-HTM is because the former makes a number of design choices that favor the common case.  First, commits are slow in LL-HTM because they involve making all transactional writes visible to the rest of the system and copying log values to their permanent homes.  Commit is the frequent operation (every transaction must commit).  Second, EE-HTM does make an abort slower because it involves a copy from the log, but aborts are supposed to be uncommon.  Aborts are even more uncommon in EE-HTM because they are only initiated in a potential deadlock situation.  While LL-HTM charges ahead with transaction execution and causes an abort on every conflict, EE-HTM is more deliberate in its progress.  If there is a conflict, a transaction is made to stall and proceed after the conflicting situation is resolved.  This leads to less wasted work and overall lower energy dissipation.

HTM is easy to implement in hardware.  All it takes is: (i) a register checkpoint so we can roll back register state on a transaction abort, (ii) two bits per cache line to keep track of whether a transaction has read or written that line, (iii) some new logic in the cache coherence controller to undertake transactional operations, such as examining the above bits and issuing aborts or NACKs, (iv) system and hardware support (in the form of hardware signatures) to handle overflows and commit permissions.  So why has industry been so reluctant to embrace HTM (with the exception of Sun's Rock and AMD's recent ASF extensions)?  My sense is that there is great concern about various corner-cases and how they can be handled efficiently.  In my view, most known corner-cases have workable solutions and it is pointless to try and optimize infrequent scenarios.  For example, a number of ugly scenarios can be averted if the system has a single "golden token" that allows its owner to successfully commit its on-going transaction.  This is a graceful way to handle cache overflows, I/O operations, nesting overflows, starvation, signature overflows, etc.  In essence, if our best efforts at transactional parallelism fail, simply revert to single-thread execution for a while.  The impact on performance should be minimal for most workloads and this is a penalty we should be willing to pay to enjoy the programmability benefits of TM for other workloads.  I am less enamored by the approach that falls back on STM when such bad events happen; STM opens up a new can of worms.

HTM implementations are efficient enough that there are few known bottlenecks.  The classic TM pathologies paper from Wisconsin in ISCA 2007 talks about possible HTM performance bottlenecks.  Contributions by that paper and others have pretty much addressed most known pathologies.  One of my Masters students, Byong Wu Chong, spent a fair bit of time characterizing STAMP and other transactional benchmarks.  He found that once a few pathological cases are optimized, very little of the overall execution time can be attributed to transactional phenomena.  Commit is typically not a big bottleneck for LL-TM and deadlock-driven aborts are uncommon in EE-TM.

In summary, HTM is relatively easy to implement, especially if we take a correctness-centric view to corner cases.  There are few glaring performance deficiencies caused by HTM features.  Quantitatively, I expect there is little difference between optimized versions of EE-HTM and LL-HTM for most workloads.  I have a personal preference for the former because it reduces wasted work and optimizes the common case.  For newbies, the latter is an easier introduction to HTM.

5 comments:

  1. But what is the take home message for people who don't care about implementation details? All I got was "HTM looks good but isn't making into mainstream systems very quickly."

    As an aside, when a technology that looks good on the research side fails to move into the product side, there generally exists an opportunity to improve our understanding of the true costs and benefits of the new technology.

    ReplyDelete
  2. If you're looking for a take-home message for software developers, the following two papers might be good reads: Texas , Karlsruhe . Both are empirical studies that try to quantify ease of use of transactions and make arguments in favor of transactions.

    I wouldn't say that transactions have failed to move into the product side. AMD's recent introduction of ASF extensions is expected to be a big step. Intel has been more reluctant to embrace TM. I don't know enough about their reasoning; if there is a public statement on this, I'd love to hear it too.

    ReplyDelete
  3. Is Byong Wu Chong's work available somewhere?

    ReplyDelete
  4. Thanks, I like the UT paper and had not seen the other one. It'll be interesting to see when this stuff finally gets supported by default by some operating system and programming language I care about.

    ReplyDelete
  5. Anon: we're working on a tech report on Byong's results and will post it soon on our web-page.

    ReplyDelete