Utah Arch: May 2011

After several months of informal meetings and ad-hoc contact, we finally visited memory manufacturer Micron in Boise, ID earlier this month. Our large group of 7 students and 2 faculty members made for a fun road trip through stretches of torrential rain and awesome homemade pie at Connor’s Cafe in Burley (I would definitely recommend you stop by if you’re ever on I-84 in Idaho!). In a 3+ hour presentation and discussion session, we discussed all of the memory research happening here at our group in Utah, including both published work and current work-in-progress. We wanted feedback on what the memory industry’s constraints were, and the feasibility of implementing some of our ideas, and here are some of the things we heard, in no particular order:

1. Density, density, density: That density and cost/bit are everything in the memory industry is well known in memory research circles, but we really had no idea how much! We were told that on commodity parts, a 0.01% area penalty “might be considered”, 0.1% would require “substantial” performance gains, and anything over 0.5% was simply not happening, no matter what (these numbers are meant to be illustrative, of course, and not precise guidelines). For more specialized parts for niche markets, on the other hand, slightly larger area penalties could be considered as part of the overall value package. Take away: Focus on the high-end specialized segment, say the next big supercomputer, where cost is a smaller concern. The commodity segment has been squeezed to death, and you’re probably not going to get anywhere trying to change anything.

2. Be afraid of the memory chip: Very afraid! This is related to the previous point. The actual array layout has been super-ultra-optimized by people who dream about this stuff, and even the slightest change you propose is likely going to mess with this beyond repair. Stay away! Also, lots of smart people care very deeply about this narrow field, and have been working on this particular aspect for a long time. Anything you think of, they’ve probably considered before (but not published!). They would be “astounded” to hear something really novel.

3. Work at the system-level: There are infinite possibilities in terms of workload-based studies, altering data placement, data migration, row-buffer management, activity throttling, etc. These are less likely to provide dramatic improvements, but are low-effort, and are more likely to have an impact in terms of actually being implemented, since they are less invasive. It is unlikely that the industry has studied all the possibilities exhaustively, and you’re much more likely to come up with something novel.

4. Build it and they will come. NOT. If you really feel there is a case to be made to radically change DRAM architecture, approach the guys that will actually buy and use the parts - the system manufacturers. If they care enough, they can ask for it, and if they’re willing to pay for it, Micron will build it. It cannot be driven from the other end, no matter what, the margins are simply too small, and the whole system is setup to incentivize maximum capacity at least cost.

5. The Universal Memory Myth: Micron has been researching PCM for over a decade now, and are only now kinda sorta ready-ish to release a product into the market. Their excitement about the maturity of the technology, and it’s ability to summarily replace all memory in the system, is far far lower than that of the academic community. It is nowhere close to reaching the capacity and cost point of DRAM, which is, therefore, not likely to die any time soon.

I've talked with a number of faculty candidates this past season. I have come to realize that people that work in non-architecture systems areas haven't really kept track of the take-home messages from recent research in hardware transactional memory (HTM). Or, perhaps, the messages have been obfuscated by other signals (STM results or industry trends). Here's my unbiased view on HTM design options. Unbiased because I have merely dabbled in TM research and don't have a horse in this race :-). Harris, Larus, and Rajwar's TM book should be an excellent read if you want a more detailed account.

For those new to TM or HTM, I'll first provide a quick summary. With numerous cores available today, hardware is effective if software is efficiently multi-threaded. It is imperative that architects come up with innovations that make it easier for programmers to write multi-threaded code. HTM is one such innovation. Programmers simply identify regions of code (transactions) that should be executed atomically. The underlying hardware looks for parallelism, but ensures that the end result is as if each transaction executed atomically and in isolation. HTM can therefore provide high performance with low programming effort. With a few exceptions (usually contrived), transactional code is also deadlock-free. Transactional semantics are an improvement over almost any known multi-threaded programming model. HTM makes the programming process a little bit easier but is by no means a panacea.

The Wisconsin LogTM paper introduced a frequently used taxonomy to describe HTM implementations. They showed that one could use either lazy or eager conflict detection and either lazy or eager version management. When explaining HTM to students, I have always found it easier and more intuitive to start with an implementation that uses lazy versioning and lazy conflict detection, such as Stanford's TCC design. In TCC, every transaction works on locally cached copies of data and writes are not immediately propagated to other caches. At the end of its execution, a transaction must first acquire permissions to commit; it then makes all its writes visible through the conventional cache coherence protocol. If another in-progress transaction has touched any of these written values (a conflict), it aborts and re-starts. Each transaction executes optimistically and conflicts are detected when one of the transactions is trying to commit (lazy conflict detection). Upon cache overflow, the new results computed by a transaction are stored in a temporary log in main memory; when the transaction commits, the overflowed results must be copied from the log into their permanent homes (lazy versioning). Bloom filters can be used to track the contents of the log and efficiently detect conflicts.

While the TCC design is easier to understand, I feel it has a few drawbacks (explained shortly). In my view, an HTM design that uses eager versioning and eager conflict detection, such as Wisconsin's LogTM design, does a better job optimizing for the common case. In LogTM, each transaction proceeds pessimistically, expecting a conflict at every step (eager conflict detection). Writes are immediately made visible to the rest of the system. The underlying cache coherence protocol detects conflicting accesses by in-progress transactions. When a conflict is detected, one of the transactions is forced to wait for the other to finish. Deadlock scenarios are easily detected and handled by using transaction timestamps. Upon cache overflow, the old result is stored in a log and the new result is immediately placed in its future permanent home (eager versioning). Some provisions are required so we can continue to detect conflicts on overflowed data (called sticky pointers; now you know why this is a little harder to explain to TM newbies :-).

The reason that I believe that EE-HTM (Eager conflict detection and Eager versioning) is more efficient than LL-HTM is because the former makes a number of design choices that favor the common case. First, commits are slow in LL-HTM because they involve making all transactional writes visible to the rest of the system and copying log values to their permanent homes. Commit is the frequent operation (every transaction must commit). Second, EE-HTM does make an abort slower because it involves a copy from the log, but aborts are supposed to be uncommon. Aborts are even more uncommon in EE-HTM because they are only initiated in a potential deadlock situation. While LL-HTM charges ahead with transaction execution and causes an abort on every conflict, EE-HTM is more deliberate in its progress. If there is a conflict, a transaction is made to stall and proceed after the conflicting situation is resolved. This leads to less wasted work and overall lower energy dissipation.

HTM is easy to implement in hardware. All it takes is: (i) a register checkpoint so we can roll back register state on a transaction abort, (ii) two bits per cache line to keep track of whether a transaction has read or written that line, (iii) some new logic in the cache coherence controller to undertake transactional operations, such as examining the above bits and issuing aborts or NACKs, (iv) system and hardware support (in the form of hardware signatures) to handle overflows and commit permissions. So why has industry been so reluctant to embrace HTM (with the exception of Sun's Rock and AMD's recent ASF extensions)? My sense is that there is great concern about various corner-cases and how they can be handled efficiently. In my view, most known corner-cases have workable solutions and it is pointless to try and optimize infrequent scenarios. For example, a number of ugly scenarios can be averted if the system has a single "golden token" that allows its owner to successfully commit its on-going transaction. This is a graceful way to handle cache overflows, I/O operations, nesting overflows, starvation, signature overflows, etc. In essence, if our best efforts at transactional parallelism fail, simply revert to single-thread execution for a while. The impact on performance should be minimal for most workloads and this is a penalty we should be willing to pay to enjoy the programmability benefits of TM for other workloads. I am less enamored by the approach that falls back on STM when such bad events happen; STM opens up a new can of worms.

HTM implementations are efficient enough that there are few known bottlenecks. The classic TM pathologies paper from Wisconsin in ISCA 2007 talks about possible HTM performance bottlenecks. Contributions by that paper and others have pretty much addressed most known pathologies. One of my Masters students, Byong Wu Chong, spent a fair bit of time characterizing STAMP and other transactional benchmarks. He found that once a few pathological cases are optimized, very little of the overall execution time can be attributed to transactional phenomena. Commit is typically not a big bottleneck for LL-TM and deadlock-driven aborts are uncommon in EE-TM.

In summary, HTM is relatively easy to implement, especially if we take a correctness-centric view to corner cases. There are few glaring performance deficiencies caused by HTM features. Quantitatively, I expect there is little difference between optimized versions of EE-HTM and LL-HTM for most workloads. I have a personal preference for the former because it reduces wasted work and optimizes the common case. For newbies, the latter is an easier introduction to HTM.

Thursday, May 26, 2011

Memory Systems Research: An Industry Perspective

Sunday, May 1, 2011

A Perspective on HTM Designs