Sunday, August 28, 2011

Rebuttals and Reviews

In the review process for top conferences, conventional wisdom says that rebuttals don't usually change the mind of the reviewer... they only help the authors feel better.  The latter is certainly true: after writing a rebuttal, I always feel like I have a plan to sculpt the paper into great shape.

Here's some data from a recent PC meeting about whether rebuttals help.  For the MICRO 2011 review process, reviewers entered an initial score for overall merit and a final score after reading rebuttals and other reviews.  While many factors (for example, new information at the PC meeting) may influence a difference in the two scores, I'll assume that the rebuttal is the dominant factor.  I was able to confirm this for the 16 papers that I reviewed.  Of the roughly 80 reviews for those papers, 14 had different initial and final scores (there were an equal number of upward and downward corrections).  In 11 of the 14 cases, the change in score was prompted by the quality of the rebuttal.  In the other 3 cases, one of the reviewers convinced the others that the paper had merit.

For 25 randomly selected papers that were accepted (roughly 125 reviews), a total of 14 reviews had different initial and final scores.  In 10 of the 14 cases, the final score was higher than the initial score.

For 25 papers that were discussed but rejected, a total of 19 reviews had different initial and final scores.  In 14 of the 19 cases, the final score was lower than the initial score.

It appears that rebuttals do play a non-trivial role in a paper's outcome, or at least a larger role than I would have expected.

Monday, August 22, 2011

The Debate over Shared and Private Caches

Several papers have been written recently on the topic of last level cache (LLC) management.  In a multi-core processor, the LLC can be shared by many cores or each core can maintain a private LLC.  Many of these recent papers evaluate the trade-offs in this space and propose models that lie somewhere between the two extremes.  It's becoming evident (at least to me) that it is more effective to start with a shared LLC and make it appear private as required.  In this post, I'll explain why.  I see many papers on private LLCs being rejected because they fail to pay attention to some of these arguments.

Early papers on LLC management have pointed out that in a multi-core processor, a shared LLC enjoys a higher effective cache capacity and ease of design, while private LLCs can offer lower average access latency and better isolation across programs.  It is worth noting that both models can have similar physical layouts -- a shared LLC can be banked and each bank can be placed adjacent to one core (as in the figure below).  So the true distinction between the two models is the logical policy used to govern their contents and access.
The distributed shared LLC shown in the figure resembles Tilera's tiled architecture and offers non-uniform latencies to banks.  In such a design, there are many options for data striping.  The many ways of a set-associative cache could be scattered across banks.  This is referred to as the D-NUCA model and leads to complex protocols for search and migration.  This model seems to have the least promise because of its complexity and non-stellar performance.  A more compelling design is S-NUCA, where every address and every cache set resides in a unique bank.  This makes it easy to locate data.  Most research assumes that in an S-NUCA design, consecutive sets are placed in adjacent banks, what I'll dub as set-interleaved mapping.  This naturally distributes load across banks, but every LLC request is effectively serviced by a random bank and must travel half-way across the chip on average.  A third option is an S-NUCA design with what I'll call page-to-bank or page-interleaved mapping.  All the blocks in an OS physical page (a minimum sized page) are mapped to a single bank and consecutive pages are mapped to adjacent banks.  This is the most compelling design because it is easy to implement (no need for complex search) and paves the way for OS-managed data placement optimizations for locality.

Recent work has taken simple S-NUCA designs and added locality optimizations so that an average LLC request need not travel far on the chip.  My favorite solution to date is R-NUCA (ISCA'09), which has a simple policy to classify pages as private or shared, and ensures that private pages are always mapped to the local bank.  The work of Cho and Jin (MICRO'06) relies on first-touch page coloring to place a page near its first requestor; Awasthi et al. (HPCA'09) augment that solution with load balancing and page migration.  Victim Replication (ISCA'05) is able to do selective data replication in even a set-interleaved S-NUCA cache without complicating the coherence protocol.  In spite of these solutions, the problem of shared LLC data management is far from solved.  Multi-threaded workloads are tougher to handle than multi-programmed workloads; existing solutions do a poor job of handling pages shared by many cores; they are not effective when the first requestor of a page is not the dominant accessor (since page migration is expensive); task migration renders most locality optimizations invalid.

The key point here is this: while there is room for improvement, a shared LLC with some of the above optimizations can offer low access latencies, high effective capacity, low implementation complexity, and better isolation across programs.  Yes, these solutions sometimes involve the OS, but in very simple ways that have commercial precedents (for example, NUMA-aware page placement in the SGI Origin).

On the other hand, one could start with an organization where each core has its own private LLC.  Prior work (such as cooperative caching) has added innovations so that the private LLCs can cooperate to provide a higher effective cache capacity.  In cooperative caching, when a block is evicted out of a private cache, it is given a chance to reside in another core's private cache.  However, all private LLC organizations require complex mechanisms to locate data that may be resident in a remote bank.  Typically, this is done with an on-chip directory that tracks the contents of every cache.  This directory introduces non-trivial complexity.  Some commercial processors do offer private caches; data search is manageable in these processors because there are few caches and bus-based broadcast is possible.  Private LLCs are less attractive as the number of cores scales up.

Based on the above arguments, I feel that shared LLCs have greater promise.  For a private LLC organization to be compelling, the following arguments are necessary: (i) How is the complexity of data search overcome?  (ii) Can it out-perform a shared LLC with page-to-bank S-NUCA mapping and a simple locality optimization (either R-NUCA or first-touch page coloring)?