Monday, August 22, 2011

The Debate over Shared and Private Caches

Several papers have been written recently on the topic of last level cache (LLC) management.  In a multi-core processor, the LLC can be shared by many cores or each core can maintain a private LLC.  Many of these recent papers evaluate the trade-offs in this space and propose models that lie somewhere between the two extremes.  It's becoming evident (at least to me) that it is more effective to start with a shared LLC and make it appear private as required.  In this post, I'll explain why.  I see many papers on private LLCs being rejected because they fail to pay attention to some of these arguments.

Early papers on LLC management have pointed out that in a multi-core processor, a shared LLC enjoys a higher effective cache capacity and ease of design, while private LLCs can offer lower average access latency and better isolation across programs.  It is worth noting that both models can have similar physical layouts -- a shared LLC can be banked and each bank can be placed adjacent to one core (as in the figure below).  So the true distinction between the two models is the logical policy used to govern their contents and access.
The distributed shared LLC shown in the figure resembles Tilera's tiled architecture and offers non-uniform latencies to banks.  In such a design, there are many options for data striping.  The many ways of a set-associative cache could be scattered across banks.  This is referred to as the D-NUCA model and leads to complex protocols for search and migration.  This model seems to have the least promise because of its complexity and non-stellar performance.  A more compelling design is S-NUCA, where every address and every cache set resides in a unique bank.  This makes it easy to locate data.  Most research assumes that in an S-NUCA design, consecutive sets are placed in adjacent banks, what I'll dub as set-interleaved mapping.  This naturally distributes load across banks, but every LLC request is effectively serviced by a random bank and must travel half-way across the chip on average.  A third option is an S-NUCA design with what I'll call page-to-bank or page-interleaved mapping.  All the blocks in an OS physical page (a minimum sized page) are mapped to a single bank and consecutive pages are mapped to adjacent banks.  This is the most compelling design because it is easy to implement (no need for complex search) and paves the way for OS-managed data placement optimizations for locality.

Recent work has taken simple S-NUCA designs and added locality optimizations so that an average LLC request need not travel far on the chip.  My favorite solution to date is R-NUCA (ISCA'09), which has a simple policy to classify pages as private or shared, and ensures that private pages are always mapped to the local bank.  The work of Cho and Jin (MICRO'06) relies on first-touch page coloring to place a page near its first requestor; Awasthi et al. (HPCA'09) augment that solution with load balancing and page migration.  Victim Replication (ISCA'05) is able to do selective data replication in even a set-interleaved S-NUCA cache without complicating the coherence protocol.  In spite of these solutions, the problem of shared LLC data management is far from solved.  Multi-threaded workloads are tougher to handle than multi-programmed workloads; existing solutions do a poor job of handling pages shared by many cores; they are not effective when the first requestor of a page is not the dominant accessor (since page migration is expensive); task migration renders most locality optimizations invalid.

The key point here is this: while there is room for improvement, a shared LLC with some of the above optimizations can offer low access latencies, high effective capacity, low implementation complexity, and better isolation across programs.  Yes, these solutions sometimes involve the OS, but in very simple ways that have commercial precedents (for example, NUMA-aware page placement in the SGI Origin).

On the other hand, one could start with an organization where each core has its own private LLC.  Prior work (such as cooperative caching) has added innovations so that the private LLCs can cooperate to provide a higher effective cache capacity.  In cooperative caching, when a block is evicted out of a private cache, it is given a chance to reside in another core's private cache.  However, all private LLC organizations require complex mechanisms to locate data that may be resident in a remote bank.  Typically, this is done with an on-chip directory that tracks the contents of every cache.  This directory introduces non-trivial complexity.  Some commercial processors do offer private caches; data search is manageable in these processors because there are few caches and bus-based broadcast is possible.  Private LLCs are less attractive as the number of cores scales up.

Based on the above arguments, I feel that shared LLCs have greater promise.  For a private LLC organization to be compelling, the following arguments are necessary: (i) How is the complexity of data search overcome?  (ii) Can it out-perform a shared LLC with page-to-bank S-NUCA mapping and a simple locality optimization (either R-NUCA or first-touch page coloring)?


  1. Respected Sir,
    "Private LLCs are less attractive as the number of cores scales up."
    In case of Multiprogrammed workload, Miss in private LLC, can be directly sent to main memory, since there is no copy present in other private LLC.
    So searching is not required. So can we say for multiprogrammed workload
    private cache is better?

    1. The hardware doesn't know if there are copies in other private LLCs, so a search would still be required. There are other arguments in favor of a shared cache, even for a multi-programmed workload -- for example, cache space is dynamically allocated to programs based on need.

  2. This comment has been removed by the author.