Friday, November 25, 2011

ISCA Deadline

I'm guessing everyone's digging themselves out after the ISCA deadline.  I had promised not to talk about energy drinks and deadlines, so let's just say that our deadline week looked a bit like this :-)... without the funky math of course.

I'm constantly hounding my students to get their writing done early.  Good writing doesn't magically show up 24 hours before the deadline.  While a Results section will undoubtedly see revisions in the last few days, there's no valid excuse for not finishing most other sections well before the deadline.

The graph below shows two examples of how our writing evolved this year... better than in previous years, but still not quite early enough!  We also labored for more than the usual amount to meet the 22-page budget... I guess we got a little spoilt after the 26-page HPCA'12 format.  I am amazed that producing a refined draft one week before the deadline is an impossible task, but producing a refined 22-page document 3 minutes before the deadline is a virtual certainty.  I suppose I can blame Parkinson's Law, not my angelic students :-), for accelerating my aging process...


Tuesday, November 8, 2011

Memory Has an Eye on Disk's Space

We recently read a couple of SOSP papers in our reading group: RAMCloud and FAWN.  These are terrific papers with significant implications for architects and application developers.  Both papers target the design of energy-efficient and low-latency datacenter platforms for a new breed of data-intensive workloads.  FAWN uses many wimpy nodes and a Flash storage system; RAMCloud replaces disk with DRAM.  While the two papers share many arguments, I'll focus the rest of this post on RAMCloud because its conclusion is more surprising.

In RAMCloud, each individual server is effectively disk-less (disks are only used for back-up and not to service application reads and writes).  All data is placed in DRAM main memory.  Each server is configured to have high memory capacity and every processor has access via the network to the memory space of all servers in the RAMCloud.  It is easy to see that such a system should offer high performance because high-latency disk access (a few milli-seconds) is replaced by low-latency DRAM+network access (micro-seconds).

An immediate architectural concern that comes to mind is cost.  DRAM has a dollar/GB purchase price that is 50-100 X higher than that of disk.  A server with 10 TB of disk space costs $2K, while a server with 64 GB DRAM and no disk costs $3K  (2009 data from the FAWN paper).  It's a little more tricky to compare the power consumption of DRAM and disk.  An individual access to DRAM consumes much less energy than an individual access to disk, but DRAM has a higher static energy overhead (the cost of refresh).  If the access rate is high enough, DRAM is more energy-efficient than disk.  For the same server example as above, the server with the 10 TB high-capacity disk has a power rating of 250 W, whereas the server with the 64 GB high-capacity DRAM memory has a power rating of 280 W (2009 data from the FAWN paper).  This is not quite an apples-to-apples comparison because the DRAM-bound server services many more requests at 280 W than the disk-bound server at 250 W.  But it is clear that in terms of operating (energy) cost per GB, DRAM is again much more expensive than disk.  Note that total cost of ownership (TCO) is the sum of capital expenditure (capex) and operational expenditure (opex).  The above data points make it appear that RAMCloud incurs a huge penalty in terms of TCO.

However, at least to my initial surprise, the opposite is true for a certain large class of workloads.  Assume that an application has a fixed high data bandwidth demand and this is the key determinant of overall performance.  Each disk offers very low bandwidth because of the low rotational speed of the spindle, especially for random access.  In order to meet the high bandwidth demands of the application, you would need several disks and several of the 250 W, 10 TB servers.  If data was instead placed in DRAM (as in RAMCloud), that same high rate of data demand can be fulfilled with just a few 280 W, 64 GB servers.  The difference in data bandwidth rates for DRAM and disk is over 600X.  So even though each DRAM server in the example above is more expensive in terms of capex and opex, you'll need 600 times fewer servers with RAMCloud.  This allows overall TCO for RAMCloud to be lower than that of a traditional disk-based platform.

I really like Figure 2 in the RAMCloud CACM paper (derived from the FAWN paper and reproduced below).  It shows that in terms of TCO, for a given capacity requirement, DRAM is a compelling design point at high access rates.  In short, if data bandwidth is the bottleneck, it is cheaper to use technology (DRAM) that has high bandwidth, even if it incurs a much higher energy and purchase price per byte.

Source: RAMCloud CACM paper

If architectures or arguments like RAMCloud become popular in the coming years, it opens up a slew of interesting problems for architects:

1.  Already, the DRAM main memory system is a huge energy bottleneck.  RAMCloud amplifies the contribution of the memory system to overall datacenter energy, making memory energy efficiency a top priority.

2.  Queuing delays at the memory controller are a major concern.  With RAMCloud, a single memory controller will service many requests from many servers, increasing the importance of the memory scheduling algorithm.

3.  With each new DDR generation, fewer main memory DIMMs can be attached to a single high-speed electrical channel.  To support high memory capacity per server, innovative channel designs are required.

4.  If the Achilles' heel of disks is their low bandwidth, are there ways to design disk and server architectures that prioritize disk bandwidth/dollar over other metrics?

Monday, October 31, 2011

Dissing the Dissertation

In my view, a Ph.D. dissertation is an over-rated document.  I'll formalize my position here so I don't get dragged into endless debates during every thesis defense I attend  :-).

A classic Ph.D. generally has a hypothesis that is formulated at an early stage (say, year two) and the dissertation describes 3-4 years of research that tests that hypothesis.

A modern Ph.D. (at least in computer architecture) rarely follows this classic recipe.  There are many reasons to adapt your research focus every year:  (1) After each project (paper), you realize the capabilities and limitations of your ideas;  that wisdom will often steer you in a different direction for your next project.  (2) Each project fixes some bottleneck and opens up other new bottlenecks that deserve attention.  (3) Once an initial idea has targeted the low-hanging fruit, incremental extensions to that idea are often unattractive to top program committees.  (4) If your initial thesis topic is no longer "hot", you may have to change direction to be better prepared for the upcoming job hunt.  (5) New technologies emerge every few years that change the playing field.

I'll use my own Ph.D. as an example.  My first project was on an adaptive cache structure.  After that effort, I felt that long latencies could not be avoided; perhaps latency tolerance would have more impact than latency reduction.  That led to my second effort, designing a runahead thread that could jump ahead and correctly prefetch data.  The wisdom I gathered from that project was that it was essential to have many registers so the processor could look far enough into the future.  If you had enough registers, you wouldn't need fancy runahead; hence, my third project was a design of a large two-level register file.  During that project, I realized that "clustering" was an effective way to support large processor structures at high clock speeds.  For my fourth project, I designed mechanisms to dynamically allocate cluster resources to threads.  So I had papers on caching, runahead threads, register file design, and clustering.  It was obviously difficult to weave them together into a coherent dissertation.  In fact, each project used very different simulators and workloads.  But I picked up skills in a variety of topics.  I learnt how to pick problems and encountered a diverse set of literature, challenges, and reviews.  By switching topics, I don't think I compromised on "depth"; I was using insight from one project to influence my approach in the next.  I felt I graduated with a deep understanding of how to both reduce and tolerate long latencies in a processor.

My key point is this: it is better to be adaptive and focus on high-impact topics than to flog a dead horse for the sake of a coherent dissertation.  Research is unpredictable and a four-year research agenda can often not be captured by a single hypothesis in year two.  The dissertation is therefore a contrived concept (or at least appears contrived today given that the nature of research has evolved with time).  The dissertation is a poor proxy when evaluating a student's readiness to graduate.  The student's ability to tie papers into a neat package tells me nothing about his/her research skills.  If a Ph.D. is conferred on someone that has depth/wisdom in a topic and is capable of performing independent research, a student's publication record is a better metric in the evaluation process.

In every thesis proposal and defense I attend, there is a prolonged discussion on what constitutes a coherent thesis.  Students are steered in a direction that leads to a coherent thesis, not necessarily in a direction of high impact.  If one works in a field where citation counts for papers far out-number citation counts for dissertations, I see little value in producing a polished dissertation that will never be read.  Use that time to instead produce another high-quality paper!

There are exceptions, of course.  One in every fifty dissertations has "bible" value... it ends up being the authoritative document on some topic and helps brand the student as the leading expert in that area.  For example, see Ron Ho's dissertation on wires.  If your Ph.D. work naturally lends itself to bible creation and you expect to be an ACM Doctoral Dissertation Award nominee, by all means, spend a few months to distill your insight into a coherent dissertation that will have high impact.  Else, staple your papers into a dissertation, and don't be ashamed about it!

My short wish-list:  (1) A thesis committee should focus on whether a student has produced sufficient high-quality peer-reviewed work and not worry about dissertation coherence.  (2) A dissertation can be as simple as a collection of the candidate's papers along with an introductory chapter that conveys the conclusions and insights that guided the choice of problems.

Wednesday, October 5, 2011

The right yellow gradient

Vic Gundotra, SVP of Engineering at Google shared this inspiring story on Google+:

One Sunday morning, January 6th, 2008 I was attending religious services when my cell phone vibrated. As discreetly as possible, I checked the phone and noticed that my phone said "Caller ID unknown". I choose to ignore.

After services, as I was walking to my car with my family, I checked my cell phone messages. The message left was from Steve Jobs. "Vic, can you call me at home? I have something urgent to discuss" it said.

Before I even reached my car, I called Steve Jobs back. I was responsible for all mobile applications at Google, and in that role, had regular dealings with Steve. It was one of the perks of the job.

"Hey Steve - this is Vic", I said. "I'm sorry I didn't answer your call earlier. I was in religious services, and the caller ID said unknown, so I didn't pick up".

Steve laughed. He said, "Vic, unless the Caller ID said 'GOD', you should never pick up during services".

I laughed nervously. After all, while it was customary for Steve to call during the week upset about something, it was unusual for him to call me on Sunday and ask me to call his home. I wondered what was so important?

"So Vic, we have an urgent issue, one that I need addressed right away. I've already assigned someone from my team to help you, and I hope you can fix this tomorrow" said Steve.

"I've been looking at the Google logo on the iPhone and I'm not happy with the icon. The second O in Google doesn't have the right yellow gradient. It's just wrong and I'm going to have Greg fix it tomorrow. Is that okay with you?"

Of course this was okay with me. A few minutes later on that Sunday I received an email from Steve with the subject "Icon Ambulance". The email directed me to work with Greg Christie to fix the icon.

Since I was 11 years old and fell in love with an Apple II, I have dozens of stories to tell about Apple products. They have been a part of my life for decades. Even when I worked for 15 years for Bill Gates at Microsoft, I had a huge admiration for Steve and what Apple had produced.

But in the end, when I think about leadership, passion and attention to detail, I think back to the call I received from Steve Jobs on a Sunday morning in January. It was a lesson I'll never forget. CEOs should care about details. Even shades of yellow. On a Sunday.


If there's one thing to learn from Steve Jobs' life, it is the importance of being a perfectionist. All the time. Right down to the perfect shade of yellow. Respect, Mr. Jobs.

Tuesday, October 4, 2011

HotChips 2011 - A Delayed Trip Report



I know, it’s been a little over a month since this year’s edition of the conference concluded, but better late than never! It was my first ever trip to HotChips, and I thought I should share some things I found interesting. I’ll keep away from regurgitating stuff from the technical talks since the tech-press has likely covered all of those in sufficient detail.


1. This is clearly an industry-oriented conference. There were hardly any professors around, and very few grad students. As a result, perhaps, there was almost nobody hanging out in the hallways or break areas during the actual talks, people were actually inside the auditorium!


2. It seemed like the entire architecture/circuits community from every tech shop in the Bay Area was in attendance. If you’re going to be on the job market soon, it’s a great place to network!


3. I enjoyed the keynote talk from ARM. It was interesting to think about smartphones (with ARM chips inside, of course) in the context of the developing world -- at price points in the $100-200 range, they are not a supplementary device like in the US, but the main conduit for people who normally couldn’t afford a $600 laptop to get on the internet. Definitely sounds like there’s huge potential for this form factor going forward.


Another interesting tidbit from this talk was a comparison between the energy densities of a typical smartphone battery and a bar of chocolate -- 4.5 kCal in 30g vs. 255 kCal in 49g! Lot of work to do to improve battery technology, obviously.


4. While on the subject of ARM, comparisons to Intel and discussions on who was “winning” were everywhere, from the panel discussion in the evening to one of the pre-defined “lunch table discussion topics”. Needless to say, there was nothing conclusive one way or the other :)


5. I finally got to touch a real, working prototype of Micron’s Hybrid Memory Cube (see Rajeev’s post here). I know, it’s of little practical value to see the die-stack, but it was cool nonetheless :) The talk was pretty interesting too, the first to release some technical information I think. It appears to be a complete rethinking of DRAM, with a novel protocol, novel interconnect topologies, etc.


6. I also really enjoyed the talk from the Kinect folks. I’ve seen it in action, and it was amazing to understand the complex design behind making everything work, especially since a lot of the challenges were outside my typical areas of interest -- mechanical design, aesthetics, reliability, processing images with low lighting, varying clothes, people of different sizes.. fascinating! There was also an aspect of “approachability” in the talk that I think is becoming increasingly common in the tech world -- basically *hide* the technology from the end user, making everything work “magically”. This makes people more likely to actually try these things. As a technologist, I understand the logic, but am not sure I like it very much -- engineers spend countless hours fixing a million corner cases, and I think there should be some way the common public understands at least the severity of these complexities and appreciates how hard it is to make things work! It’s like the common saying that Moore’s “Law” gives you increasing number of transistors every year -- it doesn’t, engineers do!


7. Finally, one really awesome event was a robot on stage! It was introduced by the folks from Willow Garage -- controlled from a simple Android app. It moved to the center of the stage from the wings, gave the speaker a high-five, and showed off a few tricks. There were also videos in the talk of it playing pool, bringing beer from a fridge (brand of your choice!), and a bunch of other cool things. Very fancy :) Waiting for my own personal assistant!



Sunday, August 28, 2011

Rebuttals and Reviews

In the review process for top conferences, conventional wisdom says that rebuttals don't usually change the mind of the reviewer... they only help the authors feel better.  The latter is certainly true: after writing a rebuttal, I always feel like I have a plan to sculpt the paper into great shape.

Here's some data from a recent PC meeting about whether rebuttals help.  For the MICRO 2011 review process, reviewers entered an initial score for overall merit and a final score after reading rebuttals and other reviews.  While many factors (for example, new information at the PC meeting) may influence a difference in the two scores, I'll assume that the rebuttal is the dominant factor.  I was able to confirm this for the 16 papers that I reviewed.  Of the roughly 80 reviews for those papers, 14 had different initial and final scores (there were an equal number of upward and downward corrections).  In 11 of the 14 cases, the change in score was prompted by the quality of the rebuttal.  In the other 3 cases, one of the reviewers convinced the others that the paper had merit.

For 25 randomly selected papers that were accepted (roughly 125 reviews), a total of 14 reviews had different initial and final scores.  In 10 of the 14 cases, the final score was higher than the initial score.

For 25 papers that were discussed but rejected, a total of 19 reviews had different initial and final scores.  In 14 of the 19 cases, the final score was lower than the initial score.

It appears that rebuttals do play a non-trivial role in a paper's outcome, or at least a larger role than I would have expected.

Monday, August 22, 2011

The Debate over Shared and Private Caches

Several papers have been written recently on the topic of last level cache (LLC) management.  In a multi-core processor, the LLC can be shared by many cores or each core can maintain a private LLC.  Many of these recent papers evaluate the trade-offs in this space and propose models that lie somewhere between the two extremes.  It's becoming evident (at least to me) that it is more effective to start with a shared LLC and make it appear private as required.  In this post, I'll explain why.  I see many papers on private LLCs being rejected because they fail to pay attention to some of these arguments.

Early papers on LLC management have pointed out that in a multi-core processor, a shared LLC enjoys a higher effective cache capacity and ease of design, while private LLCs can offer lower average access latency and better isolation across programs.  It is worth noting that both models can have similar physical layouts -- a shared LLC can be banked and each bank can be placed adjacent to one core (as in the figure below).  So the true distinction between the two models is the logical policy used to govern their contents and access.
The distributed shared LLC shown in the figure resembles Tilera's tiled architecture and offers non-uniform latencies to banks.  In such a design, there are many options for data striping.  The many ways of a set-associative cache could be scattered across banks.  This is referred to as the D-NUCA model and leads to complex protocols for search and migration.  This model seems to have the least promise because of its complexity and non-stellar performance.  A more compelling design is S-NUCA, where every address and every cache set resides in a unique bank.  This makes it easy to locate data.  Most research assumes that in an S-NUCA design, consecutive sets are placed in adjacent banks, what I'll dub as set-interleaved mapping.  This naturally distributes load across banks, but every LLC request is effectively serviced by a random bank and must travel half-way across the chip on average.  A third option is an S-NUCA design with what I'll call page-to-bank or page-interleaved mapping.  All the blocks in an OS physical page (a minimum sized page) are mapped to a single bank and consecutive pages are mapped to adjacent banks.  This is the most compelling design because it is easy to implement (no need for complex search) and paves the way for OS-managed data placement optimizations for locality.

Recent work has taken simple S-NUCA designs and added locality optimizations so that an average LLC request need not travel far on the chip.  My favorite solution to date is R-NUCA (ISCA'09), which has a simple policy to classify pages as private or shared, and ensures that private pages are always mapped to the local bank.  The work of Cho and Jin (MICRO'06) relies on first-touch page coloring to place a page near its first requestor; Awasthi et al. (HPCA'09) augment that solution with load balancing and page migration.  Victim Replication (ISCA'05) is able to do selective data replication in even a set-interleaved S-NUCA cache without complicating the coherence protocol.  In spite of these solutions, the problem of shared LLC data management is far from solved.  Multi-threaded workloads are tougher to handle than multi-programmed workloads; existing solutions do a poor job of handling pages shared by many cores; they are not effective when the first requestor of a page is not the dominant accessor (since page migration is expensive); task migration renders most locality optimizations invalid.

The key point here is this: while there is room for improvement, a shared LLC with some of the above optimizations can offer low access latencies, high effective capacity, low implementation complexity, and better isolation across programs.  Yes, these solutions sometimes involve the OS, but in very simple ways that have commercial precedents (for example, NUMA-aware page placement in the SGI Origin).

On the other hand, one could start with an organization where each core has its own private LLC.  Prior work (such as cooperative caching) has added innovations so that the private LLCs can cooperate to provide a higher effective cache capacity.  In cooperative caching, when a block is evicted out of a private cache, it is given a chance to reside in another core's private cache.  However, all private LLC organizations require complex mechanisms to locate data that may be resident in a remote bank.  Typically, this is done with an on-chip directory that tracks the contents of every cache.  This directory introduces non-trivial complexity.  Some commercial processors do offer private caches; data search is manageable in these processors because there are few caches and bus-based broadcast is possible.  Private LLCs are less attractive as the number of cores scales up.

Based on the above arguments, I feel that shared LLCs have greater promise.  For a private LLC organization to be compelling, the following arguments are necessary: (i) How is the complexity of data search overcome?  (ii) Can it out-perform a shared LLC with page-to-bank S-NUCA mapping and a simple locality optimization (either R-NUCA or first-touch page coloring)?

Tuesday, June 28, 2011

A Killer App for 3D Chip Stacks?

I had previously posted a brief summary of our ISCA 2011 paper.  The basic idea was to design a 3D stack of memory chips and one special interface chip, connected with TSVs (through silicon vias).  We argue that the interface chip should have photonic components and memory scheduling logic.  The use of such a stack enables game-changing optimizations (photonic access, scalable scheduling) without disrupting the manufacture of commodity memory chips.



When we presented our work at Micron recently (see related post by Ani Udipi), we were told that Micron has a full silicon prototype that incorporates some of these concepts.  We were very excited to hear that a 3D stacked memory/logic organization similar to our proposal is implementable and could be reality in the near future.  Slide 17 of the report from the Micron Winter Analyst Conference, February 2011, describes Micron's Hybrid Memory Cube (HMC, Figure reproduced below).  The HMC has a Micron-designed logic controller at the bottom of the stack (what we dub as the interface chip in our work) and it is connected to multiple DRAM chips with TSVs.  Details of what is on the logic controller have not been released yet.  Micron is partnering with Open-Silicon to take advantage of the opportunities made possible by the HMC.


This is an exciting development and I expect that one could discover many creative ways to put useful functionality on the interface chip.  Prior work has proposed several ways to add functionality to DRAM chips: row buffer caches, processing in memory, error correction features, photonic components, etc.  Many of these ideas were unimplementable because of their impact on cost, but they may be worth attempting in the context of a 3D memory stack.  This is also an opportunity to add value to memory products, an especially important consideration as density growth flattens or as we move to new memory technologies that have varying maintenance needs.

While most prior academic work on 3D architecture has been processor-centric, the potential benefit of memory-centric 3D stacking is relatively unexplored.  This is in spite of the fact that memory companies have embraced 3D stacking much more than processor companies.  3D memory chip stacks are currently manufactured in various forms by Tezzaron, Samsung, and Elpida among others.  The concept of building a single chip and then reusing it within 3D chip stacks to create multiple different products has been proposed previously for processors (papers from UCSB and Utah).  Given the economic impact of this concept, the cost-sensitive memory industry stands to gain more from it.  Memory companies operate at very small margins.  They therefore strive to optimize cost-per-bit and almost exclusively design for high volumes.  They are averse to adding any feature that will increase cost for millions of chips, but will only be used by a small market segment.  But with 3D chip stacks, the same high-volume commodity DRAM chip can be bonded to different interface chips to create different products for each market segment.  This may well emerge as the most compelling application of 3D stacking within the high performance domain.

Sunday, June 19, 2011

Father's Day

Just spent a great Father's Day watching the US Open with my 3-year-old son and week-old daughter in my arms.  On most Father's Days, I'm fumbling around airports and hotels, missing my family and trying to get a peek at US Open scores on my way to ISCA.  Here's hoping that future ISCA organizers make it a high priority to not schedule ISCA during Father's Day...

Thursday, May 26, 2011

Memory Systems Research: An Industry Perspective

After several months of informal meetings and ad-hoc contact, we finally visited memory manufacturer Micron in Boise, ID earlier this month. Our large group of 7 students and 2 faculty members made for a fun road trip through stretches of torrential rain and awesome homemade pie at Connor’s Cafe in Burley (I would definitely recommend you stop by if you’re ever on I-84 in Idaho!). In a 3+ hour presentation and discussion session, we discussed all of the memory research happening here at our group in Utah, including both published work and current work-in-progress. We wanted feedback on what the memory industry’s constraints were, and the feasibility of implementing some of our ideas, and here are some of the things we heard, in no particular order:


1. Density, density, density: That density and cost/bit are everything in the memory industry is well known in memory research circles, but we really had no idea how much! We were told that on commodity parts, a 0.01% area penalty “might be considered”, 0.1% would require “substantial” performance gains, and anything over 0.5% was simply not happening, no matter what (these numbers are meant to be illustrative, of course, and not precise guidelines). For more specialized parts for niche markets, on the other hand, slightly larger area penalties could be considered as part of the overall value package. Take away: Focus on the high-end specialized segment, say the next big supercomputer, where cost is a smaller concern. The commodity segment has been squeezed to death, and you’re probably not going to get anywhere trying to change anything.


2. Be afraid of the memory chip: Very afraid! This is related to the previous point. The actual array layout has been super-ultra-optimized by people who dream about this stuff, and even the slightest change you propose is likely going to mess with this beyond repair. Stay away! Also, lots of smart people care very deeply about this narrow field, and have been working on this particular aspect for a long time. Anything you think of, they’ve probably considered before (but not published!). They would be “astounded” to hear something really novel.


3. Work at the system-level: There are infinite possibilities in terms of workload-based studies, altering data placement, data migration, row-buffer management, activity throttling, etc. These are less likely to provide dramatic improvements, but are low-effort, and are more likely to have an impact in terms of actually being implemented, since they are less invasive. It is unlikely that the industry has studied all the possibilities exhaustively, and you’re much more likely to come up with something novel.


4. Build it and they will come. NOT. If you really feel there is a case to be made to radically change DRAM architecture, approach the guys that will actually buy and use the parts - the system manufacturers. If they care enough, they can ask for it, and if they’re willing to pay for it, Micron will build it. It cannot be driven from the other end, no matter what, the margins are simply too small, and the whole system is setup to incentivize maximum capacity at least cost.


5. The Universal Memory Myth: Micron has been researching PCM for over a decade now, and are only now kinda sorta ready-ish to release a product into the market. Their excitement about the maturity of the technology, and it’s ability to summarily replace all memory in the system, is far far lower than that of the academic community. It is nowhere close to reaching the capacity and cost point of DRAM, which is, therefore, not likely to die any time soon.

Sunday, May 1, 2011

A Perspective on HTM Designs

I've talked with a number of faculty candidates this past season.  I have come to realize that people that work in non-architecture systems areas haven't really kept track of the take-home messages from recent research in hardware transactional memory (HTM).  Or, perhaps, the messages have been obfuscated by other signals (STM results or industry trends).  Here's my unbiased view on HTM design options.  Unbiased because I have merely dabbled in TM research and don't have a horse in this race :-).  Harris, Larus, and Rajwar's TM book should be an excellent read if you want a more detailed account.

For those new to TM or HTM, I'll first provide a quick summary.  With numerous cores available today, hardware is effective if software is efficiently multi-threaded.  It is imperative that architects come up with innovations that make it easier for programmers to write multi-threaded code.  HTM is one such innovation.  Programmers simply identify regions of code (transactions) that should be executed atomically.  The underlying hardware looks for parallelism, but ensures that the end result is as if each transaction executed atomically and in isolation.  HTM can therefore provide high performance with low programming effort.  With a few exceptions (usually contrived), transactional code is also deadlock-free.  Transactional semantics are an improvement over almost any known multi-threaded programming model.  HTM makes the programming process a little bit easier but is by no means a panacea.

The Wisconsin LogTM paper introduced a frequently used taxonomy to describe HTM implementations.  They showed that one could use either lazy or eager conflict detection and either lazy or eager version management.  When explaining HTM to students, I have always found it easier and more intuitive to start with an implementation that uses lazy versioning and lazy conflict detection, such as Stanford's TCC design.  In TCC, every transaction works on locally cached copies of data and writes are not immediately propagated to other caches.  At the end of its execution, a transaction must first acquire permissions to commit; it then makes all its writes visible through the conventional cache coherence protocol.  If another in-progress transaction has touched any of these written values (a conflict), it aborts and re-starts.  Each transaction executes optimistically and conflicts are detected when one of the transactions is trying to commit (lazy conflict detection).  Upon cache overflow, the new results computed by a transaction are stored in a temporary log in main memory; when the transaction commits, the overflowed results must be copied from the log into their permanent homes (lazy versioning).  Bloom filters can be used to track the contents of the log and efficiently detect conflicts.

While the TCC design is easier to understand, I feel it has a few drawbacks (explained shortly).  In my view, an HTM design that uses eager versioning and eager conflict detection, such as Wisconsin's LogTM design, does a better job optimizing for the common case.  In LogTM, each transaction proceeds pessimistically, expecting a conflict at every step (eager conflict detection).  Writes are immediately made visible to the rest of the system.  The underlying cache coherence protocol detects conflicting accesses by in-progress transactions.  When a conflict is detected, one of the transactions is forced to wait for the other to finish.  Deadlock scenarios are easily detected and handled by using transaction timestamps.  Upon cache overflow, the old result is stored in a log and the new result is immediately placed in its future permanent home (eager versioning).  Some provisions are required so we can continue to detect conflicts on overflowed data (called sticky pointers; now you know why this is a little harder to explain to TM newbies :-).

The reason that I believe that EE-HTM (Eager conflict detection and Eager versioning) is more efficient than LL-HTM is because the former makes a number of design choices that favor the common case.  First, commits are slow in LL-HTM because they involve making all transactional writes visible to the rest of the system and copying log values to their permanent homes.  Commit is the frequent operation (every transaction must commit).  Second, EE-HTM does make an abort slower because it involves a copy from the log, but aborts are supposed to be uncommon.  Aborts are even more uncommon in EE-HTM because they are only initiated in a potential deadlock situation.  While LL-HTM charges ahead with transaction execution and causes an abort on every conflict, EE-HTM is more deliberate in its progress.  If there is a conflict, a transaction is made to stall and proceed after the conflicting situation is resolved.  This leads to less wasted work and overall lower energy dissipation.

HTM is easy to implement in hardware.  All it takes is: (i) a register checkpoint so we can roll back register state on a transaction abort, (ii) two bits per cache line to keep track of whether a transaction has read or written that line, (iii) some new logic in the cache coherence controller to undertake transactional operations, such as examining the above bits and issuing aborts or NACKs, (iv) system and hardware support (in the form of hardware signatures) to handle overflows and commit permissions.  So why has industry been so reluctant to embrace HTM (with the exception of Sun's Rock and AMD's recent ASF extensions)?  My sense is that there is great concern about various corner-cases and how they can be handled efficiently.  In my view, most known corner-cases have workable solutions and it is pointless to try and optimize infrequent scenarios.  For example, a number of ugly scenarios can be averted if the system has a single "golden token" that allows its owner to successfully commit its on-going transaction.  This is a graceful way to handle cache overflows, I/O operations, nesting overflows, starvation, signature overflows, etc.  In essence, if our best efforts at transactional parallelism fail, simply revert to single-thread execution for a while.  The impact on performance should be minimal for most workloads and this is a penalty we should be willing to pay to enjoy the programmability benefits of TM for other workloads.  I am less enamored by the approach that falls back on STM when such bad events happen; STM opens up a new can of worms.

HTM implementations are efficient enough that there are few known bottlenecks.  The classic TM pathologies paper from Wisconsin in ISCA 2007 talks about possible HTM performance bottlenecks.  Contributions by that paper and others have pretty much addressed most known pathologies.  One of my Masters students, Byong Wu Chong, spent a fair bit of time characterizing STAMP and other transactional benchmarks.  He found that once a few pathological cases are optimized, very little of the overall execution time can be attributed to transactional phenomena.  Commit is typically not a big bottleneck for LL-TM and deadlock-driven aborts are uncommon in EE-TM.

In summary, HTM is relatively easy to implement, especially if we take a correctness-centric view to corner cases.  There are few glaring performance deficiencies caused by HTM features.  Quantitatively, I expect there is little difference between optimized versions of EE-HTM and LL-HTM for most workloads.  I have a personal preference for the former because it reduces wasted work and optimizes the common case.  For newbies, the latter is an easier introduction to HTM.

Sunday, April 24, 2011

Dude, where are my PCM parts going?

All of us have heard of the not-so-new buzz surrounding the slew of new-er non-volatile memories (NVMs) that have either hit the market, or are expected to do so soon. The noteworthy among these are Phase-change memory (PCM) and an advancement of the Magnetoresistive Random Access Memory (MRAM) called Spin-Torque Transfer RAM (STT-RAM). PCM parts have started to trickle out as commercial products, while the viability of STT-RAM parts is still being demonstrated. Although, from the excitement about STT-RAM at the Non-Volatile Memories workshop this year, I am sure that we will be seeing more and more of it in the coming years.

Recently, the architecture community has been abuzz with proposals on applications of  PCM, at different levels in the memory and storage hierarchy. But, the buzz notwithstanding, let's take a quick look at the PCM products that are presently available. Numonyx has 128 Mb serial and parallel parts on the market. While there has been talk of Samsung shipping a 512 Mb part, the part information is unavailable, as far as I know. Although, it does seem like Samsung is "shipping a little bit" of PCM parts. Where, to whom, and what for is still not very clear. But there have been reports of PCM parts actually being used in consumer electronics. From the current state of things, it seems like the consumer electronics industry (cameras, cell phones and the likes) is intent on using PCM as a plug-and-play replacement for Flash. Assuming that manufacturers come out with high density, low latency PCM parts, how are they best used in desktops and server-side machines?

Industry analysts believe that PCM is a good contender to replace Flash in the short term. Taking that thought a little further, IBM believes that PCM will have a big role to play in the server-side storage hierarchy, but again, as a replacement for Flash SSDs. I tend to agree with them, at least in the short run. Many research papers advocate that PCM will eventually serve as a DRAM replacement.  Thus, there is little consensus on how and where PCM parts may fit in the memory/storage hierarchy.

From what I have learned so far, there are four main issues that will have to be handled before PCM is ready to be a main memory replacement - (i) endurance has to reach somewhere around that of DRAM, (ii) write energy and latency has to be reduced substantially, (iii) higher levels of the hierarchy have to be more effective at filtering out writes to the PCM level, and (iv) the recently discovered problem of resistance drift with multi-level cells has to be addressed. More recently, there have been concerns about scaling of PCM devices because of issues with increasing current densities and electromigration issues (Part 1Part 2Part 3). This, of course is assuming that all issues associated with wear-leveling have been worked out.

Apart from the issues with the devices themselves, has there been a general reluctance to shift to PCM? To quote Samsung's top honcho on the semiconductor side of the business "systems guys have been very reluctant to adopt the technology in mass quantities". The quote conveys very little discernible information, but, as a first guess, it might mean that the system software stack has to be extensively reworked to make PCM a reality.

There are many views on the future of PCM.  There appears to be a slight disconnect between industrial trends and academic thrusts.  The former seems to focus on using PCM as a permanent storage device, while the latter is attempting to move PCM higher up the hierarchy.  A case can be made that more academic effort must be invested in developing PCM as a viable storage class memory.

Saturday, March 19, 2011

Grad School Ranking Methodology

We are in the midst of the graduate admissions process and faculty candidate interviews.  A large part of a candidate's decision is often based on a university's rank in computer science.  This is especially true for prospective graduate students.  Unfortunately, the university rankings commonly in use are flawed and inappropriate.

The US News rankings seem to be most visible and hence most popular.  However, they are a blatant popularity contest.  They are based on an average of 1-5 ratings given by each CS department chair for all other CS departments.  While I'm sure that some department chairs take their rating surveys very seriously, there are likely many that simply go off of past history (a vicious cycle where a department is reputed because of its previous lofty rank), recent news, or reputation within the department chair's own research area.  In spite of several positive changes within our department, our reputation score hasn't budged in recent years and our ranking tends to randomly fluctuate in the 30s and 40s.  But no, this post is not about the impact of these rankings on our department; it is about what I feel is the right way for a prospective grad student to rank schools.

The NRC released a CS ranking way back in 1993.  After several years of work, they released a new ranking a few months back.  However, this seems to be based on a complex equation and there have been several concerns about how data has been collected.  We have ourselves seen several incorrect data points.  The CRA has a statement on this.

Even if the NRC gets their act together, I strongly feel that prospective grad students should not be closely looking at overall CS rankings.  What they should be focusing on is a ranking for their specific research area.  Such a ranking should be one of a number of factors that they should be considering, although, in my view, it should be a dominant factor.

What should such a research area ranking take into account?  It is important to consider faculty count and funding levels in one's research area.  But primarily, one should consider the end result: research productivity.  This is measured most reliably by counting recent publications by a department in top-tier venues.  Ideally, one should measure impact and not engage in bean-counting.  But by focusing the bean-counting on top-tier venues, impact and quality can be strongly factored in.  This is perhaps the closest we can get to an objective measure of impact.  Ultimately, when a student graduates with a Ph.D., his/her subsequent job depends most strongly on his/her publication record.  By measuring the top-tier publications produced by a research group, students can measure their own likely productivity if they were to join that group.

I can imagine a few reasonable variations.  If we want to be more selective about quality and impact, we could measure best-paper awards or IEEE Micro Top Picks papers.  However, those counts are small enough that they are likely not statistically significant in many cases.  One could also derive a stronger measure of impact by looking at citation counts for papers at top-tier venues.  Again, for recent papers, these citation counts tend to be noisy and often dominated by self-citations.  Besides, it's more work for whoever is computing the rankings :-).  It may also be reasonable to divide the pub-count by the number of faculty, but counting the number of faculty in an area can sometimes be tricky.  Besides, I feel a department should get credit for having more faculty in a given area; a grad student should want to join a department where he/she has many options and a large peer group.  One could make an argument for different measurement windows -- a short window adapts quickly to new faculty hires, faculty departures, etc.  The window also needs to be long enough to absorb noise from sabbaticals, funding cycles, infrastructure building efforts, etc.  Perhaps, five years (the average length of a Ph.D.) is a sweet spot.

So here is my proposed ranking metric for computer architecture: number of papers by an institution at ISCA, MICRO, ASPLOS, HPCA in the last five years.

Indeed, I have computed the computer architecture ranking for 2010.  This is based on top-tier papers in the 2006-2010 time-frame for all academic institutions world-wide.  I have not differentiated between CS and ECE departments.  An institution gets credit even if a single author on a paper is from that institution.  If you are considering joining grad school for computer architecture research (or are simply curious), email me for a link to this ranking.  I have decided not to link the ranking here because I feel that only prospective grads need to see it.  A public ranking might only foster a sense of competition that is likely not healthy for the community.

If you are shy about emailing me (why?! :-) or are curious about where your institution may stand in such a ranking, here's some data:  All top-5 schools had 44+ papers in 5 years; 20+ papers equates to a top-10 rank; 15+ papers = top-15; 12+ papers = top-20; 9+ papers = top-25.  There were 44 institutions with 4+ papers and 69 institutions with 2+ papers.

Please chime in with comments and suggestions.  I'd be happy to use a better metric for the 2011 ranking.

Friday, March 11, 2011

A Case for Archer

Disclaimer : The opinions expressed in this post are my own, based on personal experiences.

Almost any computer architecture researcher realizes the importance of simulators to the community. Architectural level simulators allow us to model and evaluate new proposals in a timely fashion, without having to go through the pain of fabricating and testing a chip.

So, what makes for a good simulator? I started grad school in the good old days when Simplescalar used to be the norm. It had a detailed, cycle-accurate pipeline model and most importantly, it was pretty fast. Once you got over the learning curve, additions to the model were fairly easy. It did have a few drawbacks. There was little support for simulating CMP architectures, there was no coherence protocol in place, and a detailed DRAM model was missing. Moreover, interference of the operating system with an application's behavior was not considered. But, these issues were less important back then.

Soon, it was apparent that CMPs (chip multi-processors) were here to stay. The face of applications changed as well. Web 2.0-based, data-intensive applications came into existence. To support these, a number of server-side applications needed to be run on backend datacenters. This made the performance of multi-threaded applications, and that of main memory of paramount importance.

To keep up with these requirements, the focus of the architecture community changed as well. Gone were the days of trying to optimize the pipeline and extracting ILP. TLP, MLP, and memory system performance (caches, DRAM) became important. One also could no longer ignore the importance of interference from the operating system when making design decisions. The community was now on the lookout for a simulation platform that could take all of these factors into account.

It was around this time that a number of full-system level simulators came into being. Off the top of my head, I can count a number of these, popular with the community and with a fairly large user base - Wind River's Simics, M5, Zesto, Simflex, SESC, to name a few. For community wide adoption, a simulator platform needed to be fast, have a modular code base, and have a good support system (being cycle-accurate was a given). Simics was one of the first platforms that I tried personally and found their support to be extremely responsive, which also garnered a large participation from the academic community. Also, with release of GEMS framework from Wisconsin, I didn't need a reason to look any further.

In spite of all the options out there, getting the infrastructure (simulator, benchmark binaries, workloads and checkpoints) in place is a time consuming and arduous process. As a result, groups seem to have gravitated towards simulators that best suited their needs in terms of features and ease of use. The large number of options today also implies that different proposals on the same topic inevitably use different simulation platforms. As a result, it is often difficult to compare results across papers and exactly reproduce results of prior work. This was not as significant a problem before when nearly everyone used Simplescalar.

In some sub-areas, it is common practice to compare an innovation against other state-of-the-art innovations (cache policies, DRAM scheduling, etc.). Faithfully modeling the innovations of others can be especially troublesome for new grad students learning the ropes of the process. I believe a large part of this effort can be reduced if these models (and by model I mean code :-) were already publicly available as part of a common simulator framework.

The Archer project, as some of you might know, is a recent effort in the direction of collaborative research in computer architecture. From the project's website, they strive for a noble goal -
"To thoroughly evaluate a new computer architecture idea, researchers and students need access to high-performance computers, simulation tools, benchmarks, and datasets - which are not often readily available to many in the community. Archer is a project funded by the National Science Foundation CRI program to address this need by providing researchers, educators and students with a cyberinfrastructure where users in the community benefit from sharing hardware, software, tools, data, documentation, and educational material."

In its current format, Archer provides a large pool of batch-scheduled computing resources, a set of commonly used tools and simulators and some benchmarks and datasets. It also has support for sharing files via NFS and a wiki-based infrastructure to aggregate shared knowledge and experiences.

If widely adopted, this will provide a solution to many of the issues I listed above. It can help push the academic community towards a common infrastructure while at the same time reduce the effort to setup simulation infrastructure and to reproduce prior work.

Although Archer is a great initial step, I believe it still can be improved upon. If I had my wishes, I would like a sourceforge like platform, where the model for a particular optimization is owned by a group of people (say, the authors of the research paper), available to be checked out from a version-control system. Anyone using the Archer platform for their research, that results in a peer-reviewed publication would be obliged to release their model into the public domain under a GPL. Bug reports will be sent to the owners who will in turn release revised versions of the model(s).

In recent years, collaborative research efforts in computer science have been very successful. Emulab is a widely used resource by the networking community. The TCS community too has been involved in successful collaborative research, e.g. the polymath project and the recent collaborative review of Deolalikar's paper. I believe that there is certainly room for larger collaborative efforts within the computer architecture community.

Sunday, February 27, 2011

Not Lacking in Buzzwords...

Warning: This post may contain some shameless advertizing for our upcoming ISCA paper. :-)

Based on a reviewer suggestion, we are changing the title of our recently accepted ISCA paper from
"Designing a Terascale Memory Node for Future Exascale Systems"
to
"Combining Memory and a Controller with Photonics through 3D-Stacking to
Enable Scalable and Energy-Efficient Systems"

Believe me, it took many iterations to find something that was descriptive, accurate, and marginally pleasant-sounding :-).  While throwing in every buzzword makes for a clunky title, such titles are essential for flagging the attention of the right audience (those working on photonics, 3D, and memory systems).

I mention the following example of a bad title to my students.  Our ISCA 2001 paper was on runahead execution (a form of hardware prefetching) and appeared with three other papers in the same conference on the same topic.  Our non-descriptive title said: "Dynamically Allocating Processor Resources between Nearby and Distant ILP".  There's little in the title to indicate that it is about runahead.  As a result, our paper got left out of most subsequent runahead conversations and had minimal impact.  In terms of citations (the closest quantitative measure of impact), our paper has 70+ citations; each of the other three have 200+.  I might have felt robbed if the paper was actually earth-shattering; in retrospect, the design was quite unimplementable (and that may no doubt have contributed to its low impact).  In my view, Onur Mutlu's subsequent thesis work put forth a far more elegant runahead execution design and made most prior work obsolete.

For those interested, here's an executive summary of our ISCA'11 work (done jointly with HP Labs).  The killer app for photonics is its high bandwidth in and out of a chip (something we can't do for long with electrical pins).  However, introducing photonic components into a cost-sensitive DRAM chip can be highly disruptive.  We take advantage of the fact that industry is possibly moving towards 3D-stacked DRAM chip packages and introduce a special interface die on the DRAM stack.  The interface die has photonic components and some memory controller functionality.  By doing this, we use photonics to break the pin barrier, but do not disrupt the manufacture of commodity DRAM chips.  For communication within the 3D stack, we compute an energy-optimal design point (exactly how much of the intra-stack communication should happen with optics and how much with electrical wiring).  It turns out that there is no need for optics to penetrate into the DRAM dies themselves.  We also define a novel protocol to schedule memory operations in a scalable manner: the on-chip memory controller does minimal book-keeping and simply reserves a speculative slot on the photonic bus for the data return.  Most scheduling minutiae are handled by the logic on the DRAM stack's interface die.

Tuesday, February 22, 2011

Research Competitions

Almost every year, The Journal of ILP has conducted a research competition at one of our top conferences.  For example, researchers are asked to develop a branch predictor that yields the highest prediction accuracy on a secret input data set.  I think such competitions are a great idea.  If you teach an advanced architecture class, you might want to consider tailoring your class projects in that direction.  An international competition could be a strong motivating factor for some students.  Besides, the competition organizers usually release a piece of simulation infrastructure that makes it easy for students to write and test their modules.  Recent competitions have been on cache replacement, data prefetching, and branch prediction.  The 3rd branch prediction championship will be held with ISCA this year.  I am slated to organize a memory scheduling championship with ISCA next year (2012).  The evaluation metric could be row buffer hit rates, overall throughput, energy, fairness, or combinations of these.  I have a grand 12 months to plan this out, so please feel free to send in suggestions regarding metrics, simulators, and workloads that you think would work well. 

Saturday, February 19, 2011

PostDocs in Computer Science



PostDocs in CS are surging and there is a lot of discussion in the computing community about this trend.  If you haven't already seen the CRA PostDoc white paper based on Taulbee data, I would recommend giving it a read and also sharing your opinions on it. Since I am pursuing a PostDoc, I wanted to discuss some of the data presented in this report and its relevance to the computer architecture research community.

The charts above have all been taken from the CRA report. Figure 1 shows the number of Phds that went into different career choices. The "Others" in this chart represents graduating Phds that either took a different career path than mentioned here or did not have a job by the time they graduated. Looking at this data, it appears that the year 2004 was the best year for academic hiring where almost equal number of CS Phds went for tenure-track positions and industry positions. In 2009, a majority of Phds went in for industry positions, mainly because the number of Phd graduates almost doubled and reduced academic hiring due to the economic downturn. The trend to note in this figure is the number of PostDocs that has almost doubled from 2004 to 2009.

Figure 3 further splits the data of number of graduating Phds into research area sub-groups. In this chart, "Others" signifies inter-disciplinary research areas or could even imply unemployment. Based on the data shown in this figure, systems and networking and AI/robotics have consistently been hot fields in terms of number of Phd graduates. Computer architecture fares as a small-sized research community similar to theory and algorithms, and the number of Phd graduates has not changed much from 1998 to 2009.

Figure 4 shows the percentage of Phds who took PostDoc positions in each sub-group over the years. Computer architecture has the lowest number of PostDocs (~2%) in 2009 (includes me). Theory, AI/Robotics, and numerical analysis/scientific computing have consistently the highest percentage of PostDocs. One reason that I can think of why we have fewer computer architecture PostDocs is perhaps the strong semiconductor industry presence in our field and smaller number of available PhDs to fill those jobs.

If we look at Figure 5, which shows the percentage of Phds taking up tenure track positions, we are again in the bottom range in 2009 (~6-7%) but we did well in years 2003-2005 (~30-35%). I have plotted the number of computer architecture Phds and the number of tenure track faculty in our field in the chart below based on the data that I could visually interpret from the CRA charts. Clearly, the number of tenure-track positions in computer architecture were at its peak in 2004 when the relative number of Phds were fewer and now we are back at 1998 levels with stiffer competition for very few slots. It will be interesting to watch this trend in the future.

Sunday, February 13, 2011

The 2X-5% Challenge

It is well-known that much of our data storage and computation will happen within datacenters in the coming decade.  Energy efficiency in datacenters is a national priority and the memory system is one of the significant contributors to system energy.  Most reports indicate that the memory system's contribution to overall energy is in the 20-30% range.  I expect memory energy to be a hot topic in the coming years.  I'm hoping that this post can serve as a guideline for those that want to exploit the energy-cost trade-off for memory system design.

For several years, the DRAM industry focused almost exclusively on the design of high-density DRAM chips.  Seemingly, customers only cared to optimize cost per bit... or more precisely, purchase cost per bit.  In recent times, the operating cost per bit is also becoming significant.  In fact, DRAM industry optimizations to reduce purchase cost per bit often end up increasing the operating cost per bit.  The time may finally be right to start building mainstream DRAM chips that are not optimized for density, but for energy.  At least that's an argument we make in our ISCA 2010 paper and a similar sentiment has been echoed in a IEEE Micro 2010 paper by Cooper-Balis and Jacob.  If DRAM vendors were to take this route, customers may have to pay a higher purchase price for their memory chips, but they'd end up paying less for energy and cooling costs.  The result should be a lower total cost of ownership.

However, it may take a fair bit of effort to convince memory producers and consumers that this is a worthwhile approach.  Some memory vendors have apparently seen the light and begun to tout the energy efficiency of their products: Samsung's Green Memory and Samsung's new DDR4 product.  While such an idea has often been scoffed at by memory designers that have spent years optimizing their designs for density, the hope is that economic realities will eventually set in.

A perfect analogy is the light bulb industry.  Customers are willing to pay a higher purchase price for energy-efficient light bulbs that hopefully save them operating costs, compared to the well-known and commoditized incandescent light bulb.  If the Average Joe can do the math, so can memory vendors.  Surely Micron must have noticed the connection.  After all, they make memory chips and lighting systems!! :-)

So what is the math?  Some of my collaborators have talked to DRAM insiders and while our ideas have received some traction, we are warned: Do not let your innovations add a 2% area overhead!!  That seems like an awfully small margin to play with.  Here's our own math.

First, if we check out various configurations for customizable Dell servers, we can quickly compute that DRAM memory sells for roughly $50/GB today.

Second, how much energy does memory consume in a year?  This varies based on several assumptions.  Consider the following data points.  In a 2003 IBM paper, two servers are described, one with 16 GB and another with 128 GB of DRAM main memory.  Memory power contributes roughly 20% (318 W) of total power in the former and 40% (1,223 W) of total power in the latter.  This leads to an estimate of 20 W/GB and 10 W/GB respectively for the two systems.  If we assume that this value will halve every two years (I don't have a good reference for this guesstimate, but it seems to agree with some other data points), we get an estimate of 1.25-2.5 W/GB today.  In a 2006 talk, Laudon describes a Sun server that dissipates approximately 64 W of power for 16 GB of memory.  With scaling considered, that amounts to a power of 1 W/GB today.  HP's server power calculator estimates that the power per GB for various configurations of a high-end 2009 785G5 server system ranges between 4.75 W/GB and 0.68 W/GB.  Based on these data points, let us use 1 W/GB as a representative estimate for DRAM operational energy.

It is common to assume that power delivery inefficiencies and cooling overheads will double the energy needs.  If we assume that the cost for energy is $0.80/Watt-year, we finally estimate that it costs $6.40 to keep one GB operational over a DIMM's four-year lifetime.  Based on this number and the $50/GB purchase price number, we can estimate that if we were to halve memory energy, we can reduce total cost of ownership even if the purchase cost went up to $53.20/GB.  Given that cost increases more than linearly with an increase in chip area, this roughly translates to a chip area overhead of 5%.

This brings us to the 2X-5% challenge: if memory designers set out to reduce memory energy by 2X, they must ensure that the incurred area overhead per DRAM chip is less than 5%.  Alternatively stated, for a 2X memory energy reduction, the increase in purchase cost must be under 6%.  Not a big margin, but certainly more generous than we were warned about.  The margin may be even higher for some systems or if some of the above assumptions were altered: for example, if the average DIMM lasted longer than 4 years (my Dell desktop is still going strong after 7.5 years).  This is an approximate model and we welcome refinements.

Friday, February 11, 2011

Datacenters and the Raging Energy Efficiency Battles

Datacenter efficiency is the new buzz word these days. Cloud computing (another buzz word!) essentially dictates the need for datacenter energy efficiency, and tree-hugger engineers are more than happy to embrace this shift in computer design.

There are a lot of sub-systems in a datacenter that can be re-designed/optimized for efficiency. These range from power distribution, all the way down to individual compute cores of a chip multi-processor. In this post, I am mostly going to talk about how general purpose microprocessor instruction set architecture (ISA) battles are currently raging at a datacenter near you.

To sum it up, Intel and ARM are getting ready for a battle. This is a kind of battle that Intel has fought in the past against IBM and Sun. The battle is between Reduced Instruction Set Computing (RISC) vs. complex Instruction Set Computing (CISC). To some extent, this battle is still going on. Intel’s x64 ISA is CISC, and it’s RISC competitors are Oracle's SPARC ISA and IBM's POWER ISA. Intel also has it’s own line of RISC processors called Itanium, and IBM has it’s mainframe z/Architecture which is CISC. We will ignore these last two because they are niche market segments and will focus on the commodity server business, which is mostly a shipment volume driven business.

One clarification to be made here is that Intel's x64 chips are "CISC-like" rather than being true CISC CPUs. Modern x64 chips have a front end that decodes the x86 Instruction Set Architecture (ISA) into RISC-like "micro-ops". These micro-ops are then processed by the actual CPU logic, unlike the true CISC operations executed by IBM's mainframe z CPUs, for example. To clarify this distinction, any reference to x64 CPUs as CISC processors in this article actually refers to x64 CPUs with this decoder front-end. The front-end decoding logic in x64 CPUs is reported to be a large consumer of resources and energy, and it's entirely possible that it's a significant contributor to lower energy efficiency of x64 CPUs compared to true RISC CPUs.

ARM is the new kid on the block in server racket, and it touts higher efficiency as it’s unique selling point. Historically, arguments that have been put forth for RISC claim that it’s a more energy efficient platform due to lower complexity of the cores. The reasoning boils down to the fact that some of the complexity of computation is moved from the hardware to the compiler. This energy efficiency is what ARM and other RISC CPU manufacturers claim to be their raison d'etre.

There is currently a lot of effort going into developing ARM based servers, especially since recent ARM designs support ECC and Physical Address Extension (PAE), allowing the cores to be “server grade”. This interest is unabated even though the cores are still a 32-bit architecture. Reports however claim that ARM is working on a 64-bit architecture. What is unclear at this point is how energy efficient ARM cores will be when they finally make it into servers.
The uncertainty about efficiency of ARM based servers stems from the fact that ARM does not manufacture it’s own cores. ARM is an IP company and it leases it’s core designs to third parties - the likes of TI, NVIDIA etc. This gives ARM both a strategic advantage, and a reason to be cautious of Intel’s strategy.

The advantage of licensing IP is that ARM is not invested heavily in the fab business, which is fast becoming an extremely difficult business to be in. AMD exited/spun-off it’s fab business as it was getting hard for them to compete with Intel. Intel is undoubtedly the big kahuna in the fab industry with the most advanced technology. This allows them to improve x64 performance by leveraging manufacturing process superiority, even when x64 inherently might be energy hungry. An example of this phenomenon is the Atom processor. Intel‘s manufacturing technology superiority is therefore a reason for concern for ARM.

The datacenter energy efficiency puzzle is a complex world. From the CPU point of view, there are five primary players - AMD, IBM, Intel, NVIDIA, and Oracle. AMD is taking it’s traditional head-on approach with Intel, and NVIDIA is attacking the problem from sideways by advancing GPU-based computing. They are reportedly working on ARM cores where they are building high-performance ARM-based CPUs to work in conjunction with their GPUs. This is done evidently to bring higher compute efficiency using asymmetric compute capabilities.
IBM and Oracle strategies are two fold. Besides their RISC offerings, they are also getting ready to peddle their “appliances” - which are vertically integrated systems that try to squeeze out every bit of performance from the underlying hardware. Ultimately, these appliances might rule the roost when it comes to energy efficiency, in which case one can expect HP, Cisco, and possibly Oracle and IBM to come out with x64 and/or ARM based appliances.

It’s unlikely that just one of these players will be able to dominate the entire market. Hopefully, in the end the entire computing eco-system will be more energy efficient because of innovation driven by these competing forces. This should make all the tree-hugging engineers and tree-hugger users of the cloud happy. So watch out for the ISA of that new server in your datacenter, unless of course you are blinded by your cloud!

Some interesting links:

Saturday, February 5, 2011

Common Fallacies in NoC Papers

I am asked to review many NoC (network-on-chip) papers.  In fact, I just got done reviewing my stack of papers for NOCS 2011.  Many NoC papers continue to make assumptions that might have been fine in 2007, but are highly questionable in 2011.  My primary concern is that most NoC papers over-state the importance of the network.  This is often used to justify complex solutions.  This is also used to justify a highly over-provisioned baseline.  And many papers then introduce optimizations to reduce the area/power/complexity of the over-provisioned baseline.  Both of these fallacies have resulted in many NoC papers with possibly limited shelf life.

The first mis-leading overstatement is this (and my own early papers have been guilty of this): "Intel's 80-core Polaris prototype attributes 28% of its power consumption to the on-chip network", and "MIT's Raw processor attributes 36% of its power to the network".  Both processors are a few years old.  Modern networks probably incorporate many recent power optimizations (clock gating, low-swing wiring, etc.) and locality optimizations (discussed next).  In fact, Intel's latest many-core prototype (the 48-core Single Cloud Computer) attributes only 10% of chip power to the network.  This dramatically changes my opinion on the kinds of network optimizations that I'd be willing to accept.

The second overstatement has to do with the extent of network traffic.  Almost any high performance many-core processor will be organized as tiles.  Each tile will have one or a few cores, private L1 and L2 caches, and a slice (bank) of a large shared L3.  Many studies assume that data placement in the L3 is essentially random and a message on the network travels half-way across the chip on average.  This is far from the truth.  The L3 will be organized as an S-NUCA and OS-based first-touch page coloring can influence the cache bank that houses every page.  A thread will access a large amount of private data, most of which will be housed in the local bank and can be serviced without network traversal.  Even for shared data, assuming some degree of locality, data can be found relatively close by.  Further, if the many-core processor executes several independent programs or virtual machines, most requests are serviced by a small collection of nearby tiles and long-range traversal on the network is only required when accessing a distant memory controller.  We will shortly post a characterization of network traffic for the processor platform I describe above: for various benchmark suites, an injection rate and histogram of distance traveled.  This will hopefully lead to a more meaningful synthetic network input than the most commonly used "uniform random".

With the above points considered, one would very likely design a baseline network that is very different from the plain vanilla mesh network.  I would expect some kind of hierarchical network: perhaps a bus at the lowest level to connect a small cluster of cores and banks, perhaps a concentrated mesh, perhaps express channels.  For those that haven't seen it, I highly recommend this thought-provoking talk by Shekhar Borkar, where he argues that buses should be the dominant component of an on-chip network.  I highly doubt the need for large amounts of virtual channels, buffers, adaptive routing, etc.  I'd go as far as to say that bufferless routing sounds like a great idea for most parts of the network.  If most traffic is localized to the lowest level of the network hierarchy because threads find most of their data nearby, there is little inter-thread interference and there is no need for QoS mechanisms within the network.

In short, I feel the NoC community needs to start with highly localized network patterns and highly skinny networks, and identify the minimum amount of additional provisioning required to handle various common and pathological cases.