Sunday, January 15, 2012

Waking Up to Bottleneck Realities!

NSF is organizing a workshop on Cross-Layer Power Optimization and Management (CPOM) next month.  The workshop will identify important future research areas to help guide funding agencies and the research community at large.  Below is my position statement for the same.  While the DRAM main memory system is a primary bottleneck in most platforms, the computer architecture community has been slow to react with innovations that look beyond the processor chip.


SPOT QUIZ:     What is an SMB?
Hint:  It consumes 14 W of power and there could be 32 of these in an 8-socket system.

FACTS:
Bottlenecks:
  • The memory system accounts for 20-40% of total system power.  Significant power is dissipated in DRAM chips, on-board buffer chips, and the memory controller.
  • Single DRAM chip power (Micron power calculator): 0.5 W.  On-board buffer chip power (Intel SMB datasheet): 14 W.  Memory controller power (Intel SCC prototype): 19-69% of chip power.
  • Future memory systems: 3D stacks, more on-board buffering, higher channel frequencies, higher refresh overheads.
  • And ... we have an off-chip memory bandwidth problem!  Pin counts have stagnated.
You Cannot Be Serious!!
(Exaggerated) State of Current Research: 
  • Number of papers on volatile memory systems:  ~1 per conference.
  • Number of papers on the processor-memory interconnect:  ~1 per year.
  • Number of papers that have to define the terms "rank" and "bank":  all of them.
  • Year of first processor paper to leverage DVFS: 2000.  Year of first memory paper to leverage DVFS: 2011.
  • Percentage of readers that have to look up the term "SMB":  > 89%.  (ok, I made up that fact :-) ... but I bet I'm right)

For every 1,000 papers written on the processor, 20 papers are written on the memory system, and 1 paper is written on the processor-memory interconnect.  This is absurd given that the processor and memory are the two fundamental elements of any computer system and memory energy can exceed processor energy.  While the routers in an NoC have been heavily optimized, the community understands very little about the off-chip memory channel.  The memory system is a very obvious fertile area for future research.

QUIZ 1:  
Most ISCA attendees know what a virtual channel is, but most would be hard-pressed to answer 2 of the following 5 basic memory channel questions:
  1. What is FB-DIMM?
  2. What is an SMI?
  3. Why are buffer chips placed between the memory controller and DRAM chips?
  4. What is SERDES and why is it important?
  5. Why do the downstream and upstream SMI channels have asymmetric widths?
QUIZ 2: 
Many ISCA attendees know the difference between PAp and GAg branch predictor configurations, but most will struggle to answer the following basic memory system questions:
  1. How many DRAM sub-arrays are activated to service one cache line request?
  2. What circuit implements the DRAM row buffer?
  3. Where is a row buffer placed?
  4. Why do DRAM chips not implement row buffer caches?
  5. What is overfetch?
  6. What is tFAW?
  7. Describe a basic algorithm to implement chipkill.  (What is chipkill?)
  8. What is scrubbing?

In early 2009 (before my foray into memory systems), I would have scored zero on both quizzes.  Such a level of ignorance is perhaps ok for esoteric topics... but unforgivable for a component that accounts for 25% of server power.

ACTION ITEMS FOR THE RESEARCH COMMUNITY:
  • Identify a priority list of bottlenecks.  Step outside our comfort zone to learn about new system components.  Increase memory system coverage in computer architecture classes.
  • Find ways to address obvious sources of energy inefficiencies in the memory system:  reduce overfetch, improve row buffer hit rates, reduce refresh power.
  • Find ways to leverage 3D stacking of memory and logic.  Exploit 3D to take our first steps in the area of DRAM chip modifications (an area that has traditionally been off-limits).
  • Understand the considerations in designing memory channels and on-board buffer chips.  Propose new channel architectures and new microarchitectures for buffer chips.
  • Understand memory controller microarchitectures and design complexity-effective memory controllers.
  • Design new architectures that integrate photonics and NVMs in the memory system.

Friday, November 25, 2011

ISCA Deadline

I'm guessing everyone's digging themselves out after the ISCA deadline.  I had promised not to talk about energy drinks and deadlines, so let's just say that our deadline week looked a bit like this :-)... without the funky math of course.

I'm constantly hounding my students to get their writing done early.  Good writing doesn't magically show up 24 hours before the deadline.  While a Results section will undoubtedly see revisions in the last few days, there's no valid excuse for not finishing most other sections well before the deadline.

The graph below shows two examples of how our writing evolved this year... better than in previous years, but still not quite early enough!  We also labored for more than the usual amount to meet the 22-page budget... I guess we got a little spoilt after the 26-page HPCA'12 format.  I am amazed that producing a refined draft one week before the deadline is an impossible task, but producing a refined 22-page document 3 minutes before the deadline is a virtual certainty.  I suppose I can blame Parkinson's Law, not my angelic students :-), for accelerating my aging process...


Tuesday, November 8, 2011

Memory Has an Eye on Disk's Space

We recently read a couple of SOSP papers in our reading group: RAMCloud and FAWN.  These are terrific papers with significant implications for architects and application developers.  Both papers target the design of energy-efficient and low-latency datacenter platforms for a new breed of data-intensive workloads.  FAWN uses many wimpy nodes and a Flash storage system; RAMCloud replaces disk with DRAM.  While the two papers share many arguments, I'll focus the rest of this post on RAMCloud because its conclusion is more surprising.

In RAMCloud, each individual server is effectively disk-less (disks are only used for back-up and not to service application reads and writes).  All data is placed in DRAM main memory.  Each server is configured to have high memory capacity and every processor has access via the network to the memory space of all servers in the RAMCloud.  It is easy to see that such a system should offer high performance because high-latency disk access (a few milli-seconds) is replaced by low-latency DRAM+network access (micro-seconds).

An immediate architectural concern that comes to mind is cost.  DRAM has a dollar/GB purchase price that is 50-100 X higher than that of disk.  A server with 10 TB of disk space costs $2K, while a server with 64 GB DRAM and no disk costs $3K  (2009 data from the FAWN paper).  It's a little more tricky to compare the power consumption of DRAM and disk.  An individual access to DRAM consumes much less energy than an individual access to disk, but DRAM has a higher static energy overhead (the cost of refresh).  If the access rate is high enough, DRAM is more energy-efficient than disk.  For the same server example as above, the server with the 10 TB high-capacity disk has a power rating of 250 W, whereas the server with the 64 GB high-capacity DRAM memory has a power rating of 280 W (2009 data from the FAWN paper).  This is not quite an apples-to-apples comparison because the DRAM-bound server services many more requests at 280 W than the disk-bound server at 250 W.  But it is clear that in terms of operating (energy) cost per GB, DRAM is again much more expensive than disk.  Note that total cost of ownership (TCO) is the sum of capital expenditure (capex) and operational expenditure (opex).  The above data points make it appear that RAMCloud incurs a huge penalty in terms of TCO.

However, at least to my initial surprise, the opposite is true for a certain large class of workloads.  Assume that an application has a fixed high data bandwidth demand and this is the key determinant of overall performance.  Each disk offers very low bandwidth because of the low rotational speed of the spindle, especially for random access.  In order to meet the high bandwidth demands of the application, you would need several disks and several of the 250 W, 10 TB servers.  If data was instead placed in DRAM (as in RAMCloud), that same high rate of data demand can be fulfilled with just a few 280 W, 64 GB servers.  The difference in data bandwidth rates for DRAM and disk is over 600X.  So even though each DRAM server in the example above is more expensive in terms of capex and opex, you'll need 600 times fewer servers with RAMCloud.  This allows overall TCO for RAMCloud to be lower than that of a traditional disk-based platform.

I really like Figure 2 in the RAMCloud CACM paper (derived from the FAWN paper and reproduced below).  It shows that in terms of TCO, for a given capacity requirement, DRAM is a compelling design point at high access rates.  In short, if data bandwidth is the bottleneck, it is cheaper to use technology (DRAM) that has high bandwidth, even if it incurs a much higher energy and purchase price per byte.

Source: RAMCloud CACM paper

If architectures or arguments like RAMCloud become popular in the coming years, it opens up a slew of interesting problems for architects:

1.  Already, the DRAM main memory system is a huge energy bottleneck.  RAMCloud amplifies the contribution of the memory system to overall datacenter energy, making memory energy efficiency a top priority.

2.  Queuing delays at the memory controller are a major concern.  With RAMCloud, a single memory controller will service many requests from many servers, increasing the importance of the memory scheduling algorithm.

3.  With each new DDR generation, fewer main memory DIMMs can be attached to a single high-speed electrical channel.  To support high memory capacity per server, innovative channel designs are required.

4.  If the Achilles' heel of disks is their low bandwidth, are there ways to design disk and server architectures that prioritize disk bandwidth/dollar over other metrics?

Monday, October 31, 2011

Dissing the Dissertation

In my view, a Ph.D. dissertation is an over-rated document.  I'll formalize my position here so I don't get dragged into endless debates during every thesis defense I attend  :-).

A classic Ph.D. generally has a hypothesis that is formulated at an early stage (say, year two) and the dissertation describes 3-4 years of research that tests that hypothesis.

A modern Ph.D. (at least in computer architecture) rarely follows this classic recipe.  There are many reasons to adapt your research focus every year:  (1) After each project (paper), you realize the capabilities and limitations of your ideas;  that wisdom will often steer you in a different direction for your next project.  (2) Each project fixes some bottleneck and opens up other new bottlenecks that deserve attention.  (3) Once an initial idea has targeted the low-hanging fruit, incremental extensions to that idea are often unattractive to top program committees.  (4) If your initial thesis topic is no longer "hot", you may have to change direction to be better prepared for the upcoming job hunt.  (5) New technologies emerge every few years that change the playing field.

I'll use my own Ph.D. as an example.  My first project was on an adaptive cache structure.  After that effort, I felt that long latencies could not be avoided; perhaps latency tolerance would have more impact than latency reduction.  That led to my second effort, designing a runahead thread that could jump ahead and correctly prefetch data.  The wisdom I gathered from that project was that it was essential to have many registers so the processor could look far enough into the future.  If you had enough registers, you wouldn't need fancy runahead; hence, my third project was a design of a large two-level register file.  During that project, I realized that "clustering" was an effective way to support large processor structures at high clock speeds.  For my fourth project, I designed mechanisms to dynamically allocate cluster resources to threads.  So I had papers on caching, runahead threads, register file design, and clustering.  It was obviously difficult to weave them together into a coherent dissertation.  In fact, each project used very different simulators and workloads.  But I picked up skills in a variety of topics.  I learnt how to pick problems and encountered a diverse set of literature, challenges, and reviews.  By switching topics, I don't think I compromised on "depth"; I was using insight from one project to influence my approach in the next.  I felt I graduated with a deep understanding of how to both reduce and tolerate long latencies in a processor.

My key point is this: it is better to be adaptive and focus on high-impact topics than to flog a dead horse for the sake of a coherent dissertation.  Research is unpredictable and a four-year research agenda can often not be captured by a single hypothesis in year two.  The dissertation is therefore a contrived concept (or at least appears contrived today given that the nature of research has evolved with time).  The dissertation is a poor proxy when evaluating a student's readiness to graduate.  The student's ability to tie papers into a neat package tells me nothing about his/her research skills.  If a Ph.D. is conferred on someone that has depth/wisdom in a topic and is capable of performing independent research, a student's publication record is a better metric in the evaluation process.

In every thesis proposal and defense I attend, there is a prolonged discussion on what constitutes a coherent thesis.  Students are steered in a direction that leads to a coherent thesis, not necessarily in a direction of high impact.  If one works in a field where citation counts for papers far out-number citation counts for dissertations, I see little value in producing a polished dissertation that will never be read.  Use that time to instead produce another high-quality paper!

There are exceptions, of course.  One in every fifty dissertations has "bible" value... it ends up being the authoritative document on some topic and helps brand the student as the leading expert in that area.  For example, see Ron Ho's dissertation on wires.  If your Ph.D. work naturally lends itself to bible creation and you expect to be an ACM Doctoral Dissertation Award nominee, by all means, spend a few months to distill your insight into a coherent dissertation that will have high impact.  Else, staple your papers into a dissertation, and don't be ashamed about it!

My short wish-list:  (1) A thesis committee should focus on whether a student has produced sufficient high-quality peer-reviewed work and not worry about dissertation coherence.  (2) A dissertation can be as simple as a collection of the candidate's papers along with an introductory chapter that conveys the conclusions and insights that guided the choice of problems.

Wednesday, October 5, 2011

The right yellow gradient

Vic Gundotra, SVP of Engineering at Google shared this inspiring story on Google+:

One Sunday morning, January 6th, 2008 I was attending religious services when my cell phone vibrated. As discreetly as possible, I checked the phone and noticed that my phone said "Caller ID unknown". I choose to ignore.

After services, as I was walking to my car with my family, I checked my cell phone messages. The message left was from Steve Jobs. "Vic, can you call me at home? I have something urgent to discuss" it said.

Before I even reached my car, I called Steve Jobs back. I was responsible for all mobile applications at Google, and in that role, had regular dealings with Steve. It was one of the perks of the job.

"Hey Steve - this is Vic", I said. "I'm sorry I didn't answer your call earlier. I was in religious services, and the caller ID said unknown, so I didn't pick up".

Steve laughed. He said, "Vic, unless the Caller ID said 'GOD', you should never pick up during services".

I laughed nervously. After all, while it was customary for Steve to call during the week upset about something, it was unusual for him to call me on Sunday and ask me to call his home. I wondered what was so important?

"So Vic, we have an urgent issue, one that I need addressed right away. I've already assigned someone from my team to help you, and I hope you can fix this tomorrow" said Steve.

"I've been looking at the Google logo on the iPhone and I'm not happy with the icon. The second O in Google doesn't have the right yellow gradient. It's just wrong and I'm going to have Greg fix it tomorrow. Is that okay with you?"

Of course this was okay with me. A few minutes later on that Sunday I received an email from Steve with the subject "Icon Ambulance". The email directed me to work with Greg Christie to fix the icon.

Since I was 11 years old and fell in love with an Apple II, I have dozens of stories to tell about Apple products. They have been a part of my life for decades. Even when I worked for 15 years for Bill Gates at Microsoft, I had a huge admiration for Steve and what Apple had produced.

But in the end, when I think about leadership, passion and attention to detail, I think back to the call I received from Steve Jobs on a Sunday morning in January. It was a lesson I'll never forget. CEOs should care about details. Even shades of yellow. On a Sunday.


If there's one thing to learn from Steve Jobs' life, it is the importance of being a perfectionist. All the time. Right down to the perfect shade of yellow. Respect, Mr. Jobs.

Tuesday, October 4, 2011

HotChips 2011 - A Delayed Trip Report



I know, it’s been a little over a month since this year’s edition of the conference concluded, but better late than never! It was my first ever trip to HotChips, and I thought I should share some things I found interesting. I’ll keep away from regurgitating stuff from the technical talks since the tech-press has likely covered all of those in sufficient detail.


1. This is clearly an industry-oriented conference. There were hardly any professors around, and very few grad students. As a result, perhaps, there was almost nobody hanging out in the hallways or break areas during the actual talks, people were actually inside the auditorium!


2. It seemed like the entire architecture/circuits community from every tech shop in the Bay Area was in attendance. If you’re going to be on the job market soon, it’s a great place to network!


3. I enjoyed the keynote talk from ARM. It was interesting to think about smartphones (with ARM chips inside, of course) in the context of the developing world -- at price points in the $100-200 range, they are not a supplementary device like in the US, but the main conduit for people who normally couldn’t afford a $600 laptop to get on the internet. Definitely sounds like there’s huge potential for this form factor going forward.


Another interesting tidbit from this talk was a comparison between the energy densities of a typical smartphone battery and a bar of chocolate -- 4.5 kCal in 30g vs. 255 kCal in 49g! Lot of work to do to improve battery technology, obviously.


4. While on the subject of ARM, comparisons to Intel and discussions on who was “winning” were everywhere, from the panel discussion in the evening to one of the pre-defined “lunch table discussion topics”. Needless to say, there was nothing conclusive one way or the other :)


5. I finally got to touch a real, working prototype of Micron’s Hybrid Memory Cube (see Rajeev’s post here). I know, it’s of little practical value to see the die-stack, but it was cool nonetheless :) The talk was pretty interesting too, the first to release some technical information I think. It appears to be a complete rethinking of DRAM, with a novel protocol, novel interconnect topologies, etc.


6. I also really enjoyed the talk from the Kinect folks. I’ve seen it in action, and it was amazing to understand the complex design behind making everything work, especially since a lot of the challenges were outside my typical areas of interest -- mechanical design, aesthetics, reliability, processing images with low lighting, varying clothes, people of different sizes.. fascinating! There was also an aspect of “approachability” in the talk that I think is becoming increasingly common in the tech world -- basically *hide* the technology from the end user, making everything work “magically”. This makes people more likely to actually try these things. As a technologist, I understand the logic, but am not sure I like it very much -- engineers spend countless hours fixing a million corner cases, and I think there should be some way the common public understands at least the severity of these complexities and appreciates how hard it is to make things work! It’s like the common saying that Moore’s “Law” gives you increasing number of transistors every year -- it doesn’t, engineers do!


7. Finally, one really awesome event was a robot on stage! It was introduced by the folks from Willow Garage -- controlled from a simple Android app. It moved to the center of the stage from the wings, gave the speaker a high-five, and showed off a few tricks. There were also videos in the talk of it playing pool, bringing beer from a fridge (brand of your choice!), and a bunch of other cool things. Very fancy :) Waiting for my own personal assistant!



Sunday, August 28, 2011

Rebuttals and Reviews

In the review process for top conferences, conventional wisdom says that rebuttals don't usually change the mind of the reviewer... they only help the authors feel better.  The latter is certainly true: after writing a rebuttal, I always feel like I have a plan to sculpt the paper into great shape.

Here's some data from a recent PC meeting about whether rebuttals help.  For the MICRO 2011 review process, reviewers entered an initial score for overall merit and a final score after reading rebuttals and other reviews.  While many factors (for example, new information at the PC meeting) may influence a difference in the two scores, I'll assume that the rebuttal is the dominant factor.  I was able to confirm this for the 16 papers that I reviewed.  Of the roughly 80 reviews for those papers, 14 had different initial and final scores (there were an equal number of upward and downward corrections).  In 11 of the 14 cases, the change in score was prompted by the quality of the rebuttal.  In the other 3 cases, one of the reviewers convinced the others that the paper had merit.

For 25 randomly selected papers that were accepted (roughly 125 reviews), a total of 14 reviews had different initial and final scores.  In 10 of the 14 cases, the final score was higher than the initial score.

For 25 papers that were discussed but rejected, a total of 19 reviews had different initial and final scores.  In 14 of the 19 cases, the final score was lower than the initial score.

It appears that rebuttals do play a non-trivial role in a paper's outcome, or at least a larger role than I would have expected.

Monday, August 22, 2011

The Debate over Shared and Private Caches

Several papers have been written recently on the topic of last level cache (LLC) management.  In a multi-core processor, the LLC can be shared by many cores or each core can maintain a private LLC.  Many of these recent papers evaluate the trade-offs in this space and propose models that lie somewhere between the two extremes.  It's becoming evident (at least to me) that it is more effective to start with a shared LLC and make it appear private as required.  In this post, I'll explain why.  I see many papers on private LLCs being rejected because they fail to pay attention to some of these arguments.

Early papers on LLC management have pointed out that in a multi-core processor, a shared LLC enjoys a higher effective cache capacity and ease of design, while private LLCs can offer lower average access latency and better isolation across programs.  It is worth noting that both models can have similar physical layouts -- a shared LLC can be banked and each bank can be placed adjacent to one core (as in the figure below).  So the true distinction between the two models is the logical policy used to govern their contents and access.
The distributed shared LLC shown in the figure resembles Tilera's tiled architecture and offers non-uniform latencies to banks.  In such a design, there are many options for data striping.  The many ways of a set-associative cache could be scattered across banks.  This is referred to as the D-NUCA model and leads to complex protocols for search and migration.  This model seems to have the least promise because of its complexity and non-stellar performance.  A more compelling design is S-NUCA, where every address and every cache set resides in a unique bank.  This makes it easy to locate data.  Most research assumes that in an S-NUCA design, consecutive sets are placed in adjacent banks, what I'll dub as set-interleaved mapping.  This naturally distributes load across banks, but every LLC request is effectively serviced by a random bank and must travel half-way across the chip on average.  A third option is an S-NUCA design with what I'll call page-to-bank or page-interleaved mapping.  All the blocks in an OS physical page (a minimum sized page) are mapped to a single bank and consecutive pages are mapped to adjacent banks.  This is the most compelling design because it is easy to implement (no need for complex search) and paves the way for OS-managed data placement optimizations for locality.

Recent work has taken simple S-NUCA designs and added locality optimizations so that an average LLC request need not travel far on the chip.  My favorite solution to date is R-NUCA (ISCA'09), which has a simple policy to classify pages as private or shared, and ensures that private pages are always mapped to the local bank.  The work of Cho and Jin (MICRO'06) relies on first-touch page coloring to place a page near its first requestor; Awasthi et al. (HPCA'09) augment that solution with load balancing and page migration.  Victim Replication (ISCA'05) is able to do selective data replication in even a set-interleaved S-NUCA cache without complicating the coherence protocol.  In spite of these solutions, the problem of shared LLC data management is far from solved.  Multi-threaded workloads are tougher to handle than multi-programmed workloads; existing solutions do a poor job of handling pages shared by many cores; they are not effective when the first requestor of a page is not the dominant accessor (since page migration is expensive); task migration renders most locality optimizations invalid.

The key point here is this: while there is room for improvement, a shared LLC with some of the above optimizations can offer low access latencies, high effective capacity, low implementation complexity, and better isolation across programs.  Yes, these solutions sometimes involve the OS, but in very simple ways that have commercial precedents (for example, NUMA-aware page placement in the SGI Origin).

On the other hand, one could start with an organization where each core has its own private LLC.  Prior work (such as cooperative caching) has added innovations so that the private LLCs can cooperate to provide a higher effective cache capacity.  In cooperative caching, when a block is evicted out of a private cache, it is given a chance to reside in another core's private cache.  However, all private LLC organizations require complex mechanisms to locate data that may be resident in a remote bank.  Typically, this is done with an on-chip directory that tracks the contents of every cache.  This directory introduces non-trivial complexity.  Some commercial processors do offer private caches; data search is manageable in these processors because there are few caches and bus-based broadcast is possible.  Private LLCs are less attractive as the number of cores scales up.

Based on the above arguments, I feel that shared LLCs have greater promise.  For a private LLC organization to be compelling, the following arguments are necessary: (i) How is the complexity of data search overcome?  (ii) Can it out-perform a shared LLC with page-to-bank S-NUCA mapping and a simple locality optimization (either R-NUCA or first-touch page coloring)?

Tuesday, June 28, 2011

A Killer App for 3D Chip Stacks?

I had previously posted a brief summary of our ISCA 2011 paper.  The basic idea was to design a 3D stack of memory chips and one special interface chip, connected with TSVs (through silicon vias).  We argue that the interface chip should have photonic components and memory scheduling logic.  The use of such a stack enables game-changing optimizations (photonic access, scalable scheduling) without disrupting the manufacture of commodity memory chips.



When we presented our work at Micron recently (see related post by Ani Udipi), we were told that Micron has a full silicon prototype that incorporates some of these concepts.  We were very excited to hear that a 3D stacked memory/logic organization similar to our proposal is implementable and could be reality in the near future.  Slide 17 of the report from the Micron Winter Analyst Conference, February 2011, describes Micron's Hybrid Memory Cube (HMC, Figure reproduced below).  The HMC has a Micron-designed logic controller at the bottom of the stack (what we dub as the interface chip in our work) and it is connected to multiple DRAM chips with TSVs.  Details of what is on the logic controller have not been released yet.  Micron is partnering with Open-Silicon to take advantage of the opportunities made possible by the HMC.


This is an exciting development and I expect that one could discover many creative ways to put useful functionality on the interface chip.  Prior work has proposed several ways to add functionality to DRAM chips: row buffer caches, processing in memory, error correction features, photonic components, etc.  Many of these ideas were unimplementable because of their impact on cost, but they may be worth attempting in the context of a 3D memory stack.  This is also an opportunity to add value to memory products, an especially important consideration as density growth flattens or as we move to new memory technologies that have varying maintenance needs.

While most prior academic work on 3D architecture has been processor-centric, the potential benefit of memory-centric 3D stacking is relatively unexplored.  This is in spite of the fact that memory companies have embraced 3D stacking much more than processor companies.  3D memory chip stacks are currently manufactured in various forms by Tezzaron, Samsung, and Elpida among others.  The concept of building a single chip and then reusing it within 3D chip stacks to create multiple different products has been proposed previously for processors (papers from UCSB and Utah).  Given the economic impact of this concept, the cost-sensitive memory industry stands to gain more from it.  Memory companies operate at very small margins.  They therefore strive to optimize cost-per-bit and almost exclusively design for high volumes.  They are averse to adding any feature that will increase cost for millions of chips, but will only be used by a small market segment.  But with 3D chip stacks, the same high-volume commodity DRAM chip can be bonded to different interface chips to create different products for each market segment.  This may well emerge as the most compelling application of 3D stacking within the high performance domain.

Sunday, June 19, 2011

Father's Day

Just spent a great Father's Day watching the US Open with my 3-year-old son and week-old daughter in my arms.  On most Father's Days, I'm fumbling around airports and hotels, missing my family and trying to get a peek at US Open scores on my way to ISCA.  Here's hoping that future ISCA organizers make it a high priority to not schedule ISCA during Father's Day...