Sunday, February 27, 2011

Not Lacking in Buzzwords...

Warning: This post may contain some shameless advertizing for our upcoming ISCA paper. :-)

Based on a reviewer suggestion, we are changing the title of our recently accepted ISCA paper from
"Designing a Terascale Memory Node for Future Exascale Systems"
to
"Combining Memory and a Controller with Photonics through 3D-Stacking to
Enable Scalable and Energy-Efficient Systems"

Believe me, it took many iterations to find something that was descriptive, accurate, and marginally pleasant-sounding :-).  While throwing in every buzzword makes for a clunky title, such titles are essential for flagging the attention of the right audience (those working on photonics, 3D, and memory systems).

I mention the following example of a bad title to my students.  Our ISCA 2001 paper was on runahead execution (a form of hardware prefetching) and appeared with three other papers in the same conference on the same topic.  Our non-descriptive title said: "Dynamically Allocating Processor Resources between Nearby and Distant ILP".  There's little in the title to indicate that it is about runahead.  As a result, our paper got left out of most subsequent runahead conversations and had minimal impact.  In terms of citations (the closest quantitative measure of impact), our paper has 70+ citations; each of the other three have 200+.  I might have felt robbed if the paper was actually earth-shattering; in retrospect, the design was quite unimplementable (and that may no doubt have contributed to its low impact).  In my view, Onur Mutlu's subsequent thesis work put forth a far more elegant runahead execution design and made most prior work obsolete.

For those interested, here's an executive summary of our ISCA'11 work (done jointly with HP Labs).  The killer app for photonics is its high bandwidth in and out of a chip (something we can't do for long with electrical pins).  However, introducing photonic components into a cost-sensitive DRAM chip can be highly disruptive.  We take advantage of the fact that industry is possibly moving towards 3D-stacked DRAM chip packages and introduce a special interface die on the DRAM stack.  The interface die has photonic components and some memory controller functionality.  By doing this, we use photonics to break the pin barrier, but do not disrupt the manufacture of commodity DRAM chips.  For communication within the 3D stack, we compute an energy-optimal design point (exactly how much of the intra-stack communication should happen with optics and how much with electrical wiring).  It turns out that there is no need for optics to penetrate into the DRAM dies themselves.  We also define a novel protocol to schedule memory operations in a scalable manner: the on-chip memory controller does minimal book-keeping and simply reserves a speculative slot on the photonic bus for the data return.  Most scheduling minutiae are handled by the logic on the DRAM stack's interface die.

Tuesday, February 22, 2011

Research Competitions

Almost every year, The Journal of ILP has conducted a research competition at one of our top conferences.  For example, researchers are asked to develop a branch predictor that yields the highest prediction accuracy on a secret input data set.  I think such competitions are a great idea.  If you teach an advanced architecture class, you might want to consider tailoring your class projects in that direction.  An international competition could be a strong motivating factor for some students.  Besides, the competition organizers usually release a piece of simulation infrastructure that makes it easy for students to write and test their modules.  Recent competitions have been on cache replacement, data prefetching, and branch prediction.  The 3rd branch prediction championship will be held with ISCA this year.  I am slated to organize a memory scheduling championship with ISCA next year (2012).  The evaluation metric could be row buffer hit rates, overall throughput, energy, fairness, or combinations of these.  I have a grand 12 months to plan this out, so please feel free to send in suggestions regarding metrics, simulators, and workloads that you think would work well. 

Saturday, February 19, 2011

PostDocs in Computer Science



PostDocs in CS are surging and there is a lot of discussion in the computing community about this trend.  If you haven't already seen the CRA PostDoc white paper based on Taulbee data, I would recommend giving it a read and also sharing your opinions on it. Since I am pursuing a PostDoc, I wanted to discuss some of the data presented in this report and its relevance to the computer architecture research community.

The charts above have all been taken from the CRA report. Figure 1 shows the number of Phds that went into different career choices. The "Others" in this chart represents graduating Phds that either took a different career path than mentioned here or did not have a job by the time they graduated. Looking at this data, it appears that the year 2004 was the best year for academic hiring where almost equal number of CS Phds went for tenure-track positions and industry positions. In 2009, a majority of Phds went in for industry positions, mainly because the number of Phd graduates almost doubled and reduced academic hiring due to the economic downturn. The trend to note in this figure is the number of PostDocs that has almost doubled from 2004 to 2009.

Figure 3 further splits the data of number of graduating Phds into research area sub-groups. In this chart, "Others" signifies inter-disciplinary research areas or could even imply unemployment. Based on the data shown in this figure, systems and networking and AI/robotics have consistently been hot fields in terms of number of Phd graduates. Computer architecture fares as a small-sized research community similar to theory and algorithms, and the number of Phd graduates has not changed much from 1998 to 2009.

Figure 4 shows the percentage of Phds who took PostDoc positions in each sub-group over the years. Computer architecture has the lowest number of PostDocs (~2%) in 2009 (includes me). Theory, AI/Robotics, and numerical analysis/scientific computing have consistently the highest percentage of PostDocs. One reason that I can think of why we have fewer computer architecture PostDocs is perhaps the strong semiconductor industry presence in our field and smaller number of available PhDs to fill those jobs.

If we look at Figure 5, which shows the percentage of Phds taking up tenure track positions, we are again in the bottom range in 2009 (~6-7%) but we did well in years 2003-2005 (~30-35%). I have plotted the number of computer architecture Phds and the number of tenure track faculty in our field in the chart below based on the data that I could visually interpret from the CRA charts. Clearly, the number of tenure-track positions in computer architecture were at its peak in 2004 when the relative number of Phds were fewer and now we are back at 1998 levels with stiffer competition for very few slots. It will be interesting to watch this trend in the future.

Sunday, February 13, 2011

The 2X-5% Challenge

It is well-known that much of our data storage and computation will happen within datacenters in the coming decade.  Energy efficiency in datacenters is a national priority and the memory system is one of the significant contributors to system energy.  Most reports indicate that the memory system's contribution to overall energy is in the 20-30% range.  I expect memory energy to be a hot topic in the coming years.  I'm hoping that this post can serve as a guideline for those that want to exploit the energy-cost trade-off for memory system design.

For several years, the DRAM industry focused almost exclusively on the design of high-density DRAM chips.  Seemingly, customers only cared to optimize cost per bit... or more precisely, purchase cost per bit.  In recent times, the operating cost per bit is also becoming significant.  In fact, DRAM industry optimizations to reduce purchase cost per bit often end up increasing the operating cost per bit.  The time may finally be right to start building mainstream DRAM chips that are not optimized for density, but for energy.  At least that's an argument we make in our ISCA 2010 paper and a similar sentiment has been echoed in a IEEE Micro 2010 paper by Cooper-Balis and Jacob.  If DRAM vendors were to take this route, customers may have to pay a higher purchase price for their memory chips, but they'd end up paying less for energy and cooling costs.  The result should be a lower total cost of ownership.

However, it may take a fair bit of effort to convince memory producers and consumers that this is a worthwhile approach.  Some memory vendors have apparently seen the light and begun to tout the energy efficiency of their products: Samsung's Green Memory and Samsung's new DDR4 product.  While such an idea has often been scoffed at by memory designers that have spent years optimizing their designs for density, the hope is that economic realities will eventually set in.

A perfect analogy is the light bulb industry.  Customers are willing to pay a higher purchase price for energy-efficient light bulbs that hopefully save them operating costs, compared to the well-known and commoditized incandescent light bulb.  If the Average Joe can do the math, so can memory vendors.  Surely Micron must have noticed the connection.  After all, they make memory chips and lighting systems!! :-)

So what is the math?  Some of my collaborators have talked to DRAM insiders and while our ideas have received some traction, we are warned: Do not let your innovations add a 2% area overhead!!  That seems like an awfully small margin to play with.  Here's our own math.

First, if we check out various configurations for customizable Dell servers, we can quickly compute that DRAM memory sells for roughly $50/GB today.

Second, how much energy does memory consume in a year?  This varies based on several assumptions.  Consider the following data points.  In a 2003 IBM paper, two servers are described, one with 16 GB and another with 128 GB of DRAM main memory.  Memory power contributes roughly 20% (318 W) of total power in the former and 40% (1,223 W) of total power in the latter.  This leads to an estimate of 20 W/GB and 10 W/GB respectively for the two systems.  If we assume that this value will halve every two years (I don't have a good reference for this guesstimate, but it seems to agree with some other data points), we get an estimate of 1.25-2.5 W/GB today.  In a 2006 talk, Laudon describes a Sun server that dissipates approximately 64 W of power for 16 GB of memory.  With scaling considered, that amounts to a power of 1 W/GB today.  HP's server power calculator estimates that the power per GB for various configurations of a high-end 2009 785G5 server system ranges between 4.75 W/GB and 0.68 W/GB.  Based on these data points, let us use 1 W/GB as a representative estimate for DRAM operational energy.

It is common to assume that power delivery inefficiencies and cooling overheads will double the energy needs.  If we assume that the cost for energy is $0.80/Watt-year, we finally estimate that it costs $6.40 to keep one GB operational over a DIMM's four-year lifetime.  Based on this number and the $50/GB purchase price number, we can estimate that if we were to halve memory energy, we can reduce total cost of ownership even if the purchase cost went up to $53.20/GB.  Given that cost increases more than linearly with an increase in chip area, this roughly translates to a chip area overhead of 5%.

This brings us to the 2X-5% challenge: if memory designers set out to reduce memory energy by 2X, they must ensure that the incurred area overhead per DRAM chip is less than 5%.  Alternatively stated, for a 2X memory energy reduction, the increase in purchase cost must be under 6%.  Not a big margin, but certainly more generous than we were warned about.  The margin may be even higher for some systems or if some of the above assumptions were altered: for example, if the average DIMM lasted longer than 4 years (my Dell desktop is still going strong after 7.5 years).  This is an approximate model and we welcome refinements.

Friday, February 11, 2011

Datacenters and the Raging Energy Efficiency Battles

Datacenter efficiency is the new buzz word these days. Cloud computing (another buzz word!) essentially dictates the need for datacenter energy efficiency, and tree-hugger engineers are more than happy to embrace this shift in computer design.

There are a lot of sub-systems in a datacenter that can be re-designed/optimized for efficiency. These range from power distribution, all the way down to individual compute cores of a chip multi-processor. In this post, I am mostly going to talk about how general purpose microprocessor instruction set architecture (ISA) battles are currently raging at a datacenter near you.

To sum it up, Intel and ARM are getting ready for a battle. This is a kind of battle that Intel has fought in the past against IBM and Sun. The battle is between Reduced Instruction Set Computing (RISC) vs. complex Instruction Set Computing (CISC). To some extent, this battle is still going on. Intel’s x64 ISA is CISC, and it’s RISC competitors are Oracle's SPARC ISA and IBM's POWER ISA. Intel also has it’s own line of RISC processors called Itanium, and IBM has it’s mainframe z/Architecture which is CISC. We will ignore these last two because they are niche market segments and will focus on the commodity server business, which is mostly a shipment volume driven business.

One clarification to be made here is that Intel's x64 chips are "CISC-like" rather than being true CISC CPUs. Modern x64 chips have a front end that decodes the x86 Instruction Set Architecture (ISA) into RISC-like "micro-ops". These micro-ops are then processed by the actual CPU logic, unlike the true CISC operations executed by IBM's mainframe z CPUs, for example. To clarify this distinction, any reference to x64 CPUs as CISC processors in this article actually refers to x64 CPUs with this decoder front-end. The front-end decoding logic in x64 CPUs is reported to be a large consumer of resources and energy, and it's entirely possible that it's a significant contributor to lower energy efficiency of x64 CPUs compared to true RISC CPUs.

ARM is the new kid on the block in server racket, and it touts higher efficiency as it’s unique selling point. Historically, arguments that have been put forth for RISC claim that it’s a more energy efficient platform due to lower complexity of the cores. The reasoning boils down to the fact that some of the complexity of computation is moved from the hardware to the compiler. This energy efficiency is what ARM and other RISC CPU manufacturers claim to be their raison d'etre.

There is currently a lot of effort going into developing ARM based servers, especially since recent ARM designs support ECC and Physical Address Extension (PAE), allowing the cores to be “server grade”. This interest is unabated even though the cores are still a 32-bit architecture. Reports however claim that ARM is working on a 64-bit architecture. What is unclear at this point is how energy efficient ARM cores will be when they finally make it into servers.
The uncertainty about efficiency of ARM based servers stems from the fact that ARM does not manufacture it’s own cores. ARM is an IP company and it leases it’s core designs to third parties - the likes of TI, NVIDIA etc. This gives ARM both a strategic advantage, and a reason to be cautious of Intel’s strategy.

The advantage of licensing IP is that ARM is not invested heavily in the fab business, which is fast becoming an extremely difficult business to be in. AMD exited/spun-off it’s fab business as it was getting hard for them to compete with Intel. Intel is undoubtedly the big kahuna in the fab industry with the most advanced technology. This allows them to improve x64 performance by leveraging manufacturing process superiority, even when x64 inherently might be energy hungry. An example of this phenomenon is the Atom processor. Intel‘s manufacturing technology superiority is therefore a reason for concern for ARM.

The datacenter energy efficiency puzzle is a complex world. From the CPU point of view, there are five primary players - AMD, IBM, Intel, NVIDIA, and Oracle. AMD is taking it’s traditional head-on approach with Intel, and NVIDIA is attacking the problem from sideways by advancing GPU-based computing. They are reportedly working on ARM cores where they are building high-performance ARM-based CPUs to work in conjunction with their GPUs. This is done evidently to bring higher compute efficiency using asymmetric compute capabilities.
IBM and Oracle strategies are two fold. Besides their RISC offerings, they are also getting ready to peddle their “appliances” - which are vertically integrated systems that try to squeeze out every bit of performance from the underlying hardware. Ultimately, these appliances might rule the roost when it comes to energy efficiency, in which case one can expect HP, Cisco, and possibly Oracle and IBM to come out with x64 and/or ARM based appliances.

It’s unlikely that just one of these players will be able to dominate the entire market. Hopefully, in the end the entire computing eco-system will be more energy efficient because of innovation driven by these competing forces. This should make all the tree-hugging engineers and tree-hugger users of the cloud happy. So watch out for the ISA of that new server in your datacenter, unless of course you are blinded by your cloud!

Some interesting links:

Saturday, February 5, 2011

Common Fallacies in NoC Papers

I am asked to review many NoC (network-on-chip) papers.  In fact, I just got done reviewing my stack of papers for NOCS 2011.  Many NoC papers continue to make assumptions that might have been fine in 2007, but are highly questionable in 2011.  My primary concern is that most NoC papers over-state the importance of the network.  This is often used to justify complex solutions.  This is also used to justify a highly over-provisioned baseline.  And many papers then introduce optimizations to reduce the area/power/complexity of the over-provisioned baseline.  Both of these fallacies have resulted in many NoC papers with possibly limited shelf life.

The first mis-leading overstatement is this (and my own early papers have been guilty of this): "Intel's 80-core Polaris prototype attributes 28% of its power consumption to the on-chip network", and "MIT's Raw processor attributes 36% of its power to the network".  Both processors are a few years old.  Modern networks probably incorporate many recent power optimizations (clock gating, low-swing wiring, etc.) and locality optimizations (discussed next).  In fact, Intel's latest many-core prototype (the 48-core Single Cloud Computer) attributes only 10% of chip power to the network.  This dramatically changes my opinion on the kinds of network optimizations that I'd be willing to accept.

The second overstatement has to do with the extent of network traffic.  Almost any high performance many-core processor will be organized as tiles.  Each tile will have one or a few cores, private L1 and L2 caches, and a slice (bank) of a large shared L3.  Many studies assume that data placement in the L3 is essentially random and a message on the network travels half-way across the chip on average.  This is far from the truth.  The L3 will be organized as an S-NUCA and OS-based first-touch page coloring can influence the cache bank that houses every page.  A thread will access a large amount of private data, most of which will be housed in the local bank and can be serviced without network traversal.  Even for shared data, assuming some degree of locality, data can be found relatively close by.  Further, if the many-core processor executes several independent programs or virtual machines, most requests are serviced by a small collection of nearby tiles and long-range traversal on the network is only required when accessing a distant memory controller.  We will shortly post a characterization of network traffic for the processor platform I describe above: for various benchmark suites, an injection rate and histogram of distance traveled.  This will hopefully lead to a more meaningful synthetic network input than the most commonly used "uniform random".

With the above points considered, one would very likely design a baseline network that is very different from the plain vanilla mesh network.  I would expect some kind of hierarchical network: perhaps a bus at the lowest level to connect a small cluster of cores and banks, perhaps a concentrated mesh, perhaps express channels.  For those that haven't seen it, I highly recommend this thought-provoking talk by Shekhar Borkar, where he argues that buses should be the dominant component of an on-chip network.  I highly doubt the need for large amounts of virtual channels, buffers, adaptive routing, etc.  I'd go as far as to say that bufferless routing sounds like a great idea for most parts of the network.  If most traffic is localized to the lowest level of the network hierarchy because threads find most of their data nearby, there is little inter-thread interference and there is no need for QoS mechanisms within the network.

In short, I feel the NoC community needs to start with highly localized network patterns and highly skinny networks, and identify the minimum amount of additional provisioning required to handle various common and pathological cases.