Utah Arch

Sunday, May 1, 2011

A Perspective on HTM Designs

I've talked with a number of faculty candidates this past season. I have come to realize that people that work in non-architecture systems areas haven't really kept track of the take-home messages from recent research in hardware transactional memory (HTM). Or, perhaps, the messages have been obfuscated by other signals (STM results or industry trends). Here's my unbiased view on HTM design options. Unbiased because I have merely dabbled in TM research and don't have a horse in this race :-). Harris, Larus, and Rajwar's TM book should be an excellent read if you want a more detailed account.

For those new to TM or HTM, I'll first provide a quick summary. With numerous cores available today, hardware is effective if software is efficiently multi-threaded. It is imperative that architects come up with innovations that make it easier for programmers to write multi-threaded code. HTM is one such innovation. Programmers simply identify regions of code (transactions) that should be executed atomically. The underlying hardware looks for parallelism, but ensures that the end result is as if each transaction executed atomically and in isolation. HTM can therefore provide high performance with low programming effort. With a few exceptions (usually contrived), transactional code is also deadlock-free. Transactional semantics are an improvement over almost any known multi-threaded programming model. HTM makes the programming process a little bit easier but is by no means a panacea.

The Wisconsin LogTM paper introduced a frequently used taxonomy to describe HTM implementations. They showed that one could use either lazy or eager conflict detection and either lazy or eager version management. When explaining HTM to students, I have always found it easier and more intuitive to start with an implementation that uses lazy versioning and lazy conflict detection, such as Stanford's TCC design. In TCC, every transaction works on locally cached copies of data and writes are not immediately propagated to other caches. At the end of its execution, a transaction must first acquire permissions to commit; it then makes all its writes visible through the conventional cache coherence protocol. If another in-progress transaction has touched any of these written values (a conflict), it aborts and re-starts. Each transaction executes optimistically and conflicts are detected when one of the transactions is trying to commit (lazy conflict detection). Upon cache overflow, the new results computed by a transaction are stored in a temporary log in main memory; when the transaction commits, the overflowed results must be copied from the log into their permanent homes (lazy versioning). Bloom filters can be used to track the contents of the log and efficiently detect conflicts.

While the TCC design is easier to understand, I feel it has a few drawbacks (explained shortly). In my view, an HTM design that uses eager versioning and eager conflict detection, such as Wisconsin's LogTM design, does a better job optimizing for the common case. In LogTM, each transaction proceeds pessimistically, expecting a conflict at every step (eager conflict detection). Writes are immediately made visible to the rest of the system. The underlying cache coherence protocol detects conflicting accesses by in-progress transactions. When a conflict is detected, one of the transactions is forced to wait for the other to finish. Deadlock scenarios are easily detected and handled by using transaction timestamps. Upon cache overflow, the old result is stored in a log and the new result is immediately placed in its future permanent home (eager versioning). Some provisions are required so we can continue to detect conflicts on overflowed data (called sticky pointers; now you know why this is a little harder to explain to TM newbies :-).

The reason that I believe that EE-HTM (Eager conflict detection and Eager versioning) is more efficient than LL-HTM is because the former makes a number of design choices that favor the common case. First, commits are slow in LL-HTM because they involve making all transactional writes visible to the rest of the system and copying log values to their permanent homes. Commit is the frequent operation (every transaction must commit). Second, EE-HTM does make an abort slower because it involves a copy from the log, but aborts are supposed to be uncommon. Aborts are even more uncommon in EE-HTM because they are only initiated in a potential deadlock situation. While LL-HTM charges ahead with transaction execution and causes an abort on every conflict, EE-HTM is more deliberate in its progress. If there is a conflict, a transaction is made to stall and proceed after the conflicting situation is resolved. This leads to less wasted work and overall lower energy dissipation.

HTM is easy to implement in hardware. All it takes is: (i) a register checkpoint so we can roll back register state on a transaction abort, (ii) two bits per cache line to keep track of whether a transaction has read or written that line, (iii) some new logic in the cache coherence controller to undertake transactional operations, such as examining the above bits and issuing aborts or NACKs, (iv) system and hardware support (in the form of hardware signatures) to handle overflows and commit permissions. So why has industry been so reluctant to embrace HTM (with the exception of Sun's Rock and AMD's recent ASF extensions)? My sense is that there is great concern about various corner-cases and how they can be handled efficiently. In my view, most known corner-cases have workable solutions and it is pointless to try and optimize infrequent scenarios. For example, a number of ugly scenarios can be averted if the system has a single "golden token" that allows its owner to successfully commit its on-going transaction. This is a graceful way to handle cache overflows, I/O operations, nesting overflows, starvation, signature overflows, etc. In essence, if our best efforts at transactional parallelism fail, simply revert to single-thread execution for a while. The impact on performance should be minimal for most workloads and this is a penalty we should be willing to pay to enjoy the programmability benefits of TM for other workloads. I am less enamored by the approach that falls back on STM when such bad events happen; STM opens up a new can of worms.

HTM implementations are efficient enough that there are few known bottlenecks. The classic TM pathologies paper from Wisconsin in ISCA 2007 talks about possible HTM performance bottlenecks. Contributions by that paper and others have pretty much addressed most known pathologies. One of my Masters students, Byong Wu Chong, spent a fair bit of time characterizing STAMP and other transactional benchmarks. He found that once a few pathological cases are optimized, very little of the overall execution time can be attributed to transactional phenomena. Commit is typically not a big bottleneck for LL-TM and deadlock-driven aborts are uncommon in EE-TM.

In summary, HTM is relatively easy to implement, especially if we take a correctness-centric view to corner cases. There are few glaring performance deficiencies caused by HTM features. Quantitatively, I expect there is little difference between optimized versions of EE-HTM and LL-HTM for most workloads. I have a personal preference for the former because it reduces wasted work and optimizes the common case. For newbies, the latter is an easier introduction to HTM.

Sunday, April 24, 2011

Dude, where are my PCM parts going?

All of us have heard of the not-so-new buzz surrounding the slew of new-er non-volatile memories (NVMs) that have either hit the market, or are expected to do so soon. The noteworthy among these are Phase-change memory (PCM) and an advancement of the Magnetoresistive Random Access Memory (MRAM) called Spin-Torque Transfer RAM (STT-RAM). PCM parts have started to trickle out as commercial products, while the viability of STT-RAM parts is still being demonstrated. Although, from the excitement about STT-RAM at the Non-Volatile Memories workshop this year, I am sure that we will be seeing more and more of it in the coming years.

Recently, the architecture community has been abuzz with proposals on applications of PCM, at different levels in the memory and storage hierarchy. But, the buzz notwithstanding, let's take a quick look at the PCM products that are presently available. Numonyx has 128 Mb serial and parallel parts on the market. While there has been talk of Samsung shipping a 512 Mb part, the part information is unavailable, as far as I know. Although, it does seem like Samsung is "shipping a little bit" of PCM parts. Where, to whom, and what for is still not very clear. But there have been reports of PCM parts actually being used in consumer electronics. From the current state of things, it seems like the consumer electronics industry (cameras, cell phones and the likes) is intent on using PCM as a plug-and-play replacement for Flash. Assuming that manufacturers come out with high density, low latency PCM parts, how are they best used in desktops and server-side machines?

Industry analysts believe that PCM is a good contender to replace Flash in the short term. Taking that thought a little further, IBM believes that PCM will have a big role to play in the server-side storage hierarchy, but again, as a replacement for Flash SSDs. I tend to agree with them, at least in the short run. Many research papers advocate that PCM will eventually serve as a DRAM replacement. Thus, there is little consensus on how and where PCM parts may fit in the memory/storage hierarchy.

From what I have learned so far, there are four main issues that will have to be handled before PCM is ready to be a main memory replacement - (i) endurance has to reach somewhere around that of DRAM, (ii) write energy and latency has to be reduced substantially, (iii) higher levels of the hierarchy have to be more effective at filtering out writes to the PCM level, and (iv) the recently discovered problem of resistance drift with multi-level cells has to be addressed. More recently, there have been concerns about scaling of PCM devices because of issues with increasing current densities and electromigration issues (Part 1, Part 2, Part 3). This, of course is assuming that all issues associated with wear-leveling have been worked out.

Apart from the issues with the devices themselves, has there been a general reluctance to shift to PCM? To quote Samsung's top honcho on the semiconductor side of the business "systems guys have been very reluctant to adopt the technology in mass quantities". The quote conveys very little discernible information, but, as a first guess, it might mean that the system software stack has to be extensively reworked to make PCM a reality.

There are many views on the future of PCM. There appears to be a slight disconnect between industrial trends and academic thrusts. The former seems to focus on using PCM as a permanent storage device, while the latter is attempting to move PCM higher up the hierarchy. A case can be made that more academic effort must be invested in developing PCM as a viable storage class memory.

Saturday, March 19, 2011

Grad School Ranking Methodology

We are in the midst of the graduate admissions process and faculty candidate interviews. A large part of a candidate's decision is often based on a university's rank in computer science. This is especially true for prospective graduate students. Unfortunately, the university rankings commonly in use are flawed and inappropriate.

The US News rankings seem to be most visible and hence most popular. However, they are a blatant popularity contest. They are based on an average of 1-5 ratings given by each CS department chair for all other CS departments. While I'm sure that some department chairs take their rating surveys very seriously, there are likely many that simply go off of past history (a vicious cycle where a department is reputed because of its previous lofty rank), recent news, or reputation within the department chair's own research area. In spite of several positive changes within our department, our reputation score hasn't budged in recent years and our ranking tends to randomly fluctuate in the 30s and 40s. But no, this post is not about the impact of these rankings on our department; it is about what I feel is the right way for a prospective grad student to rank schools.

The NRC released a CS ranking way back in 1993. After several years of work, they released a new ranking a few months back. However, this seems to be based on a complex equation and there have been several concerns about how data has been collected. We have ourselves seen several incorrect data points. The CRA has a statement on this.

Even if the NRC gets their act together, I strongly feel that prospective grad students should not be closely looking at overall CS rankings. What they should be focusing on is a ranking for their specific research area. Such a ranking should be one of a number of factors that they should be considering, although, in my view, it should be a dominant factor.

What should such a research area ranking take into account? It is important to consider faculty count and funding levels in one's research area. But primarily, one should consider the end result: research productivity. This is measured most reliably by counting recent publications by a department in top-tier venues. Ideally, one should measure impact and not engage in bean-counting. But by focusing the bean-counting on top-tier venues, impact and quality can be strongly factored in. This is perhaps the closest we can get to an objective measure of impact. Ultimately, when a student graduates with a Ph.D., his/her subsequent job depends most strongly on his/her publication record. By measuring the top-tier publications produced by a research group, students can measure their own likely productivity if they were to join that group.

I can imagine a few reasonable variations. If we want to be more selective about quality and impact, we could measure best-paper awards or IEEE Micro Top Picks papers. However, those counts are small enough that they are likely not statistically significant in many cases. One could also derive a stronger measure of impact by looking at citation counts for papers at top-tier venues. Again, for recent papers, these citation counts tend to be noisy and often dominated by self-citations. Besides, it's more work for whoever is computing the rankings :-). It may also be reasonable to divide the pub-count by the number of faculty, but counting the number of faculty in an area can sometimes be tricky. Besides, I feel a department should get credit for having more faculty in a given area; a grad student should want to join a department where he/she has many options and a large peer group. One could make an argument for different measurement windows -- a short window adapts quickly to new faculty hires, faculty departures, etc. The window also needs to be long enough to absorb noise from sabbaticals, funding cycles, infrastructure building efforts, etc. Perhaps, five years (the average length of a Ph.D.) is a sweet spot.

So here is my proposed ranking metric for computer architecture: number of papers by an institution at ISCA, MICRO, ASPLOS, HPCA in the last five years.

Indeed, I have computed the computer architecture ranking for 2010. This is based on top-tier papers in the 2006-2010 time-frame for all academic institutions world-wide. I have not differentiated between CS and ECE departments. An institution gets credit even if a single author on a paper is from that institution. If you are considering joining grad school for computer architecture research (or are simply curious), email me for a link to this ranking. I have decided not to link the ranking here because I feel that only prospective grads need to see it. A public ranking might only foster a sense of competition that is likely not healthy for the community.

If you are shy about emailing me (why?! :-) or are curious about where your institution may stand in such a ranking, here's some data: All top-5 schools had 44+ papers in 5 years; 20+ papers equates to a top-10 rank; 15+ papers = top-15; 12+ papers = top-20; 9+ papers = top-25. There were 44 institutions with 4+ papers and 69 institutions with 2+ papers.

Please chime in with comments and suggestions. I'd be happy to use a better metric for the 2011 ranking.

Friday, March 11, 2011

A Case for Archer

Disclaimer : The opinions expressed in this post are my own, based on personal experiences.

Almost any computer architecture researcher realizes the importance of simulators to the community. Architectural level simulators allow us to model and evaluate new proposals in a timely fashion, without having to go through the pain of fabricating and testing a chip.

So, what makes for a good simulator? I started grad school in the good old days when Simplescalar used to be the norm. It had a detailed, cycle-accurate pipeline model and most importantly, it was pretty fast. Once you got over the learning curve, additions to the model were fairly easy. It did have a few drawbacks. There was little support for simulating CMP architectures, there was no coherence protocol in place, and a detailed DRAM model was missing. Moreover, interference of the operating system with an application's behavior was not considered. But, these issues were less important back then.

Soon, it was apparent that CMPs (chip multi-processors) were here to stay. The face of applications changed as well. Web 2.0-based, data-intensive applications came into existence. To support these, a number of server-side applications needed to be run on backend datacenters. This made the performance of multi-threaded applications, and that of main memory of paramount importance.

To keep up with these requirements, the focus of the architecture community changed as well. Gone were the days of trying to optimize the pipeline and extracting ILP. TLP, MLP, and memory system performance (caches, DRAM) became important. One also could no longer ignore the importance of interference from the operating system when making design decisions. The community was now on the lookout for a simulation platform that could take all of these factors into account.

It was around this time that a number of full-system level simulators came into being. Off the top of my head, I can count a number of these, popular with the community and with a fairly large user base - Wind River's Simics, M5, Zesto, Simflex, SESC, to name a few. For community wide adoption, a simulator platform needed to be fast, have a modular code base, and have a good support system (being cycle-accurate was a given). Simics was one of the first platforms that I tried personally and found their support to be extremely responsive, which also garnered a large participation from the academic community. Also, with release of GEMS framework from Wisconsin, I didn't need a reason to look any further.

In spite of all the options out there, getting the infrastructure (simulator, benchmark binaries, workloads and checkpoints) in place is a time consuming and arduous process. As a result, groups seem to have gravitated towards simulators that best suited their needs in terms of features and ease of use. The large number of options today also implies that different proposals on the same topic inevitably use different simulation platforms. As a result, it is often difficult to compare results across papers and exactly reproduce results of prior work. This was not as significant a problem before when nearly everyone used Simplescalar.

In some sub-areas, it is common practice to compare an innovation against other state-of-the-art innovations (cache policies, DRAM scheduling, etc.). Faithfully modeling the innovations of others can be especially troublesome for new grad students learning the ropes of the process. I believe a large part of this effort can be reduced if these models (and by model I mean code :-) were already publicly available as part of a common simulator framework.

The Archer project, as some of you might know, is a recent effort in the direction of collaborative research in computer architecture. From the project's website, they strive for a noble goal -

"To thoroughly evaluate a new computer architecture idea, researchers and students need access to high-performance computers, simulation tools, benchmarks, and datasets - which are not often readily available to many in the community. Archer is a project funded by the National Science Foundation CRI program to address this need by providing researchers, educators and students with a cyberinfrastructure where users in the community benefit from sharing hardware, software, tools, data, documentation, and educational material."

In its current format, Archer provides a large pool of batch-scheduled computing resources, a set of commonly used tools and simulators and some benchmarks and datasets. It also has support for sharing files via NFS and a wiki-based infrastructure to aggregate shared knowledge and experiences.

If widely adopted, this will provide a solution to many of the issues I listed above. It can help push the academic community towards a common infrastructure while at the same time reduce the effort to setup simulation infrastructure and to reproduce prior work.

Although Archer is a great initial step, I believe it still can be improved upon. If I had my wishes, I would like a sourceforge like platform, where the model for a particular optimization is owned by a group of people (say, the authors of the research paper), available to be checked out from a version-control system. Anyone using the Archer platform for their research, that results in a peer-reviewed publication would be obliged to release their model into the public domain under a GPL. Bug reports will be sent to the owners who will in turn release revised versions of the model(s).

In recent years, collaborative research efforts in computer science have been very successful. Emulab is a widely used resource by the networking community. The TCS community too has been involved in successful collaborative research, e.g. the polymath project and the recent collaborative review of Deolalikar's paper. I believe that there is certainly room for larger collaborative efforts within the computer architecture community.

Sunday, February 27, 2011

Not Lacking in Buzzwords...

Warning: This post may contain some shameless advertizing for our upcoming ISCA paper. :-)

Based on a reviewer suggestion, we are changing the title of our recently accepted ISCA paper from
"Designing a Terascale Memory Node for Future Exascale Systems"
to
"Combining Memory and a Controller with Photonics through 3D-Stacking to
Enable Scalable and Energy-Efficient Systems"

Believe me, it took many iterations to find something that was descriptive, accurate, and marginally pleasant-sounding :-). While throwing in every buzzword makes for a clunky title, such titles are essential for flagging the attention of the right audience (those working on photonics, 3D, and memory systems).

I mention the following example of a bad title to my students. Our ISCA 2001 paper was on runahead execution (a form of hardware prefetching) and appeared with three other papers in the same conference on the same topic. Our non-descriptive title said: "Dynamically Allocating Processor Resources between Nearby and Distant ILP". There's little in the title to indicate that it is about runahead. As a result, our paper got left out of most subsequent runahead conversations and had minimal impact. In terms of citations (the closest quantitative measure of impact), our paper has 70+ citations; each of the other three have 200+. I might have felt robbed if the paper was actually earth-shattering; in retrospect, the design was quite unimplementable (and that may no doubt have contributed to its low impact). In my view, Onur Mutlu's subsequent thesis work put forth a far more elegant runahead execution design and made most prior work obsolete.

For those interested, here's an executive summary of our ISCA'11 work (done jointly with HP Labs). The killer app for photonics is its high bandwidth in and out of a chip (something we can't do for long with electrical pins). However, introducing photonic components into a cost-sensitive DRAM chip can be highly disruptive. We take advantage of the fact that industry is possibly moving towards 3D-stacked DRAM chip packages and introduce a special interface die on the DRAM stack. The interface die has photonic components and some memory controller functionality. By doing this, we use photonics to break the pin barrier, but do not disrupt the manufacture of commodity DRAM chips. For communication within the 3D stack, we compute an energy-optimal design point (exactly how much of the intra-stack communication should happen with optics and how much with electrical wiring). It turns out that there is no need for optics to penetrate into the DRAM dies themselves. We also define a novel protocol to schedule memory operations in a scalable manner: the on-chip memory controller does minimal book-keeping and simply reserves a speculative slot on the photonic bus for the data return. Most scheduling minutiae are handled by the logic on the DRAM stack's interface die.

Tuesday, February 22, 2011

Research Competitions

Almost every year, The Journal of ILP has conducted a research competition at one of our top conferences. For example, researchers are asked to develop a branch predictor that yields the highest prediction accuracy on a secret input data set. I think such competitions are a great idea. If you teach an advanced architecture class, you might want to consider tailoring your class projects in that direction. An international competition could be a strong motivating factor for some students. Besides, the competition organizers usually release a piece of simulation infrastructure that makes it easy for students to write and test their modules. Recent competitions have been on cache replacement, data prefetching, and branch prediction. The 3rd branch prediction championship will be held with ISCA this year. I am slated to organize a memory scheduling championship with ISCA next year (2012). The evaluation metric could be row buffer hit rates, overall throughput, energy, fairness, or combinations of these. I have a grand 12 months to plan this out, so please feel free to send in suggestions regarding metrics, simulators, and workloads that you think would work well.

Saturday, February 19, 2011

PostDocs in Computer Science

PostDocs in CS are surging and there is a lot of discussion in the computing community about this trend. If you haven't already seen the CRA PostDoc white paper based on Taulbee data, I would recommend giving it a read and also sharing your opinions on it. Since I am pursuing a PostDoc, I wanted to discuss some of the data presented in this report and its relevance to the computer architecture research community.

The charts above have all been taken from the CRA report. Figure 1 shows the number of Phds that went into different career choices. The "Others" in this chart represents graduating Phds that either took a different career path than mentioned here or did not have a job by the time they graduated. Looking at this data, it appears that the year 2004 was the best year for academic hiring where almost equal number of CS Phds went for tenure-track positions and industry positions. In 2009, a majority of Phds went in for industry positions, mainly because the number of Phd graduates almost doubled and reduced academic hiring due to the economic downturn. The trend to note in this figure is the number of PostDocs that has almost doubled from 2004 to 2009.

Figure 3 further splits the data of number of graduating Phds into research area sub-groups. In this chart, "Others" signifies inter-disciplinary research areas or could even imply unemployment. Based on the data shown in this figure, systems and networking and AI/robotics have consistently been hot fields in terms of number of Phd graduates. Computer architecture fares as a small-sized research community similar to theory and algorithms, and the number of Phd graduates has not changed much from 1998 to 2009.

Figure 4 shows the percentage of Phds who took PostDoc positions in each sub-group over the years. Computer architecture has the lowest number of PostDocs (~2%) in 2009 (includes me). Theory, AI/Robotics, and numerical analysis/scientific computing have consistently the highest percentage of PostDocs. One reason that I can think of why we have fewer computer architecture PostDocs is perhaps the strong semiconductor industry presence in our field and smaller number of available PhDs to fill those jobs.

If we look at Figure 5, which shows the percentage of Phds taking up tenure track positions, we are again in the bottom range in 2009 (~6-7%) but we did well in years 2003-2005 (~30-35%). I have plotted the number of computer architecture Phds and the number of tenure track faculty in our field in the chart below based on the data that I could visually interpret from the CRA charts. Clearly, the number of tenure-track positions in computer architecture were at its peak in 2004 when the relative number of Phds were fewer and now we are back at 1998 levels with stiffer competition for very few slots. It will be interesting to watch this trend in the future.

Sunday, February 13, 2011

The 2X-5% Challenge

It is well-known that much of our data storage and computation will happen within datacenters in the coming decade. Energy efficiency in datacenters is a national priority and the memory system is one of the significant contributors to system energy. Most reports indicate that the memory system's contribution to overall energy is in the 20-30% range. I expect memory energy to be a hot topic in the coming years. I'm hoping that this post can serve as a guideline for those that want to exploit the energy-cost trade-off for memory system design.

For several years, the DRAM industry focused almost exclusively on the design of high-density DRAM chips. Seemingly, customers only cared to optimize cost per bit... or more precisely, purchase cost per bit. In recent times, the operating cost per bit is also becoming significant. In fact, DRAM industry optimizations to reduce purchase cost per bit often end up increasing the operating cost per bit. The time may finally be right to start building mainstream DRAM chips that are not optimized for density, but for energy. At least that's an argument we make in our ISCA 2010 paper and a similar sentiment has been echoed in a IEEE Micro 2010 paper by Cooper-Balis and Jacob. If DRAM vendors were to take this route, customers may have to pay a higher purchase price for their memory chips, but they'd end up paying less for energy and cooling costs. The result should be a lower total cost of ownership.

However, it may take a fair bit of effort to convince memory producers and consumers that this is a worthwhile approach. Some memory vendors have apparently seen the light and begun to tout the energy efficiency of their products: Samsung's Green Memory and Samsung's new DDR4 product. While such an idea has often been scoffed at by memory designers that have spent years optimizing their designs for density, the hope is that economic realities will eventually set in.

A perfect analogy is the light bulb industry. Customers are willing to pay a higher purchase price for energy-efficient light bulbs that hopefully save them operating costs, compared to the well-known and commoditized incandescent light bulb. If the Average Joe can do the math, so can memory vendors. Surely Micron must have noticed the connection. After all, they make memory chips and lighting systems!! :-)

So what is the math? Some of my collaborators have talked to DRAM insiders and while our ideas have received some traction, we are warned: Do not let your innovations add a 2% area overhead!! That seems like an awfully small margin to play with. Here's our own math.

First, if we check out various configurations for customizable Dell servers, we can quickly compute that DRAM memory sells for roughly $50/GB today.

Second, how much energy does memory consume in a year? This varies based on several assumptions. Consider the following data points. In a 2003 IBM paper, two servers are described, one with 16 GB and another with 128 GB of DRAM main memory. Memory power contributes roughly 20% (318 W) of total power in the former and 40% (1,223 W) of total power in the latter. This leads to an estimate of 20 W/GB and 10 W/GB respectively for the two systems. If we assume that this value will halve every two years (I don't have a good reference for this guesstimate, but it seems to agree with some other data points), we get an estimate of 1.25-2.5 W/GB today. In a 2006 talk, Laudon describes a Sun server that dissipates approximately 64 W of power for 16 GB of memory. With scaling considered, that amounts to a power of 1 W/GB today. HP's server power calculator estimates that the power per GB for various configurations of a high-end 2009 785G5 server system ranges between 4.75 W/GB and 0.68 W/GB. Based on these data points, let us use 1 W/GB as a representative estimate for DRAM operational energy.

It is common to assume that power delivery inefficiencies and cooling overheads will double the energy needs. If we assume that the cost for energy is $0.80/Watt-year, we finally estimate that it costs $6.40 to keep one GB operational over a DIMM's four-year lifetime. Based on this number and the $50/GB purchase price number, we can estimate that if we were to halve memory energy, we can reduce total cost of ownership even if the purchase cost went up to $53.20/GB. Given that cost increases more than linearly with an increase in chip area, this roughly translates to a chip area overhead of 5%.

This brings us to the 2X-5% challenge: if memory designers set out to reduce memory energy by 2X, they must ensure that the incurred area overhead per DRAM chip is less than 5%. Alternatively stated, for a 2X memory energy reduction, the increase in purchase cost must be under 6%. Not a big margin, but certainly more generous than we were warned about. The margin may be even higher for some systems or if some of the above assumptions were altered: for example, if the average DIMM lasted longer than 4 years (my Dell desktop is still going strong after 7.5 years). This is an approximate model and we welcome refinements.

Friday, February 11, 2011

Datacenters and the Raging Energy Efficiency Battles

Datacenter efficiency is the new buzz word these days. Cloud computing (another buzz word!) essentially dictates the need for datacenter energy efficiency, and tree-hugger engineers are more than happy to embrace this shift in computer design.

There are a lot of sub-systems in a datacenter that can be re-designed/optimized for efficiency. These range from power distribution, all the way down to individual compute cores of a chip multi-processor. In this post, I am mostly going to talk about how general purpose microprocessor instruction set architecture (ISA) battles are currently raging at a datacenter near you.

To sum it up, Intel and ARM are getting ready for a battle. This is a kind of battle that Intel has fought in the past against IBM and Sun. The battle is between Reduced Instruction Set Computing (RISC) vs. complex Instruction Set Computing (CISC). To some extent, this battle is still going on. Intel’s x64 ISA is CISC, and it’s RISC competitors are Oracle's SPARC ISA and IBM's POWER ISA. Intel also has it’s own line of RISC processors called Itanium, and IBM has it’s mainframe z/Architecture which is CISC. We will ignore these last two because they are niche market segments and will focus on the commodity server business, which is mostly a shipment volume driven business.

One clarification to be made here is that Intel's x64 chips are "CISC-like" rather than being true CISC CPUs. Modern x64 chips have a front end that decodes the x86 Instruction Set Architecture (ISA) into RISC-like "micro-ops". These micro-ops are then processed by the actual CPU logic, unlike the true CISC operations executed by IBM's mainframe z CPUs, for example. To clarify this distinction, any reference to x64 CPUs as CISC processors in this article actually refers to x64 CPUs with this decoder front-end. The front-end decoding logic in x64 CPUs is reported to be a large consumer of resources and energy, and it's entirely possible that it's a significant contributor to lower energy efficiency of x64 CPUs compared to true RISC CPUs.

ARM is the new kid on the block in server racket, and it touts higher efficiency as it’s unique selling point. Historically, arguments that have been put forth for RISC claim that it’s a more energy efficient platform due to lower complexity of the cores. The reasoning boils down to the fact that some of the complexity of computation is moved from the hardware to the compiler. This energy efficiency is what ARM and other RISC CPU manufacturers claim to be their raison d'etre.

There is currently a lot of effort going into developing ARM based servers, especially since recent ARM designs support ECC and Physical Address Extension (PAE), allowing the cores to be “server grade”. This interest is unabated even though the cores are still a 32-bit architecture. Reports however claim that ARM is working on a 64-bit architecture. What is unclear at this point is how energy efficient ARM cores will be when they finally make it into servers.

The uncertainty about efficiency of ARM based servers stems from the fact that ARM does not manufacture it’s own cores. ARM is an IP company and it leases it’s core designs to third parties - the likes of TI, NVIDIA etc. This gives ARM both a strategic advantage, and a reason to be cautious of Intel’s strategy.

The advantage of licensing IP is that ARM is not invested heavily in the fab business, which is fast becoming an extremely difficult business to be in. AMD exited/spun-off it’s fab business as it was getting hard for them to compete with Intel. Intel is undoubtedly the big kahuna in the fab industry with the most advanced technology. This allows them to improve x64 performance by leveraging manufacturing process superiority, even when x64 inherently might be energy hungry. An example of this phenomenon is the Atom processor. Intel‘s manufacturing technology superiority is therefore a reason for concern for ARM.

The datacenter energy efficiency puzzle is a complex world. From the CPU point of view, there are five primary players - AMD, IBM, Intel, NVIDIA, and Oracle. AMD is taking it’s traditional head-on approach with Intel, and NVIDIA is attacking the problem from sideways by advancing GPU-based computing. They are reportedly working on ARM cores where they are building high-performance ARM-based CPUs to work in conjunction with their GPUs. This is done evidently to bring higher compute efficiency using asymmetric compute capabilities.

IBM and Oracle strategies are two fold. Besides their RISC offerings, they are also getting ready to peddle their “appliances” - which are vertically integrated systems that try to squeeze out every bit of performance from the underlying hardware. Ultimately, these appliances might rule the roost when it comes to energy efficiency, in which case one can expect HP, Cisco, and possibly Oracle and IBM to come out with x64 and/or ARM based appliances.

It’s unlikely that just one of these players will be able to dominate the entire market. Hopefully, in the end the entire computing eco-system will be more energy efficient because of innovation driven by these competing forces. This should make all the tree-hugging engineers and tree-hugger users of the cloud happy. So watch out for the ISA of that new server in your datacenter, unless of course you are blinded by your cloud!

Some interesting links:

ARM Aims for server chips.

James Hamilton's views on NVIDIA's Project Denver.

Saturday, February 5, 2011

Common Fallacies in NoC Papers

I am asked to review many NoC (network-on-chip) papers. In fact, I just got done reviewing my stack of papers for NOCS 2011. Many NoC papers continue to make assumptions that might have been fine in 2007, but are highly questionable in 2011. My primary concern is that most NoC papers over-state the importance of the network. This is often used to justify complex solutions. This is also used to justify a highly over-provisioned baseline. And many papers then introduce optimizations to reduce the area/power/complexity of the over-provisioned baseline. Both of these fallacies have resulted in many NoC papers with possibly limited shelf life.

The first mis-leading overstatement is this (and my own early papers have been guilty of this): "Intel's 80-core Polaris prototype attributes 28% of its power consumption to the on-chip network", and "MIT's Raw processor attributes 36% of its power to the network". Both processors are a few years old. Modern networks probably incorporate many recent power optimizations (clock gating, low-swing wiring, etc.) and locality optimizations (discussed next). In fact, Intel's latest many-core prototype (the 48-core Single Cloud Computer) attributes only 10% of chip power to the network. This dramatically changes my opinion on the kinds of network optimizations that I'd be willing to accept.

The second overstatement has to do with the extent of network traffic. Almost any high performance many-core processor will be organized as tiles. Each tile will have one or a few cores, private L1 and L2 caches, and a slice (bank) of a large shared L3. Many studies assume that data placement in the L3 is essentially random and a message on the network travels half-way across the chip on average. This is far from the truth. The L3 will be organized as an S-NUCA and OS-based first-touch page coloring can influence the cache bank that houses every page. A thread will access a large amount of private data, most of which will be housed in the local bank and can be serviced without network traversal. Even for shared data, assuming some degree of locality, data can be found relatively close by. Further, if the many-core processor executes several independent programs or virtual machines, most requests are serviced by a small collection of nearby tiles and long-range traversal on the network is only required when accessing a distant memory controller. We will shortly post a characterization of network traffic for the processor platform I describe above: for various benchmark suites, an injection rate and histogram of distance traveled. This will hopefully lead to a more meaningful synthetic network input than the most commonly used "uniform random".

With the above points considered, one would very likely design a baseline network that is very different from the plain vanilla mesh network. I would expect some kind of hierarchical network: perhaps a bus at the lowest level to connect a small cluster of cores and banks, perhaps a concentrated mesh, perhaps express channels. For those that haven't seen it, I highly recommend this thought-provoking talk by Shekhar Borkar, where he argues that buses should be the dominant component of an on-chip network. I highly doubt the need for large amounts of virtual channels, buffers, adaptive routing, etc. I'd go as far as to say that bufferless routing sounds like a great idea for most parts of the network. If most traffic is localized to the lowest level of the network hierarchy because threads find most of their data nearby, there is little inter-thread interference and there is no need for QoS mechanisms within the network.

In short, I feel the NoC community needs to start with highly localized network patterns and highly skinny networks, and identify the minimum amount of additional provisioning required to handle various common and pathological cases.