Utah Arch

Thoughts on Meltdown and Spectre

2018-05-02T23:02:00.000-06:00

The Meltdown and Spectre attacks were a wake-up call for processor manufacturers. They exposed the fact that processors are typically shipped with latent vulnerabilities. In some of the attack scenarios, the blame rests on a hardware design oversight (Meltdown), or on the side-effects of speculation (Spectre Variant 2), or on code that leaks information through side channels (Spectre Variant 1).

I expect that these specific attacks can and will be addressed in the coming years with hardware features that fix the design oversight or close certain side channels. Moving forward, manufacturers will likely be more paranoid about other potential side channels and the side-effects of performance optimizations.

The field experienced a very similar upheaval about three decades ago. We did discover that speculation had a very negative side effect -- its impact on consistency models. In fact, the fence instructions that came to our rescue in the 1990s are being used in some of the software patches for Spectre. These fence instructions essentially "disable" speculation when it's harmful.

As we teach speculation in architecture classes, it is important to point out these negative side-effects. Thanks to the attention received by Meltdown and Spectre, students in my classes were already very curious. I'm posting two of my screencasts here that may be useful to others that want to discuss these attacks in their undergraduate and graduate classes: Meltdown and Spectre.

Bad Advice

2016-08-12T11:19:00.000-06:00

Five years ago, our group had visited Micron for informal research discussions (documented in an earlier blog post). It is fair to say that we were generally discouraged from messing with memory chips, and instead encouraged to look at innovations outside the memory chip. We were advised (warned) that even introducing a 0.5% area overhead to a memory chip would be untenable. And this was the dominant view within the memory industry. Many academic reviewers over the years have provided similar advice.

As is now evident, that advice was mis-guided.

Just as there are markets for a variety of processor chips, there also are emerging markets for a variety of memory devices. Some memory devices will incur a cost penalty in order to provide lower latency or lower energy. For example, see the slide below that was presented by Hynix at an ISCA 2016 keynote, arguing for a memory chip that offers 30% lower latency, while incurring a 28% area overhead.

From the ISCA 2016 keynote talk by Seok-Hee Lee (SK Hynix)

Micron's Hybrid Memory Cube is also designed with many small DRAM arrays that reduce "overfetch" but occupy more area. This was the key idea in our ISCA 2010 paper -- an idea that received a lot of criticism from reviewers (at least one "Strong Reject"), ISCA attendees (at least one nasty question during the Q&A), and designers at Micron (the ominous reality check on area overheads).

Hopefully, program committees will no longer cite the "cost-sensitive memory industry" to reject otherwise good papers. Market forces are constantly evolving -- they should ideally not come in the way of a strong technical argument.

Random Thoughts on the Review Process

2015-11-04T23:09:00.000-07:00

For the most part, I've been very impressed by the review process at top architecture conferences (enough to annoy some of my colleagues in other areas :-). But we can always do better. Here are some (hopefully constructive) thoughts:

1. At a PC meeting, if a paper's reviewers can't unanimously agree on a paper, the outcome is often determined by a PC-wide vote. I'm always uncomfortable when the people who read the paper vote 2-1 in favor, but the PC votes 15-16 against the paper. So, essentially, the paper's outcome is decided by one person that never read the paper. I recommend that non-decisive PC votes (margin less than 5?) simply be discarded.

2. A few conferences have experimented with PC meetings that last 1.5 or 2 days. More time does not mean better decisions; it just means that talkative PC members find a way to say the same thing with more words. So let me cast a vote in favor of 1-day PC meetings. Also, a Friday PC meeting allows for better work/life balance -- I assume that's important. I'm curious to see how the ISCA'16 PC meeting plays out: two days, but most PC members will only attend one of the two days.

3. I assume that Moin will have more to say about the MICRO'15 review process at MICRO. In a nutshell, about 80+ papers were given 3 weeks to submit a revised paper, in addition to the standard rebuttal. In my opinion, the revision process worked very well. Assuming that a paper is close to being accepted, the authors should want to spend time now to revise the paper than roll the dice at the next conference. I think the MICRO'15 review process hit a sweet spot -- it gets us sufficiently closer to a journal-style process, while retaining the many benefits of conference reviewing.

4. Can we please do away with the rule that a PC member can't have more than 2 accepted papers? Fortunately or unfortunately, I have not been victimized by the rule. But it happens to someone at every PC meeting. It's terribly unfair to the involved grad students. Seemingly, the rule is in place to prevent collusion among PC members. But I just don't buy it. If I was the colluding type, would I stop because there's a cap on my accepted papers? Probably not. I'm ok with a rule that actually stops collusion. I'm not ok with a rule that masks collusion. (FWIW, I have never witnessed collusion at a PC meeting, but I also happen to be relatively naive.)

5. The single most important task for a PC Chair is to find qualified reviewers. The second most important task is to push reviewers towards an active, inclusive, and positive on-line discussion. If these two tasks are done well, the other tasks become a lot less crucial.

6. It has been said many times: as a community, we write harsh reviews. Why? (i) Because we appear smart when we find a flaw. (ii) It's a sign of weakness if we are easily impressed. (iii) Pulling a competitor down is almost equivalent to pulling myself up. Clearly, we all need to dial down the negativity. Write your reviews with the assumption that the submission site will get hacked -- it does wonders for your tone and positivity.

7. It was common practice to reveal author names during the PC meeting. In the past year or so, PC chairs have correctly chosen to not reveal author names at the PC meeting. This is clearly a good practice because it not only eliminates bias at the meeting, but also preserves double-blindness for future submissions of that work. One downside is that it is now much harder to detect incorrectly identified conflicts (both intentional and inadvertent). Entering conflicts for every deadline is both tedious and error-prone. We are well overdue for an automatic conflict management system, perhaps with help from an IEEE/ACM-supported repository of known conflicts.

Lab Demographics

2015-10-24T09:39:00.000-06:00

My family loves the new Shaun the Sheep Movie. The characters' resemblance to our lab is uncanny.

The Advisor: barges into the lab, spews incoherent instructions, is losing hair, has memory loss, yawns a lot.

The Senior Grad Student: keeps everyone in line.

The Creative Grad Student: always in trouble, has a prominent tuft of hair on top.

Ummm.... I don't know how to sugar-coat this: grad students.

First-year grad students.

Students from other labs: like to eat our food, use our ping-pong table.

Director of Grad Studies.

ISCA Reviewers. :-)

Image credits: shaunthesheep.com, pinterest.com, movieweb.com, wikia.com, thecomedynetwork.ca

HP Stint

2014-04-21T07:33:00.000-06:00

In the next 12 months, I'll be spending half my time as an HP Labs Visiting Scholar. The plan is to contribute to the definition of memristor architectures and other system-building efforts at HP. The memristor is one of three new non-volatile memory technologies that is being actively pursued by industry (STT-RAM and PCM are the other two). These new memory technologies have density, performance, energy, and endurance properties that lie somewhere between those of DRAM and Flash. They could be used as a new level in an existing memory hierarchy, or they could displace DRAM and/or Flash.

For a memory system academic like me, this should be a fun learning experience and a unique opportunity to impact future technologies and products. Thanks to the university administration for seeing the value in such an industry engagement and working through the details to make this happen!

In the coming year, I'll continue to work closely with my Ph.D. students on their on-going projects. I won't be teaching in Fall 2014. I will teach CS/ECE 6810 (Computer Architecture) in Spring 2015. CS/ECE 7810 (Advanced Computer Architecture) will not be taught in the 2014-2015 academic year. I will likely return to my normal teaching schedule in 2015-2016.

UPDATE: A Business Week article on the project that I'm contributing to.

A DRAM Refresh Tutorial

2013-11-27T00:52:00.000-07:00

In a recent project, we dug into the details of the DRAM refresh process. Unfortunately, there is no public document that explains this well. Hopefully, the description below helps fill this gap. It is possible that not all DRAM chips follow the procedure outlined below.

The well-known basics of DRAM refresh:

The charge on a DRAM cell weakens over time. The DDR standard requires every cell to be refreshed within a 64 ms interval, referred to as the retention time. At temperatures higher than 85° C (referred to as extended temperature range), the retention time is halved to 32 ms to account for the higher leakage rate. The refresh of a memory rank is partitioned into 8,192 smaller refresh operations. One such refresh operation has to be issued every 7.8 µs (64 ms/8192). This 7.8 µs interval is referred to as the refresh interval, tREFI. The DDR3 standard requires that eight refresh operations be issued within a time window equal to 8 x tREFI, giving the memory controller some flexibility when scheduling these refresh operations. Refresh operations are issued at rank granularity in DDR3 and DDR4. Before issuing a refresh operation, the memory controller precharges all banks in the rank. It then issues a single refresh command to the rank. DRAM chips maintain a row counter to keep track of the last row that was refreshed -- this row counter is used to determine the rows that must be refreshed next.

tRFC and recovery time:

Upon receiving a refresh command, the DRAM chips enter a refresh mode that has been carefully designed to perform the maximum amount of cell refresh in as little time as possible. During this time, the current carrying capabilities of the power delivery network and the charge pumps are stretched to the limit. The operation lasts for a time referred to as the refresh cycle time, tRFC. Towards the end of this period, the refresh process starts to wind down and some recovery time is provisioned so that the banks can be precharged and charge is restored to the charge pumps. Providing this recovery time at the end allows the memory controller to resume normal operation at the end of tRFC. Without this recovery time, the memory controller would require a new set of timing constraints that allow it to gradually ramp up its operations in parallel with charge pump restoration. Since such complexity can't be expected of every memory controller, the DDR standards include the recovery time in the tRFC specification. As soon as the tRFC time elapses, the memory controller can issue four consecutive Activate commands to different banks in the rank.

Refresh penalty:

On average, in every tREFI window, the rank is unavailable for a time equal to tRFC. So for a memory-bound application on a 1-rank memory system, the percentage of execution time that can be attributed to refresh (the refresh penalty) is tRFC/tREFI. In reality, the refresh penalty can be a little higher because directly prior to the refresh operation, the memory controller wastes some time precharging all the banks. Also, right after the refresh operation, since all rows are closed, the memory controller has to issue a few Activates to re-populate the row buffers. These added delays can grow the refresh penalty from (say) 8% in a 32 Gb chip to 9%. The refresh penalty can also be lower than the tRFC/tREFI ratio if the processors can continue to execute independent instructions in their reorder buffers while the memory system is unavailable. In a multi-rank memory system, the refresh penalty depends on whether ranks are refreshed together or in a staggered manner. If ranks are refreshed together, the refresh penalty, as above, is in the neighborhood of tRFC/tREFI. If ranks are refreshed in a staggered manner, the refresh penalty can be greater. Staggered refresh is frequently employed because it reduces the memory's peak power requirement.

Some refresh misconceptions:

We next describe how a few rows in all banks are refreshed during the tRFC period. As DRAM chip capacities increase, the number of rows on the chip also increases. Since the retention time (64 ms) and refresh interval (7.8 µs) have remained constant, the number of rows that must be refreshed in every refresh interval has increased. In modern 4Gb chips, eight rows must be refreshed in every bank in a single tRFC window. Some prior works have assumed that a row refresh is equivalent to an Activate+Precharge sequence for that row. Therefore, the refresh process was assumed to be equivalent to eight sequential Activate+Precharge commands per bank, with multiple banks performing these operations in parallel. However, DRAM chip specifications reveal that the above model over-simplifies the refresh process. First, eight sequential Activate+Precharge sequences will require time = 8 x tRC. For a 4Gb DRAM chip, this equates to 390 ns. But tRFC is only 260 ns, i.e., there is no time to issue eight sequential Activate+Precharge sequences and allow recovery time at the end. Also, parallel Activates in eight banks would draw far more current than is allowed by the tFAW constraint. Second, the DRAM specifications provide the average current drawn during an Activate/Precharge (IDD0) and Refresh (IDD5). If Refresh was performed with 64 Activate+Precharge sequences (64 = 8 banks x 8 rows per bank), we would require much more current than that afforded by IDD5. Hence, the refresh process uses a method that has higher efficiency in terms of time and current than a sequence of Activate and Precharge commands.

The actual refresh process:

This process is based on the high number of subarrays provisioned in every bank. For example, a bank may have 16 subarrays, of which only four are accessed during a regular Activate operation. This observation also formed the basis for the recent subarray-level parallelism (SALP) idea of Kim et al. During a refresh operation, the same row in all 16 subarrays undergo an Activation and Precharge. In this example, four rows worth of data are being refreshed in parallel within a single bank. Also, the current requirement for this is not 4x the current for a regular Activate; by sharing many of the circuits within the bank, the current does not increase linearly with the extent of subarray-level parallelism. Thus, a single bank uses the maximum allowed current draw to perform parallel refresh in a row in every subarray; each bank is handled sequentially (refreshes in two banks may overlap slightly based on current profiles), and there is a recovery time at the end.

Update added on 12/16/2013:

Refresh in DDR4:

One important change expected in DDR4 devices is a Fine Granularity Refresh (FGR) operation (this ISCA'13 paper from Cornell/IBM has good details on FGR). FGR-1x can be viewed as a regular refresh operation, similar to that in DDR3. FGR-2x partitions the regular refresh operation into 2 smaller "half-refresh" operations. In essence, the tREFI is halved (half-refresh operations must be issued twice as often), and the tRFC also reduces (since each half-refresh operation does half the work). FGR-4x partitions each regular refresh operation into 4 smaller "quarter-refresh" operations. Since each FGR operation renders a rank unavailable for a short time, it has the potential to reduce overall queuing delays for processor reads. But it does introduce one significant overhead. A single FGR-2x operation has to refresh half the cells refreshed in an FGR-1x operation, thus potentially requiring half the time. But an FGR-2x operation and an FGR-1x operation must both incur the same recovery cost at the end to handle depleted charge pumps. DDR4 projections for 32 Gb chips show that tRFC for FGR-1x is 640 ns, but tRFC for FGR-2x is 480 ns. The overheads of the recovery time are so significant that two FGR-2x operations take 50% longer than a single FGR-1x operation. Similarly, going to FGR-4x mode results in a tRFC of 350 ns. Therefore, four FGR-4x refresh operations would keep the rank unavailable for 1400 ns, while a single FGR-1x refresh operation would refresh the same number of cells, but keep the rank unavailable for only 640 ns. The high refresh recovery overheads in FGR-2x and FGR-4x limit their effectiveness in reducing queuing delays.

Refresh in LPDDR2:

LPDDR2 also provides a form of fine granularity refresh. It allows a single bank to be refreshed at a time with a REFpb command (per-bank refresh). For an 8-bank LPDDR2 chip, eight per-bank refreshes handle as many cells as a single regular all-bank refresh (REFab command). A single REFpb command takes way more than 1/8th the time taken by a REFab command -- REFab takes 210 ns in an 8Gb chip and REFpb takes 90 ns (see pages 75 and 141 in this datasheet). Similar to DDR4's FGR, we see that breaking a refresh operation into smaller units imposes a significant overhead. However, LPDDR2 adds one key feature. While a REFpb command is being performed in one bank, regular DRAM operations can be serviced by other banks. DDR3 and DDR4 do not allow refresh to be overlapped with other operations (although, this appears to be the topic of two upcoming papers at HPCA 2014). Page 54 of this datasheet indicates that a REFpb has a similar impact on tFAW as an Activate command. Page 75 of the same datasheet indicates that an Activate and REFpb can't be viewed similarly by the memory scheduler. We suspect that REFpb has a current profile that is somewhere between a single-bank Activate and a REFab.

Jointly authored by Rajeev Balasubramonian (University of Utah), Manju Shevgoor (University of Utah), and Jung-Sik Kim (Samsung).

Student Reports for ISCA 2013

2013-08-12T14:24:00.000-06:00

I recently handled student travel grants for ISCA 2013. As is customary, I asked awardees to send me a brief trip report:

"... explain what you saw at the conference that had a high impact on you. This could be a keynote or talk that you thought was especially impressive, it could be a commentary on research areas that deserve more/less attention, on especially effective presentation methods, on ways to improve our conference/reviewing system, etc. Please feel free to be creative..."

37 of the 68 awardees responded. By most accounts, this was one of the most memorable ISCAs ever. Several students highlighted the talk on DNA Computing. Many also wished there was a session at the start with 90-second paper summaries (as was done at MICRO 2012).

Some of the more interesting comments (lightly edited and anonymized) are summarized below.

"... we have to leave the dream of hardware generality if we still want to increase performances with reasonable energy budgets. I noticed a lot of work sitting between hardware specialized units and general purpose architectures. ... I really enjoyed the work presented by Arkaprava Basu in the big data session called Efficient Virtual Memory for Big Memory Servers. The introductory quote in the paper reads: 'Virtual memory was invented in a time of scarcity. Is it still a good idea?' Experiments on widely used server applications show that many virtual memory features are not needed."

"I strongly recommend using the lightning session in future conferences."

"... Breadth of topics has been increasing over the years. The paper on DNA computing was really, really good. A tutorial or panel on emerging technology would also be very cool. Potential list of topics: DNA/protein computation, quantum, resistive element storage & computing, bio-inspired computing and circuit technology, optical, circuits for near/sub threshold computing.

The split sessions were a bit off-putting. I would also like all sessions to be recorded.

During the business meeting the idea of turning CAN into something like WDDD was brought up. I really like this idea.

I found Uri's discussion of memory-intensive architectures particularly compelling. I rather enjoy keynotes that present interesting, largely undeveloped research directions and ideas. One thing that I thought was missing from this year's ISCA was a session on program characterization or application analysis. Given the amount of application- or workload- specific stuff we are seeing, this topic seems increasingly important. The release of data sets, benchmarks, and thorough analyses of application domains and workloads should not be relegated to ASPLOS/ISPASS -- I'd like to see them here at ISCA and encourage more of them. Especially in data center research (and to a lesser extent mobile) it seems like large companies have far more intuition and data on their workloads than is generally available to the community. Perhaps making a top-tier conference like ISCA a venue for that information would make the release of some (admittedly proprietary) information possible or attractive."

"I really enjoyed the panel in the Monday workshop that discussed the state of computer architecture in 15 years. Similarly, I liked the second keynote on future directions and specifically the memristor."

"The talk by Gadi Haber in the AMAS-BT workshop was memorable. In the talk, Gadi states that binary translation can be seen as an extension to micro-architecture. Things that are difficult in hardware are sometimes much easier to do in software. In fact, many groups are co-designing hardware with binary translators."

"Many talks/keynotes encouraged the use of cost as a metric for future evaluations. I really enjoyed the session of emerging technologies, especially the DNA computing talk. I also enjoyed the talk by Andreas Moshovos that had an entirely different way to pass the insight of their idea.

The most useful session for me was the data centers session. Specifically, the last talk by Jason Mars was excellent and I really liked the insight that was provided for a large company like Google. Knowing that the studies mentioned in this work are important for a key player in the data center industry was reassuring.

One minor suggestion is to end the last day right after lunch to facilitate travel."

"I especially enjoyed the talk on ZSim. I think simulators should be a more discussed area in computer architecture research.

One thing I would suggest is that the keynotes should be more technically informative. I thought the first keynote contained more personal opinions than technical reasons.

Another thing I would suggest is that all speakers should be required to present their paper in 2-3 mins at the very start of the conference."

"I especially enjoyed the 'Roadmaps in Computer Architecture' workshop."

"The thing that had the most impact on me was the chats I had with stalwarts of computing in the hallway. ... I think the idea of lightning talks like that in MICRO 2012 would have been really helpful."

"The two keynotes were very complementary. One looked back at history and the other was very inspiring for future research directions. The most impressive paper presentation was on the Zsim simulator. The author ran the power point on their simulator with dynamically updated performance stats. I would also suggest recording all presentations."

"I followed with interest the opening keynote by Dr. Dileep Bhandarkar. For a newbie, it's really nice to listen to the history of computer evolution. Another interesting presentation was on the ZSim simulator. It was very funny to see the thousand core systems simulator up and running during the presentation itself. The presenter precisely and clearly explained how choices were made to get the maximum performance."

"Among the talks I attended, the ideas that mostly intrigued me were Intel's Triggered Instructions work (Parashar et al.), and the introduction of palloc as an approach toward moving energy management control for datacenters to the software level, similar to the use of malloc for memory management (Wang et al.). I also found the thread scheduling work on asymmetric CMPs very interesting (Joao et al.).

... some presentations also had obvious flaws on the static part, i.e., the slides - full sentences, no smooth transitions between slides, overloaded content. Maybe an improvement could be achieved by imposing some rules (the same way as rules are set for paper submissions), or by organizing a tutorial session during the conference where 'Giving a good presentation' would be taught.

I thought that the time dedicated for Q&A after each presentation was quite limited. One thing I could think of is having (additional) Q&A time for each set of papers rather than each single paper, so that the dedicated time can be filled up according to the audience's interest for each of that set's papers."

"I'd like to see a poster session in future ISCA editions, e.g., including papers that didn't make it to the main conference."

"I have been interested in taking security research further down the hardware software stack, but it appears as though most security research at ISCA is focused on things such as side channel attacks. I think that one interesting area is to look at security accelerators or security mechanisms in hardware that increase either performance of common security solutions or improve security of said solutions. Security co-processors, as I observed in a few of the talks, do not solve primary security issues, and the problems need to be tackled at more fundamental levels."

"The most impressive talk for me was by Richard Muscat, 'DNA-based Molecular Architecture with Spatially Localized Components'. I was truly amazed when he reached a specific slide that explains how he managed to use DNA molecules as a wire to transmit the result of a computation and, therefore, enabling composition of many modules of DNA computation, while the previous approach to DNA computing is limited to doing a single computation in a soup of DNA. This is a huge step towards enabling intelligent drugs that implement some logic by using DNA molecules. I also especially appreciated the last two talks about program debugging ('QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs' and 'Non-Race Concurrency Bug Detection Through Order-Sensitive Critical Sections'). They offer interesting insights on how to enable better debugging of parallel programs, which currently is very frustrating to do. I hope that in the near future we have better options to efficiently debug parallel software instead of having to stick to 'errorless programming' :) "

"I want to emphasize the 'Emerging Technologies' session (Session 3A) and especially the work about DNA-based circuits by Richard A. Muscat et al. I have to admit that I was not really aware of the fact that there is that much research going on in the field of DNA, which might also be of interest for the computer architecture community. Nevertheless, especially in a time where we discuss whether Moore's law may not hold any more in the near future (as it was also a topic throughout the keynotes at ISCA'40), I think that investigating all kinds of alternative ways to construct "logic circuits" must be paid high attention. Assembling such circuits based on a modular approach using DNA structures may sound like a science fiction movie these days (at least for myself at the moment), but who imagined a few decades ago that we are going to run around in public, wearing camera- and other high-tech-equipped glasses? So although it does not fall into my research area at all, please keep up that great work!

One of the authors presenting a workshop paper was not able to attend. Therefore, they prepared a screen-captured video presentation. Basically, I am not really a fan of such presentation methods, but then I was positively amazed because they really did a great job and presented their work very well. However, I think in general and especially for a large audience (like the main symposium of ISCA), physically present speakers should be favored in the future ('discussions' with a laptop and a projector are somehow difficult :)."

"The session with the highest impact on me was 'Emerging Technologies'. The proposals regarding quantum computers and DNA-based molecular architecture provided an insight about how computing will be in the next years. Thus, in my opinion similar type of works should be supported."

"The most interesting part was the keynote given by Prof. Uri Weiser. He talked about heterogeneous and memory intensive architecture as the next frontier. I think ISCA may need more such talks about future technology."

"There are three highlights that come to mind: the keynote by Dr. Dileep Bhandarkar, the presentation of 'Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached', and the Workshop on Energy Efficient Design.

Keynote: ... Dr. Bhandarkar's advice to 'wear many hats' , 'you don't grow until you step out of your comfort zone', 'don't be encumbered by past history', and statement that 'strategy without execution is doomed' are particularly noteworthy. Dr. Bhandarkar's anecdote concerning the development of Itanium was also illustrative, I was previously aware of the external controversy over the architecture but did not know of the degree to which Intel sought to protect Itanium from internal competitors. Additionally, Dr. Bhandarkar's assertion that destabilizing technologies come from below (cheaper, lower powered, efficiency vs. performance) was certainly thought provoking. Finally, Dr. Bhandarkar's demonstration of the complexities of Qualcomm's snapdragon system architecture and assertion that DSP's will require new levels of programmability during the question session was intriguing.

Thin Servers: I enjoyed the presentation and had higher than average insight into this topic as my background is in networking. Briefly, I was intrigued with the custom platform developed to speed lookups, I thought the performance analysis was well done. A drawback of the work was the lack of evaluation vs. NetFPGA solutions, but the presenter claimed that their SoC solution was more compliant with the existing memcached protocol. At a high level, I think it is an interesting counterpoint to Dr. Bhandarkar's assertion that increasing programmability is necessary. It would seem that flexible, low cost development and fabrication platforms are also extremely important to developing heterogeneous systems.

WEED: I found the discussion session led by Karthick Rajamani at the end of the workshop to be thought provoking. Especially his comments pointing to increasing interest in the power consumption and control of memory systems. Additionally, I appreciated his efforts to show the impact that past workshop papers had via follow up papers at conferences."

"The opening keynote by Dr. Bhandarkar was very interesting. ... I really liked the fact that the security session and the reliability session were scheduled back to back as they often share similar audiences. I would recommend such scheduling in the future."

"I would like to comment on the Emerging Technologies Session. I feel it was too much introduction and too little presentation of what was done in the research. I propose to double time for this session. First part should be introduction, for those who are not aware of those technologies, and second part should be deep analysis of the work done by the authors."

"I thoroughly enjoyed the Accelerators and Emerging Architectures session for their creativities in facing the dark silicon and utilization wall problems head on. I also was particularly interested in the Big Data session as this is my research direction and I believe the architecture community is and should be focusing on this area. I regret that I was not able to attend the Power and Energy session as it was put in the session parallel with the Big Data session; I believe solving power and energy problems is imperative in all aspects of hardware architecture/design. I enjoyed Uri's keynote on heterogeneous and memory intensive architecture. I generally agree with his take on the future of computing being heterogeneous and memory intensive, however, I am not sold on the applicability and feasibility of the proposed MultiAmdahl law just yet. I think more research on heterogeneous and memory intensive architectures would help the community."

"The most impressive talk was DNA-based Molecular Architecture with Spatially Localized Components. It is amazing how computer architecture evolved in the last 40 years, but the ability of computing using DNA sequences is something beyond extraordinary. New secure hardware methods to prevent rootkit evasion and infection were also pretty interesting, I would like to see more talks on security in the future. Besides the technical part, the fact the conference was held in Tel Aviv gave an exotic personality to the event. The dinner with the 'Jerusalem Old City Wall' sightseeing, followed by an exciting drum experience, really promoted a smooth integration between the participants."

"I found two things to be quite inspiring at ISCA. The first was the initial keynote at the conference. Dr. Bhandarkar's talk drove home the fact that a lot of innovation in our field is indeed driven by these disruptive shifts to smaller, more pervasive form factors. I have always been a fan of the history of computers, but it was great to see how one person could touch on so many significant paths through that trajectory over a career. The second was a paper from the University of Washington on DNA-based computing. While it may not be the next disruptive change, it's always important to keep perspective on how we can apply technology from our field to other areas, as it opens up doors that we never even thought of before. I hope that future conferences continue to have such diverse publications, in order to encourage others in our field to also think outside the box."

"When I talked with fellow students at the conference, the ideas that amazed me most were actually from the session parallel with the one in which I presented: I really like the ideas in online malware detection with performance counters, as well as the notion of a provable totally secure memory system. Now that we have parallel sessions, ISCA could also do something like 3-min highlights for each paper, or a poster session for all papers. It's really a pity to miss something in the parallel session!"

"I really enjoyed ISCA this year, particularly because of the broader range of research areas represented. I found the sessions on Big Data and Datacenters a great addition to the more traditional ISCA material. I also liked Power and Energy and Heterogeneity. I would love to see ISCA continue to take a broader definition of computer architecture research in the upcoming conferences. Additionally, the presentations themselves were extremely high quality this year.

However, I think the 25 minute talk slot was not long enough. Most talks had time for only one question, which is ridiculous. Part of the value of attending the conference rather than just reading the paper is to interact with the other researchers. However, often when an industry person (such as from Google or Microsoft) with some useful insight to add was not allowed to speak. Either the sessions should be lengthened or the talks should be shortened, but there should definitely be more Q & A time."

"Firstly, this was the nicest conference I've attended thus far in my graduate studies (out of previous ISCAs and others). The venue was in a very beautiful location, the excursion was educational and quite fun, the conference itself was very well organized, and I felt that the quality of papers this year was strong. Although I probably can't say that I fully understood each paper in this session, I thought that the Emerging Technologies session this year was very interesting; especially the paper 'DNA-based Molecular Architecture with Spatially Localized Components'. This area of research is quite different than my current focus on GPGPU architectures, but I found it very intriguing to see how they were using DNA strands to create very basic digital logic structures. As this research seems to be in its infancy, I'd be interested to see where this research goes in the future and how it's applicability to the human body evolves. Commenting on one of the especially effective presentation methods, I thought that "ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems" was presented very well. The presenter, Daniel Sanchez, was actually running the entire power point presentation through their simulation framework, showing various statistics at the bottom of the screen during the talk. I thought that this was a very cool way of highlighting the speed and applicability of such a simulator while presenting the detailed implementation in the talk."

"I'll focus my comments on the Roadmap Workshop, which was the highlight of the conference for me. The talks I attended focused on where technology would be a decade from now. Doug Burger was of the opinion that the cell phone would become the primary device for all personal computation. Devices will collect large amounts of data about users and use that data to predict things like the user's health. The cloud would be split into two parts, a personal cloud running out of the user's home and the cloud that lives in large data centers. Privacy would be a major issue, hence all of the users' personal information would lie in the personal cloud, while anonymized data would be uploaded into the cloud. The large hordes of data in the data center cloud in combination with the personal cloud would be used to identify patterns and predict events (health related) in the life of the user. Devices would rely more on non-standard modes of input like gestures (kinect) and touch. Personal assistant applications would become more intelligent and be able to do more that just maintain calendars. In conclusion, devices would become smarter than humans in a large variety of tasks, helped by their ability to parse through huge amounts of data. I thought this was one of the best talks at this year's ISCA.

Other talks focused more on the end of Dennard scaling. The speakers were of the opinion that Moore's law would continue for a few more generations, but Dennard scaling was at an end because voltage no longer scales with smaller technology nodes. More exotic, one-off technologies would be used in the future. Though many believe that the only way to scale going forward is with the introduction of Dark Silicon, most speakers believed that Dark silicon was economically infeasible. Instead, dim silicon was believed to be the solution."

"I found ISCA to be well organized and with a solid technical program. Almost all presentations I attended were interesting contributions, although I was not particularly shocked by any of them as to highlight it. So, I want to focus on the issue of where to organize ISCA abroad. Personally, I enjoyed ISCA being in Israel probably more than if it had been in any other place in the world, in good part due to the possibility of touring Jerusalem. However, I found it troubling that many people from US companies and universities that usually attend ISCA did not make it to Tel-Aviv. Cost may have been an issue (especially for students, even after factoring in the travel grant), distance/time zones may have also been an issue. Maybe even the perceived safety risks of being in Israel? Maybe this was my own personal perception and I may be wrong about it (I had not attended ISCA since 2009). I don't know how attendance numbers compare to previous ISCAs, but it would be interesting to poll regular ISCA attendees in our community to ask them why they did not go, and then consider that input for the decision about where to organize future ISCAs. Anyway, I understand the trade-offs of organizing ISCA abroad and I know it's hard to pick a place that is both interesting and easy to travel to from the US and from Europe, and that also has a team of local people in the community who can do a good job organizing it."

"I think my experience at ISCA was different from that of most of the attendees because I'm a fresh grad student in the area of quantum computing. A talk that really impressed me was the DNA Computing talk. One of the challenges in presenting an emerging architecture to a general audience is providing enough background to make the topic accessible. The speaker was able to present the material in a way that gave me a good understanding of the challenges and innovations in DNA computing in 20 minutes without getting bogged down in details."

"I particularly liked the emerging technology session. Those are wild yet reasonable and well-developed ideas. Another paper I liked is the convolution engine one from Stanford. It has a clear motivation and convincing solution, as well as rich data that not only supports their own work, but also gives me good intuition on energy consumption distribution in modern processors. I also benefited a lot from the DRAM session."

"For future ISCAs, I would like to see the conference scope slightly extended such that application specialists can find their place in the conference. Application are a driving factor in the development of new systems. While PPoPP and HPCA are co-located, there is considerably little interaction between these two communities even though their interests overlap."

Observations on the Flipped Classroom

2013-01-07T00:02:00.000-07:00

I engaged in a "flipped classroom" experiment for my graduate computer architecture class last semester. In an ideal flipped classroom, the lecturing happens outside class (via on-line videos), and class time is used entirely for problem-solving and discussions. For a number of reasons, I couldn't bring myself to completely eliminate in-class lecturing. So in a typical lecture, I'd spend 80% of the time doing a quick recap of the videos, clarifying doubts, covering advanced topics, etc. The remaining 20% of the time was used to solve example problems. My observations below pertain to this modified flipped classroom model:

My estimate is that 80-90% of the students did watch videos beforehand, which was a pleasant surprise. The percentage dropped a bit as the semester progressed because (among other things) students realized that I did fairly detailed recaps in class.
I felt that students did learn more this year than in previous years. Students came to exams better prepared and the exam scores seemed to be higher. In their teaching evaluations, students commented that they liked this model. FWIW, the teaching evaluation scores were slightly higher -- an effective course rating of 5.64/6 this year (my previous offerings of this class have averaged a score of 5.54).
In my view, the course was more effective this year primarily because students had access to videos. Based on YouTube view counts, it was clear that students were re-visiting the pertinent videos while working on their assignments and while studying for exams. The in-class problem-solving was perhaps not as much of a game-changer.
Making a video is not as much fun for me as lecturing in class. From a teacher's perspective, this model is more work and less fun. But it's gratifying to have impact beyond the physical campus.
Miscellaneous observations about the recording process: I used Camtasia Studio to create screencasts and I thought it worked great; I didn't spend much time on editing and only chopped out about 10 seconds worth of audio blemishes per video; a headset mic is so much better than a laptop's in-built mic (too bad it took me 7 videos to realize this).

Next year, I'll try to make the course more effective by doing more problem-solving and less lecturing in class. Next year will also be a lot less stressful since the videos are already in place :-).

Golden Boy

2012-11-17T13:13:00.000-07:00

A brief moment of humor (and pain) in the middle of deadline chaos (true story):

8am, Saturday. Productive grad student (aka Golden Boy) sends email: I am derailed because of this blinding pain in my elbow. Need to go to the ER.

    Prof: Ouch! What's the diagnosis?

Golden Boy: Gave me some painkillers. repetitive stress injury. elbow is inflamed - will x-ray after the inflammation has gone down. can't keep the elbow at an angle -typing with left hand.

    Prof: Ouch! First ISCA casualty in 10 years! :-)

Golden Boy: not giving up because of this. will continue the expts in the evening. might not happen. but still

    Prof: gasping out his last words... two at a time.. hindi movie style...

Golden Boy: :D ... from phone.

Golden Boy (1 minute later): how do we calculate [...] energy ? kshitij's model ?

Meanwhile, repurcussions around the world...

The Magical Paper Submission

2012-11-10T09:08:00.000-07:00

Flipping my Classroom

2012-08-20T18:18:00.000-06:00

I'll be experimenting with a "Flipped Classroom" model in my graduate computer architecture course this Fall.

I'll be recording key lectures beforehand and uploading them to YouTube. I'll ask students to watch these lectures before class and use class time to solve problems and clarify doubts.

I buy the argument that a video (that can be paused, rewound, re-visited) is a more effective learning tool than listening to the same person speak in class. I hope that the classroom experience will continue to be fun and interactive. And I also hope that students have the discipline to listen to the videos before class. If done right, this should lead to more effective and efficient learning -- more time spent outside class listening to lectures, but a lot less time spent outside class on assignments.

The videos will be in a screencast format: only showing the prepared slides and my run-time handwriting on them (no hands or face). It won't be glitzy; I will fumble around at times. If this model works for Khan Academy, it can't be a terrible idea :-).

If you're curious about the process: I'm using Powerpoint on my tablet-PC, combined with the Camtasia Studio Add-In. So far, I've been willing to "accept" my videos after 4-6 takes. I do very minor editing (cropping out the occasional cough). Including prep time, I'm averaging about 3-4 hours of effort per video (note that I've taught this material many times). In addition, I'll have to create new problems that I can use for discussion in class.

The videos can be found here (only the first six are ready as of today).
The class website is here.

Memory Scheduling Championship Wrap-Up

2012-06-19T13:28:00.000-06:00

The MSC Workshop was last week at ISCA, featuring a strong set of presentations. The MSC webpage has all the final results and papers. Congrats to the winners of each track! Thanks to Zeshan Chishti and Intel for sponsoring the trophies and award certificates (pictured below with the winners of each track). Numerically speaking, the workshop had about 30 attendees, we received 11 submissions, and the USIMM code has received 1254 downloads to date (261 downloads for version 1.3).

Ikeda et al., Winners of the Performance Track.

Fang et al., Winners of the Energy-Delay Track.

Ishii et al., Winners of the Performance-Fairness Track.

Some important take-home messages from the competition:

It is nearly impossible to win a scheduler competition that implements a single new idea. A good scheduler is a combination of multiple optimizations. All of the following must be carefully handled: read/write interference, row buffer locality, early precharge, early activates, refresh, power-up/down, fairness, etc. The talk by Ishii et al. did an especially good job combining multiple techniques and breaking down the contribution of each technique. During the workshop break, some of the audience suggested building a Franken-scheduler (along the lines of Gabe Loh's Frankenpredictor) that combined the best of all submitted schedulers. I think that idea has a lot of merit.
If I had to pick a single scheduling artifact that seemed to play a strong role in many submitted schedulers, it would be the smart handling of read/write interference. Because baseline memory write handling increases execution time by about 36%, it offers the biggest room for improvement.
Our initial experiments seemed to reveal that all three metrics (performance, energy-delay, and performance-fairness) were strongly correlated, i.e., we expected that a single scheduler would win on all three metrics. It was a surprise that we had different winners in each track. Apparently, each winning scheduler had a wrinkle that gave it the edge for one metric.

The USIMM simulation infrastructure seemed to work well for the competition, but there are a few things we'd like to improve upon in the coming months:

The current processor model is simple; it does not model instruction dependences and assumes that all memory operations within a reorder buffer window can be issued as soon as they are fetched. Adding support to model dependences would make the traces larger and slow down the simulator (hence, it wasn't done for the competition).
We have already integrated USIMM with our version of Simics. This automatically takes care of the dependency problem in the previous bullet, but the simulations are much slower. In general, reviewers at top conferences will prefer the rigor of execution-driven simulation over the speed of trace-driven simulation. It would be worthwhile to understand how conclusions differ with the two simulation styles.
The DRAMSim2 tool has an excellent validation methodology. We'd like to possibly re-create something similar for USIMM.
We'd like to augment the infrastructure to support prefetched data streams.
Any new scheduler would have to compare itself against other state-of-the-art schedulers. The USIMM infrastructure already includes a simple FCFS and an opportunistic close page policy, among others. All the code submitted to the competition is on-line as well. It would be good to also release a version of the TCM algorithm (MICRO'10) in the coming months.

If you have an idea for a future research competition, please email the JWAC organizers, Alaa Alameldeen (Intel) and Eric Rotenberg (NCSU).

Introducing simtrax

2012-06-15T23:42:00.000-06:00

The Hardware Ray Tracing group has just release our architectural simulator called simtrax. The simulator, compilation tools, and a number of example codes can all be checked out from the Google Code Repository. We have a couple tutorial pages on the simtrax wiki that will hopefully get you started.

Now you may be wondering at this point, what kind of an architecture does simtrax simulate. The answer is a number of different configurations of the architecture we call TRaX (for Threaded Ray eXecution), which is designed specifically for throughput processing of a ray tracing application. As such, the simulator is not well suited to a large number of application spaces, but does show quite good results for ray tracing, and would perform well on other similar applications.

In particular, the limitations built in to the TRaX architecture which are assumed by the simulator include the following:

No cache coherence – at present the memory is kept coherent, but no simulation penalty is added to maintain coherence. We depend on code being written to not use coherence.
No write caching – writes are assumed to write around the cache, and the lines are invalidated if they were cached, but again, no coherence messages are sent to the other caches in the system.
Small stack sizes – each thread has a small local store for the program call stack. Recursion and deep call stacks can cause overflow or excessive thread-context size.
A single global scene data structure – only load relevant data when it is needed by the thread.

The architecture we simulate is hierarchical in nature. First there are a number of independent, in-order, scalar thread processors. These are grouped in sets of (usually 32) that compose what is called a Threaded Multiprocessor (TM). While each thread processor has a few independent execution units, some of the larger, less-frequently-used units are shared among the threads in the TM. In addition, the TM shares a single L1 data cache and a number of instruction caches, both of which have multiple independent banks, which allows for independent fetching of distinct words.

A number of TMs are grouped together to share access to an L2 data cache, which then manages the connection to the off-chip global memory. In a large chip there may be as many as 4 separate L2 data caches, while a smaller chip might do away with the L2 data cache altogether. The final piece of the puzzle is a small global register file with support for an atomic increment operation, which is used to give out thread work assignments.

This simulator has been in development for a number of years and has had a number of contributors involved in the project. In addition, we have taught a course using the simulation and compiler tools, and published 4 conference and journal articles, with more in progress.

We hope this simulation tool can be useful and intend to do what we can to support it. Feel free to email us and post bug reports on the google code page.

USIMM

2012-02-22T01:25:00.000-07:00

We've just released the simulation infrastructure for the Memory Scheduling Championship (MSC), to be held at ISCA this year.

The central piece in the infrastructure is USIMM, the Utah SImulated Memory Module. It reads in application traces, models the progression of application instructions through the reorder buffers of a multi-core processor, and manages the memory controller read/write queues. Every memory cycle, USIMM checks various DRAM timing parameters to figure out the set of memory commands that can be issued in that cycle. It then hands control to a scheduler function that picks a command from this candidate set. An MSC contestant will only have to modify the scheduler function, i.e., restrict all of your changes to scheduler.c and scheduler.h. This clean interface makes it very easy to produce basic schedulers. Each of my students produced a simple scheduler in the matter of hours; these have been included in the code distribution as examples to help get one started.

In the coming weeks, we'll release a number of traces that will be used for the competition. The initial distribution includes five short single-thread traces from PARSEC that people can use for initial testing.

The competition will be judged in three different tracks: performance, energy-delay-product (EDP), and performance-fairness-product (PFP). The final results will be based on the most current version of USIMM, as of June 1st 2012.

We request that contestants focus on scheduling algorithms that are easily implementable, i.e., doable within a few processor cycles and within a 68 KB storage budget. A program committee will evaluate the implementability of your algorithm, among other things.

We'll post updates and bug fixes in the comments section of this blog post as well as to the usimm-users@cs.utah.edu mailing list (sign up here). Users are welcome to use the blog or mailing list to post their own suggestions, questions, or bug reports. Email usimm@cs.utah.edu if you have a question for just the code developers.

Code Updated on 04/17/2012:

Code download: http://www.cs.utah.edu/~rajeev/usimm-v1.3.tar.gz

Changes in Version 1.1: http://www.cs.utah.edu/~rajeev/pubs/usimm-appA.pdf
Changes in Version 1.2: http://www.cs.utah.edu/~rajeev/pubs/usimm-appB.pdf
Changes in Version 1.3: http://www.cs.utah.edu/~rajeev/pubs/usimm-appC.pdf

USIMM Tech Report: http://www.cs.utah.edu/~rajeev/pubs/usimm.pdf

The contest website: http://www.cs.utah.edu/~rajeev/jwac12/

Users mailing list sign-up: http://mailman.cs.utah.edu/mailman/listinfo/usimm-users

Trip Report -- NSF Workshop -- WETI

2012-02-08T21:19:00.000-07:00

I was at an NSF-sponsored Workshop on Emerging Technologies for Interconnects (WETI) last week that was attempting to frame important interconnect research directions. I encourage everyone to check out the talk slides; talk videos will also soon be posted. In the coming months, a detailed report will be written to capture the discussion. This post summarizes some personal take-home messages.

1. Applications of Photonics: An important conclusion in my view was that photonics offers little latency, energy, and bandwidth advantage for on-chip communication. Its primary advantage is for off-chip communication. It is also worthwhile to look at limited long-distance on-chip communication with photonics. For example, if a photonic signal has entered a chip, you might as well take the signal to a point near the destination, thus reducing the cost of global wire traversal. Nearly half the workshop focused on photonics; many of the challenges appeared to be at the device level.

2. Processing in Memory: Our group has some initial work on processing-in-memory (PIM) with 3D chip stacks. It was re-assuring to see that many people believe in PIM. Because it reduces communication distance, it is viewed as a vital ingredient in the march towards energy-efficient exascale computing. However, to distinguish such ideas from those in the 1990s, it is best to market them as "processing near memory". :-)

3. Micron HMC: The talk by Gurtej Sandhu of Micron had some great details on the Hybrid Memory Cube (HMC). An HMC-based system sees significant energy contributions from the DRAM arrays, the logic layer on the 3D stack, and the host interface (the memory controller on the processor). SerDes circuits account for 66% of the power in the logic layer.

4. Electrical Interconnect Scaling: Shekhar Borkar's talk was interesting as always. He reiterated that mesh NoCs are overkill and hierarchical buses are the way forward. The wire energy for a 16 mm traversal matches the energy cost per bit for a router; frequent routers therefore get in the way of energy efficiency. He pointed out that the NoC in the Intel 80-core Polaris contributed 28% to chip power because the computational units were so simple. The NoC in Intel's SCC chip consumes more power than the NoC in Polaris, but the overall contribution is lower (10%), because the cores are more beefy and realistic. In moving from 45 nm to 7 nm, compute energy will reduce by 6x; correspondingly, the electrical interconnect energy to travel a fixed length on-chip reduces by only 1.6x and the energy for off-chip interconnect reduces by less than 2x. So the communication energy bottleneck will grow, unless we can reduce communication and communication distances.

5. Miscellaneous: There was a buzz about near threshold computing (NTC). It appears to be one of the few big arrows left in the quiver for processor energy efficiency. It was also one of many techniques that Patrick Chiang mentioned for energy-efficient communication. He also talked about low-swing, transmission lines, and wireless interconnects. Pradip Bose's talk had lots of interesting power breakdowns, also showing trends for the IBM Power series.

Waking Up to Bottleneck Realities!

2012-01-15T22:06:00.000-07:00

NSF is organizing a workshop on Cross-Layer Power Optimization and Management (CPOM) next month. The workshop will identify important future research areas to help guide funding agencies and the research community at large. Below is my position statement for the same. While the DRAM main memory system is a primary bottleneck in most platforms, the computer architecture community has been slow to react with innovations that look beyond the processor chip.

SPOT QUIZ: What is an SMB?
Hint: It consumes 14 W of power and there could be 32 of these in an 8-socket system.

FACTS:

Bottlenecks:

The memory system accounts for 20-40% of total system power. Significant power is dissipated in DRAM chips, on-board buffer chips, and the memory controller.
Single DRAM chip power (Micron power calculator): 0.5 W. On-board buffer chip power (Intel SMB datasheet): 14 W. Memory controller power (Intel SCC prototype): 19-69% of chip power.
Future memory systems: 3D stacks, more on-board buffering, higher channel frequencies, higher refresh overheads.
And ... we have an off-chip memory bandwidth problem! Pin counts have stagnated.

You Cannot Be Serious!!

(Exaggerated) State of Current Research:

Number of papers on volatile memory systems: ~1 per conference.
Number of papers on the processor-memory interconnect: ~1 per year.
Number of papers that have to define the terms "rank" and "bank": all of them.
Year of first processor paper to leverage DVFS: 2000. Year of first memory paper to leverage DVFS: 2011.
Percentage of readers that have to look up the term "SMB": > 89%. (ok, I made up that fact :-) ... but I bet I'm right)

For every 1,000 papers written on the processor, 20 papers are written on the memory system, and 1 paper is written on the processor-memory interconnect. This is absurd given that the processor and memory are the two fundamental elements of any computer system and memory energy can exceed processor energy. While the routers in an NoC have been heavily optimized, the community understands very little about the off-chip memory channel. The memory system is a very obvious fertile area for future research.

QUIZ 1:

Most ISCA attendees know what a virtual channel is, but most would be hard-pressed to answer 2 of the following 5 basic memory channel questions:

What is FB-DIMM?
What is an SMI?
Why are buffer chips placed between the memory controller and DRAM chips?
What is SERDES and why is it important?
Why do the downstream and upstream SMI channels have asymmetric widths?

QUIZ 2:

Many ISCA attendees know the difference between PAp and GAg branch predictor configurations, but most will struggle to answer the following basic memory system questions:

How many DRAM sub-arrays are activated to service one cache line request?
What circuit implements the DRAM row buffer?
Where is a row buffer placed?
Why do DRAM chips not implement row buffer caches?
What is overfetch?
What is tFAW?
Describe a basic algorithm to implement chipkill. (What is chipkill?)
What is scrubbing?

In early 2009 (before my foray into memory systems), I would have scored zero on both quizzes. Such a level of ignorance is perhaps ok for esoteric topics... but unforgivable for a component that accounts for 25% of server power.

ACTION ITEMS FOR THE RESEARCH COMMUNITY:

Identify a priority list of bottlenecks. Step outside our comfort zone to learn about new system components. Increase memory system coverage in computer architecture classes.
Find ways to address obvious sources of energy inefficiencies in the memory system: reduce overfetch, improve row buffer hit rates, reduce refresh power.
Find ways to leverage 3D stacking of memory and logic. Exploit 3D to take our first steps in the area of DRAM chip modifications (an area that has traditionally been off-limits).
Understand the considerations in designing memory channels and on-board buffer chips. Propose new channel architectures and new microarchitectures for buffer chips.
Understand memory controller microarchitectures and design complexity-effective memory controllers.
Design new architectures that integrate photonics and NVMs in the memory system.

ISCA Deadline

2011-11-25T22:55:00.001-07:00

I'm guessing everyone's digging themselves out after the ISCA deadline. I had promised not to talk about energy drinks and deadlines, so let's just say that our deadline week looked a bit like this :-)... without the funky math of course.

I'm constantly hounding my students to get their writing done early. Good writing doesn't magically show up 24 hours before the deadline. While a Results section will undoubtedly see revisions in the last few days, there's no valid excuse for not finishing most other sections well before the deadline.

The graph below shows two examples of how our writing evolved this year... better than in previous years, but still not quite early enough! We also labored for more than the usual amount to meet the 22-page budget... I guess we got a little spoilt after the 26-page HPCA'12 format. I am amazed that producing a refined draft one week before the deadline is an impossible task, but producing a refined 22-page document 3 minutes before the deadline is a virtual certainty. I suppose I can blame Parkinson's Law, not my angelic students :-), for accelerating my aging process...

Memory Has an Eye on Disk's Space

2011-11-08T00:22:00.000-07:00

We recently read a couple of SOSP papers in our reading group: RAMCloud and FAWN. These are terrific papers with significant implications for architects and application developers. Both papers target the design of energy-efficient and low-latency datacenter platforms for a new breed of data-intensive workloads. FAWN uses many wimpy nodes and a Flash storage system; RAMCloud replaces disk with DRAM. While the two papers share many arguments, I'll focus the rest of this post on RAMCloud because its conclusion is more surprising.

In RAMCloud, each individual server is effectively disk-less (disks are only used for back-up and not to service application reads and writes). All data is placed in DRAM main memory. Each server is configured to have high memory capacity and every processor has access via the network to the memory space of all servers in the RAMCloud. It is easy to see that such a system should offer high performance because high-latency disk access (a few milli-seconds) is replaced by low-latency DRAM+network access (micro-seconds).

An immediate architectural concern that comes to mind is cost. DRAM has a dollar/GB purchase price that is 50-100 X higher than that of disk. A server with 10 TB of disk space costs $2K, while a server with 64 GB DRAM and no disk costs $3K (2009 data from the FAWN paper). It's a little more tricky to compare the power consumption of DRAM and disk. An individual access to DRAM consumes much less energy than an individual access to disk, but DRAM has a higher static energy overhead (the cost of refresh). If the access rate is high enough, DRAM is more energy-efficient than disk. For the same server example as above, the server with the 10 TB high-capacity disk has a power rating of 250 W, whereas the server with the 64 GB high-capacity DRAM memory has a power rating of 280 W (2009 data from the FAWN paper). This is not quite an apples-to-apples comparison because the DRAM-bound server services many more requests at 280 W than the disk-bound server at 250 W. But it is clear that in terms of operating (energy) cost per GB, DRAM is again much more expensive than disk. Note that total cost of ownership (TCO) is the sum of capital expenditure (capex) and operational expenditure (opex). The above data points make it appear that RAMCloud incurs a huge penalty in terms of TCO.

However, at least to my initial surprise, the opposite is true for a certain large class of workloads. Assume that an application has a fixed high data bandwidth demand and this is the key determinant of overall performance. Each disk offers very low bandwidth because of the low rotational speed of the spindle, especially for random access. In order to meet the high bandwidth demands of the application, you would need several disks and several of the 250 W, 10 TB servers. If data was instead placed in DRAM (as in RAMCloud), that same high rate of data demand can be fulfilled with just a few 280 W, 64 GB servers. The difference in data bandwidth rates for DRAM and disk is over 600X. So even though each DRAM server in the example above is more expensive in terms of capex and opex, you'll need 600 times fewer servers with RAMCloud. This allows overall TCO for RAMCloud to be lower than that of a traditional disk-based platform.

I really like Figure 2 in the RAMCloud CACM paper (derived from the FAWN paper and reproduced below). It shows that in terms of TCO, for a given capacity requirement, DRAM is a compelling design point at high access rates. In short, if data bandwidth is the bottleneck, it is cheaper to use technology (DRAM) that has high bandwidth, even if it incurs a much higher energy and purchase price per byte.

Source: RAMCloud CACM paper

If architectures or arguments like RAMCloud become popular in the coming years, it opens up a slew of interesting problems for architects:

1. Already, the DRAM main memory system is a huge energy bottleneck. RAMCloud amplifies the contribution of the memory system to overall datacenter energy, making memory energy efficiency a top priority.

2. Queuing delays at the memory controller are a major concern. With RAMCloud, a single memory controller will service many requests from many servers, increasing the importance of the memory scheduling algorithm.

3. With each new DDR generation, fewer main memory DIMMs can be attached to a single high-speed electrical channel. To support high memory capacity per server, innovative channel designs are required.

4. If the Achilles' heel of disks is their low bandwidth, are there ways to design disk and server architectures that prioritize disk bandwidth/dollar over other metrics?

Dissing the Dissertation

2011-10-31T01:47:00.000-06:00

In my view, a Ph.D. dissertation is an over-rated document. I'll formalize my position here so I don't get dragged into endless debates during every thesis defense I attend :-).

A classic Ph.D. generally has a hypothesis that is formulated at an early stage (say, year two) and the dissertation describes 3-4 years of research that tests that hypothesis.

A modern Ph.D. (at least in computer architecture) rarely follows this classic recipe. There are many reasons to adapt your research focus every year: (1) After each project (paper), you realize the capabilities and limitations of your ideas; that wisdom will often steer you in a different direction for your next project. (2) Each project fixes some bottleneck and opens up other new bottlenecks that deserve attention. (3) Once an initial idea has targeted the low-hanging fruit, incremental extensions to that idea are often unattractive to top program committees. (4) If your initial thesis topic is no longer "hot", you may have to change direction to be better prepared for the upcoming job hunt. (5) New technologies emerge every few years that change the playing field.

I'll use my own Ph.D. as an example. My first project was on an adaptive cache structure. After that effort, I felt that long latencies could not be avoided; perhaps latency tolerance would have more impact than latency reduction. That led to my second effort, designing a runahead thread that could jump ahead and correctly prefetch data. The wisdom I gathered from that project was that it was essential to have many registers so the processor could look far enough into the future. If you had enough registers, you wouldn't need fancy runahead; hence, my third project was a design of a large two-level register file. During that project, I realized that "clustering" was an effective way to support large processor structures at high clock speeds. For my fourth project, I designed mechanisms to dynamically allocate cluster resources to threads. So I had papers on caching, runahead threads, register file design, and clustering. It was obviously difficult to weave them together into a coherent dissertation. In fact, each project used very different simulators and workloads. But I picked up skills in a variety of topics. I learnt how to pick problems and encountered a diverse set of literature, challenges, and reviews. By switching topics, I don't think I compromised on "depth"; I was using insight from one project to influence my approach in the next. I felt I graduated with a deep understanding of how to both reduce and tolerate long latencies in a processor.

My key point is this: it is better to be adaptive and focus on high-impact topics than to flog a dead horse for the sake of a coherent dissertation. Research is unpredictable and a four-year research agenda can often not be captured by a single hypothesis in year two. The dissertation is therefore a contrived concept (or at least appears contrived today given that the nature of research has evolved with time). The dissertation is a poor proxy when evaluating a student's readiness to graduate. The student's ability to tie papers into a neat package tells me nothing about his/her research skills. If a Ph.D. is conferred on someone that has depth/wisdom in a topic and is capable of performing independent research, a student's publication record is a better metric in the evaluation process.

In every thesis proposal and defense I attend, there is a prolonged discussion on what constitutes a coherent thesis. Students are steered in a direction that leads to a coherent thesis, not necessarily in a direction of high impact. If one works in a field where citation counts for papers far out-number citation counts for dissertations, I see little value in producing a polished dissertation that will never be read. Use that time to instead produce another high-quality paper!

There are exceptions, of course. One in every fifty dissertations has "bible" value... it ends up being the authoritative document on some topic and helps brand the student as the leading expert in that area. For example, see Ron Ho's dissertation on wires. If your Ph.D. work naturally lends itself to bible creation and you expect to be an ACM Doctoral Dissertation Award nominee, by all means, spend a few months to distill your insight into a coherent dissertation that will have high impact. Else, staple your papers into a dissertation, and don't be ashamed about it!

My short wish-list: (1) A thesis committee should focus on whether a student has produced sufficient high-quality peer-reviewed work and not worry about dissertation coherence. (2) A dissertation can be as simple as a collection of the candidate's papers along with an introductory chapter that conveys the conclusions and insights that guided the choice of problems.

The right yellow gradient

2011-10-05T23:09:00.002-06:00

Vic Gundotra, SVP of Engineering at Google shared this inspiring story on Google+:

One Sunday morning, January 6th, 2008 I was attending religious services when my cell phone vibrated. As discreetly as possible, I checked the phone and noticed that my phone said "Caller ID unknown". I choose to ignore.

After services, as I was walking to my car with my family, I checked my cell phone messages. The message left was from Steve Jobs. "Vic, can you call me at home? I have something urgent to discuss" it said.

Before I even reached my car, I called Steve Jobs back. I was responsible for all mobile applications at Google, and in that role, had regular dealings with Steve. It was one of the perks of the job.

"Hey Steve - this is Vic", I said. "I'm sorry I didn't answer your call earlier. I was in religious services, and the caller ID said unknown, so I didn't pick up".

Steve laughed. He said, "Vic, unless the Caller ID said 'GOD', you should never pick up during services".

I laughed nervously. After all, while it was customary for Steve to call during the week upset about something, it was unusual for him to call me on Sunday and ask me to call his home. I wondered what was so important?

"So Vic, we have an urgent issue, one that I need addressed right away. I've already assigned someone from my team to help you, and I hope you can fix this tomorrow" said Steve.

"I've been looking at the Google logo on the iPhone and I'm not happy with the icon. The second O in Google doesn't have the right yellow gradient. It's just wrong and I'm going to have Greg fix it tomorrow. Is that okay with you?"

Of course this was okay with me. A few minutes later on that Sunday I received an email from Steve with the subject "Icon Ambulance". The email directed me to work with Greg Christie to fix the icon.

Since I was 11 years old and fell in love with an Apple II, I have dozens of stories to tell about Apple products. They have been a part of my life for decades. Even when I worked for 15 years for Bill Gates at Microsoft, I had a huge admiration for Steve and what Apple had produced.

But in the end, when I think about leadership, passion and attention to detail, I think back to the call I received from Steve Jobs on a Sunday morning in January. It was a lesson I'll never forget. CEOs should care about details. Even shades of yellow. On a Sunday.

If there's one thing to learn from Steve Jobs' life, it is the importance of being a perfectionist. All the time. Right down to the perfect shade of yellow. Respect, Mr. Jobs.

HotChips 2011 - A Delayed Trip Report

2011-10-04T19:33:00.002-06:00

I know, it’s been a little over a month since this year’s edition of the conference concluded, but better late than never! It was my first ever trip to HotChips, and I thought I should share some things I found interesting. I’ll keep away from regurgitating stuff from the technical talks since the tech-press has likely covered all of those in sufficient detail.

1. This is clearly an industry-oriented conference. There were hardly any professors around, and very few grad students. As a result, perhaps, there was almost nobody hanging out in the hallways or break areas during the actual talks, people were actually inside the auditorium!

2. It seemed like the entire architecture/circuits community from every tech shop in the Bay Area was in attendance. If you’re going to be on the job market soon, it’s a great place to network!

3. I enjoyed the keynote talk from ARM. It was interesting to think about smartphones (with ARM chips inside, of course) in the context of the developing world -- at price points in the $100-200 range, they are not a supplementary device like in the US, but the main conduit for people who normally couldn’t afford a $600 laptop to get on the internet. Definitely sounds like there’s huge potential for this form factor going forward.

Another interesting tidbit from this talk was a comparison between the energy densities of a typical smartphone battery and a bar of chocolate -- 4.5 kCal in 30g vs. 255 kCal in 49g! Lot of work to do to improve battery technology, obviously.

4. While on the subject of ARM, comparisons to Intel and discussions on who was “winning” were everywhere, from the panel discussion in the evening to one of the pre-defined “lunch table discussion topics”. Needless to say, there was nothing conclusive one way or the other :)

5. I finally got to touch a real, working prototype of Micron’s Hybrid Memory Cube (see Rajeev’s post here). I know, it’s of little practical value to see the die-stack, but it was cool nonetheless :) The talk was pretty interesting too, the first to release some technical information I think. It appears to be a complete rethinking of DRAM, with a novel protocol, novel interconnect topologies, etc.

6. I also really enjoyed the talk from the Kinect folks. I’ve seen it in action, and it was amazing to understand the complex design behind making everything work, especially since a lot of the challenges were outside my typical areas of interest -- mechanical design, aesthetics, reliability, processing images with low lighting, varying clothes, people of different sizes.. fascinating! There was also an aspect of “approachability” in the talk that I think is becoming increasingly common in the tech world -- basically *hide* the technology from the end user, making everything work “magically”. This makes people more likely to actually try these things. As a technologist, I understand the logic, but am not sure I like it very much -- engineers spend countless hours fixing a million corner cases, and I think there should be some way the common public understands at least the severity of these complexities and appreciates how hard it is to make things work! It’s like the common saying that Moore’s “Law” gives you increasing number of transistors every year -- it doesn’t, engineers do!

7. Finally, one really awesome event was a robot on stage! It was introduced by the folks from Willow Garage -- controlled from a simple Android app. It moved to the center of the stage from the wings, gave the speaker a high-five, and showed off a few tricks. There were also videos in the talk of it playing pool, bringing beer from a fridge (brand of your choice!), and a bunch of other cool things. Very fancy :) Waiting for my own personal assistant!

Rebuttals and Reviews

2011-08-28T23:33:00.000-06:00

In the review process for top conferences, conventional wisdom says that rebuttals don't usually change the mind of the reviewer... they only help the authors feel better. The latter is certainly true: after writing a rebuttal, I always feel like I have a plan to sculpt the paper into great shape.

Here's some data from a recent PC meeting about whether rebuttals help. For the MICRO 2011 review process, reviewers entered an initial score for overall merit and a final score after reading rebuttals and other reviews. While many factors (for example, new information at the PC meeting) may influence a difference in the two scores, I'll assume that the rebuttal is the dominant factor. I was able to confirm this for the 16 papers that I reviewed. Of the roughly 80 reviews for those papers, 14 had different initial and final scores (there were an equal number of upward and downward corrections). In 11 of the 14 cases, the change in score was prompted by the quality of the rebuttal. In the other 3 cases, one of the reviewers convinced the others that the paper had merit.

For 25 randomly selected papers that were accepted (roughly 125 reviews), a total of 14 reviews had different initial and final scores. In 10 of the 14 cases, the final score was higher than the initial score.

For 25 papers that were discussed but rejected, a total of 19 reviews had different initial and final scores. In 14 of the 19 cases, the final score was lower than the initial score.

It appears that rebuttals do play a non-trivial role in a paper's outcome, or at least a larger role than I would have expected.

The Debate over Shared and Private Caches

2011-08-22T11:10:00.000-06:00

Several papers have been written recently on the topic of last level cache (LLC) management. In a multi-core processor, the LLC can be shared by many cores or each core can maintain a private LLC. Many of these recent papers evaluate the trade-offs in this space and propose models that lie somewhere between the two extremes. It's becoming evident (at least to me) that it is more effective to start with a shared LLC and make it appear private as required. In this post, I'll explain why. I see many papers on private LLCs being rejected because they fail to pay attention to some of these arguments.

Early papers on LLC management have pointed out that in a multi-core processor, a shared LLC enjoys a higher effective cache capacity and ease of design, while private LLCs can offer lower average access latency and better isolation across programs. It is worth noting that both models can have similar physical layouts -- a shared LLC can be banked and each bank can be placed adjacent to one core (as in the figure below). So the true distinction between the two models is the logical policy used to govern their contents and access.

The distributed shared LLC shown in the figure resembles Tilera's tiled architecture and offers non-uniform latencies to banks. In such a design, there are many options for data striping. The many ways of a set-associative cache could be scattered across banks. This is referred to as the D-NUCA model and leads to complex protocols for search and migration. This model seems to have the least promise because of its complexity and non-stellar performance. A more compelling design is S-NUCA, where every address and every cache set resides in a unique bank. This makes it easy to locate data. Most research assumes that in an S-NUCA design, consecutive sets are placed in adjacent banks, what I'll dub as set-interleaved mapping. This naturally distributes load across banks, but every LLC request is effectively serviced by a random bank and must travel half-way across the chip on average. A third option is an S-NUCA design with what I'll call page-to-bank or page-interleaved mapping. All the blocks in an OS physical page (a minimum sized page) are mapped to a single bank and consecutive pages are mapped to adjacent banks. This is the most compelling design because it is easy to implement (no need for complex search) and paves the way for OS-managed data placement optimizations for locality.

Recent work has taken simple S-NUCA designs and added locality optimizations so that an average LLC request need not travel far on the chip. My favorite solution to date is R-NUCA (ISCA'09), which has a simple policy to classify pages as private or shared, and ensures that private pages are always mapped to the local bank. The work of Cho and Jin (MICRO'06) relies on first-touch page coloring to place a page near its first requestor; Awasthi et al. (HPCA'09) augment that solution with load balancing and page migration. Victim Replication (ISCA'05) is able to do selective data replication in even a set-interleaved S-NUCA cache without complicating the coherence protocol. In spite of these solutions, the problem of shared LLC data management is far from solved. Multi-threaded workloads are tougher to handle than multi-programmed workloads; existing solutions do a poor job of handling pages shared by many cores; they are not effective when the first requestor of a page is not the dominant accessor (since page migration is expensive); task migration renders most locality optimizations invalid.

The key point here is this: while there is room for improvement, a shared LLC with some of the above optimizations can offer low access latencies, high effective capacity, low implementation complexity, and better isolation across programs. Yes, these solutions sometimes involve the OS, but in very simple ways that have commercial precedents (for example, NUMA-aware page placement in the SGI Origin).

On the other hand, one could start with an organization where each core has its own private LLC. Prior work (such as cooperative caching) has added innovations so that the private LLCs can cooperate to provide a higher effective cache capacity. In cooperative caching, when a block is evicted out of a private cache, it is given a chance to reside in another core's private cache. However, all private LLC organizations require complex mechanisms to locate data that may be resident in a remote bank. Typically, this is done with an on-chip directory that tracks the contents of every cache. This directory introduces non-trivial complexity. Some commercial processors do offer private caches; data search is manageable in these processors because there are few caches and bus-based broadcast is possible. Private LLCs are less attractive as the number of cores scales up.

Based on the above arguments, I feel that shared LLCs have greater promise. For a private LLC organization to be compelling, the following arguments are necessary: (i) How is the complexity of data search overcome? (ii) Can it out-perform a shared LLC with page-to-bank S-NUCA mapping and a simple locality optimization (either R-NUCA or first-touch page coloring)?

A Killer App for 3D Chip Stacks?

2011-06-28T19:32:00.000-06:00

I had previously posted a brief summary of our ISCA 2011 paper. The basic idea was to design a 3D stack of memory chips and one special interface chip, connected with TSVs (through silicon vias). We argue that the interface chip should have photonic components and memory scheduling logic. The use of such a stack enables game-changing optimizations (photonic access, scalable scheduling) without disrupting the manufacture of commodity memory chips.

When we presented our work at Micron recently (see related post by Ani Udipi), we were told that Micron has a full silicon prototype that incorporates some of these concepts. We were very excited to hear that a 3D stacked memory/logic organization similar to our proposal is implementable and could be reality in the near future. Slide 17 of the report from the Micron Winter Analyst Conference, February 2011, describes Micron's Hybrid Memory Cube (HMC, Figure reproduced below). The HMC has a Micron-designed logic controller at the bottom of the stack (what we dub as the interface chip in our work) and it is connected to multiple DRAM chips with TSVs. Details of what is on the logic controller have not been released yet. Micron is partnering with Open-Silicon to take advantage of the opportunities made possible by the HMC.

This is an exciting development and I expect that one could discover many creative ways to put useful functionality on the interface chip. Prior work has proposed several ways to add functionality to DRAM chips: row buffer caches, processing in memory, error correction features, photonic components, etc. Many of these ideas were unimplementable because of their impact on cost, but they may be worth attempting in the context of a 3D memory stack. This is also an opportunity to add value to memory products, an especially important consideration as density growth flattens or as we move to new memory technologies that have varying maintenance needs.

While most prior academic work on 3D architecture has been processor-centric, the potential benefit of memory-centric 3D stacking is relatively unexplored. This is in spite of the fact that memory companies have embraced 3D stacking much more than processor companies. 3D memory chip stacks are currently manufactured in various forms by Tezzaron, Samsung, and Elpida among others. The concept of building a single chip and then reusing it within 3D chip stacks to create multiple different products has been proposed previously for processors (papers from UCSB and Utah). Given the economic impact of this concept, the cost-sensitive memory industry stands to gain more from it. Memory companies operate at very small margins. They therefore strive to optimize cost-per-bit and almost exclusively design for high volumes. They are averse to adding any feature that will increase cost for millions of chips, but will only be used by a small market segment. But with 3D chip stacks, the same high-volume commodity DRAM chip can be bonded to different interface chips to create different products for each market segment. This may well emerge as the most compelling application of 3D stacking within the high performance domain.

Father's Day

2011-06-19T23:02:00.000-06:00

Just spent a great Father's Day watching the US Open with my 3-year-old son and week-old daughter in my arms. On most Father's Days, I'm fumbling around airports and hotels, missing my family and trying to get a peek at US Open scores on my way to ISCA. Here's hoping that future ISCA organizers make it a high priority to not schedule ISCA during Father's Day...