Tuesday, November 8, 2011

Memory Has an Eye on Disk's Space

We recently read a couple of SOSP papers in our reading group: RAMCloud and FAWN.  These are terrific papers with significant implications for architects and application developers.  Both papers target the design of energy-efficient and low-latency datacenter platforms for a new breed of data-intensive workloads.  FAWN uses many wimpy nodes and a Flash storage system; RAMCloud replaces disk with DRAM.  While the two papers share many arguments, I'll focus the rest of this post on RAMCloud because its conclusion is more surprising.

In RAMCloud, each individual server is effectively disk-less (disks are only used for back-up and not to service application reads and writes).  All data is placed in DRAM main memory.  Each server is configured to have high memory capacity and every processor has access via the network to the memory space of all servers in the RAMCloud.  It is easy to see that such a system should offer high performance because high-latency disk access (a few milli-seconds) is replaced by low-latency DRAM+network access (micro-seconds).

An immediate architectural concern that comes to mind is cost.  DRAM has a dollar/GB purchase price that is 50-100 X higher than that of disk.  A server with 10 TB of disk space costs $2K, while a server with 64 GB DRAM and no disk costs $3K  (2009 data from the FAWN paper).  It's a little more tricky to compare the power consumption of DRAM and disk.  An individual access to DRAM consumes much less energy than an individual access to disk, but DRAM has a higher static energy overhead (the cost of refresh).  If the access rate is high enough, DRAM is more energy-efficient than disk.  For the same server example as above, the server with the 10 TB high-capacity disk has a power rating of 250 W, whereas the server with the 64 GB high-capacity DRAM memory has a power rating of 280 W (2009 data from the FAWN paper).  This is not quite an apples-to-apples comparison because the DRAM-bound server services many more requests at 280 W than the disk-bound server at 250 W.  But it is clear that in terms of operating (energy) cost per GB, DRAM is again much more expensive than disk.  Note that total cost of ownership (TCO) is the sum of capital expenditure (capex) and operational expenditure (opex).  The above data points make it appear that RAMCloud incurs a huge penalty in terms of TCO.

However, at least to my initial surprise, the opposite is true for a certain large class of workloads.  Assume that an application has a fixed high data bandwidth demand and this is the key determinant of overall performance.  Each disk offers very low bandwidth because of the low rotational speed of the spindle, especially for random access.  In order to meet the high bandwidth demands of the application, you would need several disks and several of the 250 W, 10 TB servers.  If data was instead placed in DRAM (as in RAMCloud), that same high rate of data demand can be fulfilled with just a few 280 W, 64 GB servers.  The difference in data bandwidth rates for DRAM and disk is over 600X.  So even though each DRAM server in the example above is more expensive in terms of capex and opex, you'll need 600 times fewer servers with RAMCloud.  This allows overall TCO for RAMCloud to be lower than that of a traditional disk-based platform.

I really like Figure 2 in the RAMCloud CACM paper (derived from the FAWN paper and reproduced below).  It shows that in terms of TCO, for a given capacity requirement, DRAM is a compelling design point at high access rates.  In short, if data bandwidth is the bottleneck, it is cheaper to use technology (DRAM) that has high bandwidth, even if it incurs a much higher energy and purchase price per byte.

Source: RAMCloud CACM paper

If architectures or arguments like RAMCloud become popular in the coming years, it opens up a slew of interesting problems for architects:

1.  Already, the DRAM main memory system is a huge energy bottleneck.  RAMCloud amplifies the contribution of the memory system to overall datacenter energy, making memory energy efficiency a top priority.

2.  Queuing delays at the memory controller are a major concern.  With RAMCloud, a single memory controller will service many requests from many servers, increasing the importance of the memory scheduling algorithm.

3.  With each new DDR generation, fewer main memory DIMMs can be attached to a single high-speed electrical channel.  To support high memory capacity per server, innovative channel designs are required.

4.  If the Achilles' heel of disks is their low bandwidth, are there ways to design disk and server architectures that prioritize disk bandwidth/dollar over other metrics?


  1. Your conclusions 1-4 read as if you've simply forgotten about the large portion of the graph that has flash in it. 10 years ago flash wouldn't have even been on this graph due to cost/performance.

    In the next several years the price of flash will likely come down by a factor of 10 as proprietary controller designs become public knowledge. This makes it price competitive with disk - erroding the use case for disk except for archival storage. This also make it harder (on a TCO) basis to justfiy using DRAM for anything besides the smallest of caches in front of Flash or where absolute latency to memory is the defining requirement of your application's performance.

    I personally think RAMCloud is barking up the wrong tree - distributing DRAM does nothing to improve its fundamental performance/power metric. In fact it simply adds significant delay, ultimately decreasing its perf/power and requiring more peripheral equipment to build the interconnects.

  2. Dave, you're right that I didn't give Flash its due attention. As these devices scale at different rates, they may occupy more or less of the design space. As long as DRAM has a bandwidth advantage over Flash, there will be a region of the design space where it offers the lowest TCO.

    I agree with your point that the figure may look very different once appropriate caches are considered; the RAMCloud papers do talk a little about this aspect.

    Your final comment characterizes the idea as "distributing DRAM to improve its perf/power". It's more a case of harnessing enough DRAM modules to meet the bandwidth requirement. It would take many more Flash or disk modules (and hence higher TCO) to match that same bandwidth.

  3. Its possible today to build systems that do 30GB/s bandwidth from flash within a single box, with PCIe 3.0 coming in 2012 this will double again and make it on par with the main memory bandwidth inside the box. It has the side effect of of coming with much higher capacity and improved power consumption per bit over DRAM. So while currently the interface is hampering flash bandwidth, its a winner in terms of power and density.

    In my mind this makes latency the last significant advantage DRAM will still have over flash. If you're within your own electrical domain, use DRAM, if you're going remote and adding latency, use flash. I'm anxiously awaiting the follow-on paper... FLASHCloud!

  4. The FAWN project is a version of "FLASHCloud"... :-)

  5. Btw, DRAM also has a significant endurance advantage over Flash.

    The same tricks that can be used to boost Flash module bandwidth can also be used to boost DRAM module bandwidth. If a device (Flash) has an inherent 100x latency disadvantage, it can provide high bandwidth only by using many chips and striping blocks across all these chips, leading to high capex and opex. This is the basic RAMCloud argument: if you are limited by bandwidth, TCO is minimized by using the device (DRAM) with highest inherent bandwidth. This is a rough guideline that appeals to me. Once you consider caching, network latency/bandwidth, etc., it may well turn out that a DRAM/Flash hybrid is optimal for a large portion of the design space.

  6. Good points. It has an endurance advantage for sure, but in practice flash devices are being designed for 5 years of continuous use which is well beyond the replacement cycle for DRAM or Flash in industry, so does it matter that much if both can get to the replacement point?

    The latency:bandwidth ratio is changing very raplidly in flash such that you don't need to stripe across as many pads to get good bandwidth. The striping is now done more for reliability than performance. t_read (latency) is actually getting significantly slower (~30%) going from 3X->2X nm process. t_stream (bandwidth) has gone up though (~400%) in the last 24 months. Flash devices at large capacity have much larger internal bandwidth than they can export through the slow interface problem they're coupled to. Conversely the industry is able to make smaller devices (in terms of striping) each year that achieve better bandwidth.

    Here is a an industry reference showing this decoupling of capacity and bandwidth: http://www.fusionio.com/data-sheets/iodrive2/