Wednesday, November 27, 2013

A DRAM Refresh Tutorial

In a recent project, we dug into the details of the DRAM refresh process.  Unfortunately, there is no public document that explains this well.  Hopefully, the description below helps fill this gap.  It is possible that not all DRAM chips follow the procedure outlined below.

The well-known basics of DRAM refresh:

The charge on a DRAM cell weakens over time.  The DDR standard requires every cell to be refreshed within a 64 ms interval, referred to as the retention time.  At temperatures higher than 85° C (referred to as extended temperature range), the retention time is halved to 32 ms to account for the higher leakage rate.  The refresh of a memory rank is partitioned into 8,192 smaller refresh operations.  One such refresh operation has to be issued every 7.8 µs (64 ms/8192).  This 7.8 µs interval is referred to as the refresh interval, tREFI.  The DDR3 standard requires that eight refresh operations be issued within a time window equal to 8 x tREFI, giving the memory controller some flexibility when scheduling these refresh operations.  Refresh operations are issued at rank granularity in DDR3 and DDR4.  Before issuing a refresh operation, the memory controller precharges all banks in the rank.  It then issues a single refresh command to the rank.  DRAM chips maintain a row counter to keep track of the last row that was refreshed -- this row counter is used to determine the rows that must be refreshed next.

tRFC and recovery time:

Upon receiving a refresh command, the DRAM chips enter a refresh mode that has been carefully designed to perform the maximum amount of cell refresh in as little time as possible.  During this time, the current carrying capabilities of the power delivery network and the charge pumps are stretched to the limit.  The operation lasts for a time referred to as the refresh cycle time, tRFC.  Towards the end of this period, the refresh process starts to wind down and some recovery time is provisioned so that the banks can be precharged and charge is restored to the charge pumps.  Providing this recovery time at the end allows the memory controller to resume normal operation at the end of tRFC.  Without this recovery time, the memory controller would require a new set of timing constraints that allow it to gradually ramp up its operations in parallel with charge pump restoration.  Since such complexity can't be expected of every memory controller, the DDR standards include the recovery time in the tRFC specification.  As soon as the tRFC time elapses, the memory controller can issue four consecutive Activate commands to different banks in the rank.

Refresh penalty:

On average, in every tREFI window, the rank is unavailable for a time equal to tRFC.  So for a memory-bound application on a 1-rank memory system, the percentage of execution time that can be attributed to refresh (the refresh penalty) is tRFC/tREFI.  In reality, the refresh penalty can be a little higher because directly prior to the refresh operation, the memory controller wastes some time precharging all the banks.  Also, right after the refresh operation, since all rows are closed, the memory controller has to issue a few Activates to re-populate the row buffers.  These added delays can grow the refresh penalty from (say) 8% in a 32 Gb chip to 9%.  The refresh penalty can also be lower than the tRFC/tREFI ratio if the processors can continue to execute independent instructions in their reorder buffers while the memory system is unavailable.  In a multi-rank memory system, the refresh penalty depends on whether ranks are refreshed together or in a staggered manner.  If ranks are refreshed together, the refresh penalty, as above, is in the neighborhood of tRFC/tREFI.  If ranks are refreshed in a staggered manner, the refresh penalty can be greater.  Staggered refresh is frequently employed because it reduces the memory's peak power requirement.

Some refresh misconceptions:

We next describe how a few rows in all banks are refreshed during the tRFC period.  As DRAM chip capacities increase, the number of rows on the chip also increases.  Since the retention time (64 ms) and refresh interval (7.8 µs) have remained constant, the number of rows that must be refreshed in every refresh interval has increased.  In modern 4Gb chips, eight rows must be refreshed in every bank in a single tRFC window.  Some prior works have assumed that a row refresh is equivalent to an Activate+Precharge sequence for that row.  Therefore, the refresh process was assumed to be equivalent to eight sequential Activate+Precharge commands per bank, with multiple banks performing these operations in parallel.  However, DRAM chip specifications reveal that the above model over-simplifies the refresh process.  First, eight sequential Activate+Precharge sequences will require time = 8 x tRC.  For a 4Gb DRAM chip, this equates to 390 ns.  But tRFC is only 260 ns, i.e., there is no time to issue eight sequential Activate+Precharge sequences and allow recovery time at the end.  Also, parallel Activates in eight banks would draw far more current than is allowed by the tFAW constraint.  Second, the DRAM specifications provide the average current drawn during an Activate/Precharge (IDD0) and Refresh (IDD5).  If Refresh was performed with 64 Activate+Precharge sequences (64 = 8 banks x 8 rows per bank), we would require much more current than that afforded by IDD5.  Hence, the refresh process uses a method that has higher efficiency in terms of time and current than a sequence of Activate and Precharge commands.

The actual refresh process:

This process is based on the high number of subarrays provisioned in every bank.  For example, a bank may have 16 subarrays, of which only four are accessed during a regular Activate operation.  This observation also formed the basis for the recent subarray-level parallelism (SALP) idea of Kim et al.  During a refresh operation, the same row in all 16 subarrays undergo an Activation and Precharge.  In this example, four rows worth of data are being refreshed in parallel within a single bank.  Also, the current requirement for this is not 4x the current for a regular Activate; by sharing many of the circuits within the bank, the current does not increase linearly with the extent of subarray-level parallelism.  Thus, a single bank uses the maximum allowed current draw to perform parallel refresh in a row in every subarray; each bank is handled sequentially (refreshes in two banks may overlap slightly based on current profiles), and there is a recovery time at the end.

Update added on 12/16/2013:
 
Refresh in DDR4:

One important change expected in DDR4 devices is a Fine Granularity Refresh (FGR) operation (this ISCA'13 paper from Cornell/IBM has good details on FGR).  FGR-1x can be viewed as a regular refresh operation, similar to that in DDR3.  FGR-2x partitions the regular refresh operation into 2 smaller "half-refresh" operations.  In essence, the tREFI is halved (half-refresh operations must be issued twice as often), and the tRFC also reduces (since each half-refresh operation does half the work).  FGR-4x partitions each regular refresh operation into 4 smaller "quarter-refresh" operations.  Since each FGR operation renders a rank unavailable for a short time, it has the potential to reduce overall queuing delays for processor reads.  But it does introduce one significant overhead.  A single FGR-2x operation has to refresh half the cells refreshed in an FGR-1x operation, thus potentially requiring half the time.  But an FGR-2x operation and an FGR-1x operation must both incur the same recovery cost at the end to handle depleted charge pumps.  DDR4 projections for 32 Gb chips show that tRFC for FGR-1x is 640 ns, but tRFC for FGR-2x is 480 ns.  The overheads of the recovery time are so significant that two FGR-2x operations take 50% longer than a single FGR-1x operation.  Similarly, going to FGR-4x mode results in a tRFC of 350 ns.  Therefore, four FGR-4x refresh operations would keep the rank unavailable for 1400 ns, while a single FGR-1x refresh operation would refresh the same number of cells, but keep the rank unavailable for only 640 ns.  The high refresh recovery overheads in FGR-2x and FGR-4x limit their effectiveness in reducing queuing delays.

Refresh in LPDDR2:


LPDDR2 also provides a form of fine granularity refresh.  It allows a single bank to be refreshed at a time with a REFpb command (per-bank refresh).  For an 8-bank LPDDR2 chip, eight per-bank refreshes handle as many cells as a single regular all-bank refresh (REFab command).  A single REFpb command takes way more than 1/8th the time taken by a REFab command -- REFab takes 210 ns in an 8Gb chip and REFpb takes 90 ns (see pages 75 and 141 in this datasheet).  Similar to DDR4's FGR, we see that breaking a refresh operation into smaller units imposes a significant overhead.  However, LPDDR2 adds one key feature.  While a REFpb command is being performed in one bank, regular DRAM operations can be serviced by other banks.  DDR3 and DDR4 do not allow refresh to be overlapped with other operations (although, this appears to be the topic of two upcoming papers at HPCA 2014).  Page 54 of this datasheet indicates that a REFpb has a similar impact on tFAW as an Activate command.  Page 75 of the same datasheet indicates that an Activate and REFpb can't be viewed similarly by the memory scheduler.  We suspect that REFpb has a current profile that is somewhere between a single-bank Activate and a REFab.

Jointly authored by Rajeev Balasubramonian (University of Utah), Manju Shevgoor (University of Utah), and Jung-Sik Kim (Samsung).