Tuesday, June 19, 2012

Memory Scheduling Championship Wrap-Up

The MSC Workshop was last week at ISCA, featuring a strong set of presentations.  The MSC webpage has all the final results and papers.  Congrats to the winners of each track!  Thanks to Zeshan Chishti and Intel for sponsoring the trophies and award certificates (pictured below with the winners of each track).  Numerically speaking, the workshop had about 30 attendees, we received 11 submissions, and the USIMM code has received 1254 downloads to date (261 downloads for version 1.3).

Ikeda et al., Winners of the Performance Track.
Fang et al., Winners of the Energy-Delay Track.
Ishii et al., Winners of the Performance-Fairness Track.

Some important take-home messages from the competition:
  1. It is nearly impossible to win a scheduler competition that implements a single new idea.  A good scheduler is a combination of multiple optimizations.  All of the following must be carefully handled: read/write interference, row buffer locality, early precharge, early activates, refresh, power-up/down, fairness, etc.  The talk by Ishii et al. did an especially good job combining multiple techniques and breaking down the contribution of each technique.  During the workshop break, some of the audience suggested building a Franken-scheduler (along the lines of Gabe Loh's Frankenpredictor) that combined the best of all submitted schedulers.  I think that idea has a lot of merit.
  2. If I had to pick a single scheduling artifact that seemed to play a strong role in many submitted schedulers, it would be the smart handling of read/write interference.  Because baseline memory write handling increases execution time by about 36%, it offers the biggest room for improvement.
  3. Our initial experiments seemed to reveal that all three metrics (performance, energy-delay, and performance-fairness) were strongly correlated, i.e., we expected that a single scheduler would win on all three metrics.  It was a surprise that we had different winners in each track.  Apparently, each winning scheduler had a wrinkle that gave it the edge for one metric.
The USIMM simulation infrastructure seemed to work well for the competition, but there are a few things we'd like to improve upon in the coming months:
  1. The current processor model is simple; it does not model instruction dependences and assumes that all memory operations within a reorder buffer window can be issued as soon as they are fetched.  Adding support to model dependences would make the traces larger and slow down the simulator (hence, it wasn't done for the competition).
  2. We have already integrated USIMM with our version of Simics.  This automatically takes care of the dependency problem in the previous bullet, but the simulations are much slower.  In general, reviewers at top conferences will prefer the rigor of execution-driven simulation over the speed of trace-driven simulation.  It would be worthwhile to understand how conclusions differ with the two simulation styles.
  3. The DRAMSim2 tool has an excellent validation methodology.  We'd like to possibly re-create something similar for USIMM.
  4. We'd like to augment the infrastructure to support prefetched data streams.
  5. Any new scheduler would have to compare itself against other state-of-the-art schedulers.  The USIMM infrastructure already includes a simple FCFS and an opportunistic close page policy, among others.  All the code submitted to the competition is on-line as well.  It would be good to also release a version of the TCM algorithm (MICRO'10) in the coming months.
If you have an idea for a future research competition, please email the JWAC organizers, Alaa Alameldeen (Intel) and Eric Rotenberg (NCSU).

Friday, June 15, 2012

Introducing simtrax

The Hardware Ray Tracing group has just release our architectural simulator called simtrax. The simulator, compilation tools, and a number of example codes can all be checked out from the Google Code Repository. We have a couple tutorial pages on the simtrax wiki that will hopefully get you started.

Now you may be wondering at this point, what kind of an architecture does simtrax simulate. The answer is a number of different configurations of the architecture we call TRaX (for Threaded Ray eXecution), which is designed specifically for throughput processing of a ray tracing application. As such, the simulator is not well suited to a large number of application spaces, but does show quite good results for ray tracing, and would perform well on other similar applications.

In particular, the limitations built in to the TRaX architecture which are assumed by the simulator include the following:

  • No cache coherence – at present the memory is kept coherent, but no simulation penalty is added to maintain coherence. We depend on code being written to not use coherence.
  • No write caching – writes are assumed to write around the cache, and the lines are invalidated if they were cached, but again, no coherence messages are sent to the other caches in the system.
  • Small stack sizes – each thread has a small local store for the program call stack. Recursion and deep call stacks can cause overflow or excessive thread-context size.
  • A single global scene data structure – only load relevant data when it is needed by the thread.
The architecture we simulate is hierarchical in nature. First there are a number of independent, in-order, scalar thread processors. These are grouped in sets of (usually 32) that compose what is called a Threaded Multiprocessor (TM). While each thread processor has a few independent execution units, some of the larger, less-frequently-used units are shared among the threads in the TM. In addition, the TM shares a single L1 data cache and a number of instruction caches, both of which have multiple independent banks, which allows for independent fetching of distinct words. 

A number of TMs are grouped together to share access to an L2 data cache, which then manages the connection to the off-chip global memory. In a large chip there may be as many as 4 separate L2 data caches, while a smaller chip might do away with the L2 data cache altogether. The final piece of the puzzle is a small global register file with support for an atomic increment operation, which is used to give out thread work assignments.

This simulator has been in development for a number of years and has had a number of contributors involved in the project. In addition, we have taught a course using the simulation and compiler tools, and published 4 conference and journal articles, with more in progress.

We hope this simulation tool can be useful and intend to do what we can to support it. Feel free to email us and post bug reports on the google code page.