Utah Arch: USIMM

Wednesday, February 22, 2012

USIMM

We've just released the simulation infrastructure for the Memory Scheduling Championship (MSC), to be held at ISCA this year.

The central piece in the infrastructure is USIMM, the Utah SImulated Memory Module. It reads in application traces, models the progression of application instructions through the reorder buffers of a multi-core processor, and manages the memory controller read/write queues. Every memory cycle, USIMM checks various DRAM timing parameters to figure out the set of memory commands that can be issued in that cycle. It then hands control to a scheduler function that picks a command from this candidate set. An MSC contestant will only have to modify the scheduler function, i.e., restrict all of your changes to scheduler.c and scheduler.h. This clean interface makes it very easy to produce basic schedulers. Each of my students produced a simple scheduler in the matter of hours; these have been included in the code distribution as examples to help get one started.

In the coming weeks, we'll release a number of traces that will be used for the competition. The initial distribution includes five short single-thread traces from PARSEC that people can use for initial testing.

The competition will be judged in three different tracks: performance, energy-delay-product (EDP), and performance-fairness-product (PFP). The final results will be based on the most current version of USIMM, as of June 1st 2012.

We request that contestants focus on scheduling algorithms that are easily implementable, i.e., doable within a few processor cycles and within a 68 KB storage budget. A program committee will evaluate the implementability of your algorithm, among other things.

We'll post updates and bug fixes in the comments section of this blog post as well as to the usimm-users@cs.utah.edu mailing list (sign up here). Users are welcome to use the blog or mailing list to post their own suggestions, questions, or bug reports. Email usimm@cs.utah.edu if you have a question for just the code developers.

Code Updated on 04/17/2012:

Code download: http://www.cs.utah.edu/~rajeev/usimm-v1.3.tar.gz

Changes in Version 1.1: http://www.cs.utah.edu/~rajeev/pubs/usimm-appA.pdf
Changes in Version 1.2: http://www.cs.utah.edu/~rajeev/pubs/usimm-appB.pdf
Changes in Version 1.3: http://www.cs.utah.edu/~rajeev/pubs/usimm-appC.pdf

USIMM Tech Report: http://www.cs.utah.edu/~rajeev/pubs/usimm.pdf

The contest website: http://www.cs.utah.edu/~rajeev/jwac12/

Users mailing list sign-up: http://mailman.cs.utah.edu/mailman/listinfo/usimm-users

69 comments:

NilayFebruary 24, 2012 at 7:48 AM
Is the competition meant for individuals? Or are we allowed to participate in teams?
ReplyDelete
Replies
Rajeev BalasubramonianFebruary 24, 2012 at 7:53 AM
Team submissions are welcome.
ReplyDelete
Replies
MoinMarch 7, 2012 at 2:59 PM
Traces please ...
ReplyDelete
Replies
Rajeev BalasubramonianMarch 7, 2012 at 9:27 PM
It'll take us about 2 more weeks to release traces. We want to make sure we're picking trace snapshots that are representative. If you're just looking for longer traces for testing, email us and we'll send you a link.

On a related note, someone suggested that we add the PC for the cache miss in the trace. So in a week or two, we'll release version 1.1 that adds this support. If anyone would like other features added to version 1.1, please let us know soon.
ReplyDelete
Replies
Rajeev BalasubramonianMarch 9, 2012 at 9:17 AM
Yasuo, these are excellent suggestions. We'll try to include these in version 1.1.
ReplyDelete
Replies
Yoongu KimMarch 14, 2012 at 1:25 AM
The PDP metric, as currently defined, is vulnerable to non-intuitive behavior and perhaps even gaming. I provide one example below.

Let us consider a two-way multiprogrammed workload W, which consists of benchmarks X and Y.

/* Benchmark Characteristics */
X's delay (running by itself): 1
Y's delay (running by itself): 1

/* Workload W + Scheduler A */
X's delay: 2
Y's delay: 2

/* Workload W + Scheduler B */
X's delay: 1
Y's delay: 2

Between the two, Scheduler B clearly dominates Scheduler A. This is because Scheduler A accelerates X (2->1), while not affecting Y (2->2). However, the PFP (performance-fairness product) metric tells a different story.

/* Workload W + Scheduler A */
"Performance": 2+2 = 4 (lower is better)
Fairness: 2/2 = 1 (higher is better)
PFP: 4/1 = 4 (lower is better)

/* Workload W + Scheduler B */
"Performance" : 1+2 = 3 (lower is better)
Fairness: 1/2 = 0.5 (higher is better)
PFP: 3/0.5 = 6 (lower is better)

Here we see that PFP_A < PFP_B, denoting that Scheduler A is somehow better than B, when that is definitely not the case.
ReplyDelete
Replies
Yoongu KimMarch 14, 2012 at 1:50 AM
After some thinking, at least I haven't been able to come up with a good alternative metric that captures both performance and fairness using a single number.

Instead of using a single number, one option is to define a ("performance", fairness)-tuple for each workload. Using the same example as in my previous comment, the tuples for the two schedulers would be the following.

/* Scheduler A */
("performance", maximum slowdown) = (4, 2)

/* Scheduler B */
("performance", maximum slowdown) = (3, 2)

I've used the maximum slowdown as the metric for fairness. This is because I think it's more robust than the min-to-max slowdown that's susceptible to high fluctuation since it relies on the two most outlying benchmarks within a workload. (The CAL'11 paper by Vandierendonck and Seznec also agrees, but for different reasons.)

To compare the two schedulers, you would do an element-wise comparison of the tuples. For "performance" (lower is better), Scheduler B is awarded one point while, for fairness, there is a tie so both schedulers receive zero points. To compare the two schedulers for multiple workloads, you would do the same workload-by-workload comparison between the two schedulers and sum up all the points to determine the winner.

This is just my suggestion. Other people may have better ideas.
ReplyDelete
Replies
Rajeev BalasubramonianMarch 14, 2012 at 10:28 PM
Yoongu, thanks. I agree with your point that PFP (as defined) is not appropriate. So we will change the metric for the third track. I'm hoping this discussion will help us converge on an appropriate alternative metric within a week or two.

PFP (as defined) turns out to be a bad metric because doing well on some program ends up hurting the fairness metric and overall PFP. So you're almost discouraged from doing too well on some program. Like you suggest, let's change the fairness metric from min-to-max slowdown to just the max slowdown. That change would restore sanity to the PFP metric :-). Do you see any problem with this new measure (product of overall performance and the performance of the most affected program)?

The element-wise comparison of tuples would not work well when comparing (say) five different schedulers in your next paper. But.. it could work in a memory scheduling competition. We could do a March-Madness-style bracket where each scheduler has a match with one other scheduler for the right to advance. :-) I'm only half-serious here.
ReplyDelete
Replies
yKWONMarch 15, 2012 at 7:30 PM
According to configuration in published manual, it seems to be confined X4 org. for 1 channel case. is there any special reason for confining X4 org for 1 channel case? It is not about tool but about just curiosity. Thank you in advance!
ReplyDelete
Replies
RaghavendraMarch 19, 2012 at 7:56 AM
Hi,
From the code it appears that there are no multiple memory controllers. Is it planned in the upcoming versions of the simulator? or is the competition only for single memory controller?
ReplyDelete
Replies
stone moonMarch 21, 2012 at 7:29 PM
Hi,
I've downloaded 1G trace successfully, but I realized that four 1G-traces is too heavy to be simulated.
For the one 1G-trace case, it does not make a problem.
However, while simulation with four 1G-fluid traces finishes in 37 min, simulation with four 1G-black traces does not finishes(my workstation performs 3-days along and it still works now).

It is reasonable to use a 100M,10M,1M traces in real competition.
In addition, 1G-trace and 100k-trace looks similar in terms of read density, write density, read hit ratio, write hit ratio,..(except face case).
ReplyDelete
Replies
Rajeev BalasubramonianMarch 22, 2012 at 2:15 AM
Version 1.1 is available. The switch to 1.1 should be quite straightforward -- simply copy your scheduler.c and scheduler.h into the new src/ directory. The tool will take longer to download because the input/ directory includes five billion-instruction traces.

A summary of changes can be found here.

The tool itself can be downloaded here. (190 MB)
ReplyDelete
Replies
RamMarch 22, 2012 at 6:44 AM
Hi, Are we allowed to change only scheduler.c and scheduler.h for the competition? Or can we make changes in other files if required by our algorithm?
ReplyDelete
Replies
stone moonMarch 25, 2012 at 11:49 PM
I'm very sad to tell this:
my workstation has 8G memory and it fails when running four blackscholes_1G, or four canneal_1G with 1.1 version.
I think that init_new_node() function makes this problem, because init_new_node() is called from insert_read(), insert_write() and it does not free "request_t *new_node" at all.
So, as the # read/write count increases, memory is filled with huge write and read commands without de-allocating. (black's read+write count is 7.4 million and canneal's count is 17.9 million, while body is 4.2 million, fluid is 4.8 million, and freq is 5.6 million.)
I think that no workstation would simulate eight canneal.
Looks like de-allocation part is needed in the code.
ReplyDelete
Replies
Niladrish ChatterjeeMarch 26, 2012 at 3:29 PM
Hi stone moon,

The issue can be fixed by inserting a couple of calls to free in memory_controller.c.

Insert the following call to free() after the macro invocation LL_DELETE(read_queue_head[channel],rd_ptr)
on line 641 in memory_controller.c

free(rd_ptr);

Similarly after LL_DELETE(write_queue_head[channel],wrt_ptr)) on line 657, insert a call to free as follows.

free(wrt_ptr);

This should take care of deallocating the request nodes. We will include this fix in the next release.

Sorry for the inconvenience and thanks for highlighting the problem.
ReplyDelete
Replies
chungchungMarch 28, 2012 at 5:19 AM
Hello sir, i got an error while using scheduler.c(adapted by scheduler-fair.c) in usimm-v1.1
the message is:
usimm: memory_controller.c:1540: update_memory: Assertion `is_refresh_allowed(channel, rank)' failed.
please let me know why this error occurred.
ReplyDelete
Replies
biswaApril 4, 2012 at 6:42 PM
Hi,
I am getting "process killed" for all the 1 channel configuration which takes 4 trace files. The simulation for 1 channel with 1 trace file is running fine.
ReplyDelete
Replies
stone moonApril 6, 2012 at 7:45 AM
Hi, I found some bugs.
scheduler-refr.c does not operate normally in all traces that you had distributed - in 1-CH configuration, "body" trace works only while "black", "can", "fluid", "freq" fails due to assertion

fail.
I tried to debug this situation and figured out reason.

During the simulation, "last_refresh_completion_deadline" and "next_refresh_completion_deadline" are shifted together by the amount of "8*tREFI".
And "refresh_issue_deadline" is updated to "next_refresh_completion_deadline-tRP-tRFC*(8-num_issued_refresh)" at every DRAM cycle.
Here is the bad situation.
When 8-refresh is issued in advance, "refresh_issue_deadline" equals to "next_refresh_completion_deadline-tRP".
If last(8th) refresh command is issued at "next_refresh_completion_deadline-tRP-tRFC/2",
"dram_state.next_refresh" is updated to "next_refresh_completion_deadline-tRP+tRFC/2" which makes an assertion fail
At "next_refresh_completion_deadline-tRP", "is_refresh_allowed" function returns 0 because "CYCLE_VAL < dram_state.next_refresh"
(CYCLE_VAL = next_refresh_completion_deadline-tRP, dram_state.next_refresh=next_refresh_completion_deadline-tRP+tRFC/2)

Additional code is essential to avoid this problem.(in memory_controller.c)
At line 1547 in memory_controller.c,
"else if(CYCLE_VAL == refresh_issue_deadline[channel][rank])"
should be changed to
"else if( (CYCLE_VAL==refresh_issue_deadline[channel][rank]) && (num_issued_refreshes[channel][rank]<8) )"
, in order to skip timing check when 8 refreshes are issued in 8*tREFI window.

Surely, I can avoid this situation in scheduler-refr.c by restricting refresh at
"CYCLE_VAL > next_refresh_completion_deadline-(8-num_issued_refresh)*tRFC-tRP".
However, I don't want that conservative situation which limits performance improvement.

It would be very helpful if you consider this situation when you release usimm1.2

Thank you.
ReplyDelete
Replies
Rajeev BalasubramonianApril 7, 2012 at 1:49 PM
Version 1.2 is now available. Download the tool and copy your own scheduler.c and scheduler.h into the src/ directory. This version fixes the memory leaks and bugs that have been reported in the above comments (thanks to everyone for reporting the bugs and fixes!). The only files that have changed are memory_controller.c and our license file.

A summary of changes can be found here .

The tool itself can be downloaded here . (190 MB)
ReplyDelete
Replies
ksk9687April 8, 2012 at 6:11 AM
Hi,
Thank you for releasing such a valuable simulator.

I got an assertion error I when I use autoprecharge command on usimm v1.2.

usimm: memory_controller.c:1548: update_memory: Assertion `is_refresh_allowed(channel, rank)' failed.

In is_autoprecharge_allowed(), interval time after autoprecharge is calculated only based on CYCLE_VAL; however, actual interval time can be based on dram_state[channel][rank][bank].next_pre in issue_autoprecharge().
Differently from the other commands, the autoprecharge command can add T_RP to dram_state[channel][rank][bank].next_*.
Therefore, in some cases, it causes a violation of refresh_issue_deadline.

I guess the following is_autoprecharge_allowed() is correct:

int is_autoprecharge_allowed(int channel, int rank, int bank) {
if (((cas_issued_current_cycle[channel] == 1) && ((max(CYCLE_VAL + T_RTP, dram_state[channel][rank][bank].next_pre) + T_RP) <= refresh_issue_deadline[channel][rank]))
|| ((cas_issued_current_cycle[channel] == 2) && ((max(CYCLE_VAL + T_CWD + T_DATA_TRANS + T_WR, dram_state[channel][rank][bank].next_pre) + T_RP) <= refresh_issue_deadline[channel][rank])))
return 1;
else
return 0;
}

or

int is_autoprecharge_allowed(int channel, int rank, int bank) {
if (((cas_issued_current_cycle[channel] == 1) && (dram_state[channel][rank][bank].next_pre + T_RP) <= refresh_issue_deadline[channel][rank]))
|| ((cas_issued_current_cycle[channel] == 2) && (dram_state[channel][rank][bank].next_pre + T_RP) <= refresh_issue_deadline[channel][rank])))
return 1;
else
return 0;
}

Thank you,

Keisuke KUROYANAGI
ReplyDelete
Replies
RamApril 9, 2012 at 10:28 AM
Hi,
Can we implement closed page policy by modifying only the scheduler.c. The simulator is using an open page policy. What should we do from the scheduler function so that we can selectively close some of the pages just after read cmd is sent? -- Any pointers for this
Thank you.
ReplyDelete
Replies
RamApril 10, 2012 at 4:28 AM
Hi,
Thank you for your reply. I have one more query. Added a few lines in FCFS scheduler to print the contents in the queue and the entry which was scheduled(printed inside (())). We checked it for Bank 4. There was an issue

Bank 4 Row 38104 ThreadID 1 is_issuable 0 COL_READ_CMD
Bank 4 Row 39795 ThreadID 1 is_issuable 1 ACT_CMD
Bank 4 Row 100125 ThreadID 3 is_issuable 0 PRE_CMD
Bank 4 Row 39795 ThreadID 1 is_issuable 1 ACT_CMD
Bank 4 Row 66491 ThreadID 2 is_issuable 1 ACT_CMD
Bank 4 Row 66491 ThreadID 2 is_issuable 1 ACT_CMD

-------------------------------------------
((Bank 4 Row 39795 ThreadID 1 is_issuable 1 ACT_CMD))

-----------------------------------------------------------------
Bank 4 Row 38104 ThreadID 1 is_issuable 0 COL_READ_CMD
Bank 4 Row 39795 ThreadID 1 is_issuable 0 COL_READ_CMD
Bank 4 Row 100125 ThreadID 3 is_issuable 1 PRE_CMD
Bank 4 Row 39795 ThreadID 1 is_issuable 0 COL_READ_CMD
Bank 4 Row 66491 ThreadID 2 is_issuable 0 PRE_CMD
Bank 4 Row 66491 ThreadID 2 is_issuable 0 PRE_CMD

-------------------------------------------
((Bank 4 Row 100125 ThreadID 3 is_issuable 1 PRE_CMD))

-----------------------------------------------------------------

Bank 4 Row 38104 ThreadID 1 is_issuable 1 ACT_CMD
Bank 4 Row 39795 ThreadID 1 is_issuable 1 COL_READ_CMD
Bank 4 Row 100125 ThreadID 3 is_issuable 1 ACT_CMD
Bank 4 Row 39795 ThreadID 1 is_issuable 1 COL_READ_CMD
Bank 4 Row 66491 ThreadID 2 is_issuable 0 PRE_CMD
Bank 4 Row 66491 ThreadID 2 is_issuable 0 PRE_CMD

-------------------------------------------
((Bank 4 Row 38104 ThreadID 1 is_issuable 1 ACT_CMD))

-----------------------------------------------------------------

Bank 4 Row 38104 ThreadID 1 is_issuable 0 COL_READ_CMD
Bank 4 Row 39795 ThreadID 1 is_issuable 1 COL_READ_CMD
Bank 4 Row 100125 ThreadID 3 is_issuable 0 PRE_CMD
Bank 4 Row 39795 ThreadID 1 is_issuable 1 COL_READ_CMD
Bank 4 Row 66491 ThreadID 2 is_issuable 0 PRE_CMD
Bank 4 Row 66491 ThreadID 2 is_issuable 0 PRE_CMD

-------------------------------------------
((Bank 4 Row 39795 ThreadID 1 is_issuable 1 COL_READ_CMD))

-----------------------------------------------------------------

Row 39795 was sent ACT_CMD and then Row 100125 was sent PRE_CMD, Row 38104 ACT_CMD, followed by again Row 39795 with COL_READ_CMD. Should that not be PRE_CMD? Is there anything wrong in our interpretation?
ReplyDelete
Replies
Seokju YoonApril 14, 2012 at 1:27 AM
Hi,
Is allowed to communicate with global variable between DRAM controllers? (in 'scheduler.c')
How can I calculate communication time overhead?
ReplyDelete
Replies
stone moonApril 15, 2012 at 10:59 PM
Hi, Rajeev.
I have a question.
How can I submit source code and paper on due date?
Via e-mail?
ReplyDelete
Replies
Rajeev BalasubramonianApril 17, 2012 at 2:33 AM
The program committee for the competition is now posted on the competition page. Since there were a few questions about this, I thought I'd clarify the role of the PC.

The competition winners will be determined entirely by the numbers produced by the simulation experiments. The PC will check the following criteria and provide feedback that may help authors with future submissions of their work. Criteria that must be fulfilled to qualify for the competition: 1. The algorithm must be "implementable" on modern hardware. This includes meeting the storage budget of 68 KB and being computationally tractable. 2. The paper must offer new insight beyond published work. This could either be a new scheduling algorithm, an effective combination of known scheduling heuristics, or the authors' own previously published algorithm analyzed on the new simulation infrastructure.

Please email me if there is any confusion about the competition rules.
ReplyDelete
Replies
Rajeev BalasubramonianApril 17, 2012 at 3:11 AM
Version 1.3 is now available here (378 MB).

This is hopefully our final release before the competition. The code changes in version 1.3 are minor. The most significant addition is the set of workloads that will be used for the competition submissions. Details on version 1.3 additions can be found here . Please read it carefully.

In the next day or two, we'll also release files that will facilitate quantifying and reporting the final competition metrics.

As always, please email usimm@cs.utah.edu if you have any questions or feedback.
ReplyDelete
Replies
stone moonApril 18, 2012 at 4:39 AM
Hi,
Can I send both paper and source code via e-mail?
Because of my company's security policy, sending e-mail is easier than upload files.
ReplyDelete
Replies
Rajeev BalasubramonianApril 19, 2012 at 8:44 PM
Here is a perl script to make it easier to compute final metrics and graph results. (Note that you'll have to rename the file to "usimm-script.pl")

The perl script reads your "runsim" script, finds your output files in an output/ directory, and produces the following:

1. The total execution time, PFP, and EDP metrics on stdout.

2. A csv file that can be read by excel and that allows you to graph the performance and EDP numbers.

Please read the perl script documentation for more details, especially the naming convention for output files. I hope people find this useful. As always, contact usimm@cs.utah.edu if you have any questions or suggestions.

(Thanks to Manju Shevgoor for writing this script.)
ReplyDelete
Replies
Rajeev BalasubramonianApril 22, 2012 at 12:27 AM
The submission site for the MSC is now open. The submission deadline is Tuesday April 24th, 9pm PDT (Pacific daylight savings time). Contestants must submit a 6-page conference-style paper pdf on the site and email their scheduler.c/h to nil@cs.utah.edu.

Please email me if you have any questions about the process.
ReplyDelete
Replies
Rajeev BalasubramonianApril 23, 2012 at 12:15 AM
Here are links to a sample results table: pdf and tex . Submissions are not required to include such a table. But I suspect many will -- hopefully, this template will reduce effort.
ReplyDelete
Replies
Rajeev BalasubramonianApril 23, 2012 at 12:56 AM
Please note that blind submissions are allowed, but not required. If you're submitting a minor extension of your own published scheduler, it probably makes sense to identify yourself in the submission.
ReplyDelete
Replies
ahmedMarch 13, 2013 at 3:38 AM
First, thanks for this useful tool to understand the DRAM internals and experiment with different scheduling policy.

I have a question regarding the trace format. I plan to use PIN instrumentation to generate the trace. The format of the trace specifies that each memory instruction is displaced from preceeding memory instruction by some value >= 0. For a gap of 0, two instructions should be fed to the ROB next to each other. But for gap > 0, second memory instruction should be fed only after elapsed CPU cycle. I do understand that for out-of-order execution it's possible. But, for a large gap (say 1000+), small instruction window can't capture this. However, it seems the fetching engine is disregarding this in current implementation.

Below code snippet (main.c) shows this:
while ((num_fetch < MAX_FETCH) && (ROB[numc].inflight != ROBSIZE) && (!writeqfull))
{
/* Keep fetching until fetch width or ROB capacity or WriteQ are fully consumed. */
}

Thanks in advance for your suggestion.

ReplyDelete
Replies
UnknownApril 5, 2014 at 3:08 PM
Thank you for this simulator. I am trying to model an approximate DRAM using USIMM. I'd like to start off by modifying a few refresh parameters. I went through the params.h file but did not find numbers for these there. How do I go about doing this? Thanks!
ReplyDelete
Replies
UnknownAugust 14, 2014 at 1:32 AM
Hi. I have a question about workload. I don't understand how do you make workload by Simpoint. I think Simpoint don't print addresses, Did you fix Simpoint's code or use another tool. If you use another tool, please let me know what is it. Thank you.
ReplyDelete
Replies

Add comment