The Hardware Ray Tracing group has just release our architectural simulator called simtrax. The simulator, compilation tools, and a number of example codes can all be checked out from the
Google Code Repository. We have a couple tutorial pages on the simtrax wiki that will hopefully get you started.
Now you may be wondering at this point, what kind of an architecture does simtrax simulate. The answer is a number of different configurations of the architecture we call TRaX (for Threaded Ray eXecution), which is designed specifically for throughput processing of a ray tracing application. As such, the simulator is not well suited to a large number of application spaces, but does show quite good results for ray tracing, and would perform well on other similar applications.
In particular, the limitations built in to the TRaX architecture which are assumed by the simulator include the following:
- No cache coherence – at present the memory is kept coherent, but no simulation penalty is added to maintain coherence. We depend on code being written to not use coherence.
- No write caching – writes are assumed to write around the cache, and the lines are invalidated if they were cached, but again, no coherence messages are sent to the other caches in the system.
- Small stack sizes – each thread has a small local store for the program call stack. Recursion and deep call stacks can cause overflow or excessive thread-context size.
- A single global scene data structure – only load relevant data when it is needed by the thread.
The architecture we simulate is hierarchical in nature. First there are a number of independent, in-order, scalar thread processors. These are grouped in sets of (usually 32) that compose what is called a Threaded Multiprocessor (TM). While each thread processor has a few independent execution units, some of the larger, less-frequently-used units are shared among the threads in the TM. In addition, the TM shares a single L1 data cache and a number of instruction caches, both of which have multiple independent banks, which allows for independent fetching of distinct words.
A number of TMs are grouped together to share access to an L2 data cache, which then manages the connection to the off-chip global memory. In a large chip there may be as many as 4 separate L2 data caches, while a smaller chip might do away with the L2 data cache altogether. The final piece of the puzzle is a small global register file with support for an atomic increment operation, which is used to give out thread work assignments.
This simulator has been in development for a number of years and has had a number of contributors involved in the project. In addition, we have taught a course using the simulation and compiler tools, and published 4 conference and journal articles, with more in progress.
We hope this simulation tool can be useful and intend to do what we can to support it. Feel free to email us and post bug reports on the google code page.