Saturday, February 5, 2011

Common Fallacies in NoC Papers

I am asked to review many NoC (network-on-chip) papers.  In fact, I just got done reviewing my stack of papers for NOCS 2011.  Many NoC papers continue to make assumptions that might have been fine in 2007, but are highly questionable in 2011.  My primary concern is that most NoC papers over-state the importance of the network.  This is often used to justify complex solutions.  This is also used to justify a highly over-provisioned baseline.  And many papers then introduce optimizations to reduce the area/power/complexity of the over-provisioned baseline.  Both of these fallacies have resulted in many NoC papers with possibly limited shelf life.

The first mis-leading overstatement is this (and my own early papers have been guilty of this): "Intel's 80-core Polaris prototype attributes 28% of its power consumption to the on-chip network", and "MIT's Raw processor attributes 36% of its power to the network".  Both processors are a few years old.  Modern networks probably incorporate many recent power optimizations (clock gating, low-swing wiring, etc.) and locality optimizations (discussed next).  In fact, Intel's latest many-core prototype (the 48-core Single Cloud Computer) attributes only 10% of chip power to the network.  This dramatically changes my opinion on the kinds of network optimizations that I'd be willing to accept.

The second overstatement has to do with the extent of network traffic.  Almost any high performance many-core processor will be organized as tiles.  Each tile will have one or a few cores, private L1 and L2 caches, and a slice (bank) of a large shared L3.  Many studies assume that data placement in the L3 is essentially random and a message on the network travels half-way across the chip on average.  This is far from the truth.  The L3 will be organized as an S-NUCA and OS-based first-touch page coloring can influence the cache bank that houses every page.  A thread will access a large amount of private data, most of which will be housed in the local bank and can be serviced without network traversal.  Even for shared data, assuming some degree of locality, data can be found relatively close by.  Further, if the many-core processor executes several independent programs or virtual machines, most requests are serviced by a small collection of nearby tiles and long-range traversal on the network is only required when accessing a distant memory controller.  We will shortly post a characterization of network traffic for the processor platform I describe above: for various benchmark suites, an injection rate and histogram of distance traveled.  This will hopefully lead to a more meaningful synthetic network input than the most commonly used "uniform random".

With the above points considered, one would very likely design a baseline network that is very different from the plain vanilla mesh network.  I would expect some kind of hierarchical network: perhaps a bus at the lowest level to connect a small cluster of cores and banks, perhaps a concentrated mesh, perhaps express channels.  For those that haven't seen it, I highly recommend this thought-provoking talk by Shekhar Borkar, where he argues that buses should be the dominant component of an on-chip network.  I highly doubt the need for large amounts of virtual channels, buffers, adaptive routing, etc.  I'd go as far as to say that bufferless routing sounds like a great idea for most parts of the network.  If most traffic is localized to the lowest level of the network hierarchy because threads find most of their data nearby, there is little inter-thread interference and there is no need for QoS mechanisms within the network.

In short, I feel the NoC community needs to start with highly localized network patterns and highly skinny networks, and identify the minimum amount of additional provisioning required to handle various common and pathological cases.

No comments:

Post a Comment