One of the constants valued by our developers is the backward compatibility provided by our architectures in the form of a consistent ISA. Historically, a corollary of this has been that legacy software has benefited from process and micro-architectural improvement. Of course, we are doing our best make sure this “forward scalability” corollary still holds true (AKA the “free lunch”). But, the stakes are increasing to re-optimize software to better take advantage of new micro-architectural features that don’t obviously benefit legacy binaries. As I’ve mentioned previously in this blog, core counts increase (this shouldn’t surprise anyone at this point), memory hierarchies change, and ISA evolves as power-efficiency becomes the first-order concern. (The graph below shows a combination of a retrospective look at changes in Vector ISA and a speculative look at how Vector ISA change in the future considering stuff like co-processor/GPU trends, application usage, and so on.) These changes may have the unintended consequence of tempering the forward scalability corollary…or even regressing performance in some cases.
We specifically architect-ed Ct for forward scalability. Of course, the dynamic nature of its current incarnation helps this, but underlying this are two specific components that are designed to deal with forward scaling:
- ISA: Ct’s uses an abstract layer for code generation against evolving vector ISA, while retaining the performance of natively generated IA code.
- Threads and locality: Ct’s threading runtime which provides a portable layer for fine grained threading and synchronization that adapts to underlying microarchitecture and topology.
Unfortunately, we in the corporate world are afflicted by this need to resort to three-letter acronyms (or TLAs
, Dijkstra’s reference to this
has to make you chuckle). The term “afflicted” is strangely appropriate because TLAs typically evoke diseases for me. This limits the number of clever names we can come up with and these layers have had several name reason, most chosen because they’re the closest thing to sensible without colliding with any other projects (yet). They’ll probably change again, so I’m not naming them here. Slowly, my group has been crawling out of the muck of TLAs and
using more letters in our naming, most notably for our group name: CLARA, named for a convoluted acronym, or Santa Clara, home of our headquarters. Santa Clara, by the way, is the patron saint of (among other things) embroiderers, eyes, television, and television writers, which seems appropriate in the age of visual computing. But I digress…]
So, we have ISA and threading abstraction layers, both designed to address distinct aspects of forward scaling. These can be viewed as comprising what I’ve heard the folks at UC Berkeley
call an “Efficiency Layer” (i.e. what a language/framework implementer might use) to Ct’s “Productivity Layer” (i.e. what the average programmer might use).
The ISA efficiency layer addresses ISA evolution and the power-driven causation of richer and/or wider and/or heterogeneous vector ISA in future architectures (think CPUs + GPUs). We looked at some alternatives, but were generally unsatisfied with the common denominator approach taken. For a forward looking approach, it made more sense to look at a more general solution, even at the expense of some performance. So, this meant supporting at wider vector registers than are currently physically supported, richer instructions sets (scatter/gather, swizzles, type conversions) and more control (user-defined precision on transcendentals and other interpolated functions). Again, the risk here is that we lose performance. However, in practice this hasn’t been our experience. On Core 2 Duo systems, we are competitive with hand-tuned code and in some cases, where we can deploy improved interpolation algorithms under the ISA efficiency layer, we do better than hand-tuned performance (until that hand-tuned code is manually changed to use that improved algorithm). Our overall goal is to get to within 85% of hand-tuned performance. We’ll do better than this for many cases, but I have no doubt that we’ll be challenged by code that we’ll see in the near future.
The threading efficiency layer addresses multi- and many-core evolution and the power-driven causation of increasing core counts, heterogeneity (which I think is inevitable), and changes the caches and interconnect (referred to as the vaguely Orwellian “uncore” here at Intel). There are several ways to back-translate these influences on performance to a threading model:
1. Task granularity: The size of the task created in a collection oriented language is subject to much choice…after all, we’re typically operating over at least modest size collections. This affords us the ability to control the size of data chunk that each task works on to control memory footprint at various levels of hierarchy and to minimize threading overhead. There is a tension between these two objectives that needs to be carefully balanced and whose tradeoff is potentially unique per core/uncore design.
2. Synchronization algorithm: Collective communication primitives, like reductions and scans, induce particular synchronization patterns, which can be implemented a variety of ways. For a given micro-architecture and number of cores, the choice of algorithms is likely to be very specific. But across micro-architectures and varying numbers of cores (and their relative loading) allocated to the process, the choice can be variable. Underlying memory model can influence this, as well, as data may need to be marshaled into messages, etc.
3. Task reuse: There are many instances in which the threading structure can be reused…that is, the particular granularity of tasks and synchronization structure can be treated as a data structure are reused across invocations of the code region.
Ct’s threading efficiency layer is set up to allow these all of these decision to be made dynamically. This late-binding is essential to adapting to freed hardware resources, processor upgrades, new graphics cards, and so on. Also, reuse allows us to refine scheduling decisions progressively (i.e. which threads run on which cores).
The threading and ISA efficiency layers are the keys to Ct’s forward scaling. We’ve got an article at intel.com/go/Ct
that goes into more detail than I can go into here, but not nearly enough to do the topics justice…this is too rich and fast evolving an area of work. In the next year or two, we’ll be looking at ways we can leverage these “efficiency layers” in other language run-times and compilers.