C for Throughput Computing

One of the challenges of enabling parallel computing broadly is that there is (understandably) some inertia around migrating programming tools, build environments, and, generally, 100’s of thousands or millions of lines of code to new programming models or compilers (especially those of a proprietary nature). This was at the forefront of our minds when we started researching a programming model for Throughput Computing. More generally, there is little value to the software developer in designing a programming model or language from the bottom-up (from micro-architecture) or in an applications vacuum. (We’ll let the architectures evolve to efficiently support a useful programming abstraction, though we’re not too far away from that.) We came to the inescapable conclusion that, at least for the near term, any successful effort at building a new programming model had to satisfy the following requirements:

1. It has to work with the prevailing tools. This includes VC++, GCC, and ICC to start with, then all the libraries/APIs/frameworks that are already being used.

2. It has to be adoptable incrementally, without placing undue burden on the programmer to completely restructure their application. The typical tuning nightmare for parallelization or vectorization optimizations is the “transposing” of performance paths out of their natural locations, embedded within and across various object hierarchies.

3. It cannot force the programmer to fundamentally change the algorithmic and data-container idioms of their applications.

With this in mind, I’m going to introduce Ct, which is short for C (really, C++) for Throughput Computing. An overly facile view of Ct is that it is simply a parallel-ready version of the container types C++ programmers already use (e.g. valarray). However, even a cursory inspection will reveal that there is much more to Ct than this. In this blog, I’ll describe briefly what Ct is now and what it will be in the not-too-distant future. We know Ct isn’t perfect (is any programming model?), but we’re striving to make it as close as possible so your comments and criticisms are most welcome.

I’ll start with a somewhat apocryphal restating of the reasoning behind Ct’s design. Of the various flavors of parallelism, while it seemed obvious that we had to support multiple, some flavor of collection-oriented (or data-oriented) parallelism was going to comprise the vast majority of parallelized and vectorized compute cycles we’d encounter. Why? Well, in a putative architecture supporting hundreds/thousands of hardware threads and thousands of software threads, it seemed highly unlikely that programmers would be able to produce thousands of completely distinct threads manually. That is, while these threads may vary in minor control paths followed and in particular task dependences they might exhibit, their creation would almost certainly be driven by small and large data structures representing vectors, hierarchical data structures, hash tables, etc. Truly heterogeneous task parallelism of a fairly small cardinality will almost certainly exist in most applications, but I will describe how we support that later. Also, having a flexible, parallel collection-oriented building block for user-defined classes is just essential. A key aspect of the “Ct Manifesto” is determinism…which makes tasks a little trickier. So, we start from data structure or collections and leave tasks sub rosa … for the time being.

One of the justified criticisms of data-parallel programming models is that they focus on the very restrictive (but performance friendly) vector as the primitive parallel collection. [I’ll have to apologize in advance because I’m going to abuse the term vector shortly.] Defining parallel operations over vectors is useful when your data type fits nicely into a flat, contiguous segment of memory. For dense linear algebra, for example, this is a convenient abstraction. What if your collection is a sparse matrix, a tree, a graph, or a set of value-key associations? With a good deal of effort (and perhaps a dissertation) you can make these structures work. As a proof of concept, this is definitely achievable (though with some performance penalty, especially on very constrained architectures). The more important question is whether an easier path to these types can be exposed to the programmer.

Here’s a quick fly-through of what you can do in Ct (there is a slightly more detailed one pager at the Ct in One Page site, and even more detail available at the links at the bottom of that page):

The programmer can implement the type of data structures mentioned above using a TVEC, or a “throughput vector”. There’s the aforementioned abuse of the term “vector”, since TVECs can have all kinds of shapes. Currently, TVECs can be nested/segmented, flat, 2D, 3D, and indexed/sparse. It’s straightforward to add more specialized (e.g. an Image type) or generalized types (e.g. variable length indices). You can use a pretty complete set of pre-defined operators on TVECs (including element-wise, reduction, scan, and permutation operators), or define your own operators to map, reduce, or scan (i.e. map, fold, foldlist). This gives you the option to program in different styles of data parallelism in a single API. When you define your own operators, you’re programming in more of the kernel-style of data parallelism. With a few tweaks, this is a very nice model for image processing kernels of various flavors. If you use the built in operators, your basically using the vector-style of data parallelism. This model tends to work well for linear algebra and large matrix math. By the way, the shape of the TVEC has a profound effect on the underlying semantics of the operator. For example, a reduction on an indexed vector looks like a combining-send/multi-reduce/scatter-combine (there are probably more names for this).

So, what about tasks? If you looked at the implementation of Ct, it would become obvious that the Ct TVEC API is essentially a declarative way of specifying parallel tasks that are dependent on each other in a data-flow pattern. So, collections of Ct operations on TVECs become collections of task dependence graphs. But this is a bit too constraining for those who would like to get at tasking abstractions directly. The key question for us was how to introduce tasks without compromising the deterministic character of Ct? The programmer use futures (which is the underlying tasking building block in Ct and goes back to MultiLisp) to build up ad-hoc task dependence graphs, or by using this thing we’re calling hierarchical, synchronous tasks (HSTs). The basic idea is that we want tasks to be able to execute in parallel, but that their “effects” to be isolated from each other, at least for some time. Of course, some of these effects are useful, but we want them combined in some predictable (deterministic) way. It turns out there is an old idea (sadly, there always is) that has similar characteristics, called bulk synchronous processes. I’ll post more on tasks later, but the nice thing about this is that by extending the idiom a bit, we can easily support various other forms of structured parallelism, including pipeline and fork-join patterns…all with deterministic behaviors!

I’m going to be writing a lot more about Ct, related programming models, and our development process, and plans in the coming months…so stay tuned.

7 Responses to C for Throughput Computing

  1. mukesh says:

    Hello! Anwar Ghuloum
    I wold like to appreciate your memoth efforts in the right direction. I am a research scholar in India, and would like to develop some of the new ideas in Ct. May I know presently at what stage you are, and where I can be helpful?
    regards
    mukesh from India

  2. Anwar, I really like your posts, which show that you understand what you are doing, so please excuse my following criticism.
    Sadly, I have this gnawing feeling of a marketing-style presentation. Perhaps, Intel requires you to do this ‘sub rosa’ (in the meaning of ‘secret’ rather than ‘silent’) but a parallel language is not an architecture. The earlier you disclose the details, the better adoption you can achieve. In a nutshell, I’d really appreciate reading a POPL, PLDI or PPoPP format paper on Ct, detailing all the rationale and similarities to other languages.

  3. M says:

    HI.WHY NOT INSTALL WINXP SP2 ON DG33TLM?
    WHY DP35 & DQ35 NOT WORK WITH CPU 3 GHz FULL?

  4. Chang Li says:

    Ct is an implementation of vector by C++ template. It is good to have only one template. It will be good that the compiled codes can apply SSE instructions for vector processing.
    How could be Ct’s vector computing with variable length more efficient than GPU’s vector computing with fixed length?

  5. hemant says:

    Anwar, I don’t have much programming experience. But few months back I read about map-reduce and it seems you people are implementing similar kind of technology or architecture. But here your compiler will determine what to map and what to reduce.