How to Count Cores

One of the most abused terms today is “core count”. Depending on who you ask, a core might mean a full-fledged IA Core (e.g. a Core 2), or it might mean something substantially less…like a small processing element with an ALU and local storage. Recently, I was guilty of this, too. In my defense, I was taking a decidedly software-oriented view of core count: what is the overall degree of parallelism have to find in my application?

At Intel, we would probably define a core as having the rough functionality of a CPU core (including one core of these).

For my purposes, I had a slightly different definition in mind. I define this essentially as the SIMD ISA width (or effective SIMD width if the micro-architecture has implicit constraints, like branch coherence) times the per core (HW) thread count times the core count. (Note that I might further multiplex multiple lightweight software threads on each hardware thread to tolerate memory system latencies, but we’ll leave that out…for now.) So, consider the Core2Duo processors: Each core has 1 thread per core, 4-wide (32-bit) SIMD ISA and there are 2 cores. The out-of-order logic will take care of the latency hiding for me. So, to fully utilize the these cores, I’m going to want to find 1x2x4 = 8-way parallelism. On a dual socket system, double this to 16-way. Upgrade to quad core, and we’re talking about 32-way. Go crazy and move to a quad socket system and we’re at 64-way! Now consider quadrupling both the per-core SIMD width to 16 and the number of hardware threads per core to 4.

I suppose it used to be that I’d think in terms of a few threads running some apps that had SIMD parallelism. For some apps this still works, using different types of parallelism expression (like wrapping parallel loops around vectorizable inner loops)…consider an image processing algorithm like frequency coding for compression that operates on small sub-blocks of the image or video frame. However, to get at the maximal scalability, I generally want to think in terms of the parallelism as a whole. Even in the video processing scenario, I’ll want to start considering the parallelism across the width, height, and timestamp of the frame. This is probably a better way to think when architecting for very scalable applications. In essence, think about all the parallelism exposed to the software layers.

Of course, there are other issues: explicit vector instructions (e.g. SSE) have some benefits in terms of atomicity guarantees of the vector instructions; the ability to perform scalar operations in parallel issue slots to the vector instuctrions; and the latency hiding benefits of software multi-“thread”ing is significantly impacted by how the cores access memory and how the caches work (e.g., whether they exist at all in the architecture). And so on.

I think it was relatively benign to use “core count” in this situation…however, from what I see reported on ours and our competitors’ product roadmaps, it can get a little confusing.

Late note: the following blog and accompanying presentation is spot-on [http://gates381.blogspot.com/2008/08/how-to-count-to-800-comparing-nv-gtx.html ]

2 Responses to How to Count Cores

  1. Steve Hochschild says:

    An important tactical issue that revolves around core count is pricing. A number of commercial software products have scaled their pricing to the core count of the server, and not having a standardized way to count is causing significant misunderstanding and confusion.

  2. Wade Ju says:

    >>Each core has 1 thread per core, 4-wide (32-bit) SIMD ISA and there are 2 cores. The out-of-order logic will take care of the latency hiding for me. So, to fully utilize the these cores, I’m going to want to find 1x2x4 = 8-way parallelism
    Say,
    pxor xmm0, xmm1
    pxor xmm2, xmm3
    pxor xmm4, xmm5
    pxor xmm6, xmm7
    My understanding is the four instructions running sequentially. Is there any way to let them run parallelly without using more threads (or tasks)?
    Thanks,
    Wade