One of the most abused terms today is “core count”. Depending on who you ask, a core might mean a full-fledged IA Core (e.g. a Core 2), or it might mean something substantially less…like a small processing element with an ALU and local storage. Recently, I was guilty of this, too. In my defense, I was taking a decidedly software-oriented view of core count: what is the overall degree of parallelism have to find in my application?
At Intel, we would probably define a core as having the rough functionality of a CPU core (including one core of these). For my purposes, I had a slightly different definition in mind. I define this essentially as the SIMD ISA width (or effective SIMD width if the micro-architecture has implicit constraints, like branch coherence) times the per core (HW) thread count times the core count. (Note that I might further multiplex multiple lightweight software threads on each hardware thread to tolerate memory system latencies, but we’ll leave that out…for now.) So, consider the Core2Duo processors: Each core has 1 thread per core, 4-wide (32-bit) SIMD ISA and there are 2 cores. The out-of-order logic will take care of the latency hiding for me. So, to fully utilize the these cores, I’m going to want to find 1x2x4 = 8-way parallelism. On a dual socket system, double this to 16-way. Upgrade to quad core, and we’re talking about 32-way. Go crazy and move to a quad socket system and we’re at 64-way! Now consider quadrupling both the per-core SIMD width to 16 and the number of hardware threads per core to 4. I suppose it used to be that I’d think in terms of a few threads running some apps that had SIMD parallelism. For some apps this still works, using different types of parallelism expression (like wrapping parallel loops around vectorizable inner loops)…consider an image processing algorithm like frequency coding for compression that operates on small sub-blocks of the image or video frame. However, to get at the maximal scalability, I generally want to think in terms of the parallelism as a whole. Even in the video processing scenario, I’ll want to start considering the parallelism across the width, height, and timestamp of the frame. This is probably a better way to think when architecting for very scalable applications. In essence, think about all the parallelism exposed to the software layers. Of course, there are other issues: explicit vector instructions (e.g. SSE) have some benefits in terms of atomicity guarantees of the vector instructions; the ability to perform scalar operations in parallel issue slots to the vector instuctrions; and the latency hiding benefits of software multi-”thread”ing is significantly impacted by how the cores access memory and how the caches work (e.g., whether they exist at all in the architecture). And so on. I think it was relatively benign to use “core count” in this situation…however, from what I see reported on ours and our competitors’ product roadmaps, it can get a little confusing. Late note: the following blog and accompanying presentation is spot-on [http://gates381.blogspot.com/2008/08/how-to-count-to-800-comparing-nv-gtx.html ]Intel Labs
Connect With Us
Related Links
Recent Comments
- Qingfeng Zhu on The Third Eye View
- Anil on The Third Eye View
- Olajfestmény on Intel and Stanford Researchers Reveal Peptide Chip Details to Categorize Diseases and Analyze Protein Interactions
- Tony Rivers on Intel and Stanford Researchers Reveal Peptide Chip Details to Categorize Diseases and Analyze Protein Interactions
- Neel on Our ISTC-VC will rock at SIGGRAPH 2012
Categories
Tags
#IntelR&Dday 80-core @idf08 Big Data Cloud Computing Ct CTO energy efficient Future Lab Future Lab Radio IDF IDF2008 IDF 2010 Immersive Connected Experiences innovation Intel Intel Labs Intel Labs Europe Intel Research ISSCC Justin Rattner many core microprocessor mobility multi-core parallel computing parallel programming radio Rattner ray tracing research Research@Intel Research At Intel Day Robotics security silicon silicon photonics software development Stanford technology terascale virtual worlds Wi-Fi WiMAX wireless






2 Responses to How to Count Cores
An important tactical issue that revolves around core count is pricing. A number of commercial software products have scaled their pricing to the core count of the server, and not having a standardized way to count is causing significant misunderstanding and confusion.
>>Each core has 1 thread per core, 4-wide (32-bit) SIMD ISA and there are 2 cores. The out-of-order logic will take care of the latency hiding for me. So, to fully utilize the these cores, I’m going to want to find 1x2x4 = 8-way parallelism
Say,
pxor xmm0, xmm1
pxor xmm2, xmm3
pxor xmm4, xmm5
pxor xmm6, xmm7
My understanding is the four instructions running sequentially. Is there any way to let them run parallelly without using more threads (or tasks)?
Thanks,
Wade