One of the most abused terms today is “core count”. Depending on who you ask, a core might mean a full-fledged IA Core (e.g. a Core 2), or it might mean something substantially less…like a small processing element with an ALU and local storage. Recently, I was guilty of this, too. In my defense, I was taking a decidedly software-oriented view of core count: what is the overall degree of parallelism have to find in my application?At Intel, we would probably define a core as having the rough functionality of a CPU core (including one core of these). For my purposes, I had a slightly different definition in mind. I define this essentially as the SIMD ISA width (or effective SIMD width if the micro-architecture has implicit constraints, like branch coherence) times the per core (HW) thread count times the core count. (Note that I might further multiplex multiple lightweight software threads on each hardware thread to tolerate memory system latencies, but we’ll leave that out…for now.) So, consider the Core2Duo processors: Each core has 1 thread per core, 4-wide (32-bit) SIMD ISA and there are 2 cores. The out-of-order logic will take care of the latency hiding for me. So, to fully utilize the these cores, I’m going to want to find 1x2x4 = 8-way parallelism. On a dual socket system, double this to 16-way. Upgrade to quad core, and we’re talking about 32-way. Go crazy and move to a quad socket system and we’re at 64-way! Now consider quadrupling both the per-core SIMD width to 16 and the number of hardware threads per core to 4. I suppose it used to be that I’d think in terms of a few threads running some apps that had SIMD parallelism. For some apps this still works, using different types of parallelism expression (like wrapping parallel loops around vectorizable inner loops)…consider an image processing algorithm like frequency coding for compression that operates on small sub-blocks of the image or video frame. However, to get at the maximal scalability, I generally want to think in terms of the parallelism as a whole. Even in the video processing scenario, I’ll want to start considering the parallelism across the width, height, and timestamp of the frame. This is probably a better way to think when architecting for very scalable applications. In essence, think about all the parallelism exposed to the software layers. Of course, there are other issues: explicit vector instructions (e.g. SSE) have some benefits in terms of atomicity guarantees of the vector instructions; the ability to perform scalar operations in parallel issue slots to the vector instuctrions; and the latency hiding benefits of software multi-”thread”ing is significantly impacted by how the cores access memory and how the caches work (e.g., whether they exist at all in the architecture). And so on. I think it was relatively benign to use “core count” in this situation…however, from what I see reported on ours and our competitors’ product roadmaps, it can get a little confusing. Late note: the following blog and accompanying presentation is spot-on [http://gates381.blogspot.com/2008/08/how-to-count-to-800-comparing-nv-gtx.html ]
Connect With Us
- gta on What makes a super computer become a super computer?
- Profilebaker on Meet the “New” Makers: They Love Electronics, but Aren’t Necessarily Techies
- gk-edv on The Internet of Things will overtake you only if you let it
- Negin Owliaei on The Internet of Things will overtake you only if you let it
- website packages on Ask the Expert: The Internet of Things
Tags#IntelR&Dday @idf08 Big Data circuits Cloud Computing Ct CTO energy efficient Future Lab Future Lab Radio HPC IDF IDF2008 IDF 2010 Immersive Connected Experiences innovation Intel Intel Labs Intel Labs Europe Intel Research ISSCC Justin Rattner many core microprocessor mobility multi-core parallel computing parallel programming radio Rattner ray tracing research Research@Intel Research At Intel Day Robotics security silicon photonics software development Stanford technology terascale virtual worlds Wi-Fi WiMAX wireless