Intel® Xeon® Phi™ coprocessors accelerate the pace of discovery and innovation

Discovery and the supercomputers behind discoveries are fascinating to me. I remember visiting NASA’s central computing facility for the first time and seeing their Cray supercomputer. I was amazed at the sheer number of wires needed to connect the systems, the cool cooling innovation (no pun intended) and even the sleek cabinets. Everything about it was just so much more grand than the little workstation on my desk. Moreover, when they told me what kinds of applications they were running and the discoveries they were making, I was in awe. And I was also .. in love. With Supercomputers. 

Innovation drives productivity and is the foundation of economic growth. For the past 25 or so years, both research and innovation has been driven by massive amounts of computing power in the form of personal workstations and clusters of servers called supercomputers. It’s allowed scientists to unravel the mysteries of the human genome, driving the economics of sequencing down to the level where you can now affordably get your own genome sequenced. On a broader scale, it’s allowed countless engineers to use digital prototyping, shaving cost and time-to-market of products in virtually every industry. Looking to the future, with the continuum of “data-information-knowledge” becoming an essential tool for digitally enabled knowledge economies around the world, the applications and need for supercomputing makes the past look the proverbial tip of the iceberg. 

Today, with the announcement of Intel® Xeon® Phi™ coprocessors, we’re going to accelerate the pace of these discoveries and innovations. Intel® Xeon Phi products extend the Intel® Xeon® brand – found in over 70% of the world’s supercomputers (see TOP500) – by providing the programmability of the Intel Xeon processor architecture to an emerging class of highly parallel applications that benefit from processors containing a large number of cores and threads. Lots of technical talk there, but let me put this into human terms. While the vast majority of software applications are best suited to Intel® Xeon® processors, these highly parallel applications benefit from a bunch of mathematical calculations performed at once. For example, if you’re trying to accurately track a weather storm, the more accurately you can predict the movement of each molecule of the storm in relation to every other molecule, the more accurate the prediction. This is what we call a “highly parallel application”. 

For the past 2 years we’ve been working with software developers from leading research institutes and private companies around the world on early silicon of Intel Xeon Phi coprocessors to accelerate their highly parallel applications. As I look at the type of research they conduct, it stirs the imagination and makes we wonder “What if?” What if these institutes and companies had 1000x the computer power they do today? What could they do with that amount of computing power? Could scientists at the DEEP Exascale project in Europe, who work on brain simulation, find a breakthrough discovery that cures Alzheimer’s Disease? Could Fraunhofer Institute researchers – who are doing full, high-definition, photo realistic rendering of objects – make physical prototyping obsolete? Could the cellular research that the Texas Advanced Computing Center (TACC) conducts lead to sustainable food production that’s free of disease? These breakthroughs are inevitable. Our mission is to deliver the computing power to make them happen faster.  

Last November, we demonstrated our first silicon of the Intel Xeon Phi coprocessor, code named “Knights Corner”. It produced an astounding teraflop of performance in a processor the size of your thumb, setting the industry on notice of the potential of many core architectures and providing a clear path of how we’ll get to the Petascale and Exascale era. This is the same amount of performance as the number 1 supercomputer on the TOP500 list in 1997, dubbed ASCI Red. ASCI Red used thousands of processors and filled a room with cabinets to produce the same amount of performance. Knights Corner quickly got the nickname of “Supercomputer on a Chip”. 

But any computer is tool. And a tool is useless unless the user can knows how to use it and the millions of lines of software code that institutes and companies have developed over the past 25+ years run on it. That’s the programmability aspect of the Intel Xeon Phi product family. It’s Intel Architecture. The codes that run on Intel Xeon processors today will run on Intel Xeon Phi products, carrying forward years of software development. 

At the International Supercomputing Conference kicking off today in Hamburg, Germany, the first supercomputer with the Intel Xeon Phi coprocessor made the TOP500 list. It’s aptly named “Discovery”. And we’ll be in production with Intel Xeon Phi coprocessors later this year. As we move into 2013 and beyond, we’ll see it in the hands of researchers, scientists, and engineers around the world. And the pace of discovery and innovation will be greatly accelerated.

32 Responses to Intel® Xeon® Phi™ coprocessors accelerate the pace of discovery and innovation

  1. James Davis says:

    I am part of the SBIR program and I need to buy one of these for my LASER Academy, Inc buisness please send me info on how I can get one

    • Joe Curley says:

      James – we’ve announced the Intel(r) Xeon(r) Phi(tm) brand, but have not yet launched a product. But good news… Intel has said these processors will be in production before the end of the year. So stay tuned!

      • Joe Curley says:

        It’s a DP Teraflop, and it was first show before the end of 2011. Over time, Xeon keeps getting better and faster, and so will Xeon Phi. Xeon maximizes the performance of each compute thread while scaling cores and adding SIMD for parallelism. Xeon Phi optimized for higher levels of parallel power/performance at the expense of per thread performance. But here’s what’s cool: The SW techniques used to extract the best performance for Xeon Phi also benefit Xeon. So, customers can choose the right Intel processor for their application.

  2. Nick says:

    Only 1 teraflop? If I understand correctly, a Haswell 8-core CPU running at 3.9 GHz would offer 1 teraflop of single-precision floating-point performance, and be more efficient (including TSX technology). Or is it double-precision for Xeon Phi?

    • Joe Curley says:

      Sorry… this was meant for you, Nick. It’s a DP Teraflop, and it was first show before the end of 2011. Over time, Xeon keeps getting better and faster, and so will Xeon Phi. Xeon maximizes the performance of each compute thread while scaling cores and adding SIMD for parallelism. Xeon Phi optimized for higher levels of parallel power/performance at the expense of per thread performance. But here’s what’s cool: The SW techniques used to extract the best performance for Xeon Phi also benefit Xeon. So, customers can choose the right Intel processor for their application.

  3. troy says:

    Intel you are competing against Nvidia’s Kepler goes big with its 1536 CUDA cores. Right now I see Intel’s Phi has only 50 cores, come on get real!!! Let me put it this way Nvidia and API will make fun of Intel IF Intel does not go all out!!! Show US why intel is better than those other chip builders. Nvidia has 4.58 TFLOPS single-precision chip right now, Intel should surpass that speed if you don’t want Nvidia and ATI to take the Chip market!!

    • Adam S says:

      Number of cores is irrelevant, because a CUDA core is much, much, simpler than an x86 core. NVidia may have 4.58 TF, but on a completely different architecture. HPC is about more than just the number of FLOPS you can do.

    • Joe Curley says:

      First off, performance matters not cores, but trying to answer the question: There’s an old joke about a guy who’s importing Guinea Pigs, and when he sees thousands of dollars of tariff, he asks customs, “Why?!?!?” The customs agent says it’s a five hundred dollars duty to bring a pig into the country. The importer says…”but these are GUINEA pigs…not pigs.” The bureacrat just shakes his head and says “pigs is pigs.” To avoid making that mistake: Cores ain’t cores. The Fermi graphics chip has 16 32-wide SIMD, but they call each shader a core. Normalizing out the marketing, the Knights Ferry SDP had 32 ‘cores’ with SIMD 16, and we’d say the graphics chip had 16 ‘cores’ with SIMD 32. The compute power is the same, modulo any differences in clock.

  4. Virendra Yadav says:

    I would like to know if this would require specific programming / OS to crunch data like Nvidia uses Cuda or it will simply take over anything that’s being a burden on the CPU?

    • Joe Curley says:

      Knights Corner, the future first Intel(r) Xeon(r) Phi(tm) product uses the same programming models as the Xeon CPU. That includes high-level programming languages like C, C++, and Fortran, and standards like MPI and OpenMP. But you correctly point out Knights Corner is not an accelerator. It’s actually an SMP machine on a chip, running a Linux OS. So, developers could SSH into the card and run an application, or through features in the compiler run a program on the host Xeon, and offload a parallel phase of the program to the Intel Xeon Phi coprocessor. It is increadibly flexible and programmable, unlike accelerators.

    • Joe Curley says:

      I haven’t seen a highly parallel version of World of Warcraft yet, so for the moment I’d recommend a great Core i7 processor in the meantime. :-)

    • Brad Rubin says:

      You made me laugh. You can run WOW on integrated graphics; the Ivy Bridge on-chip graphics are overkill :) .

  5. Joseph Pingenot says:

    > Intel you are competing against Nvidia’s Kepler goes big with its 1536 CUDA cores. Right now I see Intel’s Phi has only 50 cores, come on get real!!!

    It should be noted that a CUDA core is not anything like a Knights Corner core. Each Knights Corner core is quad-threaded with 512-bit SIMD units, so it’s (extremely roughly) comparable to 1600+ cores.

    The SIMD lanes were remapped by the video card companies to correspond to e.g. threads in a warp. Thus the talk of “lanes” in both worlds. Interestingly, Intel has recently turned their compiler sideways to roughly correspond to this model. See e.g. http://llvm.org/pubs/2012-05-13-InPar-ispc.html and Matt Walsh’s work (I think).

    Of course, this simple back-of-the-envelope cores calculation ignores a lot, such as memory access and contention issues, the nature of the cores themselves, the blocks + threads CUDA maps, etc. But the core point is completely valid: it’s not 50 cores vs 1500 cores.

  6. Dariusz says:

    Heya

    Im quite ineterested to get this unit. I work as CGI artists and having render farm of calculating power 100ghz+ at the size of 1 PCIE slot would be very impresive.

    Also if I can share memory across cards then 3x 8 gb = 24 gb which is the minimum I need for work. If this is atleast as fast as gtx 680 in gpu computing then I have a lot of applications for this CPU unit. Assuming existing CPU code would be able to run on it with good scaling ratio….

    fingers cross intel can deliver on memory share/ good scalability and of course affordable price for freelancers…

    Thanks, bye

  7. Troy says:

    As you know knights corner internal network, this links should network outside the chip. I would like to see internal network could linked to other Phi cores so that the outside core look the same as the Internal cores. If each Phi chip had intel’s thunderbolt built inside so that multitasking DMA (direct memory Access)transfers could take place faster outer cores. Intel is a CPU chip builder, and the Thunderbolt network could link more CPU into one big cluster, Think of the Thunderbolt network replacing PCI cards. If a user want more speed, just buy a chip (Not a card) plug it into the PHI Thunderbult network and watch your home computer into a Supercomputer.

  8. flakefrost says:

    It’s great for all the Xeon clusters out there with PCIe slots. Evidence suggests ARM clusters are the way of the future. How will Intel respond? Maybe there could be a very low power Atom SoC in the future? I would like to see a powerful cluster full of Atom chips! Go Intel!

  9. Ty says:

    Very interesting stuff. I too would like some information about programming the device. If coding is more accessible than other technologies then it may be just what I am looking for. Can this be programmed in C# or Java? Also is this a product that can be purchased? I need a workstation class device so 2GB of RAM is fine; even a dev board is fine. btw Raj doesn’t have his information in the meet the bloggers section; and no one at Intel seems to be reading/responding to the comments on this page.

  10. Mike says:

    NVidia cores and Intel cores are not the same thing. I like to think of it this way: you can get 10,000 estes C rocket engines and say it is more engines than a single Saturn 5 booster but given the choice, only one of those is going to get you to the moon. Phi is an optimized cluster on a single piece of silicon, Kepler is an array of “overglorified pipelines”. Some useful ways to compare chips are 1) transistor count [assuming that Nvidia 64nm single gate transistor is the same as Intel's 22nm tri-gate and Nvidia is as good at using its transistors for heavy-computing as Intel] 2) what is the largest non-sparse, invertible matrix that the processor can invert in a fixed time, 3) TCO including how well it manages heat, and how well its power can be controlled [those are the huge costs in the data-center, not the initial purchase price], 4) Development ecosystem already in place [If Blizzard isn't going to develop for it then you won't play wow on it, but if they can flip a switch in their compiler and use their existing code-base then you might be able to have some awesome game-play, ARM doesn't have this], 5) how well the architecture is going to be supported for the life of the software products and/or compute services. You really should account for how parallel processes can interact in a benchmark and something like raw, single processor clock speed doesn’t capture that. Sometimes it is important to consider products and ratios of individual metrics (pseudo-Buckingham Pi theorem) in order to get measures that adequately inform decisions.

  11. mroberts says:

    What the world really needs is something like this that will run on existing end-user hardware. As an computer artist the economic been extremely brutal on me because almost no one is buying art. An affordable hardware upgrade that works with my i5-2500 is what I need. However it looks like this will tied to Xeon-only systems and will be priced WAY out the range of the average person.

  12. Steven Shack says:

    This might be a bit early to ask. But is there any news of Mathematica on Phi? Development/investigation time is an important factor for me, so CUDA/OpenCL aren’t really viable options.

  13. Farbod says:

    +1 for what Dariusz said . I’m a vfx artist and very interested in xeon phi for 3d rendering and simulation . 8GB memory is too low for a lot of application . I hope if intel realease card with more memory (16Gb or 32 GB) ….. I dont think multiple card can share memory between each other

  14. Brad Rubin says:

    I read a very interesting article on air-cooling solutions recently. The tech was developed by Sandia National Labs (just up the street from here in Rio Rancho :) ) and would be very cool – no pun intended – to integrate into the Xeon Phi customer-side product packaging. It would certainly set Intel apart even further!

    Article link:

    https://share.sandia.gov/news/resources/news_releases/cooler/

  15. aussiebear says:

    I’m not sure if any Intel reps will see this, but I’ll ask this question any way…

    Will there be a cost-friendly, consumer-oriented version of the Xeon Phi?
    (For enthusiasts, DIY’ers, students, etc.)

    Maybe market it as “Core Phi” ? :)

  16. bank kus says:

    I m a bit unclear about the programmability particularly
    [a] Do we have cache coherence between CPU memory and coprocessor memory or are explicit memory copies required.

    [b] If explicit copies are required, how are these exposed to the programmer. Compiler does it for you or the programmer has to do that?

    [c] I see talk of “flip of a compiler switch” . Can somebody elaborate what that means. Will the compiler take a x86_64 dusty deck program and offload silently some work to the coproc and bring the results back with good memory pipelining?
    ————— **** ————————-
    Or is the argument that the coprocessor is a full blown SMP with its own OS, why not just ssh into it and run your parallel programs on it. Serial parts will also work albeit slower. Why on earth do you want to ping pong b/w main core and coproc.

    If this is the argument then this whole exercise makes sense. If not its not clear to me why someone would want to shift their CUDA apps to this.

    Thanks
    banks