We taught a one day tutorial at Supercomputing 2013 (in Denver) on Sunday November 17, 2013 based on the principles in the book. The presentation material we used is available here. Also, now during… Read more
RECENT BLOG POSTS
I’m excited by our announcement today of Intel® Parallel Computing Centers. The first five centers will be located at CINECA, Purdue University, Texas Advanced Computing Center at the University of… Read more >
Intel® Processor Trace (Intel® PT) is an exciting new feature coming in future processors that can be enormously helpful in debugging because it will expose an accurate and detailed trace of activity with triggering and filtering capabilities to help with isolating the tracing that matters. We released specifications recently, and now a library is available to enable tool development as well as a talk this week on the work to make these capabilities available in Linux. Tool and operating system developers have specifications and our library to enable development.
We have released a library along wth sample tools to enable use of Intel® Processor Trace (Intel® PT) available as the “Processor Trace Decoder Library” available as a free download. I can tell you a little about this project and I will also explain Intel PT to motivate the decoding capabilities of the Processor Trace Decoder Library.
The project itself will be able to support any operating system which itself is enabled for using Intel PT. Intel PT is presented as a performace event, therefore support in an operating system is easy to detect by seeing if that event is available to configure/use. Changes for Linux have been worked on; the status of some Linux work was presented this week (presentation available at linuxfoundation-PT for LinuxCon.pdf). In time, we expect other operating systems including Windows and OS X will include support for Intel PT too and our Processor Trace Decoder Library is ready for that. The decoder library currently has been verified to build on Linux, Windows and OS X so it is ready!
The project for Processor Trace Decoder Library contains a library for decoding Intel PT together with sample implementations of simple tools built on top of the library that show how to use the library in your own tool. The following are included in the download:
- libipt: A packet encoder/decoder library plus a document describing the usage of the decoder library.
- Optional Contents and Samples:
- ptdump: Example implementation of a packet dumper.
- ptxed: Example implementation of a trace disassembler.
- pttc: A trace test generator.
- script: A collection of scripts.
Intel recently released details about Intel Processor Trace in the latest Intel® Architecture Instruction Set Extensions Programming Reference as Chapter 11. Intel Processor Trace is a low-overhead execution tracing feature that will be supported by some processors in the future. It works by capturing information about software execution on each hardware thread using dedicated hardware facilities so that after execution completes software can do processing of the captured trace data and reconstruct the exact program flow. Intel PT is not free with respect to execution overhead, but the overhead is low enough that it should work well in production builds for most applications.
The captured information is collected in data packets. The first implementation of Intel PT offers control flow tracing, which includes in these packets timing and program flow information (e.g. branch targets, branch taken/not taken indications) and program-induced mode related information (e.g., Intel TSX state transitions, CR3 changes). These packets may be buffered internally before being sent to the memory subsystem.
Why is this useful?
Intel PT provides the context around all kinds of events. Performance profilers can use PT to discover the root causes of ‘response-time’ issues – performance issues which affect the quality of execution, if not the overall runtime. For example, using PT, video application developers can explore, in very fine detail, the execution of problematic individual frames, something not generally possible with more traditional sampling-based collection.
Furthermore, the complete tracing provided by Intel PT enables a much deeper view into execution than has previously been commonly available; for example, loop behavior, from entry and exit down to specific backedges and loop tripcounts, is easy to extract and report.
Debuggers can use it to reconstruct the code flow that led to the current location. Whether this is a crash site, a breakpoint, a watchpoint, or simply the instruction following a function call we just stepped over. They may even allow navigating in the recorded execution history via reverse stepping commands.
Another important use case is debugging stack corruptions. When the call stack has been corrupted, normal frame unwinding usually fails or may not produce reliable results. Intel PT can be used to reconstruct the stack back trace based on actual CALL and RET instructions.
Operating systems could include Intel PT into core files. This would allow debuggers to not only inspect the program state at the time of the crash, but also to reconstruct the control flow that led to the crash. It is also possible to extend this to the whole system to debug kernel panics and other system hangs. Intel PT can trace globally so that when an operating system crash occurs the trace can be saved as part of a operating system crash dump mechanism and then used later to reconstruct the failure.
Intel PT can also help to narrow down data races in multi-threaded operating system and user program code. It can log the execution of all threads with a rough time indication. While it is not precise enough to detect data races automatically, it can give enough information to aid in the analysis.
Trace Buffer Management
The trace data can be collected into operating system provided circular buffers. To simplify memory management and to make it easier for the operating system to find a suitably large piece of memory, the buffer need not be contiguous.
The logical buffer consists of a collection of memory pages and a control structure that describes the page layout. The operating system may configure Intel PT to generate an interrupt when any of the sections is near full.
This enables a variety of different use cases:
- a single circular buffer
- a single buffer with copy-out
- a single buffer with copy-out section by section
While Intel PT generates too much data to store the execution trace over a long period of time to disk, shorter snippets can be saved.
Execution flow reconstruction
Intel PT uses a compact format to store the execution trace. It omits everything that can be deduced directly from the code or from previous trace.
You can compare this with a brief list of instructions for navigating a maze. As long as the way is obvious, you simply follow the twists and turns of the maze. When you come to a junction you need to know whether to turn left or right. In order to navigate the maze, all you really need is a short list of left or right directions. Similar to that, Intel PT uses a single bit to indicate whether a conditional branch has been taken or not taken. Unconditional jumps and linear code are not represented in the trace, at all.
The PT trace consists of a sequence of packets (which come in different types). To represent a selection of conditional branches, for example, Intel PT uses the TNT packet that comes in two different sizes: 8 bytes and 64bytes. For reconstructing execution flow, there are a few more things to consider such as indirect branches, function returns, or interrupts. To model these, Intel PT adds more packets like TIP for indirect branches and function returns, and FUP for asynchronous event locations. An interrupt will then be represented as a FUP followed by a TIP, giving the source and destination of the asynchronous branch, respectively. Intel PT also gives information about transactional synchronization. Whenever a transaction is started, committed, or aborted, Intel PT will generate two packets: a MODE.TSX packet giving the new transactional state, and a FUP packet giving the code location at which the new state is effective. For a transaction abort, an additional TIP packet will be generated giving the location of the corresponding abort handler.
Please refer to the specification (Chapter 11 of the Intel® Architecture Instruction Set Extensions Programming Reference) for a full list of supported packets.
In order to reconstruct the execution flow, a decoder therefore needs to decode the instructions in the traced executable or library as well as the PT trace packets. To handle dynamic libraries, the decoder also needs to consider sideband information provided by the operating system.
Intel provides an Open Source reference implementation for decoding PT packets and for reconstructing the execution flow. The Processor Trace Decoder Library (a collection of tools and libraries to enable use of Intel® Processor Trace) is available as a free download. Intel is currently working to help enabling GDB, the GNU* debugger. Additional intregration with other tools are being considered as well.
Intel provides a low-overhead tracing feature that allows recording the execution flow and reconstructing it at a later time. This feature has applications for functional as well as for performance debugging.
We taught a class on “Multithreading and VFX” on July 24 at SIGGRAPH 2013.
All course notes are now online at http://www.multithreadingandvfx.org/course_notes/ – useful even if you were not there!
Wonderful group of presenters to work with (in order of presentation in our class):
- James Reinders, Intel
- George ElKoura, Pixar Animation Studios
- Martin Watt, Dreamworks Animation
- Erwin Coumans, AMD
- Ron Henderson, Dreamworks Animation
- Jeff Lait, Side Effects Software
The web site http://www.multithreadingandvfx.org is worth a look!
Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
The latest Intel® Architecture Instruction Set Extensions Programming Reference includes the definition of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions. These instructions represent a significant leap to 512-bit SIMD support. Programs can pack eight double precision or sixteen single precision floating-point numbers, or eight 64-bit integers, or sixteen 32-bit integers within the 512-bit vectors. This enables processing of twice the number of data elements that AVX/AVX2 can process with a single instruction and four times that of SSE.
Intel AVX-512 instructions are important because they offer higher performance for the most demanding computational tasks. Intel AVX-512 instructions offer the highest degree of compiler support by including an unprecedented level of richness in the design of the instructions. Intel AVX-512 features include 32 vector registers each 512 bits wide, eight dedicated mask registers, 512-bit operations on packed floating point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, new operations, additional gather/scatter support, high speed math instructions, compact representation of large displacement value, and the ability to have optional capabilities beyond the foundational capabilities. It is interesting to note that the 32 ZMM registers represent 2K of register space!
Intel AVX-512 offers a level of compatibility with AVX that is stronger than prior transitions to new widths for SIMD operations. Unlike SSE and AVX that cannot be mixed without performance penalties, the mixing of AVX and Intel AVX-512 instructions is supported without penalty. AVX registers YMM0–YMM15 map into the Intel AVX-512 registers ZMM0–ZMM15, very much like SSE registers map into AVX registers. Therefore, in processors with Intel AVX-512 support, AVX and AVX2 instructions operate on the lower 128 or 256 bits of the first 16 ZMM registers.
The evolution to Intel AVX-512 contributes to our goal to grow peak FLOP/sec by 8X over 4 generations: 2X with AVX1.0 with the Sandy Bridge architecture over the prior SSE4.2, extended by Ivy Bridge architecture with 16-bit float and random number support, 2X with AVX2.0 and its fused multiply-add (FMA) in the Haswell architecture and then 2X more with Intel AVX-512.
Intel AVX-512 in Intel products
Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing. Intel AVX-512 brings the capabilities of 512-bit vector operations, first seen in the first Xeon Phi Coprocessors (previously code named Knights Corner), into the official Intel instruction set in a way that can be utilized in processors as well. Intel AVX-512 offers some improvements and refinement over the 512-bit SIMD found on Knights Corner that I’ve seen bring smiles to compiler writers and application developers alike. This is done in a way that offers source code compatibility for almost all applications with a simple recompile or relinking to libraries with Knights Landing support.
Intel AVX-512 Instruction encodings
Intel AVX instructions use the VEX prefix while Intel AVX-512 instructions use the EVEX prefix which is one byte longer. The EVEX prefix enables the additional functionality of Intel AVX-512. In general, if the extra capabilities of the EVEX prefix are not needed then the AVX2 instructions can be used, coded using the VEX prefix saving a byte in certain cases. Such optimizations can be done in compiler code generators or assemblers automatically
Emulation for Testing, Prior to Product
In order to help with testing of support, before Knights Landing is available, the Intel® Software Development Emulator (Intel® SDE) has been extended for Intel AVX-512 and is available at http://www.intel.com/software/sde.
Innovation Beyond Intel AVX-512
Intel AVX-512 foundation instructions will be included in all implementations of Intel AVX-512. Products may also include capabilities that extend Intel AVX-512 and have distinct CPUID bits for detection. Knights Landing will support three sets of capabilities to augment the foundation instructions. This is documented in the programmer’s guide; they are known as Intel AVX-512 Conflict Detection Instructions (CDI), Intel AVX-512 Exponential and Reciprocal Instructions (ERI) and Intel AVX-512 Prefetch Instructions (PFI). These capabilities provide efficient conflict detection to allow more loops to be vectorized, exponential and reciprocal operations and new prefetch capabilities, respectively.
Intel AVX-512 support
Release of detailed information on Intel AVX-512 helps enable support in tools and operating systems by the time products appear. We are working with both open source projects and tool vendors to help incorporate support. The Intel compilers, libraries, and analysis tools have, or will be updated, to provide first class Intel AVX-512 support.
Intel AVX-512 documentation
The Intel AVX-512 instructions are documented in the Intel® Architecture Instruction Set Extensions Programming Reference (see the “Getting Started” tab at http://software.intel.com/en-us/intel-isa-extensions). Intel AVX-512 is covered in Chapters 2-7; Chapters 5 and 6 detail the Intel AVX-512 foundation instructions while Chapter 7 details the capabilities that extend Intel AVX-512.
Vectorization is an industry wide challenge – and if you are interested in seeing some one of the industry leading exploration projects (and trying it on your code)… then may want to look at ispc. ispc is an R&D compiler for a C-based language that is targeted for exploring the performance available from doing SPMD [...] Read more >
We are kicking off a series of OpenCL Webinars this week (they will be available for playback afterwards too). All talks are at 09:00 Pacific Time. Register for the talks to get a link and optional reminder. July 11: Getting Started with Intel® SDK for OpenCL Applications 2012 July 18: Writing Efficient Code for OpenCL Applications 2012 [...] Read more >
The Top500 List released on June 18, 2012, included preproduction Knights Corner coprocessors in the #150 system. It became the first system on the Top500 list to utilize the Intel Many Integrated Core (MIC) Architecture. The system came in at #150 on the Top500 and was cited as the third most power efficient design in the [...] Read more >
A busy ISC awaits us in Hamburg June 17-21, 2012. Please drop by and say hello if you will be there. Here are some of the places I’ll be: Sunday helping Arch with his tutorial on Cilk Plus programming. I’ll be fetching water, holding the door, whatever Arch needs. Monday at 6pm – here is [...] Read more >
Knights Corner: Open source software stack As mentioned in “Knights Corner micro-architecture support” the open source software stack consists of an embedded Linux, a minimally modified GCC, plus driver software. There is a package for GDB available separately as well. Links for these resources can be found at intel.com/software/mic in the article titled “RESOURCES (including downloads).” [...] Read more >
Knights Corner micro-architecture support How does a high performance SMP on-a-chip sound to you? I can now share, for the first time, key details about our vision for Knights Corner (the aforementioned high performance SMP on-a-chip), and our thinking behind the software architecture and features. There is a lot to cover here so I’ll cover it [...] Read more >
one presenter exclaimed “Time spent optimizing for MIC is time well spent because it optimizes your code for non-MIC processors at the same time.” Read more >
A couple of back-to-back opportunities to see great talks about harness lots of cores, and to give talks about programming options and why we do not need to give up on programmability in our quest for high performance. Wellington this week, Austin next week. Programming is not easy, and neither is parallel programming. Nevertheless, many [...] Read more >
Coarse-grained locks, and the importance of transactions, are key concepts that motivate why Intel Transactional Synchronization Extensions (TSX) is useful. I’ll do my best to explain them in this blog. In my blog “Transactional Synchronization in Haswell,” I describe new instructions (Intel TSX) that will improve the performance of coarse-grained locks. Understanding coarse-grained locks and [...] Read more >
We have released details of Intel® Transactional Synchronization Extensions (TSX) for the future multicore processor code-named “Haswell”. The updated specification (Intel® Architecture Instruction Set Extensions Programming Reference) can be downloaded. In this blog, I’ll introduce Intel TSX and provide a little background. Please refer to The Transactional Synchronization Extensions Chapter (Chapter 8) in the manual [...] Read more >
OPEN CASCADE S.A.S and Intel Corporation software teams decided to join their efforts to introduce parallel calculations into Salome SMESH Module. They developed with the help of Intel® Parallel Studio XE. They wrote an article about it which can be downloaded (for free) from Parallelism_in_SMESH.pdf Read more >
HPCwire recognized Intel Parallel Studio XE, the same month we added even more to like with Intel Cluster Studio XE. Read more >
HPCwire recognized Intel Parallel Studio XE, the same month we added even more to like with Intel Cluster Studio XE. Read more >
This week we demonstrated the Knights Corner co-processor at SC11 and we had many developers demonstrating real results with the prototype systems. During the “SC11 season,” a number of tool vendors announced they will be providing versions of their software tailored to supporting MIC architecture, starting with the Knights Corner co-processor. Here are the ones I know [...] Read more >