How to use Pause/Resume API in your code and running it on Intel® Xeon™ Phi coprocessor correctly

VTune™ Amplifier XE 2013 Update 13 now supports ITT Pause/Resume API on the Intel® Xeon Phi™ coprocessor. Here is the article to describe that the user has to set environment variables for Intel Xeon Phi coprocessor.

I ever wrote old article about using Pause/Resume API on traditional Intel Xeon processor. (Use same example code) I hope that I will do same things that I did for Xeon processor, besides setting environment variables. Finally I realized there is the trick that we need to pay attentions on that.


1.      (Use Intel C/C++ Composer XE 2013) Build a Native Intel Xeon Phi coprocessor application.

# icpc -g -mmic test_api.cpp -I/opt/intel/vtune_amplifier_xe_2013/include /opt/intel/vtune_amplifier_xe_2013/bin64/k1om/libittnotify.a -lpthread -o test_api

2.      Copy  native application onto Intel Xeon Phi coprocessor

# scp test_api mic0:/root

3.      Run VTune data collection.

# amplxe-cl -collect knc-hotspots -start-paused –search-dir all:rp=./ — ssh mic0 INTEL_LIBITTNOTIFY64=$MIC_INTEL_LIBITTNOTIFY64 INTEL_JIT_PROFILER64=$MIC_INTEL_JIT_PROFILER64 INTEL_ITTNOTIFY_CONFIG=$MIC_INTEL_ITTNOTIFY_CONFIG /root/test_api

amplxe: Collection started. To stop the collection, either press CTRL-C or enter from another console window: amplxe-cl -r /home/peter/problem_report/r014hs -command stop.

amplxe: Collection paused.

amplxe: Collection resumed.

amplxe: Collection stopped.

amplxe: Using result path `/home/peter/problem_report/r001hs’

This is wrong result, I start data collection in paused state, but resume collection in code, pause it again and resume it again. There should be two data collection phases, so result was in correct.

The trick is that I should create a separate file which uses all MIC environment variables, for example: with contents:


Then run collector,

# amplxe-cl -collect knc-hotspots -search-dir all:rp=./ -start-paused — /home/peter/problem_report/
amplxe: Collection started. To stop the collection, either press CTRL-C or enter from another console window: amplxe-cl -r /home/peter/problem_report/r009hs -command stop.
amplxe: Collection paused.
amplxe: Collection resumed.
amplxe: Collection paused.
amplxe: Collection resumed.
amplxe: Collection stopped.

amplxe: Using result path `/home/peter/problem_report/r002hs’


This is an expected result I want

Read more >

Display source view with performance data in command line by using VTune(TM) Amplifier XE

Intel(R) VTune(TM) Amplifier XE can profile user application and report hotspots, where application consumes high CPU time. 
Usually after performance data collecting, the user can open result in appliction GUI and display hot functions, if the user wants to know hot lines of hot functions – just doubl-click on specific hot function then go to source view which displays all hot lines with performance data.
The problem is that piror product only supported hotline view in GUI, it was not supported in command line (only hot functions are available). Since VTune(TM) Amplifier XE 2013 Update 9, the product supports of displaying assembly / source view in command line. It is meaningful for some users to do auto-test (focus on critcal source lines), that is, write VTune command line in their test script. 
Use the “source-object” option to implement above goals. Here are examples of usage mode:
1. Collecting performance data
# amplxe-cl -collect advanced-hotspots — ./primes.icc
2. Displaying hotlines in specific funuction
a. # amplxe-cl -report hotspots -r r001ah/
amplxe: Using result path `/home/peter/problem_report/r001ah’
amplxe: Executing actions 50 % Generating a report
Function                 Module              CPU Time:Self
———————–  ——————  ————-
findPrimes               primes.icc                  2.089
pthread_mutex_unlock          0.002
__dentry_open            vmlinux                     0.001
b.  # amplxe-cl -report hotspots -source-object function=findPrimes -r r001ah/
Source Line  Source                                                            CPU Time:Self
———–  —————————————————————-  ————-
387              for (number = start; number < end; number += stride)           
388              {                                                              
389                  factor = 3;                                                
391                  while ((number % factor) != 0 ) factor += 2;                      2.088
393                  if ( factor == number )                                           0.001
394                  {                                                          
395                      pthread_mutex_lock (&cs); 
3. Also we can display assembly code with performance data, in command line
# amplxe-cl -report hotspots -source-object function=findPrimes -group-by basic-block,address -r r001ah/
Basic Block  Assembly                                 Source Line  CPU Time:Self
———–  —————————————  ———–  ————-
0x4008e0     Block 13
 0x4008e0     movq  $0×3, -0×18(%rbp)                 389
0x4008e8     Block 14                                                      2.027
 0x4008e8     movq  -0×20(%rbp), %rax                 391                  0.003
 0x4008ec     movq  -0×18(%rbp), %rdx                 391
 0x4008f0     movq  %rdx, -0×10(%rbp)                 391
 0x4008f4     cqo                                     391                  0.070
 0x4008f6     movq  -0×10(%rbp), %rcx                 391
 0x4008fa     idiv %rcx                               391                  0.001
 0x4008fd     test %rdx, %rdx                         391                  1.880
 0×400900     jz 0×400911 <Block 16>                  391                  0.073
0×400902     Block 15                                                      0.061
 0×400902     mov $0×2, %eax                          391                  0.002
 0×400907     addq  -0×18(%rbp), %rax                 391                  0.001
 0x40090b     movq  %rax, -0×18(%rbp)                 391

Read more >

How to work VTune™ Amplifier XE on embedded Linux system quickly?

What is the purpose that I write this blog? Most of embedded Linux systems are customized to “tiny” OS. That means, you might fail to install VTune™ Amplifier because some utilities are missed but VTune installer requires them, this article to teach you to run VTun quickly in your embedded system:

Steps are:

1. Install VTune on Host machine, but command line only. Use “Customize installation->Change components to install->Unselect Graphical user interface->Start installation Now”

2. After installing the VTune on Host, copy whole VTune to the target. For example, “#scp -r vtune_amplifier_xe_2013 kentsfield-01:/opt/intel”

3. Check total size of VTune on the target, and set all environments

# du -sh vtune_amplifier_xe_2013/

600M    vtune_amplifier_xe_2013/

# source /opt/intel/vtune_amplifier_xe_2013/

Copyright (C) 2009-2013 Intel Corporation. All rights reserved.

Intel(R) VTune(TM) Amplifier XE 2013 (build 305106)

4. Build and install vtune drivers

#cd /opt/intel/vtune_amplifier_xe_2013/sepdk/src



5. Try data collection at first time  

amplxe-cl -collect advanced-hotspots -duration 10

Note: you can use amplxe-cl command on the target device to collect data and display result, or copy result directory onto another machine which has installed VTune Amplifier XE GUI to analyze.

Read more >

What are steps of using user defined metrics to profiling your program?

I’d like to recommend this new feature to advanced VTune™ Amplifier XE 2013 users, the feature aids the user to define their metrics from interest of PMU hardware events.

Let’s review what we have before using this new feature. Usually the user may try predefined analysis type in VTun Amplifier when starting a new analysis. Predefined analysis type includes many events and VTune Amplifier will collect these performance data. Meanwhile the tool will generate metrics data since all formulas of metrics and rules already were set in these predefined analysis types. For example, the tool can calculate the value of metric depending on event counts. If the value is greater than threshold – the tool will highlight this metrics on GUI (please note that you can see metrics in summary report by command line, if you use general-exploration analysis, latest Update is U11)

Unfortunately, sometime the user has interest of events which are not defined in existing predefined analysis types. This article ( educates you how to quickly collect data for supported events in command line. The shortcoming is that the tool only provides all event counts but there is no metric data…It is same situation if the user to create a new analysis type, the metrics are hard to be added on GUI

This article is to teach you of creating user metrics. Please refer to below steps:

1. Ensure that your system has installed Python 2.6 or above

2. Extract, which is under vtune_amplifier_xe_2013/sdk/user_metrics, do “unzip”.

3. In user_metrics directory, there are many examples called .py files. Make sure what your processor architecture is – for example, Core™ 2, Core™ i7, Sand bridge or Ivy bridge, so you can choose (reference) right example .py to build a new one which can work on your processor. Here is a small example for your referencing. I work on Sandy bridge processor. The purpose of writing this small .py file is to use event BR_MISP_RETIRED.ALL_BRANCHES_PS to evaluate if it has high Performance Impact, and what Mispredict-Rate (average) is on branch instructions, in your program.


This is a simple example to show how to create user’s metrics





BranchMispredictImpact = metric(“Branch Mispredict Impact”)

BranchMispredictImpact.formula["snb"] =     20* event(“BR_MISP_RETIRED.ALL_BRANCHES_PS”) /query(“Clockticks”)

BranchMispredictImpact.issue_eval["snb"] = ( formula() > .20) * (query(“PMUHotspot”) > .05)


BranchMispredictRate = metric(“Branch Mispredict Rate”)

BranchMispredictRate.formula["snb"] = event(“BR_MISP_RETIRED.ALL_BRANCHES_PS”) / event(“BR_INST_RETIRED.ALL_BRANCHES_PS”)

BranchMispredictRate.issue_eval["snb"] = ( formula() > .50) * (query(“PMUHotspot”) > .05)



SNBBranchMispredict = analysis(“SNB Branch Mispredict”)                  = “Branch Mispredict”

SNBBranchMispredict.long_name             = “Intel Microarchitectures Code Name Sandy Bridge and Ivy Bridge – Branch Mispredict”

SNBBranchMispredict.description           = “”

SNBBranchMispredict.valid_architectures   = ["snb,ivybridge"]

SNBBranchMispredict.alias_name            = “snb-branch-mispredict”

SNBBranchMispredict.always_collect["snb,ivybridge"] = [ event("CPU_CLK_UNHALTED.THREAD"), event("CPU_CLK_UNHALTED.REF_TSC"), event("INST_RETIRED.ANY") ]

SNBBranchMispredict.metric_tree = [

    ( query("Clockticks"),                                      set(["summary","grid","srcasm"]) ),

    ( query(“InstructionsRetired”),                             set(["summary","grid","srcasm"]) ),

    ( query(“CPI”),                                             set(["summary","grid"]) ),

    ( query(“BranchMispredictImpact”),                          set(["summary","grid","srcasm"]) ),

    ( query(“BranchMispredictRate”),                            set(["summary","grid","srcasm"]) )



4. In this step, use “# python translate_metrics.pyc -m” to parse your .py file then generate new analysis type, default result is under GENERATED_OUTPUTS directory, which has sub-directories named “analysis_type” and “viewpoint”.

5. In analysis_type directory, do “cp snb-branch-mispredict_atype.cfg /opt/intel/vtune_amplifier_xe_2013/config/analysis_type/”

6. In viewpoint directory, do “cp snb-branch-mispredict_viewpoint.cfg /opt/intel/vtune_amplifier_xe_2013/config/viewpoint/” 

7. Now you can see a new analysis type has been generated in your system.

# amplxe-cl -collect-list | grep snb-branch-mispredict

snb-branch-mispredict           SNB Branch Mispredict

8. Use this new analysis to profile program

# amplxe-cl -collect snb-branch-mispredict -duration 60 — ./nbench

9. Use VTune Amplifier GUI to review metrics

Read more >

How to use memory bandwidth analysis on old processors?

Intel(R) VTune(TM) Amplifier XE supports memory bandwidth analysis on recent Sandbridge, Ivybridge, and Haswell processors. However if the user worked on some old processors, for example – Nehalem, Westmere-DP, will receive error message such as:

# amplxe-cl -collect wsmex-write-bandwidth -duration 10

amplxe: Fatal error: This analysis type is only defined for Intel processors code name Beckton or Eagleton.

Memory bandwidth analysis is key feature in VTune? Amplifier XE, which will use uncore events named  UNC_IMC_WRITES.FULL.ANY and UNC_IMC_NORMAL_READS.ANY to gather performance data of memory read/write via IMC (Integrated Memory Controller), those events are not based on specific core, so they are doing event-based samplings in counting mode. Data collector only records the event counts but not (cannot) record where events happened (in which core?). This is also very helpful for the user to know overall data throughput (of interacting memory), per second when program is running.

What is idea to get those data on Nehalem and Westmere-DP platform?

There are two workarounds for your choosing:

1. PTU (Performance Tuning Utility) plus core batch can solve this issue. PTU is old experimental tool from Intel but now it is EOL, and it is not downloadable again. However if you have its old version, for example PTU 3.2 Update 1, you can download lin_measurebw.tar.gz from this article, then follow below steps to do bandwidth analysis

1) Extract PTU package and no need to install.

2) Go PTU/vdk/src, build vtune driver and install it.

3) Extract patch file, and go “uncore” directory, do 


Please enter the path to PTU 3.2 [/opt/intel/ptu32_001_lin_intel64]: /home/peter/ptu32_001_lin_intel64

Measurement complete.  See bandwidth.txt for results.

Press enter to exit.

4) Review output result named bandwidth.txt

Note that PTU is old product, and tested on old OSs only.Usually Linux* kernel version 2.6.18 is recommended to use PTU 3.2.

2. Use Intel PCM  to solve this problem. PCM is simple utility to create an architecturally-defined approach for software agents to interacting with the PMU of processor. Here is the example to use PMU on Linux*

1) Extract zip into IntelPerformanceCounterMonitorV2.5 directory

2) #make ; build all utilities

3) Run an program in one console, for example – #nbench-2.1/nbench

4) Run an utility in another console to monitor performance, for example – # ./pcm.x 1 -nc -ns.  the utility will display - 

EXEC  : instructions per nominal CPU cycle

 IPC   : instructions per CPU cycle

 FREQ  : relation to nominal CPU frequency=’unhalted clock ticks’/'invariant timer ticks’ (includes Intel Turbo Boost)

 AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)=’unhalted clock ticks’/'invariant timer ticks while in C0-state’  (includes Intel Turbo Boost)

 L3MISS: L3 cache misses

 L2MISS: L2 cache misses (including other core’s L2 cache *hits*)

 L3HIT : L3 cache hit ratio (0.00-1.00)

 L2HIT : L2 cache hit ratio (0.00-1.00)

 L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency

 L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)

 READ  : bytes read from memory controller (in GBytes)

 WRITE : bytes written to memory controller (in GBytes)

 TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature



 TOTAL  *     0.16   1.12   0.14    1.12     196 K    516 K    0.62    0.95    0.01    0.00    0.17    0.00     N/A

Read more >

Estimate the penalty of Cache Miss more accurate on Ivy-bridge?

Most of time the user will reference Tuning Guides and Performance Analysis Papers for different Intel® Core™ Generation processors, to optimize their applications.
Usually estimating Cache Miss penalty will be first considered, because CPU penalty is expensive when LLC miss happened. See below formula: (Ivy-bridge as example)
% of cycles spent on memory access (LLC misses) = (MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS * 180) / CPU_CLK_UNHALTED.THREAD
That means, we estimate 180 cycles as latency for memory load once LLC Miss happened. However this is average value, sometime it is not true.
Is there any other method to capture *runtime* performance data, which is more closely to the fact? The answer is “Yes”. Please see document Intek® 64 and IA-32 Architectures Optimization Reference Manual, there are new events supported in VTune™ Amplifier XE 2013, which can present *runtime* latency of LLC miss.
To estimate the exposure of DRAM traffic on third generation Intel Core processors, the remainder of
L2_PENDING is used for MEM Bound:
Where L3_Miss_fraction is:
The correction factor MEM_L3_WEIGHT is approximately the external memory to L3 cache latency ratio. A factor of 7 can be used for the third
generation Intel Core processor family.
Let’s have a simple test to know what the advantage of new event has.
Example code:
#include <stdio.h>
#define NUM 1024
double a[NUM][NUM], b[NUM][NUM], c[NUM][NUM];
void multiply()
 unsigned int i,j,k;
    for(i=0;i<NUM;i++) {
       for(j=0;j<NUM;j++) {
          c[i][j] = 0.0;
          for(k=0;k<NUM;k++) {
             c[i][j] += a[i][k]*b[k][j];
 //start timing the matrix multiply code
amplxe: Using result path `/home/peter/r005runsa’
amplxe: Executing actions 50 % Generating a report                             
Collection and Platform Info
Parameter                 r005runsa
————————  —————————————————————————-
Application Command Line  ./matrix 
Computer Name             ivb01
Environment Variables     
MPI Process Rank          
Operating System          2.6.32-279.el6.x86_64 Red Hat Enterprise Linux Server release 6.3 (Santiago)
Result Size               4144003
User Name                 root
Parameter          r005runsa
—————–  ————————————————-
Frequency          3500000000
Logical CPU Count  8
Name               3rd generation Intel(R) Core(TM) Processor family
Elapsed Time:  7.332
Event summary
Hardware Event Type                Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
———————————  ————————-  ——————————–  —————–
CPU_CLK_UNHALTED.THREAD            28510042765                14255                             2000003
CYCLE_ACTIVITY.STALLS_L2_PENDING   12598018897                6299                              2000003
MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS  1600112                    16                                100007
MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS   15356447                   307                               50021 amplxe: Executing actions 100 % done         
Now we can use two methods to estimate the latency of LLC miss.
(Old – estimated data) 1. % of cycles spent on memory access (LLC misses) = (MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS * 180) / CPU_CLK_UNHALTED.THREAD = 1600112 * 180 / 28510042765 = 1.01%
(New – calculate by using runtime data) 2. L3_Miss_fraction is:
MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS) = 7*1600112 / (7*1600112+15356447) = 11200784 / 26557231 = 0.421
= 12598018897 * 0.421 / 28510042765 = 18.6%
In this new method, you can use L2 stall pending cycles – in this case, it is 44% (CYCLE_ACTIVITY.STALLS_L2_PENDING/ CPU_CLK_UNHALTED.THREAD) of all CPU clocks, and L3 Miss fraction is 42.1% of 44% L2 stall pending, or say L3 Miss latency is 18.6% of all CPU clocks. That is more accurate than old method 1, because it just estimated LLC miss count but without pending cycles.

Read more >