If you compile and run your OpenMP* code with Intel Compiler 13.1 Update 2 or later, use advanced-hotspots from VTune(TM) Amplifier XE 2015 Update 4 to get important metrics, they can be categorized… Read more
RECENT BLOG POSTS
Use "column" option to display data on selective columns in the report of VTune(TM) Amplifier XE
Intel® VTune™ Amplifier XE 2015 can collect performance data of running application. General-exploration is a good analysis type for capturing all typical performance counters (Hardware Performance… Read more >
VTune(TM) Amplifier XE 2015 can analyze MPI processes combined in hybrid codes in cluster system. It means that VTune Amplifier runs parallel MPI program on N ranks to collect performance data, then… Read more >
FLOPS means total floating point operations per second, which is used in High Performance Computing. In general, Intel(R) VTune(TM) Amplifier XE
only provides metric named Cycles Per Instruction… Read more >
I wrote an article to introduce of using remote data collector in VTune(TM) Amplifier XE, that data collector supports Windows* host and Linux* host ( target always is Linux* server).
Now OS X*… Read more >
VTune™ Amplifier XE 2015 Update 2 supports for driverless hardware event-based sampling with call stack info
In general, vtune drivers will be built and loaded to the Linux* system automatically during installing VTune™ Amplifier XE product, then hardware PMU event-based sampling can work.
However… Read more >
When the user ran VTune(TM) Amplifier XE’s basic hotspots with huge (complicated) application, sometime profiling time was more than one hour to generate vtune result. It looked like the… Read more >
I ever wrote an old article about using Pause & Resume API for Xeon Phi™ programs in VTune Amplifier XE 2015 Beta, there were some limitations. Now, these limitations have been removed in 2015… Read more >
Practice an example of profiling applications on Intel® Xeon Phi™ coprocessor on the sever from a client machine
A Linux* server with Intel(R) Xeon Phi(TM) coprocessor card is a customized Linux* system, there is no X11 support so VTune™ Amplifier XE GUI cannot work on this server. The user should… Read more >
Instruction cache miss is a major issue which increases Front End Stalls. Usually the application with a large hot code section with many mispredicted branches, which results in many ICache… Read more >
Intel® SDK for OpenCL™ Application can build application to work on Intel® HD Graphics. Using VTune™ Amplifier XE to analyze OpenCL™ application’s performance on GPU side, which covers:
1. GPU… Read more >
When you use VTune(TM) Amplifier XE’s user-mode sampling collector, e.g. Hotspots Collector to profile running MySQL server (attach-mode), you will meet unexpected result. See following… Read more >
The user might use VTune™ Amplifier XE with big project, which runs longer. The user might have interest of seeing performance data on specific time range, the reasons could be – 1) they don’t care… Read more >
VTune™ Amplifier XE 2013 Update 13 now supports ITT Pause/Resume API on the Intel® Xeon Phi™ coprocessor. Here is the article to describe that the user has to set environment variables for Intel Xeon Phi coprocessor.
I ever wrote old article about using Pause/Resume API on traditional Intel Xeon processor. (Use same example code) I hope that I will do same things that I did for Xeon processor, besides setting environment variables. Finally I realized there is the trick that we need to pay attentions on that.
1. (Use Intel C/C++ Composer XE 2013) Build a Native Intel Xeon Phi coprocessor application.
# icpc -g -mmic test_api.cpp -I/opt/intel/vtune_amplifier_xe_2013/include /opt/intel/vtune_amplifier_xe_2013/bin64/k1om/libittnotify.a -lpthread -o test_api
2. Copy native application onto Intel Xeon Phi coprocessor
# scp test_api mic0:/root
3. Run VTune data collection.
# amplxe-cl -collect knc-hotspots -start-paused –search-dir all:rp=./ — ssh mic0 INTEL_LIBITTNOTIFY64=$MIC_INTEL_LIBITTNOTIFY64 INTEL_JIT_PROFILER64=$MIC_INTEL_JIT_PROFILER64 INTEL_ITTNOTIFY_CONFIG=$MIC_INTEL_ITTNOTIFY_CONFIG /root/test_api
amplxe: Collection started. To stop the collection, either press CTRL-C or enter from another console window: amplxe-cl -r /home/peter/problem_report/r014hs -command stop.
amplxe: Collection paused.
amplxe: Collection resumed.
amplxe: Collection stopped.
amplxe: Using result path `/home/peter/problem_report/r001hs’
This is wrong result, I start data collection in paused state, but resume collection in code, pause it again and resume it again. There should be two data collection phases, so result was in correct.
The trick is that I should create a separate file which uses all MIC environment variables, for example: run_sample.sh with contents:
ssh mic0 INTEL_LIBITTNOTIFY64=$MIC_INTEL_LIBITTNOTIFY64 INTEL_JIT_PROFILE64=$MIC_JIT_PROFILE64 INTEL_ITTNOTIFY_CONFIG=$MIC_INTEL_ITTNOTIFY_CONFIG /root/test_api
Then run collector,
# amplxe-cl -collect knc-hotspots -search-dir all:rp=./ -start-paused — /home/peter/problem_report/run_sample.sh
amplxe: Collection started. To stop the collection, either press CTRL-C or enter from another console window: amplxe-cl -r /home/peter/problem_report/r009hs -command stop.
amplxe: Collection paused.
amplxe: Collection resumed.
amplxe: Collection paused.
amplxe: Collection resumed.
amplxe: Collection stopped.
amplxe: Using result path `/home/peter/problem_report/r002hs’
This is an expected result I want
What is the purpose that I write this blog? Most of embedded Linux systems are customized to “tiny” OS. That means, you might fail to install VTune™ Amplifier because some utilities are missed but VTune installer requires them, this article to teach you to run VTun quickly in your embedded system:
1. Install VTune on Host machine, but command line only. Use “Customize installation->Change components to install->Unselect Graphical user interface->Start installation Now”
2. After installing the VTune on Host, copy whole VTune to the target. For example, “#scp -r vtune_amplifier_xe_2013 kentsfield-01:/opt/intel”
3. Check total size of VTune on the target, and set all environments
# du -sh vtune_amplifier_xe_2013/
# source /opt/intel/vtune_amplifier_xe_2013/amplxe-vars.sh
Copyright (C) 2009-2013 Intel Corporation. All rights reserved.
Intel(R) VTune(TM) Amplifier XE 2013 (build 305106)
4. Build and install vtune drivers
5. Try data collection at first time
amplxe-cl -collect advanced-hotspots -duration 10
Note: you can use amplxe-cl command on the target device to collect data and display result, or copy result directory onto another machine which has installed VTune Amplifier XE GUI to analyze.
I’d like to recommend this new feature to advanced VTune™ Amplifier XE 2013 users, the feature aids the user to define their metrics from interest of PMU hardware events.
Let’s review what we have before using this new feature. Usually the user may try predefined analysis type in VTun Amplifier when starting a new analysis. Predefined analysis type includes many events and VTune Amplifier will collect these performance data. Meanwhile the tool will generate metrics data since all formulas of metrics and rules already were set in these predefined analysis types. For example, the tool can calculate the value of metric depending on event counts. If the value is greater than threshold – the tool will highlight this metrics on GUI (please note that you can see metrics in summary report by command line, if you use general-exploration analysis, latest Update is U11)
Unfortunately, sometime the user has interest of events which are not defined in existing predefined analysis types. This article (http://software.intel.com/en-us/articles/event-configuration-from-the-command-line) educates you how to quickly collect data for supported events in command line. The shortcoming is that the tool only provides all event counts but there is no metric data…It is same situation if the user to create a new analysis type, the metrics are hard to be added on GUI
This article is to teach you of creating user metrics. Please refer to below steps:
1. Ensure that your system has installed Python 2.6 or above
2. Extract user_metrics.zip, which is under vtune_amplifier_xe_2013/sdk/user_metrics, do “unzip user_metrics.zip”.
3. In user_metrics directory, there are many examples called .py files. Make sure what your processor architecture is – for example, Core™ 2, Core™ i7, Sand bridge or Ivy bridge, so you can choose (reference) right example .py to build a new one which can work on your processor. Here is a small example for your referencing. I work on Sandy bridge processor. The purpose of writing this small .py file is to use event BR_MISP_RETIRED.ALL_BRANCHES_PS to evaluate if it has high Performance Impact, and what Mispredict-Rate (average) is on branch instructions, in your program.
This is a simple example to show how to create user’s metrics
BranchMispredictImpact = metric(“Branch Mispredict Impact”)
BranchMispredictImpact.formula[“snb”] = 20* event(“BR_MISP_RETIRED.ALL_BRANCHES_PS”) /query(“Clockticks”)
BranchMispredictImpact.issue_eval[“snb”] = ( formula() > .20) * (query(“PMUHotspot”) > .05)
BranchMispredictRate = metric(“Branch Mispredict Rate”)
BranchMispredictRate.formula[“snb”] = event(“BR_MISP_RETIRED.ALL_BRANCHES_PS”) / event(“BR_INST_RETIRED.ALL_BRANCHES_PS”)
BranchMispredictRate.issue_eval[“snb”] = ( formula() > .50) * (query(“PMUHotspot”) > .05)
SNBBranchMispredict = analysis(“SNB Branch Mispredict”)
SNBBranchMispredict.name = “Branch Mispredict”
SNBBranchMispredict.long_name = “Intel Microarchitectures Code Name Sandy Bridge and Ivy Bridge – Branch Mispredict”
SNBBranchMispredict.description = “”
SNBBranchMispredict.valid_architectures = [“snb,ivybridge”]
SNBBranchMispredict.alias_name = “snb-branch-mispredict”
SNBBranchMispredict.always_collect[“snb,ivybridge”] = [ event(“CPU_CLK_UNHALTED.THREAD”), event(“CPU_CLK_UNHALTED.REF_TSC”), event(“INST_RETIRED.ANY”) ]
SNBBranchMispredict.metric_tree = [
( query(“Clockticks”), set([“summary”,”grid”,”srcasm”]) ),
( query(“InstructionsRetired”), set([“summary”,”grid”,”srcasm”]) ),
( query(“CPI”), set([“summary”,”grid”]) ),
( query(“BranchMispredictImpact”), set([“summary”,”grid”,”srcasm”]) ),
( query(“BranchMispredictRate”), set([“summary”,”grid”,”srcasm”]) )
4. In this step, use “# python translate_metrics.pyc -m snb_branch_misp.py” to parse your .py file then generate new analysis type, default result is under GENERATED_OUTPUTS directory, which has sub-directories named “analysis_type” and “viewpoint”.
5. In analysis_type directory, do “cp snb-branch-mispredict_atype.cfg /opt/intel/vtune_amplifier_xe_2013/config/analysis_type/”
6. In viewpoint directory, do “cp snb-branch-mispredict_viewpoint.cfg /opt/intel/vtune_amplifier_xe_2013/config/viewpoint/”
7. Now you can see a new analysis type has been generated in your system.
# amplxe-cl -collect-list | grep snb-branch-mispredict
snb-branch-mispredict SNB Branch Mispredict
8. Use this new analysis to profile program
# amplxe-cl -collect snb-branch-mispredict -duration 60 — ./nbench
9. Use VTune Amplifier GUI to review metrics
Intel(R) VTune(TM) Amplifier XE supports memory bandwidth analysis on recent Sandbridge, Ivybridge, and Haswell processors. However if the user worked on some old processors, for example – Nehalem, Westmere-DP, will receive error message such as:
# amplxe-cl -collect wsmex-write-bandwidth -duration 10
amplxe: Fatal error: This analysis type is only defined for Intel processors code name Beckton or Eagleton.
Memory bandwidth analysis is key feature in VTune? Amplifier XE, which will use uncore events named UNC_IMC_WRITES.FULL.ANY and UNC_IMC_NORMAL_READS.ANY to gather performance data of memory read/write via IMC (Integrated Memory Controller), those events are not based on specific core, so they are doing event-based samplings in counting mode. Data collector only records the event counts but not (cannot) record where events happened (in which core?). This is also very helpful for the user to know overall data throughput (of interacting memory), per second when program is running.
What is idea to get those data on Nehalem and Westmere-DP platform?
There are two workarounds for your choosing:
1. PTU (Performance Tuning Utility) plus core batch can solve this issue. PTU is old experimental tool from Intel but now it is EOL, and it is not downloadable again. However if you have its old version, for example PTU 3.2 Update 1, you can download lin_measurebw.tar.gz from this article, then follow below steps to do bandwidth analysis
1) Extract PTU package and no need to install.
2) Go PTU/vdk/src, build vtune driver and install it.
3) Extract patch file, and go “uncore” directory, do
Please enter the path to PTU 3.2 [/opt/intel/ptu32_001_lin_intel64]: /home/peter/ptu32_001_lin_intel64
Measurement complete. See bandwidth.txt for results.
Press enter to exit.
4) Review output result named bandwidth.txt
Note that PTU is old product, and tested on old OSs only.Usually Linux* kernel version 2.6.18 is recommended to use PTU 3.2.
2. Use Intel PCM to solve this problem. PCM is simple utility to create an architecturally-defined approach for software agents to interacting with the PMU of processor. Here is the example to use PMU on Linux*
1) Extract zip into IntelPerformanceCounterMonitorV2.5 directory
2) #make ; build all utilities
3) Run an program in one console, for example – #nbench-2.1/nbench
4) Run an utility in another console to monitor performance, for example – # ./pcm.x 1 -nc -ns. the utility will display –
EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency=’unhalted clock ticks’/’invariant timer ticks’ (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)=’unhalted clock ticks’/’invariant timer ticks while in C0-state’ (includes Intel Turbo Boost)
L3MISS: L3 cache misses
L2MISS: L2 cache misses (including other core’s L2 cache *hits*)
L3HIT : L3 cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
READ : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature
Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK | READ | WRITE | TEMP
TOTAL * 0.16 1.12 0.14 1.12 196 K 516 K 0.62 0.95 0.01 0.00 0.17 0.00 N/A