Intel’s Cache Monitoring Technology (CMT) feature was introduced with the Intel®Xeon® E5 2600 v3 product family in 2014. This feature enables fine-grained tracking of L3 cache… Read more
Intel’s Cache Monitoring Technology (CMT) feature was introduced with the Intel®Xeon® E5 2600 v3 product family in 2014. This feature enables fine-grained tracking of L3 cache… Read more >
Intel’s Cache Monitoring Technology (CMT) feature was introduced with the Intel® Xeon™ E5 2600 v3 product family in 2014. The CMT feature provides visibility into shared platform… Read more >
Intel‘s Cache Monitoring Technology (CMT) feature was introduced with the Intel® Xeon™ E5 2600 v3 line of server processors in 2004. Initial blog posts here… Read more >
By Rukhsana Yeasmin
A single application can drain a device’s battery and negatively impact consumers’ perception of a platform. Unlike performance, there is no natural market-driven motivation to make one’s application power-efficient. There may be a few exceptions where an application is so obviously flawed but more often than not apps will be sipping power with no one being aware. Hence, power efficiency is extremely important.
Developers want free and easy-to-use tools that may give some directions toward reasons behind high power consumption of an application. Windows Performance Analyzer can be used to trace the power data files. But the user needs to manually look through the entire trace to find potential problem areas, which might lead to a situation similar to searching for a needle in a haystack. Also the user needs some knowledge of what to look for. This is where the Power Auto-Analyzer can help. Here we are going after low hanging fruits. Our goal is to come-up with a list of functions consuming most of the CPU times and being called repeatedly during the idle times of an application and to show the statistics at the time regions of high CPU usage. Hence, someone using this tool will get some idea of possible problematic issues and may attempt to fix those.
The general assumption is that, during idle times a process should have minimal CPU activities resulting in lower power consumption. However, having high CPU usage during idle times is an indication of potential problem cases causing inefficient power use and there are rooms to power optimize the application by handling these issues. For example, use of Sleep API with low timeout value, busy wait may result in frequent context switches. Higher context switches cause more overhead to the system, prevent it to go to the idle state and thus increase the overall power consumption of the system. Excessive IO activity could result in larger number of interrupts and thus increase the power consumption rate. There could be several reasons behind high CPU usage and periodic activities during the idle times of a process. One possible way to figure it out is to search for the common call stacks during the times of high CPU usage and periodic events and then to look for the functions being called at those times.
Power auto-analyzer automatically detects the top power consuming threads in an application. If someone has an idle time trace of a process, he/ she can run the tool, which will automatically detect regions of high CPU usage and repetitions along with the common call stacks in those regions – to detect the top most CPU time consuming threads and functions being called at the peak regions of those threads in the application. It shows statistics of those regions that could also be interesting to the user. It lists all disk read/ write events initialization, completion, IO times and the IO paths for each thread of the application.
The algorithm detects high CPU usage areas by walking through the entire events trace of each thread of the given application file and looking for clustered regions of event occurrence or regions of very high CPU usage. If the region has multiple events in it, the algorithm runs an overlapping sliding window through each such clustered region and generates statistics of CPU usage and repetitions of the events inside the window. Sometimes it may show a single event as one region, if its CPU usage is very high compared to overall thread and process usage (in this case, the “Count” statistic should be 1). A region is considered to be a region of interest, if it has higher CPU usage and repetition compared to the overall application usage. The algorithm then merges successive overlapping windows inside a clustered region (based on CPU usage and repetition statistics) from the list of regions of interest to get the final clustered regions of interest. Finally it detects the common call stacks of the events in each such region by building a tree with context switched events call stacks, along with other useful statistics of those regions. The algorithm also generates disk IO statistics of each thread of the process of interest. The algorithm used “traceevent.dll” to parse the raw .etl trace files of an application.
Below is an example trace of an application having high CPU usage at idle times. From the figure (snapshot taken from WPA) we see there are periodic peak regions of high CPU usage. Events inside those regions were being called at an average of 5ms frequency. Tracking functions being called repeatedly at those regions will give some idea of potential reasons behind keeping the CPU busy at idle times and consuming more power.
CPU usage of the whole process of interest (time shown in secs.):
The most CPU time consuming thread (time shown in secs.):
If we run power auto-analyzer for this trace, it will automatically detect the peak regions and find the common function call stacks and other statistics in these regions.
Output Regions from Power Auto-Analyzer:
Process Name: AfterMouse.race (Process ID: 1464) : : Thread ID: 2084
The regions having high CPU usage and the corresponding common call stacks are shown below:
Figure: Common call stacks at different peak regions of an idle analysis trace
In the above table, common call stacks for the detected peak regions are shown. “No common stack” indicates no match found in the call stacks of the events inside a region starting from the root function. Now, the user may look at the functions being called repeatedly at those high CPU usage regions. The program generates other useful statistics such as disk IO statistics, so if IO is a bottleneck, user may detect it by looking at the frequency of disk IO events, IO times and durations. In the above example the CPU usage peaks are quite clearly visible, but there are cases where it is hard to visually predict the problem regions. Our algorithm could successfully filter the desired regions in those cases too through comparative analysis of the CPU usage at different regions of a thread and the average CPU usage of that thread and the whole process.
The algorithm generates three output files:
1) “CommonStacks_HighCPUUsageRegions.csv”: shows potential problem regions grouped by common call stacks, sorted by the average CPU usage of the regions.
2) “Statistic_HighCPUUsageRegion.csv”: shows other statistics of the regions of interest shown in file 1, such as – region start, region end, percent of total events in the region belonging to this thread, average CPU usage, number of events count, mean time interval of events occurrence, coefficient of variation of the time intervals of events occurrence.
3) “DiskIO_Statistics.csv”: shows the disk IO events statistics of each thread of the process of interest, i.e. IO type, IO initialization time, IO completion time, Time spent in IO, IO path.
These files will provide statistics of high power consumption areas during idle times of an application.
Power auto-analyzer is a free and easy to use tool, which will direct the user toward potential problem areas by analyzing an idle trace file of a particular application. The user may use this information to fix issues in an application which are keeping the CPU busy at idle times and thus resulting in high power usage.
NOTE: Over the first half of 2014 we will be adding more detail on how applications can be augmented to be more “Recovery Aware”. Please subscribe to this page (button at the bottom) to… Read more >
Virtual machine monitors (VMM) emulate most guest access to interrupts and the advanced programmable interrupt controller (APIC) in a virtual environment. They also virtualize all… Read more >
Today almost everybody take their tablets or smartphones everywhere. They take pictures and videos and send them to the cloud storages so that they can be accessed anywhere and shared with friends and family. Storing pictures and video clips in the cloud is convenient. However, if someone gains access to your cloud storage accounts, they can view all the files you stored, so you must encrypt the files in order to secure them. In this blog I will discuss using Boxcryptor* to encrypt your files in the Google drive*, cloud storage, and show how encrypting the files will affect the performance and battery life of the tablets/smartphones, and how this might affect your decisions to encrypt data in the cloud.
What is Google Drive?
Google Drive is a cloud storage system that allows users to store their music, pictures, videos and other files so that they can be accessed anywhere. More information about Google Drive can be found here.
What is BoxCryptor?
Boxcryptor is encryption software used in cloud storages like Google Drive, Microsoft SkyDrive and Dropbox. Boxcryptor was chosen due to its popularity and cross-platform availability.
More information about Boxcryptor and how to get it can be found here.
Performance and Power Tests
I made a 10-minute video clip using an Android* tablet. The testing procedure follows:
1) We used the Xoom to upload and download the video clip to Google Drive and recorded how it took to transfer the file, and collected the power data.
2) We ran Boxcryptor to encrypt the video clip, then repeated the process in step one.
Motorola* tablet: Xoom* with Android version 4.1.2
Intel internal software development platform – Ultrabook i7-3667U CPU at 2.00GHz, 4GB RAM, 120GB SSD with Microsoft* Windows 8.0
Software on Tablet:
Google Drive version 220.127.116.11
Boxcryptor version 2.0.402.16
Note: All software was downloaded from the Google Play Store.
Picture Size: 256×144
Frame Rate: 23.97fps
Audio Codec: aac
Video Codec: mpeg4
Video Bit Rate: 504kbps
Audio Bit Rate: 96kbps
Cloud Storage: Google Drive
Used a stopwatch to measure how long it took to upload and download the file.
Used the following script to collect power data on the Xoom while uploading and downloading the file:
The power data was collected by reading the file uevent every 10 seconds. Be aware that reading this file more often will affect the power measurement process. The power data was collected using a TCP/IP connection, not a USB cable connection to collect power data. The USB cable provides a charge and skews the data. Use the following commands to collect power data through tcpip:
adb tcpip 5555
adb connect <tablet-ip-address>
The power data collected will look something like this:
POWER_SUPPLY_STATUS Not charging
POWER_SUPPLY_STATUS Not charging
POWER_SUPPLY_STATUS Not charging
We are only interested in the average current, POWER_SUPPLY_CURRENT_AVG, and voltage values so that we can calculate the average energy consumption. Note that Xoom did not provide the counter to collect the average voltage. It did provide the instantaneous voltage, POWER_SUPPLY_VOLTAGE_NOW, instead. We need to take the average of the instantaneous voltage to calculate the average voltage. The average current unit is in micro-amperes (uA) and the instantaneous voltage unit is in micro-volts (uV). Other tablets might provide the counter to collect the average voltage.
The power was calculated as follows:
Average Power = (Average voltage) * (Average current)
The energy consumption was calculated as follows:
Average Energy = (Average power) * Time
From the table we can see that uploading the encrypted file took 164 seconds comparing to 118 seconds when uploading the un-encrypted version of the same file.
From the table we can see that uploading the encrypted file took 3391.38 mW comparing to 3215.08 mW when uploading the un-encrypted version of the same file.
From the table we can see that uploading the encrypted file took 154.49 mWh comparing 105.38 mWh when uploading the un-encrypted version of the same file.
In this case, we can see that it takes an extra 46 seconds and 3 seconds to upload and download an encrypted file, respectively. Similarly, it consumes an additional 49.11 mWh and 2.49mWh to upload and download an encrypted file using Boxcryptor, respectively. To make it easier to understand, let’s convert the energy into the battery life. The tablet battery is rated at 3950mAh 3.7VDC or 14615mWh. Assuming during normal operation this battery lasts up to 10 hours. Each time uploading the encrypted file will reduce the battery life by 0.0336 hours or 2 minutes comparing to that of the un-encrypted file. Similarly, downloading the encrypted file will reduce the battery life by 0.0017 hours or 0.1 minutes.
Encrypting files storing in cloud storages will prevent unauthorized persons from being able to look at your data. Fortunately there are multiple applications that exist on both Windows and Android that allow you to encrypt data on your mobile clients and access them from other clients. However, uploading encrypted files to cloud storage will increase the time it takes to transfer the data, resulting in more power consumed, and a decrease in battery life of your mobile device. Although there are other factors that we have not tested, with the data we have seen, we recommend that you consider encrypting your more sensitive documents (as opposed to everything).
In today’s world, many applications, in one way or another, involve graphics. High resolution graphical and game applications may require a huge amount of disk space and memory to store graphics data. Half precision floating format can specifically reduce the amount of graphics data and the memory bandwidth required for an application; however, half precision floating point format can only be used to store data, not to operate on the data. In order to perform operations with such data, a half precision floating point value needs to be converted back a single precision floating point value. This blog will talk about where the half precision floating point format is used and how Intel has newly introduced half precision floating-point (float 16) conversion new instructions that are used to optimize the half-to-single and single-to-half conversion processes.
What is Half-Precision Floating-Point Format?
Half precision floating point is a 16-bit binary floating-point format. It is half the size of traditional 32-bit single precision floats. More information about half-precision floating-point format can be found at .
Where is Half-Precision Floating-Point Format Useful?
This format is used in many graphics environments like OpenEXR, JPEG XR, and OpenGL and so on.
OpenEXR is a high dynamic-range (HDR) image file format developed by Industrial Light & Magic for use in computer imaging applications. OpenEXR was used in movies like Harry Potter and the Sorcerer Stone, Men in Black II and so on. More information about OpenEXR can be found at .
JPEG XR , per Wikipedia, is a still-image compression standard and file format for continuous tone photographic images, based on technology originally developed and patented by Microsoft* under the name HD Photo (formerly Windows Media Photo). More information about jpeg XR can be found at .
OpenGL is the cross-platform application program interface for defining 2-D and 3-D graphic images. Before OpenGL, any company developing a graphical application typically had to rewrite the graphics part of it for each operating system. Since OpenGL is cross-platform, an application can create the same effects in any operating system using any OpenGL-adhering graphics adapter. More information about OpenGL can be found at .
Use Cases for Half-Precision Floating-Point Format
In this section, we will talk about how half-precision floating-point format can be used in digital imaging applications like Computed Tomography (CT) scan. CT, also known as Computed Axial Tomography (CAT), is an x-ray procedure. Multiple images are taken during a CAT scan, and a computer reconstructs them into complete, cross-sectional pictures (“slices”) of soft tissue, bone and so on. More information about CT scanning can be found at .
CT has four major steps:
1) Scanning to generate images in memory
2) Saving images to disk
3) Loading images to memory
4) Reconstructing based on images.
By utilizing half-precision floating-point format in steps 2 and 3, the amount of disk space and memory bandwidth required is reduced to half, respectively. Also step 4 has 3 major sub-steps: convolution, matrix transpose and backprojection. Backprojection is the main step in reconstructing images. Here we only concern backprojection step since it involves loading images and computing images. As images are loaded from the disk to the memory, they are still in half-precision floating-point format. In the convolution step, after the load, images Tey need to be converted back to single-precision (32-bit) floating format before they can be reconstructed. The backprojection step is computationally very intensive. More information about backprojection can be found at .
In order to speed up the conversion processes, Intel® introduces new instructions in new generations of Intel® processors.
Intel® Half-Precision Floating-Point Format Conversion Instructions
New Intel® processors like Intel® Xeon® processor E5-2600 v2 family have two new instructions to convert the half-precision (16-bit) data to single-precision (32-bit) data and vice versa.
VCVTPS2PH: Converting data in single-precision floating-point format to half-precision floating point format.
VCVTPH2PS: Converting data in half-precision floating-point format to single-precision floating point format.
More information about these instructions can be found at  and 
In order to recognize which Intel® processors support these instructions, execute the instruction CPUID  with register EAX set to 1. If bit 29 of the value in register ECX is 1 then the processor supports these instructions.
The two new instructions are assembly language instructions. Not all applications are using assembly language. Therefore, Intel also introduces two equivalent instructions call intrinsic instructions that can be used in C/C++ language. They are:
Converting from single precision to half precision
_mm256_cvtps_ph (for 256-bit vector)
_mm_cvtps_ph (for 128-bit vector)
Converting from half precision to single precision
_mm256_cvtph_ps (for 256-bit vector)
_mm_cvtph_ps (for 128-bit vector)
In the case of CT above, if we want to use intrinsic instructions then we need to first use the 128-bit load intrinsic instruction, _mm_load_si128, to load 8 half-precision values and then use _mm256_cvtph_ps to convert 8 half precision values to 8 single precision to do the computation. After finish computing, use _mm256_cvtps_ph to convert them back to half-precision values and use _mm_store_si128 to store them to the disk.
Details on how to use these instructions can be found at ,  and .
Utilizing half-precision floating-point format helps reduce data size down to half to store to the disk. Note that half-precision floating-point format is useful with applications that are tolerable with some amount of data precision loss due to the conversion between half precision and single precision. Intel® new half-precision floating-point conversion instructions help speed up the conversion process from half-precision to single-precision and vice-versa.
 Intel® 64 and IA-32 Architectures Optimization Reference Manual
In my blog about the most common pitfalls in analyzing application power consumption (http://software.intel.com/en-us/forums/showthread.php?t=106174&o=a&s=lr) I talked about potential issues that could drive power consumption to higher amounts. C-states are states when the CPU has reduced or turned off selected functions. Different processors support different numbers of C-states in which various parts of the CPU are […] Read more >
Introduction There are many blogs and articles on the internet that discuss analyzing an application to figure out ways to reduce power consumption. Finding where an application consumes power can be very challenging, especially with developers who are new to power optimization. In this blog I am going to talk about how to quickly determine, […] Read more >