ADVISOR DETAILS

RECENT BLOG POSTS

Eliminate the dreaded “clocksource is unstable” message: Switch to TSC for a stable clock source option for Linux when using an Intel® Xeon Phi™ Coprocessor.

Introduction
For applications running on Linux, device drivers and various kernel subsystems the ability to keep track of time is of importance. For example, an application might ask for user input but only wait for 5 seconds before continuing with a default value. An NFS client in the kernel might set a timer before performing a remote procedure call (RPC) to ensure that it can detect a timeout and perform a retransmission if it does not hear back from the NFS server in time. In addition to timeouts, applications use systems calls like gettimeofday() to get wall clock time. All of this falls under the generic “time keeping architecture” of Linux.  This architecture includes two central structures used by the kernel to provide such services – clocksources and clock events. In this article, I will discuss what clocksources are, how they work and why they get unstable. I will also talk about available clocksources on Intel® Xeon Phi™ coprocessor, some issues observed and best ways to obtain wall clock time for performance sensitive codes.
Okay, so what’s a clocksource?
A clocksource, as defined in the Linux kernel, is simply a counter that increments monotonically at a known fixed frequency. The counter can be provided by software or by a specialized piece of hardware like a real time clock (RTC). If you’re running Linux, one quick way to find out what clocksources your system has is to run the two commands shown below. Available_clocksources lists all the clocksources your system supports and current_clocksource is the one currently chosen for your timekeeping needs.  As an example, my development system has Time Stamp Counter (TSC), High Precision Event Timer (HPET) and ACPI_PM listed for available clocksources and the kernel chose to use TSC as its preferred clocksource (how this happens is discussed in a bit)

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksourcetsc
hpet acpi_pm
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

Similarly, the output from my Intel® Xeon Phi™ coprocessor card shows:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc micetc
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

A couple of quick observation can be made from comparing the two. Firstly, the Intel® Xeon Phi™ coprocessor doesn’t have support for an HPET, PIT, RTC or ACPI_PM (no HW!). Second, there is this new clocksource called “micetc” on the card. What is micetc? To understand the nuances of micetc and other clocksources, we must first understand what a generic clocksource looks like.
How is a clocksource used in the Linux kernel?
To understand how the kernel uses clocksources we can start with by examining what a clocksource looks like (see include/linux/clocksource.h in the kernel source tree if you have it handy). Three fields in the structure are of interest for this discussion, namely name, rating and a function pointer call read.

struct clocksource {       
/*        
* First part of structure is read mostly        
*/       
char *name;       
struct list_head list;       
int rating;       
cycle_t (*read)(struct clocksource *cs);       
int (*enable)(struct clocksource *cs);       
void (*disable)(struct clocksource *cs);       
cycle_t mask;       
u32 mult;       
u32 shift;       
u64 max_idle_ns;       
unsigned long flags;       
 ….              
};

Name is just a human readable string that the driver of the counter choses to identify itself with. For example, if I were to invent a piece of HW that could count up at a fixed frequency very reliably and if I wrote a driver for it, I would call it “veryreliable” and it would show up that way in the list of available_clocksources above. Read is simply a pointer to a function that allows the generic kernel to read the current value of the counter anytime it likes. What steps need to be taken for the counter to be read (for example, set bit 13, followed by bit 4, followed by a bit in a control register) are transparent to the kernel and are implemented in the driver supporting the clocksource. If this was a counter implemented in software, perhaps read just returns the current value of a variable when called. Like name, rating is simply a number assigned by the driver writer, but it has a special meaning to the kernel. The kernel uses the rating value to select the best clocksource among the many it has seen. Given its importance to the kernel, there are some guidelines around rating:

Clocksources that are very bad and should be selected only if nothing better is available or during boot are given a rating between 1 and 99. (see clocksource_jiffies in kernel/time/jiffies.c an example of such a clocksource)
Clocksources that are okay, but aren’t preferred are given a rating between 100 and 199.
Clocksources that are fairly accurate are given a rating between 300 and 399. For example, tsc, found on most Intel architectures, has a rating of 300 given its reliability and accuracy (see arch/x86/kernel/tsc.c and look for clocksource_tsc
Lastly, clocksources that are perfect and should be selected as the clocksource of choice are rated between 400 and 499.

 Providers of clocksources to the kernel register themselves (or their clocksources) by calling clocksource_register() (see kernel/time/clocksource.c) and install their clocksource structure when the kernel is booting up. The kernel does a few simple things when this a clocksource is registered:

Inserts the clocksource structure in a simple linked list but keeps the list sorted by the rating field mentioned earlier. The highest rated clocksource is always at the head of the list. 
Designates the registering clocksource as a potential candidate for promotion to “watchdog” status if it does not have the “CLOCK_SOURCE_MUST_VERIFY” bit (all flags are defined in include/linux/clocksource.h) set in its flag field. If the bit is set, the clocksource provider is essentially saying “you have to verify that this clocksource is stable and accurate” and the kernel says “okay, then I will not make you the watchdog”. If this flag is not set, this clocksource is picked as a watchdog (explained below) if there isn’t already a watchdog of higher rating picked.

 Finally, at the end of the boot process, it goes through the list of all clocksources collected in step 1 above and picks the highest rated clocksource as the current_clocksource. Note: there are ways to override this selection, but that’s a whole different topic.
One example that might be interesting here is the tsc clocksource (the structure is called clocksource_tsc defined in arch/x86/kernel/tsc.c). Notice that the kernel adds CLOCK_SOURCE_MUST_VERIFY to the flags field because it needs to ensure that tsc is reliable before it can be trusted. Also notice that this flag is not defined for clocksource_jiffies in kernel/time/jiffies.c).
Very nice, but where does all of this get used?
All of us have used gettimeofday () sometime in the past and it basically returns the “wall clock” time as well as the time zone. The implementation of gettimeofday () uses the current_clocksource to keep track of the passage of time (refer to do_gettimeofday () in kernel/time/timekeeping.c). If you follow the code you’ll notice the kernel calls the “read” routine of the current_clocksource to get its counter value. This value is then converted to a timebase. If the current_clocksource becomes unstable (why? see below), the kernel will call into the timekeeping architecture (refer to timekeeping_notify () in kernel/time/clocksource.c) to update it about a change to the current_clocksource.
So what is a watchdog, what purpose does it serve and why do clocksources become “unstable”?
As noted earlier, a watchdog is just another clocksource that is of decent enough quality; a watchdog is reliable so that it can check on all the other registered clocksources periodically. This is how it works: The kernel schedules a function to run every 50 ticks (or 500ms if the kernel is compiled with HZ set to 10ms) where it reads the counter value of the watchdog and converts it to a time base given it knows the frequency of the watchdog clocksource (see a function called clocksource_watchdog() in kernel/time/clocksource.c). Then it walks the list of all registered clocksources that need to be verified (like clocksource_tsc in our example above) and does the same thing, i.e. read its counter value of the clocksource and convert it to a time base. The function then compares the time obtained from the clocksource with that of the watchdog to see if the two are close. If they are, all is well and we move on to the next clocksource in the list. If they are off by more than a certain threshold (WATCHDOG_THRESHOLD), the clocksource is marked “unstable”. The kernel then reduces its rating and selects another clocksource as the new current_clocksource. If and when this happens a function called clocksource_unstable() is executed and you should see something like the following in your dmesg log:

[   46.370696] Clocksource tsc unstable (delta = 508518207 ns)
[   46.371904] Switching to clocksource jiffies

What this means is that the kernel has decided that the clocksource (tsc in the above example) is unfit to be a good clocksource (given the drift) and is therefore switching to the next best clocksource in the list of available clocksources (remember the linked list from above?)
Some special considerations with TSC
The timestamp counter (tsc) is fundamental to the architecture in that its counter value is simply the CPUs “clock tick”. It is a 64-bit counter that increments monotonically. It has a very high resolution. It is synchronous across all the cores on an Intel® Xeon Phi™ coprocessor. It is relatively easy to read its counter value via the rdtsc instruction which can be executed in ring3 too. All in all, it is a great clocksource except for one problem on Intel® Xeon Phi™ coprocessor – It does not support constant and non-stop TSC (called invariant TSC). Invariant just means that TSC will run at a constant rate in all ACPI P, C and T states and modern Intel architecture CPUs support this behavior (see Intel® 64 and IA-32 Architectures Software Developer’s Manual) but Intel® Xeon Phi™ coprocessor does not. On processors with this behavior the OS may use TSC as a wall clock timer reliably (instead of relying on HPET or ACPI clocksources. What all this means on the Intel® Xeon Phi™ coprocessor is that in order to get accurate timing results with tsc as your current_clocksource you have to disable all power-management and frequency changes while running your applications on Intel® Xeon Phi™ coprocessor (but there is a better way discussed later)
The Linux kernel runs a check during boot by executing the cpuid instruction to check if the underlying architecture actually supports invariant TSC (see a check for X86_FEATURE_NONSTOP_TSC in drivers/idle/intel_idle.c). If it detects that the architecture does not in fact support non-stop TSC, it will mark_tsc_unstable().
Even if you disable power management as described above, a few people have reported seeing the “clocksource is unstable” message. As it turns out, tsc is not the culprit at all. When micetc is disabled (“etc_off” on the PowerManagement command line), the kernel picks jiffies to be the watchdog because tsc has the “CLOCKSOURCE_MUST_VERIFY” in its definition and therefore isn’t allowed to be the watchdog as we learned earlier. Jiffies is a simple counter that is incremented with every timer tick. But in some cases, interrupts are disabled for “long” periods of time causing the kernel to miss updates to jiffies. Later, when the kernel runs its periodic watchdog function and compares it to the watchdog (clocksource_jiffies in this case), it notices that tsc is off and declares it to be unstable and disables it as seen earlier. Clearly tsc is not buggy! We would be better off using micetc as the watchdog since it doesn’t depend on the periodic timer tick interrupt to keep its internal counter up-to-date.
Gettimeofday() and the Elapsed Time Counter (ETC)
I mentioned the micetc clocksource in the earlier part of this article. The etc is essentially a frequency invariant counter that is of very high resolution and is another great clockcource in the Intel® Xeon Phi™ coprocessor. So why not use the micetc as the clocksource of choice? Well, remember that gettimeofday(), when invoked by an application, results in a call to the read routine of the current_clocksource. If the current_clocksource is set to micetc, reading the counter value isn’t a simple instruction like rdtsc. It involves making a transition into the ring0, followed by a couple of MMIO read operations (uncached reads) to get the value of the 64-bit counter from hardware. In other words it is a lot more expensive to read the counter of the micetc compared to tsc. Additionally, if more than one thread of execution calls gettimeofday() at the same time, each of the counter reads mentioned earlier are serialized resulting in more expensive gettimeofday() calls. Given this, tsc is preferred over micetc as the current_clocksource, but we already mentioned that tsc is not invariant on Intel® Xeon Phi™ coprocessor, so what do we do?
Recommended configuration for clocksources
Now that you know a whole lot about clocksources and how gettimeofday() uses them, you know you want tsc as your clocksource. But given that the Intel® Xeon Phi™ coprocessor lacks support for invariant tsc, Intel® Manycore Platform Software Stack (Intel® MPSS) Gold Update2 and newer versions provide a software solution to make tsc stable. However it relies on clocksource_micetc to re-calibrate clocksource_tsc when certain events, like changes in frequency, occur (since etc runs off of a fixed-frequency clock). Therefore, the recommendation is to enable micetc in the kernel, but make sure that you tell the kernel to stop picking it as a clocksource by overriding the kernel’s choice via the “clocksource=” parameter.

Add  “clocksource=tsc” to the ExtraCommandLine (so even though etc is enabled in the next line, the kernel will override what it thinks needs to be the current_clocksource and choose tsc instead). ExtraCommandLine can be found in the configuration files in /etc/sysconfig/mic directory if Intel® MPSS is installed.
Check that you have micetc enabled by removing “etc_off” from the PowerManagement line if it is present (it is enabled by default in Intel® MPSS).

 You now have everything you need – a high resolution clock in clocksource_tsc that is very efficient to access for gettimeofday(). Additionally clocksource_micetc provides a reliable watchdog and support to make clocksource_tsc frequency invariant via software!
Conclusions
The Linux kernel running on Intel® Xeon Phi™ coprocessor supports a few clocksources, most notably the timestamp counter (tsc) and the elapsed time counter (etc). This article described what  clocksources exist, and how they are implemented and used by gettimeofday().  We also talked about some issues when using tsc as a clocksource and how Intel® MPSS works around it by using micetc to “lean on” during power management events like frequency changes to make tsc frequency invariant. Lastly, a configuration that provides the best combination of performance and stability is described that can be easily enabled via a set of kernel command line options.

Read more >