BLOGGER DETAILS

RECENT BLOG POSTS

The switch() statement isn't really evil, right?

In my current position, I work to optimize and parallelize codes that deal with genomic data, e.g., DNA, RNA, proteins, etc. To be universally available, many of the input files holding DNA samples (called reads) are text files full of the characters ‘A’, ‘C’, ‘G’, and ‘T’. (It’s not necessary to know much about DNA to understand this post or the code I’m going to describe, but if you want a primer you can try “What is DNA?”) Depending upon the application and the species being studied, data files can contain 10’s of thousands to millions of reads with lengths between 50 and 250 characters (bases) each. Rather than devote a full 8-bit character space in memory for each base, programmers of genomic applications can compress the four DNA molecules into 2 bits, which allows for four bases to be stored within a byte.

One of the applications I’ve dealt with recently used the following code (with error detection) to compute the 2-bit conversion of a given base (a char parameter to the function containing this code).

  switch (base) {
    case 'A': case '0': return 0;
    case 'C': case '1': return 1;
    case 'G': case '2': return 2;
    case 'T': case '3': return 3;
  }
  cerr << "error: unexpected character: '" 
       << base << "'n";
  assert(false);
  abort();

(The number characters ‘0’ to ‘3’ are an alternate input notation.) The return values of this function are shifted into position within a byte to format each read (or substring needed) to use 25% of the space that the original input string needed.

Upon compiling the above code, I found that the assembly code contains a series of indirect jumps. Because there is no dominant base molecule from “random” strings of DNA, there is no savings to be gotten from branch prediction. (When one of our compiler experts saw the compiled code for this routine he was horrified.)

So, I looked to find an alternate way to express this computation that would execute in fewer cycles. The solution that I came up with was to substitute the switch() statement with a table lookup as implemented with the following code.

  uint8_t r = b2C[base];
  if (r != 0xFF) return r;
  cerr << "error: unexpected character: '" 
       << base << "'n";
  assert(false);
  abort();

The table, b2C, is declared to have 256 elements, with 248 of those elements holding the value 0xFF. The other eight elements have the integer values zero to three corresponding to the ASCII location of the characters ‘0’ to ‘3’, ‘A’, ‘C’, ‘G’, and ‘T’.

Here is the most relevant part of the table declaration:

uint8_t b2C[256] = {
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, //0
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, //1
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, //2
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0x00, 0x01, 0x02, 0x03, 0xFF, 0xFF, 0xFF, 0xFF, //3   ‘0’ ‘1’ ‘2’ ‘3’
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0x00, 0xFF, 0x01, 0xFF, 0xFF, 0xFF, 0x02, //4   ‘A’ ‘C’ ‘G’
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0x03, 0xFF, 0xFF, 0xFF, //5   ‘T’
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
. . .
}

It’s really a small change and is less intuitive than the original switch() statement. However, since the function now takes the same amount of time to process an ‘A’ as it does to process a ‘T’ (rather than an average of 4 indirect jumps per base), this modification led to a 1.20X speedup without any other code changes. The authors of the code were amazed that even little changes like this can have significant impact on the execution time of their application. In retrospect the speedup seems almost obvious. With a large portion of the execution dedicated to input and conversion of very large data sets, even small changes to reduce the number of instructions executed should positively impact the run time.

I’m not advocating you go and change all your switch() constructs to table lookups. This was a very special case that fit the mold and yielded good results. Besides, there is always the maintenance aspect to what programming constructs are used. If something like a switch() or a multi-level if-then-elseif structure makes sense and gets the job done, then use it. Even so, converting to something less obvious can lead to execution improvement. Be open to different coding practices.

Read more >

Intel is Number 1 with a Milky Way

No, not the candy bar (though I could really go for a Milky Way Midnight bar right now). I’m thinking of the Milky Way 2 (Tianhe-2) computer system at the National Supercomputing Center in Guangzhou China. This machine incorporates 32,000 12-core Intel Xeon processors (E5-2600 v2) and 48,000 Intel Xeon Phi coprocessors. It has been declared the fastest computer on the planet (maybe even the galaxy?) as it sits on top of the June 2013 TOP500 list of supercomputer systems released at the ISC2013 conference.

This is the second time a Chinese system has occupied the number one slot on a TOP500 list. The system achieved a computation speed of 33.2 Petaflops on the Linpack benchmark. The first China system to reach the number one position on the list was in November 2010 when the Tianhe-1A system was able to attain 2.57 petaflop/s. That system used over 14,000 Intel Xeon processors and over 7000 NVIDIA Tesla GPUs.

Comparing these two machines shows off the latest trend in achieving top-notch performance in HPC computations: the use of accelerators. There is an increase of 13X in the petaflop/s rate in the 31 months between the appearance of Tianhe-1A and Tianhe-2 on the lists. From the “rule of thumb” guidance attributed to Gordon Moore, one might have expected an increase around 4X. Between the two generations of Tianhe, there has been a doubling of the number of processors and almost a 7X increase in the number of accelerator resources incorporated. I think this has more to do with the performance differences than just the raw increase in number of processors and cores. As we march toward exascale machines, I believe the use of accelerator technology will become more common.

One other point I’d like to note is the energy efficiency of the top machines. The number 2 system on the most recent list, Titan at Oak Ridge National Laboratory, has a power rating of 8.2MW. The Tianhe-2 system uses about twice that much power and achieves almost two times the petaflop/s. I can only imagine that the flops per watt rating of future machines will increase as processor manufacturers release even more power efficient chips in the coming years.

Read more >

Can I still get an Energy Efficient Free Lunch?

When the semiconductor industry was turning to multicore chips and lowering clock rates, Herb Sutter wrote a seminal article entitled “The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software.” Up to that point software developers relied on the increasing clock speeds (the “free lunch”) to give their software a boost in the next generation [...] Read more >

Using Amdahl’s Law for Energy Efficient Performance Estimation?

While trying to find an answer to my previous question, I stumbled across the paper “Extending Amdahl’s Law for Energy-Efficient Computing in the Many-Core Era” (Computer, Dec. 2008, pp. 24-31) by Dong Hyuk Woo and Hsien-Hsin S. Lee (Georgia Institute of Technology). The title had me thinking that this might be an investigation into finding [...] Read more >

MIC: Stepping-stone to Quantum Computing?

I was reading Quantum Computing for Computer Scientists by Noson S. Yanofsky and Mirco A. Mannucci while I was on the treadmill last night. I started out reading the description of Shor’s algorithm (for factoring integers) and thought that implementing this on a classical computer (in parallel, of course) would make an interesting problem for the Intel [...] Read more >