Mining Big Data on Big Clusters (Intel Labs@SC12)

Many of the most promising applications of Big Data, the vast and growing data repositories accumulating across the world, work by scouring millions or billions of interrelated things to discover interesting new relationships. The result could be a new scientific theory, a business insight that reveals a new market opportunity, or connections to people who share a common interest with you across the planet.

One can envision these relationships as web of interconnected points. Computers see these webs in the form of a data structure called a ‘graph.’ Real world examples include the network of roads connecting cities in a country, the neural connections in your brain, or the Internet itself. A graph can be created from almost any collection of data, with the connections specifying relationships. On social networks, you see graphs in the form of your networks of friends, families, and colleagues. Mining relationships within big graphs is the subject of an Intel Labs paper presented today at SC12.

The more data you add to a graph, the more potential you have to discover insights that are personally, culturally, or financially valuable. The challenge is that today the datasets are growing faster than compute systems can handle.

Imagine you are mapping out a highway system of an unfamiliar country starting from scratch. Until you get to the next town, you have no idea where the next road will take you. Likewise as a computer searches a graph, going from node to node in the web, it doesn’t know where in memory it will need to look for the next set of nodes. The best way to do this search efficiently is to keep all of the data together in main memory so anything can be readily retrieved.

However, datasets have grown well beyond what can be stored in one computer’s memory. They may require tens, hundreds, or even thousands of computers for applications like cosmology, where the relationships between billions of galaxies are being explored to better understand the evolution and fate of the universe. As such, the next challenge for graph computing is to run across large clusters of computers. Since they span the memory of many systems, searching them means constantly looking for data that resides on another machine. Today, these calculations are severely limited by the amount of information that can flow between the many processors in a cluster at one time.

To help address this, Intel Labs has developed new methods to accelerate the computations of large graphs by reducing the amount of data communication required across the cluster’s networks. As described in the SC12 paper,  this is accomplished though a collection of techniques to efficiently compress data, eliminate unnecessary transfers, and pipeline computations to make the most of time spent waiting for data to return.

The results were demonstrated using the Graph500 benchmark, an emerging benchmark used to rate high performance computing systems on their ability to compute graphs. Through these improvements, Intel Labs showed that these graph searches can be done more than six times faster, and in a way that is more than eight times more energy-efficient.  For both Big Data and HPC this efficiency is critical, as energy and cooling have become major concerns for future datacenters and supercomputers alike.

These results represent one of the most efficient implementations of Graph500 benchmark, based on the most recent list. More efficient computation will make these graph insights more accessible to a wider array of people, business, and scientific institutions.  

 Click here for more information on all six of Intel’s papers at SC12.

Sean Koehl

About Sean Koehl

Sean Koehl (@smkoehl) is a Vision Strategist for Intel Labs, the global research arm of Intel Corporation. He is responsible for crafting visions of how Intel R&D efforts could impact daily life in the future. He leverages insights from Intel’s technologists, social scientists, futurists, and business strategists to articulate how technology innovations and new user experiences could improve lives and society. Sean received a bachelor’s degree in Physics from Purdue University and launched his career at Intel in 1998. He has worn many hats in his career including those of an engineer, evangelist, writer, creative director, spokesperson, and strategist. He has led a variety of projects and events, authored numerous technology publications and blogs, and holds seven patents. He is based at Intel’s headquarters in Santa Clara, California.

Comments are closed.