Disruptive technologies to unlock the power of Big Data

This week’s announcement by Intel that it’s expanding the availability of the Intel® Distribution for Apache Hadoop* software to the US market is seriously exciting for the employees of this semiconductor giant, especially researchers like me.  Why?  Why would I say this given the amount of overexposure that Hadoop has received?  I mean, isn’t this technology nearly 10 years old already??!!  Well, because the only thing I hear more than people touting Hadoop’s promise are people venting frustration in implementing it.  Rest assured that Intel is listening.  We get that users don’t want to make a career out of configuring Hadoop… debugging it…  managing it… and trying to figure out why the “insight” it’s supposed to be delivering often looks like meaningless noise.

Which brings me back to why this is a seriously exciting event for me.  With our product teams doing the heavy lifting of making the Hadoop framework less rigid and easier to use while keeping it inexpensive, Intel Labs gets a landing zone for some cool disruptive technologies. In December, I blogged about the launch of our open source scalable graph construction library for Hadoop, called Intel® Graph Builder for Apache Hadoop software (f.k.a. GraphBuilder), and explained how it makes it easy to construct large scale graphs for machine learning and data mining. These structures can yield insights from relationships hidden within a wide range of big data sources, from social media and business analytics to medicine and e-science. Today I’ll delve a bit more into Graph Builder technology and introduce the Intel® Active Tuner for Apache Hadoop software, an auto-tuner that uses Artificial Intelligence (AI) to configure Hadoop for optimal performance.  Both technologies will be available in the Intel Distribution.

So, Intel® Graph Builder leverages Hadoop MapReduce to turn large unstructured (or semi-structured) datasets into structured output in graph form.  This kind of graph may be mined using graph search of the sort that Facebook recently announced.  Many companies would like construct such graphs out of unstructured datasets and Graph Builder makes it possible.  Beyond search, analysis may be applied to an entire graph to answer questions of the type shown in the figure below.  The analysis may be performed using distributed algorithms implemented in frameworks like GraphLab, which I also discussed in my previous post.

Intel® Graph Builder performs extract, transform, and load operations, terms borrowed from databases and data warehousing.  And, it does so at Hadoop MapReduce scale.  Text is parsed and tokenized to extract interesting features.  These operations are described in a short map-reduce program written by the data scientist.  This program also defines when two vertices (i.e., features) in the graph are related by an edge.  The rule is applied repeatedly to form the graph’s topology (i.e., the pattern of edge relationships between vertices), which is stored via the library.  In addition, most applications require that additional tabulated information, or “network information,” be associated with each vertex/edge and the library provides a number of distributed algorithms for these tabulations.

At this point, we have a large-scale graph ready for HDFS, HBase, or another distributed store.  But we need to do a few more things to ensure that queries and computations on the graph will scale up nicely, like:

  • Cleaning the graph’s structure and checking that it is reasonable
  • Compressing the graph and network information to conserve cluster resources
  • Partitioning the graph in a way that will minimize cluster communications while load balancing computational effort

The Intel Graph Builder library provides efficient distributed algorithms for all of the above, and more, so that data scientists can spend more of their time analyzing data and less of their time preparing it.  Enough said. The library will be included in the Intel Distribution shortly and we look forward to your feedback.  We are constantly on the hunt for new features as we look to the future of big data.

Whereas Intel® Graph Builder was developed to simplify the programming of emerging applications, Intel® Active Tuner was developed to simplify the deployment of today’s applications by automating the selection of configuration settings that will result in optimal cluster performance. In fact, we initially codenamed this technology “Gunther,” after a well-known circus elephant trainer, because of its ability to train Hadoop to run faster :-).  It’s cruelty-free to boot, I promise.  Anyway, many Hadoop configuration parameters need to be tuned for the characteristics of each particular application, such as web search, medical image analysis, audio feature analysis, fraud detection, semantic analysis, etc.  This tuning significantly reduces both job execution and query time but is time consuming and requires domain expertise. If you use Hadoop you know that the common practice is to tune it up using rule-of-thumb settings published by industry leaders.  But these recommendations are too general and fail to capture the specific requirements of a given application and cluster resource constraints.  Enter the Active Tuner.

Intel® Active Tuner implements a search engine that uses a small number of representative jobs to identify the best configuration from among millions or billions of possible Hadoop configurations.  It uses a form of AI known as a genetic algorithm to search out the best settings for the number of maps, buffer sizes, compression settings, etc., constantly striving to derive better settings by combining those from pairs of trials that show the most promise (this is where the genetic part comes in) and deriving future trials from these new combinations.  And, the Active Tuner can do this faster and more effectively than a human can using the rules-of-thumb.  It can be controlled from a slick GUI in the new Intel Manager for Apache Hadoop, so take it for a test run when you pick up a copy of the Intel Distribution.  You may see your cluster performance improve by up to 30% without any hassle.

To wrap, these are one-of-a-kind technologies that I think you’ll have fun playing with.  And, despite offering quite a lot, Intel® Graph Builder and Intel® Active Tuner are just the beginning.  I am very excited by what’s coming next.  Intel is moving to unlock the power of Big Data and Intel Labs is preparing to blow it wide open.

*Other names and brands may be claimed as the property of others

Ted Willke

About Ted Willke

Ted Willke is a Principal Engineer with Intel and the General Manager of the Graph Analytics Operation in Intel Labs. Before joining Intel Labs in 2010, Ted spent 12 years working on server I/O technologies and standards within Intel’s product and pathfinding organizations. He holds a Doctorate in electrical engineering from Columbia University.

Comments are closed.