Big Data. Big. Data. We hear the term frequently used to describe data of unusual size or generated at spectacular velocity, like the amount of social data that Facebook has amassed on us (30 PB in one cluster) or the rate at which sensors at the Large Hadron Collider collect information on subatomic particles (15 PB/year). And it’s often deemed “unstructured or semi-structured” to describe its lack of apparent, well, structure. What’s meant is that this data isn’t organized in a way that can directly answer questions, like a database can if you ask it how many widgets you sold last week.
But Big Data does have structure; it just needs to be discovered from within the raw text, images, video, sensor data, etc., that comprise it. And, companies, led by pioneers like Google, have been doing this for the better part of a decade, using applications that churn through the information using data-parallel processing and convenient frameworks for it, like Hadoop MapReduce. Their systems chop the incoming data into slices, farm it out to masses of machines, which subsequently filter it, order it, sum it, transform it, and do just about anything you’d want to do with it, within the practical limits of the readily available frameworks.
But until recently, only the wizards of Big Data were able to rapidly extract knowledge from a different type of structure within the data, a type that is best modeled by tree or graph structures. Imagine the pattern of hyperlinks connecting Wikipedia pages or the connections between Tweeters and Followers on Twitter. In these models, a line is drawn between two bits of information if they are related to each other in some way. The nature of the connection can be less obvious than in these examples and made specifically to serve a particular algorithm. For example, a popular form of machine learning called Latent Dirichlet Allocation (a mouthful, I know) can create “word clouds” of topics in a set of documents without being told the topics in advance. All it needs is a graph that connects word occurrences to the filenames. Another algorithm can accurately guess the type of noun (i.e., person, place, or thing) if given a graph that connects noun phrases to surrounding context phrases.
Many of these graphs are very large, with tens of billions of vertices (i.e., things being related) and hundreds of billions of edges (i.e., the relationships). And, many that model natural phenomena possess power-law degree distributions, meaning that many vertices connect to a handful of others, but a few may have edges to a substantial portion of the vertices. For instance, a graph of Twitter relationships would show that many people only have a few dozen followers while only a handful of celebrities have millions. This is all very problematic for parallel computation in general and MapReduce in particular. As a result, Carlos Guestrin and his crack team at the University of Washington in Seattle have developed a new framework, called GraphLab, that is specifically designed for graph-based parallel machine learning. In many cases, GraphLab can process such graphs 20-50X faster than Hadoop MapReduce. Learn more about their exciting work here.
Carlos is a member of the Intel Science and Technology Center for Cloud Computing, and we started working with him on graph-based machine learning and data mining challenges in 2011. Quickly it became clear that no one had a good story about how to construct large-scale graphs that frameworks like GraphLab could digest. His team was constantly writing scripts to construct different graphs from various unstructured data sources. These scripts ran on a single machine and would take a very long time to execute. Essentially, they were using a labor-intensive, low-performance method to feed information to their elegant high-performance GraphLab framework. This simply would not do.
Scanning the environment, we identified a more general hole in the open source ecosystem: A number of systems were out there to process, store, visualize, and mine graphs but, surprisingly, not to construct them from unstructured sources. So, we set out to develop a demo of a scalable graph construction library for Hadoop. Yes, for Hadoop. Hadoop is not good for graph-based machine learning but graph construction is another story. This work became GraphBuilder, which was demonstrated in July at the First GraphLab Workshop on Large-Scale Machine Learning and open sourced this week at 01.org (under Apache 2.0 licensing).
GraphBuilder not only constructs large-scale graphs fast but also offloads many of the complexities of graph construction, including graph formation, cleaning, compression, partitioning, and serialization. This makes it easy for just about anyone to build graphs for interesting research and commercial applications. In fact, GraphBuilder makes it possible for a Java programmer to build an internet-scale graph for PageRank in about 100 lines of code and a Wikipedia-sized graph for LDA in about 130.
This is only the beginning for GraphBuilder but it has already made a lot of connections. We will continually update it with new capabilities, so please try it out and let us know if you’d value something in particular. And, let us know if you’ve got an interesting graph problem for us to grind through. We are always looking for new revelations.