Today, my team is announcing a major update to Intel® Graph Builder for Apache Hadoop* software, our open source library that structures big data for graph-based machine learning and data mining. This update will help data scientists accelerate their time-to-insight by making graph analytics easier to work with on big data systems. We believe that graph analytics will be a key tool for realizing value from big data once a few key hurdles are cleared, and, in this blog, my engineers and I would like to share our perspective on why we decided to tackle graph construction first and what we’re doing to make it easier.
Graph Builder 1.0 was released just over a year ago and we have taken a wild ride since then into other parts of the graph processing problem, looking at graph databases, like Titan*, and in-memory graph engines, like Apache Giraph*, and GraphLab*. We’ve assembled complete graph processing workflows for our customers, tackling anomaly detection, classification, clustering, correlation, predictive analytics, and recommendation problems using complex real-world graphs with billions of objects, object properties, and object relationships. But, as we did all this, we noticed something (not all that) funny – we were still spending an inordinate amount of time wrestling raw material into usable graphical models, which was exactly what Graph Builder was supposed to save us from! Sometimes eating your own dog food is no fun, but eat and eat and eat we did. And, we learned a lot. First, although the parsing and extraction methods in Graph Builder 1.0 enabled us to define graph elements, the critical step of transforming and filtering the raw material for graphs required us to extend our Java source code. For example, there was no easy way to normalize or calculate the logarithm of a particular field or to perform simple operations, such as dropping null values, etc. Second, the semantics of the Java MapReduce API continued to be cumbersome for many of the programmers we hired. Third, we found the graph JSON format to be extremely limiting and not that easy to explore (i.e., traverse) as one can easily do with graph databases such as Titan or Neo4j*. And, this format is not supported by most graph visualization tools.
After we had eaten enough dog food to kill any dog, our objectives for Graph Builder 2.0 became clear. It needed to provide additional graph-specific operators for assigning graph elements, filtering, string manipulation, and math during the extract, transform, and load phases, and we needed to reconsider the use of Java MapReduce as the programming API. Finally, we needed to support additional import and export formats so that our graphs could be consumed by popular graph databases and visualization tools. While Graph Builder 1.0 addressed some of these issues, it left many gaps. It was time for a major update.
Traditionally, ETL for relational database management systems (RDBMS) is implemented in SQL as a series of operations such as joins, sorts, and filters. In the world of Hadoop MapReduce, Pig is a dataflow engine that may be used to do the same. Pig Latin is the language used to define the steps of a dataflow in Pig. Philosophically, Pig and Pig Latin are more about “how” the data should be processed than “what” specific MapReduce jobs are required for processing. Additionally, as Allan Gates says in his book on the subject, Pigs eat everything and Pig can ingest many data formats and data types from many data sources, in line with our objectives for Graph Builder. Also, Pig has native support for local file systems, HDFS, and HBase, and tools like Sqoop can be used upstream of Pig to transfer data into HDFS from relational databases. One of the most fascinating things about Pig is that it only takes a few lines of code to define a complex workflow comprised of a long chain of data operations. These operations can map to multiple MapReduce jobs and Pig compiles the logical plan into an optimized workflow of MapReduce jobs. With all of these advantages, Pig seemed like the right tool for graph ETL, so we re-architected Graph Builder 2.0 as a library of User-Defined Functions (UDF’s) and macros in Pig Latin.
The new version of Graph Builder also adds support for multi-relational graphs, or property graphs, in which both objects and relationships may be labeled with multiple properties and property values. The property graph model is a generalized graph model that can handily describe real-world information networks, such as semantic relationships on the Web and social networks. It allows for different types of relationships, e.g. Alice-Likes->Books, Alice-WorksWith->Frank, to coexist in the same graph. Objects of different types can have different sets of properties, e.g. “People” objects can have Name, Address, Gender, and Age, whereas “Organization” objects can have Address, Phone, Number of Employees, and Revenue as properties. It’s absolutely beautiful.
Property graphs can be constructed from structured, semi-structured, or unstructured data. In case of structured data, columns or fields can be annotated as objects, relationships, or their properties. We have added the ability to extract fields from nested JSON and improved the XMLLoader function available in the Apache Piggy Bank repository to make the construction of property graphs from semi-structured data easier. In addition, we’ve contributed a new regular expression utility, RegexExtractAllMatches, that returns all text matches in a string, which is very useful for processing unstructured text data. Once you’ve constructed a graph, you may use our deduplication UDF to merge duplicate elements. Often, these capabilities will not be enough – and that’s okay because Pig makes it easy for users to add new UDFs for data manipulation.
Of course, there’s no point in building a graph if you can’t query it, analyze it, visualize it, etc. So, we’re introducing new bulk load and export methods. If you are using the open source Titan distributed graph database, the LOAD_TITAN macro will allow you to bulk load Titan via the Blueprints API so that you can explore your graph using the Gremlin query language. In addition, we have extended Graph Builder to support the Resource Description Framework (RDF) export format. RDF is a standard property graph model where each graph element is represented as a list of statements in the form of subject (one thing in the world), predicate (the relationship), and object (another thing). The RDF model is used extensively in text analytics and natural language processing and by a slew of graph databases, like AllegroGraph and Oracle NoSQL Database. Last, but not least, the new Graph Builder can export simple edge (object) lists and vertex (relationship) lists.
By now, you can probably tell we’re extremely excited about Intel® Graph Builder 2.0 and we think you’ll share our excitement when you realize what you can do with several lines of Pig Latin. We hope we have provided the Graph Community with a launch pad by which to extend and simplify Graph ETL. And, we’re not done yet. Inspired by projects like the Python NetworkX package for complex network analysis, we continue to strive to provide a *complete* graph processing library and to develop a native graph data type for Pig. To get started, check out our sample Pig script and build your own page-link graph from the Wikipedia dataset. Then, ask a question or share an idea for the new Graph Builder here. We look forward to connecting with you. Until then, graph on!
*Other names and brands may be claimed as the property of others
– Kushal Datta contributed to this blog.