The Center for Data Innovation spoke with Dr. Marko A. Rodriguez, a partner at graph computing startup Aurelius. Uniquely suited to storing and processing data from social networks, human brain research and other disparate fields, graph databases and graph analytics have enjoyed a swift increase in popularity over the last several years. Dr. Rodriguez spoke about his technologies’ capabilities and why he thinks they will one day lead to big breakthroughs in neuroscience and artificial intelligence.
Travis Korte: Can you first give a brief overview (for an educated non-engineer) of the Aurelius Graph Cluster.
Marko Rodriguez: The Aurelius Graph Cluster is a cluster of interoperable graph computing technologies that works over a multi-machine compute cluster. Titan is a distributed graph database that has been demonstrated to handle graphs on the order of 100 billion edges and transactions at the rate of 10,000 a second. Faunus is a graph analytics system that leverages Hadoop to do global graph traversals as well as bulk loading/mutating of the graph data contained within Titan. These two technologies currently form the online transaction processing (OLTP) and online analytics processing (OLAP) aspects of the Aurelius Graph Cluster.
TK: What are some use cases for which graph analysis is particularly well suited?
MR: Anytime a data set can be represented as discrete “things” (vertices) that can be associated with one another by various types of relationships (edges), then a graph database becomes a useful medium for storing that data. Once the data has been stored, the next requirement is querying that data. Typically, queries are represented as traversals whereby a traverser moves from vertex to vertex over the edges that connect them. Graph databases excel in expressing and executing traversals that are recursive (e.g. walking trees) and deep (e.g. long paths across the graph are explored). At the more laymen level, graph databases are well positioned to handle network- and hierarchical-data.
TK: I’ve heard the narrative in the past that graph databases are hard to scale. How do you manage to keep Titan moving quickly with large data volumes?
MR: The narrative of “graph databases are hard to scale” is effused by vendors that have completely designed their graph database for single-machine, in-memory usage. When the architecture is pigeonholed for single machine usage, its hard to move to multi-machine. If you design the architecture from the start to be distributed, then a graph database can be distributed and done effectively given an intelligent design. Titan leverages [NoSQL database] Cassandra (or HBase) to store its serialized graph on the disks of a multi-machine compute cluster. The BigTable data model of Cassandra/HBase is actually an excellent medium for representing graphs as it forms an adjacency list, where each row is a vertex and the incident edges of the vertex are the columns. In a BigTable system, a row can have an arbitrary number of columns (as a vertex can have an arbitrary number of incident edges). With this disk layout, a vertex’s incident edges are colocated with the vertex (reducing disk seek times). Moreover, with column predicates, edges can be sorted and thus, faster disk-access times can be realized (reducing latency due to less data being fetched). Titan leverages off-heap caching to ensure consistent Java Virtual Machine (JVM) garbage collection behavior and in turn, this allows for a high transactional throughput. In the end, Titan is designed from the ground up to support massive-scale graphs being heavily traversed by numerous concurrent threads of execution.
TK: The basic idea of graph databases has been around for a while. Why don’t you think they’ve caught on to a greater degree. What’s different now?
MR: The concept of graphs/networks has been around for centuries in academia. Only now are people starting to realize that many data sets can be naturally stored as a graph and problems can be naturally solved using graph traversals. With the rise of social media sites such as Facebook, Twitter, LinkedIn, etc., the popular zeitgeist realizes that “the world” is best represented as a graph. This is contrary to the common thought of the early days of databases when the world was seen in terms of tables (spreadsheets, ledgers, etc.). While the graph perspective continues to grow, a graph database will be a tool of choice — but like all things tested against time, world views are fleeting.
TK: What do you think the future holds for graph database technology?
MR: Into the future, I believe that as neuroscience sufficiently advances to be able to explain, computationally, the means by which data is stored and processed in the human brain, neural-inspired data structures and algorithms will be applied to problems using a graph database. Massive graphs stored in compute clusters executing neural algorithms will provide us novel, artificially-intelligent technologies — automated categorization/classification, associative memory, and input/output “behavioral” pathways. The entailment will be the realization of correlations across space and time that currently take centuries of human-based scientific investigation to grasp.