NoSQL Zone is brought to you in partnership with:

Deploying the Aurelius Graph Cluster

11.19.2012
| 5722 views |
  • submit to reddit

The Aurelius Graph Cluster is a cluster of interoperable graph technologies that can be deployed on a multi-machine compute cluster. This post demonstrates how to set up the cluster on Amazon EC2 (a popular cloud service provider) with the following graph technologies:

Titan is an Apache2-licensed distributed graph database that leverages existing persistence technologies such as Apache HBase and Cassandra. Titan implements the Blueprints graph API and therefore supports the Gremlin graph traversal/query language. [OLTP]

Faunus is an Apache2-licensed batch analytics, graph computing framework based on Apache Hadoop. Faunus leverages the Blueprints graph API and exposes Gremlin as its traversal/query language. [OLAP]

Please note the date of this publication. There may exist newer versions of the technologies discussed as well as other deployment techniques. Finally, all commands point to an example cluster and any use of the commands should be respective of the specific cluster being computed on.

Cluster Configuration

The examples in this post assume the reader has access to an Amazon EC2 account. The first step is to create a machine instance that has, at minimum, Java 1.6+ on it. This instance is used to spawn the graph cluster. The name given to this instance is agc-master and it is a modest m1.small machine. On agc-master, Apache Whirr 0.8.0 is downloaded and unpacked.

~$ ssh ubuntu@ec2-184-72-209-80.compute-1.amazonaws.com
...
ubuntu@ip-10-117-55-34:~$ wget http://www.apache.org/dist/whirr/whirr-0.8.0/whirr-0.8.0.tar.gz
ubuntu@ip-10-117-55-34:~$ tar -xzf whirr-0.8.0.tar.gz

Whirr is a cloud service agnostic tool that simplifies the creation and destruction of a compute cluster. A Whirr “recipe” (i.e. a properties file) describes the machines in a cluster and their respective services. The recipe used in this post is provided below and saved to a text file named agc.properties on agc-master. The recipe defines a 5 m1.large machine cluster containing HBase 0.94.1 and Hadoop 1.0.3 (see whirr.instance-templates). HBase will serve as the database persistance engine for Titan and Hadoop will serve as the batch computing engine for Faunus.

whirr.cluster-name=agc
whirr.instance-templates=1 zookeeper+hadoop-namenode+hadoop-jobtracker+hbase-master,4 hadoop-datanode+hadoop-tasktracker+hbase-regionserver
whirr.provider=aws-ec2
whirr.identity=${env:AWS_ACCESS_KEY_ID}
whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
whirr.hardware-id=m1.large
whirr.image-id=us-east-1/ami-da0cf8b3
whirr.location-id=us-east-1
whirr.hbase.tarball.url=http://archive.apache.org/dist/hbase/hbase-0.94.1/hbase-0.94.1.tar.gz
whirr.hadoop.tarball.url=http://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz
hbase-site.dfs.replication=2

From agc-master, the following commands will launch the previously described cluster. Note that the first two lines require specific Amazon EC2 account information. When the launch completes, the Amazon EC2 web admin console will show the 5 m1.large machines.

ubuntu@ip-10-117-55-34:~$ export AWS_ACCESS_KEY_ID= # requires account specific information
ubuntu@ip-10-117-55-34:~$ export AWS_SECRET_ACCESS_KEY= # requires account specific information
ubuntu@ip-10-117-55-34:~$ ssh-keygen -t rsa -P ''
ubuntu@ip-10-117-55-34:~$ whirr-0.8.0/bin/whirr launch-cluster --config agc.properties

The deployed cluster is diagrammed on the right where each machine maintains its respective software services. The sections to follow will demonstrate how to load and then process graph data within the cluster. Titan will serve as the data source for Faunus’ batch analytic jobs.

Loading Graph Data into Titan

Titan is a highly scalable, distributed graph database that leverages existing persistence engines. Titan 0.1.0 supports Apache Cassandra (AP), Apache HBase (CP), and Oracle BerkeleyDB (CA). Each of these backends emphasizes a different aspect of the CAP theorem. For the purpose of this post, Apache HBase is utilized and therefore, Titan is consistent (C) and partitioned (P).

For the sake of simplicity, the 1 zookeeper+hadoop-namenode+hadoop-jobtracker+hbase-master machine will be used for cluster interactions. The IP address can be found in the Whirr instance metadata on agc-master. The reason for using this machine is that numerous services are already installed on it (e.g. HBase shell, Hadoop, etc.) and therefore, no manual software installation is required on agc-master.

ubuntu@ip-10-117-55-34:~$ more .whirr/agc/instances 
us-east-1/i-3c121b41    zookeeper,hadoop-namenode,hadoop-jobtracker,hbase-master    54.242.14.83    10.12.27.208
us-east-1/i-34121b49    hadoop-datanode,hadoop-tasktracker,hbase-regionserver   184.73.57.182   10.40.23.46
us-east-1/i-38121b45    hadoop-datanode,hadoop-tasktracker,hbase-regionserver   54.242.151.125  10.12.119.135
us-east-1/i-3a121b47    hadoop-datanode,hadoop-tasktracker,hbase-regionserver   184.73.145.69   10.35.63.206
us-east-1/i-3e121b43    hadoop-datanode,hadoop-tasktracker,hbase-regionserver   50.16.174.157   10.224.3.16
Once in the machine via ssh, Titan 0.1.0 is downloaded, unzipped, and the Gremlin console is started.
ubuntu@ip-10-117-55-34:~$ ssh 54.242.14.83
...ubuntu@ip-10-12-27-208:~$ wget https://github.com/downloads/thinkaurelius/titan/titan-0.1.0.zip
ubuntu@ip-10-12-27-208:~$ sudo apt-get install unzip
ubuntu@ip-10-12-27-208:~$ unzip titan-0.1.0.zip
ubuntu@ip-10-12-27-208:~$ cd titan-0.1.0/
ubuntu@ip-10-12-27-208:~/titan-0.1.0$ bin/gremlin.sh 

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> 

A toy 1 million vertex/edge graph is loaded into Titan using the Gremlin/Groovy script below (simply cut-and-paste the source into the Gremlin console and wait approximately 3 minutes). The code implements a preferential attachment algorithm. For an explanation of this algorithm, please see the second column of page 33 in Mark Newman‘s article The Structure and Function of Complex Networks.

// connect Titan to HBase in batch loading mode
conf = new BaseConfiguration()
conf.setProperty('storage.backend','hbase')
conf.setProperty('storage.hostname','localhost')
conf.setProperty('storage.batch-loading','true');
g = TitanFactory.open(conf)

// preferentially attach a growing vertex set
size = 1000000; ids = [g.addVertex().id]; rand = new Random();
(1..size).each{
  v = g.addVertex();
  u = g.v(ids.get(rand.nextInt(ids.size())))
  g.addEdge(v,u,'linked');
  ids.add(u.id);
  ids.add(v.id);
  if(it % 10000 == 0) {
    g.stopTransaction(SUCCESS)
    println it
  }
}; g.shutdown()

Batch Analytics with Faunus

Faunus is a Hadoop-based graph computing framework. It supports performant global graph analyses by making use of sequential reads from disk (see The Pathologies of Big Data). Faunus provides connectivity to Titan/HBase, Titan/Cassandra, any Rexster-fronted graph database, and to text/binary files stored in HDFS. From the 1 zookeeper+hadoop-namenode+hadoop-jobtracker+hbase-master machine, Faunus 0.1-alpha is downloaded and unzipped. The provided titan-hbase.properties file should be updated with hbase.zookeeper.quorum=10.12.27.208 instead of localhost. The IP address 10.12.27.208 is provided by ~/.whirr/agc/instances on agc-master. Finally, the Gremlin console is started.

ubuntu@ip-10-12-27-208:~$ wget https://github.com/downloads/thinkaurelius/faunus/faunus-0.1-alpha.zip
ubuntu@ip-10-12-27-208:~$ unzip faunus-0.1-alpha.zip
ubuntu@ip-10-12-27-208:~$ cd faunus-0.1-alpha/
ubuntu@ip-10-12-27-208:~/faunus-0.1-alpha$ vi bin/titan-hbase.properties
ubuntu@ip-10-12-27-208:~/faunus-0.1-alpha$ bin/gremlin.sh
	 
         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> 
A few example Faunus jobs are provided below. The final job on line 9 generates an in-degree distribution. The in-degree of a vertex is defined as the number of incoming edges to the vertex. The outputted result states how many vertices (second column) have a particular in-degree (first column). For example, 167,050 vertices have only 1 incoming edge.
gremlin> g = FaunusFactory.open('bin/titan-hbase.properties')
==>faunusgraph[titanhbaseinputformat]
gremlin> g.V.count() // how many vertices in the graph?
==>1000001
gremlin> g.E.count() // how many edges in the graph?
==>1000000
gremlin> g.V.out.out.out.count() // how many length 3 paths are in the graph?
==>988780
gremlin> g.V.sideEffect('{it.degree = it.inE.count()}').degree.groupCount // what is the graph's in-degree distribution?
==>1 167050
==>10    2305
==>100   6
==>108   3
==>119   3
==>122   3
==>133   1
==>144   2
==>155   1
==>166   2
==>18    471
==>188   1
==>21    306
==>232   1
==>254   1
==>...
gremlin> 

To conclude, the in-degree distribution result is pulled from Hadoop’s HDFS (stored in output/job-0). Next, scp is used to download the file to agc-master and then again to download the file to a local machine (e.g. a laptop). If the local machine has R installed, then the file can be plotted and visualized (see the final diagram below). The log-log plot demonstrates the known result that the preferential attachment algorithm generates a graph with a power-law degree distribution (i.e. “natural statistics”).

ubuntu@ip-10-12-27-208:~$ hadoop fs -getmerge output/job-0 distribution.txt
ubuntu@ip-10-12-27-208:~$ head -n5 distribution.txt
1   167050
10  2305
100 6
108 3
119 3
ubuntu@ip-10-12-27-208:~$ exit
...
ubuntu@ip-10-117-55-34:~$ scp 54.242.14.83:~/distribution.txt .
ubuntu@ip-10-117-55-34:~$ exit
...
~$ scp ubuntu@ec2-184-72-209-80.compute-1.amazonaws.com:~/distribution.txt .
~$ r
> t = read.table('distribution.txt')
> plot(t,log='xy',xlab='in-degree',ylab='frequency')

Conclusion

The Aurelius Graph Cluster is used for processing massive-scale graphs, where massive-scale denotes a graph so large it does not fit within the resource confines of a single machine. In other words, the Aurelius Graph Cluster is all about Big Graph Data. The two cluster technologies explored in this post were Titan and Faunus. They serve two distinct graph computing needs. Titan supports thousands of concurrent real-time, topologically local graph interactions. Faunus, on the other hand, supports long running, topologically global graph analyses. In other words, they provide OLTP and OLAP functionality, respectively.

References

London, G., “Set Up a Hadoop/HBase Cluster on EC2 in (About) an Hour,” Cloudera Developer Center, October 2012.

Newman, M., “The Structure and Function of Complex Networks,” SIAM Review, volume 45, pages 167-256, 2003.

Jacobs, A., “The Pathologies of Big Data,” ACM Communications, volume 7, number 6, July 2009.

Authors

Marko A. Rodriguez Dan LaRocque



Published at DZone with permission of Marko Rodriguez, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)