SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
Turning NoSQL data into Graph

Playing with Apache Giraph and Apache Gora
Team
Renato Marroquín
!
• PhD student:
• Interested in:
Information retrieval.
Distributed and scalable data management.
• Apache Gora:
PPMC Member
Committer.
• rmarroquin [at] apache [dot] org
Claudio Martella
• PhD student: LSDS @VU University Amsterdam.
• Interested in
Complex Networks.
Distributed and scalable infrastructures.
• Apache Girapher:
PPMC Member
Committer.
• claudio [at] apache [dot] org
Lewis McGibbney
• Scottish expat fae Glasgow
• Post Doc @Stanford University: Engineering Informatics
• Quantity Surveyor/Cost Consultant by 

profession
• Cycling mad
• Keen OSS enthusiast @TheASF 

and beyond
• lewismc [at] apacher [dot] org
Apache Gora
What is Apache Gora?
● Data Persistence : Persisting objects to Column stores, key-value
stores, SQL databases and to flat files in local file system of Hadoop
HDFS.
● Data Access : An easy to use Java-friendly common API for accessing
the data regardless of its location.
● Indexing : Persisting objects to Lucene and Solr indexes, accessing/
querying the data with Gora API.
● Analysis : Accesing the data and making analysis through adapters for
Apache Pig, Apache Hive and Cascading
● MapReduce support : Out-of-the-box and extensive MapReduce
(Apache Hadoop) support for data in the data store.
What is Apache Gora?
● Provides an in-memory data model and persistence for big data.
● Gora supports:
How does Gora work?
!
1.Define your schema using Apache AVRO.
2.Compile your schemas using Gora's Compiler.
3.Create a mapping between logical and physical layout.
4.Update gora.properties file to set back-end properties.



Rock the NoSQL world!!!
How does Gora work?
1.Define your schema using Apache AVRO.
How does Gora work?
2.Compile your schemas using Gora's Compiler.
	 java -jar gora-core-XYZ-.jar

" " o.a.gora.compiler.GoraCompiler.class 

" " " employee.avsc

" " " gora-app/src/main/java/
How does Gora work?
2.Compile your schemas using Gora's Compiler.
How does Gora work?
3.Create a mapping between logical and physical layout.
How does Gora work?
4.Update gora.properties file to set back-end properties.
How does Gora work?
Rock the NoSQL world!
Apache Giraph
MapReduce and Graphs
• Plain MapReduce is not well suited for graph
algorithms because:
• Graph algorithms are iterative.
• Not intuitive in MapReduce.
• Unnecessarily slow
• Each iteration is a single MapReduce job with too much
overhead
• Separately scheduled
• The graph structure is read from disk
• The intermediate results are read from disks
• Hard to implement
Google's Pregel
• Introduced on 2010
• Based on Valiant's BSP
• “Think like a vertex” that can send messages to any vertex in the
graph using the bulk synchronous parallel programming model.
• Computation complete when all components complete.
• Batch-oriented processing
• Computation happens in-memory
• Master/slave architecture
Bulk synchronous parallel
Time
Processors
Barrier
Computation + 

Communication
Superstep
Open source implementations
• There are some such as:
• Apache Giraph
• Apache Hama
• GoldenOrb
• Signal/Collect
Apache Giraph
• Incubated since summer 2011
• Written in Java
• Implements Pregel's API
• Runs on existing MapReduce infrastructure
• Active community from Yahoo!, Facebook, LinkedIn, Twitter, and
more.
• It's a single Map-only job
• It runs on Hadoop in-memory.
• Fault tolerant
• Zookeeper for state, No SPOF
During execution time
Setup
● Load graph
● Assign vertices to workers
● Validate workers' health
Teardown
● Write results back
● Write aggregators back
Computer
● Assign messages to workers
● Iterate on active vertices
● Call vertices compute()
Synchronize
● Send messages to workers
● Compute aggregators
● Checkpoint
Giraph's components
• Master
• Application coordinator
• One active master at a time
• Assigns partition owners to workers prior to each superstep
• Synchronizes supersteps
• Worker – Computation & messaging
• Loads the graph from input splits
• Performs computation/messaging of its assigned partitions
• Zookeeper
• Maintains global application state
What is needed then?
• Your algorithm in the Pregel model.
• A VertexInputFormat to read your graph.

e.g. <vertex><neighbor1><neighbor2>
• A VertexOutputFormat to write back the results.

e.g. <vertex> <pageRank>
• You could define:
• A Combiner (for reducing number of messages sent/received)
• An Aggregator (for enabling global computation)
Running a Giraph job
• It is just like running Hadoop
!
$HADOOP_HOME/bin/hadoop jar
giraph-examples-1.1.0-XXX-jar-with-dependencies.jar
o.a.g.GiraphRunner o.a.g.examples.SimpleShortestPathsComputation
-vif o.a.g.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
-vip /user/hduser/input/tiny_graph.txt
-vof o.a.g.io.formats.IdWithValueTextOutputFormat
-op /user/hduser/output/shortestpaths
-w 1
Apache Giraph + Apache Gora
The project idea
• Integrating Apache Gora with other cool projects.
• Provide access to different data stores out-of-the-
box for Apache Giraph.
• Give users more flexibility when deciding how to run graph
algorithms.
• Make the Hadoop Env bigger.
• Apply to for the Google Summer of Code Project.
The big picture
Integration hooks
• Vertices
Integration hooks
• Vertices
Integration hooks
• Edges
Integration hooks
• Edges
Integration hooks
• Key factory
Parameters offered
Label Description
giraph.gora.datastore.class Gora DataStore class to access to data from - required.
!giraph.gora.key.class Gora Key class to query the datastore - required.
giraph.gora.persistent.class Gora Persistent class to read objects from Gora - required.
giraph.gora.keys.factory.class Keys factory to convert strings into desired keys - required.
giraph.gora.output.datastore.class Gora DataStore class to write data to - required.
giraph.gora.output.key.class Gora Key class to write to datastore - required.
giraph.gora.output.persistent.class Gora Persistent class to write to Gora - required.
giraph.gora.start.key Gora start key to query the datastore.
giraph.gora.end.key Gora end key to query the datastore.
Rocks in the way
• Dependency issues.
• Supported versions by each project.
• Maven war for handling cyclic dependencies.
• Hadoop issues.
• Not all data stores support MapReduce out of the box.
• Finding what it is necessary to be in the classpath.
• Providing an API between both projects that is:
• Flexible.
• Simple.
• Pluggable.
So now what?
1.Create your data beans with Gora.
So now what?
2. Compile them.
java -jar gora-core-XYZ.jar o.a.gora.compiler.GoraCompiler.class vertex.avsc gora-app/src/main/java/
So now what?
3. Get your Gora files set up for passing them to Giraph.
	 Gora.properties
	 Gora-mapping-{datastore}.xml.
So now what?
4. Get your hooks in place.
	 GVertexInputFormat
So now what?
4. Get your hooks in place.
	 GVertexOutputFormat
So now what?
4. Get your hooks in place.
	 GVertexOutputFormat
So now what?
4. Get your hooks in place.
	 KeyFactory
So now what?
5. Run Giraph!
	 hadoop jar $GIRAPH_EXAMPLES_JAR org.apache.giraph.GiraphRunner "
-files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml"
-Dio.serializations=o.a.h.io.serializer.WritableSerialization,o.a.h.io.serializer.JavaSerialization"
-Dgiraph.gora.datastore.class=org.apache.gora.hbase.store.HBaseStore"
-Dgiraph.gora.key.class=java.lang.String"
-Dgiraph.gora.persistent.class=org.apache.giraph.io.gora.generated.GEdge"
-Dgiraph.gora.start.key=0 -Dgiraph.gora.end.key=10"
-Dgiraph.gora.keys.factory.class=org.apache.giraph.io.gora.utils.KeyFactory"
-Dgiraph.gora.output.datastore.class=org.apache.gora.hbase.store.HBaseStore "
-Dgiraph.gora.output.key.class=java.lang.String "
-Dgiraph.gora.output.persistent.class=org.apache.giraph.io.gora.generated.GEdgeResult "
-libjars $GIRAPH_GORA_JAR,$GORA_HBASE_JAR,$HBASE_JAR"
org.apache.giraph.examples.SimpleShortestPathsComputation "
-eif org.apache.giraph.io.gora.GoraGEdgeEdgeInputFormat "
-eof org.apache.giraph.io.gora.GoraGEdgeEdgeOutputFormat "
-w 1"
Future work
More complex schemas
Adding more data stores
Send us an email on the mailing lists
New serialization formats
• Different serialization formats beside Apache Avro.
!
!
!
!
!
!
• And others that could be interesting for handling different use
cases.
Thanks!
Q&A
References
• http://prezi.com/9ake_klzwrga/apache-giraph-distributed-graph-processing-in-the-cloud/
• http://de.slideshare.net/sscdotopen/large-scale
• http://www.slideshare.net/Hadoop_Summit/processing-edges-on-apache-giraph
Bulk synchronous parallel model
• Computation consists of a series of “supersteps”
• Supersteps are an atomic unit of computation where operations can
happen in parallel
• During a superstep, components are assigned to tasks and receive
unordered messages from previous supersteps.
• Point-to-point messages
• Sent during a superstep from one component to another and then
delivered in the following supersteps.
• Computation completes when all components complete

Mais conteúdo relacionado

Mais procurados

New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Lightbend
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
 

Mais procurados (20)

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Sqoop
SqoopSqoop
Sqoop
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
SQL and Search with Spark in your browser
SQL and Search with Spark in your browserSQL and Search with Spark in your browser
SQL and Search with Spark in your browser
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 

Semelhante a Giraph+Gora in ApacheCon14

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 

Semelhante a Giraph+Gora in ApacheCon14 (20)

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platform
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 

Último

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 

Último (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Giraph+Gora in ApacheCon14

  • 1. Turning NoSQL data into Graph
 Playing with Apache Giraph and Apache Gora
  • 3. Renato Marroquín ! • PhD student: • Interested in: Information retrieval. Distributed and scalable data management. • Apache Gora: PPMC Member Committer. • rmarroquin [at] apache [dot] org
  • 4. Claudio Martella • PhD student: LSDS @VU University Amsterdam. • Interested in Complex Networks. Distributed and scalable infrastructures. • Apache Girapher: PPMC Member Committer. • claudio [at] apache [dot] org
  • 5. Lewis McGibbney • Scottish expat fae Glasgow • Post Doc @Stanford University: Engineering Informatics • Quantity Surveyor/Cost Consultant by 
 profession • Cycling mad • Keen OSS enthusiast @TheASF 
 and beyond • lewismc [at] apacher [dot] org
  • 7. What is Apache Gora? ● Data Persistence : Persisting objects to Column stores, key-value stores, SQL databases and to flat files in local file system of Hadoop HDFS. ● Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location. ● Indexing : Persisting objects to Lucene and Solr indexes, accessing/ querying the data with Gora API. ● Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading ● MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.
  • 8. What is Apache Gora? ● Provides an in-memory data model and persistence for big data. ● Gora supports:
  • 9. How does Gora work? ! 1.Define your schema using Apache AVRO. 2.Compile your schemas using Gora's Compiler. 3.Create a mapping between logical and physical layout. 4.Update gora.properties file to set back-end properties.
 
 Rock the NoSQL world!!!
  • 10. How does Gora work? 1.Define your schema using Apache AVRO.
  • 11. How does Gora work? 2.Compile your schemas using Gora's Compiler. java -jar gora-core-XYZ-.jar
 " " o.a.gora.compiler.GoraCompiler.class 
 " " " employee.avsc
 " " " gora-app/src/main/java/
  • 12. How does Gora work? 2.Compile your schemas using Gora's Compiler.
  • 13. How does Gora work? 3.Create a mapping between logical and physical layout.
  • 14. How does Gora work? 4.Update gora.properties file to set back-end properties.
  • 15. How does Gora work? Rock the NoSQL world!
  • 17. MapReduce and Graphs • Plain MapReduce is not well suited for graph algorithms because: • Graph algorithms are iterative. • Not intuitive in MapReduce. • Unnecessarily slow • Each iteration is a single MapReduce job with too much overhead • Separately scheduled • The graph structure is read from disk • The intermediate results are read from disks • Hard to implement
  • 18. Google's Pregel • Introduced on 2010 • Based on Valiant's BSP • “Think like a vertex” that can send messages to any vertex in the graph using the bulk synchronous parallel programming model. • Computation complete when all components complete. • Batch-oriented processing • Computation happens in-memory • Master/slave architecture
  • 20. Open source implementations • There are some such as: • Apache Giraph • Apache Hama • GoldenOrb • Signal/Collect
  • 21. Apache Giraph • Incubated since summer 2011 • Written in Java • Implements Pregel's API • Runs on existing MapReduce infrastructure • Active community from Yahoo!, Facebook, LinkedIn, Twitter, and more. • It's a single Map-only job • It runs on Hadoop in-memory. • Fault tolerant • Zookeeper for state, No SPOF
  • 22. During execution time Setup ● Load graph ● Assign vertices to workers ● Validate workers' health Teardown ● Write results back ● Write aggregators back Computer ● Assign messages to workers ● Iterate on active vertices ● Call vertices compute() Synchronize ● Send messages to workers ● Compute aggregators ● Checkpoint
  • 23. Giraph's components • Master • Application coordinator • One active master at a time • Assigns partition owners to workers prior to each superstep • Synchronizes supersteps • Worker – Computation & messaging • Loads the graph from input splits • Performs computation/messaging of its assigned partitions • Zookeeper • Maintains global application state
  • 24. What is needed then? • Your algorithm in the Pregel model. • A VertexInputFormat to read your graph.
 e.g. <vertex><neighbor1><neighbor2> • A VertexOutputFormat to write back the results.
 e.g. <vertex> <pageRank> • You could define: • A Combiner (for reducing number of messages sent/received) • An Aggregator (for enabling global computation)
  • 25. Running a Giraph job • It is just like running Hadoop ! $HADOOP_HOME/bin/hadoop jar giraph-examples-1.1.0-XXX-jar-with-dependencies.jar o.a.g.GiraphRunner o.a.g.examples.SimpleShortestPathsComputation -vif o.a.g.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hduser/input/tiny_graph.txt -vof o.a.g.io.formats.IdWithValueTextOutputFormat -op /user/hduser/output/shortestpaths -w 1
  • 26. Apache Giraph + Apache Gora
  • 27. The project idea • Integrating Apache Gora with other cool projects. • Provide access to different data stores out-of-the- box for Apache Giraph. • Give users more flexibility when deciding how to run graph algorithms. • Make the Hadoop Env bigger. • Apply to for the Google Summer of Code Project.
  • 34. Parameters offered Label Description giraph.gora.datastore.class Gora DataStore class to access to data from - required. !giraph.gora.key.class Gora Key class to query the datastore - required. giraph.gora.persistent.class Gora Persistent class to read objects from Gora - required. giraph.gora.keys.factory.class Keys factory to convert strings into desired keys - required. giraph.gora.output.datastore.class Gora DataStore class to write data to - required. giraph.gora.output.key.class Gora Key class to write to datastore - required. giraph.gora.output.persistent.class Gora Persistent class to write to Gora - required. giraph.gora.start.key Gora start key to query the datastore. giraph.gora.end.key Gora end key to query the datastore.
  • 35. Rocks in the way • Dependency issues. • Supported versions by each project. • Maven war for handling cyclic dependencies. • Hadoop issues. • Not all data stores support MapReduce out of the box. • Finding what it is necessary to be in the classpath. • Providing an API between both projects that is: • Flexible. • Simple. • Pluggable.
  • 36. So now what? 1.Create your data beans with Gora.
  • 37. So now what? 2. Compile them. java -jar gora-core-XYZ.jar o.a.gora.compiler.GoraCompiler.class vertex.avsc gora-app/src/main/java/
  • 38. So now what? 3. Get your Gora files set up for passing them to Giraph. Gora.properties Gora-mapping-{datastore}.xml.
  • 39. So now what? 4. Get your hooks in place. GVertexInputFormat
  • 40.
  • 41. So now what? 4. Get your hooks in place. GVertexOutputFormat
  • 42. So now what? 4. Get your hooks in place. GVertexOutputFormat
  • 43. So now what? 4. Get your hooks in place. KeyFactory
  • 44. So now what? 5. Run Giraph! hadoop jar $GIRAPH_EXAMPLES_JAR org.apache.giraph.GiraphRunner " -files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml" -Dio.serializations=o.a.h.io.serializer.WritableSerialization,o.a.h.io.serializer.JavaSerialization" -Dgiraph.gora.datastore.class=org.apache.gora.hbase.store.HBaseStore" -Dgiraph.gora.key.class=java.lang.String" -Dgiraph.gora.persistent.class=org.apache.giraph.io.gora.generated.GEdge" -Dgiraph.gora.start.key=0 -Dgiraph.gora.end.key=10" -Dgiraph.gora.keys.factory.class=org.apache.giraph.io.gora.utils.KeyFactory" -Dgiraph.gora.output.datastore.class=org.apache.gora.hbase.store.HBaseStore " -Dgiraph.gora.output.key.class=java.lang.String " -Dgiraph.gora.output.persistent.class=org.apache.giraph.io.gora.generated.GEdgeResult " -libjars $GIRAPH_GORA_JAR,$GORA_HBASE_JAR,$HBASE_JAR" org.apache.giraph.examples.SimpleShortestPathsComputation " -eif org.apache.giraph.io.gora.GoraGEdgeEdgeInputFormat " -eof org.apache.giraph.io.gora.GoraGEdgeEdgeOutputFormat " -w 1"
  • 47. Adding more data stores Send us an email on the mailing lists
  • 48. New serialization formats • Different serialization formats beside Apache Avro. ! ! ! ! ! ! • And others that could be interesting for handling different use cases.
  • 50. Q&A
  • 52. Bulk synchronous parallel model • Computation consists of a series of “supersteps” • Supersteps are an atomic unit of computation where operations can happen in parallel • During a superstep, components are assigned to tasks and receive unordered messages from previous supersteps. • Point-to-point messages • Sent during a superstep from one component to another and then delivered in the following supersteps. • Computation completes when all components complete