SlideShare uma empresa Scribd logo
1 de 44
Baixar para ler offline
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Hello, I'm Eli Reisman!
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Eli is...
•  Apache Giraph Committer and PMC Member
•  Apache Tajo Committer
•  Wrote initial port of Giraph to YARN
•  Collaborating with fellow Giraph committers on
Giraph in Action book for Manning publishing
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Eli is...
•  Only able to do all this with the support of:
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Eli is a software engineer at
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Etsy enables non-technical folks to sell
handmade and vintage stuff:
We have a great blog called Code As Craft:
Fast, Scalable Graph Processing:
Apache Giraph on YARN
...but, enough about me, lets talk Giraph!
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Key Topics
What is Apache Giraph?
Why do I need it?
Giraph + MapReduce
Giraph + YARN
Giraph Roadmap
Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph is a framework for performing offline
batch processing of semi-structured graph
data on a massive scale.
Giraph is loosely based upon Google's Pregel
graph processing framework.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph performs iterative calculations on top of an
existing Hadoop cluster.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph uses Apache Zookeeper to enforce atomic
barrier waits and perform leader election.
Done! Done! ...Still
working...
Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph benefits from a vibrant Apache community, and is
under active development:
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Why do I need it?
Giraph makes graph algorithms easy to reason about
and implement by following the Bulk Synchronous
Parallel (BSP) programming model.
In BSP, all algorithms are implemented from the point
of view of a single vertex in the input graph
performing a single iteration of the computation.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Why do I need it?
•  Giraph makes iterative data processing more
practical for Hadoop users.
•  Giraph can avoid costly disk and network
operations that are mandatory in MR.
•  No concept of message passing in MR.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Why do I need it?
Each cycle of an iterative calculation on
Hadoop means running a full MapReduce
job.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Let's use simple PageRank as a quick
example:
http://en.wikipedia.org/wiki/PageRank
1.0
1.0
1.0
Fast, Scalable Graph Processing:
Apache Giraph on YARN
1. All vertices start with same PageRank
1.0
1.0
1.0
Fast, Scalable Graph Processing:
Apache Giraph on YARN
2. Each vertex distributes an equal portion of
its PageRank to all neighbors:
0.5
0.5
1
1
Fast, Scalable Graph Processing:
Apache Giraph on YARN
3. Each vertex sums incoming values times a
weight factor and adds in small adjustment:
1/(# vertices in graph)
(.5*.85) + (.15/3)
(1.5*.85) + (.15/3)
(1*.85) + (.15/3)
Fast, Scalable Graph Processing:
Apache Giraph on YARN
4. This value becomes the vertices' PageRank
for the next iteration
.43
.21
.64
Fast, Scalable Graph Processing:
Apache Giraph on YARN
5. Repeat until convergence:
(change in PR per-iteration < epsilon)
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Vertices with more in-degrees converge to higher
PageRank
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Put another way:
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
1. Load complete input graph from disk as
[K= Vertex ID, V = out-edges and PR]
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
2. Emit all input records (full graph state),
Emit [K = edgeTarget, V = share of PR]
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
3. Sort and Shuffle this entire mess!
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
4. Sum incoming PR shares for each vertex,
update PR values in graph state records
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
5. Emit full graph state to disk...
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
6. ...and start over!
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
•  Awkward to reason about
•  I/O bound despite simple core business logic
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
1. Hadoop Mappers are "hijacked" to host
Giraph master and worker tasks.
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
2. Input graph is loaded once, maintaining
code-data locality when possible.
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
3. All iterations are performed on data in memory,
optionally spilled to disk. Disk access is linear/
scan-based.
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
4. Output is written from the Mappers hosting
the calculation, and the job run ends.
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
This is all well and good, but must we
manipulate Hadoop this way?
?
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + MapReduce
•  Heap and other resources are set once, globally, for all
Mappers in the computation.
•  No control of which cluster nodes host which tasks.
•  No control over how Mappers are scheduled.
•  Mapper and Reducer slots abstraction is meaningless
for Giraph at best, an artificial limit at worst.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
YARN
•  YARN (Yet Another Resource Negotiator) is Hadoop's
next-gen job management platform.
•  Powers MapReduce v2, but is a general purpose
framework that is not tied to the MapReduce paradigm.
•  Offers fine-grained control over each task's resource
allocations and host placement for clients that need it.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
YARN Architecture
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN
Its a natural fit!
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN
•  Giraph has maintained compatibility with Hadoop since
0.1 release by executing via MapReduce interface.
•  Giraph has featured a "pure YARN" build profile since
1.0 release. It supports Hadoop-2.0.3 and trunk.
*Patches to add 2.0.4 and 2.0.5 support are in review :)
•  Giraph's YARN component is easy to extend or use as
a template to port other projects!
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN: Roadmap
•  YARN Application Master allows for more natural and
stable bootstrapping of Giraph jobs.
•  Zookeeper management can find natural home in
Application Master.
•  Giraph on YARN can stop borrowing from Hadoop and
have its own web interface.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN: Roadmap
•  Variable per-task resource allocation opens up the
possibility of Supertasks to manage graph supernodes.
•  Ability to spawn or retire tasks per-iteration enables in-
flight reassignment of data partitions.
•  AppMaster managed utility tasks such as dedicated
sub-aggregators for tree-like aggregation, or data pre-
samplers.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph New Developments
•  Decoupling of logic and graph data means tasks host
computations that are pluggable per-iteration.
•  Support for Giraph job scripting, starting with Jython.
More to follow...
•  New website, fresh docs, upcoming Manning book, and
large, active community means Giraph has never been
easier to use or contribute to!
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Great! Where can I learn more?
http://giraph.apache.org
Mailing List:
user@giraph.apache.org

Mais conteúdo relacionado

Mais procurados

Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
rhatr
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
DataWorks Summit
 

Mais procurados (20)

Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
SparkR + Zeppelin
SparkR + ZeppelinSparkR + Zeppelin
SparkR + Zeppelin
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 

Destaque

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...
Mani kandan
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph Analytics
Nesreen K. Ahmed
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
Graph theory
Graph theoryGraph theory
Graph theory
Kumar
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Andy Petrella
 

Destaque (20)

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph Analytics
 
Apache giraph
Apache giraphApache giraph
Apache giraph
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphs
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Graph theory
Graph theoryGraph theory
Graph theory
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...
 

Semelhante a Fast, Scalable Graph Processing: Apache Giraph on YARN

2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 

Semelhante a Fast, Scalable Graph Processing: Apache Giraph on YARN (20)

Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoop
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Empire: JPA for RDF & SPARQL
Empire: JPA for RDF & SPARQLEmpire: JPA for RDF & SPARQL
Empire: JPA for RDF & SPARQL
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Apache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduceApache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduce
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Module01
 Module01 Module01
Module01
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 

Mais de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Fast, Scalable Graph Processing: Apache Giraph on YARN

  • 1. Fast, Scalable Graph Processing: Apache Giraph on YARN
  • 2. Fast, Scalable Graph Processing: Apache Giraph on YARN Hello, I'm Eli Reisman!
  • 3. Fast, Scalable Graph Processing: Apache Giraph on YARN Eli is... •  Apache Giraph Committer and PMC Member •  Apache Tajo Committer •  Wrote initial port of Giraph to YARN •  Collaborating with fellow Giraph committers on Giraph in Action book for Manning publishing
  • 4. Fast, Scalable Graph Processing: Apache Giraph on YARN Eli is... •  Only able to do all this with the support of:
  • 5. Fast, Scalable Graph Processing: Apache Giraph on YARN Eli is a software engineer at
  • 6. Fast, Scalable Graph Processing: Apache Giraph on YARN Etsy enables non-technical folks to sell handmade and vintage stuff: We have a great blog called Code As Craft:
  • 7. Fast, Scalable Graph Processing: Apache Giraph on YARN ...but, enough about me, lets talk Giraph!
  • 8. Fast, Scalable Graph Processing: Apache Giraph on YARN Key Topics What is Apache Giraph? Why do I need it? Giraph + MapReduce Giraph + YARN Giraph Roadmap
  • 9. Fast, Scalable Graph Processing: Apache Giraph on YARN What is Apache Giraph? Giraph is a framework for performing offline batch processing of semi-structured graph data on a massive scale. Giraph is loosely based upon Google's Pregel graph processing framework.
  • 10. Fast, Scalable Graph Processing: Apache Giraph on YARN What is Apache Giraph? Giraph performs iterative calculations on top of an existing Hadoop cluster.
  • 11. Fast, Scalable Graph Processing: Apache Giraph on YARN What is Apache Giraph? Giraph uses Apache Zookeeper to enforce atomic barrier waits and perform leader election. Done! Done! ...Still working...
  • 12. Fast, Scalable Graph Processing: Apache Giraph on YARN What is Apache Giraph? Giraph benefits from a vibrant Apache community, and is under active development:
  • 13. Fast, Scalable Graph Processing: Apache Giraph on YARN Why do I need it? Giraph makes graph algorithms easy to reason about and implement by following the Bulk Synchronous Parallel (BSP) programming model. In BSP, all algorithms are implemented from the point of view of a single vertex in the input graph performing a single iteration of the computation.
  • 14. Fast, Scalable Graph Processing: Apache Giraph on YARN Why do I need it? •  Giraph makes iterative data processing more practical for Hadoop users. •  Giraph can avoid costly disk and network operations that are mandatory in MR. •  No concept of message passing in MR.
  • 15. Fast, Scalable Graph Processing: Apache Giraph on YARN Why do I need it? Each cycle of an iterative calculation on Hadoop means running a full MapReduce job.
  • 16. Fast, Scalable Graph Processing: Apache Giraph on YARN Let's use simple PageRank as a quick example: http://en.wikipedia.org/wiki/PageRank 1.0 1.0 1.0
  • 17. Fast, Scalable Graph Processing: Apache Giraph on YARN 1. All vertices start with same PageRank 1.0 1.0 1.0
  • 18. Fast, Scalable Graph Processing: Apache Giraph on YARN 2. Each vertex distributes an equal portion of its PageRank to all neighbors: 0.5 0.5 1 1
  • 19. Fast, Scalable Graph Processing: Apache Giraph on YARN 3. Each vertex sums incoming values times a weight factor and adds in small adjustment: 1/(# vertices in graph) (.5*.85) + (.15/3) (1.5*.85) + (.15/3) (1*.85) + (.15/3)
  • 20. Fast, Scalable Graph Processing: Apache Giraph on YARN 4. This value becomes the vertices' PageRank for the next iteration .43 .21 .64
  • 21. Fast, Scalable Graph Processing: Apache Giraph on YARN 5. Repeat until convergence: (change in PR per-iteration < epsilon)
  • 22. Fast, Scalable Graph Processing: Apache Giraph on YARN Vertices with more in-degrees converge to higher PageRank
  • 23. Fast, Scalable Graph Processing: Apache Giraph on YARN Put another way:
  • 24. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 1. Load complete input graph from disk as [K= Vertex ID, V = out-edges and PR] Map Sort/Shuffle Reduce
  • 25. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 2. Emit all input records (full graph state), Emit [K = edgeTarget, V = share of PR] Map Sort/Shuffle Reduce
  • 26. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 3. Sort and Shuffle this entire mess! Map Sort/Shuffle Reduce
  • 27. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 4. Sum incoming PR shares for each vertex, update PR values in graph state records Map Sort/Shuffle Reduce
  • 28. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 5. Emit full graph state to disk... Map Sort/Shuffle Reduce
  • 29. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 6. ...and start over! Map Sort/Shuffle Reduce
  • 30. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce •  Awkward to reason about •  I/O bound despite simple core business logic Map Sort/Shuffle Reduce
  • 31. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on Giraph 1. Hadoop Mappers are "hijacked" to host Giraph master and worker tasks. Map Sort/Shuffle Reduce
  • 32. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on Giraph 2. Input graph is loaded once, maintaining code-data locality when possible. Map Sort/Shuffle Reduce
  • 33. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on Giraph 3. All iterations are performed on data in memory, optionally spilled to disk. Disk access is linear/ scan-based. Map Sort/Shuffle Reduce
  • 34. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on Giraph 4. Output is written from the Mappers hosting the calculation, and the job run ends. Map Sort/Shuffle Reduce
  • 35. Fast, Scalable Graph Processing: Apache Giraph on YARN This is all well and good, but must we manipulate Hadoop this way? ?
  • 36. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph + MapReduce •  Heap and other resources are set once, globally, for all Mappers in the computation. •  No control of which cluster nodes host which tasks. •  No control over how Mappers are scheduled. •  Mapper and Reducer slots abstraction is meaningless for Giraph at best, an artificial limit at worst.
  • 37. Fast, Scalable Graph Processing: Apache Giraph on YARN YARN •  YARN (Yet Another Resource Negotiator) is Hadoop's next-gen job management platform. •  Powers MapReduce v2, but is a general purpose framework that is not tied to the MapReduce paradigm. •  Offers fine-grained control over each task's resource allocations and host placement for clients that need it.
  • 38. Fast, Scalable Graph Processing: Apache Giraph on YARN YARN Architecture
  • 39. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph + YARN Its a natural fit!
  • 40. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph + YARN •  Giraph has maintained compatibility with Hadoop since 0.1 release by executing via MapReduce interface. •  Giraph has featured a "pure YARN" build profile since 1.0 release. It supports Hadoop-2.0.3 and trunk. *Patches to add 2.0.4 and 2.0.5 support are in review :) •  Giraph's YARN component is easy to extend or use as a template to port other projects!
  • 41. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph + YARN: Roadmap •  YARN Application Master allows for more natural and stable bootstrapping of Giraph jobs. •  Zookeeper management can find natural home in Application Master. •  Giraph on YARN can stop borrowing from Hadoop and have its own web interface.
  • 42. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph + YARN: Roadmap •  Variable per-task resource allocation opens up the possibility of Supertasks to manage graph supernodes. •  Ability to spawn or retire tasks per-iteration enables in- flight reassignment of data partitions. •  AppMaster managed utility tasks such as dedicated sub-aggregators for tree-like aggregation, or data pre- samplers.
  • 43. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph New Developments •  Decoupling of logic and graph data means tasks host computations that are pluggable per-iteration. •  Support for Giraph job scripting, starting with Jython. More to follow... •  New website, fresh docs, upcoming Manning book, and large, active community means Giraph has never been easier to use or contribute to!
  • 44. Fast, Scalable Graph Processing: Apache Giraph on YARN Great! Where can I learn more? http://giraph.apache.org Mailing List: user@giraph.apache.org