SlideShare uma empresa Scribd logo
1 de 50
Baixar para ler offline
Scaling Big Data with
Hadoop And Mesos
Bernardo Gomez Palacio
Software Engineer at Guavus Inc
Beyond Buzz Words
Mesos and Data Analysis
Yes, you don't need Hadoop to start using Mesos and
Spark.
Now, If You...
4 Need to store large files? by default each block is
128MB.
4 Data is written mainly as new files or by appending
into existing ones?
Convinced you want to jump into the
Hadoop bandwagon?
Read
Sammer, Eric. "Hadoop Operations." Sebastopol, CA:
O'Reilly, 2012. Print.
Welcome to the Jungle
Version Hell
Distributions
Apache Bigtop, CDH, HDP, MapR
Hadoop
HDFS
MRV1
MRV2
Assuming You Already Have Mesos
4 Mesosphere Packages
4 https://mesosphere.io/downloads/
4 From Source.
4 https://github.com/apache/mesos
Hadoop MRV1 in Meso
https://github.com/mesos/hadoop
Hadoop MRV1 in Mesos
4 Requires Hadoop MRV1
4 Officially works with CDH5 MRV1
4 Apache Hadoop 0.22, 0.23 and 1+
4 Apache Hadoop 2+ doesn't come with MRV1!
Hadoop MRV1 in Mesos
4 Requires a JobTracker.
4 By default uses the
org.apache.hadoop.mapred.JobQueueTaskScheduler
4 You can change it .e.g ...mapred.FairScheduler
Hadoop MRV1 in Mesos
4 Requires TaskTracker.
4 That is
org.apache.hadoop.mapreduce.server.jobtracker.
TaskTracker.
4 And not
org.apache.hadoop.mapred.TaskTracker.java.
How Hadoop MRV1 Runs In
Mesos?
How Hadoop MRV1 in Mesos works?
1. Framework Mesos Scheduler creates the Job
Tracker as part of the driver.
2. The Job Trakcer will use
org.apache.hadoop.mapred.MesosScheduler to lunch
tasks.
Mesos Hadoop Task Scheduling
4 mapred.mesos.slot.cpus (1)
4 mapred.mesos.slot.disk (1024MB)
4 mapred.mesos.slot.mem (1024MB)
Additional Mesos parameters
4 mapred.mesos.checkpoint (false)
4 mapred.mesos.role (*)
Thoughts
What about Hadoop 2.4?
Namenode HA?
MRV2 and YARN?
Personal Preference
4 Use Hadoop 2.4.0 or above.
4 Name Node HA through the Quorum Journal
Manager.
4 Move to Spark if Possible.
Example of a Mesos Data Analysis
Stack
1. HDFS stores files.
2. Use the Spark CLI to test ideas.
3. Use Spark Submit for jobs.
4. Use Chronos or Oozie to schedule workflows.
Spark On Mesos
Spark On Mesos
https://spark.apache.org/docs/latest/img/cluster-overview.png
Know that Each Spark Application
1. Has its own driving process.
2. Has its own RDDs
3. Has its own cache.
Spark Schedulers on Mesos
Fine Grained
Coarse Grained
Spark Fine Grained Scheduling
4 Enabled by default.
4 Each Spark task runs as a separate Mesos task.
4 Has an overhead in launching each task.
Spark Coarse Grained Scheduling
4 Uses only one long-running Spark task on each Mesos
slave.
4 Dynamically schedules its own “mini-tasks”, using
Akka.
4 Lower startup overhead.
4 Reserving the cluster resources for the complete
duration of the application.
Be ware of...
4 Greedy Scheduling (Coarse Grain)
4 Over committing and deadlocks (Fine Grained)
Using Spark
Understand Parametrization and Usage
4 spark.app.name
4 spark.executor.memory
4 spark.serializer
4 spark.local.dir
4 ....
Use Spark Submit
Avoid parametrizing the Spark Context in your code as
much as possible.
Leverage the spark-submit arguments, properties files
as well as environment variables to configure your
application.
Using Spark
Accept That Tunning is a
Science & an Art
Understand and Tune Your Applications
4 Know your Working Set.
4 Understand Spark Partitioning and Block
management.
4 Define your Spark workflow and where to cache/
persist.
4 If you cache you will serialize, use Kryo.
Example Spark API PairRDDFunctions
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
numPartitions: Int): RDD[(K, C)]
PairRDDFunctions.combineByKey
4 Combines the elements for key using a custom set of
aggregations.
4 RDD[(K, V)] to RDD[(K, C)]
PairRDDFunctions.combineByKey
4 createCombiner: Turns a V into a C
4 mergeValue: merge a V into a C
4 mergeCombiners: to combine two C's into a single
one.
partitioner defaults to HashPartitioner.
Example Spark API PairRDDFunctions
self: RDD[(K, V)]
def aggregateByKey[U: ClassTag](zeroValue: U)(
seqOp: (U, V) => U,
combOp: (U, U) => U
): RDD[(K, U)]
Uses the default partitioner.
Understand your Data
Tune your Data
4 Per Data Source understand its optimal block size
4 Leverage Avro as the serialization format.
4 Leverage Parquet as the storage format.
4 Try to keep your Avro & Parquet schemas flat.
Suggestions
Each Application
4 Instrument the Code.
4 Measure Input size in number of records and byte
size.
4 Measure Output size in the same way.
Standardize
4 JDK & JRE version across your cluster.
4 The Spark version across your cluster.
4 The libraries that will be added to the JVM classpath
by default.
4 A packaging strategy for your application, uber jar.
About YARN and Spark
Some Differences with YARN
4 Execution Cluster vs Client modes.
4 Isolation process vs cgroups
4 Docker support? LXC Templates?
4 Deployment complexity?
Wrapping Up
Some Ideas..
References
1. "Hadoop - Apache Hadoop 2.4.0." Apache Hadoop
2.4.0. Apache Software Foundation,
31 Mar. 2014. Web. 24 July 2014. link.
2. "Hadoop Distributed File System-2.4.0 - HDFS High
Availability Using the Quorum Journal Manager."
Apache Hadoop 2.4.0. Apache Software Foundation,
31 Mar. 2014. Web. 23 July 2014.
link.
References
1. Sammer, Eric. Hadoop Operations. Sebastopol, CA:
O'Reilly, 2012. Print.
2. "Spark Configuration." Spark 1.0.1 Documentation.
Apache Software Foundation, n.d. Web. 24 July 2014.
link.
3. "Tuning Spark." Spark 1.0.1 Documentation. Apache
Software Foundation, n.d. Web. 24 July 2014.
link.
References
1. Ryza, Sandy. "Managing Multiple Resources in
Hadoop 2 with YARN." Cloudera Developer Blog.
Cloudera, 2 Dec. 2013. Web. 24 July 2014.
link.
Thank you! ✌

Mais conteúdo relacionado

Mais procurados

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
Openshift Container Platform on Azure
Openshift Container Platform on Azure Openshift Container Platform on Azure
Openshift Container Platform on Azure Glenn West
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesYousun Jeong
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogJoe Stein
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at FacebookRedis Labs
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginerYousun Jeong
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosJoe Stein
 
Apache Superset at Airbnb
Apache Superset at AirbnbApache Superset at Airbnb
Apache Superset at AirbnbBill Liu
 
Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules RestructuredDoiT International
 
How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentBlueData, Inc.
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
DalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataformDalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataformLeandro Totino Pereira
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...Radhika Puthiyetath
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containerspranav_joshi
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Camuel Gilyadov
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 

Mais procurados (20)

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Openshift Container Platform on Azure
Openshift Container Platform on Azure Openshift Container Platform on Azure
Openshift Container Platform on Azure
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginer
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
 
Apache Superset at Airbnb
Apache Superset at AirbnbApache Superset at Airbnb
Apache Superset at Airbnb
 
Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules Restructured
 
How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized Environment
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
DalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataformDalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataform
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containers
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 

Destaque

Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Discover Pinterest
 
Introduction to Apache Spark / PUT 06.2014
Introduction to Apache Spark / PUT 06.2014Introduction to Apache Spark / PUT 06.2014
Introduction to Apache Spark / PUT 06.2014bbogacki
 
8 devstack beyond_hello-world
8 devstack beyond_hello-world8 devstack beyond_hello-world
8 devstack beyond_hello-worldopenstackindia
 
SwiftStack Presents at Under the Radar 2013
SwiftStack Presents at Under the Radar 2013SwiftStack Presents at Under the Radar 2013
SwiftStack Presents at Under the Radar 2013Dealmaker Media
 
Resource Sharing Beyond Boundaries - Apache Myriad
Resource Sharing Beyond Boundaries - Apache MyriadResource Sharing Beyond Boundaries - Apache Myriad
Resource Sharing Beyond Boundaries - Apache MyriadSantosh Marella
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaEdureka!
 
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureDiscover Pinterest
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Deploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and MarathonDeploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and MarathonDiscover Pinterest
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningNik Spirin
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerIMC Institute
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionsaba khan
 

Destaque (19)

Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
Introduction to Apache Spark / PUT 06.2014
Introduction to Apache Spark / PUT 06.2014Introduction to Apache Spark / PUT 06.2014
Introduction to Apache Spark / PUT 06.2014
 
8 devstack beyond_hello-world
8 devstack beyond_hello-world8 devstack beyond_hello-world
8 devstack beyond_hello-world
 
SwiftStack Presents at Under the Radar 2013
SwiftStack Presents at Under the Radar 2013SwiftStack Presents at Under the Radar 2013
SwiftStack Presents at Under the Radar 2013
 
Resource Sharing Beyond Boundaries - Apache Myriad
Resource Sharing Beyond Boundaries - Apache MyriadResource Sharing Beyond Boundaries - Apache Myriad
Resource Sharing Beyond Boundaries - Apache Myriad
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & Scala
 
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging Infrastructure
 
Data Driven Growth
Data Driven GrowthData Driven Growth
Data Driven Growth
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Deploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and MarathonDeploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and Marathon
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
Logistic Regression Analysis
Logistic Regression AnalysisLogistic Regression Analysis
Logistic Regression Analysis
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Semelhante a Scaling Big Data with Hadoop and Mesos

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online trainingsrikanthhadoop
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-trainingGeohedrick
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs sparkamarkayam
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]Shweta Patnaik
 

Semelhante a Scaling Big Data with Hadoop and Mesos (20)

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online training
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-training
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 

Último

University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 

Último (20)

University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 

Scaling Big Data with Hadoop and Mesos

  • 1. Scaling Big Data with Hadoop And Mesos
  • 2. Bernardo Gomez Palacio Software Engineer at Guavus Inc
  • 4. Mesos and Data Analysis Yes, you don't need Hadoop to start using Mesos and Spark.
  • 5. Now, If You... 4 Need to store large files? by default each block is 128MB. 4 Data is written mainly as new files or by appending into existing ones?
  • 6. Convinced you want to jump into the Hadoop bandwagon? Read Sammer, Eric. "Hadoop Operations." Sebastopol, CA: O'Reilly, 2012. Print.
  • 7. Welcome to the Jungle
  • 11. Assuming You Already Have Mesos 4 Mesosphere Packages 4 https://mesosphere.io/downloads/ 4 From Source. 4 https://github.com/apache/mesos
  • 12. Hadoop MRV1 in Meso https://github.com/mesos/hadoop
  • 13. Hadoop MRV1 in Mesos 4 Requires Hadoop MRV1 4 Officially works with CDH5 MRV1 4 Apache Hadoop 0.22, 0.23 and 1+ 4 Apache Hadoop 2+ doesn't come with MRV1!
  • 14. Hadoop MRV1 in Mesos 4 Requires a JobTracker. 4 By default uses the org.apache.hadoop.mapred.JobQueueTaskScheduler 4 You can change it .e.g ...mapred.FairScheduler
  • 15. Hadoop MRV1 in Mesos 4 Requires TaskTracker. 4 That is org.apache.hadoop.mapreduce.server.jobtracker. TaskTracker. 4 And not org.apache.hadoop.mapred.TaskTracker.java.
  • 16. How Hadoop MRV1 Runs In Mesos?
  • 17. How Hadoop MRV1 in Mesos works? 1. Framework Mesos Scheduler creates the Job Tracker as part of the driver. 2. The Job Trakcer will use org.apache.hadoop.mapred.MesosScheduler to lunch tasks.
  • 18. Mesos Hadoop Task Scheduling 4 mapred.mesos.slot.cpus (1) 4 mapred.mesos.slot.disk (1024MB) 4 mapred.mesos.slot.mem (1024MB)
  • 19. Additional Mesos parameters 4 mapred.mesos.checkpoint (false) 4 mapred.mesos.role (*)
  • 20. Thoughts What about Hadoop 2.4? Namenode HA? MRV2 and YARN?
  • 21. Personal Preference 4 Use Hadoop 2.4.0 or above. 4 Name Node HA through the Quorum Journal Manager. 4 Move to Spark if Possible.
  • 22. Example of a Mesos Data Analysis Stack 1. HDFS stores files. 2. Use the Spark CLI to test ideas. 3. Use Spark Submit for jobs. 4. Use Chronos or Oozie to schedule workflows.
  • 25. Know that Each Spark Application 1. Has its own driving process. 2. Has its own RDDs 3. Has its own cache.
  • 26. Spark Schedulers on Mesos Fine Grained Coarse Grained
  • 27. Spark Fine Grained Scheduling 4 Enabled by default. 4 Each Spark task runs as a separate Mesos task. 4 Has an overhead in launching each task.
  • 28. Spark Coarse Grained Scheduling 4 Uses only one long-running Spark task on each Mesos slave. 4 Dynamically schedules its own “mini-tasks”, using Akka. 4 Lower startup overhead. 4 Reserving the cluster resources for the complete duration of the application.
  • 29. Be ware of... 4 Greedy Scheduling (Coarse Grain) 4 Over committing and deadlocks (Fine Grained)
  • 30. Using Spark Understand Parametrization and Usage 4 spark.app.name 4 spark.executor.memory 4 spark.serializer 4 spark.local.dir 4 ....
  • 31. Use Spark Submit Avoid parametrizing the Spark Context in your code as much as possible. Leverage the spark-submit arguments, properties files as well as environment variables to configure your application.
  • 32. Using Spark Accept That Tunning is a Science & an Art
  • 33. Understand and Tune Your Applications 4 Know your Working Set. 4 Understand Spark Partitioning and Block management. 4 Define your Spark workflow and where to cache/ persist. 4 If you cache you will serialize, use Kryo.
  • 34. Example Spark API PairRDDFunctions def combineByKey[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
  • 35. PairRDDFunctions.combineByKey 4 Combines the elements for key using a custom set of aggregations. 4 RDD[(K, V)] to RDD[(K, C)]
  • 36. PairRDDFunctions.combineByKey 4 createCombiner: Turns a V into a C 4 mergeValue: merge a V into a C 4 mergeCombiners: to combine two C's into a single one. partitioner defaults to HashPartitioner.
  • 37. Example Spark API PairRDDFunctions self: RDD[(K, V)] def aggregateByKey[U: ClassTag](zeroValue: U)( seqOp: (U, V) => U, combOp: (U, U) => U ): RDD[(K, U)] Uses the default partitioner.
  • 39. Tune your Data 4 Per Data Source understand its optimal block size 4 Leverage Avro as the serialization format. 4 Leverage Parquet as the storage format. 4 Try to keep your Avro & Parquet schemas flat.
  • 41. Each Application 4 Instrument the Code. 4 Measure Input size in number of records and byte size. 4 Measure Output size in the same way.
  • 42. Standardize 4 JDK & JRE version across your cluster. 4 The Spark version across your cluster. 4 The libraries that will be added to the JVM classpath by default. 4 A packaging strategy for your application, uber jar.
  • 43. About YARN and Spark
  • 44. Some Differences with YARN 4 Execution Cluster vs Client modes. 4 Isolation process vs cgroups 4 Docker support? LXC Templates? 4 Deployment complexity?
  • 47. References 1. "Hadoop - Apache Hadoop 2.4.0." Apache Hadoop 2.4.0. Apache Software Foundation, 31 Mar. 2014. Web. 24 July 2014. link. 2. "Hadoop Distributed File System-2.4.0 - HDFS High Availability Using the Quorum Journal Manager." Apache Hadoop 2.4.0. Apache Software Foundation, 31 Mar. 2014. Web. 23 July 2014. link.
  • 48. References 1. Sammer, Eric. Hadoop Operations. Sebastopol, CA: O'Reilly, 2012. Print. 2. "Spark Configuration." Spark 1.0.1 Documentation. Apache Software Foundation, n.d. Web. 24 July 2014. link. 3. "Tuning Spark." Spark 1.0.1 Documentation. Apache Software Foundation, n.d. Web. 24 July 2014. link.
  • 49. References 1. Ryza, Sandy. "Managing Multiple Resources in Hadoop 2 with YARN." Cloudera Developer Blog. Cloudera, 2 Dec. 2013. Web. 24 July 2014. link.