SlideShare uma empresa Scribd logo
1 de 50
Baixar para ler offline
● Data science at Cloudera 
● Recently lead Apache Spark development at 
Cloudera 
● Before that, committing on Apache YARN 
and MapReduce 
● Hadoop project management committee
com.esotericsoftware.kryo. 
KryoException: Unable to find class: 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC 
$$iwC$$iwC$$anonfun$4$$anonfun$apply$3
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on 
host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by 
zero 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) 
[...] 
Driver stacktrace: 
at org.apache.spark.scheduler.DAGScheduler. 
org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages 
(DAGScheduler.scala:1033) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. 
apply(DAGScheduler.scala:1017) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. 
apply(DAGScheduler.scala:1015) 
[...]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on 
host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by 
zero 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) 
[...]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on 
host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by 
zero 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) 
[...]
val file = sc.textFile("hdfs://...") 
file.filter(_.startsWith(“banana”)) 
.count()
Job 
Stage 
Task Task 
Task Task 
Stage 
Task Task 
Task Task
val rdd1 = sc.textFile(“hdfs://...”) 
.map(someFunc) 
.filter(filterFunc) 
textFile map 
filter
val rdd2 = sc.hadoopFile(“hdfs: 
//...”) 
.groupByKey() 
.map(someOtherFunc) 
hadoopFile groupByKey map
val rdd3 = rdd1.join(rdd2) 
.map(someFunc) 
join map
rdd3.collect()
textFile map filter 
hadoop 
group 
File ByKey 
map 
join map
textFile map filter 
hadoop 
group 
File ByKey 
map 
join map
Stage 
Task Task 
Task Task
Stage 
Task Task 
Task Task
Stage 
Task Task 
Task Task
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on 
host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by 
zero 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) 
[...]
14/04/22 11:59:58 ERROR executor.Executor: Exception in task ID 286 6 
java.io.IOException: Filesystem closed 
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:565 ) 
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:64 8) 
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:706 ) 
at java.io.DataInputStream.read(DataInputStream.java:100 ) 
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:20 9) 
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173 ) 
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:20 6) 
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:4 5) 
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164 ) 
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149 ) 
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71 ) 
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:2 7) 
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327 ) 
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388 ) 
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388 ) 
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327 ) 
at scala.collection.Iterator$class.foreach(Iterator.scala:727 ) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157 ) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:16 1) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:10 2) 
at org.apache.spark.scheduler.Task.run(Task.scala:53 ) 
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:21 1) 
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:4 2) 
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:4 1) 
at java.security.AccessController.doPrivileged(Native Method ) 
at javax.security.auth.Subject.doAs(Subject.java:415 ) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:140 8) 
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:4 1) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176 ) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:114 5) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:61 5) 
at java.lang.Thread.run(Thread.java:724)
ResourceManager 
NodeManager NodeManager 
Container Container 
Application 
Master 
Container 
Client
ResourceManager 
Client 
NodeManager NodeManager 
Container 
Map Task 
Container 
Application 
Master 
Container 
Reduce Task
ResourceManager 
Client 
NodeManager NodeManager 
Container 
Map Task 
Container 
Application 
Master 
Container 
Reduce Task
ResourceManager 
Client 
NodeManager NodeManager 
Container 
Map Task 
Container 
Application 
Master 
Container 
Reduce Task
Container [pid=63375, 
containerID=container_1388158490598_0001_01_00 
0003] is running beyond physical memory 
limits. Current usage: 2.1 GB of 2 GB physical 
memory used; 2.8 GB of 4.2 GB virtual memory 
used. Killing container.
yarn.nodemanager.resource.memory-mb 
Executor container 
spark.yarn.executor.memoryOverhead 
spark.executor.memory 
spark.shuffle.memoryFraction 
spark.storage.memoryFraction
ExternalAppend 
OnlyMap 
Block 
Block 
deserialize 
deserialize
ExternalAppend 
OnlyMap 
key1 -> values 
key2 -> values 
key3 -> values
ExternalAppend 
OnlyMap 
key1 -> values 
key2 -> values 
key3 -> values
ExternalAppend 
OnlyMap 
Sort & Spill 
key1 -> values 
key2 -> values 
key3 -> values
rdd.reduceByKey(reduceFunc, 
numPartitions=1000)
java.io.FileNotFoundException: 
/dn6/spark/local/spark-local- 
20140610134115- 
2cee/30/merged_shuffle_0_368_14 (Too many 
open files)
Task 
Task 
Write stuff 
key1 -> values 
key2 -> values 
key3 -> values 
out 
key1 -> values 
key2 -> values 
key3 -> values Task
Task 
Task 
Write stuff 
key1 -> values 
key2 -> values 
key3 -> values 
out 
key1 -> values 
key2 -> values 
key3 -> values Task
Partition 1 
File 
Partition 2 
File 
Partition 3 
File 
Records
Records 
Buffer
Single file 
Buffer 
Sort & Spill 
Partition 1 
Records 
Partition 2 
Records 
Partition 3 
Records 
Index file
conf.set(“spark.shuffle.manager”, 
SORT)
● No 
● Distributed systems are complicated
Spark and Hadoop Data Scientist

Mais conteúdo relacionado

Mais procurados

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkDatabricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 

Mais procurados (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 

Destaque

Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageSandeep Patil
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...gethue
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDataWorks Summit
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 

Destaque (17)

Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
 
SocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean ManualSocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean Manual
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Proxy Servers
Proxy ServersProxy Servers
Proxy Servers
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Proxy Server
Proxy ServerProxy Server
Proxy Server
 

Semelhante a Spark and Hadoop Data Scientist

Why is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of ClouderaWhy is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of ClouderaJack Gudenkauf
 
Why is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of ClouderaWhy is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of ClouderaData Con LA
 
Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingWhy Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingCloudera, Inc.
 
把鐵路開進視窗裡
把鐵路開進視窗裡把鐵路開進視窗裡
把鐵路開進視窗裡Wei Jen Lu
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogVadim Semenov
 
Openshift operator insight
Openshift operator insightOpenshift operator insight
Openshift operator insightRyan ZhangCheng
 
A jar-nORM-ous Task
A jar-nORM-ous TaskA jar-nORM-ous Task
A jar-nORM-ous TaskErin Dees
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarDatabricks
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
 
Real World Lessons on the Pain Points of Node.JS Application
Real World Lessons on the Pain Points of Node.JS ApplicationReal World Lessons on the Pain Points of Node.JS Application
Real World Lessons on the Pain Points of Node.JS ApplicationBen Hall
 
Harmonious Development: Via Vagrant and Puppet
Harmonious Development: Via Vagrant and PuppetHarmonious Development: Via Vagrant and Puppet
Harmonious Development: Via Vagrant and PuppetAchieve Internet
 
introduction-infra-as-a-code using terraform
introduction-infra-as-a-code using terraformintroduction-infra-as-a-code using terraform
introduction-infra-as-a-code using terraformniyof97
 
Innovation and Security in Ruby on Rails
Innovation and Security in Ruby on RailsInnovation and Security in Ruby on Rails
Innovation and Security in Ruby on Railstielefeld
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Scaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container ServiceScaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container ServiceBen Hall
 
Troubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerTroubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerJeff Anderson
 
Troubleshooting Tips from a Docker Support Engineer - Jeff Anderson, Docker
Troubleshooting Tips from a Docker Support Engineer - Jeff Anderson, DockerTroubleshooting Tips from a Docker Support Engineer - Jeff Anderson, Docker
Troubleshooting Tips from a Docker Support Engineer - Jeff Anderson, DockerDocker, Inc.
 
Replatforming Legacy Packaged Applications: Block-by-Block with Minecraft
Replatforming Legacy Packaged Applications: Block-by-Block with MinecraftReplatforming Legacy Packaged Applications: Block-by-Block with Minecraft
Replatforming Legacy Packaged Applications: Block-by-Block with MinecraftVMware Tanzu
 
Porting Rails Apps to High Availability Systems
Porting Rails Apps to High Availability SystemsPorting Rails Apps to High Availability Systems
Porting Rails Apps to High Availability SystemsMarcelo Pinheiro
 
Symfony 2 (PHP Quebec 2009)
Symfony 2 (PHP Quebec 2009)Symfony 2 (PHP Quebec 2009)
Symfony 2 (PHP Quebec 2009)Fabien Potencier
 

Semelhante a Spark and Hadoop Data Scientist (20)

Why is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of ClouderaWhy is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of Cloudera
 
Why is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of ClouderaWhy is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of Cloudera
 
Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingWhy Your Apache Spark Job is Failing
Why Your Apache Spark Job is Failing
 
把鐵路開進視窗裡
把鐵路開進視窗裡把鐵路開進視窗裡
把鐵路開進視窗裡
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 
Openshift operator insight
Openshift operator insightOpenshift operator insight
Openshift operator insight
 
A jar-nORM-ous Task
A jar-nORM-ous TaskA jar-nORM-ous Task
A jar-nORM-ous Task
 
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex DadgarHomologous Apache Spark Clusters Using Nomad with Alex Dadgar
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
Real World Lessons on the Pain Points of Node.JS Application
Real World Lessons on the Pain Points of Node.JS ApplicationReal World Lessons on the Pain Points of Node.JS Application
Real World Lessons on the Pain Points of Node.JS Application
 
Harmonious Development: Via Vagrant and Puppet
Harmonious Development: Via Vagrant and PuppetHarmonious Development: Via Vagrant and Puppet
Harmonious Development: Via Vagrant and Puppet
 
introduction-infra-as-a-code using terraform
introduction-infra-as-a-code using terraformintroduction-infra-as-a-code using terraform
introduction-infra-as-a-code using terraform
 
Innovation and Security in Ruby on Rails
Innovation and Security in Ruby on RailsInnovation and Security in Ruby on Rails
Innovation and Security in Ruby on Rails
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Scaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container ServiceScaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container Service
 
Troubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerTroubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support Engineer
 
Troubleshooting Tips from a Docker Support Engineer - Jeff Anderson, Docker
Troubleshooting Tips from a Docker Support Engineer - Jeff Anderson, DockerTroubleshooting Tips from a Docker Support Engineer - Jeff Anderson, Docker
Troubleshooting Tips from a Docker Support Engineer - Jeff Anderson, Docker
 
Replatforming Legacy Packaged Applications: Block-by-Block with Minecraft
Replatforming Legacy Packaged Applications: Block-by-Block with MinecraftReplatforming Legacy Packaged Applications: Block-by-Block with Minecraft
Replatforming Legacy Packaged Applications: Block-by-Block with Minecraft
 
Porting Rails Apps to High Availability Systems
Porting Rails Apps to High Availability SystemsPorting Rails Apps to High Availability Systems
Porting Rails Apps to High Availability Systems
 
Symfony 2 (PHP Quebec 2009)
Symfony 2 (PHP Quebec 2009)Symfony 2 (PHP Quebec 2009)
Symfony 2 (PHP Quebec 2009)
 

Último

Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 

Último (20)

Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 

Spark and Hadoop Data Scientist

  • 1.
  • 2. ● Data science at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache YARN and MapReduce ● Hadoop project management committee
  • 3. com.esotericsoftware.kryo. KryoException: Unable to find class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC $$iwC$$iwC$$anonfun$4$$anonfun$apply$3
  • 4.
  • 5.
  • 6.
  • 7. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...] Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler. org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages (DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. apply(DAGScheduler.scala:1015) [...]
  • 8. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...]
  • 9. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...]
  • 10. val file = sc.textFile("hdfs://...") file.filter(_.startsWith(“banana”)) .count()
  • 11. Job Stage Task Task Task Task Stage Task Task Task Task
  • 12.
  • 13. val rdd1 = sc.textFile(“hdfs://...”) .map(someFunc) .filter(filterFunc) textFile map filter
  • 14. val rdd2 = sc.hadoopFile(“hdfs: //...”) .groupByKey() .map(someOtherFunc) hadoopFile groupByKey map
  • 15. val rdd3 = rdd1.join(rdd2) .map(someFunc) join map
  • 17. textFile map filter hadoop group File ByKey map join map
  • 18. textFile map filter hadoop group File ByKey map join map
  • 19. Stage Task Task Task Task
  • 20. Stage Task Task Task Task
  • 21. Stage Task Task Task Task
  • 22. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...]
  • 23. 14/04/22 11:59:58 ERROR executor.Executor: Exception in task ID 286 6 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:565 ) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:64 8) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:706 ) at java.io.DataInputStream.read(DataInputStream.java:100 ) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:20 9) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173 ) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:20 6) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:4 5) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164 ) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149 ) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71 ) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:2 7) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327 ) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388 ) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388 ) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327 ) at scala.collection.Iterator$class.foreach(Iterator.scala:727 ) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157 ) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:16 1) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:10 2) at org.apache.spark.scheduler.Task.run(Task.scala:53 ) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:21 1) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:4 2) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:4 1) at java.security.AccessController.doPrivileged(Native Method ) at javax.security.auth.Subject.doAs(Subject.java:415 ) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:140 8) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:4 1) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176 ) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:114 5) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:61 5) at java.lang.Thread.run(Thread.java:724)
  • 24.
  • 25. ResourceManager NodeManager NodeManager Container Container Application Master Container Client
  • 26. ResourceManager Client NodeManager NodeManager Container Map Task Container Application Master Container Reduce Task
  • 27. ResourceManager Client NodeManager NodeManager Container Map Task Container Application Master Container Reduce Task
  • 28. ResourceManager Client NodeManager NodeManager Container Map Task Container Application Master Container Reduce Task
  • 29. Container [pid=63375, containerID=container_1388158490598_0001_01_00 0003] is running beyond physical memory limits. Current usage: 2.1 GB of 2 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used. Killing container.
  • 30. yarn.nodemanager.resource.memory-mb Executor container spark.yarn.executor.memoryOverhead spark.executor.memory spark.shuffle.memoryFraction spark.storage.memoryFraction
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36. ExternalAppend OnlyMap Block Block deserialize deserialize
  • 37. ExternalAppend OnlyMap key1 -> values key2 -> values key3 -> values
  • 38. ExternalAppend OnlyMap key1 -> values key2 -> values key3 -> values
  • 39. ExternalAppend OnlyMap Sort & Spill key1 -> values key2 -> values key3 -> values
  • 41.
  • 42. java.io.FileNotFoundException: /dn6/spark/local/spark-local- 20140610134115- 2cee/30/merged_shuffle_0_368_14 (Too many open files)
  • 43. Task Task Write stuff key1 -> values key2 -> values key3 -> values out key1 -> values key2 -> values key3 -> values Task
  • 44. Task Task Write stuff key1 -> values key2 -> values key3 -> values out key1 -> values key2 -> values key3 -> values Task
  • 45. Partition 1 File Partition 2 File Partition 3 File Records
  • 47. Single file Buffer Sort & Spill Partition 1 Records Partition 2 Records Partition 3 Records Index file
  • 49. ● No ● Distributed systems are complicated