SlideShare uma empresa Scribd logo
1 de 29
Frustration-Reduced
PySpark
Data engineering with DataFrames
Ilya Ganelin
Why are we here?
 Spark for quick and easy batch ETL (no streaming)
 Actually using data frames
 Creation
 Modification
 Access
 Transformation
 Lab!
 Performance tuning and operationalization
What does it take to solve a
data science problem?
 Data Prep
 Ingest
 Cleanup
 Error-handling &
missing values
 Data munging
 Transformation
 Formatting
 Splitting
 Modeling
 Feature extraction
 Algorithm selection
 Data creation
 Train
 Test
 Validate
 Model building
 Model scoring
Why Spark?
 Batch/micro-batch processing of large datasets
 Easy to use, easy to iterate, wealth of common
industry-standard ML algorithms
 Super fast if properly configured
 Bridges the gap between the old (SQL, single machine
analytics) and the new (declarative/functional
distributed programming)
Why not Spark?
 Breaks easily with poor usage or improperly specified
configs
 Scaling up to larger datasets 500GB -> TB scale
requires deep understanding of internal configurations,
garbage collection tuning, and Spark mechanisms
 While there are lots of ML algorithms, a lot of them
simply don’t work, don’t work at scale, or have poorly
defined interfaces / documentation
Scala
 Yes, I recommend Scala
 Python API is underdeveloped, especially for ML Lib
 Java (until Java 8) is a second class citizen as far as
convenience vs. Scala
 Spark is written in Scala – understanding Scala helps you
navigate the source
 Can leverage the spark-shell to rapidly prototype new
code and constructs
 http://www.scala-
lang.org/docu/files/ScalaByExample.pdf
Why DataFrames?
 Iterate on datasets MUCH faster
 Column access is easier
 Data inspection is easier
 groupBy, join, are faster due to under-the-hood
optimizations
 Some chunks of ML Lib now optimized to use data
frames
Why not DataFrames?
 RDD API is still much better developed
 Getting data into DataFrames can be clunky
 Transforming data inside DataFrames can be clunky
 Many of the algorithms in ML Lib still depend on RDDs
Creation
 Read in a file with an embedded header
 http://stackoverflow.com/questions/24718697/pyspark-drop-rows
 Create a DF
 Option A – Inferred types from Rows RDD
 Option B – Specify schema as strings
DataFrame Creation
 Option C – Define the schema explicitly
 Check your work with df.show()
DataFrame Creation
Column Manipulation
 Selection
 GroupBy
 Confusing! You get a GroupedData object, not an RDD or
DataFrame
 Use agg or built-ins to get back to a DataFrame.
 Can convert to RDD with dataFrame.rdd
Custom Column Functions
 Add a column with a custom function:
 http://stackoverflow.com/questions/33287886/replace-
empty-strings-with-none-null-values-in-dataframe
Row Manipulation
 Filter
 Range:
 Equality:
 Column functions
 https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.ht
ml#pyspark.sql.Column
Joins
 Option A (inner join)
 Option B (explicit)
 Join types: inner, outer, left_outer, right_outer, leftsemi
 DataFrame joins benefit from Tungsten optimizations
 Note: PySpark will not drop columns for outer joins
Null Handling
 Built in support for handling nulls/NA in data frames.
 Drop, fill, replace
 https://spark.apache.org/docs/1.6.0/api/python/pyspark.
sql.html#pyspark.sql.DataFrameNaFunctions
What does it take to solve a
data science problem?
 Data Prep
 Ingest
 Cleanup
 Error-handling &
missing values
 Data munging
 Transformation
 Formatting
 Splitting
 Modeling
 Feature extraction
 Algorithm selection
 Data creation
 Train
 Test
 Validate
 Model building
 Model scoring
Lab Rules
 Ask Google and StackOverflow before you ask me 
 You do not have to use my code.
 Use DataFrames until you can’t.
 Keep track of what breaks!
 There are no stupid questions.
Lab
 Ingest Data
 Remove invalid entrees or fill missing entries
 Split into test, train, validate
 Reformat a single column, e.g. map IDs or change
format
 Add a custom metric or feature based on other columns
 Run a classification algorithm on this data to figure out
who will survive!
What problems did you
encounter?
What are you still confused
about?
Spark Architecture
 Partitions
 How data is split on disk
 Affects memory usage, shuffle size
 Count ~ speed, Count ~ 1/memory
 Caching
 Persist RDDs in distributed memory
 Major speedup for repeated operations
 Serialization
 Efficient movement of data
 Java vs. Kryo
Partitions, Caching, and
Serialization
Shuffle!
 All-all operations
 reduceByKey, groupByKey
 Data movement
 Serialization
 Akka
 Memory overhead
 Dumps to disk when OOM
 Garbage collection
 EXPENSIVE!
Map Reduce
What else?
 Save your work => Write completed datasets to file
 Work on small data first, then go to big data
 Create test data to capture edge cases
 LMGTFY
By popular demand:
screen pyspark
--driver-memory 100g 
--num-executors 60 
--executor-cores 5 
--master yarn-client 
--conf "spark.executor.memory=20g” 
--conf "spark.io.compression.codec=lz4" 
--conf "spark.shuffle.consolidateFiles=true" 
--conf "spark.dynamicAllocation.enabled=false" 
--conf "spark.shuffle.manager=tungsten-sort" 
--conf "spark.akka.frameSize=1028" 
--conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m -
XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 -
XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -
XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts -
XX:+UseCompressedOops"
Any Spark on YARN
 E.g. Deploy Spark 1.6 on CDH 5.4
 Download your Spark binary to the cluster and untar
 In $SPARK_HOME/conf/spark-env.sh:
 export
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/co
nf
 This tells Spark where Hadoop is deployed, this also gives it the
link it needs to run on YARN
 export SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop
classpath)
 This defines the location of the Hadoop binaries used at run
time
References
 http://spark.apache.org/docs/latest/programming-guide.html
 http://spark.apache.org/docs/latest/sql-programming-guide.html
 http://tinyurl.com/leqek2d (Working With Spark, by Ilya Ganelin)
 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-
spark-jobs-part-1/ (by Sandy Ryza)
 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-
spark-jobs-part-2/ (by Sandy Ryza)
 http://www.slideshare.net/ilganeli/frustrationreduced-pyspark-data-
engineering-with-dataframes

Mais conteúdo relacionado

Mais procurados

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 

Mais procurados (20)

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
PySaprk
PySaprkPySaprk
PySaprk
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 

Semelhante a Frustration-Reduced PySpark: Data engineering with DataFrames

Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designCalpont
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousingSneha Challa
 
Intro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationIntro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationPhilip Yurchuk
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxKnoldus Inc.
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By SparkKnoldus Inc.
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 

Semelhante a Frustration-Reduced PySpark: Data engineering with DataFrames (20)

Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
 
Percona Lucid Db
Percona Lucid DbPercona Lucid Db
Percona Lucid Db
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Intro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data IntegrationIntro to Talend Open Studio for Data Integration
Intro to Talend Open Studio for Data Integration
 
11g R2
11g R211g R2
11g R2
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By Spark
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 

Último

notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 

Último (20)

notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 

Frustration-Reduced PySpark: Data engineering with DataFrames

  • 2. Why are we here?  Spark for quick and easy batch ETL (no streaming)  Actually using data frames  Creation  Modification  Access  Transformation  Lab!  Performance tuning and operationalization
  • 3. What does it take to solve a data science problem?  Data Prep  Ingest  Cleanup  Error-handling & missing values  Data munging  Transformation  Formatting  Splitting  Modeling  Feature extraction  Algorithm selection  Data creation  Train  Test  Validate  Model building  Model scoring
  • 4. Why Spark?  Batch/micro-batch processing of large datasets  Easy to use, easy to iterate, wealth of common industry-standard ML algorithms  Super fast if properly configured  Bridges the gap between the old (SQL, single machine analytics) and the new (declarative/functional distributed programming)
  • 5.
  • 6. Why not Spark?  Breaks easily with poor usage or improperly specified configs  Scaling up to larger datasets 500GB -> TB scale requires deep understanding of internal configurations, garbage collection tuning, and Spark mechanisms  While there are lots of ML algorithms, a lot of them simply don’t work, don’t work at scale, or have poorly defined interfaces / documentation
  • 7. Scala  Yes, I recommend Scala  Python API is underdeveloped, especially for ML Lib  Java (until Java 8) is a second class citizen as far as convenience vs. Scala  Spark is written in Scala – understanding Scala helps you navigate the source  Can leverage the spark-shell to rapidly prototype new code and constructs  http://www.scala- lang.org/docu/files/ScalaByExample.pdf
  • 8. Why DataFrames?  Iterate on datasets MUCH faster  Column access is easier  Data inspection is easier  groupBy, join, are faster due to under-the-hood optimizations  Some chunks of ML Lib now optimized to use data frames
  • 9. Why not DataFrames?  RDD API is still much better developed  Getting data into DataFrames can be clunky  Transforming data inside DataFrames can be clunky  Many of the algorithms in ML Lib still depend on RDDs
  • 10. Creation  Read in a file with an embedded header  http://stackoverflow.com/questions/24718697/pyspark-drop-rows
  • 11.  Create a DF  Option A – Inferred types from Rows RDD  Option B – Specify schema as strings DataFrame Creation
  • 12.  Option C – Define the schema explicitly  Check your work with df.show() DataFrame Creation
  • 13. Column Manipulation  Selection  GroupBy  Confusing! You get a GroupedData object, not an RDD or DataFrame  Use agg or built-ins to get back to a DataFrame.  Can convert to RDD with dataFrame.rdd
  • 14. Custom Column Functions  Add a column with a custom function:  http://stackoverflow.com/questions/33287886/replace- empty-strings-with-none-null-values-in-dataframe
  • 15. Row Manipulation  Filter  Range:  Equality:  Column functions  https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.ht ml#pyspark.sql.Column
  • 16. Joins  Option A (inner join)  Option B (explicit)  Join types: inner, outer, left_outer, right_outer, leftsemi  DataFrame joins benefit from Tungsten optimizations  Note: PySpark will not drop columns for outer joins
  • 17. Null Handling  Built in support for handling nulls/NA in data frames.  Drop, fill, replace  https://spark.apache.org/docs/1.6.0/api/python/pyspark. sql.html#pyspark.sql.DataFrameNaFunctions
  • 18. What does it take to solve a data science problem?  Data Prep  Ingest  Cleanup  Error-handling & missing values  Data munging  Transformation  Formatting  Splitting  Modeling  Feature extraction  Algorithm selection  Data creation  Train  Test  Validate  Model building  Model scoring
  • 19. Lab Rules  Ask Google and StackOverflow before you ask me   You do not have to use my code.  Use DataFrames until you can’t.  Keep track of what breaks!  There are no stupid questions.
  • 20. Lab  Ingest Data  Remove invalid entrees or fill missing entries  Split into test, train, validate  Reformat a single column, e.g. map IDs or change format  Add a custom metric or feature based on other columns  Run a classification algorithm on this data to figure out who will survive!
  • 21. What problems did you encounter? What are you still confused about?
  • 23.
  • 24.  Partitions  How data is split on disk  Affects memory usage, shuffle size  Count ~ speed, Count ~ 1/memory  Caching  Persist RDDs in distributed memory  Major speedup for repeated operations  Serialization  Efficient movement of data  Java vs. Kryo Partitions, Caching, and Serialization
  • 25. Shuffle!  All-all operations  reduceByKey, groupByKey  Data movement  Serialization  Akka  Memory overhead  Dumps to disk when OOM  Garbage collection  EXPENSIVE! Map Reduce
  • 26. What else?  Save your work => Write completed datasets to file  Work on small data first, then go to big data  Create test data to capture edge cases  LMGTFY
  • 27. By popular demand: screen pyspark --driver-memory 100g --num-executors 60 --executor-cores 5 --master yarn-client --conf "spark.executor.memory=20g” --conf "spark.io.compression.codec=lz4" --conf "spark.shuffle.consolidateFiles=true" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.shuffle.manager=tungsten-sort" --conf "spark.akka.frameSize=1028" --conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m - XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 - XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 - XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts - XX:+UseCompressedOops"
  • 28. Any Spark on YARN  E.g. Deploy Spark 1.6 on CDH 5.4  Download your Spark binary to the cluster and untar  In $SPARK_HOME/conf/spark-env.sh:  export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/co nf  This tells Spark where Hadoop is deployed, this also gives it the link it needs to run on YARN  export SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop classpath)  This defines the location of the Hadoop binaries used at run time
  • 29. References  http://spark.apache.org/docs/latest/programming-guide.html  http://spark.apache.org/docs/latest/sql-programming-guide.html  http://tinyurl.com/leqek2d (Working With Spark, by Ilya Ganelin)  http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache- spark-jobs-part-1/ (by Sandy Ryza)  http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache- spark-jobs-part-2/ (by Sandy Ryza)  http://www.slideshare.net/ilganeli/frustrationreduced-pyspark-data- engineering-with-dataframes

Notas do Editor

  1. Get a sense for familiarity with Spark
  2. Ask the class for involvement here!
  3. Ask the class for involvement here!