In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
2. Why are we here?
Spark for quick and easy batch ETL (no streaming)
Actually using data frames
Creation
Modification
Access
Transformation
Lab!
Performance tuning and operationalization
3. What does it take to solve a
data science problem?
Data Prep
Ingest
Cleanup
Error-handling &
missing values
Data munging
Transformation
Formatting
Splitting
Modeling
Feature extraction
Algorithm selection
Data creation
Train
Test
Validate
Model building
Model scoring
4. Why Spark?
Batch/micro-batch processing of large datasets
Easy to use, easy to iterate, wealth of common
industry-standard ML algorithms
Super fast if properly configured
Bridges the gap between the old (SQL, single machine
analytics) and the new (declarative/functional
distributed programming)
5.
6. Why not Spark?
Breaks easily with poor usage or improperly specified
configs
Scaling up to larger datasets 500GB -> TB scale
requires deep understanding of internal configurations,
garbage collection tuning, and Spark mechanisms
While there are lots of ML algorithms, a lot of them
simply don’t work, don’t work at scale, or have poorly
defined interfaces / documentation
7. Scala
Yes, I recommend Scala
Python API is underdeveloped, especially for ML Lib
Java (until Java 8) is a second class citizen as far as
convenience vs. Scala
Spark is written in Scala – understanding Scala helps you
navigate the source
Can leverage the spark-shell to rapidly prototype new
code and constructs
http://www.scala-
lang.org/docu/files/ScalaByExample.pdf
8. Why DataFrames?
Iterate on datasets MUCH faster
Column access is easier
Data inspection is easier
groupBy, join, are faster due to under-the-hood
optimizations
Some chunks of ML Lib now optimized to use data
frames
9. Why not DataFrames?
RDD API is still much better developed
Getting data into DataFrames can be clunky
Transforming data inside DataFrames can be clunky
Many of the algorithms in ML Lib still depend on RDDs
10. Creation
Read in a file with an embedded header
http://stackoverflow.com/questions/24718697/pyspark-drop-rows
11. Create a DF
Option A – Inferred types from Rows RDD
Option B – Specify schema as strings
DataFrame Creation
12. Option C – Define the schema explicitly
Check your work with df.show()
DataFrame Creation
13. Column Manipulation
Selection
GroupBy
Confusing! You get a GroupedData object, not an RDD or
DataFrame
Use agg or built-ins to get back to a DataFrame.
Can convert to RDD with dataFrame.rdd
14. Custom Column Functions
Add a column with a custom function:
http://stackoverflow.com/questions/33287886/replace-
empty-strings-with-none-null-values-in-dataframe
16. Joins
Option A (inner join)
Option B (explicit)
Join types: inner, outer, left_outer, right_outer, leftsemi
DataFrame joins benefit from Tungsten optimizations
Note: PySpark will not drop columns for outer joins
17. Null Handling
Built in support for handling nulls/NA in data frames.
Drop, fill, replace
https://spark.apache.org/docs/1.6.0/api/python/pyspark.
sql.html#pyspark.sql.DataFrameNaFunctions
18. What does it take to solve a
data science problem?
Data Prep
Ingest
Cleanup
Error-handling &
missing values
Data munging
Transformation
Formatting
Splitting
Modeling
Feature extraction
Algorithm selection
Data creation
Train
Test
Validate
Model building
Model scoring
19. Lab Rules
Ask Google and StackOverflow before you ask me
You do not have to use my code.
Use DataFrames until you can’t.
Keep track of what breaks!
There are no stupid questions.
20. Lab
Ingest Data
Remove invalid entrees or fill missing entries
Split into test, train, validate
Reformat a single column, e.g. map IDs or change
format
Add a custom metric or feature based on other columns
Run a classification algorithm on this data to figure out
who will survive!
24. Partitions
How data is split on disk
Affects memory usage, shuffle size
Count ~ speed, Count ~ 1/memory
Caching
Persist RDDs in distributed memory
Major speedup for repeated operations
Serialization
Efficient movement of data
Java vs. Kryo
Partitions, Caching, and
Serialization
25. Shuffle!
All-all operations
reduceByKey, groupByKey
Data movement
Serialization
Akka
Memory overhead
Dumps to disk when OOM
Garbage collection
EXPENSIVE!
Map Reduce
26. What else?
Save your work => Write completed datasets to file
Work on small data first, then go to big data
Create test data to capture edge cases
LMGTFY
28. Any Spark on YARN
E.g. Deploy Spark 1.6 on CDH 5.4
Download your Spark binary to the cluster and untar
In $SPARK_HOME/conf/spark-env.sh:
export
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/co
nf
This tells Spark where Hadoop is deployed, this also gives it the
link it needs to run on YARN
export SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop
classpath)
This defines the location of the Hadoop binaries used at run
time