The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

The Nitty Gritty of Advanced
Analytics
Using Apache Spark in Python
Miklos Christine
Solutions Architect
mwc@databricks.com, @Miklos_C

About Me
Miklos Christine
Solutions Architect @ Databricks
- mwc@databricks.com
- Miklos_C@twitter
Systems Engineer @ Cloudera
Supported a few of the largest clusters in the world
Software Engineer @ Cisco
UC Berkeley Graduate

We are Databricks, the company behind Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
3
Data Value
Created Databricks on top of Spark to make big data simple.

…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads &
environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries

2012
started
@
Berkeley
2010
research
paper
2013
Databricks
started
& donated
to ASF
2014
Spark 1.0 &
libraries
(SQL, ML, GraphX)
2015
DataFrames
Tungsten
ML
Pipelines
2016
Spark
2.0

Spark Community Growth
• Spark Survey 2015
Highlights
• End of Year Spark Highlights

2015: A Great Year for Spark
Most active open source project in (big) data
• 1000+ code contributors
New language: R
Widespread industry support & adoption

HOW RESPONDENTS ARE RUNNING SPARK
51%
on a public cloud
TOP ROLES USING SPARK
of respondents identify
themselves as Data Engineers
41%
of respondents identify
themselves as Data Scientists
22%

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN
FRANCISCO
Source: Slide 5 of Spark Community Update

Large-Scale Usage
Largest cluster:
8000 Nodes (Tencent)
Largest single job:
1 PB (Alibaba, Databricks)
Top Streaming
Intake:
1 TB/hour (HHMI
Janelia Farm)
2014 On-Disk Sort Record
Fastest Open Source
Engine for sorting a PB

History of Spark APIs
RDD
(2011)
DataFrame
(2013)
Distribute collection
of JVM objects
Functional Operators (map,
filter, etc.)
Distribute collection
of Row objects
Expression-based
operations and UDFs
Logical plans and optimizer
Fast/efficient internal
representations
DataSet
(2015)
Internally rows, externally
JVM objects
Almost the “Best of both
worlds”: type safe + fast
But slower than DF
Not as good for interactive
analysis, especially Python

Benefit of Logical Plan:
Performance Parity Across Languages
DataFrame
RDD

ETL: Extract, Transform, Load
● Key factor for big data platforms
● Provides Speed Improvements in All Workloads
● Typically Executed by Data Engineers

File Formats
● Text File Formats
○ CSV
○ JSON
● Avro Row Format
● Parquet Columnar Format

File Formats + Compression
● File Formats
○ JSON
○ CSV
○ Avro
○ Parquet
● Compression Codecs
○ No compression
○ Snappy
○ Gzip
○ LZO

● Industry Standard File Format: Parquet
○ Write to Parquet:
df.write.format(“parquet”).save(“namesAndAges.parquet”)
df.write.format(“parquet”).saveAsTable(“myTestTable”)
○ For compression:
spark.sql.parquet.compression.codec = (gzip, snappy)
Spark Parquet Properties

Small Files Problem
● Small files problem still exists
● Metadata loading
● APIs:
df.coalesce(N)
df.repartition(N)
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

● RDD / DataFrame Partitions
df.rdd.getNumPartitions()
● SparkSQL Shuffle Partitions
spark.sql.shuffle.partitions
● Table Level Partitions
df.write.partitionBy(“year”).
save(“data.parquet”)
All About Partitions

# CSV
df = sqlContext.read.
format('com.databricks.spark.csv').
options(header='true', inferSchema='true').
load('/path/to/data')
# JSON
df = sqlContext.read.json("/tmp/test.json")
df.write.json("/tmp/test_output.json")
PySpark ETL APIs - Text Formats

PySpark ETL APIs - Container Formats
# Binary Container Formats
# Avro
df = sqlContext.read.
format("com.databricks.spark.avro").
load("/path/to/files/")
# Parquet
df = sqlContext.read.parquet("/path/to/files/")
df.write.parquet("/path/to/files/")

● Manage Number of Files
○ APIs manage the number of files per directory
df.repartition(80).
write.
parquet("/path/to/parquet/")
df.repartition(80)
partitionBy("year")
write.
parquet("/path/to/parquet/")
PySpark ETL APIs

Common ETL Problems
● Malformed JSON Records
sqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE
_corrupt_record IS NOT NULL")
● Mismatched DataFrame Schema
○ Null Representation vs Schema DataType
● Many Small Files / No Partition Strategy
○ Parquet Files: ~128MB - 256MB Compressed
Ref: https://databricks.gitbooks.io/databricks-spark-knowledge-
base/content/best_practices/dealing_with_bad_data.html

Debugging Spark
Spark Driver Error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 362.0 failed
4 times, most recent failure: Lost task 1.3 in stage 362.0 (TID 275202, ip-10-111-225-98.ec2.internal):
java.nio.channels.ClosedChannelException
Spark Executor Error:
16/04/13 20:02:16 ERROR DefaultWriterContainer: Aborting task.
java.text.ParseException: Unparseable number: "N"
at java.text.NumberFormat.parse(NumberFormat.java:385)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply$mcD$sp(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at scala.util.Try.getOrElse(Try.scala:77)
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:58)

SparkSQL Best Practices
● DataFrames and SparkSQL are synonyms
● Use builtin functions instead of custom UDFs
○ import pyspark.sql.functions
● Examples:
○ to_date()
○ get_json_object()
○ regexp_extract()
○ hour() / minute()
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

SparkSQL Best Practices
● Large Table Joins
○ Largest Table on LHS
○ Increase Spark Shuffle Partitions
○ Leverage “cluster by” API included in Spark 1.6
sqlCtx.sql("select * from large_table_1 cluster by num1")
.registerTempTable("sorted_large_table_1");
sqlCtx.sql(“cache table sorted_large_table_1”);

PySpark API Best Practices
● User Defined Functions (UDFs)
from pyspark.sql import functions as F
add_n = udf(lambda x, y: x + y, IntegerType())
# We register a UDF that adds a column to the DataFrame, and we cast the
id column to an Integer type.
df = df.withColumn('id_offset',
add_n( F.lit(1000), df.id.cast(IntegerType())))

● Built-in Functions
corpus_df = df.select(
F.lower( F.col('body')).alias('corpus'),
F.monotonicallyIncreasingId().alias('id'))
corpus_df = df.select(
F.date_format( F.from_utc_timestamp(
F.from_unixtime(F.col('created_utc'), "PST"), 'EEEE')).alias
('dayofweek'))
Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

● User Defined Functions (UDFs)
def squared(s):
return s * s
sqlContext.udf.register("squaredWithPython", squared)
display(df.select("id", squared_udf("id").alias("id_squared")))

Why Spark ML
Provide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability
• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)
Advantages of MLlib’s Design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility

High-level functionality in MLlib
Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent
itemsets
42
Workflow utilities
• Model import/export
• Pipelines
• DataFrames
• Cross validation
Data utilities
• Feature
extraction &
selection
• Statistics
• Linear algebra

Machine Learning: What and Why?
ML uses data to identify patterns and make decisions.
Core value of ML is automated decision making
• Especially important when dealing with TB or PB of data
Many Use Cases including:
• Marketing and advertising optimization
• Security monitoring / fraud detection
• Operational optimizations

Algorithm coverage in MLlib
Classification
• Logistic regression w/ elastic net
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
• Multilayer perceptron
• One-vs-rest
Regression
• Least squares w/ elastic net
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
Recommendation
• Alternating Least Squares
Frequent itemsets
• FP-growth
• Prefix span
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors & matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Model import/export
Pipelines
Feature extraction &
selection
• Binarizer
• Bucketizer
• Chi-Squared selection
• CountVectorizer
• Discrete cosine transform
• ElementwiseProduct
• Hashing term frequency
• Inverse document frequency
• MinMaxScaler
• Ngram
• Normalizer
• One-Hot Encoder
• PCA
• PolynomialExpansion
• RFormula
• SQLTransformer
• Standard scaler
• StopWordsRemover
• StringIndexer
• Tokenizer
• StringIndexer
• VectorAssembler
• VectorIndexer
• VectorSlicer
• Word2Vec
List based on Spark
1.5 44

Spark ML Best Practices
● Spark MLLib vs SparkML
○ Understand the differences
● Don’t Pipeline Too Many Stages
○ Check Results Between Stages

● DataFrame to RDD Mapping
def tokenize(text):
tokens = word_tokenize(text)
lowercased = [t.lower() for t in tokens]
no_punctuation = []
for word in lowercased:
punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION])
no_punctuation.append(punct_removed)
no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]
stemmed = [STEMMER.stem(w) for w in no_stopwords]
return [w for w in stemmed if w]
rdd = wordsDataFrame.map(lambda x: (x.__getitem__('id'), tokenize(x.__getitem__('corpus'))))
PySpark ML API Best Practices

Learning more about MLlib
Guides & examples
• Example workflow using ML Pipelines (Python)
• The above 2 links are part of the Databricks Guide, which contains many more
examples and references.
References
• Apache Spark MLlib User Guide
• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API
documentation.
• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.
org/abs/1505.06807 (academic paper)
49

Thanks!
Sign Up For Databricks Community Edition!
http://go.databricks.com/databricks-community-
edition-beta-waitlist

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Semelhante a The Nitty Gritty of Advanced Analytics Using Apache Spark in Python (20)

Último

Último (20)

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python