SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
The Nitty Gritty of Advanced
Analytics
Using Apache Spark in Python
Miklos Christine
Solutions Architect
mwc@databricks.com, @Miklos_C
About Me
Miklos Christine
Solutions Architect @ Databricks
- mwc@databricks.com
- Miklos_C@twitter
Systems Engineer @ Cloudera
Supported a few of the largest clusters in the world
Software Engineer @ Cisco
UC Berkeley Graduate
We are Databricks, the company behind Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
3
Data Value
Created Databricks on top of Spark to make big data simple.
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads &
environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries
2012
started
@
Berkeley
2010
research
paper
2013
Databricks
started
& donated
to ASF
2014
Spark 1.0 &
libraries
(SQL, ML, GraphX)
2015
DataFrames
Tungsten
ML
Pipelines
2016
Spark
2.0
Spark Community Growth
• Spark Survey 2015
Highlights
• End of Year Spark Highlights
2015: A Great Year for Spark
Most active open source project in (big) data
• 1000+ code contributors
New language: R
Widespread industry support & adoption
HOW RESPONDENTS ARE RUNNING SPARK
51%
on a public cloud
TOP ROLES USING SPARK
of respondents identify
themselves as Data Engineers
41%
of respondents identify
themselves as Data Scientists
22%
Spark User
Highlights
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN
FRANCISCO
Source: Slide 5 of Spark Community Update
Large-Scale Usage
Largest cluster:
8000 Nodes (Tencent)
Largest single job:
1 PB (Alibaba, Databricks)
Top Streaming
Intake:
1 TB/hour (HHMI
Janelia Farm)
2014 On-Disk Sort Record
Fastest Open Source
Engine for sorting a PB
Spark API Performance
History of Spark APIs
RDD
(2011)
DataFrame
(2013)
Distribute collection
of JVM objects
Functional Operators (map,
filter, etc.)
Distribute collection
of Row objects
Expression-based
operations and UDFs
Logical plans and optimizer
Fast/efficient internal
representations
DataSet
(2015)
Internally rows, externally
JVM objects
Almost the “Best of both
worlds”: type safe + fast
But slower than DF
Not as good for interactive
analysis, especially Python
Benefit of Logical Plan:
Performance Parity Across Languages
DataFrame
RDD
ETL with Spark
ETL: Extract, Transform, Load
● Key factor for big data platforms
● Provides Speed Improvements in All Workloads
● Typically Executed by Data Engineers
File Formats
● Text File Formats
○ CSV
○ JSON
● Avro Row Format
● Parquet Columnar Format
File Formats + Compression
● File Formats
○ JSON
○ CSV
○ Avro
○ Parquet
● Compression Codecs
○ No compression
○ Snappy
○ Gzip
○ LZO
● Industry Standard File Format: Parquet
○ Write to Parquet:
df.write.format(“parquet”).save(“namesAndAges.parquet”)
df.write.format(“parquet”).saveAsTable(“myTestTable”)
○ For compression:
spark.sql.parquet.compression.codec = (gzip, snappy)
Spark Parquet Properties
Small Files Problem
● Small files problem still exists
● Metadata loading
● APIs:
df.coalesce(N)
df.repartition(N)
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
● RDD / DataFrame Partitions
df.rdd.getNumPartitions()
● SparkSQL Shuffle Partitions
spark.sql.shuffle.partitions
● Table Level Partitions
df.write.partitionBy(“year”).
save(“data.parquet”)
All About Partitions
# CSV
df = sqlContext.read.
format('com.databricks.spark.csv').
options(header='true', inferSchema='true').
load('/path/to/data')
# JSON
df = sqlContext.read.json("/tmp/test.json")
df.write.json("/tmp/test_output.json")
PySpark ETL APIs - Text Formats
PySpark ETL APIs - Container Formats
# Binary Container Formats
# Avro
df = sqlContext.read.
format("com.databricks.spark.avro").
load("/path/to/files/")
# Parquet
df = sqlContext.read.parquet("/path/to/files/")
df.write.parquet("/path/to/files/")
● Manage Number of Files
○ APIs manage the number of files per directory
df.repartition(80).
write.
parquet("/path/to/parquet/")
df.repartition(80)
partitionBy("year")
write.
parquet("/path/to/parquet/")
PySpark ETL APIs
Common ETL Problems
● Malformed JSON Records
sqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE
_corrupt_record IS NOT NULL")
● Mismatched DataFrame Schema
○ Null Representation vs Schema DataType
● Many Small Files / No Partition Strategy
○ Parquet Files: ~128MB - 256MB Compressed
Ref: https://databricks.gitbooks.io/databricks-spark-knowledge-
base/content/best_practices/dealing_with_bad_data.html
Debugging Spark
Spark Driver Error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 362.0 failed
4 times, most recent failure: Lost task 1.3 in stage 362.0 (TID 275202, ip-10-111-225-98.ec2.internal):
java.nio.channels.ClosedChannelException
Spark Executor Error:
16/04/13 20:02:16 ERROR DefaultWriterContainer: Aborting task.
java.text.ParseException: Unparseable number: "N"
at java.text.NumberFormat.parse(NumberFormat.java:385)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply$mcD$sp(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)
at scala.util.Try.getOrElse(Try.scala:77)
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:58)
Debugging Spark
SQL with Spark
SparkSQL Best Practices
● DataFrames and SparkSQL are synonyms
● Use builtin functions instead of custom UDFs
○ import pyspark.sql.functions
● Examples:
○ to_date()
○ get_json_object()
○ regexp_extract()
○ hour() / minute()
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
SparkSQL Best Practices
● Large Table Joins
○ Largest Table on LHS
○ Increase Spark Shuffle Partitions
○ Leverage “cluster by” API included in Spark 1.6
sqlCtx.sql("select * from large_table_1 cluster by num1")
.registerTempTable("sorted_large_table_1");
sqlCtx.sql(“cache table sorted_large_table_1”);
PySpark API Best Practices
● User Defined Functions (UDFs)
from pyspark.sql import functions as F
add_n = udf(lambda x, y: x + y, IntegerType())
# We register a UDF that adds a column to the DataFrame, and we cast the
id column to an Integer type.
df = df.withColumn('id_offset',
add_n( F.lit(1000), df.id.cast(IntegerType())))
PySpark API Best Practices
● Built-in Functions
corpus_df = df.select( 
F.lower( F.col('body')).alias('corpus'), 
F.monotonicallyIncreasingId().alias('id'))
corpus_df = df.select( 
F.date_format( F.from_utc_timestamp( 
F.from_unixtime(F.col('created_utc'), "PST"), 'EEEE')).alias
('dayofweek'))
Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
PySpark API Best Practices
● User Defined Functions (UDFs)
def squared(s):
return s * s
sqlContext.udf.register("squaredWithPython", squared)
display(df.select("id", squared_udf("id").alias("id_squared")))
ML with Spark
Data Science Time
40
Why Spark ML
Provide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability
• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)
Advantages of MLlib’s Design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility
High-level functionality in MLlib
Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent
itemsets
42
Workflow utilities
• Model import/export
• Pipelines
• DataFrames
• Cross validation
Data utilities
• Feature
extraction &
selection
• Statistics
• Linear algebra
Machine Learning: What and Why?
ML uses data to identify patterns and make decisions.
Core value of ML is automated decision making
• Especially important when dealing with TB or PB of data
Many Use Cases including:
• Marketing and advertising optimization
• Security monitoring / fraud detection
• Operational optimizations
Algorithm coverage in MLlib
Classification
• Logistic regression w/ elastic net
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
• Multilayer perceptron
• One-vs-rest
Regression
• Least squares w/ elastic net
• Isotonic regression
• Decision trees
• Random forests
• Gradient-boosted trees
• Streaming linear methods
Recommendation
• Alternating Least Squares
Frequent itemsets
• FP-growth
• Prefix span
Clustering
• Gaussian mixture models
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocation
• Power Iteration Clustering
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear algebra
• Local dense & sparse vectors & matrices
• Distributed matrices
• Block-partitioned matrix
• Row matrix
• Indexed row matrix
• Coordinate matrix
• Matrix decompositions
Model import/export
Pipelines
Feature extraction &
selection
• Binarizer
• Bucketizer
• Chi-Squared selection
• CountVectorizer
• Discrete cosine transform
• ElementwiseProduct
• Hashing term frequency
• Inverse document frequency
• MinMaxScaler
• Ngram
• Normalizer
• One-Hot Encoder
• PCA
• PolynomialExpansion
• RFormula
• SQLTransformer
• Standard scaler
• StopWordsRemover
• StringIndexer
• Tokenizer
• StringIndexer
• VectorAssembler
• VectorIndexer
• VectorSlicer
• Word2Vec
List based on Spark
1.5 44
Spark ML Best Practices
● Spark MLLib vs SparkML
○ Understand the differences
● Don’t Pipeline Too Many Stages
○ Check Results Between Stages
PySpark ML API Best Practices
PySpark ML API Best Practices
● DataFrame to RDD Mapping
def tokenize(text):
tokens = word_tokenize(text)
lowercased = [t.lower() for t in tokens]
no_punctuation = []
for word in lowercased:
punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION])
no_punctuation.append(punct_removed)
no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]
stemmed = [STEMMER.stem(w) for w in no_stopwords]
return [w for w in stemmed if w]
rdd = wordsDataFrame.map(lambda x: (x.__getitem__('id'), tokenize(x.__getitem__('corpus'))))
PySpark ML API Best Practices
Learning more about MLlib
Guides & examples
• Example workflow using ML Pipelines (Python)
• The above 2 links are part of the Databricks Guide, which contains many more
examples and references.
References
• Apache Spark MLlib User Guide
• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API
documentation.
• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.
org/abs/1505.06807 (academic paper)
49
Spark Demo
Thanks!
Sign Up For Databricks Community Edition!
http://go.databricks.com/databricks-community-
edition-beta-waitlist

Mais conteúdo relacionado

Mais procurados

Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceMongoDB
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying SparkDatabricks
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 

Mais procurados (20)

PySaprk
PySaprkPySaprk
PySaprk
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 

Destaque

SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015Alexander DEJANOVSKI
 
The SparkSQL things you maybe confuse
The SparkSQL things you maybe confuseThe SparkSQL things you maybe confuse
The SparkSQL things you maybe confusevito jeng
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)AllineaSoftware
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016clairvoyantllc
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0vithakur
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtMichael Stack
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionDataWorks Summit/Hadoop Summit
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark datastaxjp
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Juan Pedro Moreno
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra Nikiforos Botis
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5Yan Zhou
 
Cassandra Basics: Indexing
Cassandra Basics: IndexingCassandra Basics: Indexing
Cassandra Basics: IndexingBenjamin Black
 
Introduction to Cassandra - Denver
Introduction to Cassandra - DenverIntroduction to Cassandra - Denver
Introduction to Cassandra - DenverJon Haddad
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Developers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQLDevelopers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQLRyu Kobayashi
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Jon Haddad
 
Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Jon Haddad
 

Destaque (20)

HOW Series: Knights Landing
HOW Series: Knights LandingHOW Series: Knights Landing
HOW Series: Knights Landing
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 
The SparkSQL things you maybe confuse
The SparkSQL things you maybe confuseThe SparkSQL things you maybe confuse
The SparkSQL things you maybe confuse
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
 
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to ProductionThe DAP - Where YARN, HBase, Kafka and Spark go to Production
The DAP - Where YARN, HBase, Kafka and Spark go to Production
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Cassandra Basics: Indexing
Cassandra Basics: IndexingCassandra Basics: Indexing
Cassandra Basics: Indexing
 
Introduction to Cassandra - Denver
Introduction to Cassandra - DenverIntroduction to Cassandra - Denver
Introduction to Cassandra - Denver
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Developers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQLDevelopers summit cassandraで見るNoSQL
Developers summit cassandraで見るNoSQL
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014Diagnosing Problems in Production: Cassandra Summit 2014
Diagnosing Problems in Production: Cassandra Summit 2014
 

Semelhante a The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 

Semelhante a The Nitty Gritty of Advanced Analytics Using Apache Spark in Python (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 

Último

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Último (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

  • 1. The Nitty Gritty of Advanced Analytics Using Apache Spark in Python Miklos Christine Solutions Architect mwc@databricks.com, @Miklos_C
  • 2. About Me Miklos Christine Solutions Architect @ Databricks - mwc@databricks.com - Miklos_C@twitter Systems Engineer @ Cloudera Supported a few of the largest clusters in the world Software Engineer @ Cisco UC Berkeley Graduate
  • 3. We are Databricks, the company behind Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 3 Data Value Created Databricks on top of Spark to make big data simple.
  • 4. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 5.
  • 6. 2012 started @ Berkeley 2010 research paper 2013 Databricks started & donated to ASF 2014 Spark 1.0 & libraries (SQL, ML, GraphX) 2015 DataFrames Tungsten ML Pipelines 2016 Spark 2.0
  • 7. Spark Community Growth • Spark Survey 2015 Highlights • End of Year Spark Highlights
  • 8. 2015: A Great Year for Spark Most active open source project in (big) data • 1000+ code contributors New language: R Widespread industry support & adoption
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. HOW RESPONDENTS ARE RUNNING SPARK 51% on a public cloud TOP ROLES USING SPARK of respondents identify themselves as Data Engineers 41% of respondents identify themselves as Data Scientists 22%
  • 15. NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO Source: Slide 5 of Spark Community Update
  • 16. Large-Scale Usage Largest cluster: 8000 Nodes (Tencent) Largest single job: 1 PB (Alibaba, Databricks) Top Streaming Intake: 1 TB/hour (HHMI Janelia Farm) 2014 On-Disk Sort Record Fastest Open Source Engine for sorting a PB
  • 18. History of Spark APIs RDD (2011) DataFrame (2013) Distribute collection of JVM objects Functional Operators (map, filter, etc.) Distribute collection of Row objects Expression-based operations and UDFs Logical plans and optimizer Fast/efficient internal representations DataSet (2015) Internally rows, externally JVM objects Almost the “Best of both worlds”: type safe + fast But slower than DF Not as good for interactive analysis, especially Python
  • 19. Benefit of Logical Plan: Performance Parity Across Languages DataFrame RDD
  • 21. ETL: Extract, Transform, Load ● Key factor for big data platforms ● Provides Speed Improvements in All Workloads ● Typically Executed by Data Engineers
  • 22. File Formats ● Text File Formats ○ CSV ○ JSON ● Avro Row Format ● Parquet Columnar Format
  • 23. File Formats + Compression ● File Formats ○ JSON ○ CSV ○ Avro ○ Parquet ● Compression Codecs ○ No compression ○ Snappy ○ Gzip ○ LZO
  • 24. ● Industry Standard File Format: Parquet ○ Write to Parquet: df.write.format(“parquet”).save(“namesAndAges.parquet”) df.write.format(“parquet”).saveAsTable(“myTestTable”) ○ For compression: spark.sql.parquet.compression.codec = (gzip, snappy) Spark Parquet Properties
  • 25. Small Files Problem ● Small files problem still exists ● Metadata loading ● APIs: df.coalesce(N) df.repartition(N) Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
  • 26. ● RDD / DataFrame Partitions df.rdd.getNumPartitions() ● SparkSQL Shuffle Partitions spark.sql.shuffle.partitions ● Table Level Partitions df.write.partitionBy(“year”). save(“data.parquet”) All About Partitions
  • 27. # CSV df = sqlContext.read. format('com.databricks.spark.csv'). options(header='true', inferSchema='true'). load('/path/to/data') # JSON df = sqlContext.read.json("/tmp/test.json") df.write.json("/tmp/test_output.json") PySpark ETL APIs - Text Formats
  • 28. PySpark ETL APIs - Container Formats # Binary Container Formats # Avro df = sqlContext.read. format("com.databricks.spark.avro"). load("/path/to/files/") # Parquet df = sqlContext.read.parquet("/path/to/files/") df.write.parquet("/path/to/files/")
  • 29. ● Manage Number of Files ○ APIs manage the number of files per directory df.repartition(80). write. parquet("/path/to/parquet/") df.repartition(80) partitionBy("year") write. parquet("/path/to/parquet/") PySpark ETL APIs
  • 30. Common ETL Problems ● Malformed JSON Records sqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE _corrupt_record IS NOT NULL") ● Mismatched DataFrame Schema ○ Null Representation vs Schema DataType ● Many Small Files / No Partition Strategy ○ Parquet Files: ~128MB - 256MB Compressed Ref: https://databricks.gitbooks.io/databricks-spark-knowledge- base/content/best_practices/dealing_with_bad_data.html
  • 31. Debugging Spark Spark Driver Error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 362.0 failed 4 times, most recent failure: Lost task 1.3 in stage 362.0 (TID 275202, ip-10-111-225-98.ec2.internal): java.nio.channels.ClosedChannelException Spark Executor Error: 16/04/13 20:02:16 ERROR DefaultWriterContainer: Aborting task. java.text.ParseException: Unparseable number: "N" at java.text.NumberFormat.parse(NumberFormat.java:385) at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply$mcD$sp(TypeCast.scala:58) at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58) at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58) at scala.util.Try.getOrElse(Try.scala:77) at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:58)
  • 34. SparkSQL Best Practices ● DataFrames and SparkSQL are synonyms ● Use builtin functions instead of custom UDFs ○ import pyspark.sql.functions ● Examples: ○ to_date() ○ get_json_object() ○ regexp_extract() ○ hour() / minute() Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
  • 35. SparkSQL Best Practices ● Large Table Joins ○ Largest Table on LHS ○ Increase Spark Shuffle Partitions ○ Leverage “cluster by” API included in Spark 1.6 sqlCtx.sql("select * from large_table_1 cluster by num1") .registerTempTable("sorted_large_table_1"); sqlCtx.sql(“cache table sorted_large_table_1”);
  • 36. PySpark API Best Practices ● User Defined Functions (UDFs) from pyspark.sql import functions as F add_n = udf(lambda x, y: x + y, IntegerType()) # We register a UDF that adds a column to the DataFrame, and we cast the id column to an Integer type. df = df.withColumn('id_offset', add_n( F.lit(1000), df.id.cast(IntegerType())))
  • 37. PySpark API Best Practices ● Built-in Functions corpus_df = df.select( F.lower( F.col('body')).alias('corpus'), F.monotonicallyIncreasingId().alias('id')) corpus_df = df.select( F.date_format( F.from_utc_timestamp( F.from_unixtime(F.col('created_utc'), "PST"), 'EEEE')).alias ('dayofweek')) Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
  • 38. PySpark API Best Practices ● User Defined Functions (UDFs) def squared(s): return s * s sqlContext.udf.register("squaredWithPython", squared) display(df.select("id", squared_udf("id").alias("id_squared")))
  • 41. Why Spark ML Provide general purpose ML algorithms on top of Spark • Let Spark handle the distribution of data and queries; scalability • Leverage its improvements (e.g. DataFrames, Datasets, Tungsten) Advantages of MLlib’s Design: • Simplicity • Scalability • Streamlined end-to-end • Compatibility
  • 42. High-level functionality in MLlib Learning tasks Classification Regression Recommendation Clustering Frequent itemsets 42 Workflow utilities • Model import/export • Pipelines • DataFrames • Cross validation Data utilities • Feature extraction & selection • Statistics • Linear algebra
  • 43. Machine Learning: What and Why? ML uses data to identify patterns and make decisions. Core value of ML is automated decision making • Especially important when dealing with TB or PB of data Many Use Cases including: • Marketing and advertising optimization • Security monitoring / fraud detection • Operational optimizations
  • 44. Algorithm coverage in MLlib Classification • Logistic regression w/ elastic net • Naive Bayes • Streaming logistic regression • Linear SVMs • Decision trees • Random forests • Gradient-boosted trees • Multilayer perceptron • One-vs-rest Regression • Least squares w/ elastic net • Isotonic regression • Decision trees • Random forests • Gradient-boosted trees • Streaming linear methods Recommendation • Alternating Least Squares Frequent itemsets • FP-growth • Prefix span Clustering • Gaussian mixture models • K-Means • Streaming K-Means • Latent Dirichlet Allocation • Power Iteration Clustering Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation Linear algebra • Local dense & sparse vectors & matrices • Distributed matrices • Block-partitioned matrix • Row matrix • Indexed row matrix • Coordinate matrix • Matrix decompositions Model import/export Pipelines Feature extraction & selection • Binarizer • Bucketizer • Chi-Squared selection • CountVectorizer • Discrete cosine transform • ElementwiseProduct • Hashing term frequency • Inverse document frequency • MinMaxScaler • Ngram • Normalizer • One-Hot Encoder • PCA • PolynomialExpansion • RFormula • SQLTransformer • Standard scaler • StopWordsRemover • StringIndexer • Tokenizer • StringIndexer • VectorAssembler • VectorIndexer • VectorSlicer • Word2Vec List based on Spark 1.5 44
  • 45. Spark ML Best Practices ● Spark MLLib vs SparkML ○ Understand the differences ● Don’t Pipeline Too Many Stages ○ Check Results Between Stages
  • 46. PySpark ML API Best Practices
  • 47. PySpark ML API Best Practices
  • 48. ● DataFrame to RDD Mapping def tokenize(text): tokens = word_tokenize(text) lowercased = [t.lower() for t in tokens] no_punctuation = [] for word in lowercased: punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION]) no_punctuation.append(punct_removed) no_stopwords = [w for w in no_punctuation if not w in STOPWORDS] stemmed = [STEMMER.stem(w) for w in no_stopwords] return [w for w in stemmed if w] rdd = wordsDataFrame.map(lambda x: (x.__getitem__('id'), tokenize(x.__getitem__('corpus')))) PySpark ML API Best Practices
  • 49. Learning more about MLlib Guides & examples • Example workflow using ML Pipelines (Python) • The above 2 links are part of the Databricks Guide, which contains many more examples and references. References • Apache Spark MLlib User Guide • The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation. • Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv. org/abs/1505.06807 (academic paper) 49
  • 51. Thanks! Sign Up For Databricks Community Edition! http://go.databricks.com/databricks-community- edition-beta-waitlist