SlideShare uma empresa Scribd logo
1 de 46
Baixar para ler offline
Extending Machine
Learning Algorithms
with PySpark
Karen Feng, Kiavash Kianfar
Databricks
Agenda
● Discuss using PySpark
(especially Pandas UDFs) to
perform machine learning
at unprecedented scale
● Learn about an application
for a genomics use case
(GloWGR)
Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
2. Problem: Bioinformaticians are not familiar with
the native languages used by big data tools (Scala)
Solution: Provide clients for high-level languages
(Python)
Design decisions
1. Problem: Genomic data are growing too quickly for
existing tools
Solution: Use big data tools (Spark)
2. Problem: Bioinformaticians are not familiar with
the native languages used by big data tools (Scala)
Solution: Provide clients for high-level languages
(Python)
3. Problem: Performant, maintainable machine
learning algorithms are difficult to write natively in
big data tools (Spark SQL expressions)
Solution: Write algorithms in high-level languages
and link them to big data tools (PySpark)
Genomic data are growing too fast for existing tools
Problem 1
Genomic data are growing at an exponential pace
●
Biobank datasets are growing in
scale
• Next-generation sequencing
• Genotyping arrays (1Mb)
• Whole exome sequence (39Mb)
• Whole genome sequence (3200Mb)
• 1,000s of samples → 100,000s
of samples
• 10s of traits → 1000s of traits
Genomic data are growing at an exponential pace
Use general-purpose big data tools - specifically, Spark
Solution 1
Differentiation from single-node libraries
▪ Flexible: Glow is built natively on Spark, a
general-purpose big data engine
▪ Enables aggregation and mining of genetic
variants on an industrial scale
▪ Low-overhead: Spark minimizes
serialization cost with libraries like Kryo and
Arrow
▪ Inflexible: Each tool requires custom
parallelization logic, per language and
algorithm
▪ High-overhead: Moving text between
arbitrary processes hurts performance
Single-node
Bioinformaticians are not familiar with the native languages
used by big data tools, such as Scala
Problem 2
Spark is predominantly written in Scala
Data engineers and
scientists are
Python-oriented
● More than 60% of
notebook commands in
Databricks are written in
Python
● Fewer than 20% of
commands are written in
Scala
Bioinformaticians are even more Python-oriented
Provide clients for high-level languages, such as Python
Solution 2
Python improves the user experience
• Py4J: achieve
near-feature parity with
Scala APIs
• PySpark Project Zen
• PySpark type hints
Py4J
Performant, maintainable machine learning algorithms are
difficult to write natively in big data tools
Problem 3
Spark SQL expressions
• Built to process data row-by-row
• Difficult to maintain state
• Minimal support for machine learning
• Overhead from converting rows to ML-compatible shapes (eg. matrices)
• Few linear algebra libraries exist in Scala
• Limited functionality
Write algorithms in high-level languages and link them to big
data tools
Solution 3
Python improves the developer experience
• Pandas: user-defined
functions (UDFs)
• Apache Arrow: transfer
data between JVM and
Python processes
Feature in Spark 3.0: mapInPandas
Local algorithm development in Pandas Plug-and-play with Spark with minimal overhead
X
f(X) → Y
Y
...
Iter(Y) ...
Iter(X)
f(X) → Y
Deep Dive: Genomics Use Case
Single nucleotide polymorphisms (SNP)
Genome Wide Association Studies (GWAS)
Detect associations between
genetic variations and traits of
interest across a population
• Common genetic
variations confer a small
amount of risk
• Rare genetic variation
confer a large amount of
risk
Whole Genome Regression (WGR)
Account for polygenic
effects, population
structure, and
relatedness
• Reduce false positives
• Reduce false
negatives
Mission: Industrialize genomics by integrating bioinformatics
into data science
Core principles:
• Build on Apache Spark
• Flexibly and natively support genomics tools and file
formats
• Provide single-line functions for common genomics
workloads
• Build an open-source community
26
Glow v1.0.0
● Datasources: Read/write common
genomic file formats (eg. VCF, BGEN,
Plink, GFF3) into/from Spark
DataFrames
● SQL expressions: Simple variant
handling operations can be called
from Python, SQL, Scala, or R
● Transformers: Complex genomic
transformations can be called from
Python or Scala
● GloWGR: Novel WGR/GWAS algorithm
built with PySpark
https://projectglow.io/
GloWGR: WGR and GWAS
● Detect which genotypes are associated with each
phenotype using a Generalized Linear Model
● Glow parallelizes the REGENIE method via Spark as
GloWGR
● Built from the ground-up using Pandas UDFs
GWAS Regression Tests
Millions of single-variate linear or logistic regressions
GloWGR: Learning at huge dimensions
WGR Reduction: ~5000 multi-variate linear ridge
regressions (one for each block and parameter)
500K x 100
500K x 50 500K x 1M
WGR Regression: ~ 5000 multi-variate linear or
logistic ridge regressions with cross validation
Data preparation
Transformation and SQL functions
on Genomic Variant DataFrame
● split_multiallelics
● genotype_states
● mean_substitute
Stage 1: Genotype matrix blocking
Stage 2: Dimensionality reduction
RidgeReduction.fit
● Pandas UDF: Construct X and Y
matrices for each block and calculate
Xt
X and Xt
Y
● Pandas UDF: Reduce with
element-wise sum over sample blocks
● Pandas UDF: Assemble the matrices
Xt
X and Xt
Y for a particular sample
block and calculate B= (Xt
X + I⍺)-1
Xt
Y
RidgeReduction.transform
● Pandas UDF: Calculates XB for each block
Stage 3: Estimate
phenotypic predictors
RidgeRegression.fit
● Pandas UDF: Construct X and Y
matrices for each block and calculate
Xt
X and Xt
Y
● Pandas UDF: Reduce with
element-wise sum over sample blocks
● Pandas UDF: Assemble the matrices
Xt
X and XY for a particular sample block
and calculate B= (Xt
X + I⍺)-1
Xt
Y
● Perform cross validation. Pick model
with best ⍺
RidgeRegression.transform_loco
● Pandas UDF: Calculates XB for each
block in a loco fashion
GWAS
Y ~ Gβg
+ Cβc
+ ϵ
Y - Ŷ ~ Gβg
+ Cβc +
ϵ
Use the phenotype estimate Ŷ
output by WGR to account for
polygenic effects during
regression
GWAS with Spark SQL expressions
Data
S samples
C covariates
V variants
T traits
Fitted model
S samples
C covariates
1 variant
1 trait
Results
V variants
T traits
Null model
S samples
C covariates
1 trait
V
x T
x
T
x
Cβc
Gβg
GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
GWAS with Spark SQL expressions
Pros
• Portable to all Spark clients
Cons
• Requires writing your own Spark SQL
expressions
• User-unfriendly linear algebra libraries in Scala
(ie. Breeze)
• Limited to 2 dimensions
• Unnatural expressions of mathematical operations
• Customized, expensive data transfers
• Spark DataFrames ↔ MLLib matrices ↔ Breeze
matrices
• Input and output must be Spark DataFrames
GWAS with PySpark
Phenotype
matrix
S samples
T traits
Covariate
matrix
S samples
C covariates
Null model
S samples
C covariates
1 trait
Genotype
matrix
S samples
T traits
Fitted model
S samples
C covariates
O(V) variants
O(T) traits
T x
# partitions x
Results
V variants
T traits
Gβg
Cβc
GWAS with PySpark
Pros
• User-friendly Scala libraries (ie. Pandas)
• Easy to express mathematical notation
• Unlimited dimensions
• Batched, optimized transfers between Pandas
and Spark DataFrames
• Input and output can be Pandas or Spark
DataFrames
Cons
• Accessible only from Python
GWAS with PySpark
Pros
• User-friendly Scala libraries (ie. Pandas)
• Easy to express mathematical notation
• Unlimited dimensions
• Batched, optimized transfers between Pandas
and Spark DataFrames
• Input and output can be Pandas or Spark
DataFrames
Cons
• Accessible only from Python
GWAS
I/O formats Linalg libraries Accessible clients
Spark SQL Spark DataFrames Spark ML/MLLib,
Breeze
Scala, Python, R
PySpark Spark or Pandas
DataFrames
Pandas, Numpy,
Einsum, ...
Python
Differentiation from other parallelized libraries
▪ Lightweight: Glow is a thin layer built to be
compatible with the latest major Spark
releases, as well as other open-source
libraries (eg. Delta)
▪ Flexible: Glow includes a set of core
algorithms, and is easily extended to ad-hoc
use cases using existing tools
▪ Heavyweight: Many libraries build on
custom logic that make it difficult to update
to new technologies
▪ Inflexible: Many libraries expose custom
interfaces that make it difficult to extend
beyond the built-in algorithms
Other parallelized libraries
Future work: gene burden tests
Big takeaways
1. Listen to your
users
2. Use the latest
off-the-shelf
tools
3. If all else fails,
pivot early
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Mais conteúdo relacionado

Mais procurados

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 

Mais procurados (20)

Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 

Semelhante a Extending Machine Learning Algorithms with PySpark

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Predictive Models at Scale
Predictive Models at ScalePredictive Models at Scale
Predictive Models at Scale
Nikhil Ketkar
 

Semelhante a Extending Machine Learning Algorithms with PySpark (20)

DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Predictive Models at Scale
Predictive Models at ScalePredictive Models at Scale
Predictive Models at Scale
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 

Último (20)

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 

Extending Machine Learning Algorithms with PySpark

  • 1. Extending Machine Learning Algorithms with PySpark Karen Feng, Kiavash Kianfar Databricks
  • 2. Agenda ● Discuss using PySpark (especially Pandas UDFs) to perform machine learning at unprecedented scale ● Learn about an application for a genomics use case (GloWGR)
  • 3. Design decisions 1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark)
  • 4. Design decisions 1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark) 2. Problem: Bioinformaticians are not familiar with the native languages used by big data tools (Scala) Solution: Provide clients for high-level languages (Python)
  • 5. Design decisions 1. Problem: Genomic data are growing too quickly for existing tools Solution: Use big data tools (Spark) 2. Problem: Bioinformaticians are not familiar with the native languages used by big data tools (Scala) Solution: Provide clients for high-level languages (Python) 3. Problem: Performant, maintainable machine learning algorithms are difficult to write natively in big data tools (Spark SQL expressions) Solution: Write algorithms in high-level languages and link them to big data tools (PySpark)
  • 6. Genomic data are growing too fast for existing tools Problem 1
  • 7. Genomic data are growing at an exponential pace ●
  • 8. Biobank datasets are growing in scale • Next-generation sequencing • Genotyping arrays (1Mb) • Whole exome sequence (39Mb) • Whole genome sequence (3200Mb) • 1,000s of samples → 100,000s of samples • 10s of traits → 1000s of traits Genomic data are growing at an exponential pace
  • 9. Use general-purpose big data tools - specifically, Spark Solution 1
  • 10. Differentiation from single-node libraries ▪ Flexible: Glow is built natively on Spark, a general-purpose big data engine ▪ Enables aggregation and mining of genetic variants on an industrial scale ▪ Low-overhead: Spark minimizes serialization cost with libraries like Kryo and Arrow ▪ Inflexible: Each tool requires custom parallelization logic, per language and algorithm ▪ High-overhead: Moving text between arbitrary processes hurts performance Single-node
  • 11. Bioinformaticians are not familiar with the native languages used by big data tools, such as Scala Problem 2
  • 12. Spark is predominantly written in Scala
  • 13. Data engineers and scientists are Python-oriented ● More than 60% of notebook commands in Databricks are written in Python ● Fewer than 20% of commands are written in Scala
  • 14. Bioinformaticians are even more Python-oriented
  • 15. Provide clients for high-level languages, such as Python Solution 2
  • 16. Python improves the user experience • Py4J: achieve near-feature parity with Scala APIs • PySpark Project Zen • PySpark type hints Py4J
  • 17. Performant, maintainable machine learning algorithms are difficult to write natively in big data tools Problem 3
  • 18. Spark SQL expressions • Built to process data row-by-row • Difficult to maintain state • Minimal support for machine learning • Overhead from converting rows to ML-compatible shapes (eg. matrices) • Few linear algebra libraries exist in Scala • Limited functionality
  • 19. Write algorithms in high-level languages and link them to big data tools Solution 3
  • 20. Python improves the developer experience • Pandas: user-defined functions (UDFs) • Apache Arrow: transfer data between JVM and Python processes
  • 21. Feature in Spark 3.0: mapInPandas Local algorithm development in Pandas Plug-and-play with Spark with minimal overhead X f(X) → Y Y ... Iter(Y) ... Iter(X) f(X) → Y
  • 24. Genome Wide Association Studies (GWAS) Detect associations between genetic variations and traits of interest across a population • Common genetic variations confer a small amount of risk • Rare genetic variation confer a large amount of risk
  • 25. Whole Genome Regression (WGR) Account for polygenic effects, population structure, and relatedness • Reduce false positives • Reduce false negatives
  • 26. Mission: Industrialize genomics by integrating bioinformatics into data science Core principles: • Build on Apache Spark • Flexibly and natively support genomics tools and file formats • Provide single-line functions for common genomics workloads • Build an open-source community 26
  • 27. Glow v1.0.0 ● Datasources: Read/write common genomic file formats (eg. VCF, BGEN, Plink, GFF3) into/from Spark DataFrames ● SQL expressions: Simple variant handling operations can be called from Python, SQL, Scala, or R ● Transformers: Complex genomic transformations can be called from Python or Scala ● GloWGR: Novel WGR/GWAS algorithm built with PySpark https://projectglow.io/
  • 28. GloWGR: WGR and GWAS ● Detect which genotypes are associated with each phenotype using a Generalized Linear Model ● Glow parallelizes the REGENIE method via Spark as GloWGR ● Built from the ground-up using Pandas UDFs
  • 29. GWAS Regression Tests Millions of single-variate linear or logistic regressions GloWGR: Learning at huge dimensions WGR Reduction: ~5000 multi-variate linear ridge regressions (one for each block and parameter) 500K x 100 500K x 50 500K x 1M WGR Regression: ~ 5000 multi-variate linear or logistic ridge regressions with cross validation
  • 30. Data preparation Transformation and SQL functions on Genomic Variant DataFrame ● split_multiallelics ● genotype_states ● mean_substitute
  • 31. Stage 1: Genotype matrix blocking
  • 32. Stage 2: Dimensionality reduction RidgeReduction.fit ● Pandas UDF: Construct X and Y matrices for each block and calculate Xt X and Xt Y ● Pandas UDF: Reduce with element-wise sum over sample blocks ● Pandas UDF: Assemble the matrices Xt X and Xt Y for a particular sample block and calculate B= (Xt X + I⍺)-1 Xt Y RidgeReduction.transform ● Pandas UDF: Calculates XB for each block
  • 33. Stage 3: Estimate phenotypic predictors RidgeRegression.fit ● Pandas UDF: Construct X and Y matrices for each block and calculate Xt X and Xt Y ● Pandas UDF: Reduce with element-wise sum over sample blocks ● Pandas UDF: Assemble the matrices Xt X and XY for a particular sample block and calculate B= (Xt X + I⍺)-1 Xt Y ● Perform cross validation. Pick model with best ⍺ RidgeRegression.transform_loco ● Pandas UDF: Calculates XB for each block in a loco fashion
  • 34. GWAS Y ~ Gβg + Cβc + ϵ Y - Ŷ ~ Gβg + Cβc + ϵ Use the phenotype estimate Ŷ output by WGR to account for polygenic effects during regression
  • 35. GWAS with Spark SQL expressions Data S samples C covariates V variants T traits Fitted model S samples C covariates 1 variant 1 trait Results V variants T traits Null model S samples C covariates 1 trait V x T x T x Cβc Gβg
  • 36. GWAS with Spark SQL expressions Pros • Portable to all Spark clients
  • 37. GWAS with Spark SQL expressions Pros • Portable to all Spark clients
  • 38. GWAS with Spark SQL expressions Pros • Portable to all Spark clients Cons • Requires writing your own Spark SQL expressions • User-unfriendly linear algebra libraries in Scala (ie. Breeze) • Limited to 2 dimensions • Unnatural expressions of mathematical operations • Customized, expensive data transfers • Spark DataFrames ↔ MLLib matrices ↔ Breeze matrices • Input and output must be Spark DataFrames
  • 39. GWAS with PySpark Phenotype matrix S samples T traits Covariate matrix S samples C covariates Null model S samples C covariates 1 trait Genotype matrix S samples T traits Fitted model S samples C covariates O(V) variants O(T) traits T x # partitions x Results V variants T traits Gβg Cβc
  • 40. GWAS with PySpark Pros • User-friendly Scala libraries (ie. Pandas) • Easy to express mathematical notation • Unlimited dimensions • Batched, optimized transfers between Pandas and Spark DataFrames • Input and output can be Pandas or Spark DataFrames Cons • Accessible only from Python
  • 41. GWAS with PySpark Pros • User-friendly Scala libraries (ie. Pandas) • Easy to express mathematical notation • Unlimited dimensions • Batched, optimized transfers between Pandas and Spark DataFrames • Input and output can be Pandas or Spark DataFrames Cons • Accessible only from Python
  • 42. GWAS I/O formats Linalg libraries Accessible clients Spark SQL Spark DataFrames Spark ML/MLLib, Breeze Scala, Python, R PySpark Spark or Pandas DataFrames Pandas, Numpy, Einsum, ... Python
  • 43. Differentiation from other parallelized libraries ▪ Lightweight: Glow is a thin layer built to be compatible with the latest major Spark releases, as well as other open-source libraries (eg. Delta) ▪ Flexible: Glow includes a set of core algorithms, and is easily extended to ad-hoc use cases using existing tools ▪ Heavyweight: Many libraries build on custom logic that make it difficult to update to new technologies ▪ Inflexible: Many libraries expose custom interfaces that make it difficult to extend beyond the built-in algorithms Other parallelized libraries
  • 44. Future work: gene burden tests
  • 45. Big takeaways 1. Listen to your users 2. Use the latest off-the-shelf tools 3. If all else fails, pivot early
  • 46. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.