SlideShare uma empresa Scribd logo
1 de 35
Logistics
• We can’t hear you…
• Recording will be available…
• Code samples and notebooks will be available…
• Submit your questions…
• Bookmark databricks.com/blog
Our Mission: Helping data teams solve the world’s toughest
problems
Original creators of popular data and machine learning open source projects
Global company with 5,000+ customers and 450+ partners
Data Science, ML, and
BI on one cloud platform
Access all business and
big data in open data
lake
Securely integrates with
your cloud ecosystem
BIG DATA & BUSINESS DATA
DATA
SCIENTISTS
ML
ENGINEERS
DATA
ANALYSTS
DATA
ENGINEERS
ENTERPRISE CLOUD SERVICE
A simple, scalable, and secure managed service
UNIFIED DATA SERVICE
High quality data with great performance
DATA SCIENCE WORKSPACE
Collaboration across the lifecycle
BI
INTEGRATIONS
Access all your
data
UNIFIED DATA ANALYTICS
PLATFORM
About our speakers
Yifan Cao, Sr. Product Manager, Machine Learning at Databricks
• Product Area: ML/DL algorithms and Databricks Runtime for Machine
Learning
• Built and grew two ML products to multi-million dollars in annual
revenue
• B.S. Engineering from UC Berkeley; MBA from MIT
Patryk Oleniuk, Lead Data Engineer, Virgin Hyperloop One
• Wearing many hats @ Hyperloop: Embedded Devices, Back-end
Software, Data Science & Machine Learning
• Previous experience includes Samsung R&D, CERN
• M.S. in Information Technologies from EPFL (Switzerland)
Agenda
1. Virgin Hyperloop One
2. Hyperloop  Databricks Story
3. What is Databricks and Koalas
4. Koalas (python + notebook, live coding)
5. Short intro to Mlflow tracking
6. Koalas & MLflow (python + notebook, live coding)
7. Koalas tips&tricks
 Startup with 250+ employees in Los Angeles
( hiring! )
 New transportation system
 Vacuum tube + small pass./cargo vehicles(Pods)
 Short travel times
 On-demand ( “Ride Hailing” )
 Zero direct emission (Electric Lev & Prop)
 500m test track in Nevada (video)
 New exciting tests, tracks, enhancements
(in progress right now)
Virgin Hyperloop One (VHO)
 Operational Research for Hyperloop
 Analytics products for Business and Tech
 Simulation & Data based
 Sample Questions:
 What’s the optimal Vehicle Capacity?
 How many passengers can we realistically handle in
scenario X between cities Y and Z?
 How much better are we than other modes?
 Answers? Data
VHO – MIA Machine Intelligence & Analytics
DEMAND
MODELLING
TRIP PLANNING
PERFORMANCE
METRICS
COST METRICS
GEOSPATIAL
ANALYTICS
3D ALIGNMENT
OPTIMIZER
TEST RUNS
HW & SW TEST RIGS
Examples of AI and Analytics and Data in VHO
Hyperloop Data Story
 growing data sizes (from MBs to GBs)
 growing processing times (from mins to hrs)
 Python scripts crashing ( pandas out of memory )
 Need a more Enterprise, and Scalable approach to
handling Data (tried different solutions)
 Spark and its family is de-facto standard solution to that
problem
Hyperloop Data Story
 Who’s gonna manage our new
Spark infrastructure?
Not enough Devops …
Koalas - why?
pandas code:
pandas_df
.groupby(”Destination”)
.sum()
.nlargest(10,
columns = "Trip Count")
PySpark code:
spark_df
.groupby(“Destination”)
.sum()
.orderBy(“sum(Trip Count)”,
ascending = False)
.limit(10)
Me after learning I need to redo
all our pandas scripts in pySpark
(and keep doing it for future DS work)
13
Introducing Koalas
14
 Education (MOOCs, books, universities) → pandas
 Analyze small data sets → pandas
 Analyze big data sets → DataFrame in Spark ● Standard for distributed
workloads
● Big data
● Standard for single
machine workloads
● Small data
Pandas Apache Spark+
Typical journey of a data scientist
15
 Launched on April 24, 2019 by Databricks
 Pure Open Source Python library
 Aims at providing the pandas API on top of Apache Spark:
 Unifies the two ecosystems with a familiar API
 Seamless transition between small and large data
What is Koalas?
github.com/databricks/koalas
Koalas - why?
pandas code:
pandas_df
.groupby(”Destination”)
.sum()
.nlargest(10,
columns = "Trip Count")
PySpark code:
spark_df
.groupby(“Destination”)
.sum()
.orderBy(“sum(Trip Count)”,
ascending = False)
.limit(10)
Koalas code:
koalas_df
.groupby(”Destination”)
.sum()
.nlargest(10,
columns = "Trip Count")
github.com/databricks/koalas
Koalas Architecture
Catalyst Optimization &
Tungsten Execution
DataFrame
APIs
SQL
Koalas
Core
Data
Source
Connectors
Pandas
SPAR
K
A lean API
layer
Koalas User Adoption
 Better scale the breadth of Pandas to big data
 Reduce friction by unifying big data environment
 Has been quickly adopted
 860+ patches merged since announcement in April 2019
 20+ major contributors that are outside Of Databricks
 24k+ daily downloads
github.com/databricks/koalas
What is Koalas
Koalas allows seamless* switch from pandas to Spark
 that means scaling the compute power
* few caveats covered in the DEMO #1
 Databricks can manage our Spark infrastructure
(and is also the author of Koalas)
What is Koalas
Koalas allows seamless* switch from pandas to Spark
 that means scaling the compute power
* few caveats covered in the DEMO #1
 Need to speed up your computation?
Scale-up your Spark workers:
- obviously, comes at a $ price
- cannot be below few-s processing, saturates
DEMO #1 time
MLflow - purpose
with mlflow.start_run():
mlflow.log_param("alpha", a)
mlflow.log_param("l1_ratio", l1)
rmse, r2, lr = train_score_model(a, l1)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.sklearn.log_model(lr, "model")
github.com/mlflow
databricks.com/mlflow
MLflow - purpose
with mlflow.start_run():
mlflow.log_param("alpha", a)
mlflow.log_param("l1_ratio", l1)
rmse, r2, lr = train_score_model(a, l1)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.sklearn.log_model(lr, "model")
N times
(parameter sweep)
MLflow - our model
– the demand depends on the hour and
day of week
– let’s create different ML prediction
models and save it to an MLflow
experiment
Output
Scoring
(on test set)
Output
Scoring
(on test set)
Output
Scoring
(on test set)
MLflow - our model
– Koalas can easily be used for pre-
processing the test/train data
(reuse existing)
– use kdf.apply() to sweep&score models
in parallel using Spark workers
– let’s create different models and save it
to MLflow experiment
sweep parameter #1
Log parameters,
metrics and
model
to MLflow
sweep parameter #2
ML model
( black box )
ML model
( black box )
ML model
( black box )
predicted
trip count
MLflow - our model
– the demand depends on the hour and
day of week
– let’s create different ML prediction
models and save it to an MLflow
experiment
Output
Scoring
(on test set)
Output
Scoring
(on test set)
Output
Scoring
(on test set)
MLflow - our model
– Koalas can easily be used for pre-
processing the test/train data
(reuse existing)
– use kdf.apply() to sweep&score models
in parallel using Spark workers
– let’s create different models and save it
to MLflow experiment
sweep parameter #1
Log parameters,
metrics and
model
to MLflow
sweep parameter #2
ML model
( black box )
ML model
( black box )
ML model
( black box )
predicted
trip count
DEMO #2 Koalas & MLflow
pandas to Koalas – tips and tricks
 Almost all popular pandas ops are available in koalas. Some params missing
(missing something? Setup an an issue on Koalas github! Don’t be shy like me!).
 Some of the functionality is chosen NOT to be implemented, the easiest workaround is:
kdf.to_pandas().do_whatever_you_want().to_koalas()
DataFrame.values : all the data would be loaded into the driver's memory, OOM errors.
 Be aware of different execution principles (ordering, lazy evaluation, underlying Spark df).
sort after groupby, different structure of groupby.apply, different NaN treatment & ops
 I personally really like using kdf.apply(my_func, axis=1) for any distributed row-based job,
including web-scrapping, dict-mapping, MLflow runs, etc.
All the function calls (for all the rows) are then distributed among all Spark workers.
pandas to Koalas – tips and tricks #2
 kdf = kdf.cache() – will not re-compute from the beginning every time,
useful especially for exploratory analysis where you use same kdf in different cells,
for 1 long script, Spark is gonna optimize the tree for you.
This behavior is very different than pandas!
 Problems with Koalas? Take a look at ks.options: - compute.ops_on_diff_frames
- compute.ordered_head
- plotting.sample_ratio
- display.max_rows
cacheno cache
ks_with_trend = ks_bart_df.groupby(["Date", "Hour"]).mean()
ks_trendless = ks_with_trend.copy()
ks_trendless["Trip Count"] -= trend["Trip Count"]
ks_trendless = ks_trendless.cache() # <-- caching here
Koalas roadmap
 Expand pandas API coverage with Koalas Dataframes
 Current: ~70% API coverage with pandas
 Integration with more visualization packages
 Current: support with matplotlib
 Deeper Integration with numpy
 Current: universal functions are implemented
 More example notebooks
 Current: a few examples on
Conclusions
1. Virgin Hyperloop One – cool stuff – we’re hiring ( https://hyperloop-one.com/careers )
2. Hyperloop  Databricks Story – lucky coincidence, amazing partnership
3. What is Databricks and Koalas – pandas API for Spark
4. Koalas DEMO #1 – very convenient, but still can use pySpark if situation requires
5. Short intro to MLflow tracking – excellent for organizing your experiments (not only ML)
6. Koalas & MLflow DEMO #2 – Koalas is also nice for parallel model execution and scoring
7. Next Steps: Sparkifying Matlab
References, links
1. Koalas documentation & Github :
https://github.com/databricks/koalas
2. Blog post : “How Virgin Hyperloop One reduced processing time from hours to minutes
with Koalas”
https://databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-one-reduced-
processing-time-from-hours-to-minutes-with-koalas.html
3. How to improve your pandas, if you don’t wanna move to Spark ? :
“From Pandas-wan to Pandas-master”
https://medium.com/unit8-machine-learning-publication/from-pandas-wan-to-pandas-
master-4860cf0ce442
Q&A
Thank you for joining!
databricks.com/sparkaisummit
EXPANDED
TECHNICAL TRAINING
LEARN MORE
REGISTER BY MARCH 31
GET $450 OFF!
SAVE MY SPOT

Mais conteúdo relacionado

Mais procurados

Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Sandesh Rao
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 

Mais procurados (20)

Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 

Semelhante a From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherDatabricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataWeCloudData
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsXiao Li
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksAlberto Diaz Martin
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSSKevin Crocker
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 

Semelhante a From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data (20)

Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudData
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 

Último (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

  • 1.
  • 2. Logistics • We can’t hear you… • Recording will be available… • Code samples and notebooks will be available… • Submit your questions… • Bookmark databricks.com/blog
  • 3. Our Mission: Helping data teams solve the world’s toughest problems Original creators of popular data and machine learning open source projects Global company with 5,000+ customers and 450+ partners
  • 4. Data Science, ML, and BI on one cloud platform Access all business and big data in open data lake Securely integrates with your cloud ecosystem BIG DATA & BUSINESS DATA DATA SCIENTISTS ML ENGINEERS DATA ANALYSTS DATA ENGINEERS ENTERPRISE CLOUD SERVICE A simple, scalable, and secure managed service UNIFIED DATA SERVICE High quality data with great performance DATA SCIENCE WORKSPACE Collaboration across the lifecycle BI INTEGRATIONS Access all your data UNIFIED DATA ANALYTICS PLATFORM
  • 5. About our speakers Yifan Cao, Sr. Product Manager, Machine Learning at Databricks • Product Area: ML/DL algorithms and Databricks Runtime for Machine Learning • Built and grew two ML products to multi-million dollars in annual revenue • B.S. Engineering from UC Berkeley; MBA from MIT Patryk Oleniuk, Lead Data Engineer, Virgin Hyperloop One • Wearing many hats @ Hyperloop: Embedded Devices, Back-end Software, Data Science & Machine Learning • Previous experience includes Samsung R&D, CERN • M.S. in Information Technologies from EPFL (Switzerland)
  • 6. Agenda 1. Virgin Hyperloop One 2. Hyperloop  Databricks Story 3. What is Databricks and Koalas 4. Koalas (python + notebook, live coding) 5. Short intro to Mlflow tracking 6. Koalas & MLflow (python + notebook, live coding) 7. Koalas tips&tricks
  • 7.  Startup with 250+ employees in Los Angeles ( hiring! )  New transportation system  Vacuum tube + small pass./cargo vehicles(Pods)  Short travel times  On-demand ( “Ride Hailing” )  Zero direct emission (Electric Lev & Prop)  500m test track in Nevada (video)  New exciting tests, tracks, enhancements (in progress right now) Virgin Hyperloop One (VHO)
  • 8.  Operational Research for Hyperloop  Analytics products for Business and Tech  Simulation & Data based  Sample Questions:  What’s the optimal Vehicle Capacity?  How many passengers can we realistically handle in scenario X between cities Y and Z?  How much better are we than other modes?  Answers? Data VHO – MIA Machine Intelligence & Analytics
  • 9. DEMAND MODELLING TRIP PLANNING PERFORMANCE METRICS COST METRICS GEOSPATIAL ANALYTICS 3D ALIGNMENT OPTIMIZER TEST RUNS HW & SW TEST RIGS Examples of AI and Analytics and Data in VHO
  • 10. Hyperloop Data Story  growing data sizes (from MBs to GBs)  growing processing times (from mins to hrs)  Python scripts crashing ( pandas out of memory )  Need a more Enterprise, and Scalable approach to handling Data (tried different solutions)  Spark and its family is de-facto standard solution to that problem
  • 11. Hyperloop Data Story  Who’s gonna manage our new Spark infrastructure? Not enough Devops …
  • 12. Koalas - why? pandas code: pandas_df .groupby(”Destination”) .sum() .nlargest(10, columns = "Trip Count") PySpark code: spark_df .groupby(“Destination”) .sum() .orderBy(“sum(Trip Count)”, ascending = False) .limit(10) Me after learning I need to redo all our pandas scripts in pySpark (and keep doing it for future DS work)
  • 14. 14  Education (MOOCs, books, universities) → pandas  Analyze small data sets → pandas  Analyze big data sets → DataFrame in Spark ● Standard for distributed workloads ● Big data ● Standard for single machine workloads ● Small data Pandas Apache Spark+ Typical journey of a data scientist
  • 15. 15  Launched on April 24, 2019 by Databricks  Pure Open Source Python library  Aims at providing the pandas API on top of Apache Spark:  Unifies the two ecosystems with a familiar API  Seamless transition between small and large data What is Koalas? github.com/databricks/koalas
  • 16. Koalas - why? pandas code: pandas_df .groupby(”Destination”) .sum() .nlargest(10, columns = "Trip Count") PySpark code: spark_df .groupby(“Destination”) .sum() .orderBy(“sum(Trip Count)”, ascending = False) .limit(10) Koalas code: koalas_df .groupby(”Destination”) .sum() .nlargest(10, columns = "Trip Count") github.com/databricks/koalas
  • 17. Koalas Architecture Catalyst Optimization & Tungsten Execution DataFrame APIs SQL Koalas Core Data Source Connectors Pandas SPAR K A lean API layer
  • 18. Koalas User Adoption  Better scale the breadth of Pandas to big data  Reduce friction by unifying big data environment  Has been quickly adopted  860+ patches merged since announcement in April 2019  20+ major contributors that are outside Of Databricks  24k+ daily downloads github.com/databricks/koalas
  • 19. What is Koalas Koalas allows seamless* switch from pandas to Spark  that means scaling the compute power * few caveats covered in the DEMO #1  Databricks can manage our Spark infrastructure (and is also the author of Koalas)
  • 20. What is Koalas Koalas allows seamless* switch from pandas to Spark  that means scaling the compute power * few caveats covered in the DEMO #1  Need to speed up your computation? Scale-up your Spark workers: - obviously, comes at a $ price - cannot be below few-s processing, saturates
  • 22. MLflow - purpose with mlflow.start_run(): mlflow.log_param("alpha", a) mlflow.log_param("l1_ratio", l1) rmse, r2, lr = train_score_model(a, l1) mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) mlflow.sklearn.log_model(lr, "model") github.com/mlflow databricks.com/mlflow
  • 23. MLflow - purpose with mlflow.start_run(): mlflow.log_param("alpha", a) mlflow.log_param("l1_ratio", l1) rmse, r2, lr = train_score_model(a, l1) mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) mlflow.sklearn.log_model(lr, "model") N times (parameter sweep)
  • 24. MLflow - our model – the demand depends on the hour and day of week – let’s create different ML prediction models and save it to an MLflow experiment
  • 25. Output Scoring (on test set) Output Scoring (on test set) Output Scoring (on test set) MLflow - our model – Koalas can easily be used for pre- processing the test/train data (reuse existing) – use kdf.apply() to sweep&score models in parallel using Spark workers – let’s create different models and save it to MLflow experiment sweep parameter #1 Log parameters, metrics and model to MLflow sweep parameter #2 ML model ( black box ) ML model ( black box ) ML model ( black box ) predicted trip count
  • 26. MLflow - our model – the demand depends on the hour and day of week – let’s create different ML prediction models and save it to an MLflow experiment
  • 27. Output Scoring (on test set) Output Scoring (on test set) Output Scoring (on test set) MLflow - our model – Koalas can easily be used for pre- processing the test/train data (reuse existing) – use kdf.apply() to sweep&score models in parallel using Spark workers – let’s create different models and save it to MLflow experiment sweep parameter #1 Log parameters, metrics and model to MLflow sweep parameter #2 ML model ( black box ) ML model ( black box ) ML model ( black box ) predicted trip count
  • 28. DEMO #2 Koalas & MLflow
  • 29. pandas to Koalas – tips and tricks  Almost all popular pandas ops are available in koalas. Some params missing (missing something? Setup an an issue on Koalas github! Don’t be shy like me!).  Some of the functionality is chosen NOT to be implemented, the easiest workaround is: kdf.to_pandas().do_whatever_you_want().to_koalas() DataFrame.values : all the data would be loaded into the driver's memory, OOM errors.  Be aware of different execution principles (ordering, lazy evaluation, underlying Spark df). sort after groupby, different structure of groupby.apply, different NaN treatment & ops  I personally really like using kdf.apply(my_func, axis=1) for any distributed row-based job, including web-scrapping, dict-mapping, MLflow runs, etc. All the function calls (for all the rows) are then distributed among all Spark workers.
  • 30. pandas to Koalas – tips and tricks #2  kdf = kdf.cache() – will not re-compute from the beginning every time, useful especially for exploratory analysis where you use same kdf in different cells, for 1 long script, Spark is gonna optimize the tree for you. This behavior is very different than pandas!  Problems with Koalas? Take a look at ks.options: - compute.ops_on_diff_frames - compute.ordered_head - plotting.sample_ratio - display.max_rows cacheno cache ks_with_trend = ks_bart_df.groupby(["Date", "Hour"]).mean() ks_trendless = ks_with_trend.copy() ks_trendless["Trip Count"] -= trend["Trip Count"] ks_trendless = ks_trendless.cache() # <-- caching here
  • 31. Koalas roadmap  Expand pandas API coverage with Koalas Dataframes  Current: ~70% API coverage with pandas  Integration with more visualization packages  Current: support with matplotlib  Deeper Integration with numpy  Current: universal functions are implemented  More example notebooks  Current: a few examples on
  • 32. Conclusions 1. Virgin Hyperloop One – cool stuff – we’re hiring ( https://hyperloop-one.com/careers ) 2. Hyperloop  Databricks Story – lucky coincidence, amazing partnership 3. What is Databricks and Koalas – pandas API for Spark 4. Koalas DEMO #1 – very convenient, but still can use pySpark if situation requires 5. Short intro to MLflow tracking – excellent for organizing your experiments (not only ML) 6. Koalas & MLflow DEMO #2 – Koalas is also nice for parallel model execution and scoring 7. Next Steps: Sparkifying Matlab
  • 33. References, links 1. Koalas documentation & Github : https://github.com/databricks/koalas 2. Blog post : “How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas” https://databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-one-reduced- processing-time-from-hours-to-minutes-with-koalas.html 3. How to improve your pandas, if you don’t wanna move to Spark ? : “From Pandas-wan to Pandas-master” https://medium.com/unit8-machine-learning-publication/from-pandas-wan-to-pandas- master-4860cf0ce442
  • 34. Q&A Thank you for joining!