From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

Logistics
• We can’t hear you…
• Recording will be available…
• Code samples and notebooks will be available…
• Submit your questions…
• Bookmark databricks.com/blog

Our Mission: Helping data teams solve the world’s toughest
problems
Original creators of popular data and machine learning open source projects
Global company with 5,000+ customers and 450+ partners

Data Science, ML, and
BI on one cloud platform
Access all business and
big data in open data
lake
Securely integrates with
your cloud ecosystem
BIG DATA & BUSINESS DATA
DATA
SCIENTISTS
ML
ENGINEERS
DATA
ANALYSTS
DATA
ENGINEERS
ENTERPRISE CLOUD SERVICE
A simple, scalable, and secure managed service
UNIFIED DATA SERVICE
High quality data with great performance
DATA SCIENCE WORKSPACE
Collaboration across the lifecycle
BI
INTEGRATIONS
Access all your
data
UNIFIED DATA ANALYTICS
PLATFORM

About our speakers
Yifan Cao, Sr. Product Manager, Machine Learning at Databricks
• Product Area: ML/DL algorithms and Databricks Runtime for Machine
Learning
• Built and grew two ML products to multi-million dollars in annual
revenue
• B.S. Engineering from UC Berkeley; MBA from MIT
Patryk Oleniuk, Lead Data Engineer, Virgin Hyperloop One
• Wearing many hats @ Hyperloop: Embedded Devices, Back-end
Software, Data Science & Machine Learning
• Previous experience includes Samsung R&D, CERN
• M.S. in Information Technologies from EPFL (Switzerland)

Agenda
1. Virgin Hyperloop One
2. Hyperloop  Databricks Story
3. What is Databricks and Koalas
4. Koalas (python + notebook, live coding)
5. Short intro to Mlflow tracking
6. Koalas & MLflow (python + notebook, live coding)
7. Koalas tips&tricks

 Startup with 250+ employees in Los Angeles
( hiring! )
 New transportation system
 Vacuum tube + small pass./cargo vehicles(Pods)
 Short travel times
 On-demand ( “Ride Hailing” )
 Zero direct emission (Electric Lev & Prop)
 500m test track in Nevada (video)
 New exciting tests, tracks, enhancements
(in progress right now)
Virgin Hyperloop One (VHO)

 Operational Research for Hyperloop
 Analytics products for Business and Tech
 Simulation & Data based
 Sample Questions:
 What’s the optimal Vehicle Capacity?
 How many passengers can we realistically handle in
scenario X between cities Y and Z?
 How much better are we than other modes?
 Answers? Data
VHO – MIA Machine Intelligence & Analytics

DEMAND
MODELLING
TRIP PLANNING
PERFORMANCE
METRICS
COST METRICS
GEOSPATIAL
ANALYTICS
3D ALIGNMENT
OPTIMIZER
TEST RUNS
HW & SW TEST RIGS
Examples of AI and Analytics and Data in VHO

Hyperloop Data Story
 growing data sizes (from MBs to GBs)
 growing processing times (from mins to hrs)
 Python scripts crashing ( pandas out of memory )
 Need a more Enterprise, and Scalable approach to
handling Data (tried different solutions)
 Spark and its family is de-facto standard solution to that
problem

Hyperloop Data Story
 Who’s gonna manage our new
Spark infrastructure?
Not enough Devops …

Koalas - why?
pandas code:
pandas_df
.groupby(”Destination”)
.sum()
.nlargest(10,
columns = "Trip Count")
PySpark code:
spark_df
.groupby(“Destination”)
.sum()
.orderBy(“sum(Trip Count)”,
ascending = False)
.limit(10)
Me after learning I need to redo
all our pandas scripts in pySpark
(and keep doing it for future DS work)

14
 Education (MOOCs, books, universities) → pandas
 Analyze small data sets → pandas
 Analyze big data sets → DataFrame in Spark ● Standard for distributed
workloads
● Big data
● Standard for single
machine workloads
● Small data
Pandas Apache Spark+
Typical journey of a data scientist

15
 Launched on April 24, 2019 by Databricks
 Pure Open Source Python library
 Aims at providing the pandas API on top of Apache Spark:
 Unifies the two ecosystems with a familiar API
 Seamless transition between small and large data
What is Koalas?
github.com/databricks/koalas

Koalas - why?
pandas code:
pandas_df
.sum()
.nlargest(10,
PySpark code:
spark_df
.groupby(“Destination”)
.sum()
.orderBy(“sum(Trip Count)”,
ascending = False)
.limit(10)
Koalas code:
koalas_df
.sum()
.nlargest(10,

Koalas Architecture
Catalyst Optimization &
Tungsten Execution
DataFrame
APIs
SQL
Koalas
Core
Data
Source
Connectors
Pandas
SPAR
K
A lean API
layer

Koalas User Adoption
 Better scale the breadth of Pandas to big data
 Reduce friction by unifying big data environment
 Has been quickly adopted
 860+ patches merged since announcement in April 2019
 20+ major contributors that are outside Of Databricks
 24k+ daily downloads

What is Koalas
Koalas allows seamless* switch from pandas to Spark
 that means scaling the compute power
* few caveats covered in the DEMO #1
 Databricks can manage our Spark infrastructure
(and is also the author of Koalas)

What is Koalas
Koalas allows seamless* switch from pandas to Spark
 that means scaling the compute power
* few caveats covered in the DEMO #1
 Need to speed up your computation?
Scale-up your Spark workers:
- obviously, comes at a $ price
- cannot be below few-s processing, saturates

MLflow - purpose
with mlflow.start_run():
mlflow.log_param("alpha", a)
mlflow.log_param("l1_ratio", l1)
rmse, r2, lr = train_score_model(a, l1)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.sklearn.log_model(lr, "model")
github.com/mlflow
databricks.com/mlflow

MLflow - purpose
with mlflow.start_run():
mlflow.log_param("alpha", a)
mlflow.log_param("l1_ratio", l1)
rmse, r2, lr = train_score_model(a, l1)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.sklearn.log_model(lr, "model")
N times
(parameter sweep)

MLflow - our model
– the demand depends on the hour and
day of week
– let’s create different ML prediction
models and save it to an MLflow
experiment

Output
Scoring
(on test set)
Output
Scoring
(on test set)
Output
Scoring
(on test set)
MLflow - our model
– Koalas can easily be used for pre-
processing the test/train data
(reuse existing)
– use kdf.apply() to sweep&score models
in parallel using Spark workers
– let’s create different models and save it
to MLflow experiment
sweep parameter #1
Log parameters,
metrics and
model
to MLflow
sweep parameter #2
ML model
( black box )
ML model
( black box )
ML model
( black box )
predicted
trip count

pandas to Koalas – tips and tricks
 Almost all popular pandas ops are available in koalas. Some params missing
(missing something? Setup an an issue on Koalas github! Don’t be shy like me!).
 Some of the functionality is chosen NOT to be implemented, the easiest workaround is:
kdf.to_pandas().do_whatever_you_want().to_koalas()
DataFrame.values : all the data would be loaded into the driver's memory, OOM errors.
 Be aware of different execution principles (ordering, lazy evaluation, underlying Spark df).
sort after groupby, different structure of groupby.apply, different NaN treatment & ops
 I personally really like using kdf.apply(my_func, axis=1) for any distributed row-based job,
including web-scrapping, dict-mapping, MLflow runs, etc.
All the function calls (for all the rows) are then distributed among all Spark workers.

pandas to Koalas – tips and tricks #2
 kdf = kdf.cache() – will not re-compute from the beginning every time,
useful especially for exploratory analysis where you use same kdf in different cells,
for 1 long script, Spark is gonna optimize the tree for you.
This behavior is very different than pandas!
 Problems with Koalas? Take a look at ks.options: - compute.ops_on_diff_frames
- compute.ordered_head
- plotting.sample_ratio
- display.max_rows
cacheno cache
ks_with_trend = ks_bart_df.groupby(["Date", "Hour"]).mean()
ks_trendless = ks_with_trend.copy()
ks_trendless["Trip Count"] -= trend["Trip Count"]
ks_trendless = ks_trendless.cache() # <-- caching here

Koalas roadmap
 Expand pandas API coverage with Koalas Dataframes
 Current: ~70% API coverage with pandas
 Integration with more visualization packages
 Current: support with matplotlib
 Deeper Integration with numpy
 Current: universal functions are implemented
 More example notebooks
 Current: a few examples on

Conclusions
1. Virgin Hyperloop One – cool stuff – we’re hiring ( https://hyperloop-one.com/careers )
2. Hyperloop  Databricks Story – lucky coincidence, amazing partnership
3. What is Databricks and Koalas – pandas API for Spark
4. Koalas DEMO #1 – very convenient, but still can use pySpark if situation requires
5. Short intro to MLflow tracking – excellent for organizing your experiments (not only ML)
6. Koalas & MLflow DEMO #2 – Koalas is also nice for parallel model execution and scoring
7. Next Steps: Sparkifying Matlab

References, links
1. Koalas documentation & Github :
https://github.com/databricks/koalas
2. Blog post : “How Virgin Hyperloop One reduced processing time from hours to minutes
with Koalas”
https://databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-one-reduced-
processing-time-from-hours-to-minutes-with-koalas.html
3. How to improve your pandas, if you don’t wanna move to Spark ? :
“From Pandas-wan to Pandas-master”
https://medium.com/unit8-machine-learning-publication/from-pandas-wan-to-pandas-
master-4860cf0ce442

databricks.com/sparkaisummit
EXPANDED
TECHNICAL TRAINING
LEARN MORE
REGISTER BY MARCH 31
GET $450 OFF!
SAVE MY SPOT

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

Semelhante a From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data