From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
1.
2. Logistics
• We can’t hear you…
• Recording will be available…
• Code samples and notebooks will be available…
• Submit your questions…
• Bookmark databricks.com/blog
3. Our Mission: Helping data teams solve the world’s toughest
problems
Original creators of popular data and machine learning open source projects
Global company with 5,000+ customers and 450+ partners
4. Data Science, ML, and
BI on one cloud platform
Access all business and
big data in open data
lake
Securely integrates with
your cloud ecosystem
BIG DATA & BUSINESS DATA
DATA
SCIENTISTS
ML
ENGINEERS
DATA
ANALYSTS
DATA
ENGINEERS
ENTERPRISE CLOUD SERVICE
A simple, scalable, and secure managed service
UNIFIED DATA SERVICE
High quality data with great performance
DATA SCIENCE WORKSPACE
Collaboration across the lifecycle
BI
INTEGRATIONS
Access all your
data
UNIFIED DATA ANALYTICS
PLATFORM
5. About our speakers
Yifan Cao, Sr. Product Manager, Machine Learning at Databricks
• Product Area: ML/DL algorithms and Databricks Runtime for Machine
Learning
• Built and grew two ML products to multi-million dollars in annual
revenue
• B.S. Engineering from UC Berkeley; MBA from MIT
Patryk Oleniuk, Lead Data Engineer, Virgin Hyperloop One
• Wearing many hats @ Hyperloop: Embedded Devices, Back-end
Software, Data Science & Machine Learning
• Previous experience includes Samsung R&D, CERN
• M.S. in Information Technologies from EPFL (Switzerland)
6. Agenda
1. Virgin Hyperloop One
2. Hyperloop Databricks Story
3. What is Databricks and Koalas
4. Koalas (python + notebook, live coding)
5. Short intro to Mlflow tracking
6. Koalas & MLflow (python + notebook, live coding)
7. Koalas tips&tricks
7. Startup with 250+ employees in Los Angeles
( hiring! )
New transportation system
Vacuum tube + small pass./cargo vehicles(Pods)
Short travel times
On-demand ( “Ride Hailing” )
Zero direct emission (Electric Lev & Prop)
500m test track in Nevada (video)
New exciting tests, tracks, enhancements
(in progress right now)
Virgin Hyperloop One (VHO)
8. Operational Research for Hyperloop
Analytics products for Business and Tech
Simulation & Data based
Sample Questions:
What’s the optimal Vehicle Capacity?
How many passengers can we realistically handle in
scenario X between cities Y and Z?
How much better are we than other modes?
Answers? Data
VHO – MIA Machine Intelligence & Analytics
10. Hyperloop Data Story
growing data sizes (from MBs to GBs)
growing processing times (from mins to hrs)
Python scripts crashing ( pandas out of memory )
Need a more Enterprise, and Scalable approach to
handling Data (tried different solutions)
Spark and its family is de-facto standard solution to that
problem
11. Hyperloop Data Story
Who’s gonna manage our new
Spark infrastructure?
Not enough Devops …
12. Koalas - why?
pandas code:
pandas_df
.groupby(”Destination”)
.sum()
.nlargest(10,
columns = "Trip Count")
PySpark code:
spark_df
.groupby(“Destination”)
.sum()
.orderBy(“sum(Trip Count)”,
ascending = False)
.limit(10)
Me after learning I need to redo
all our pandas scripts in pySpark
(and keep doing it for future DS work)
14. 14
Education (MOOCs, books, universities) → pandas
Analyze small data sets → pandas
Analyze big data sets → DataFrame in Spark ● Standard for distributed
workloads
● Big data
● Standard for single
machine workloads
● Small data
Pandas Apache Spark+
Typical journey of a data scientist
15. 15
Launched on April 24, 2019 by Databricks
Pure Open Source Python library
Aims at providing the pandas API on top of Apache Spark:
Unifies the two ecosystems with a familiar API
Seamless transition between small and large data
What is Koalas?
github.com/databricks/koalas
18. Koalas User Adoption
Better scale the breadth of Pandas to big data
Reduce friction by unifying big data environment
Has been quickly adopted
860+ patches merged since announcement in April 2019
20+ major contributors that are outside Of Databricks
24k+ daily downloads
github.com/databricks/koalas
19. What is Koalas
Koalas allows seamless* switch from pandas to Spark
that means scaling the compute power
* few caveats covered in the DEMO #1
Databricks can manage our Spark infrastructure
(and is also the author of Koalas)
20. What is Koalas
Koalas allows seamless* switch from pandas to Spark
that means scaling the compute power
* few caveats covered in the DEMO #1
Need to speed up your computation?
Scale-up your Spark workers:
- obviously, comes at a $ price
- cannot be below few-s processing, saturates
22. MLflow - purpose
with mlflow.start_run():
mlflow.log_param("alpha", a)
mlflow.log_param("l1_ratio", l1)
rmse, r2, lr = train_score_model(a, l1)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.sklearn.log_model(lr, "model")
github.com/mlflow
databricks.com/mlflow
23. MLflow - purpose
with mlflow.start_run():
mlflow.log_param("alpha", a)
mlflow.log_param("l1_ratio", l1)
rmse, r2, lr = train_score_model(a, l1)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.sklearn.log_model(lr, "model")
N times
(parameter sweep)
24. MLflow - our model
– the demand depends on the hour and
day of week
– let’s create different ML prediction
models and save it to an MLflow
experiment
25. Output
Scoring
(on test set)
Output
Scoring
(on test set)
Output
Scoring
(on test set)
MLflow - our model
– Koalas can easily be used for pre-
processing the test/train data
(reuse existing)
– use kdf.apply() to sweep&score models
in parallel using Spark workers
– let’s create different models and save it
to MLflow experiment
sweep parameter #1
Log parameters,
metrics and
model
to MLflow
sweep parameter #2
ML model
( black box )
ML model
( black box )
ML model
( black box )
predicted
trip count
26. MLflow - our model
– the demand depends on the hour and
day of week
– let’s create different ML prediction
models and save it to an MLflow
experiment
27. Output
Scoring
(on test set)
Output
Scoring
(on test set)
Output
Scoring
(on test set)
MLflow - our model
– Koalas can easily be used for pre-
processing the test/train data
(reuse existing)
– use kdf.apply() to sweep&score models
in parallel using Spark workers
– let’s create different models and save it
to MLflow experiment
sweep parameter #1
Log parameters,
metrics and
model
to MLflow
sweep parameter #2
ML model
( black box )
ML model
( black box )
ML model
( black box )
predicted
trip count
29. pandas to Koalas – tips and tricks
Almost all popular pandas ops are available in koalas. Some params missing
(missing something? Setup an an issue on Koalas github! Don’t be shy like me!).
Some of the functionality is chosen NOT to be implemented, the easiest workaround is:
kdf.to_pandas().do_whatever_you_want().to_koalas()
DataFrame.values : all the data would be loaded into the driver's memory, OOM errors.
Be aware of different execution principles (ordering, lazy evaluation, underlying Spark df).
sort after groupby, different structure of groupby.apply, different NaN treatment & ops
I personally really like using kdf.apply(my_func, axis=1) for any distributed row-based job,
including web-scrapping, dict-mapping, MLflow runs, etc.
All the function calls (for all the rows) are then distributed among all Spark workers.
30. pandas to Koalas – tips and tricks #2
kdf = kdf.cache() – will not re-compute from the beginning every time,
useful especially for exploratory analysis where you use same kdf in different cells,
for 1 long script, Spark is gonna optimize the tree for you.
This behavior is very different than pandas!
Problems with Koalas? Take a look at ks.options: - compute.ops_on_diff_frames
- compute.ordered_head
- plotting.sample_ratio
- display.max_rows
cacheno cache
ks_with_trend = ks_bart_df.groupby(["Date", "Hour"]).mean()
ks_trendless = ks_with_trend.copy()
ks_trendless["Trip Count"] -= trend["Trip Count"]
ks_trendless = ks_trendless.cache() # <-- caching here
31. Koalas roadmap
Expand pandas API coverage with Koalas Dataframes
Current: ~70% API coverage with pandas
Integration with more visualization packages
Current: support with matplotlib
Deeper Integration with numpy
Current: universal functions are implemented
More example notebooks
Current: a few examples on
32. Conclusions
1. Virgin Hyperloop One – cool stuff – we’re hiring ( https://hyperloop-one.com/careers )
2. Hyperloop Databricks Story – lucky coincidence, amazing partnership
3. What is Databricks and Koalas – pandas API for Spark
4. Koalas DEMO #1 – very convenient, but still can use pySpark if situation requires
5. Short intro to MLflow tracking – excellent for organizing your experiments (not only ML)
6. Koalas & MLflow DEMO #2 – Koalas is also nice for parallel model execution and scoring
7. Next Steps: Sparkifying Matlab
33. References, links
1. Koalas documentation & Github :
https://github.com/databricks/koalas
2. Blog post : “How Virgin Hyperloop One reduced processing time from hours to minutes
with Koalas”
https://databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-one-reduced-
processing-time-from-hours-to-minutes-with-koalas.html
3. How to improve your pandas, if you don’t wanna move to Spark ? :
“From Pandas-wan to Pandas-master”
https://medium.com/unit8-machine-learning-publication/from-pandas-wan-to-pandas-
master-4860cf0ce442