While struggling to choose among different computing and machine learning frameworks such as Spark, Dask, Scikit-learn, Tensorflow, etc. for your ETL and machine learning projects, have you thought about unifying them into one ecosystem to use?
6. Motivation of Fugue
● A pure abstraction layer
● Unify and simplify core concepts of distributed computing
● Decouple your logic from any specific solution
● Easy to learn and easy to switch
● NOT invasive, NOT obstructive, and NOT exclusive
7. Example: Node2Vec
Apply certain walk strategy on graph to generate a collection of node vectors
to be used by embedding algos such as Word2Vec
15. Why DAG?
1. X = Run mapper A on a dataframe
2. Map X by mapper B and save
3. Map X by mapper C and save
16. Optimizations on DAG Execution
● Automatically parallelize independent branches
● Auto persist
● More errors can be captured at “compile” time
● Determinism enables checkpointing, executions can “resume”
18. # Enriched syntax
a:= CREATE [[“k1”,0],[“k2”,1]] SCHEMA k:str,f:int
# Transformer extension
b:= TRANSFORM a USING plus_n PARTITION BY k
# SELECT statement
c:= SELECT a.*, b.f2 FROM a JOIN b ON a.k = b.k
# Simplified syntax & multi tasks
SELECT f, f2, 3 AS f3 PERSIST
PRINT
OUTPUT TO “file.parquet”
# Checkpoint
df ?? TRANSFORM b USING expensive_op
OUTPUT c, df USING assert_eq
Fugue SQL
19. Fugue SQL vs Spark SQL
Fugue SQL Spark SQL
Workflow level Yes No
Cross platform Yes No
SELECT statement Yes Yes
Other SQL statements No, can be done in extensions Yes
Multiple statements Yes Yes (WITH statement)
Spark/Hive UDF (Java/Py) Yes Yes
Fugue extensions Yes No
Caching/checkpointing Yes No
24. ML Library: Node2Vec
● We implemented the distributed Node2Vec algorithm on Fugue
○ Use adjacency lists to represent a graph
○ Distributed Breadth-First Search for random walk
○ Cache critical variables for picking next step during BFS
26. ● Graph (10 million vertices, 300 million edges)
○ 2-3 hours with 500 cores and 3 TB memory
● Graph (100 million vertices, 3 billion edges)
○ 6-8 hours with 2,000 cores and 12 TB memory
Large Scale Testing
27. ML Library: Time Series Seasonality
● Forecast seasonality coefficients using Kalman Filter
○ Decent performance on noisy data
○ Simulate special events (holidays, and etc.) and anomalies
○ Any interval: hourly, daily, weekly, yearly, and etc.
● Handle very large number of time series with seasonalities
28. ● Fugue supports Spark streaming very well
○ Treats batch processing and streaming equivalently
○ Fugue spark-streaming pipeline in production
● Fugue abstract connectors for streaming
○ Kinesis connectors
○ Confluent Kafka connectors
○ Commonly used streaming APIs
Fugue Streaming
31. Migrated Projects
● Collaborated with multiple product teams to migrate legacy
pipelines
○ Large cost and runtime saving
○ Higher testability
○ Shorter development time
● Performance Improvement on all migrated projects (by Dec 2019)
○ Average total CPU hours: 74.6% reduction
○ Average total runtime: 83.9% reduction
32. Multi-region Regression
Region-based models to be trained and tuned.
Reliability Avg Cost/Run Runtime
Legacy Pipeline ~80% ~$630 7+ hours
Fugue Pipeline 99.5% ~$23 30 min
Improvement - 95+% reduction 90+% reduction
33. Time-series Forecasting
Forecast business metric for better budget planning and decision making
Horizon: weekly, monthly, quarterly
Reliability Avg Cost/Run Runtime
Legacy Pipeline ~70% ~$70 2+ hours
Fugue Pipeline 99.5% ~$5 10 min
Improvement - 90+% reduction 90+% reduction
34. ▪ Fugue unifies various computing frameworks with uniform
interfaces.
▪ Fugue SQL is a novel language for workflows.
▪ K8S + Spark + Fugue is a great combination with high flexibility
and efficiency for distributed computing.
▪ The Fugue project will build a unified ecosystem for integrating
distributed systems and machine learning.
Summary