9. Apache Spark Today: Python
68%
of notebook
commands on
Databricks are in
Python
10. Apache Spark Today: SQL
exabytes
queried/day in SQL
on Databricks alone
>90%
of Spark API calls
run via Spark SQL
TPC-DS benchmark record
set using Spark SQL
11. Apache Spark Today: Streaming
>5 trillion
records/day processed on Databricks
with Structured Streaming
13. Adaptive Query Execution (AQE)
change execution plan at runtime to automatically set # of reducers and join algorithms
3.0: SQL Performance Enhancement
Change join algorithm
Accelerates TPC-DS queries up to 8x
TPC-DS 1TB No-Stats With vs. Without Adaptive
Query Execution
Duration
(Seconds)
14. Speeds up 60/102
TPC-DS queries by
2-18x
3.0: SQL Performance Enhancement
TPC-DS 1TB With vs. Without Dynamic Partition Pruning
Duration(Seconds)
Dynamic Partition Pruning (DPP)
Efficiently broadcast partition information to speed up star-schema join performance
15. 3.0: SQL Compatibility
ANSI Reserved
Keywords
ANSI Gregorian
Calendar
ANSI Store
Assignment
ANSI Overflow
Checking
ANSI SQL: Run unmodified queries from major SQL engines
(language dialect and broader support)
16. Python type hints for Pandas UDFs
3.0: Python & R Performance
Old API
17. Faster Apache Arrow-based
calls to Python user code
Vectorized SparkR calls
New Pandas function APIs
3.0: Python & R Performance
SparkR API Performance
Python Pandas UDF Performance
Time(Seconds)
Time(Seconds)
20. Other Apache Spark Ecosystem Projects
Pandas API over Spark
Large-scale genomics GPU-accelerated data science
Reliable table storage Scale-out on Spark
Visualization
21. What is Koalas?
Implementation of Pandas APIs over Spark
▪ Easily port existing data science code
Launched at Spark+AI Summit 2019
Now up to 850,000 downloads
per month (1/5th of PySpark!)
import databricks.koalas as ks
df = ks.read_csv(file)
df[‘x’] = df.y * df.z
df.describe()
df.plot.line(...)
22. Announcing Koalas 1.0!
Close to 80% API coverage
Faster performance with Spark 3.0 APIs
More support for missing values, in-place updates
Faster distributed index type
pip install koalas
to get started!
20.17% faster
Time(Seconds)
26.39% faster
Koalas API Coverage
77%
69%
65%
25. Data Warehouses
were purpose-built
for BI and reporting, however…
▪ No support for video, audio, text
▪ No support for data science, ML
▪ Limited support for streaming
▪ Closed & proprietary formats
Therefore, most data is stored in data lakes & blob
stores
ETL
External Data Operational Data
Data Warehouses
BI Reports
26. Data Lakes
could handle all your data for data
science and ML, however…
▪ Poor BI support
▪ Complex to set up
▪ Poor performance
▪ Difficult to quality control
▪ Unreliable data swamps
BIData
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses
Data Prep and
Validation
ETL
27. Lakehouse
Data Warehouse Data Lake
ETL
External Data Operational Data
Data Warehouses
BI Reports BIData
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses
Data Prep and
Validation
ETL
28. Lakehouse
Data Warehouse Data Lake
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
ETL
External Data Operational Data
Data Warehouses
BI Reports BIData
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses
Data Prep and
Validation
ETL
34. Photon
New execution engine for Delta Engine to accelerate Spark SQL
Built from scratch in C++, for performance:
▪ Vectorization:
▪ Data-level parallelism
▪ Instruction-level parallelism
▪ Optimized for modern workloads, not just benchmarks:
▪ Faster string processing
▪ Regex
Native execution engine purpose-built for performance
38. Faster String Processing
MBs processed per core per sec, UPPER() function (higher is better)
MBs processed per core per sec, SUBSTRING() function (higher is better)
39. Faster String Processing - Regex
Millions of rows processed per core per sec, LIKE "%a_c%" (higher better)
42. Redash helps you make sense of your data
Powerful SQL editor
Browse schema and click-to-
insert
Create reusable snippets
Schedule updates and setup
alerts
43. Redash helps you make sense of your data
Visualize and share
▪Build a wide variety of
visualizations and gather them
into thematic dashboards
▪Drag & drop and resize any
visualization
▪Share dashboards with your team
or with the public
44. Re-dash in Action SQL query against the data to pull out the data we
need.
45. Re-dash in Action Easily turn SQL into a visualization to make the data easier
to understand
46. Re-dash in Action We can build a dashboard the business can use to
understand what’s going on
47. Redash helps you make sense of your data
Databases Integrations
Query all of your SQL, NoSQL, big data, and API data sources
49. One line to record params, metrics
and models in popular ML libraries:
Autologging
mlflow.keras.autolog()
updated
in
1.8
Including specific data versions read when using Delta Lake
50. Model Schemas
Specify input and output data types for models
Incompatible schemas!
Model
Input Schema
Output Schema
Check Compatibility
and Validate New
Model Versions
new
in
1.9
zipcode: string,
sqft: double,
distance: double
price: double
log_model(…)
51. Model Serving on Databricks
Tracking
Experiment tracking
Logged
Model
Model Registry
Model management
Model Serving
Turnkey serving for
MLflow models
new
Staging Production Archived
Data Scientists Application Engineers
Reports
Applications
...
REST
Endpoint
in
preview
Deployment Backends
53. Pluggable way to create and manage
deployment endpoints in MLflow
Used in 2 new endpoints:
Other integrations being ported:
Deployments API
com
ing
soon
mlflow deployments create -t gcp -n spam
-m models:/SpamScorer/production
mlflow deployments predict -t gcp –n spam
-f emails.json
56. CI/CD based Workflow from
Experimentation to Production
Version Review Test
Development /
Experimentation
Production Jobs
Git / CI/CD
Systems
in
preview
58. Seamless transition to and from
Jupyter Notebooks
Native Support for Standard
Notebook Formats
Before
(conversion):
ipynb
Databricks
Format
Databricks
Notebooks
com
ing
soon