SlideShare uma empresa Scribd logo
1 de 53
The Developer Data
Scientist Creating New Analytics Driven
Applications
Using Azure Databricks® and Apache Spark™
About Richard
Richard Garris
● Principal Solutions Architect
● 14+ years in data management and
advanced analytics
● advises customers on their data
science and advanced analytic
projects
● Degrees from The Ohio State
University and Carnegie Mellon
University
2
Agenda
- Introduction to Data Science
- Data Science Lifecycle
- Data Ingestion
- Data Understanding & Exploration
- Modeling
- Integrating Machine Learning in Your Application
- End-to-End Example Use Cases
3
Introduction to Data
Science
4
AI is Changing the World
What is the secret to AI?
AlphaGoSelf-driving cars Alexa
AI is Changing the World
What do these companies have in common?
AlphabetTesla Amazon
Hardest Part of AI isn’t AI, its Big Data
ML
Code
Configuration
Data Collection
Data
Verification
Feature
Extraction
Machine
Resource
Management
Analysis Tools
Process
Management Tools
Serving
Infrastructure
Monitoring
“Hidden Technical Debt in Machine Learning Systems,” Google NIPS
2015
Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by
the small green box in the middle. The required surrounding infrastructure is vast and complex.
Business Value of Data Science
Present the
Right Offer at
the Right Time
•Businesses have to Adapt Faster to
Change
•Data driven decisions need to be
made quickly and accurately
•Customers expect faster responses
Data Science Lifecycle
9
Agile Modeling Process
Set Business
Goals
Understand Your
Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
Data Scientists or Data Janitors?
1
1
“3 out of 5 data
scientists spend
80% of their time
collecting, cleaning
and organizing
data”
Data Understanding
Schema - understand the field names / data types
Metadata Management - understand descriptions and business
meaning
Data Quality - data validation / profiling / checks
Exploration / Visualization - scatter plots, charts, correlations
Summary Statistics - average, min, max, range, median,
standard deviation
Data Understanding - Visualization
Data Understanding
Descriptive Statistics
Max, Min, Mean, Standard Deviation, Median, Skewness, Kurtosis
Relationship Statistics
Pearson’s Coefficient, Spearman correlation,
Data Understanding
Structured
• Key Value
• Tabular (Relational,
Nested)
• Graph
• Geocoded / Location
• Time Series
Unstructured
• Text (logs, tweets, articles)
• Sound / Waveform
• Sensor
• Genomic / Scientific Data
Structured Data (relational, tabular,
nested)
Text (logs, tweets, articles, social)
Graph Data (connections, social)
Start with the Raw Data
Geocoded / Location Data
Time Series Data
Stock Charts
Streaming / Events
Sound and Waveform Data
Sensor Data
Images and Video
Genomic & Scientific Data
What is a Data Science Platform?
Gartner defines Data Science Platform :
“an end-to-end platform for developing and
deploying models”
Using sophisticated statistical models, machine learning,
neural networks, text analytics, and other advanced data
mining techniques
25
What is a Model
A simplified and idealized
representation of the real-world
What does Modeling Mean?
A Class is a Model Model of a Building Data Model
class Employee {
FirstName : String
LastName : String
DOB : java.calendar.Date
Grades : Seq[Grade]
}
Types of Models
Machine Learning Models
Statistical Models
Financial Models
Graph Models
Simulation Models
Predictive Models
Biological Models
Two Broad Categories of Models
●Supervised learning: prediction
Classification (binary or multiclass): predict a category (label)
Regression: predict a number (target)
●Unsupervised learning: discovery
Clustering: find groupings based on pattern
Density estimation: match data with distribution pattern
Dimensionality: reduction / reduce # of columns
Similarity search: find similar data
Frequent Items (or association rules): finding relationships in variables
Model Category Use Cases
●Anomaly detection
Density estimation: “Is this observation uncommon?”
Similarity search: “How far is it from other observations?”
Clustering: “Are there groups of strange observations?”
●Lead scoring / recommendation
Classification: “Will this user become a buyer?”
Regression: “How much will he/she spend?”
Similarity search: “What products did similar users buy?”
A Model is a Mathematical Function
Unsupervised Methods in MLlib
Clustering
●Gaussian mixture models
●K-Means
●Streaming K-Means
●Latent Dirichlet Allocation
●Power Iteration Clustering
Frequent itemsets
●FP-growth
●Prefix span
Recommendation
●Alternating Least Squares
Supervised Methods in MLlib
Classification
Logistic regression w/ elastic net
Naive Bayes
Streaming logistic regression
Linear SVMs
Decision trees
Random forests
Gradient-boosted trees
Multilayer perceptron
One-vs-rest
DeepImagePredictor
Regression
Least squares w/ elastic net
Isotonic regression
Decision trees
Random forests
Gradient-boosted trees
Streaming linear methods
But What is a Model Really?
A model is a complex pipeline of components
Data Sources
Joins
Featurization Logic
Algorithm(s)
Transformers
Estimators
Tuning Parameters
ML Pipelines
Train model
Evaluate
Load data
Extract features
A very simple pipeline
ML Pipelines
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline!
Integrating Machine
Learning in Your
Application
37
Productionizing Models Today
Data Science Data Engineering
Develop Prototype
Model using
Python/R
Re-implement model for
production (Java)
Problems with Productionizing
Models
Develop Prototype
Model using
Python/R
Re-implement model for
production (Java)
- Extra work
- Different code paths
- Data science does not translate to production
- Slow to update models
Data Science Data Engineering
MLLib 2.X Model Serialization
Data Science Data Engineering
Develop Prototype
Model using
Python/R
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
Scala
val lrModel = lrPipeline.fit(dataset)
// Save the Model
lrModel.write.save("/models/lr")
•
MLLib 2.X Model Serialization
Snippet
Python
lrModel = lrPipeline.fit(dataset)
# Save the Model
lrModel.write.save("/models/lr")
•
Model Serialization Output
Code
// List Contents of the Model Dir
dbutils.fs.ls("/models/lr")
•
Output
Remember this is a pipeline
model and these are the stages!
Transformer Stage (StringIndexer)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/00_s
trIdx_bb9728f85745/metadata/part-
00000")
// Display the Parquet File in the Data
dir
display(sqlContext.read.parquet(”/mod
els/lr/stages/00_strIdx_bb9728f85745/
data/"))
Output
{
"class":"org.apache.spark.ml.feature.StringIndexerModel",
"timestamp":1488120411719,
"sparkVersion":"2.1.0",
"uid":"strIdx_bb9728f85745",
"paramMap":{
"outputCol":"workclassIdx",
"inputCol":"workclass",
"handleInvalid":"error"
}
}
Metadata and params
Data (Hashmap)
Estimator Stage (LogisticRegression)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/18_l
ogreg_325fa760f925/metadata/part-
00000")
// Display the Parquet File in the Data
dir
display(sqlContext.read.parquet("/mod
els/lr/stages/18_logreg_325fa760f925/
data/"))
Output
Model params
Intercept + Coefficients
{"class":"org.apache.spark.ml.classification.LogisticRegressionModel",
"timestamp":1488120446324,
"sparkVersion":"2.1.0",
"uid":"logreg_325fa760f925",
"paramMap":{
"predictionCol":"prediction",
"standardization":true,
"probabilityCol":"probability",
"maxIter":100,
"elasticNetParam":0.0,
"family":"auto",
"regParam":0.0,
"threshold":0.5,
"fitIntercept":true,
"labelCol":"label” }}
Output
Decision Tree Splits
Estimator Stage (DecisionTree)
Code
// Display the Parquet File in the Data dir
display(sqlContext.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/"))
// Re-save as JSON
sqlContext.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").
Visualize Stage (DecisionTree)
Visualization of the Tree
In Databricks
Databricks + ML Pipelines: Ideal
Modeling Tool
Data Science - highly iterative,
agile
● Lots of data sources
● Lots of dirty data
● Lots and lots of data
ML Pipelines and notebooks are
ideal way to experiment with new
methods, data, features in order to
minimize error
Databricks Runtime
Elastic, Fully Managed, Highly Tuned Engine
48
FULLY MANAGED CLOUD
SERVICE
• Auto-configured multi-user elastic
clusters
• Reliable sharing with fault
isolation and workload
Preemption
PERFORMANCE OPTIMIZATIONS
• Increases performance by 5X
(TPC Benchmark)
• Connector optimizations
for Cloud ( Kafka, S3 and
Kinesis)
COST OPTIMIZED / LINEAR
SCALING
• 2x nodes - time cut in half
• 2x data, 2x nodes - time constant
• Cost of 10 nodes for 10 hours
equal to 100 nodes for 1 hour
DATABRICKS UNIFIED RUNTIME
Databricks I/O Databricks Serverless
Databricks Collaborative Workspace
Frictionless Collaboration Enabling Faster Innovation
49
Secure collaboration for fast feedback loops with single click access to
clusters
Production Jobs
FAST, RELIABLE AND SECURE JOBS
• Executes jobs 30-50% faster
• Notebooks to Production Jobs with one-
click
• Debug faster with logs and Spark history
UI.
DATA ENGINEER
ANALYZE DATA WITH NOTEBOOKS
• Multi-language: SQL, R, Scala, Python
• Advanced Analytics (Graph, ML & DL)
• Built-in visualization, including D3 & ggplot
DATA SCIENTIST
Interactive
Notebooks
BUILD DASHBOARDS
• Publish Insights
• Real-time updates
• Interactive reportsBUSINESS SME
Dashboards
49
50
Databricks’ Approach to Accelerate Innovation
INCREASE PERFORMANCE
By more than 5x and reduce TCO by
more than 70%
INCREASE PRODUCTIVITY
Of data science teams by 4-5x
STREAMLINE ANALYTIC
WORKFLOWS
Reducing deployment time to minutes
REDUCE RISK
And enable innovation with out-of-the-box
enterprise security and compliance
UNIFY ANALYTICS WITH APACHE
SPARK
Eliminating disparate tools
DATA SCIENTIST
/ANALYST
BUSINESS SMEDATA
ENGINEER
OPTIMIZEBIG DATA
CLUSTERS
SETUP
BREAK-FIX
DATA
WAREHOUSES
CLOUD
STORAGE
HADOOP STORAGEIoT / STREAMING DATA
MODEL DATA PRODUCTEXPLOREINGEST DASHBOARD
ML
LIBRARIES
STREAMIN
G
STATISTICS
PACKAGES
ETL SQL
Unified Engine
• SQL
• Streaming
• MLlib
• Graph
Databricks Optimizations
and Managed Cloud Service
DATABRICKS ENTERPRISE
SECURITY
DATABRICKS COLLABORATIVE
WORKSPACE
Databricks
Production
Databricks Interactive
DATABRICKS
RUNTIME
Databricks I/ODatabricks Serverless Unified Engine
Open APIs
End-to-end Examples
51
Demonstrations
•Predicting Power Output for an Energy Company
•Scoring inbound Leads
•Predicting Ratings given Reviews from Amazon
5
2
Demonstration
Databricks Community Edition or Free Trial
<Link to Azure>
Additional Questions?
Contact us at http://go.databricks.com/contact-databricks
5
3

Mais conteúdo relacionado

Mais procurados

Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksAlberto Diaz Martin
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAmazon Web Services
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Designing big data analytics solutions on azure
Designing big data analytics solutions on azureDesigning big data analytics solutions on azure
Designing big data analytics solutions on azureMohamed Tawfik
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for ArchitectsTomasz Kopacz
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorialrustd
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsEduardo Castro
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalCaserta
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsInformatica
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Microsoft Tech Community
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 

Mais procurados (20)

Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
 
Synapse for mere mortals
Synapse for mere mortalsSynapse for mere mortals
Synapse for mere mortals
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Designing big data analytics solutions on azure
Designing big data analytics solutions on azureDesigning big data analytics solutions on azure
Designing big data analytics solutions on azure
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorial
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 

Semelhante a Azure Databricks for Data Scientists

The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumVMware Tanzu
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudVMware Tanzu
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisYuanyuan Tian
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingGuido Schmutz
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceCambridge Semantics
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & RŁukasz Grala
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackTuri, Inc.
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 

Semelhante a Azure Databricks for Data Scientists (20)

The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 

Último

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Azure Databricks for Data Scientists

  • 1. The Developer Data Scientist Creating New Analytics Driven Applications Using Azure Databricks® and Apache Spark™
  • 2. About Richard Richard Garris ● Principal Solutions Architect ● 14+ years in data management and advanced analytics ● advises customers on their data science and advanced analytic projects ● Degrees from The Ohio State University and Carnegie Mellon University 2
  • 3. Agenda - Introduction to Data Science - Data Science Lifecycle - Data Ingestion - Data Understanding & Exploration - Modeling - Integrating Machine Learning in Your Application - End-to-End Example Use Cases 3
  • 5. AI is Changing the World What is the secret to AI? AlphaGoSelf-driving cars Alexa
  • 6. AI is Changing the World What do these companies have in common? AlphabetTesla Amazon
  • 7. Hardest Part of AI isn’t AI, its Big Data ML Code Configuration Data Collection Data Verification Feature Extraction Machine Resource Management Analysis Tools Process Management Tools Serving Infrastructure Monitoring “Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015 Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small green box in the middle. The required surrounding infrastructure is vast and complex.
  • 8. Business Value of Data Science Present the Right Offer at the Right Time •Businesses have to Adapt Faster to Change •Data driven decisions need to be made quickly and accurately •Customers expect faster responses
  • 10. Agile Modeling Process Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results
  • 11. Data Scientists or Data Janitors? 1 1 “3 out of 5 data scientists spend 80% of their time collecting, cleaning and organizing data”
  • 12. Data Understanding Schema - understand the field names / data types Metadata Management - understand descriptions and business meaning Data Quality - data validation / profiling / checks Exploration / Visualization - scatter plots, charts, correlations Summary Statistics - average, min, max, range, median, standard deviation
  • 13. Data Understanding - Visualization
  • 14. Data Understanding Descriptive Statistics Max, Min, Mean, Standard Deviation, Median, Skewness, Kurtosis Relationship Statistics Pearson’s Coefficient, Spearman correlation,
  • 15. Data Understanding Structured • Key Value • Tabular (Relational, Nested) • Graph • Geocoded / Location • Time Series Unstructured • Text (logs, tweets, articles) • Sound / Waveform • Sensor • Genomic / Scientific Data
  • 16. Structured Data (relational, tabular, nested)
  • 17. Text (logs, tweets, articles, social)
  • 19. Start with the Raw Data Geocoded / Location Data
  • 20. Time Series Data Stock Charts Streaming / Events
  • 25. What is a Data Science Platform? Gartner defines Data Science Platform : “an end-to-end platform for developing and deploying models” Using sophisticated statistical models, machine learning, neural networks, text analytics, and other advanced data mining techniques 25
  • 26. What is a Model A simplified and idealized representation of the real-world
  • 27. What does Modeling Mean? A Class is a Model Model of a Building Data Model class Employee { FirstName : String LastName : String DOB : java.calendar.Date Grades : Seq[Grade] }
  • 28. Types of Models Machine Learning Models Statistical Models Financial Models Graph Models Simulation Models Predictive Models Biological Models
  • 29. Two Broad Categories of Models ●Supervised learning: prediction Classification (binary or multiclass): predict a category (label) Regression: predict a number (target) ●Unsupervised learning: discovery Clustering: find groupings based on pattern Density estimation: match data with distribution pattern Dimensionality: reduction / reduce # of columns Similarity search: find similar data Frequent Items (or association rules): finding relationships in variables
  • 30. Model Category Use Cases ●Anomaly detection Density estimation: “Is this observation uncommon?” Similarity search: “How far is it from other observations?” Clustering: “Are there groups of strange observations?” ●Lead scoring / recommendation Classification: “Will this user become a buyer?” Regression: “How much will he/she spend?” Similarity search: “What products did similar users buy?”
  • 31. A Model is a Mathematical Function
  • 32. Unsupervised Methods in MLlib Clustering ●Gaussian mixture models ●K-Means ●Streaming K-Means ●Latent Dirichlet Allocation ●Power Iteration Clustering Frequent itemsets ●FP-growth ●Prefix span Recommendation ●Alternating Least Squares
  • 33. Supervised Methods in MLlib Classification Logistic regression w/ elastic net Naive Bayes Streaming logistic regression Linear SVMs Decision trees Random forests Gradient-boosted trees Multilayer perceptron One-vs-rest DeepImagePredictor Regression Least squares w/ elastic net Isotonic regression Decision trees Random forests Gradient-boosted trees Streaming linear methods
  • 34. But What is a Model Really? A model is a complex pipeline of components Data Sources Joins Featurization Logic Algorithm(s) Transformers Estimators Tuning Parameters
  • 35. ML Pipelines Train model Evaluate Load data Extract features A very simple pipeline
  • 36. ML Pipelines Train model 1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble A real pipeline!
  • 37. Integrating Machine Learning in Your Application 37
  • 38. Productionizing Models Today Data Science Data Engineering Develop Prototype Model using Python/R Re-implement model for production (Java)
  • 39. Problems with Productionizing Models Develop Prototype Model using Python/R Re-implement model for production (Java) - Extra work - Different code paths - Data science does not translate to production - Slow to update models Data Science Data Engineering
  • 40. MLLib 2.X Model Serialization Data Science Data Engineering Develop Prototype Model using Python/R Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
  • 41. Scala val lrModel = lrPipeline.fit(dataset) // Save the Model lrModel.write.save("/models/lr") • MLLib 2.X Model Serialization Snippet Python lrModel = lrPipeline.fit(dataset) # Save the Model lrModel.write.save("/models/lr") •
  • 42. Model Serialization Output Code // List Contents of the Model Dir dbutils.fs.ls("/models/lr") • Output Remember this is a pipeline model and these are the stages!
  • 43. Transformer Stage (StringIndexer) Code // Cat the contents of the Metadata dir dbutils.fs.head(”/models/lr/stages/00_s trIdx_bb9728f85745/metadata/part- 00000") // Display the Parquet File in the Data dir display(sqlContext.read.parquet(”/mod els/lr/stages/00_strIdx_bb9728f85745/ data/")) Output { "class":"org.apache.spark.ml.feature.StringIndexerModel", "timestamp":1488120411719, "sparkVersion":"2.1.0", "uid":"strIdx_bb9728f85745", "paramMap":{ "outputCol":"workclassIdx", "inputCol":"workclass", "handleInvalid":"error" } } Metadata and params Data (Hashmap)
  • 44. Estimator Stage (LogisticRegression) Code // Cat the contents of the Metadata dir dbutils.fs.head(”/models/lr/stages/18_l ogreg_325fa760f925/metadata/part- 00000") // Display the Parquet File in the Data dir display(sqlContext.read.parquet("/mod els/lr/stages/18_logreg_325fa760f925/ data/")) Output Model params Intercept + Coefficients {"class":"org.apache.spark.ml.classification.LogisticRegressionModel", "timestamp":1488120446324, "sparkVersion":"2.1.0", "uid":"logreg_325fa760f925", "paramMap":{ "predictionCol":"prediction", "standardization":true, "probabilityCol":"probability", "maxIter":100, "elasticNetParam":0.0, "family":"auto", "regParam":0.0, "threshold":0.5, "fitIntercept":true, "labelCol":"label” }}
  • 45. Output Decision Tree Splits Estimator Stage (DecisionTree) Code // Display the Parquet File in the Data dir display(sqlContext.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/")) // Re-save as JSON sqlContext.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").
  • 46. Visualize Stage (DecisionTree) Visualization of the Tree In Databricks
  • 47. Databricks + ML Pipelines: Ideal Modeling Tool Data Science - highly iterative, agile ● Lots of data sources ● Lots of dirty data ● Lots and lots of data ML Pipelines and notebooks are ideal way to experiment with new methods, data, features in order to minimize error
  • 48. Databricks Runtime Elastic, Fully Managed, Highly Tuned Engine 48 FULLY MANAGED CLOUD SERVICE • Auto-configured multi-user elastic clusters • Reliable sharing with fault isolation and workload Preemption PERFORMANCE OPTIMIZATIONS • Increases performance by 5X (TPC Benchmark) • Connector optimizations for Cloud ( Kafka, S3 and Kinesis) COST OPTIMIZED / LINEAR SCALING • 2x nodes - time cut in half • 2x data, 2x nodes - time constant • Cost of 10 nodes for 10 hours equal to 100 nodes for 1 hour DATABRICKS UNIFIED RUNTIME Databricks I/O Databricks Serverless
  • 49. Databricks Collaborative Workspace Frictionless Collaboration Enabling Faster Innovation 49 Secure collaboration for fast feedback loops with single click access to clusters Production Jobs FAST, RELIABLE AND SECURE JOBS • Executes jobs 30-50% faster • Notebooks to Production Jobs with one- click • Debug faster with logs and Spark history UI. DATA ENGINEER ANALYZE DATA WITH NOTEBOOKS • Multi-language: SQL, R, Scala, Python • Advanced Analytics (Graph, ML & DL) • Built-in visualization, including D3 & ggplot DATA SCIENTIST Interactive Notebooks BUILD DASHBOARDS • Publish Insights • Real-time updates • Interactive reportsBUSINESS SME Dashboards 49
  • 50. 50 Databricks’ Approach to Accelerate Innovation INCREASE PERFORMANCE By more than 5x and reduce TCO by more than 70% INCREASE PRODUCTIVITY Of data science teams by 4-5x STREAMLINE ANALYTIC WORKFLOWS Reducing deployment time to minutes REDUCE RISK And enable innovation with out-of-the-box enterprise security and compliance UNIFY ANALYTICS WITH APACHE SPARK Eliminating disparate tools DATA SCIENTIST /ANALYST BUSINESS SMEDATA ENGINEER OPTIMIZEBIG DATA CLUSTERS SETUP BREAK-FIX DATA WAREHOUSES CLOUD STORAGE HADOOP STORAGEIoT / STREAMING DATA MODEL DATA PRODUCTEXPLOREINGEST DASHBOARD ML LIBRARIES STREAMIN G STATISTICS PACKAGES ETL SQL Unified Engine • SQL • Streaming • MLlib • Graph Databricks Optimizations and Managed Cloud Service DATABRICKS ENTERPRISE SECURITY DATABRICKS COLLABORATIVE WORKSPACE Databricks Production Databricks Interactive DATABRICKS RUNTIME Databricks I/ODatabricks Serverless Unified Engine Open APIs
  • 52. Demonstrations •Predicting Power Output for an Energy Company •Scoring inbound Leads •Predicting Ratings given Reviews from Amazon 5 2
  • 53. Demonstration Databricks Community Edition or Free Trial <Link to Azure> Additional Questions? Contact us at http://go.databricks.com/contact-databricks 5 3