Mais conteúdo relacionado
Semelhante a Does it only have to be ML + AI? (20)
Does it only have to be ML + AI?
- 1. © 2019 Snowflake Inc. All Rights Reserved
DOES IT ONLY HAVE TO BE ML & AI?
MACHINE LEARNING INSIDE + OUTSIDE OF
SNOWFLAKE’S CLOUD DATA WAREHOUSE
HARALD ERB I harald.erb@snowflake.com
Nürnberg, 19. November 2019
- 2. © 2019 Snowflake Computing Inc. All Rights Reserved
AGENDA
> What is Snowflake? (briefly explained)
> Talk Intro
> Recap: What can a Database do for you?
> What about Implementation of In-Database ML Models?
> Or better use Auto-ML outside your Database?
> End-to-end ML Projects at Scale
- 4. © 2019 Snowflake Computing Inc. All Rights Reserved
SNOWFLAKE: A TEAM OF DATA EXPERTS
4
- 5. © 2019 Snowflake Computing Inc. All Rights Reserved
SNOWFLAKE TIMELINE
5
Founded in 2012 by
industry veterans
with over 120
database patents
~$1BN in venture
capital funding from
leading investors
~$4.5BN valuation
First customers
2014, general
availability 2015
1.600+ employees
Over 3.000
customers today
Queries processed in
Snowflake per day:
> 290+ million
Largest single
table:
> 68 trillion rows
Largest number of
tables single DB:
> 200,000
Single customer
most data:
> 55PB
Single customer
most users:
> 10,000
FUN FACTS
Gartner and
Forrester “Leader”
- 6. © 2019 Snowflake Computing Inc. All Rights Reserved
A NEW ARCHITECTURE FOR DATA WAREHOUSING
Multi-Cluster, Shared Data, in the Cloud
10
Traditional Architectures Snowflake
Cluster of nodes with a
single shared disk.
Throughput is constrained
by either CPU, memory or
disk access
(DW’s based on
traditional RDBMS)
Shared-Disk (SMP)
Cluster of nodes each of which
has its own disk – data distributed
across the nodes. Not elastic
because data must be
redistributed when resize the
cluster (Most MPP DW‘s, Hadoop)
Shared-Nothing (MPP) Multi-Cluster, Shared Data
Multiple clusters, shared data.
Compute power and storage scale
independently of each other
- 7. © 2019 Snowflake Computing Inc. All Rights Reserved
Snowflake’s built-for-the-cloud architecture
provides many benefits. The key is separation
of storage, compute and metadata services.
● Unlimited storage scalability without
refactoring
● Multiple compute clusters can read/write
shared data
● Resize clusters instantly - no downtime
● Centrally manage logical assets
(warehouse, database, etc) - not
technical assets (servers, buckets, etc)
● Full transactional consistency (ACID)
across entire system
THE SNOWFLAKE ELASTIC DATA WAREHOUSE
Management Optimisation Security Availability Transactions Metadata
- 8. © 2019 Snowflake Computing Inc. All Rights Reserved
THE SNOWFLAKE DIFFERENCE
12
Traditional DW
(in the Cloud)
Query Service
ADMINISTRATION
Data warehouse
as a service
Customer manages tuning,
optimization and manual
administration
Complete black box
CONCURRENCY
Unlimited, automatic
concurrency scaling
Limited concurrency Poor concurrency scaling
FLEXIBILITY
Native, optimized support
for diverse data
Data transformation
required
Native support, limited
optimization for diverse data
SCALING
Scale on the fly,
in seconds to minutes
Manual, disruptive,
slow scaling
Scale on the fly
- 9. © 2019 Snowflake Computing Inc. All Rights Reserved
YOU MIGHT HAVE HEARD SOMETHING SIMILAR
13
If you build replication and failover
Stop, resize, repartition, restart
What you hear What it actually means Difference with Snowflake
We support SQL Some SQL… some BQL… ANSI-Standard SQL, full DML
We scale elastically Instant, automatic resize
We support semi-structured Limited JSON Avro, Parquet, XML, JSON
We’re fast If you sort, index, and tune Built fast with no tuning
We are great for concurrency If it’s only 4-5 concurrent queries Scale linearly for concurrency
We’re global Global HA, out of the box
- 11. © 2019 Snowflake Computing Inc. All Rights Reserved
ML & AI – FROM HYPE TO PRODUCTIVITY
15
> Enterprises continue to establish a
data-driven culture
• Predictive analytics matures; “What is
likely to happen?” questions allow
organizations to become proactive and
not to rely on human experience and
intuitions only
• Machine Learning is producing some
quantifiable results today; integration with
operations (procedures and processes) is
still a challenge
• Automation: data preparation, insight
discovery, data science, ML Model
development + complex decisioning
à still a future topic
> “Citizen Data Scientists”
• To fill the data scientist talent gap
• Modern analytics tools to guide business
users through the process and help to
extract advanced analytic insights from data
• “Real” data scientists to focus on more
difficult analytics work and insight.
Source: Gartner Hype Cycle for Midsize Enterprises 2019 à Link
- 12. © 2019 Snowflake Computing Inc. All Rights Reserved
FOCUSING ON THE RIGHT THINGS
16
> SQL vs. Machine Learning vs. Machine Learning
Applied to SQL
• SQL for BI Level Analysis: Business questions and many
prediction problems can be solved by well-crafted SQL –
and it offers explainability that deep ML generally does not
• Still true: “Garbage in, Garbage Out”- nothing of substance
can come from BI or ML without good data à Data
Collection + Engineering is a sophisticated discipline,
consumes a lot of time à but these activities are crucial on
making information available reliably at scale
• Machine Learning: helpful to spot complex patterns,
maybe less important for predictions. Applied ML for better
data preparation can be very beneficial
„Wow! ML“
„BI-level Analysis“
„Data Engineering“
„Sometimes
needed“
„Always needed,
often enough for
predictions“
„Always needed“
Sources: A. Jhingran, Talk at VLDB 2019, Slides à Link ;
C. Kozyrkov, Towards Data Science Blog, 2019 à Link
> Decision Intelligence
• A new engineering discipline that augments data science with
theory from social science, decision theory, and managerial
science
• Goal: Turning information into better actions at any scale.
• Provides a framework for best practices in organizational decision-
making and processes for applying machine learning at scale.
• […] Theory skipped for this talk, but here some interesting questions
to think about
- “How should you set up decision criteria and design your metrics?”
- “What quality should you make this decision at and how much should
you pay for perfect information?” (Decision analysis)
- “How do emotions, heuristics, and biases play into decision-making?”
(Psychology)
> Fact-based decisions are not enough?
Enter à Data Science
• Use partial facts along with statistics, analytics, ML & AI to deal with
uncertainty.
• Remember: The goal (objective) is always the starting point!
- 13. © 2019 Snowflake Computing Inc. All Rights Reserved 17
FIXING DEPLOYMENT ISSUES
Source: D. Sculley, et al.: “Hidden technical debt in Machine learning systems”, 2015
- 14. © 2019 Snowflake Computing Inc. All Rights Reserved
FIXING DEPLOYMENT ISSUES – MAYBE WITH MLOPS
18
- 15. © 2019 Snowflake Inc. All Rights Reserved 19
RECAP:
WHAT CAN A DATABASE DO FOR YOU?_
- 16. © 2019 Snowflake Computing Inc. All Rights Reserved
RECAP #1: FACT-BASED DECISIONING
21
TPC-DS Benchmark Query Q57:
Catalog Sales Call Center Outliers
„Find the item brands and categories for each call center and their monthly sales figures for a specified
year, where the monthly sales figure deviated more than 10% of the average monthly sales for the year,
sorted by deviation and call center. Report the sales deviation from the previous and following months.“
> „BI-level Analysis“
• Mature Point & Click Tools based on a well-crafted
Semantic Layer on top of a (virtualized) Data Mart
enable lots of business users to answer even complex
questions
• Challenges: rapidly growing data volumes, resource
contention issues lead to restricted DW access (instead
of wider end user adoption)
- 17. © 2019 Snowflake Computing Inc. All Rights Reserved
Elastic compute: Snowflake separates
user workloads through multiple Virtual
Warehouses which scale instantly to meet
required performance levels
and can auto-resize in the case of peak
workloads to eliminate concurrency issues
Feature
- 18. © 2019 Snowflake Computing Inc. All Rights Reserved
RECAP #2: FAST DATA EXPLORATION & AGGREGATION
23
Sessionized clickstream data:
Finding sessions that include at least
one "addtocart" event, but do NOT include
a transaction.
ANSI Compliant SQL, comprehensive
set of aggregation, window, pattern
matching SQL functionsFeature
- 19. © 2019 Snowflake Computing Inc. All Rights Reserved
WORKLOAD SEPARATION IN SNOWFLAKE
24
Continuous
Loading (4TB/day)
S3
<5min SLA
Virtual
Warehouse
Medium
Batch ETL &
Maintenance
Virtual Warehouse
Large
Virtual
Warehouse
2X-Large
Analytics
(Segmentation)
Interactive
Dashboards
50% < 1s
85% < 2s
95% < 5s
Virtual Warehouse
Auto Scale – X-Large x 5
3+ PB of raw data
1,5 PB data stored in Database (8x compression ratio)
25M micro partitions
Prod DB
EXCURSUS
- 20. © 2019 Snowflake Computing Inc. All Rights Reserved
RECAP #3: ANALYTICS ON (SEMI-)STRUCTURED DATA
25
Ingest external weather data (JSON format)
and make it instantly available for SQL queries
Use case: Blend historical city bike trip data
with semi-structured weather data to spot
new patterns in customer behavior
- 21. © 2019 Snowflake Computing Inc. All Rights Reserved 26
Feature
Because of Snowflake’s VARIANT
data type, semi-structured data can be
handeled with similar performance
compared to structured
data. ”Real world" SQL including
Common Table Expressions (CTEs)
and User Defined Functions (UDFs)
are also supported.
Selecting directly from a JSON document stored in
a VARIANT column of a table
- 22. © 2019 Snowflake Computing Inc. All Rights Reserved
RECAP #4: FORECAST USING AGGREGATION FUNCTIONS
27
Calculation of Linear Regression Line + UNION ALL
of actual data and forecast data for a complete set of
sales data
Use case: Forecast price for the next hour interval Simple linear regression model:
Actual Predicted
- 23. © 2019 Snowflake Computing Inc. All Rights Reserved 28
Simple linear regression model:
Aggregation Functions for
Linear Regression
- 24. © 2019 Snowflake Inc. All Rights Reserved 29
WHAT ABOUT IMPLEMENTATION OF
IN-DATABASE ML MODELS?_
- 25. © 2019 Snowflake Computing Inc. All Rights Reserved
IT CAN BE DONE IN A DATABASE…
30
Working (experimental) examples
in Snowflake:
> K-Means Clustering,
> Predictions using an ID3
Decision Tree algorithm, or even
> Hierarchical Temporal Memory
(HTM) approach
- 26. © 2019 Snowflake Computing Inc. All Rights Reserved
…USING SQL, UDF’S, STORED PROCEDURES & MATH…
31
Feature
Snowflake stored procedures are
implemented through JavaScript
and, optionally, SQL:
• JavaScript provides the control
structures (branching and
looping).
• SQL is executed within the
JavaScript by calling
functions in an API (SQL is
not required in a stored
procedure, but is typically
included)
Embedded SQL
- 28. © 2019 Snowflake Computing Inc. All Rights Reserved
Result of Procedure Call
- 29. © 2019 Snowflake Computing Inc. All Rights Reserved
…IT MAKES SOME SENSE FROM A DEPLOYMENT PERSPECTIVE…
34
Line of
Governance
• Structured + semi-
structured data
• Raw data available for
discovery
• Self-Service sandbox
• Multiple toolsets / IDE’s
• Readable code!
• Same technology for
commercial exploitation
• Direct access via SQL
• Elastic compute
• Versioning
• Standardisation & governance
Model
- 30. © 2019 Snowflake Computing Inc. All Rights Reserved
…BUT PURE CODING IS NOT WORKING FOR EVERYONE!
35
Challenge: Fixing the Data
Scientists Talent Gap
> In any organization there
are many Business
Analysts, BI power users,
who are curious to explore
data science and predictive
algorithms for their
business case
> Enablement through basic
learning, literacy and the
right tools will lead these
individuals to transform to
Citizen Data Scientists to
do their hypothesis and
prototyping on their own
> Probably the only feasible
way today to democratize
advanced analytics in an
organisation
Potential large
user base
Citizen Data
Scientist
Potential
user impact
- 31. © 2019 Snowflake Inc. All Rights Reserved 36
AUTO-ML OUTSIDE YOUR DATABASE?_
- 32. © 2019 Snowflake Computing Inc. All Rights Reserved 37
WOW! AUTOML
> Automated Machine Learning (AutoML)
• Is the process of automating the entire end-to-end process (or some
steps) of applying ML to real-world problems:
- Data pre-processing
- Feature engineering, extraction, and selection
- Algorithm selection & hyperparameter optimization
• Accuracy of ML solutions can be measured à automated systems
can fine-tune data, features, algorithms to generate accurate models
relying on established ML knowledge
> Benefits of AutoML
• Cost reductions: Increased productivity for data scientists and/or
Democratization of machine learning reduces demand for data
scientists
• Intelligence can be easily added to applications to à Increase
revenues and customer satisfaction
• Higher productivity: Roll out more models with increased accuracy
> The Data Scientists advantage
• Conformance to custom specifications, i.e. if a model needs to be
embedded in edge devices, or if Explainability is required
• Model performance: Humans are still beating models generated by
AutoML tools.
- 33. © 2019 Snowflake Computing Inc. All Rights Reserved 38
AMAZON FORECAST – DATA IMPORT
• Input file format has to be csv
• Data schema of new time series
dataset needs to be specified and
mapped to required input format
• Data import from AWS S3 buckets
only
• A Dataset can have multiple types:
- TARGET_TIME_SERIES - historical
time series data for each item
- RELATED_TIME_SERIES –
additional numeric data points, i.e.
price, webpage_hits, flags (1,0); the
more information available, the more
accurate the forecast
- ITEM_METADATA – additional
metadata (attributes), i.e. category,
color, brand
- 34. © 2019 Snowflake Computing Inc. All Rights Reserved 39
AMAZON FORECAST – MODEL TRAINING
• Instead of AutoML, manual algorithm selection is also possible:
- Autoregressive Integrated Moving Average (ARIMA)
- DeepAR+ (incl. hyperparameter optimization)
- Exponential Smoothing (ETS)
- Non-Parametric Time Series (NPTS)
- Prophet Algorithm
• Additional configuration in non-AutoML mode:
• Predictor accuracy needs to be evaluated using related metadata + metrics,
i.e. RMSE. Training and featurization configurations are also available
- 35. © 2019 Snowflake Computing Inc. All Rights Reserved 40
AMAZON FORECAST – FORECAST GENERATION + EXPORT
• Based on the evaluation metrics of
previously trained models (predictors),
a good performing predictor can be
used to generate a forecast for each
unique item in a given target time-
series dataset
• Retrieval of a forecast for a single
item à via query incl. filter (time
window)
• Export of the complete forecast into
an Amazon S3 bucket
- 36. © 2019 Snowflake Computing Inc. All Rights Reserved 41
INTEGRATING SNOWFLAKE WITH AMAZON FORECAST
Source: aws.amazon.com/forecast
Scenario with Snowflake
(AWS deployment)
• Prepare & retrieve time
series data in Snowflake
• Export data set into
Snowflake stage (= S3
bucket)
• Use AWS Forecast via
Console, CLI, or API’s
• Retrieve forecast results as
csv files or via API and
write it back to Snowflake
Snowflake
Connector
for PythonFeature
- 37. © 2019 Snowflake Computing Inc. All Rights Reserved 42
USING PYTHON TO ORCHESTRATE THE OVERALL PROCESS
> Using Amazon Forecast with Python
• For Python, AWS provides a SDK called “Boto 3” enabling
developers to create, configure, and manage AWS services, such as
EC2 and S3. Boto provides also low-level access to AWS services
like AWS Forecast à Documentation Link
• AWS Forecast API Reference provides all actions explained in the
previous slides à Link
• Jupyter Notebooks with detailed examples on Amazon Forecast
are available in Github à Link
> Snowflake Connector for Python
• Provides an interface for developing Python applications that can
connect to Snowflake and perform all standard operations:
- Connecting to Snowflake with the Default Authenticator, or with a
SAML 2.0-compliant identity provider
- Query date, create & set up new database and tables, grant access,…
- Assign, resize a compute cluster (Virtual Warehouse)
- Load/unload data from/to Amazon S3 (or other Cloud Storage)
• End-to-end integration example is explained in Snowflake’s Blog à
Link, sample Python code can be downloaded from a Github Repo
à Link
- 38. © 2019 Snowflake Inc. All Rights Reserved 43
END-TO-END ML PROJECTS AT SCALE_
- 39. © 2019 Snowflake Computing Inc. All Rights Reserved 44
KEY ROLES IN A DATA-POWERED ORGANIZATION…
Source: Dataiku.com
- 40. © 2019 Snowflake Computing Inc. All Rights Reserved 45
…(IDEAL) CROSS-COLLABORATION IN ML PROJECTS…
Source:Dataiku
- 41. © 2019 Snowflake Computing Inc. All Rights Reserved
…SUPPORTED BY A SCALABLE ANALYTICS ARCHITECTURE
46
- 42. © 2019 Snowflake Computing Inc. All Rights Reserved 47
SNOWFLAKE INTEGRATION WITH A DATA SCIENCE SUITE
EXAMPLE
End-to-end Data Flow
- 43. © 2019 Snowflake Computing Inc. All Rights Reserved 48
SNOWFLAKE INTEGRATION WITH A DATA SCIENCE SUITE
End-to-end Data Flow
EXAMPLE
Automatic bulk copy
of datasets from S3
Data Lake to
Snowflake
Automatic table
creation and data
movement
Run complex SQL
directly in
Snowflake utilising
its built-in functions
Visual data transformations
operations (prepare, group
by, filter, split…)
automatically pushed down
to Snowflake
Use Python and
coding recipes and
execute it in
Snowflake
Train and use build-
in ML models on
Snowflake Data
Interactive SQL
notebooks for
interactive analysis
In-database” charting
to visual entire
datasets (stored in
Snowflake)
- 47. © 2019 Snowflake Computing Inc. All Rights Reserved
TRY SNOWFLAKE YOURSELF!
Snowflake Hands-on Lab
Guide à Download
Handbook here
- 48. © 2019 Snowflake Computing Inc. All Rights Reserved
SNOWFLAKE SIGMOD PAPER
Download:
www.snowflake.com/resource/
sigmod-2016-paper-snowflake-elastic-data-warehouse
- 49. © 2019 Snowflake Computing Inc. All Rights Reserved
SELECTED SNOWFLAKE TECH ARTICLES
> SNOWFLAKE CHALLENGE: CONCURRENT LOAD AND QUERY,
Benoit Dageville
> DON’T IGNORE ACID-COMPLIANT DATA PROCESSING IN THE CLOUD
Michael Nixon
> HOW TO LOAD TERABYTES INTO SNOWFLAKE – SPEEDS, FEEDS AND TECHNIQUES
Stuart Ozer
> SNOWFLAKE AND SPARK: PUSHING SPARK QUERY PROCESSING TO SNOWFLAKE
Edward Ma
> HOW WE BUILT SNOWFLAKE ON AZURE
Polita Paulus
> DATA MODELING IN THE AGE OF JSON AND SCHEMA-ON-READ
Kent Graziano
> HOW TO MANAGE GDPR COMPLIANCE WITH SNOWFLAKE’S TIME TRAVEL AND
DISASTER RECOVERY
Kent Graziano
> DATA ENCRYPTION WITH CUSTOMER-MANAGED KEYS
Martin Hentschel