Mais conteúdo relacionado Semelhante a A deep dive into running data analytic workloads in the cloud (20) Mais de Cloudera, Inc. (20) A deep dive into running data analytic workloads in the cloud1. 1© Cloudera, Inc. All rights reserved.
A deep dive into running data analytic workloads in the cloud
Strata San Jose 2018
Jason Wang | Altus Engineering
Aishwarya Venkataraman | Altus Engineering
Stefan Salandy | Systems Engineering
Mala Ramakrishnan | Senior Director, Altus Product & Marketing
3. 3© Cloudera, Inc. All rights reserved.
Agenda
- Introduction
- Cloudera Altus
- Introducing today’s lab
- Hands-on data pipeline
- Running analytic database as a PaaS
- Workload Analytics
- Conclusion
5. 5© Cloudera, Inc. All rights reserved.
The Big Shift
In 2017
58% on-premises
11% private cloud
25% public cloud
Source: 451 Research, Voice of the
Enterprise: Workloads and Key Projects,
Cloud Transformation, 2017.
By 2019
38% on-premises
15% private cloud
41% public cloud
6. 6© Cloudera, Inc. All rights reserved.
Old Job
Buy databases in bulk and rent back to
departments
Load data into and out of individual
data silos as needed
Add storage to each platform as
needed
The cloud has redefined our world
Role of VP of Data Management
Most deployments are a hybrid of the old and new
SDX
COMPUTE
STORAGE
The New World
New Job
Departments buy their own databases
Safe, collaborative environment for
every department to access
centralized, shared data
Departments rent their storage needs
7. 7© Cloudera, Inc. All rights reserved.
The market is diverging toward 4 distinct
environments
¼ PaaS
¼ Public Cloud / IaaS
¼ Private Cloud
¼ Non-Cloud
8. 8© Cloudera, Inc. All rights reserved.
Perfectly valid reasons for each environment
Non-Cloud Private Cloud
Public Cloud /
IaaS
PaaS
I want to
maximize
• Cost-efficiency • Control, elasticity,
and convenience
• Control, elasticity,
and convenience
• Agility
I want to
minimize
• Dependence on
unproven technology
• Resource contention
between
departments
• Dependence on data
center floor space
• Dependence on IT
and therefore need
as simple as possible
I want to
standardize
• On whatever
provides the best
ROI
• On a single
environment for the
entire data center
• On a single cloud
provider for all
infrastructure needs
• On whatever is
easiest to use
I want to
store my
data
• On premises
because cheaper
and/or more secure
• On premises due to
company /
government mandate
• In the cloud because
easier
• In the cloud because
easier
9. 9© Cloudera, Inc. All rights reserved.
Which environment do you want?
Non-Cloud Private Cloud Public Cloud / IaaS PaaS
“I need huge scale in a
single cluster”
“I want to separate compute
and storage”
“I want to configure and
troubleshoot my
environment”
“I’m done hiring my own
admins”
“I have a ton of cold data”
“I have unmet demand for
ad hoc workloads”
“We’ve already done a scan
of AWS and that’s where
we’re moving”
“My team has limited skills”
“My existing cluster
utilization is 90%”
“Bare metal is not an option
and I’m not allowed to move
to the cloud”
“My annual chargeback per
server is outrageous”
“I get no love from central
IT”
10. 10© Cloudera, Inc. All rights reserved.
● The modern platform for machine
learning and analytics
● with multiple deployment options
● and one shared data experience
11. 11© Cloudera, Inc. All rights reserved. 11
The modern platform for machine learning and analytics optimized for the cloud
DATA CATALOG
SECURITY GOVERNANCE
WORKLOAD
MANAGEMENT
INGEST &
REPLICATION
EXTENSIBLE
SERVICES
CORE
SERVICES DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
S3 ADL
S
HDFS KUDU
STORAGE
SERVICES
Cloudera Enterprise
PRIVATE CLOUDBARE METAL INFRASTRUCTURE
DEPLOYMENT
OPTIONS SERVICES
12. 12© Cloudera, Inc. All rights reserved.
Who is this tutorial for?
The Data Management Infrastructure Model
https://www.gartner.com/doc/3817571/solve-data-challenges-data-management
13. 13© Cloudera, Inc. All rights reserved.
Who is this tutorial for?
Data Management Infrastructure Model Roles and Skills
https://www.gartner.com/doc/3817571/solve-data-challenges-
data-management
14. 14© Cloudera, Inc. All rights reserved.
Traditional on-premises workloads generally share a cluster
HDFS
15. 15© Cloudera, Inc. All rights reserved.
Cloud workloads: Separation of storage and compute
Object Store (S3, ADLS)
Dedicated
compute
Shared
data
16. 16© Cloudera, Inc. All rights reserved.
Technology drivers for workloads in the cloud
1. Scalable and cost-effective storage in
a single repository
1. Access to utility-based compute
1. Open and modular architectures
Amazon
EC2
Azure
Data Lake Storage
Amazon
S3
Azure
Virtual Machine
17. 17© Cloudera, Inc. All rights reserved.
Types of clusters
lifecycle
transient permanent
single tenant
multi tenant
Data Engineering Pipeline
Analytics Cluster authorization
configuration
performance
troubleshooting
upgrade
metadata
18. 18© Cloudera, Inc. All rights reserved.
Data Engineering in the Cloud
Hyperscale Cloud Storage
Batch
Cluster
Transient Batch
Spin up clusters as needed.
● On-demand/spot instances
● Usage-based pricing
● Sized for workload
● Cluster per tenant/user
Batch
Cluster
Batch
Cluster
Long-running Batch
Persistent clusters for frequent ETL.
● Reserved instances
● Node-based pricing
● Grow/shrink
● Cluster per tenant group
Persistent
Cluster
Batch
Persistent Batch on HDFS
Top performance for frequent ETL.
● Reserved instances
● Node-based pricing
● Shared across tenant groups
● Lift-and-shift
PaaS
Batch
Persistent
Cluster
Batch Batch
Persistent Cluster
HDFS
Batch Batch
19. 19© Cloudera, Inc. All rights reserved.
Analytics in the cloud
Object Storage
Transient
Cluster
Transient Analytics
(infrequent usage)
Spin up clusters when needed
● On-demand instances
● Usage-based pricing
● Grow/shrink
● Cluster per tenant or user
Persistent Analytics
(regular usage)
Persistent clusters for BI any time
● Reserved instances
● Usage-based pricing
● Grow/shrink
● Cluster per tenant group
Persistent Analytics
with Local Storage (fastest)
Max speed for more regular workloads
● Reserved instances
● Node-based pricing
● Less frequent grow/shrink
● Shared cluster for shared local data
Persistent Cluster HDFS and/or
Kudu
Transient
Cluster
Persistent
Cluster
Persistent
Cluster
PaaS
20. 20© Cloudera, Inc. All rights reserved.
Primary analytic workloads in the cloud
scale, agility, and cost-efficiencies
Shared, Open Storage
ETL / Data
Preparation
BI / SQL
Analytics
Only pay for what you
need, when you need it
• Transient workloads
• Contention-free
isolation
• Cloud-native
integration
Self-service flexibility at
any scale
• Elastic scale
• Multi-tenant isolation
• Cloud-native or local
22. 22© Cloudera, Inc. All rights reserved. 22
Multi-cloud Platform-as-a-Service (PaaS) offering
Built to analyze and process data at scale in public cloud infrastructure
Cloudera Altus
EXTENSIBLE
SERVICES
ALTUS
SERVICES DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
23. 23© Cloudera, Inc. All rights reserved. 23
Multi-cloud Platform-as-a-Service (PaaS) offering
Built to analyze and process data at scale in public cloud infrastructure
Cloudera Altus
EXTENSIBLE
SERVICES
ALTUS
SERVICES DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA
SCIENCE
24. 24© Cloudera, Inc. All rights reserved.
What is it?
- Short-lived
- Single tenant
- Hive, Spark, or MapReduce Cluster
Used for things like
- ETL jobs
- batch processing
- with data living in S3 or ADLS
- Provides fast and easy job submission
without cluster management
Available on AWS and Azure
Altus Data Engineering (DE)
DATA
ENGINEERING
25. 25© Cloudera, Inc. All rights reserved.
What is it?
- Long-lived
- Multi tenant
- Impala Cluster
Used for things like
- data warehousing
- analytics
- with data living in S3 or ADLS
- Provies fast and easy analytics
without cluster management
Available on AWS
Altus Analytic Database (ADB)
ANALYTIC
DATABASE
26. 26© Cloudera, Inc. All rights reserved.
What is it?
- Cloud native shared metadata store
with metadata living in S3 or ADLS
Used for things like
- Shared cataloging to define and preserve
structure and business context of data
- Provides unified security across
transient and recurring workloads
- Enables consistent governance
across all data to increase compliance
Cloudera Shared Data Experience (SDX)
S3 or ADLS
DATA
ENGINEERING
ANALYTIC
DATABASE
ANALYTIC
DATABASE
ANALYTIC
DATABASE
DATA
ENGINEERING
27. 27© Cloudera, Inc. All rights reserved.
Altus Features
Focus on the workload, not the infrastructure.
Let Altus do the heavy lifting.
Low cost
• Per-node/per-hour pricing
• Create clusters as needed
• Terminate clusters when
they’re not in use
End-user focused
• Manages your cluster so you
don’t have to
• Submit Jobs via the UI/CLI/API
• Built in workload
troubleshooting and analytics
Easy to use
• Self-service for end-users
• Built on your familiar cloud
infrastructure
• Cluster provisioning in
minutes
Cloud-native
• Runs on AWS and Azure
• Read/Write against ADLS and
S3
• Decouple storage from compute
Integrated Platform
• Same Cloudera platform on-
premises and in the cloud
• Many different services like
DE and ADB
• Share metadata across
clusters with SDX
Secure
• Integrated with Azure and
AWS security models
• Cloudera NEVER has access
to your data
• Backed by native cloud
storage
34. 34© Cloudera, Inc. All rights reserved.
What is an Environment?
What are Clusters?
An Environment is an encapsulation of the cloud provider resources and the
cross account trust needed to deploy Cloudera clusters.
A Cluster is a Cloudera Cluster (CM + Master + Worker nodes) optimized for
running specific workloads.
35. 35© Cloudera, Inc. All rights reserved.
1. Security Model for Delegated Access
2. Networking
3. Cloud Storage Data Access
AWS vs. Azure
37. 37© Cloudera, Inc. All rights reserved.
Azure Model for Delegated Access: Service Principal
42. 42© Cloudera, Inc. All rights reserved.
Today’s Lab:
Solving a Business Need With Cloudera Altus
43. 43© Cloudera, Inc. All rights reserved.
Setting the Scene
- We work for an outdoor clothing retail company and website
sales are struggling
- We need to figure out whether sales orders correlate with
website visits and what steps to take to improve sales
- We’ll use Altus DE and Altus ADB to solve this
44. 44© Cloudera, Inc. All rights reserved.
Already Setup: Raw Data Ingestion
Sales Orders Raw Logs
45. 45© Cloudera, Inc. All rights reserved.
Part One: Data Engineering
Sales Orders Raw Logs Tokenized logs
46. 46© Cloudera, Inc. All rights reserved.
Sales Orders Raw Logs Tokenized logs
Part Two: Analytics
47. 47© Cloudera, Inc. All rights reserved.
What this will look like in today’s lab
1
2
3
4
49. 49© Cloudera, Inc. All rights reserved.
But first, go get the handout
https://tinyurl.com/y9zxxzkm
50. 50© Cloudera, Inc. All rights reserved.
When you see this hand it means look at your handout for a hands-on task.
Handout overview
https://tinyurl.com/y9zxxzkm
51. 51© Cloudera, Inc. All rights reserved.
Log in to Altus
1
console.altus.cloudera.com
https://tinyurl.com/y9zxxzkm
52. 52© Cloudera, Inc. All rights reserved.
Create one cluster for Data engineering and one
cluster for Analytic Database. While these clusters are
creating, take a break!
Create Altus clusters
2
https://tinyurl.com/y9zxxzkm
53. 53© Cloudera, Inc. All rights reserved.
Perform ETL using Altus Data Engineering
3
https://tinyurl.com/y9zxxzkm
55. Altus Analytic DB Architecture
S3
EC2
● Impala running on
EC2 nodes
● Data stored in S3
● Data can be
accessed by
multiple clusters
56. 56© Cloudera, Inc. All rights reserved.
Explore data using Altus Analytic Database
4
https://tinyurl.com/y9zxxzkm
59. 59© Cloudera, Inc. All rights reserved.
● Get insight into causes of
job failure
● Size clusters and optimize
job performance
● Identify issues even when
they don’t show up as
errors
Altus Workload Analytics
60. 60© Cloudera, Inc. All rights reserved.
Hive invalid query
Troubleshooting failed jobs
5
62. 62© Cloudera, Inc. All rights reserved.
Spark Out of Memory issue
Troubleshooting failed jobs
5
64. 64© Cloudera, Inc. All rights reserved.
Example: Skewed join
- WA lists outlier tasks that have a long wait before they start
Optimize Performance
66. 66© Cloudera, Inc. All rights reserved.
● Track history of recurring workloads over time
● Performance trends of each individual stage
● Automatic detection of abnormal behavior of recurring workloads (too fast or
too slow)
● Drilling down can show differences between data input / output size
● Group by jobs
DEMO
Track history
69. 69© Cloudera, Inc. All rights reserved.
- Number of Map/Reduce jobs generated
- Log files for each individual task
- Metrics for each stage
- Browse and search configuration properties
DEMO
Execution details of a job
71. 71© Cloudera, Inc. All rights reserved.
Spin up working environments ad hoc
Bring your own data and tools
Adjust resources on-demand
Pay for your actual consumption of resources
Key benefits of PaaS
75. 75© Cloudera, Inc. All rights reserved.
The key benefits of a modern analytic database
High-performance BI and SQL analytics
Flexibility for data and use case variety
Cost-effective scale for today and tomorrow
Go beyond SQL with an open architecture
76. 76© Cloudera, Inc. All rights reserved.
Advantages of a modern approach
decoupled for cloud and on-premises
Go Beyond SQL
• Consolidate data silos with
an open architecture
• Shared data across SQL
and non-SQL workloads
Data Flexibility
• Iterative modeling and self-
service accessibility
• Portability: No proprietary
formats or storage lock-in
Cost-Effective Scalability
• Elastic scale in any
environment
• Cloud-native integration for
optimized pay-per-use costs
• Proven at massive scale
Hybrid
• Runs across multi-cloud &
on-prem for zero lock-in
• Multi-storage over S3,
ADLS, HDFS, Kudu, Isilon,
etc.
Shared Data
77. 77© Cloudera, Inc. All rights reserved.
High-performance BI and SQL analytics
Flexibility for data and use case variety
Cost-effective scale for today and
tomorrow
Go beyond SQL with an open
architecture
Same SQL engine native across any
cloud and on-prem
Self-service access directly on object
stores, without the silos
Elasticity on-demand through
decoupled compute and object
storage
Converge workloads over shared
data, with zero lock-in
Key benefits translated for the cloud