Mais conteúdo relacionado Semelhante a Big Data Fundamentals 6.6.18 (20) Mais de Cloudera, Inc. (20) Big Data Fundamentals 6.6.181. 1© Cloudera, Inc. All rights reserved.
Big data fundamentals
Understanding the optimizationchoices in big data components
2. 2© Cloudera, Inc. All rights reserved.
Presentation goals
Teach you something
Help you see the potential of Big Data beyond Map Reduce
Be fair to Cloudera’s competitors
Inspire you to learn more
If something doesn’t make sense, please ask.
3. 3© Cloudera, Inc. All rights reserved.
Notification
• The information in this document is proprietary to Cloudera. No part of this document may be reproduced,
copied or transmitted in any form for any purpose without the express prior written permission of Cloudera.
• This document is a preliminary version and not subject to your license agreement or any other agreement
with Cloudera. This document contains only intended strategies, developments and functionalities of
Cloudera products and is not intended to be binding upon Cloudera to any particular course of business,
product strategy and/or development. Please note that this document is subject to change and may be
changed by Cloudera at any time without notice.
• Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not warrant
the accuracy or completeness of the information, text, graphics, links or other items contained within this
material. This document is provided without a warranty of any kind, either express or implied, including but
not limited to the implied warranties of merchantability, fitness for a particular purpose or non-infringement.
• Cloudera shall have no liability for damages of any kind including without limitation direct, special, indirect
or consequential damages that may result from the use of these materials. The limitation shall not apply in
cases of gross negligence.
4. 4© Cloudera, Inc. All rights reserved.
Agenda
• Open source software
• Data storage and stewardship
• Data integration
• Data engineering
• Data analytics
• Life after Lambda architectures and IoT
• Data science at scale
• Big data in the clouds
• Cybersecurity as a Big Data problem
• Cluster management and security
• Customer success stories
• Question and answers
5. 5© Cloudera, Inc. All rights reserved.
Big data fundamentals
Open source software
Optimizing to benefit from community innovation
6. 6© Cloudera, Inc. All rights reserved.
Free evaluation
Install, test, inspect, and evaluate open
source code in perpetuity, with no
financial obligation
Freedom from lock-in
Multiple vendors supporting same core
technology makes it easier to move
Scalable innovation
The collective work of a global,
passionate community keeps the code
base evolving
3 Reasons open source is good for companies
[1] [2] [3]
These benefits derive from use of the permissive
Apache License
7. 7© Cloudera, Inc. All rights reserved.
Not business focus
Company assets should be working on
core competency
Real cost hard to measure
Time developers spend solving
problems or adding features often isn’t
visible
Multiple projects
Each project is managed by a separate
committee and there is not necessarily
an overriding design
3 Reasons open source adds risk for companies
[1
]
[2
]
[3]
8. 8© Cloudera, Inc. All rights reserved.
“Open source software is free
like a puppy is free”
- Scott McNealy
CEO Sun Microsystems
9. 9© Cloudera, Inc. All rights reserved.
What if you got a dog for a
reason?
• Can take years to mature
• Months of intensive training (when your
attention should be elsewhere)
• Dog becomes very bonded to the
handler (and vice versa)
• Poor training results in a misbehaving
dog
Developers don’t want to be tied to one system
You don’t want your developers tied to one system
11. 11© Cloudera, Inc. All rights reserved.
Each Apache project has its own
dependencies and release cycle.
Getting them to work together
requires effort and thorough testing.
Code in Open Source changes
constantly. Cloudera provides a
new feature release every quarter
that is tested and supported.
Distribution Vendors should employ
Open Source Committers that can
make sure fixes are added to the
Open Source base.
Benefits of using a distribution
Stability Regular upgrades 24x7 Support and bug
fixes
12. 12© Cloudera, Inc. All rights reserved.
With a Distribution, you can start
developing applications right away.
Building an environment from
scratch would take months.
With a distribution, you know what it
will cost and you know that it will
work. Building an environment from
scratch provides no such
guarantees.
Building an environment from
scratch would require the focus of a
few of your best developers. Get
them working on the real problem.
More benefits of using a distribution
Faster to market Minimize risk Focus on business problems
13. 13© Cloudera, Inc. All rights reserved.
The big data ecosystem vendors
(Spark) (Kafka)
Comprehensive distributions
Single+ project specialists
Proprietary + Hadoop in the gaps
(Cassandra)
Google Cloud Dataproc
14. 14© Cloudera, Inc. All rights reserved.
Apache software foundation
ASF board of directors
Project management committee chair – ensures the project complies with ASF requirements
PMC members – decide the architecture, feature set and direction of the project, usually are also
Committers
Committers – have write access to the code, although contributions are approved by the PMC
Developers (aka contributors) – anyone may propose changes to the code or
documentation, but those changes have to be picked up and used by a committer
Users – provide feedback, bug reports and feature suggestions
appoints
For each project
15. 15© Cloudera, Inc. All rights reserved.
Apache project requirements
• Must be Apache licensed (may include compatibly licensed elements)
• Free to download and use for any purpose
• Branding requirements and restrictions
• Source code must be open and available on the ASF website
• Must provide sufficient documentation to use the project on website
• Releases must follow the ASF PMC voting policies
• Corporations may not directly contribute – only individuals
• Must govern themselves independently of undue commercial influence
• Must not discourage new contributions from competing vendors
• Low diversity may incur ‘extra scrutiny’ from the board
However, there are NO requirements to:
• Have more than one commercial entity involved (random community members are ok)
• Contribute to an existing project when there is overlap in functionality (competitive projects are ok)
• Contribute modifications or enhancements back to the project
• Employ Committers or PMC members if you are a commercial vendor
16. 16© Cloudera, Inc. All rights reserved.
Cloudera’s commitment to our customers
Anything that stores your data
Any APIs your applications call
Uses open source code
Our contributions and fixes go back
to open source first
When possible, use projects
supported by multiple commercial
vendors
Keeping your cluster running
Cloudera express edition
No limit to number of servers
Managing your applications
Employ* committers, if not PMC
members, on the projects we
support
* People manage their own careers. Temporary gaps may exist
High availability features
Ensure your success
Open source
License expiration won’t stop
the cluster
Free to use forever Provide enterprise value
RBAC over your data
24x7 support
Minimize your risk
Rolling upgrades
Data governance and lineage
Automated backup and recovery
Full disk encryption
Multi-tenant usage reports
17. 17© Cloudera, Inc. All rights reserved.
Big data fundamentals
Data storage and stewardship
Optimizing for inexpensive, reliable storage accessed by
multiple execution engines
18. 18© Cloudera, Inc. All rights reserved.
Anatomy of a big data cluster Masters
Workers Gateway(s)
Cloudera
Manager
Data Node
HBase
Region Server
Search
YARN Resource
Pool(s)
CM Agent
Data Node
HBase
Region Server
Search
YARN Resource
Pool(s)
CM Agent
Data Node
HBase
Region Server
Search
YARN Resource
Pool(s)
CM Agent
Data Node
Kudu Tablet
Server
Impala
Daemon
YARN Resource
Pool(s)
CM Agent
Data Node
Kudu Tablet
Server
Impala
Daemon
YARN Resource
Pool(s)
CM Agent
Data Node
Kudu Tablet
Server
Impala
Daemon
YARN Resource
Pool(s)
CM Agent
HMaster
CM Agent
HUE Server
Zookeeper
Name Node
YARN
Kudu Master
⭐️ Zookeeper
Secondary Name
Node
Impala Catalog
Store
Kudu Master⭐️
HMaster
CM Agent
Sentry Server
Zookeeper
HiveServer
Impala Statestore
Kudu Master
HMaster
CM Agent
Oozie Server
CM Agent
CDSW
User App
User App
Metadata
Database(s)
CM Agent
CDSW
CDSW Session
CDSW Session
CDSW Session
CDSW Session
CDSW Session
Cloud Plugin
Cloudera
Director
(optional)
19. 19© Cloudera, Inc. All rights reserved.
HDFS
Name Node
Secondary
Name Node
Standby
Name Node
Data NodeA Data NodeB Data NodeC Data NodeD
FileQ
BX BY BZ
BX1 BX2 BX3
BY1 BY3 BY2
BZ3BZ2 BZ1
Rack1 Rack2 Rack3
Default block size
= 256 MB
20. 20© Cloudera, Inc. All rights reserved.
HDFS Snapshots
…
user
hive
tables
sales
subscriptions
Data1.parquet
Data2.parquet
.snapshot
Data Node
BX1
Name Node
BY1 BZ1
BY2 BX2 BZ2
BY1 BX2 BY2
BX1 BZ1 BZ2
BX1 BY1 BY2
BX2 BZ1 BZ2snap1
Data1.parquet
Data2.parquet
21. 21© Cloudera, Inc. All rights reserved.
Public cloud blob storage
Public clouds are offering low cost, highly available storage
Designed for access inside and outside of Hadoop
Amazon Simple Storage Service (S3)
Uses ‘bucket’ paradigm
Requires S3 Guard (Apache Open Source) to achieve consistency
Use protocol s3a://<bucket name>/<filename>
• Microsoft Azure Data Lake Store (ADLS)
‘Feels’ more like a normal (POSIX) file system
Use protocol adl://<directory>/<directory>/filename
22. 22© Cloudera, Inc. All rights reserved.
Compute over storage
SparkImpala MapReduceSearch
Hive Pig
ADLS
KuduHDFS
Compute
Storage
Filesystem
S3
HBase
23. 23© Cloudera, Inc. All rights reserved.
Schema on write or ‘structured data’
1. Define schema
2. Create table(s)
3. Map known fields
4. Discard unknown fields
24. 24© Cloudera, Inc. All rights reserved.
Schema on read or ‘unstructured data’
1. Write whole record(s) to
filesystem (compressed)
3. Query engine applies
schema to data
2. Register schema with metastore
25. 25© Cloudera, Inc. All rights reserved.
Popular file format options
XML, JSON Files
Can’t be both split and compressed
Text/Delimited/CSV/JSON Records
Usable everywhere
Schema on read
Poor performance, poor compression
Avro
Contain schema, but also allow schema on read
Usable inside and outside of Hadoop
Parquet
Columnar, splitable, query performance benefits, excellent compression
Support schema evolution (adding columns)
Skips columns well during scans
ORC (not supported by Cloudera, HDP Hive Only)
Similar to Parquet but with higher compression but poor data skip
Hortonworks working on ACID transactions, secondary indexes
File type Example size
Uncompressed CSV 1.8 GB
Avro 1.5 GB
Avro w/ snappy compression 750 MB
Parquet w/ snappy compression 300 MB
26. 26© Cloudera, Inc. All rights reserved.
Raw and formatted data copies
• Keep the raw version if there is an opportunity that information
will be lost in the translation
• Use Columnar storage on formatted data to improve analytic
performance immensely
• Think about a metadata tagging policy (e.g. Cloudera
Navigator) to assist with Data stewardship
27. 27© Cloudera, Inc. All rights reserved.
Big data pipelines
Data ingestion Data engineering Data stewardship Data science Data analytics
Move
Cleanse
Conform
Transform
Enrich
Store
Secure
Govern
Tag
Model
Score
Enrich
Predict
BI
Online
APIs
Capture
Stream
29. 29© Cloudera, Inc. All rights reserved.
Data lake to a data hub
• Comprehensive, planned and enforced data hierarchy
• Carefully administered versioning and retention policies
• Comprehensive, unified security, governance and
lineage
• Encourage and support metadata
• Establish standards for data, metadata and analytic
models
• Maximize reuse of data without making copies
• Balanced with security and performance concerns – don’t be an
ideologue!
• Plan staffing around new roles
30. 30© Cloudera, Inc. All rights reserved.
Big data fundamentals
Data integration
Optimizing for data ingestion with volume, velocity and variety
31. 31© Cloudera, Inc. All rights reserved.
Apache Flume
HDFS
Flume Agent
Flume Agent(s)
Compress
Flume Agent Flume Agent Flume Agent
Flume Agent Flume Agent
Filter Transform
Flume Agent
Encrypt
Flume Agent
•Pre-process data before storing
• Such as transform, scrub or
enrich
• Store in any format
• Text, compressed, binary, or
custom sink
•Collect data as it is produced
• Files, syslogs, stdout or
custom source
•Process in place
• Such as encrypt or
compress
• Write in parallel
• Scalable throughput
32. 32© Cloudera, Inc. All rights reserved.
Apache Kafka
Broker1
TopicA- Partition0
Broker2
TopicA- Partition1
Broker3
TopicA- Partition2
Producer
Producer
ConsumerA
Consumer
Consumer Group
ConsumerB
Producers push to Kafka Consumers pull from Kafka
33. 33© Cloudera, Inc. All rights reserved.
Kafka redundancy
Broker3
TopicA- Partition2
TopicA- Partition0 -Replica
TopicA- Partition1 -Replica
Broker3
TopicA- Partition1
TopicA- Partition0 -Replica
TopicA- Partition2 -Replica
Broker3
TopicA- Partition0
TopicA- Partition1 -Replica
TopicA- Partition2 -Replica
34. 34© Cloudera, Inc. All rights reserved.
Apache Sqoop
RDBMS
HDFS
▪ Rapidly moves large amounts of data
between relational databases and HDFS
– Import tables (or partial tables)
from an RDBMS intoHDFS
– Export data from HDFS to a database table
▪ Uses JDBC to connect to thedatabase
– Works with virtually all standard RDBMSs
▪ Custom “connectors” for some RDBMSs provide much higher throughput
– Available forcertain databases, such as Teradata and Oracle
35. 35© Cloudera, Inc. All rights reserved.
Big data fundamentals
Data engineering
Optimizing for parallel processing of big data with minimum code
38. 38© Cloudera, Inc. All rights reserved.
Resilient Distributed Dataset (RDD)
An RDD is an immutable distributed collection of elements of your data, partitioned
across nodes in your cluster that can be operated in parallel with an API that
offers transformations and actions.
map (function)
filter (predicate)
sortBy (function)
join (RDD2)
39. 39© Cloudera, Inc. All rights reserved.
Apache Spark
RDDA RDDB RDDC
RDDD RDDE RDDF
RDDG
map groupBy
filtermap
join
40. 40© Cloudera, Inc. All rights reserved.
Spark stages
RDDA RDDB RDDC
RDDD RDDE RDDF
map groupBy
filtermap
41. 41© Cloudera, Inc. All rights reserved.
Spark stages
RDDA RDDB RDDC
RDDD RDDE RDDF
map groupBy
filtermap
RDDG
join
42. 42© Cloudera, Inc. All rights reserved.
Spark caching
RDDA RDDB RDDC
RDDD RDDE RDDF
map groupBy
filtermap
RDDG
join
43. 43© Cloudera, Inc. All rights reserved.
Evolution of the RDD API
DataFrame (Spark 1.3)
• Untyped, for R and Python
• Adds concept of ‘Schema’ to describe the data
• Uses RDDs underneath
• Allows Spark engine to perform some optimizations
• Avoids use of Java serialization, uses off heap storage
• Required different API than RDDs
RDD (Spark 1.0)
• Can be strongly typed in Java, Scala
• Uses RDDs underneath
• Catch compile-time errors
Dataset (Spark 2.x)
• Unified API
• Typed and untyped
45. 45© Cloudera, Inc. All rights reserved.
Other DAG/streaming processors
(not supported by Cloudera)
46. 46© Cloudera, Inc. All rights reserved.
Spark ecosystem
Spark core
Spark SQL
Spark
Streaming
Spark ML GraphX
Standalone Mesos
(not included in CDH)
Yarn
47. 47© Cloudera, Inc. All rights reserved.
Spark SQL
+ Static typing (optional)
+ Storage and processing efficiencies
48. 48© Cloudera, Inc. All rights reserved.
ETL into EDW
Data
sources
ETL
EDW
Archive
Data
marts
Canned
reports
Dashboards/
analytic
applications
Non-SQL
workloads
Self-service
BI/ad hocEDW
49. 49© Cloudera, Inc. All rights reserved.
EL-T into EDW
Data
sources
EL
EDW
Archive
Data
marts
Canned
reports
Dashboards/
analytic
applications
Non-SQL
workloads
Self-service
BI/ad hoc
T
50. 50© Cloudera, Inc. All rights reserved.
Modern data warehouse landscape
Data
sources
Analytic
database
Operational
database
Data Science &
engineering
Shared data
layer
Modern Data Platform
Fixed
reports
Dashboards/
analytic applications
Non-SQL
workloads
Self-service
BI/ad hoc
Flexible
reporting
EDW
51. 51© Cloudera, Inc. All rights reserved.
Cloudera’s featured data engineering partners
Hadoop Native Solution
52. 52© Cloudera, Inc. All rights reserved.
Big data fundamentals
Data analytics
Optimizing the engine to match the use case
53. 53© Cloudera, Inc. All rights reserved.
Apache Hive
Hive Metastore
HDFS BLOB OtherStorage
Location
Schema
SerDe
File format
HiveServer2
Thrift Service Beeline CLI
JDBC
ODBC
Driver
Compiler
Executor
Driver
Compiler
Executor
SessionA SessionB
or
54. 54© Cloudera, Inc. All rights reserved.
Apache Hive
✓ Spins up processes under the control of YARN
- shares resources well on the cluster
- but there is a lot of overhead to create these processes
✓ Can handle the failure of a machine during the query
- but recovery takes many seconds
✓ Will overflow join data to HDFS
- can handle very large joins
- but HDFS writes data 3 times, so this takes time
Don’t forget
who won the
race, Bucko!
Hive on Spark (Cloudera, MapR, Databricks) ✓Improves speed due to efficiencies of Spark
Live Long and Process (Hortonworks) ✓Improves speed by using pre-allocated JVMs w/ caching
Presto (Facebook) ✓Improves speed by optimizing data transfers for SQL and
using data streaming instead of HDFS for intermediate data
But all of these solutions are still JVM based
55. 55© Cloudera, Inc. All rights reserved.
Apache Impala
✓ Written in C++
- avoids issues of the JVM
✓ Uses the Hive metastore
- better integration for security and administration
✓ Uses pre-allocated processes on worker nodes
- no process spin up time
- but still builds an execution plan for each query
✓ Employs algorithms from MPP databases
But I left you
in the dust at
the starting
line, Grandpa!
If a machine fails during a query that only takes 1 second to run, you will just retry the query.
Adopted by:
(the fastest of the antelopes)
56. 56© Cloudera, Inc. All rights reserved.
So which engine should I choose?
"If the only tool you have is a hammer, you tend to
see every problem as a nail."
- Abraham Maslow
Psychologist
Author of ‘Maslow’s Hierarchy of Needs’
SparkImpala MapReduceSearch
Hive Pig
ADLS
KuduHDFS
Filesystem
S3
HBase
57. 57© Cloudera, Inc. All rights reserved.
Other SQL engines
LLAPStinger.next
CubeHive ++
aka Live Long And Process
For JSON lovers
Tied to proprietary front/backLayer over HBaseSQL engines ‘from scratch’
Low Latency Analytical Processing
(not supported by Cloudera)
IBM Big SQL
OLAP
58. 58© Cloudera, Inc. All rights reserved.
How to interpret benchmark tests
Standard test? How many of the queries were
run?
What is the criterion for excluding a query?
Single-user or multi-user? Data size?
Allow modifications to the queries?
"There are three kinds of lies:
lies, damned lies, and statistics."
-Benjamin Disraeli
Prime Minister of Britain
59. 59© Cloudera, Inc. All rights reserved.
Big data fundamentals
Life after lambda architectures and IoT
Optimizing for time series and changing data
60. 60© Cloudera, Inc. All rights reserved.
Updates or analytics
using
Analytics(Scans)
Online (Random Access)slow
slowfast
fast
(but not both at the same time)
Write once, read many. No updates, but can append (sort of)
Optimized for batch inserts and scans
Read, write, update individual rows
Optimized row-based access, sparse columns
61. 61© Cloudera, Inc. All rights reserved.
Lambda architectures
(named for the simple shape)
62. 62© Cloudera, Inc. All rights reserved.
Lambda architectures
(not so simple in practice)
Source: http://horicky.blogspot.com/2014/08/lambda-architecture-principles.html
63. 63© Cloudera, Inc. All rights reserved.
Kudu design goals
using
Analytics(Scans)
Online (Random Access)slow
slowfast
fast
High throughput for big scans
Goal: Close to Parquet on HDFS
Low-latency for short accesses (primary key indexes
and quorum design)
Goal: 1ms read/write on SSD
Database-like semantics (initially single-row ACID)
Relational data model
SQL query
“NoSQL” style scan/insert/update (Java client)
64. 64© Cloudera, Inc. All rights reserved.
Why are updates important?
Right to forget
ETL mistakes/corrections
Analytic enrichment
66. 66© Cloudera, Inc. All rights reserved.
Kudu use cases
Kudu is best for use cases requiring a
simultaneous combination of sequential and
random reads and writes
● Time series data
○ Examples: Stream market data; fraud detection
and prevention; risk monitoring
○ Workload: Insert, updates, scans, lookups
● Machine data analysis
○ Examples: Network threat detection
○ Workload: Inserts, scans, lookups
● Online reporting
○ Examples: ODS
○ Workload: Inserts, updates, scans, lookups
67. 67© Cloudera, Inc. All rights reserved.
Big data fundamentals
Data science
Optimizing to detect complex patterns over time
69. 69© Cloudera, Inc. All rights reserved.
Data science is a big data problem
“It’s not who has the best algorithm that wins. It’s
who has the most data.”
Banko and Brill, 2001
70. 70© Cloudera, Inc. All rights reserved.
Notebooks
What was our revenue last year?
RDBMS
$14,325,874,321.07
What will our revenue be next year?
• Assumptions
• Algorithms
• Source Data
• Methodology
Your code tells a story
• Tell it with pictures & results
• Allow someone to re-run the numbers
• Pass it to someone who may use it as
the basis for a new/different story
71. 71© Cloudera, Inc. All rights reserved.
Notebook challenges
Access
For sensitive data, secure clusters are
difficult to access. And IT typically doesn’t
want random packages installed on a secure
cluster.
Popular open source tools don’t easily
connect to these environments, or always
support Hadoop data formats.
Scale
Laptops rarely have capacity for
medium, let alone big data. This leads
to a lot of sampling.
Popular frameworks don’t easily
parallelize on a cluster. Typically code
has to get rewritten for production.
Developer Experience
Notebooks, while awesome, don’t easily
support virtual environment and
dependency management, especially for
teams. This makes sharing and
reproducibility hard.
Notebooks are also challenging to “put
into production.”
72. 72© Cloudera, Inc. All rights reserved.
‘Dependency hell’
Or ’I am my own Grandpa’
X (1.0.0)
Y (1.0.0)
MyApp
X (1.0.0)
Y (1.0.0)
MyApp
X (1.1.0)
Upgrade
Dependency Graph for Hadoop Java Client
www.visioneye.com
73. 73© Cloudera, Inc. All rights reserved.
Cloudera Data Science Workbench
Team-based
R, Python, Scala
SDLC
Secure
Containerized
Integrated into the cluster
74. 74© Cloudera, Inc. All rights reserved.
The importance of an open ecosystem
Open ecosystem Black box
75. 75© Cloudera, Inc. All rights reserved.
Containers
Hardware
Host OS
Hypervisor
(Optional)
GuestOS
GuestOS
GuestOS
Libs Libs Libs
AppA1 AppA2 AppB
VM
Hardware
Host OSContainer
Daemon
Libs Libs
AppA1
AppA3
AppA2
AppB1
AppB3
AppB2
AppB4
Container
Containers
• Use less memory than VMs
• You get to use more of the machine you pay for
• Provide isolation between apps
• Can share libraries between similar apps
• Provide abstraction of the OS, not of the HW
• Get you out of ‘Dependency Hell’ against other applications
76. 76© Cloudera, Inc. All rights reserved.
Scaling data science for big data
Master(s)
Workers
Gateway(s)
Name Node
YARN
CDSW
CDSW
CDSW Session
CDSW Session
CDSW Session
CDSW Session
CDSW Session
Data Node
YARN Resource
Pool(s)
Data Node
YARN Resource
Pool(s)
Data Node
YARN Resource
Pool(s)
Data Node
YARN Resource
Pool(s)
Web browser
login
Start session
CDSW Session
CDSW Session
Kubernetes
77. 77© Cloudera, Inc. All rights reserved.
Machine learning pipeline in Spark
Load learning data frame
Clean/process data
Extract and transform features
Vectorize features Save model
Scoring results Test m odel
Fit and access model
Load test data frame
Test resultsLoad scoring data frame
Score DataSave Results
78. 78© Cloudera, Inc. All rights reserved.
Big data fundamentals
Big Data in the Clouds
Optimizing for a variety of operational choices
79. 79© Cloudera, Inc. All rights reserved.
My organization
is moving to the cloud,
why should we
consider ?
80. 80© Cloudera, Inc. All rights reserved.
Traditional applications
80
Data
Exploration
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST &
REPLICATION
DATA CATALOG
SQL & BI
Analytics
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
Operational
Real-Time DB
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
ETL & Data
Processing
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST &
REPLICATION
DATA CATALOG
Custom
Functions
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
Many data silos, each with its own proprietary tools and infrastructure
Different vendors, products, and services on-premises versus in cloud
A fragmented approach is difficult, expensive, and risky
81. 81© Cloudera, Inc. All rights reserved.
Multiple compute engines, same data
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
CORE
SERVICES
STORAGE
SERVICES
82. 82© Cloudera, Inc. All rights reserved.
Common metadata
Data Catalog, Security, Governance, Lineage, Metadata Tags
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
CORE
SERVICES
STORAGE
SERVICES
METADATA
SERVICES
83. 83© Cloudera, Inc. All rights reserved.
Data Silos 2.0
DW Cluster
DW Service
Source Data
A B
D
C
84. 84© Cloudera, Inc. All rights reserved.
Deployed anywhere
Data Catalog, Security, Governance, Lineage, Metadata Tags
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES
CORE
SERVICES
STORAGE
SERVICES
METADATA
SERVICES
DEPLOYMENT
OPTIONS
86. 86© Cloudera, Inc. All rights reserved.
How do we deal with hybrid clouds?
• Shared catalog
• Unified security
• Consistent governance
• Easy workload management
• Flexible ingest and replication
87. 87© Cloudera, Inc. All rights reserved.
Cloudera Enterprise
Data Catalog, Security, Governance, Lineage, Metadata Tags
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
CORE
SERVICES
STORAGE
SERVICES
PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES
DEPLOYMENT
OPTIONS
The modern platform for machine learning and analytics optimized for the cloud
88. 88© Cloudera, Inc. All rights reserved.
Deployment & management options
Bare Metal Private Cloud Cloud IaaS Cloud PaaS
Applications Applications Applications Applications
Clusters Clusters Clusters Clusters
Operating System Operating System Operating System Operating System
Network Network Network Network
Storage Storage Storage Storage
Servers Servers Servers Servers
Customer managed Vendor managed
Manager Director Altus
90. 90© Cloudera, Inc. All rights reserved.
Altus service architecture
● Runs in Cloudera’s secured and monitored environment
● Manages CDH clusters in customer cloud account
● Customer data does not pass* to Cloudera
* Workload Analytics requires opt-in log data transfer to Cloudera
91. 91© Cloudera, Inc. All rights reserved.
Keep your encryption keys outside of the cloud
92. 92© Cloudera, Inc. All rights reserved.
Cloudera usage based pricing option
Pay per use Node based pricing
Cheaper for transient clusters
Cheaper for small machine types
Pay as you go or discounted credits
Cheaper for persistent or long-running clusters
Volume & enterprise discounts
93. 93© Cloudera, Inc. All rights reserved.
Hot-Warm-Cold Data
Store partitions from the same table in different storage types
m4.4xlarge m4.4xlarge i2.2xlarge
serve serve
preload
serve preloadserve
d2.4xlarge
serve
0 1 3 14
Days of ‘Hot’ Data
AWS Instance premium – 200% AWS Instance premium – 320%
preload
S3
S3
EBS
S3 S3
94. 94© Cloudera, Inc. All rights reserved.
BDR to Blob Storage
Minimum Storage Cost
No Backup Cluster Costs
(servers or subscription)
RPO unaffected
Cloud provider manages
regional locality
✗RTO longer
user
sales
contracts
North America
.snapshots
snap 4-21-17
Contract1.txt
Contract2.txt
Contract1.txt
Contract2.txt
AWS S3
ADLS
95. 95© Cloudera, Inc. All rights reserved.
Big data fundamentals
Cybersecurity
Optimizing to detect complex attacks over longer periods of time
96. 96© Cloudera, Inc. All rights reserved.
Cybersecurity is a big data problem
Popular cyber platforms can
not cost effectively scale to
the volume and variety of
modern data
Only partial view of the
enterprise limits analytics and
slows investigations
Difficult to deploy advanced
machine learning detection
capabilities
Explosion of data Limited enterprise visibility Limited analytic processing
DataAccess
1%50%100%
DataVolume
10PB1PB1TB
IF (X) AND (Y)
THEN (Z)
Time
User
Network
Endpoint
Archived
data
Emerging
data
97. 97© Cloudera, Inc. All rights reserved.
Open Data Models:
Enterprise Visibility, Support For Multiple Workloads
Endpoint User
Network
DIVERSE DATA SOURCES SINGLE ACCESS
Source: Momentum Partners Cybersecurity Snapshot April 2016
98. 98© Cloudera, Inc. All rights reserved.
Detect advanced threats faster
with full compliment of analytic
frameworks for all cyber
workloads
Faster time to incident
investigation and response with
comprehensive enterprise
visibility
Change the economics of
cybersecurity with an open
source platform that supports
multiple LOB workloads
The value of Apache Spot
99. 99© Cloudera, Inc. All rights reserved.
Many applications on one shared data set and architecture
Visualization & machine
learning applications can share
common data set &
infrastructure
CustomPackaged
Spot community is developing
out machine learning (e.g.
network threat detection)
Open Source
Build custom applications &
analytics using Cloudera
without having to buy new
infrastructure
100. 100© Cloudera, Inc. All rights reserved.
But I already have Splunk …
Go Beyond Splunk’s SPL
• Share enriched data across
multiple analytic processing
engines
• Simple search, SQL, Python,
R, Scala
Data flexibility
• Faster, more agile, full-
fidelity data acquisition
• Data portability: Open data
model and open storage
Cost-effective scalability
• Elastic scale on-prem or in
the cloud
• Cloud-native pay-per-use and
transience
• Proven at big data scale
Hybrid
• Runs across multi-clouds &
on-prem
• Multi-storage over S3, HDFS,
Kudu, Isilon, etc
¢¢¢
101. 101© Cloudera, Inc. All rights reserved.
Big data fundamentals
Management
Optimizing for reliable uptime and optimal resource utilization
102. 102© Cloudera, Inc. All rights reserved.
Big data and the administrator
Get up and running
Monitor and maintain
Troubleshoot and resolve
Grow and adapt
103. 103© Cloudera, Inc. All rights reserved.
Get up and running
Cloudera manager
service
Cloudera archives
Cloudera manager
agent
Packages
Templates
RoleC
RoleB
RoleA
Cluster member
104. 104© Cloudera, Inc. All rights reserved.
Monitor and maintain
Services
Hosts
Applications Resources
105. 105© Cloudera, Inc. All rights reserved.
Troubleshoot and resolve
Add your own customized
charts
See performance and resource utilization at a glance
Select historical time period for charts
106. 106© Cloudera, Inc. All rights reserved.
Grow and adapt
• Utilization by tenant
• Project future needs
• Prioritize pre-emption
107. 107© Cloudera, Inc. All rights reserved.
Backup and disaster recovery (BDR)
Distributed (uses distcp)
Work done by target cluster
Secure (can have different
encryption keys on each side,
encrypted in motion)
Bandwidth Limited (optional)
user
sales
contracts
North America
.snapshots
EMEA
snap 4-21-17
Contract1.txt
Contract2.txt
Contract1.txt
Contract2.txt
Contract3.txt
user
sales
contracts
North America
EMEA
Contract1.txt
Contract2.txt
Contract3.txt
.snapshots
snap 4-21-17
Contract3.txt
Federated clusters
108. 108© Cloudera, Inc. All rights reserved.
Big data fundamentals
Information security
Optimizing for minimum risk
109. 109© Cloudera, Inc. All rights reserved.
Big data security
Authentication, authorization, audit and compliance
Access
Defining what users
and applications can
do with data
Technical concepts:
Permissions
Authorization
Data
Protecting data in the
cluster from
unauthorized visibility
Technical concepts:
Encryption, tokenization,
Data masking
Visibility
Reporting on where
data came from and
how it’s being used
Technical concepts:
Auditing
Lineage
Cloudera Manager
Apache Sentry &
RecordService
Cloudera Navigator
Navigator Encrypt & Key
Trustee | Partners
Perimeter
Guarding access to the
cluster itself
Technical concepts:
Authentication
Network isolation
110. 110© Cloudera, Inc. All rights reserved.
Active directory and Kerberos
Perimeter
• Manages Users, Groups, and Services
• Provides username / password
authentication
• Group membership determines service
access
Active directory
• Trusted and standard third-party
• Authenticated users receive “Tickets”
• “Tickets” gain access to services
Kerberos
User
authenticates
to AD
Authenticated
user gets
Kerberos Ticket
Ticket grants
access to Services
e.g. Impala
User [ssmith]
Password[***** ]
111. 111© Cloudera, Inc. All rights reserved.
Apache Sentry
• Apache Sentry is an authorization
module for Hadoop
• Apache Licensed project
• Supported by multiple vendors
• Used in many industries
• Used by Hive, Impala, Search &
Spark
• Syncs with HDFS ACL
• Supports ease of administration through
role-based authorization (RBAC)
Access
Spark Bindings
Spark
112. 112© Cloudera, Inc. All rights reserved.
Centralized role-based access control
Sentry Perm.
Read access to
Transactions.Date…
Where Country = US
Sentry Perm.
Read access to
Customers.CustomerID
… Where Country = US
Sentry Role
U.S.
Customer
Transaction
Analysis
Group
Tier 1
Customer
Support Reps
Sam Smith
Group
Tier 1
Broker
Analysts
Martha Jones
Cust. ID SSN Phone Country
6758493 329-44-9847 US
09:22:03 16-
Feb-2015
344-22-9876 EU
5768459 585-11-2345 US
Date/Time Cust. ID Trade Country
11:33:01 16-
Feb-2015
Sell US
09:22:03 16-
Feb-2015
344-
22-
9876
EU
13:45:24 16-
Feb-2015
Buy US
Access
113. 113© Cloudera, Inc. All rights reserved.
Auditing
Track, understand, and
protect access to
sensitive data
• Auditing needs to happen automatically
• Audit logs need to be immutable
• Need to be able to drill down on events
to the original events/data
Visibility
114. 114© Cloudera, Inc. All rights reserved.
Governance
Faceted search
Natural language
Incremental filters
Drill down links
Visibility
Used to facilitate research and the ability to find groups of similar assets
Jump to
application log
115. 115© Cloudera, Inc. All rights reserved.
Metadata
Automatic collection
• No need to create XML files or
manage manual controls
Complete aggregation
• Full coverage across all platform
components
Simple accessibility
• Integrated user interface with full-
text search
116. 116© Cloudera, Inc. All rights reserved.
Visibility
Enterprise metadata
The foundation for data management and governance
Metadata enables you to put context and meaning to data to
answer the important questions
Technical Managed Custom
Unified metadata repository
Who are the high-value customers?
How do we define that?
How is high value calculated?
Where is customer data stored and used?
Is the data reliable and accurate?
117. 117© Cloudera, Inc. All rights reserved.
Lineage
• Where did the data come from?
• Who ran the process that created
the data?
• What code was used to generate
the values?
• Which files and columns were
used to derive the values?
Visibility
118. 118© Cloudera, Inc. All rights reserved.
Is it encrypted?
Data written to HDFS✓
Metadata in RDBMS✗
Spill-over files✗
Data
119. 119© Cloudera, Inc. All rights reserved.
Cloudera navigator encrypt
Transparent layer between application
and file system
• Compliance-ready
• Massively scalable
• High performance: Optimized for Intel
• Separation of duties
• Key management with Navigator Key
Trustee
Data
120. 120© Cloudera, Inc. All rights reserved.
Cloudera Navigator Key Trustee
“Virtual safe-deposit box” for managing encryption keys or
other Hadoop security artifact
• Separates Keys from Encrypted Data
• Centralized Management with Audit
Controls
• Integration with HSMs
• Roadmap: Management of SSL
certificates, SSH keys, tokens,
passwords, Kerberos Keytab Files,
and more
Data
121. 121© Cloudera, Inc. All rights reserved.
Redacted Log Files
SELET * FROM customers
WHERE ssn=‘123-45-6789’
hive.server2.logging.operation.log.location
HUE Saved Queries
Audit Logs
• Credit card numbers
• Social security numbers
• Email addresses
• Server host names / IP
122. 122© Cloudera, Inc. All rights reserved.
Thank you
The modern platform for machine learning and
analytics, optimized for the cloud