Manage Data Analytics in Hybrid Cloud with Shared Data Lakes

Managing Data Analytics in a Hybrid Cloud
Karan Singh
Sr. Solution Architect
Storage & Hyper-Converged Business Unit
Daniel Gilfix
Technical Marketing Manager
Storage & Hyper-Converged Business Unit

AGENDA
2
● CUSTOMER PAIN
● COMMON APPROACHES
● SHARED DATA LAKES
● HOW IT WORKS AND WHERE
● SUMMARY AND NEXT STEPS

INSERT DESIGNATOR, IF NEEDED4
CUSTOMER PAIN POINTS
EXPLOSIVE GROWTH
in data analytics teams
and analytic tools
MULTIPLE TEAMS COMPETING
for use of the same
big data resources.
CONGESTION
in busy analytic clusters
causing frustration and missed SLAs.
HADOOP
SPARKSQL
SPARK
HIVE
MAPREDUCE
PRESTO
IMPALA
KAFKA
NIFI
ETC.

RESULTING IN CUSTOMER CHOICES
Get a bigger cluster
for many teams to share.
Give each team
own dedicated cluster,
each with copy of
PBs of data.
Give teams ability to
spin-up/spin-down
clusters which can
share common data store.
#1 #2 #3

#3 ON-DEMAND ANALYTIC CLUSTERS
WITH A SHARED DATA LAKE
HIT SERVICE-LEVEL AGREEMENTS
Give teams their own
compute clusters.
ELIMINATE IDLE RESOURCES
By right-sizing de-coupled
compute and storage.
BUY 10s OF PBS INSTEAD OF 100S
Share data sets across clusters
instead of duplicating them.
INCREASE AGILITY
With spin-up/spin-
down clusters.

INSERT DESIGNATOR, IF NEEDED
Red Hat data analytics infrastructure solution
Multi-tenant workload isolation with shared data context
BATCH JOBS
(SLOW)
STREAMING
ANALYTICS
INTERACTIVE
ANALYTICS
OTHER
ANALYTICS
BATCH JOBS
(FAST)
DYNAMIC compute resources and
clusters able to meet different SLAs
UNIFIED single object storage
solution feeding analytics jobs
ELASTIC provisioning and release
of compute resources required by
various analytics jobs

BENEFITS - AGILITY AND $$$
● Faster answers through elastic provisioning via OSP on shared data sets
● Fewer roadblocks for empowered users in self-service data labs / clusters
● Private/public cloud versatility with S3A interface
● Reduced cost and risk from not duplicating and maintaining data sets
● CapEx relief by scaling storage independent from compute

GENERATION - I : ANALYTICS
MONOLITHIC HADOOP STACKS
Analytics vendors provide
single-purpose infrastructure
analytics software
ANALYTICS +
INFRASTRUCTURE
ANALYTICS +
INFRASTRUCTURE
ANALYTICS +
INFRASTRUCTURE

GENERATION - II : ANALYTICS
ELASTIC COMPUTE AND SHARED STORAGE CLOUDS
analytics software
Red Hat provides
cloud infrastructure software
Provisioned Compute Pool
via OpenStack and OpenShift platforms
Shared Datasets
on Red Hat Ceph Storage

MULTIPLE ANALYTIC CLUSTERS
SHARING DATA
INGEST ETL
INTERACTIVE
QUERY
BATCH QUERY
& JOINS
ELASTIC COMPUTE RESOURCE POOL
Kafka
compute instances
Hive/Map Reduce
compute instances
Presto
compute instances
Spark
compute instances
SHARED DATA LAKE
Platinum SLA
Gold SLA
Silver SLA
Bronze SLA

ANALYTIC WORKLOADS JOINING THE INFRA
storage silo
bare metal silo virtualization infra
shared storage SAN
Red Hat private cloud infra
Red Hat private cloud object store
The rest of an
enterprise’s apps
The rest of an
enterprise’s apps
VMs VMs today -> containers tomorrow

MULTI TENANT WORKLOAD ISOLATION
With Shared Data Context
HDFS TMP
HADOOP
RED HAT CEPH STORAGE
COMPUTE
STORAGE
COMPUTE
STORAGE
COMPUTE
STORAGE
WORKER
HADOOP CLUSTER 1
OPENSTACK
VM
OPENSHIFT
CONTAINER
2
3
HDFS TMP
SPARK
HDFS TMP
SPARK/
PRESTO
HDFS TMP
S3A S3A
BAREMETAL
RHEL
S3A/S3

COMMON ARCHITECTURAL MODEL -
PUBLIC OR PRIVATE CLOUD
PUBLIC CLOUD (AWS) PRIVATE CLOUD (RHT)
AWS EC2
PROVISIONING
RED HAT®
OPENSTACK PLATFORM
PROVISIONING
AWS S3
SHARED DATASETS
RED HAT®
CEPH S3
SHARED DATASETS
Hadoop
Presto
Spark Hadoop
Presto
Spark

FEATURES AND BENEFITS
MULTIPLE ANALYTIC CLUSTERS
• Enable teams to meet their individual SLAs without
competing for resources.
SHARED DATA SETS
• Eliminate duplicate storage costs for multiple HDFS cluster silos.
• Eliminate OpEx costs and complexity for maintaining multiple copies
of datasets for multiple HDFS cluster silos.
FAST PROVISIONING OF ANALYTIC CLUSTERS
• Unlocks Agility
• Enables Speed to Capability

MODERN BIG DATA ANALYTICS PIPELINE
Simplified Example
DATA
GENERATION
INGEST DATA
SCIENCE
MACHINE
LEARNING
STREAM
PROCESSING
TRANSFORM,
MERGE,
JOIN
DATA
ANALYTICS

MODERN BIG DATA ANALYTICS PIPELINE
KEY TERMINOLOGY
DATA
GENERATION
INGEST DATA
SCIENCE
MACHINE
LEARNING
STREAM
PROCESSING
TRANSFORM,
MERGE, JOIN
DATA
ANALYTICS
• Sensors
• Click-stream
• Transactions
• Call-detail records
• NiFi
• Kafka • Presto
• Impala
• SparkSQL
• TensorFlow
• Kafka • Hadoop
• Spark
• Spark
• Hadoop

TESTED WITH CEPH OBJECT STORE
DATA
GENERATION
INGEST DATA
SCIENCE
MACHINE
LEARNING
STREAM
PROCESSING
TRANSFORM,
MERGE, JOIN
DATA
ANALYTICS
• TPC-DS data sets
(structured)
• logsynth
(semi-structured)
• bulk load
• MapReduce • Impala
• Presto
• (not tested)
• SparkSQL
• Hive/MapReduce
• SparkSQL
• Hive/MapReduce
• (not tested)

TYPICAL SHARED DATA LAKE
PROJECT STAGES
IDENTIFY
• Potential fit?
QUALIFY
• 1-2 day workshop
• ID questions needing evidence
• Prioritize questions by value
• Design POC architecture
POC OR PILOT
• Answer questions
• Empirical results
• RHT Solution Engineering
• RHT Consulting
DEPLOYMENT
• Phased roll-out
• Red Hat Consulting

KEY TAKEAWAYS
MISSED SLAs
Large Spark/Hadoop
shops suffering from missed
SLAs due to cluster congestion.
EXCESSIVE CAPEX AND OPEX
due to multi-cluster
solutions without shared data.
Do you do big data analytics on-premises?
Do you have multi-PB data sets?
Do you have multiple Spark/Hadoop clusters?
Do these Spark/Hadoop clusters need to
share data sets?
Do you also have non Spark/Hadoop tools that
need access to these data sets?
PROBLEMS HOW YOU KNOW IT’S YOU

INSERT DESIGNATOR, IF NEEDED RED HAT CONFIDENTIAL
ONE CUSTOMER’S UNSOLICITED TESTIMONY
“We managed to deliver tremendous value to our organization”:
● Releasing lock on data: moving the HDFS to an open access object store and opening
the data process to more processes and analysis.
● Releasing lock on compute: now we’re able to spin up and decommission compute
power according to customer needs and utilize cloud benefits (including GPU incorporation
in zero time and effort), without worrying about the data.
● Releasing lock on innovation: we can now allow anyone to try and build something new
without the fear of messing things up (data or cluster wise). We’ve built an environment that
can tolerant mistakes at all levels (process and data), and by doing so, our developers can be
much more daring.“

INSERT DESIGNATOR, IF NEEDED RED HAT CONFIDENTIAL
CUSTOMER SATISFACTION
“I’m delighted to announce that its been a few weeks since we’ve launched our Cloudoop*
offering to our customers, and it’s a huge success. The responses from our customers are
very, very positive, and I’m quoting “Big big like!!!”
This shift from the traditional approach is revolutionizing the way we consume and process
our data.”
---- Head of Cloud Infrastructure, government agency
(*Cloudoop is their Spark-as-a-service offering with an S3 backend, Spark by Cloudera and an S3 by Ceph)

INSERT DESIGNATOR, IF NEEDED
RESOURCES
Summary-level blogs:
● Breaking down data silos with Red
Hat infrastructure
● Why would companies do this?
● Will mainstream analytics jobs run
directly against a Ceph object store?
● How much slower will they run than
natively on HDFS?
Architect-level blogs:
● What about locality?
● Anatomy of the S3A filesystem client
● To the cloud!
● Storing tables in Ceph object storage
● Comparing with HDFS—TestDFSIO
● Comparing with remote HDFS—Hive
Testbench (SparkSQL)
● Comparing with local HDFS—Hive
Testbench (SparkSQL)
● Comparing with remote HDFS—Hive
Testbench (Impala)
● AI and machine learning workloads

Manage Data Analytics in Hybrid Cloud with Shared Data Lakes

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Manage Data Analytics in Hybrid Cloud with Shared Data Lakes

Semelhante a Manage Data Analytics in Hybrid Cloud with Shared Data Lakes (20)

Último

Último (20)

Manage Data Analytics in Hybrid Cloud with Shared Data Lakes