SlideShare uma empresa Scribd logo
1 de 42
1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Global data management
with Hortonworks DataPlane
Service (DPS)
Abdelkrim Hadjidj – Solution Engineer
FOD Paris Meetup – 24-04-2018
2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Presenters
Abdelkrim Hadjidj
Solution Engineer, Hortonworks
Organizer of Future of Data Meetup Paris
@ahadjidj
linkedin.com/in/ahadjidj/
medium.com/@abdelkrim.hadjidj
3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Agenda
à Global Data Management
à Data Plane Service (DPS)
à Data Lifecycle Manager (DLM) : Disaster Recovery
à Demonstration
à Other DPS Services : DSS, DAS
à Q & A
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hortonworks Confidential. For Internal Use Only.
Global Data Management
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Generation Data Problems
Multi-clusters deployment are a
reality. As a user, how can access my
data seamlessly?
We need an API based layer to abstract the
underlying complexity. As a user, I focus on my
tasks and optimize my efficiency
Cloud strategies are evolving. Hybrid Multi
Cloud is the future of Big Data deployments
Production IT
Production
Business
Discovery
Dev
Test
Dev
Test
MULTIPLE CLUSTERS AND SOURCES
MULTIHYBRID
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Generation Data Problems
Business Analysts/Data Scientists
How do we consolidate and combine several
datasets from several platforms?
Data Architects
How do we optimize performance and model of
my data reports?
Is there a common layer that can help
building these services?
API is a key enablers for these scenarios
Data Ops
How can I spin up new resources quickly and
consistently? How can I share data securely?
Data Steward
How do we apply consistent security policies
and manage my assets globally?
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Common API Layer
Next Generation Data Problems
Data ScientistData ArchitectData Steward Data Ops
Production IT
Production
Business Discovery
DevAWS / Azure / GCTestDev / Tests
DSS DAS DLM DSX …
Other persona
8 © Hortonworks Inc. 2011–2018. All rights reserved
DataPlane Service: At a Glance
Native Capabilities Clusters & Data Sources, Shared Services
Core Services Extensibility, Metering, Telemetry
Data
Management
Services
Data
Processing
Services
Data
Infrastructure
Services
DLM(GA)
DSS(TP)
DAS(future)
CBD(future)
DPS EXTENSIBLE SERVICES
DPS PLATFORM
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hortonworks Confidential. For Internal Use Only.
Data Plane Services (DPS)
10 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Hortonworks DataPlane Service (DPS)
CORE CAPABILITIES
Provide basic functions to be used by extensible
data services for for multiple types and tiers
o Data Source Integration
Ability to register and/or create data sources to allow
consolidated access
o Data Services Catalog
Full configuration and management utilities for the enablement of
new services
o Security Controls
Full definition of security access controls including persona
definitions
DATA SOURCE INTEGRATION
DATA SERVICES CATALOG SECURITY CONTROLS
CORE CAPABILITIES
MULTIPLE CLUSTERS AND SOURCES
MULTIHYBRID
HORTONWORKS DATAPLANE SERVICE
11 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Data Plane
Core Services
Ambari
HDFS Hive
Engine
Cluster 1
…
Data Plane DB
Knox
LDAP / AD
Ambari
HDFS Hive
Engine
Cluster 2
Knox
Ambari
HDFS Hive
Engine
Cluster N
Knox
DP
UI
DLM
UI
Single User Store
High Level Architecture of DPS
DSS
UI
Ranger AtlasRanger AtlasRanger Atlas
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lifecycle Manager (DLM) : DR and backup
13 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Hortonworks DPS features
à DPS installation
– Distributed as Docker images : Ease of packaging and distribution, Easy upgrade / rollback
mechanisms
– Both on-prem and cloud
– Dpdeploy - a script that allows deployment of services, lifecycle management and upgrades
– DPS agents are installed on HDP clusters with Ambari MPack
à User management
à Centralized Authentication & identity propagation
à Cluster on-boarding : Data and Service discovery
à Service Lifecycle Management : install, operate, upgrade
à Telemetry and monitoring
14 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
DP Core (Micro) Services and DP Apps are Docker Containers
Beacon has direct access to the HDP-Ranger, Hiveserver2
15 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
DataPlane Components - Access services
à Apache Knox
– Provides authentication for DPS
– Has to be configured with the same LDAP / AD user identity store as the managed HDP clusters
à Consul
– Service registry for DPS components
– Services register with Consul so they can be discovered by other services
– Allows services to be distributed in future to multiple machines
à Zuul
– Zuul enables dynamic routing, monitoring, resiliency and security
– Used to intercept & authenticate all API calls (redirecting to Knox if required)
– Proxy to redirect calls to physical location of components (using Consul to discover their location)
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lifecycle Manager (DLM)
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Replication/DR vs Backup
à Replication/Disaster Recovery
– Replication is copying data from Production Site to Disaster Recovery Site
– Disaster Recovery includes replication, but also incorporates failover to Disaster Recovery site in
case of outage and failback to the original Production Site
– Disaster Recovery Site can be an on-premise or cloud cluster
à Backup & Restore
– While Replication/Disaster Recovery protects against disasters, it is can transport the logical errors
(e.g. accidental deletion or corruption of data) to the DR Site
– To protect against accident deletion of your important HDFS directories or HBase Databases,
customers need to do incremental/full backup (generally retained for 30 days) in order to restore
back to a previous Point in time version
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Lifecycle Manager (DLM) : Replication and DR
⬢ Replication to another cloud/on-prem
site for Disaster Recovery. Failover and
failback in case of disaster
⬢ Backup & Restore of business critical
data for protection against accidental
deletion
⬢ Auto Tiering of hot/warm/cold data
for TCO reduction. Cold tier can be an
on-prem or cloud object store
19 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Design Goals
à Metadata + Data replication : Atlas Tags, Lineage, Ranger policies, etc
à Point in time consistent replication
à Efficient replication – transfer exact changes
à Use cases
– Disaster recovery
– Offload data processing to other clusters (perhaps in cloud)
à Plugin architecture for extensibility of components supported
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DLM Deep Dive
21 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
DP/DLM Architecture
Knox
Job
Manager
Data
Store
HDFS Hive Ranger
REST API
DLM Engine
Plugin Manager
Knox
Job
Manager
Data
Store
HDFS Hive
REST API
DLM Engine
DLM UI
DLM App
Dataplane
Production cluster Backup cluster
Ranger
Plugin Manager
Scheduler Scheduler
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS Replication
23 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
HDFS Replication
à Directory level, not file level
à Snapshot based replication -
– Restoration of to a prior snapshot state if there are errors during replication
– Automatic management of snapshots
à Uses distcp currently - configurable queue and bandwidth
à A deny policy is automatically created to restrict write access to the dataset replicated
– Can be disabled in Beacon service configuration
à Limitations
– Target directory is read-only, any modifications on target will be deleted
– Failback requires cleanup + bootstrap
24 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
HDFS replication
à Uses HDFS snapshot based replication if the source and target are HDFS endpoints and
snapshots are enabled on the folder
à In case of replication from source folders where snapshot is not enabled, regular distcp
based replication is done
à Automatic management of snapshots is done (retention policy for source and target
datasets can be specified in the policy definition)
à If the target snapshot state is compromised, a reverse diff is applies to bring the target
to a correct snapshot state before continuing with replication
à The number of mappers, bandwidth can be configured
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive Replication
26 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Hive Replication
à DB level, not table level
à Replicates schema, data, UDFs
à Event based incremental replication
– New commands - repl dump and repl load
à Uses distcp internally - configurable queue and bandwidth
à Limitations
– No ACID table support
– HDFS copy to table directory is ignored
– Failback requires cleanup + bootstrap
27 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Hive Replication Under the hood – Event logging
HiveServer2 Hive
Metastore
Metastore
RDBMS
Events table
JDBC/ODBC
Runs Query Retrieve/Store metadata
à Capture event : Create/Alter/Delete on Db/table/partition/function/constraint
à Stores information about state after each action, to create idempotent events that can
be replayed in destination cluster
à Events have an increasing event id associated with them. Current event id is tagged
with each destination database.
28 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Under the hood - Event based replication
à "repl dump <db name> <event id>”
– get events newer than <event id>.
– Includes data files information.
– "<event id>" is last replicated event id for db from the destination cluster
à "repl load <db name> <hdfs URI>"
– apply the events on destination
à State replicated in batches currently, can be optimized in future
à Dropping table/partition would result in files being backed up in a ‘change management
directory’ to enable replay of the original insert replication event
29 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Event Based Replication
Metastore
RDBMS
Events Table
HDFS
Serialize new events
batch
Master Cluster
Slave Cluster
HiveServer2
Dump
(metadata + data)
HDFS
Meatastore
RDBMS
HiveServer2
DistcpMetastore API to
write objects
Data files
copy
Read repl
dump dir
REPL DUMP
REPL LOAD
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cloud Replication
31 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Cloud Replication (in DPS 1.1)
à HDFS and Hive are replicated to S3 (persistent) and RDS (metastore).
à Ranger-policies on source cluster are replicated/translated to S3 policies.
à Apps use the replicated data on Target to compute (spark jobs, MR jobs, BI jobs).
à Ability to send results of the workloads/jobs back to On-prem cluster.
OnPrem
Cluster
Azure
OR
Apps
Folder/DB level policy
Replication
Results data
Workloads/Jobs/Tasks
Hive
HDP-
Cloud
Cluster
Ranger Beacon
HMS
Hive
RDS
HDFS
s3a://mybucket/salesHDFS
Security&Governance
RangerPoliciesTranslatedtoS3
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Other DPS services
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSS : Data Steward Studio
35 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
Service: Data Steward Services (DSS)
Suite of capabilities that allows users to understand,
secure, and govern data across enterprise data lakes
Ensure consistent security and governance
for data assets across tiers
Data Steward Studio (DSS) : 360 view of data
Data Steward
36 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
CONSUMABILITY: Curate, organize, and manage data
assets with Asset Collections
Data Steward Studio (DSS)
37 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
CONSUMABILITY: Audit Profiler shows both
summarized views & patterns of access for a data asset.
Data Steward Studio (DSS)
38 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
CONSUMABILITY: Data lineage shows complete chain
of custody and downstream dependencies for an
asset!
Data Steward Studio (DSS)
39 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
CONSUMABILITY: Understand shape of Hive column
data with statistical profiler, example: Profile shows box
plot and histogram for distribution of column values
Data Steward Studio (DSS)
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DAS : Data Analytics Studio
© Hortonworks Inc. 2011- 2017. All rights reserved | 41
Why is my query slow?
Noisy neighbors Poor schema Inefficient queries Unstable demand
Expensive
Query log
Storage
Optimizations
Query
Optimizations
Demand
Shifting
Hortonworks Data Analytics Studio
Optimize Your Hive Workloads
Part of the Hortonworks DataPlane Service
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions ?

Mais conteúdo relacionado

Mais procurados

Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash CourseDataWorks Summit
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...DataWorks Summit/Hadoop Summit
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNDataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkDataWorks Summit
 
MapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Technologies
 
Synchronicity of a distributed financial system
Synchronicity of a distributed financial systemSynchronicity of a distributed financial system
Synchronicity of a distributed financial systemDataWorks Summit
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBigDataExpo
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR Technologies
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...DataWorks Summit/Hadoop Summit
 
The Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data CentricThe Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data CentricDataWorks Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the dataDataWorks Summit
 

Mais procurados (20)

Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Admiral Group
Admiral GroupAdmiral Group
Admiral Group
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on Flink
 
MapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data Platform
 
Synchronicity of a distributed financial system
Synchronicity of a distributed financial systemSynchronicity of a distributed financial system
Synchronicity of a distributed financial system
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
 
The Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data CentricThe Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data Centric
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the data
 

Semelhante a Global data management with Hortonworks DPS

Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersDataWorks Summit
 
Manage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in HadoopManage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in HadoopDataWorks Summit
 
IBM Cloud Paris meetup 20180213 - Hortonworks
IBM Cloud Paris meetup   20180213 - HortonworksIBM Cloud Paris meetup   20180213 - Hortonworks
IBM Cloud Paris meetup 20180213 - HortonworksIBM France Lab
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformEMC
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...DataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudDataWorks Summit
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsDataWorks Summit
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseSankar H
 
Paris FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging ManagerParis FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging ManagerAbdelkrim Hadjidj
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureDataWorks Summit
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
 
Hortonworks - IBM - Cloud Event
Hortonworks - IBM - Cloud EventHortonworks - IBM - Cloud Event
Hortonworks - IBM - Cloud EventThiago Santiago
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks
 

Semelhante a Global data management with Hortonworks DPS (20)

Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
Manage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in HadoopManage democratization of the data - Data Replication in Hadoop
Manage democratization of the data - Data Replication in Hadoop
 
IBM Cloud Paris meetup 20180213 - Hortonworks
IBM Cloud Paris meetup   20180213 - HortonworksIBM Cloud Paris meetup   20180213 - Hortonworks
IBM Cloud Paris meetup 20180213 - Hortonworks
 
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformPivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks Data In Motion Webinar Series Pt. 2
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Paris FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging ManagerParis FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging Manager
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Hortonworks - IBM - Cloud Event
Hortonworks - IBM - Cloud EventHortonworks - IBM - Cloud Event
Hortonworks - IBM - Cloud Event
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 

Mais de Abdelkrim Hadjidj

Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Abdelkrim Hadjidj
 
Paris FOD meetup - koordinator
Paris FOD meetup - koordinatorParis FOD meetup - koordinator
Paris FOD meetup - koordinatorAbdelkrim Hadjidj
 
Paris FOD meetup - kafka security 101
Paris FOD meetup - kafka security 101Paris FOD meetup - kafka security 101
Paris FOD meetup - kafka security 101Abdelkrim Hadjidj
 
Apache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scaleApache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scaleAbdelkrim Hadjidj
 
Future of Data Meetup : Boontadata
Future of Data Meetup : BoontadataFuture of Data Meetup : Boontadata
Future of Data Meetup : BoontadataAbdelkrim Hadjidj
 

Mais de Abdelkrim Hadjidj (6)

Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Paris FOD meetup - koordinator
Paris FOD meetup - koordinatorParis FOD meetup - koordinator
Paris FOD meetup - koordinator
 
Paris FOD meetup - kafka security 101
Paris FOD meetup - kafka security 101Paris FOD meetup - kafka security 101
Paris FOD meetup - kafka security 101
 
Apache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scaleApache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scale
 
Future of Data Meetup : Boontadata
Future of Data Meetup : BoontadataFuture of Data Meetup : Boontadata
Future of Data Meetup : Boontadata
 

Último

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Último (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Global data management with Hortonworks DPS

  • 1. 1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Global data management with Hortonworks DataPlane Service (DPS) Abdelkrim Hadjidj – Solution Engineer FOD Paris Meetup – 24-04-2018
  • 2. 2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Presenters Abdelkrim Hadjidj Solution Engineer, Hortonworks Organizer of Future of Data Meetup Paris @ahadjidj linkedin.com/in/ahadjidj/ medium.com/@abdelkrim.hadjidj
  • 3. 3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Agenda à Global Data Management à Data Plane Service (DPS) à Data Lifecycle Manager (DLM) : Disaster Recovery à Demonstration à Other DPS Services : DSS, DAS à Q & A
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hortonworks Confidential. For Internal Use Only. Global Data Management
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next Generation Data Problems Multi-clusters deployment are a reality. As a user, how can access my data seamlessly? We need an API based layer to abstract the underlying complexity. As a user, I focus on my tasks and optimize my efficiency Cloud strategies are evolving. Hybrid Multi Cloud is the future of Big Data deployments Production IT Production Business Discovery Dev Test Dev Test MULTIPLE CLUSTERS AND SOURCES MULTIHYBRID
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next Generation Data Problems Business Analysts/Data Scientists How do we consolidate and combine several datasets from several platforms? Data Architects How do we optimize performance and model of my data reports? Is there a common layer that can help building these services? API is a key enablers for these scenarios Data Ops How can I spin up new resources quickly and consistently? How can I share data securely? Data Steward How do we apply consistent security policies and manage my assets globally?
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Common API Layer Next Generation Data Problems Data ScientistData ArchitectData Steward Data Ops Production IT Production Business Discovery DevAWS / Azure / GCTestDev / Tests DSS DAS DLM DSX … Other persona
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved DataPlane Service: At a Glance Native Capabilities Clusters & Data Sources, Shared Services Core Services Extensibility, Metering, Telemetry Data Management Services Data Processing Services Data Infrastructure Services DLM(GA) DSS(TP) DAS(future) CBD(future) DPS EXTENSIBLE SERVICES DPS PLATFORM
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hortonworks Confidential. For Internal Use Only. Data Plane Services (DPS)
  • 10. 10 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Hortonworks DataPlane Service (DPS) CORE CAPABILITIES Provide basic functions to be used by extensible data services for for multiple types and tiers o Data Source Integration Ability to register and/or create data sources to allow consolidated access o Data Services Catalog Full configuration and management utilities for the enablement of new services o Security Controls Full definition of security access controls including persona definitions DATA SOURCE INTEGRATION DATA SERVICES CATALOG SECURITY CONTROLS CORE CAPABILITIES MULTIPLE CLUSTERS AND SOURCES MULTIHYBRID HORTONWORKS DATAPLANE SERVICE
  • 11. 11 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Data Plane Core Services Ambari HDFS Hive Engine Cluster 1 … Data Plane DB Knox LDAP / AD Ambari HDFS Hive Engine Cluster 2 Knox Ambari HDFS Hive Engine Cluster N Knox DP UI DLM UI Single User Store High Level Architecture of DPS DSS UI Ranger AtlasRanger AtlasRanger Atlas
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Lifecycle Manager (DLM) : DR and backup
  • 13. 13 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Hortonworks DPS features à DPS installation – Distributed as Docker images : Ease of packaging and distribution, Easy upgrade / rollback mechanisms – Both on-prem and cloud – Dpdeploy - a script that allows deployment of services, lifecycle management and upgrades – DPS agents are installed on HDP clusters with Ambari MPack à User management à Centralized Authentication & identity propagation à Cluster on-boarding : Data and Service discovery à Service Lifecycle Management : install, operate, upgrade à Telemetry and monitoring
  • 14. 14 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved DP Core (Micro) Services and DP Apps are Docker Containers Beacon has direct access to the HDP-Ranger, Hiveserver2
  • 15. 15 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved DataPlane Components - Access services à Apache Knox – Provides authentication for DPS – Has to be configured with the same LDAP / AD user identity store as the managed HDP clusters à Consul – Service registry for DPS components – Services register with Consul so they can be discovered by other services – Allows services to be distributed in future to multiple machines à Zuul – Zuul enables dynamic routing, monitoring, resiliency and security – Used to intercept & authenticate all API calls (redirecting to Knox if required) – Proxy to redirect calls to physical location of components (using Consul to discover their location)
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Lifecycle Manager (DLM)
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Replication/DR vs Backup à Replication/Disaster Recovery – Replication is copying data from Production Site to Disaster Recovery Site – Disaster Recovery includes replication, but also incorporates failover to Disaster Recovery site in case of outage and failback to the original Production Site – Disaster Recovery Site can be an on-premise or cloud cluster à Backup & Restore – While Replication/Disaster Recovery protects against disasters, it is can transport the logical errors (e.g. accidental deletion or corruption of data) to the DR Site – To protect against accident deletion of your important HDFS directories or HBase Databases, customers need to do incremental/full backup (generally retained for 30 days) in order to restore back to a previous Point in time version
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Lifecycle Manager (DLM) : Replication and DR ⬢ Replication to another cloud/on-prem site for Disaster Recovery. Failover and failback in case of disaster ⬢ Backup & Restore of business critical data for protection against accidental deletion ⬢ Auto Tiering of hot/warm/cold data for TCO reduction. Cold tier can be an on-prem or cloud object store
  • 19. 19 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Design Goals à Metadata + Data replication : Atlas Tags, Lineage, Ranger policies, etc à Point in time consistent replication à Efficient replication – transfer exact changes à Use cases – Disaster recovery – Offload data processing to other clusters (perhaps in cloud) à Plugin architecture for extensibility of components supported
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DLM Deep Dive
  • 21. 21 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved DP/DLM Architecture Knox Job Manager Data Store HDFS Hive Ranger REST API DLM Engine Plugin Manager Knox Job Manager Data Store HDFS Hive REST API DLM Engine DLM UI DLM App Dataplane Production cluster Backup cluster Ranger Plugin Manager Scheduler Scheduler
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Replication
  • 23. 23 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved HDFS Replication à Directory level, not file level à Snapshot based replication - – Restoration of to a prior snapshot state if there are errors during replication – Automatic management of snapshots à Uses distcp currently - configurable queue and bandwidth à A deny policy is automatically created to restrict write access to the dataset replicated – Can be disabled in Beacon service configuration à Limitations – Target directory is read-only, any modifications on target will be deleted – Failback requires cleanup + bootstrap
  • 24. 24 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved HDFS replication à Uses HDFS snapshot based replication if the source and target are HDFS endpoints and snapshots are enabled on the folder à In case of replication from source folders where snapshot is not enabled, regular distcp based replication is done à Automatic management of snapshots is done (retention policy for source and target datasets can be specified in the policy definition) à If the target snapshot state is compromised, a reverse diff is applies to bring the target to a correct snapshot state before continuing with replication à The number of mappers, bandwidth can be configured
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive Replication
  • 26. 26 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Hive Replication à DB level, not table level à Replicates schema, data, UDFs à Event based incremental replication – New commands - repl dump and repl load à Uses distcp internally - configurable queue and bandwidth à Limitations – No ACID table support – HDFS copy to table directory is ignored – Failback requires cleanup + bootstrap
  • 27. 27 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Hive Replication Under the hood – Event logging HiveServer2 Hive Metastore Metastore RDBMS Events table JDBC/ODBC Runs Query Retrieve/Store metadata à Capture event : Create/Alter/Delete on Db/table/partition/function/constraint à Stores information about state after each action, to create idempotent events that can be replayed in destination cluster à Events have an increasing event id associated with them. Current event id is tagged with each destination database.
  • 28. 28 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Under the hood - Event based replication à "repl dump <db name> <event id>” – get events newer than <event id>. – Includes data files information. – "<event id>" is last replicated event id for db from the destination cluster à "repl load <db name> <hdfs URI>" – apply the events on destination à State replicated in batches currently, can be optimized in future à Dropping table/partition would result in files being backed up in a ‘change management directory’ to enable replay of the original insert replication event
  • 29. 29 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Event Based Replication Metastore RDBMS Events Table HDFS Serialize new events batch Master Cluster Slave Cluster HiveServer2 Dump (metadata + data) HDFS Meatastore RDBMS HiveServer2 DistcpMetastore API to write objects Data files copy Read repl dump dir REPL DUMP REPL LOAD
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cloud Replication
  • 31. 31 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Cloud Replication (in DPS 1.1) Ã HDFS and Hive are replicated to S3 (persistent) and RDS (metastore). Ã Ranger-policies on source cluster are replicated/translated to S3 policies. Ã Apps use the replicated data on Target to compute (spark jobs, MR jobs, BI jobs). Ã Ability to send results of the workloads/jobs back to On-prem cluster. OnPrem Cluster Azure OR Apps Folder/DB level policy Replication Results data Workloads/Jobs/Tasks Hive HDP- Cloud Cluster Ranger Beacon HMS Hive RDS HDFS s3a://mybucket/salesHDFS Security&Governance RangerPoliciesTranslatedtoS3
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Other DPS services
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DSS : Data Steward Studio
  • 35. 35 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved Service: Data Steward Services (DSS) Suite of capabilities that allows users to understand, secure, and govern data across enterprise data lakes Ensure consistent security and governance for data assets across tiers Data Steward Studio (DSS) : 360 view of data Data Steward
  • 36. 36 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved CONSUMABILITY: Curate, organize, and manage data assets with Asset Collections Data Steward Studio (DSS)
  • 37. 37 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved CONSUMABILITY: Audit Profiler shows both summarized views & patterns of access for a data asset. Data Steward Studio (DSS)
  • 38. 38 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved CONSUMABILITY: Data lineage shows complete chain of custody and downstream dependencies for an asset! Data Steward Studio (DSS)
  • 39. 39 © Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved CONSUMABILITY: Understand shape of Hive column data with statistical profiler, example: Profile shows box plot and histogram for distribution of column values Data Steward Studio (DSS)
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DAS : Data Analytics Studio
  • 41. © Hortonworks Inc. 2011- 2017. All rights reserved | 41 Why is my query slow? Noisy neighbors Poor schema Inefficient queries Unstable demand Expensive Query log Storage Optimizations Query Optimizations Demand Shifting Hortonworks Data Analytics Studio Optimize Your Hive Workloads Part of the Hortonworks DataPlane Service
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions ?