SlideShare uma empresa Scribd logo
1 de 27
Backup and DR in Hadoop
Lars George – Partner and Co-Founder @ OpenCore
DataWorks Summit Munich 2017
Distributed Problems
About Me
• Partner & Co-Founder at OpenCore
• Before that
• Lars: EMEA Chief Architect at Cloudera (5+ years)
• Hadoop since 2007
• Apache Committer & Apache Member
• HBase (also in PMC)
• Lars: O’Reilly Author: HBase – The Definitive Guide
• Contact
• lars.george@opencore.com
• @larsgeorge
Website: www.opencore.com
Agenda
• Context
• Data Backup Strategies
• Summary
Context
What do you have to look out for?
What is What?
• Backup
• Ability to restore data using previously taken, frozen in time data snapshots
• Allows to recover deleted, or erroneously modified data
• Usually backups are not current, as the most recent is not included
• Disaster Recovery
• Restore business and operations after a complete system failure
• Includes rebuilding the environment and restoring the data from the last (good)
backup
• Minimize the impact on the business (financial loss)
Many Systems
• Hadoop is a platform of many distributed
sytems
• Simple tools only cover simple topics
• Every system has data and/or meta data
• Amount of data ranges from a few terabytes
to multiple petabytes in practice
• A cluster contains few to hundreds of servers
 What do you back up, how often, and how?
2006 2008 2009 2010 2011 2012 2013
Core Hadoop
(HDFS,
MapReduce)
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
The stack evolves and grows continuously!
2007
Solr
Pig
Core Hadoop
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
2014 2015
Kudu
RecordService
Ibis
Falcon
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Evolution of the Hadoop Platform
Why is backing up data difficult?
• Data at scale is difficult to move around!
• You cannot cheat physics
• The sheer inertia of data requires new approaches
• Do not or only minimally move data as necessary
• If duplicated data, use it for other purposes as well
• Multiple clusters with different workloads
• Traditional backup tools often require standardized APIs
• Hadoop does not supply those necessarily, or they are inefficient here
• Included backup tools in Hadoop are often rudimentary
• Not all scenarios are covered, or are only partially covered
Databases in Hadoop
• Many components use databases to store their state and metadata for
persistency
• The selection of RDBMS may have a substantial impact on that functionality
 Never use the ”developer option” (e.g. Derby)!
 The RDBMS should be highly available (HA)
• Databases should be backed up and archived on a regular basis
• But the question often remains: Is this a task of the Hadoop team or the
(often central) IT department?
• This also applies to other, external Hadoop stack systems (e.g. Storm)
Goals and Objectives
Usually backup and DR is grounded into conditions:
RTO – Recovery Time Objective
• Time to recover a service
• The hotter backup data is kept, the
shorter the RTO
• At scale, the RTO is foremost a
factor of infrastructure
RPO – Recovery Point Objective
• Measures how much data is lost in
case of a disastrous failure
• The more often data is backed up,
the shorter the RPO
 The RPO and RTO are driving cost factors and are multiplied by each other
Failure Scenarios
• Node Degradation
• One or more nodes are slowing down or produce an increasing number of errors
(and with it fewer results) – coined “The John Wayne”
• Mayb cause byzantine errors, which are difficult to identify
 Reasons: Failures or bugs in disks, NICs, device drivers, software
 Hadoop can handle many such errors, but not all
• Partial Node Failure
• Single (redundant) components are failing completely
• Example: A disk stops working
• Operators can swap component at runtime
 Hadoop is built to handle failures like this
 Impact is restricted to the share of component on total capacity
Failure Scenarios (cont.)
• Node Failure
• Assumes preparation, like enabling HA everywhere or configure „Rack Awareness“
 Reasons: Power or network outage
 Hadoop can handle this just fine
• Network Partitioning
• The cluster is split into two or more parts at random points
• Causes the so-called „split brain“ problem, where each now autonomous part has to
decide if it must fail, or can continue to serve request
• Applications need to switch to one of the working parts of the cluster
 Hadoop has some support for that, but there are external dependencies
 What happens when the parts join the cluster again?
Failure Scenarios (cont.)
• Loss of an entire data center
• Complete loss of a data copy
• Either switch to a warm/hot standby cluster (blue-green deployment)
• Or, rebuild cluster and restore data
 Reasons: Power or network outage
 Has to be done outside of Hadoop
Data Sources
• Not all Hadoop components have persistent data (or metadata)
• Transient data can (should) be recomputed as needed
• The number of used Hadoop components varies a lot
• „Onboarding“ checklist can help to capture that
• Given a set of requirements the RTO and RPO can be different
• Question: How long does re-computing derived data take?
• Basic Rule: The more you have, the more costly and time consuming it is
• You can always omit parts, as long as everyone is OK with it (for realz!)
• Cost can be capped – but not without consequence (higher RTO)
Backup Strategies
• Replication
• Copy of data and modifications of one cluster to another
• Basically like the venerable rsync problem
• Some components in Hadoop support this (partially?)
• What do you do with deleted data?
• Snapshots
• Few tools have a built-in snapshot feature
• HDFS and HBase
• Special access to frozen-in-time data
• Using special paths or system tools
• Data is local and needs to be moved
• How do you do this incrementally?
Bakup Strategies (cont.)
• Backup
• Store of data to a cold media
• Not supplied with Hadoop
• A few tools have system tools
• But… Versioned? Complete? Consistent?
• HA and Rack-Awareness
• Does neither cover backup nor DR
• Unless calling the HDFS trash functionality a backup... NOPE!
• Only valid within the cluster, within the same data center
Data Types
 There are two main types of data: persisted data and metadata
 There is also transient data
• Data concerns all user data, stored in HDFS, HBase, Solr, and so on
• Can be accessed using an interface
• Metadata are auxiliary information, helping to make sense of or being to
access the user data
• Hive Schemas
• Cluster Information
• Transient data often is stored in temporary files, logs, or streams
Data Consistency
• An often missed (or ignored?) topic, describing what actually is inside a backup
• Is the contained data consistent in itself?
• Some components (NoSQL, including HDFS) cannot mark data across system
boundaries in a reliable and predictable manner
• Snapshots may also be of no help as they are taken asynchronously
• Per regions server in HBase
• Open blocks are added in HDFS
• Move the task towards the application
• Which application was design to do that?
• When restoring data, gaps or bulges can form!
• Question is: Who is responsible to handle that?
• You could be tempted to add transactions...
Validation
• After taking a backup, its integrity needs to be checked
• Should consistency also be verified?
• HDFS has typical checks like CRCs
• Database could be restored and checked
• Special test scripts?
• Applications should ideally supply their own verification tools or rule sets
• Make this part of the software engineering task
• Use Jenkins CI as a backup und restore pipeline?
So far…
• Backup is a combination of already available techniques, or a special
implementation for systems that have no native support
• Snapshots alone only offer local versioning
• Replication is either a hot mirror, or a set of raw data structures that do not
allow an instantaneous restoration
• Consistency has to be handled on the application side
• The required RTO und RPO is crucial for how cluster environments have to
be built, and should be considered from the get go
• There does not seem to be a complete solution, requiring special
implementations!
Data Backup Strategies
Practical scenarios
Approaches
Cluster A Export
Scenario 1: Cold Export
StorageAnwendungAnwendungApplication
Cluster A Replication
Scenario 2: Replication
AnwendungAnwendungApplication Cluster B
Approaches (cont.)
• Applications write into two
(or more) clusters at the
same time
• Typically using Kafka
• ACK requires for both
clusters to confirm the write
• Could be controlled per
application
• See Google Spanner and
TrueTime
Cluster A
Scenario 3: Simultaneous Writes
AnwendungAnwendungApplication
Cluster B
Impact on Business
• The basic scenario
are polar opposites
• Depending on
complexity extra
layers can be added
• Kafka as a buffer
• Cost varies greatly,
with #3 requiring
two same size
clusters
RTO
RPO
HochNiedrig
Niedrig Hoch
1
23
Summary
Where to go from here?
Summary
Backup and DR must be part of planning and procurement from the start
Many systems handle data differently, requiring special treatment
Data backup and restoration has to be handled by the applications
Commercial offerings are few and not fully featured
Thank You!
@larsgeorge

Mais conteĂşdo relacionado

Mais procurados

The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and FlinkBryan Bende
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryChris Schalk
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Slim Baltagi
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStackDataWorks Summit
 
오픈소스 모니터링 알아보기(Learn about opensource monitoring)
오픈소스 모니터링 알아보기(Learn about opensource monitoring)오픈소스 모니터링 알아보기(Learn about opensource monitoring)
오픈소스 모니터링 알아보기(Learn about opensource monitoring)SeungYong Baek
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...Salesforce Engineering
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Transactional SQL in Apache Hive
Transactional SQL in Apache HiveTransactional SQL in Apache Hive
Transactional SQL in Apache HiveDataWorks Summit
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...HostedbyConfluent
 

Mais procurados (20)

The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and Pain
 
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and Flink
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
오픈소스 모니터링 알아보기(Learn about opensource monitoring)
오픈소스 모니터링 알아보기(Learn about opensource monitoring)오픈소스 모니터링 알아보기(Learn about opensource monitoring)
오픈소스 모니터링 알아보기(Learn about opensource monitoring)
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...
High Scale Relational Storage at Salesforce Built with Apache HBase and Apach...
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Transactional SQL in Apache Hive
Transactional SQL in Apache HiveTransactional SQL in Apache Hive
Transactional SQL in Apache Hive
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
 

Semelhante a Backup and Disaster Recovery in Hadoop

Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big DataZohar Elkayam
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoopch adnan
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events
 
Planning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPMPlanning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPMWASdev Community
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentalsLokesh Ramaswamy
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdfBrahmam8
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop clusterFurqan Haider
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 

Semelhante a Backup and Disaster Recovery in Hadoop (20)

Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 
Planning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPMPlanning For Catastrophe with IBM WAS and IBM BPM
Planning For Catastrophe with IBM WAS and IBM BPM
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdf
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 

Mais de DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionDataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinDataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopDataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

Mais de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Último

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Backup and Disaster Recovery in Hadoop

  • 1. Backup and DR in Hadoop Lars George – Partner and Co-Founder @ OpenCore DataWorks Summit Munich 2017 Distributed Problems
  • 2. About Me • Partner & Co-Founder at OpenCore • Before that • Lars: EMEA Chief Architect at Cloudera (5+ years) • Hadoop since 2007 • Apache Committer & Apache Member • HBase (also in PMC) • Lars: O’Reilly Author: HBase – The Definitive Guide • Contact • lars.george@opencore.com • @larsgeorge Website: www.opencore.com
  • 3. Agenda • Context • Data Backup Strategies • Summary
  • 4. Context What do you have to look out for?
  • 5. What is What? • Backup • Ability to restore data using previously taken, frozen in time data snapshots • Allows to recover deleted, or erroneously modified data • Usually backups are not current, as the most recent is not included • Disaster Recovery • Restore business and operations after a complete system failure • Includes rebuilding the environment and restoring the data from the last (good) backup • Minimize the impact on the business (financial loss)
  • 6. Many Systems • Hadoop is a platform of many distributed sytems • Simple tools only cover simple topics • Every system has data and/or meta data • Amount of data ranges from a few terabytes to multiple petabytes in practice • A cluster contains few to hundreds of servers  What do you back up, how often, and how?
  • 7. 2006 2008 2009 2010 2011 2012 2013 Core Hadoop (HDFS, MapReduce) HBase ZooKeeper Solr Pig Core Hadoop Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop The stack evolves and grows continuously! 2007 Solr Pig Core Hadoop Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop 2014 2015 Kudu RecordService Ibis Falcon Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Evolution of the Hadoop Platform
  • 8. Why is backing up data difficult? • Data at scale is difficult to move around! • You cannot cheat physics • The sheer inertia of data requires new approaches • Do not or only minimally move data as necessary • If duplicated data, use it for other purposes as well • Multiple clusters with different workloads • Traditional backup tools often require standardized APIs • Hadoop does not supply those necessarily, or they are inefficient here • Included backup tools in Hadoop are often rudimentary • Not all scenarios are covered, or are only partially covered
  • 9. Databases in Hadoop • Many components use databases to store their state and metadata for persistency • The selection of RDBMS may have a substantial impact on that functionality  Never use the ”developer option” (e.g. Derby)!  The RDBMS should be highly available (HA) • Databases should be backed up and archived on a regular basis • But the question often remains: Is this a task of the Hadoop team or the (often central) IT department? • This also applies to other, external Hadoop stack systems (e.g. Storm)
  • 10. Goals and Objectives Usually backup and DR is grounded into conditions: RTO – Recovery Time Objective • Time to recover a service • The hotter backup data is kept, the shorter the RTO • At scale, the RTO is foremost a factor of infrastructure RPO – Recovery Point Objective • Measures how much data is lost in case of a disastrous failure • The more often data is backed up, the shorter the RPO  The RPO and RTO are driving cost factors and are multiplied by each other
  • 11. Failure Scenarios • Node Degradation • One or more nodes are slowing down or produce an increasing number of errors (and with it fewer results) – coined “The John Wayne” • Mayb cause byzantine errors, which are difficult to identify  Reasons: Failures or bugs in disks, NICs, device drivers, software  Hadoop can handle many such errors, but not all • Partial Node Failure • Single (redundant) components are failing completely • Example: A disk stops working • Operators can swap component at runtime  Hadoop is built to handle failures like this  Impact is restricted to the share of component on total capacity
  • 12. Failure Scenarios (cont.) • Node Failure • Assumes preparation, like enabling HA everywhere or configure „Rack Awareness“  Reasons: Power or network outage  Hadoop can handle this just fine • Network Partitioning • The cluster is split into two or more parts at random points • Causes the so-called „split brain“ problem, where each now autonomous part has to decide if it must fail, or can continue to serve request • Applications need to switch to one of the working parts of the cluster  Hadoop has some support for that, but there are external dependencies  What happens when the parts join the cluster again?
  • 13. Failure Scenarios (cont.) • Loss of an entire data center • Complete loss of a data copy • Either switch to a warm/hot standby cluster (blue-green deployment) • Or, rebuild cluster and restore data  Reasons: Power or network outage  Has to be done outside of Hadoop
  • 14. Data Sources • Not all Hadoop components have persistent data (or metadata) • Transient data can (should) be recomputed as needed • The number of used Hadoop components varies a lot • „Onboarding“ checklist can help to capture that • Given a set of requirements the RTO and RPO can be different • Question: How long does re-computing derived data take? • Basic Rule: The more you have, the more costly and time consuming it is • You can always omit parts, as long as everyone is OK with it (for realz!) • Cost can be capped – but not without consequence (higher RTO)
  • 15. Backup Strategies • Replication • Copy of data and modifications of one cluster to another • Basically like the venerable rsync problem • Some components in Hadoop support this (partially?) • What do you do with deleted data? • Snapshots • Few tools have a built-in snapshot feature • HDFS and HBase • Special access to frozen-in-time data • Using special paths or system tools • Data is local and needs to be moved • How do you do this incrementally?
  • 16. Bakup Strategies (cont.) • Backup • Store of data to a cold media • Not supplied with Hadoop • A few tools have system tools • But… Versioned? Complete? Consistent? • HA and Rack-Awareness • Does neither cover backup nor DR • Unless calling the HDFS trash functionality a backup... NOPE! • Only valid within the cluster, within the same data center
  • 17. Data Types  There are two main types of data: persisted data and metadata  There is also transient data • Data concerns all user data, stored in HDFS, HBase, Solr, and so on • Can be accessed using an interface • Metadata are auxiliary information, helping to make sense of or being to access the user data • Hive Schemas • Cluster Information • Transient data often is stored in temporary files, logs, or streams
  • 18. Data Consistency • An often missed (or ignored?) topic, describing what actually is inside a backup • Is the contained data consistent in itself? • Some components (NoSQL, including HDFS) cannot mark data across system boundaries in a reliable and predictable manner • Snapshots may also be of no help as they are taken asynchronously • Per regions server in HBase • Open blocks are added in HDFS • Move the task towards the application • Which application was design to do that? • When restoring data, gaps or bulges can form! • Question is: Who is responsible to handle that? • You could be tempted to add transactions...
  • 19. Validation • After taking a backup, its integrity needs to be checked • Should consistency also be verified? • HDFS has typical checks like CRCs • Database could be restored and checked • Special test scripts? • Applications should ideally supply their own verification tools or rule sets • Make this part of the software engineering task • Use Jenkins CI as a backup und restore pipeline?
  • 20. So far… • Backup is a combination of already available techniques, or a special implementation for systems that have no native support • Snapshots alone only offer local versioning • Replication is either a hot mirror, or a set of raw data structures that do not allow an instantaneous restoration • Consistency has to be handled on the application side • The required RTO und RPO is crucial for how cluster environments have to be built, and should be considered from the get go • There does not seem to be a complete solution, requiring special implementations!
  • 22. Approaches Cluster A Export Scenario 1: Cold Export StorageAnwendungAnwendungApplication Cluster A Replication Scenario 2: Replication AnwendungAnwendungApplication Cluster B
  • 23. Approaches (cont.) • Applications write into two (or more) clusters at the same time • Typically using Kafka • ACK requires for both clusters to confirm the write • Could be controlled per application • See Google Spanner and TrueTime Cluster A Scenario 3: Simultaneous Writes AnwendungAnwendungApplication Cluster B
  • 24. Impact on Business • The basic scenario are polar opposites • Depending on complexity extra layers can be added • Kafka as a buffer • Cost varies greatly, with #3 requiring two same size clusters RTO RPO HochNiedrig Niedrig Hoch 1 23
  • 25. Summary Where to go from here?
  • 26. Summary Backup and DR must be part of planning and procurement from the start Many systems handle data differently, requiring special treatment Data backup and restoration has to be handled by the applications Commercial offerings are few and not fully featured

Notas do Editor

  1. The rapid expansion of the Hadoop ecosystem is further evidence of its meteoric adoption.