This document discusses best practices for upgrading Hadoop clusters with Cloudera Manager. It describes how the Cloudera Manager upgrade wizard provides a simplified, guided process for upgrading Hadoop distributions with minimal downtime. The upgrade wizard automates many of the manual steps previously required for upgrades and allows rolling upgrades for non-major upgrades when certain conditions are met. Following best practices like testing upgrades in non-production environments and having backup policies in place can help avoid issues during upgrades.
Darren Lo, He was the primary driver of the upgrade wizard and the lead engineer on making the changes in CM.
===
Upgrade Without the Headache Best Practices for Upgrading Hadoop in Production
Thursday, February 12th, 2015 • 10am – 11am Pacific Time
During this webinar, Vala Dormiani, Product Manager at Cloudera, will walk you through some of the best practices to keep in mind when it comes to upgrading and how to leverage Cloudera Manager to upgrade your Cloudera cluster. He will also discuss some of the new Upgrade Wizard features released with Cloudera Enterprise 5.3. Finally, he will go over a few of the other production-ready capabilities available with Cloudera Manager, including backup and disaster recovery and direct integration with Cloudera Support.
This is a technical webinar with a live Q&A at the end.
Many organizations have turned to a new architecture – an enterprise data hub – to complement and extend existing investments.
An enterprise data hub can store unlimited data, cost-effectively and reliably, for as long as you need, and lets users access that data in a variety of ways. Data can be collected, stored, processed, explored, modeled, and served in one unified platform. It’s connected to the systems you already rely on.
Cloudera’s enterprise data hub, powered by Apache Hadoop, the popular open source distributed data platform, is differentiated in several crucial areas. We provide:
Leading query performance.
The enterprise management and governance that you require of all of your mission-critical infrastructure.
Comprehensive, transparent, compliance-ready security at the core.
An open source platform that is also built of open standards – projects that are supported by multiple vendors to ensure sustainability, portability, and compatibility.
Our platform runs in your choice of environment, whether on-premises or in the cloud.
===
Cheat Sheet version: Our enterprise data hub is:
One place for unlimited data
Accessible to anyone
Connected to the systems you already depend on
Secure, governed, managed & compliant
Built on open source and open standards
Deployed however you want
Coupled with the support and enablement you need to succeed.
Important Note: Our EDH emphasizes “unified analytics” over “unified data”: It’s not practical or probable that customers will actually unify all their data. Much of it lives in the cloud or on storage (e.g. Isilon), in remote datacenters, is of uncertain value vs. cost of moving it to a hub, or security mandates preclude collocation. We enable customers to gather unlimited data, while bringing diverse processing and analytics to that data.
Hadoop is more than a dozen services running across many machines
Hundreds of hardware components
Thousands of settings
Limitless permutations
Manager lets you manage the complexity of running all these tools through one, easy to use interface
Hadoop is a system, not just a collection of parts
Everything is interrelated
Raw data about individual pieces is not enough
Must extract what’s important
Manager provides context to help you know what’s important
Managing Hadoop with multiple tools and manual process takes longer
Complicated, error-prone workflows
Longer issue resolution
Lack of consistent and repeatable processes
Manager lets you maximize efficiency by simplifying your workflow (and allowing it to be repeated)
Best-in-Class
The only enterprise-grade Hadoop management application available
Zero downtime rolling upgrades and BDR
Most downtime is scheduled. Manager provides zero-downtime upgrades to minimize scheduled downtime
Deploy jars across entire cluster
Integration with Cloudera Support
A direct connection to Cloudera Support to easily and efficiently support customers
Simple
Gain end-to-end administration for an enterprise data hub in a single tool
Add/Remove nodes, diagnose issues
Intelligent
Manage Hadoop at a system level – Cloudera’s experience realized in software
Efficient
Simplify complex workflows and make administrators more productive
3rd Party
Broadest network of partners to add greater functionality and have it be a completely integrated component of Cloudera Enterprise
While CM is not "open-source", it is "open". By this I mean the following:
1) A rich set of API's for customers to work with. So they can script their way through with CM. At any point they decide to move away from Cloudera, they can "script" out/ parse out any recommended setting that they have used with CM
2) Cloudera is very transparent w.r.t to how CM works. See CM docs and some of the blog posts: http://blog.cloudera.com/blog/2013/07/how-does-cloudera-manager-work/
3) Customers can revert back to the Free edition if they decided to not renew the subscription. The Free edition is very capable version and we hold back very few feature in the enterprise version for production requirements. See here - http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-enterprise/cloudera-manager/cloudera-manager-features.html
4) We have several Partners now that are starting to integrate with CM, leveraging the API's and other functionality (most notably 3rd party extensibility). All these integrations are available for the free users as well. Examples of partners include - Syncsort, 0xData, Dell, StackIQ, Wibidata etc.
Monitor and Diagnose Cluster Workloads
Manage Workflows
Get Host-Level Snapshots
Track events across the cluster
Create Custom Charts
Report on System Performance and Usage
Features
View Service Health and Performance
Get Host-Level Snapshots
Monitor and Diagnose Cluster Workloads
Gather, View, and Search Hadoop Logs
Track Events from Across the Cluster
Report on System Performance and Usage
Create Custom Charts and Landing Page
Manage Resources
Manage Workflows
Set Time Context Globally
Key Features:
Manage BDR
Key Features:
Zero Downtime Rolling Upgrades
Hopefully, we have established that Cloudera Manager makes it easy and is the only production-ready administration tool for Hadoop.
Cloudera Manager has a built in Upgrade Wizard to make upgrades simple and predictable and also features zero-downtime rolling upgrades
Cloudera Enterprise 5 provides additional enterprise-ready capabilities and marks the next step in the evolution of the Hadoop-based data management platform.
Latest and Greatest
Multitude of other factors
For most mission critical workloads, downtime is never an option.
Any downtime can have a direct impact on revenue and lead to frantic calls in the middle of the night.
For this reason, upgrading the software that powers these workloads can often be a daunting task. It can cause unpredictable issues without access to support.
Hadoop consists of dozens of components, running across multiple machines, all with their own configurations. That can lead to a lot of complexity and uncertainty - especially when taking the upgrade plunge.
That’s why an enterprise-grade administration tool is crucial for running Hadoop in production.
A built-in Upgrade Wizard in Cloudera Manager 5 makes it easy to upgrade CDH on your clusters.
The Upgrade Wizard (enhanced) performs service-specific upgrade steps that you would have had to run manually in the past.
Misc other things
===
No retry support on failure yet
For example, to upgrade to CDH 5.3, you must be on Cloudera Manager 5.3 or higher.
===
Maintenance CDH 5 Downgrades are still called “upgrades”
Other Variations:
Environments / Cross-cutting features
HA (HDFS, MR1, Yarn, Oozie...)
Security (Kerberos, SSL, Sentry)
Both parcel and package installations are supported by the Upgrade Wizard.
Using parcels is the preferred and recommended way, as packages must be manually installed, whereas parcels are installed by Cloudera Manager.
====
By type of bits
Rolling restart capability enables zero-downtime upgrades under certain conditions. vs. Regular restart
If you are using parcels, have a Cloudera Enterprise license, and have enabled HDFS high availability, you can perform a rolling upgrade for non-major upgrades.
This enables you to upgrade your cluster software and restart the upgraded services without incurring any cluster downtime.
Note that it is not possible to perform a rolling upgrade from CDH 4 to CDH 5 (i.e. major upgrade) because of incompatibilities between the two major versions.
For minor and maintenance upgrades, you will have the option to select Rolling Upgrades where Supported services will undergo a rolling restart…while the rest will undergo a normal restart, with some downtime.
Log in to the Cloudera Manager Admin Console.
To access the wizard, on the Home page, click the cluster’s drop down menu, and select Upgrade Cluster.
Alternately, you can trigger the wizard from the Parcels page, by first downloading and distributing a parcel to upgrade to, and then selecting the Upgrade button for this parcel.
Select the CDH version. If the option to pick between packages and parcels is provided, click the Use Parcels radio button. If there are no qualifying parcels, the location of the parcel repository will need to be added under Parcel Configuration Settings.
It will provide additional steps to prepare your cluster for upgrade. The Wizard will now prompt you to backup existing databases. Check Yes for all required actions to be able to Continue. Please read the Upgrade Documentation for a more complete list of actions to be taken at this stage, before proceeding with the upgrade.
The Wizard now performs consistency and health checks on all hosts in the cluster. This is particularly helpful if you have mismatched versions of packages across cluster hosts. If any problems are found, you will be prompted to fix these before continuing.
The selected parcel is downloaded and distributed to all hosts.
For major upgrades, the Wizard will warn that the services are about to be shut down for the upgrade. For minor and maintenance upgrades, if you are using parcels and have HDFS high availability enabled, you will have the option to select Rolling Upgrades on this page. Supported services will undergo a rolling restart, while the rest will undergo a normal restart. Check Rolling Upgrade to proceed with this option. Until this point, you can exit and resume the Wizard without impacting any running services.
The Command Progress screen displays the results of the commands run by the Wizard as it shuts down all services, activates the new parcel, upgrades services, deploys client configuration files, and restarts services. The service commands include upgrading HDFS metadata, upgrading the Oozie database and installing ShareLib, upgrading the Sqoop server and Hive Metastore, among others.
The Host Inspector runs to validate all hosts, as well as report CDH versions running on them.
At the end of the Wizard, you are prompted to finalize the HDFS metadata upgrade. It is recommended at this stage to refer to the Upgrade Documentation for additional steps that might be relevant to your cluster configuration and upgrade path. For major (CDH 4 to CDH 5) upgrades, you have the option of importing your MapReduce configurations into your YARN service. Additional steps in the Wizard will assist with this migration. On completion, we recommend reviewing the YARN configurations for any additional tuning you might need.
Your upgrade is now complete!
===
If the cluster can’t access the internet or even if their cluster has internet access
They may want to stage their own repos anyway,
A parallel shell or some way to execute commands across all cluster hosts
Notable points & possible issues - before you upgrade
job, app and tool compatibility
===
Restore a fresh environment and repeat if necessary
You should still read the docs
Please refer to the Upgrade Documentation for more comprehensive details on using the Upgrade Wizard and the steps if the upgrade wizard reports a failure
===
Synopsis
Customer upgraded to CDH4.2.1.
Long-running or large MapReduce jobs were failing. Among some other configuration changes customer was trying to do simultaneously (JT HA introduction, Kerberizing cluster) it was discovered that the setting
mapred.job.reuse.jvm.num.tasks = -1
was causing the failure (MAPREDUCE-4490).
Where CM would help?
Cloudera Manager’s default setting for this is
mapred.job.reuse.jvm.num.tasks = 1
which could have prevented hitting this known issue.
Sharing their expertise for large & critical upgrades
(downtime, data loss)
Cloudera’s goal is to deliver customer experience
Makes the upgrade process less stressful
Why You Need Backup & Disaster Recovery
Your EDH is a Mission-Critical Part of the Data Management Infrastructure
Stores valuable data and runs important workloads
Business continuity is a MUST HAVE
Managing Business Continuity for Hadoop is Complex
Different services that store data – HDFS, HBase, Hive
Backup and disaster recovery is configured separately for each
Processes are manual
BDR in CM is important and makes it easy to manage Hive replication, metadata replication, and have data readily available across datacenters. It also is automated and fault tolerant
Central configuration: Define backup and disaster recover policies and apply across services
Track progress of replication jobs and get notified when data is out of sync
High performance & CDH-optimized replication using MapReduce via DistCP - the replication uses the scalability and availability of MapReduce and YARN to parallelize the copying of files using a specialized MapReduce job or YARN application that efficiently and quickly transfers only changed files from each Mapper to the replica side.
Cloudera Manager provides an integrated, easy-to-use management solution for enabling data protection in the Hadoop platform. Manager provides rich functionality aimed towards replicating data stored in HDFS and accessed through Hive. When critical data is stored on HDFS, Cloudera Manager provides the necessary capabilities to ensure that the data is available at all times,
Cloudera Manager provides key capabilities that are fully integrated into the Cloudera Manager Admin Console:
Select - Choose the key datasets that are critical for your business operations.
Schedule - Create an appropriate schedule for data replication and/or snapshots – trigger replication and snapshots as frequently as is appropriate for your business needs.
Monitor - Track progress of your snapshots and replication jobs through a central console and easily identify issues or files that failed to be transferred.
Alert - Issue alerts when a snapshot or replication job fails or is aborted so that the problem can be diagnosed expeditiously.
Replication capabilities work seamlessly across Hive and HDFS – replication can be setup on files or directories in the case of HDFS and on tables in the case of Hive. Hive metastore information is also replicated which means that the applications that depend upon the table definitions stored in Hive will work correctly on the replica side
All Cloudera BDR functionality is available directly through the Cloudera Manager Admin Console.
Diagnostic Data Bundles - From log files and other longitudinal data sets drawn from Cloudera Manager
Enable predictive and proactive support of customer clusters under license.
Based on similar use cases- And comparative analysis of diagnostics across all nodes under subscription
Cloudera’s dedicated Proactive Support unit ensures customers are prepared to benefit from every element of their subscription. Proactive Support includes reviewing configurations for known issues and providing comparisons of usage patterns to help enhance your operations and plan for future changes.
Unique to Cloudera, our Predictive Support model means we're regularly monitoring the status of your EDH environment, allowing us to isolate and prevent issues before they even occur by analyzing support cases and platform usage across all deployments.
===
Full Lifecycle Support. starting on day one. The onboarding process scopes technical assistance to customer requirements, introduces key product documentation and community resources, and assures you can take full advantage of the online support portal to meet your business goals.
Support as a Strategic Advantage. We also ensure that customers are optimizing their use of Cloudera's technical resources, starting with the onboarding process. via our Proactive Support program
Cloudera offers the industry’s best Hadoop support
===
Some internal results of external feedback loops, benefitting customers:
COE Pods: support resources specialized by committership and allocated to support specific parts of Hadoop/EDH to enhance expertise, responsiveness, and staffing based on customer needs
Support Portal: full Proactive Support orientation to online customer support and communication resources during onboarding
License Key Provision: helps keep customers unified and up to date between systems and provide ability to generate and manage license keys for Cloudera Manager and CDH Cluster stats
Cloudera Communities: user forums for basic self-support. insights into best practices, and virtual community-building
Account Health Check: ongoing nine-attribute diagnostic to correlate the most important characteristics determining customer satisfaction
Customer Operations Tools Team (COTT): Cloudera staff dedicated to building tools that enable predictive and proactive support of customer clusters under license using longitudinal data drawn primarily from Cloudera Manager
CSI: HBase database of cluster data, community info, knowledge base, support records, Cloudera internal
Monacle: Search-based user interface for CSI
Validations (under development): automated alerting system based on comparative analysis of diagnostics across all nodes under subscription
An enterprise data hub can store unlimited data, cost-effectively and reliably, for as long as you need, and lets users access that data in a variety of ways. Data can be collected, stored, processed, explored, modeled, and served in one unified platform. It’s connected to the systems you already rely on.
Our EDH emphasizes “unified analytics” We enable customers to gather unlimited data, while bringing diverse processing and analytics to that data.
==
In response, many organizations have turned to a new architecture – an enterprise data hub – to complement and extend existing investments.
Cloudera’s enterprise data hub, powered by Apache Hadoop, the popular open source distributed data platform, is differentiated in several crucial areas. We provide:
Leading query performance.
The enterprise management and governance that you require of all of your mission-critical infrastructure.
Comprehensive, transparent, compliance-ready security at the core.
An open source platform that is also built of open standards – projects that are supported by multiple vendors to ensure sustainability, portability, and compatibility.
Our platform runs in your choice of environment, whether on-premises or in the cloud.
===
Cheat Sheet version: Our enterprise data hub is:
One place for unlimited data
Accessible to anyone
Connected to the systems you already depend on
Secure, governed, managed & compliant
Built on open source and open standards
Deployed however you want
Coupled with the support and enablement you need to succeed.
Important Note: over “unified data”: It’s not practical or probable that customers will actually unify all their data. Much of it lives in the cloud or on storage (e.g. Isilon), in remote datacenters, is of uncertain value vs. cost of moving it to a hub, or security mandates preclude collocation.
* We offer the most complete set of processing, analysis, and serving frameworks for Hadoop.
* Including comprehensive support for YARN. *For example, Impala runs on YARN. YARN is not a differentiator.*
What’s really significant about this architecture is how it unifies diverse access to common data.
In traditional approaches, you’d have separate systems to collect, store, process, explore, model, and serve data. Different teams would use different systems for each workload, and users whose roles span multiple systems would have to use several of them to achieve their objectives.
With Cloudera’s enterprise data hub:
You can perform end-to-end data workflows in a single system, dramatically lowering time to value.
Each workload can access unlimited data, thanks to the underlying data platform, enhancing the value of each workload.
Users can now access their data in new ways and are enabled by these diverse workloads to interact with data
Cloudera Enterprise provides comprehensive support for batch, interactive, and real-time workloads:
Batch
Data integration with Apache Sqoop
Data processing with MapReduce, Apache Hive, Apache Pig
Memory-centric processing with Apache Spark
Interactive
Analytic SQL with Impala
Search with Apache Solr
Machine Learning with Apache Spark
Real-Time
Data integration with Apache Kafka, Apache Flume
Stream processing with Apache Spark
Data serving with Apache Hbase
Shared resource management ensures that each workload is handled appropriately and abides by IT policy.
===
What’s more, 3rd party tools, such as SAS or Informatica can run as native workloads inside Cloudera’s enterprise data hub.
To enable you achieve the benefits of an enterprise data hub without compromise, we offer the most comprehensive security capabilities of any Hadoop solution. We approach security in terms of 4 core pillars:
Perimeter security. Can we ensure only the right people have access to the cluster?
Access controls. Can we ensure people using the cluster can access only the right data?
Visibility. Can we ensure that these rules are being followed, and that malicious activity isn’t taking place? Trust but verify.
Data protection. If all else fails, can we ensure that data is comprehensively encrypted, both at rest and in transit?
*It's all too easy for other vendors to claim their platforms are "secure" because they cover one or more of these pillars.*
It’s important to ensure complete coverage in order to protect your customers and your most sensitive data.
Key capabilities include:
Active Directory and Kerberos for all identity management and user / service authentication
Sensitive data is restricted to authorized personnel and secured against privileged users
Data encrypted using dedicated key manager tied to corporate HSM as root of trust
Full logging of data access, creation of derivative data sets, and changes to access permissions
Cloudera partners more broadly and deeply across the Hadoop ecosystem than any other vendor. With over 1300 partners and counting, our partnerships offer:
Compatibility with your existing tools and skills
160+ certified on Cloudera 5, including all 12 of the 12 Gartner Business Intelligence Magic Quadrant leaders
Flexible deployment options
On-premises
Public, private, or hybrid cloud
Appliances and engineered systems
Partnerships you can trust
Deep engineering relationships
Comprehensive certification program
Workload/ Resource Management
Service Extensibility - enabling 3rd party applications
Enhanced Impala Query Monitoring
YARN/MR2 Monitoring
User defined triggers – custom alerts
Oozie and YARN RM High Availability workflows
Why C5 is great
Any enhancements are ineffective if the benefits of the enterprise data hub are not easily accessible to existing users. That’s why Cloudera has placed an increased emphasis on the upgrade experience, to make it easier to upgrade to the latest version of the software. The team will continue to work on making improvements to this experience.
More details about the Cloudera 5.3 release can be found at “Cloudera Enterprise 5.3 is Released.” CDH 5.3 is now available for download.
Preview of the next webinar
To ensure the highest level of functionality and stability, consider upgrading to the most recent version of CDH.