SlideShare uma empresa Scribd logo
1 de 173
Baixar para ler offline
1© Cloudera, Inc. All rights reserved.
Apache Hadoop Operations
for Production Systems
Philip Zeyliger, Philip Langdale, Kathleen Ting, Miklos Christine
Strata Hadoop NYC, 29 September 2015
2© Cloudera, Inc. All rights reserved.
Your Hosts
Philip
Langdale
Philip
Zeyliger
Miklos
Christine
Kathleen
Ting
3© Cloudera, Inc. All rights reserved.
Philip Langdale
• CM Architect
• philipl@cloudera
Philip Zeyliger
• CM Architect
• philip@cloudera
Kathleen Ting
• Customer Success Manager
• kate@cloudera
Miklos Christine
• Solutions Engineer
• mwc@databricks.com
$ whoami
4© Cloudera, Inc. All rights reserved.
Overall Agenda
• Intro
• Installation
• Configuration
• Troubleshooting
• Enterprise Considerations
(lunch) 12:30-1:30pm
• Hands-on Lab Exercises
(break) 3-3:30pm
• Continue hands-on exercises
• AMA
Q&A at end of every section
(Opening Reception) 5pm @ 3E
5© Cloudera, Inc. All rights reserved.
Prerequisites (for hands-on portions)
SSH Client
○ Putty for Windows
○ Terminal for OS X
Web Browser
Wi-Fi
6© Cloudera, Inc. All rights reserved.
Hands-on Logistics
We’ll be handing out URLs to your clusters shortly!
7© Cloudera, Inc. All rights reserved.
Asking Questions
sli.do/ops
(reasonably mobile-friendly)
8© Cloudera, Inc. All rights reserved.
Why Apache Hadoop?
Solves problems that don’t fit on a single computer.
Doesn’t require you to be a distributed systems person.
Handles failures for you.
Data
Analysts
Distributed
Systems
People
Unicorns!
9© Cloudera, Inc. All rights reserved.
A Distributed System
Many processes, which, taken together, are trying to act
like a single system.
Ideally, users deal with the system as a whole.
For operations, you need to understand the system by
parts too.
10© Cloudera, Inc. All rights reserved.
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
In Mem processing:
Spark
Interactive SQL:
ImpalaBatch processing
MapReduce
Resource Management: YARN
Coordination:
ZooKeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
Random Access Storage: HBase
File Storage: HDFS
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
The Hadoop Stack
11© Cloudera, Inc. All rights reserved.
Storage
Random Access Storage: HBase
File Storage: HDFS
Search: Solr
12© Cloudera, Inc. All rights reserved.
Execution
In Mem processing:
Spark
Interactive SQL:
ImpalaBatch processing
MapReduce
Search: Solr
13© Cloudera, Inc. All rights reserved.
Compilers
Hive—SQL to MR/Spark compiler, metadata mapping
between files and tables
Pig—PigLatin to MR compiler
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
14© Cloudera, Inc. All rights reserved.
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
In Mem processing:
Spark
Interactive SQL:
ImpalaBatch processing
MapReduce
Resource Management: YARN
Coordination:
ZooKeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
Random Access Storage: HBase
File Storage: HDFS
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
Taken all together...
15© Cloudera, Inc. All rights reserved.
How to learn these things...
These systems are largely defined by the state they store on
disk (what will survive a restart?) and the defined protocols
(RPCs) they use to interact amongst themselves and with the
outside world.
Questions?
17© Cloudera, Inc. All rights reserved.
Apache Hadoop Operations
for Production Systems:
Installation
Philip Langdale
18© Cloudera, Inc. All rights reserved.
Agenda
• Hardware Considerations
• Node types and recommended role allocations
• Host configuration
• Rack configuration
• Software Installation
• OS Prerequisites
• Installing Hadoop and other Ecosystem components
• Launch
• Initial Configuration
• Sanity testing
• Security considerations
19© Cloudera, Inc. All rights reserved.
Hardware Considerations
• As a distributed system, Hadoop is going to be deployed onto
multiple interconnected hosts
• How large will the cluster be?
• What services will be deployed on the cluster?
• Can all services effectively run together on the same hosts or is
some form of physical partitioning required?
• What role will each host play in the cluster?
• This impacts the hardware profile (CPU, Memory, Storage, etc)
• How should the hosts be networked together?
20© Cloudera, Inc. All rights reserved.
Host Roles
within a
Cluster
Master Node
HDFS NameNode
YARN ResourceManager
HBase Master
Impala StateStore
ZooKeeper
Worker Node
HDFS DataNode
YARN NodeManager
HBase RegionServer
Impalad
Utility Node
Relational Database
Management (eg: CM)
Hive Metastore
Oozie
Impala Catalog Server
Edge Node
Gateway Configuration
Client Tools
Hue
HiveServer2
Ingest (eg: Flume)
For larger clusters, roles will
be spread across multiple
nodes of a given type
21© Cloudera, Inc. All rights reserved.
Roles vs Cluster Size
Master Worker Utility Edge
Very Small
(≤10)
1 ≤10 1 shared Host
Small (≤20) 2 ≤20 1 shared Host
Medium (≤200) 3 ≤200 2 1+
Large (≤500) 5 ≤500 2 1+
22© Cloudera, Inc. All rights reserved.
Host Hardware Configuration
• CPU
• There’s no such thing as too much CPU
• Jobs typically do not saturate their cores, so raw clock speed is
not at a premium
• Cost and Budget are the major factors here
• Memory
• You really don’t want to overcommit and swap
• Java heaps should fit into physical RAM with additional space for
OS and non-Hadoop processes
23© Cloudera, Inc. All rights reserved.
Host Configuration (cont.)
• Disk
• More spindles == More I/O capacity
• Larger drives == lower cost per TB
• More hosts with less capacity increases parallelism and
decreases re-replication costs when replacing a host
• Fewer hosts with more capacity generally means lower unit cost
• Rule of thumb: One disk per two configured YARN vcores
• Lower latency disks are generally not a good investment, except
for specific use-cases where random I/O is important
24© Cloudera, Inc. All rights reserved.
A Hadoop Machine
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
25© Cloudera, Inc. All rights reserved.
A Hadoop Rack
Rack
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Top of Rack Switch
26© Cloudera, Inc. All rights reserved.
A Hadoop ClusterCluster
Backbone Switch
Rack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack SwitchRack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack Switch
Rack
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Top of Rack Switch
27© Cloudera, Inc. All rights reserved.
Cluster
Rack
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
Top of Rack SwitchRack
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
Top of Rack SwitchRack
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
Top of Rack Switch
Backbone Switch
WANCluster
Backbone Switch
Rack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack SwitchRack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack Switch
Rack
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
Top of Rack Switch
28© Cloudera, Inc. All rights reserved.
Rack Configuration
• Common: 10Gb or (20Gb bonded) to the server, 40Gb to the spine
• Cost sensitive: 2Gb bonded to the server, 10Gb to the spine
• This is likely to be a false economy, with the real potential of
being network bottlenecked with disks idle
• Look for 25/100 in the next couple of years
• Network I/O is generally consumed by reading/writing data from
disks on other nodes
• Teragen is a useful benchmark for network capacity
• Typically 3-9x more intensive than normal workloads
29© Cloudera, Inc. All rights reserved.
Software Installation
• Linux Distribution
• Operating System Configuration
• Hadoop Distribution
• Distribution Lifecycle Management
30© Cloudera, Inc. All rights reserved.
Linux Distributions
• All enterprise distributions are credible choices
• Do you already buy support from a vendor?
• Which distributions does your Hadoop distro support?
• What are you already familiar with
• Cloudera supports
• RHEL 5.x, 6.x
• Ubuntu 12.04, 14.04
• Debian 6.x, 7.x
• SLES 11
• RHEL 6.x is the most common
31© Cloudera, Inc. All rights reserved.
Operating System Configuration
• Turn off IPTables (or any other firewall tool)
• Turn off SELinux
• Turn down swappiness to 10 or less
• Turn off Transparent Huge Page Compaction
• Use a network time source to keep all hosts in sync (also timezones!)
• Make sure forward and reverse DNS work on each host to resolve all
other hosts consistently
• Use the Oracle JDK - OpenJDK is subtly different and may lead you to
grief
32© Cloudera, Inc. All rights reserved.
Cloudera
Manager
provides a Host
Inspector to
check for these
situations
33© Cloudera, Inc. All rights reserved.
Hadoop Distributions
• Well, we’re obviously biased here
• CDH is a jolly good Hadoop Distro. We recommend it
• You’re free to try others
• While it’s technically possible to have a go at running a cluster
without using any management tools, it’s not going to be fun and we’
re not going to talk about that much
34© Cloudera, Inc. All rights reserved.
Distribution Lifecycle Management
• Theoretically, you might want to just install the binaries for services
and programs you are running on a specific node
• But honestly, space is not at that much of a premium
• Install everything everywhere and don’t worry about it again
• If you decide to alter the footprint of a service, you don’t need to
worry about binaries
• You don’t need different profiles for different hosts in the cluster
• Cloudera Manager works this way
35© Cloudera, Inc. All rights reserved.
Distribution Lifecycle Management (cont.)
• For CDH, Cloudera Manager can handle lifecycle management
through parcels
• For package based installations, there are a variety of recognised
options
• Puppet, Chef, Ansible, etc
• But please use something
• Long term package management by hand is a recipe for disaster
• Exposes you to having inconsistencies between hosts
• Missing Packages
• Un-upgraded Packages
36© Cloudera, Inc. All rights reserved.
Lifecycle
Management
with Parcels in
Cloudera
Manager
37© Cloudera, Inc. All rights reserved.
Installation with
Cloudera
Manager
38© Cloudera, Inc. All rights reserved.
Installation with
Cloudera
Manager
39© Cloudera, Inc. All rights reserved.
Installation with
Cloudera
Manager
40© Cloudera, Inc. All rights reserved.
Installation with
Cloudera
Manager
41© Cloudera, Inc. All rights reserved.
Launch
• Initial Configuration
• Sanity testing
• Security Considerations
42© Cloudera, Inc. All rights reserved.
Initial Configuration
• Recall our earlier discussion of hardware
• YARN vcores (or MapReduceV1 slots) proportional to physical cores
• Typically 1/1.5:1 (1.5 with hyperthreading)
• Heap sizes
• Don’t overcommit on memory, but don’t make Java heaps too large (GC)
• See Memory table in Appendix
• Mounting data disks
• One partition per disk
• No RAID
• Use a well established filesystem (ext4)
• Use a uniform naming scheme
43© Cloudera, Inc. All rights reserved.
Initial Configuration (cont.)
• Think about space on the OS partition(s)
• /var is where your logs go by default
• /opt is where Cloudera Manager stores parcels
• By default, these are part of / on modern distros, which
works out fine
• Your IT policies may require separate partitions
• If these areas are too small, then you’ll need to change
these configurations
44© Cloudera, Inc. All rights reserved.
Cloudera Manager can help
• Assigns roles based on cluster size
• Tries to detect masters based on physical
capabilities
• Sets vcore count based on detected host CPUs
• Sets heap sizes to avoid overcommitting RAM
• Autodetects mounted drives and assigns them for
DataNodes
45© Cloudera, Inc. All rights reserved.
Initial Service
Configuration in
Cloudera
Manager
46© Cloudera, Inc. All rights reserved.
Implications of Services in Use
• Different services running concurrently means we need to
consider how resources are shared between them
• Services that use YARN can be managed through YARN’s
scheduler
• But certain services do not - most visibly HBase, but also
services like Accumulo or Flume
• These services can run on a shared cluster through the use
of static resource partitioning with cgroups
• Cloudera Manager can configure these
47© Cloudera, Inc. All rights reserved.
Dynamic
Resource
Management
48© Cloudera, Inc. All rights reserved.
Static Resource
Management
49© Cloudera, Inc. All rights reserved.
Sanity Testing
• Use the basic sanity tests provided by each service
• Submit example jobs: pi, sleep, teragen/terasort
• Work with sample tables in Hive/Impala
• Use Hue to do these things if your users will
• Repeat these tests when turning on
security/authentication/authorization mechanisms
• Make sure they succeed for the expected users and fail for
others
• Cloudera Manager provides some ‘canary’ tests for certain services
• HDFS create/read/write/delete
• Hbase, Zookeeper, etc
50© Cloudera, Inc. All rights reserved.
HDFS Health
tests
51© Cloudera, Inc. All rights reserved.
Hue Examples
52© Cloudera, Inc. All rights reserved.
Appendix
53© Cloudera, Inc. All rights reserved.
Detailed Role Assignments
• Very Small (Up to 10 workers, No High Availability)
• 1 Master Node
• NameNode, YARN Resource Manager, Job History Server,
ZooKeeper, Impala StateStore
• 1 Utility/Edge Node
• Secondary NameNode, Cloudera Manager, Hive Metastore,
HiveServer2, Impala Catalog, Hue, Oozie, Flume, Relational
Database, Gateway configurations
• 3-10 Worker Nodes
• DataNode, NodeManager, Impalad, Llama
54© Cloudera, Inc. All rights reserved.
Detailed Role Assignments
• Small (Up to 20 workers, High Availability)
• 2 Master Nodes
• NameNode (with JournalNode and FailoverController), YARN Resource
Manager, ZooKeeper
• (1 Node each) Job History Server, Impala StateStore
• 1 Utility/Edge Node
• Cloudera Manager, Hive Metastore, HiveServer2, Impala Catalog, Hue,
Oozie, Flume, Relational Database, Gateway configurations
• (requires dedicated spindle) Zookeeper, JournalNode
• 3-20 Worker Nodes
• DataNode, NodeManager, Impalad, Llama
55© Cloudera, Inc. All rights reserved.
Detailed Role Assignments
• Medium (Up to 200 workers, High Availability)
• 3 Master Nodes
• (3 Nodes) Zookeeper, JournalNode
• (2 Nodes each) NameNode (with FailoverController), YARN Resource Manager
• (1 Node each) Job History Server, Impala StateStore
• 2 Utility Nodes
• Node 1: Cloudera Manager, Relational Database
• Node 2: CM Management Service, Hive Metastore, Catalog Server, Oozie
• 1+Edge Noded
• Hue, HiveServer2, Flume, Gateway configuration
• 50-200 Worker Nodes
• DataNode, NodeManager, Impalad, Llama
56© Cloudera, Inc. All rights reserved.
Detailed Role Assignments
• Large (Up to 500 workers, High Availability)
• 5 Master Nodes
• (5 Nodes) Zookeeper, JournalNode
• (2 Nodes) NameNode (with FailoverController)
• (2 different Nodes) YARN Resource Manager
• (1 Node each) Job History Server, Impala StateStore
• 2 Utility Nodes
• Node 1: Cloudera Manager, Relational Database
• Node 2: CM Management Service, Hive Metastore, Catalog Server, Oozie
• 1+Edge Noded
• Hue, HiveServer2, Flume, Gateway configuration
• 200-500 Worker Nodes
• DataNode, NodeManager, Impalad, Llama
57© Cloudera, Inc. All rights reserved.
Memory Allocation
Item RAM Allocated
Operating System Overhead 2 GB (minimum)
DataNode 1-4 GB
YARN NodeManager 1 GB
YARN ApplicationManager 1 GB
YARN Map/Reduce Containers 1-2GB/Container
HBase RegionServer 4-12 GB
Impala 128 GB (can be reduced with spill-to-disk)
Questions?
59© Cloudera, Inc. All rights reserved.
Apache Hadoop Operations
for Production Systems:
Configuration
Philip Zeyliger
60© Cloudera, Inc. All rights reserved.
Agenda
• Mechanics
• Key configurations
• Resource Management
• Configurations...are...living documents!
61© Cloudera, Inc. All rights reserved.
What’s there to configure, anyway?
• On a 100-node cluster, there are likely 400+ (HDFS datanode, Yarn
nodemanager, HBase RegionServer, Impalad) processes running. Each has
environment variables, config files, and command line options!
• Where your daemons are running are configuration too, but often implicit.
Moving from machine A to machine B is a configuration change!
• Most (but not all) settings require a restart to take effect.
• Different kinds of restarts (e.g., rolling)
• A management tool will help you with “scoping.” Some configurations must
be the same globally (e.g., kerberos), some make sense within a service
(HDFS Trash), some per-daemon
62© Cloudera, Inc. All rights reserved.
$ vi /etc/hadoop/conf/hdfs-site.xml
• Configs are key-value pairs, in a straightforward if verbose XML format
• When in doubt, place configuration everywhere, since you might not know
whether the client reads it or which daemons read it.
• Dear golly, use configuration management.
(This is one of many config files!)
63© Cloudera, Inc. All rights reserved.
Editing a configuration
• Quick demo!
64© Cloudera, Inc. All rights reserved.
Step One: edit a config
Step Two: save
65© Cloudera, Inc. All rights reserved.
Step Three: at top-level, note that restart is needed
Step Four: review changes
Step five: restart
66© Cloudera, Inc. All rights reserved.
Show me the files!
• core-site.xml, hdfs-site.xml, dfs_hosts_allow, dfs_hosts_exclude, hbase-site.
xml, hive-env.sh, hive-site.xml, hue.ini, mapred-site.xml, oozie-site.xml,
yarn-site.xml, zoo.cfg (and so on)
• e.g., /var/run/cloudera-scm-agent/process/*-NAMENODE
67© Cloudera, Inc. All rights reserved.
Let’s take a look!
• Demo: find the files via UI
68© Cloudera, Inc. All rights reserved.
How to double-check?
• http://ec2-54-209-51-178.compute-1.amazonaws.com:50070/conf
69© Cloudera, Inc. All rights reserved.
Key Configuration Themes (everyone)
• Security (Kerberos, SSL, ACLs?)
• Ports
• Local Storage
• JVM (use Oracle JDK 1.7)
• Databases (back them up)
• Heap Sizes (different workload? → different config!)
70© Cloudera, Inc. All rights reserved.
Rack Topology
• Hadoop cares about racks because:
• Shared failure zone
• Network bandwidth
• When operating a large cluster, tell Hadoop (by use of a rack locality script)
what machines are in which rack.
71© Cloudera, Inc. All rights reserved.
Networking
• Does your DNS work?
• Like, forwards and backwards?
• For sure?
• Have you really checked?
72© Cloudera, Inc. All rights reserved.
HDFS Configurations of Note
• Heap sizes:
• Datanodes: linear in # of blocks
• Namenode: linear in # of blocks and # of files
select blocks_total, jvm_heap_used_mb
where roletype=DATANODE and hostname RLIKE "hodor-016.*"
73© Cloudera, Inc. All rights reserved.
HDFS Configurations of Note
• Local Data Directories
• dfs.datanode.data.dir: one per spindle; avoid RAID
• dfs.namenode.name.dir: two copies of your metadata better than one
• High Availability
• Requires more daemons
• Two failover controllers co-located with namenodes
• Three journal nodes
74© Cloudera, Inc. All rights reserved.
Yarn Configurations of Note
• Yarn doles out resources to applications across two axes: memory and CPU
• Define per-NodeManager resources
• yarn.nodemanager.resource.cpu-vcores
• yarn.nodemanager.resource.memory-mb
1GB
1GB
1GB
1GB
1
core
1
core
1
core
1
core
1
core
1
core
1 GB, 1 core
2 GB, 2 cores
1 GB, 2 cores
75© Cloudera, Inc. All rights reserved.
Resource management in YARN
• Fannie and Freddie go in on a large cluster together. Fannie ponies up 75%
of the budget; Freddie ponies up 25%.
• When cluster is idle, let Freddie use 100%
• When Fannie has a lot of work, Freddie can only use 25%.
• The “Fair Scheduler” implements “DRF” to share the cluster fairly across
CPU and memory resources.
• Configured by an “allocations.xml” file.
76© Cloudera, Inc. All rights reserved.
Quick Demo
• Yarn Applications
77© Cloudera, Inc. All rights reserved.
Configs as live documents
• Configurations that have to do with workloads will evolve.
Configs Metrics
Hokey-Pokey
Break
(You put your
right CPU in…)
79© Cloudera, Inc. All rights reserved.
Apache Hadoop Operations
for Production Systems:
Troubleshooting
Kathleen Ting
80© Cloudera, Inc. All rights reserved.
Troubleshooting
Managing Hadoop Clusters
Troubleshooting Hadoop Systems
Debugging Hadoop Applications
81© Cloudera, Inc. All rights reserved.
Troubleshooting
Managing Hadoop Clusters
Troubleshooting Hadoop Systems
Debugging Hadoop Applications
82© Cloudera, Inc. All rights reserved.
Understanding Normal
• Establish normal
– Boring logs are good
• So that you can detect abnormal
– Find anomalies and outliers
– Compare performance
• And then isolate root causes
– Who are the suspects
– Pull the thread by interrogating
– Correlate across different subsystems
BORING
INFRASTRUCTURE
83© Cloudera, Inc. All rights reserved.
Basic tools
• What happened?
– Logs
• What state are we in now?
– Metrics
– Thread stacks
• What happens when I do this?
– Tracing
– Dumps
• Is it alive?
– listings
– Canaries
• Is it OK?
– fsck / hbck
84© Cloudera, Inc. All rights reserved.
Diagnostics for single machines
Machine
Linux
Disk, CPU, Mem
Analysis and Tools via @brendangregg
85© Cloudera, Inc. All rights reserved.
Diagnosis Tools
HW/Kernel
Logs /log, /var/log
dmesg
Metrics /proc, top, iotop,
sar, vmstat, netstat
Tracing Strace, tcpdump
Dumps Core dumps
Liveness ping
Corrupt fsck
86© Cloudera, Inc. All rights reserved.
Diagnostics for the JVM
• Most Hadoop services run in the Java VM, intros new java subsystems
– Just in time compiler
– Threads
– Garbage collection
• Dumping Current threads
– jstacks (any threads stuck or blocked?)
• JVM Settings:
– Enabling GC Logs: -XX:+PrintGCDateStamps -XX:+PrintGCDetails
– Enabling OOME Heap dumps: -XX:+HeapDumpOnOutOfMemoryError
• Metrics on GC
– Get gc counts and info with jstat
• Memory Usage dumps
– Create heap dump: jmap –dump:live,format=b,file=heap.bin <pid>
– View heap dump from crash or jmap with jhat
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
87© Cloudera, Inc. All rights reserved.
Diagnosis Tools
HW/Kernel JVM
Logs /log, /var/log
dmesg
Enable gc logging
Metrics /proc, top, iotop, sar,
vmstat, netstat
jstat
Tracing Strace, tcpdump Debugger, jdb, jstack
Dumps Core dumps jmap
Enable OOME dump
Liveness ping Jps
Corrupt fsck
88© Cloudera, Inc. All rights reserved.
Other
systems:
httpd, sas,
custom apps
etc
Tools for lots of machines
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
89© Cloudera, Inc. All rights reserved.
Other
systems:
httpd, sas,
custom apps
etc
Tools for lots of machines
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Ganglia, Nagios
Ganglia*, Nagios*
90© Cloudera, Inc. All rights reserved.
Ganglia
• Collects and aggregates metrics from machines
• Good for what’s going on right now and describing normal perf
• Organized by physical resource (CPU, Mem, Disk)
– Good defaults
– Good for pin pointing machines
– Good for seeing overall utilization
– Uses RRDTool under the covers
• Some data scalability limitations, lossy over time.
• Dynamic for new machines, requires config for new metrics
91© Cloudera, Inc. All rights reserved.
Graphite
• Popular alternative to Ganglia
• Can handle the scale of metrics coming
in
• Similar to Ganglia, but uses its own RRD
database.
• More aimed at dynamic metrics (as
opposed to statically defined metrics)
92© Cloudera, Inc. All rights reserved.
Nagios
• Provides alerts from canaries and basic health checks for services on
machines
• Organized by Service (httpd, dns, etc)
• Defacto standard for service monitoring
• Lacks distributed system know how
– Requires bespoke setup for service slaves and masters
– Lacks details with multi-tenant services or short-lived jobs
93© Cloudera, Inc. All rights reserved.
Other
systems:
httpd, sas,
custom apps
etc
Tools for the Hadoop Stack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Ganglia, Nagios
Ganglia*, Nagios*
94© Cloudera, Inc. All rights reserved.
Tools for the Hadoop Stack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
In Mem
processing:
Spark
Interactive SQL:
ImpalaBatch
processing
MapReduce
Resource Management: YARN
Coordination:
zookeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
Random Access Storage: HBase
File Storage: HDFS
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
Ganglia, Nagios
Ganglia*, Nagios*
95© Cloudera, Inc. All rights reserved.
Diagnostics for the Hadoop Stack
• Single client call can trigger many RPCs spanning many machines
• Systems are evolving quickly
• A failure on one daemon, by design, does not cause failure of the entire service
• Logs:
– Each service’s master and slaves have their own logs: /var/log/
– There are lot of logs and they change frequently
• Metrics:
– Each daemon offers metrics, often aggregated at masters
• Tracing:
– HTrace (integrated into Hadoop, HBase; currently in Apache Incubator)
• Liveness:
– Canaries, service/daemon web UIs
96© Cloudera, Inc. All rights reserved.
Diagnosis Tools
HW/Kernel JVM Hadoop Stack
Logs /log, /var/log
dmesg
Enable gc logging *:/var/log/hadoo|hbase|…
Metrics /proc, top, iotop, sar,
vmstat, netstat
jstat *:/stacks, *:/jmx,
Tracing Strace, tcpdump Debugger, jdb, jstack htrace
Dumps Core dumps jmap
Enable OOME dump
*:/dump
Liveness ping Jps Web UI
Corrupt fsck HDFS fsck, HBase hbck
97© Cloudera, Inc. All rights reserved.
Tools for the Hadoop Stack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
In Mem
processing:
Spark
Interactive SQL:
ImpalaBatch
processing
MapReduce
Resource Management: YARN
Coordination:
zookeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
Random Access Storage: HBase
File Storage: HDFS
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
Ganglia, Nagios
Ganglia*, Nagios*
98© Cloudera, Inc. All rights reserved.
Tools for the Hadoop Stack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
In Mem
processing:
Spark
Interactive SQL:
ImpalaBatch
processing
MapReduce
Resource Management: YARN
Coordination:
zookeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
Random Access Storage: HBase
File Storage: HDFS
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
Logs and Metrics
Logs and
Metrics
Logs and
Metrics
Logs and
Metrics
Logs and
Metrics
Logs and Metrics
Logs
Logs and
Metrics
Logsand
Metrics
Logs
Logs and Metrics
Logs and Metrics
HTrace
HTrace
Custom
app logs
/proc, dmesg, /var/logs/
jmx, jstack, jstat, GC logs
99© Cloudera, Inc. All rights reserved.
Tools for the Hadoop Stack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
In Mem
processing:
Spark
Interactive SQL:
ImpalaBatch
processing
MapReduce
Resource Management: YARN
Coordination:
zookeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
Random Access Storage: HBase
File Storage: HDFS
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
Ganglia, Nagios
Ganglia*, Nagios*
Ganglia*, Nagios* ?
100© Cloudera, Inc. All rights reserved.
Ganglia and Nagios are not enough
• Fault tolerant distributed system masks many problems (by design!)
– Some failures are not critical – failure condition more complicated
• Lacks distributed system know how
– Requires bespoke setup for service slaves and masters
– Lacks details with multitenant services or short-lived jobs
• Hadoop services are logically dependent on each other
– Need to correlate metrics across different service and machines
– Need to correlate logs from different services and machines
– Young systems where Logs are changing frequently
– What about all the logs?
– What of all these metrics do we really need?
• Some data scalability limitations, lossy over time
– What about full fidelity?
101© Cloudera, Inc. All rights reserved.
OpenTSDB
• OpenTSDB (Time Series Database)
– Efficiently stores metric data into HBase
– Keeps data at full fidelity
– Keep as much data as your HBase instance
can handle.
• Free and Open Source
102© Cloudera, Inc. All rights reserved.
Cloudera Manager
• Extracts hardware, OS, and Hadoop service metrics
specifically relevant to Hadoop and its related services.
– Operation Latencies
– # of disk seeks
– HDFS data written
– Java GC time
– Network IO
• Provides
– Cluster preflight checks
– Basic host checks
– Regular health checks
• Uses LevelDB for underlying metrics storage
• Provides distributed log search
• Monitors logs for known issues
• Point and click for useful utils (lsof, jstack, jmap)
• Free
103© Cloudera, Inc. All rights reserved.
Tools for the Hadoop Stack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
In Mem
processing:
Spark
Interactive SQL:
ImpalaBatch
processing
MapReduce
Resource Management: YARN
Coordination:
zookeeper;
Security: Sentry
Search: Solr
Eventingest:
Flume,Kafka
DBImport
Export:Sqoop
User interface: HUE
Other
systems:
httpd, sas,
custom apps
etc
Random Access Storage: HBase
File Storage: HDFS
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
Ganglia, Nagios
Ganglia*, Nagios*
Full Fidelity metrics:
Open TSDB*Metrics + Logs:
Cloudera Manager
Logs:
Search a la Splunk
104© Cloudera, Inc. All rights reserved.
Activity
How many file descriptors do the datanodes have open?
What is the current latency of the HDFS canary?
105© Cloudera, Inc. All rights reserved.
Troubleshooting
Managing a Hadoop Clusters
Troubleshooting Hadoop Systems
Debugging Hadoop Applications
106© Cloudera, Inc. All rights reserved.
The Law of Cluster Inertia
A cluster in a good state stays in a good
state,
and
a cluster in a bad state stays in a bad state,
unless
acted upon by an external force.
107© Cloudera, Inc. All rights reserved.
108© Cloudera, Inc. All rights reserved.
External Forces
• Failures
• Acts of God
• Users
• Admins
109© Cloudera, Inc. All rights reserved.
Batch
processing
MapReduce
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
Failures as an External Force
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Coordination:
zookeeper;
Security: Sentry
Eventingest:
Flume,Kafka
Random Access Storage: HBase
Resource Management: YARN
DBImport
Export:Sqoop
File Storage: HDFS Other
systems:
httpd, sas,
custom apps
etc
Machine
JVM
HW
Slave
Job
In Mem
processing:
Spark
Interactive SQL:
Impala
Search: Solr
User interface: HUE
110© Cloudera, Inc. All rights reserved.
Acts of God as an External Force WAN
Cluster Backbone Switch
Rack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack SwitchRack
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack Switch
Rack
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Top of Rack Switch
Power
Failure
111© Cloudera, Inc. All rights reserved.
Users as an external force
• Use Security to protect systems from users
– Prevent and track
• Authentication – proving who you are
– LDAP, Kerberos
• Authorization – deciding what you are allowed to do
– Apache Sentry (incubating), Hadoop security, HBase security
• Audit – who and when was something done?
– Cloudera Navigator
112© Cloudera, Inc. All rights reserved.
Admins as an external force
Upgrades
• Linux
• Hadoop
• Java
Misconfiguration
• Memory Mismanagement
– TT OOME
– JT OOME
– Native Threads
• Thread Mismanagement
– Fetch Failures
– Replicas
• Disk Mismanagement
– No File
– Too Many Files
113© Cloudera, Inc. All rights reserved.
Troubleshooting
Managing Hadoop Clusters
Troubleshooting Hadoop Systems
Debugging Hadoop Applications
114© Cloudera, Inc. All rights reserved.
Example application pipeline with strict SLAs
ingest process export
ingest process export
ingest process export
time
115© Cloudera, Inc. All rights reserved.
Example Application Pipeline - Ingest
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Coordination:
zookeeper;
Security: SentryFile Storage: HDFS
Custom app feeds
event data into HBase
Random Access Storage: HBase
Eventingest:
Flume,Kafka
Other
systems:
httpd, sas,
custom apps
etc
116© Cloudera, Inc. All rights reserved.
Example Application Pipeline - processing
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Coordination:
zookeeper;
Security: Sentry
Eventingest:
Flume,Kafka
Other
systems:
httpd, sas,
custom apps
etc
Batch
processing
MapReduce
Resource Management: YARN
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
Random Access Storage: HBase
Map Reduce Job
generates new artifact
from HBase data and
writes to HDFS
File Storage: HDFS
117© Cloudera, Inc. All rights reserved.
Batch
processing
MapReduce
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
Example Application Pipeline - export
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Coordination:
zookeeper;
Security: Sentry
Eventingest:
Flume,Kafka
Random Access Storage: HBase9
Resource Management: YARN
DBImport
Export:Sqoop
File Storage: HDFS Other
systems:
httpd, sas,
custom apps
etc
Sqoop result into
relational
database to serve
application
118© Cloudera, Inc. All rights reserved.
Case study 1: slow jobs after Hadoop upgrade
Symptom:
After an upgrade, activity
on the cluster eventually
began to slow down and
the job queue
overflowed.
119© Cloudera, Inc. All rights reserved.
Finding the right part of the stack
E-SPORE (from Eric Sammer’s Hadoop Operations)
• Environment
– What is different about the environment now from the last time everything worked?
• Stack
– The entire cluster also has shared dependency on data center infrastructure such as the network, DNS, and other services.
• Patterns
– Are the tasks from the same job? Are they all assigned to the same tasktracker? Do they all use a shared library that was
changed recently?
• Output
– Always check log output for exceptions but don’t assume the symptom correlates to the root cause.
• Resources
– Do local disks have enough? Is the machine swapping? Does the network utilization look normal? Does the CPU utilization look
normal?
• Event correlation
– It’s important to know the order in which the events led to the failure.
120© Cloudera, Inc. All rights reserved.
Example Application Pipeline - processing
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Machine
JVM
Linux
Hadoop
Daemons
Disk, CPU,
Mem
Coordination:
zookeeper;
Security: Sentry
Eventingest:
Flume,Kafka
Other
systems:
httpd, sas,
custom apps
etc
Languages/APIs: Hive, Pig, Crunch,
Kite, Mahout
Random Access Storage: HBase
File Storage: HDFS
Batch processing
MapReduce
Resource Management: YARN
Map Reduce Job
generates new artifact
from HBase data and
writes to hdfs
121© Cloudera, Inc. All rights reserved.
Case study 1: slow jobs after Hadoop upgrade
Evidence:
Isolated to Processing phase (MR).
In TT Logs, found an innocuous but
anomalous log entry about “fetch
failures.”
Many users had run in to this MR
problem using different versions of
MR.
Workaround provided: remove the
problem node from the cluster.
122© Cloudera, Inc. All rights reserved.
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
123© Cloudera, Inc. All rights reserved.
Case study 1: slow jobs after Hadoop upgrade
Root cause:
All MR versions had a
common dependency on
a particular version of
Jetty (Jetty 6.1.26).
Dev was able to
reproduce and fix the
bug in Jetty.
124© Cloudera, Inc. All rights reserved.
Case study 2: slow jobs after Linux upgrade
Symptom:
After an upgrade, system
CPU usage peaked at
30% or more of the total
CPU usage.
125© Cloudera, Inc. All rights reserved.
Case study 2: slow jobs after Linux upgrade
Evidence:
Used tracing tools to
isolate a majority of time
was inexplicably spent in
virtual memory calls.
http://structureddata.org/2012/06/18/linux-6-
transparent-huge-pages-and-hadoop-workloads/
126© Cloudera, Inc. All rights reserved.
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
127© Cloudera, Inc. All rights reserved.
Case study 2: slow jobs after Linux upgrade
Root cause:
Running RHEL or CentOS
versions 6.2, 6.3, and 6.4 or
SLES 11 SP2 has a feature called
"Transparent Huge Page (THP)"
compaction which interacts
poorly with Hadoop workloads.
128© Cloudera, Inc. All rights reserved.
Case study 3: slow jobs at a precise moment
Symptom:
High CPU usage and responsive
but sluggish cluster - even non-
Hadoop apps e.g. MySQL.
30 customers all hit this at the
exact same time: 6/30/12 at
5pm PDT.
129© Cloudera, Inc. All rights reserved.
Case study 3: slow jobs at a precise moment
Evidence:
Checked the kernel
message buffer (run
dmesg) and look for
output confirming the
leap second injection.
Other systems had same
problem.
130© Cloudera, Inc. All rights reserved.
Machine
JVM
Linux
Hadoop Daemons
Disk, CPU, Mem
131© Cloudera, Inc. All rights reserved.
Case study 3: slow jobs at a precise moment
Root cause:
Linux OS kernel
mishandled a leap
second added.
132© Cloudera, Inc. All rights reserved.
Similar symptoms, different problem
133© Cloudera, Inc. All rights reserved.
Case study 1: slow jobs after Hadoop upgrade
After an upgrade, activity on the
cluster eventually began to slow
down and the job queue
overflowed.
In TT Logs, found an innocuous but
anomalous log entry about “fetch
failures.”
Many users had run in to this MR
problem using different versions of MR.
Workaround provided: remove the
problem node from the cluster.
All MR versions had a common
dependency on a particular
version of Jetty (Jetty 6.1.26) .
Dev was able to reproduce and
fix the bug in Jetty.
Symptom Evidence Root Cause
134© Cloudera, Inc. All rights reserved.
Case study 2: slow jobs after Linux upgrade
After an upgrade, system CPU
usage peaked at 30% or more of
the total CPU usage.
Perf tool which proved that
majority of time was
inexplicably spent in virtual
memory calls.
Running RHEL or CentOS versions 6.2,
6.3, and 6.4 or SLES 11 SP2 has a feature
called "Transparent Huge Page (THP)"
compaction which interacts poorly with
Hadoop workloads.
Symptom Evidence Root Cause
135© Cloudera, Inc. All rights reserved.
Case study 3: slow jobs at a precise moment
High CPU usage and responsive
but sluggish cluster.
30 customers all hit this at the
exact same time.
Checked the kernel message
buffer (run dmesg) and look for
output confirming the leap
second injection.
Linux OS kernel mishandled a leap
second added on 6/30/12 at 5pm PDT.
Symptom Evidence Root Cause
136© Cloudera, Inc. All rights reserved.
Lessons learned
More crucial than the
specific troubleshooting
methodology used is to
use one.
More crucial than the
specific tool used is the
type of data analyzed and
how it’s analyzed.
Capture for posterity in a
knowledge base article,
blog post, or conference
presentation.
Methodology Tools Learn from failure
Questions?
138© Cloudera, Inc. All rights reserved.
Apache Hadoop Operations
for Production Systems:
Enterprise Considerations
Miklos Christine
139© Cloudera, Inc. All rights reserved.
Scale Considerations
Ref: http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop-clusters-like-a-boss/
140© Cloudera, Inc. All rights reserved.
Scale Considerations
• HDFS
• Namenode Heap Settings
• Namenode RPC Configurations
Property Name Default Recommended
dfs.namenode.servicerpc-
address
N/A 8022
dfs.namenode.handler.count 10 ln(# of DNs) * 20
dfs.namenode.service.handler.
count
10 ln(# of DNs) * 20
141© Cloudera, Inc. All rights reserved.
Scale Considerations
• YARN
• ResourceManager High Availability
• yarn.resourcemanager.zk-address
• Application recovery
• yarn.resourcemanager.work-preserving-recovery.enabled
• yarn.nodemanager.recovery.dir
• User cache disk space
• yarn.nodemanager.local-dirs
142© Cloudera, Inc. All rights reserved.
Metrics: HDFS
• HDFS is the core of the platform.
What’s important?
• Is the Standby NN
checkpointing?
• Are the NNs garbage
collecting?
• Percentage of heap used at
steady state?
143© Cloudera, Inc. All rights reserved.
Logs are your friend
• Logs are verbose but necessary
• Namenode Logs:
• 10 * 200MB log files = 2GB
• 2GB of logs span 3 hours
• 3 days of logs = ~48GB
• Retain enough logs for debugging. Plan for the worst case
• Adjust log retention as the cluster grows
144© Cloudera, Inc. All rights reserved.
• Just reduce the log level to save space?
• NO!
• INFO logging is important!
• Application: Yarn containers write logs locally, then migrate to HDFS.
• Ensure application log space is sufficient
• Tool to Fetch System Logs for Root Cause Analysis
Logs are your friend
145© Cloudera, Inc. All rights reserved.
Logs are your friend
• GC Logging
• -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-Xloggc:/var/log/hdfs/nn-hdfs.log
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=20M
• Great resource:
• https://stackoverflow.com/questions/895444/java-garbage-collection-log-messages
146© Cloudera, Inc. All rights reserved.
• jstack
• Use the same JDK
• Must be run as user of the
process
• kill -3 <PID>
• Dumps jstack to stdout
Debugging Techniques : Hung Process
147© Cloudera, Inc. All rights reserved.
Debugging Techniques : Hung Process
148© Cloudera, Inc. All rights reserved.
Debugging Techniques : LogLevel
• Set the log level without process restarts
http://namenode.cloudera.com:50070/logLevel
• Scriptable
http://namenode.cloudera.com:50070/logLevel?log=org&level=DEBUG
149© Cloudera, Inc. All rights reserved.
Debugging Techniques : Heap Analysis
• jstat -gcutil <PID> 1s 120
• Checks for current GC activity
• jmap -histo:live <PID>
• Get a histogram of the current objects within the heap
150© Cloudera, Inc. All rights reserved.
Sanity Tests
• Lots of tools available to end users
( Hive / Impala / MapReduce / HDFS Client / Cascading / Spark / … )
• Create sanity tests for each tool
• Document expected response / processing time
151© Cloudera, Inc. All rights reserved.
Controlled Usage
• How to prevent bad behavior from bringing down the cluster?
• HDFS Quotas
• Yarn FairScheduler Pools
• Hive / Impala Access Control with Sentry
152© Cloudera, Inc. All rights reserved.
Failure Testing
• If the NN fails, how long does it take to recover given the average #
of edits?
• If RM HA failover were to occur, would jobs continue?
• What is the mean time to recovery for HBase when a RS dies?
• What properties can be tuned to improve this?
153© Cloudera, Inc. All rights reserved.
Security Considerations
• Securing communication channels within the cluster
• Kerberos
• Allows secure communication between hosts on an
untrusted network.
• Secures traffic between hosts in the cluster
• Provides authentication for users to services
• TLS
• Used to secure http interfaces
• Kerberos can be used to authenticate to these interfaces
with SPNEGO
154© Cloudera, Inc. All rights reserved.
Kerberos, Authentication and Authorization
• While often conflated, these are distinct concepts
• They are usually configured together, and we would recommend this,
but it’s not an absolute requirement
• Authentication: Having a user provide and prove their identity
• Authorization: Controlling what a user can access or do
155© Cloudera, Inc. All rights reserved.
Security and Authentication (cont)
• Setting up Kerberos is an exercise that’s beyond the scope of this
tutorial
• Main implementations: MIT Kerberos, Active Directory
• Typically LDAP (or AD) is used for user management
• Cloudera Manager can help you configure Kerberos for your services
156© Cloudera, Inc. All rights reserved.
Authentication
• Without Kerberos, users are typically identified as whatever Linux
system user their client application runs as.
• With Kerberos, the user will obtain a kerberos ticket (typically at
login time) that will be used to identify them to the cluster services
157© Cloudera, Inc. All rights reserved.
Authorization
• Even if you’re using an authentication mechanism to limit who can
connect to the various services, you probably want to control what
they can do. Without authorization, anyone can do anything
• Each service provides different authorization mechanisms. eg:
• YARN queues can be restricted to certain users
• HBase tables can be restricted to certain users (ACLs)
• The nature of cluster users will affect authorization requirements
• Are there different groups with different SLAs?
158© Cloudera, Inc. All rights reserved.
Ask Me Anything on Hadoop Ops this Thursday
Date: Thursday, 10/01
Time: 2:05 – 2:45 pm
Location: 3D 05/08
159© Cloudera, Inc. All rights reserved.
Takeaway
A cluster in a good state stays in
a good state, and a cluster in a
bad state stays in a bad state,
unless acted upon by an
external force..
Cloudera has seen a lot of
diverse clusters and used that
experience to build tools to help
diagnose and understand how
Hadoop operates.
Similar symptoms can lead to
different root causes. Use tools
to assist with event correlation
and pattern determination.
Anatomy of a Hadoop System Managing Hadoop Clusters Troubleshooting Hadoop
Applications
160© Cloudera, Inc. All rights reserved.
Exercises
• Configs and Restarting - Phil Z
• Monitoring and Health Tests - Phil L
• Spark & MR - Miklos
• Ingest Using Apache Sqoop - Kate
161© Cloudera, Inc. All rights reserved.
Configurations
• Edit the “NameNode Port”
• What will need to be restarted? (Almost everything)
• Restart it!
• Let’s find the underlying files on the FS
• /var/run/cloudera-scm-agent/...
162© Cloudera, Inc. All rights reserved.
Health Tests
• Look at some basic health test results
• Number of missing blocks in HDFS (you want this to be zero!)
• Number of times the Datanode process has exited unexpectedly
• Let’s kill a Datanode
• CM automatically restarts it
• Look at our health tests again
• Let’s kill a Datanode a lot
• CM will back off if restarts seem to be going nowhere
• Now what do we see?
• Let’s get it running again
163© Cloudera, Inc. All rights reserved.
Configuring for a multi-use cluster
● Log into the Cloudera Manager instance
● Go to the Yarn service
● Configure the following property to 2g
yarn.scheduler.maximum-allocation-mb
○ This determines the max size per requested container
● Restart the Yarn service
MapReduce / Spark Example
164© Cloudera, Inc. All rights reserved.
MapReduce / Spark Example - 1
We will run 2 jobs on the platform.
● Log into the system using the provided key
● Run the following test pi MapReduce job
$ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-
mapreduce/hadoop-examples.jar
pi
-Dmapreduce.map.memory.mb=2048
5 5000
165© Cloudera, Inc. All rights reserved.
MapReduce / Spark Example - 2
Attempt to run the following:
● Log into the system using the provided key
● Run the following test pi Spark job
$ spark-submit
--class org.apache.spark.examples.SparkPi
--conf spark.yarn.executor.memoryOverhead=1g
--master yarn-client
--executor-memory 2g
/opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples.jar 10
166© Cloudera, Inc. All rights reserved.
MapReduce / Spark Example - 2
• What happened and why?
167© Cloudera, Inc. All rights reserved.
MapReduce / Spark Example - 2
Exception in thread "main" java.lang.
IllegalArgumentException:
Required executor memory (2048+384 MB) is above the max
threshold (2048 MB) of this cluster!
• Where does the 384 MB come from?
168© Cloudera, Inc. All rights reserved.Try Cloudera Live today—cloudera.com/live
169© Cloudera, Inc. All rights reserved.
Cloudera Live Tutorial
• “Getting Started” tutorial for Apache Hadoop:
http://<IP Address>/#/tutorial/home
OR
http://www.cloudera.
com/content/cloudera/en/developers/home/developer-admin-
resources/get-started-with-hadoop-tutorial/exercise-1.
html
• Load relational and clickstream data into HDFS
• Use Apache Avro to serialize/prepare that data for analysis
• Create Apache Hive tables
• Query those tables using Hive or Impala
• Index the clickstream data for business users/analysts
170© Cloudera, Inc. All rights reserved.
Exercise 1: Ingest Using Apache Sqoop
(1) Fetch Table Metadata
(3) Data Transfer Via Map Job
Import
(2) MR Job Submission
171© Cloudera, Inc. All rights reserved.
Exercise 1: Ingest Using Apache Sqoop Explained
> sqoop import-all-tables 
-m {{cluster_data.worker_node_hostname.length}} 
--connect jdbc:mysql://{{cluster_data.manager_node_hostname}}:3306/retail_db 
--username=retail_dba 
--password=cloudera 
--compression-codec=snappy 
--as-parquetfile 
--warehouse-dir=/user/hive/warehouse 
--hive-import
172© Cloudera, Inc. All rights reserved.
Join the Discussion
Get community
help or provide
feedback
cloudera.com/community
Questions?

Mais conteúdo relacionado

Mais procurados

Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersAmal G Jose
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsDataWorks Summit
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Michael Arnold
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureDataWorks Summit
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopDataWorks Summit
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopCloudera, Inc.
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best PracticesCloudera, Inc.
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionUpgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionCloudera, Inc.
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 

Mais procurados (20)

Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
ha_module5
ha_module5ha_module5
ha_module5
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
 
YARN
YARNYARN
YARN
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best Practices
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionUpgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 

Destaque

Pepperdata's Real-time Hadoop Cluster Optimization
Pepperdata's Real-time Hadoop Cluster OptimizationPepperdata's Real-time Hadoop Cluster Optimization
Pepperdata's Real-time Hadoop Cluster OptimizationBecky Mendenhall
 
Tech lab 2016-ep02-pepper-data-webinar-02-dez-slides-20160503-final
Tech lab 2016-ep02-pepper-data-webinar-02-dez-slides-20160503-finalTech lab 2016-ep02-pepper-data-webinar-02-dez-slides-20160503-final
Tech lab 2016-ep02-pepper-data-webinar-02-dez-slides-20160503-finalDez Blanchfield
 
AnalyzingMovieData and Business Intelligence
AnalyzingMovieData and Business IntelligenceAnalyzingMovieData and Business Intelligence
AnalyzingMovieData and Business IntelligenceJUNWEI GUAN
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupAndrei Savu
 
Single node hadoop cluster installation
Single node hadoop cluster installation Single node hadoop cluster installation
Single node hadoop cluster installation Mahantesh Angadi
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
Unit testing Agile OpenSpace
Unit testing Agile OpenSpaceUnit testing Agile OpenSpace
Unit testing Agile OpenSpaceAndrei Savu
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and ClouderaJoey Echeverria
 
CDH5最新情報 #cwt2013
CDH5最新情報 #cwt2013CDH5最新情報 #cwt2013
CDH5最新情報 #cwt2013Cloudera Japan
 
Recommendation Engine using Apache Mahout
Recommendation Engine using Apache MahoutRecommendation Engine using Apache Mahout
Recommendation Engine using Apache MahoutAmbarish Hazarnis
 
Cloudera hadoop installation
Cloudera hadoop installationCloudera hadoop installation
Cloudera hadoop installationSumitra Pundlik
 
Introducing Cloudera Director at Big Data Bash
Introducing Cloudera Director at Big Data BashIntroducing Cloudera Director at Big Data Bash
Introducing Cloudera Director at Big Data BashAndrei Savu
 
YARN High Availability
YARN High AvailabilityYARN High Availability
YARN High AvailabilityCloudera, Inc.
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Extending and Automating Cloudera Manager via API
Extending and Automating Cloudera Manager via APIExtending and Automating Cloudera Manager via API
Extending and Automating Cloudera Manager via APIClouderaUserGroups
 
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera, Inc.
 
Samsung’s First 90-Days Building a Next-Generation Analytics Platform
Samsung’s First 90-Days Building a Next-Generation Analytics PlatformSamsung’s First 90-Days Building a Next-Generation Analytics Platform
Samsung’s First 90-Days Building a Next-Generation Analytics PlatformCloudera, Inc.
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN ApplicationsHortonworks
 
Cluster management and automation with cloudera manager
Cluster management and automation with cloudera managerCluster management and automation with cloudera manager
Cluster management and automation with cloudera managerChris Westin
 

Destaque (20)

Pepperdata's Real-time Hadoop Cluster Optimization
Pepperdata's Real-time Hadoop Cluster OptimizationPepperdata's Real-time Hadoop Cluster Optimization
Pepperdata's Real-time Hadoop Cluster Optimization
 
Tech lab 2016-ep02-pepper-data-webinar-02-dez-slides-20160503-final
Tech lab 2016-ep02-pepper-data-webinar-02-dez-slides-20160503-finalTech lab 2016-ep02-pepper-data-webinar-02-dez-slides-20160503-final
Tech lab 2016-ep02-pepper-data-webinar-02-dez-slides-20160503-final
 
AnalyzingMovieData and Business Intelligence
AnalyzingMovieData and Business IntelligenceAnalyzingMovieData and Business Intelligence
AnalyzingMovieData and Business Intelligence
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
 
Single node hadoop cluster installation
Single node hadoop cluster installation Single node hadoop cluster installation
Single node hadoop cluster installation
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Unit testing Agile OpenSpace
Unit testing Agile OpenSpaceUnit testing Agile OpenSpace
Unit testing Agile OpenSpace
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
 
CDH5最新情報 #cwt2013
CDH5最新情報 #cwt2013CDH5最新情報 #cwt2013
CDH5最新情報 #cwt2013
 
Recommendation Engine using Apache Mahout
Recommendation Engine using Apache MahoutRecommendation Engine using Apache Mahout
Recommendation Engine using Apache Mahout
 
YARN High Availability
YARN High AvailabilityYARN High Availability
YARN High Availability
 
Cloudera hadoop installation
Cloudera hadoop installationCloudera hadoop installation
Cloudera hadoop installation
 
Introducing Cloudera Director at Big Data Bash
Introducing Cloudera Director at Big Data BashIntroducing Cloudera Director at Big Data Bash
Introducing Cloudera Director at Big Data Bash
 
YARN High Availability
YARN High AvailabilityYARN High Availability
YARN High Availability
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Extending and Automating Cloudera Manager via API
Extending and Automating Cloudera Manager via APIExtending and Automating Cloudera Manager via API
Extending and Automating Cloudera Manager via API
 
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
 
Samsung’s First 90-Days Building a Next-Generation Analytics Platform
Samsung’s First 90-Days Building a Next-Generation Analytics PlatformSamsung’s First 90-Days Building a Next-Generation Analytics Platform
Samsung’s First 90-Days Building a Next-Generation Analytics Platform
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
Cluster management and automation with cloudera manager
Cluster management and automation with cloudera managerCluster management and automation with cloudera manager
Cluster management and automation with cloudera manager
 

Semelhante a Apache Hadoop Operations for Production Systems Installation

Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 
Postgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuitePostgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuiteEDB
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming ArchitecturesCloudera, Inc.
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Wei-Chiu Chuang
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5UniFabric
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineDataWorks Summit
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 

Semelhante a Apache Hadoop Operations for Production Systems Installation (20)

Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Postgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuitePostgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster Suite
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 

Último

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Último (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Apache Hadoop Operations for Production Systems Installation

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Hadoop Operations for Production Systems Philip Zeyliger, Philip Langdale, Kathleen Ting, Miklos Christine Strata Hadoop NYC, 29 September 2015
  • 2. 2© Cloudera, Inc. All rights reserved. Your Hosts Philip Langdale Philip Zeyliger Miklos Christine Kathleen Ting
  • 3. 3© Cloudera, Inc. All rights reserved. Philip Langdale • CM Architect • philipl@cloudera Philip Zeyliger • CM Architect • philip@cloudera Kathleen Ting • Customer Success Manager • kate@cloudera Miklos Christine • Solutions Engineer • mwc@databricks.com $ whoami
  • 4. 4© Cloudera, Inc. All rights reserved. Overall Agenda • Intro • Installation • Configuration • Troubleshooting • Enterprise Considerations (lunch) 12:30-1:30pm • Hands-on Lab Exercises (break) 3-3:30pm • Continue hands-on exercises • AMA Q&A at end of every section (Opening Reception) 5pm @ 3E
  • 5. 5© Cloudera, Inc. All rights reserved. Prerequisites (for hands-on portions) SSH Client ○ Putty for Windows ○ Terminal for OS X Web Browser Wi-Fi
  • 6. 6© Cloudera, Inc. All rights reserved. Hands-on Logistics We’ll be handing out URLs to your clusters shortly!
  • 7. 7© Cloudera, Inc. All rights reserved. Asking Questions sli.do/ops (reasonably mobile-friendly)
  • 8. 8© Cloudera, Inc. All rights reserved. Why Apache Hadoop? Solves problems that don’t fit on a single computer. Doesn’t require you to be a distributed systems person. Handles failures for you. Data Analysts Distributed Systems People Unicorns!
  • 9. 9© Cloudera, Inc. All rights reserved. A Distributed System Many processes, which, taken together, are trying to act like a single system. Ideally, users deal with the system as a whole. For operations, you need to understand the system by parts too.
  • 10. 10© Cloudera, Inc. All rights reserved. Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem In Mem processing: Spark Interactive SQL: ImpalaBatch processing MapReduce Resource Management: YARN Coordination: ZooKeeper; Security: Sentry Search: Solr Eventingest: Flume,Kafka DBImport Export:Sqoop User interface: HUE Other systems: httpd, sas, custom apps etc Random Access Storage: HBase File Storage: HDFS Languages/APIs: Hive, Pig, Crunch, Kite, Mahout The Hadoop Stack
  • 11. 11© Cloudera, Inc. All rights reserved. Storage Random Access Storage: HBase File Storage: HDFS Search: Solr
  • 12. 12© Cloudera, Inc. All rights reserved. Execution In Mem processing: Spark Interactive SQL: ImpalaBatch processing MapReduce Search: Solr
  • 13. 13© Cloudera, Inc. All rights reserved. Compilers Hive—SQL to MR/Spark compiler, metadata mapping between files and tables Pig—PigLatin to MR compiler Languages/APIs: Hive, Pig, Crunch, Kite, Mahout
  • 14. 14© Cloudera, Inc. All rights reserved. Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem In Mem processing: Spark Interactive SQL: ImpalaBatch processing MapReduce Resource Management: YARN Coordination: ZooKeeper; Security: Sentry Search: Solr Eventingest: Flume,Kafka DBImport Export:Sqoop User interface: HUE Other systems: httpd, sas, custom apps etc Random Access Storage: HBase File Storage: HDFS Languages/APIs: Hive, Pig, Crunch, Kite, Mahout Taken all together...
  • 15. 15© Cloudera, Inc. All rights reserved. How to learn these things... These systems are largely defined by the state they store on disk (what will survive a restart?) and the defined protocols (RPCs) they use to interact amongst themselves and with the outside world.
  • 17. 17© Cloudera, Inc. All rights reserved. Apache Hadoop Operations for Production Systems: Installation Philip Langdale
  • 18. 18© Cloudera, Inc. All rights reserved. Agenda • Hardware Considerations • Node types and recommended role allocations • Host configuration • Rack configuration • Software Installation • OS Prerequisites • Installing Hadoop and other Ecosystem components • Launch • Initial Configuration • Sanity testing • Security considerations
  • 19. 19© Cloudera, Inc. All rights reserved. Hardware Considerations • As a distributed system, Hadoop is going to be deployed onto multiple interconnected hosts • How large will the cluster be? • What services will be deployed on the cluster? • Can all services effectively run together on the same hosts or is some form of physical partitioning required? • What role will each host play in the cluster? • This impacts the hardware profile (CPU, Memory, Storage, etc) • How should the hosts be networked together?
  • 20. 20© Cloudera, Inc. All rights reserved. Host Roles within a Cluster Master Node HDFS NameNode YARN ResourceManager HBase Master Impala StateStore ZooKeeper Worker Node HDFS DataNode YARN NodeManager HBase RegionServer Impalad Utility Node Relational Database Management (eg: CM) Hive Metastore Oozie Impala Catalog Server Edge Node Gateway Configuration Client Tools Hue HiveServer2 Ingest (eg: Flume) For larger clusters, roles will be spread across multiple nodes of a given type
  • 21. 21© Cloudera, Inc. All rights reserved. Roles vs Cluster Size Master Worker Utility Edge Very Small (≤10) 1 ≤10 1 shared Host Small (≤20) 2 ≤20 1 shared Host Medium (≤200) 3 ≤200 2 1+ Large (≤500) 5 ≤500 2 1+
  • 22. 22© Cloudera, Inc. All rights reserved. Host Hardware Configuration • CPU • There’s no such thing as too much CPU • Jobs typically do not saturate their cores, so raw clock speed is not at a premium • Cost and Budget are the major factors here • Memory • You really don’t want to overcommit and swap • Java heaps should fit into physical RAM with additional space for OS and non-Hadoop processes
  • 23. 23© Cloudera, Inc. All rights reserved. Host Configuration (cont.) • Disk • More spindles == More I/O capacity • Larger drives == lower cost per TB • More hosts with less capacity increases parallelism and decreases re-replication costs when replacing a host • Fewer hosts with more capacity generally means lower unit cost • Rule of thumb: One disk per two configured YARN vcores • Lower latency disks are generally not a good investment, except for specific use-cases where random I/O is important
  • 24. 24© Cloudera, Inc. All rights reserved. A Hadoop Machine Machine JVM Linux Hadoop Daemons Disk, CPU, Mem
  • 25. 25© Cloudera, Inc. All rights reserved. A Hadoop Rack Rack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Top of Rack Switch
  • 26. 26© Cloudera, Inc. All rights reserved. A Hadoop ClusterCluster Backbone Switch Rack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Top of Rack SwitchRack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Top of Rack Switch Rack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Top of Rack Switch
  • 27. 27© Cloudera, Inc. All rights reserved. Cluster Rack JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine Top of Rack SwitchRack JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine Top of Rack SwitchRack JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine Top of Rack Switch Backbone Switch WANCluster Backbone Switch Rack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Top of Rack SwitchRack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Top of Rack Switch Rack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Top of Rack Switch
  • 28. 28© Cloudera, Inc. All rights reserved. Rack Configuration • Common: 10Gb or (20Gb bonded) to the server, 40Gb to the spine • Cost sensitive: 2Gb bonded to the server, 10Gb to the spine • This is likely to be a false economy, with the real potential of being network bottlenecked with disks idle • Look for 25/100 in the next couple of years • Network I/O is generally consumed by reading/writing data from disks on other nodes • Teragen is a useful benchmark for network capacity • Typically 3-9x more intensive than normal workloads
  • 29. 29© Cloudera, Inc. All rights reserved. Software Installation • Linux Distribution • Operating System Configuration • Hadoop Distribution • Distribution Lifecycle Management
  • 30. 30© Cloudera, Inc. All rights reserved. Linux Distributions • All enterprise distributions are credible choices • Do you already buy support from a vendor? • Which distributions does your Hadoop distro support? • What are you already familiar with • Cloudera supports • RHEL 5.x, 6.x • Ubuntu 12.04, 14.04 • Debian 6.x, 7.x • SLES 11 • RHEL 6.x is the most common
  • 31. 31© Cloudera, Inc. All rights reserved. Operating System Configuration • Turn off IPTables (or any other firewall tool) • Turn off SELinux • Turn down swappiness to 10 or less • Turn off Transparent Huge Page Compaction • Use a network time source to keep all hosts in sync (also timezones!) • Make sure forward and reverse DNS work on each host to resolve all other hosts consistently • Use the Oracle JDK - OpenJDK is subtly different and may lead you to grief
  • 32. 32© Cloudera, Inc. All rights reserved. Cloudera Manager provides a Host Inspector to check for these situations
  • 33. 33© Cloudera, Inc. All rights reserved. Hadoop Distributions • Well, we’re obviously biased here • CDH is a jolly good Hadoop Distro. We recommend it • You’re free to try others • While it’s technically possible to have a go at running a cluster without using any management tools, it’s not going to be fun and we’ re not going to talk about that much
  • 34. 34© Cloudera, Inc. All rights reserved. Distribution Lifecycle Management • Theoretically, you might want to just install the binaries for services and programs you are running on a specific node • But honestly, space is not at that much of a premium • Install everything everywhere and don’t worry about it again • If you decide to alter the footprint of a service, you don’t need to worry about binaries • You don’t need different profiles for different hosts in the cluster • Cloudera Manager works this way
  • 35. 35© Cloudera, Inc. All rights reserved. Distribution Lifecycle Management (cont.) • For CDH, Cloudera Manager can handle lifecycle management through parcels • For package based installations, there are a variety of recognised options • Puppet, Chef, Ansible, etc • But please use something • Long term package management by hand is a recipe for disaster • Exposes you to having inconsistencies between hosts • Missing Packages • Un-upgraded Packages
  • 36. 36© Cloudera, Inc. All rights reserved. Lifecycle Management with Parcels in Cloudera Manager
  • 37. 37© Cloudera, Inc. All rights reserved. Installation with Cloudera Manager
  • 38. 38© Cloudera, Inc. All rights reserved. Installation with Cloudera Manager
  • 39. 39© Cloudera, Inc. All rights reserved. Installation with Cloudera Manager
  • 40. 40© Cloudera, Inc. All rights reserved. Installation with Cloudera Manager
  • 41. 41© Cloudera, Inc. All rights reserved. Launch • Initial Configuration • Sanity testing • Security Considerations
  • 42. 42© Cloudera, Inc. All rights reserved. Initial Configuration • Recall our earlier discussion of hardware • YARN vcores (or MapReduceV1 slots) proportional to physical cores • Typically 1/1.5:1 (1.5 with hyperthreading) • Heap sizes • Don’t overcommit on memory, but don’t make Java heaps too large (GC) • See Memory table in Appendix • Mounting data disks • One partition per disk • No RAID • Use a well established filesystem (ext4) • Use a uniform naming scheme
  • 43. 43© Cloudera, Inc. All rights reserved. Initial Configuration (cont.) • Think about space on the OS partition(s) • /var is where your logs go by default • /opt is where Cloudera Manager stores parcels • By default, these are part of / on modern distros, which works out fine • Your IT policies may require separate partitions • If these areas are too small, then you’ll need to change these configurations
  • 44. 44© Cloudera, Inc. All rights reserved. Cloudera Manager can help • Assigns roles based on cluster size • Tries to detect masters based on physical capabilities • Sets vcore count based on detected host CPUs • Sets heap sizes to avoid overcommitting RAM • Autodetects mounted drives and assigns them for DataNodes
  • 45. 45© Cloudera, Inc. All rights reserved. Initial Service Configuration in Cloudera Manager
  • 46. 46© Cloudera, Inc. All rights reserved. Implications of Services in Use • Different services running concurrently means we need to consider how resources are shared between them • Services that use YARN can be managed through YARN’s scheduler • But certain services do not - most visibly HBase, but also services like Accumulo or Flume • These services can run on a shared cluster through the use of static resource partitioning with cgroups • Cloudera Manager can configure these
  • 47. 47© Cloudera, Inc. All rights reserved. Dynamic Resource Management
  • 48. 48© Cloudera, Inc. All rights reserved. Static Resource Management
  • 49. 49© Cloudera, Inc. All rights reserved. Sanity Testing • Use the basic sanity tests provided by each service • Submit example jobs: pi, sleep, teragen/terasort • Work with sample tables in Hive/Impala • Use Hue to do these things if your users will • Repeat these tests when turning on security/authentication/authorization mechanisms • Make sure they succeed for the expected users and fail for others • Cloudera Manager provides some ‘canary’ tests for certain services • HDFS create/read/write/delete • Hbase, Zookeeper, etc
  • 50. 50© Cloudera, Inc. All rights reserved. HDFS Health tests
  • 51. 51© Cloudera, Inc. All rights reserved. Hue Examples
  • 52. 52© Cloudera, Inc. All rights reserved. Appendix
  • 53. 53© Cloudera, Inc. All rights reserved. Detailed Role Assignments • Very Small (Up to 10 workers, No High Availability) • 1 Master Node • NameNode, YARN Resource Manager, Job History Server, ZooKeeper, Impala StateStore • 1 Utility/Edge Node • Secondary NameNode, Cloudera Manager, Hive Metastore, HiveServer2, Impala Catalog, Hue, Oozie, Flume, Relational Database, Gateway configurations • 3-10 Worker Nodes • DataNode, NodeManager, Impalad, Llama
  • 54. 54© Cloudera, Inc. All rights reserved. Detailed Role Assignments • Small (Up to 20 workers, High Availability) • 2 Master Nodes • NameNode (with JournalNode and FailoverController), YARN Resource Manager, ZooKeeper • (1 Node each) Job History Server, Impala StateStore • 1 Utility/Edge Node • Cloudera Manager, Hive Metastore, HiveServer2, Impala Catalog, Hue, Oozie, Flume, Relational Database, Gateway configurations • (requires dedicated spindle) Zookeeper, JournalNode • 3-20 Worker Nodes • DataNode, NodeManager, Impalad, Llama
  • 55. 55© Cloudera, Inc. All rights reserved. Detailed Role Assignments • Medium (Up to 200 workers, High Availability) • 3 Master Nodes • (3 Nodes) Zookeeper, JournalNode • (2 Nodes each) NameNode (with FailoverController), YARN Resource Manager • (1 Node each) Job History Server, Impala StateStore • 2 Utility Nodes • Node 1: Cloudera Manager, Relational Database • Node 2: CM Management Service, Hive Metastore, Catalog Server, Oozie • 1+Edge Noded • Hue, HiveServer2, Flume, Gateway configuration • 50-200 Worker Nodes • DataNode, NodeManager, Impalad, Llama
  • 56. 56© Cloudera, Inc. All rights reserved. Detailed Role Assignments • Large (Up to 500 workers, High Availability) • 5 Master Nodes • (5 Nodes) Zookeeper, JournalNode • (2 Nodes) NameNode (with FailoverController) • (2 different Nodes) YARN Resource Manager • (1 Node each) Job History Server, Impala StateStore • 2 Utility Nodes • Node 1: Cloudera Manager, Relational Database • Node 2: CM Management Service, Hive Metastore, Catalog Server, Oozie • 1+Edge Noded • Hue, HiveServer2, Flume, Gateway configuration • 200-500 Worker Nodes • DataNode, NodeManager, Impalad, Llama
  • 57. 57© Cloudera, Inc. All rights reserved. Memory Allocation Item RAM Allocated Operating System Overhead 2 GB (minimum) DataNode 1-4 GB YARN NodeManager 1 GB YARN ApplicationManager 1 GB YARN Map/Reduce Containers 1-2GB/Container HBase RegionServer 4-12 GB Impala 128 GB (can be reduced with spill-to-disk)
  • 59. 59© Cloudera, Inc. All rights reserved. Apache Hadoop Operations for Production Systems: Configuration Philip Zeyliger
  • 60. 60© Cloudera, Inc. All rights reserved. Agenda • Mechanics • Key configurations • Resource Management • Configurations...are...living documents!
  • 61. 61© Cloudera, Inc. All rights reserved. What’s there to configure, anyway? • On a 100-node cluster, there are likely 400+ (HDFS datanode, Yarn nodemanager, HBase RegionServer, Impalad) processes running. Each has environment variables, config files, and command line options! • Where your daemons are running are configuration too, but often implicit. Moving from machine A to machine B is a configuration change! • Most (but not all) settings require a restart to take effect. • Different kinds of restarts (e.g., rolling) • A management tool will help you with “scoping.” Some configurations must be the same globally (e.g., kerberos), some make sense within a service (HDFS Trash), some per-daemon
  • 62. 62© Cloudera, Inc. All rights reserved. $ vi /etc/hadoop/conf/hdfs-site.xml • Configs are key-value pairs, in a straightforward if verbose XML format • When in doubt, place configuration everywhere, since you might not know whether the client reads it or which daemons read it. • Dear golly, use configuration management. (This is one of many config files!)
  • 63. 63© Cloudera, Inc. All rights reserved. Editing a configuration • Quick demo!
  • 64. 64© Cloudera, Inc. All rights reserved. Step One: edit a config Step Two: save
  • 65. 65© Cloudera, Inc. All rights reserved. Step Three: at top-level, note that restart is needed Step Four: review changes Step five: restart
  • 66. 66© Cloudera, Inc. All rights reserved. Show me the files! • core-site.xml, hdfs-site.xml, dfs_hosts_allow, dfs_hosts_exclude, hbase-site. xml, hive-env.sh, hive-site.xml, hue.ini, mapred-site.xml, oozie-site.xml, yarn-site.xml, zoo.cfg (and so on) • e.g., /var/run/cloudera-scm-agent/process/*-NAMENODE
  • 67. 67© Cloudera, Inc. All rights reserved. Let’s take a look! • Demo: find the files via UI
  • 68. 68© Cloudera, Inc. All rights reserved. How to double-check? • http://ec2-54-209-51-178.compute-1.amazonaws.com:50070/conf
  • 69. 69© Cloudera, Inc. All rights reserved. Key Configuration Themes (everyone) • Security (Kerberos, SSL, ACLs?) • Ports • Local Storage • JVM (use Oracle JDK 1.7) • Databases (back them up) • Heap Sizes (different workload? → different config!)
  • 70. 70© Cloudera, Inc. All rights reserved. Rack Topology • Hadoop cares about racks because: • Shared failure zone • Network bandwidth • When operating a large cluster, tell Hadoop (by use of a rack locality script) what machines are in which rack.
  • 71. 71© Cloudera, Inc. All rights reserved. Networking • Does your DNS work? • Like, forwards and backwards? • For sure? • Have you really checked?
  • 72. 72© Cloudera, Inc. All rights reserved. HDFS Configurations of Note • Heap sizes: • Datanodes: linear in # of blocks • Namenode: linear in # of blocks and # of files select blocks_total, jvm_heap_used_mb where roletype=DATANODE and hostname RLIKE "hodor-016.*"
  • 73. 73© Cloudera, Inc. All rights reserved. HDFS Configurations of Note • Local Data Directories • dfs.datanode.data.dir: one per spindle; avoid RAID • dfs.namenode.name.dir: two copies of your metadata better than one • High Availability • Requires more daemons • Two failover controllers co-located with namenodes • Three journal nodes
  • 74. 74© Cloudera, Inc. All rights reserved. Yarn Configurations of Note • Yarn doles out resources to applications across two axes: memory and CPU • Define per-NodeManager resources • yarn.nodemanager.resource.cpu-vcores • yarn.nodemanager.resource.memory-mb 1GB 1GB 1GB 1GB 1 core 1 core 1 core 1 core 1 core 1 core 1 GB, 1 core 2 GB, 2 cores 1 GB, 2 cores
  • 75. 75© Cloudera, Inc. All rights reserved. Resource management in YARN • Fannie and Freddie go in on a large cluster together. Fannie ponies up 75% of the budget; Freddie ponies up 25%. • When cluster is idle, let Freddie use 100% • When Fannie has a lot of work, Freddie can only use 25%. • The “Fair Scheduler” implements “DRF” to share the cluster fairly across CPU and memory resources. • Configured by an “allocations.xml” file.
  • 76. 76© Cloudera, Inc. All rights reserved. Quick Demo • Yarn Applications
  • 77. 77© Cloudera, Inc. All rights reserved. Configs as live documents • Configurations that have to do with workloads will evolve. Configs Metrics
  • 79. 79© Cloudera, Inc. All rights reserved. Apache Hadoop Operations for Production Systems: Troubleshooting Kathleen Ting
  • 80. 80© Cloudera, Inc. All rights reserved. Troubleshooting Managing Hadoop Clusters Troubleshooting Hadoop Systems Debugging Hadoop Applications
  • 81. 81© Cloudera, Inc. All rights reserved. Troubleshooting Managing Hadoop Clusters Troubleshooting Hadoop Systems Debugging Hadoop Applications
  • 82. 82© Cloudera, Inc. All rights reserved. Understanding Normal • Establish normal – Boring logs are good • So that you can detect abnormal – Find anomalies and outliers – Compare performance • And then isolate root causes – Who are the suspects – Pull the thread by interrogating – Correlate across different subsystems BORING INFRASTRUCTURE
  • 83. 83© Cloudera, Inc. All rights reserved. Basic tools • What happened? – Logs • What state are we in now? – Metrics – Thread stacks • What happens when I do this? – Tracing – Dumps • Is it alive? – listings – Canaries • Is it OK? – fsck / hbck
  • 84. 84© Cloudera, Inc. All rights reserved. Diagnostics for single machines Machine Linux Disk, CPU, Mem Analysis and Tools via @brendangregg
  • 85. 85© Cloudera, Inc. All rights reserved. Diagnosis Tools HW/Kernel Logs /log, /var/log dmesg Metrics /proc, top, iotop, sar, vmstat, netstat Tracing Strace, tcpdump Dumps Core dumps Liveness ping Corrupt fsck
  • 86. 86© Cloudera, Inc. All rights reserved. Diagnostics for the JVM • Most Hadoop services run in the Java VM, intros new java subsystems – Just in time compiler – Threads – Garbage collection • Dumping Current threads – jstacks (any threads stuck or blocked?) • JVM Settings: – Enabling GC Logs: -XX:+PrintGCDateStamps -XX:+PrintGCDetails – Enabling OOME Heap dumps: -XX:+HeapDumpOnOutOfMemoryError • Metrics on GC – Get gc counts and info with jstat • Memory Usage dumps – Create heap dump: jmap –dump:live,format=b,file=heap.bin <pid> – View heap dump from crash or jmap with jhat Machine JVM Linux Hadoop Daemons Disk, CPU, Mem
  • 87. 87© Cloudera, Inc. All rights reserved. Diagnosis Tools HW/Kernel JVM Logs /log, /var/log dmesg Enable gc logging Metrics /proc, top, iotop, sar, vmstat, netstat jstat Tracing Strace, tcpdump Debugger, jdb, jstack Dumps Core dumps jmap Enable OOME dump Liveness ping Jps Corrupt fsck
  • 88. 88© Cloudera, Inc. All rights reserved. Other systems: httpd, sas, custom apps etc Tools for lots of machines Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem
  • 89. 89© Cloudera, Inc. All rights reserved. Other systems: httpd, sas, custom apps etc Tools for lots of machines Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Ganglia, Nagios Ganglia*, Nagios*
  • 90. 90© Cloudera, Inc. All rights reserved. Ganglia • Collects and aggregates metrics from machines • Good for what’s going on right now and describing normal perf • Organized by physical resource (CPU, Mem, Disk) – Good defaults – Good for pin pointing machines – Good for seeing overall utilization – Uses RRDTool under the covers • Some data scalability limitations, lossy over time. • Dynamic for new machines, requires config for new metrics
  • 91. 91© Cloudera, Inc. All rights reserved. Graphite • Popular alternative to Ganglia • Can handle the scale of metrics coming in • Similar to Ganglia, but uses its own RRD database. • More aimed at dynamic metrics (as opposed to statically defined metrics)
  • 92. 92© Cloudera, Inc. All rights reserved. Nagios • Provides alerts from canaries and basic health checks for services on machines • Organized by Service (httpd, dns, etc) • Defacto standard for service monitoring • Lacks distributed system know how – Requires bespoke setup for service slaves and masters – Lacks details with multi-tenant services or short-lived jobs
  • 93. 93© Cloudera, Inc. All rights reserved. Other systems: httpd, sas, custom apps etc Tools for the Hadoop Stack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Ganglia, Nagios Ganglia*, Nagios*
  • 94. 94© Cloudera, Inc. All rights reserved. Tools for the Hadoop Stack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem In Mem processing: Spark Interactive SQL: ImpalaBatch processing MapReduce Resource Management: YARN Coordination: zookeeper; Security: Sentry Search: Solr Eventingest: Flume,Kafka DBImport Export:Sqoop User interface: HUE Other systems: httpd, sas, custom apps etc Random Access Storage: HBase File Storage: HDFS Languages/APIs: Hive, Pig, Crunch, Kite, Mahout Ganglia, Nagios Ganglia*, Nagios*
  • 95. 95© Cloudera, Inc. All rights reserved. Diagnostics for the Hadoop Stack • Single client call can trigger many RPCs spanning many machines • Systems are evolving quickly • A failure on one daemon, by design, does not cause failure of the entire service • Logs: – Each service’s master and slaves have their own logs: /var/log/ – There are lot of logs and they change frequently • Metrics: – Each daemon offers metrics, often aggregated at masters • Tracing: – HTrace (integrated into Hadoop, HBase; currently in Apache Incubator) • Liveness: – Canaries, service/daemon web UIs
  • 96. 96© Cloudera, Inc. All rights reserved. Diagnosis Tools HW/Kernel JVM Hadoop Stack Logs /log, /var/log dmesg Enable gc logging *:/var/log/hadoo|hbase|… Metrics /proc, top, iotop, sar, vmstat, netstat jstat *:/stacks, *:/jmx, Tracing Strace, tcpdump Debugger, jdb, jstack htrace Dumps Core dumps jmap Enable OOME dump *:/dump Liveness ping Jps Web UI Corrupt fsck HDFS fsck, HBase hbck
  • 97. 97© Cloudera, Inc. All rights reserved. Tools for the Hadoop Stack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem In Mem processing: Spark Interactive SQL: ImpalaBatch processing MapReduce Resource Management: YARN Coordination: zookeeper; Security: Sentry Search: Solr Eventingest: Flume,Kafka DBImport Export:Sqoop User interface: HUE Other systems: httpd, sas, custom apps etc Random Access Storage: HBase File Storage: HDFS Languages/APIs: Hive, Pig, Crunch, Kite, Mahout Ganglia, Nagios Ganglia*, Nagios*
  • 98. 98© Cloudera, Inc. All rights reserved. Tools for the Hadoop Stack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem In Mem processing: Spark Interactive SQL: ImpalaBatch processing MapReduce Resource Management: YARN Coordination: zookeeper; Security: Sentry Search: Solr Eventingest: Flume,Kafka DBImport Export:Sqoop User interface: HUE Other systems: httpd, sas, custom apps etc Random Access Storage: HBase File Storage: HDFS Languages/APIs: Hive, Pig, Crunch, Kite, Mahout Logs and Metrics Logs and Metrics Logs and Metrics Logs and Metrics Logs and Metrics Logs and Metrics Logs Logs and Metrics Logsand Metrics Logs Logs and Metrics Logs and Metrics HTrace HTrace Custom app logs /proc, dmesg, /var/logs/ jmx, jstack, jstat, GC logs
  • 99. 99© Cloudera, Inc. All rights reserved. Tools for the Hadoop Stack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem In Mem processing: Spark Interactive SQL: ImpalaBatch processing MapReduce Resource Management: YARN Coordination: zookeeper; Security: Sentry Search: Solr Eventingest: Flume,Kafka DBImport Export:Sqoop User interface: HUE Other systems: httpd, sas, custom apps etc Random Access Storage: HBase File Storage: HDFS Languages/APIs: Hive, Pig, Crunch, Kite, Mahout Ganglia, Nagios Ganglia*, Nagios* Ganglia*, Nagios* ?
  • 100. 100© Cloudera, Inc. All rights reserved. Ganglia and Nagios are not enough • Fault tolerant distributed system masks many problems (by design!) – Some failures are not critical – failure condition more complicated • Lacks distributed system know how – Requires bespoke setup for service slaves and masters – Lacks details with multitenant services or short-lived jobs • Hadoop services are logically dependent on each other – Need to correlate metrics across different service and machines – Need to correlate logs from different services and machines – Young systems where Logs are changing frequently – What about all the logs? – What of all these metrics do we really need? • Some data scalability limitations, lossy over time – What about full fidelity?
  • 101. 101© Cloudera, Inc. All rights reserved. OpenTSDB • OpenTSDB (Time Series Database) – Efficiently stores metric data into HBase – Keeps data at full fidelity – Keep as much data as your HBase instance can handle. • Free and Open Source
  • 102. 102© Cloudera, Inc. All rights reserved. Cloudera Manager • Extracts hardware, OS, and Hadoop service metrics specifically relevant to Hadoop and its related services. – Operation Latencies – # of disk seeks – HDFS data written – Java GC time – Network IO • Provides – Cluster preflight checks – Basic host checks – Regular health checks • Uses LevelDB for underlying metrics storage • Provides distributed log search • Monitors logs for known issues • Point and click for useful utils (lsof, jstack, jmap) • Free
  • 103. 103© Cloudera, Inc. All rights reserved. Tools for the Hadoop Stack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem In Mem processing: Spark Interactive SQL: ImpalaBatch processing MapReduce Resource Management: YARN Coordination: zookeeper; Security: Sentry Search: Solr Eventingest: Flume,Kafka DBImport Export:Sqoop User interface: HUE Other systems: httpd, sas, custom apps etc Random Access Storage: HBase File Storage: HDFS Languages/APIs: Hive, Pig, Crunch, Kite, Mahout Ganglia, Nagios Ganglia*, Nagios* Full Fidelity metrics: Open TSDB*Metrics + Logs: Cloudera Manager Logs: Search a la Splunk
  • 104. 104© Cloudera, Inc. All rights reserved. Activity How many file descriptors do the datanodes have open? What is the current latency of the HDFS canary?
  • 105. 105© Cloudera, Inc. All rights reserved. Troubleshooting Managing a Hadoop Clusters Troubleshooting Hadoop Systems Debugging Hadoop Applications
  • 106. 106© Cloudera, Inc. All rights reserved. The Law of Cluster Inertia A cluster in a good state stays in a good state, and a cluster in a bad state stays in a bad state, unless acted upon by an external force.
  • 107. 107© Cloudera, Inc. All rights reserved.
  • 108. 108© Cloudera, Inc. All rights reserved. External Forces • Failures • Acts of God • Users • Admins
  • 109. 109© Cloudera, Inc. All rights reserved. Batch processing MapReduce Languages/APIs: Hive, Pig, Crunch, Kite, Mahout Failures as an External Force Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Coordination: zookeeper; Security: Sentry Eventingest: Flume,Kafka Random Access Storage: HBase Resource Management: YARN DBImport Export:Sqoop File Storage: HDFS Other systems: httpd, sas, custom apps etc Machine JVM HW Slave Job In Mem processing: Spark Interactive SQL: Impala Search: Solr User interface: HUE
  • 110. 110© Cloudera, Inc. All rights reserved. Acts of God as an External Force WAN Cluster Backbone Switch Rack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Top of Rack SwitchRack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Top of Rack Switch Rack Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Top of Rack Switch Power Failure
  • 111. 111© Cloudera, Inc. All rights reserved. Users as an external force • Use Security to protect systems from users – Prevent and track • Authentication – proving who you are – LDAP, Kerberos • Authorization – deciding what you are allowed to do – Apache Sentry (incubating), Hadoop security, HBase security • Audit – who and when was something done? – Cloudera Navigator
  • 112. 112© Cloudera, Inc. All rights reserved. Admins as an external force Upgrades • Linux • Hadoop • Java Misconfiguration • Memory Mismanagement – TT OOME – JT OOME – Native Threads • Thread Mismanagement – Fetch Failures – Replicas • Disk Mismanagement – No File – Too Many Files
  • 113. 113© Cloudera, Inc. All rights reserved. Troubleshooting Managing Hadoop Clusters Troubleshooting Hadoop Systems Debugging Hadoop Applications
  • 114. 114© Cloudera, Inc. All rights reserved. Example application pipeline with strict SLAs ingest process export ingest process export ingest process export time
  • 115. 115© Cloudera, Inc. All rights reserved. Example Application Pipeline - Ingest Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Coordination: zookeeper; Security: SentryFile Storage: HDFS Custom app feeds event data into HBase Random Access Storage: HBase Eventingest: Flume,Kafka Other systems: httpd, sas, custom apps etc
  • 116. 116© Cloudera, Inc. All rights reserved. Example Application Pipeline - processing Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Coordination: zookeeper; Security: Sentry Eventingest: Flume,Kafka Other systems: httpd, sas, custom apps etc Batch processing MapReduce Resource Management: YARN Languages/APIs: Hive, Pig, Crunch, Kite, Mahout Random Access Storage: HBase Map Reduce Job generates new artifact from HBase data and writes to HDFS File Storage: HDFS
  • 117. 117© Cloudera, Inc. All rights reserved. Batch processing MapReduce Languages/APIs: Hive, Pig, Crunch, Kite, Mahout Example Application Pipeline - export Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Coordination: zookeeper; Security: Sentry Eventingest: Flume,Kafka Random Access Storage: HBase9 Resource Management: YARN DBImport Export:Sqoop File Storage: HDFS Other systems: httpd, sas, custom apps etc Sqoop result into relational database to serve application
  • 118. 118© Cloudera, Inc. All rights reserved. Case study 1: slow jobs after Hadoop upgrade Symptom: After an upgrade, activity on the cluster eventually began to slow down and the job queue overflowed.
  • 119. 119© Cloudera, Inc. All rights reserved. Finding the right part of the stack E-SPORE (from Eric Sammer’s Hadoop Operations) • Environment – What is different about the environment now from the last time everything worked? • Stack – The entire cluster also has shared dependency on data center infrastructure such as the network, DNS, and other services. • Patterns – Are the tasks from the same job? Are they all assigned to the same tasktracker? Do they all use a shared library that was changed recently? • Output – Always check log output for exceptions but don’t assume the symptom correlates to the root cause. • Resources – Do local disks have enough? Is the machine swapping? Does the network utilization look normal? Does the CPU utilization look normal? • Event correlation – It’s important to know the order in which the events led to the failure.
  • 120. 120© Cloudera, Inc. All rights reserved. Example Application Pipeline - processing Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Machine JVM Linux Hadoop Daemons Disk, CPU, Mem Coordination: zookeeper; Security: Sentry Eventingest: Flume,Kafka Other systems: httpd, sas, custom apps etc Languages/APIs: Hive, Pig, Crunch, Kite, Mahout Random Access Storage: HBase File Storage: HDFS Batch processing MapReduce Resource Management: YARN Map Reduce Job generates new artifact from HBase data and writes to hdfs
  • 121. 121© Cloudera, Inc. All rights reserved. Case study 1: slow jobs after Hadoop upgrade Evidence: Isolated to Processing phase (MR). In TT Logs, found an innocuous but anomalous log entry about “fetch failures.” Many users had run in to this MR problem using different versions of MR. Workaround provided: remove the problem node from the cluster.
  • 122. 122© Cloudera, Inc. All rights reserved. Machine JVM Linux Hadoop Daemons Disk, CPU, Mem
  • 123. 123© Cloudera, Inc. All rights reserved. Case study 1: slow jobs after Hadoop upgrade Root cause: All MR versions had a common dependency on a particular version of Jetty (Jetty 6.1.26). Dev was able to reproduce and fix the bug in Jetty.
  • 124. 124© Cloudera, Inc. All rights reserved. Case study 2: slow jobs after Linux upgrade Symptom: After an upgrade, system CPU usage peaked at 30% or more of the total CPU usage.
  • 125. 125© Cloudera, Inc. All rights reserved. Case study 2: slow jobs after Linux upgrade Evidence: Used tracing tools to isolate a majority of time was inexplicably spent in virtual memory calls. http://structureddata.org/2012/06/18/linux-6- transparent-huge-pages-and-hadoop-workloads/
  • 126. 126© Cloudera, Inc. All rights reserved. Machine JVM Linux Hadoop Daemons Disk, CPU, Mem
  • 127. 127© Cloudera, Inc. All rights reserved. Case study 2: slow jobs after Linux upgrade Root cause: Running RHEL or CentOS versions 6.2, 6.3, and 6.4 or SLES 11 SP2 has a feature called "Transparent Huge Page (THP)" compaction which interacts poorly with Hadoop workloads.
  • 128. 128© Cloudera, Inc. All rights reserved. Case study 3: slow jobs at a precise moment Symptom: High CPU usage and responsive but sluggish cluster - even non- Hadoop apps e.g. MySQL. 30 customers all hit this at the exact same time: 6/30/12 at 5pm PDT.
  • 129. 129© Cloudera, Inc. All rights reserved. Case study 3: slow jobs at a precise moment Evidence: Checked the kernel message buffer (run dmesg) and look for output confirming the leap second injection. Other systems had same problem.
  • 130. 130© Cloudera, Inc. All rights reserved. Machine JVM Linux Hadoop Daemons Disk, CPU, Mem
  • 131. 131© Cloudera, Inc. All rights reserved. Case study 3: slow jobs at a precise moment Root cause: Linux OS kernel mishandled a leap second added.
  • 132. 132© Cloudera, Inc. All rights reserved. Similar symptoms, different problem
  • 133. 133© Cloudera, Inc. All rights reserved. Case study 1: slow jobs after Hadoop upgrade After an upgrade, activity on the cluster eventually began to slow down and the job queue overflowed. In TT Logs, found an innocuous but anomalous log entry about “fetch failures.” Many users had run in to this MR problem using different versions of MR. Workaround provided: remove the problem node from the cluster. All MR versions had a common dependency on a particular version of Jetty (Jetty 6.1.26) . Dev was able to reproduce and fix the bug in Jetty. Symptom Evidence Root Cause
  • 134. 134© Cloudera, Inc. All rights reserved. Case study 2: slow jobs after Linux upgrade After an upgrade, system CPU usage peaked at 30% or more of the total CPU usage. Perf tool which proved that majority of time was inexplicably spent in virtual memory calls. Running RHEL or CentOS versions 6.2, 6.3, and 6.4 or SLES 11 SP2 has a feature called "Transparent Huge Page (THP)" compaction which interacts poorly with Hadoop workloads. Symptom Evidence Root Cause
  • 135. 135© Cloudera, Inc. All rights reserved. Case study 3: slow jobs at a precise moment High CPU usage and responsive but sluggish cluster. 30 customers all hit this at the exact same time. Checked the kernel message buffer (run dmesg) and look for output confirming the leap second injection. Linux OS kernel mishandled a leap second added on 6/30/12 at 5pm PDT. Symptom Evidence Root Cause
  • 136. 136© Cloudera, Inc. All rights reserved. Lessons learned More crucial than the specific troubleshooting methodology used is to use one. More crucial than the specific tool used is the type of data analyzed and how it’s analyzed. Capture for posterity in a knowledge base article, blog post, or conference presentation. Methodology Tools Learn from failure
  • 138. 138© Cloudera, Inc. All rights reserved. Apache Hadoop Operations for Production Systems: Enterprise Considerations Miklos Christine
  • 139. 139© Cloudera, Inc. All rights reserved. Scale Considerations Ref: http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop-clusters-like-a-boss/
  • 140. 140© Cloudera, Inc. All rights reserved. Scale Considerations • HDFS • Namenode Heap Settings • Namenode RPC Configurations Property Name Default Recommended dfs.namenode.servicerpc- address N/A 8022 dfs.namenode.handler.count 10 ln(# of DNs) * 20 dfs.namenode.service.handler. count 10 ln(# of DNs) * 20
  • 141. 141© Cloudera, Inc. All rights reserved. Scale Considerations • YARN • ResourceManager High Availability • yarn.resourcemanager.zk-address • Application recovery • yarn.resourcemanager.work-preserving-recovery.enabled • yarn.nodemanager.recovery.dir • User cache disk space • yarn.nodemanager.local-dirs
  • 142. 142© Cloudera, Inc. All rights reserved. Metrics: HDFS • HDFS is the core of the platform. What’s important? • Is the Standby NN checkpointing? • Are the NNs garbage collecting? • Percentage of heap used at steady state?
  • 143. 143© Cloudera, Inc. All rights reserved. Logs are your friend • Logs are verbose but necessary • Namenode Logs: • 10 * 200MB log files = 2GB • 2GB of logs span 3 hours • 3 days of logs = ~48GB • Retain enough logs for debugging. Plan for the worst case • Adjust log retention as the cluster grows
  • 144. 144© Cloudera, Inc. All rights reserved. • Just reduce the log level to save space? • NO! • INFO logging is important! • Application: Yarn containers write logs locally, then migrate to HDFS. • Ensure application log space is sufficient • Tool to Fetch System Logs for Root Cause Analysis Logs are your friend
  • 145. 145© Cloudera, Inc. All rights reserved. Logs are your friend • GC Logging • -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/var/log/hdfs/nn-hdfs.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=20M • Great resource: • https://stackoverflow.com/questions/895444/java-garbage-collection-log-messages
  • 146. 146© Cloudera, Inc. All rights reserved. • jstack • Use the same JDK • Must be run as user of the process • kill -3 <PID> • Dumps jstack to stdout Debugging Techniques : Hung Process
  • 147. 147© Cloudera, Inc. All rights reserved. Debugging Techniques : Hung Process
  • 148. 148© Cloudera, Inc. All rights reserved. Debugging Techniques : LogLevel • Set the log level without process restarts http://namenode.cloudera.com:50070/logLevel • Scriptable http://namenode.cloudera.com:50070/logLevel?log=org&level=DEBUG
  • 149. 149© Cloudera, Inc. All rights reserved. Debugging Techniques : Heap Analysis • jstat -gcutil <PID> 1s 120 • Checks for current GC activity • jmap -histo:live <PID> • Get a histogram of the current objects within the heap
  • 150. 150© Cloudera, Inc. All rights reserved. Sanity Tests • Lots of tools available to end users ( Hive / Impala / MapReduce / HDFS Client / Cascading / Spark / … ) • Create sanity tests for each tool • Document expected response / processing time
  • 151. 151© Cloudera, Inc. All rights reserved. Controlled Usage • How to prevent bad behavior from bringing down the cluster? • HDFS Quotas • Yarn FairScheduler Pools • Hive / Impala Access Control with Sentry
  • 152. 152© Cloudera, Inc. All rights reserved. Failure Testing • If the NN fails, how long does it take to recover given the average # of edits? • If RM HA failover were to occur, would jobs continue? • What is the mean time to recovery for HBase when a RS dies? • What properties can be tuned to improve this?
  • 153. 153© Cloudera, Inc. All rights reserved. Security Considerations • Securing communication channels within the cluster • Kerberos • Allows secure communication between hosts on an untrusted network. • Secures traffic between hosts in the cluster • Provides authentication for users to services • TLS • Used to secure http interfaces • Kerberos can be used to authenticate to these interfaces with SPNEGO
  • 154. 154© Cloudera, Inc. All rights reserved. Kerberos, Authentication and Authorization • While often conflated, these are distinct concepts • They are usually configured together, and we would recommend this, but it’s not an absolute requirement • Authentication: Having a user provide and prove their identity • Authorization: Controlling what a user can access or do
  • 155. 155© Cloudera, Inc. All rights reserved. Security and Authentication (cont) • Setting up Kerberos is an exercise that’s beyond the scope of this tutorial • Main implementations: MIT Kerberos, Active Directory • Typically LDAP (or AD) is used for user management • Cloudera Manager can help you configure Kerberos for your services
  • 156. 156© Cloudera, Inc. All rights reserved. Authentication • Without Kerberos, users are typically identified as whatever Linux system user their client application runs as. • With Kerberos, the user will obtain a kerberos ticket (typically at login time) that will be used to identify them to the cluster services
  • 157. 157© Cloudera, Inc. All rights reserved. Authorization • Even if you’re using an authentication mechanism to limit who can connect to the various services, you probably want to control what they can do. Without authorization, anyone can do anything • Each service provides different authorization mechanisms. eg: • YARN queues can be restricted to certain users • HBase tables can be restricted to certain users (ACLs) • The nature of cluster users will affect authorization requirements • Are there different groups with different SLAs?
  • 158. 158© Cloudera, Inc. All rights reserved. Ask Me Anything on Hadoop Ops this Thursday Date: Thursday, 10/01 Time: 2:05 – 2:45 pm Location: 3D 05/08
  • 159. 159© Cloudera, Inc. All rights reserved. Takeaway A cluster in a good state stays in a good state, and a cluster in a bad state stays in a bad state, unless acted upon by an external force.. Cloudera has seen a lot of diverse clusters and used that experience to build tools to help diagnose and understand how Hadoop operates. Similar symptoms can lead to different root causes. Use tools to assist with event correlation and pattern determination. Anatomy of a Hadoop System Managing Hadoop Clusters Troubleshooting Hadoop Applications
  • 160. 160© Cloudera, Inc. All rights reserved. Exercises • Configs and Restarting - Phil Z • Monitoring and Health Tests - Phil L • Spark & MR - Miklos • Ingest Using Apache Sqoop - Kate
  • 161. 161© Cloudera, Inc. All rights reserved. Configurations • Edit the “NameNode Port” • What will need to be restarted? (Almost everything) • Restart it! • Let’s find the underlying files on the FS • /var/run/cloudera-scm-agent/...
  • 162. 162© Cloudera, Inc. All rights reserved. Health Tests • Look at some basic health test results • Number of missing blocks in HDFS (you want this to be zero!) • Number of times the Datanode process has exited unexpectedly • Let’s kill a Datanode • CM automatically restarts it • Look at our health tests again • Let’s kill a Datanode a lot • CM will back off if restarts seem to be going nowhere • Now what do we see? • Let’s get it running again
  • 163. 163© Cloudera, Inc. All rights reserved. Configuring for a multi-use cluster ● Log into the Cloudera Manager instance ● Go to the Yarn service ● Configure the following property to 2g yarn.scheduler.maximum-allocation-mb ○ This determines the max size per requested container ● Restart the Yarn service MapReduce / Spark Example
  • 164. 164© Cloudera, Inc. All rights reserved. MapReduce / Spark Example - 1 We will run 2 jobs on the platform. ● Log into the system using the provided key ● Run the following test pi MapReduce job $ hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20- mapreduce/hadoop-examples.jar pi -Dmapreduce.map.memory.mb=2048 5 5000
  • 165. 165© Cloudera, Inc. All rights reserved. MapReduce / Spark Example - 2 Attempt to run the following: ● Log into the system using the provided key ● Run the following test pi Spark job $ spark-submit --class org.apache.spark.examples.SparkPi --conf spark.yarn.executor.memoryOverhead=1g --master yarn-client --executor-memory 2g /opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples.jar 10
  • 166. 166© Cloudera, Inc. All rights reserved. MapReduce / Spark Example - 2 • What happened and why?
  • 167. 167© Cloudera, Inc. All rights reserved. MapReduce / Spark Example - 2 Exception in thread "main" java.lang. IllegalArgumentException: Required executor memory (2048+384 MB) is above the max threshold (2048 MB) of this cluster! • Where does the 384 MB come from?
  • 168. 168© Cloudera, Inc. All rights reserved.Try Cloudera Live today—cloudera.com/live
  • 169. 169© Cloudera, Inc. All rights reserved. Cloudera Live Tutorial • “Getting Started” tutorial for Apache Hadoop: http://<IP Address>/#/tutorial/home OR http://www.cloudera. com/content/cloudera/en/developers/home/developer-admin- resources/get-started-with-hadoop-tutorial/exercise-1. html • Load relational and clickstream data into HDFS • Use Apache Avro to serialize/prepare that data for analysis • Create Apache Hive tables • Query those tables using Hive or Impala • Index the clickstream data for business users/analysts
  • 170. 170© Cloudera, Inc. All rights reserved. Exercise 1: Ingest Using Apache Sqoop (1) Fetch Table Metadata (3) Data Transfer Via Map Job Import (2) MR Job Submission
  • 171. 171© Cloudera, Inc. All rights reserved. Exercise 1: Ingest Using Apache Sqoop Explained > sqoop import-all-tables -m {{cluster_data.worker_node_hostname.length}} --connect jdbc:mysql://{{cluster_data.manager_node_hostname}}:3306/retail_db --username=retail_dba --password=cloudera --compression-codec=snappy --as-parquetfile --warehouse-dir=/user/hive/warehouse --hive-import
  • 172. 172© Cloudera, Inc. All rights reserved. Join the Discussion Get community help or provide feedback cloudera.com/community