TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
1. Enabling Exploratory Analytics of Data in
Shared-service Hadoop Clusters
P R E S E N T E D B Y S a g i Z e l n i c k P r i n c i p a l A r c h i t e c t @ Y a h o o a n d L e d i o n B i t i n c k a
P r i n c i p a l A r c h i t e c t @ S p l u n k
H a d o o p S u m m i t J u n e 2 0 1 4 S a n J o s e , C A
2. About your speakers
2 Yahoo Proprietary
Sagi Zelnick Ledion Bitincka
Principal Architect Principal Architect
Yahoo Splunk
3. Background
3 Yahoo Proprietary
Hadoop @ Yahoo: 8+ years of innovation
Hunk @ Yahoo: organization-wide investment for next 3+ years
Yahoo providing Hunk as a self-service to explore, analyze & visualize
data in HDFS
Hunk allows visually browsing of very complex tables (250+ fields)
Rapid prototyping for new jobs with almost instant results for searches,
without having to wait for the entire job/query to finish
Cuts down on the development cycles by faster interaction with results
5. Over 600PB of Hadoop storage (over half an exabyte)
5 Yahoo Proprietary
Very large clusters used by many groups across the enterprise
More than 40,000 individual datanodes
Hadoop is provided as a service
Multiple cluster types such as research, dev, sandbox and production
Services such as HBase, Hive, Oozie, etc…
Users are free to run jobs, but have resource constraints
Maintained by Grid Operations Group
6. Improving visibility & providing operational insights with Hunk
We pointed Hunk at many operational logs and event data we already
have on the grid
This includes system metrics, HDFS ops, JVM stats and YARN metrics
Created instrumentation to measure usage per user and job
Analyzed terabytes of NameNode audit logs
Job history leveraged for visualizing usage/growth and historical views
Custom events for HBase statistics
6 Yahoo Proprietary
7. Use Case Customer Benefits
Namenode metrics, block ops, memory
usage
Research, Dev Improved performance and
stability
System/Hadoop metrics of ~40,000
individual datanodes
Grid Ops / Grid Customers Identify slow tasks/nodes when
debugging
Historical insights into resource
consumption
All Grid Customers Track organic growth
Generate reports on job performance All Grid Customers Improved job SLAs
HBase metrics All Grid Customers Track region/RS/table metrics…
Track job logs in near real-time All Grid Customers / Ops Detect and search for errors
directly from the YARN job logs
for troubleshooting
Tracking Hadoop performance and metrics in Hunk
7 Yahoo Proprietary
8. Use Case Customer Benefits
Find dataset instances/files that have never
been accessed after creation
Data Storage Efficiency
Team, SE
Savings via reduction of storage-
costs
How is each user/team using compute and
disk capacity on a cluster?
Management / Grid
Customers
Metering / Chargeback
Replace ad hoc and legacy solutions for
analyzing cluster-usage
SE / Grid Solutions / Grid
Performance / Hadoop
Core Development Team
Improved Grid-utilization and cost-
reduction
Generate reports on cluster performance,
utilization of available capacity, etc.
SE / Grid Solutions / Grid
Performance / Hadoop
Core Development Team
Data-mining for product
improvements and best-practices
Determine KPIs of Hadoop stack components
(Pig, Oozie, etc.)
SE / Grid Solutions /
Hadoop Stack
Development Team
Feedback for product
improvements
Find efficacy of various heuristics in Hadoop
(data-locality of Tasks, replication of blocks,
etc.)
Hadoop Stack
Development Team
Fine-tune heuristics for better
efficiency
Tracking Hadoop performance and metrics continued
8 Yahoo Proprietary
10. Measuring NameNode performance pre & post upgrades
10 Yahoo Proprietary
Historical visualizations of all operations
Search data in Hunk from billions of NameNode events
Measure JVM and memory usage
Insights into operational performance
11. Yahoo Proprietary
New Search
i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt ype=" DFS" #hdf s=hdf s) | t i mechar t spa
n=1h avg( number * ) as num_*
Last 7 days
✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM)
_time
num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations
num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp
Fri May 16
2014
Sun May 18 Tue May 20
200,000,000
400,000,000
600,000,000
_time ✓
num_Bl
ockRep
orts ✓
num_Copy
BlockOpera
tions ✓
num_
HeartB
eats ✓
num_Read
BlockOpera
tions ✓
num_ReadMe
tadataOperati
ons ✓
num_Replac
eBlockOperat
ions ✓
num_Write
BlockOpera
tions ✓
num_blo
ckChecks
umOp ✓
2014-05-15 01:00 112443
7.7359
02
46721126.
819672
51495
7.3840
98
12930433.0
77869
0.000000 94210832.78
6885
63512425.9
67213
13975.30
6557
Visualization
Sample visualization in Hunk
11
12. 12 Yahoo Proprietary
✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM)
_time
num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations
num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp
12:00 PM
Tue May 20
2014
12:00 AM
Wed May 21
12:00 PM
1,000,000,000
250,000,000
500,000,000
750,000,000
_time ✓
num_Bl
ockRep
orts ✓
num_Copy
BlockOpera
tions ✓
num_
HeartB
eats ✓
num_Read
BlockOpera
tions ✓
num_ReadMe
tadataOperati
ons ✓
num_Replac
eBlockOperat
ions ✓
num_Write
BlockOpera
tions ✓
num_blo
ckChecks
umOp ✓
Visualization
Sample troubleshooting in Hunk of 750 million events
13. 13 Yahoo Proprietary
New Search
i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt ype=" JVM" Pr ocessName=" NameNode" ) | t i m
echar t span=5m avg( Thr eads* ) as t hr eads_*
Last 2 days
✓ 8,463 events (5/20/14 12:00:00.000 AM to 5/22/14 12:00:00.000 AM)
_time
threads_Blocked threads_New threads_Runnable threads_Terminated threads_TimedWaiting
threads_Waiting
12:00 AM
Tue May 20
2014
12:00 PM 12:00 AM
Wed May 21
12:00 PM
200
400
_time ✓
threads_Block
ed ✓
threads_Ne
w ✓
threads_Runna
ble ✓
threads_Terminat
ed ✓
threads_TimedWait
ing ✓
threads_Waiti
ng ✓
2014-05-20 00:00:00 72.360000 10.638333 5.485833 0.000000 21.208333 78.555000
2014-05-20 00:05:00 70.177333 10.554667 5.277333 0.000000 20.744667 76.578000
2014-05-20 00:10:00 70.211333 9.998667 5.022000 0.000000 19.333333 73.766667
Visualization
Big picture plus granular details
14. Analyzing NameNode RPC calls
14 Yahoo Proprietary
Who is making what RPC call (open, listStatus, create, etc.)
How often are they making these RPC calls
From which IP/host are they coming from
Search and visualize historical data from billions of events
Prevent NameNode abuse/misuse
17. Queue insights
Each Hadoop job runs in a specific queue
We track every aspect of the YARN framework
Immediate queue performance and configuration profiling via job
history server
Historical views and trends that enable better capacity management
Improved queue utilization and allocation management
17 Yahoo Proprietary
18. New Search
i ndex=" j obsummar y_l ogs_al l _r ed" cl ust er =" di l i t hi um* " | eval t ot al _sl ot _seconds=( m apSl ot Seconds + r educeSl ot Sec
onds) | eval gb_hour s=( ( t ot al _sl ot _seconds * 0. 5) / 3600) | eval gb_hour s=r ound( gb_h our s) | t i mechar t span=6h sum
( gb_hour s) as gb_hour s by queue
Last 7 days
✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM)
200,000
400,000
600,000
OTH apg_dai apg_dail apg_hou apg_ho apg_hourl apg curveb curveb sling sling
Visualization
_time
Wed May 21
2014
Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26
Visualizing queues
18 Yahoo Proprietary
19. Creating job reports per user
19 Yahoo Proprietary
Each job is unique and so are the map and reduce elements
How to start analyzing jobs?
Historical job performance and profiling enables in-depth
performance tuning
Long terms historical views and trending of growth
20. More data to tap into with the metastore/hive sources
20 Yahoo Proprietary
We will provide Hunk as a self-service to explore & visualize data in HDFS
Using the metastore we can setup virtual indexes to any table(s) in Hive,
without the need to define the schema up-front
Allows for visually browsing very complex tables (250+ fields)
Rapid prototyping for new jobs with almost instant results for searches,
without having to wait for the entire job/query to finish
Cuts down on the development cycles by faster interaction with results
Built-in graphs/charts makes for a powerful solution for many situations
29. 29
Fast Deployment and Configuration
Just point at Hadoop
• Certified integrations to all
major Hadoop distributions
• Choose 1st-gen MapReduce
or YARN
• Create Virtual Indexes
across one or more clusters
• From download to
searching data in < 60
minutes
Connect to one or multiple Hadoop clusters
YARN
certified
30. Interactive Search and Results Preview
Rapidly interact with data
• Powerful Search Processing
Language (SPL™)
• Ad hoc exploratory analytics
across massive datasets
• Preview results
• No fixed schema
• No requirement to
“understand” data upfront
Search
interface
Preview
results
30
Drill down
to raw data
Pause or stop MapReduce jobs
31. 31
Powerful Dashboards for Self-Service Analytics
Interactive Dashboards
and Charts
• Easy-to-use dashboard editor
• Chart overlay
• Pan and zoom
• In-dashboard drilldown
• Embed charts and
dashboards in 3rd party apps
• Reuse skills with Splunk
Enterprise 6.1 and Hunk 6.1
33. 33
Role-based Security for Shared Clusters
Pass-through
Authentication
• Provide role-based security
for Hadoop clusters
• Access Hadoop resources
under security and
compliance
• Integrates with Kerberos
for Hadoop security
Business
Analyst
Marketing
Analyst
Sys
Admin
Business
Analyst
Queue:
Biz Analytics
Marketing
Analyst
Queue:
Marketing
Sys
Admin2
Queue:
Prod
34. 34
Powerful Developer
Environment
• Use a standards-based web
framework and REST API
• Customize dashboards and
UIs with Simple XML,
JavaScript or Django
• Choose among SDKs
• One integration for both
Splunk Enterprise and Hunk
Build Analytics-Rich Big Data Apps
35. 35
Explore, analyze and visualize data in
one integrated platform
Point Hunk at your storage clusters and
explore data immediately
Preview results as MapReduce jobs run and
accelerate reports with no fixed schemas
INTERACTIVE
SEARCH
RICH DEVELOPER
ENVIRONMENT
Build big data apps using standard web
languages and frameworks
FULL-FEATURED
ANALYTICS
FAST TO DEPLOY
AND DRIVE VALUE
Hunk: One Integrated Platform
36. Question/Comments?
Sagi Zelnick – Principal Architect
Email: zelnicks@yahoo-inc.com
Ledion Bitincka – Principal Architect
Email: lbitincka@splunk.com