Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Enabling Exploratory Analytics of Data in
Shared-service Hadoop Clusters
P R E S E N T E D B Y S a g i Z e l n i c k P r i n c i p a l A r c h i t e c t @ Y a h o o a n d L e d i o n B i t i n c k a
P r i n c i p a l A r c h i t e c t @ S p l u n k
H a d o o p S u m m i t J u n e 2 0 1 4 S a n J o s e , C A

About your speakers
2 Yahoo Proprietary
Sagi Zelnick Ledion Bitincka
Principal Architect Principal Architect
Yahoo Splunk

Background
3 Yahoo Proprietary
 Hadoop @ Yahoo: 8+ years of innovation
 Hunk @ Yahoo: organization-wide investment for next 3+ years
 Yahoo providing Hunk as a self-service to explore, analyze & visualize
data in HDFS
 Hunk allows visually browsing of very complex tables (250+ fields)
 Rapid prototyping for new jobs with almost instant results for searches,
without having to wait for the entire job/query to finish
 Cuts down on the development cycles by faster interaction with results

4 Yahoo Proprietary
History of Hadoop innovation @ Yahoo

Over 600PB of Hadoop storage (over half an exabyte)
5 Yahoo Proprietary
 Very large clusters used by many groups across the enterprise
 More than 40,000 individual datanodes
 Hadoop is provided as a service
 Multiple cluster types such as research, dev, sandbox and production
 Services such as HBase, Hive, Oozie, etc…
 Users are free to run jobs, but have resource constraints
 Maintained by Grid Operations Group

Improving visibility & providing operational insights with Hunk
 We pointed Hunk at many operational logs and event data we already
have on the grid
 This includes system metrics, HDFS ops, JVM stats and YARN metrics
 Created instrumentation to measure usage per user and job
 Analyzed terabytes of NameNode audit logs
 Job history leveraged for visualizing usage/growth and historical views
 Custom events for HBase statistics
6 Yahoo Proprietary

Use Case Customer Benefits
Namenode metrics, block ops, memory
usage
Research, Dev Improved performance and
stability
System/Hadoop metrics of ~40,000
individual datanodes
Grid Ops / Grid Customers Identify slow tasks/nodes when
debugging
Historical insights into resource
consumption
All Grid Customers Track organic growth
Generate reports on job performance All Grid Customers Improved job SLAs
HBase metrics All Grid Customers Track region/RS/table metrics…
Track job logs in near real-time All Grid Customers / Ops Detect and search for errors
directly from the YARN job logs
for troubleshooting
Tracking Hadoop performance and metrics in Hunk
7 Yahoo Proprietary

Use Case Customer Benefits
Find dataset instances/files that have never
been accessed after creation
Data Storage Efficiency
Team, SE
Savings via reduction of storage-
costs
How is each user/team using compute and
disk capacity on a cluster?
Management / Grid
Customers
Metering / Chargeback
Replace ad hoc and legacy solutions for
analyzing cluster-usage
SE / Grid Solutions / Grid
Performance / Hadoop
Core Development Team
Improved Grid-utilization and cost-
reduction
Generate reports on cluster performance,
utilization of available capacity, etc.
SE / Grid Solutions / Grid
Performance / Hadoop
Core Development Team
Data-mining for product
improvements and best-practices
Determine KPIs of Hadoop stack components
(Pig, Oozie, etc.)
SE / Grid Solutions /
Hadoop Stack
Development Team
Feedback for product
improvements
Find efficacy of various heuristics in Hadoop
(data-locality of Tasks, replication of blocks,
etc.)
Hadoop Stack
Development Team
Fine-tune heuristics for better
efficiency
Tracking Hadoop performance and metrics continued
8 Yahoo Proprietary

9 Yahoo Proprietary
Sample search in Hunk

Measuring NameNode performance pre & post upgrades
10 Yahoo Proprietary
 Historical visualizations of all operations
 Search data in Hunk from billions of NameNode events
 Measure JVM and memory usage
 Insights into operational performance

Yahoo Proprietary
New Search
i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt ype=" DFS" #hdf s=hdf s) | t i mechar t spa
n=1h avg( number * ) as num_*
Last 7 days
✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM)
_time
num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations
num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp
Fri May 16
2014
Sun May 18 Tue May 20
200,000,000
400,000,000
600,000,000
_time ✓
num_Bl
ockRep
orts ✓
num_Copy
BlockOpera
tions ✓
num_
HeartB
eats ✓
num_Read
BlockOpera
tions ✓
num_ReadMe
tadataOperati
ons ✓
num_Replac
eBlockOperat
ions ✓
num_Write
BlockOpera
tions ✓
num_blo
ckChecks
umOp ✓
2014-05-15 01:00 112443
7.7359
02
46721126.
819672
51495
7.3840
98
12930433.0
77869
0.000000 94210832.78
6885
63512425.9
67213
13975.30
6557
Visualization
Sample visualization in Hunk
11

✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM)
_time
num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perations
num_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp
12:00 PM
Tue May 20
2014
12:00 AM
Wed May 21
12:00 PM
1,000,000,000
250,000,000
500,000,000
750,000,000
_time ✓
num_Bl
ockRep
orts ✓
num_Copy
BlockOpera
tions ✓
num_
HeartB
eats ✓
num_Read
BlockOpera
tions ✓
num_ReadMe
tadataOperati
ons ✓
num_Replac
eBlockOperat
ions ✓
num_Write
BlockOpera
tions ✓
num_blo
ckChecks
umOp ✓
Visualization
Sample troubleshooting in Hunk of 750 million events

New Search
i ndex=" si mon_bl ue_new_al l " t hi s_cl ust er =" di l i t hi umbl ue* " ( l og_subt ype=" JVM" Pr ocessName=" NameNode" ) | t i m
echar t span=5m avg( Thr eads* ) as t hr eads_*
Last 2 days
✓ 8,463 events (5/20/14 12:00:00.000 AM to 5/22/14 12:00:00.000 AM)
_time
threads_Blocked threads_New threads_Runnable threads_Terminated threads_TimedWaiting
threads_Waiting
12:00 AM
Tue May 20
2014
12:00 PM 12:00 AM
Wed May 21
12:00 PM
200
400
_time ✓
threads_Block
ed ✓
threads_Ne
w ✓
threads_Runna
ble ✓
threads_Terminat
ed ✓
threads_TimedWait
ing ✓
threads_Waiti
ng ✓
2014-05-20 00:00:00 72.360000 10.638333 5.485833 0.000000 21.208333 78.555000
2014-05-20 00:05:00 70.177333 10.554667 5.277333 0.000000 20.744667 76.578000
2014-05-20 00:10:00 70.211333 9.998667 5.022000 0.000000 19.333333 73.766667
Visualization
Big picture plus granular details

Analyzing NameNode RPC calls
 Who is making what RPC call (open, listStatus, create, etc.)
 How often are they making these RPC calls
 From which IP/host are they coming from
 Search and visualize historical data from billions of events
 Prevent NameNode abuse/misuse

Visualizing 834 million discrete events …

16 Yahoo Confidential & Proprietary
… continued

Queue insights
 Each Hadoop job runs in a specific queue
 We track every aspect of the YARN framework
 Immediate queue performance and configuration profiling via job
history server
 Historical views and trends that enable better capacity management
 Improved queue utilization and allocation management

New Search
i ndex=" j obsummar y_l ogs_al l _r ed" cl ust er =" di l i t hi um* " | eval t ot al _sl ot _seconds=( m apSl ot Seconds + r educeSl ot Sec
onds) | eval gb_hour s=( ( t ot al _sl ot _seconds * 0. 5) / 3600) | eval gb_hour s=r ound( gb_h our s) | t i mechar t span=6h sum
( gb_hour s) as gb_hour s by queue
Last 7 days
✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM)
200,000
400,000
600,000
OTH apg_dai apg_dail apg_hou apg_ho apg_hourl apg curveb curveb sling sling
Visualization
_time
Wed May 21
2014
Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26
Visualizing queues

Creating job reports per user
 Each job is unique and so are the map and reduce elements
 How to start analyzing jobs?
 Historical job performance and profiling enables in-depth
performance tuning
 Long terms historical views and trending of growth

More data to tap into with the metastore/hive sources
 We will provide Hunk as a self-service to explore & visualize data in HDFS
 Using the metastore we can setup virtual indexes to any table(s) in Hive,
without the need to define the schema up-front
 Allows for visually browsing very complex tables (250+ fields)
 Rapid prototyping for new jobs with almost instant results for searches,
without having to wait for the entire job/query to finish
 Cuts down on the development cycles by faster interaction with results
 Built-in graphs/charts makes for a powerful solution for many situations

Hunk + Hadoop Demo
21Yahoo Proprietary

28
Integrated Analytics Platform
Full-featured,
Integrated
Product
Insights for
Everyone
Works with
What You
Have Today
Explore Visualize Dashboard
s
ShareAnalyze
Hadoop Clusters NoSQL and Other Data Stores
Hadoop Client Libraries Streaming Resource Libraries
for Diverse Data Stores

29
Fast Deployment and Configuration
Just point at Hadoop
• Certified integrations to all
major Hadoop distributions
• Choose 1st-gen MapReduce
or YARN
• Create Virtual Indexes
across one or more clusters
• From download to
searching data in < 60
minutes
Connect to one or multiple Hadoop clusters
YARN
certified

Interactive Search and Results Preview
Rapidly interact with data
• Powerful Search Processing
Language (SPL™)
• Ad hoc exploratory analytics
across massive datasets
• Preview results
• No fixed schema
• No requirement to
“understand” data upfront
Search
interface
Preview
results
30
Drill down
to raw data
Pause or stop MapReduce jobs

31
Powerful Dashboards for Self-Service Analytics
Interactive Dashboards
and Charts
• Easy-to-use dashboard editor
• Chart overlay
• Pan and zoom
• In-dashboard drilldown
• Embed charts and
dashboards in 3rd party apps
• Reuse skills with Splunk
Enterprise 6.1 and Hunk 6.1

32
Hive Data Support
Supported File Formats
• Text files
• Sequence files
• RCFile
• ORC files
• Parquet

33
Role-based Security for Shared Clusters
Pass-through
Authentication
• Provide role-based security
for Hadoop clusters
• Access Hadoop resources
under security and
compliance
• Integrates with Kerberos
for Hadoop security
Business
Analyst
Marketing
Analyst
Sys
Admin
Business
Analyst
Queue:
Biz Analytics
Marketing
Analyst
Queue:
Marketing
Sys
Admin2
Queue:
Prod

34
Powerful Developer
Environment
• Use a standards-based web
framework and REST API
• Customize dashboards and
UIs with Simple XML,
JavaScript or Django
• Choose among SDKs
• One integration for both
Splunk Enterprise and Hunk
Build Analytics-Rich Big Data Apps

35
Explore, analyze and visualize data in
one integrated platform
Point Hunk at your storage clusters and
explore data immediately
Preview results as MapReduce jobs run and
accelerate reports with no fixed schemas
INTERACTIVE
SEARCH
RICH DEVELOPER
ENVIRONMENT
Build big data apps using standard web
languages and frameworks
FULL-FEATURED
ANALYTICS
FAST TO DEPLOY
AND DRIVE VALUE
Hunk: One Integrated Platform

Question/Comments?
Sagi Zelnick – Principal Architect
Email: zelnicks@yahoo-inc.com
Ledion Bitincka – Principal Architect
Email: lbitincka@splunk.com

Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Semelhante a Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters