Mais conteúdo relacionado Semelhante a Hadoop Operations at LinkedIn (20) Mais de DataWorks Summit (20) Hadoop Operations at LinkedIn1. Grid Operations
Hadoop Operations at LinkedIn
Allen Wittenauer
Grid Computing Architect
©2013 LinkedIn Corporation. All Rights Reserved.
Thursday, March 28, 2013
2. “Hadoop is not a developer problem;
it’s an operations problem.”
-- Hadoop vendor ex-employee
©2013 LinkedIn Corporation. All Rights Reserved.
Thursday, March 28, 2013
4. § August 2009
– 20 Nodes in 1 grid
– Apache Hadoop 0.20.0
– No configuration management
– No monitoring
– No security
– Free for all, including random mafia hits on running jobs
– FIFO Scheduling
– ~20 users
– 20 tasks per node
– Solaris
– No operational support
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013
6. How We Fixed This
(In Chronological Order)
©2013 LinkedIn Corporation. All Rights Reserved.
Thursday, March 28, 2013
7. Year One
©2013 LinkedIn Corporation. All Rights Reserved.
Thursday, March 28, 2013
8. § Dropped task count
– 10 mappers => 7 mappers
– 10 reducers => 5 reducers
§ Reworked ETL
– hourlies => dailies
– Re-ordered to take advantage of compression
§ 10x storage improvement
– Sample impact on one job (not workflow!):
§ 80,000 map tasks => 2,000 map tasks
§ Run time cut in half
§ Optimize work flows/culture shift
§ More task time, less tasks
§ Production review to reinforce good behavio(u)r
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013
9. § Switched to Capacity Scheduler 5% ETL Tasks
– FIFO is terrible 15% Fast Queue:
– Fair Share only viable for small tasks - Task Time < 15 Minutes
- Job Time < 1 Hour
– Enforced SLAs via custom patch
- Slot stealing from "Slow" Queue
§ Submitted Jar Size Limit
80% Slow Queue:
– Encourage distributed cache usage - Job Time < 24 Hours
– Enforced limit via custom patch - Up to 80% of slots
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013
10. § Benchmarking
– Use production code not TeraSort!
Old Node: New Node:
- 2 Rack Units - 1 Rack Unit
- 2 CPUs - 2 CPUs
- 16 GB - 24 or 32 GB
- 8 x 1 TB SATA - 6 x 2 TB SATA
- 1 x 2 gb NIC - 1 x 1 gb NIC
§ Cut cost per unit in half
§ 2x nodes per rack
§ Extra RAM
– buffering
– bus speed
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013
12. Year Two
©2013 LinkedIn Corporation. All Rights Reserved.
Thursday, March 28, 2013
14. § DataNode disk partitioning
– Separate file systems for different purposes
20 GB 200 GB
HDFS
/, ... MR
...
5GB 200 GB
HDFS
Swap MR
– Mount options: noatime, commit=30, data=writeback
§ NN, JT, etc
– No “special hardware” == use SW RAID
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013
15. LDAP Master Multi
LDAP Master
+ Master +
Replication
KDC Master KDC
LDAP/KDC LDAP/KDC
Slaves Slaves
username, uid username, uid
group name, gid group name, gid
netgroup, sudoers netgroup, sudoers
nscd nscd
Client Node Client Node
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013
17. Host bcfg2 Server
Group1,
Group2,
... Group1 -> Svc1, Svc2, ...
bcfg2
Group2 -> Svc1, Svc3, ...
client Svc1+
Group3 -> Svc4, Svc5, ...
Svc2+
Svc3
Content
§ Service Bundle
– RPMs, config files, etc
– Conflict resolution
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013
18. § Different RPM names + different install locations = pre-deploy-ability:
Object RPM Name File Path
Hadoop 1.0.4-p3 Binaries hadoop-1043-bin-1.0.4-3 /dir/hadoop-1.0.4-p3
Grid Config for 1.0.4-p3 gridname-1043- /dir/grid-conf-1.0.4-p3
hadoopconf-1.0.4.3-1
Hadoop 1.1.2-p1 Binaries hadoop-1121-bin-1.1.2.1-1 /dir/hadoop-1.1.2-p1
Grid Config for 1.1.2-p1 gridname-1043- /dir/grid-conf-1.1.2-p1
hadoopconf-1.0.4.3-1
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013
19. Year Three+
©2013 LinkedIn Corporation. All Rights Reserved.
Thursday, March 28, 2013
20. Corp IT
Grid Realm
Active Directory krbtgt/GRID@CORP
@GRID
@CORP
Password
krbtgt/host@GRID
krbtgt/service@GRID
krbtgt/user@CORP Hadoop
krbtgt/GRID@CORP
Services
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013
21. Many months moving to secure Apache Hadoop...
©2013 LinkedIn Corporation. All Rights Reserved.
Thursday, March 28, 2013
24. § March 2013
– 5000 Nodes in ~10 grids
– Apache Hadoop 1.0.4 + custom patches
– Full configuration management
– Full monitoring
– Security
– Capacity scheduler with SLA
– ~700 users
– 12 tasks per node
– Linux
– Five dedicated operations staff members
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013
26. Future Work
©2013 LinkedIn Corporation. All Rights Reserved.
Thursday, March 28, 2013
27. Is ‘pure Hadoop’ the right
tool for all of our workloads?
©2013 LinkedIn Corporation. All Rights Reserved.
Thursday, March 28, 2013
28. YARN PBS
H
D
F
S
C
E
P
H
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013
30. § More on LinkedIn Hadoop Performance:
– http://www.slideshare.net/allenwittenauer/2012-lihadoopperf
§ LinkedIn Data Analytics:
– http://data.linkedin.com/
©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS
Thursday, March 28, 2013