Criteo has a 100 PB production cluster of four namenodes and 2000 datanodes that runs 300,000 jobs a day. At the moment, we backup some of the cluster's data on our old 1200 node ex-production cluster. We build our clusters in our own datacenters, as running in the cloud would be many times more expensive. In 2018, we will build yet another datacenter that will backup, in storage and compute, and then replace the main cluster. This presentation will describe what we learned when building multiple Hadoop clusters and why in-house bare-metal is better than the cloud. Building a cluster requires testing the hardware from several manufacturers and choosing the most cost-effective option. We have now done these tests twice and can provide advice on how to do it right the first time. Our two clusters were meant to provide a redundant solution to Criteo's storage and compute needs. We will explain our project, what went wrong, and our progress in building yet another, even bigger, cluster to create a computing system that will survive the loss of an entire datacenter. Disasters are not always external, they can come from operator error that could wipe out both Hadoop clusters. We are thus adding a second different backup technology for our most important 10 PB of data. Which technology will we use and how did we test it? An HDFS namenode cannot store an infinite number of data blocks. We will share what we have had to do to reach almost half a billion blocks, including sharding our data by adding namenodes. Hadoop, especially at that this scale, does not run itself, so what operational skills and tools are required to keep the clusters healthy, the data safe and the jobs running 24 hours a day every day? Stuart Pook, Senior DevOps Engineer, Criteo.
Driving Behavioral Change for Information Management through Data-Driven Gree...
Help hadoop survive the 300 million block barrier and then back it up
1. HELP HADOOP SURVIVE THE 300 MILLION BLOCKHELP HADOOP SURVIVE THE 300 MILLION BLOCK
BARRIER AND THEN BACK IT UPBARRIER AND THEN BACK IT UP
Stuart Pook (s.pook@criteo.com @StuartPook)
2. BROUGHT TO YOU BY LAKE-STORAGEBROUGHT TO YOU BY LAKE-STORAGE
Anthony Rabier, Dhia Moakhar, Marouane Benalla, Meriam Lachkar,
Stuart Pook
3. CRITEO DYNAMIC RETARGETINGCRITEO DYNAMIC RETARGETING
Online advertising
Target the right user
At the right time
With the right message
Tailored video and display ads
5. TODAY PRIMARY DATA INPUTTODAY PRIMARY DATA INPUT
Kafka
Up to 7 000 000 messages/s
Bids, impressions, sales, visits, …
400 billion messages/day
Up to 3 GB/s (gzipped)
HDFS
500 TB created per day
410 TB removed per day
260 000 MapReduce jobs per day
6 000 Spark jobs per day
6. PAID
HADOOP'S COMPUTE AND DATA AREHADOOP'S COMPUTE AND DATA ARE
ESSENTIALESSENTIAL
Extract, Transform & Load logs (ETL)
Machine Learning to generate bidding models
Billing
Business analysis
7. HADOOP PROVIDES LOCAL REDUNDANCYHADOOP PROVIDES LOCAL REDUNDANCY
Failing datanodes (1 or 2)
Failing racks (1)
soon one failing pod
Failing namenodes (1)
Failing resourcemanager (1)
8. NO PROTECTION AGAINST:NO PROTECTION AGAINST:
Data centre disaster
Multiple datanode failures in a short time
Multiple rack failures in a day
Operator error
9. DATA BACKUP IN THE CLOUDDATA BACKUP IN THE CLOUD
Backup requires lots of bandwidth
≥ 1.2 GB/s backup from HDFS
≥ 8 GB/s for backup catchup
≥ 28 GB/s restore to HDFS
Restore 2 PB too long at 50 Gb/s
What about compute?
10. COMPUTE IN THE CLOUDCOMPUTE IN THE CLOUD
60 000 cores requires reservation
Reservation expensive
No need for elasticity (batch processing)
Criteo has data centres
Criteo likes bare metal
Cloud is several times more expensive
In-house get us exactly the network & hardware we need
11. BUILD NEW DATA CENTRE (PA4) IN 9BUILD NEW DATA CENTRE (PA4) IN 9
MONTHSMONTHS
Space for 5000 machines
Non-blocking network
< 10% network utilisation
10 Gb/s endpoints
Level 3 routing
Clos topology
Power 1 megawatt + option for 1 MW
12. NEW HARDWARE FOR PA4NEW HARDWARE FOR PA4
Had one supplier
Need competition to keep prices down
3 replies to our call for tenders
3 similar 2U machines
16 (or 12) 6 TB SATA disks
2 Xeon E5-2650L v3, 24 cores, 48 threads
256 GB RAM
Mellanox 10 Gb/s network card
2 different RAID cards
13. TEST THE HARDWARE (PA4)TEST THE HARDWARE (PA4)
Three 10 node clusters
Disk bandwidth
Disk errors?
Network bandwidth
Teragen
Zero replication (disks)
High replication (network)
Terasort
14. HARDWARE IS SIMILAR (PA4)HARDWARE IS SIMILAR (PA4)
Eliminate the constructor with
4 DOA disks
Other failed disks
20% higher power consumption
Choose the cheapest and most dense
Huawei
LSI-3008 RAID card
15. MIX THE HARDWAREMIX THE HARDWARE
Hardware interventions are more diverse
Multiple configurations needed
Some clusters have both hardware types
We have more choice at each order
Avoid vendor lock-in
Negotiate better prices
16. HAVE THE DC, BUILD HADOOPHAVE THE DC, BUILD HADOOP
Configure using Chef
Infrastructure is code in git
Automate to scale (because we don't)
Test Hadoop with 10 hour petasorts
Tune Hadoop for this scale
Namenode machine crashes
Upgrade kernel
Rolling restart on all machines
Rack by rack, not node by node
17. PA4 CLUSTER ONLINEPA4 CLUSTER ONLINE
More capacity as have 2 clusters
Users love it
Users find new uses
Soon using more capacity than the new cluster provides
Impossible to stop the old cluster
18. SITUATION NOW WORSESITUATION NOW WORSE
Two critical clusters
Two Hadoop versions to support CDH4 & CDH5
Not one but two SPOFs
19. GROW THE NEW DATACENTREGROW THE NEW DATACENTRE
Add hundreds of new nodes
Soon the new cluster will be big enough
1370 datanodes
+ 650 in Q3 2017
+ 900 in Q4 2017
> 2800 in Q2 2018
20. NEED FAST NAMENODESNEED FAST NAMENODES
Namenode does many sequential operations
Long locks
Failovers too slow
Heartbeats lost
→ fast CPU
Big namenodes
768 GiB RAM
2 × Intel Xeon CPU E5-2643 v4 @ 3.40GHz
3 × RAID 1 of 2 SSD
21. FEDERATE FOR MORE BLOCKSFEDERATE FOR MORE BLOCKS
Too many blocks for the namenode
Federation with 3 namespaces
649 734 268 + 281 542 996 + 153 263 676
1 084 540 940 file system objects in PA4
Impact on users
22. ONE SPOF IS ENOUGHONE SPOF IS ENOUGH
Move all jobs and data to the new cluster (PA4)
Old cluster (AM5) for backups
Only one SPOF but still no compute redundancy
Data on two clusters
23. IS THE DATA SAFE?IS THE DATA SAFE?
Same operators on both clusters
One chef server for both clusters
Same code on both clusters
Single mistake → both clusters
24. BACKUP ON DIFFERENT TECHNOLOGYBACKUP ON DIFFERENT TECHNOLOGY
≥ 1.2 GB/s backup from HDFS
≥ 8 GB/s for catchup
≥ 28 GB/s restore to HDFS
> 200 million files
10 PB PoC on 108 nodes
All met objectives but MapR chosen
Cluster to be built
Tools to be written
Data Owners to select data
Ceph
OpenIO
MapR
25.
26. HADOOP AT CRITEO TODAYHADOOP AT CRITEO TODAY
2860 Huawei datanodes
16 × 6 TB SATA disks
2 Xeon E5-2650L v3 @ 1.80GHz
24 cores, 48 threads
256 GB RAM
Mellanox 10 Gb/s network card
6 big Namenodes with 222 PB raw storage
Resource Manager shows 545 TB RAM
466 620 VCores
AM5 for backups
27. 2018 PLAN FOR A NEW CLUSTER (AM6)2018 PLAN FOR A NEW CLUSTER (AM6)
Classify critical jobs
Identify critical data
Critical jobs must fit on one cluster
Critical data must be copied
Other jobs are distributed
Still being put into place
28. ANOTHER ROUND OF TENDORSANOTHER ROUND OF TENDORS
Need more CPU
Denser machines
4U, 8 nodes, 16 CPUs, 4 × 8 × 2.5" disks (8/U)
2U, 4 nodes, 8 CPUs, 6 × 4 × 2.5" disks (12/U)
Infrastructure validation
Hadoop tests and benchmarks
Added test machines to preprod
Statistics for best CPU
Disks can be added
Change disk to CPU & RAM ratio
30. STAYING ALIVE — TUNE HADOOPSTAYING ALIVE — TUNE HADOOP
Increase bandwidth and time limit for checkpoint
GC Tuning: Serial, Parallel, CMS or G1
Permanent (losing) battle
Tune Azul JVM for namenode
580 GiB heap
1 GC thread per 500 MB allocation/s
Easy and it works
31. STAYING ALIVE — FIX BUGSSTAYING ALIVE — FIX BUGS
The cluster crashes, find the bug, if fixed, backport it, else fix
Fix
HDFS-10220 expired leases make namenode unresponsive and
failover
Backport
YARN-4041 Slow delegation token renewal prolongs RM
recovery
HDFS-9305 Delayed heartbeat processing causes storm of
heartbeats
YARN-4546 ResourceManager crash due to scheduling
opportunity overflow
HDFS-9906 Remove spammy log spew when a datanode is
restarted
32. STAYING ALIVE — MONITORINGSTAYING ALIVE — MONITORING
HDFS
Namenode: missing blocks, GC, checkpoints, safemode, QPS,
live datanode
Datanodes: disks, read/write throughput, space
YARN
Queue length, memory & CPU usage, job duration (scheduling
+ run time)
ResourceManager: QPS
Bad nodes
Probes to emulate client behavior with witness jobs
Zookeeper: availability, probes
33. WE HAVEWE HAVE
2 prod clusters
1 prod cluster under construction
2 pre-prod clusters
1 infrastructure cluster
> 4300 datanodes
273 PB raw disk
592 180 vcores
667 TiB RAM in RM
> 265 000 jobs/day
90 TB daily HDFS growth
15 PB read per day
3 PB written per day
34. UPCOMING CHALLENGESUPCOMING CHALLENGES
Optimize and improve Hadoop
Install a new bare-metal data-centre
Schedule jobs over two clusters
Synchronise two HDFS clusters
Backup HDFS to MapR
We are hiring
Come and join us in Paris, Palo Alto or Ann Arbor (MI)
s.pook@criteo.com @StuartPook
Questions?