Help hadoop survive the 300 million block barrier and then back it up

HELP HADOOP SURVIVE THE 300 MILLION BLOCKHELP HADOOP SURVIVE THE 300 MILLION BLOCK
BARRIER AND THEN BACK IT UPBARRIER AND THEN BACK IT UP
Stuart Pook (s.pook@criteo.com @StuartPook)

BROUGHT TO YOU BY LAKE-STORAGEBROUGHT TO YOU BY LAKE-STORAGE
Anthony Rabier, Dhia Moakhar, Marouane Benalla, Meriam Lachkar,
Stuart Pook

CRITEO DYNAMIC RETARGETINGCRITEO DYNAMIC RETARGETING
Online advertising
Target the right user
At the right time
With the right message
Tailored video and display ads

SIMPLIFED BUSINESS MODELSIMPLIFED BUSINESS MODEL
We buy
Ad spaces
We sell
Clicks
Sales
We take the risk

TODAY PRIMARY DATA INPUTTODAY PRIMARY DATA INPUT
Kafka
Up to 7 000 000 messages/s
Bids, impressions, sales, visits, …
400 billion messages/day
Up to 3 GB/s (gzipped)
HDFS
500 TB created per day
410 TB removed per day
260 000 MapReduce jobs per day
6 000 Spark jobs per day

PAID
HADOOP'S COMPUTE AND DATA AREHADOOP'S COMPUTE AND DATA ARE
ESSENTIALESSENTIAL
Extract, Transform & Load logs (ETL)
Machine Learning to generate bidding models
Billing
Business analysis

HADOOP PROVIDES LOCAL REDUNDANCYHADOOP PROVIDES LOCAL REDUNDANCY
Failing datanodes (1 or 2)
Failing racks (1)
soon one failing pod
Failing namenodes (1)
Failing resourcemanager (1)

NO PROTECTION AGAINST:NO PROTECTION AGAINST:
Data centre disaster
Multiple datanode failures in a short time
Multiple rack failures in a day
Operator error

DATA BACKUP IN THE CLOUDDATA BACKUP IN THE CLOUD
Backup requires lots of bandwidth
≥ 1.2 GB/s backup from HDFS
≥ 8 GB/s for backup catchup
≥ 28 GB/s restore to HDFS
Restore 2 PB too long at 50 Gb/s
What about compute?

COMPUTE IN THE CLOUDCOMPUTE IN THE CLOUD
60 000 cores requires reservation
Reservation expensive
No need for elasticity (batch processing)
Criteo has data centres
Criteo likes bare metal
Cloud is several times more expensive
In-house get us exactly the network & hardware we need

BUILD NEW DATA CENTRE (PA4) IN 9BUILD NEW DATA CENTRE (PA4) IN 9
MONTHSMONTHS
Space for 5000 machines
Non-blocking network
< 10% network utilisation
10 Gb/s endpoints
Level 3 routing
Clos topology
Power 1 megawatt + option for 1 MW

NEW HARDWARE FOR PA4NEW HARDWARE FOR PA4
Had one supplier
Need competition to keep prices down
3 replies to our call for tenders
3 similar 2U machines
16 (or 12) 6 TB SATA disks
2 Xeon E5-2650L v3, 24 cores, 48 threads
256 GB RAM
Mellanox 10 Gb/s network card
2 diﬀerent RAID cards

TEST THE HARDWARE (PA4)TEST THE HARDWARE (PA4)
Three 10 node clusters
Disk bandwidth
Disk errors?
Network bandwidth
Teragen
Zero replication (disks)
High replication (network)
Terasort

HARDWARE IS SIMILAR (PA4)HARDWARE IS SIMILAR (PA4)
Eliminate the constructor with
4 DOA disks
Other failed disks
20% higher power consumption
Choose the cheapest and most dense
Huawei
LSI-3008 RAID card

MIX THE HARDWAREMIX THE HARDWARE
Hardware interventions are more diverse
Multiple configurations needed
Some clusters have both hardware types
We have more choice at each order
Avoid vendor lock-in
Negotiate better prices

HAVE THE DC, BUILD HADOOPHAVE THE DC, BUILD HADOOP
Configure using Chef
Infrastructure is code in git
Automate to scale (because we don't)
Test Hadoop with 10 hour petasorts
Tune Hadoop for this scale
Namenode machine crashes
Upgrade kernel
Rolling restart on all machines
Rack by rack, not node by node

PA4 CLUSTER ONLINEPA4 CLUSTER ONLINE
More capacity as have 2 clusters
Users love it
Users find new uses
Soon using more capacity than the new cluster provides
Impossible to stop the old cluster

SITUATION NOW WORSESITUATION NOW WORSE
Two critical clusters
Two Hadoop versions to support CDH4 & CDH5
Not one but two SPOFs

GROW THE NEW DATACENTREGROW THE NEW DATACENTRE
Add hundreds of new nodes
Soon the new cluster will be big enough
1370 datanodes
+ 650 in Q3 2017
+ 900 in Q4 2017
> 2800 in Q2 2018

NEED FAST NAMENODESNEED FAST NAMENODES
Namenode does many sequential operations
Long locks
Failovers too slow
Heartbeats lost
→ fast CPU
Big namenodes
768 GiB RAM
2 × Intel Xeon CPU E5-2643 v4 @ 3.40GHz
3 × RAID 1 of 2 SSD

FEDERATE FOR MORE BLOCKSFEDERATE FOR MORE BLOCKS
Too many blocks for the namenode
Federation with 3 namespaces
649 734 268 + 281 542 996 + 153 263 676
1 084 540 940 file system objects in PA4
Impact on users

ONE SPOF IS ENOUGHONE SPOF IS ENOUGH
Move all jobs and data to the new cluster (PA4)
Old cluster (AM5) for backups
Only one SPOF but still no compute redundancy
Data on two clusters

IS THE DATA SAFE?IS THE DATA SAFE?
Same operators on both clusters
One chef server for both clusters
Same code on both clusters
Single mistake → both clusters

BACKUP ON DIFFERENT TECHNOLOGYBACKUP ON DIFFERENT TECHNOLOGY
≥ 1.2 GB/s backup from HDFS
≥ 8 GB/s for catchup
≥ 28 GB/s restore to HDFS
> 200 million files
10 PB PoC on 108 nodes
All met objectives but MapR chosen
Cluster to be built
Tools to be written
Data Owners to select data
Ceph
OpenIO
MapR

HADOOP AT CRITEO TODAYHADOOP AT CRITEO TODAY
2860 Huawei datanodes
16 × 6 TB SATA disks
2 Xeon E5-2650L v3 @ 1.80GHz
24 cores, 48 threads
256 GB RAM
Mellanox 10 Gb/s network card
6 big Namenodes with 222 PB raw storage
Resource Manager shows 545 TB RAM
466 620 VCores
AM5 for backups

2018 PLAN FOR A NEW CLUSTER (AM6)2018 PLAN FOR A NEW CLUSTER (AM6)
Classify critical jobs
Identify critical data
Critical jobs must fit on one cluster
Critical data must be copied
Other jobs are distributed
Still being put into place

ANOTHER ROUND OF TENDORSANOTHER ROUND OF TENDORS
Need more CPU
Denser machines
4U, 8 nodes, 16 CPUs, 4 × 8 × 2.5" disks (8/U)
2U, 4 nodes, 8 CPUs, 6 × 4 × 2.5" disks (12/U)
Infrastructure validation
Hadoop tests and benchmarks
Added test machines to preprod
Statistics for best CPU
Disks can be added
Change disk to CPU & RAM ratio

HADOOP AT CRITEO TOMORROWHADOOP AT CRITEO TOMORROW
PA4 > 2800 nodes
AM6
≥ HPE DL380 Gen10
2 × Xeon Skylake Gold 6140 2.3 GHz
36 Cores (72 threads)
384 GiB RAM
12 × 8 TB disk
2 × 10 Gb/s network
≥ 11 HPE DL380 Gen10
2 × Xeon Skylake Gold 6146 3.2 GHz
24 Cores
512 GiB RAM
1140

STAYING ALIVE — TUNE HADOOPSTAYING ALIVE — TUNE HADOOP
Increase bandwidth and time limit for checkpoint
GC Tuning: Serial, Parallel, CMS or G1
Permanent (losing) battle
Tune Azul JVM for namenode
580 GiB heap
1 GC thread per 500 MB allocation/s
Easy and it works

STAYING ALIVE — FIX BUGSSTAYING ALIVE — FIX BUGS
The cluster crashes, find the bug, if fixed, backport it, else fix
Fix
HDFS-10220 expired leases make namenode unresponsive and
failover
Backport
YARN-4041 Slow delegation token renewal prolongs RM
recovery
HDFS-9305 Delayed heartbeat processing causes storm of
heartbeats
YARN-4546 ResourceManager crash due to scheduling
opportunity overflow
HDFS-9906 Remove spammy log spew when a datanode is
restarted

STAYING ALIVE — MONITORINGSTAYING ALIVE — MONITORING
HDFS
Namenode: missing blocks, GC, checkpoints, safemode, QPS,
live datanode
Datanodes: disks, read/write throughput, space
YARN
Queue length, memory & CPU usage, job duration (scheduling
+ run time)
ResourceManager: QPS
Bad nodes
Probes to emulate client behavior with witness jobs
Zookeeper: availability, probes

WE HAVEWE HAVE
2 prod clusters
1 prod cluster under construction
2 pre-prod clusters
1 infrastructure cluster
> 4300 datanodes
273 PB raw disk
592 180 vcores
667 TiB RAM in RM
> 265 000 jobs/day
90 TB daily HDFS growth
15 PB read per day
3 PB written per day

UPCOMING CHALLENGESUPCOMING CHALLENGES
Optimize and improve Hadoop
Install a new bare-metal data-centre
Schedule jobs over two clusters
Synchronise two HDFS clusters
Backup HDFS to MapR
We are hiring
Come and join us in Paris, Palo Alto or Ann Arbor (MI)
s.pook@criteo.com @StuartPook
Questions?

Help hadoop survive the 300 million block barrier and then back it up

Recommended

Recommended

More Related Content

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Help hadoop survive the 300 million block barrier and then back it up