Hadoop summit-ams-2014-04-03

•

4 gostaram•17,886 visualizações

SDanzanvilliersCriteo

Criteo slides form the Hadoop summit in Amsterdam

Engenharia Tecnologia Negócios

HADOOP, FROM LAB TO 24/7 PRODUCTION
http://criteolabs.com/jobs

criteolabs.com/jobs
Jean-Baptiste NOTE
jb.note@criteo.com
Ana DIN
a.din@criteo.com
From the Criteo HPC Team
(+ Loïc / Serge / Maxime / Samuel / Yann / Stuart)
ABOUT US

criteolabs.com/jobs
CRITEO ?
6 DATA CENTERS, 4 CONTINENTS.
120 BILLION REQUESTS/DAY*.
* EVERY DAY CRITEO IS CALLED MORE THAN 100 BILLION TIMES BY
ADVERTISERS AND PUBLISHERS
54 OPEN POSITIONS IN PARIS’ R&D
http://criteolabs.com/jobs

criteolabs.com/jobs
« Anything that can go wrong - will go wrong »
-- Murphy’s Law
TALES OF A TECHNOLOGY ADOPTION

criteolabs.com/jobs
Usage of Hadoop is growing exponentially
• Learning curve is real
• Analysts discover interesting things with raw data
– Which causes them to ask more questions
• Increased insight leads to a better product
– Which leads to more data
• Data gains in value and more is kept (and studied!)
• YOU (the admin) are the bottleneck !
USAGE GROWTH

criteolabs.com/jobs
• Administration automation
• Hadoop configuration tuning
• Network
• Multitenancy
TOPICS

criteolabs.com/jobs
ADMINISTRATION AUTOMATION

criteolabs.com/jobs
Rack and load!
• Machine is racked, cabled and provisionned for a role
• Chef is our one stop-shop for automation
• Diskless system install
AUTOMATING DEPLOYMENTS
INSTA-
CLUSTER!

criteolabs.com/jobs
• Learn from the past
• Previous cluster 1.5 years operation
• 78% failure rate on /dev/sda at restart
• Disk usage symmetry
• Garanteed statelessness
OS DISKLESS : WHY

criteolabs.com/jobs
• PXE Boot on custom CentOs image
• Automated Chef bootstrap
• Everything done by Chef
– Inventory
– Firmware updates
– OS / Service deployment
OS DISKLESS : HOW

criteolabs.com/jobs
• Evolutive maintenance (version bump)
• Not much to do on normal ops
• Most freq. issue is flacking / slow performing host
• Use Preprod / Prod for infra changes
• Progressive VS black out
MAINTENANCE

criteolabs.com/jobs
• User facing interfaces
• Jobtracker
• Fsimage checkpointing
• HDFS usage and local disk usage
MONITORING

criteolabs.com/jobs
HADOOP CONFIG TUNING

criteolabs.com/jobs
• Hadoop is a DDOS to your infrastructure
– Increase ARP retention (L2-specific)
– Use NSCD
• Increase Read ahead
• Disable THP compaction
• MTU jumbo frames
SYSTEM CONFIGS

criteolabs.com/jobs
CLUSTER CONFIGS
• Adjust log settings (default is INFO,console)
• Increase handler counts (JT,NN,DN)
• Use namenode.service.handler.count
• Watch out for checkpointing loops

criteolabs.com/jobs
• One datacenter topology will not fit all
• Web traffic VS Hadoop traffic
• Historical Fat-tree hierarchy with layer 2 routing
• Switched to meshed design (soon layer3)
NETWORK TOPOLOGY

criteolabs.com/jobs
• Rack awareness (of course !)
– Performance
– Reliability
– Maintenance (eg. relocation)
HADOOP TOPOLOGY

criteolabs.com/jobs
• HDFS Quotas
• Scheduling (user-facing)
• Map / Reduce ratio
• Use Yarn !
MULTITENANCY

criteolabs.com/jobs
• Dedicated kdc / realm
• Dedicated services principals
• Cross-realm trusts
• Delegate user management to your IT
KERBEROS SETUP

criteolabs.com/jobs
• Use multiple proxies
• Easy way to interconnect to the outside world
• Data injection / read with a simple curl
• High bandwidth transfers
HTTPFS PROXIES

criteolabs.com/jobs
• Multiple use cases (ML, BI analytics)
• Baseline Json (+gzip) is ok
• Don’t optimize too early
• We still use it(*) at Peta scale
(*) some teams also use Parquet and contributed to Hive integration
FILE FORMATS

criteolabs.com/jobs
Did I say we’re hiring!
We’re hiring lots of engineers in 2014. Come join us!
http://criteolabs.com/jobs
MY FELLOW CRITEOS WOULD KILL ME…

Mais conteúdo relacionado

Mais procurados

Community-Driven Graphs with JanusGraphJason Plurad

2 hadoop@e bay-hug-2010-07-21Hadoop User Group

JanusGraph, Jupyter Meetup NYCJason Plurad

Large Scale Graph Analytics with JanusGraphP. Taylor Goetz

Microsoft cosmosKarthik Murugesan

Graph Computing with JanusGraphJason Plurad

Scaling graphite for application metricsJim Plush

Cloud Costing Services InnoTech

Powers of Ten ReduxJason Plurad

Spot at quboleAjaya Agrawal

Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk

Qubole Overview at the Fifth Elephant ConferenceJoydeep Sen Sarma

Cloud Optimized Big DataJoydeep Sen Sarma

JanusGraph: Looking Backward, Reaching ForwardJason Plurad

Airline Reservations and Routing: A Graph Use CaseJason Plurad

Deep Learning to Production with MLflow & RedisAIDatabricks

Graph Processing with Apache TinkerPop and GremlinJason Plurad

Community-Driven Graphs with JanusGraphJason Plurad

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon

IMA Lab: Indianapolis Museum of Art Collection Page RedesignRita Troyer

Mais procurados (20)

Community-Driven Graphs with JanusGraph

2 hadoop@e bay-hug-2010-07-21

JanusGraph, Jupyter Meetup NYC

Large Scale Graph Analytics with JanusGraph

Microsoft cosmos

Graph Computing with JanusGraph

Scaling graphite for application metrics

Cloud Costing Services

Powers of Ten Redux

Spot at qubole

Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...

Qubole Overview at the Fifth Elephant Conference

Cloud Optimized Big Data

JanusGraph: Looking Backward, Reaching Forward

Airline Reservations and Routing: A Graph Use Case

Deep Learning to Production with MLflow & RedisAI

Graph Processing with Apache TinkerPop and Gremlin

Community-Driven Graphs with JanusGraph

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving

IMA Lab: Indianapolis Museum of Art Collection Page Redesign

Semelhante a Hadoop summit-ams-2014-04-03

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion

Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyStuart Pook

Big Data Adoption Status Xpand IT

Introduction to Hadoop AdministrationRamesh Pabba - seeking new projects

Scaling Hadoop at LinkedInDataWorks Summit

Inroduction to Big DataOmnia Safaan

Hadoop and Distributed ComputingFederico Cargnelutti

From Zero to Data Flow in Hours with Apache NiFiDataWorks Summit/Hadoop Summit

Hadoop-Quick introductionSandeep Singh

HUG France - Apache DrillMapR Technologies

Emerging technologies /frameworks in Big DataRahul Jain

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely

Hadoop Master Class : A concise overviewAbhishek Roy

Big Data and OSS at IBMBoulder Java User's Group

Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore

Bn1028 demo hadoop administration and developmentconline training

Elastic Data Analytics Platform @DatadogC4Media

Hadoop ppt1chariorienit

Making BD Work~TIAS_20150622Anthony Potappel

Semelhante a Hadoop summit-ams-2014-04-03 (20)

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production

Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy

Big Data Adoption Status

Introduction to Hadoop Administration

Scaling Hadoop at LinkedIn

Inroduction to Big Data

Hadoop and Distributed Computing

From Zero to Data Flow in Hours with Apache NiFi

Hadoop-Quick introduction

HUG France - Apache Drill

Emerging technologies /frameworks in Big Data

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...

Hadoop Master Class : A concise overview

Big Data and OSS at IBM

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Bn1028 demo hadoop administration and development

Elastic Data Analytics Platform @Datadog

Hadoop ppt1

Making BD Work~TIAS_20150622

Último

22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...KrishnaveniKrishnara1

KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales

Input Output Management in Operating SystemRashmi Bhat

Secure Key Crypto - Tech Paper JET Tech Labsamber724300

Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201

Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...arifengg7

Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar

Artificial Intelligence in Power System overviewsandhya757531

Javier_Fernandez_CARS_workshop_presentation.pptxJavier Fernández Muñoz

CS 3251 Programming in c all unit notes pdfBalamuruganV28

Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl

High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531

Gravity concentration_MI20612MI_________Romil Mishra

SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar

Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork

Turn leadership mistakes into a better future.pptxStephen Sitton

"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University

STATE TRANSITION DIAGRAM in psoc subjectGayathriM270621

Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...IJAEMSJORNAL

Python Programming for basic beginners.pptxmohitesoham12

Hadoop summit-ams-2014-04-03

1. HADOOP, FROM LAB TO 24/7 PRODUCTION http://criteolabs.com/jobs

2. criteolabs.com/jobs Jean-Baptiste NOTE jb.note@criteo.com Ana DIN a.din@criteo.com From the Criteo HPC Team (+ Loïc / Serge / Maxime / Samuel / Yann / Stuart) ABOUT US

3. criteolabs.com/jobs CRITEO ? 6 DATA CENTERS, 4 CONTINENTS. 120 BILLION REQUESTS/DAY*. * EVERY DAY CRITEO IS CALLED MORE THAN 100 BILLION TIMES BY ADVERTISERS AND PUBLISHERS 54 OPEN POSITIONS IN PARIS’ R&D http://criteolabs.com/jobs

4. criteolabs.com/jobs « Anything that can go wrong - will go wrong » -- Murphy’s Law TALES OF A TECHNOLOGY ADOPTION

5. criteolabs.com/jobs Usage of Hadoop is growing exponentially • Learning curve is real • Analysts discover interesting things with raw data – Which causes them to ask more questions • Increased insight leads to a better product – Which leads to more data • Data gains in value and more is kept (and studied!) • YOU (the admin) are the bottleneck ! USAGE GROWTH

6. criteolabs.com/jobs • Administration automation • Hadoop configuration tuning • Network • Multitenancy TOPICS

7. criteolabs.com/jobs ADMINISTRATION AUTOMATION

8. criteolabs.com/jobs Rack and load! • Machine is racked, cabled and provisionned for a role • Chef is our one stop-shop for automation • Diskless system install AUTOMATING DEPLOYMENTS INSTA- CLUSTER!

9. criteolabs.com/jobs • Learn from the past • Previous cluster 1.5 years operation • 78% failure rate on /dev/sda at restart • Disk usage symmetry • Garanteed statelessness OS DISKLESS : WHY

10. criteolabs.com/jobs • PXE Boot on custom CentOs image • Automated Chef bootstrap • Everything done by Chef – Inventory – Firmware updates – OS / Service deployment OS DISKLESS : HOW

11. criteolabs.com/jobs • Evolutive maintenance (version bump) • Not much to do on normal ops • Most freq. issue is flacking / slow performing host • Use Preprod / Prod for infra changes • Progressive VS black out MAINTENANCE

12. criteolabs.com/jobs • User facing interfaces • Jobtracker • Fsimage checkpointing • HDFS usage and local disk usage MONITORING

13. criteolabs.com/jobs HADOOP CONFIG TUNING

14. criteolabs.com/jobs • Hadoop is a DDOS to your infrastructure – Increase ARP retention (L2-specific) – Use NSCD • Increase Read ahead • Disable THP compaction • MTU jumbo frames SYSTEM CONFIGS

15. criteolabs.com/jobs CLUSTER CONFIGS

16. criteolabs.com/jobs CLUSTER CONFIGS • Adjust log settings (default is INFO,console) • Increase handler counts (JT,NN,DN) • Use namenode.service.handler.count • Watch out for checkpointing loops

17. criteolabs.com/jobs NETWORK

18. criteolabs.com/jobs • One datacenter topology will not fit all • Web traffic VS Hadoop traffic • Historical Fat-tree hierarchy with layer 2 routing • Switched to meshed design (soon layer3) NETWORK TOPOLOGY

19. criteolabs.com/jobs • Rack awareness (of course !) – Performance – Reliability – Maintenance (eg. relocation) HADOOP TOPOLOGY

20. criteolabs.com/jobs • HDFS Quotas • Scheduling (user-facing) • Map / Reduce ratio • Use Yarn ! MULTITENANCY

21. criteolabs.com/jobs SECURITY

22. criteolabs.com/jobs • Dedicated kdc / realm • Dedicated services principals • Cross-realm trusts • Delegate user management to your IT KERBEROS SETUP

23. criteolabs.com/jobs • Use multiple proxies • Easy way to interconnect to the outside world • Data injection / read with a simple curl • High bandwidth transfers HTTPFS PROXIES

24. criteolabs.com/jobs • Multiple use cases (ML, BI analytics) • Baseline Json (+gzip) is ok • Don’t optimize too early • We still use it(*) at Peta scale (*) some teams also use Parquet and contributed to Hive integration FILE FORMATS

25. criteolabs.com/jobs QUESTIONS ?

26. criteolabs.com/jobs Did I say we’re hiring! We’re hiring lots of engineers in 2014. Come join us! http://criteolabs.com/jobs MY FELLOW CRITEOS WOULD KILL ME…

Notas do Editor

http://www.shutterstock.com/pic.mhtml?id=95662684Who are we ?* Serving the right ad….* Slide wasimposed by MarketingYouprobablywillencounter the cloud versus in-house dilemniaKey factor is the elastic aspect ;we use our cluster 100% of the time ;wealready have DCs ;in-house waslessexpensive
This is the story of a growing and successfull startup usingHadoop.Growingmeansincreased volume. Successfullmeansbuckloads of cash to grow the infrastructure. Startup meansverysmall teams to manage the wholething.PoCiseasyWhenyou gain traction, everythingwill go fastWentfrom 12 nodes to 150 2 yearsago, to 600 today, above 1000 by the end of the year.Whyisitgrowingthatfast ?Virtuouscircle :Variousteam aregatheringskillsBI analysts: the more theyget, the more theywantHadoop shows mutualizationbenefits, platform to consolidate ad-hoc data processingtoolsYou business will boom thanks to hadoop adoption
Becauseyouneed to scale infrastructure:Automate operations (prod VS devops)Tune hadoop system (Hardware, Linux, Hadoopitself)Specifically networkThis is about scaling the infrastructure. Withhundreds of clients usinghadoop as a service, youalsoneed to scale infrastructure usage. For instance: multi-tenancy.Managing ressource contention Mapreduce, storageMaintainsecurity (user sandboxingthroughauthorization & authentication)Allowhadoop to beused as a service
Don’tdo anything by hand ; youwillhurtyourselfmanagingthousands of serversBuildthings once, runforeverThe choice of freedom : don’tbebound to a specificvendor ; eg. We use CDH4.5.0 right now, but could, and probablywill switch to HDPFull stack automation : frombare-metal to live service
Our cluster are turnkeydeployed once hosting and network have finishedtheirworkWeassigneverynode a role, and the hosts will boot and setup themselvesaccordingly
Why a diskless system ?You want maximumstoragedensityThereforeyouwant to fillthose 14 slots per server with 3TB drivesThereforeyouwill break a nicesymmetry if yourun the OS fromdiskNot theoretical: hands-on experience on 150-node cluster operated for 1.5 yearsHard constraint; but veryworthwhile:RemoteloggingcompulsoryNothinghiddenfrom automation system2GB of RAM per node (2% of RAM)
How do weachievethisMinimize size of diskless imageBoot chef as soon as youcan, and let it flow fromthereInitial chef roleis an inventoryrole. Chef used for management of updates, OS, service deployment.
Maintenance : * EvolutiveUpgradingyour distribution regularly (don’twant to lagbehind)* CorrectiveHadoopworksbetterwhenyoujustdon’ttouchit* HowEverythingistested on a PREPROD environmentProgressive deployment (rolling-out node by node) maybedisrupting for long running jobs
http://blog.cloudera.com/blog/2014/03/a-guide-to-checkpointing-in-hadoop/Monitor user facing interfaces : usersfrequentlyassimilatecluster’s condition to the JT’s or NN’s GUIMonitor yourJobtracker (MRv1 willeventuallygetstuck)MOST IMPORTANT OF ALL: CHECKPOINTINGMonitor the checkpoints of yourfsimage or youwill end-up with a namenode in reallybadshapeAt one point wewerehavingnearly 6 months of edits ;) 12 hours to start a NN ; urbanlegend of NN beingunsafe to restartMonitor HDFS disk usage and local disk usage
10:00http://www.flickr.com/photos/76588645
In a realworld system most of yourtaskswillbe IO boundReadahead ! Very importantWhenyou hit a performance bottleneck, the first thing to watch for is *outside* hadoop, becauseHadoopis a DOS to yourwhole infrastructureUse infrastructure local caches as much as youcan
Default parameters are usable for small clusters / smallnodesThese are examples, wehad to tune a significant part of themDetaillist of significantones + explanations
Default parameters are usable for small clusters / smallnodesThese are examples, wehad to tune a significant part of themLog settings theywillkillyour JT / NNHandler countsSeparate the thread pool for internal / external clients. Alsoeasier for firewallingHA has somedownsides (checkpointing)
One of the first thingthatyouwillgetwhenyou move yourhadoop cluster past a rack isyour network engineersyelling at you.Plan aheadyour network topology !
Fat treesuited to North/South traffic.Hadoop uses the network as a bus: East-Westlayer2 FabricPath/TrillLayer3 BGP
Soundsobvious but impementing a correct definition of the rack topologyregarding network isvery importantlldp information flackingdepending on which interface withask (4 interfaces bonding)
20:00Hadoopis a shared ressource.Whenyour usagegrowsyouwill face ressource starvation and contention. This will lead to twoproblems:1) Accountability: You willneed to report ressource accountabilitynumbers to plan for growth and optimize2) Maintain a good user experienceHDFS quotas ; but has bugs in fsimagecheckpointingScheduling ; user facingproblem ; requireseducation to understand the time/spacefolding; achievewelldesigned jobs (mapper ~ 3 to 15 minutes)YARN solvesmap/reduce ratio (+20% computing power)
Once yourealizeyourcompany ’s mostcritical data has landed on HadoopAnd youronlysecurity model isobscurityYouwillwant to switch to somethingbuilt-in and robustsecurity model.
Very good documentation fromCloudera, HortonworksNot sufficientthough:ironing out the problems (SPENGO) needs close integrationwith IT
Hadoop limitation: POSIX-levelaccess to HDFS HTTPFS worksaround the absence of a scalable(and workswithKerberos, too !)In oursystems, HDFS replacedcompletelyIsilonStreaming a sustained 200-400MB/s of logs into the clusterDon’tcreatebottlenecks ; address the connectivitywith a many-to-many pattern
JSON + GZIP is good enough for most uses.
http://www.flickr.com/photos/jarbo/9379813470

Hadoop summit-ams-2014-04-03

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Hadoop summit-ams-2014-04-03

Semelhante a Hadoop summit-ams-2014-04-03 (20)

Último

Último (20)

Hadoop summit-ams-2014-04-03

Notas do Editor