SlideShare uma empresa Scribd logo
1 de 29
Scaling Hadoop Applications Milind Bhandarkar Distributed Data Systems Engineer [mbhandarkar@linkedin.com] [twitter: @techmilind] http://www.flickr.com/photos/theplanetdotcom/4878812549/
About Me http://www.linkedin.com/in/milindb Founding member of Hadoop team at Yahoo! [2005-2010] Contributor to Apache Hadoop & Pig since version 0.1 Built and Led Grid Solutions Team at Yahoo! [2007-2010] Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, and LinkedIn http://www.flickr.com/photos/tmartin/32010732/
Outline Scalability of Applications Causes of Sublinear Scalability Best practices Q & A http://www.flickr.com/photos/86798786@N00/2202278303/
Importance of Scaling Zynga’s Cityville: 0 to 100 Million users in 43 days ! (http://venturebeat.com/2011/01/14/zyngas-cityville-grows-to-100-million-users-in-43-days/) Facebook in 2009 : From 150 Million users to 350 Million ! (http://www.facebook.com/press/info.php?timeline) LinkedIn in 2010: Adding a member per second ! (http://money.cnn.com/2010/11/17/technology/linkedin_web2/index.htm) Public WWW size: 26M in 1998, 1B in 2000, 1T in 2008 (http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html) Scalability allows dealing with success gracefully ! http://www.flickr.com/photos/palmdiscipline/2784992768/
Explosion of Data (Data-Rich Computing theme proposal.  J. Campbell, et al, 2007) http://www.flickr.com/photos/28634332@N05/4128229979/
Hadoop Simplifies building parallel applications Programmer specifies Record-level and Group-level sequential computation Framework handles partitioning, scheduling, dispatch, execution, communication, failure handling, monitoring, reporting etc. User provides hints to control parallel execution
Scalability of Parallel Programs If one node processes k MB/s, then N nodes should process (k*N) MB/s If some fixed amount of data is processed in T minutes on one node, the N nodes should process same data in (T/N) minutes Linear Scalability http://www-03.ibm.com/innovation/us/watson/
Reduce Latency Minimize program execution time http://www.flickr.com/photos/adhe55/2560649325/
Increase Throughput Maximize data processed per unit time http://www.flickr.com/photos/mikebaird/3898801499/
Three Equations http://www.flickr.com/photos/ucumari/321343385/
Amdahl’s Law http://www.computer.org/portal/web/awards/amdahl
Multi-Phase Computations If computation C is split into N different parts, C1..CN If partial computation Ci can be speeded up by a factor of Si http://www.computer.org/portal/web/awards/amdahl
Amdahl’s Law, Restated http://www.computer.org/portal/web/awards/amdahl
MapReduce Workflow http://www.computer.org/portal/web/awards/amdahl
Little’s Law http://sloancf.mit.edu/photo/facstaff/little.gif
Message Cost Model http://www.flickr.com/photos/mattimattila/4485665105/
Message Granularity For Gigabit Ethernet α = 300 μS β = 100 MB/s 100 Messages of 10KB each = 40 ms 10 Messages of 100 KB each = 13 ms http://www.flickr.com/photos/philmciver/982442098/
Alpha-Beta Common Mistake: Assuming that α is constant Scheduling latency for responder Jobtracker/Tasktracker time slice inversely proportional to number of concurrent tasks Common Mistake: Assuming that β is constant Network congestion TCP incast
Causes of Sublinear Scalability http://www.flickr.com/photos/sabertasche2/2609405532/
Sequential Bottlenecks Mistake: Single reducer (default) mapred.reduce.tasks Parallel directive Mistake: Constant number of reducers independent of data size default_parallel http://www.flickr.com/photos/bogenfreund/556656621/
Load Imbalance Unequal Partitioning of Work Imbalance = Max Partition Size – Min Partition Size Heterogeneous workers Example: Disk Bandwidth Empty Disk: 100 MB/s 75% Full disk: 25 MB/s  http://www.flickr.com/photos/kelpatolog/176424797/
Over-Partitioning For N workers, choose M partitions, M >> N Adaptive Scheduling, each worker chooses the next unscheduled partition Load Imbalance = Max(Sum(Wk)) – Min(Sum(Wk)) For sufficiently large M, both Max and Min tend to Average(Wk), thus reducing load imbalance http://www.flickr.com/photos/infidelic/3238637031/
Critical Paths Maximum distance between start and end nodes Long tail in Map task executions Enable Speculative Execution Overlap communication and computation Asynchronous notifications Example: deleting intermediate outputs after job completion http://www.flickr.com/photos/cdevers/4619306252/
Algorithmic Overheads Repeating computation in place of adding communication Parallel exploration of solution space results in wastage of computations Need for a coordinator Share partial, unmaterialized results Consider memory-based small replicated K-V stores with eventual consistency http://www.flickr.com/photos/sbprzd/140903299/
Synchronization Shuffle stage in Hadoop, M*R transfers Slowest system components involved: Network, Disk Can you eliminate Reduce stage ? Use combiners ? Compress intermediate output ? Launch reduce tasks before all maps finish http://www.flickr.com/photos/susan_d_p/2098849477/
Task Granularity Amortize task scheduling and launching overheads Centralized scheduler Out-of-band Tasktracker heartbeat reduces latency, but May overwhelm the scheduler Increase task granularity Combine multiple small files Maximize split sizes Combine computations per datum http://www.flickr.com/photos/philmciver/982442098/
Hadoop Best Practices Use higher-level languages (e.g. Pig) Coalesce small files into larger ones & use bigger HDFS block size Tune buffer sizes for tasks to avoid spill to disk Consume only one slot per task Use combiner if local aggregation reduces size http://www.flickr.com/photos/yaffamedia/1388280476/
Hadoop Best Practices Use compression everywhere CPU-efficient for intermediate data Disk-efficient for output data Use Distributed Cache to distribute small side-files (<< than input split size) Minimize number of NameNode/JobTracker RPCs from tasks http://www.flickr.com/photos/yaffamedia/1388280476/
Questions ? http://www.flickr.com/photos/horiavarlan/4273168957/

Mais conteúdo relacionado

Mais procurados

Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveIlya Ganelin
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopAllen Wittenauer
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersAmal G Jose
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 

Mais procurados (20)

Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Compression talk
Compression talkCompression talk
Compression talk
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's Perspective
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 

Semelhante a March 2011 HUG: Scaling Hadoop

Apache Hadoop India Summit 2011 Keynote talk "Scaling Hadoop Applications" by...
Apache Hadoop India Summit 2011 Keynote talk "Scaling Hadoop Applications" by...Apache Hadoop India Summit 2011 Keynote talk "Scaling Hadoop Applications" by...
Apache Hadoop India Summit 2011 Keynote talk "Scaling Hadoop Applications" by...Yahoo Developer Network
 
Journey to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container EngineJourney to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container EngineGoogle Cloud Platform - Japan
 
Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Vrushali Channapattan
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud lohitvijayarenu
 
5 Ways to Optimize Your LiDAR Data
5 Ways to Optimize Your LiDAR Data5 Ways to Optimize Your LiDAR Data
5 Ways to Optimize Your LiDAR DataSafe Software
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
[White paper] Maintain-Accurate-Network-Diagrams
[White paper] Maintain-Accurate-Network-Diagrams[White paper] Maintain-Accurate-Network-Diagrams
[White paper] Maintain-Accurate-Network-DiagramsNetBrain Technologies
 
Indoor Point Cloud Processing
Indoor Point Cloud ProcessingIndoor Point Cloud Processing
Indoor Point Cloud ProcessingPetteriTeikariPhD
 
Indoor Point Cloud Processing - Deep learning for semantic segmentation of in...
Indoor Point Cloud Processing - Deep learning for semantic segmentation of in...Indoor Point Cloud Processing - Deep learning for semantic segmentation of in...
Indoor Point Cloud Processing - Deep learning for semantic segmentation of in...CubiCasa
 
5 Ways to Improve Your LiDAR Workflows
5 Ways to Improve Your LiDAR Workflows5 Ways to Improve Your LiDAR Workflows
5 Ways to Improve Your LiDAR WorkflowsSafe Software
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Nati Shalom
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMRABC Talks
 
Colorado OpenStack 5th Birthday Monasca Operations
Colorado OpenStack 5th Birthday Monasca OperationsColorado OpenStack 5th Birthday Monasca Operations
Colorado OpenStack 5th Birthday Monasca Operationsdlfryar
 

Semelhante a March 2011 HUG: Scaling Hadoop (20)

Apache Hadoop India Summit 2011 Keynote talk "Scaling Hadoop Applications" by...
Apache Hadoop India Summit 2011 Keynote talk "Scaling Hadoop Applications" by...Apache Hadoop India Summit 2011 Keynote talk "Scaling Hadoop Applications" by...
Apache Hadoop India Summit 2011 Keynote talk "Scaling Hadoop Applications" by...
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Journey to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container EngineJourney to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container Engine
 
Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
5 Ways to Optimize Your LiDAR Data
5 Ways to Optimize Your LiDAR Data5 Ways to Optimize Your LiDAR Data
5 Ways to Optimize Your LiDAR Data
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
UDP Report
UDP ReportUDP Report
UDP Report
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
BdT 3.1-3.6
BdT 3.1-3.6BdT 3.1-3.6
BdT 3.1-3.6
 
[White paper] Maintain-Accurate-Network-Diagrams
[White paper] Maintain-Accurate-Network-Diagrams[White paper] Maintain-Accurate-Network-Diagrams
[White paper] Maintain-Accurate-Network-Diagrams
 
Indoor Point Cloud Processing
Indoor Point Cloud ProcessingIndoor Point Cloud Processing
Indoor Point Cloud Processing
 
Indoor Point Cloud Processing - Deep learning for semantic segmentation of in...
Indoor Point Cloud Processing - Deep learning for semantic segmentation of in...Indoor Point Cloud Processing - Deep learning for semantic segmentation of in...
Indoor Point Cloud Processing - Deep learning for semantic segmentation of in...
 
5 Ways to Improve Your LiDAR Workflows
5 Ways to Improve Your LiDAR Workflows5 Ways to Improve Your LiDAR Workflows
5 Ways to Improve Your LiDAR Workflows
 
Mechatronics engineer
Mechatronics engineerMechatronics engineer
Mechatronics engineer
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
 
Colorado OpenStack 5th Birthday Monasca Operations
Colorado OpenStack 5th Birthday Monasca OperationsColorado OpenStack 5th Birthday Monasca Operations
Colorado OpenStack 5th Birthday Monasca Operations
 

Mais de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Mais de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Último

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 

Último (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 

March 2011 HUG: Scaling Hadoop

  • 1. Scaling Hadoop Applications Milind Bhandarkar Distributed Data Systems Engineer [mbhandarkar@linkedin.com] [twitter: @techmilind] http://www.flickr.com/photos/theplanetdotcom/4878812549/
  • 2. About Me http://www.linkedin.com/in/milindb Founding member of Hadoop team at Yahoo! [2005-2010] Contributor to Apache Hadoop & Pig since version 0.1 Built and Led Grid Solutions Team at Yahoo! [2007-2010] Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, and LinkedIn http://www.flickr.com/photos/tmartin/32010732/
  • 3. Outline Scalability of Applications Causes of Sublinear Scalability Best practices Q & A http://www.flickr.com/photos/86798786@N00/2202278303/
  • 4. Importance of Scaling Zynga’s Cityville: 0 to 100 Million users in 43 days ! (http://venturebeat.com/2011/01/14/zyngas-cityville-grows-to-100-million-users-in-43-days/) Facebook in 2009 : From 150 Million users to 350 Million ! (http://www.facebook.com/press/info.php?timeline) LinkedIn in 2010: Adding a member per second ! (http://money.cnn.com/2010/11/17/technology/linkedin_web2/index.htm) Public WWW size: 26M in 1998, 1B in 2000, 1T in 2008 (http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html) Scalability allows dealing with success gracefully ! http://www.flickr.com/photos/palmdiscipline/2784992768/
  • 5. Explosion of Data (Data-Rich Computing theme proposal. J. Campbell, et al, 2007) http://www.flickr.com/photos/28634332@N05/4128229979/
  • 6. Hadoop Simplifies building parallel applications Programmer specifies Record-level and Group-level sequential computation Framework handles partitioning, scheduling, dispatch, execution, communication, failure handling, monitoring, reporting etc. User provides hints to control parallel execution
  • 7. Scalability of Parallel Programs If one node processes k MB/s, then N nodes should process (k*N) MB/s If some fixed amount of data is processed in T minutes on one node, the N nodes should process same data in (T/N) minutes Linear Scalability http://www-03.ibm.com/innovation/us/watson/
  • 8. Reduce Latency Minimize program execution time http://www.flickr.com/photos/adhe55/2560649325/
  • 9. Increase Throughput Maximize data processed per unit time http://www.flickr.com/photos/mikebaird/3898801499/
  • 12. Multi-Phase Computations If computation C is split into N different parts, C1..CN If partial computation Ci can be speeded up by a factor of Si http://www.computer.org/portal/web/awards/amdahl
  • 13. Amdahl’s Law, Restated http://www.computer.org/portal/web/awards/amdahl
  • 16. Message Cost Model http://www.flickr.com/photos/mattimattila/4485665105/
  • 17. Message Granularity For Gigabit Ethernet α = 300 μS β = 100 MB/s 100 Messages of 10KB each = 40 ms 10 Messages of 100 KB each = 13 ms http://www.flickr.com/photos/philmciver/982442098/
  • 18. Alpha-Beta Common Mistake: Assuming that α is constant Scheduling latency for responder Jobtracker/Tasktracker time slice inversely proportional to number of concurrent tasks Common Mistake: Assuming that β is constant Network congestion TCP incast
  • 19. Causes of Sublinear Scalability http://www.flickr.com/photos/sabertasche2/2609405532/
  • 20. Sequential Bottlenecks Mistake: Single reducer (default) mapred.reduce.tasks Parallel directive Mistake: Constant number of reducers independent of data size default_parallel http://www.flickr.com/photos/bogenfreund/556656621/
  • 21. Load Imbalance Unequal Partitioning of Work Imbalance = Max Partition Size – Min Partition Size Heterogeneous workers Example: Disk Bandwidth Empty Disk: 100 MB/s 75% Full disk: 25 MB/s http://www.flickr.com/photos/kelpatolog/176424797/
  • 22. Over-Partitioning For N workers, choose M partitions, M >> N Adaptive Scheduling, each worker chooses the next unscheduled partition Load Imbalance = Max(Sum(Wk)) – Min(Sum(Wk)) For sufficiently large M, both Max and Min tend to Average(Wk), thus reducing load imbalance http://www.flickr.com/photos/infidelic/3238637031/
  • 23. Critical Paths Maximum distance between start and end nodes Long tail in Map task executions Enable Speculative Execution Overlap communication and computation Asynchronous notifications Example: deleting intermediate outputs after job completion http://www.flickr.com/photos/cdevers/4619306252/
  • 24. Algorithmic Overheads Repeating computation in place of adding communication Parallel exploration of solution space results in wastage of computations Need for a coordinator Share partial, unmaterialized results Consider memory-based small replicated K-V stores with eventual consistency http://www.flickr.com/photos/sbprzd/140903299/
  • 25. Synchronization Shuffle stage in Hadoop, M*R transfers Slowest system components involved: Network, Disk Can you eliminate Reduce stage ? Use combiners ? Compress intermediate output ? Launch reduce tasks before all maps finish http://www.flickr.com/photos/susan_d_p/2098849477/
  • 26. Task Granularity Amortize task scheduling and launching overheads Centralized scheduler Out-of-band Tasktracker heartbeat reduces latency, but May overwhelm the scheduler Increase task granularity Combine multiple small files Maximize split sizes Combine computations per datum http://www.flickr.com/photos/philmciver/982442098/
  • 27. Hadoop Best Practices Use higher-level languages (e.g. Pig) Coalesce small files into larger ones & use bigger HDFS block size Tune buffer sizes for tasks to avoid spill to disk Consume only one slot per task Use combiner if local aggregation reduces size http://www.flickr.com/photos/yaffamedia/1388280476/
  • 28. Hadoop Best Practices Use compression everywhere CPU-efficient for intermediate data Disk-efficient for output data Use Distributed Cache to distribute small side-files (<< than input split size) Minimize number of NameNode/JobTracker RPCs from tasks http://www.flickr.com/photos/yaffamedia/1388280476/

Notas do Editor

  1. So, why is scalability important ? We know that we need to deal with failures. Our product needs to function even if there are hardware failures ! But more importantly, it also needs to function when the product is highly successful ! Designing the application for scalability allows us to deal with the success of our product gracefully.
  2. Since data is at the heart of all these products, and the products that we design now and will design in future, and since the amount of data available to us is exploding, we need to make sure that our product handles this exploding data well. And that’s where Hadoop really shines.
  3. What is different between Hadoop and other traditional paradigms and frameworks, is that Hadoop makes it deceptively simple to write these big-data applications. All that the programmer has to do is specify a couple of sequential computations, and Hadoop takes care of the rest of messy parallel execution details. So, while developing hadoop applications, we sometimes tend to forget that we are writing a parallel application. But, believe me, if one forgets that this application is a parallel program, it will bite us in the end  So, when thinking about scalability of hadoop applications, we need to think about scalability of parallel programs.
  4. There are two ways of thinking about scalability of parallel programs. One, when we have more computational resources at hand, we should be able to process proportionately bigger data. The second way is, when we are dealing with fixed sized data, we should be able to process that data in shorter amount of time. When the speedup we get is proportional to the number of computational resources we throw at the application, it is called linear scalability.
  5. These two definitions of linear scalability correspond to two goals of scalability. First is to reduce latency, by minimizing program execution time. That way, data can be processed quickly, and we can do something else with the resources available.
  6. The second goal is to be able to process more and more data with additional machines, without having to rewrite our application. So, for reasoning about scalability for satisfying both these goals, all we need to know is..
  7. The three equations. And since I don’t have much to say on this slide, I hope you like my choice of three cute little lion cubs here 
  8. The first equation is from 1967, by Gene Amdahl, who was at IBM at the time. What this law tells you is that if alpha is the fraction of sequential computation in your code, then you can never achieve scalability beyond one divided by alpha. So, if 5% of your code has to execute sequentially, then you never speed up beyond 20.
  9. In case of real parallel applications, of course where there are multiple phases, and each phase can use multiple computational resources, then Amdahl’s law can be restated as…
  10. This. And in case you start thinking about how or why this relates to hadoop, …
  11. Here is the graphical depiction of the typical map reduce application. Note the sequential parts of computation that happen in job client, startup task and clean up task. Also note that map phase and reduce phase typically have different amount of parallelism in them.
  12. The second equation to know is called Little’s Law. What this law tells is that the total number of occupants in a system is the product of their arrival rate and the amount of time they spend there. So, if a university accepts 1000 students every year, and every student spends 4 years at the university on an average, then at any time, there are 4000 students at the university. In case of Hadoop, we have to look at it the other way. We have fixed number of tasks that could be running at any time within the cluster. Now we have to decide whether our job arrival rate is adequate with the amount of time each task takes.
  13. The third equation one has to know is the data transfer cost model. Every message you send from one machine to another has some fixed cost, that we call the latency of the message alpha, and then a per byte cost that we call the bandwidth of the channel, the beta.
  14. To put it concretely, for a gigabit ethernet nic, the latency is about 300 microseconds, and the bandwidth we expect for the real payload is about 100 megabytes per second. So, now if you have to transfer 1 MB between two machines connected by gigabit nic, and you chose to send them in 100 messages of 10 KB each, it will take you 40 milliseconds. But if you choose to club these messages together and send them as 10 messages of 100 kilobytes each, it will cost you substantially lower !
  15. One thing to note though. That these alphas and betas are very runtime dependent. A common mistake is to assume that they are constant. When a reduce task starts fetching map output, it has to contact the tasktracker and request for map output. The task tracker has to get scheduled with a context switch. The amount of time it takes for the tasktracker to get scheduled is dependent on what other things are happening on that machine. With several tasks executing, the tasktracker gets a very small slice of time, and that may result in several milliseconds latencies. Of course the bandiwtdh is divided into number of data transfers at the time, so even beta is not constant.
  16. So, with these three equations as our tools, we try to analyse the causes of sublinear scalability of hadoop applications.
  17. The first of course is sequential bottlenecks. In multi-phase parallel applications, the amount of time spent in low-width phases reduces scalability. So, the most common mistake is to use a single reducer. Unfortunately, that’s the default setting in Hadoop. So, make sure that you have reasoned about that well, and chosen appropriate number of reducers my specifying this config variable. Or by using a parallel directive in Pig. Another common mistake is chosing the same number of reducers no matter what the amount of data being handled or computations being performed by the reducers. In my opinion, the directive default_parallel is a mistake in pig.
  18. The second most common cause of sublinear scalability in hadoop is load imbalance. Load imbalance results from unequal partitioning of work to be done in parallel. The imbalance can be characterized by maximum amount of work done in a partition minus minimum amount of work in a partition. Even if one tries to equally partition the work, the computational resources at hand may have different performance. What’s worse that performance is heavily dependent on the state of the resourse, and often degrades over time. When the disk gets full, the performance decreases.
  19. A common way to deal with this problem is by overpartitioning. So, if you have N worker nodes, choose M number of partitions such that M is a lot bigger than N. About 5 to 10 times bigger. Hadoop uses adaptive scheduling. So, when a worker is free, it requests new work. This means that the load imbalance inherent in the application or the hardware evens out. Each worker processes some chunk of these avaiulable partitions. So, the max and min tend to average, as you have more partitions than the number of machines.
  20. A common problem we see in hadoop applications is the amount of work that needs to be done on the critical path of an application. If you express computation as a directed acyclic graph, then the critical path is the longest path in that graph. In Hadoop it rears its ugly head by having one of the tasks that executes much longer than the rest. Make sure you use speculative execution for scheduling multiple redundant attempts for such a task. In the framework itself, care has been taken to reduce this critical path as much as possible. If you have a graph where each node is a map-reduce job, make sure that all the things you are doing that are not really needed in a critical path get done asynchronously.
  21. Of course when we convert a sequential algorithm to work in parallel, we incur some algorithmic overhead. (If not, we could have just taken the efficient parellel algorithm and emulated it sequentially.) Sometimes we need to explore solutions to our problem in parallel which may be determinted to be wastage in sequential algorithm. Try to minimize such wastage. Sometimes, having a central coordinator helps, since it knows whether a given computation is already performed by someone else. When you need to share partial results among other pieces of computations, you need to keep them for a quick lookup. Consider using small fast key-value stores for this.
  22. Fortunately, there is exactly one synchronization step in a hadoop job. The shuffle stage. Unfortunately, it uses the slowest system components, that is, network and the disk. So, all your focus should be on speeding up the shuffle stage. If you can avoid the shuffle stage altogether, that’s the best. But if you cannot, then at least try to reduce the time taken by this shuffle stage. Use combiners to partially aggregate the intermediate results to reduce the data sizes. And definitely try to compress the intermediate outputs. One can achieve significant computation-communication overlap in this stage by launching reduce tasks before al the map tasks finish. So, that while some map tasks are still executing, the reduce tasks can fetch the outputs from the finished map tasks.
  23. The message cost model of fixed latency, and per byte bandwidth also applies to task scheduling. Task launching overheads are significant. The tasktracker has to ask for the next task, the job tracker and the scheduler has to do some work to determine which task to hand over to the tasktracker, and the tasktracker has to launch the task, and cleanup after it’s done. In order to amortize this overhead the task needs to run for some time and process some data. For this purpose, the amount of data each task processes is the best way to control the task granularity. Try combining multiple small files into a single split, try increasing the dfs block sizes to maximize split sizes. If you are performing multiple computation per key-value pair in the data, try combining them into a single tasks as much as possible. Most high-level frameworks such as Pig already do this for you, so moving to the hadoop best practices…
  24. Try using high level languages such as pig as much as possible. Try to maximize the data processed per task (but keep enough partitions to have adequate parallelism and avoid load imbalance.) Make sure that you are not hitting the disk frequently by tuning the in-memory buffer sizes. The capacity scheduler allows you to occupy more than one task slot per task. Please avoid that, since it results in a lot of wastage of resources. Try to reduce intermediate shuffle size by using combiners and doing local partial aggregation.
  25. Use compression on the intermediate data. It speeds up the shuffle stage. Make sure you compress output data as well, since it improves the write speed of reduces. When you are using “secondary inputs”, please use distributed cache to distribute these to individual nodes. And make sure that the secondary inputs are much smaller than the primary map inputs. If you are trying to store per job state using counters etc, make sure that you hand them over to the tasktracker as counters, and do not make calls directly to the jobtracker. And also since the namenode is a singleton, it causes sequential bottleneck, when called frequently from the tasks. Make sure you don’t do that.