SlideShare uma empresa Scribd logo
1 de 21
Hadoop – Large scale data analysis Abhijit Sharma Page 1    |    9/8/2011
Unprecedented growth in  Data set size - Facebook 21+ PB data warehouse, 12+ TB/day Un(semi)-structured data – logs, documents, graphs Connected data web, tags, graphs Relevant to enterprises – logs, social media, machine generated data, breaking of silos Page 2    |    9/8/2011 Big Data Trends
Page 3    |    9/8/2011 Putting Big Data to work Data driven Org – decision support, new offerings Analytics on large data sets (FB Insights – Page, App etc stats),  Data Mining – Clustering - Google News articles Search - Google
Embarrassingly data parallel problems Data chunked & distributed across cluster Parallel processing with data locality – task dispatched where data is Horizontal/Linear scaling approach using commodity hardware Write Once, Read Many Examples  Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links Page 4    |    9/8/2011 Problem characteristics and examples
Open source system for large scale batch distributed computing on big data Map Reduce Programming Paradigm & Framework  Map Reduce Infrastructure Distributed File System (HDFS) Endorsed/used extensively by web giants – Google, FB, Yahoo! Page 5    |    9/8/2011 What is Hadoop?
MapReduce is a programming model and an implementation for parallel processing of large data sets Map processes each logical record per input split to generate a set of intermediate key/value pairs Reduce merges all intermediate values associated with the same intermediate key Page 6    |    9/8/2011 Map Reduce - Definition
Map : Apply a function to each list member - Parallelizable [1, 2, 3].collect { it * it }  Output : [1, 2, 3] -> Map (Square) : [1, 4, 9] Reduce : Apply a function and an accumulator to each list member [1, 2, 3].inject(0) { sum, item -> sum + item }  Output : [1, 2, 3] -> Reduce (Sum) : 6 Map & Reduce  [1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item }  Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14 Page 7    |    9/8/2011 Map Reduce - Functional Programming Origins
Page 8    |    9/8/2011 Word Count - Shell cat * | grep  | sort                | uniq –c input| map  | shuffle & sort  | reduce
Page 9    |    9/8/2011 Word Count - Map Reduce
mapper (filename, file-contents): for each word in file-contents:     emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the” reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..]) sum = 0   for each value in intermediate_values:     sum = sum + value   emit (word, sum) Page 10    |    9/8/2011 Word Count  - Pseudo code
Word Count / Distributed logs search for # accesses to various URLs Map – emits word/URL, 1 for each doc/log split Reduce – sums up the counts for a specific word/URL Term Vector generation – term -> [doc-id] Map – emits term, doc-id for each doc split Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..]) Reverse Links – source -> target to target-> source Map – emits (target, source) for each doc split Reducer – Identity Reducer – accumulates the (target, [source, source ..])  Page 11    |    9/8/2011 Examples – Map Reduce Defn
Hides complexity of distributed computing Automatic parallelization of job Automatic data chunking & distribution (via HDFS) Data locality – MR task dispatched where data is Fault tolerant to server, storage, N/W failures Network and disk transfer optimization Load balancing Page 12    |    9/8/2011 Map Reduce – Hadoop Implementation
Page 13    |    9/8/2011 Hadoop Map Reduce Architecture
Very large files – block size 64 MB/128 MB Data access pattern - Write once read many Writes are large, create & append only Reads are large & streaming Commodity hardware Tolerant to failure – server, storage, network Highly available through transparent replication ,[object Object],Page 14    |    9/8/2011 HDFS Characteristics
Page 15    |    9/8/2011 HDFS Architecture
Thanks Page 16    |    9/8/2011
Page 17    |    9/8/2011 Backup Slides
Page 18    |    9/8/2011 Map & Reduce Functions
Page 19    |    9/8/2011 Job Configuration
Job Tracker tracks MR jobs – runs on master node Task Tracker Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node Heartbeats to Job Tracker Maintains and picks up tasks from a queue Page 20    |    9/8/2011 Hadoop Map Reduce Components
Name Node  Manages the file system namespace and regulates access to files by clients – stores meta data Mapping of blocks to Data Nodes and replicas Manage replication Executes file system namespace operations like opening, closing, and renaming files and directories. Data Node One per node, which manages local storage attached to the node  Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node. Page 21    |    9/8/2011 HDFS

Mais conteúdo relacionado

Mais procurados

Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...dbpublications
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...AyeeshaParveen
 
Hadoop Design Patterns
Hadoop Design PatternsHadoop Design Patterns
Hadoop Design PatternsEMC
 
Taking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo ProfessionalTaking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo ProfessionalPeter Horsbøll Møller
 
Hadoop, mapreduce and yarn networks
Hadoop, mapreduce and yarn networksHadoop, mapreduce and yarn networks
Hadoop, mapreduce and yarn networksHariniA7
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalknzhang
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatationAshish Saraf
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...jencyjayastina
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 
Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment NAXA-Developers
 
Hadoop development series(1)
Hadoop development series(1)Hadoop development series(1)
Hadoop development series(1)Amar kumar
 

Mais procurados (20)

Adding data into GIS
Adding  data into GISAdding  data into GIS
Adding data into GIS
 
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
 
Hadoop Design Patterns
Hadoop Design PatternsHadoop Design Patterns
Hadoop Design Patterns
 
Taking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo ProfessionalTaking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo Professional
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Dbms quiz
Dbms quiz Dbms quiz
Dbms quiz
 
Hadoop, mapreduce and yarn networks
Hadoop, mapreduce and yarn networksHadoop, mapreduce and yarn networks
Hadoop, mapreduce and yarn networks
 
Hadoop
HadoopHadoop
Hadoop
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
 
Geodatabases
GeodatabasesGeodatabases
Geodatabases
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Zenith it-hadoop-training
Zenith it-hadoop-trainingZenith it-hadoop-training
Zenith it-hadoop-training
 
Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop development series(1)
Hadoop development series(1)Hadoop development series(1)
Hadoop development series(1)
 
Introduction to MapBasic
Introduction to MapBasicIntroduction to MapBasic
Introduction to MapBasic
 
9-Figures in LaTex
9-Figures in LaTex9-Figures in LaTex
9-Figures in LaTex
 

Destaque

Industrial Sector of Pakistan
Industrial Sector of PakistanIndustrial Sector of Pakistan
Industrial Sector of Pakistanshobia
 
Responders and Assessments Presentation
Responders  and  Assessments PresentationResponders  and  Assessments Presentation
Responders and Assessments Presentationfrewsmhuffman
 
Connect Globally For An Innovation Economy, Nastas Article In Moscow Times
Connect Globally For An Innovation Economy, Nastas Article In Moscow TimesConnect Globally For An Innovation Economy, Nastas Article In Moscow Times
Connect Globally For An Innovation Economy, Nastas Article In Moscow TimesThomas Nastas
 
Big Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLBig Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLAbhijit Sharma
 
Better Search With Structured Knowledge
Better Search With Structured KnowledgeBetter Search With Structured Knowledge
Better Search With Structured KnowledgeMichel Dumontier
 
Adapting health systems to the challenge of diversity in the US and Europe
Adapting health systems to the challenge of diversity in the US and EuropeAdapting health systems to the challenge of diversity in the US and Europe
Adapting health systems to the challenge of diversity in the US and EuropediversityRx
 
Android Bootcamp Santa Fe GTUG
Android Bootcamp Santa Fe GTUGAndroid Bootcamp Santa Fe GTUG
Android Bootcamp Santa Fe GTUGmatiasmolinas
 
U of L and The Social Web
U of L and The Social WebU of L and The Social Web
U of L and The Social Webjackbr4
 
Kenenisa
KenenisaKenenisa
Kenenisargana
 
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...Michel Dumontier
 
Tema 5 1º bach tangencias y enlaces v4
Tema 5 1º bach tangencias y enlaces v4Tema 5 1º bach tangencias y enlaces v4
Tema 5 1º bach tangencias y enlaces v4qvrrafa
 
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)Michel Dumontier
 
The Economics of Grid-Connected Hybrid Distributed Generation
The Economics of Grid-Connected Hybrid Distributed GenerationThe Economics of Grid-Connected Hybrid Distributed Generation
The Economics of Grid-Connected Hybrid Distributed GenerationIain Sanders
 

Destaque (20)

Biosimilars in China
Biosimilars in ChinaBiosimilars in China
Biosimilars in China
 
Industrial Sector of Pakistan
Industrial Sector of PakistanIndustrial Sector of Pakistan
Industrial Sector of Pakistan
 
Responders and Assessments Presentation
Responders  and  Assessments PresentationResponders  and  Assessments Presentation
Responders and Assessments Presentation
 
Connect Globally For An Innovation Economy, Nastas Article In Moscow Times
Connect Globally For An Innovation Economy, Nastas Article In Moscow TimesConnect Globally For An Innovation Economy, Nastas Article In Moscow Times
Connect Globally For An Innovation Economy, Nastas Article In Moscow Times
 
Big Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLBig Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQL
 
Tennessee Ballot
Tennessee BallotTennessee Ballot
Tennessee Ballot
 
Better Search With Structured Knowledge
Better Search With Structured KnowledgeBetter Search With Structured Knowledge
Better Search With Structured Knowledge
 
Rims Metals and Mining Session
Rims Metals and Mining Session Rims Metals and Mining Session
Rims Metals and Mining Session
 
Adapting health systems to the challenge of diversity in the US and Europe
Adapting health systems to the challenge of diversity in the US and EuropeAdapting health systems to the challenge of diversity in the US and Europe
Adapting health systems to the challenge of diversity in the US and Europe
 
Lourenza
LourenzaLourenza
Lourenza
 
Android Bootcamp Santa Fe GTUG
Android Bootcamp Santa Fe GTUGAndroid Bootcamp Santa Fe GTUG
Android Bootcamp Santa Fe GTUG
 
Squizz presentation
Squizz presentationSquizz presentation
Squizz presentation
 
Howgirlsunderstand
HowgirlsunderstandHowgirlsunderstand
Howgirlsunderstand
 
U of L and The Social Web
U of L and The Social WebU of L and The Social Web
U of L and The Social Web
 
Kenenisa
KenenisaKenenisa
Kenenisa
 
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
 
Tema 5 1º bach tangencias y enlaces v4
Tema 5 1º bach tangencias y enlaces v4Tema 5 1º bach tangencias y enlaces v4
Tema 5 1º bach tangencias y enlaces v4
 
HR head dilemma ideate assignment
HR head dilemma ideate assignmentHR head dilemma ideate assignment
HR head dilemma ideate assignment
 
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
 
The Economics of Grid-Connected Hybrid Distributed Generation
The Economics of Grid-Connected Hybrid Distributed GenerationThe Economics of Grid-Connected Hybrid Distributed Generation
The Economics of Grid-Connected Hybrid Distributed Generation
 

Semelhante a An introduction to Hadoop for large scale data analysis

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
assignment3
assignment3assignment3
assignment3Kirti J
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 

Semelhante a An introduction to Hadoop for large scale data analysis (20)

Big data
Big dataBig data
Big data
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
assignment3
assignment3assignment3
assignment3
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 

Último

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Último (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

An introduction to Hadoop for large scale data analysis

  • 1. Hadoop – Large scale data analysis Abhijit Sharma Page 1 | 9/8/2011
  • 2. Unprecedented growth in Data set size - Facebook 21+ PB data warehouse, 12+ TB/day Un(semi)-structured data – logs, documents, graphs Connected data web, tags, graphs Relevant to enterprises – logs, social media, machine generated data, breaking of silos Page 2 | 9/8/2011 Big Data Trends
  • 3. Page 3 | 9/8/2011 Putting Big Data to work Data driven Org – decision support, new offerings Analytics on large data sets (FB Insights – Page, App etc stats), Data Mining – Clustering - Google News articles Search - Google
  • 4. Embarrassingly data parallel problems Data chunked & distributed across cluster Parallel processing with data locality – task dispatched where data is Horizontal/Linear scaling approach using commodity hardware Write Once, Read Many Examples Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links Page 4 | 9/8/2011 Problem characteristics and examples
  • 5. Open source system for large scale batch distributed computing on big data Map Reduce Programming Paradigm & Framework Map Reduce Infrastructure Distributed File System (HDFS) Endorsed/used extensively by web giants – Google, FB, Yahoo! Page 5 | 9/8/2011 What is Hadoop?
  • 6. MapReduce is a programming model and an implementation for parallel processing of large data sets Map processes each logical record per input split to generate a set of intermediate key/value pairs Reduce merges all intermediate values associated with the same intermediate key Page 6 | 9/8/2011 Map Reduce - Definition
  • 7. Map : Apply a function to each list member - Parallelizable [1, 2, 3].collect { it * it } Output : [1, 2, 3] -> Map (Square) : [1, 4, 9] Reduce : Apply a function and an accumulator to each list member [1, 2, 3].inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Reduce (Sum) : 6 Map & Reduce [1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14 Page 7 | 9/8/2011 Map Reduce - Functional Programming Origins
  • 8. Page 8 | 9/8/2011 Word Count - Shell cat * | grep | sort | uniq –c input| map | shuffle & sort | reduce
  • 9. Page 9 | 9/8/2011 Word Count - Map Reduce
  • 10. mapper (filename, file-contents): for each word in file-contents: emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the” reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..]) sum = 0 for each value in intermediate_values: sum = sum + value emit (word, sum) Page 10 | 9/8/2011 Word Count - Pseudo code
  • 11. Word Count / Distributed logs search for # accesses to various URLs Map – emits word/URL, 1 for each doc/log split Reduce – sums up the counts for a specific word/URL Term Vector generation – term -> [doc-id] Map – emits term, doc-id for each doc split Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..]) Reverse Links – source -> target to target-> source Map – emits (target, source) for each doc split Reducer – Identity Reducer – accumulates the (target, [source, source ..]) Page 11 | 9/8/2011 Examples – Map Reduce Defn
  • 12. Hides complexity of distributed computing Automatic parallelization of job Automatic data chunking & distribution (via HDFS) Data locality – MR task dispatched where data is Fault tolerant to server, storage, N/W failures Network and disk transfer optimization Load balancing Page 12 | 9/8/2011 Map Reduce – Hadoop Implementation
  • 13. Page 13 | 9/8/2011 Hadoop Map Reduce Architecture
  • 14.
  • 15. Page 15 | 9/8/2011 HDFS Architecture
  • 16. Thanks Page 16 | 9/8/2011
  • 17. Page 17 | 9/8/2011 Backup Slides
  • 18. Page 18 | 9/8/2011 Map & Reduce Functions
  • 19. Page 19 | 9/8/2011 Job Configuration
  • 20. Job Tracker tracks MR jobs – runs on master node Task Tracker Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node Heartbeats to Job Tracker Maintains and picks up tasks from a queue Page 20 | 9/8/2011 Hadoop Map Reduce Components
  • 21. Name Node Manages the file system namespace and regulates access to files by clients – stores meta data Mapping of blocks to Data Nodes and replicas Manage replication Executes file system namespace operations like opening, closing, and renaming files and directories. Data Node One per node, which manages local storage attached to the node Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node. Page 21 | 9/8/2011 HDFS