SlideShare uma empresa Scribd logo
1 de 33
CSE509: Introduction to Web Science and Technology Lecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems and MapReduce Muhammad AtifQureshi Web Science Research Group Institute of Business Administration (IBA)
Last Time… Search Engine Architecture Overview of Web Crawling Web Link Structure Ranking Problem SEO and Web Spam Web Spam Research July 30, 2011
Today Web Data Explosion Part I MapReduce Basics MapReduce Example and Details MapReduce Case-Study: Web Crawler based on MapReduce Architecture Part II Large-Scale File Systems Google File System Case-Study July 30, 2011
Introduction Web data sets can be very large  Tens to hundreds of terabytes Cannot mine on a single server (why?) “Big data” is a fact on the World Wide Web Larger data implies effective algorithms Web-scale processing: Data-intensive processing Also applies to startups and niche players July 30, 2011
How Much Data? Google processes 20 PB a day (2008) Facebook has 2.5 PB of user data + 15 TB/day (4/2009)  eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s LHC will generate 15 PB a year (??) July 30, 2011
Cluster Architecture July 30, 2011 CPU CPU CPU CPU Mem Mem Mem Mem Disk Disk Disk Disk 2-10 Gbps backbone between racks 1 Gbps between  any pair of nodes in a rack Switch Switch Switch … … Each rack contains 16-64 nodes
Concerns If we had to abort and restart the computation every time one component fails, then the computation might never complete successfully If one node fails, all its files would be unavailable until the node is replaced Can also lead to permanent loss of files July 30, 2011 Solutions: MapReduce and Google File system
PART I: MapReduce July 30, 2011
Major Ideas Scale “out”, not “up” (Distributed vs. SMP)  Limits of SMP and large shared-memory machines Move processing to the data Cluster have limited bandwidth Process data sequentially, avoid random access Seeks are expensive, disk throughput is reasonable Seamless scalability From the traditional mythical man-month approach to a newly known phenomenon tradable machine-hour Twenty-one chicken together cannot make an egg hatch in a day July 30, 2011
Traditional Parallelization: Divide and Conquer July 30, 2011 “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result”
Parallelization Challenges How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? July 30, 2011
Common Theme Parallelization problems arise from: Communication between workers (e.g., to exchange state) Access to shared resources (e.g., data) Thus, we need a synchronization mechanism July 30, 2011
Parallelization is Hard Traditionally, concurrency is difficult to reason about (uni to small-scale architecture) Concurrency is even more difficult to reason about At the scale of datacenters (even across datacenters) In the presence of failures In terms of multiple interacting services Not to mention debugging… The reality: Write your own dedicated library, then program with it Burden on the programmer to explicitly manage everything July 30, 2011
Solution: MapReduce Programming model for expressing distributed computations at a massive scale Hides system-level details from the developers No more race conditions, lock contention, etc. Separating the what from how Developer specifies the computation that needs to be performed Execution framework (“runtime”) handles actual execution July 30, 2011
What is MapReduce Used For? At Google: Index building for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Index building for Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection July 30, 2011
Typical MapReduce Execution Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Map Reduce Key idea: provide a functional abstraction for these two operations (Dean and Ghemawat, OSDI 2004)
MapReduce Basics Programmers specify two functions: map (k, v) -> <k’, v’>* reduce (k’, v’) -> <k’, v’>* All values with the same key are sent to the same reducer The execution framework handles everything else… July 30, 2011
Warm Up Example: Word Count We have a large file of words, one word to a line Count the number of times each distinct word appears in the file Sample application: analyze web server logs to find popular URLs July 30, 2011
Word Count (2) Case 1: Entire file fits in memory Case 2: File too large for mem, but all <word, count> pairs fit in mem Case 3: File on disk, too many distinct words to fit in memory sort datafile | uniq –c July 30, 2011
Word Count (3) To make it slightly harder, suppose we have a large corpus of documents Count the number of times each distinct word occurs in the corpus words(docs/*) | sort | uniq -c where words takes a file and outputs the words in it, one to a line The above captures the essence of MapReduce Great thing is it is naturally parallelizable July 30, 2011
Word Count using MapReduce July 30, 2011 map(key, value): // key: document name; value: text of document 	for each word w in value: 		emit(w, 1) reduce(key, values): // key: a word; values: an iterator over counts 	result = 0 	for each count v in values: 		result += v 	emit(key,result)
Word Count Illustration July 30, 2011 map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see	1 bob	1  run	1 see 	1 spot 	1 throw	1 bob	1  run	1 see 	2 spot 	1 throw	1 see bob run see spot throw
Implementation Overview 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bandwidth  Storage is on local IDE disks  GFS: distributed file system manages data (SOSP'03)  Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines  July 30, 2011 Implementation at Google is a C++ library linked to user programs
Distributed Execution Overview July 30, 2011 UserProgram (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output file 0 (5) remote read worker split 1 (3) read split 2 (4) local write worker split 3 output file 1 split 4 worker worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files Adapted from (Dean and Ghemawat, OSDI 2004)
MapReduce Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem Lots of custom research implementations For GPUs, cell processors, etc. July 30, 2011
Bonus Assignment Write MapReduce version of Assignment no. 2 July 30, 2011
MapReduce in VisionerBOT July 30, 2011
VisionerBOT Distributed Design July 30, 2011
PART II: Google File System July 30, 2011
Distributed File System Don’t move data to workers… move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data local Why? Not enough RAM to hold all the data in memory Disk access is slow, but disk throughput is reasonable A distributed file system is the answer GFS (Google File System) for Google’s MapReduce HDFS (Hadoop Distributed File System) for Hadoop
GFS: Assumptions Commodity hardware over “exotic” hardware Scale “out”, not “up” High component failure rates Inexpensive commodity components fail all the time “Modest” number of huge files Multi-gigabyte files are common, if not encouraged Files are write-once, mostly appended to Perhaps concurrently Large streaming reads over random access High sustained throughput over low latency GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
GFS: Design Decisions Files stored as chunks Fixed size (64MB) Reliability through replication Each chunk replicated across 3+ chunkservers Single master to coordinate access, keep metadata Simple centralized management No data caching Little benefit due to large datasets, streaming reads Simplify the API Push some of the issues onto the client (e.g., data layout) HDFS = GFS clone (same basic ideas)
QUESTIONS? July 30, 2011

Mais conteúdo relacionado

Destaque

Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)Carlos Castillo (ChaTo)
 
Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimizationNaga Gopinath
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Kira
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web searchEmrullah Delibas
 
Link Analysis in Networks - or - Finding the Terrorists
Link Analysis in Networks - or - Finding the TerroristsLink Analysis in Networks - or - Finding the Terrorists
Link Analysis in Networks - or - Finding the TerroristsJames McGivern
 
Analysis on link networks of iran municipal websites
Analysis on link networks of iran municipal websitesAnalysis on link networks of iran municipal websites
Analysis on link networks of iran municipal websitesShahid Beheshti University
 
Link analysis .. Data Mining
Link analysis .. Data MiningLink analysis .. Data Mining
Link analysis .. Data MiningMustafa Salam
 

Destaque (12)

Motivation
MotivationMotivation
Motivation
 
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
Link Analysis in National Web Domains (OSWIR 2005 Compiegne)
 
Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimization
 
Link Analysis (RBY)
Link Analysis (RBY)Link Analysis (RBY)
Link Analysis (RBY)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Link analysis for web search
Link analysis for web searchLink analysis for web search
Link analysis for web search
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Link Analysis in Networks - or - Finding the Terrorists
Link Analysis in Networks - or - Finding the TerroristsLink Analysis in Networks - or - Finding the Terrorists
Link Analysis in Networks - or - Finding the Terrorists
 
Analysis on link networks of iran municipal websites
Analysis on link networks of iran municipal websitesAnalysis on link networks of iran municipal websites
Analysis on link networks of iran municipal websites
 
Link analysis
Link analysisLink analysis
Link analysis
 
Link analysis .. Data Mining
Link analysis .. Data MiningLink analysis .. Data Mining
Link analysis .. Data Mining
 

Semelhante a CSE509 Lecture 4

Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopJongwook Woo
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Effiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenEffiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenFlorian Stegmaier
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
A NOVEL APPROACH FOR PROCESSING BIG DATA
A NOVEL APPROACH FOR PROCESSING BIG DATAA NOVEL APPROACH FOR PROCESSING BIG DATA
A NOVEL APPROACH FOR PROCESSING BIG DATAijdms
 
Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Rohit Srivastava
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopJosh Devins
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Computer Science Journals
 

Semelhante a CSE509 Lecture 4 (20)

Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
Big data
Big dataBig data
Big data
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Effiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenEffiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen Datenmengen
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
A NOVEL APPROACH FOR PROCESSING BIG DATA
A NOVEL APPROACH FOR PROCESSING BIG DATAA NOVEL APPROACH FOR PROCESSING BIG DATA
A NOVEL APPROACH FOR PROCESSING BIG DATA
 
Big Data
Big DataBig Data
Big Data
 
Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Big Data Hadoop (Overview)
Big Data Hadoop (Overview)
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.
 

Mais de Web Science Research Group at Institute of Business Administration, Karachi, Pakistan (8)

ReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for PakistanReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for Pakistan
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Social Media Mining and Analytics
Social Media Mining and AnalyticsSocial Media Mining and Analytics
Social Media Mining and Analytics
 
CSE509 Lecture 6
CSE509 Lecture 6CSE509 Lecture 6
CSE509 Lecture 6
 
CSE509 Lecture 5
CSE509 Lecture 5CSE509 Lecture 5
CSE509 Lecture 5
 
CSE509 Lecture 3
CSE509 Lecture 3CSE509 Lecture 3
CSE509 Lecture 3
 
CSE509 Lecture 2
CSE509 Lecture 2CSE509 Lecture 2
CSE509 Lecture 2
 
CSE509 Lecture 1
CSE509 Lecture 1CSE509 Lecture 1
CSE509 Lecture 1
 

Último

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

CSE509 Lecture 4

  • 1. CSE509: Introduction to Web Science and Technology Lecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems and MapReduce Muhammad AtifQureshi Web Science Research Group Institute of Business Administration (IBA)
  • 2. Last Time… Search Engine Architecture Overview of Web Crawling Web Link Structure Ranking Problem SEO and Web Spam Web Spam Research July 30, 2011
  • 3. Today Web Data Explosion Part I MapReduce Basics MapReduce Example and Details MapReduce Case-Study: Web Crawler based on MapReduce Architecture Part II Large-Scale File Systems Google File System Case-Study July 30, 2011
  • 4. Introduction Web data sets can be very large Tens to hundreds of terabytes Cannot mine on a single server (why?) “Big data” is a fact on the World Wide Web Larger data implies effective algorithms Web-scale processing: Data-intensive processing Also applies to startups and niche players July 30, 2011
  • 5. How Much Data? Google processes 20 PB a day (2008) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s LHC will generate 15 PB a year (??) July 30, 2011
  • 6. Cluster Architecture July 30, 2011 CPU CPU CPU CPU Mem Mem Mem Mem Disk Disk Disk Disk 2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch Switch … … Each rack contains 16-64 nodes
  • 7. Concerns If we had to abort and restart the computation every time one component fails, then the computation might never complete successfully If one node fails, all its files would be unavailable until the node is replaced Can also lead to permanent loss of files July 30, 2011 Solutions: MapReduce and Google File system
  • 8. PART I: MapReduce July 30, 2011
  • 9. Major Ideas Scale “out”, not “up” (Distributed vs. SMP) Limits of SMP and large shared-memory machines Move processing to the data Cluster have limited bandwidth Process data sequentially, avoid random access Seeks are expensive, disk throughput is reasonable Seamless scalability From the traditional mythical man-month approach to a newly known phenomenon tradable machine-hour Twenty-one chicken together cannot make an egg hatch in a day July 30, 2011
  • 10. Traditional Parallelization: Divide and Conquer July 30, 2011 “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result”
  • 11. Parallelization Challenges How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? July 30, 2011
  • 12. Common Theme Parallelization problems arise from: Communication between workers (e.g., to exchange state) Access to shared resources (e.g., data) Thus, we need a synchronization mechanism July 30, 2011
  • 13. Parallelization is Hard Traditionally, concurrency is difficult to reason about (uni to small-scale architecture) Concurrency is even more difficult to reason about At the scale of datacenters (even across datacenters) In the presence of failures In terms of multiple interacting services Not to mention debugging… The reality: Write your own dedicated library, then program with it Burden on the programmer to explicitly manage everything July 30, 2011
  • 14. Solution: MapReduce Programming model for expressing distributed computations at a massive scale Hides system-level details from the developers No more race conditions, lock contention, etc. Separating the what from how Developer specifies the computation that needs to be performed Execution framework (“runtime”) handles actual execution July 30, 2011
  • 15. What is MapReduce Used For? At Google: Index building for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: Index building for Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection July 30, 2011
  • 16. Typical MapReduce Execution Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Map Reduce Key idea: provide a functional abstraction for these two operations (Dean and Ghemawat, OSDI 2004)
  • 17. MapReduce Basics Programmers specify two functions: map (k, v) -> <k’, v’>* reduce (k’, v’) -> <k’, v’>* All values with the same key are sent to the same reducer The execution framework handles everything else… July 30, 2011
  • 18. Warm Up Example: Word Count We have a large file of words, one word to a line Count the number of times each distinct word appears in the file Sample application: analyze web server logs to find popular URLs July 30, 2011
  • 19. Word Count (2) Case 1: Entire file fits in memory Case 2: File too large for mem, but all <word, count> pairs fit in mem Case 3: File on disk, too many distinct words to fit in memory sort datafile | uniq –c July 30, 2011
  • 20. Word Count (3) To make it slightly harder, suppose we have a large corpus of documents Count the number of times each distinct word occurs in the corpus words(docs/*) | sort | uniq -c where words takes a file and outputs the words in it, one to a line The above captures the essence of MapReduce Great thing is it is naturally parallelizable July 30, 2011
  • 21. Word Count using MapReduce July 30, 2011 map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key,result)
  • 22. Word Count Illustration July 30, 2011 map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see 1 bob 1 run 1 see 1 spot 1 throw 1 bob 1 run 1 see 2 spot 1 throw 1 see bob run see spot throw
  • 23. Implementation Overview 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines July 30, 2011 Implementation at Google is a C++ library linked to user programs
  • 24. Distributed Execution Overview July 30, 2011 UserProgram (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output file 0 (5) remote read worker split 1 (3) read split 2 (4) local write worker split 3 output file 1 split 4 worker worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files Adapted from (Dean and Ghemawat, OSDI 2004)
  • 25. MapReduce Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem Lots of custom research implementations For GPUs, cell processors, etc. July 30, 2011
  • 26. Bonus Assignment Write MapReduce version of Assignment no. 2 July 30, 2011
  • 27. MapReduce in VisionerBOT July 30, 2011
  • 29. PART II: Google File System July 30, 2011
  • 30. Distributed File System Don’t move data to workers… move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data local Why? Not enough RAM to hold all the data in memory Disk access is slow, but disk throughput is reasonable A distributed file system is the answer GFS (Google File System) for Google’s MapReduce HDFS (Hadoop Distributed File System) for Hadoop
  • 31. GFS: Assumptions Commodity hardware over “exotic” hardware Scale “out”, not “up” High component failure rates Inexpensive commodity components fail all the time “Modest” number of huge files Multi-gigabyte files are common, if not encouraged Files are write-once, mostly appended to Perhaps concurrently Large streaming reads over random access High sustained throughput over low latency GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
  • 32. GFS: Design Decisions Files stored as chunks Fixed size (64MB) Reliability through replication Each chunk replicated across 3+ chunkservers Single master to coordinate access, keep metadata Simple centralized management No data caching Little benefit due to large datasets, streaming reads Simplify the API Push some of the issues onto the client (e.g., data layout) HDFS = GFS clone (same basic ideas)

Notas do Editor

  1. 2 In traditional high-performance computing (HPC) applications (e.g.,for climate or nuclear simulations), it is commonplace for a supercomputer to have “processing nodes”and “storage nodes” linked together by a high-capacity interconnect. Many data-intensive workloadsare not very processor-demanding, which means that the separation of compute and storage createsa bottleneck in the network. As an alternative to moving data around, it is more efficient to movethe processing around. That is, MapReduce assumes an architecture where processors and storage(disk) are co-located. In such a setup, we can take advantage of data locality by running code on theprocessor directly attached to the block of data we need. The distributed file system is responsiblefor managing the data over which MapReduce operates.3 Data-intensive processing by definition meansthat the relevant datasets are too large to fit in memory and must be held on disk. Seek times forrandom disk access are fundamentally limited by the mechanical nature of the devices: read heads can only move so fast and platters can only spin so rapidly. As a result, it is desirable to avoidrandom data access, and instead organize computations so that data are processed sequentially. Asimple scenario10 poignantly illustrates the large performance gap between sequential operationsand random seeks: assume a 1 terabyte database containing 1010 100-byte records. Given reasonableassumptions about disk latency and throughput, a back-of-the-envelop calculation will show thatupdating 1% of the records (by accessing and then mutating each record) will take about a monthon a single machine. On the other hand, if one simply reads the entire database and rewrites allthe records (mutating those that need updating), the process would finish in under a work day ona single machine. Sequential data access is, literally, orders of magnitude faster than random dataaccess.11The development of solid-state drives is unlikely to change this balance for at least tworeasons. First, the cost differential between traditional magnetic disks and solid-state disks remainssubstantial: large-data will for the most part remain on mechanical drives, at least in the nearfuture. Second, although solid-state disks have substantially faster seek times, order-of-magnitudedifferences in performance between sequential and random access still remain.MapReduce is primarily designed for batch processing over large datasets. To the extentpossible, all computations are organized into long streaming operations that take advantage of theaggregate bandwidth of many disks in a cluster. Many aspects of MapReduce’s design explicitly tradelatency for throughput.