SlideShare uma empresa Scribd logo
1 de 49
Running MapReduce
Programs in Clouds
-Anshul Aggarwal
Cisco Systems
Cloud Computing….Mapreduce
…..Hadoop…..
What is MapReduce?
• Simple data-parallel programming model designed for
scalability and fault-tolerance
• Pioneered by Google
• Processes 20 petabytes of data per day
• Popularized by open-source Hadoop project
• Used at Yahoo!, Facebook, Amazon, …
Why MapReduce Optimization
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Cloud Computing
• The emergence of cloud computing
has made a tremendous impact on
the Information Technology (IT) industry
• Cloud computing moved away from personal computers and
the individual enterprise application server to services
provided by the cloud of computers
• The resources like CPU and storage are provided as general
utilities to the users on-demand based through internet
• Cloud computing is in initial stages, with many issues still to
be addressed.
CLOUD COMPUTING SERVICES
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Mapreduce
Framework
MapReduce History
• Historically, data processing was completely done using
database technologies. Most of the data had a well-defined
structure and was often stored in relational databases
• Data soon reached terabytes and then petabytes
• Google developed a new programming model called
MapReduce to handle large-scale data analysis,and later they
introduced the model through their seminal paper
MapReduce: Simplified Data Processing on Large Clusters.
What the paper says
Example: Facebook Lexicon
www.facebook.com/lexicon
What is MapReduce used for?
• At Google:
• Index construction for Google Search
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!:
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook:
• Data mining
• Ad optimization
• Spam detection
MapReduce Framework
• computing paradigm for processing data that resides on hundreds of
computers
• popularized recently by Google, Hadoop, and many others
• more of a framework
• makes problem solving easier and harder
• inter-cluster network utilization
• performance of a job that will be distributed
• published by Google without any actual source code
MapReduce Terminology
Outline
• Cloud And MapReduce
• MapReduce Basics
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Word Count -"Hello World" of
MapReduce world.
• The word count job accepts an input directory, a mapper
function, and a reducer function as inputs.
• We use the mapper function to process the data in parallel,
and we use the reducer function to collect results of the
mapper and produce the final results.
• Mapper sends its results to reducer using a key-value based
model.
• $bin/hadoop -cp hadoop-microbook.jar
microbook.wordcount. WordCount amazon-meta.txt
wordcount-output1
WorkFlow
Example : Word Count
19Map
Tasks
Reduce
Tasks
• Job: Count the occurrences of each word in a data set
Outline
• Cloud And MapReduce
• MapReduce Basics
• Example applications
• Mapreduce Architecture
• Getting started with Hadoop
• Tuning MapReduce
How Mapreduce Works
At the highest level, there are four independent entities:
• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run. The jobtracker
is a Java application whose main class is JobTracker.
• The tasktrackers, which run the tasks that the job has been
split into.
• The distributed filesystem (normally HDFS), which is used
for sharing job files between the other entities.
Anatomy of a Mapreduce Job
Developing a MapReduce Application
• The Configuration API
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
conf.addResource("configuration-2.xml");
• GenericOptionsParser, Tool, and ToolRunner
• Writing a Unit Test
• Testing the Driver
• Launching a Job
% hadoop jar hadoop-examples.jar v3.MaxTemperatureDriver -
conf conf/hadoop-cluster.xml  Input/ncdc/all max-temp
• Retrieving the Results
This is where the Magic Happens
public class MaxTemperatureDriver extends Configured implements Tool {
@Override
Job job = new Job(getConf(), "Max temperature");
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);
System.exit(exitCode);
}
}
Configuring Map Reduce params
• <configuration>
• <property>
• <name>mapred.job.tracker</name>
• <value>MASTER_NODE:9001</value>
• </property>
• <property>
• <name>mapred.local.dir</name>
• <value>HADOOP_DATA_DIR/local</value>
• </property>
• <property>
• <name>mapred.tasktracker.map.tasks.maximum</name>
• <value>8</value>
• </property>
• </configuration>
• $bin/hadoop -cp hadoop-microbook.jar microbook.wordcount.
WordCount amazon-meta.txt wordcount-output1
Q & A
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Hadoop Clusters
Inpioneerdaystheyusedoxenforheavypulling,
andwhenoneoxcouldn’tbudgealog,
theydidn’ttryto growalargerox.Weshouldn’tbe
tryingforbiggercomputers,butfor
moresystemsofcomputers.
—GraceHopper
Why Hadoop is able to compete?
30
Scalability (petabytes of data,
thousands of machines)
Database
vs.
Flexibility in accepting all data
formats (no schema)
Commodity inexpensive hardware
Efficient and simple fault-tolerant
mechanism
Performance (tons of indexing,
tuning, data organization tech.)
Features:
- Provenance tracking
- Annotation management
- ….
What is Hadoop
• Hadoop is a software framework for distributed processing of large
datasets across large clusters of computers
• Large datasets  Terabytes or petabytes of data
• Large clusters  hundreds or thousands of nodes
• Hadoop is open-source implementation for Google MapReduce
• HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters of commodity
hardware
31
What is Hadoop (Cont’d)
• Hadoop framework consists on two main layers
• Distributed file system (HDFS)
• Execution engine (MapReduce)
• Hadoop is designed as a master-slave shared-nothing architecture
32
Design Principles of Hadoop
• Automatic parallelization & distribution
• computation across thousands of nodes and Hidden from the end-user
• Fault tolerance and automatic recovery
• Nodes/tasks will fail and will recover automatically
• Clean and simple programming abstraction
• Users only provide two functions “map” and “reduce”
• Need to process big data
• Commodity hardware
• Large number of low-end cheap machines working in parallel to solve a
computing problem
33
Hardware Specs
• Memory
• RAM
• Total tasks
• No Raid required
• No Blade server
• Dedicated Switch
• Dedicated 1GB line
Who Uses MapReduce/Hadoop
• Google: Inventors of MapReduce computing paradigm
• Yahoo: Developing Hadoop open-source of MapReduce
• IBM, Microsoft, Oracle
• Facebook, Amazon, AOL, NetFlex
• Many others + universities and research labs
• Many enterprises are turning to Hadoop
• Especially applications generating big data
• Web applications, social networks, scientific applications
35
Hadoop:How it Works
• Hadoop implements Google’s MapReduce, using HDFS
• MapReduce divides applications into many small blocks of work.
• HDFS creates multiple replicas of data blocks for reliability, placing them
on compute nodes around the cluster.
• MapReduce can then process the data where it is located.
• Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.
36
SathyaSaiUniversity,Prashanti
Nilayam
WorkFlow
Hadoop: Assumptions
It is written with large clusters of computers in mind and is built
around the following assumptions:
• Hardware will fail.
• Processing will be run in batches.
• Applications that run on HDFS have large data sets.
• It should provide high aggregate data bandwidth
• Applications need a write-once-read-many access model.
• Moving Computation is Cheaper than Moving Data.
• Portability is important.
Complete Overview
Hadoop Distributed File System (HDFS)
40
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
Main Properties of HDFS
• Large: A HDFS instance may consist of thousands of server
machines, each storing part of the file system’s data
• Replication: Each data block is replicated many times
(default is 3)
• Failure: Failure is the norm rather than exception
• Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
• Namenode is consistently checking Datanodes
41
Outline
• Cloud And MapReduce
• MapReduce architecture
• Example applications
• Getting started with Hadoop
• Tuning MapReduce
Tuning Parameters
Mapping workers to
Processors
• The input data (on HDFS) is stored on the local disks of the machines
in the cluster. HDFS divides each file into 64 MB blocks, and stores
several copies of each block (typically 3 copies) on different
machines.
• The MapReduce master takes the location information of the input
files into account and attempts to schedule a map task on a machine
that contains a replica of the corresponding input data. Failing that, it
attempts to schedule a map task near a replica of that task's input
data. When running large MapReduce operations on a significant
fraction of the workers in a cluster, most input data is read locally and
consumes no network bandwidth.
44
SathyaSaiUniversity,Prashanti
Nilayam
Task Granularity
• The map phase has M pieces and the reduce phase has R pieces.
• M and R should be much larger than the number of worker
machines.
• Having each worker perform many different tasks improves dynamic
load balancing, and also speeds up recovery when a worker fails.
• Larger the M and R, more the decisions the master must make
• R is often constrained by users because the output of each reduce task
ends up in a separate output file.
• Typically, (at Google), M = 200,000 and R = 5,000, using 2,000
worker machines.
45
SathyaSaiUniversity,Prashanti
Nilayam
Speculative Execution – One
approach
• Tasks may be slow for various reasons, including hardware
degradation or software mis-configuration, but the causes
may be hard to detect since the tasks still complete
• successfully, albeit after a longer time than expected. Hadoop
doesn’t try to diagnose and fix slow-running tasks;
• instead, it tries to detect when a task is running slower than
expected and launches another, equivalent, task as a backup.
Problem Statement
The problem at hand is defining a resource provisioning
framework for MapReduce jobs running in a cloud keeping in
mind performance goals such as
Resource utilization with
-optimal number of map and reduce slots
-improvements in execution time
-Highly scalable solution
References
[1] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map-
reduce clusters” In Proc. of the 4th USENIX conference on Hot Topics in Cloud computing,
2012.
[2] R. Buyya, S. K. Garg, and R. N. Calheiros, “SLA-Oriented Resource Provisioning for Cloud
Computing: Challenges, Architecture, and Solutions” In International Conference on Cloud and
Service Computing, 2011.
[3] S. Chaisiri, Bu-Sung Lee, and D. Niyato, “Optimization of Resource Provisioning Cost in
Cloud Computing” in Transactions On Service Computing, Vol. 5, No. 2, IEEE, April-June 2012
[4] L Cherkasova and R.H. Campbell, “Resource Provisioning Framework for MapReduce Jobs
with Performance Goals”, in Middleware 2011, LNCS 7049, pp. 165–186, 2011
[5] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”,
Communications of the ACM, Jan 2008
[6] Y. Hu, J. Wong, G. Iszlai, and M. Litoiu, “Resource Provisioning for Cloud Computing” In
Proc. of the 2009 Conference of the Center for Advanced Studies on Collaborative Research,
2009.
[7] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing hadoop provisioning in the
cloud in Proc. of the First Workshop on Hot Topics in Cloud Computing, 2009
[8] Kuyoro S. O., Ibikunle F. and Awodele O., “Cloud Computing Security Issues and
Challenges” in International Journal of Computer Networks (IJCN), Vol. 3, Issue 5, 2011
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Mais conteúdo relacionado

Mais procurados

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
 

Mais procurados (20)

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 

Destaque

Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
 
Bfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare IndustryBfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare IndustryGlobalsion Software Sdn Bhd
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...AIIM International
 
Technology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance CompaniesTechnology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance CompaniesChris Reynolds
 
A Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsA Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsScott Abel
 
MapReduce for Idiots
MapReduce for IdiotsMapReduce for Idiots
MapReduce for Idiotspetewarden
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformationsswooledge
 
Alfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture OverviewAlfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture OverviewAlfresco Software
 
The Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementThe Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementDATAVERSITY
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions Alfresco Software
 
Intro To Alfresco Part 1
Intro To Alfresco Part 1Intro To Alfresco Part 1
Intro To Alfresco Part 1Jeff Potts
 
EDRMS Pre implementation project plan
EDRMS Pre implementation project planEDRMS Pre implementation project plan
EDRMS Pre implementation project planDonna_Maree_Findlay
 
Alfresco 5.2 REST API
Alfresco 5.2 REST APIAlfresco 5.2 REST API
Alfresco 5.2 REST APIJ V
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 

Destaque (20)

Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Aggarwal Draft
Aggarwal DraftAggarwal Draft
Aggarwal Draft
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
Bfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare IndustryBfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare Industry
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
 
DMAvatar
DMAvatarDMAvatar
DMAvatar
 
Technology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance CompaniesTechnology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance Companies
 
A Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsA Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your Documents
 
MapReduce for Idiots
MapReduce for IdiotsMapReduce for Idiots
MapReduce for Idiots
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformations
 
Alfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture OverviewAlfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture Overview
 
The Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementThe Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data Management
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Intro To Alfresco Part 1
Intro To Alfresco Part 1Intro To Alfresco Part 1
Intro To Alfresco Part 1
 
EDRMS Pre implementation project plan
EDRMS Pre implementation project planEDRMS Pre implementation project plan
EDRMS Pre implementation project plan
 
Alfresco 5.2 REST API
Alfresco 5.2 REST APIAlfresco 5.2 REST API
Alfresco 5.2 REST API
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 

Semelhante a Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2aswini pilli
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecturesaipriyacoool
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 

Semelhante a Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015 (20)

project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Anju
AnjuAnju
Anju
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 

Mais de Deanna Kosaraju

Speak Out and Change the World! Voices 2015
Speak Out and Change the World!   Voices 2015Speak Out and Change the World!   Voices 2015
Speak Out and Change the World! Voices 2015Deanna Kosaraju
 
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...Deanna Kosaraju
 
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...Deanna Kosaraju
 
Measure Impact, Not Activity - Voices 2015
Measure Impact, Not Activity - Voices 2015Measure Impact, Not Activity - Voices 2015
Measure Impact, Not Activity - Voices 2015Deanna Kosaraju
 
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...Deanna Kosaraju
 
The Language of Leadership - Voices 2015
The Language of Leadership - Voices 2015The Language of Leadership - Voices 2015
The Language of Leadership - Voices 2015Deanna Kosaraju
 
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015Deanna Kosaraju
 
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...Deanna Kosaraju
 
Heart Rate Variability and the Digital Health Revolution - Voices 2015
Heart Rate Variability and the Digital Health Revolution - Voices 2015Heart Rate Variability and the Digital Health Revolution - Voices 2015
Heart Rate Variability and the Digital Health Revolution - Voices 2015Deanna Kosaraju
 
Women and CS, Lessons Learned From Turkey - Voices 2015
Women and CS, Lessons Learned From Turkey - Voices 2015Women and CS, Lessons Learned From Turkey - Voices 2015
Women and CS, Lessons Learned From Turkey - Voices 2015Deanna Kosaraju
 
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...Deanna Kosaraju
 
ASEAN Women in Tech - Voices 2015
ASEAN Women in Tech - Voices 2015ASEAN Women in Tech - Voices 2015
ASEAN Women in Tech - Voices 2015Deanna Kosaraju
 
Empowering Women Technology Startup Founders to Succeed - Voices 2015
Empowering Women Technology Startup Founders to Succeed - Voices 2015Empowering Women Technology Startup Founders to Succeed - Voices 2015
Empowering Women Technology Startup Founders to Succeed - Voices 2015Deanna Kosaraju
 
Innovation a Destination and a Journey - Voices 2015
Innovation a Destination and a Journey - Voices 2015Innovation a Destination and a Journey - Voices 2015
Innovation a Destination and a Journey - Voices 2015Deanna Kosaraju
 
Agility and Cloud Computing - Voices 2015
Agility and Cloud Computing - Voices 2015Agility and Cloud Computing - Voices 2015
Agility and Cloud Computing - Voices 2015Deanna Kosaraju
 
The Confidence Gap: Igniting Brilliance through Feminine Leadership - Voices...
The Confidence Gap:  Igniting Brilliance through Feminine Leadership - Voices...The Confidence Gap:  Igniting Brilliance through Feminine Leadership - Voices...
The Confidence Gap: Igniting Brilliance through Feminine Leadership - Voices...Deanna Kosaraju
 
Business Intelligence Engineering - Voices 2015
Business Intelligence Engineering - Voices 2015Business Intelligence Engineering - Voices 2015
Business Intelligence Engineering - Voices 2015Deanna Kosaraju
 
Agility and cloud computing
Agility and cloud computingAgility and cloud computing
Agility and cloud computingDeanna Kosaraju
 
J johnson global tech draft size reduce
J johnson global tech draft size reduceJ johnson global tech draft size reduce
J johnson global tech draft size reduceDeanna Kosaraju
 

Mais de Deanna Kosaraju (20)

Speak Out and Change the World! Voices 2015
Speak Out and Change the World!   Voices 2015Speak Out and Change the World!   Voices 2015
Speak Out and Change the World! Voices 2015
 
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
Breaking the Code of Interview Implicit Bias to Value Different Gender Compet...
 
Change IT! Voices 2015
Change IT! Voices 2015Change IT! Voices 2015
Change IT! Voices 2015
 
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
How Can We Make Interacting With Technology and Science Exciting and Fun Expe...
 
Measure Impact, Not Activity - Voices 2015
Measure Impact, Not Activity - Voices 2015Measure Impact, Not Activity - Voices 2015
Measure Impact, Not Activity - Voices 2015
 
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
Women’s INpowerment: The First-ever Global Survey to Hear Voice, Value and Vi...
 
The Language of Leadership - Voices 2015
The Language of Leadership - Voices 2015The Language of Leadership - Voices 2015
The Language of Leadership - Voices 2015
 
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
Mentors and Role Models - Best Practices in Many Cultures - Voices 2015
 
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
Panel: Cracking the Glass Ceiling: Growing Female Technology Professionals - ...
 
Heart Rate Variability and the Digital Health Revolution - Voices 2015
Heart Rate Variability and the Digital Health Revolution - Voices 2015Heart Rate Variability and the Digital Health Revolution - Voices 2015
Heart Rate Variability and the Digital Health Revolution - Voices 2015
 
Women and CS, Lessons Learned From Turkey - Voices 2015
Women and CS, Lessons Learned From Turkey - Voices 2015Women and CS, Lessons Learned From Turkey - Voices 2015
Women and CS, Lessons Learned From Turkey - Voices 2015
 
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
Communications Platform Provides "Your School at your Fingertips" for Busy Pa...
 
ASEAN Women in Tech - Voices 2015
ASEAN Women in Tech - Voices 2015ASEAN Women in Tech - Voices 2015
ASEAN Women in Tech - Voices 2015
 
Empowering Women Technology Startup Founders to Succeed - Voices 2015
Empowering Women Technology Startup Founders to Succeed - Voices 2015Empowering Women Technology Startup Founders to Succeed - Voices 2015
Empowering Women Technology Startup Founders to Succeed - Voices 2015
 
Innovation a Destination and a Journey - Voices 2015
Innovation a Destination and a Journey - Voices 2015Innovation a Destination and a Journey - Voices 2015
Innovation a Destination and a Journey - Voices 2015
 
Agility and Cloud Computing - Voices 2015
Agility and Cloud Computing - Voices 2015Agility and Cloud Computing - Voices 2015
Agility and Cloud Computing - Voices 2015
 
The Confidence Gap: Igniting Brilliance through Feminine Leadership - Voices...
The Confidence Gap:  Igniting Brilliance through Feminine Leadership - Voices...The Confidence Gap:  Igniting Brilliance through Feminine Leadership - Voices...
The Confidence Gap: Igniting Brilliance through Feminine Leadership - Voices...
 
Business Intelligence Engineering - Voices 2015
Business Intelligence Engineering - Voices 2015Business Intelligence Engineering - Voices 2015
Business Intelligence Engineering - Voices 2015
 
Agility and cloud computing
Agility and cloud computingAgility and cloud computing
Agility and cloud computing
 
J johnson global tech draft size reduce
J johnson global tech draft size reduceJ johnson global tech draft size reduce
J johnson global tech draft size reduce
 

Último

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 

Último (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

  • 1. Running MapReduce Programs in Clouds -Anshul Aggarwal Cisco Systems
  • 3. What is MapReduce? • Simple data-parallel programming model designed for scalability and fault-tolerance • Pioneered by Google • Processes 20 petabytes of data per day • Popularized by open-source Hadoop project • Used at Yahoo!, Facebook, Amazon, …
  • 5. Outline • Cloud And MapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 6. Cloud Computing • The emergence of cloud computing has made a tremendous impact on the Information Technology (IT) industry • Cloud computing moved away from personal computers and the individual enterprise application server to services provided by the cloud of computers • The resources like CPU and storage are provided as general utilities to the users on-demand based through internet • Cloud computing is in initial stages, with many issues still to be addressed.
  • 8. Outline • Cloud And MapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 10. MapReduce History • Historically, data processing was completely done using database technologies. Most of the data had a well-defined structure and was often stored in relational databases • Data soon reached terabytes and then petabytes • Google developed a new programming model called MapReduce to handle large-scale data analysis,and later they introduced the model through their seminal paper MapReduce: Simplified Data Processing on Large Clusters.
  • 13. What is MapReduce used for? • At Google: • Index construction for Google Search • Article clustering for Google News • Statistical machine translation • At Yahoo!: • “Web map” powering Yahoo! Search • Spam detection for Yahoo! Mail • At Facebook: • Data mining • Ad optimization • Spam detection
  • 14. MapReduce Framework • computing paradigm for processing data that resides on hundreds of computers • popularized recently by Google, Hadoop, and many others • more of a framework • makes problem solving easier and harder • inter-cluster network utilization • performance of a job that will be distributed • published by Google without any actual source code
  • 16. Outline • Cloud And MapReduce • MapReduce Basics • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 17. Word Count -"Hello World" of MapReduce world. • The word count job accepts an input directory, a mapper function, and a reducer function as inputs. • We use the mapper function to process the data in parallel, and we use the reducer function to collect results of the mapper and produce the final results. • Mapper sends its results to reducer using a key-value based model. • $bin/hadoop -cp hadoop-microbook.jar microbook.wordcount. WordCount amazon-meta.txt wordcount-output1
  • 19. Example : Word Count 19Map Tasks Reduce Tasks • Job: Count the occurrences of each word in a data set
  • 20. Outline • Cloud And MapReduce • MapReduce Basics • Example applications • Mapreduce Architecture • Getting started with Hadoop • Tuning MapReduce
  • 21. How Mapreduce Works At the highest level, there are four independent entities: • The client, which submits the MapReduce job. • The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker. • The tasktrackers, which run the tasks that the job has been split into. • The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.
  • 22. Anatomy of a Mapreduce Job
  • 23. Developing a MapReduce Application • The Configuration API Configuration conf = new Configuration(); conf.addResource("configuration-1.xml"); conf.addResource("configuration-2.xml"); • GenericOptionsParser, Tool, and ToolRunner • Writing a Unit Test • Testing the Driver • Launching a Job % hadoop jar hadoop-examples.jar v3.MaxTemperatureDriver - conf conf/hadoop-cluster.xml Input/ncdc/all max-temp • Retrieving the Results
  • 24. This is where the Magic Happens public class MaxTemperatureDriver extends Configured implements Tool { @Override Job job = new Job(getConf(), "Max temperature"); job.setJarByClass(getClass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(MaxTemperatureMapper.class); job.setCombinerClass(MaxTemperatureReducer.class); job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args); System.exit(exitCode); } }
  • 25. Configuring Map Reduce params • <configuration> • <property> • <name>mapred.job.tracker</name> • <value>MASTER_NODE:9001</value> • </property> • <property> • <name>mapred.local.dir</name> • <value>HADOOP_DATA_DIR/local</value> • </property> • <property> • <name>mapred.tasktracker.map.tasks.maximum</name> • <value>8</value> • </property> • </configuration> • $bin/hadoop -cp hadoop-microbook.jar microbook.wordcount. WordCount amazon-meta.txt wordcount-output1
  • 26. Q & A
  • 27. Outline • Cloud And MapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 30. Why Hadoop is able to compete? 30 Scalability (petabytes of data, thousands of machines) Database vs. Flexibility in accepting all data formats (no schema) Commodity inexpensive hardware Efficient and simple fault-tolerant mechanism Performance (tons of indexing, tuning, data organization tech.) Features: - Provenance tracking - Annotation management - ….
  • 31. What is Hadoop • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers • Large datasets  Terabytes or petabytes of data • Large clusters  hundreds or thousands of nodes • Hadoop is open-source implementation for Google MapReduce • HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware 31
  • 32. What is Hadoop (Cont’d) • Hadoop framework consists on two main layers • Distributed file system (HDFS) • Execution engine (MapReduce) • Hadoop is designed as a master-slave shared-nothing architecture 32
  • 33. Design Principles of Hadoop • Automatic parallelization & distribution • computation across thousands of nodes and Hidden from the end-user • Fault tolerance and automatic recovery • Nodes/tasks will fail and will recover automatically • Clean and simple programming abstraction • Users only provide two functions “map” and “reduce” • Need to process big data • Commodity hardware • Large number of low-end cheap machines working in parallel to solve a computing problem 33
  • 34. Hardware Specs • Memory • RAM • Total tasks • No Raid required • No Blade server • Dedicated Switch • Dedicated 1GB line
  • 35. Who Uses MapReduce/Hadoop • Google: Inventors of MapReduce computing paradigm • Yahoo: Developing Hadoop open-source of MapReduce • IBM, Microsoft, Oracle • Facebook, Amazon, AOL, NetFlex • Many others + universities and research labs • Many enterprises are turning to Hadoop • Especially applications generating big data • Web applications, social networks, scientific applications 35
  • 36. Hadoop:How it Works • Hadoop implements Google’s MapReduce, using HDFS • MapReduce divides applications into many small blocks of work. • HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. • MapReduce can then process the data where it is located. • Hadoop ‘s target is to run on clusters of the order of 10,000-nodes. 36 SathyaSaiUniversity,Prashanti Nilayam
  • 38. Hadoop: Assumptions It is written with large clusters of computers in mind and is built around the following assumptions: • Hardware will fail. • Processing will be run in batches. • Applications that run on HDFS have large data sets. • It should provide high aggregate data bandwidth • Applications need a write-once-read-many access model. • Moving Computation is Cheaper than Moving Data. • Portability is important.
  • 40. Hadoop Distributed File System (HDFS) 40 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB)
  • 41. Main Properties of HDFS • Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data • Replication: Each data block is replicated many times (default is 3) • Failure: Failure is the norm rather than exception • Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS • Namenode is consistently checking Datanodes 41
  • 42. Outline • Cloud And MapReduce • MapReduce architecture • Example applications • Getting started with Hadoop • Tuning MapReduce
  • 44. Mapping workers to Processors • The input data (on HDFS) is stored on the local disks of the machines in the cluster. HDFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines. • The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data. When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth. 44 SathyaSaiUniversity,Prashanti Nilayam
  • 45. Task Granularity • The map phase has M pieces and the reduce phase has R pieces. • M and R should be much larger than the number of worker machines. • Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails. • Larger the M and R, more the decisions the master must make • R is often constrained by users because the output of each reduce task ends up in a separate output file. • Typically, (at Google), M = 200,000 and R = 5,000, using 2,000 worker machines. 45 SathyaSaiUniversity,Prashanti Nilayam
  • 46. Speculative Execution – One approach • Tasks may be slow for various reasons, including hardware degradation or software mis-configuration, but the causes may be hard to detect since the tasks still complete • successfully, albeit after a longer time than expected. Hadoop doesn’t try to diagnose and fix slow-running tasks; • instead, it tries to detect when a task is running slower than expected and launches another, equivalent, task as a backup.
  • 47. Problem Statement The problem at hand is defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Resource utilization with -optimal number of map and reduce slots -improvements in execution time -Highly scalable solution
  • 48. References [1] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map- reduce clusters” In Proc. of the 4th USENIX conference on Hot Topics in Cloud computing, 2012. [2] R. Buyya, S. K. Garg, and R. N. Calheiros, “SLA-Oriented Resource Provisioning for Cloud Computing: Challenges, Architecture, and Solutions” In International Conference on Cloud and Service Computing, 2011. [3] S. Chaisiri, Bu-Sung Lee, and D. Niyato, “Optimization of Resource Provisioning Cost in Cloud Computing” in Transactions On Service Computing, Vol. 5, No. 2, IEEE, April-June 2012 [4] L Cherkasova and R.H. Campbell, “Resource Provisioning Framework for MapReduce Jobs with Performance Goals”, in Middleware 2011, LNCS 7049, pp. 165–186, 2011 [5] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Communications of the ACM, Jan 2008 [6] Y. Hu, J. Wong, G. Iszlai, and M. Litoiu, “Resource Provisioning for Cloud Computing” In Proc. of the 2009 Conference of the Center for Advanced Studies on Collaborative Research, 2009. [7] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing hadoop provisioning in the cloud in Proc. of the First Workshop on Hot Topics in Cloud Computing, 2009 [8] Kuyoro S. O., Ibikunle F. and Awodele O., “Cloud Computing Security Issues and Challenges” in International Journal of Computer Networks (IJCN), Vol. 3, Issue 5, 2011

Notas do Editor

  1. When you run the MapReduce job, Hadoop first reads the input files from the input directory line by line. Then Hadoop invokes the mapper once for each line passing the line as the argument. Subsequently, each mapper parses the line, and extracts words included in the line it received as the input. After processing, the mapper sends the word count to the reducer by emitting the word and word count as name value pairs.
  2. Writing a program in MapReduce has a certain flow to it. You start by writing your map and reduce functions, ideally with unit tests to make sure they do what you expect. Then you write a driver program to run a job, which can run from your IDE using a small subset of the data to check that it is working. If it fails, then you can use your IDE’s debugger to find the source of the problem. With this information, you can expand your unit tests to cover this case and improve your mapper or reducer as appropriate to handle such input correctly. When the program runs as expected against the small dataset, you are ready to unleash it on a cluster. Running against the full dataset is likely to expose some more issues, which you can fix as before, by expanding your tests and mapper or reducer to handle the new cases. Debugging failing programs in the cluster is a challenge, so we look at some common techniques to make it easier.
  3. We solve problems involving large datasets using many computers where we can parallel process the dataset using those computers. However, writing a program that processes a dataset in a distributed setup is a heavy undertaking. The challenges of such a program are shown as follows: Although it is possible to write such a program, it is a waste to write such programs again and again. MapReduce-based frameworks like Hadoop lets users write only the