SlideShare uma empresa Scribd logo
1 de 32
Introduction to MapReduce Christopher Curtin
About Me 20+ years in Technology Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop CTO of Silverpop Silverpop is a leading marketing automation and email marketing company
Contrived Example
What is MapReduce “MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. “ http://labs.google.com/papers/mapreduce.html
Back to the example I need to know: # of each color M&M Average weight of each color Average width of each color
Traditional approach Initialize data structure Read CSV Split each row into parts Find color in data structure Increment count, add width, weight Write final result
ASSume with me Determining weight is a CPU intensive step 8 core machine 5,000,000,000 pieces per shift to process Files ‘rotated’ hourly
Thread It! Write logic to start multiple threads, pass each one a row (or 1000 rows) to evaluate
Issues with threading Have to write coordination logic Locking of the color data structure Disk/Network I/O becomes next bottleneck As volume increases, cost of CPUs/Disks isn’t linear
Ideas to solve these problems? Put it a database Multiple machines, each processes a file
MapReduce Map Parse the data into name/value pairs Can be fast or expensive Reduce Collect the name/value pairs and perform function on each ‘name’  Framework makes sure you get all the distinct ‘names’ and only one per invocation
Distributed File System System takes the files and makes copies across all the machines in the cluster Often files are broken apart and spread around
Move processing to the data! Rather than copying files to the processes, push the application to the machine where the data lives! System pushes jar files and launches JVMs to process
Runtime Distribution © Concurrent 2009
Hadoop Apache’s MapReduce implementation Lots of third party support Yahoo Cloudera Others announcing almost daily
Example
Issues with example /ajug/output can’t exist! What’s with all the ‘Writable’ classes? Data Structures have a lot of coding overhead What if I want to do multiple things off the source? What if I want to do something after the Reduce?
Cascading Layer on top of Hadoop Introduces Pipes to abstract when mappers or reducers are needed Can easily string together logic steps No need to think about when to map, when to reduce No need for intermediate data structures
Sample Example in Cascading
Multiple Output example in Cascading
Unit testing Kind of hard without some upfront thought Separate business logic from hadoop/cascading specific parts Try to use domain objects or primitives in business logic, not Tuples or Hadoop structures Cascading has a nice testing framework to implement
Other testing Known sets of data is critical at volume
Common Use Cases Evaluation of large volumes of data at a regular frequency Algorithms that take a single pass through the data Sensor data, log files, web analytics, transactional data First pass ‘what is going on’ evaluation before building/paying for ‘real’ reports
Things it is not good for Ad-hoc queries (though there are some tools on top of Hadoop to help) Fast/real-time evaluations OLTP Well known analysis may be better off in a data wharehouse
Issues to watch out for Lots of small files Default scheduler is pretty poor Users need shell-level access?!?
Getting started Download latest from Cloudera or Apache Setup local only cluster (really easy to do) Download Cascading Optional download Karmasphere if using Eclipse (http://www.karmasphere.com/) Build some simple tests/apps Running locally is almost the same as in the cluster
Elastic Map Reduce Amazon EC2-based Hadoop Define as many servers as you want Load the data and go 60 CENTS per hour per machine for a decent size
So ask yourself What could I do with 100 machines in an hour?
Ask yourself again … What design/ architecture do I have because I didn’t have a good way to store the data? Or What have I shoved into an RDBMS because I had one?
Other Solutions Apache Pig: http://hadoop.apache.org/pig/ More ‘sql-like’  Not as easy to mix regular Java into processes More ‘ad hoc’ than Cascading Yahoo! Oozie: http://yahoo.github.com/oozie/ Work coordination via configuration not code Allows integration of non-hadoop jobs into process
Resources Me: ccurtin@silverpop.com @ChrisCurtin Chris Wensel: @cwensel  Web site: www.cascading.org, Mailing list off website Atlanta Hadoop Users Group: http://www.meetup.com/Atlanta-Hadoop-Users-Group/ Cloud Computing Atlanta Meetup: http://www.meetup.com/acloud/ O’Reilly Hadoop Book:  http://oreilly.com/catalog/9780596521974/

Mais conteúdo relacionado

Mais procurados

Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
Hadoop User Group
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
Hadoop User Group
 

Mais procurados (20)

Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Final deck
Final deckFinal deck
Final deck
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 

Semelhante a Ajug april 2011

AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
Amazon Web Services
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
webuploader
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
Renato Lucindo
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
Revolution Analytics
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Cal Henderson
 

Semelhante a Ajug april 2011 (20)

AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 
Changing the tires on a big data racecar
Changing the tires on a big data racecarChanging the tires on a big data racecar
Changing the tires on a big data racecar
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Ajug april 2011

  • 1. Introduction to MapReduce Christopher Curtin
  • 2. About Me 20+ years in Technology Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop CTO of Silverpop Silverpop is a leading marketing automation and email marketing company
  • 4. What is MapReduce “MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. “ http://labs.google.com/papers/mapreduce.html
  • 5. Back to the example I need to know: # of each color M&M Average weight of each color Average width of each color
  • 6.
  • 7. Traditional approach Initialize data structure Read CSV Split each row into parts Find color in data structure Increment count, add width, weight Write final result
  • 8. ASSume with me Determining weight is a CPU intensive step 8 core machine 5,000,000,000 pieces per shift to process Files ‘rotated’ hourly
  • 9. Thread It! Write logic to start multiple threads, pass each one a row (or 1000 rows) to evaluate
  • 10. Issues with threading Have to write coordination logic Locking of the color data structure Disk/Network I/O becomes next bottleneck As volume increases, cost of CPUs/Disks isn’t linear
  • 11. Ideas to solve these problems? Put it a database Multiple machines, each processes a file
  • 12. MapReduce Map Parse the data into name/value pairs Can be fast or expensive Reduce Collect the name/value pairs and perform function on each ‘name’ Framework makes sure you get all the distinct ‘names’ and only one per invocation
  • 13. Distributed File System System takes the files and makes copies across all the machines in the cluster Often files are broken apart and spread around
  • 14. Move processing to the data! Rather than copying files to the processes, push the application to the machine where the data lives! System pushes jar files and launches JVMs to process
  • 15. Runtime Distribution © Concurrent 2009
  • 16. Hadoop Apache’s MapReduce implementation Lots of third party support Yahoo Cloudera Others announcing almost daily
  • 18. Issues with example /ajug/output can’t exist! What’s with all the ‘Writable’ classes? Data Structures have a lot of coding overhead What if I want to do multiple things off the source? What if I want to do something after the Reduce?
  • 19. Cascading Layer on top of Hadoop Introduces Pipes to abstract when mappers or reducers are needed Can easily string together logic steps No need to think about when to map, when to reduce No need for intermediate data structures
  • 20. Sample Example in Cascading
  • 21. Multiple Output example in Cascading
  • 22. Unit testing Kind of hard without some upfront thought Separate business logic from hadoop/cascading specific parts Try to use domain objects or primitives in business logic, not Tuples or Hadoop structures Cascading has a nice testing framework to implement
  • 23. Other testing Known sets of data is critical at volume
  • 24. Common Use Cases Evaluation of large volumes of data at a regular frequency Algorithms that take a single pass through the data Sensor data, log files, web analytics, transactional data First pass ‘what is going on’ evaluation before building/paying for ‘real’ reports
  • 25. Things it is not good for Ad-hoc queries (though there are some tools on top of Hadoop to help) Fast/real-time evaluations OLTP Well known analysis may be better off in a data wharehouse
  • 26. Issues to watch out for Lots of small files Default scheduler is pretty poor Users need shell-level access?!?
  • 27. Getting started Download latest from Cloudera or Apache Setup local only cluster (really easy to do) Download Cascading Optional download Karmasphere if using Eclipse (http://www.karmasphere.com/) Build some simple tests/apps Running locally is almost the same as in the cluster
  • 28. Elastic Map Reduce Amazon EC2-based Hadoop Define as many servers as you want Load the data and go 60 CENTS per hour per machine for a decent size
  • 29. So ask yourself What could I do with 100 machines in an hour?
  • 30. Ask yourself again … What design/ architecture do I have because I didn’t have a good way to store the data? Or What have I shoved into an RDBMS because I had one?
  • 31. Other Solutions Apache Pig: http://hadoop.apache.org/pig/ More ‘sql-like’ Not as easy to mix regular Java into processes More ‘ad hoc’ than Cascading Yahoo! Oozie: http://yahoo.github.com/oozie/ Work coordination via configuration not code Allows integration of non-hadoop jobs into process
  • 32. Resources Me: ccurtin@silverpop.com @ChrisCurtin Chris Wensel: @cwensel Web site: www.cascading.org, Mailing list off website Atlanta Hadoop Users Group: http://www.meetup.com/Atlanta-Hadoop-Users-Group/ Cloud Computing Atlanta Meetup: http://www.meetup.com/acloud/ O’Reilly Hadoop Book: http://oreilly.com/catalog/9780596521974/