SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Ruby on Hadoop
Tuesday, January 8, 13
Introduction




                                      Hi.
                                   I’m Ted O’Meara
                         ...and I just quit my job last week.

                                    @tomeara
                                 tedomeara.com

Tuesday, January 8, 13
MapReduce
Tuesday, January 8, 13
History of MapReduce



        • First implemented
          by Google
        • Used in CouchDB,
          Hadoop, etc.
        • Helps to “distill” data into
          a concentrated result set




Tuesday, January 8, 13
What is MapReduce?




Tuesday, January 8, 13
What is MapReduce?




                                                                 sum = 0
   input = ["deer", "bear",
                                                                 input.each do |x|
   "river", "car", "car", "river",   input.map! { |x| [x, 1] }
                                                                   sum += x[1]
   "deer", "car", "bear"]
                                                                 end




Tuesday, January 8, 13
Hadoop Breakdown
Tuesday, January 8, 13
History of Hadoop



        •Doug Cutting @ Yahoo!
        •It is a Toy Elephant
        •It is also a framework for
         distributed computing
        •It is a distributed filesystem




Tuesday, January 8, 13
Network Topology


Tuesday, January 8, 13
Hadoop Cluster

                         Cluster
                         •Commodity hardware
                         •Partition tolerant
                         •Network-aware (rack-aware)



                          555.555.1.*             555.555.2.*              444.444.1.*
                              JobTracker              NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         NameNode
                         •Keeps track of the DataNodes
                         •Uses “heartbeat” to determine a node’s health
                         •The most resources should be spent here



                          555.555.1.*             555.555.2.*                 444.444.1.*
                              JobTracker              NameNode                 TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode
                                                                          ♥    TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode        TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         DataNode
                         •Stores filesystem blocks
                         •Can be scaled. Spun up/down.
                         •Replicate based on a set replication factor



                          555.555.1.*             555.555.2.*               444.444.1.*
                              JobTracker               NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode     TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         JobTracker
                         •Delegates which TaskTrackers should handle a
                          MapReduce job
                         •Communicates with the NameNode to assign a TaskTracker
                          close to the DataNode where the source exists


                          555.555.1.*                 555.555.2.*              444.444.1.*
                              JobTracker                  NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode
                                                  ♥    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode        TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Cluster

                         TaskTracker
                         •Worker for MapReduce jobs
                         •The closer to the DataNode with the data, the better



                          555.555.1.*             555.555.2.*              444.444.1.*
                              JobTracker              NameNode              TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode

                           TaskTracker/DataNode    TaskTracker/DataNode     TaskTracker/DataNode




Tuesday, January 8, 13
HDFS


Tuesday, January 8, 13
HDFS

                                           hadoop fs -put localfile /user/hadoop/hadoopfile




                         555.555.1.*                   555.555.2.*                    444.444.1.*
                             JobTracker                      NameNode                   TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode

                          TaskTracker/DataNode            TaskTracker/DataNode          TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Streaming


Tuesday, January 8, 13
Hadoop Streaming
        $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
                          -input "/user/me/samples/cachefile/input.txt" 
                          -mapper "xargs cat" 
                          -reducer "cat" 
                          -output "/user/me/samples/cachefile/out" 
                          -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' 
                          -jobconf mapred.map.tasks=3 
                          -jobconf mapred.reduce.tasks=3 
                          -jobconf mapred.job.name="Experiment"




                         555.555.1.*                  555.555.2.*                      444.444.1.*
                             JobTracker                     NameNode                     TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode

                          TaskTracker/DataNode          TaskTracker/DataNode             TaskTracker/DataNode




Tuesday, January 8, 13
Hadoop Streaming




                          Pig        Hive          Wukong
                         Pig Latin   SQL-ish         Ruby!




           Hadoop Ecosystem
Tuesday, January 8, 13
Wukong


        •Infochimps
        •Currently going through
         heavy development
        •Use the 3.0.0.pre3 gem
            https://github.com/infochimps-labs/wukong/tree/3.0.0

        •Model your jobs with
         wukong-hadoop
            https://github.com/infochimps-labs/wukong-hadoop




Tuesday, January 8, 13
Wukong



            Wukong                             wukong-hadoop
            •Write mappers and reducers        •A CLI to use with Hadoop
             using Ruby                        •Created around building tasks
            •As of 3.0.0, Wukong uses           with Wukong
             “Processors”, which are Ruby      •Better than piping in the shell
             classes that define map, reduce,
                                                (you can see this with --dry_run)
             and other tasks




Tuesday, January 8, 13
Wukong Processors

                                     Wukong.processor(:mapper) do
                                       
                                       field :min_length, Integer, :default    =>   1
                                       field :max_length, Integer, :default    =>   256
                                       field :split_on,   Regexp,   :default   =>   /s+/
                                       field :remove,     Regexp,   :default   =>   /[^a-zA-Z0-9']+/
                                       field :fold_case, :boolean, :default    =>   false
                                       
                                       def process string

        •Fields are accessible           tokenize(string).each do |token|
                                           yield token if acceptable?(token)
                                         end
         through switches in shell     end

                                       private
        •Local hand-off is made at      def tokenize string
                                         string.split(split_on).map do |token|
         STDOUT to STDIN                   stripped = token.gsub(remove, '')
                                           fold_case ? stripped.downcase : stripped
                                         end
                                       end

                                       def acceptable? token
                                         (min_length..max_length).include?(token.length)
                                       end
                                     end




Tuesday, January 8, 13
Wukong Processors



                         Wukong.processor(:reducer, Wukong::Processor::Accumulator) do

                           attr_accessor :count
                           
                           def start record
                             self.count = 0
                           end
                           
                           def accumulate record
                             self.count += 1
                           end

                           def finalize
                             yield [key, count].join("t")
                           end
                         end




Tuesday, January 8, 13
Wukong Processors

           wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb 
                            --mode=local 
                            --input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub




                                      Simpsons - Ep 8
                                      do 7
                                      Doctor     1
                                      Does 2
                                      doesn't    1
                                      dog 2
                                      D'oh 1
                                      doif 1
                                      doing      2
                                      done 1
                                      doneYou    1
                                      don't 10
                                      Don't 1




Tuesday, January 8, 13
The End




                         Thank you!
                             @tomeara
                             ted@tedomeara.com




Tuesday, January 8, 13

Mais conteúdo relacionado

Semelhante a Ruby on hadoop

Semelhante a Ruby on hadoop (20)

Hadoop
HadoopHadoop
Hadoop
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pptx present
Pptx presentPptx present
Pptx present
 
Pp1tx present
Pp1tx presentPp1tx present
Pp1tx present
 
1cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp011cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp01
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
 
Test schedule
Test scheduleTest schedule
Test schedule
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp015anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
5anbcquvtfgv1pvhfif9 140508053553-phpapp01 (1)
 
Ppt1x present
Ppt1x presentPpt1x present
Ppt1x present
 
1cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp011cailbbtbilgqjw00cff 140424042232-phpapp01
1cailbbtbilgqjw00cff 140424042232-phpapp01
 
My bar
My barMy bar
My bar
 
Pptx present
Pptx presentPptx present
Pptx present
 
Ppt1x present
Ppt1x presentPpt1x present
Ppt1x present
 
Pptx present
Pptx presentPptx present
Pptx present
 
#jeet
#jeet#jeet
#jeet
 
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp015anbcquvtfgv1pvhfif9 140508053553-phpapp01
5anbcquvtfgv1pvhfif9 140508053553-phpapp01
 

Último

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Último (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Ruby on hadoop

  • 1. Ruby on Hadoop Tuesday, January 8, 13
  • 2. Introduction Hi. I’m Ted O’Meara ...and I just quit my job last week. @tomeara tedomeara.com Tuesday, January 8, 13
  • 4. History of MapReduce • First implemented by Google • Used in CouchDB, Hadoop, etc. • Helps to “distill” data into a concentrated result set Tuesday, January 8, 13
  • 6. What is MapReduce? sum = 0 input = ["deer", "bear", input.each do |x| "river", "car", "car", "river", input.map! { |x| [x, 1] } sum += x[1] "deer", "car", "bear"] end Tuesday, January 8, 13
  • 8. History of Hadoop •Doug Cutting @ Yahoo! •It is a Toy Elephant •It is also a framework for distributed computing •It is a distributed filesystem Tuesday, January 8, 13
  • 10. Hadoop Cluster Cluster •Commodity hardware •Partition tolerant •Network-aware (rack-aware) 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 11. Hadoop Cluster NameNode •Keeps track of the DataNodes •Uses “heartbeat” to determine a node’s health •The most resources should be spent here 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 12. Hadoop Cluster DataNode •Stores filesystem blocks •Can be scaled. Spun up/down. •Replicate based on a set replication factor 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 13. Hadoop Cluster JobTracker •Delegates which TaskTrackers should handle a MapReduce job •Communicates with the NameNode to assign a TaskTracker close to the DataNode where the source exists 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode ♥ TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 14. Hadoop Cluster TaskTracker •Worker for MapReduce jobs •The closer to the DataNode with the data, the better 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 16. HDFS hadoop fs -put localfile /user/hadoop/hadoopfile 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 18. Hadoop Streaming $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input "/user/me/samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "/user/me/samples/cachefile/out" -cacheArchive 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar#testlink' -jobconf mapred.map.tasks=3 -jobconf mapred.reduce.tasks=3 -jobconf mapred.job.name="Experiment" 555.555.1.* 555.555.2.* 444.444.1.* JobTracker NameNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode TaskTracker/DataNode Tuesday, January 8, 13
  • 19. Hadoop Streaming Pig Hive Wukong Pig Latin SQL-ish Ruby! Hadoop Ecosystem Tuesday, January 8, 13
  • 20. Wukong •Infochimps •Currently going through heavy development •Use the 3.0.0.pre3 gem https://github.com/infochimps-labs/wukong/tree/3.0.0 •Model your jobs with wukong-hadoop https://github.com/infochimps-labs/wukong-hadoop Tuesday, January 8, 13
  • 21. Wukong Wukong wukong-hadoop •Write mappers and reducers •A CLI to use with Hadoop using Ruby •Created around building tasks •As of 3.0.0, Wukong uses with Wukong “Processors”, which are Ruby •Better than piping in the shell classes that define map, reduce, (you can see this with --dry_run) and other tasks Tuesday, January 8, 13
  • 22. Wukong Processors Wukong.processor(:mapper) do      field :min_length, Integer, :default => 1   field :max_length, Integer, :default => 256   field :split_on, Regexp, :default => /s+/   field :remove, Regexp, :default => /[^a-zA-Z0-9']+/   field :fold_case, :boolean, :default => false      def process string •Fields are accessible     tokenize(string).each do |token|       yield token if acceptable?(token)     end through switches in shell   end   private •Local hand-off is made at   def tokenize string     string.split(split_on).map do |token| STDOUT to STDIN       stripped = token.gsub(remove, '')       fold_case ? stripped.downcase : stripped     end   end   def acceptable? token     (min_length..max_length).include?(token.length)   end end Tuesday, January 8, 13
  • 23. Wukong Processors Wukong.processor(:reducer, Wukong::Processor::Accumulator) do   attr_accessor :count      def start record     self.count = 0   end      def accumulate record     self.count += 1   end   def finalize     yield [key, count].join("t")   end end Tuesday, January 8, 13
  • 24. Wukong Processors wu-hadoop /home/hduser/wukong-hadoop/examples/word_count.rb --mode=local --input=/home/hduser/simpsons/simpsonssubs/Simpsons [1.08].sub Simpsons - Ep 8 do 7 Doctor 1 Does 2 doesn't 1 dog 2 D'oh 1 doif 1 doing 2 done 1 doneYou 1 don't 10 Don't 1 Tuesday, January 8, 13
  • 25. The End Thank you! @tomeara ted@tedomeara.com Tuesday, January 8, 13