O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Próximos SlideShares
Dave barrys greatest hits
Avançar

21

Compartilhar

Anatomy of distributed computing with Hadoop

Anatomy of distributed computing with Hadoop

  1. 1. Anatomy of distributed computing with Hadoop
  2. 2. What is Hadoop?  Hadoop was started out as a subproject of Nutch by Doug Cutting  Hadoop boosted Nutch’s scalability  Enhanced by Yahoo! and became Apache top level project  System for distributed big data processing  Big data is Terabytes and Petabytes and more…  Exabytes, Zettabytes datasets?
  3. 3. Why anyone needs Hadoop?
  4. 4. Hadoop use cases
  5. 5. Hadoop use cases
  6. 6. Hadoop use cases
  7. 7. Hadoop basics  Implements Google’s whitepaper: http://research.google.com/archive/mapreduce.html  Hadoop is a combination of: HDFS Storage MapReduce Computation
  8. 8. HDFS Hadoop Distributed File System  It’s a file system bin/hadoop dfs <command> <options> <command> cat expunge put chgrp get rm chmod getmerge rmr chown ls setrep copyFromLocal lsr stat copyToLocal mkdir tail cp moveFromLocal test du moveToLocal text dus mv touchz
  9. 9. Hadoop Distributed File System  It’s accessible
  10. 10. Hadoop Distributed File System  It’s distributed  It employs masterslave architecture
  11. 11. Hadoop Distributed File System  Name Node: Stores file system metadata  Secondary Name Node(s): Periodically merges file system image  Data Node(s): Stores actual data (blocks) Allows data to be replicated
  12. 12. MapReduce  A programming model for distributed data processing  A data processing primitives are functions: Mappers and Reducers
  13. 13. MapReduce ! To decompose MapReduce think of data in terms of keys and values: <key, value> <user id, user profile> <timestamp, apache log entry> <tag, list of tagged images>
  14. 14. MapReduce  Mapper Function that takes key and value and emits zero or more keys and values  Reducer Function that takes key and all “mapped” values and emits zero or more new keys and value
  15. 15. MapReduce example  “Hello World” for Hadoop: http://wiki.apache.org/hadoop/WordCount  “Tag Cloud” example for Hadoop: tag1 tag2 tag3 tag1 tag3 weight(tagi) tag3 tag4 tag5 tag6
  16. 16. Tag Cloud example  Input is taggable content (images, posts, videos) with space separated tags: <posti, “tag1 tag2 … tagn”>  Output is tagi with it’s count and total tags: <tagi, tag count> <total tags, total tags count>  Results: weight(tagi)=tagi count/total tags font(tagi)=fn(weight(tagi))
  17. 17. Tag Cloud Mapper  Mapper implements interface: org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>  Mapper input: <post1, “tag1 tag3”> <post2, “tag3”> <post3, “tag2 tag3 tag4”> <post4, “tag1 tag2 tag3”> simplify model & make line number a key <line1, “tag1 tag3”> <line2, “tag3”> <line3, “tag2 tag3 tag4”> <line4, “tag1 tag2 tag3”> write raw tags to input file
  18. 18. Tag Cloud Mapper  Mapper input:  Mapper output: <0, “tag1 tag3”> <“total tags”, 2> <1, “tag3”> <“tag1”, 1> <2, “tag2 tag3 tag4”> <“tag3”, 1> <3, “tag1 tag2 tag3”> <“total tags”, 1> read values - tags from file (line number is a key) <“tag3”, 1> “tag1 tag3” // space separated tags <“total tags”, 3> <“tag2”, 1> String line = value.toString(); <“tag3”, 1> StringTokenizer tokenizer = new StringTokenizer(line, ” "); <“tag4”, 1> context.write(TOTAL_TAGS_KEY, context.write() new IntWritable(tokenizer.countTokens())); <“total tags”, 3> while (tokenizer.hasMoreTokens()) { <“tag1”, 1> Text tag = new Text(tokenizer.nextToken()); <“tag2”, 1> context.write(tag, new IntWritable(1)); // write to HDFS <“tag3”, 1> }
  19. 19. Reducer phases  1. Shuffle or Copy phase: Copies output from Mapper to Reducer local file system  2. Sort phase: Sort Mapper output by keys. This becomes Reducer input Mapper output: Reducer input: <“total tags”, 2> <“tag1”, 1> <“tag1”, 1> <“tag1”, 1> <“tag3”, 1> <“tag2”, 1> <“total tags”, 1> <“tag2”, 1> <“tag3”, 1> shuffle & sort by <“total tags”, 3> key <“tag3”, 1> <“tag2”, 1> <“tag3”, 1> <“tag3”, 1> <“tag3”, 1> <“tag4”, 1> <“tag3”, 1> <“total tags”, 3> <“tag4”, 1> <“tag1”, 1> <“tag2”, 1> <“total tags”, 2> <“tag3”, 1> <“total tags”, 1> <“total tags”, 3> <“total tags”, 3>  3. Reduce or Emit phase: Performs reduce() for each sorted <key, value> input groups
  20. 20. Tag Cloud Reduce phase  Reducer implements interface: org.apache.hadoop.mapreduce.Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>  Reducer input: [<“tag1”, 1>, <“tag1”, 1>] <“tag1”, 1> <“tag1”, 1> int tagsCount = 0; pairs grouped by tagi for (IntWritable value : values) { <“tag2”, 1> tagsCount += value.get(); <“tag2”, 1> } context.write(key, new IntWritable(tagsCount)); <“tag3”, 1> <“tag3”, 1> context.write() <“tag3”, 1> <“tag3”, 1>  Reducer output: <“tag4”, 1> <tag1, 2> <tag2, 2> <“total tags”, 2> <tag3, 4> <“total tags”, 1> <tag4, 1> <“total tags”, 3> <total tags, 9> <“total tags”, 3>
  21. 21. Tag Cloud Output  Reducer output is weighted list: <tag1, 2> <tag2, 2> <tag3, 4> <tag4, 1> <total tags, 9> output  Tag’s weight: weight(tagi)=tagi count/total tags <weight(tag1), 2/9> <weight(tag2), 2/9> <weight(tag3), 4/9> <weight(tag4), 1/9>  Size of font: font(tagi)=fn(weight(tagi))
  22. 22. Between Map and Reduce Mapper output:  Combiner: <“total tags”, 2> <“tag1”, 1>  implements interface <“tag1”, 1> org.apache.hadoop.mapreduce.Reducer <“tag3”, 1>  function works as in-memory Reducer in-memory combine  serves for additional optimization Combiner output: <“total tags”, 3> <“tag1”, 2>  Partitioner: <“tag3”, 1>  implements interface org.apache.hadoop.mapreduce.Partitioner  function assigns intermediate <key, value> pair from Mapper to designed Reducer partition
  23. 23. Time for a Workshop Standalone mode  Build “Tag Cloud” project jar: cd $TAG_CLOUD_HOME mvn clean install  Check input directory: $HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/input/  Check input file: $HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/input/tags01  Submit TagCloudJob to Hadoop: $HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input $TAG_CLOUD_HOME/output  Check output directory: $HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/output/  Check output file: $HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/output/part-r-00000
  24. 24. Apache Pig  Higher-level data processing layer on top of Hadoop  Data-flow oriented language (pig scripts)  Data types include sets, associative arrays, tuples  Developed at Yahoo!
  25. 25. Apache Hive  Feature set is similar to Pig  SQL-like data warehouse infrastructure  Language is more strictly SQL  Supports SELECT, JOIN, GROUP BY, etc  Developed at Facebook
  26. 26. Apache HBase  Column-store database (after Google BigTable model)  HDFS is an underlying file system  Holds extremely large datasets (multi Tb)  Constrained access model
  27. 27. Apache Mahout  Scalable machine learning algorithms on top of Hadoop: – filtering, – recommendations, – classifiers, – clustering
  28. 28. Apache ZooKeeper  Common services for distributed applications: - group services, - configuration management, - naming services, - synchronization
  29. 29. Oozie  Workflow engine for Hadoop  Orchestrates dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce)  Another query processing API  Developed at Yahoo!
  30. 30. Apache Chukwa  System for reliable large-scale log collection  Displaying, monitoring and analyzing results  Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce  Incubated at apache.org
  31. 31. Questions links: http://www.slideshare.net/tazija/anatomy-of-distributed-computing-with-hadoop https://github.com/tazija/TagCloud skype: siarhei_bushyk mailto: tazija@gmail.com mailto: sergey.bushik@altoros.com
  • mansoorn

    Sep. 18, 2016
  • Sobieslaw

    Jan. 26, 2015
  • nvnq

    Aug. 5, 2014
  • MarsChen

    Jul. 11, 2014
  • fjgirante

    May. 21, 2014
  • JorgeNorid

    Dec. 26, 2013
  • MaxHines

    Dec. 3, 2013
  • garudareiga

    Oct. 30, 2013
  • sonkarmanish1

    Oct. 30, 2013
  • vinodkumar1

    Oct. 27, 2013
  • kritinewlay

    Sep. 22, 2013
  • p_s_d

    Sep. 6, 2013
  • p_s_d

    Sep. 6, 2013
  • royster70

    Aug. 14, 2013
  • slideshare_org

    Jul. 25, 2013
  • luishsousa

    Jun. 3, 2013
  • NileshMangtani

    Mar. 3, 2013
  • Manishtmp

    Dec. 14, 2012
  • manyouman

    Nov. 26, 2012
  • ArunKumar576

    Nov. 24, 2012

Vistos

Vistos totais

4.443

No Slideshare

0

De incorporações

0

Número de incorporações

95

Ações

Baixados

0

Compartilhados

0

Comentários

0

Curtir

21

×