O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Big Data Analytics with Hadoop with @techmilind

1.703 visualizações

Publicada em

Hadoop has rapidly emerged as the preferred solution for big data analytics across unstructured data and companies are seeking competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data. This session reviews the practices of performing analytics using unstructured data with Hadoop.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Big Data Analytics with Hadoop with @techmilind

  1. 1. Big Data Analytics with Apache Hadoop Milind Bhandarkar @techmilind @milindb Data Computing Division
  2. 2. Speaker ProfileMilind Bhandarkar was the founding member of the team at Yahoo!that took Apache Hadoop from 20-node prototype to datacenter-scale production system, and has been contributing and working withHadoop since version 0.1.0. He started the Yahoo! Grid solutionsteam focused on training, consulting, and supporting hundreds of newmigrants to Hadoop. Parallel programming languages and paradigmshas been his area of focus for over 20 years, and his area ofspecialization for PhD (Computer Science) from University of Illinoisat Urbana-Champaign. He worked at the Center for Development ofAdvanced Computing (C-DAC), National Center forSupercomputing Applications (NCSA), Center for Simulation ofAdvanced Rockets, Siebel Systems, Pathscale Inc. (acquired byQLogic), Yahoo! and Linkedin. Currently, he is the Chief Scientist,Machine Learning Platforms at Greenplum, a division of EMC. Data Computing Division
  3. 3. Outline• Intro to Hadoop (10 mins)• MapReduce (15 mins)• Hadoop Examples (30 mins)• Q &A Data Computing Division
  4. 4. Apache Hadoop Data Computing Division
  5. 5. Apache Hadoop• January 2006: Subproject of Lucene• January 2008: Top-level Apache project• Stable Version: 1.0.3• Latest Version: 2.0 alpha Data Computing Division
  6. 6. Apache Hadoop• Reliable, Performant Distributed file system• MapReduce Programming framework• Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ... Data Computing Division
  7. 7. Problem: Bandwidth to Data• Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage @ 50-200MB/s = 33-8 mins• Moving computation is more efficient than moving data • Need visibility into data placement Data Computing Division
  8. 8. Problem: Scaling Reliably• Failure is not an option, it’s a rule ! • 1000 nodes, MTBF < 1 day • 4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM)• Need fault tolerant store with reasonable availability guarantees • Handle hardware faults transparently Data Computing Division
  9. 9. Hadoop Goals• Scalable: Petabytes (1015 Bytes) of data on thousands on nodes• Economical: Commodity components only• Reliable • Engineering reliability into every application is expensive Data Computing Division
  10. 10. Hadoop MapReduce Data Computing Division
  11. 11. Think MapReduce• Record = (Key,Value)• Key : Comparable, Serializable• Value: Serializable• Input, Map, Shuffle, Reduce, Output Data Computing Division
  12. 12. Seems Familiar ?cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 | sort | uniq -c > ~/userlist Data Computing Division
  13. 13. Map• Input: (Key ,Value ) 1 1• Output: List(Key ,Value ) 2 2• Projections, Filtering, Transformation Data Computing Division
  14. 14. Shuffle• Input: List(Key ,Value ) 2 2• Output • Sort(Partition(List(Key , List(Value )))) 2 2• Provided by Hadoop Data Computing Division
  15. 15. Reduce• Input: (Key , List(Value )) 2 2• Output: List(Key ,Value ) 3 3• Aggregation Data Computing Division
  16. 16. Hadoop Streaming• Hadoop is written in Java • Java MapReduce code is “native”• What about Non-Java Programmers ? • Perl, Python, Shell, R • grep, sed, awk, uniq as Mappers/Reducers• Text Input and Output Data Computing Division
  17. 17. Hadoop Streaming• Thin Java wrapper for Map & Reduce Tasks• Forks actual Mapper & Reducer• IPC via stdin, stdout, stderr• Key.toString() t Value.toString() n• Slower than Java programs • Allows for quick prototyping / debugging Data Computing Division
  18. 18. Hadoop Streaming$ bin/hadoop jar hadoop-streaming.jar -input in-files -output out-dir -mapper mapper.sh -reducer reducer.sh# mapper.shsed -e s/ /n/g | grep .# reducer.shuniq -c | awk {print $2 "t" $1} Data Computing Division
  19. 19. Hadoop Examples Data Computing Division
  20. 20. Example: Standard Deviation• Takeaway: Changing algorithm to suit architecture yields the best implementation Data Computing Division
  21. 21. Implementation 1• Two Map-Reduce stages• First stage computes Mean• Second stage computes standard deviation Data Computing Division
  22. 22. Stage 1: Compute Mean• Map Input (x for i = 1 ..N ) i m• Map Output (N , Mean(x )) m 1..Nm• Single Reducer• Reduce Input (Group(Map Output))• Reduce Output (Mean(x )) 1..N Data Computing Division
  23. 23. Stage 2: Compute Standard Deviation• Map Input (x for i = 1 ..N ) & Mean(x ) i m 1..N• Map Output (Sum(x – Mean(x)) for i = i 2 1 ..Nm• Single Reducer• Reduce Input (Group (Map Output)) & N• Reduce Output (σ) Data Computing Division
  24. 24. Standard Deviation• Algebraically equivalent• Be careful about numerical accuracy, though Data Computing Division
  25. 25. Implementation 2• Map Input (x for i = 1 ..N ) i m• Map Output (N , [Sum(x ),Mean(x )]) m 2 1..Nm 1..Nm• Single Reducer• Reduce Input (Group (Map Output))• Reduce Output (σ) Data Computing Division
  26. 26. NGrams Data Computing Division
  27. 27. Bigrams• Input: A large text corpus• Output: List(word , Top (word )) 1 K 2• Two Stages: • Generate all possible bigrams • Find most frequent K bigrams for each word Data Computing Division
  28. 28. Bigrams: Stage 1 Map• Generate all possible Bigrams• Map Input: Large text corpus• Map computation • In each sentence, or each “word word ” 1 2 • Output (word , word ), (word , word ) 1 2 2 1• Partition & Sort by (word , word ) 1 2 Data Computing Division
  29. 29. pairs.plwhile(<STDIN>) { chomp; $_ =~ s/[^a-zA-Z]+/ /g ; $_ =~ s/^s+//g ; $_ =~ s/s+$//g ; $_ =~ tr/A-Z/a-z/; my @words = split(/s+/, $_); for (my $i = 0; $i < $#words - 1; ++$i) { print "$words[$i]:$words[$i+1]n”; print "$words[$i+1]:$words[$i]n”; }} Data Computing Division
  30. 30. Bigrams: Stage 1 Reduce• Input: List(word , word ) sorted and 1 2 partitioned• Output: List(word , [freq, word ]) 1 2• Counting similar to Unigrams example Data Computing Division
  31. 31. count.pl$_ = <STDIN>;chomp;my ($pw1, $pw2) = split(/:/, $_);$count = 1;while(<STDIN>) { chomp; my ($w1, $w2) = split(/:/, $_); if ($w1 eq $pw1 && $w2 eq $pw2) { $count++; } else { print "$pw1:$count:$pw2n"; $pw1 = $w1; $pw2 = $w2; $count = 1; }}print "$pw1:$count:$pw2n"; Data Computing Division
  32. 32. Bigrams: Stage 2 Map• Input: List(word , [freq,word ]) 1 2• Output: List(word , [freq, word ]) 1 2• Identity Mapper (/bin/cat)• Partition by word 1• Sort descending by (word , freq) 1 Data Computing Division
  33. 33. Bigrams: Stage 2 Reduce• Input: List(word , [freq,word ]) 1 2 • partitioned by word 1 • sorted descending by (word , freq) 1• Output: Top (List(word , [freq, word ])) K 1 2• For each word, throw away after K records Data Computing Division
  34. 34. firstN.pl$N = 5;$_ = <STDIN>;chomp;my ($pw1, $count, $pw2) = split(/:/, $_);$idx = 1;$out = "$pw1t$pw2,$count;";while(<STDIN>) { chomp; my ($w1, $c, $w2) = split(/:/, $_); if ($w1 eq $pw1) { if ($idx < $N) { $out .= "$w2,$c;"; $idx++; } } else { print "$outn"; $pw1 = $w1; $idx = 1; $out = "$pw1t$w2,$c;"; }}print "$outn"; Data Computing Division
  35. 35. Partitioner• By default, evenly distributes keys • hashcode(key) % NumReducers• Overriding partitioner • Skew in map-outputs • Restrictions on reduce outputs • All URLs in a domain together Data Computing Division
  36. 36. Partitioner// JobConf.setPartitionerClass(className)public interface Partitioner <K, V> extends JobConfigurable { int getPartition(K key, V value, int maxPartitions);} Data Computing Division
  37. 37. Fully Sorted Output• By contract, reducer gets input sorted on key• Typically reducer output order is the same as input order • Each output file (part file) is sorted• How to make sure that Keys in part i are all less than keys in part i+1 ? Data Computing Division
  38. 38. Fully Sorted Output• Use single reducer for small output• Insight: Reducer input must be fully sorted• Partitioner should provide fully sorted reduce input• Sampling + Histogram equalization Data Computing Division
  39. 39. Number of Maps• Number of Input Splits • Number of HDFS blocks• mapred.map.tasks• Minimum Split Size (mapred.min.split.size)• split_size = max(min(hdfs_block_size, data_size/#maps), min_split_size) Data Computing Division
  40. 40. Parameter Sweeps• External program processes data based on command-line parameters• ./prog –params=“0.1,0.3” < in.dat > out.dat• Objective: Run an instance of ./prog for each parameter combination• Number of Mappers = Number of different parameter combinations Data Computing Division
  41. 41. Parameter Sweeps• Input File: params.txt • Each line contains one combination of parameters• Input format is NLineInputFormat (N=1)• Number of maps = Number of splits = Number of lines in params.txt Data Computing Division
  42. 42. Auxiliary Files• -file auxFile.dat• Job submitter adds file to job.jar• Unjarred on the task tracker• Available to task as $cwd/auxFile.dat• Not suitable for large / frequently used files Data Computing Division
  43. 43. Auxiliary Files• Tasks need to access “side” files • Read-only Dictionaries (such as for porn filtering) • Dynamically linked libraries• Tasks themselves can fetch files from HDFS • Not Always ! (Hint: Unresolved symbols) Data Computing Division
  44. 44. Distributed Cache• Specify “side” files via –cacheFile• If lot of such files needed • Create a tar.gz archive • Upload to HDFS • Specify via –cacheArchive Data Computing Division
  45. 45. Distributed Cache• TaskTracker downloads these files “once”• Untars archives• Accessible in task’s $cwd before task starts• Cached across multiple tasks• Cleaned up upon exit Data Computing Division
  46. 46. Joining Multiple Datasets• Datasets are streams of key-value pairs• Could be split across multiple files in a single directory• Join could be on Key, or any field in Value• Join could be inner, outer, left outer, cross product etc• Join is a natural Reduce operation Data Computing Division
  47. 47. Example• A = (id, name), B = (id, address)• A is in /path/to/A/part-*• B is in /path/to/B/part-*• Select A.name, B.address where A.id == B.id Data Computing Division
  48. 48. Map in Join• Input: (Key ,Value ) from A or B 1 1 • map.input.file indicates A or B • MAP_INPUT_FILE in Streaming• Output: (Key , [Value , A|B]) 2 2 • Key is the Join Key 2 Data Computing Division
  49. 49. Reduce in Join• Input: Groups of [Value , A|B] for each Key 2 2• Operation depends on which kind of join • Inner join checks if key has values from both A & B• Output: (Key , JoinFunction(Value ,…)) 2 2 Data Computing Division
  50. 50. MR Join Performance• Map Input = Total of A & B• Map output = Total of A & B• Shuffle & Sort• Reduce input = Total of A & B• Reduce output = Size of Joined dataset• Filter and Project in Map Data Computing Division
  51. 51. Join Special Cases• Fragment-Replicate • 100GB dataset with 100 MB dataset• Equipartitioned Datasets • Identically Keyed • Equal Number of partitions • Each partition locally sorted Data Computing Division
  52. 52. Fragment-Replicate• Fragment larger dataset • Specify as Map input• Replicate smaller dataset • Use Distributed Cache• Map-Only computation • No shuffle / sort Data Computing Division
  53. 53. Equipartitioned Join• Available since Hadoop 0.16• Datasets joined “before” input to mappers• Input format: CompositeInputFormat• mapred.join.expr• Simpler to use in Java, but can be used in Streaming Data Computing Division
  54. 54. Examplemapred.join.expr = inner ( tbl ( ....SequenceFileInputFormat.class,"hdfs://namenode:8020/path/to/data/A" ), tbl ( ....SequenceFileInputFormat.class,"hdfs://namenode:8020/path/to/data/B" ) ) Data Computing Division
  55. 55. Get Social @EMCAcademics Data Computing Division
  56. 56. Next Session:Technology Lecture Series : Classic D Center on 16 Aug 2012 Data Computing Division
  57. 57. Questions ? Data Computing Division

×