2. Plan prezentacji
• dobór parametrów replikacji węzła Hadoopa
• Pig czy Hive do ETL-a?
• samodzielne budowanie klastra czy Cloud?
3.
4. Prawdziwy plan spotkania
• Co to jest “Big Data”?
• Roboty piszące zadania MapReduce
• Zaproszeni goście - Harimata, GE Healthcare
• Krasnale a Data Science
5. Big Data means "a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.” (Wikipedia)
http://www.winshuttle.com/big-data-timeline/
12. Dane
Komputer
Program
Komputer
Komputer
Komputer
Komputer
Dane
Dane
Dane Dane
Dane
Dane
Dane
Dane
Dane
Dane
Dane
Dane
Dane
Dane
Dane
Dane
…
Dane Program
Program
Program
Program
13.
14. Dane
Komputer
Dane
Dane
Dane
Dane
Komputer
Dane
Dane
Dane
Dane
Komputer
Dane
Dane
Dane
Dane
Komputer
Dane
Dane
Dane
Dane
Komputer
Dane
Dane
Dane
Program
Program
Program
Program
Program
JobTracker,
NameNode,
…
…
17. Dane
Komputer
Dane
Dane
Dane
Dane
Komputer
Dane
Dane
Dane
Dane
Komputer
Dane
Dane
Dane
Dane
Komputer
Dane
Dane
Dane
Dane
Komputer
Dane
Dane
Dane
Program
Program
Program
Program
Program
ResourceManager,
NameNode, …
HDFS
18. Map Shuffle Reduce
Dane
Komputer
Dane
Dane
Dane
Dane
Komputer
Dane
Dane
Dane
Program
Program
Wyniki fazy Map
Komputer
Komputer
Wyniki fazy Map
Wyniki koncowe
Wyniki koncowe
MapReduce
19. …
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
20
21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();
38 }
39 context.write(key, new IntWritable(sum));
40 }
41 }
…
20. input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS
(line:chararray);
words = FOREACH input_lines GENERATE
FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE
COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
CREATE TABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO
TABLE input;
SELECT word, COUNT(*) FROM input LATERAL VIEW
explode(split(text, ' ')) lTable as word GROUP BY word
ORDER BY word;