SlideShare uma empresa Scribd logo
1 de 49
1
Hadoop Puzzlers
Aaron Myers & Daniel Templeton
Cloudera, Inc.
2
Your Hosts
Aaron “ATM” Myers
• AKA “Cash Money”
• Software Engineer
• Apache Hadoop
Committer
Daniel Templeton
• Certification Developer
• Crusty, old HPC guy
• Likes Perl
©2014 Cloudera, Inc. All rights reserved.2
3
What is a Hadoop Puzzler
©2014 Cloudera, Inc. All rights reserved.3
• Shameless knockoff of Josh Bloch’s Java Puzzlers talks
• We’ll walk through a puzzle
• You vote on the answer
• We all learn a valuable lesson
4 ©2014 Cloudera, Inc. All rights reserved.4
Let’s try it, OK?
5
An Easy One
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
IntWritable max =
new IntWritable(0);
for (IntWritable v: values)
if (v.get() > max.get())
max = v;
c.write(key, max);
} }
©2014 Cloudera, Inc. All rights reserved.5
6
An Easy One
The data:
A,1
A,5
A,3
The results:
a) A 5
b) A 1
c) A 3
d) The job fails
©2014 Cloudera, Inc. All rights reserved.6
7
An Easy One
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
IntWritable max =
new IntWritable(0);
for (IntWritable v: values)
if (v.get() > max.get())
max = v;
c.write(key, max);
} }
©2014 Cloudera, Inc. All rights reserved.7
A 1
A 5
A 3
8
An Easy One
The data:
A,1
A,5
A,3
The results:
a) A 5
b) A 1
c) A 3
d) The job fails
©2014 Cloudera, Inc. All rights reserved.8
9
An Easy One (Answer)
The data:
A,1
A,5
A,3
The results:
a) A 5
b) A 1
c) A 3
d) The job fails
©2014 Cloudera, Inc. All rights reserved.9
10
An Easy One (Problem)
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
IntWritable max =
new IntWritable(0);
for (IntWritable v: values)
if (v.get() > max.get())
max = v;
c.write(key, max);
} }
©2014 Cloudera, Inc. All rights reserved.10
11
An Easy One (Moral)
©2014 Cloudera, Inc. All rights reserved.11
• MapReduce reuses Writables whenever it can
• That includes while iterating through the values
• Always be careful to only store the value instead of
the Writable!
12
A Sinking Feeling
public class AsyncSubmit
extends Configured
implements Tool {
public static void main(String[] args)
throws Exception {
int ret = ToolRunner.run(
new Configuration(),
new AsyncSubmit(), args);
System.exit(ret);
}
public int run(String[] args)
throws Exception {
Job job = Job.getInstance(getConf());
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job,
new Path(args[0]));
FileOutputFormat.setOutputPath(job,
new Path(args[1]));
job.waitForCompletion(false);
return job.isComplete() ? 1 : 0;
} }
©2014 Cloudera, Inc. All rights reserved.12
13
A Sinking Feeling
The data:
The complete works of
William Shakespeare
The results:
a) Fails to compile
b) The job fails
c) Exits with 0
d) Exits with 1
©2014 Cloudera, Inc. All rights reserved.13
14
A Sinking Feeling
public class AsyncSubmit
extends Configured
implements Tool {
public static void main(String[] args)
throws Exception {
int ret = ToolRunner.run(
new Configuration(),
new AsyncSubmit(), args);
System.exit(ret);
}
public int run(String[] args)
throws Exception {
Job job = Job.getInstance(getConf());
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job,
new Path(args[0]));
FileOutputFormat.setOutputPath(job,
new Path(args[1]));
job.waitForCompletion(false);
return job.isComplete() ? 1 : 0;
} }
©2014 Cloudera, Inc. All rights reserved.14
The complete works of
William Shakespeare
15
A Sinking Feeling
The data:
The complete works of
William Shakespeare
The results:
a) Fails to compile
b) The job fails
c) Exits with 0
d) Exits with 1
©2014 Cloudera, Inc. All rights reserved.15
16
A Sinking Feeling (Answer)
The data:
The complete works of
William Shakespeare
The results:
a) Fails to compile
b) The job fails
c) Exits with 0
d) Exits with 1
©2014 Cloudera, Inc. All rights reserved.16
17
A Sinking Feeling (Problem)
public class AsyncSubmit
extends Configured
implements Tool {
public static void main(String[] args)
throws Exception {
int ret = ToolRunner.run(
new Configuration(),
new AsyncSubmit(), args);
System.exit(ret);
}
public int run(String[] args)
throws Exception {
Job job = Job.getInstance(getConf());
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job,
new Path(args[0]));
FileOutputFormat.setOutputPath(job,
new Path(args[1]));
job.waitForCompletion(false);
return job.isComplete() ? 1 : 0;
} }
©2014 Cloudera, Inc. All rights reserved.17
18
A Sinking Job (Moral)
©2014 Cloudera, Inc. All rights reserved.18
• Read the API docs!
• Sometimes the obvious meanings of methods and
parameters aren’t correct
• Parameter for waitForCompletion() controls whether
status output is printed
• Driver does wait for job to exit but does not print all the job
status information
19
Do-over
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduceRedux
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
int max = 0;
for (IntWritable v: values)
if (v.get() > max)
max = v.get();
c.write(key, new IntWritable(max));
} }
©2014 Cloudera, Inc. All rights reserved.19
20
Do-over
The data:
A,1
A,5
The results:
a) A 5
b) A 1
c) A 1
A 5
d) The job fails
©2014 Cloudera, Inc. All rights reserved.20
21
Do-over
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduceRedux
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
int max = 0;
for (IntWritable v: values)
if (v.get() > max)
max = v.get();
c.write(key, new IntWritable(max));
} }
©2014 Cloudera, Inc. All rights reserved.21
A 1
A 5
22
Do-over
The data:
A,1
A,5
The results:
a) A 5
b) A 1
c) A 1
A 5
d) The job fails
©2014 Cloudera, Inc. All rights reserved.22
23
Do-over (Answer)
The data:
A,1
A,5
The results:
a) A 5
b) A 1
c) A 1
A 5
d) The job fails
©2014 Cloudera, Inc. All rights reserved.23
24
Do-over (Problem)
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class MaxReduceRedux
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
int max = 0;
for (IntWritable v: values)
if (v.get() > max)
max = v.get();
c.write(key, new IntWritable(max));
} }
©2014 Cloudera, Inc. All rights reserved.24
25
Do-over (Moral)
©2014 Cloudera, Inc. All rights reserved.25
• Mismatched signatures can lead to unexpected
behaviors because of exposed base class method
implementations
• ALWAYS use @Override!
26
Joining Forces
hive> DESCRIBE table1;
OK
id int
phone string
state string
Time taken: 0.236 seconds
hive> DESCRIBE table2;
OK
id int
city string
state string
Time taken: 0.116 seconds
hive> CREATE TABLE table3 AS SELECT
table2.*,table1.phone,table1.state
AS s FROM table1 JOIN table2 ON
(table1.id == table2.id);
…
hive> EXPORT TABLE table3 TO
'/user/cloudera/table3.csv';
…
hive> exit
$ hadoop fs –cat table3.csv |
head -1 | tr , 'n' | wc –l
©2014 Cloudera, Inc. All rights reserved.26
27
Joining Forces
The data:
hive> SELECT * FROM table1;
OK
1 6506506500 CA
2 2282282280 MS
Time taken: 1.006 seconds
hive> SELECT * FROM table2;
OK
1 Palo Alto CA
2 Gautier MS
Time taken: 1.202 seconds
The results:
a) 5
b) 4
c) 1
d) The join fails
©2014 Cloudera, Inc. All rights reserved.27
28
Joining Forces
hive> DESCRIBE table1;
OK
id int
phone string
state string
Time taken: 0.236 seconds
hive> DESCRIBE table2;
OK
id int
city string
state string
Time taken: 0.116 seconds
hive> CREATE TABLE table3 AS SELECT
table2.*,table1.phone,table1.state
AS s FROM table1 JOIN table2 ON
(table1.id == table2.id);
…
hive> EXPORT TABLE table3 TO
'/user/cloudera/table3.csv';
…
hive> exit
$ hadoop fs –cat table3.csv |
head -1 | tr , 'n' | wc –l
©2014 Cloudera, Inc. All rights reserved.28
1 6506506500 CA
2 2282282280 MS
1 Palo Alto CA
2 Gautier MS
29
Joining Forces
The data:
hive> SELECT * FROM table1;
OK
1 6506506500 CA
2 2282282280 MS
Time taken: 1.006 seconds
hive> SELECT * FROM table2;
OK
1 Palo Alto CA
2 Gautier MS
Time taken: 1.202 seconds
The results:
a) 5
b) 4
c) 1
d) The join fails
©2014 Cloudera, Inc. All rights reserved.29
30
Joining Forces (Answer)
The data:
hive> SELECT * FROM table1;
OK
1 6506506500 CA
2 2282282280 MS
Time taken: 1.006 seconds
hive> SELECT * FROM table2;
OK
1 Palo Alto CA
2 Gautier MS
Time taken: 1.202 seconds
The results:
a) 5
b) 4
c) 1
d) The join fails
©2014 Cloudera, Inc. All rights reserved.30
31
Joining Forces (Problem)
hive> DESCRIBE table1;
OK
id int
phone string
state string
Time taken: 0.236 seconds
hive> DESCRIBE table2;
OK
id int
city string
state string
Time taken: 0.116 seconds
hive> CREATE TABLE table3 AS SELECT
table2.*,table1.phone,table1.state
AS s FROM table1 JOIN table2 ON
(table1.id == table2.id);
…
hive> EXPORT TABLE table3 TO
'/user/cloudera/table3.csv';
…
hive> exit
$ hadoop fs –cat table3.csv |
head -1 | tr , 'n' | wc –l
©2014 Cloudera, Inc. All rights reserved.31
32
Joining Forces (Moral)
©2014 Cloudera, Inc. All rights reserved.32
• Hive’s default delimiter is 0x01 (CTRL-A)
• Easy to assume export will use a sane delimiter – it
doesn’t
• Incidentally, Hive’s join rules are pretty sane and work
as you’d expect
33
Close Enough
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class Top20Reduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
float max = 0.0f;
for (IntWritable v: values)
if (v.get() > max) max = v.get();
max *= 0.8f;
for (IntWritable v: values)
if (v.get() >= max)
c.write(key, v);
} }
©2014 Cloudera, Inc. All rights reserved.33
34
Close Enough
The data:
A,1
A,5
A,4
The results:
a)
b) A 5
c) A 5
A 4
d) The job fails
©2014 Cloudera, Inc. All rights reserved.34
35
Close Enough
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class Top20Reduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
float max = 0.0f;
for (IntWritable v: values)
if (v.get() > max) max = v.get();
max *= 0.8f;
for (IntWritable v: values)
if (v.get() >= max)
c.write(key, v);
} }
©2014 Cloudera, Inc. All rights reserved.35
A 1
A 5
A 4
36
Close Enough
The data:
A,1
A,5
A,4
The results:
a)
b) A 5
c) A 5
A 4
d) The job fails
©2014 Cloudera, Inc. All rights reserved.36
37
Close Enough (Answer)
The data:
A,1
A,5
A,4
The results:
a)
b) A 5
c) A 5
A 4
d) The job fails
©2014 Cloudera, Inc. All rights reserved.37
38
Close Enough (Problem)
public class MaxMap
extends Mapper<LongWritable,
Text,Text,IntWritable> {
Text k = new Text();
IntWritable v = new IntWritable();
protected void map(LongWritable key,
Text val, Context c) … {
String[] parts =
val.toString().split(",");
k.set(parts[0]);
v.set(Integer.parseInt(parts[1]));
c.write(k, v);
} }
public class Top20Reduce
extends Reducer<Text,IntWritable,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<IntWritable> values,
Context c) … {
float max = 0.0f;
for (IntWritable v: values)
if (v.get() > max) max = v.get();
max *= 0.8f;
for (IntWritable v: values)
if (v.get() >= max)
c.write(key, v);
} }
©2014 Cloudera, Inc. All rights reserved.38
39
Close Enough (Moral)
©2014 Cloudera, Inc. All rights reserved.39
• For scalability reasons, the values iterable is
single-shot
• Subsequent iterators iterate over an empty collection
• Store values (not Writables!) in the first pass
• Better yet, restructure the logic to avoid storing all
values in memory
40
Overbyte
public class MinLineMap
extends Mapper<LongWritable,
Text,Text,Text> {
Text k = new Text();
protected void map(LongWritable key,
Text value, Context c) … {
String val = value.toString();
k.set(val.substring(0, 1));
c.write(k, value);
} }
public class MinLineReduce
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<Text> values,
Context c) … {
int min = Integer.MAX_VALUE;
for (Text v: values)
if (v.getBytes().length < min)
min = v.getBytes().length;
c.write(key, new IntWritable(min));
} }
©2014 Cloudera, Inc. All rights reserved.40
41
Overbyte
The data:
Hadoop
Spark
Hive
Sqoop2
The results:
a) H 4
S 5
b) H 6
S 5
c) H 6
S 6
d) The job fails
©2014 Cloudera, Inc. All rights reserved.41
42
Overbyte
public class MinLineMap
extends Mapper<LongWritable,
Text,Text,Text> {
Text k = new Text();
protected void map(LongWritable key,
Text value, Context c) … {
String val = value.toString();
k.set(val.substring(0, 1));
c.write(k, value);
} }
public class MinLineReduce
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<Text> values,
Context c) … {
int min = Integer.MAX_VALUE;
for (Text v: values)
if (v.getBytes().length < min)
min = v.getBytes().length;
c.write(key, new IntWritable(min));
} }
©2014 Cloudera, Inc. All rights reserved.42
Hadoop
Spark
Hive
Sqoop2
43
Overbyte
The data:
Hadoop
Spark
Hive
Sqoop2
The results:
a) H 4
S 5
b) H 6
S 5
c) H 6
S 6
d) The job fails
©2014 Cloudera, Inc. All rights reserved.43
44
Overbyte (Answer)
The data:
Hadoop
Spark
Hive
Sqoop2
The results:
a) H 4
S 5
b) H 6
S 5
c) H 6
S 6
d) The job fails
©2014 Cloudera, Inc. All rights reserved.44
45
Overbyte (Problem)
public class MinLineMap
extends Mapper<LongWritable,
Text,Text,Text> {
Text k = new Text();
protected void map(LongWritable key,
Text value, Context c) … {
String val = value.toString();
k.set(val.substring(0, 1));
c.write(k, value);
} }
public class MinLineReduce
extends Reducer<Text,Text,
Text,IntWritable> {
protected void reduce(Text key,
Iterable<Text> values,
Context c) … {
int min = Integer.MAX_VALUE;
for (Text v: values)
if (v.getBytes().length < min)
min = v.getBytes().length;
c.write(key, new IntWritable(min));
} }
©2014 Cloudera, Inc. All rights reserved.45
46
Overbyte (Moral)
©2014 Cloudera, Inc. All rights reserved.46
• Writables get reused in loops
• In addition, Text.getBytes() reuses byte array
allocated by previous calls
• Net result is wrongness
• Text.getLength() is the correct way to get the length
of a Text.
47
What We Learned
©2014 Cloudera, Inc. All rights reserved.47
• Beware of reuse of Writables
• Always use @Override so your compiler can help you
• Don’t assume you know what a method does because
of the name or parameters – read the docs!
• Sometimes scalability is inconvenient
48
One Closing Note
©2014 Cloudera, Inc. All rights reserved.48
• Hadoop is still not easy
• Being good takes effort and experience
• Recognizing Hadoop talent can be hard
• Cloudera’s is working to make Hadoop talent easier to
recognize through certification
http://cloudera.com/content/cloudera/en/training/cert
ification.html
49 ©2014 Cloudera, Inc. All rights reserved.
Aaron Myers &
Daniel Templeton

Mais conteúdo relacionado

Mais procurados

Vielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
Vielseitiges In-Memory Computing mit Apache Ignite und KubernetesVielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
Vielseitiges In-Memory Computing mit Apache Ignite und KubernetesQAware GmbH
 
Getting Started with Datatsax .Net Driver
Getting Started with Datatsax .Net DriverGetting Started with Datatsax .Net Driver
Getting Started with Datatsax .Net DriverDataStax Academy
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Michaël Figuière
 
Kotlin coroutines and spring framework
Kotlin coroutines and spring frameworkKotlin coroutines and spring framework
Kotlin coroutines and spring frameworkSunghyouk Bae
 
Showdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp KrennShowdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp KrennJavaDayUA
 
.NET Multithreading and File I/O
.NET Multithreading and File I/O.NET Multithreading and File I/O
.NET Multithreading and File I/OJussi Pohjolainen
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersMichaël Figuière
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
 
Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQLPeter Eisentraut
 
Hazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMSHazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMSuzquiano
 
Concurrency Concepts in Java
Concurrency Concepts in JavaConcurrency Concepts in Java
Concurrency Concepts in JavaDoug Hawkins
 
Vavr Java User Group Rheinland
Vavr Java User Group RheinlandVavr Java User Group Rheinland
Vavr Java User Group RheinlandDavid Schmitz
 
Rx 101 Codemotion Milan 2015 - Tamir Dresher
Rx 101   Codemotion Milan 2015 - Tamir DresherRx 101   Codemotion Milan 2015 - Tamir Dresher
Rx 101 Codemotion Milan 2015 - Tamir DresherTamir Dresher
 
Building responsive application with Rx - confoo - tamir dresher
Building responsive application with Rx - confoo - tamir dresherBuilding responsive application with Rx - confoo - tamir dresher
Building responsive application with Rx - confoo - tamir dresherTamir Dresher
 
NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010Ben Scofield
 

Mais procurados (20)

Vielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
Vielseitiges In-Memory Computing mit Apache Ignite und KubernetesVielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
Vielseitiges In-Memory Computing mit Apache Ignite und Kubernetes
 
Getting Started with Datatsax .Net Driver
Getting Started with Datatsax .Net DriverGetting Started with Datatsax .Net Driver
Getting Started with Datatsax .Net Driver
 
ChtiJUG - Cassandra 2.0
ChtiJUG - Cassandra 2.0ChtiJUG - Cassandra 2.0
ChtiJUG - Cassandra 2.0
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
 
Kotlin coroutines and spring framework
Kotlin coroutines and spring frameworkKotlin coroutines and spring framework
Kotlin coroutines and spring framework
 
Apex code benchmarking
Apex code benchmarkingApex code benchmarking
Apex code benchmarking
 
Showdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp KrennShowdown of the Asserts by Philipp Krenn
Showdown of the Asserts by Philipp Krenn
 
.NET Multithreading and File I/O
.NET Multithreading and File I/O.NET Multithreading and File I/O
.NET Multithreading and File I/O
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for Developers
 
Clojure: a LISP for the JVM
Clojure: a LISP for the JVMClojure: a LISP for the JVM
Clojure: a LISP for the JVM
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
 
Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQL
 
Hazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMSHazelcast and MongoDB at Cloud CMS
Hazelcast and MongoDB at Cloud CMS
 
Concurrency Concepts in Java
Concurrency Concepts in JavaConcurrency Concepts in Java
Concurrency Concepts in Java
 
Vavr Java User Group Rheinland
Vavr Java User Group RheinlandVavr Java User Group Rheinland
Vavr Java User Group Rheinland
 
Rx 101 Codemotion Milan 2015 - Tamir Dresher
Rx 101   Codemotion Milan 2015 - Tamir DresherRx 101   Codemotion Milan 2015 - Tamir Dresher
Rx 101 Codemotion Milan 2015 - Tamir Dresher
 
Building responsive application with Rx - confoo - tamir dresher
Building responsive application with Rx - confoo - tamir dresherBuilding responsive application with Rx - confoo - tamir dresher
Building responsive application with Rx - confoo - tamir dresher
 
Hadoop
HadoopHadoop
Hadoop
 
JVM Mechanics
JVM MechanicsJVM Mechanics
JVM Mechanics
 
NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010
 

Destaque

Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With ChaosDataWorks Summit
 
Omid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBaseOmid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBaseDataWorks Summit
 
Learning Linear Models with Hadoop
Learning Linear Models with HadoopLearning Linear Models with Hadoop
Learning Linear Models with HadoopDataWorks Summit
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopDataWorks Summit
 
Enterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive TechnologiesEnterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive TechnologiesDataWorks Summit
 
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop ClusterEnterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop ClusterDataWorks Summit
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the MoviesDataWorks Summit
 
Hadoop-scale Search with Solr
Hadoop-scale Search with SolrHadoop-scale Search with Solr
Hadoop-scale Search with SolrDataWorks Summit
 
Safer on the road with Hadoop! Setting up a Data Science Platform
Safer on the road with Hadoop! Setting up a Data Science PlatformSafer on the road with Hadoop! Setting up a Data Science Platform
Safer on the road with Hadoop! Setting up a Data Science PlatformDataWorks Summit
 
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...DataWorks Summit
 
Building and Improving Products with Hadoop
Building and Improving Products with Hadoop Building and Improving Products with Hadoop
Building and Improving Products with Hadoop DataWorks Summit
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
 
Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingDataWorks Summit
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Analyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and ProfitAnalyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and ProfitDataWorks Summit
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)DataWorks Summit
 
Experiences Streaming Analytics at Petabyte Scale
Experiences Streaming Analytics at Petabyte ScaleExperiences Streaming Analytics at Petabyte Scale
Experiences Streaming Analytics at Petabyte ScaleDataWorks Summit
 
Big Data Analytics for Dodd-Frank
Big Data Analytics for Dodd-FrankBig Data Analytics for Dodd-Frank
Big Data Analytics for Dodd-FrankDataWorks Summit
 

Destaque (20)

Practical Computing With Chaos
Practical Computing With ChaosPractical Computing With Chaos
Practical Computing With Chaos
 
Omid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBaseOmid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBase
 
Learning Linear Models with Hadoop
Learning Linear Models with HadoopLearning Linear Models with Hadoop
Learning Linear Models with Hadoop
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
 
Enterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive TechnologiesEnterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive Technologies
 
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop ClusterEnterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
Hadoop-scale Search with Solr
Hadoop-scale Search with SolrHadoop-scale Search with Solr
Hadoop-scale Search with Solr
 
Safer on the road with Hadoop! Setting up a Data Science Platform
Safer on the road with Hadoop! Setting up a Data Science PlatformSafer on the road with Hadoop! Setting up a Data Science Platform
Safer on the road with Hadoop! Setting up a Data Science Platform
 
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
 
Building and Improving Products with Hadoop
Building and Improving Products with Hadoop Building and Improving Products with Hadoop
Building and Improving Products with Hadoop
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARN
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction Processing
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Analyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and ProfitAnalyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and Profit
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
 
Experiences Streaming Analytics at Petabyte Scale
Experiences Streaming Analytics at Petabyte ScaleExperiences Streaming Analytics at Petabyte Scale
Experiences Streaming Analytics at Petabyte Scale
 
Big Data Analytics for Dodd-Frank
Big Data Analytics for Dodd-FrankBig Data Analytics for Dodd-Frank
Big Data Analytics for Dodd-Frank
 

Semelhante a Hadoop Puzzlers

Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraDeependra Ariyadewa
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Yulia Tsisyk
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net PerformanceCUSTIS
 
Celery - A Distributed Task Queue
Celery - A Distributed Task QueueCelery - A Distributed Task Queue
Celery - A Distributed Task QueueDuy Do
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable CodeBaidu, Inc.
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghStuart Roebuck
 
C# 7.x What's new and what's coming with C# 8
C# 7.x What's new and what's coming with C# 8C# 7.x What's new and what's coming with C# 8
C# 7.x What's new and what's coming with C# 8Christian Nagel
 
Using xUnit as a Swiss-Aarmy Testing Toolkit
Using xUnit as a Swiss-Aarmy Testing ToolkitUsing xUnit as a Swiss-Aarmy Testing Toolkit
Using xUnit as a Swiss-Aarmy Testing ToolkitChris Oldwood
 
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...Muhammad Ulhaque
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Andrew Yongjoon Kong
 
실시간 인벤트 처리
실시간 인벤트 처리실시간 인벤트 처리
실시간 인벤트 처리Byeongweon Moon
 

Semelhante a Hadoop Puzzlers (20)

Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
C# - What's next
C# - What's nextC# - What's next
C# - What's next
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and Cassandra
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net Performance
 
Celery - A Distributed Task Queue
Celery - A Distributed Task QueueCelery - A Distributed Task Queue
Celery - A Distributed Task Queue
 
Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++
 
The Art Of Readable Code
The Art Of Readable CodeThe Art Of Readable Code
The Art Of Readable Code
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 
Anti patterns
Anti patternsAnti patterns
Anti patterns
 
TechTalk - Dotnet
TechTalk - DotnetTechTalk - Dotnet
TechTalk - Dotnet
 
C# 7.x What's new and what's coming with C# 8
C# 7.x What's new and what's coming with C# 8C# 7.x What's new and what's coming with C# 8
C# 7.x What's new and what's coming with C# 8
 
Using xUnit as a Swiss-Aarmy Testing Toolkit
Using xUnit as a Swiss-Aarmy Testing ToolkitUsing xUnit as a Swiss-Aarmy Testing Toolkit
Using xUnit as a Swiss-Aarmy Testing Toolkit
 
RxJava on Android
RxJava on AndroidRxJava on Android
RxJava on Android
 
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
 
What is new in Java 8
What is new in Java 8What is new in Java 8
What is new in Java 8
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...
 
실시간 인벤트 처리
실시간 인벤트 처리실시간 인벤트 처리
실시간 인벤트 처리
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Último (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Hadoop Puzzlers

  • 1. 1 Hadoop Puzzlers Aaron Myers & Daniel Templeton Cloudera, Inc.
  • 2. 2 Your Hosts Aaron “ATM” Myers • AKA “Cash Money” • Software Engineer • Apache Hadoop Committer Daniel Templeton • Certification Developer • Crusty, old HPC guy • Likes Perl ©2014 Cloudera, Inc. All rights reserved.2
  • 3. 3 What is a Hadoop Puzzler ©2014 Cloudera, Inc. All rights reserved.3 • Shameless knockoff of Josh Bloch’s Java Puzzlers talks • We’ll walk through a puzzle • You vote on the answer • We all learn a valuable lesson
  • 4. 4 ©2014 Cloudera, Inc. All rights reserved.4 Let’s try it, OK?
  • 5. 5 An Easy One public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.5
  • 6. 6 An Easy One The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.6
  • 7. 7 An Easy One public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.7 A 1 A 5 A 3
  • 8. 8 An Easy One The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.8
  • 9. 9 An Easy One (Answer) The data: A,1 A,5 A,3 The results: a) A 5 b) A 1 c) A 3 d) The job fails ©2014 Cloudera, Inc. All rights reserved.9
  • 10. 10 An Easy One (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { IntWritable max = new IntWritable(0); for (IntWritable v: values) if (v.get() > max.get()) max = v; c.write(key, max); } } ©2014 Cloudera, Inc. All rights reserved.10
  • 11. 11 An Easy One (Moral) ©2014 Cloudera, Inc. All rights reserved.11 • MapReduce reuses Writables whenever it can • That includes while iterating through the values • Always be careful to only store the value instead of the Writable!
  • 12. 12 A Sinking Feeling public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.12
  • 13. 13 A Sinking Feeling The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.13
  • 14. 14 A Sinking Feeling public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.14 The complete works of William Shakespeare
  • 15. 15 A Sinking Feeling The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.15
  • 16. 16 A Sinking Feeling (Answer) The data: The complete works of William Shakespeare The results: a) Fails to compile b) The job fails c) Exits with 0 d) Exits with 1 ©2014 Cloudera, Inc. All rights reserved.16
  • 17. 17 A Sinking Feeling (Problem) public class AsyncSubmit extends Configured implements Tool { public static void main(String[] args) throws Exception { int ret = ToolRunner.run( new Configuration(), new AsyncSubmit(), args); System.exit(ret); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf()); job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(false); return job.isComplete() ? 1 : 0; } } ©2014 Cloudera, Inc. All rights reserved.17
  • 18. 18 A Sinking Job (Moral) ©2014 Cloudera, Inc. All rights reserved.18 • Read the API docs! • Sometimes the obvious meanings of methods and parameters aren’t correct • Parameter for waitForCompletion() controls whether status output is printed • Driver does wait for job to exit but does not print all the job status information
  • 19. 19 Do-over public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.19
  • 20. 20 Do-over The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.20
  • 21. 21 Do-over public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.21 A 1 A 5
  • 22. 22 Do-over The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.22
  • 23. 23 Do-over (Answer) The data: A,1 A,5 The results: a) A 5 b) A 1 c) A 1 A 5 d) The job fails ©2014 Cloudera, Inc. All rights reserved.23
  • 24. 24 Do-over (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class MaxReduceRedux extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { int max = 0; for (IntWritable v: values) if (v.get() > max) max = v.get(); c.write(key, new IntWritable(max)); } } ©2014 Cloudera, Inc. All rights reserved.24
  • 25. 25 Do-over (Moral) ©2014 Cloudera, Inc. All rights reserved.25 • Mismatched signatures can lead to unexpected behaviors because of exposed base class method implementations • ALWAYS use @Override!
  • 26. 26 Joining Forces hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit $ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.26
  • 27. 27 Joining Forces The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.27
  • 28. 28 Joining Forces hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit $ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.28 1 6506506500 CA 2 2282282280 MS 1 Palo Alto CA 2 Gautier MS
  • 29. 29 Joining Forces The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.29
  • 30. 30 Joining Forces (Answer) The data: hive> SELECT * FROM table1; OK 1 6506506500 CA 2 2282282280 MS Time taken: 1.006 seconds hive> SELECT * FROM table2; OK 1 Palo Alto CA 2 Gautier MS Time taken: 1.202 seconds The results: a) 5 b) 4 c) 1 d) The join fails ©2014 Cloudera, Inc. All rights reserved.30
  • 31. 31 Joining Forces (Problem) hive> DESCRIBE table1; OK id int phone string state string Time taken: 0.236 seconds hive> DESCRIBE table2; OK id int city string state string Time taken: 0.116 seconds hive> CREATE TABLE table3 AS SELECT table2.*,table1.phone,table1.state AS s FROM table1 JOIN table2 ON (table1.id == table2.id); … hive> EXPORT TABLE table3 TO '/user/cloudera/table3.csv'; … hive> exit $ hadoop fs –cat table3.csv | head -1 | tr , 'n' | wc –l ©2014 Cloudera, Inc. All rights reserved.31
  • 32. 32 Joining Forces (Moral) ©2014 Cloudera, Inc. All rights reserved.32 • Hive’s default delimiter is 0x01 (CTRL-A) • Easy to assume export will use a sane delimiter – it doesn’t • Incidentally, Hive’s join rules are pretty sane and work as you’d expect
  • 33. 33 Close Enough public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.33
  • 34. 34 Close Enough The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.34
  • 35. 35 Close Enough public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.35 A 1 A 5 A 4
  • 36. 36 Close Enough The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.36
  • 37. 37 Close Enough (Answer) The data: A,1 A,5 A,4 The results: a) b) A 5 c) A 5 A 4 d) The job fails ©2014 Cloudera, Inc. All rights reserved.37
  • 38. 38 Close Enough (Problem) public class MaxMap extends Mapper<LongWritable, Text,Text,IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(); protected void map(LongWritable key, Text val, Context c) … { String[] parts = val.toString().split(","); k.set(parts[0]); v.set(Integer.parseInt(parts[1])); c.write(k, v); } } public class Top20Reduce extends Reducer<Text,IntWritable, Text,IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context c) … { float max = 0.0f; for (IntWritable v: values) if (v.get() > max) max = v.get(); max *= 0.8f; for (IntWritable v: values) if (v.get() >= max) c.write(key, v); } } ©2014 Cloudera, Inc. All rights reserved.38
  • 39. 39 Close Enough (Moral) ©2014 Cloudera, Inc. All rights reserved.39 • For scalability reasons, the values iterable is single-shot • Subsequent iterators iterate over an empty collection • Store values (not Writables!) in the first pass • Better yet, restructure the logic to avoid storing all values in memory
  • 40. 40 Overbyte public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.40
  • 41. 41 Overbyte The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.41
  • 42. 42 Overbyte public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.42 Hadoop Spark Hive Sqoop2
  • 43. 43 Overbyte The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.43
  • 44. 44 Overbyte (Answer) The data: Hadoop Spark Hive Sqoop2 The results: a) H 4 S 5 b) H 6 S 5 c) H 6 S 6 d) The job fails ©2014 Cloudera, Inc. All rights reserved.44
  • 45. 45 Overbyte (Problem) public class MinLineMap extends Mapper<LongWritable, Text,Text,Text> { Text k = new Text(); protected void map(LongWritable key, Text value, Context c) … { String val = value.toString(); k.set(val.substring(0, 1)); c.write(k, value); } } public class MinLineReduce extends Reducer<Text,Text, Text,IntWritable> { protected void reduce(Text key, Iterable<Text> values, Context c) … { int min = Integer.MAX_VALUE; for (Text v: values) if (v.getBytes().length < min) min = v.getBytes().length; c.write(key, new IntWritable(min)); } } ©2014 Cloudera, Inc. All rights reserved.45
  • 46. 46 Overbyte (Moral) ©2014 Cloudera, Inc. All rights reserved.46 • Writables get reused in loops • In addition, Text.getBytes() reuses byte array allocated by previous calls • Net result is wrongness • Text.getLength() is the correct way to get the length of a Text.
  • 47. 47 What We Learned ©2014 Cloudera, Inc. All rights reserved.47 • Beware of reuse of Writables • Always use @Override so your compiler can help you • Don’t assume you know what a method does because of the name or parameters – read the docs! • Sometimes scalability is inconvenient
  • 48. 48 One Closing Note ©2014 Cloudera, Inc. All rights reserved.48 • Hadoop is still not easy • Being good takes effort and experience • Recognizing Hadoop talent can be hard • Cloudera’s is working to make Hadoop talent easier to recognize through certification http://cloudera.com/content/cloudera/en/training/cert ification.html
  • 49. 49 ©2014 Cloudera, Inc. All rights reserved. Aaron Myers & Daniel Templeton