30. 30
Hadoop v1 : drawbacks
– One Namenode : SPOF
– One Jobtracker : SPOF and un-scalable (nodes limitation)
– MapReduce only : open this platform to non MR
applications
– MapReduce v1 : do not fit well with iterative algorithms
used by Machine Learning
31. 31
Hadoop v2
Improvements :
– HDFS v2 : Secondary namenode
– YARN (Yet Another Resource Negociator)
● JobTracker => Resource Manager + Applications
Master (more than one)
● Can be used by non MapReduce applications
– MapReduce v2 : uses Yarn
39. 39
What about monitoring ?
● Command line : hadoop job, yarn
● IHM to monitor cluster status
● IHM to check status of running jobs
● Access to logs files about nodes activity from the IHM
41. 41
What can we do with Hadoop ?
(Me) 2 projects in Credit Mutuel Arkea :
– LAB : Anti-money laundering
– Operational reporting for a B2B customer
42. 42
LAB : Context
● Tracfin : supervised by the Economic and Financial
department in France
43. 43
LAB : Context
● Difficulties to provide accurate alerts : complexity to
maintain the system and develop new features
44. 44
LAB : Context
● Batch Cobol (z/OS) : started at 19h00 until 9h00
the day after
45. 45
LAB : Migration to Hadoop
● Pig : Pig dataflow model fits well for this kind of
process (lot of data manipulation)
46. 46
LAB : Migration to Hadoop
● Lot of data in input : +1 for Pig
47. 47
LAB : Migration to Hadoop
● A lot of jobs tasks can be parallelized : +1 for
Hadoop
48. 48
LAB : Migration to Hadoop
● Time spent for data manipulation reduced by more
than 50 %
49. 49
LAB : Migration to Hadoop
● Previous Job was a batch : MapReduce Ok
50. 50
Operational Reporting
Context :
– Provide a large variety of reporting to a B2B partner
Why Hadoop :
– New project
– Huge amount of different data sources as input : Pig Help
me !
– Batch is ok
52. 52
Pig – Why a new langage ?
● With Pig write MR Jobs becomes easy
● Dataflow model : data is the key !
● Langage : PigLatin
● No limit : Used Defined Functions
http://pig.apache.org/docs/r0.13.0/
https://github.com/linkedin/datafu
https://github.com/twitter/elephant-bird
https://cwiki.apache.org/confluence/display/PIG/PiggyBank
53. 53
● Pig-Wordcount
-- Load file on HDFS
lines = LOAD '/user/XXX/file.txt' AS (line:chararray);
-- Iterate on each line
-- We use TOKENISE to split by word and FLATTEN to obtain a tuple
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- Group by word
grouped = GROUP words BY word;
-- Count number of occurences for each group (word)
wordcount = FOREACH grouped GENERATE group, COUNT(words);
-- Display results on sysout
DUMP wordcount;
Pig “Hello world”
54. 54
Import …
Pig vs MapReduce
public class WordCount2 {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
static enum CountersEnum { INPUT_WORDS }
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private boolean caseSensitive;
private Set<String> patternsToSkip = new HashSet<String>();
private Configuration conf;
private BufferedReader fis;
...
=> 130 lines of code !
55. 55
● SQL like : HQL
● Metastore : data abstraction and data discovery
● UDFs
Hive
56. 56
Hive “Hello world”
● Hive-Wordcount
-- Create table with structure (DDL)
CREATE TABLE docs (line STRING);
-- Load data..
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
-- Create table for results
-- Select data from previous table, split lines and group by word
-- And Count records per group
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
57. 57
Zookeeper
Purpose : Coordinate relations between the
different actors. Provide a global configuration
we have pushed.
60. 60
Kafka
● Messaging System with a specific design
● Topic / Point to Point in the same time
● Suitable for high volume of data
https://kafka.apache.org/
62. 62
Tez
● Interactive processing uppon Hive and Pig
63. 63
HBase
● Online database (realtime querying)
● NoSQL : columm oriented database
● Based on Google BigTable
● Storage on HDFS
64. 64
Storm
● Streaming mode
● Plug well with Apache Kafka
● Allow data manipulation during input
http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos
http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign
65. 65
Cascading
● Application development platform on Hadoop
● APIs in Java : standard API, data processing, data
integration, scheduler API
67. 67
Phoenix
● Relational DB Layer over Hbase
● HBase access delivered as a JDBC client
● Perf : on the order of milliseconds for small
queries, or seconds for tens of millions of rows
68. 68
Spark
● Big data analytics in-memory / disk
● Complements Hadoop
● Fast and more flexible
https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html