SlideShare uma empresa Scribd logo
1 de 69
Baixar para ler offline
INTRODUCTION TO HADOOP 
BreizhJug 
Rennes – 2014-11-06 
David Morin - @davAtBzh
Me 
David Morin 
@davAtBzh 
Solutions Engineer at
3 
What is Hadoop ?
4 
An elephant – This one ?
5 
No, this one !
6 
The father
7 
Let's go !
8 
Let's go !
9 
Timeline
10 
How did the story begin ? 
=> Deal with high volume of data
11 
Big Data – Big Server ?
12 
Big Data – Big Server ?
13 
Big Data – Big Problems ?
14 
Big Data – Big Problems ?
15 
Split is the key
16 
How to find data ?
17 
Define a master
18 
Try again
19 
Not so bad
20 
Hadoop fundamentals 
● Distributed FileSystem for high volume of data 
● Use of common servers (limit costs) 
● Scalable / fault tolerance
21 
HDFS 
HDFS
22 
Hadoop Distributed FileSystem
23 
Hadoop fundamentals 
● Distributed FileSystem for high volume of data 
● Use of common servers (limit costs) 
● Scalable / fault tolerance ??
24 
Hadoop Distributed FileSystem
25 
MapReduce 
HDFS MapReduce
26 
Mapreduce
27 
Mapreduce : word count 
Map Reduce
28 
Data Locality Optimization
29 
Mapreduce in action
30 
Hadoop v1 : drawbacks 
– One Namenode : SPOF 
– One Jobtracker : SPOF and un-scalable (nodes limitation) 
– MapReduce only : open this platform to non MR 
applications 
– MapReduce v1 : do not fit well with iterative algorithms 
used by Machine Learning
31 
Hadoop v2 
Improvements : 
– HDFS v2 : Secondary namenode 
– YARN (Yet Another Resource Negociator) 
● JobTracker => Resource Manager + Applications 
Master (more than one) 
● Can be used by non MapReduce applications 
– MapReduce v2 : uses Yarn
32 
Hadoop v2
33 
YARN
34 
YARN
35 
YARN
36 
YARN
37 
YARN
38 
YARN
39 
What about monitoring ? 
● Command line : hadoop job, yarn 
● IHM to monitor cluster status 
● IHM to check status of running jobs 
● Access to logs files about nodes activity from the IHM
40 
What about monitoring ?
41 
What can we do with Hadoop ? 
(Me) 2 projects in Credit Mutuel Arkea : 
– LAB : Anti-money laundering 
– Operational reporting for a B2B customer
42 
LAB : Context 
● Tracfin : supervised by the Economic and Financial 
department in France
43 
LAB : Context 
● Difficulties to provide accurate alerts : complexity to 
maintain the system and develop new features
44 
LAB : Context 
● Batch Cobol (z/OS) : started at 19h00 until 9h00 
the day after
45 
LAB : Migration to Hadoop 
● Pig : Pig dataflow model fits well for this kind of 
process (lot of data manipulation)
46 
LAB : Migration to Hadoop 
● Lot of data in input : +1 for Pig
47 
LAB : Migration to Hadoop 
● A lot of jobs tasks can be parallelized : +1 for 
Hadoop
48 
LAB : Migration to Hadoop 
● Time spent for data manipulation reduced by more 
than 50 %
49 
LAB : Migration to Hadoop 
● Previous Job was a batch : MapReduce Ok
50 
Operational Reporting 
Context : 
– Provide a large variety of reporting to a B2B partner 
Why Hadoop : 
– New project 
– Huge amount of different data sources as input : Pig Help 
me ! 
– Batch is ok
51
52 
Pig – Why a new langage ? 
● With Pig write MR Jobs becomes easy 
● Dataflow model : data is the key ! 
● Langage : PigLatin 
● No limit : Used Defined Functions 
http://pig.apache.org/docs/r0.13.0/ 
https://github.com/linkedin/datafu 
https://github.com/twitter/elephant-bird 
https://cwiki.apache.org/confluence/display/PIG/PiggyBank
53 
● Pig-Wordcount 
-- Load file on HDFS 
lines = LOAD '/user/XXX/file.txt' AS (line:chararray); 
-- Iterate on each line 
-- We use TOKENISE to split by word and FLATTEN to obtain a tuple 
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word; 
-- Group by word 
grouped = GROUP words BY word; 
-- Count number of occurences for each group (word) 
wordcount = FOREACH grouped GENERATE group, COUNT(words); 
-- Display results on sysout 
DUMP wordcount; 
Pig “Hello world”
54 
Import … 
Pig vs MapReduce 
public class WordCount2 { 
public static class TokenizerMapper 
extends Mapper<Object, Text, Text, IntWritable>{ 
static enum CountersEnum { INPUT_WORDS } 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
private boolean caseSensitive; 
private Set<String> patternsToSkip = new HashSet<String>(); 
private Configuration conf; 
private BufferedReader fis; 
... 
=> 130 lines of code !
55 
● SQL like : HQL 
● Metastore : data abstraction and data discovery 
● UDFs 
Hive
56 
Hive “Hello world” 
● Hive-Wordcount 
-- Create table with structure (DDL) 
CREATE TABLE docs (line STRING); 
-- Load data.. 
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; 
-- Create table for results 
-- Select data from previous table, split lines and group by word 
-- And Count records per group 
CREATE TABLE word_counts AS 
SELECT word, count(1) AS count FROM 
(SELECT explode(split(line, 's')) AS word FROM docs) w 
GROUP BY word 
ORDER BY word;
57 
Zookeeper 
Purpose : Coordinate relations between the 
different actors. Provide a global configuration 
we have pushed.
58 
Zookeeper 
● Distributed coordination service
59 
Zookeeper 
● Dynamic configuration 
● Distributed locking
60 
Kafka 
● Messaging System with a specific design 
● Topic / Point to Point in the same time 
● Suitable for high volume of data 
https://kafka.apache.org/
61 
Hadoop : Batch but not only..
62 
Tez 
● Interactive processing uppon Hive and Pig
63 
HBase 
● Online database (realtime querying) 
● NoSQL : columm oriented database 
● Based on Google BigTable 
● Storage on HDFS
64 
Storm 
● Streaming mode 
● Plug well with Apache Kafka 
● Allow data manipulation during input 
http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos 
http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign
65 
Cascading 
● Application development platform on Hadoop 
● APIs in Java : standard API, data processing, data 
integration, scheduler API
66 
Scalding 
● Scala API for Cascading
67 
Phoenix 
● Relational DB Layer over Hbase 
● HBase access delivered as a JDBC client 
● Perf : on the order of milliseconds for small 
queries, or seconds for tens of millions of rows
68 
Spark 
● Big data analytics in-memory / disk 
● Complements Hadoop 
● Fast and more flexible 
https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark 
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
69 
??

Mais conteúdo relacionado

Mais procurados

Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in DepthSyed Hadoop
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGPradeep MG
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoopSandeep Patil
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Querying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillQuerying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillVince Gonzalez
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 

Mais procurados (20)

Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoop
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Querying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillQuerying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and Drill
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 

Semelhante a Hadoop breizhjug

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction葵慶 李
 
HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cvrevuri
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asiaMuhammad Rifqi
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 
Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocketSeedRocket
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 

Semelhante a Hadoop breizhjug (20)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cv
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Apache drill
Apache drillApache drill
Apache drill
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocket
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 

Último

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 

Último (20)

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 

Hadoop breizhjug

  • 1. INTRODUCTION TO HADOOP BreizhJug Rennes – 2014-11-06 David Morin - @davAtBzh
  • 2. Me David Morin @davAtBzh Solutions Engineer at
  • 3. 3 What is Hadoop ?
  • 4. 4 An elephant – This one ?
  • 5. 5 No, this one !
  • 10. 10 How did the story begin ? => Deal with high volume of data
  • 11. 11 Big Data – Big Server ?
  • 12. 12 Big Data – Big Server ?
  • 13. 13 Big Data – Big Problems ?
  • 14. 14 Big Data – Big Problems ?
  • 15. 15 Split is the key
  • 16. 16 How to find data ?
  • 17. 17 Define a master
  • 19. 19 Not so bad
  • 20. 20 Hadoop fundamentals ● Distributed FileSystem for high volume of data ● Use of common servers (limit costs) ● Scalable / fault tolerance
  • 22. 22 Hadoop Distributed FileSystem
  • 23. 23 Hadoop fundamentals ● Distributed FileSystem for high volume of data ● Use of common servers (limit costs) ● Scalable / fault tolerance ??
  • 24. 24 Hadoop Distributed FileSystem
  • 25. 25 MapReduce HDFS MapReduce
  • 27. 27 Mapreduce : word count Map Reduce
  • 28. 28 Data Locality Optimization
  • 29. 29 Mapreduce in action
  • 30. 30 Hadoop v1 : drawbacks – One Namenode : SPOF – One Jobtracker : SPOF and un-scalable (nodes limitation) – MapReduce only : open this platform to non MR applications – MapReduce v1 : do not fit well with iterative algorithms used by Machine Learning
  • 31. 31 Hadoop v2 Improvements : – HDFS v2 : Secondary namenode – YARN (Yet Another Resource Negociator) ● JobTracker => Resource Manager + Applications Master (more than one) ● Can be used by non MapReduce applications – MapReduce v2 : uses Yarn
  • 39. 39 What about monitoring ? ● Command line : hadoop job, yarn ● IHM to monitor cluster status ● IHM to check status of running jobs ● Access to logs files about nodes activity from the IHM
  • 40. 40 What about monitoring ?
  • 41. 41 What can we do with Hadoop ? (Me) 2 projects in Credit Mutuel Arkea : – LAB : Anti-money laundering – Operational reporting for a B2B customer
  • 42. 42 LAB : Context ● Tracfin : supervised by the Economic and Financial department in France
  • 43. 43 LAB : Context ● Difficulties to provide accurate alerts : complexity to maintain the system and develop new features
  • 44. 44 LAB : Context ● Batch Cobol (z/OS) : started at 19h00 until 9h00 the day after
  • 45. 45 LAB : Migration to Hadoop ● Pig : Pig dataflow model fits well for this kind of process (lot of data manipulation)
  • 46. 46 LAB : Migration to Hadoop ● Lot of data in input : +1 for Pig
  • 47. 47 LAB : Migration to Hadoop ● A lot of jobs tasks can be parallelized : +1 for Hadoop
  • 48. 48 LAB : Migration to Hadoop ● Time spent for data manipulation reduced by more than 50 %
  • 49. 49 LAB : Migration to Hadoop ● Previous Job was a batch : MapReduce Ok
  • 50. 50 Operational Reporting Context : – Provide a large variety of reporting to a B2B partner Why Hadoop : – New project – Huge amount of different data sources as input : Pig Help me ! – Batch is ok
  • 51. 51
  • 52. 52 Pig – Why a new langage ? ● With Pig write MR Jobs becomes easy ● Dataflow model : data is the key ! ● Langage : PigLatin ● No limit : Used Defined Functions http://pig.apache.org/docs/r0.13.0/ https://github.com/linkedin/datafu https://github.com/twitter/elephant-bird https://cwiki.apache.org/confluence/display/PIG/PiggyBank
  • 53. 53 ● Pig-Wordcount -- Load file on HDFS lines = LOAD '/user/XXX/file.txt' AS (line:chararray); -- Iterate on each line -- We use TOKENISE to split by word and FLATTEN to obtain a tuple words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- Group by word grouped = GROUP words BY word; -- Count number of occurences for each group (word) wordcount = FOREACH grouped GENERATE group, COUNT(words); -- Display results on sysout DUMP wordcount; Pig “Hello world”
  • 54. 54 Import … Pig vs MapReduce public class WordCount2 { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ static enum CountersEnum { INPUT_WORDS } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private boolean caseSensitive; private Set<String> patternsToSkip = new HashSet<String>(); private Configuration conf; private BufferedReader fis; ... => 130 lines of code !
  • 55. 55 ● SQL like : HQL ● Metastore : data abstraction and data discovery ● UDFs Hive
  • 56. 56 Hive “Hello world” ● Hive-Wordcount -- Create table with structure (DDL) CREATE TABLE docs (line STRING); -- Load data.. LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; -- Create table for results -- Select data from previous table, split lines and group by word -- And Count records per group CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word;
  • 57. 57 Zookeeper Purpose : Coordinate relations between the different actors. Provide a global configuration we have pushed.
  • 58. 58 Zookeeper ● Distributed coordination service
  • 59. 59 Zookeeper ● Dynamic configuration ● Distributed locking
  • 60. 60 Kafka ● Messaging System with a specific design ● Topic / Point to Point in the same time ● Suitable for high volume of data https://kafka.apache.org/
  • 61. 61 Hadoop : Batch but not only..
  • 62. 62 Tez ● Interactive processing uppon Hive and Pig
  • 63. 63 HBase ● Online database (realtime querying) ● NoSQL : columm oriented database ● Based on Google BigTable ● Storage on HDFS
  • 64. 64 Storm ● Streaming mode ● Plug well with Apache Kafka ● Allow data manipulation during input http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign
  • 65. 65 Cascading ● Application development platform on Hadoop ● APIs in Java : standard API, data processing, data integration, scheduler API
  • 66. 66 Scalding ● Scala API for Cascading
  • 67. 67 Phoenix ● Relational DB Layer over Hbase ● HBase access delivered as a JDBC client ● Perf : on the order of milliseconds for small queries, or seconds for tens of millions of rows
  • 68. 68 Spark ● Big data analytics in-memory / disk ● Complements Hadoop ● Fast and more flexible https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
  • 69. 69 ??