SlideShare uma empresa Scribd logo
1 de 31
1 / 31
BIG DATA TECHNOLOGY
● Juanjo Mostazo
● c-base Berlin
● May 2014
2 / 31
Roadmap
● Map Reduce
● Hadoop
● Concepts
● HDFS
● Architecture
● Hadoop Ecosystem
● Lambda Architecture
● New trends
3 / 31
M/R: Motivation
● Process
big amount
of data to
produce
other data
● Scale up vs
Scale out
4 / 31
M/R: What is it?
● Different programming paradigm
● Based on a google paper (2004)
● Automatic parallelization and distribution
● I/O Scheduling
● Fault tolerance
● Status and monitoring
5 / 31
M/R: The paradigm
● Input & Output: set of key/value pairs
● Big amount of data group & sort
● Job = Two phases = Mapper & Reducer
● Map (in_key, in_value) →
list(interm_key, interm_value)
● Reduce (interm_key,
list(interm_value)) →
list (out_key, out_value)
6 / 31
M/R: Example
(word counter)
7 / 31
M/R: Workflow
8 / 31
Roadmap
● Map Reduce
● Hadoop
● Concepts
● HDFS
● Architecture
● Hadoop Ecosystem
● Lambda Architecture
● New trends
9 / 31
Hadoop: What is it?
● Framework based on GMR / GFS
● Apache project
● Developed in Java
● Multiple applications
● Used by many companies
● De-facto standard in community
10 / 31
Hadoop: HDFS Architecture
11 / 31
Hadoop: HDFS concepts
● Distributed file system. Layer
on top ext3, xfs...
● Works better on huge files
● Redundancy (default 3)
● Bad seeking, no append!
● Good rack scale. Not good
data center scale
● File divided in 128Mb –
256Mb blocks
● Computation is sent to data!
12 / 31
Hadoop: Architecture v1
13 / 31
Hadoop: Architecture v2
14 / 31
Hadoop: Architecture v3
15 / 31
M/R: Example
(word counter)
16 / 31
Hadoop: Clustering
17 / 31
Hadoop: Advanced
● Distributed caches
● Partitioner
● Sort comparator
● Group comparator
● Combiner
● Input format & Record reader
● MultiInput
● MultiOutput
● Compression
18 / 31
Hadoop: Conclusions
● Simplify large-scale computation
● Hide parallel programming issues
● Easy to get into & develop (huge doc)
● Deeply used & maintained by community
● Possibility to throw away RDBMs! (Bottleneck)
19 / 31
Roadmap
● Map Reduce
● Hadoop
● Concepts
● HDFS
● Architecture
● Hadoop Ecosystem
● Lambda Architecture
● New trends
20 / 31
Hadoop: Ecosystem
21 / 31
Roadmap
● Map Reduce
● Hadoop
● Concepts
● HDFS
● Architecture
● Hadoop Ecosystem
● Lambda Architecture
● New trends
22 / 31
Lambda Architecture: Motivation
● Real time use cases
● Business analytics
● Batch processing vs Real Time
● Problem!
● Low latency read & update
● Scalable & fault tolerant
● Something else needed!
23 / 31
Lambda Architecture: Schema
24 / 31
Lambda Architecture: Example 1
25 / 31
Lambda Architecture: Example 2
26 / 31
Lambda Architecture: Lambdoop
● Unified technology stack
● High level programming environment
● Management tools
27 / 31
Roadmap
● Map Reduce
● Hadoop
● Concepts
● HDFS
● Architecture
● Hadoop Ecosystem
● Lambda Architecture
● New trends
28 / 31
New trends: Architecture
● Hadoop vs Hadoop2
● Columnar storage
29 / 31
New trends: Storm
● Stream processing
● Tuples
● Streams
● Spouts
● Bolts
● Topologies
● Twitter
30 / 31
New trends: Spark
● Next generation MapReduce
● Integrated but not dependent on Hadoop
● Fast memory optimized execution engine
● Avoids many Hadoop problems
● Overhead
● High latency
● Many disk writes
● In-memory cache
● Flexible executions graph
● Much faster than MapReduce (up to 100x)
● Shark (SQL)
● Support streaming (beta)
31 / 31
BIG DATA TECHNOLOGY
● Juanjo Mostazo
● juanj.mostazo@gmail.com
● http://www.slideshare.net/juanjmostazo/mr-hadoop-cbase

Mais conteúdo relacionado

Destaque

Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
Guido Schmutz
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
Guido Schmutz
 

Destaque (6)

Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data Architectures
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
 
Importance of Big Data Analytics
Importance of Big Data AnalyticsImportance of Big Data Analytics
Importance of Big Data Analytics
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 

Semelhante a Big Data Technology

Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocket
SeedRocket
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
MapReduce
MapReduceMapReduce
MapReduce
robjk
 
MapReduce
MapReduceMapReduce
MapReduce
robjk
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
ivascucristian
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
Noha Elprince
 

Semelhante a Big Data Technology (20)

Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocket
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
How to get started in Big Data for master's students
How to get started in Big Data for master's studentsHow to get started in Big Data for master's students
How to get started in Big Data for master's students
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedback
 
Main map reduce
Main map reduceMain map reduce
Main map reduce
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
 
Map and Reduce
Map and ReduceMap and Reduce
Map and Reduce
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 

Último

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Último (20)

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 

Big Data Technology

  • 1. 1 / 31 BIG DATA TECHNOLOGY ● Juanjo Mostazo ● c-base Berlin ● May 2014
  • 2. 2 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  • 3. 3 / 31 M/R: Motivation ● Process big amount of data to produce other data ● Scale up vs Scale out
  • 4. 4 / 31 M/R: What is it? ● Different programming paradigm ● Based on a google paper (2004) ● Automatic parallelization and distribution ● I/O Scheduling ● Fault tolerance ● Status and monitoring
  • 5. 5 / 31 M/R: The paradigm ● Input & Output: set of key/value pairs ● Big amount of data group & sort ● Job = Two phases = Mapper & Reducer ● Map (in_key, in_value) → list(interm_key, interm_value) ● Reduce (interm_key, list(interm_value)) → list (out_key, out_value)
  • 6. 6 / 31 M/R: Example (word counter)
  • 7. 7 / 31 M/R: Workflow
  • 8. 8 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  • 9. 9 / 31 Hadoop: What is it? ● Framework based on GMR / GFS ● Apache project ● Developed in Java ● Multiple applications ● Used by many companies ● De-facto standard in community
  • 10. 10 / 31 Hadoop: HDFS Architecture
  • 11. 11 / 31 Hadoop: HDFS concepts ● Distributed file system. Layer on top ext3, xfs... ● Works better on huge files ● Redundancy (default 3) ● Bad seeking, no append! ● Good rack scale. Not good data center scale ● File divided in 128Mb – 256Mb blocks ● Computation is sent to data!
  • 12. 12 / 31 Hadoop: Architecture v1
  • 13. 13 / 31 Hadoop: Architecture v2
  • 14. 14 / 31 Hadoop: Architecture v3
  • 15. 15 / 31 M/R: Example (word counter)
  • 16. 16 / 31 Hadoop: Clustering
  • 17. 17 / 31 Hadoop: Advanced ● Distributed caches ● Partitioner ● Sort comparator ● Group comparator ● Combiner ● Input format & Record reader ● MultiInput ● MultiOutput ● Compression
  • 18. 18 / 31 Hadoop: Conclusions ● Simplify large-scale computation ● Hide parallel programming issues ● Easy to get into & develop (huge doc) ● Deeply used & maintained by community ● Possibility to throw away RDBMs! (Bottleneck)
  • 19. 19 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  • 20. 20 / 31 Hadoop: Ecosystem
  • 21. 21 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  • 22. 22 / 31 Lambda Architecture: Motivation ● Real time use cases ● Business analytics ● Batch processing vs Real Time ● Problem! ● Low latency read & update ● Scalable & fault tolerant ● Something else needed!
  • 23. 23 / 31 Lambda Architecture: Schema
  • 24. 24 / 31 Lambda Architecture: Example 1
  • 25. 25 / 31 Lambda Architecture: Example 2
  • 26. 26 / 31 Lambda Architecture: Lambdoop ● Unified technology stack ● High level programming environment ● Management tools
  • 27. 27 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  • 28. 28 / 31 New trends: Architecture ● Hadoop vs Hadoop2 ● Columnar storage
  • 29. 29 / 31 New trends: Storm ● Stream processing ● Tuples ● Streams ● Spouts ● Bolts ● Topologies ● Twitter
  • 30. 30 / 31 New trends: Spark ● Next generation MapReduce ● Integrated but not dependent on Hadoop ● Fast memory optimized execution engine ● Avoids many Hadoop problems ● Overhead ● High latency ● Many disk writes ● In-memory cache ● Flexible executions graph ● Much faster than MapReduce (up to 100x) ● Shark (SQL) ● Support streaming (beta)
  • 31. 31 / 31 BIG DATA TECHNOLOGY ● Juanjo Mostazo ● juanj.mostazo@gmail.com ● http://www.slideshare.net/juanjmostazo/mr-hadoop-cbase