SlideShare uma empresa Scribd logo
1 de 56
Vademecum Big Data
Adam Kawa, Spotify, Compendium CE
About Me
Spotify/Compendium, WHUG/SHUG, HakunaMapData.com, +2.5Y
And The 20-Minute Story About ...




Image source:http://www.containsmoderateperil.com/wp-content/uploads/2012/09/Dev-Diary-Epic-Story.jpg
A Really Data-Driven Company …




Image source: http://wwwimg.roku.com/hero-images/home2_1.jpg
And Some Inevitable Problems ...




Image source: http://www.digitalnewsasia.com/sites/default/files/images/digital%20economy/data%20explosion.jpg
And Some Inevitable Problems ...




Image source: http://p.alejka.pl/i2/p_new/36/42/grosz-na-szczescie-ze-zlota-m-1z-doskonaly-na-kazda-okazje_0_b.jpg
And Some Inevitable Problems ...




Image source: http://25.media.tumblr.com/d1038e7831eae86f5e84d0d09a2e6fad/tumblr_mfh5srmNAR1s06a3to1_500.jpg
Start!
The First Approach Works Fine ...
Until Data Gets Bigger ...
And More Diverse ...
The Data Monster Becomes A Problem




Image source: http://cloudtimes.org/wp-content/uploads/2012/05/big-data.jpg
Apache Hadoop Becomes A Solution




Image source: http://gigaom2.files.wordpress.com/2012/06/shutterstock_60414424.jpg
Orchestra Of Nodes




Image source: http://www.dsn.jhu.edu/images/orchestra.gif
Fault-Tolerant Orchestra Of Nodes
Untypical Orchestra Of Typical* Nodes
* however having very cheap nodes is false economy
Highly Scalable Orchestra Of Nodes
Hadoop Distributed File System (HDFS)




Image source: http://www.wallcoo.net/car/Trucks/images/Big_Truck_on_Road_.jpg
HDFS Blocks And Replication
HDFS Self-Healing Features




Image source: http://www.mwctoys.com/images/review_hydra_3.jpg
HDFS Scales And Shines With MapReduce




Image source: http://www.kkkp.pl/graph/gr_kdz_char3.jpg
MapReduce Is A Change


                                            DATA
                                             Map And Reduce


Image source: http://2.bp.blogspot.com/-Kl1ADjd3_7I/T6a8ZQV7ITI/AAAAAAAAKfE/qVyTQdJl2Do/s1600/make-big-changes-in-small-steps.png
Map And Reduce Functions
MapReduce Paradigm
Artist Count Example
Sending Computation To Data


                                                                                                     Data
                                                                                                     Is
                                                                                                     Here!


Computation


Image source: http://www.conservationmagazine.org/wp-content/uploads/2011/03/ElephantAndMouse1.jpg
MapReduce Implementation




Image source: http://i3.mirror.co.uk/incoming/article1360046.ece/ALTERNATES/s615/Male+drones+tend+to+honeycomb+cells+in+a+bee+colony
First Success: 5-Node Hadoop Cluster




Image source: http://www.smallbiztechnology.com/wp-content/uploads/2012/12/success.jpg
Apache Whirr And The Cloud
===== hadoop.properties =============
whirr.cluster-name=production_cluster
whirr.instance-templates=
1 hadoop-jobtracker+hadoop-namenode,
4 hadoop-datanode+hadoop-tasktracker
whirr.provider=aws-ec2 # or Rackspace cloudservers-us
...
=====================================

$ whirr launch-cluster --config hadoop.properties
$ whirr destroy-cluster --config hadoop.properties
First Sad (Non-Java Speaking) Developers




Image source: http://www.shivayanaturals.com/wp-content/uploads/2012/01/Unhappy.jpg
Hadoop Streaming For Scripting Languages




Image source:http://www.mightystreamradio.com/PHOTOS/STREAM%20PHOTO%202.jpg
Apache Hive Makes You Feel Younger




Image source: http://majapszczolka.blox.pl/resource/Pszczolka_Maja_Baje_Pl_6.jpg
Speak ~SQL, But Run As MapReduce
HUE - Browser-Based Environment




Image source: http://www.sentric.ch/wp-content/uploads/2013/01/Create-table-in-Hive.png
Hive Is Based On & Limited By Hadoop
Apache Pig Makes Them Happier!


                        




Image source: http://vetnolimits.files.wordpress.com/2012/02/pumba.jpg
Pig Accelerates Development


        
Need To Add More Relational Data To HDFS




Based on the image from http://blog.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/
SQL To Hadoop = Sqoop




Image source:http://3.bp.blogspot.com/_uuOo8x3WXWE/SuNV4y7qzeI/AAAAAAAAkYM/6RUExOMQPno/s400/pumpkin_eating_elephant.jpg
Sqoop Import/Export Data Using MR




Image source: http://blog.cloudera.com/blog/2011/10/apache-sqoop-overview/
Apache Oozie For Defining Workflows




Image source: Apache Oozie website
Apache Oozie For Scheduling




Image source:http://risingtechies.files.wordpress.com/2012/05/schedule.jpg
Need To Add Even More Logs To HDFS




Based on the image from http://blog.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/
Apache Flume For Data Collection
                                     e.g. JDBC, Memory, File




Image source: Apache Flume website
How To Manager A Larger Cluster
Apache Avro + Snappy/Deflate_6




Image source: http://www.funkydiva.pl/wp-content/uploads/2012/10/lego-tapety-na-pulpit-duze-zdjecia-16.jpg
When Latency Is To High




Image source: http://www.pharmacyowners.com/Portals/37772/images/It-can-be-a-LONG-wait-at-the-pharmacy-resized-600.jpg
Cloudera Impala – Real-Time ~SQL Queries




Image source: http://static.cargurus.com/images/site/2010/07/02/12/24/1969_chevrolet_impala-pic-2868587530424686499.jpeg
Apache HBase - Random, Real-Time
Access To Big Data




Image source: http://www.superhqwallpapers.com/wp-content/uploads/2012/01/Super-Ferrari.jpg
YARN – Hadoop Cluster More Robust




Image source: http://globeattractions.com/wp-content/uploads/2012/01/green-leaf-drops-green-hd-leaf-nature-wet.jpg
Hadoop Is Successfully Deployed




Image source: http://bogdankipko.com/wp-content/uploads/2012/03/lessons-learned.jpg
Learn More About Apache Hadoop?
Use Hadoop To Solve Real-World Problems?
Oozie And YARN At WHUG, Today @18:00
Thank You! Any Questions About Them?




Image source: http://xn--gryprzegldarkowe-43b.com.pl/wp-content/uploads/2012/05/me-free-zoo1.jpg
Apache Hadoop Ecosystem (based on an exemplary data-driven…

Mais conteúdo relacionado

Mais procurados

Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Holden Karau
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 

Mais procurados (19)

Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig Fundamentals
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Apache pig
Apache pigApache pig
Apache pig
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 

Destaque

Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
Adam Kawa
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Adam Kawa
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacji
Adam Kawa
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
Adam Kawa
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
Adam Kawa
 

Destaque (13)

Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacji
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java API
 

Semelhante a Apache Hadoop Ecosystem (based on an exemplary data-driven…

Logan composition (2)
Logan composition (2)Logan composition (2)
Logan composition (2)
loganm
 
[psuweb] Adaptive Images in Responsive Web Design
[psuweb] Adaptive Images in Responsive Web Design[psuweb] Adaptive Images in Responsive Web Design
[psuweb] Adaptive Images in Responsive Web Design
Christopher Schmitt
 

Semelhante a Apache Hadoop Ecosystem (based on an exemplary data-driven… (20)

Back to the [Completable] Future
Back to the [Completable] FutureBack to the [Completable] Future
Back to the [Completable] Future
 
Empowering DevOps with Cloud Foundry
Empowering DevOps with Cloud FoundryEmpowering DevOps with Cloud Foundry
Empowering DevOps with Cloud Foundry
 
Testing Like a Pro - Chef Infrastructure Testing
Testing Like a Pro - Chef Infrastructure TestingTesting Like a Pro - Chef Infrastructure Testing
Testing Like a Pro - Chef Infrastructure Testing
 
The Last Mile
The Last MileThe Last Mile
The Last Mile
 
Fitting the pieces together - at Drupal Summit Europe - 2011
Fitting the pieces together - at Drupal Summit Europe - 2011Fitting the pieces together - at Drupal Summit Europe - 2011
Fitting the pieces together - at Drupal Summit Europe - 2011
 
HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015HTML5 after the hype - JFokus2015
HTML5 after the hype - JFokus2015
 
Design+Performance Velocity 2015
Design+Performance Velocity 2015Design+Performance Velocity 2015
Design+Performance Velocity 2015
 
Logan composition (2)
Logan composition (2)Logan composition (2)
Logan composition (2)
 
Hacking Web Performance @ ForwardJS 2017
Hacking Web Performance @ ForwardJS 2017Hacking Web Performance @ ForwardJS 2017
Hacking Web Performance @ ForwardJS 2017
 
10 Laravel packages everyone should know
10 Laravel packages everyone should know10 Laravel packages everyone should know
10 Laravel packages everyone should know
 
Velocity Report 2009
Velocity Report 2009Velocity Report 2009
Velocity Report 2009
 
[psuweb] Adaptive Images in Responsive Web Design
[psuweb] Adaptive Images in Responsive Web Design[psuweb] Adaptive Images in Responsive Web Design
[psuweb] Adaptive Images in Responsive Web Design
 
Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017
 
Tactics to Kickstart Your Journey Toward Continuous Delivery
Tactics to Kickstart Your Journey Toward Continuous DeliveryTactics to Kickstart Your Journey Toward Continuous Delivery
Tactics to Kickstart Your Journey Toward Continuous Delivery
 
Tactics to Kickstart Your Journey Toward Continuous Delivery
Tactics to Kickstart Your Journey Toward Continuous DeliveryTactics to Kickstart Your Journey Toward Continuous Delivery
Tactics to Kickstart Your Journey Toward Continuous Delivery
 
Vpn presentation richard kong
Vpn presentation   richard kongVpn presentation   richard kong
Vpn presentation richard kong
 
GDG Varna - Hadoop
GDG Varna - HadoopGDG Varna - Hadoop
GDG Varna - Hadoop
 
High Performance HTML5 (SF HTML5 UG)
High Performance HTML5 (SF HTML5 UG)High Performance HTML5 (SF HTML5 UG)
High Performance HTML5 (SF HTML5 UG)
 
Tactics to Kickstart Your Journey Toward DevOps
Tactics to Kickstart Your Journey Toward DevOpsTactics to Kickstart Your Journey Toward DevOps
Tactics to Kickstart Your Journey Toward DevOps
 
Tactics to Kickstart Your Journey Toward DevOps
Tactics to Kickstart Your Journey Toward DevOpsTactics to Kickstart Your Journey Toward DevOps
Tactics to Kickstart Your Journey Toward DevOps
 

Último

Último (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Apache Hadoop Ecosystem (based on an exemplary data-driven…

  • 1. Vademecum Big Data Adam Kawa, Spotify, Compendium CE
  • 2. About Me Spotify/Compendium, WHUG/SHUG, HakunaMapData.com, +2.5Y
  • 3. And The 20-Minute Story About ... Image source:http://www.containsmoderateperil.com/wp-content/uploads/2012/09/Dev-Diary-Epic-Story.jpg
  • 4. A Really Data-Driven Company … Image source: http://wwwimg.roku.com/hero-images/home2_1.jpg
  • 5. And Some Inevitable Problems ... Image source: http://www.digitalnewsasia.com/sites/default/files/images/digital%20economy/data%20explosion.jpg
  • 6. And Some Inevitable Problems ... Image source: http://p.alejka.pl/i2/p_new/36/42/grosz-na-szczescie-ze-zlota-m-1z-doskonaly-na-kazda-okazje_0_b.jpg
  • 7. And Some Inevitable Problems ... Image source: http://25.media.tumblr.com/d1038e7831eae86f5e84d0d09a2e6fad/tumblr_mfh5srmNAR1s06a3to1_500.jpg
  • 9. The First Approach Works Fine ...
  • 10. Until Data Gets Bigger ...
  • 12. The Data Monster Becomes A Problem Image source: http://cloudtimes.org/wp-content/uploads/2012/05/big-data.jpg
  • 13. Apache Hadoop Becomes A Solution Image source: http://gigaom2.files.wordpress.com/2012/06/shutterstock_60414424.jpg
  • 14. Orchestra Of Nodes Image source: http://www.dsn.jhu.edu/images/orchestra.gif
  • 16. Untypical Orchestra Of Typical* Nodes * however having very cheap nodes is false economy
  • 18. Hadoop Distributed File System (HDFS) Image source: http://www.wallcoo.net/car/Trucks/images/Big_Truck_on_Road_.jpg
  • 19. HDFS Blocks And Replication
  • 20. HDFS Self-Healing Features Image source: http://www.mwctoys.com/images/review_hydra_3.jpg
  • 21. HDFS Scales And Shines With MapReduce Image source: http://www.kkkp.pl/graph/gr_kdz_char3.jpg
  • 22. MapReduce Is A Change DATA Map And Reduce Image source: http://2.bp.blogspot.com/-Kl1ADjd3_7I/T6a8ZQV7ITI/AAAAAAAAKfE/qVyTQdJl2Do/s1600/make-big-changes-in-small-steps.png
  • 23. Map And Reduce Functions
  • 26. Sending Computation To Data Data Is Here! Computation Image source: http://www.conservationmagazine.org/wp-content/uploads/2011/03/ElephantAndMouse1.jpg
  • 27. MapReduce Implementation Image source: http://i3.mirror.co.uk/incoming/article1360046.ece/ALTERNATES/s615/Male+drones+tend+to+honeycomb+cells+in+a+bee+colony
  • 28. First Success: 5-Node Hadoop Cluster Image source: http://www.smallbiztechnology.com/wp-content/uploads/2012/12/success.jpg
  • 29. Apache Whirr And The Cloud ===== hadoop.properties ============= whirr.cluster-name=production_cluster whirr.instance-templates= 1 hadoop-jobtracker+hadoop-namenode, 4 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2 # or Rackspace cloudservers-us ... ===================================== $ whirr launch-cluster --config hadoop.properties $ whirr destroy-cluster --config hadoop.properties
  • 30. First Sad (Non-Java Speaking) Developers Image source: http://www.shivayanaturals.com/wp-content/uploads/2012/01/Unhappy.jpg
  • 31. Hadoop Streaming For Scripting Languages Image source:http://www.mightystreamradio.com/PHOTOS/STREAM%20PHOTO%202.jpg
  • 32. Apache Hive Makes You Feel Younger Image source: http://majapszczolka.blox.pl/resource/Pszczolka_Maja_Baje_Pl_6.jpg
  • 33. Speak ~SQL, But Run As MapReduce
  • 34. HUE - Browser-Based Environment Image source: http://www.sentric.ch/wp-content/uploads/2013/01/Create-table-in-Hive.png
  • 35. Hive Is Based On & Limited By Hadoop
  • 36. Apache Pig Makes Them Happier!   Image source: http://vetnolimits.files.wordpress.com/2012/02/pumba.jpg
  • 38. Need To Add More Relational Data To HDFS Based on the image from http://blog.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/
  • 39. SQL To Hadoop = Sqoop Image source:http://3.bp.blogspot.com/_uuOo8x3WXWE/SuNV4y7qzeI/AAAAAAAAkYM/6RUExOMQPno/s400/pumpkin_eating_elephant.jpg
  • 40. Sqoop Import/Export Data Using MR Image source: http://blog.cloudera.com/blog/2011/10/apache-sqoop-overview/
  • 41. Apache Oozie For Defining Workflows Image source: Apache Oozie website
  • 42. Apache Oozie For Scheduling Image source:http://risingtechies.files.wordpress.com/2012/05/schedule.jpg
  • 43. Need To Add Even More Logs To HDFS Based on the image from http://blog.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/
  • 44. Apache Flume For Data Collection e.g. JDBC, Memory, File Image source: Apache Flume website
  • 45. How To Manager A Larger Cluster
  • 46. Apache Avro + Snappy/Deflate_6 Image source: http://www.funkydiva.pl/wp-content/uploads/2012/10/lego-tapety-na-pulpit-duze-zdjecia-16.jpg
  • 47. When Latency Is To High Image source: http://www.pharmacyowners.com/Portals/37772/images/It-can-be-a-LONG-wait-at-the-pharmacy-resized-600.jpg
  • 48. Cloudera Impala – Real-Time ~SQL Queries Image source: http://static.cargurus.com/images/site/2010/07/02/12/24/1969_chevrolet_impala-pic-2868587530424686499.jpeg
  • 49. Apache HBase - Random, Real-Time Access To Big Data Image source: http://www.superhqwallpapers.com/wp-content/uploads/2012/01/Super-Ferrari.jpg
  • 50. YARN – Hadoop Cluster More Robust Image source: http://globeattractions.com/wp-content/uploads/2012/01/green-leaf-drops-green-hd-leaf-nature-wet.jpg
  • 51. Hadoop Is Successfully Deployed Image source: http://bogdankipko.com/wp-content/uploads/2012/03/lessons-learned.jpg
  • 52. Learn More About Apache Hadoop?
  • 53. Use Hadoop To Solve Real-World Problems?
  • 54. Oozie And YARN At WHUG, Today @18:00
  • 55. Thank You! Any Questions About Them? Image source: http://xn--gryprzegldarkowe-43b.com.pl/wp-content/uploads/2012/05/me-free-zoo1.jpg