SlideShare a Scribd company logo
1 of 44
Expect More from Hadoop!


©MapR Technologies - Confidential   1
My Background

     University, Startups
       – Aptex, MusicMatch, ID Analytics, Veoh
       – big data since before it was big

     Open source
       – even before the internet
       – Apache Hadoop, Mahout, Zookeeper, Drill
       – bought the beer at first HUG

 MapR
 Founding member of Apache Drill


©MapR Technologies - Confidential       2
MapR Technologies

     Enterprise quality distribution for Hadoop
       – Many extensions beyond basic Hadoop
     Super strong team
       – Long history of successful startups
     Strong supporter of Apache Drill
       –      and open source in general




©MapR Technologies - Confidential   3
meta-Hadoop?



©MapR Technologies - Confidential         4
meta
                    Meta- (from Greek: μετά = "after", "beyond",
                    "with", "adjacent", "self"), is a…




©MapR Technologies - Confidential         5
Answering
             Beyond ≠ yesterday’s
                      problems


©MapR Technologies - Confidential   6
Philosophy First




                              What is History?



©MapR Technologies - Confidential    7
The study of the past

(what came before now)


©MapR Technologies - Confidential   8
What is the future?

        (it comes after now)


©MapR Technologies - Confidential   9
©MapR Technologies - Confidential   10
©MapR Technologies - Confidential   11
But the future also
                     has a past!



©MapR Technologies - Confidential   12
the future of the past
             is not
     the past of the future



©MapR Technologies - Confidential   13
Do you remember the
                       future?



©MapR Technologies - Confidential   14
©MapR Technologies - Confidential   15
©MapR Technologies - Confidential   16
©MapR Technologies - Confidential   17
Those are
                                    yesterday’s
                                      answers

©MapR Technologies - Confidential        18
and also
                                     the seeds
                                    of tomorrow

©MapR Technologies - Confidential        19
Guys wearing
                                      Fedoras



©MapR Technologies - Confidential      20
Hadoop has
                                     a history


©MapR Technologies - Confidential       21
Hadoop also
                                       has a
                                      future

©MapR Technologies - Confidential        22
The Old Future of Hadoop

     Implementing yet another Google paper
       –   Map-reduce and HDFS, and Yarn and Tez
       –   more and more, but not really different


     Eco-system additions (more Google papers)
       –   simpler programming (Hive and Pig and Crunch) (Sawzall, FlumeJava, etc)
       –   key-value store (big table)
       –   ad hoc query (Dremel)
       –   also not really different


     Stands apart from other computing
       –   required by HDFS and other limitations

©MapR Technologies - Confidential            23
The New Future of Hadoop

     Real-time processing
       –   Combines real-time and long-time


     Integration with traditional IT
       –   No need to stand apart


     Integration with new technologies
       –   Solr, Node.js, Twisted all should work directly on Hadoop


     Fast and flexible computation
       –   Drill logical plan language


©MapR Technologies - Confidential             24
Example #1
                                    Search Abuse


©MapR Technologies - Confidential        25
History matrix

                                    One row per user

                                    One column per thing




©MapR Technologies - Confidential        26
Recommendation based on
                                    cooccurrence

                                    Cooccurrence gives item-item
                                    mapping

                                    One row and column per thing




©MapR Technologies - Confidential         27
Cooccurrence matrix can also be
                                    implemented as a search index




©MapR Technologies - Confidential         28
SolR
                                                              SolR
                          Complete    Cooccurrence            Indexer
                                                            Solr
                                                            Indexer
                            history     (Mahout)          indexing




                                        Item meta-             Index
                                           data               shards




©MapR Technologies - Confidential                    29
SolR
                                                             SolR
                                User                         Indexer
                                                           Solr
                                        Web tier           Indexer
                              history                     search




                                        Item meta-
                                                              Index
                                           data              shards




©MapR Technologies - Confidential                    30
Objective Results

     At a very large credit card company


     History is all transactions, all web interaction


     Processing time cut from 20 hours per day to 3


     Recommendation engine load time decreased from 8 hours to 3
      minutes




©MapR Technologies - Confidential       31
Scaling Estimates – Twitter Fire hose

     Old School – 8+ separate              MapR – one platform
      clusters, 20-25 nodes                  –   5-10 nodes total, any node does any
       –   >3 Kafka nodes                        job
       –   >2 TwitterLogger                  –   Full HA included,
       –   5-10 Hadoop                           backups included,
       –   >3 Storm                              disaster recovery included
       –   3 zookeepers (or not?)
       –   NAS for web storage
       –   >2 web servers




©MapR Technologies - Confidential   32
Example #2
                         Web Technology


©MapR Technologies - Confidential   33
Real-time   Fast analysis
                                         data     (Storm)




                                                   Analytic
                                                                   Raw logs
                                                   output




©MapR Technologies - Confidential                             34
Large analysis
                                                    (map-reduce)




                                    Analytic
                                                       Raw logs
                                    output




©MapR Technologies - Confidential              35
Presentation
                                    Browser
                                                tier (d3 +
                                      query
                                                 node.js)




                                                 Analytic
                                                                 Raw logs
                                                 output




©MapR Technologies - Confidential                           36
Old School Storm: Complex architecture
               Twitter
                                    Twitter
                                       API                              Kafka
                                                        Kafka             API
                TwitterLogger                            Kafka
                                                          Kafka
                                                       Cluster
                                                        Cluster
                                                         Cluster
                                                                                        Storm
                                               Kafka                            Storm



                                                                                            Web
                                              Flume                                         Data
                                                                                            NAS

                                                       HDFS
                                                       Data
                                      Hadoop                                                    http



                                                                                          Web-server
©MapR Technologies - Confidential                                  37
MapR: One Platform with Streaming Writes
          Twitter


                           Twitter
                              API                                                      http



                                       Catcher                                       Web-server
                    TwitterLogger       Catcher                 Storm

                                            NFS     NFS                 NFS     NFS
                       Optional
                                     HDFS
                      MapReduce               Topic                           Web
                                      API
                                              Queue                           Data
            MapR
                                                  Users can also run extended
                                                  analytics/MapReduce on the stored
                                                  data

©MapR Technologies - Confidential                   38
©MapR Technologies - Confidential   39
Objective Results

     Real-time + long-time analysis is seamless


     Web tier can be rooted directly on Hadoop cluster


     No need to move data




©MapR Technologies - Confidential     40
The future is
                               not what we
                                thought it
                                 would be

©MapR Technologies - Confidential   41
It is better!



©MapR Technologies - Confidential   42
Get Involved!


                                       Tweet:
                                    #strataconf
                                        #mapr
                                    @ted_dunning


©MapR Technologies - Confidential         43
Get Involved!

     Join Apache Drill!
       – drill-dev-subscribe@incubator.apache.org
       – Follow @apachedrill


     Join MapR!
       –   jobs@mapr.com

     Download these slides
       –   http://www.mapr.com/company/events/strata-conference-2-2-27-13

     Contact me:
       – tdunning@maprtech.com
       – tdunning@apache.org
       – @ted_dunning



©MapR Technologies - Confidential           44

More Related Content

What's hot

Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
Ted Dunning
 

What's hot (8)

Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
Design Patterns for working with Fast Data
Design Patterns for working with Fast DataDesign Patterns for working with Fast Data
Design Patterns for working with Fast Data
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Addressing Emerging Challenges in Designing HPC Runtimes
Addressing Emerging Challenges in Designing HPC RuntimesAddressing Emerging Challenges in Designing HPC Runtimes
Addressing Emerging Challenges in Designing HPC Runtimes
 
Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19Scalable Deep Learning in ExtremeEarth-phiweek19
Scalable Deep Learning in ExtremeEarth-phiweek19
 
Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06
 

Viewers also liked

Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith Chaos
MapR Technologies
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
MapR Technologies
 

Viewers also liked (8)

CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith Chaos
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Buzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal RecommendationsBuzz Words Dunning Multi Modal Recommendations
Buzz Words Dunning Multi Modal Recommendations
 
Strata New York 2012
Strata New York 2012Strata New York 2012
Strata New York 2012
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 

Similar to Dunning strata-2012-27-02

How to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and FastHow to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and Fast
MapR Technologies
 

Similar to Dunning strata-2012-27-02 (20)

Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San Diego
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in business
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
How to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and FastHow to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and Fast
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
Container and Kubernetes without limits
Container and Kubernetes without limitsContainer and Kubernetes without limits
Container and Kubernetes without limits
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
 
David Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TESDavid Loureiro - Presentation at HP's HPC & OSL TES
David Loureiro - Presentation at HP's HPC & OSL TES
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 

More from MapR Technologies

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Dunning strata-2012-27-02

  • 1. Expect More from Hadoop! ©MapR Technologies - Confidential 1
  • 2. My Background  University, Startups – Aptex, MusicMatch, ID Analytics, Veoh – big data since before it was big  Open source – even before the internet – Apache Hadoop, Mahout, Zookeeper, Drill – bought the beer at first HUG  MapR  Founding member of Apache Drill ©MapR Technologies - Confidential 2
  • 3. MapR Technologies  Enterprise quality distribution for Hadoop – Many extensions beyond basic Hadoop  Super strong team – Long history of successful startups  Strong supporter of Apache Drill – and open source in general ©MapR Technologies - Confidential 3
  • 5. meta Meta- (from Greek: μετά = "after", "beyond", "with", "adjacent", "self"), is a… ©MapR Technologies - Confidential 5
  • 6. Answering Beyond ≠ yesterday’s problems ©MapR Technologies - Confidential 6
  • 7. Philosophy First What is History? ©MapR Technologies - Confidential 7
  • 8. The study of the past (what came before now) ©MapR Technologies - Confidential 8
  • 9. What is the future? (it comes after now) ©MapR Technologies - Confidential 9
  • 10. ©MapR Technologies - Confidential 10
  • 11. ©MapR Technologies - Confidential 11
  • 12. But the future also has a past! ©MapR Technologies - Confidential 12
  • 13. the future of the past is not the past of the future ©MapR Technologies - Confidential 13
  • 14. Do you remember the future? ©MapR Technologies - Confidential 14
  • 15. ©MapR Technologies - Confidential 15
  • 16. ©MapR Technologies - Confidential 16
  • 17. ©MapR Technologies - Confidential 17
  • 18. Those are yesterday’s answers ©MapR Technologies - Confidential 18
  • 19. and also the seeds of tomorrow ©MapR Technologies - Confidential 19
  • 20. Guys wearing Fedoras ©MapR Technologies - Confidential 20
  • 21. Hadoop has a history ©MapR Technologies - Confidential 21
  • 22. Hadoop also has a future ©MapR Technologies - Confidential 22
  • 23. The Old Future of Hadoop  Implementing yet another Google paper – Map-reduce and HDFS, and Yarn and Tez – more and more, but not really different  Eco-system additions (more Google papers) – simpler programming (Hive and Pig and Crunch) (Sawzall, FlumeJava, etc) – key-value store (big table) – ad hoc query (Dremel) – also not really different  Stands apart from other computing – required by HDFS and other limitations ©MapR Technologies - Confidential 23
  • 24. The New Future of Hadoop  Real-time processing – Combines real-time and long-time  Integration with traditional IT – No need to stand apart  Integration with new technologies – Solr, Node.js, Twisted all should work directly on Hadoop  Fast and flexible computation – Drill logical plan language ©MapR Technologies - Confidential 24
  • 25. Example #1 Search Abuse ©MapR Technologies - Confidential 25
  • 26. History matrix One row per user One column per thing ©MapR Technologies - Confidential 26
  • 27. Recommendation based on cooccurrence Cooccurrence gives item-item mapping One row and column per thing ©MapR Technologies - Confidential 27
  • 28. Cooccurrence matrix can also be implemented as a search index ©MapR Technologies - Confidential 28
  • 29. SolR SolR Complete Cooccurrence Indexer Solr Indexer history (Mahout) indexing Item meta- Index data shards ©MapR Technologies - Confidential 29
  • 30. SolR SolR User Indexer Solr Web tier Indexer history search Item meta- Index data shards ©MapR Technologies - Confidential 30
  • 31. Objective Results  At a very large credit card company  History is all transactions, all web interaction  Processing time cut from 20 hours per day to 3  Recommendation engine load time decreased from 8 hours to 3 minutes ©MapR Technologies - Confidential 31
  • 32. Scaling Estimates – Twitter Fire hose  Old School – 8+ separate  MapR – one platform clusters, 20-25 nodes – 5-10 nodes total, any node does any – >3 Kafka nodes job – >2 TwitterLogger – Full HA included, – 5-10 Hadoop backups included, – >3 Storm disaster recovery included – 3 zookeepers (or not?) – NAS for web storage – >2 web servers ©MapR Technologies - Confidential 32
  • 33. Example #2 Web Technology ©MapR Technologies - Confidential 33
  • 34. Real-time Fast analysis data (Storm) Analytic Raw logs output ©MapR Technologies - Confidential 34
  • 35. Large analysis (map-reduce) Analytic Raw logs output ©MapR Technologies - Confidential 35
  • 36. Presentation Browser tier (d3 + query node.js) Analytic Raw logs output ©MapR Technologies - Confidential 36
  • 37. Old School Storm: Complex architecture Twitter Twitter API Kafka Kafka API TwitterLogger Kafka Kafka Cluster Cluster Cluster Storm Kafka Storm Web Flume Data NAS HDFS Data Hadoop http Web-server ©MapR Technologies - Confidential 37
  • 38. MapR: One Platform with Streaming Writes Twitter Twitter API http Catcher Web-server TwitterLogger Catcher Storm NFS NFS NFS NFS Optional HDFS MapReduce Topic Web API Queue Data MapR Users can also run extended analytics/MapReduce on the stored data ©MapR Technologies - Confidential 38
  • 39. ©MapR Technologies - Confidential 39
  • 40. Objective Results  Real-time + long-time analysis is seamless  Web tier can be rooted directly on Hadoop cluster  No need to move data ©MapR Technologies - Confidential 40
  • 41. The future is not what we thought it would be ©MapR Technologies - Confidential 41
  • 42. It is better! ©MapR Technologies - Confidential 42
  • 43. Get Involved! Tweet: #strataconf #mapr @ted_dunning ©MapR Technologies - Confidential 43
  • 44. Get Involved!  Join Apache Drill! – drill-dev-subscribe@incubator.apache.org – Follow @apachedrill  Join MapR! – jobs@mapr.com  Download these slides – http://www.mapr.com/company/events/strata-conference-2-2-27-13  Contact me: – tdunning@maprtech.com – tdunning@apache.org – @ted_dunning ©MapR Technologies - Confidential 44

Editor's Notes

  1. Take all of Twitter400 x 10^6 tweets per day < 400 GB per day < 40MB/s
  2. Kafka is a message Queuing system
  3. Catcher is a processorAll of the systems can be run out of Hadoop. Warden can be configured to run Storm as well. Simple Architecture – all from one platform. The green blocks are data that is available for other analytics.