SlideShare a Scribd company logo
1 of 35
Download to read offline
NEARING THE EVENT HORIZON.
HADOOP WAS PREDICTABLE, WHAT’S NEXT?




            May 23, 2012       Mike Miller
                              mike@cloudant.com
                                @mlmilleratmit
What I Am

    Cloudant Founder, Chief Scientist
    (we’re hiring at all positions)

    Affiliate Assistant Professor, Particle Physics(UW)

    Background: machine learning, analysis, big data,
    globally distributed systems




Mike Miller, GlueCon May 2012                           2
What I Am




                                A CDN for your Application Data
Mike Miller, GlueCon May 2012                                     3
What I Am Not


                                didn’t see these coming
                                Super luminal neutrinos
                                Red Sox epic collapse in September
                                Red Wings losing in the first round
                                ...

                                But here I go anyway




Mike Miller, GlueCon May 2012                                        4
My First Postulate of Big-Data

                                     Google Matters

           What matters for google...
           ... matters for the internet...
           ...and therefore matters for the enterprise...
           ... will therefore be re-architected by Apache...
           ... and therefore matters to you.




Mike Miller, GlueCon May 2012                                  5
Evidence




               Business Week, 12/24/2007




Mike Miller, GlueCon May 2012              6
Evidence




               Business Week, 12/24/2007




Mike Miller, GlueCon May 2012              6
Evidence




               Business Week, 12/24/2007




Mike Miller, GlueCon May 2012              6
The Old Canon
         • Google File System (the important one)
           http://labs.google.com/papers/gfs.html

         • MapReduce (the big one)
           http://labs.google.com/papers/mapreduce.html

         • BigTable (clone me!)
           http://labs.google.com/papers/bigtable.html

         • Dynamo (ok, AWS. but masterless quorum)
           http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf



                                copy these. use these. print $$$
Mike Miller, GlueCon May 2012                                                             7
MapReduce: The Awesome
         • Approachable interface
           “What do I do with a single piece of data?”

         • Data Parallel
           Developers can basically forget about scatter-gather

         • Fault Tolerant
           Failure at scale is the norm!
           Protects both user and system operator

         • IO Optimized
           Built for sequential IO
           commodity disks spinning forward at O(20 MB/sec) each




Mike Miller, GlueCon May 2012                                      8
So... is that it?




   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/




Mike Miller, GlueCon May 2012                                                9
So... is that it?




   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/




                                                  http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/




Mike Miller, GlueCon May 2012                                                                                              9
So... is that it?




   http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/




                                                  http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/


                                                                                      http://mackiemathew.com/2012/02/25/the-problems-in-hadoop-when-does-it-fail-to-deliver/
Mike Miller, GlueCon May 2012                                                                                                                                            9
MapReduce: The not so Awesome
         • Hadoop doesn’t power big data applications
           Not a transactional datastore. Slosh back and forth via ETL

         • Processing latency
           Non-incremental, must re-slurp entire dataset every pass

         • Ad-Hoc queries
           Bare metal interface, data import

         • Graphs
           Only a handful of graph problems amenable to MR
           http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120




Mike Miller, GlueCon May 2012                                                  10
To the Event Horizon




Mike Miller, GlueCon May 2012                          11
Enter The New Canon
         • Percolator
           incremental processing
           http://research.google.com/pubs/pub36726.html

         • Dremel
           ad-hoc analysis queries
           http://research.google.com/pubs/pub36632.html

         • Pregel
           Big graphs
           http://dl.acm.org/citation.cfm?id=1807184


                                Scalable, Fault Tolerant, Approachable

Mike Miller, GlueCon May 2012                                            12
Percolator




Mike Miller, GlueCon May 2012   13
Percolator: incremental processing
         • Replaced MapReduce as the tool to build search index
           “However, reprocessing the entire web discards the work done in earlier runs and makes latency
           proportional to the size of the repository, rather than the size of the update.”

         • Bigtable alone can’t do it
           “BigTable scales...but doesn’t provide tools to help programmers maintain data invariants in the
           face of concurrent updates.”

         • Applicability
           Incrementally updating data
           Computational output can be broken down into small pieces
           Computation large in some dimension (data size, cpu, etc)

         • Does it matter?
           “...Converting the indexing system to an incremental system ... reduced the averaging document
           processing latency by a factor of 100...”


Mike Miller, GlueCon May 2012                                                                                 14
Percolator: incremental processing
  • BigTable plus...
    Multi-row ACID Transactions
    snapshot isolation, lazy locks
    up to 10s write latencies

    Timestamps

    Notifications                                        Start Timestamp (read)
    Do not maintain invariants
                                                        Commit Timestamp (write)
    Observer Framework
    your code to be run upon notification of an update


Mike Miller, GlueCon May 2012                                                      15
Percolator: incremental processing




                                Near Linear Scaling to 15k Cores
Mike Miller, GlueCon May 2012                                      16
Percolator: incremental processing




                                Latency lower than MapReduce by 100x
Mike Miller, GlueCon May 2012                                          17
Dremel




Mike Miller, GlueCon May 2012   18
Dremel: ad-hoc Query
         • Scalable, interactive ad-hoc query system for read-only nested data
           “...capable of running aggregation queries over trillion-row tables in seconds.”

         • ... on nested data structures in situ
           Web and scientific data is often non-relational
           nested data (protobuffs) underlies most structured data at Google

         • Usage
           DEFINE TABLE t AS /path/to/data/*
           SELECT TOP(signal1,100), COUNT(*) FROM t

         • Applicability
           Analysis of crawled documents
           Tracking of install data for apps on Android Market
           Crash reports
           Spam analysis...

                                                      Dream BI Tool
Mike Miller, GlueCon May 2012                                                                 19
Dremel: ad-hoc Query
 • Ingredients
   In situ data
   SQL like interface
   Serving trees for query execution
   Column striped data (3-10x)
   Analysis Catalogs




Mike Miller, GlueCon May 2012          20
Dremel: ad-hoc Query




                                Columns ~10x faster than Records   21
Mike Miller, GlueCon May 2012
Dremel: ad-hoc Query



                Benchmark Data   MapReduce (via Sawzall)




                                       Dremel (via SQL)

Mike Miller, GlueCon May 2012                              22
Dremel: ad-hoc Query



                                     Significant Optimization Possible


 Dremel ~100x Faster than Stock MR




Mike Miller, GlueCon May 2012                                           23
Dremel: ad-hoc Query




                          Most Production Queries Executed in <10 seconds

Mike Miller, GlueCon May 2012                                               24
Pregel




Mike Miller, GlueCon May 2012   25
Pregel: Big Graphs
         • Massively parallel processing of big graphs
           billions of vertices, trillions of edges

         • Bulk synchronous parallel model
           sequence of vertex oriented iterations
           send/receive messages from other vertex computations
           read/modify state of vertex, outgoing edges, graph topology

         • Expressive, easy to program
           distribution details hidden behind abstract API

         • Iterative
           computation continues until each vertex votes to terminate

         • In production
           PageRank 15 lines of code


Mike Miller, GlueCon May 2012                                            26
Pregel: Big Graphs
  • Master “Name” node
    connects processes for messaging

  • Message Passing
    no remote procedures, reads

  • Graph hashed across nodes
    vertex, outgoing edges stored in RAM

  • Aggregators
    global mechanism for aggregation
    all but final reduce computed on node local data

  • Checkpointing
    configurable, enables automatic recovery


Mike Miller, GlueCon May 2012                         27
Pregel: Big Graphs




Mike Miller, GlueCon May 2012   28
Pregel: Big Graphs




                                Near Linear Scaling to 1B nodes
Mike Miller, GlueCon May 2012                                     29
Learn More
         • Incremental Processing
           Incremental, in-database map/reduce in Cloudant’s BigCouch
           HBase 0.92 supports observers/coprocessors
           Stream processing via Storm, HStreaming, etc.

         • Ad Hoc Query
           Google BigQuery
           Column stores (Vertica, etc)
           OpenDremel (stalled?)
           ?

         • Big Graphs
           Giraph on Hadoop (Apache Incubator)
           Golden Orb (stalled?)


Mike Miller, GlueCon May 2012                                           30
Lessons Learned


 • Hire Jeff Dean and Sanjay Ghemawat
 • GFS enables everything
 • There is massive opportunity on the horizon




Mike Miller, GlueCon May 2012                    31

More Related Content

Similar to Gluecon miller horizon

How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014James Chittenden
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...SANGHEE SHIN
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012Mike Miller
 
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012GoGrid Cloud Hosting
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceAlexandru Iosup
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Denodo
 
A Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopA Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopChung-Tsai Su
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceEvert Lammerts
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...exponential-inc
 
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Chris Jang
 
Big data - teams not technology
Big data - teams not technologyBig data - teams not technology
Big data - teams not technologyUpside Energy Ltd
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
 
BigData Meets the Federal Data Center
BigData Meets the Federal Data CenterBigData Meets the Federal Data Center
BigData Meets the Federal Data CenterAbe Usher
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 

Similar to Gluecon miller horizon (20)

How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012
 
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
Microgroove (GoGrid Customer) Presentation at Cloud Connect 2012
 
Fundamentals of Big Data
Fundamentals of Big DataFundamentals of Big Data
Fundamentals of Big Data
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
CloudCamp
CloudCampCloudCamp
CloudCamp
 
A Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on HadoopA Survey of NGS Data Analysis on Hadoop
A Survey of NGS Data Analysis on Hadoop
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
 
Hadoop.mapreduce
Hadoop.mapreduceHadoop.mapreduce
Hadoop.mapreduce
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
 
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big data - teams not technology
Big data - teams not technologyBig data - teams not technology
Big data - teams not technology
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
BigData Meets the Federal Data Center
BigData Meets the Federal Data CenterBigData Meets the Federal Data Center
BigData Meets the Federal Data Center
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 

Recently uploaded

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

Gluecon miller horizon

  • 1. NEARING THE EVENT HORIZON. HADOOP WAS PREDICTABLE, WHAT’S NEXT? May 23, 2012 Mike Miller mike@cloudant.com @mlmilleratmit
  • 2. What I Am Cloudant Founder, Chief Scientist (we’re hiring at all positions) Affiliate Assistant Professor, Particle Physics(UW) Background: machine learning, analysis, big data, globally distributed systems Mike Miller, GlueCon May 2012 2
  • 3. What I Am A CDN for your Application Data Mike Miller, GlueCon May 2012 3
  • 4. What I Am Not didn’t see these coming Super luminal neutrinos Red Sox epic collapse in September Red Wings losing in the first round ... But here I go anyway Mike Miller, GlueCon May 2012 4
  • 5. My First Postulate of Big-Data Google Matters What matters for google... ... matters for the internet... ...and therefore matters for the enterprise... ... will therefore be re-architected by Apache... ... and therefore matters to you. Mike Miller, GlueCon May 2012 5
  • 6. Evidence Business Week, 12/24/2007 Mike Miller, GlueCon May 2012 6
  • 7. Evidence Business Week, 12/24/2007 Mike Miller, GlueCon May 2012 6
  • 8. Evidence Business Week, 12/24/2007 Mike Miller, GlueCon May 2012 6
  • 9. The Old Canon • Google File System (the important one) http://labs.google.com/papers/gfs.html • MapReduce (the big one) http://labs.google.com/papers/mapreduce.html • BigTable (clone me!) http://labs.google.com/papers/bigtable.html • Dynamo (ok, AWS. but masterless quorum) http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf copy these. use these. print $$$ Mike Miller, GlueCon May 2012 7
  • 10. MapReduce: The Awesome • Approachable interface “What do I do with a single piece of data?” • Data Parallel Developers can basically forget about scatter-gather • Fault Tolerant Failure at scale is the norm! Protects both user and system operator • IO Optimized Built for sequential IO commodity disks spinning forward at O(20 MB/sec) each Mike Miller, GlueCon May 2012 8
  • 11. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ Mike Miller, GlueCon May 2012 9
  • 12. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/ Mike Miller, GlueCon May 2012 9
  • 13. So... is that it? http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/ http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/ http://mackiemathew.com/2012/02/25/the-problems-in-hadoop-when-does-it-fail-to-deliver/ Mike Miller, GlueCon May 2012 9
  • 14. MapReduce: The not so Awesome • Hadoop doesn’t power big data applications Not a transactional datastore. Slosh back and forth via ETL • Processing latency Non-incremental, must re-slurp entire dataset every pass • Ad-Hoc queries Bare metal interface, data import • Graphs Only a handful of graph problems amenable to MR http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120 Mike Miller, GlueCon May 2012 10
  • 15. To the Event Horizon Mike Miller, GlueCon May 2012 11
  • 16. Enter The New Canon • Percolator incremental processing http://research.google.com/pubs/pub36726.html • Dremel ad-hoc analysis queries http://research.google.com/pubs/pub36632.html • Pregel Big graphs http://dl.acm.org/citation.cfm?id=1807184 Scalable, Fault Tolerant, Approachable Mike Miller, GlueCon May 2012 12
  • 18. Percolator: incremental processing • Replaced MapReduce as the tool to build search index “However, reprocessing the entire web discards the work done in earlier runs and makes latency proportional to the size of the repository, rather than the size of the update.” • Bigtable alone can’t do it “BigTable scales...but doesn’t provide tools to help programmers maintain data invariants in the face of concurrent updates.” • Applicability Incrementally updating data Computational output can be broken down into small pieces Computation large in some dimension (data size, cpu, etc) • Does it matter? “...Converting the indexing system to an incremental system ... reduced the averaging document processing latency by a factor of 100...” Mike Miller, GlueCon May 2012 14
  • 19. Percolator: incremental processing • BigTable plus... Multi-row ACID Transactions snapshot isolation, lazy locks up to 10s write latencies Timestamps Notifications Start Timestamp (read) Do not maintain invariants Commit Timestamp (write) Observer Framework your code to be run upon notification of an update Mike Miller, GlueCon May 2012 15
  • 20. Percolator: incremental processing Near Linear Scaling to 15k Cores Mike Miller, GlueCon May 2012 16
  • 21. Percolator: incremental processing Latency lower than MapReduce by 100x Mike Miller, GlueCon May 2012 17
  • 23. Dremel: ad-hoc Query • Scalable, interactive ad-hoc query system for read-only nested data “...capable of running aggregation queries over trillion-row tables in seconds.” • ... on nested data structures in situ Web and scientific data is often non-relational nested data (protobuffs) underlies most structured data at Google • Usage DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal1,100), COUNT(*) FROM t • Applicability Analysis of crawled documents Tracking of install data for apps on Android Market Crash reports Spam analysis... Dream BI Tool Mike Miller, GlueCon May 2012 19
  • 24. Dremel: ad-hoc Query • Ingredients In situ data SQL like interface Serving trees for query execution Column striped data (3-10x) Analysis Catalogs Mike Miller, GlueCon May 2012 20
  • 25. Dremel: ad-hoc Query Columns ~10x faster than Records 21 Mike Miller, GlueCon May 2012
  • 26. Dremel: ad-hoc Query Benchmark Data MapReduce (via Sawzall) Dremel (via SQL) Mike Miller, GlueCon May 2012 22
  • 27. Dremel: ad-hoc Query Significant Optimization Possible Dremel ~100x Faster than Stock MR Mike Miller, GlueCon May 2012 23
  • 28. Dremel: ad-hoc Query Most Production Queries Executed in <10 seconds Mike Miller, GlueCon May 2012 24
  • 30. Pregel: Big Graphs • Massively parallel processing of big graphs billions of vertices, trillions of edges • Bulk synchronous parallel model sequence of vertex oriented iterations send/receive messages from other vertex computations read/modify state of vertex, outgoing edges, graph topology • Expressive, easy to program distribution details hidden behind abstract API • Iterative computation continues until each vertex votes to terminate • In production PageRank 15 lines of code Mike Miller, GlueCon May 2012 26
  • 31. Pregel: Big Graphs • Master “Name” node connects processes for messaging • Message Passing no remote procedures, reads • Graph hashed across nodes vertex, outgoing edges stored in RAM • Aggregators global mechanism for aggregation all but final reduce computed on node local data • Checkpointing configurable, enables automatic recovery Mike Miller, GlueCon May 2012 27
  • 32. Pregel: Big Graphs Mike Miller, GlueCon May 2012 28
  • 33. Pregel: Big Graphs Near Linear Scaling to 1B nodes Mike Miller, GlueCon May 2012 29
  • 34. Learn More • Incremental Processing Incremental, in-database map/reduce in Cloudant’s BigCouch HBase 0.92 supports observers/coprocessors Stream processing via Storm, HStreaming, etc. • Ad Hoc Query Google BigQuery Column stores (Vertica, etc) OpenDremel (stalled?) ? • Big Graphs Giraph on Hadoop (Apache Incubator) Golden Orb (stalled?) Mike Miller, GlueCon May 2012 30
  • 35. Lessons Learned • Hire Jeff Dean and Sanjay Ghemawat • GFS enables everything • There is massive opportunity on the horizon Mike Miller, GlueCon May 2012 31