SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Scaling Out
Hadoop and NoSQL


    Age Mooij
An Introduction to Dealing with




Big Data
About me...




              @agemooij
Big Data
  ...and me
My Current Project...




           IP Address Registration for
           Europe, Middle East, Russia

           Ipv4:2 32   (4.3×109)addresses
           Ipv6: 2128 (3.4×1038) addresses
Challenge

10 years of historical registration/routing data in flat files
200+ billion (!) historical data records (25 TB)

                30 billion records per year (4 TB)
                80 million per day / 1,000 per second




        Make it searchable...
Big Data
  ...and you
Google             Yahoo          Amazon
                                                  eBay
            Facebookusers
                  300M           MySpace users
                                      264M         Wikipedia
LinkedInusers
                      Twitterusers
      50M

                           45M           Digg         Hyves
       Flickr users       YouTube
           32M
                                              Marktplaats 5.5M ads
                                                    6.5M users,
Scalability:

         Handling more load / requests
             Handling more data
          Handling more types of data



  ...without anything breaking or falling over
         ...and without going bankrupt
UP
          Out Out Out Out
          Out Out Out Out
          Out Out Out Out
     VS   Out Out Out Out
          Out Out Out Out
          Out Out Out Out
Scaling Out, Part 1

Processing Data
  a.k.a. Data Crunching
Map/Reduce

 Parallel Batch Processing of Data
     Break the data into chunks
       Distribute the chunks
    Process the chunks in parallel
         Merge the results
Reliable, Scalable, Distributed Computing




           (written in Java)
Distributed File System (DFS)

    Foundation for all Hadoop projects
        Automatic file replication
Automatic checksumming / error correction
   Based on Google’s File System (GFS)
Map / Reduce

Simple Java API
Powerful supporting framework
Powerful tools
Good support for non-java languages
4TB of raw image TIFF data (stored in S3)
       100 Amazon EC2 instances
          Hadoop Map/Reduce
        11 million finished PDFs
         24 hours, about $240
Scaling Out, Part 1I

Storing & Retrieving Data
       Reads and Writes
Relational Databases
are hard to scale out
Ways to Scale out an RDBMS (1)


    Replication
                       Good for scaling reads
     Master-Slave      Single point of failure
                       Single point of bottleneck
    Master-Master      Limited scaling of writes
                       Complicated
Ways to Scale out an RDBMS (2)


                           Partitioning
Vertical   : by function / table
Horizontal : by key / id (Sharding)


     Not truly Relational anymore (application joins)
      Limited Scalability (relocating, resharding)
Why are RDBMSs
so hard to
scale out
Brewer’s CAP Theorem

Consistency
Availability
Partition Tolerance   ...pick any two
Relational   Non-Relational



ACID vs      BASE
Atomic       Basic
Consistent   Availability
Isolated     Soft State
Durable      Eventual Consistency
NoSQL             NO-SQL

 Non-Relational Databases

    Better Different
Types of NOSQL
(Distributed) Key-Value
        Redis
        Voldemort             Document Oriented
        Scalaris (D)
                                            CouchDB
                                            MongoDB
                                            Riak (D)


  Column Oriented
       Cassandra (D)
       HBase (D)
                                  Graph Oriented
                                              Neo4J



                          (D) = Distributed (automatic out scaling)
RIPE NCC
Experiences so far...
Those Big Numbers Again...


10 years of historical data in flat files
200+ billion (!) historical data records (25 TB)

                  30 billion records per year (4 TB)
                  80 million per day / 1,000 per second




                       Make it searchable...
~ 200 000 000 000 records




        Map / Reduce




~ 15 000 000 000 records
Our Data is 3D

IP Address
             1     0..*
                           Record
                          Record
                                    1   0..*
                                                Timestamp
                                               Timestamp



       Best fit & performance:
                   Column Oriented


 Row             Column Name (!)               Values (!)
Facebook
Cassandra                                 Twitter
                                           Digg


  Tunable: Availability vs Consistency
  Very active community
  0.4.1
  No documentation
Yahoo Adobe
                      Meetup Tumblr
                       StumbleUpon
                          Streamy


Built on top of Hadoop DFS
Very active community
0.20.1
Good Documentation
Initial Results:
   Tested on an EC2 cluster of 8 XLarge instances


3.8 B (23 GB)                                        33 M (1 GB)
                            5 hours




33 M (1 GB)                                            15 GB
                                                 Record duplication: 6x

    75 minutes                        “Needle in a haystack” full on-disk table scan:
44000 inserts/second                             0.5 M records/second
In order to choose the right
  scaling tools, you need to:
       Understand your data
Know what you want to query and how
Big Data
   ...Be Prepared !
val shameless = <SelfPromotion>




    Try some Scala in the basement !



        </SelfPromotion>

Mais conteúdo relacionado

Mais procurados

Distributed dbms architectures
Distributed dbms architecturesDistributed dbms architectures
Distributed dbms architecturesPooja Dixit
 
Register allocation and assignment
Register allocation and assignmentRegister allocation and assignment
Register allocation and assignmentKarthi Keyan
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platformsSyed Zaid Irshad
 
Introduction to Data Stream Processing
Introduction to Data Stream ProcessingIntroduction to Data Stream Processing
Introduction to Data Stream ProcessingSafe Software
 
Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.Meghaj Mallick
 
Algorithm and pseudocode conventions
Algorithm and pseudocode conventionsAlgorithm and pseudocode conventions
Algorithm and pseudocode conventionssaranyatdr
 
Distributed Systems Real Life Applications
Distributed Systems Real Life ApplicationsDistributed Systems Real Life Applications
Distributed Systems Real Life ApplicationsAman Srivastava
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce ApplicationDr. C.V. Suresh Babu
 
Infrastructure as a Service ( IaaS)
Infrastructure as a Service ( IaaS)Infrastructure as a Service ( IaaS)
Infrastructure as a Service ( IaaS)Ravindra Dastikop
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingSayed Chhattan Shah
 

Mais procurados (20)

Distributed dbms architectures
Distributed dbms architecturesDistributed dbms architectures
Distributed dbms architectures
 
Register allocation and assignment
Register allocation and assignmentRegister allocation and assignment
Register allocation and assignment
 
Google App Engine ppt
Google App Engine  pptGoogle App Engine  ppt
Google App Engine ppt
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
And or graph
And or graphAnd or graph
And or graph
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Introduction to Data Stream Processing
Introduction to Data Stream ProcessingIntroduction to Data Stream Processing
Introduction to Data Stream Processing
 
Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Middleware
MiddlewareMiddleware
Middleware
 
Algorithm and pseudocode conventions
Algorithm and pseudocode conventionsAlgorithm and pseudocode conventions
Algorithm and pseudocode conventions
 
Distributed Systems Real Life Applications
Distributed Systems Real Life ApplicationsDistributed Systems Real Life Applications
Distributed Systems Real Life Applications
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce Application
 
Infrastructure as a Service ( IaaS)
Infrastructure as a Service ( IaaS)Infrastructure as a Service ( IaaS)
Infrastructure as a Service ( IaaS)
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed Computing
 
Rule based system
Rule based systemRule based system
Rule based system
 

Destaque

An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellMichel Rijnders
 
Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Alert Logic
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015Ivan Glushkov
 
MySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMatthew Aslett
 
Up to speed in domain driven design
Up to speed in domain driven designUp to speed in domain driven design
Up to speed in domain driven designRick van der Arend
 

Destaque (7)

An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using Haskell
 
Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
MySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey results
 
Up to speed in domain driven design
Up to speed in domain driven designUp to speed in domain driven design
Up to speed in domain driven design
 

Semelhante a Scaling Out With Hadoop And HBase

Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big DataPierre De Wilde
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesAmazon Web Services
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasThoughtworks
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed DatabaseEric Evans
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLYan Cui
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWSAmazon Web Services
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 

Semelhante a Scaling Out With Hadoop And HBase (20)

Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big Data
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
 
Mongodb lab
Mongodb labMongodb lab
Mongodb lab
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 

Último

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Último (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Scaling Out With Hadoop And HBase

  • 1. Scaling Out Hadoop and NoSQL Age Mooij
  • 2. An Introduction to Dealing with Big Data
  • 3. About me... @agemooij
  • 4. Big Data ...and me
  • 5. My Current Project... IP Address Registration for Europe, Middle East, Russia Ipv4:2 32 (4.3×109)addresses Ipv6: 2128 (3.4×1038) addresses
  • 6. Challenge 10 years of historical registration/routing data in flat files 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...
  • 7. Big Data ...and you
  • 8. Google Yahoo Amazon eBay Facebookusers 300M MySpace users 264M Wikipedia LinkedInusers Twitterusers 50M 45M Digg Hyves Flickr users YouTube 32M Marktplaats 5.5M ads 6.5M users,
  • 9. Scalability: Handling more load / requests Handling more data Handling more types of data ...without anything breaking or falling over ...and without going bankrupt
  • 10. UP Out Out Out Out Out Out Out Out Out Out Out Out VS Out Out Out Out Out Out Out Out Out Out Out Out
  • 11. Scaling Out, Part 1 Processing Data a.k.a. Data Crunching
  • 12. Map/Reduce Parallel Batch Processing of Data Break the data into chunks Distribute the chunks Process the chunks in parallel Merge the results
  • 13. Reliable, Scalable, Distributed Computing (written in Java)
  • 14. Distributed File System (DFS) Foundation for all Hadoop projects Automatic file replication Automatic checksumming / error correction Based on Google’s File System (GFS)
  • 15. Map / Reduce Simple Java API Powerful supporting framework Powerful tools Good support for non-java languages
  • 16.
  • 17. 4TB of raw image TIFF data (stored in S3) 100 Amazon EC2 instances Hadoop Map/Reduce 11 million finished PDFs 24 hours, about $240
  • 18. Scaling Out, Part 1I Storing & Retrieving Data Reads and Writes
  • 20. Ways to Scale out an RDBMS (1) Replication Good for scaling reads Master-Slave Single point of failure Single point of bottleneck Master-Master Limited scaling of writes Complicated
  • 21. Ways to Scale out an RDBMS (2) Partitioning Vertical : by function / table Horizontal : by key / id (Sharding) Not truly Relational anymore (application joins) Limited Scalability (relocating, resharding)
  • 22. Why are RDBMSs so hard to scale out
  • 24. Relational Non-Relational ACID vs BASE Atomic Basic Consistent Availability Isolated Soft State Durable Eventual Consistency
  • 25. NoSQL NO-SQL Non-Relational Databases Better Different
  • 26. Types of NOSQL (Distributed) Key-Value Redis Voldemort Document Oriented Scalaris (D) CouchDB MongoDB Riak (D) Column Oriented Cassandra (D) HBase (D) Graph Oriented Neo4J (D) = Distributed (automatic out scaling)
  • 28. Those Big Numbers Again... 10 years of historical data in flat files 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...
  • 29. ~ 200 000 000 000 records Map / Reduce ~ 15 000 000 000 records
  • 30. Our Data is 3D IP Address 1 0..* Record Record 1 0..* Timestamp Timestamp Best fit & performance: Column Oriented Row Column Name (!) Values (!)
  • 31. Facebook Cassandra Twitter Digg Tunable: Availability vs Consistency Very active community 0.4.1 No documentation
  • 32. Yahoo Adobe Meetup Tumblr StumbleUpon Streamy Built on top of Hadoop DFS Very active community 0.20.1 Good Documentation
  • 33. Initial Results: Tested on an EC2 cluster of 8 XLarge instances 3.8 B (23 GB) 33 M (1 GB) 5 hours 33 M (1 GB) 15 GB Record duplication: 6x 75 minutes “Needle in a haystack” full on-disk table scan: 44000 inserts/second 0.5 M records/second
  • 34. In order to choose the right scaling tools, you need to: Understand your data Know what you want to query and how
  • 35. Big Data ...Be Prepared !
  • 36. val shameless = <SelfPromotion> Try some Scala in the basement ! </SelfPromotion>