Scaling Out With Hadoop And HBase

•

17 gostaram•5,192 visualizações

A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands

Tecnologia

An Introduction to Dealing with

Big Data

My Current Project...

IP Address Registration for
Europe, Middle East, Russia

Ipv4:2 32 (4.3×109)addresses
Ipv6: 2128 (3.4×1038) addresses

Challenge

10 years of historical registration/routing data in ﬂat ﬁles
200+ billion (!) historical data records (25 TB)

30 billion records per year (4 TB)
80 million per day / 1,000 per second

Make it searchable...

Google Yahoo Amazon
eBay
Facebookusers
300M MySpace users
264M Wikipedia
LinkedInusers
Twitterusers
50M

45M Digg Hyves
Flickr users YouTube
32M
Marktplaats 5.5M ads
6.5M users,

Scalability:

Handling more load / requests
Handling more data
Handling more types of data

...without anything breaking or falling over
...and without going bankrupt

UP
Out Out Out Out
Out Out Out Out
Out Out Out Out
VS Out Out Out Out
Out Out Out Out
Out Out Out Out

Scaling Out, Part 1

Processing Data
a.k.a. Data Crunching

Map/Reduce

Parallel Batch Processing of Data
Break the data into chunks
Distribute the chunks
Process the chunks in parallel
Merge the results

Reliable, Scalable, Distributed Computing

(written in Java)

Distributed File System (DFS)

Foundation for all Hadoop projects
Automatic ﬁle replication
Automatic checksumming / error correction
Based on Google’s File System (GFS)

Map / Reduce

Simple Java API
Powerful supporting framework
Powerful tools
Good support for non-java languages

4TB of raw image TIFF data (stored in S3)
100 Amazon EC2 instances
Hadoop Map/Reduce
11 million ﬁnished PDFs
24 hours, about $240

Scaling Out, Part 1I

Storing & Retrieving Data
Reads and Writes

Relational Databases
are hard to scale out

Ways to Scale out an RDBMS (1)

Replication
Good for scaling reads
Master-Slave Single point of failure
Single point of bottleneck
Master-Master Limited scaling of writes
Complicated

Ways to Scale out an RDBMS (2)

Partitioning
Vertical : by function / table
Horizontal : by key / id (Sharding)

Not truly Relational anymore (application joins)
Limited Scalability (relocating, resharding)

Brewer’s CAP Theorem

Consistency
Availability
Partition Tolerance ...pick any two

Relational Non-Relational

ACID vs BASE
Atomic Basic
Consistent Availability
Isolated Soft State
Durable Eventual Consistency

NoSQL NO-SQL

Non-Relational Databases

Better Different

Types of NOSQL
(Distributed) Key-Value
Redis
Voldemort Document Oriented
Scalaris (D)
CouchDB
MongoDB
Riak (D)

Column Oriented
Cassandra (D)
HBase (D)
Graph Oriented
Neo4J

(D) = Distributed (automatic out scaling)

Those Big Numbers Again...

10 years of historical data in ﬂat ﬁles
200+ billion (!) historical data records (25 TB)

30 billion records per year (4 TB)
80 million per day / 1,000 per second

Make it searchable...

~ 200 000 000 000 records

Map / Reduce

~ 15 000 000 000 records

Our Data is 3D

IP Address
1 0..*
Record
Record
1 0..*
Timestamp
Timestamp

Best ﬁt & performance:
Column Oriented

Row Column Name (!) Values (!)

Facebook
Cassandra Twitter
Digg

Tunable: Availability vs Consistency
Very active community
0.4.1
No documentation

Yahoo Adobe
Meetup Tumblr
StumbleUpon
Streamy

Built on top of Hadoop DFS
Very active community
0.20.1
Good Documentation

Initial Results:
Tested on an EC2 cluster of 8 XLarge instances

3.8 B (23 GB) 33 M (1 GB)
5 hours

33 M (1 GB) 15 GB
Record duplication: 6x

75 minutes “Needle in a haystack” full on-disk table scan:
44000 inserts/second 0.5 M records/second

In order to choose the right
scaling tools, you need to:
Understand your data
Know what you want to query and how

val shameless = <SelfPromotion>

Try some Scala in the basement !

</SelfPromotion>

Mais conteúdo relacionado

Mais procurados

Distributed dbms architecturesPooja Dixit

Google App Engine pptOECLIB Odisha Electronics Control Library

Dichotomy of parallel computing platformsSyed Zaid Irshad

And or graphAli A Jalil

Parallel AlgorithmsDr Sandeep Kumar Poonia

Introduction to Data Stream ProcessingSafe Software

Concurrency Control in Distributed Database.Meghaj Mallick

Data mining tasksKhwaja Aamer

MiddlewareDr. Uday Saikia

Algorithm and pseudocode conventionssaranyatdr

Distributed Systems Real Life ApplicationsAman Srivastava

Lecture6 introduction to data streamshktripathy

Data mining primitiveslavanya marichamy

Big Data Analytics with HadoopPhilippe Julio

Data Mining: Graph mining and social network analysisDataminingTools Inc

Developing a Map Reduce ApplicationDr. C.V. Suresh Babu

Infrastructure as a Service ( IaaS)Ravindra Dastikop

Introduction to Parallel and Distributed ComputingSayed Chhattan Shah

Rule based systemDr. C.V. Suresh Babu

Mais procurados (20)

Distributed dbms architectures

Google App Engine ppt

Dichotomy of parallel computing platforms

And or graph

Parallel Algorithms

Introduction to Data Stream Processing

Concurrency Control in Distributed Database.

Data mining tasks

Middleware

Algorithm and pseudocode conventions

Distributed Systems Real Life Applications

Lecture6 introduction to data streams

Data mining primitives

Big Data Analytics with Hadoop

Data Mining: Graph mining and social network analysis

Developing a Map Reduce Application

Infrastructure as a Service ( IaaS)

Introduction to Parallel and Distributed Computing

Rule based system

Destaque

An Introduction to Functional Programming using HaskellMichel Rijnders

Next-Generation SIEM: Delivered from the Cloud Alert Logic

Modern Big Data Analytics Tools: An OverviewGreat Wide Open

NewSQL overview, Feb 2015Ivan Glushkov

Big data unit iNavjot Kaur

MySQL vs. NoSQL and NewSQL - survey resultsMatthew Aslett

Up to speed in domain driven designRick van der Arend

Destaque (7)

An Introduction to Functional Programming using Haskell

Next-Generation SIEM: Delivered from the Cloud

Modern Big Data Analytics Tools: An Overview

NewSQL overview, Feb 2015

Big data unit i

MySQL vs. NoSQL and NewSQL - survey results

Up to speed in domain driven design

Semelhante a Scaling Out With Hadoop And HBase

Small, Medium and Big DataPierre De Wilde

Above the cloud: Big Data and BIDenny Lee

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev

Data Driven Innovation with Amazon Web ServicesAmazon Web Services

Mongodb labBas van Oudenaarde

Next Generation Data Platforms - Deon ThomasThoughtworks

The Cassandra Distributed DatabaseEric Evans

Introduction to NoSQLYan Cui

Schemaless DatabasesDan Gunter

(DAT203) Building Graph Databases on AWSAmazon Web Services

Yahoo compares Storm and SparkChicago Hadoop Users Group

NO SQL: What, Why, HowIgor Moochnick

BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy

Microsoft Openness Mongo DBHeriyadi Janwar

Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall

Apache Spark: The Next Gen toolset for Big Data Processingprajods

Etu L2 Training - Hadoop 企業應用實作James Chen

Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB

MySQL And Search At CraigslistJeremy Zawodny

Semelhante a Scaling Out With Hadoop And HBase (20)

Small, Medium and Big Data

Above the cloud: Big Data and BI

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015

Data Driven Innovation with Amazon Web Services

Mongodb lab

Next Generation Data Platforms - Deon Thomas

The Cassandra Distributed Database

Introduction to NoSQL

Schemaless Databases

(DAT203) Building Graph Databases on AWS

Yahoo compares Storm and Spark

NO SQL: What, Why, How

BDI- The Beginning (Big data training in Coimbatore)

Microsoft Openness Mongo DB

Big Data/Hadoop Infrastructure Considerations

Apache Spark: The Next Gen toolset for Big Data Processing

Etu L2 Training - Hadoop 企業應用實作

Sf NoSQL MeetUp: Apache Hadoop and HBase

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...

MySQL And Search At Craigslist

Último

Advanced Computer Architecture – An IntroductionDilum Bandara

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Story boards and shot lists for my a level piececharlottematthew16

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

"ML in Production",Oleksandr BaganFwdays

Scaling Out With Hadoop And HBase

1. Scaling Out Hadoop and NoSQL Age Mooij

2. An Introduction to Dealing with Big Data

3. About me... @agemooij

4. Big Data ...and me

5. My Current Project... IP Address Registration for Europe, Middle East, Russia Ipv4:2 32 (4.3×109)addresses Ipv6: 2128 (3.4×1038) addresses

6. Challenge 10 years of historical registration/routing data in ﬂat ﬁles 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...

7. Big Data ...and you

8. Google Yahoo Amazon eBay Facebookusers 300M MySpace users 264M Wikipedia LinkedInusers Twitterusers 50M 45M Digg Hyves Flickr users YouTube 32M Marktplaats 5.5M ads 6.5M users,

9. Scalability: Handling more load / requests Handling more data Handling more types of data ...without anything breaking or falling over ...and without going bankrupt

10. UP Out Out Out Out Out Out Out Out Out Out Out Out VS Out Out Out Out Out Out Out Out Out Out Out Out

11. Scaling Out, Part 1 Processing Data a.k.a. Data Crunching

12. Map/Reduce Parallel Batch Processing of Data Break the data into chunks Distribute the chunks Process the chunks in parallel Merge the results

13. Reliable, Scalable, Distributed Computing (written in Java)

14. Distributed File System (DFS) Foundation for all Hadoop projects Automatic ﬁle replication Automatic checksumming / error correction Based on Google’s File System (GFS)

15. Map / Reduce Simple Java API Powerful supporting framework Powerful tools Good support for non-java languages

16.

17. 4TB of raw image TIFF data (stored in S3) 100 Amazon EC2 instances Hadoop Map/Reduce 11 million ﬁnished PDFs 24 hours, about $240

18. Scaling Out, Part 1I Storing & Retrieving Data Reads and Writes

19. Relational Databases are hard to scale out

20. Ways to Scale out an RDBMS (1) Replication Good for scaling reads Master-Slave Single point of failure Single point of bottleneck Master-Master Limited scaling of writes Complicated

21. Ways to Scale out an RDBMS (2) Partitioning Vertical : by function / table Horizontal : by key / id (Sharding) Not truly Relational anymore (application joins) Limited Scalability (relocating, resharding)

22. Why are RDBMSs so hard to scale out

23. Brewer’s CAP Theorem Consistency Availability Partition Tolerance ...pick any two

24. Relational Non-Relational ACID vs BASE Atomic Basic Consistent Availability Isolated Soft State Durable Eventual Consistency

25. NoSQL NO-SQL Non-Relational Databases Better Different

26. Types of NOSQL (Distributed) Key-Value Redis Voldemort Document Oriented Scalaris (D) CouchDB MongoDB Riak (D) Column Oriented Cassandra (D) HBase (D) Graph Oriented Neo4J (D) = Distributed (automatic out scaling)

27. RIPE NCC Experiences so far...

28. Those Big Numbers Again... 10 years of historical data in ﬂat ﬁles 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...

29. ~ 200 000 000 000 records Map / Reduce ~ 15 000 000 000 records

30. Our Data is 3D IP Address 1 0..* Record Record 1 0..* Timestamp Timestamp Best ﬁt & performance: Column Oriented Row Column Name (!) Values (!)

31. Facebook Cassandra Twitter Digg Tunable: Availability vs Consistency Very active community 0.4.1 No documentation

32. Yahoo Adobe Meetup Tumblr StumbleUpon Streamy Built on top of Hadoop DFS Very active community 0.20.1 Good Documentation

33. Initial Results: Tested on an EC2 cluster of 8 XLarge instances 3.8 B (23 GB) 33 M (1 GB) 5 hours 33 M (1 GB) 15 GB Record duplication: 6x 75 minutes “Needle in a haystack” full on-disk table scan: 44000 inserts/second 0.5 M records/second

34. In order to choose the right scaling tools, you need to: Understand your data Know what you want to query and how

35. Big Data ...Be Prepared !

36. val shameless = <SelfPromotion> Try some Scala in the basement ! </SelfPromotion>

Scaling Out With Hadoop And HBase

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a Scaling Out With Hadoop And HBase

Semelhante a Scaling Out With Hadoop And HBase (20)

Último

Último (20)

Scaling Out With Hadoop And HBase