A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands
5. My Current Project...
IP Address Registration for
Europe, Middle East, Russia
Ipv4:2 32 (4.3×109)addresses
Ipv6: 2128 (3.4×1038) addresses
6. Challenge
10 years of historical registration/routing data in flat files
200+ billion (!) historical data records (25 TB)
30 billion records per year (4 TB)
80 million per day / 1,000 per second
Make it searchable...
9. Scalability:
Handling more load / requests
Handling more data
Handling more types of data
...without anything breaking or falling over
...and without going bankrupt
10. UP
Out Out Out Out
Out Out Out Out
Out Out Out Out
VS Out Out Out Out
Out Out Out Out
Out Out Out Out
14. Distributed File System (DFS)
Foundation for all Hadoop projects
Automatic file replication
Automatic checksumming / error correction
Based on Google’s File System (GFS)
15. Map / Reduce
Simple Java API
Powerful supporting framework
Powerful tools
Good support for non-java languages
16.
17. 4TB of raw image TIFF data (stored in S3)
100 Amazon EC2 instances
Hadoop Map/Reduce
11 million finished PDFs
24 hours, about $240
20. Ways to Scale out an RDBMS (1)
Replication
Good for scaling reads
Master-Slave Single point of failure
Single point of bottleneck
Master-Master Limited scaling of writes
Complicated
21. Ways to Scale out an RDBMS (2)
Partitioning
Vertical : by function / table
Horizontal : by key / id (Sharding)
Not truly Relational anymore (application joins)
Limited Scalability (relocating, resharding)
28. Those Big Numbers Again...
10 years of historical data in flat files
200+ billion (!) historical data records (25 TB)
30 billion records per year (4 TB)
80 million per day / 1,000 per second
Make it searchable...
29. ~ 200 000 000 000 records
Map / Reduce
~ 15 000 000 000 records
30. Our Data is 3D
IP Address
1 0..*
Record
Record
1 0..*
Timestamp
Timestamp
Best fit & performance:
Column Oriented
Row Column Name (!) Values (!)
31. Facebook
Cassandra Twitter
Digg
Tunable: Availability vs Consistency
Very active community
0.4.1
No documentation
32. Yahoo Adobe
Meetup Tumblr
StumbleUpon
Streamy
Built on top of Hadoop DFS
Very active community
0.20.1
Good Documentation
33. Initial Results:
Tested on an EC2 cluster of 8 XLarge instances
3.8 B (23 GB) 33 M (1 GB)
5 hours
33 M (1 GB) 15 GB
Record duplication: 6x
75 minutes “Needle in a haystack” full on-disk table scan:
44000 inserts/second 0.5 M records/second
34. In order to choose the right
scaling tools, you need to:
Understand your data
Know what you want to query and how