O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Five Lessons
in Distributed Databases
Jonathan Ellis
CTO, DataStax
1 © DataStax, All Rights Reserved. Confidential
© DataStax, All Rights Reserved.
1. If it’s not SQL, it’s not a database
© DataStax, All Rights Reserved.
A brief history of NoSQL
● Early 2000s: people hit limits on vertical scaling, start
shar...
© DataStax, All Rights Reserved.
One small problem
© DataStax, All Rights Reserved.
Cassandra’s experience
● Thrift RPC “drivers” too low level
● Fragmented: Hector, Pelops,...
© DataStax, All Rights Reserved.
© DataStax, All Rights Reserved.
Solution: CQL
● 2011: Cassandra 0.8 introduces CQL 1.0
● 2012: Cassandra 1.1 introduces C...
© DataStax, All Rights Reserved.
Today
● Cassandra: CQL
● CosmosDB: “SQL”
● Cloud Spanner: “SQL”
● Couchbase: N1QL
● HBase...
© DataStax, All Rights Reserved.
2. It takes 5+ years to build a database
© DataStax, All Rights Reserved.
Curt Monash
Rule 1: Developing a good DBMS requires 5-7 years and
tens of millions of dol...
© DataStax, All Rights Reserved.
Aside: Mistakes I made starting DataStax
● Stayed at Rackspace too long
● Raised a $2.5M ...
© DataStax, All Rights Reserved.
Examples (Curt)
● Concurrent workloads benchmarked in the lab are poor
predictors of conc...
© DataStax, All Rights Reserved.
Examples (Cassandra)
● Hinted handoff
● Repair
● Counters
● Paxos
● Test suite
© DataStax, All Rights Reserved.
Aside: Fallout (Jepsen at Scale)
● Ensemble - A set of clusters that is brought up/torn
d...
© DataStax, All Rights Reserved.
A simple Fallout workload
ensemble:
server:
node.count: 3
provisioner:
name: local
config...
© DataStax, All Rights Reserved.
5-7 years?
● Cassandra became Apache TLP in Feb 2010
● 3.0 released Fall 2015
● OSS is ab...
© DataStax, All Rights Reserved.
3. The customer is always right
© DataStax, All Rights Reserved.
Example: sequential scans
SELECT * FROM user_purchases
WHERE purchase_date > 2000
© DataStax, All Rights Reserved.
What’s wrong with this query?
For 100,000 purchases, nothing.
For 100,000,000 purchases, ...
© DataStax, All Rights Reserved.
Solution (2012): ALLOW FILTERING
SELECT * FROM user_purchases
WHERE purchase_date > 2000
...
© DataStax, All Rights Reserved.
Better solution (2013): Paging
● Build resultset incrementally and “page” it to the client
© DataStax, All Rights Reserved.
Example: tombstones
INSERT INTO foo VALUES (1254, …)
DELETE FROM foo WHERE id = 1254
…
SE...
© DataStax, All Rights Reserved.
Solution (2013)
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000
© DataStax, All Rights Reserved.
Better Solution (???): It’s complicated
● Track repair status to get rid of GCGS
● Bring ...
© DataStax, All Rights Reserved.
Example: joins
● CQL doesn’t support joins
● People still use client-side joins instead o...
© DataStax, All Rights Reserved.
Solution (2015-???): MV
● Make it easier to denormalize
© DataStax, All Rights Reserved.
Better solution (???): actually add joins
● Less controversial: shared partition joins
● ...
© DataStax, All Rights Reserved.
A note on configurability
© DataStax, All Rights Reserved.
4. Too much magic is a bad thing
© DataStax, All Rights Reserved.
Not (just) about vendors overpromising
● “Our database isn’t subject to the limits of the...
© DataStax, All Rights Reserved.
Magic can be bad even when it works
© DataStax, All Rights Reserved.
Cloud Spanner analysis excerpt
Spanner’s architecture implies that writes will be signifi...
© DataStax, All Rights Reserved.
Cloud Spanner
© DataStax, All Rights Reserved.
Auto-scaling in DynamoDB
● Request capacity tied to “partitions” [pp]
○ pp count = max (r...
© DataStax, All Rights Reserved.
“Best practices for tables”
● Bulk load 200M items = 200 GB
● Target 60 minutes = 55,000 ...
© DataStax, All Rights Reserved.
Ravelin, 2017
You construct a table which uses a customer ID as partition key. You
know y...
© DataStax, All Rights Reserved.
How much magic is too much?
● Joins: Apparently okay
● Auto-scaling: Apparently also okay...
© DataStax, All Rights Reserved.
5. It’s the cloud, stupid
© DataStax, All Rights Reserved.
September 2011
© DataStax, All Rights Reserved.
March 2012
© DataStax, All Rights Reserved.
March 2012
© DataStax, All Rights Reserved.
March 2012
© DataStax, All Rights Reserved.
The cloud is here. Now what?
© DataStax, All Rights Reserved.
Cloud-first architecture
“The second trend will be the increased
prevalence of shared-dis...
© DataStax, All Rights Reserved.
Cloud-first infrastructure
● What on-premises infrastructure can provide a
cloud-like exp...
© DataStax, All Rights Reserved.
Cloud-first development
● Is a yearly (bi-yearly?) release process the right
cadence for ...
© DataStax, All Rights Reserved.
Cloud-first OSS
● What does OSS look like when you don’t work for the
big three clouds?
●...
© DataStax, All Rights Reserved.
Summary
1. If it’s not SQL, it’s not a database.
2. It takes 5+ years to build a database...
Próximos SlideShares
Carregando em…5
×

Five Lessons in Distributed Databases

299 visualizações

Publicada em

1. If it’s not SQL, it’s not a database.
2. It takes 5+ years to build a database.
3. Listen to your users.
4. Too much magic is a bad thing.
5. It’s the cloud, stupid.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Five Lessons in Distributed Databases

  1. 1. Five Lessons in Distributed Databases Jonathan Ellis CTO, DataStax 1 © DataStax, All Rights Reserved. Confidential
  2. 2. © DataStax, All Rights Reserved. 1. If it’s not SQL, it’s not a database
  3. 3. © DataStax, All Rights Reserved. A brief history of NoSQL ● Early 2000s: people hit limits on vertical scaling, start sharding RDBMSes ● 2006, 2007: BigTable, Dynamo papers ● 2008-2010: Explosion of scale-out systems ○ Voldemort, Riak, Dynomite, FoundationDB, CouchDB ○ Cassandra, HBase, MongoDB
  4. 4. © DataStax, All Rights Reserved. One small problem
  5. 5. © DataStax, All Rights Reserved. Cassandra’s experience ● Thrift RPC “drivers” too low level ● Fragmented: Hector, Pelops, Astyanax ● Inconsistent across language ecosystems
  6. 6. © DataStax, All Rights Reserved.
  7. 7. © DataStax, All Rights Reserved. Solution: CQL ● 2011: Cassandra 0.8 introduces CQL 1.0 ● 2012: Cassandra 1.1 introduces CQL 3.0 ● 2013: Cassandra 1.2 adds collections
  8. 8. © DataStax, All Rights Reserved. Today ● Cassandra: CQL ● CosmosDB: “SQL” ● Cloud Spanner: “SQL” ● Couchbase: N1QL ● HBase: Phoenix SQL (Java only) ● DynamoDB: REST/JSON ● MongoDB: BSON
  9. 9. © DataStax, All Rights Reserved. 2. It takes 5+ years to build a database
  10. 10. © DataStax, All Rights Reserved. Curt Monash Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars. That’s if things go extremely well. Rule 2: You aren’t an exception to Rule 1.
  11. 11. © DataStax, All Rights Reserved. Aside: Mistakes I made starting DataStax ● Stayed at Rackspace too long ● Raised a $2.5M series A ● Waited a year to get serious about enterprise sales ● Changed the company name ● Brisk
  12. 12. © DataStax, All Rights Reserved. Examples (Curt) ● Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life. ● Mixed workload management is harder than you’re assuming it is. ● Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
  13. 13. © DataStax, All Rights Reserved. Examples (Cassandra) ● Hinted handoff ● Repair ● Counters ● Paxos ● Test suite
  14. 14. © DataStax, All Rights Reserved. Aside: Fallout (Jepsen at Scale) ● Ensemble - A set of clusters that is brought up/torn down each test ○ Server Cluster - Cassandra/DSE ○ Client Cluster - Load Generators ○ Observer Cluster - Records live information from clusters (OpsCenter/Graphite) ○ Controller - Fallout ● Workload - The guts of the test ○ Phases - Run sequentially. Contains one or more modules that run in parallel for that phase ○ Checkers - Run after all phases and verify the data emitted from modules. ○ Artifact Checkers - Runs against collected artifacts to look for correctness/problems
  15. 15. © DataStax, All Rights Reserved. A simple Fallout workload ensemble: server: node.count: 3 provisioner: name: local configuration_manager: name: ccm properties: cassandra.version: 3.0.0 client: server #use server cluster phases: - insert_workload: module: stress properties: iterations: 1m type: write rf: 3 gossip_updown: module: nodetool properties: command: disablegossip secondary.command: enablegossip sleep.seconds: 10 sleep.randomize: 20 - read_workload: module: stress properties: iterations: 1m type: read checkers: verify_success: checker: nofail 1. Start 3 node ccm cluster. 2. Insert data while bringing gossip on the nodes up and down. 3. Read/Check the data. 4. Verify none of the steps failed. Note: to move from ccm to ec2 we only need to change the ensemble section.
  16. 16. © DataStax, All Rights Reserved. 5-7 years? ● Cassandra became Apache TLP in Feb 2010 ● 3.0 released Fall 2015 ● OSS is about adoption, not saving time/money
  17. 17. © DataStax, All Rights Reserved. 3. The customer is always right
  18. 18. © DataStax, All Rights Reserved. Example: sequential scans SELECT * FROM user_purchases WHERE purchase_date > 2000
  19. 19. © DataStax, All Rights Reserved. What’s wrong with this query? For 100,000 purchases, nothing. For 100,000,000 purchases, you’ll crash the server (in 2012).
  20. 20. © DataStax, All Rights Reserved. Solution (2012): ALLOW FILTERING SELECT * FROM user_purchases WHERE purchase_date > 2000 ALLOW FILTERING
  21. 21. © DataStax, All Rights Reserved. Better solution (2013): Paging ● Build resultset incrementally and “page” it to the client
  22. 22. © DataStax, All Rights Reserved. Example: tombstones INSERT INTO foo VALUES (1254, …) DELETE FROM foo WHERE id = 1254 … SELECT * FROM foo
  23. 23. © DataStax, All Rights Reserved. Solution (2013) tombstone_warn_threshold: 1000 tombstone_failure_threshold: 100000
  24. 24. © DataStax, All Rights Reserved. Better Solution (???): It’s complicated ● Track repair status to get rid of GCGS ● Bring time-to-repair from “days” to “hours” ● Optional: improve time-to-compaction
  25. 25. © DataStax, All Rights Reserved. Example: joins ● CQL doesn’t support joins ● People still use client-side joins instead of denormalizing
  26. 26. © DataStax, All Rights Reserved. Solution (2015-???): MV ● Make it easier to denormalize
  27. 27. © DataStax, All Rights Reserved. Better solution (???): actually add joins ● Less controversial: shared partition joins ● More controversial: cross-partition ● CosmosDB, Spanner
  28. 28. © DataStax, All Rights Reserved. A note on configurability
  29. 29. © DataStax, All Rights Reserved. 4. Too much magic is a bad thing
  30. 30. © DataStax, All Rights Reserved. Not (just) about vendors overpromising ● “Our database isn’t subject to the limits of the CAP theorem” ● “Our queue can guarantee exactly once delivery” ● “We’ll give you 99.99% uptime*”
  31. 31. © DataStax, All Rights Reserved. Magic can be bad even when it works
  32. 32. © DataStax, All Rights Reserved. Cloud Spanner analysis excerpt Spanner’s architecture implies that writes will be significantly slower than reads due to the need to coordinate across multiple replicas and avoid overlapping time bounds, and that is what we see in the original 2012 Spanner paper. … Besides write performance in isolation, because Spanner uses pessimistic locking to achieve ACID, reads are locked out of rows (partitions?) that are in the process of being updated. Thus, write performance challenges can spread to causing problems with reads as well.
  33. 33. © DataStax, All Rights Reserved. Cloud Spanner
  34. 34. © DataStax, All Rights Reserved. Auto-scaling in DynamoDB ● Request capacity tied to “partitions” [pp] ○ pp count = max (rc / 3000, wc / 1000, st / 10 GB) ● Subtle implication: capacity / pp decreases as storage volume increases ○ Non-uniform: pp request capacity halved when shard splits ● Subtle implication 2: bulk loads will wreck your planning
  35. 35. © DataStax, All Rights Reserved. “Best practices for tables” ● Bulk load 200M items = 200 GB ● Target 60 minutes = 55,000 write capacity = 55 pps ● Post bulk load steady state ● 1000 req/s = 2 req/pp = 2 req/(3.6M items) ● No way to reduce partition count
  36. 36. © DataStax, All Rights Reserved. Ravelin, 2017 You construct a table which uses a customer ID as partition key. You know your customer ID’s are unique and should be uniformly distributed across nodes. Your business has millions of customers and no single customer can do so many actions so quickly that the individual could create a hot key. Under this key you are storing around 2KB of data. This sounds reasonable. This will not work at scale in DynamoDb.
  37. 37. © DataStax, All Rights Reserved. How much magic is too much? ● Joins: Apparently okay ● Auto-scaling: Apparently also okay ● Automatic partitioning: not okay ● Really slow ACID: not okay (?) ● Why? ● How do we make the system more transparent without inflicting an unnecessary level of detail on the user?
  38. 38. © DataStax, All Rights Reserved. 5. It’s the cloud, stupid
  39. 39. © DataStax, All Rights Reserved. September 2011
  40. 40. © DataStax, All Rights Reserved. March 2012
  41. 41. © DataStax, All Rights Reserved. March 2012
  42. 42. © DataStax, All Rights Reserved. March 2012
  43. 43. © DataStax, All Rights Reserved. The cloud is here. Now what?
  44. 44. © DataStax, All Rights Reserved. Cloud-first architecture “The second trend will be the increased prevalence of shared-disk distributed DBMS. By “shared-disk” I mean a DBMS that uses a distributed storage layer as its primary storage location, such as HDFS or Amazon’s EBS/S3 services. This separates the DBMS’s storage layer from its execution nodes. Contrast this with a shared-nothing DBMS architecture where each execution node maintains its own storage.”
  45. 45. © DataStax, All Rights Reserved. Cloud-first infrastructure ● What on-premises infrastructure can provide a cloud-like experience? ● Kubernetes? ● OpenStack?
  46. 46. © DataStax, All Rights Reserved. Cloud-first development ● Is a yearly (bi-yearly?) release process the right cadence for companies building cloud services?
  47. 47. © DataStax, All Rights Reserved. Cloud-first OSS ● What does OSS look like when you don’t work for the big three clouds? ● “Commons Clause” is an attempt to deal with this ○ (What about AGPL?)
  48. 48. © DataStax, All Rights Reserved. Summary 1. If it’s not SQL, it’s not a database. 2. It takes 5+ years to build a database. 3. Listen to your users. 4. Too much magic is a bad thing. 5. It’s the cloud, stupid.

×