Speaker: Alex Komyagin
MongoDB replica sets allow you to make the database highly available so that you can keep your applications running even when some of the database nodes are down. In a distributed system, local durability of writes with journaling is no longer enough to guarantee system-wide durability, as the node might go down just before any other node replicates new write operations from it. As such, we need a new concept of cluster-wide durability.
How do you make sure that your write operations are durable within a replica set? How do you make sure that your read operations do not see those writes that are not yet durable? This talk will cover the mechanics of ensuring durability of writes via write concern and how to prevent reading of stale data in MongoDB using read concern. We will discuss the decision flow for selecting an appropriate level of write concern, as well as associated tradeoffs and several practical use cases and examples."
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
ReadConcern and WriteConcern
1. # M D B l o c a l
Alex Komyagin
Senior Consulting Engineer
MongoDB
2. O C T O B E R 1 2 , 2 0 1 7 | B E S P O K E | S A N F R A N C I S C O
# M D B l o c a l
Who stole my write?
Or the story of Write Concern and Read Concern
3. # M D B l o c a l
WHAT ARE WE GOING TO LEARN TODAY?
• What those things are - Write Concern and Read Concern
• What you can do with them
• What you should do with them
4. # M D B l o c a l
TYPICAL WRITE WORKFLOW
The App
Secondaryjournal
In-memory structures and oplog
data files
{name:”Alex”}
{ok:1}
1
2
3
4
5
6
7
Secondary
Primary
6. # M D B l o c a l
WRITE SOME DATA
The App
Secondaryjournal
In-memory structures and oplog
data
files
Secondary
Primary
{x:1},...,{x:99}
{ok:1}
{x:99}
…
{x:1}
{x:99}
…
{x:1}
{x:99}
…
{x:1}
7. # M D B l o c a l
WRITE SOME MORE
The App
Secondaryjournal
In-memory structures and oplog
data
files
Secondary
Primary
{x:100}
{ok:1}
{x:100}
{x:99}
…
{x:1}
{x:99}
…
{x:1}
{x:99}
…
{x:1}
8. # M D B l o c a l
OOOPSIE!
The App
Secondaryjournal
In-memory structures and oplog
data
files
Secondary
Primary
{x:100}
{ok:1}
{x:100}
{x:99}
…
{x:1}
{x:99}
…
{x:1}
{x:99}
…
{x:1}
9. # M D B l o c a l
KEEP WRITING
The App
Secondary
Primary
Primary
{x:101}
{ok:1}
{x:100}
{x:99}
…
{x:1}
{x:101}
{x:99}
…
{x:1}
{x:101}
{x:99}
…
{x:1}
10. # M D B l o c a l
THE OLD PRIMARY COMES BACK ONLINE
The App
Secondary
Primary
???
{x:101}
{ok:1}
{x:100}
{x:99}
…
{x:1}
{x:101}
{x:99}
…
{x:1}
{x:101}
{x:99}
…
{x:1}
11. # M D B l o c a l
HE HAS TO FIX HIS STATE TO RESUME
REPLICATION
The App
Secondary
Primary
ROLLBACK
{x:100}
{x:99}
…
{x:1}
{x:101}
{x:99}
…
{x:1}
{x:101}
{x:99}
…
{x:1}
<dbpath>/rollback/<...>.bson
{x:99} is the last common point
12. # M D B l o c a l
…AND THINGS ARE BACK TO NORMAL
The App
Secondary
Primary
Secondary
{x:101}
{x:99}
…
{x:1}
{x:101}
{x:99}
…
{x:1}
{x:101}
{x:99}
…
{x:1}
<dbpath>/rollback/<...>.bson
The {x:100} write is not lost per se,
but is not accessible for the app
13. # M D B l o c a l
Rollback is entirely unavoidable, but it is not a problem, it’s like self-healing
14. # M D B l o c a l
SO WHERE WAS THE PROBLEM?
The App
Secondaryjournal
In-memory structures and oplog
data
files
Secondary
Primary
{x:100}
{ok:1}
{x:100}
{x:99}
…
{x:1}
{x:99}
…
{x:1}
{x:99}
…
{x:1}
The App got the “OK” before the write was replicated to any of the secondaries
16. # M D B l o c a l
WRITE CONCERN
• Form of an intelligent receipt/confirmation that the write operation
was replicated to the desired number of nodes
• Default number is 1
• Allows us to express how concerned we are with durability of a
particular write in a replica set
• Can be set for individual ops / collections / etc
• NOT a distributed transaction
db.test.insert({x:100},{writeConcern:{w:2}})
17. # M D B l o c a l
HOW DOES IT WORK?
• Different levels
• {w:<N>}
• {w:<N>, j:true}
• Includes secondaries since 3.2
• {w:”majority”} - implies {j:true} in
MongoDB 3.2+
• Guarantees that confirmed operations
won’t be rolled back
• Supports timeout
• {w:2, wtimeout:100}
• Timeout doesn’t imply a write failure -
you just get no receipt
18. # M D B l o c a l
WRITE CONCERN TRADEOFFS
• Choose {w:”majority”} for writes that matter
• The main tradeoff is latency
• It’s not as bad as you think (within the same DC, AZ
or even region)
• Use multiple threads to get desired throughput
• Use async frameworks in user facing applications, if
needed
• For cross-regional deployments choose {w:2}
• Reasonable compromise between performance and
durability
20. # M D B l o c a l
WHAT HAPPENS IF WRITE CONCERN FAILS?
• “wtimeout” only generates a write concern failure exception
• Similar to network exceptions
• No useful information in a failure
• App code has to handle exceptions and retry when appropriate
• Writes need to be made idempotent (e.g. updates with $inc -> $set)
• When idempotency is not possible, at least log the failures
• Retriable writes: Coming soon!
db.test.insert({name:”Alex”},
{writeConcern:{w:2,wtimeout:1000}}
writeConcernError
SecondaryPrimary
21. # M D B l o c a l
BEST EFFORT WRITE CODE EXAMPLE
• Replica set with 2 data nodes and
an arbiter
• One node goes down every 90
seconds
• Inserting 2mln records
• w:1 - only 1999911 records were
actually there in the end!
client = MongoClient("mongodb://a,b,c/?replicaSet=rs")
coll = client.test_db.test_col
i = 0
while i < 2000000:
my_object = {'number': i}
try:
coll.insert(my_object)
except:
while True: # repeat until success or we hit a
dup key error
try:
coll.insert(my_object)
break
except DuplicateKeyError:
break
except ConnectionFailure:
pass
i += 1
22. # M D B l o c a l
HOW TO MAKE IT BETTER?
• Use write concern to know if writes
are durable
• We’ll pay with additional latency for
writes that might never be rolled
back (but we don’t know that!)
• It’s not practical to wait for every
write
- Use bulk inserts
client = MongoClient("mongodb://a,b,c/?replicaSet=rs")
coll = client.test_db.test_col
i = 0
while i < 2000000:
my_object = {'number': i}
try:
coll.insert(my_object)
except:
while True: # repeat until success or we hit a
dup key error
try:
coll.insert(my_object)
break
except DuplicateKeyError:
break
except ConnectionFailure:
pass
i += 1
23. # M D B l o c a l
client = MongoClient("mongodb://a,b,c/?replicaSet=rs")
coll = client.test_db.test_col.with_options(write_concern=WriteConcern(w=2))
i=0
while i<20000:
requests = []
for j in range(0,100):
requests.append(InsertOne({"number":i*100+j}))
while True: #repeat until success or write concern is satisfied
try:
coll.bulk_write(requests, ordered=False)
break
except BulkWriteError as bwe:
if bwe.details.get('writeConcernErrors') == []:
break
except ConnectionFailure:
pass
i+=1
BETTER, SAFER CODE
• db.test.count() is 2000000 after the test
• Takes the same amount of time with w:2 as w:1
Insert
batch
Next!
Success
Problems?
No write concern errors
Otherwise
25. # M D B l o c a l
WHAT IS A DIRTY READ?
The App
Secondaryjournal
In-memory structures and oplog
data
files
Secondary
Primary
db.test.find({x:100})
{x:100}
{x:100}
{x:99}
…
{x:1}
{x:99}
…
{x:1}
{x:99}
…
{x:1}
26. # M D B l o c a l
WHAT IS A DIRTY READ?
The App
Secondary
Primary
Secondary
{x:101}
{x:99}
…
{x:1}
{x:101}
{x:99}
…
{x:1}
{x:101}
{x:99}
…
{x:1}
<dbpath>/rollback/<...>.bson
db.test.find({x:100})
null
27. # M D B l o c a l
READ CONCERN
• Determines which data to
return from a query
• Different modes:
- Local
- Majority (3.2)
- Linearizable (3.4)
• NOT related to read
preferences
Secondaryjournal
In-memory structures and oplog
data
files
Secondary
Primary
{x:100} - local
{x:99} - majority
{x:98}
…
{x:1}
{x:99} - majority/local
{x:98}
…
{x:1}
{x:99} - local
{x:98} - majority
…
{x:1}
28. # M D B l o c a l
READ CONCERN
• db.test.find( { x:100 } )
- WORKS
• db.test.find( { x:100 } ).readConcern("majority")
- RETURNS “null”
• db.test.find( { x:100 } ).readConcern("linearizable")
- BLOCKS until the last write is replicated
- Use the maxTimeMS() option to avoid blocking forever
Secondary
Primary
{x:100} - local
{x:99} - majority
{x:98}
…
{x:1}
{x:99} - local
{x:98} - majority
…
{x:1}
29. # M D B l o c a l
MAJORITY VS. LINEARIZABLE
• Return data that won’t be rolled back
• “Majority” returns the most recent data replicated to a majority of nodes that this particular
node knows about
- Each node maintains and advances a separate “majority-committed” pointer/snapshot
• “Linearizable” ensures that this data is the most recent
- Enables multiple threads to perform reads and writes on a single document as if a single thread
performed these operations in real time
- Only on Primary
- Significantly slower than “majority”
• In most applications dirty reads is not a big problem
- If write failures are handled correctly, the “dirtiness” is temporary
- Twitter vs. Changing your password
30. # M D B l o c a l
DID WE FORGET ANYTHING?
• Read preference controls where we are reading
from
• Read concern controls what we are reading
• Causal consistency, new in 3.6, allows us to read
what we wrote from any node
• Extension for read concern (read-after-optime)
• Compatible with read concern “majority”
• Enabled on the session level
Secondary
Primary
{x:100} - local
{x:99} - majority
{x:98}
…
{x:1}
{x:99} - local
{x:98} - majority
…
{x:1}
The App
Reads
Writes
Readsdb.getMongo().setCausalConsistency(true)
32. # M D B l o c a l
HOW TO CHOOSE THE RIGHT CONCERN?
THINK WHAT YOUR USERS CARE ABOUT
Writing important data that has to be
durable?
• Example: ETL process for reporting
• Use {w:2}* or {w:”majority”}
Reads must see the most recent durable
state (can’t be stale or uncommitted)?
• Example: Credentials Management
Application
• Use {w:”majority”} and “linearizable” read
concern
Mission-critical data where dirty reads are not
allowed?
• Example: Config servers in sharding
• Use {w:”majority”} and “majority” read
concern
33. # M D B l o c a l
DOES MY DRIVER SUPPORT THIS??
• Java
- https://mongodb.github.io/mongo-java-driver/3.4/javadoc/com/mongodb/WriteConcern.html
- https://mongodb.github.io/mongo-java-driver/3.4/javadoc/com/mongodb/ReadConcern.html
• C#
- https://mongodb.github.io/mongo-csharp-driver/2.3/apidocs/html/T_MongoDB_Driver_WriteConcern.htm
- https://mongodb.github.io/mongo-csharp-driver/2.3/apidocs/html/T_MongoDB_Driver_ReadConcern.htm
• PHP
- http://php.net/manual/en/mongo.writeconcerns.php
- http://php.net/manual/en/class.mongodb-driver-readconcern.php
• Others do, too!
34. # M D B l o c a l
THANK YOU!
TIME FOR YOUR QUESTIONS
My name is Alex
Don’t email me here:
alex@mongodb.com
35. # M D B l o c a l
MORE RESOURCES
• Documentation is where we all start:
https://docs.mongodb.com/manual/reference/write-concern/
https://docs.mongodb.com/manual/reference/read-concern/
• Great presentation by Jesse Davis on resilient operations:
https://www.slideshare.net/mongodb/mongodb-world-2016-smart-strategies-for-resilient-
applications
Notas do Editor
Why this talk? Distributed systems are different from standalone systems
STORY about where is my data
Typical app operations workflow with a replica set (app sends ops to the primary secondaries replicate) - the steps that a write goes thru - mainly to establish a common ground especially in terminology
Can I disable rollback? Is rollback avoidable?
Replication is async, so the primary waits for secondaries who send a special replSetUpdatePosition command to inform upstream nodes on their replication progress
We’ll talk more about timeouts later
This is how you would probably write it if you didn’t attend this presentation
By the way in prod we discourage while True loops (Jesse)
Why did we lose documents?
Now, knowing all of the above, let’s make it better
30 min
What happens if someone reads data that is going to be rolled back?
How much of a problem is that?
Now after the rollback we get nothing
Almost not related as we will see
Changing your password requires linearizable read concern when verifying
As of now, you can only get what you wrote by reading from the primary
Not a need for everyone, but there were quite a few requests