This talk was given by Swaroop Jagadish (Staff Software Engineer @ LinkedIn) at the ACM SIGMOD/PODS Conference (June 2013). For the paper written by the LinkedIn Espresso Team, go here:
http://www.slideshare.net/amywtang/espresso-20952131
2. Outline
LinkedIn Data Ecosystem
Espresso: Design Points
Data Model and API
Architecture
Deep Dive: Fault Tolerance
Deep Dive: Secondary Indexing
Espresso In Production
Future work
2
10. Document based data model
Richer than a plain key-value store
Hierarchical keys
Values are rich documents and may contain
nested types
10
from : {
name : "Chris",
email : "chris@linkedin.com"
}
subject : "Go Giants!"
body : "World Series 2012! w00t!"
unread : true
Messages
mailboxID : String
messageID : long
from : {
name : String
email : String
}
subject : String
body : String
unread : boolean
11. REST based API
• Secondary Index query
– GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true
+isInbox:true”&start=0&count=15
• Partial updates
POST /MailboxDB/MessageMeta/bob/1
Content-Type: application/json
Content-Length: 21
{“unread” : “false”}
• Conditional operations
– Get a message, only if recently updated
GET /MailboxDB/MessageMeta/bob/1
If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT
11
12. Transactional writes within a hierarchy
mboxId value
George { “numUnread”:
2 }
MessageCounter
mboxId msgId value etag
George 0 {…, “unread”: false, …} 7abf8091
George 1 {…, “unread”: true, …} b648bc5f
George 2 {…, “unread”: true, …} 4fde8701
Message/Message/George/0 {…, “unread”: false, …} 7abf8091
/Message/George/0 {…, “unread”: true, …}
/MessageCounter/George {…, “numUnread”: “+1”, …}
1. Read, record etags
2. Prepare after-image
3.Update
mboxId value
George { “numUnread”:
3 }
21. Generic Cluster Manager: Apache Helix
Generic cluster management
– State model + constraints
– Ideal state of distribution of partitions
across the cluster
– Migrate cluster from current state to
ideal state
• More Info
• SoCC 2012
• http://helix.incubator.apache.org
21
34. Espresso Secondary Indexing
• Local Secondary Index Requirements
• Read after write
• Consistent with primary data under failure
• Rich query support: match, prefix, range, text search
• Cost-to-serve proportional to working set
• Pluggable Index Implementations
• MySQL B-Tree
• Inverted index using Apache Lucene with MySQL backing store
• Inverted index using Prefix Index
• Fastbit based bitmap index
35. Lucene based implementation
• Requires entire index to be memory-resident to support low latency
query response times
• For the Mailbox application, we have two options
36. Optimizations for Lucene based implementation
• Concurrent transactions on the same Lucene
index leads to inconsistency
• Need to acquire a lock
• Opening an index repeatedly is expensive
• Group commit to amortize index opening cost
write
Request 2
Request 3
Request 4
Request 5
Request 1
37. Optimizations for Lucene based implementation
High value users of the site accumulate large
mailboxes
– Query performance degrades with a large index
Performance shouldn’t get worse with more usage!
Time Partitioned Indexes: Partition index into buckets
based on created time
39. Espresso in Production
Unified Social Content Platform –social activity aggregation
High Read:Write ratio
39
40. Espresso in Production
InMail - Allows members to communicate with each other
Large storage footprint
Low latency requirement for secondary index queries involving text
search and relational predicates
40
41. Performance
Average Failover Latency with 1024 partitions is
around 300ms
Primary Data Reads and Writes
For Single Storage Node on SSD
Average row size = 1KB
41
Operation Average Latency Average
Throughput
Reads ~3ms 40,000 per
second
Writes ~6ms 20,000 per
second
42. Performance
Partition-key level Secondary Index using Lucene
One Index per Mailbox use-case
Base data on SAS, Indexes on SSDs
Average throughput per index = ~1000 per second
(after the group commit and partitioned index
optimizations)
42
Operation Average Latency
Queries (average
of 5 indexed
fields)
~20ms
Writes (Around
30 indexed fields)
~20ms
44. Durability and Consistency
Within a Data Center
– Write latency vs Durability
Asynchronous replication
– May lead to data loss
– Tooling can mitigate some of this
Semi-synchronous replication
– Wait for at least one relay to acknowledge
– During failover, slaves wait for catchup
Consistency over availability
Helix selects slave with least replication lag to take over
mastership
Failover time is ~300ms in practice
45. Durability and Consistency
Across data centers
– Asynchronous replication
– Stale reads possible
– Active-active: Conflict resolution via last-writer-wins
46. Lessons learned
Dealing with transient failures
Planned upgrades
Slave reads
Storage Devices
– SSDs vs SAS disks
Scaling Cluster Management
46
48. Key Takeaways
Espresso is a timeline consistent,
document-oriented distributed database
Feature rich: Secondary indexing,
transactions over related documents,
seamless integration with the data
ecosystem
In production since June 2012 serving
several key use-cases
48