How to Remove Document Management Hurdles with X-Docs?
NoSQL! is it for you?
1. NoSQL
What it is and is it for you?
Iraj Islam
Rubayeet Islam
Nurul Ferdous
NewsCred
Thursday, February 3, 2011
2. Agenda NewsCred
• Part 1. Why NoSQL?
• Part 2. NoSQL Use Cases
• Part 3. Choosing a NoSQL Solution
• Part 4. Understanding MongoDB
• Part 5. Building a MongoDB App
• Part 6. Scaling MongoDB
• Questions
Thursday, February 3, 2011
3. Who We Are NewsCred
Iraj Islam
CTO/Co-founder, NewsCred
Rubayeet Islam
Senior Software Engineer, NewsCred
Nurul Ferdous
Senior Software Engineer, NewsCred
Thursday, February 3, 2011
4. Our Story NewsCred
Launched 2008
Founded by two Bangladeshis 2008
Funded By Investors of Twitter
Floodgate Ventures (twitter), Bessemer Cap. (LinkedIn)
Top-tier Clients
Yahoo! Orange Telecom, Harvard U, The Daily Star etc.
Thursday, February 3, 2011
5. What We Do NewsCred
Domain Expertise
• Big Data
• Information Retrieval
• Machine Learning
• Semantic Web
Technologies
• Apache Solr
• MySQL/MongoDB
• Python/Java
Thursday, February 3, 2011
6. Part 1
Why NoSQL?
NewsCred
Thursday, February 3, 2011
7. What’s NoSQL? NewsCred
NoSQL
What’s with the weird name?
Thursday, February 3, 2011
9. Why NoSQL? NewsCred
Web 1.0
The read intensive web
Publishing Model
Thursday, February 3, 2011
10. Why NoSQL? NewsCred
Web 1.0
The read intensive web
Publishing Model
Textual Content
Thursday, February 3, 2011
11. Why NoSQL? NewsCred
Web 1.0
The read intensive web
Publishing Model Small Data
Textual Content
Thursday, February 3, 2011
12. Why NoSQL? NewsCred
Web 1.0
The read intensive web
Publishing Model Browsing Small Data
Textual Content
Thursday, February 3, 2011
13. Why NoSQL? NewsCred
Web 1.0
The read intensive web
Publishing Model Browsing Small Data
Textual Content Search
Thursday, February 3, 2011
14. Why NoSQL? NewsCred
Web 1.0
The read intensive web
Publishing Model Browsing Small Data
Textual Content Personal Computer Search
Thursday, February 3, 2011
15. Why NoSQL? NewsCred
The Age of Big Data
Exabytes (1018) of data stored per year
1000
750
500
250
2006
2007
2008 0
2009
2010
Thursday, February 3, 2011
16. Why NoSQL? NewsCred
Web 2.0+
The write intensive web
Thursday, February 3, 2011
17. Why NoSQL? NewsCred
Web 2.0+
The write intensive web
User-generated Content
Thursday, February 3, 2011
18. Why NoSQL? NewsCred
Web 2.0+
The write intensive web
Big Data
User-generated Content
Thursday, February 3, 2011
19. Why NoSQL? NewsCred
Web 2.0+
The write intensive web
Semi-structured Data Big Data
User-generated Content
Thursday, February 3, 2011
20. Why NoSQL? NewsCred
Web 2.0+
The write intensive web
Semi-structured Data Big Data
Semantic Web User-generated Content
Thursday, February 3, 2011
21. Why NoSQL? NewsCred
Web 2.0+
The write intensive web
Semi-structured Data Real-time Big Data
Semantic Web User-generated Content
Thursday, February 3, 2011
22. Why NoSQL? NewsCred
Web 2.0+
The write intensive web
Semi-structured Data Real-time Big Data
Semantic Web Ubiquity User-generated Content
Any device. Anywhere.
Thursday, February 3, 2011
23. Why NoSQL? NewsCred
The MySQL Problem
1. Default
Application
Data
Source
Writing
MySQL
User Reading
Thursday, February 3, 2011
24. Why NoSQL? NewsCred
The MySQL Problem
1. Default
Application
Bottleneck, too much load!
Data
Source
Writing
MySQL
User Reading
Thursday, February 3, 2011
25. Why NoSQL? NewsCred
The MySQL Problem
2. Replication
Application
Data
Source
Writing MySQL
Master
User Reading MySQL
Slaves
Thursday, February 3, 2011
26. Why NoSQL? NewsCred
The MySQL Problem
2. Replication
Application
Data
Source
Writing MySQL
Master
User Reading MySQL
Slaves
Scalable Reads!
Thursday, February 3, 2011
27. Why NoSQL? NewsCred
The MySQL Problem
2. Replication
Bottleneck, writes won’t scale!
Application
Data
Source
Writing MySQL
Master
User Reading MySQL
Slaves
Scalable Reads!
Thursday, February 3, 2011
28. Why NoSQL? NewsCred
The MySQL Problem
3. Sharding
Application
Data
Source
Writing S
MySQL
User Reading S
Thursday, February 3, 2011
29. Why NoSQL? NewsCred
The MySQL Problem
3. Sharding
Application
Great, scalable writes!
Data
Source
Writing S
MySQL
User Reading S
Thursday, February 3, 2011
30. Why NoSQL? NewsCred
The MySQL Problem
3. Sharding
Application
Great, scalable writes!
Data
Source
Writing S
MySQL
User Reading S
Development and maintenance
costs just skyrocketed!
Thursday, February 3, 2011
31. Why NoSQL? NewsCred
Web 2.0+
The write intensive web
Semi-structured Data Real-time Big Data
Semantic Web Ubiquity User-generated Content
Any device. Anywhere.
Thursday, February 3, 2011
32. Why NoSQL? NewsCred
The NoSQL Solution
Design Goals
Semi-structure >> Schema-free
Thursday, February 3, 2011
33. Why NoSQL? NewsCred
The NoSQL Solution
Design Goals
Semi-structure >> Schema-free
Big Data >> Scalable reads/writes
Thursday, February 3, 2011
34. Why NoSQL? NewsCred
The NoSQL Solution
Design Goals
Semi-structure >> Schema-free
Big Data >> Scalable reads/writes
Real-time >> High-performance
Thursday, February 3, 2011
35. Why NoSQL? NewsCred
The NoSQL Solution
Design Goals
Semi-structure >> Schema-free
Big Data >> Scalable reads/writes
Real-time >> High-performance
Ubiquity >> High-availability
Thursday, February 3, 2011
36. NoSQL vs RDMS NewsCred
NoSQL RDBMS
• Schema-free • Relational schema
• Scalable writes/reads • Scalable reads
vs
• Auto high-availability • Custom high-availability
Thursday, February 3, 2011
37. NoSQL vs RDMS NewsCred
NoSQL RDBMS
• Schema-free • Relational schema
• Scalable writes/reads • Scalable reads
vs
• Auto high-availability • Custom high-availability
• Limited queries • Flexible queries
• Eventual Consistency * • Consistency
• BASE • ACID
* Applies to most NoSQL systems
Thursday, February 3, 2011
38. Is NoSQL For You? NewsCred
NoSQL RDBMS
• Schema-free • Relational schema
• Scalable writes/reads • Scalable reads
vs
• Auto high-availability • Custom high-availability
• Limited queries • Flexible queries
• Eventual Consistency * • Consistency
• BASE • ACID
* Applies to most NoSQL systems
Thursday, February 3, 2011
39. Is NoSQL For You? NewsCred
NoSQL RDBMS
• Schema-free • Relational schema
• Scalable writes/reads • Scalable reads
vs
• Auto high-availability • Custom high-availability
• Limited queries • Flexible queries
• Eventual Consistency * • Consistency
• BASE • ACID
* Applies to most NoSQL systems
Thursday, February 3, 2011
40. Part 2
NoSQL Use Cases
NewsCred
Thursday, February 3, 2011
42. NoSQL Use Cases NewsCred
• Consumer Use Cases
• Facebook
• Twitter
• NetFlix
• Enterprise Use Cases
• Rackspace
• TrendMicro
• NewsCred
Thursday, February 3, 2011
43. NoSQL Use Cases NewsCred
• Facebook
• Hbase - Facebook messages
• Scribe - Real-time click logs
• Hive - SQL queries -> MapReduce jobs
• Hadoop
• Web analytics warehouse
• Distributed datastore
• MySQL backups
Thursday, February 3, 2011
44. NoSQL Use Cases NewsCred
• Twitter
• Hadoop - Analytics
• Hbase - People search
• Scribe - Log collection framework
• FlockDB - Social graph analysis
Thursday, February 3, 2011
45. NoSQL Use Cases NewsCred
• Rackspace
• Cassandra – stat collection, mail and apps
• TrendMicro
• Hbase & Hadoop – reputation databases
• NewsCred
• MongoDB
• API usage analytics
• Pixel tracking analytics
• Entity metadata storage
Thursday, February 3, 2011
46. Demo
NewsCred API Analytics
NewsCred
Thursday, February 3, 2011
47. Part 3
Choosing a NoSQL Solution
NewsCred
Thursday, February 3, 2011
48. Choosing a NoSQL Solution NewsCred
Availability
Each:client:can:always:read:and:write
A
RDBMSs Cassandra
MySQL: Voldemort
PostgreSQL CouchDB
Aster:Data CA AP Dynamo
GreenPlum SimpleDB
Vertica Tokyo:Cabinet
Riak
C P PartitionDtolerance:
Consistency CP
All:clients:have:the:same:view:of: The:system:works:well:despite:
the:data BigTable Scalaris physical:network:partitions
HyperTable Berkeley:DB
Hbase Memcache:DB
MongoDB Redis
Thursday, February 3, 2011
49. Consistent, Available (CA) NewsCred
CA-systems have trouble with partitions and
deal with it with replication.
• Examples
• MySQL (relational)
• Aster Data (relational)
• Greenplum (relational)
• Vertica (column)
Thursday, February 3, 2011
50. Availability, Partition-Tolerant (AP) NewsCred
AP-systems have trouble with consistency, achieve
“eventual consistency” through replication.
• Examples
• Cassandra (column/tabular)
• Dynamo (key-value)
• Voldemort (key-value)
• Tokyo Cabinet (key-value)
• CouchDB (document)
• SimpleDB (document)
• Riak (document)
Thursday, February 3, 2011
51. Consistent, Partition-Tolerant (CP) NewsCred
CP-systems have trouble with availability while
keeping data consistent across partitioned nodes.
• Examples
• MongoDB (document)
• BigTable (column/tabular)
• HyperTable (column/tabular)
• Hbase (column/tabular)
• Redis (key-value)
• Scalaris (key-value)
• MemcacheDB (key-value)
Thursday, February 3, 2011
52. Hbase NewsCred
Selling point: A
Billions of rows, millions of columns
Use when you need:
Random, real-time access to Big Data
C P
Written in: Java
License: Apache
Type: Column/Tabular
Protocol: HTTP/REST/Thrift Users:
Community Support: Good Yahoo!, Facebook, Microsoft, Adobe,
Learning Curve: High StumbleUpon etc.
Thursday, February 3, 2011
53. Cassandra NewsCred
Selling point: A
Best of Google BigTable and Amazon Dynamo
Use when you need:
To write more than you read (logging)
C P
Written in: Java
License: Apache
Type: Column/Tabular
Protocol: Custom, binary (Thrift) Users:
Community Support: Great Facebook, Twitter, Digg, Reddit,
Learning Curve: Medium Rackspace, Cisco, SimpleGeo, Cloudkick etc.
Thursday, February 3, 2011
54. Redis NewsCred
Selling point: A
Blazing fast, in-memory like memcached
Use when you need:
To manage rapidly changing data
C P
Written in: C/C++
License: BSD
Type: Key-value
Protocol: Telnet-like Users:
Community Support: Good Github, Craigslist, Stackoverflow,
Learning Curve: Low Disqus, The Guardian Uk etc.
Thursday, February 3, 2011
55. MongoDB NewsCred
Selling point: A
Best of NoSQL and RDBMS
Use when you need:
Dynamic queries and indexing on a Big DB
C P
Written in: C++
License: AGPL
Type: Document
Protocol: Custom, binary (BSON) Users:
Community Support: Great NewsCred, Foursquare, Github, Sourceforge,
Learning Curve: Low The New York Times, Etsy, Shutterfly etc.
Thursday, February 3, 2011
56. Part 4
Understanding MongoDB
NewsCred
Thursday, February 3, 2011
60. Understanding MongoDB NewsCred
• SELECT
SELECT * FROM users WHERE X = 3 AND Y = 'abc';
db.users.find({X:3, Y: ”abc”})
SELECT * FROM users WHERE X = 3 AND Y = 'abc' ORDER BY X ASC;
db.users.find({X:3, Y: ”abc”}).sort({X:1})
SELECT username, email FROM users WHERE X = 3 AND Y = 'abc';
db.users.find({X:3, Y: ”abc”}, {username:true, email:true})
Thursday, February 3, 2011
61. Understanding MongoDB NewsCred
• UPDATE
db.collection.update(criteria, modifier, upsert, multi)
criteria : Query which selects the record(s) to update
modifier : $set, $inc, $unset, $push, $pop...
upsert : Insert if not exists, update otherwise
multi : Update multiple docs matching the criteria
UPDATE users SET X = 4, Y = 'abc' WHERE username = 'joegunchy';
db.users.update({username:”joegunchy”}, {$set: {X:4, Y:'abc'}}, true, true)
Thursday, February 3, 2011
62. Understanding MongoDB NewsCred
• DELETE
db.articles.remove({}) /*remove all*/
db.articles.remove({tag:'sql'}) /*remove all articles with tag = 'sql'*/
db.articles.remove({tag:'sql'}) /*block other ops while removing*/
Thursday, February 3, 2011
64. Understanding MongoDB NewsCred
• Map/Reduce
• Algorithm introduced by Google for processing large
datasets on clusters
• MongoDB uses it for:
• Aggregation (Group By, Avg, Sum etc.)
• Batch processing jobs
Thursday, February 3, 2011
66. Understanding MongoDB NewsCred
• Map/Reduce Example
Document
We want to do something like...
Thursday, February 3, 2011
67. Understanding MongoDB NewsCred
• Map/Reduce Example
Map
Reduce
Thursday, February 3, 2011
68. Understanding MongoDB NewsCred
• Map/Reduce Example
Execute
Thursday, February 3, 2011
69. Understanding MongoDB NewsCred
• Map/Reduce Example
Result
Thursday, February 3, 2011
70. Part 5
Building a MongoDB App
NewsCred
Thursday, February 3, 2011
71. Part 6
Scaling with MongoDB
NewsCred
Thursday, February 3, 2011
72. Scaling with MongoDB NewsCred
• Scaling is a challenge
• No silver bullet
• Strategies
• Replication
• Replica Sets
• Auto-sharding
Thursday, February 3, 2011
73. Scaling with MongoDB NewsCred
Replication
Master
Slave Slave Slave
Thursday, February 3, 2011
74. Scaling with MongoDB NewsCred
Replica Sets
Secondary
User
Passive
Primary
Thursday, February 3, 2011
75. Scaling with MongoDB NewsCred
Replica Sets: Election
Synced,3ms,ago
C
Priority,1
A
Synced,1ms,ago
E
Priority,1
Priority 1
B
D
Priority,0
Thursday, February 3, 2011
76. Scaling with MongoDB NewsCred
• Replica Sets: Network Partition
• Election Process initiated
• When a node can’t reach primary
• When primary can’t reach majority of nodes in set
• New primary is elected by majority of nodes in set
• Node with the most recent data gets priority
• Arbiter node used to break ties
Thursday, February 3, 2011
77. Scaling with MongoDB NewsCred
• Auto-sharding
• Cluster handles sharding data and rebalancing
automatically
• No administrative headaches of manual sharding
• Application is oblivious to existence of shards
Thursday, February 3, 2011
78. Scaling with MongoDB NewsCred
Auto-sharding
Big$Collection
Thursday, February 3, 2011
79. Scaling with MongoDB NewsCred
Auto-sharding
User
Router)
Thursday, February 3, 2011
80. Scaling with MongoDB NewsCred
Auto-sharding
• Connect to a single server
• db = connect(‘localhost:27017’)
• Connect to a router
• db = connect(‘localhost:27017’)
User
Mongo)DB
Thursday, February 3, 2011
81. Scaling with MongoDB NewsCred
• When to shard?
• Running out of disk space
• Write intensive
• Need to keep large chunk of data in memory
• Don’t start out with a sharded collection!
• Shard “if and when” you need to
Thursday, February 3, 2011
82. Scaling with MongoDB NewsCred
• Choosing a Shard Key
• Incremental
• Example: timestamps i.e. ‘created_at’
• Queries on shard key is highly efficient
• Random
• Example: ‘username’
• Writes are distributed across multiple shards
Thursday, February 3, 2011
83. Scaling with MongoDB NewsCred
Sharding + Replica Sets
User
Router
P P
S S S S
Thursday, February 3, 2011
84. Questions? NewsCred
Iraj Islam
iraj@newscred.com, @irajislam
Rubayeet Islam
rubayeet@newscred.com, @rubayeet
Nurul Ferdous
nurul@newscred.com, @ferdous
Thursday, February 3, 2011