This is the presentation that I gave at Silicon Valley Code Camp 2012. The deck covers various aspects of bigdata and NoSQL solutions available to handle this.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Code camp2012
1. Big Data and NoSQL
Landscape
Sanjeev Mishra
Silicon Valley Code Camp 2012
Sanjeev Mishra SVCC 2012
2. Timeline
• 1970s – Genesis of modern db
• Modeling the world based on relational
calculus: best for managing uniform data
• 1980s
• RDBMS takes over the world
• 1990s – 2000+
• Invention of HTML
• Spread of Web based technologies
Sanjeev Mishra SVCC 2012
3. Need for Modern Data Storage
• Amazon
• Managing: Shopping carts, Seller Lists, Customer Preferences,
Sales Rank, Recommendations
• Google
• Storing and managing web scale data
• Facebook
• Managing social graphs
• LinkedIn, Twitter and others
Sanjeev Mishra SVCC 2012
4. Data Explosion Current
• Every two days now we
create as much
information as we did
from the dawn of
civilization up
until 2003 - about 5
exabytes (1K PB) of
data: Eric Schmidt *
Sanjeev Mishra SVCC 2012
5. Data Explosion Future
• A telescope planned to be finished
in 2024 will generate more data
in a single day than the entire
Internet.*
Sanjeev Mishra SVCC 2012
6. What is Big Data?
• Terabytes(TB) is not big data, petabytes
(PB) (1000 TB) may be.
• Current definition of big data: zettabytes
(1M PB or 1G TB)
Sanjeev Mishra SVCC 2012
7. Nature of Big Data
Web 2.0 kind of data
• Different from traditional RDBMS/Warehouse
data – more reads less updates
• User Generated Content – Tweets, Reviews,
Comments etc…
• Lots of updates and lots of reads
• Scale to millions of users
• Not necessarily Transactional
• Compromised consistency
Sanjeev Mishra SVCC 2012
8. Data Explosion, So What?
• Structural issues
• The dynamic nature of data
• Performance issues
• Insertion
• Search
• Scaling Horizontally
• Dozens or hundreds of machines to operate as single
server
Sanjeev Mishra SVCC 2012
9. What is NoSQL?
Not Only SQL or Not Relational
• Carlo Strozzi used it in 1998 and then Eric Evans in 2009
• Simple call level interface (SQL not supported)
• Flexible schema
• Efficient use of distributed indexes
• Horizontally scaling of operations over many server
• No ACID but BASE (Basically Available, Soft state*,
Eventually consistent**)
Sanjeev Mishra SVCC 2012
10. CAP Theorem (Brewer’s Theorem)*
A distributed system can satisfy any two of
following three guarantees at any time
o Consistency (all nodes see the same data at the same
time)
o Availability (a guarantee that every request receives a
response about whether it was successful or failed)
o Partition tolerance (the system continues to operate
despite arbitrary message loss or failure of part of the
system)
Sanjeev Mishra SVCC 2012
11. Eventual Consistency Flavors
• Causal consistency
o changes are notified through events, the receiving
session will always see the updated value.
• Read your own writes
o a session that updates the db will immediately see the
changes.
• Monotonic consistency*
o once a session reads a value will never see an earlier
value.
Sanjeev Mishra SVCC 2012
12. Consistency Tradeoffs
Where,
o N is # of copies of each data that db maintains
o R is # of copies that is read for each read
o W is # of copies that must be written for each write
• Most NoSQL use N>W>1: More than one write must
complete but not all nodes need to update immediately.
Sanjeev Mishra SVCC 2012
14. Row vs. Column Oriented DB
Id First name Last name SSN DOB
1 John Doe 111-222-3333 8/12/1968
2 Jane Doe 111-332-3408 4/3/1972
Row oriented Column oriented
1 1
John 2
Doe John
111-222-3333 Jane
8/12/1968 Doe
2 Doe
Jane 111-222-3333
Doe 111-332-3408
111-332-3408 8/12/1968
4/3/1972 4/3/1972
Sanjeev Mishra SVCC 2012
15. Contrasting Operations on Row vs Col DB
Insert a new tuple
Row oriented Column oriented
1
1
2
John
3
Doe
John
111-22-3333
8/12/1968
Jane
Foo
2
Doe
Jane
Doe
Doe
111-32-3408 Bar
4/3/1972 111-22-3333
3 111-32-3408
Foo 237-23-3924
Bar 8/12/1968
237-23-3924 4/3/1972
2/3/1978 2/3/1978
Sanjeev Mishra SVCC 2012
16. Row vs. Column Oriented DB
Create a new attribute
Row oriented Column oriented
1 1
John 2
Doe John
111-22-3333 Jane
8/12/1968 Doe
408-555-1212
Doe
2 111-22-3333
Jane 111-32-3408
Doe 8/12/1968
111-32-3408 4/3/1972
4/3/1972 408-555-1212
650-555-2323 650-555-2323
Sanjeev Mishra SVCC 2012
17. Row vs. Column Oriented DB
Get all who were born in a given year
Row oriented Column oriented
Easy, just pick all rows where year Not so simple, scan the years and
of DOB matches the given year remember the indexes of all
occurrences that match given year
and extract based on these
indexes
Get sum of all years
Little difficult, data does not live Easy, the data is found
consecutively so scanning through consecutively
entire dataset needed
Sanjeev Mishra SVCC 2012
18. Glossary
• Consistent Hashing (Cassandra, Dynamo)
o the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the
•
largest hash value wraps around to the smallest hash value)
Vector Clock (Cassandra, Riak, Dynamo)
o an algorithm for generating a partial ordering of events in a distributed system and
•
detecting causality violations
•
Quorum (Cassandra, Dynamo (sloppy))
Merkle Tree (Cassandra, Riak, Dynamo)
o a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in
the tree are hashes of their respective children. The principal advantage of Merkle tree is
that each branch of the tree can be checked independently without requiring nodes to
•
download the entire data set
Anti-Entropy Gossip Protocol (Cassandra, Dynamo)
o comparing all the replicas of each piece of data that exist and updating each replica to the
•
newest version
Order preserving partitioning (Cassandra, MongoDB)
Sanjeev Mishra SVCC 2012
19. Glossary
• MVCC
o
•
multi version concurrency control
Atomicity
o
•
all or nothing
Consistency
o
•
each transaction leaves the db in valid state
Isolation
o
•
concurrent execution of txn results into a state that is obtained if txn were executed serially
Durability
o committed txn remain so even in the event of power loss, crashes or errors
• WAL
o Write ahead logging – changes are written to a log before they are applied (Durability)
• Eventually consistent
o sufficiently long quiet period all updates can be expected to propagate eventually through
the system and all replicas will be consistent
Sanjeev Mishra SVCC 2012
20. Glossary
• Sharding
o horizontal partitioning of data, storing records on different servers according to some key
• Tuple
o row in RDBMS, predefined schema.
• Document
o contains nested document or lists as well as scalar values. No predefined schema.
• Extensible Record
o hybrid between Tuple and Document, families of attributes defined in a schema but attributes
can be added on a per record basis.
• Key-value Stores
o stores values indexed by a user defined key.
• Document Stores
o indexed document store
• Extensible Record Stores aka Wide Column Stores
o Stores extensible records partitioned vertically and horizontally across nodes.
Sanjeev Mishra SVCC 2012
21. NoSQL Categories
• Key-value Stores
o Stores values indexed by a user defined key.
• Document Stores
o Indexed document store
• Extensible Record Stores (Column Stores)
o Stores extensible records partitioned vertically and
horizontally across nodes.
• Graph Databases
Sanjeev Mishra SVCC 2012
23. Key-Value Stores
• A distributed cache/Hashtable
o Inspired by Amazon Dynamo
o like memcached with
o persistence, replication, versioning, locking, transactions,
sorting etc.
o get/put and lookups
o No secondary indices or keys
o Values are BLOBs or in some cases JSON document
o Scalability through key distribution over nodes
Sanjeev Mishra SVCC 2012
24. Key-Value Stores
• Riak (Erlang/Basho/Apache)
• Membase (C+Erlang/Couchbase/Apache)
• Project Voldemort (Java/LinkedIn/Apache)
• Redis (C/VMWare/BSD)
• Scalaris (Erlang/Zuse+onScale/Apache)
• Tokyo Cabinet (C/Fal Labs/LGPL)
• Dynamo (Java/For Amazon internal use)
There are others
Key Value / Tuple Store at http://nosql-database.org/
Sanjeev Mishra SVCC 2012
25. Amazon Dynamo
• KV Store Developed by Amazon to support
o Best Seller Lists
o Shopping carts
o Customer Preferences
o Session Management
o Sales Rank
o Product Catalog etc...
• Variation of Consistent Hashing based Data
Partitioning and Replication
• Dynamic add/delete of Storage Nodes
• Each service uses distinct instance of
Dynamo
Sanjeev Mishra SVCC 2012
26. Amazon Dynamo Cont...
• Key/Value are opaque byte[]. ID= 128-
bit MD5 hash of the Key
• “always writeable” where no updates are
rejected due to failures or concurrent writes
• Simple Read/Write - get/put - operation on
data uniquely identified by a key, value is
binary object (BLOB)
o get(key): single or a list (conflicts with context)
o put(key,context,object)
• Eventual consistency with no isolation
guarantees
Sanjeev Mishra SVCC 2012
27. RIAK
• Developed in Erlang by Basho
• Clients:Python, Javascript, Java, PHP, Erlang
• Dynamo inspired Open-Source
o Advanced K/V and
o Document Store (not a full featured document store)
• Replication and sharding by primary key hash
o Consistent Hashing
o De-Centralized (No-Master node)
• Eventually consistent
o Tunable number of replicas for read and write
o Tunable per-read and per-write
o Different parts of application can choose different trade
offs Sanjeev Mishra SVCC 2012
28. Project Voldemort
• Java based advanced Key/Value store
• Developed at LinkedIn
• Open source, Apache license
• Supports MVCC for updates
• Replicas are updated asynchronously - up-to-
date view guaranteed if majority of replicas read
• Uses optimistic locking for consistent multi-
record updates
• Versions are ordered based on Vector clocks
• More info: http://www.project-voldemort.com/voldemort/
Sanjeev Mishra SVCC 2012
30. Document Stores
• Data more complex than that in K/V stores
• Data encapsulated and encoded in
o JSON, XML, YAML, BSON or some other standard format
• Multiple types of documents per database
o Documents of similar type grouped together
o Optional metadata/schema for the document
o Less rigid schema than that of RDBMS
• Nested documents or collection
• Secondary indexes
• Complex query/update support
o Multiple attributes, collections etc
Sanjeev Mishra SVCC 2012
31. Document Example
{
"when": "2011-09-19T02:10:11.3Z",
"author": "alex",
"title": "No Free Lunch",
"text": "This is the text of the post. It could be very long.",
"tags": [ "business", "ramblings“ ],
"votes": 5,
"voters": ["jane“, "joe", "spencer", "phyllis", "li”],
"comments": [
{
"who": "jane",
"when": "2011-09-19T04:00:10.112Z",
"comment": "I agree."
},
{
"who": "meghan",
"when": "2011-09-20T14:36:06.958Z",
"comment": "You must be joking. etc etc ..."
}
]
}
Sanjeev Mishra SVCC 2012
32. Document Stores
• MongoDB (C/10Gen/AGPL)
• Apache CouchDB (Erlang/Apache)
• Amazon SimpleDB (Erlang/Amazon)
• Terrastore (Java/Terracota/Apache)
• RavenDB (C#/HibernatingRhino/AGPL)
There are others
Document Store at http://nosql-database.org/
Sanjeev Mishra SVCC 2012
34. MongoDB
huMongous
• Document format: BSON (Binary JSON)
• Supports nested documents
• Documents are grouped in Collections
• Supports secondary indexes
• Scalability – auto sharding
• Consistency – Tunable based on request
(WriteConcerns)
• Replication – replica set – master – slave
• Atomicity – document level
Sanjeev Mishra SVCC 2012
35. MongoDB
Data Type SQL MongoDB
String Integer create table users db.createCollections(“users”)
(name varchar(128), age number)
Boolea Double
insert into users values („bob‟,32‟) db.users.insert
Null Array
({name:”bob”, age:32})
Object ObjectId
select * from user db.users.find()
Binary Regex
Code select name, age from users db.users.find
({}, {name:1, age:1,_id:0})
select name, age from users where age db.users.find
=32 ({age:32}, {name:1, age:1})
SQL MongoDB select * from user db.users.find().sort({name:1})
Database Database order by name asc
Table Collection select * from user db.users.find().skip(20).limit(10)
limit 10 offset 20
Index Index
select distinct name from user db.users.distinct(“name”)
Row Document
Column Field select count(*) from user db.users.count()
Join Embedding or
update users set age =39 where name = db.users.update({name:”bob”},
Linking
„bob‟ {$set:{age:33}}, false, true)
Primary _id delete from users where name=„bob‟ db.users.remove({name:”bob”})
Key
Sanjeev Mishra SVCC 2012
37. Extensible Record Stores
Column Stores
• Motivated by Google BigTable
• Basic Data Model – Rows and Columns
• Scale by splitting rows and columns over
multiple nodes
o Rows split by sharding on primary key – split
by range rather than hash function
o Columns split by column groups
Sanjeev Mishra SVCC 2012
38. Extensible Record Stores
• Cassandra (Java/Facebook/Apache)
• Marriage of Dynamo and BigTable
• HBase (Java/Yahoo/Apache)
• Inspired by BigTable, used HDFS for storage
• HyperTable (C/Zvent/GPL)
• Similar to HBase/BigTable
• Accumulo (Java/NSA/Apache)
• Uses Hadoop, ZooKeeper, and Thrift, cell level access control
• Google BigTable (Internal to Google)
There are others
Wide Column Store at http://nosql-database.org/
Sanjeev Mishra SVCC 2012
40. Cassandra Features
• Decentralized
o Data is distributed across cluster of nodes
o No master, any node can address any request
o No single point of failure
• Fault-tolerant (Configurable replication strategies)
o Simple Strategy (first determined by partitioner, rest
on other nodes clockwise)
o Network Topology Strategy: multi datacenter strategy
Sanjeev Mishra SVCC 2012
41. Cassandra Features Cont…
• Failure detection and recovery
o Based on Gossip protocol
o Node state updated based on gossip message version
o Per-node heartbeat threshold
• Tunable consistency
o Can be configured per read/write
Sanjeev Mishra SVCC 2012
42. Cassandra
Data Type SQL Cassandra QL
ascii int create database codecamp CREATE KEYSPACE codecamp WITH
strategy_class =
float decimal
„NetworkTopologyStrategy‟ AND
boolean bigint strategy_options:DC1=3
double varchar create table users CREATE COLUMNFAMILY users (key
(key varchar(128), name varchar PRIMARY KEY, name
counter timestamp
varchar(128), age number) varchar, age int)
uuid text
create index idx_name ON CREATE INDEX idx_name ON
blob varint users(name) users(name)
insert into users values („bob‟, „Bob‟,32‟) INSERT INTO users
(KEY, name, age)
SQL Cassandra VALUES(„jdoe‟,‟Jane Doe‟, 39)
Database Keyspace
select name, age from users SELECT name, age FROM users
Table Column Family where age>30 WHERE age>30
Index Index update users set age = 35 UPDATE users SET age=35
where name = „bob‟ WHERE name=„bob‟
Row Row
delete from users where DELETE FROM users where KEY =
Column Column key=„bob‟ „bob‟
DELETE age FROM users where
Join KEY=„alice‟
Primary Key Primary Key drop table users DROP COLUMNFAMILY users
drop database codecamp DROP KEYSPACE codecamp
Sanjeev Mishra SVCC 2012
44. Cassandra Keyspace
Analogous to database in RDBMS
• Contains one or more Column Families
analogous to tables in RDBMS
• Column Family contains columns
• A Row Key identifies a set of related columns
• A Row is not required to have same set of
columns
• No join between two column families:
o Each column family is self contained to serve a query
o A rule of thumb - one column family per query for
better performance
• Replication is controlled on per-keyspace basis
Sanjeev Mishra SVCC 2012
45. Cassendra In Enterprise
• Netflix, Twitter, Urban Airship, Constant
Contact, Reddit, Cisco, OpenX, Rackspace,
Ooyala, and many more
• The largest Cassandra cluster has over 300
TB of data in over 400 machines
Sanjeev Mishra SVCC 2012
46. HBase
• Design influenced by Google BigTable
• A type of NoSQL – more a data store than data base, lacks many
RDBMS features such as
• Typed column, secondary indexes, triggers, advanced query language etc.
• Build on top of HDFS: Data is stored in HDFS as indexed
“StoreFiles”
• Strongly consistent R/W not “eventually consistent” – suitable for
counter aggregation
• Auto Sharding
• Auto Region Server Failover
• Out of the box support for Hadoop/HDFS
• Can be used as Source and/or Sink for MapReduce
• Java, Thrift/REST client
• Support Block Cache and Bloom Filters for high volume query
optimization
• Web management tool and JMX support
Sanjeev Mishra SVCC 2012