The document provides an overview and introduction to NoSQL databases. It discusses what triggered the NoSQL movement, common characteristics of NoSQL systems, and business benefits. The agenda covers topics such as what NoSQL is, differences from big data and cloud computing, core concepts, example implementations, and selecting the right NoSQL system for a project.
1. The CIO's Guide to
NoSQL
Dan McCreary
July 12th 2012
Version 6
2. Agenda
• What is NoSQL?
• What Triggered the NoSQL Movement?
• How is NoSQL distinct from Big Data and Cloud
Computing?
• Common Characteristics of NoSQL System
• Business Benefits of NoSQL
• Core NoSQL Concepts
• Selected NoSQL Implementations
• Recent NoSQL Developments
• Selecting the Right NoSQL System
• Next Step: Selecting the Right NoSQL Pilot Project
M
D 2
Copyright Kelly-McCreary & Associates, LLC
4. Background for Dan McCreary
• Bell Labs
• NeXT Computer (Steve Jobs)
• Owner of Custom Object-Oriented
Software Consultancy
• Federal data integration (National
Information Exchange Model)
• Native XML/XQuery – 2006
• Advocate of NoSQL/XRX systems
• Working with Manning
Publications on NoSQL Topic
M
D 4
Copyright Kelly-McCreary & Associates, LLC
5. NoSQL Definition
The NoSQL movement is a set of concepts
and technologies that allow the rapid and
efficient processing of large data sets with a
focus on performance and resiliency.
M
D 5
Copyright Kelly-McCreary & Associates, LLC
6. Sample of NoSQL Jargon
Document orientation Indexing
B-Tree
Schema free
Configurable durability
MapReduce Documents for archives
Horizontal scaling Functional programming
Sharding and auto-sharding Document Transformation
Document Indexing and Search
Brewer's CAP Theorem Alternate Query Languages
Consistency Aggregates
Reliability OLAP
XQuery
Partition tolerance
MDX
Single-point-of-failure RDF
Object-Relational mapping SPARQL
Key-value stores Architecture Tradeoff Modeling
ATAM
Column stores
Document-stores
Memcached Note that within the context of NoSQL many
of these terms have different meanings!
M
D 6
Copyright Kelly-McCreary & Associates, LLC
7. Selecting a Database…
"Selecting the right data storage solution is
no longer a trivial task."
Does it Yes
Start look like
Use Microsoft
document? Office
No
Use the Stop
RDBMS
M
D 7
Copyright Kelly-McCreary & Associates, LLC
8. Pressures on SQL Only Systems
Scalability
OLAP/BI/Data
Warehouse SQL Social
Networks
Agile
Schema
Free
M
D 8
Copyright Kelly-McCreary & Associates, LLC
9. Simplicity is a Virtue
• Many systems derive their strength by dramatically limiting the
features in their system
• Simplicity allows database designers to focus on the primary
business driver
• Examples:
– Touch screen interfaces
– Key-value data stores
M
D 9
Copyright Kelly-McCreary & Associates, LLC
10. Historical Context
Mainframe Era MapReduce Era
• 1 CPU • 10,000 CPUs
• COBOL and FORTRAN • Functional programming
• Punchcards and flat files • MapReduce "server farms"
• $10,000 per CPU hour • Pennies per CPU hour
M
D Copyright Kelly-McCreary & Associates, LLC
10
11. Two Approaches to Computation
1930s and 40s
John Von Neumann Alonzo Church
Manage state with a program counter. Make computations act like math functions.
Which is simpler? Which is cheaper? Which will scale to 10,000 CPUs?
M
D 11
Copyright 2010 Dan McCreary & Associates
12. Standard vs. MapReduce Prices
John's Way Alonzo's Way
M
http://aws.amazon.com/elasticmapreduce/#pricing
D 12
Copyright Kelly-McCreary & Associates, LLC
13. MapReduce CPUs Cost Less!
40
Cost Per CPU Hour (Cents)
35
30
25
20
15
10
5
0
Standard MapReduce Cuts cost from 32 to 6 cents per CPU hour!
CPU CPU Perhaps Alanzo was right!
Why? (hint: how "shareable" is this process)
M http://aws.amazon.com/elasticmapreduce/#pricing
D 13
Copyright Kelly-McCreary & Associates, LLC
14. Perspectives
Object OLAP
Native Stores MDX
XML
NoSQL for Graph
Web 2.0 Stores
and
BigData
M Perspective depends on your context
D Kelly-McCreary & Associates, LLC
14
15. Architectural Tradeoffs
"I want a fast car with good mileage."
"I want a scaleable database with low cost that runs
well on the 1,000 CPUs in our data center."
M
D Kelly-McCreary & Associates, LLC
15
16. NoSQL on Google Trends
!
M
D 16
Kelly-McCreary & Associates, LLC
17. Recent History
• The term NoSQL became re-popularized
around 2009
• Used for conferences of advocates of non-
relational databases
• Became a contagious idea "meme"
• First of many "NoSQL meetups" in San
Francisco organized by Jon Oskarsson
• Conversion from "No SQL" to "Not Only
SQL" in recent year
M
D 17
Kelly-McCreary & Associates, LLC
18. NoSQL and Web 2.0 Startups
• Many web 2.0 startups did not use Oracle
or MySQL
• They built their own data stores influenced
by Amazon’s Dynamo and Google’s
BigTable in order to store and process
huge amounts of data
• In the social community or cloud
computing applications, most of these data
stores became OpenSource software
M
D 18
Kelly-McCreary & Associates, LLC
19. Google MapReduce
• 2004 paper that had huge impact of
functional programming in the entire
community
• Copied by many organizations, including
Yahoo
M
D 19
Copyright Kelly-McCreary & Associates, LLC
20. Google Bigtable Paper
• 2006 paper that gave focus to scaleable
databases
• designed to reliably scale to petabytes of
data and thousands of machines
M
D 20
Copyright Kelly-McCreary & Associates, LLC
21. Amazon's Dynamo Paper
• Werner Vogels
• CTO - Amazon.com
• October 2, 2007
• Used to power Amazon's
S3 service
• One of the most
influential papers in the
NoSQL movement
• Service in 2012
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,
Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”,
in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.
M
D 21
Copyright Kelly-McCreary & Associates, LLC
22. NoSQL "Meetups"
“NoSQLers came to share how they had
overthrown the tyranny of slow, expensive
relational databases in favor of more
efficient and cheaper ways of managing
data.”
Computerworld magazine, July 1st, 2009
M
D 22
Kelly-McCreary & Associates, LLC
23. Key Motivators
• Licensing RDBMS on multiple CPUs
• The Thee "V"s
– Velocity – lots of data arriving fast
– Volume – web-scale BigData
– Variability – many exceptions
• Desire to escape rigid schema design
• Avoidance of complex Object-Relational
Mapping (the "Vietnam" of computer
science)
M
D 23
Kelly-McCreary & Associates, LLC
24. Many Processes Today Are Driven By…
The constraints of yesterday…
Challenge:
Ask ourselves the question…
Do our current method of solving problems with tabular data…
Reflect the storage of the 1950s…
Or our actual business requirements?
What structures best solve the actual business problem?
M
24
D
Copyright 2008 Dan McCreary & Associates
25. No-Shredding!
My
Data
• Relational databases take a single hierarchical document and
shred it into many pieces so it will fit in tabular structures
• Document stores prevent this shredding
M
25
D
Copyright 2008 Dan McCreary & Associates
26. Is Shredding Really Necessary?
• Every time you take
hierarchical data and
put it into a traditional
database you have to
put repeating groups in
separate tables and
use SQL “joins” to
reassemble the data
M
26
D
Copyright 2008 Dan McCreary & Associates
27. Object Relational Mapping
T1 T2
T4 T3
Relational
Web Browser Object Middle
Database
Tier
• T1 – HTML into Objects
• T2 –Objects into SQL Tables
• T3 – Tables into Objects
• T4 – Objects into HTML
M
D 27
Kelly-McCreary & Associates, LLC
28. "The Vietnam of Applications"
• Object-relational mapping has become one of
the most complex components of building
applications today
• A "Quagmire" where many projects get lost
• Many "heroic efforts" have been made to
solve the problem:
– Hibernate
– Ruby on Rails
• But sometimes the way to avoid complexity is
to keep your architecture very simple
M
D 28
Copyright Kelly-McCreary & Associates, LLC
29. Document Stores Need No Translation
Document Document
Application Layer Database
• Documents in the database (JSON or XML)
• Documents in the application
• No object middle tier
• No "shredding"
• No reassembly
• Simple!
M
29
D
Copyright 2010 Dan McCreary & Associates
30. The XML "Full Stack"
XForms REST-Interfaces
Web Browser XML database
• XML lives in the web browser (XForms)
• REST interfaces
• XML in the database (Native XML, XQuery)
• XRX Web Application Architecture
• No translation!
M
30
D
Copyright 2010 Dan McCreary & Associates
31. "Schema Free"
• Systems that automatically determine how to
index data as the data is loaded into the
database
• No a priori knowledge of data structure
• No need for up-front logical data modeling
– …but some modeling is still critical
• Adding new data elements or changing data
elements is not disruptive
• Searching millions of records still has sub-
second response time
M
31
D
Copyright 2010 Dan McCreary & Associates
33. Eric Evans
“The whole point of seeking alternatives
[to RDBMS systems] is that you need to
solve a problem that relational databases
are a bad fit for.”
Eric Evans
Rackspace
M
D 33
Kelly-McCreary & Associates, LLC
34. Evolution of Ideas in OpenSource
New Database Ideas New Products
Proprietary Software
Product A
Schema-free
Product B
OpenSource
Auto-sharding MapReduce
Product B
Cloud Computing
• How quickly can new ideas be recombined into new database products?
• OpenSource software has proved to be the most efficient way to quickly
recombine new ideas into new products
M
D 34
Copyright Kelly-McCreary & Associates, LLC
36. Finding the Right Match
Schema-Free
Standards Compliant
Mature Query Language
Use CMU's Architectural Tradeoff and Modeling (ATAM) Process
M
36
D Copyright 2010 Dan McCreary & Associates
37. Avoidance of Unneeded Complexity
• Relational databases provide a variety of
features to ALWAYS support strict data
consistency
• Rich feature set and the ACID properties
implemented by RDBMSs might be more
than necessary for particular applications
and use cases
M
D 37
Kelly-McCreary & Associates, LLC
38. "Once Size Fits…"
"One Size Does Not Fit All"
James Hamilton Nov. 3rd, 2009
http://perspectives.mvdirona.com/CommentView,guid,afe46691-a293-4f9a-8900-5688a597726a.aspx
M
D 38
Kelly-McCreary & Associates, LLC
39. Different Thinking
Sequential Processing Parallel Processing
• The output of any step
can be used in the • Each loop of XQuery FLOWR
next step statements are independent
• State must be carefully thread (no side-effects)
managed
M
D Kelly-McCreary & Associates, LLC
39
40. Cloud Computing
• High scalability
– Especially in the horizontal direction (multi
CPUs)
• Low administration overhead
– Simple web page administration
M
D 40
Kelly-McCreary & Associates, LLC
41. Databases work well in the cloud
• Data warehousing specific databases for
batch data processing and map/reduce
operations
• Simple, scalable and fast key/value-stores
• Databases containing a richer feature set
than key/value-stores fitting the gap with
traditional
• RDBMS while offering good performance and
scalability properties (such as document
databases).
M
D 41
Kelly-McCreary & Associates, LLC
42. Auto-Sharding
• When one database gets almost full it tells a "coordinator" system
and the data automatically gets migrated to other systems
• Systems have "Partition Tolerance"
Warning Disk Full!
Before: one disk 90% full:
Time to "Shard"
After: two disks 45% full:
M
D 42
Copyright Kelly-McCreary & Associates, LLC
43. Brewer's CAP Theorem
Consistency
You can not
have all three
so pick two!
Availability Partition Tolerance
M
D Kelly-McCreary & Associates, LLC
43
44. Migrating to Partition Tolarance
Consistency
CA CP
RDBMS
Availability AP Partition Tolerance
M
D 44
Copyright Kelly-McCreary & Associates, LLC
45. Scale Up vs. Scale Out
Scale Up Scale Out
• Make a single CPU as fast as • Make Many CPUs work
possible together
• Increase clock speed • Learn how to divide your
• Add RAM problems into independent
• Make disk I/O go faster threads
M
D Copyright Kelly-McCreary & Associates, LLC
45
46. Sample of NO-SQL Systems
Document Stores
Key-Value Stores
Memcache
XML
Column Stores
Graph Stores
Object Stores
M
46
D
Copyright 2010 Dan McCreary & Associates
47. If you can't beat them…
M
D Kelly-McCreary & Associates, LLC
47
48. Key Value Stores
Key Value
• A table with two columns
and a simple interface
– Add a key-value
– For this key, give me the
value
– Delete a key
• Blazingly fast and easy to
scale
M
D 48
Copyright Kelly-McCreary & Associates, LLC
49. Types of Key-Value Stores
• Eventually‐consistent Key‐Value store
• Hierarchical Key-Value Stores
• Key-Value Stores In RAM
• Key Value Stores on Disk
• Ordered Key-Value Stores
M
D 49
Copyright Kelly-McCreary & Associates, LLC
50. Cassendra
• Apache open source project
• Originally developed by Facebook
• Designed for highly distributed high-
reliable systems
• No single point of failure
• Column-family data model
http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
M
D 50
Copyright Kelly-McCreary & Associates, LLC
51. MongoDB
• Open Source License
• Document/Collection centric
• Sharding built-in, automatic
• Stores data in JSON format
• Query language is JSON
• Can be 10x faster than MySQL
• Many languages (C++, JavaScript, Java,
Perl, Python etc.)
M
D 51
Copyright Kelly-McCreary & Associates, LLC
52. Hadoop/Hbase
• Open source implementation of
MapReduce algorithm written in Java
• Initially created by Yahoo
– 300 person-years development
• Column-oriented data store similar to
Google's BigTable
• Java interface
• H-Base designed specifically to work with
Hadoop and the Hadoop file system
M
D 52
Copyright Kelly-McCreary & Associates, LLC
53. CouchDB
• Commercial Company
• Apache Project
• Written in ERLANG
• RESTful JSON API
• Distributed, featuring robust, incremental
replication with bi-directional conflict
detection and management
M
D 53
Copyright Kelly-McCreary & Associates, LLC
54. Memcached
• Free & open source in-memory caching system
• Designed to speeding up dynamic web applications by
alleviating database load
• RAM resident key-value store for small chunks of arbitrary
data (strings, objects) from results of database calls, API calls,
or page rendering
• Simple interface
• Designed for quick deployment, ease of development
• APIs in many languages
M
D 54
Copyright Kelly-McCreary & Associates, LLC
55. MarkLogic
• Native XML database designed to used by
Petabyte data stores
• ACID compliant
• Role-based access control
• Heavy use by federal agencies, document
publishers and "high-variability" data
• Arguably the most successful NoSQL
company
M
D 55
Copyright Kelly-McCreary & Associates, LLC
56. eXist
• OpenSource native XML database
• Strong support for XQuery and XQuery
extensions
• Heavily used by the Text Encoding Initiative
(TEI) community and XRX/XForms communities
• Ideal for metadata management
• Integrated Lucene search and structured search
M
D 56
Copyright Kelly-McCreary & Associates, LLC
57. Riak
• Community and Commercial licenses
• A "Dynamo-inspired" database
• Written in ERLANG
• Query JSON or ERLANG
M
D 57
Copyright Kelly-McCreary & Associates, LLC
58. Hypertable
• Open Source
• Closely modeled after Google's Bigtable
project
• High performance distributed data storage
system
• Designed to support applications requiring
maximum performance, scalability, and
reliability
• Hypertable Query Language (HQL) that is
syntactically similar to SQL
M
D 58
Copyright Kelly-McCreary & Associates, LLC
59. Selecting a NoSQL Pilot Project
• The "Goldilocks Pilot
Project Strategy"
• Not to big, not to
small, just the right
size
• Duration
• Sponsorship
• Importance
• Skills
• Mentorship
M
59
D
Copyright 2010 Dan McCreary & Associates
60. The Future of the NoSQL Movement
Growth Diversity
• Will data sets continue to grow at exponential rates?
• Will new system options become more diverse?
• Will new markets have different demands?
• Will some ideas be "absorbed" into existing RDBMS vendors
products?
• Will the NoSQL community continue to be the place where new
database ideas and products are incubated?
• Will the job of doing high-quality architectural tradeoffs analysis
M become easier?
D 60
Copyright Kelly-McCreary & Associates, LLC
61. Using the Wrong Architecture
Start Finish
Credit: Isaac Homelund – MN Office of the Revisor
M
D
62. Using the Right Architecture
Finish
Start
Find ways to remove barriers to empowering
the non programmers on your team.
M
D
63. Questions
Dan McCreary
President, Kelly-McCreary & Associates
dan@danmccreary.com
M
D 63
Kelly-McCreary & Associates, LLC