5. We Lose: Joe Hellerstein (Berkeley) 2001
“Databases are commoditised and cornered
to slow-moving, evolving, structure
intensive, applications that require schema
evolution.“ …
“The internet companies are lost and we will
remain in the doldrums of the enterprise
space.” …
“As databases are black boxes which
require a lot of coaxing to get maximum
performance”
8. Backlash (2009)
Not novel (dates back to the 80’s)
Physical level not the logical level (messy?)
Incompatible with tooling
Lack of integrity (referential) & ACID
MR is brute force ignoring indexing, scew
10. And they proved it too!
“A comparison of Approaches to Large Scale
Data Analysis” – Sigmod 2009
• Vertica vs. DBMSX vs.
Hadoop
• Vertica up to 7 x faster than
Hadoop over benchmarks
Databases faster
than Hadoop
15. It’s more than just scale, they
facilitate different practices
16. A Better Fit
They better match the way software is
engineered today.
– Iterative development
– Fast feedback
– Frequent releases
17. Is NoSQL a Disruptive Technology?
Christensen’s observation:
Market leaders are displaced when markets
shift in ways that the incumbent leaders are
not prepared for.
18. Aside: MongoDB
• Impressive trajectory
• Slightly crappy product (from a traditional
database standpoint)
• Most closely related to relational DB (of
the NoSQLs)
• Plays to the agile mindset
19. Yet the NoSQL market is relatively
small
• Currently around $600 but projected to
grow strongly
• Database and systems management
market is worth around $34billion
20. Key Point
There is more to NoSQL than just
scale, it sits better with the way we
build software today
22. My Problem
• Sprawling application space, built over
many years, grouped into both vertical and
horizontal silos
• Duplication of effort
• Data corruption & preventative measures
• Consolidation is costly, time consuming
and technically challenging.
26. EDW pattern is workable, but tough
– As soon as you take a ‘view’ on what the
shape of the data is, it becomes harder to
change.
• Leave ‘taking a view” to the last responsible
moment
– Multifaceted: Shape, diversity of source,
diversity of population, temporal change
33. Problems with solidifying a
schematic representation
• Risk of throwing information away, keeping
only what you think you need.
– OK if you create data
– Bad if you got data from elsewhere
• Data tends to be poly-structured in
programs and on the wire
• Early-binding slows down development
34. But schemas are good
• They guarantee a contract
• That contract spans the whole dataset
– Similar to static typing in programming
languages.
35. Compromise positions
• Query schema can be a subset of data
schema.
• Use schemaless databases to capture
diversity early and evolve it as you build.
36. Common solutions today use
multiple technologies
M Re u
ap d ce
D a
at
W ho se
are u
?
Ke Vl u
y ae
St o
re
In- M mry/
eo
O
LTP D ba
ata se
37. We use an late-bound schema,
sitting over a schemaless store
S
tructured
S
tandardisation
Layer
Raw Data
Late Bound
Schema
38. Evolutionary Approach
• Late-binding makes consolidation
incremental
– Schematic representation delivered at the ‘last
responsible moment’ (schema on demand)
– A trade in this model has 4 mandatory nodes. A
fully modeled trade has around 800.
• The system of record is raw data, not our
‘view’ of it
• No schema migration! But this comes at a
price.
45. Scaling
• Key based sharding is only sufficient very
simple workloads
• Course grained shards help (but suffer
from skew)
• Replication provides useful, if expensive,
hardware isolation
• Workload management is less useful in
my experience
47. Scaling two phase commit is hard to
do efficiently
• Requires distributed lock/clock/counter
• Requires synchronisation of all readers &
writers
48. Alternatives to traditional 2PC
• MVCC over explicit locking
• Timestamp based strong consistency
– E.g. Granola
• Optimistic concurrency control
– Leverage short running transactions (avoid
cross-network transactions)
– Tolerate different temporal viewpoints to
reduce synchronization costs.
50. Use joins to avoid ‘over aggregating’
Joins are ok, so long as they are
– Local
– via a unique key
Trade
r
Party
Trade
51. Memory/Disk Tradeoff
• Memory only (possibly overplayed)
• Pinned indexes (generally good idea if you
can afford the RAM)
• Disk resident (best general purpose
solution and for very large datasets)
52. Balance flexibility and complexity
Operational
(real time / MR)
Object/S
QL
S
tandardisation
Raw Data
Relational
Analytics
53. Supple at the front, more rigid at the back
Raw Access
Operational Access
Analytic Access
D
Looser
Tighter
L
M
Untyped
Object/S
QL
Reporting
Broad Data Coverage
Narrow Data Coverage
Narrow Query
Comprehensive Quer y
54. Principals
•
•
•
•
Record everything
Grow a schema, don’t do it upfront
Avoid using a ‘view’ as your system of record.
Differentiate between sourced data (out of
your control) and generated data (in your
control).
• Use automated replication (for isolation) as
well as sharding (for scale)
• Leverage asynchronicity to reduce
transaction overheads
Think about the systems you built five or ten years ago. Who was involved in the building of a new system in the early 2000s? Who used a relational DB? Who seriously considered using anything else?