4. Big Data circa 2004: Internet exhaust
• Through the 1990s, Moore’s Law + Kryder’s Law grew
faster than transaction rates, and what was
“overwhelming” in 1994 was manageable by 2004
• But large internet concerns (Google, Facebook, Yahoo!)
encountered a new class of problem: analyzing massive
amounts of data emitted as a byproduct of activity
• Data scaled with activity, not transactions — changing
both data sizes and economics
• Data sizes were too large for extant data warehousing
solutions — and were embarrassingly parallel besides
5. Big Data circa 2004: MapReduce
• MapReduce, pioneered by Google and later emulated
by Hadoop, pointed to a new paradigm where compute
tasks are broken into map and reduce phases
• Serves to explicitly divide the work that can be
parallelized from that which must be run sequentially
• Map phases are farmed out to a storage layer that
attempts to co-locate them with the data being mapped
• Made for commodity scale-out systems; relatively cheap
storage allowed for sloppy but effective solutions (e.g.
storing data in triplicate)
6. Big Data circa 2014
• Hadoop has become the de facto big data processing
engine — and HDFS the de facto storage substrate
• But HDFS is designed around availability during/for
computation; it is not designed to be authoritative
• HDFS is used primarily for data that is redundant,
transient, replaceable or otherwise fungible
• Authoritative storage remains either enterprise storage
(on premises) or object storage (in the cloud)
• For analysis of non-fungible data, pattern is to ingest
data into a Hadoop cluster from authoritative storage
• But a new set of problems is poised to emerge...
7. Big Data circa 2014: Internet-of-things
• IDC forecasts that the “digital universe” will grow from
130 exabytes in 2005 to 40,000 exabytes in 2020 —
with as much of a third having “analytic value”
• This doesn’t even factor in the (long forecasted) rise of
the internet-of-things/industrial internet...
• Machine-generated data at the edge will effect a step
function in data sizes and processing methodologies
• No one really knows how much data will be generated
by IoT, but the numbers are insane (e.g., HD camera
generates 20 GB/hour; a Ford Energi engine generates
25 GB/hour; a GE jet engine generates 1TB/flight)
8. How to cope with IoT-generated data?
• IoT presents so much more data that we will
increasingly need data science to make sense of it
• To assure data, we need to retain as much raw data as
possible, storing it once and authoritatively
• Storing data authoritatively has ramifications for the
storage substrate
• To allow for science, we need to place an emphasis on
hypothesis exploration: it must be quick to iterate from
hypothesis to experiment to result to new hypothesis
• Emphasizing hypothesis exploration has ramifications
for the compute abstractions and data movement
9. The coming ramifications of IoT
• It will no longer be acceptable to discard data: all data
will need to be retained to explore future hypotheses
• It will no longer be acceptable to store three copies: 3X
on storage costs is too acute when data is massive
• It will no longer be acceptable to move data for analysis:
in some cases, not even over the internet!
• It will no longer be acceptable to dictate the abstraction:
we must accommodate anything that can process data
• These shifts are as significant as the shift from
traditional data warehousing to scale-out MapReduce!
10. IoT: Authoritative storage?
• “Filesystems” that are really just user-level programs
layered on local filesystems lack device-level visibility,
sacrificing reliability and performance
• Even in-kernel, we have seen the corrosiveness of an
abstraction divide in the historic divide between logical
volume management and the filesystem:
• The volume manager understands multiple disks, but
nothing of the higher level semantics of the filesystem
• The filesystem understands the higher semantics of the
data, but has no physical device understanding
• This divide became entrenched over the 1990s, and had
devastating ramifications for reliability and performance
11. The ZFS revolution
• Starting in 2001, Sun began a revolutionary new
software effort: to unify storage and eliminate the divide
• In this model, filesystems would lose their one-to-one
association with devices: many filesystems would be
multiplexed on many devices
• By starting with a clean sheet of paper, ZFS opened up
vistas of innovation — and by its architecture was able
to solve many otherwise intractable problems
• Sun shipped ZFS in 2005, and used it as the foundation
of its enterprise storage products starting in 2008
• ZFS was open sourced in 2005; it remains the only open
source enterprise-grade filesystem
12. ZFS advantages
• Copy-on-write design allows on-disk consistency to be
always assured (eliminating file system check)
• Copy-on-write design allows constant-time snapshots in
unlimited quantity — and writable clones!
• Filesystem architecture allows filesystems to be created
instantly and expanded — or shrunk! — on-the-fly
• Integrated volume management allows for intelligent
device behavior with respect to disk failure and recovery
• Adaptive replacement cache (ARC) allows for optimal
use of DRAM — especially on high DRAM systems
• Support for dedicated log and cache devices allows for
optimal use of flash-based SSDs
13. ZFS at Joyent
• Joyent was the earliest ZFS adopter: becoming (in
2005) the first production user of ZFS outside of Sun
• ZFS is one of the four foundational technologies of
Joyent’s SmartOS, our illumos derivative
• The other three foundational technologies in SmartOS are
DTrace, Zones and KVM
• Search “fork yeah illumos” for the (uncensored) history of
OpenSolaris, illumos, SmartOS and derivatives
• Joyent has extended ZFS to provide better support
multi-tenant operation with I/O throttling
14. ZFS as the basis for IoT?
• ZFS offers commodity hardware economics with
enterprise-grade reliability — and obviates the need for
cross-machine mirroring for durability
• But ZFS is not itself a scale-out distributed system, and
is ill suited to become one
• Conclusion: ZFS is a good building block for the data
explosion from IoT, but not the whole puzzle
15. IoT: Compute abstraction?
• To facilitate hypothesis exploration, we need to carefully
consider the abstraction for computation
• How is data exploration programmatically expressed?
• How can this be made to be multi-tenant?
• The key enabling technology for multi-tenancy is
virtualization — but where in the stack to virtualize?
16. • The historical answer — since the 1960s — has been to
virtualize at the level of the hardware:
• A virtual machine is presented upon which each
tenant runs an operating system of their choosing
• There are as many operating systems as tenants
• The historical motivation for hardware virtualization
remains its advantage today: it can run entire legacy
stacks unmodified
• However, hardware virtualization exacts a heavy tolls:
operating systems are not designed to share resources
like DRAM, CPU, I/O devices or the network
• Hardware virtualization limits tenancy and inhibits
performance!
Hardware-level virtualization?
17. • Virtualizing at the application platform layer addresses
the tenancy challenges of hardware virtualization…
• ...but at the cost of dictating abstraction to the developer
• With IoT, this is especially problematic: we can expect
much more analog data and much deeper numerical
analysis — and dependencies on native libraries and/or
domain-specific languages
• Virtualizing at the application platform layer poses many
other challenges:
• Security, resource containment, language specificity,
environment-specific engineering costs
Platform-level virtualization?
18. • Containers virtualizing the OS and hit the sweet spot:
• Single OS (single kernel) allows for efficient use of hardware
resources, and therefore allows load factors to be high
• Disjoint instances are securely compartmentalized by the
operating system
• Gives customers what appears to be a virtual machine
(albeit a very fast one) on which to run higher-level software
• Gives customers PaaS when the abstractions work for them,
IaaS when they need more generality
• OS-level virtualization allows for high levels of tenancy
without dictating abstraction or sacrificing efficiency
• Zones is a bullet-proof implementation of OS-level
virtualization — and is the core abstraction in Joyent’s
SmartOS
Joyent’s solution: OS containers
20. • Building a sophisticated distributed system on top of
ZFS and zones, we have built Manta, an internet-facing
object storage system offering in situ compute
• That is, the description of compute can be brought to
where objects reside instead of having to backhaul
objects to transient compute
• The abstractions made available for computation are
anything that can run on the OS...
• ...and as a reminder, the OS — Unix — was built around
the notion of ad hoc unstructured data processing, and
allows for remarkably terse expressions of computation
Manta: ZFS + Containers!
21. Aside: Unix
• When Unix appeared in the early 1970s, it was not just a
new system, but a new way of thinking about systems
• Instead of a sealed monolith, the operating system was
a collection of small, easily understood programs
• First Edition Unix (1971) contained many programs that
we still use today (ls, rm, cat, mv)
• Its very name conveyed this minimalist aesthetic: Unix is
a homophone of “eunuchs” — a castrated Multics
We were a bit oppressed by the big system mentality. Ken
wanted to do something simple. — Dennis Ritchie
22. Unix: Let there be light
• In 1969, Doug McIlroy had the idea of connecting
different components:
At the same time that Thompson and Ritchie were sketching
out a file system, I was sketching out how to do data
processing on the blackboard by connecting together
cascades of processes
• This was the primordial pipe, but it took three years to
persuade Thompson to adopt it:
And one day I came up with a syntax for the shell that went
along with the piping, and Ken said, “I’m going to do it!”
23. Unix: ...and there was light
And the next morning we had this
orgy of one-liners. — Doug McIlroy
24. The Unix philosophy
• The pipe — coupled with the small-system aesthetic —
gave rise to the Unix philosophy, as articulated by Doug
McIlroy:
• Write programs that do one thing and do it well
• Write programs to work together
• Write programs that handle text streams, because
that is a universal interface
• Four decades later, this philosophy remains the single
most important revolution in software systems thinking!
25. • In 1986, Jon Bentley posed the challenge that became
the Epic Rap Battle of computer science history:
Read a file of text, determine the n most frequently used
words, and print out a sorted list of those words along with
their frequencies.
• Don Knuth’s solution: an elaborate program in WEB, a
Pascal-like literate programming system of his own
invention, using a purpose-built algorithm
• Doug McIlroy’s solution shows the power of the Unix
philosophy:
tr -cs A-Za-z 'n' | tr A-Z a-z |
sort | uniq -c | sort -rn | sed ${1}q
Doug McIlroy v. Don Knuth: FIGHT!
26. Big Data: History repeats itself?
• The original Google MapReduce paper (Dean et al.,
OSDI ’04) poses a problem disturbingly similar to
Bentley’s challenge nearly two decades prior:
Count of URL Access Frequency: The function processes
logs of web page requests and outputs ⟨URL, 1⟩. The
reduce function adds together all values for the same URL
and emits a ⟨URL, total count⟩ pair
• But the solutions do not adhere to the Unix philosophy...
• ...and nor do they make use of the substantial Unix
foundation for data processing
• e.g., Appendix A of the OSDI ’04 paper has a 71 line
word count in C++ — with nary a wc in sight
27. • Manta allows for an arbitrarily scalable variant of
McIlroy’s solution to Bentley’s challenge:
mfind -t o /bcantrill/public/v7/usr/man |
mjob create -o -m "tr -cs A-Za-z 'n' |
tr A-Z a-z | sort | uniq -c" -r
"awk '{ x[$2] += $1 }
END { for (w in x) { print x[w] " " w } }' |
sort -rn | sed ${1}q"
• This description not only terse, it is high performing: data
is left at rest — with the “map” phase doing heavy
reduction of the data stream
• As such, Manta — like Unix — is not merely syntactic
sugar; it converges compute and data in a new way
Manta: Unix for Big Data — and IoT
28. • Eventual consistency represents the wrong CAP
tradeoffs for most; we prefer consistency over
availability for writes (but still availability for reads)
• Many more details:
http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/
• Celebrity endorsement:
Manta: CAP tradeoffs
29. • Hierarchical storage is an excellent idea (ht: Multics);
Manta implements proper directories, delimited with a
forward slash
• Manta implements a snapshot/link hybrid dubbed a
snaplink; can be used to effect versioning
• Manta has full support for CORS headers
• Manta uses SSH-based HTTP auth for client-side
tooling (IETF draft-cavage-http-signatures-00)
• Manta SDKs exist for node.js, R, Go, Java, Ruby,
Python — and of course, compute jobs may be in any of
these (plus Perl, Clojure, Lisp, Erlang, Forth, Prolog,
Fortran, Haskell, Lua, Mono, COBOL, Fortran, etc.)
• “npm install manta” for command line interface
Manta: Other design principles
30. • We believe compute/data convergence to be a
constraint imposed by IoT: stores of record must support
computation as a first-class, in situ operation
• We believe that some (and perhaps many) IoT
workloads will require computing at the edge — internet
transit may be prohibitive for certain applications
• We believe that Unix is a natural way of expressing this
computation — and that OS containers are the right way
to support this securely
• We believe that ZFS is the only sane storage
underpinning for such a system
• Manta will surely not be the only system to represent the
confluence of these — but it is the first
Manta and IoT
31. • Product page:
http://joyent.com/products/manta
• node.js module:
https://github.com/joyent/node-manta
• Manta documentation:
http://apidocs.joyent.com/manta/
• IRC, e-mail, Twitter, etc.:
#manta on freenode, manta@joyent.com, @mcavage,
@dapsays, @yunongx, @joyent
Manta: More information