Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
1. Paco Nathan
liber118.com/pxn/
“Enterprise Data Workflows
with Cascading and Mesos”
Licensed under a Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License.
1Saturday, 27 July 13
2. Cascading / Cascalog / Scalding
Enterprise Data Workflows with Cascading
Cluster Computing with Mesos
Looking ahead…
2Saturday, 27 July 13
3. Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.
3Saturday, 27 July 13
4. Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
4Saturday, 27 July 13
5. Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/
5Saturday, 27 July 13
12. (ns impatient.core
(:use [cascalog.api]
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
12Saturday, 27 July 13
13. github.com/nathanmarz/cascalog/wiki
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
13Saturday, 27 July 13
14. import com.twitter.scalding._
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
14Saturday, 27 July 13
15. github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
15Saturday, 27 July 13
16. Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Data is represented as flows of tuples. Operations within
the flows bring functional programming aspects into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
16Saturday, 27 July 13
17. Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Literate Programming
Don Knuth
literateprogramming.com
17Saturday, 27 July 13
18. Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
18Saturday, 27 July 13
20. Cascading / Cascalog / Scalding
Enterprise Data Workflows with Cascading
Cluster Computing with Mesos
Looking ahead…
20Saturday, 27 July 13
21. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
21Saturday, 27 July 13
22. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
22Saturday, 27 July 13
23. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
23Saturday, 27 July 13
24. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
24Saturday, 27 July 13
25. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
25Saturday, 27 July 13
26. Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…
26Saturday, 27 July 13
27. ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org
27Saturday, 27 July 13
28. a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );
SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );
flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
28Saturday, 27 July 13
29. a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();
flowDef.addAssemblyPlanner( pmmlPlanner );
29Saturday, 27 July 13
30. cascading.org
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
visual collaboration for the business logic is a great
way to improve how teams work together
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
30Saturday, 27 July 13
31. Lingual – CSV data in local file system
cascading.org/lingual
31Saturday, 27 July 13
34. # load the JDBC package
library(RJDBC)
# set up the driver
drv <- JDBC("cascading.lingual.jdbc.Driver",
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")
# set up a database connection to a local repository
connection <- dbConnect(drv,
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")
# query the repository: in this case the MySQL sample database (CSV files)
df <- dbGetQuery(connection,
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
head(df)
# use R functions to summarize and visualize part of the data
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
summary(df$hire_age)
library(ggplot2)
m <- ggplot(df, aes(x=hire_age))
m <- m + ggtitle("Age at hire, people named Gina")
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
Lingual – connecting Hadoop and R
34Saturday, 27 July 13
35. > summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
Lingual – connecting Hadoop and R
cascading.org/lingual
35Saturday, 27 July 13
37. • established XML standard for predictive model markup
• organized by Data Mining Group (DMG), since 1997
http://dmg.org/
• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.
• PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple flows
“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations.With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”
PMML – standard
wikipedia.org/wiki/Predictive_Model_Markup_Language
37Saturday, 27 July 13
39. • Association Rules: AssociationModel element
• Cluster Models: ClusteringModel element
• Decision Trees: TreeModel element
• Naïve Bayes Classifiers: NaiveBayesModel element
• Neural Networks: NeuralNetwork element
• Regression: RegressionModel and GeneralRegressionModel elements
• Rulesets: RuleSetModel element
• Sequences: SequenceModel element
• SupportVector Machines: SupportVectorMachineModel element
• Text Models: TextModel element
• Time Series: TimeSeriesModel element
PMML – model coverage
ibm.com/developerworks/industry/library/ind-PMML2/
39Saturday, 27 July 13
40. ## train a RandomForest model
f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
## test the model on the holdout test set
print(fit$importance)
print(fit)
predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)
## export predicted labels to TSV
write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
## export RF model to PMML
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
Pattern – create a model in R
40Saturday, 27 July 13
44. Roadmap – existing algorithms for scoring
•
Random Forest
• Decision Trees
• Linear Regression
• GLM
• Logistic Regression
• K-Means Clustering
• Hierarchical Clustering
• Multinomial
• SupportVector Machines (prepared for release)
also, model chaining and general support for ensembles
cascading.org/pattern
44Saturday, 27 July 13
45. Roadmap – next priorities for scoring
•
Time Series (ARIMA forecast)
• Association Rules (basket analysis)
• Naïve Bayes
• Neural Networks
algorithms extended based on customer use cases –
contact groups.google.com/forum/?fromgroups#!forum/pattern-user
cascading.org/pattern
45Saturday, 27 July 13
46. Cascading / Cascalog / Scalding
Enterprise Data Workflows with Cascading
Cluster Computing with Mesos
Looking ahead…
46Saturday, 27 July 13
47. Q3 1997: inflection point
Four independent teams were working toward horizontal
scale-out of workflows based on commodity hardware
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this
47Saturday, 27 July 13
48. RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
48Saturday, 27 July 13
49. RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
“throw it over the wall”
49Saturday, 27 July 13
50. RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
50Saturday, 27 July 13
51. RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“data products”
51Saturday, 27 July 13
52. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
52Saturday, 27 July 13
53. Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
“optimize topologies”
53Saturday, 27 July 13
54. Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
MIT Media Lab
“Social Information Filtering for Music Recommendation” – Pattie Maes
pubs.media.mit.edu/pubs/papers/32paper.ps
ted.com/speakers/pattie_maes.html
Primary Sources
54Saturday, 27 July 13
55. Operating Systems, redux
meanwhile, GOOG is 3+ generations ahead,
with much improved ROI on data centers
John Wilkes, et al.
Borg/Omega: data center “secret sauce”
youtu.be/0ZFMlO98Jkc
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)
Florian Leibert, Chronos/Mesos @ Airbnb
Mesos, open source cloud OS – like Borg
goo.gl/jPtTP
55Saturday, 27 July 13
56. Mesos
mesos.apache.org
Return of the Borg: HowTwitter Rebuilt Google’s SecretWeapon
Cade Metz
wired.com/wiredenterprise/2013/03/google-
borg-twitter-mesos/
56Saturday, 27 July 13
57. Mesos
a common substrate for cluster computing
heterogenous assets in your data center or cloud made
available as a homogenous set of resources
• leverages OS features in Linux/Unix
• obviates the need for virtual machines
• written in C++, with API for Python, Java, Scala, etc.
• available for Linux, Mac OSX, OpenSolaris
• developed by UC Berkeley,Twitter,Airbnb, Mesosphere, etc.
• deployments at Twitter,Airbnb, Conviva, Foursquare,Vimeo,
Shopify, UCSF, UC Berkeley, etc.
57Saturday, 27 July 13
58. Mesos
a common substrate for cluster computing
• scale to 10,000s of nodes using fast, event-driven C++ impl
• maximize utilization rates, minimize latency for data updates
• combine batch, real-time, and long-lived services on the same
nodes and share resources
• reshape clusters on the fly based on app history and workload
requirements
• run multiple Hadoop versions, Spark, MPI, Heroku, HAProxy, etc.,
on the same cluster
• build new distributed frameworks without reinventing low-level
facilities
• enable new kinds of apps, which combine frameworks with lower
latency
• hire top talent out of Gxxxxx, providing a familiar data center env
58Saturday, 27 July 13
60. Cascading / Cascalog / Scalding
Enterprise Data Workflows with Cascading
Cluster Computing with Mesos
Looking ahead…
60Saturday, 27 July 13
61. A Crash Course in Machine Learning…
consider ML as an approach for generalization…
here’s a great introduction to ML, plus a proposed categorization
for comparing different machine learning approaches:
A Few UsefulThings to Know about Machine Learning
Pedro Domingos, U Washington
homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
key points:
• representation: a classifier must be represented in some formal
language that the computer can handle (algorithms, data structures,
etc.)
• evaluation: an evaluation function (objective function, scoring
function) is needed to distinguish good classifiers from bad ones
• optimization: a method to search among the classifiers in the
language for the highest-scoring one
61Saturday, 27 July 13
62. Algorithms
many algorithm libraries used today are based on implementations
back when people used DO loops in FORTRAN, 30+ years ago
MapReduce is Good Enough?
Jimmy Lin, U Maryland
umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
astrophysics and genomics are light years ahead in sophisticated
algorithms work – as Breiman suggested in 2001 – which may take
a few years to percolate into industry
other game-changers:
• streaming algorithms, sketches, probabilistic data structures
• significant “Big O” complexity reduction (e.g., skytree.net)
• better architectures and topologies (e.g., GPUs and CUDA)
• partial aggregates – parallelizing workflows
62Saturday, 27 July 13
63. Make It Sparse…
also, take a moment to check this out…
(IMHO most interesting algorithm work recently)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale,
e.g., PCA, SVD, etc.
• numerically stable with efficient implementation
on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer
interactions where there are 100MM customers,
with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
Tristan Jehan
63Saturday, 27 July 13
64. Sparse Matrix Collection
for when you really need a wide variety of sparse matrix examples…
University of Florida Sparse Matrix Collection
cise.ufl.edu/research/sparse/matrices/
Tim Davis, U Florida
cise.ufl.edu/~davis/welcome.html
Yifan Hu, AT&T Research
www2.research.att.com/~yifanhu/
64Saturday, 27 July 13
65. A Winning Approach…
consider that if you know priors about a system, then
you may be able to leverage low dimensional structure
within high dimensional data… that works much, much
better than sampling!
1. real-world data
2. graph theory for representation
3. sparse matrix factorization for production work
4. cost-effective parallel processing
for machine learning app at scale
65Saturday, 27 July 13
66. Suggested Reading
when you have time, take a look through these selected articles…
A Few UsefulThings to Know about Machine Learning
Pedro Domingos, U Washington
homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Probabilistic Data Structures forWeb Analytics and Data Mining
Ilya Katsov, Grid Dynamics
highlyscalable.wordpress.com/2012/05/01/probabilistic-
structures-web-analytics-data-mining/
MapReduce is Good Enough?
Jimmy Lin, U Maryland + Twitter
umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
66Saturday, 27 July 13
67. algorithmic modeling + machine data
+ curation, metadata + Open Data
data products, as feedback into automation
evolution of feedback loops
less about “bigness”, more about complexity
internet of things + A/D conversion
+ complex analytics
accelerated evolution, additional feedback loops
orders of magnitude higher data rates
Internet ofThings accelerates this process of disruption
Business Drivers
source: National Geographic
“A kind of Cambrian explosion”
source: National Geographic
67Saturday, 27 July 13
68. Trendlines
Big Data? we’re just getting started:
• ~12 exabytes/day, jet turbines on commercial flights
• Google self-driving cars, ~1 Gb/s per vehicle
• National Instruments initiative: Big Analog Data™
• 1m resolution satellites skyboximaging.com
• open resource monitoring reddmetrics.com
• Sensing XChallenge nokiasensingxchallenge.org
consider the implications of Jawbone, Nike, etc.,
plus the secondary/tertiary effects of Google Glass
7+ billion people, instrumented better than … how we
have Nagios instrumenting our web servers right now
technologyreview.com/...
68Saturday, 27 July 13