SlideShare uma empresa Scribd logo
1 de 96
Baixar para ler offline
Paco Nathan
liber118.com/pxn/
@pacoid
“Hadoop and Beyond”
Licensed under a Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License.
1Saturday, 13 July 13
?
issues confronted:
“Data becomes too complex for
ONE computer, ONE model,
ONE expert…”
2Saturday, 13 July 13
First Principles
we are taught to think of computing resources
in terms of Von Neumann architecture
in other words, we characterize the computing
resources by CPU, RAM, I/O
3Saturday, 13 July 13
First Principles
we are taught to think of computing resources
in terms of Von Neumann architecture
in other words, we characterize the computing
resources by CPU, RAM, I/O
CPU
4Saturday, 13 July 13
First Principles
we are taught to think of computing resources
in terms of Von Neumann architecture
in other words, we characterize the computing
resources by CPU, RAM, I/O
RAM
5Saturday, 13 July 13
First Principles
we are taught to think of computing resources
in terms of Von Neumann architecture
in other words, we characterize the computing
resources by CPU, RAM, I/O
I/O
6Saturday, 13 July 13
First Principles
back in the day, all the tables required for a
given database could fit onto one computer,
with one memory space, and one file space
7Saturday, 13 July 13
First Principles
back in the day, all the tables required for a
given database could fit onto one computer,
with one memory space, and one file space
• okay, maybe the CPU was multi-core…
• okay, maybe RAM paged out to virtual memory…
• okay, maybe the disks were in a RAID config…
8Saturday, 13 July 13
First Principles
back in the day, all the tables required for a
given database could fit onto one computer,
with one memory space, and one file space
• okay, maybe the CPU was multi-core…
• okay, maybe RAM paged out to virtual memory…
• okay, maybe the disks were in a RAID config…
or there were extra caches, or separate busses, etc.
but essentially those were incremental extensions
to aVon Neumann architecture…
9Saturday, 13 July 13
First Principles
back in the day, all the tables required for a
given database could fit onto one computer,
with one memory space, and one file space
• okay, maybe the CPU was multi-core…
• okay, maybe RAM paged out to virtual memory…
• okay, maybe the disks were in a RAID config…
or there were extra caches, or separate busses, etc.
but essentially those were incremental extensions
to aVon Neumann architecture…
a machine created in his image, if you will
NB: credit should go to Eckert and Mauchly, inventors of the ENIAC
10Saturday, 13 July 13
First Principles
a generation of computer scientists has been
taught to think “relational” – data on a DB server
RDBMS made sense, with their indexes, b-trees,
normal forms, etc.
Q: need to query bigger data?
A: simple, buy or lease a bigger DB server
11Saturday, 13 July 13
issues confronted:
“Data becomes too complex for
ONE computer, ONE model,
ONE expert…”
trends observed:
“Historical arc: 1996 - 2013,
rise of machine data, scale-out,
and algorithmic modeling…”
“The management problem is
about multi-disciplinary teams
and learning curves…”
12Saturday, 13 July 13
Q3 1997: inflection point
Four independent teams were working toward horizontal
scale-out of workflows based on commodity hardware
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this
13Saturday, 13 July 13
RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
14Saturday, 13 July 13
RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inflection point
“throw it over the wall”
15Saturday, 13 July 13
RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
16Saturday, 13 July 13
RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“data products”
17Saturday, 13 July 13
Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
MIT Media Lab
“Social Information Filtering for Music Recommendation” – Pattie Maes
pubs.media.mit.edu/pubs/papers/32paper.ps
ted.com/speakers/pattie_maes.html
Primary Sources
18Saturday, 13 July 13
Three broad categories of data
Curt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which fits
well into tables/arrays
• Human/Nontabular data – all other data generated by
humans
• Machine-Generated data
Now let’s add IoT:
• A/D conversion for sensors
Machine Data
19Saturday, 13 July 13
Data Jujitsu
DJ Patil
O’Reilly, 2012
amazon.com/dp/B008HMN5BE
Building Data ScienceTeams
DJ Patil
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE
Data Products
20Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
21Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
“optimize topologies”
22Saturday, 13 July 13
?
issues confronted:
“Data becomes too complex for
ONE computer, ONE model,
ONE expert…”
23Saturday, 13 July 13
Modeling
back in the day, we worked with practices based on
data modeling
1. sample the data
2. fit the sample to a known distribution
3. ignore the rest of the data
4. infer, based on that fitted distribution
that served well with ONE computer, ONE analyst,
ONE model… just throw away annoying “extra” data
circa late 1990s: machine data, aggregation, clusters, etc.
algorithmic modeling displaced data modeling
because the data won’t fit on one computer anymore
24Saturday, 13 July 13
Two Cultures
“A new research community using these tools sprang up.Their goal
was predictive accuracy.The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in financial markets.”
Statistical Modeling: TheTwo Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
this paper chronicled a sea change from data modeling practices
(silos, manual process) to the rising use of algorithmic modeling
(machine data for automation/optimization)
25Saturday, 13 July 13
issues confronted:
“Data becomes too complex for
ONE computer, ONE model,
ONE expert…”
trends observed:
“Historical arc: 1996 - 2013,
rise of machine data, scale-out,
and algorithmic modeling…”
“The management problem is
about multi-disciplinary teams
and learning curves…”
26Saturday, 13 July 13
Algorithmic Modeling
“The trick to being a scientist is to be open to using
a wide variety of tools.” – Breiman
circa 2001: Random Forest, bootstrap aggregation, etc.,
yield dramatic increases in predictive power over earlier
modeling such as Logistic Regression
major learnings from the Netflix Prize: the power of
ensembles, model chaining, etc.
the problems at hand have become simply too big and too
complex for ONE distribution, ONE model, ONE team…
an overall history of data science:
forbes.com/sites/gilpress/2013/05/28/a-very-
short-history-of-data-science/
27Saturday, 13 July 13
Why Do Ensembles Matter?
The World…
per Data Modeling
The World…
28Saturday, 13 July 13
Ensemblers of Fortune
Breiman:“a multiplicity of data models”
BellKor team: 100+ individual models in 2007 Progress Prize
while the process of combining models adds complexity
(making it more difficult to anticipate or explain predictions)
accuracy may increase substantially
Ensemble Learning: Better PredictionsThrough Diversity
Todd Holloway
ETech (2008)
abeautifulwww.com/EnsembleLearningETech.pdf
The Story of the Netflix Prize:An EnsemblersTale
Lester Mackey
National Academies Seminar,Washington, DC (2011)
stanford.edu/~lmackey/papers/
29Saturday, 13 July 13
?
issues confronted:
“Data becomes too complex for
ONE computer, ONE model,
ONE expert…”
30Saturday, 13 July 13
issues confronted:
“Data becomes too complex for
ONE computer, ONE model,
ONE expert…”
trends observed:
“Historical arc: 1996 - 2013,
rise of machine data, scale-out,
and algorithmic modeling…”
“The management problem is
about multi-disciplinary teams
and learning curves…”
31Saturday, 13 July 13
Q: Can I simply hire one rockstar
data scientist to cover all this kind
of work?
32Saturday, 13 July 13
A: No, multi-disciplinary work
requires teams.
A: Hire leads who speak the lingo
of each domain.
A: Hire people who cover 2+ roles,
when possible.
33Saturday, 13 July 13
approximately 80% of the costs for data-related projects
gets spent on data preparation – mostly on cleaning up
data quality issues: ETL, log files, etc., generally by socializing
the problem
unfortunately, data-related budgets tend to go into
frameworks which can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
d3js.org
What is needed most?
34Saturday, 13 July 13
employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way…
however, both systems engineers and data scientists must
Process Variation Data Tools
Statistical Thinking
35Saturday, 13 July 13
discovery
discovery
modeling
modeling
integration
integration
appsapps
systems
systems
business process,
stakeholder
data prep, discovery,
modeling, etc.
software engineering,
automation
systems engineering,
access
data
science
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
Team Composition: Needs × Roles
36Saturday, 13 July 13
issues confronted:
“Data becomes too complex for
ONE computer, ONE model,
ONE expert…”
trends observed:
“Historical arc: 1996 - 2013,
rise of machine data, scale-out,
and algorithmic modeling…”
“The management problem is
about multi-disciplinary teams
and learning curves…”
37Saturday, 13 July 13
Culture
Notes from the Mystery Machine Bus
SteveYegge, Google
goo.gl/SeRZa
consider these perspectives
in light of Conway’s Law…
“conservatism” “liberalism”
(mostly) Enterprise (mostly) Start-Up
risk management customer experiments
assurance flexibility
well-defined schema schema follows code
explicit configuration convention
type-checking compiler interpreted scripts
wants no surprises wants no impediments
Java, Scala, Clojure, etc. PHP, Ruby, Python, etc.
Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
38Saturday, 13 July 13
Two Avenues to the App Layer…
scale ➞
complexity➞
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
39Saturday, 13 July 13
Learning Curves
difficulties in the commercial use of distributed systems
often get represented as issues of managing complexity
much of the risk in managing a data science team is about
budgeting for learning curve: some orgs practice a kind of
engineering “conservatism”, with highly structured process
and strictly codified practices – people learn a few things
well, then avoid having to struggle with learning many new
things perpetually…
that approach leads to enormous teams and low ROI scale➞
complexity➞
ultimately, the challenge is about
managing learning curves within
a social context
40Saturday, 13 July 13
Learning Curves vs. Technology Selections
ultimately, the challenge is about managing
learning curves within a social context
est. cost of individual learning, initial impl
est.costofteamre-learning,lifecycle
some technologies constrain the
need to learn, others accelerate
re-learning prior business logic…
choose the latter, FTW!
41Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more
complexity and variety, widespread
disruption…”
?
42Saturday, 13 July 13
Big Data?
we’re just getting started:
• ~12 exabytes/day, jet turbines on commercial flights
• Google self-driving cars, ~1 Gb/s per vehicle
• National Instruments initiative: Big Analog Data™
• 1m resolution satellites skyboximaging.com
• open resource monitoring reddmetrics.com
• Sensing XChallenge nokiasensingxchallenge.org
consider the implications of Nike, Jawbone, etc.,
plus the secondary/tertiary effects of Google Glass
7+ billion people, instrumented better than … how we
have Nagios instrumenting our web servers right now
technologyreview.com/...
43Saturday, 13 July 13
Internet of Things
44Saturday, 13 July 13
Business Disruption
Geoffrey Moore
Mohr DavidowVentures, author CrossingThe Chasm / Hadoop Summit, 2012:
what Amazon did to the retail sector… has put the entire Global 1000
on notice over the next decade… data as the major force… mostly
through apps – verticals, leveraging domain expertise
Michael Stonebraker
INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc. / XLDB, 2012:
complex analytics workloads are now displacing SQL as the basis
for Enterprise apps
Larry Page
CEO, Google / Wired, 2013:
create products and services that are 10 times better than the
competition… thousand-percent improvement requires rethinking
problems entirely, exploring the edges of what’s technically possible,
and having a lot more fun in the process
45Saturday, 13 July 13
A Thought Exercise
consider that when a company like Caterpillar moves
into data science, they won’t be building the world’s
next search engine or social network
they will most likely be optimizing supply chain,
optimizing fuel costs, automating data feedback
loops integrated into their equipment…
that’s a $50B company,
in a market segment worth $250B
upcoming: tractors as drones –
guided by complex, distributed data apps
Operations Research –
crunching amazing amounts of data
46Saturday, 13 July 13
Alternatively…
climate.com
47Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more
complexity and variety, widespread
disruption…”
trends observed:
“Functional programming for Big Data”
“Just enough math, but not calculus”
“Enterprise Data Workflow design pattern”
“Cluster computing, smarter scheduling”
48Saturday, 13 July 13
Languages
JVM-based languages became popular for Big Data open source
technologies:
• partly becauseYHOO adopted Hadoop, etc.
• partly because Enterprise IT shops have J2EE expertise
• partly because of functional languages: Clojure, Scala
JVM has its drawbacks, especially for low-latency use cases
ample use of languages such as Python and Erlang in Big Data
practices, plus keep in mind that Google uses lots of C++
FunctionalThinking
Neal Ford
youtu.be/plSZIkLodDM
49Saturday, 13 July 13
Architecture
Rich Hickey, Nathan Marz, Stuart Sierra, et al.:
functional programming to help reduce
costs over time
technical debt? this is how an organization
builds a culture to avoid it
Conway's Law corollary: model teams and
communication based on properties of the
desired architecture
“Out of theTar Pit”
Moseley & Marks, 2006
goo.gl/SKspn
“A relational model of data for
large shared data banks”
Edgar Codd, 1970
dl.acm.org/citation.cfm?id=362685
Rich Hickey, infoq.com/presentations/Simple-Made-Easy
50Saturday, 13 July 13
Pattern Language
structured method for solving large, complex design
problems, where the syntax of the language ensures
the use of best practices – i.e., conveying expertise
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
51Saturday, 13 July 13
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
1 map
1 reduce
18 lines code gist.github.com/3900702
WordCount – conceptual flow diagram
cascading.org/category/impatient
52Saturday, 13 July 13
WordCount – Cascading app in Java
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
53Saturday, 13 July 13
mapreduce
Every('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count']
[{1}:'token']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
wc[{1}:'token']
[{1}:'token']
[{2}:'token', 'count']
[{2}:'token', 'count']
[{1}:'token']
[{1}:'token']
WordCount – generated flow diagram
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
54Saturday, 13 July 13
(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))
(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
55Saturday, 13 July 13
github.com/nathanmarz/cascalog/wiki
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
56Saturday, 13 July 13
import com.twitter.scalding._
 
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
57Saturday, 13 July 13
github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
58Saturday, 13 July 13
Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time
59Saturday, 13 July 13
Case Studies: LinkedIn and eBay
“Scalable and Flexible Machine LearningWith Scala @ LinkedIn”
Vitaly Gordon, LinkedIn
Chris Severs, eBay
slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-with-scala-linkedin
…be sure to read slides 8-16 !!
60Saturday, 13 July 13
Lambda Architecture
Big Data
Nathan Marz, James Warren
Manning, 2013
manning.com/marz
• batch layer (immutable data, idempotent ops)
• serving layer (to query batch)
• speed layer (transient, cached “real-time”)
• combining results
61Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more
complexity and variety, widespread
disruption…”
trends observed:
“Functional programming for Big Data”
“Just enough math, but not calculus”
“Enterprise Data Workflow design pattern”
“Cluster computing, smarter scheduling”
62Saturday, 13 July 13
Where To Start?
having a solid background in statistics becomes vital,
because it provides formalisms for what we’re trying
to accomplish at scale
along with that, some areas of math help – regardless
of the “calculus threshold” invoked at many universities…
linear algebra e.g., crunching algorithms efficiently for large-scale apps
graph theory e.g., representation of problems in a calculable language
abstract algebra e.g., probabilistic data structures in streaming analytics
topology e.g., determining the underlying structure of the data
operations research e.g., techniques for optimization … in other words, ROI
63Saturday, 13 July 13
in a nutshell, most of what we do is to…
‣ estimate probability
‣ calculate analytic variance
‣ manipulate dimension and complexity
‣ make use of learning theory
+ collaborate with DevOps, Stakeholders
+ reduce our work into cron entries
UniqueRegistration
Launchedgameslobby
NUI:TutorialMode
BirthdayMessage
ChatPublicRoomvoice
Launchedheyzapgame
ityTest:testsuitestarted
CreateNewPet
tarted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
spaceremaining:512M
chaseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
PanelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
ssspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
PanelRemoveProduct
oryPanelApplyProduct
NUI:DressUpMode
UniqueRegistration
Launchedgameslobby
NUI:TutorialMode
BirthdayMessage
ChatPublicRoomvoice
Launchedheyzapgame
ConnectivityTest:testsuitestarted
CreateNewPet
MovieViewStarted:client,community
NUI:MovieMode
BuyanItem:web
PutonClothing
Addressspaceremaining:512M
CustomerMadePurchaseCartPageStep2
FeedPet
PlayPet
ChatNow
EditPanel
ClientInventoryPanelFlipProductOver
AddFriend
Open3DWindow
ChangeSeat
TypeaBubble
VisitOwnHomepage
TakeaSnapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
Addressspaceremaining:1G
LeaveaMessage
NUI:ChatMode
NUI:FriendsMode
dv
WebsiteLogin
AddBuddy
NUI:PublicRoomMode
NUI:MyRoomMode
ClientInventoryPanelRemoveProduct
ClientInventoryPanelApplyProduct
NUI:DressUpMode
Where is the Science in Data Science?
64Saturday, 13 July 13
techniques for manipulating order complexity:
dimensional reduction…
with clustering as a common case
e.g., you may have 100 million HTML docs,
but only ~10K useful keywords within them
low-dimensional structure, PCA, etc.
linear algebra techniques: eigenvalues,
matrix factorization, etc.
this is an area ripe for much advancement
in algorithms research, near-term
Dimension and Complexity
65Saturday, 13 July 13
in general, apps alternate between learning patterns/rules
and retrieving similar things…
statistical learning theory – rigorous, prevents you
from making billion dollar mistakes, probably our future
machine learning – scalable, enables you to make
billion dollar mistakes, much commercial emphasis
supervised vs. unsupervised
arguably, optimization is a parent category
once Big Data projects get beyond merely digesting
log files, optimization will likely become the next
overused buzzword :)
Learning Theory
66Saturday, 13 July 13
Algorithms
many algorithm libraries used today are based on implementations
back when people used DO loops in FORTRAN, 30+ years ago
MapReduce is Good Enough?
Jimmy Lin, U Maryland
umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
astrophysics and genomics are light years ahead in sophisticated
algorithms work – as Breiman suggested in 2001 – which may take
a few years to percolate into industry
other game-changers:
• streaming algorithms, sketches, probabilistic data structures
• significant “Big O” complexity reduction (e.g., skytree.net)
• better architectures and topologies (e.g., GPUs and CUDA)
• partial aggregates – parallelizing workflows
67Saturday, 13 July 13
Make It Sparse…
also, take a moment to check this out…
(IMHO most interesting algorithm work recently)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale,
e.g., PCA, SVD, etc.
• numerically stable with efficient implementation
on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer
interactions where there are 100MM customers,
with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
Tristan Jehan
68Saturday, 13 July 13
Sparse Matrix Collection
for when you really need a wide variety of sparse matrix examples…
University of Florida Sparse Matrix Collection
cise.ufl.edu/research/sparse/matrices/
Tim Davis, U Florida
cise.ufl.edu/~davis/welcome.html
Yifan Hu, AT&T Research
www2.research.att.com/~yifanhu/
69Saturday, 13 July 13
A Winning Approach…
consider that if you know priors about a system, then
you may be able to leverage low dimensional structure
within high dimensional data… that works much, much
better than sampling!
1. real-world data
2. graph theory for representation
3. sparse matrix factorization for production work
4. cost-effective parallel processing
for machine learning app at scale
70Saturday, 13 July 13
Suggested Reading
A Few UsefulThings to Know about Machine Learning
Pedro Domingos, U Washington
homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Probabilistic Data Structures forWeb Analytics and Data Mining
Ilya Katsov, Grid Dynamics
highlyscalable.wordpress.com/2012/05/01/probabilistic-
structures-web-analytics-data-mining/
71Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more
complexity and variety, widespread
disruption…”
trends observed:
“Functional programming for Big Data”
“Just enough math, but not calculus”
“Enterprise Data Workflow design pattern”
“Cluster computing, smarter scheduling”
72Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
73Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
74Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
75Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
76Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
77Saturday, 13 July 13
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…
78Saturday, 13 July 13
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org
79Saturday, 13 July 13
a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );
 
SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );
 
flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
80Saturday, 13 July 13
a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
 
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();
 
flowDef.addAssemblyPlanner( pmmlPlanner );
81Saturday, 13 July 13
cascading.org
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
visual collaboration for the business logic is a great
way to improve how teams work together
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
82Saturday, 13 July 13
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
multiple departments, working in their respective
frameworks, integrate results into a combined app,
which runs at scale on a cluster… business process
combined in a common space (DAG) for flow
planners, compiler, optimization, troubleshooting,
exception handling, notifications, security audit,
performance monitoring, etc.
cascading.org
83Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more
complexity and variety, widespread
disruption…”
trends observed:
“Functional programming for Big Data”
“Just enough math, but not calculus”
“Enterprise Data Workflow design pattern”
“Cluster computing, smarter scheduling”
84Saturday, 13 July 13
Clusters
a little secret: people like me make a good living by
leveraging high ROI apps based on clusters, and so
the execs agree to build out more data centers…
clusters for Hadoop/Hive/HBase, clusters for Memcached,
for Cassandra, for MySQL, for Storm, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler
to manage; but terrible for utilization
leveragingVMs and various notions of “cloud” helps
Cloudera, Hortonworks, probably EMC soon: sell a notion
of “Hadoop as OS” All your workloads are belong to us
regardless of how architectures change, death and taxes
will endure: servers fail, and data must move
Google Data Center, Fox News
~2002
85Saturday, 13 July 13
Three Laws, or more?
meanwhile, architectures evolve toward much, much larger data…
pistoncloud.com/ ...
Rich Freitas, IBM Research
Q:
what kinds of evolution in topologies could
this imply?
86Saturday, 13 July 13
Topologies
Hadoop and other topologies arose from a need for fault-
tolerant workloads, leveraging horizontal scale-out based
on commodity hardware
because the data won’t fit on one computer anymore
a variety of Big Data technologies has since emerged,
which can be categorized in terms of topologies and
the CAP Theorem
87Saturday, 13 July 13
Some Topologies Beyond Hadoop…
Spark (iterative/interactive)
Titan (graph database)
Redis (data structure server)
Zookeeper (distributed metadata)
HBase (columnar data objects)
Riak (durable key-value store)
Storm (real-time streams)
ElasticSearch (search index)
MongoDB (document store)
Greenplum (MPP)
SciDB (array database)
88Saturday, 13 July 13
issues confronted:
“Orders of magnitude increase, more
complexity and variety, widespread
disruption…”
trends observed:
“Functional programming for Big Data”
“Just enough math, but not calculus”
“Enterprise Data Workflow design pattern”
“Cluster computing, smarter scheduling”
89Saturday, 13 July 13
Operating Systems, redux
meanwhile, GOOG is 3+ generations ahead,
with much improved ROI on data centers
John Wilkes, et al.
Borg/Omega:“10x” secret sauce
youtu.be/0ZFMlO98Jkc
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)
Florian Leibert, Chronos/Mesos @ Airbnb
Mesos, open source cloud OS – like Borg
goo.gl/jPtTP
90Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
91Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
1. End Use Cases, the drivers
92Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
2. A new kind of team process
93Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
3. Abstraction layer as optimizing
middleware, e.g., Cascading
94Saturday, 13 July 13
Workflow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere – Four-Part Harmony
4. Distributed OS, e.g., Mesos
95Saturday, 13 July 13
Enterprise DataWorkflows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
Further study…
workshops and newsletter
updates:
liber118.com/pxn/
96Saturday, 13 July 13

Mais conteúdo relacionado

Destaque

Village Global 2009 : Séance 1
Village Global 2009 : Séance 1Village Global 2009 : Séance 1
Village Global 2009 : Séance 1Emilie Marquois
 
Pinterest for Business 101
Pinterest for Business 101Pinterest for Business 101
Pinterest for Business 101Nick Armstrong
 
Numa, coworking space working with corporations
Numa, coworking space working with corporationsNuma, coworking space working with corporations
Numa, coworking space working with corporationsCoworking Conference
 
Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...David Gleich
 
Verge (pdf with some notes)
Verge (pdf with some notes)Verge (pdf with some notes)
Verge (pdf with some notes)Tim O'Reilly
 
Demain, la voiture servicielle_TNS1
Demain, la voiture servicielle_TNS1Demain, la voiture servicielle_TNS1
Demain, la voiture servicielle_TNS1agencecle
 
TOC Bologna 2012: The Opportunity in Digital: Putting Publishers at the Helm...
TOC Bologna 2012: The Opportunity in Digital: Putting  Publishers at the Helm...TOC Bologna 2012: The Opportunity in Digital: Putting  Publishers at the Helm...
TOC Bologna 2012: The Opportunity in Digital: Putting Publishers at the Helm...OReillyTOC
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
 
Deploying a #CRM solution in Latin America (Or the Rest of the World). #sugarcon
Deploying a #CRM solution in Latin America (Or the Rest of the World). #sugarconDeploying a #CRM solution in Latin America (Or the Rest of the World). #sugarcon
Deploying a #CRM solution in Latin America (Or the Rest of the World). #sugarconJesus Hoyos
 
Getting Started With Web Accessibility
Getting Started With Web AccessibilityGetting Started With Web Accessibility
Getting Started With Web AccessibilitySean Yo
 
OPEN Silcon Valley - Clean-tech is Main-tech: How do you fit in the Green Ec...
OPEN Silcon Valley - Clean-tech is Main-tech:  How do you fit in the Green Ec...OPEN Silcon Valley - Clean-tech is Main-tech:  How do you fit in the Green Ec...
OPEN Silcon Valley - Clean-tech is Main-tech: How do you fit in the Green Ec...Shuja Keen
 
Kotter Eight Step Plan
Kotter Eight Step PlanKotter Eight Step Plan
Kotter Eight Step PlanCPA Australia
 
SlideShare's Lean Startup Journey: Lessons Learnt
SlideShare's Lean Startup Journey: Lessons LearntSlideShare's Lean Startup Journey: Lessons Learnt
SlideShare's Lean Startup Journey: Lessons LearntKapil Mohan
 
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...Wesley Schwalje
 

Destaque (16)

Village Global 2009 : Séance 1
Village Global 2009 : Séance 1Village Global 2009 : Séance 1
Village Global 2009 : Séance 1
 
Pinterest for Business 101
Pinterest for Business 101Pinterest for Business 101
Pinterest for Business 101
 
Numa, coworking space working with corporations
Numa, coworking space working with corporationsNuma, coworking space working with corporations
Numa, coworking space working with corporations
 
Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Fast matrix computations for pair-wise and column-wise Katz scores and commut...
 
Verge (pdf with some notes)
Verge (pdf with some notes)Verge (pdf with some notes)
Verge (pdf with some notes)
 
Demain, la voiture servicielle_TNS1
Demain, la voiture servicielle_TNS1Demain, la voiture servicielle_TNS1
Demain, la voiture servicielle_TNS1
 
TOC Bologna 2012: The Opportunity in Digital: Putting Publishers at the Helm...
TOC Bologna 2012: The Opportunity in Digital: Putting  Publishers at the Helm...TOC Bologna 2012: The Opportunity in Digital: Putting  Publishers at the Helm...
TOC Bologna 2012: The Opportunity in Digital: Putting Publishers at the Helm...
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Deploying a #CRM solution in Latin America (Or the Rest of the World). #sugarcon
Deploying a #CRM solution in Latin America (Or the Rest of the World). #sugarconDeploying a #CRM solution in Latin America (Or the Rest of the World). #sugarcon
Deploying a #CRM solution in Latin America (Or the Rest of the World). #sugarcon
 
Getting Started With Web Accessibility
Getting Started With Web AccessibilityGetting Started With Web Accessibility
Getting Started With Web Accessibility
 
Stanford Ee380
Stanford Ee380Stanford Ee380
Stanford Ee380
 
Warburg2011
Warburg2011Warburg2011
Warburg2011
 
OPEN Silcon Valley - Clean-tech is Main-tech: How do you fit in the Green Ec...
OPEN Silcon Valley - Clean-tech is Main-tech:  How do you fit in the Green Ec...OPEN Silcon Valley - Clean-tech is Main-tech:  How do you fit in the Green Ec...
OPEN Silcon Valley - Clean-tech is Main-tech: How do you fit in the Green Ec...
 
Kotter Eight Step Plan
Kotter Eight Step PlanKotter Eight Step Plan
Kotter Eight Step Plan
 
SlideShare's Lean Startup Journey: Lessons Learnt
SlideShare's Lean Startup Journey: Lessons LearntSlideShare's Lean Startup Journey: Lessons Learnt
SlideShare's Lean Startup Journey: Lessons Learnt
 
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
 

Semelhante a Seattle Data Geeks: Hadoop and Beyond

Hadoop and Beyond
Hadoop and BeyondHadoop and Beyond
Hadoop and BeyondPaco Nathan
 
Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...xu liwei
 
Making the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than SpeedMaking the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than SpeedInside Analysis
 
Fascinating Tales of a Strange Tomorrow
Fascinating Tales of a Strange TomorrowFascinating Tales of a Strange Tomorrow
Fascinating Tales of a Strange TomorrowJulien SIMON
 
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosAugury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosPaco Nathan
 
Invent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel FutureInvent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel FutureM. Atif Qureshi
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2rusersla
 
The evolution of the collections management system
The evolution of the collections management systemThe evolution of the collections management system
The evolution of the collections management systemirowson
 
AI, Machine Learning, and Data Science Concepts
AI, Machine Learning, and Data Science ConceptsAI, Machine Learning, and Data Science Concepts
AI, Machine Learning, and Data Science ConceptsDan O'Leary
 
History of Computer Systems - Why we are doing it that way
History of Computer Systems - Why we are doing it that wayHistory of Computer Systems - Why we are doing it that way
History of Computer Systems - Why we are doing it that wayLeo Lorieri
 
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2ODeep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2OSri Ambati
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Raspberry pi on seminar documentation
Raspberry pi on seminar documentationRaspberry pi on seminar documentation
Raspberry pi on seminar documentationGeorgekutty Francis
 
Cloud Computingfor Lti Final
Cloud Computingfor Lti FinalCloud Computingfor Lti Final
Cloud Computingfor Lti FinalLynn McCormick
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 

Semelhante a Seattle Data Geeks: Hadoop and Beyond (20)

Hadoop and Beyond
Hadoop and BeyondHadoop and Beyond
Hadoop and Beyond
 
Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...
 
Making the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than SpeedMaking the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than Speed
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
Fascinating Tales of a Strange Tomorrow
Fascinating Tales of a Strange TomorrowFascinating Tales of a Strange Tomorrow
Fascinating Tales of a Strange Tomorrow
 
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosAugury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
 
Invent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel FutureInvent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel Future
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2
 
The evolution of the collections management system
The evolution of the collections management systemThe evolution of the collections management system
The evolution of the collections management system
 
AI, Machine Learning, and Data Science Concepts
AI, Machine Learning, and Data Science ConceptsAI, Machine Learning, and Data Science Concepts
AI, Machine Learning, and Data Science Concepts
 
History of Computer Systems - Why we are doing it that way
History of Computer Systems - Why we are doing it that wayHistory of Computer Systems - Why we are doing it that way
History of Computer Systems - Why we are doing it that way
 
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2ODeep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
 
Visual analytics
Visual analyticsVisual analytics
Visual analytics
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Rulespace
RulespaceRulespace
Rulespace
 
Raspberry pi on seminar documentation
Raspberry pi on seminar documentationRaspberry pi on seminar documentation
Raspberry pi on seminar documentation
 
Cloud Computingfor Lti Final
Cloud Computingfor Lti FinalCloud Computingfor Lti Final
Cloud Computingfor Lti Final
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Torocpsummit 130116180230-phpapp02
Torocpsummit 130116180230-phpapp02Torocpsummit 130116180230-phpapp02
Torocpsummit 130116180230-phpapp02
 

Mais de Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 

Mais de Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 

Último

Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 

Último (20)

Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 

Seattle Data Geeks: Hadoop and Beyond

  • 1. Paco Nathan liber118.com/pxn/ @pacoid “Hadoop and Beyond” Licensed under a Creative Commons Attribution- NonCommercial-NoDerivs 3.0 Unported License. 1Saturday, 13 July 13
  • 2. ? issues confronted: “Data becomes too complex for ONE computer, ONE model, ONE expert…” 2Saturday, 13 July 13
  • 3. First Principles we are taught to think of computing resources in terms of Von Neumann architecture in other words, we characterize the computing resources by CPU, RAM, I/O 3Saturday, 13 July 13
  • 4. First Principles we are taught to think of computing resources in terms of Von Neumann architecture in other words, we characterize the computing resources by CPU, RAM, I/O CPU 4Saturday, 13 July 13
  • 5. First Principles we are taught to think of computing resources in terms of Von Neumann architecture in other words, we characterize the computing resources by CPU, RAM, I/O RAM 5Saturday, 13 July 13
  • 6. First Principles we are taught to think of computing resources in terms of Von Neumann architecture in other words, we characterize the computing resources by CPU, RAM, I/O I/O 6Saturday, 13 July 13
  • 7. First Principles back in the day, all the tables required for a given database could fit onto one computer, with one memory space, and one file space 7Saturday, 13 July 13
  • 8. First Principles back in the day, all the tables required for a given database could fit onto one computer, with one memory space, and one file space • okay, maybe the CPU was multi-core… • okay, maybe RAM paged out to virtual memory… • okay, maybe the disks were in a RAID config… 8Saturday, 13 July 13
  • 9. First Principles back in the day, all the tables required for a given database could fit onto one computer, with one memory space, and one file space • okay, maybe the CPU was multi-core… • okay, maybe RAM paged out to virtual memory… • okay, maybe the disks were in a RAID config… or there were extra caches, or separate busses, etc. but essentially those were incremental extensions to aVon Neumann architecture… 9Saturday, 13 July 13
  • 10. First Principles back in the day, all the tables required for a given database could fit onto one computer, with one memory space, and one file space • okay, maybe the CPU was multi-core… • okay, maybe RAM paged out to virtual memory… • okay, maybe the disks were in a RAID config… or there were extra caches, or separate busses, etc. but essentially those were incremental extensions to aVon Neumann architecture… a machine created in his image, if you will NB: credit should go to Eckert and Mauchly, inventors of the ENIAC 10Saturday, 13 July 13
  • 11. First Principles a generation of computer scientists has been taught to think “relational” – data on a DB server RDBMS made sense, with their indexes, b-trees, normal forms, etc. Q: need to query bigger data? A: simple, buy or lease a bigger DB server 11Saturday, 13 July 13
  • 12. issues confronted: “Data becomes too complex for ONE computer, ONE model, ONE expert…” trends observed: “Historical arc: 1996 - 2013, rise of machine data, scale-out, and algorithmic modeling…” “The management problem is about multi-disciplinary teams and learning curves…” 12Saturday, 13 July 13
  • 13. Q3 1997: inflection point Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware This effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this 13Saturday, 13 July 13
  • 14. RDBMS Stakeholder SQL Query result sets Excel pivot tables PowerPoint slide decks Web App Customers transactions Product strategy Engineering requirements BI Analysts optimized code Circa 1996: pre- inflection point 14Saturday, 13 July 13
  • 15. RDBMS Stakeholder SQL Query result sets Excel pivot tables PowerPoint slide decks Web App Customers transactions Product strategy Engineering requirements BI Analysts optimized code Circa 1996: pre- inflection point “throw it over the wall” 15Saturday, 13 July 13
  • 16. RDBMS SQL Query result sets recommenders + classifiers Web Apps customer transactions Algorithmic Modeling Logs event history aggregation dashboards Product Engineering UX Stakeholder Customers DW ETL Middleware servletsmodels Circa 2001: post- big ecommerce successes 16Saturday, 13 July 13
  • 17. RDBMS SQL Query result sets recommenders + classifiers Web Apps customer transactions Algorithmic Modeling Logs event history aggregation dashboards Product Engineering UX Stakeholder Customers DW ETL Middleware servletsmodels Circa 2001: post- big ecommerce successes “data products” 17Saturday, 13 July 13
  • 18. Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtu.be/E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtu.be/qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx MIT Media Lab “Social Information Filtering for Music Recommendation” – Pattie Maes pubs.media.mit.edu/pubs/papers/32paper.ps ted.com/speakers/pattie_maes.html Primary Sources 18Saturday, 13 July 13
  • 19. Three broad categories of data Curt Monash, 2010 dbms2.com/2010/01/17/three-broad-categories-of-data • Human/Tabular data – human-generated data which fits well into tables/arrays • Human/Nontabular data – all other data generated by humans • Machine-Generated data Now let’s add IoT: • A/D conversion for sensors Machine Data 19Saturday, 13 July 13
  • 20. Data Jujitsu DJ Patil O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data ScienceTeams DJ Patil O’Reilly, 2011 amazon.com/dp/B005O4U3ZE Data Products 20Saturday, 13 July 13
  • 21. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere 21Saturday, 13 July 13
  • 22. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere “optimize topologies” 22Saturday, 13 July 13
  • 23. ? issues confronted: “Data becomes too complex for ONE computer, ONE model, ONE expert…” 23Saturday, 13 July 13
  • 24. Modeling back in the day, we worked with practices based on data modeling 1. sample the data 2. fit the sample to a known distribution 3. ignore the rest of the data 4. infer, based on that fitted distribution that served well with ONE computer, ONE analyst, ONE model… just throw away annoying “extra” data circa late 1990s: machine data, aggregation, clusters, etc. algorithmic modeling displaced data modeling because the data won’t fit on one computer anymore 24Saturday, 13 July 13
  • 25. Two Cultures “A new research community using these tools sprang up.Their goal was predictive accuracy.The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.” Statistical Modeling: TheTwo Cultures Leo Breiman, 2001 bit.ly/eUTh9L this paper chronicled a sea change from data modeling practices (silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization) 25Saturday, 13 July 13
  • 26. issues confronted: “Data becomes too complex for ONE computer, ONE model, ONE expert…” trends observed: “Historical arc: 1996 - 2013, rise of machine data, scale-out, and algorithmic modeling…” “The management problem is about multi-disciplinary teams and learning curves…” 26Saturday, 13 July 13
  • 27. Algorithmic Modeling “The trick to being a scientist is to be open to using a wide variety of tools.” – Breiman circa 2001: Random Forest, bootstrap aggregation, etc., yield dramatic increases in predictive power over earlier modeling such as Logistic Regression major learnings from the Netflix Prize: the power of ensembles, model chaining, etc. the problems at hand have become simply too big and too complex for ONE distribution, ONE model, ONE team… an overall history of data science: forbes.com/sites/gilpress/2013/05/28/a-very- short-history-of-data-science/ 27Saturday, 13 July 13
  • 28. Why Do Ensembles Matter? The World… per Data Modeling The World… 28Saturday, 13 July 13
  • 29. Ensemblers of Fortune Breiman:“a multiplicity of data models” BellKor team: 100+ individual models in 2007 Progress Prize while the process of combining models adds complexity (making it more difficult to anticipate or explain predictions) accuracy may increase substantially Ensemble Learning: Better PredictionsThrough Diversity Todd Holloway ETech (2008) abeautifulwww.com/EnsembleLearningETech.pdf The Story of the Netflix Prize:An EnsemblersTale Lester Mackey National Academies Seminar,Washington, DC (2011) stanford.edu/~lmackey/papers/ 29Saturday, 13 July 13
  • 30. ? issues confronted: “Data becomes too complex for ONE computer, ONE model, ONE expert…” 30Saturday, 13 July 13
  • 31. issues confronted: “Data becomes too complex for ONE computer, ONE model, ONE expert…” trends observed: “Historical arc: 1996 - 2013, rise of machine data, scale-out, and algorithmic modeling…” “The management problem is about multi-disciplinary teams and learning curves…” 31Saturday, 13 July 13
  • 32. Q: Can I simply hire one rockstar data scientist to cover all this kind of work? 32Saturday, 13 July 13
  • 33. A: No, multi-disciplinary work requires teams. A: Hire leads who speak the lingo of each domain. A: Hire people who cover 2+ roles, when possible. 33Saturday, 13 July 13
  • 34. approximately 80% of the costs for data-related projects gets spent on data preparation – mostly on cleaning up data quality issues: ETL, log files, etc., generally by socializing the problem unfortunately, data-related budgets tend to go into frameworks which can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to understand the audience and their priorities ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable d3js.org What is needed most? 34Saturday, 13 July 13
  • 35. employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way… however, both systems engineers and data scientists must Process Variation Data Tools Statistical Thinking 35Saturday, 13 July 13
  • 36. discovery discovery modeling modeling integration integration appsapps systems systems business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, access data science Data Scientist App Dev Ops Domain Expert introduced Team Composition: Needs × Roles 36Saturday, 13 July 13
  • 37. issues confronted: “Data becomes too complex for ONE computer, ONE model, ONE expert…” trends observed: “Historical arc: 1996 - 2013, rise of machine data, scale-out, and algorithmic modeling…” “The management problem is about multi-disciplinary teams and learning curves…” 37Saturday, 13 July 13
  • 38. Culture Notes from the Mystery Machine Bus SteveYegge, Google goo.gl/SeRZa consider these perspectives in light of Conway’s Law… “conservatism” “liberalism” (mostly) Enterprise (mostly) Start-Up risk management customer experiments assurance flexibility well-defined schema schema follows code explicit configuration convention type-checking compiler interpreted scripts wants no surprises wants no impediments Java, Scala, Clojure, etc. PHP, Ruby, Python, etc. Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc. 38Saturday, 13 July 13
  • 39. Two Avenues to the App Layer… scale ➞ complexity➞ Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding 39Saturday, 13 July 13
  • 40. Learning Curves difficulties in the commercial use of distributed systems often get represented as issues of managing complexity much of the risk in managing a data science team is about budgeting for learning curve: some orgs practice a kind of engineering “conservatism”, with highly structured process and strictly codified practices – people learn a few things well, then avoid having to struggle with learning many new things perpetually… that approach leads to enormous teams and low ROI scale➞ complexity➞ ultimately, the challenge is about managing learning curves within a social context 40Saturday, 13 July 13
  • 41. Learning Curves vs. Technology Selections ultimately, the challenge is about managing learning curves within a social context est. cost of individual learning, initial impl est.costofteamre-learning,lifecycle some technologies constrain the need to learn, others accelerate re-learning prior business logic… choose the latter, FTW! 41Saturday, 13 July 13
  • 42. issues confronted: “Orders of magnitude increase, more complexity and variety, widespread disruption…” ? 42Saturday, 13 July 13
  • 43. Big Data? we’re just getting started: • ~12 exabytes/day, jet turbines on commercial flights • Google self-driving cars, ~1 Gb/s per vehicle • National Instruments initiative: Big Analog Data™ • 1m resolution satellites skyboximaging.com • open resource monitoring reddmetrics.com • Sensing XChallenge nokiasensingxchallenge.org consider the implications of Nike, Jawbone, etc., plus the secondary/tertiary effects of Google Glass 7+ billion people, instrumented better than … how we have Nagios instrumenting our web servers right now technologyreview.com/... 43Saturday, 13 July 13
  • 45. Business Disruption Geoffrey Moore Mohr DavidowVentures, author CrossingThe Chasm / Hadoop Summit, 2012: what Amazon did to the retail sector… has put the entire Global 1000 on notice over the next decade… data as the major force… mostly through apps – verticals, leveraging domain expertise Michael Stonebraker INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc. / XLDB, 2012: complex analytics workloads are now displacing SQL as the basis for Enterprise apps Larry Page CEO, Google / Wired, 2013: create products and services that are 10 times better than the competition… thousand-percent improvement requires rethinking problems entirely, exploring the edges of what’s technically possible, and having a lot more fun in the process 45Saturday, 13 July 13
  • 46. A Thought Exercise consider that when a company like Caterpillar moves into data science, they won’t be building the world’s next search engine or social network they will most likely be optimizing supply chain, optimizing fuel costs, automating data feedback loops integrated into their equipment… that’s a $50B company, in a market segment worth $250B upcoming: tractors as drones – guided by complex, distributed data apps Operations Research – crunching amazing amounts of data 46Saturday, 13 July 13
  • 48. issues confronted: “Orders of magnitude increase, more complexity and variety, widespread disruption…” trends observed: “Functional programming for Big Data” “Just enough math, but not calculus” “Enterprise Data Workflow design pattern” “Cluster computing, smarter scheduling” 48Saturday, 13 July 13
  • 49. Languages JVM-based languages became popular for Big Data open source technologies: • partly becauseYHOO adopted Hadoop, etc. • partly because Enterprise IT shops have J2EE expertise • partly because of functional languages: Clojure, Scala JVM has its drawbacks, especially for low-latency use cases ample use of languages such as Python and Erlang in Big Data practices, plus keep in mind that Google uses lots of C++ FunctionalThinking Neal Ford youtu.be/plSZIkLodDM 49Saturday, 13 July 13
  • 50. Architecture Rich Hickey, Nathan Marz, Stuart Sierra, et al.: functional programming to help reduce costs over time technical debt? this is how an organization builds a culture to avoid it Conway's Law corollary: model teams and communication based on properties of the desired architecture “Out of theTar Pit” Moseley & Marks, 2006 goo.gl/SKspn “A relational model of data for large shared data banks” Edgar Codd, 1970 dl.acm.org/citation.cfm?id=362685 Rich Hickey, infoq.com/presentations/Simple-Made-Easy 50Saturday, 13 July 13
  • 51. Pattern Language structured method for solving large, complex design problems, where the syntax of the language ensures the use of best practices – i.e., conveying expertise Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads A Pattern Language Christopher Alexander, et al. amazon.com/dp/0195019199 51Saturday, 13 July 13
  • 52. Document Collection Word Count Tokenize GroupBy token Count R M 1 map 1 reduce 18 lines code gist.github.com/3900702 WordCount – conceptual flow diagram cascading.org/category/impatient 52Saturday, 13 July 13
  • 53. WordCount – Cascading app in Java String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Document Collection Word Count Tokenize GroupBy token Count R M 53Saturday, 13 July 13
  • 54. mapreduce Every('wc')[Count[decl:'count']] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] GroupBy('wc')[by:['token']] Each('token')[RegexSplitGenerator[decl:'token'][args:1]] Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [head] [tail] [{2}:'token', 'count'] [{1}:'token'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] wc[{1}:'token'] [{1}:'token'] [{2}:'token', 'count'] [{2}:'token', 'count'] [{1}:'token'] [{1}:'token'] WordCount – generated flow diagram Document Collection Word Count Tokenize GroupBy token Count R M 54Saturday, 13 July 13
  • 55. (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient WordCount – Cascalog / Clojure Document Collection Word Count Tokenize GroupBy token Count R M 55Saturday, 13 July 13
  • 56. github.com/nathanmarz/cascalog/wiki • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn WordCount – Cascalog / Clojure Document Collection Word Count Tokenize GroupBy token Count R M 56Saturday, 13 July 13
  • 57. import com.twitter.scalding._   class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } WordCount – Scalding / Scala Document Collection Word Count Tokenize GroupBy token Count R M 57Saturday, 13 July 13
  • 58. github.com/twitter/scalding/wiki • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog WordCount – Scalding / Scala Document Collection Word Count Tokenize GroupBy token Count R M 58Saturday, 13 July 13
  • 59. Functional Programming for Big Data WordCount with token scrubbing… Apache Hive: 52 lines HQL + 8 lines Python (UDF) compared to Scalding: 18 lines Scala/Cascading functional programming languages help reduce software engineering costs at scale, over time 59Saturday, 13 July 13
  • 60. Case Studies: LinkedIn and eBay “Scalable and Flexible Machine LearningWith Scala @ LinkedIn” Vitaly Gordon, LinkedIn Chris Severs, eBay slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-with-scala-linkedin …be sure to read slides 8-16 !! 60Saturday, 13 July 13
  • 61. Lambda Architecture Big Data Nathan Marz, James Warren Manning, 2013 manning.com/marz • batch layer (immutable data, idempotent ops) • serving layer (to query batch) • speed layer (transient, cached “real-time”) • combining results 61Saturday, 13 July 13
  • 62. issues confronted: “Orders of magnitude increase, more complexity and variety, widespread disruption…” trends observed: “Functional programming for Big Data” “Just enough math, but not calculus” “Enterprise Data Workflow design pattern” “Cluster computing, smarter scheduling” 62Saturday, 13 July 13
  • 63. Where To Start? having a solid background in statistics becomes vital, because it provides formalisms for what we’re trying to accomplish at scale along with that, some areas of math help – regardless of the “calculus threshold” invoked at many universities… linear algebra e.g., crunching algorithms efficiently for large-scale apps graph theory e.g., representation of problems in a calculable language abstract algebra e.g., probabilistic data structures in streaming analytics topology e.g., determining the underlying structure of the data operations research e.g., techniques for optimization … in other words, ROI 63Saturday, 13 July 13
  • 64. in a nutshell, most of what we do is to… ‣ estimate probability ‣ calculate analytic variance ‣ manipulate dimension and complexity ‣ make use of learning theory + collaborate with DevOps, Stakeholders + reduce our work into cron entries UniqueRegistration Launchedgameslobby NUI:TutorialMode BirthdayMessage ChatPublicRoomvoice Launchedheyzapgame ityTest:testsuitestarted CreateNewPet tarted:client,community NUI:MovieMode BuyanItem:web PutonClothing spaceremaining:512M chaseCartPageStep2 FeedPet PlayPet ChatNow EditPanel PanelFlipProductOver AddFriend Open3DWindow ChangeSeat TypeaBubble VisitOwnHomepage TakeaSnapshot NUI:BuyCreditsMode NUI:MyProfileClicked ssspaceremaining:1G LeaveaMessage NUI:ChatMode NUI:FriendsMode dv WebsiteLogin AddBuddy NUI:PublicRoomMode NUI:MyRoomMode PanelRemoveProduct oryPanelApplyProduct NUI:DressUpMode UniqueRegistration Launchedgameslobby NUI:TutorialMode BirthdayMessage ChatPublicRoomvoice Launchedheyzapgame ConnectivityTest:testsuitestarted CreateNewPet MovieViewStarted:client,community NUI:MovieMode BuyanItem:web PutonClothing Addressspaceremaining:512M CustomerMadePurchaseCartPageStep2 FeedPet PlayPet ChatNow EditPanel ClientInventoryPanelFlipProductOver AddFriend Open3DWindow ChangeSeat TypeaBubble VisitOwnHomepage TakeaSnapshot NUI:BuyCreditsMode NUI:MyProfileClicked Addressspaceremaining:1G LeaveaMessage NUI:ChatMode NUI:FriendsMode dv WebsiteLogin AddBuddy NUI:PublicRoomMode NUI:MyRoomMode ClientInventoryPanelRemoveProduct ClientInventoryPanelApplyProduct NUI:DressUpMode Where is the Science in Data Science? 64Saturday, 13 July 13
  • 65. techniques for manipulating order complexity: dimensional reduction… with clustering as a common case e.g., you may have 100 million HTML docs, but only ~10K useful keywords within them low-dimensional structure, PCA, etc. linear algebra techniques: eigenvalues, matrix factorization, etc. this is an area ripe for much advancement in algorithms research, near-term Dimension and Complexity 65Saturday, 13 July 13
  • 66. in general, apps alternate between learning patterns/rules and retrieving similar things… statistical learning theory – rigorous, prevents you from making billion dollar mistakes, probably our future machine learning – scalable, enables you to make billion dollar mistakes, much commercial emphasis supervised vs. unsupervised arguably, optimization is a parent category once Big Data projects get beyond merely digesting log files, optimization will likely become the next overused buzzword :) Learning Theory 66Saturday, 13 July 13
  • 67. Algorithms many algorithm libraries used today are based on implementations back when people used DO loops in FORTRAN, 30+ years ago MapReduce is Good Enough? Jimmy Lin, U Maryland umiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf astrophysics and genomics are light years ahead in sophisticated algorithms work – as Breiman suggested in 2001 – which may take a few years to percolate into industry other game-changers: • streaming algorithms, sketches, probabilistic data structures • significant “Big O” complexity reduction (e.g., skytree.net) • better architectures and topologies (e.g., GPUs and CUDA) • partial aggregates – parallelizing workflows 67Saturday, 13 July 13
  • 68. Make It Sparse… also, take a moment to check this out… (IMHO most interesting algorithm work recently) QR factorization of a “tall-and-skinny” matrix • used to solve many data problems at scale, e.g., PCA, SVD, etc. • numerically stable with efficient implementation on large-scale Hadoop clusters suppose that you have a sparse matrix of customer interactions where there are 100MM customers, with a limited set of outcomes… cs.purdue.edu/homes/dgleich stanford.edu/~arbenson github.com/ccsevers/scalding-linalg David Gleich, slideshare.net/dgleich Tristan Jehan 68Saturday, 13 July 13
  • 69. Sparse Matrix Collection for when you really need a wide variety of sparse matrix examples… University of Florida Sparse Matrix Collection cise.ufl.edu/research/sparse/matrices/ Tim Davis, U Florida cise.ufl.edu/~davis/welcome.html Yifan Hu, AT&T Research www2.research.att.com/~yifanhu/ 69Saturday, 13 July 13
  • 70. A Winning Approach… consider that if you know priors about a system, then you may be able to leverage low dimensional structure within high dimensional data… that works much, much better than sampling! 1. real-world data 2. graph theory for representation 3. sparse matrix factorization for production work 4. cost-effective parallel processing for machine learning app at scale 70Saturday, 13 July 13
  • 71. Suggested Reading A Few UsefulThings to Know about Machine Learning Pedro Domingos, U Washington homes.cs.washington.edu/~pedrod/papers/cacm12.pdf Probabilistic Data Structures forWeb Analytics and Data Mining Ilya Katsov, Grid Dynamics highlyscalable.wordpress.com/2012/05/01/probabilistic- structures-web-analytics-data-mining/ 71Saturday, 13 July 13
  • 72. issues confronted: “Orders of magnitude increase, more complexity and variety, widespread disruption…” trends observed: “Functional programming for Big Data” “Just enough math, but not calculus” “Enterprise Data Workflow design pattern” “Cluster computing, smarter scheduling” 72Saturday, 13 July 13
  • 73. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses 73Saturday, 13 July 13
  • 74. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses ANSI SQL for ETL 74Saturday, 13 July 13
  • 75. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic 75Saturday, 13 July 13
  • 76. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive models 76Saturday, 13 July 13
  • 77. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive modelsANSI SQL for ETL most of the licensing costs… 77Saturday, 13 July 13
  • 78. Anatomy of an Enterprise app Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic most of the project costs… 78Saturday, 13 July 13
  • 79. ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source a compiler sees it all… cascading.org 79Saturday, 13 July 13
  • 80. a compiler sees it all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap );   SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement );   flowDef.addAssemblyPlanner( sqlPlanner ); cascading.org 80Saturday, 13 July 13
  • 81. a compiler sees it all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );   PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields();   flowDef.addAssemblyPlanner( pmmlPlanner ); 81Saturday, 13 July 13
  • 82. cascading.org ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source visual collaboration for the business logic is a great way to improve how teams work together Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads 82Saturday, 13 July 13
  • 83. ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads multiple departments, working in their respective frameworks, integrate results into a combined app, which runs at scale on a cluster… business process combined in a common space (DAG) for flow planners, compiler, optimization, troubleshooting, exception handling, notifications, security audit, performance monitoring, etc. cascading.org 83Saturday, 13 July 13
  • 84. issues confronted: “Orders of magnitude increase, more complexity and variety, widespread disruption…” trends observed: “Functional programming for Big Data” “Just enough math, but not calculus” “Enterprise Data Workflow design pattern” “Cluster computing, smarter scheduling” 84Saturday, 13 July 13
  • 85. Clusters a little secret: people like me make a good living by leveraging high ROI apps based on clusters, and so the execs agree to build out more data centers… clusters for Hadoop/Hive/HBase, clusters for Memcached, for Cassandra, for MySQL, for Storm, for Nginx, etc. this becomes expensive! a single class of workloads on a given cluster is simpler to manage; but terrible for utilization leveragingVMs and various notions of “cloud” helps Cloudera, Hortonworks, probably EMC soon: sell a notion of “Hadoop as OS” All your workloads are belong to us regardless of how architectures change, death and taxes will endure: servers fail, and data must move Google Data Center, Fox News ~2002 85Saturday, 13 July 13
  • 86. Three Laws, or more? meanwhile, architectures evolve toward much, much larger data… pistoncloud.com/ ... Rich Freitas, IBM Research Q: what kinds of evolution in topologies could this imply? 86Saturday, 13 July 13
  • 87. Topologies Hadoop and other topologies arose from a need for fault- tolerant workloads, leveraging horizontal scale-out based on commodity hardware because the data won’t fit on one computer anymore a variety of Big Data technologies has since emerged, which can be categorized in terms of topologies and the CAP Theorem 87Saturday, 13 July 13
  • 88. Some Topologies Beyond Hadoop… Spark (iterative/interactive) Titan (graph database) Redis (data structure server) Zookeeper (distributed metadata) HBase (columnar data objects) Riak (durable key-value store) Storm (real-time streams) ElasticSearch (search index) MongoDB (document store) Greenplum (MPP) SciDB (array database) 88Saturday, 13 July 13
  • 89. issues confronted: “Orders of magnitude increase, more complexity and variety, widespread disruption…” trends observed: “Functional programming for Big Data” “Just enough math, but not calculus” “Enterprise Data Workflow design pattern” “Cluster computing, smarter scheduling” 89Saturday, 13 July 13
  • 90. Operating Systems, redux meanwhile, GOOG is 3+ generations ahead, with much improved ROI on data centers John Wilkes, et al. Borg/Omega:“10x” secret sauce youtu.be/0ZFMlO98Jkc 0% 25% 50% 75% 100% RAILS CPU LOAD MEMCACHED CPU LOAD 0% 25% 50% 75% 100% HADOOP CPU LOAD 0% 25% 50% 75% 100% t t 0% 25% 50% 75% 100% Rails Memcached Hadoop COMBINED CPU LOAD (RAILS, MEMCACHED, HADOOP) Florian Leibert, Chronos/Mesos @ Airbnb Mesos, open source cloud OS – like Borg goo.gl/jPtTP 90Saturday, 13 July 13
  • 91. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere – Four-Part Harmony 91Saturday, 13 July 13
  • 92. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere – Four-Part Harmony 1. End Use Cases, the drivers 92Saturday, 13 July 13
  • 93. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere – Four-Part Harmony 2. A new kind of team process 93Saturday, 13 July 13
  • 94. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere – Four-Part Harmony 3. Abstraction layer as optimizing middleware, e.g., Cascading 94Saturday, 13 July 13
  • 95. Workflow RDBMS near timebatch services transactions, content social interactions Web Apps, Mobile, etc.History Data Products Customers RDBMS Log Events In-Memory Data Grid Hadoop, etc. Cluster Scheduler Prod Eng DW Use Cases Across Topologies s/w dev data science discovery + modeling Planner Ops dashboard metrics business process optimized capacitytaps Data Scientist App Dev Ops Domain Expert introduced capability existing SDLC Circa 2013: clusters everywhere – Four-Part Harmony 4. Distributed OS, e.g., Mesos 95Saturday, 13 July 13
  • 96. Enterprise DataWorkflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do Further study… workshops and newsletter updates: liber118.com/pxn/ 96Saturday, 13 July 13