Chicago finance-big-data

Scalability in Hadoop and
Similar Systems
©MapR Technologies - Confidential 1

Big is the next big thing

 Big data and Hadoop are exploding

 Companies are being funded

 Books are being written

 Applications sprouting up everywhere

2

Slow Motion Explosion

3

Hadoop Explosion

4

Why Now?

 But Moore’s law has applied for a long time

 Why is Hadoop exploding now?

 Why not 10 years ago?

 Why not 20?

9/18/2012
5

Size Matters, but …

 If it were just availability of data then existing big companies would
adopt big data technology first

6

Size Matters, but …

 If it were just availability of data then existing big companies would
adopt big data technology first

They didn’t

7

Or Maybe Cost

 If it were just a net positive value then finance companies should
adopt first because they have higher opportunity value / byte

8

Or Maybe Cost

 If it were just a net positive value then finance companies should
adopt first because they have higher opportunity value / byte

They didn’t

9

Backwards adoption

 Under almost any threshold argument startups would not adopt
big data technology first

10

Backwards adoption

 Under almost any threshold argument startups would not adopt
big data technology first

They did

11

Everywhere at Once?

 Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small

12

Everywhere at Once?

 Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small

Why?

13

The Conventional Answer
More data is being produced more quickly
Data sizes are bigger than even a very large computer can hold
Cost to create and store continues to decrease


Analytics Scaling Laws

 Analytics scaling is all about the 80-20 rule
– Big gains for little initial effort
– Rapidly diminishing returns
 The key to net value is how costs scale
– Old school – exponential scaling
– Big data – linear scaling, low constant
 Cost/performance has changed radically
– IF you can use many commodity boxes


You’re kidding, people do that?

We didn’t know that!

We should have
known that

We knew that


NSA, non-proliferation
1

0.75

Industry-wide data consortium
Value

0.5
In-house analytics

Intern with a spreadsheet
0.25

Anybody with eyes

0
0 500 1000 1500 2,000

Scale


1

0.75

Net value optimum has a
Value

0.5 sharp peak well before
maximum effort

0.25

0
0 500 1000 1500 2,000

Scale


But scaling laws are changing
both slope and shape


1

0.75
Value

0.5
More than just a little

0.25

0
0 500 1000 1500 2,000

Scale


1

0.75
Value

0.5

They are changing a LOT!
0.25

0
0 500 1000 1500 2,000

Scale


1

0.75
Value

0.5

0.25

0
0 500 1000 1500 2,000

Scale


1

0.75

A tipping point is reached and
things change radically …
Value

0.5

Initially, linear cost scaling
actually makes things worse
0.25

0
0 500 1000 1500 2,000

Scale


Pre-requisites for Tipping

 To reach the tipping point,
 Algorithms must scale out horizontally
– On commodity hardware
– That can and will fail
 Data practice must change
– Denormalized is the new black
– Flexible data dictionaries are the rule
– Structured data becomes rare


Yeah… but wait


The Standard Sort of Model

 People talk about the law of large numbers as if it were …

 Well, as if it were a law

 It’s not …

 It is a context and assumption dependent theorem


What if …

 These assumptions are:

 Changes have a
– stationary,
– independent,
– finite variance distribution

 What happens if these assumptions are wrong?

 And which of them is really wrong?


For Example
Stuﬀ

Tim e


End point
Stuﬀ

has nice
tractable
distribution

Tim e


What if the Assumptions are Wrong?

 Take the finite variance as a simple example

 This leads to Levy stable distributions

 Like the Cauchy distribution


Is it Really Different?


Stuﬀ

Tim e


What About Real Life?


But is it Really Infinite Variance?

 Or are there other kinds of phenomena that show this?

 What about the independence assumption?

 What if the supposedly independent components of the system
communicate?

 Like we do. Everyday. All the time.


Why the Difference?

The space of Infinite The space of
all things that variance interacting
change things

Law of large Interacting
numbers agents

Apologies and credit to
Simon DaDeo, SFI


What Happens with Interactions

 Social phenomena defeat the law of large numbers
 Distributions are well modeled by “rich get richer” processes
– Pittman-Yar process, Indian Buffet
 Limiting dstributions are heavy tailed, power law
 We see these distributions everywhere
– price of cotton in the 19th century
– word frequencies
– popularity of Github projects
– equity pricing and volumes
– sizes of cities
– popularity of web-sites


What are the
Implications?


1

0.75
Value

0.5

0.25

0
0 500 1000 1500 2,000

Scale


In a Nutshell

 Scalability is much more important than we thought

 Mashups are more important than we thought

 Network effects are more important than we thought

 Exploration is more important than we thought

 Hadoop style linear scaling must be mixed with ad hoc analysis


Thank You


whoami?

 Ted Dunning
– @ted_dunning
– tdunning@maprtech.com (MapR distribution for Hadoop)
– tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill)
– ted.dunning@gmail.com (me)

 More info:

http://www.mapr.com/company/events/hadoop-in-finance-2012


Chicago finance-big-data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (15)

Destaque

Destaque (20)

Semelhante a Chicago finance-big-data

Semelhante a Chicago finance-big-data (20)

Mais de Ted Dunning

Mais de Ted Dunning (20)

Último

Último (20)

Chicago finance-big-data

Notas do Editor