Mais conteúdo relacionado Semelhante a Chicago finance-big-data (20) Chicago finance-big-data2. Big is the next big thing
Big data and Hadoop are exploding
Companies are being funded
Books are being written
Applications sprouting up everywhere
©MapR Technologies - Confidential 2
2
5. Why Now?
But Moore’s law has applied for a long time
Why is Hadoop exploding now?
Why not 10 years ago?
Why not 20?
9/18/2012
©MapR Technologies - Confidential 5
5
6. Size Matters, but …
If it were just availability of data then existing big companies would
adopt big data technology first
©MapR Technologies - Confidential 6
6
7. Size Matters, but …
If it were just availability of data then existing big companies would
adopt big data technology first
They didn’t
©MapR Technologies - Confidential 7
7
8. Or Maybe Cost
If it were just a net positive value then finance companies should
adopt first because they have higher opportunity value / byte
©MapR Technologies - Confidential 8
8
9. Or Maybe Cost
If it were just a net positive value then finance companies should
adopt first because they have higher opportunity value / byte
They didn’t
©MapR Technologies - Confidential 9
9
10. Backwards adoption
Under almost any threshold argument startups would not adopt
big data technology first
©MapR Technologies - Confidential 10
10
11. Backwards adoption
Under almost any threshold argument startups would not adopt
big data technology first
They did
©MapR Technologies - Confidential 11
11
12. Everywhere at Once?
Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
©MapR Technologies - Confidential 12
12
13. Everywhere at Once?
Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
Why?
©MapR Technologies - Confidential 13
13
14. The Conventional Answer
More data is being produced more quickly
Data sizes are bigger than even a very large computer can hold
Cost to create and store continues to decrease
©MapR Technologies - Confidential 14
15. Analytics Scaling Laws
Analytics scaling is all about the 80-20 rule
– Big gains for little initial effort
– Rapidly diminishing returns
The key to net value is how costs scale
– Old school – exponential scaling
– Big data – linear scaling, low constant
Cost/performance has changed radically
– IF you can use many commodity boxes
©MapR Technologies - Confidential 15
16. You’re kidding, people do that?
We didn’t know that!
We should have
known that
We knew that
©MapR Technologies - Confidential 16
17. NSA, non-proliferation
1
0.75
Industry-wide data consortium
Value
0.5
In-house analytics
Intern with a spreadsheet
0.25
Anybody with eyes
0
0 500 1000 1500 2,000
Scale
©MapR Technologies - Confidential 17
18. 1
0.75
Net value optimum has a
Value
0.5 sharp peak well before
maximum effort
0.25
0
0 500 1000 1500 2,000
Scale
©MapR Technologies - Confidential 18
19. But scaling laws are changing
both slope and shape
©MapR Technologies - Confidential 19
20. 1
0.75
Value
0.5
More than just a little
0.25
0
0 500 1000 1500 2,000
Scale
©MapR Technologies - Confidential 20
21. 1
0.75
Value
0.5
They are changing a LOT!
0.25
0
0 500 1000 1500 2,000
Scale
©MapR Technologies - Confidential 21
24. 1
0.75
Value
0.5
0.25
0
0 500 1000 1500 2,000
Scale
©MapR Technologies - Confidential 24
25. 1
0.75
Value
0.5
0.25
0
0 500 1000 1500 2,000
Scale
©MapR Technologies - Confidential 25
26. 1
0.75
A tipping point is reached and
things change radically …
Value
0.5
Initially, linear cost scaling
actually makes things worse
0.25
0
0 500 1000 1500 2,000
Scale
©MapR Technologies - Confidential 26
27. Pre-requisites for Tipping
To reach the tipping point,
Algorithms must scale out horizontally
– On commodity hardware
– That can and will fail
Data practice must change
– Denormalized is the new black
– Flexible data dictionaries are the rule
– Structured data becomes rare
©MapR Technologies - Confidential 27
29. The Standard Sort of Model
People talk about the law of large numbers as if it were …
Well, as if it were a law
It’s not …
It is a context and assumption dependent theorem
©MapR Technologies - Confidential 29
30. What if …
These assumptions are:
Changes have a
– stationary,
– independent,
– finite variance distribution
What happens if these assumptions are wrong?
And which of them is really wrong?
©MapR Technologies - Confidential 30
31. For Example
Stuff
Tim e
©MapR Technologies - Confidential 31
32. End point
Stuff
has nice
tractable
distribution
Tim e
©MapR Technologies - Confidential 32
33. What if the Assumptions are Wrong?
Take the finite variance as a simple example
This leads to Levy stable distributions
Like the Cauchy distribution
©MapR Technologies - Confidential 33
34. Is it Really Different?
©MapR Technologies - Confidential 34
35. Stuff
Tim e
©MapR Technologies - Confidential 35
38. But is it Really Infinite Variance?
Or are there other kinds of phenomena that show this?
What about the independence assumption?
What if the supposedly independent components of the system
communicate?
Like we do. Everyday. All the time.
©MapR Technologies - Confidential 38
39. Why the Difference?
The space of Infinite The space of
all things that variance interacting
change things
Law of large Interacting
numbers agents
Apologies and credit to
Simon DaDeo, SFI
©MapR Technologies - Confidential 39
40. What Happens with Interactions
Social phenomena defeat the law of large numbers
Distributions are well modeled by “rich get richer” processes
– Pittman-Yar process, Indian Buffet
Limiting dstributions are heavy tailed, power law
We see these distributions everywhere
– price of cotton in the 19th century
– word frequencies
– popularity of Github projects
– equity pricing and volumes
– sizes of cities
– popularity of web-sites
©MapR Technologies - Confidential 40
41. What are the
Implications?
©MapR Technologies - Confidential 41
42. 1
0.75
Value
0.5
0.25
0
0 500 1000 1500 2,000
Scale
©MapR Technologies - Confidential 42
43. In a Nutshell
Scalability is much more important than we thought
Mashups are more important than we thought
Network effects are more important than we thought
Exploration is more important than we thought
Hadoop style linear scaling must be mixed with ad hoc analysis
©MapR Technologies - Confidential 43
45. whoami?
Ted Dunning
– @ted_dunning
– tdunning@maprtech.com (MapR distribution for Hadoop)
– tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill)
– ted.dunning@gmail.com (me)
More info:
http://www.mapr.com/company/events/hadoop-in-finance-2012
©MapR Technologies - Confidential 45
Notas do Editor Why is big data sooo fashionable with big and small companies from different industries? What has suddenly changed? Google searches are up 10x over just four years ago. Hadoop use is exploding. We chose this example, which shows job trends for Hadoop. Further evidence that you should pay attention during this talk. But we have seen constant growth for a long time. And simple growth would only explain some kinds of companies starting with big data (probably big ones) and then slow adoption. Databases started with big companies and took 20 years or more to reach everywhere because the need exceeded cost at different times for different companies. The internet, on the other hand, largely happened to everybody at the same time so it changed things in nearly all industries at all scales nearly simultaneously. Why is big data exploding right now and why is it exploding at all? The different kinds of scaling laws have different shape and I think that shape is the key. The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase. In classical analytics, the cost of doing analytics increases sharply. The result is a net value that has a sharp optimum in the area where value is increasing rapidly and cost is not yet increasing so rapidly. New techniques such as Hadoop result in linear scaling of cost. This is a change in shape and it causes a qualitative change in the way that costs trade off against value to give net value. As technology improves, the slope of this cost line is also changing rapidly over time. This next sequence shows how the net value changes with different slope linear cost models. Notice how the best net value has jumped up significantly And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale. And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.