SlideShare uma empresa Scribd logo
1 de 45
Scalability in Hadoop and
                                               Similar Systems
©MapR Technologies - Confidential              1
Big is the next big thing

     Big data and Hadoop are exploding


     Companies are being funded


     Books are being written


     Applications sprouting up everywhere




©MapR Technologies - Confidential   2
                                             2
Slow Motion Explosion




©MapR Technologies - Confidential   3
                                        3
Hadoop Explosion




©MapR Technologies - Confidential   4
                                        4
Why Now?

        But Moore’s law has applied for a long time


        Why is Hadoop exploding now?


        Why not 10 years ago?


        Why not 20?




9/18/2012
   ©MapR Technologies - Confidential    5
                                                       5
Size Matters, but …

     If it were just availability of data then existing big companies would
      adopt big data technology first




©MapR Technologies - Confidential      6
                                                          6
Size Matters, but …

     If it were just availability of data then existing big companies would
      adopt big data technology first


                       They didn’t




©MapR Technologies - Confidential      7
                                                          7
Or Maybe Cost

     If it were just a net positive value then finance companies should
      adopt first because they have higher opportunity value / byte




©MapR Technologies - Confidential     8
                                                        8
Or Maybe Cost

     If it were just a net positive value then finance companies should
      adopt first because they have higher opportunity value / byte


                       They didn’t




©MapR Technologies - Confidential     9
                                                        9
Backwards adoption

     Under almost any threshold argument startups would not adopt
      big data technology first




©MapR Technologies - Confidential   10
                                                    10
Backwards adoption

     Under almost any threshold argument startups would not adopt
      big data technology first


                       They did




©MapR Technologies - Confidential   11
                                                    11
Everywhere at Once?

     Something very strange is happening
       –   Big data is being applied at many different scales
       –   At many value scales
       –   By large companies and small




©MapR Technologies - Confidential             12
                                                                12
Everywhere at Once?

     Something very strange is happening
       –   Big data is being applied at many different scales
       –   At many value scales
       –   By large companies and small


                                    Why?




©MapR Technologies - Confidential             13
                                                                13
The Conventional Answer
More data is being produced more quickly
Data sizes are bigger than even a very large computer can hold
Cost to create and store continues to decrease




©MapR Technologies - Confidential      14
Analytics Scaling Laws

     Analytics scaling is all about the 80-20 rule
       –   Big gains for little initial effort
       –   Rapidly diminishing returns
     The key to net value is how costs scale
       –   Old school – exponential scaling
       –   Big data – linear scaling, low constant
     Cost/performance has changed radically
       –   IF you can use many commodity boxes




©MapR Technologies - Confidential                15
You’re kidding, people do that?


                                      We didn’t know that!

                                     We should have
                                     known that

                                    We knew that




©MapR Technologies - Confidential                  16
NSA, non-proliferation
                                      1




                                    0.75

                                                  Industry-wide data consortium
                           Value




                                     0.5
                                                 In-house analytics

                                                Intern with a spreadsheet
                                    0.25

                                               Anybody with eyes

                                      0
                                           0      500             1000      1500   2,000

                                                                  Scale




©MapR Technologies - Confidential                            17
1




                                    0.75




                                               Net value optimum has a
                           Value




                                     0.5       sharp peak well before
                                               maximum effort


                                    0.25




                                      0
                                           0   500            1000       1500   2,000

                                                              Scale




©MapR Technologies - Confidential                        18
But scaling laws are changing
                                         both slope and shape




©MapR Technologies - Confidential   19
1




                                    0.75
                           Value




                                     0.5
                                                                  More than just a little


                                    0.25




                                      0
                                           0   500        1000         1500           2,000

                                                          Scale




©MapR Technologies - Confidential                    20
1




                                    0.75
                           Value




                                     0.5


                                                                  They are changing a LOT!
                                    0.25




                                      0
                                           0   500        1000         1500         2,000

                                                          Scale




©MapR Technologies - Confidential                    21
©MapR Technologies - Confidential   22
©MapR Technologies - Confidential   23
1




                                    0.75
                           Value




                                     0.5




                                    0.25




                                      0
                                           0   500        1000    1500   2,000

                                                          Scale




©MapR Technologies - Confidential                    24
1




                                    0.75
                           Value




                                     0.5




                                    0.25




                                      0
                                           0   500        1000    1500   2,000

                                                          Scale




©MapR Technologies - Confidential                    25
1




                                    0.75

                                                                   A tipping point is reached and
                                                                   things change radically …
                           Value




                                     0.5

                                               Initially, linear cost scaling
                                               actually makes things worse
                                    0.25




                                      0
                                           0            500              1000      1500             2,000

                                                                         Scale




©MapR Technologies - Confidential                                   26
Pre-requisites for Tipping

     To reach the tipping point,
     Algorithms must scale out horizontally
       –   On commodity hardware
       –   That can and will fail
     Data practice must change
       –   Denormalized is the new black
       –   Flexible data dictionaries are the rule
       –   Structured data becomes rare




©MapR Technologies - Confidential              27
Yeah… but wait




©MapR Technologies - Confidential         28
The Standard Sort of Model

     People talk about the law of large numbers as if it were …



     Well, as if it were a law


     It’s not …


     It is a context and assumption dependent theorem




©MapR Technologies - Confidential     29
What if …

     These assumptions are:


     Changes have a
       –   stationary,
       –   independent,
       –   finite variance distribution




     What happens if these assumptions are wrong?


     And which of them is really wrong?

©MapR Technologies - Confidential         30
For Example
                         Stuff




                                    Tim e




©MapR Technologies - Confidential    31
End point
                         Stuff




                                            has nice
                                            tractable
                                            distribution




                                    Tim e




©MapR Technologies - Confidential    32
What if the Assumptions are Wrong?

     Take the finite variance as a simple example


     This leads to Levy stable distributions


     Like the Cauchy distribution




©MapR Technologies - Confidential      33
Is it Really Different?




©MapR Technologies - Confidential   34
Stuff




                                    Tim e




©MapR Technologies - Confidential    35
What About Real Life?




©MapR Technologies - Confidential             36
©MapR Technologies - Confidential   37
But is it Really Infinite Variance?

     Or are there other kinds of phenomena that show this?


     What about the independence assumption?



     What if the supposedly independent components of the system
      communicate?


     Like we do. Everyday. All the time.




©MapR Technologies - Confidential    38
Why the Difference?


                     The space of              Infinite                  The space of
                     all things that           variance                  interacting
                     change                                              things




                                       Law of large        Interacting
                                       numbers             agents




Apologies and credit to
Simon DaDeo, SFI

 ©MapR Technologies - Confidential                    39
What Happens with Interactions

     Social phenomena defeat the law of large numbers
     Distributions are well modeled by “rich get richer” processes
       –   Pittman-Yar process, Indian Buffet
     Limiting dstributions are heavy tailed, power law
     We see these distributions everywhere
       –   price of cotton in the 19th century
       –   word frequencies
       –   popularity of Github projects
       –   equity pricing and volumes
       –   sizes of cities
       –   popularity of web-sites


©MapR Technologies - Confidential                40
What are the
                                    Implications?



©MapR Technologies - Confidential         41
1




                                    0.75
                           Value




                                     0.5




                                    0.25




                                      0
                                           0   500        1000    1500   2,000

                                                          Scale




©MapR Technologies - Confidential                    42
In a Nutshell

     Scalability is much more important than we thought


     Mashups are more important than we thought


     Network effects are more important than we thought


     Exploration is more important than we thought


     Hadoop style linear scaling must be mixed with ad hoc analysis



©MapR Technologies - Confidential    43
Thank You




©MapR Technologies - Confidential   44
whoami?

     Ted Dunning
       –   @ted_dunning
       –   tdunning@maprtech.com (MapR distribution for Hadoop)
       –   tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill)
       –   ted.dunning@gmail.com (me)


     More info:

       http://www.mapr.com/company/events/hadoop-in-finance-2012




©MapR Technologies - Confidential         45

Mais conteúdo relacionado

Mais procurados

"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
Edge AI and Vision Alliance
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
Edge AI and Vision Alliance
 

Mais procurados (15)

Dunning strata-2012-27-02
Dunning strata-2012-27-02Dunning strata-2012-27-02
Dunning strata-2012-27-02
 
Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06
 
Hcj 2013-01-21
Hcj 2013-01-21Hcj 2013-01-21
Hcj 2013-01-21
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M..."How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...
 
Talk on commercialising space data
Talk on commercialising space data Talk on commercialising space data
Talk on commercialising space data
 
Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing
 
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
 
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization PerspectiveNew Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 

Destaque

Framework for Data Informed Science Policy
Framework for Data Informed Science PolicyFramework for Data Informed Science Policy
Framework for Data Informed Science Policy
Brian Wee
 
Business Analyst Training - Gain America
Business Analyst Training - Gain AmericaBusiness Analyst Training - Gain America
Business Analyst Training - Gain America
GainAmerica
 
Big Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science MethodologyBig Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science Methodology
Sumit Mattey
 

Destaque (20)

Iss
IssIss
Iss
 
Het Iss
Het IssHet Iss
Het Iss
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Transactional Data Mining
Transactional Data MiningTransactional Data Mining
Transactional Data Mining
 
Research Support in an Open Science Framework - Ron Dekker, seconded national...
Research Support in an Open Science Framework - Ron Dekker, seconded national...Research Support in an Open Science Framework - Ron Dekker, seconded national...
Research Support in an Open Science Framework - Ron Dekker, seconded national...
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data Management
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
 
Framework for Data Informed Science Policy
Framework for Data Informed Science PolicyFramework for Data Informed Science Policy
Framework for Data Informed Science Policy
 
InfosysPublicServices - Healthcare SOA | Program Management Framework
InfosysPublicServices - Healthcare SOA | Program Management FrameworkInfosysPublicServices - Healthcare SOA | Program Management Framework
InfosysPublicServices - Healthcare SOA | Program Management Framework
 
The Framework Program for the Sustainable Management of La Plata Basin's Wate...
The Framework Program for the Sustainable Management of La Plata Basin's Wate...The Framework Program for the Sustainable Management of La Plata Basin's Wate...
The Framework Program for the Sustainable Management of La Plata Basin's Wate...
 
R Cheat Sheet
R Cheat SheetR Cheat Sheet
R Cheat Sheet
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
 
Big data framework
Big data frameworkBig data framework
Big data framework
 
Business Analyst Training - Gain America
Business Analyst Training - Gain AmericaBusiness Analyst Training - Gain America
Business Analyst Training - Gain America
 
Scientific Data Cataloging Framework
Scientific Data Cataloging FrameworkScientific Data Cataloging Framework
Scientific Data Cataloging Framework
 
Program Mgmt Framework
Program Mgmt FrameworkProgram Mgmt Framework
Program Mgmt Framework
 
Open Science Framework (OSF)
Open Science Framework (OSF)Open Science Framework (OSF)
Open Science Framework (OSF)
 
Big Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science MethodologyBig Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science Methodology
 

Semelhante a Chicago finance-big-data

Dell panel cloud computing - small biz summit 2012
Dell panel   cloud computing - small biz summit 2012Dell panel   cloud computing - small biz summit 2012
Dell panel cloud computing - small biz summit 2012
Ramon Ray
 

Semelhante a Chicago finance-big-data (20)

Big data, why now?
Big data, why now?Big data, why now?
Big data, why now?
 
Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningChicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted Dunning
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
 
EMC's IT's Cloud Transformation, Thomas Becker, EMC
EMC's IT's Cloud Transformation, Thomas Becker, EMCEMC's IT's Cloud Transformation, Thomas Becker, EMC
EMC's IT's Cloud Transformation, Thomas Becker, EMC
 
How to Plan and Budget for 2013 with Cloud in Mind
How to Plan and Budget for 2013 with Cloud in MindHow to Plan and Budget for 2013 with Cloud in Mind
How to Plan and Budget for 2013 with Cloud in Mind
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution Roadshow
 
Antonio piraino v1
Antonio piraino v1Antonio piraino v1
Antonio piraino v1
 
2012 Future of Cloud Computing
2012 Future of Cloud Computing 2012 Future of Cloud Computing
2012 Future of Cloud Computing
 
Dell panel cloud computing - small biz summit 2012
Dell panel   cloud computing - small biz summit 2012Dell panel   cloud computing - small biz summit 2012
Dell panel cloud computing - small biz summit 2012
 
Progress with confidence into next generation IT
Progress with confidence into next generation ITProgress with confidence into next generation IT
Progress with confidence into next generation IT
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
 
Nyc lunch and learn 03 15 2012 final
Nyc lunch and learn   03 15 2012 finalNyc lunch and learn   03 15 2012 final
Nyc lunch and learn 03 15 2012 final
 
The Smarter Way to Commercialize Algorithms
The Smarter Way to Commercialize AlgorithmsThe Smarter Way to Commercialize Algorithms
The Smarter Way to Commercialize Algorithms
 
Managing your Cloud with Confidence
Managing your Cloud with Confidence Managing your Cloud with Confidence
Managing your Cloud with Confidence
 
CloudOps with OpsRamp: From Discovery to Resolution
CloudOps with OpsRamp: From Discovery to ResolutionCloudOps with OpsRamp: From Discovery to Resolution
CloudOps with OpsRamp: From Discovery to Resolution
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
 
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformed
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformedDr Markus Pleier - Datadeluge and big data, how IT operation get transformed
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformed
 

Mais de Ted Dunning

Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning
 

Mais de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Chicago finance-big-data

  • 1. Scalability in Hadoop and Similar Systems ©MapR Technologies - Confidential 1
  • 2. Big is the next big thing  Big data and Hadoop are exploding  Companies are being funded  Books are being written  Applications sprouting up everywhere ©MapR Technologies - Confidential 2 2
  • 3. Slow Motion Explosion ©MapR Technologies - Confidential 3 3
  • 5. Why Now?  But Moore’s law has applied for a long time  Why is Hadoop exploding now?  Why not 10 years ago?  Why not 20? 9/18/2012 ©MapR Technologies - Confidential 5 5
  • 6. Size Matters, but …  If it were just availability of data then existing big companies would adopt big data technology first ©MapR Technologies - Confidential 6 6
  • 7. Size Matters, but …  If it were just availability of data then existing big companies would adopt big data technology first They didn’t ©MapR Technologies - Confidential 7 7
  • 8. Or Maybe Cost  If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte ©MapR Technologies - Confidential 8 8
  • 9. Or Maybe Cost  If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte They didn’t ©MapR Technologies - Confidential 9 9
  • 10. Backwards adoption  Under almost any threshold argument startups would not adopt big data technology first ©MapR Technologies - Confidential 10 10
  • 11. Backwards adoption  Under almost any threshold argument startups would not adopt big data technology first They did ©MapR Technologies - Confidential 11 11
  • 12. Everywhere at Once?  Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small ©MapR Technologies - Confidential 12 12
  • 13. Everywhere at Once?  Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small Why? ©MapR Technologies - Confidential 13 13
  • 14. The Conventional Answer More data is being produced more quickly Data sizes are bigger than even a very large computer can hold Cost to create and store continues to decrease ©MapR Technologies - Confidential 14
  • 15. Analytics Scaling Laws  Analytics scaling is all about the 80-20 rule – Big gains for little initial effort – Rapidly diminishing returns  The key to net value is how costs scale – Old school – exponential scaling – Big data – linear scaling, low constant  Cost/performance has changed radically – IF you can use many commodity boxes ©MapR Technologies - Confidential 15
  • 16. You’re kidding, people do that? We didn’t know that! We should have known that We knew that ©MapR Technologies - Confidential 16
  • 17. NSA, non-proliferation 1 0.75 Industry-wide data consortium Value 0.5 In-house analytics Intern with a spreadsheet 0.25 Anybody with eyes 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 17
  • 18. 1 0.75 Net value optimum has a Value 0.5 sharp peak well before maximum effort 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 18
  • 19. But scaling laws are changing both slope and shape ©MapR Technologies - Confidential 19
  • 20. 1 0.75 Value 0.5 More than just a little 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 20
  • 21. 1 0.75 Value 0.5 They are changing a LOT! 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 21
  • 22. ©MapR Technologies - Confidential 22
  • 23. ©MapR Technologies - Confidential 23
  • 24. 1 0.75 Value 0.5 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 24
  • 25. 1 0.75 Value 0.5 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 25
  • 26. 1 0.75 A tipping point is reached and things change radically … Value 0.5 Initially, linear cost scaling actually makes things worse 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 26
  • 27. Pre-requisites for Tipping  To reach the tipping point,  Algorithms must scale out horizontally – On commodity hardware – That can and will fail  Data practice must change – Denormalized is the new black – Flexible data dictionaries are the rule – Structured data becomes rare ©MapR Technologies - Confidential 27
  • 28. Yeah… but wait ©MapR Technologies - Confidential 28
  • 29. The Standard Sort of Model  People talk about the law of large numbers as if it were …  Well, as if it were a law  It’s not …  It is a context and assumption dependent theorem ©MapR Technologies - Confidential 29
  • 30. What if …  These assumptions are:  Changes have a – stationary, – independent, – finite variance distribution  What happens if these assumptions are wrong?  And which of them is really wrong? ©MapR Technologies - Confidential 30
  • 31. For Example Stuff Tim e ©MapR Technologies - Confidential 31
  • 32. End point Stuff has nice tractable distribution Tim e ©MapR Technologies - Confidential 32
  • 33. What if the Assumptions are Wrong?  Take the finite variance as a simple example  This leads to Levy stable distributions  Like the Cauchy distribution ©MapR Technologies - Confidential 33
  • 34. Is it Really Different? ©MapR Technologies - Confidential 34
  • 35. Stuff Tim e ©MapR Technologies - Confidential 35
  • 36. What About Real Life? ©MapR Technologies - Confidential 36
  • 37. ©MapR Technologies - Confidential 37
  • 38. But is it Really Infinite Variance?  Or are there other kinds of phenomena that show this?  What about the independence assumption?  What if the supposedly independent components of the system communicate?  Like we do. Everyday. All the time. ©MapR Technologies - Confidential 38
  • 39. Why the Difference? The space of Infinite The space of all things that variance interacting change things Law of large Interacting numbers agents Apologies and credit to Simon DaDeo, SFI ©MapR Technologies - Confidential 39
  • 40. What Happens with Interactions  Social phenomena defeat the law of large numbers  Distributions are well modeled by “rich get richer” processes – Pittman-Yar process, Indian Buffet  Limiting dstributions are heavy tailed, power law  We see these distributions everywhere – price of cotton in the 19th century – word frequencies – popularity of Github projects – equity pricing and volumes – sizes of cities – popularity of web-sites ©MapR Technologies - Confidential 40
  • 41. What are the Implications? ©MapR Technologies - Confidential 41
  • 42. 1 0.75 Value 0.5 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 42
  • 43. In a Nutshell  Scalability is much more important than we thought  Mashups are more important than we thought  Network effects are more important than we thought  Exploration is more important than we thought  Hadoop style linear scaling must be mixed with ad hoc analysis ©MapR Technologies - Confidential 43
  • 44. Thank You ©MapR Technologies - Confidential 44
  • 45. whoami?  Ted Dunning – @ted_dunning – tdunning@maprtech.com (MapR distribution for Hadoop) – tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill) – ted.dunning@gmail.com (me)  More info: http://www.mapr.com/company/events/hadoop-in-finance-2012 ©MapR Technologies - Confidential 45

Notas do Editor

  1. Why is big data sooo fashionable with big and small companies from different industries? What has suddenly changed?
  2. Google searches are up 10x over just four years ago.
  3. Hadoop use is exploding. We chose this example, which shows job trends for Hadoop. Further evidence that you should pay attention during this talk.
  4. But we have seen constant growth for a long time. And simple growth would only explain some kinds of companies starting with big data (probably big ones) and then slow adoption. Databases started with big companies and took 20 years or more to reach everywhere because the need exceeded cost at different times for different companies. The internet, on the other hand, largely happened to everybody at the same time so it changed things in nearly all industries at all scales nearly simultaneously. Why is big data exploding right now and why is it exploding at all?
  5. The different kinds of scaling laws have different shape and I think that shape is the key.
  6. The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  7. In classical analytics, the cost of doing analytics increases sharply.
  8. The result is a net value that has a sharp optimum in the area where value is increasing rapidly and cost is not yet increasing so rapidly.
  9. New techniques such as Hadoop result in linear scaling of cost. This is a change in shape and it causes a qualitative change in the way that costs trade off against value to give net value. As technology improves, the slope of this cost line is also changing rapidly over time.
  10. This next sequence shows how the net value changes with different slope linear cost models.
  11. Notice how the best net value has jumped up significantly
  12. And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.
  13. And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.