SlideShare uma empresa Scribd logo
1 de 31
MalStone and MalGen Robert GrossmanOpen Data GroupOpen Cloud Consortium Joint work with Collin Bennett, David Locke, Jonathan Seidman and Steve Vejcik
Part 1.  Other Communities are not Afraid of Benchmarks
Hadoop wins 2008 Terasort in 2008 in 209 seconds.
Hadoop cluster with 910 nodes Sorted 1 TB of data consisting of 10 billion 100-byte records and writing results to disk Each node has 2 quad core 2.0 GHZ Xeons 8 GB RAM per node 40 nodes per rack 8 Gbps Ethernet uplinks from rack to switch
Why Is This Important? Helpful when designing out of memory algorithms. Helpful when porting applications to MapReduce and similar environments. Helpful when benchmarking different rack architectures. Helpful to those designing large data clouds to understand trade off space.
MapReduceTerasort The job used 1800 maps and 1800 reduces Hadoop pre-0.18 with optimization patches so intermediate results not written to disk Allocated enough memory buffers to hold intermediate data in memory Code checked in as Hadoop example by Hadoop team
Proposed:  ,[object Object]
 scan
DebitCreditProposed:  ,[object Object]
 penny sort,[object Object]
A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks can handle commercial batch workloads.  Using commodity processors, memory, and arrays of SCSI disks, AlphaSort runs the industry-standard sort benchmark in seven seconds.  This beats the best published record on a 32-CPU 32-disk Hypercube by 8:1.  On another benchmark, AlphaSort sorted more than a gigabyte in a minute. AlphaSort is a cache-sensitive memory-intensive sort algorithm.  We argue that modern architectures require algorithm designers to re-examine their use of the memory hierarchy.  AlphaSort uses clustered data structures to get good cache locality.  It uses file striping to get high disk bandwidth.  It uses QuickSort to generate runs and uses replacement-selection to merge the runs.  It uses shared memory multiprocessors to break the sort into subsort chores.  Source: Abstract from AlphaSort: A Cache-Sensitive Parallel External Sort, Chris Nyberg, Tom Barclay, ZarkaCvetanovic, Jim Gray, Dave Lomet
Is Terasort relevant to the KDD community?
Not that much…. … So what benchmark is relevant for large scale analytics?
Part 2.  Log Files are Everywhere
Log Files Are Everywhere Advertising systems Analyzing system logs Health and status monitoring
What are the Common Elements? Time stamps Sites e.g. Web sites, computers, network devices Entities e.g. visitors, users, flows Log files fill disks, many, many disks Behavior occurs at all scales Want to identify phenomena at all scales Need to group “similar behavior” Need to do statistics (not just sorting)
Abstract the Problem Using Site-Entity Logs 15
MalStone Schema Event ID Time stamp Site ID Entity ID Mark (categorical variable) Fit into 100 bytes
Toy Example reduce map/shuffle Events collected by device or processor in time order Map events by site For each site, compute counts and ratios of events by type 17
Distributions Tens of millions of sites Hundreds of millions of entities Billions of events Most sites have a few number of events Some sites have many events Most entities visit a few sites Some visitors visit many sites
MalStone B 19 entities sites dk-2 dk-1 dk time
The Mark Model Some sites are marked (percent of mark is a parameter and type of sites marked is a draw from a distribution) Some entities become marked after visiting a marked site (this is a draw from a distribution) There is a delay between the visit and the when the entity becomes marked (this is a draw from a distribution) There is a background process that marks some entities independent of visit (this adds noise to problem)
Exposure Window Monitor Window dk-2 dk-1 dk time 21
Notation Fix a site s[j] Let A[j] be entities that transact during ExpWin and if entity is marked, then visit occurs before mark Let B[j] be all entities in A[j] that become marked sometime during the MonWin Subsequent proportion of marks is r[j] = | B[j] |  /  | A[j]  |
ExpWin MonWin 1 MonWin 2 B[j, t] are entities that become marked during MonWin[j] r[j, t] = | B[j, t] |  /  | A[j]  | dk-2 dk-1 dk time 23
Part 3.  MalStone Benchmarks code.google.com/p/malgen/ MalGen and MalStone implementations are open source
MalStone Benchmark Benchmark developed by Open Cloud Consortium for clouds supporting data intensive computing. Code to generate synthetic data required is available from code.google.com/p/malgen Stylized analytic computation that is easy to implement in MapReduce and its generalizations. 25
MalStone A & B
MalStone B running on 10 Billion 100 byte records Hadoop version 0.18.3 20 nodes in the Open Cloud Testbed MapReduce required 799 minutes Hadoop streams required 142 minutes
  68 minutes running MalStone B Benchmark 4 AMD 8435 processors with 24 cores running at 2.6 GHZ  64 Gigabytes of Memory RAID file system with of 5 SATA drives Source: cs.pervasive.com/blogs/datarush/archive/2010/03/05/cluster-on-a-chip.asp March 5, 2010
Design Trade Offs for Sector 29 Tests done on Open Cloud Testbed.

Mais conteúdo relacionado

Mais procurados

Ronalao termpresent
Ronalao termpresentRonalao termpresent
Ronalao termpresent
Elma Belitz
 

Mais procurados (20)

Big Linked Data Querying - ExtremeEarth Open Workshop
Big Linked Data Querying - ExtremeEarth Open WorkshopBig Linked Data Querying - ExtremeEarth Open Workshop
Big Linked Data Querying - ExtremeEarth Open Workshop
 
Big Linked Data Interlinking - ExtremeEarth Open Workshop
Big Linked Data Interlinking - ExtremeEarth Open WorkshopBig Linked Data Interlinking - ExtremeEarth Open Workshop
Big Linked Data Interlinking - ExtremeEarth Open Workshop
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Working with OpenStreetMap using Apache Spark and Geotrellis
Working with OpenStreetMap using Apache Spark and GeotrellisWorking with OpenStreetMap using Apache Spark and Geotrellis
Working with OpenStreetMap using Apache Spark and Geotrellis
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceBlock Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
 
Druid meetup @walkme
Druid meetup @walkmeDruid meetup @walkme
Druid meetup @walkme
 
Kyryl Sablin Crdt and their uses
Kyryl Sablin Crdt and their usesKyryl Sablin Crdt and their uses
Kyryl Sablin Crdt and their uses
 
G-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge ProcessingG-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge Processing
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
 
Lsm trees
Lsm treesLsm trees
Lsm trees
 
Lsm
LsmLsm
Lsm
 
Delta Management excercise
Delta Management excerciseDelta Management excercise
Delta Management excercise
 
Advancing Scientific Data Support in ArcGIS
Advancing Scientific Data Support in ArcGISAdvancing Scientific Data Support in ArcGIS
Advancing Scientific Data Support in ArcGIS
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way
 
Ronalao termpresent
Ronalao termpresentRonalao termpresent
Ronalao termpresent
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De Boer
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Using python to analyze spatial data
Using python to analyze spatial dataUsing python to analyze spatial data
Using python to analyze spatial data
 

Destaque

Destaque (7)

Bionimbus - An Overview (2010-v6)
Bionimbus - An Overview (2010-v6)Bionimbus - An Overview (2010-v6)
Bionimbus - An Overview (2010-v6)
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
Large Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefLarge Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster Relief
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 

Semelhante a Malstone KDD 2010

Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Cloudera, Inc.
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
lilyco
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
sector-sphere
sector-spheresector-sphere
sector-sphere
xlight
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 

Semelhante a Malstone KDD 2010 (20)

MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Parallel In-Memory Processing and Data Virtualization Redefine Analytics Arch...
Parallel In-Memory Processing and Data Virtualization Redefine Analytics Arch...Parallel In-Memory Processing and Data Virtualization Redefine Analytics Arch...
Parallel In-Memory Processing and Data Virtualization Redefine Analytics Arch...
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009
 
sector-sphere
sector-spheresector-sphere
sector-sphere
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 

Mais de Robert Grossman

Mais de Robert Grossman (20)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Malstone KDD 2010

  • 1. MalStone and MalGen Robert GrossmanOpen Data GroupOpen Cloud Consortium Joint work with Collin Bennett, David Locke, Jonathan Seidman and Steve Vejcik
  • 2. Part 1. Other Communities are not Afraid of Benchmarks
  • 3. Hadoop wins 2008 Terasort in 2008 in 209 seconds.
  • 4. Hadoop cluster with 910 nodes Sorted 1 TB of data consisting of 10 billion 100-byte records and writing results to disk Each node has 2 quad core 2.0 GHZ Xeons 8 GB RAM per node 40 nodes per rack 8 Gbps Ethernet uplinks from rack to switch
  • 5. Why Is This Important? Helpful when designing out of memory algorithms. Helpful when porting applications to MapReduce and similar environments. Helpful when benchmarking different rack architectures. Helpful to those designing large data clouds to understand trade off space.
  • 6. MapReduceTerasort The job used 1800 maps and 1800 reduces Hadoop pre-0.18 with optimization patches so intermediate results not written to disk Allocated enough memory buffers to hold intermediate data in memory Code checked in as Hadoop example by Hadoop team
  • 7.
  • 9.
  • 10.
  • 11. A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks can handle commercial batch workloads. Using commodity processors, memory, and arrays of SCSI disks, AlphaSort runs the industry-standard sort benchmark in seven seconds. This beats the best published record on a 32-CPU 32-disk Hypercube by 8:1. On another benchmark, AlphaSort sorted more than a gigabyte in a minute. AlphaSort is a cache-sensitive memory-intensive sort algorithm. We argue that modern architectures require algorithm designers to re-examine their use of the memory hierarchy. AlphaSort uses clustered data structures to get good cache locality. It uses file striping to get high disk bandwidth. It uses QuickSort to generate runs and uses replacement-selection to merge the runs. It uses shared memory multiprocessors to break the sort into subsort chores. Source: Abstract from AlphaSort: A Cache-Sensitive Parallel External Sort, Chris Nyberg, Tom Barclay, ZarkaCvetanovic, Jim Gray, Dave Lomet
  • 12. Is Terasort relevant to the KDD community?
  • 13. Not that much…. … So what benchmark is relevant for large scale analytics?
  • 14. Part 2. Log Files are Everywhere
  • 15. Log Files Are Everywhere Advertising systems Analyzing system logs Health and status monitoring
  • 16. What are the Common Elements? Time stamps Sites e.g. Web sites, computers, network devices Entities e.g. visitors, users, flows Log files fill disks, many, many disks Behavior occurs at all scales Want to identify phenomena at all scales Need to group “similar behavior” Need to do statistics (not just sorting)
  • 17. Abstract the Problem Using Site-Entity Logs 15
  • 18. MalStone Schema Event ID Time stamp Site ID Entity ID Mark (categorical variable) Fit into 100 bytes
  • 19. Toy Example reduce map/shuffle Events collected by device or processor in time order Map events by site For each site, compute counts and ratios of events by type 17
  • 20. Distributions Tens of millions of sites Hundreds of millions of entities Billions of events Most sites have a few number of events Some sites have many events Most entities visit a few sites Some visitors visit many sites
  • 21. MalStone B 19 entities sites dk-2 dk-1 dk time
  • 22. The Mark Model Some sites are marked (percent of mark is a parameter and type of sites marked is a draw from a distribution) Some entities become marked after visiting a marked site (this is a draw from a distribution) There is a delay between the visit and the when the entity becomes marked (this is a draw from a distribution) There is a background process that marks some entities independent of visit (this adds noise to problem)
  • 23. Exposure Window Monitor Window dk-2 dk-1 dk time 21
  • 24. Notation Fix a site s[j] Let A[j] be entities that transact during ExpWin and if entity is marked, then visit occurs before mark Let B[j] be all entities in A[j] that become marked sometime during the MonWin Subsequent proportion of marks is r[j] = | B[j] | / | A[j] |
  • 25. ExpWin MonWin 1 MonWin 2 B[j, t] are entities that become marked during MonWin[j] r[j, t] = | B[j, t] | / | A[j] | dk-2 dk-1 dk time 23
  • 26. Part 3. MalStone Benchmarks code.google.com/p/malgen/ MalGen and MalStone implementations are open source
  • 27. MalStone Benchmark Benchmark developed by Open Cloud Consortium for clouds supporting data intensive computing. Code to generate synthetic data required is available from code.google.com/p/malgen Stylized analytic computation that is easy to implement in MapReduce and its generalizations. 25
  • 29. MalStone B running on 10 Billion 100 byte records Hadoop version 0.18.3 20 nodes in the Open Cloud Testbed MapReduce required 799 minutes Hadoop streams required 142 minutes
  • 30. 68 minutes running MalStone B Benchmark 4 AMD 8435 processors with 24 cores running at 2.6 GHZ 64 Gigabytes of Memory RAID file system with of 5 SATA drives Source: cs.pervasive.com/blogs/datarush/archive/2010/03/05/cluster-on-a-chip.asp March 5, 2010
  • 31. Design Trade Offs for Sector 29 Tests done on Open Cloud Testbed.
  • 33. Thank You! For more information, rgrossman.com