SlideShare uma empresa Scribd logo
1 de 21
High-throughput data analysis 	A Streaming Reports Platform Authors J Singh, Early Stage IT David Zheng, Early Stage IT Contributor Satya Gupta, Virsec Systems October 3, 2011
High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons
Streaming Data Data arrives continuously Must be processed continuously Emit analysis results or alerts as needed
Security: Scrapers, Spammers, …
Monitoring and Alerts
Financial Markets
High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons
Example Use Case Resolve Virtual Machineprofiles application and gathers data about the application The data analysis blocked until data collection was complete,  Took several hours before conclusions could be drawn Project goals Stream-mode analysis Begin within a few seconds of start of profiling Continuous update Data rates up to 5 GB per hour Ability to sustain rate for 24 hrs A product of Virsec Systems Analysis and Reporting configured to run in the Amazon EC2 environment ,[object Object]
Or scaled out (more machines),[object Object]
Requirements Fast inserts into the database Thenature and amount of analysis required was hard to judge in the beginning Previous experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this application Slick, demo-worthy web interface for presenting results Stream-mode operation Start showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed.
High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons
Key decisions Chunk up data into 1-second “slices” as it arrives Use a collection for signaling the availability of each data slice Process each chunk as it becomes available Use Map/Reduce for analysis Exploit Parallelism of the data by using as many processors as needed to maintain the “flow rate” Pipeline the various Map Reduce jobs to maintain sequentiality of data
Pipeline Component: Listener Listener Goal: push the data into MongoDB as fast as possible Receives the data from the Resolve Virtual Machine and stores it into MongoDB Self-describing data 12 different types of data fed over 12 different sockets Written in C++ Socket Interface at one end MongoDB C++ driver at other end
Pipeline Component:MongoDB MongoDB ,[object Object]
Did everything it was supposed to do
Allowed us to focus on our problem, not on MongoDB
Will use replica sets for making the data available to analysis servers,[object Object]
“Function Call Structure” Analysis FnNameTotalTimeSrcFnAddress                 PID SF_CFA map (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …) Shuffle stage: FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …}                                                   {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …}                                                   {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …} {TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …}                                                 }  reduce Output: {FnName: CreateRaceObjects{TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …}
Pipeline Component: Presentation Presentation ,[object Object]
And requires a page design of its own
Tool of choice: DjangoNonRel

Mais conteúdo relacionado

Mais procurados

Psdot 1 optimization of resource provisioning cost in cloud computing
Psdot 1 optimization of resource provisioning cost in cloud computingPsdot 1 optimization of resource provisioning cost in cloud computing
Psdot 1 optimization of resource provisioning cost in cloud computingZTech Proje
 
Introduction to Reactive programming
Introduction to Reactive programmingIntroduction to Reactive programming
Introduction to Reactive programmingDwi Randy Herdinanto
 
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...Srinath Perera
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...Srinath Perera
 
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
An Efficient Decentralized Load Balancing Algorithm in Cloud ComputingAn Efficient Decentralized Load Balancing Algorithm in Cloud Computing
An Efficient Decentralized Load Balancing Algorithm in Cloud ComputingAisha Kalsoom
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...Spark Summit
 
Learning From the Past: Automated Rule Generation for CEP - DEBS 2014
Learning From the Past: Automated Rule Generation for CEP - DEBS 2014Learning From the Past: Automated Rule Generation for CEP - DEBS 2014
Learning From the Past: Automated Rule Generation for CEP - DEBS 2014Alessandro Margara
 
Efficient monitoring and alerting
Efficient monitoring and alertingEfficient monitoring and alerting
Efficient monitoring and alertingTobias Schmidt
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesNatalino Busa
 
Hadoop MapReduce Paradigm
Hadoop MapReduce ParadigmHadoop MapReduce Paradigm
Hadoop MapReduce ParadigmTarjMehta1
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learningMehdi Shibahara
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
 
Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...
Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...
Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...InfluxData
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
 

Mais procurados (20)

Tuning Java Servers
Tuning Java Servers Tuning Java Servers
Tuning Java Servers
 
Psdot 1 optimization of resource provisioning cost in cloud computing
Psdot 1 optimization of resource provisioning cost in cloud computingPsdot 1 optimization of resource provisioning cost in cloud computing
Psdot 1 optimization of resource provisioning cost in cloud computing
 
Introduction to Reactive programming
Introduction to Reactive programmingIntroduction to Reactive programming
Introduction to Reactive programming
 
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
 
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
An Efficient Decentralized Load Balancing Algorithm in Cloud ComputingAn Efficient Decentralized Load Balancing Algorithm in Cloud Computing
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
 
Learning From the Past: Automated Rule Generation for CEP - DEBS 2014
Learning From the Past: Automated Rule Generation for CEP - DEBS 2014Learning From the Past: Automated Rule Generation for CEP - DEBS 2014
Learning From the Past: Automated Rule Generation for CEP - DEBS 2014
 
1. introduction
1. introduction1. introduction
1. introduction
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Efficient monitoring and alerting
Efficient monitoring and alertingEfficient monitoring and alerting
Efficient monitoring and alerting
 
Doc5
Doc5Doc5
Doc5
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
Hadoop MapReduce Paradigm
Hadoop MapReduce ParadigmHadoop MapReduce Paradigm
Hadoop MapReduce Paradigm
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...
Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...
Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 

Semelhante a High Throughput Data Analysis

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software ValidationBioDec
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent MonitoringIntelie
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011Gerardo Pardo-Castellote
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
Would Mr. Spok choose Open Source
Would Mr. Spok choose Open SourceWould Mr. Spok choose Open Source
Would Mr. Spok choose Open Sourcevlcinsky
 
Natural Laws of Software Performance
Natural Laws of Software PerformanceNatural Laws of Software Performance
Natural Laws of Software PerformanceGibraltar Software
 
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018VMware Tanzu
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksGuido Schmutz
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
 

Semelhante a High Throughput Data Analysis (20)

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software Validation
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
 
RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Would Mr. Spok choose Open Source
Would Mr. Spok choose Open SourceWould Mr. Spok choose Open Source
Would Mr. Spok choose Open Source
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Natural Laws of Software Performance
Natural Laws of Software PerformanceNatural Laws of Software Performance
Natural Laws of Software Performance
 
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 

Mais de J Singh

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashingJ Singh
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big dataJ Singh
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 updateJ Singh
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engineJ Singh
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)J Singh
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsJ Singh
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceJ Singh
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data LaboratoryJ Singh
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceJ Singh
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduceJ Singh
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitJ Singh
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlJ Singh
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query OptimizationJ Singh
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query ExecutionJ Singh
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementJ Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceJ Singh
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index StructuresJ Singh
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceJ Singh
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processingJ Singh
 

Mais de J Singh (20)

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
 
Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map Reduce
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

High Throughput Data Analysis

  • 1. High-throughput data analysis A Streaming Reports Platform Authors J Singh, Early Stage IT David Zheng, Early Stage IT Contributor Satya Gupta, Virsec Systems October 3, 2011
  • 2. High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons
  • 3. Streaming Data Data arrives continuously Must be processed continuously Emit analysis results or alerts as needed
  • 7. High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons
  • 8.
  • 9.
  • 10. Requirements Fast inserts into the database Thenature and amount of analysis required was hard to judge in the beginning Previous experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this application Slick, demo-worthy web interface for presenting results Stream-mode operation Start showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed.
  • 11. High-throughput data analysis A few examples of streaming data problems A concrete problem we solved How we solved it Take-away lessons
  • 12. Key decisions Chunk up data into 1-second “slices” as it arrives Use a collection for signaling the availability of each data slice Process each chunk as it becomes available Use Map/Reduce for analysis Exploit Parallelism of the data by using as many processors as needed to maintain the “flow rate” Pipeline the various Map Reduce jobs to maintain sequentiality of data
  • 13. Pipeline Component: Listener Listener Goal: push the data into MongoDB as fast as possible Receives the data from the Resolve Virtual Machine and stores it into MongoDB Self-describing data 12 different types of data fed over 12 different sockets Written in C++ Socket Interface at one end MongoDB C++ driver at other end
  • 14.
  • 15. Did everything it was supposed to do
  • 16. Allowed us to focus on our problem, not on MongoDB
  • 17.
  • 18. “Function Call Structure” Analysis FnNameTotalTimeSrcFnAddress PID SF_CFA map (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …) Shuffle stage: FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …} {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …} {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …} {TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …} } reduce Output: {FnName: CreateRaceObjects{TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …}
  • 19.
  • 20. And requires a page design of its own
  • 21. Tool of choice: DjangoNonRel
  • 22. But Python driver for MongoDB was sufficient for most work.
  • 23.
  • 25. Endpoint Stack Data Capture (Listener) Custom, preferably written in C++ or Java NoSQL Database MongoDB Well suited for high speedinserts Calculation Platform MongoDB Map/Reduce Could use Hadoop but startup times are a concern Presentation Django Non-Rel
  • 26. About Us Involved with Map/Reduce and NoSQL technologies on several platforms Many students in J’s Database Systems class at WPI did a project on a NoSQL database. DataThinks.org is a new service of Early Stage IT Building and operating “Big Data” analytics services Thanks