SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Analytics Drives Big Data Drives
Infrastructure
Confessions of Storage turned Analytics Geeks
Dr. Aloke Guha
29th IEEE Conference on Massive Data Storage
May 8th, 2013
aloke@cruxly.com
2
What’s Common Between
a Sensor that could Distinguish a fine Cognac,
and Predicting Movies You’d Like on Netflix?
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
The Sommelier “Robot”
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3
Predicting What Movies You’d Watch
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4
5
(Analytics, BigData, DataStore)+
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
6
Many Analytics Techniques . . .
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
Statistics
Regression
Linear
Time-Series
Decision
Trees
R
AI (McCarthy) 1956
Expert
Systems
Machine Learning
Neural
Networks
SVM
LDA
Naïve
Bayes K-nearest
neighbor
Random
Forests
. . .
Genetic
Algorithms
Random
Forests
SNARC (Minsky) 1951
Dendral (Feigenbaum) 1965
Fraser and Burnell (1970)
. . . Vapnik (1992)
Ihaka and Gentleman (1993)
7
Common Analytics Processing pre-2000
• Sources: Local
• Data: Numeric, Homogeneous
• Processing: Local
• Consumer: Local
• Analytics: Linear/Non-Linear Regression,
Neural Networks, SVM, LDA, LSA,
Decision Trees, Monte Carlo, Lin-Ops,
Expert Systems . . .
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
Flavor Predictor – Neural Networks
USPTO #5,373,452 (1994) 1988
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8
Pattern Recognition – Genetic Algorithms
US PTO #5,140,530, 1992
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9
10
Small to Big
http://article.wn.com/view/2013/04/04/Big_data_forefather_Michael_Stonebraker_shows_no_signs_of_sl/#/related_news
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
11
Typical Analytics: 2000-2006
• Sources: Global , Social
Networks
• Data: Heterogeneous, Numeric,
Text
• Processing: Hosted/Scale
• Consumer: Global
• Analytics: Batch Mode, Social
Media Marketing, Churn
Detection, Sentiment Analysis, etc.
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
2007- : Internet Data Analytics
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12
Financial Risk Scoring: Detect
Risk Scoring: detect incremental change in # occurrences where corporate officers
mention “risk” (or equivalent terms) during earnings call
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13
Financial Risk Scoring: Listen
*Risk Scoring: detect incremental change in occurrences where corporate officers
mention “risk” (or semantically equivalent terms) during the corporate earnings call
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14
Banking: Credit Worthiness – remember 2008?
Analyze bank reports to assess loans, payments, recoveries, etc. for key bank
indexes, groups of banks, or individual banks
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15
Share of Voice: Online Buzz
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16
Sentiment Analysis
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17
18
Analytics Processing: 2007-
• Sources: Global, Mobile,
New Social (Instagram, . . )
• Data: Multi-Dimensional,
Heterogeneous, Audio/Video
• Processing: Hosted/Scale
• Consumer: Global
• Analytics: Batch, Streaming, . . .
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
2008 - : Real-Time/Streaming Analytics
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19
Brand Marketing
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20
Brand Management
21
Customer Support
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22
Customer Support
23
24
Lead Generation
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
. . . More Data, Faster
http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=CIOMINUTE05062013CIOA
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25
“Internet of Things”
http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-for-
m2m-technology-to-drive-connected-smarter-cities/
Message Queuing Telemetry Transport
Machine-to-Machine
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26
27
AumniData: Batch Processing
Data Collector
(Batch Scheduled)
Twitter Blog/Web Site
Data Collector
(Batch Scheduled)
RSS/ATOM
Feed
Requestor/
URL Scanner
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP Stack+ AumniData
Classifier + Analytics*
(RackSpace VM)
Dashboard
Application
(.3rd party App)
Blog/Web Site
Blog/Web SiteYouTube
Dashboard
Configuration
(TomCat)
Custom Analytics
Display
Ad-Hoc Query
Summary
Data Collector
(Batch Scheduled)
Content
Store
Content /
Metadata
Index
(MySQL)
Dashboard
Store
(SQL Server)
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
28
Cruxly: Stream Processing
Streaming API Client
(Heroku Worker)
(24x7)
Streaming API Client
(Heroku Worker)
(24x7)
NLP+ Cruxly Intent
Detection
(AWS)
Streaming API Client
(Heroku Worker)
(24x7)
Tweets
(Keywords)
Request
(Keywords)
Tweets
(Keywords) Tweet ID + Intent
Signal
(Heroku
PostgresSQL)
Tweets
Content Store
(DynamoDB)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP+ Cruxly Intent
Detection
(AWS)
NLP (NER, etc + Cruxly
Intent Detection
(AWS)
Reports / Dashboard
Tracker Editor
(web app - Heroku)
Twitter
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
29
Data Analytics Demands . . .
Store
Process
Analyze
View
Store
Process
Analyze
View
Storm
Data Collector
Text / Sensor Data/ Stream . . .
NLP
Classify
Index
Query/ RT Query
Ad Hoc/ Search/ SQL
Custom Analytics
Dashboards
Chart
Report
Machine
Learning
Library
Stats
Library
R
Yarn
Storage Implications: Back to the Future
MB/s – Batch
IOPs – Stream
Both?
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30
Storage Implications: Back to the Future II, III
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
Task
tracker
Task
tracker
Task
tracker
Job Tracker
Zookeeper
Hive
Pig
Oozie
HUE
HDFS clientData Node Data Node Data Node
Name Node
MapReduceHDFS
Master Slave #1 Slave #N Mgmt Node
Storage Capacity Scaling?
31
Storage Tiering?
Import/Export Data?
A More General Data Analytics Framework?
Data
Ingesters
(Basic)
Data
Ingesters
(Smart)
Content StoreMetadata / In-Mem
Store
Processing
Stream and Batch
Data Ingesters
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
AnalyticsProcessing
SensorProcessing:DataIntegration
VisualizationLibrary/InteractiveQuery
LocalStorage/Flash/DAS
MapReduce/DistributedDataStore
32
33
Conclusion
• Data Analytics ⇒ Big Data ⇒ Scale-Out
• Variety ⇒ Infrastructure
• Volume ⇒ Bandwidth Support
• Velocity ⇒ Streaming Support
• We Solved the Processing Problem
• We Need to Solve the Larger Storage Problem
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
34
Grateful Acknowledgements
• Kapil Tundwal
• Dr. Kirill Kireyev
• Dr. Andrew Lampert
• Venky Madireddy
• Dr. Shumin Wu
• Joan Wrabetz
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Mais conteúdo relacionado

Destaque

Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupCaserta
 
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.144 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14Daniel Bianchini
 
How to tackle big data from a security
How to tackle big data from a securityHow to tackle big data from a security
How to tackle big data from a securityTyrone Systems
 
The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)IT Arena
 
7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data PlatformHarshal Deo (HD)
 
The Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI OrganizationThe Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI OrganizationLooker
 
CES 2014 - Autonomous Connected Vehicles
CES 2014 - Autonomous Connected VehiclesCES 2014 - Autonomous Connected Vehicles
CES 2014 - Autonomous Connected VehiclesAndreas Mai
 
Data Driven Decision Making Presentation
Data Driven Decision Making PresentationData Driven Decision Making Presentation
Data Driven Decision Making PresentationRussell Kunz
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
 
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...Lora Cecere
 

Destaque (12)

Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing Meetup
 
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.144 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14
4 Steps to Building a Data-Driven Strategy - White Exchange - 24.11.14
 
Big Data
Big DataBig Data
Big Data
 
How to tackle big data from a security
How to tackle big data from a securityHow to tackle big data from a security
How to tackle big data from a security
 
The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)The Future of Telecom (Petro Chernyshov Business Stream)
The Future of Telecom (Petro Chernyshov Business Stream)
 
7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform
 
Data Strategy
Data StrategyData Strategy
Data Strategy
 
The Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI OrganizationThe Business Benefits of a Data-Driven, Self-Service BI Organization
The Business Benefits of a Data-Driven, Self-Service BI Organization
 
CES 2014 - Autonomous Connected Vehicles
CES 2014 - Autonomous Connected VehiclesCES 2014 - Autonomous Connected Vehicles
CES 2014 - Autonomous Connected Vehicles
 
Data Driven Decision Making Presentation
Data Driven Decision Making PresentationData Driven Decision Making Presentation
Data Driven Decision Making Presentation
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
Imagining Supply Chain Processes Outside-in. Building Value Networks at IBM t...
 

Último

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Analytics Drives Big Data Drives Infrastructure

  • 1. Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks Dr. Aloke Guha 29th IEEE Conference on Massive Data Storage May 8th, 2013 aloke@cruxly.com
  • 2. 2 What’s Common Between a Sensor that could Distinguish a fine Cognac, and Predicting Movies You’d Like on Netflix? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 3. The Sommelier “Robot” Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3
  • 4. Predicting What Movies You’d Watch Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4
  • 5. 5 (Analytics, BigData, DataStore)+ Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 6. 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 Statistics Regression Linear Time-Series Decision Trees R AI (McCarthy) 1956 Expert Systems Machine Learning Neural Networks SVM LDA Naïve Bayes K-nearest neighbor Random Forests . . . Genetic Algorithms Random Forests SNARC (Minsky) 1951 Dendral (Feigenbaum) 1965 Fraser and Burnell (1970) . . . Vapnik (1992) Ihaka and Gentleman (1993)
  • 7. 7 Common Analytics Processing pre-2000 • Sources: Local • Data: Numeric, Homogeneous • Processing: Local • Consumer: Local • Analytics: Linear/Non-Linear Regression, Neural Networks, SVM, LDA, LSA, Decision Trees, Monte Carlo, Lin-Ops, Expert Systems . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 8. Flavor Predictor – Neural Networks USPTO #5,373,452 (1994) 1988 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8
  • 9. Pattern Recognition – Genetic Algorithms US PTO #5,140,530, 1992 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9
  • 11. 11 Typical Analytics: 2000-2006 • Sources: Global , Social Networks • Data: Heterogeneous, Numeric, Text • Processing: Hosted/Scale • Consumer: Global • Analytics: Batch Mode, Social Media Marketing, Churn Detection, Sentiment Analysis, etc. Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 12. 2007- : Internet Data Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12
  • 13. Financial Risk Scoring: Detect Risk Scoring: detect incremental change in # occurrences where corporate officers mention “risk” (or equivalent terms) during earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13
  • 14. Financial Risk Scoring: Listen *Risk Scoring: detect incremental change in occurrences where corporate officers mention “risk” (or semantically equivalent terms) during the corporate earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14
  • 15. Banking: Credit Worthiness – remember 2008? Analyze bank reports to assess loans, payments, recoveries, etc. for key bank indexes, groups of banks, or individual banks Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15
  • 16. Share of Voice: Online Buzz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16
  • 17. Sentiment Analysis Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17
  • 18. 18 Analytics Processing: 2007- • Sources: Global, Mobile, New Social (Instagram, . . ) • Data: Multi-Dimensional, Heterogeneous, Audio/Video • Processing: Hosted/Scale • Consumer: Global • Analytics: Batch, Streaming, . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 19. 2008 - : Real-Time/Streaming Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19
  • 20. Brand Marketing Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20
  • 22. Customer Support Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22
  • 24. 24 Lead Generation Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 25. . . . More Data, Faster http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=CIOMINUTE05062013CIOA Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25
  • 26. “Internet of Things” http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-for- m2m-technology-to-drive-connected-smarter-cities/ Message Queuing Telemetry Transport Machine-to-Machine Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26
  • 27. 27 AumniData: Batch Processing Data Collector (Batch Scheduled) Twitter Blog/Web Site Data Collector (Batch Scheduled) RSS/ATOM Feed Requestor/ URL Scanner NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP Stack+ AumniData Classifier + Analytics* (RackSpace VM) Dashboard Application (.3rd party App) Blog/Web Site Blog/Web SiteYouTube Dashboard Configuration (TomCat) Custom Analytics Display Ad-Hoc Query Summary Data Collector (Batch Scheduled) Content Store Content / Metadata Index (MySQL) Dashboard Store (SQL Server) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 28. 28 Cruxly: Stream Processing Streaming API Client (Heroku Worker) (24x7) Streaming API Client (Heroku Worker) (24x7) NLP+ Cruxly Intent Detection (AWS) Streaming API Client (Heroku Worker) (24x7) Tweets (Keywords) Request (Keywords) Tweets (Keywords) Tweet ID + Intent Signal (Heroku PostgresSQL) Tweets Content Store (DynamoDB) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP (NER, etc + Cruxly Intent Detection (AWS) Reports / Dashboard Tracker Editor (web app - Heroku) Twitter Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 29. 29 Data Analytics Demands . . . Store Process Analyze View Store Process Analyze View Storm Data Collector Text / Sensor Data/ Stream . . . NLP Classify Index Query/ RT Query Ad Hoc/ Search/ SQL Custom Analytics Dashboards Chart Report Machine Learning Library Stats Library R Yarn
  • 30. Storage Implications: Back to the Future MB/s – Batch IOPs – Stream Both? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30
  • 31. Storage Implications: Back to the Future II, III Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 Task tracker Task tracker Task tracker Job Tracker Zookeeper Hive Pig Oozie HUE HDFS clientData Node Data Node Data Node Name Node MapReduceHDFS Master Slave #1 Slave #N Mgmt Node Storage Capacity Scaling? 31 Storage Tiering? Import/Export Data?
  • 32. A More General Data Analytics Framework? Data Ingesters (Basic) Data Ingesters (Smart) Content StoreMetadata / In-Mem Store Processing Stream and Batch Data Ingesters Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 AnalyticsProcessing SensorProcessing:DataIntegration VisualizationLibrary/InteractiveQuery LocalStorage/Flash/DAS MapReduce/DistributedDataStore 32
  • 33. 33 Conclusion • Data Analytics ⇒ Big Data ⇒ Scale-Out • Variety ⇒ Infrastructure • Volume ⇒ Bandwidth Support • Velocity ⇒ Streaming Support • We Solved the Processing Problem • We Need to Solve the Larger Storage Problem Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
  • 34. 34 Grateful Acknowledgements • Kapil Tundwal • Dr. Kirill Kireyev • Dr. Andrew Lampert • Venky Madireddy • Dr. Shumin Wu • Joan Wrabetz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013