A personal perspective of how analytics have evolved from the 80s to current and how it has driven demands on the computing and storage infrastructure. Examples are given from using machine learning ("AI") techniques using neural networks and genetic algorithms in 80s and 90s to Aumnidata's social media analytics in 2008-10 and real-time intent detection by Cruxly from 2011 onwards.
1. Analytics Drives Big Data Drives
Infrastructure
Confessions of Storage turned Analytics Geeks
Dr. Aloke Guha
29th IEEE Conference on Massive Data Storage
May 8th, 2013
aloke@cruxly.com
2. 2
What’s Common Between
a Sensor that could Distinguish a fine Cognac,
and Predicting Movies You’d Like on Netflix?
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
11. 11
Typical Analytics: 2000-2006
• Sources: Global , Social
Networks
• Data: Heterogeneous, Numeric,
Text
• Processing: Hosted/Scale
• Consumer: Global
• Analytics: Batch Mode, Social
Media Marketing, Churn
Detection, Sentiment Analysis, etc.
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
12. 2007- : Internet Data Analytics
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12
13. Financial Risk Scoring: Detect
Risk Scoring: detect incremental change in # occurrences where corporate officers
mention “risk” (or equivalent terms) during earnings call
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13
14. Financial Risk Scoring: Listen
*Risk Scoring: detect incremental change in occurrences where corporate officers
mention “risk” (or semantically equivalent terms) during the corporate earnings call
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14
15. Banking: Credit Worthiness – remember 2008?
Analyze bank reports to assess loans, payments, recoveries, etc. for key bank
indexes, groups of banks, or individual banks
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15
16. Share of Voice: Online Buzz
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16
29. 29
Data Analytics Demands . . .
Store
Process
Analyze
View
Store
Process
Analyze
View
Storm
Data Collector
Text / Sensor Data/ Stream . . .
NLP
Classify
Index
Query/ RT Query
Ad Hoc/ Search/ SQL
Custom Analytics
Dashboards
Chart
Report
Machine
Learning
Library
Stats
Library
R
Yarn
30. Storage Implications: Back to the Future
MB/s – Batch
IOPs – Stream
Both?
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30
31. Storage Implications: Back to the Future II, III
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
Task
tracker
Task
tracker
Task
tracker
Job Tracker
Zookeeper
Hive
Pig
Oozie
HUE
HDFS clientData Node Data Node Data Node
Name Node
MapReduceHDFS
Master Slave #1 Slave #N Mgmt Node
Storage Capacity Scaling?
31
Storage Tiering?
Import/Export Data?
32. A More General Data Analytics Framework?
Data
Ingesters
(Basic)
Data
Ingesters
(Smart)
Content StoreMetadata / In-Mem
Store
Processing
Stream and Batch
Data Ingesters
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
AnalyticsProcessing
SensorProcessing:DataIntegration
VisualizationLibrary/InteractiveQuery
LocalStorage/Flash/DAS
MapReduce/DistributedDataStore
32
33. 33
Conclusion
• Data Analytics ⇒ Big Data ⇒ Scale-Out
• Variety ⇒ Infrastructure
• Volume ⇒ Bandwidth Support
• Velocity ⇒ Streaming Support
• We Solved the Processing Problem
• We Need to Solve the Larger Storage Problem
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013
34. 34
Grateful Acknowledgements
• Kapil Tundwal
• Dr. Kirill Kireyev
• Dr. Andrew Lampert
• Venky Madireddy
• Dr. Shumin Wu
• Joan Wrabetz
Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013