2. Table of Content
1) Big Data trends.
2) How Big is your Data?
3) Big Data Potential.
4) Big technologies. New databases.
5) Big quantitative methods. New stats.
6) Big Data temperaments.
7) Is Big always better?
2
5. Social Media (Facebook & Twitter) has
grown exponentially
Facebook vs Twitter # Active Users in 000
exponential growth Facebook started
1,200,000 in Feb 2004. Has
1 billion active
1,000,000
users.
800,000
600,000 Twitter started in
March 2006.
400,000
Has 500 million
200,000 users.
0
Ap 8
Ap 9
O 9
Ap 0
O 0
Ap 1
O 1
Ap 2
O 2
13
O 8
Ja 8
Ja 9
Ju 0
Ju 1
Ja 1
Ju 2
Ja 2
Ju 8
Ju 9
Ja 0
l-0
l-1
l-1
1
l-1
0
r-0
l-0
-0
0
r-0
-0
1
r-1
-1
1
r-1
-1
r-1
-1
n-
n-
n-
n-
n-
n-
ct
ct
ct
ct
ct
Ja
Facebook Twitter
Social networks are creating a huge live Unstructured Data. 5
7. 2) How Big is your Data?
• How Tall is it? How large is your sample (rows)?
• How Wide is it? How many variables (columns)?
• What is its Velocity? How frequently is it updated?
• Does it include unstructured data (documents,
emails, Social Media)?
7
11. Database: Structured vs Unstructured
Database Database Database Reporting
Data Type type
language structure tool
Structured. SQL Data Warehouse
Relational Oracle Essbase
Customers, structured Data Marts
database & IBM Cognos
transactions, query language Reporting
numbers in rows. Business
Intelligence
Hadoop
Connectors
Unstructured. NoSQL Non-relational Hadoop
Social Media, not only SQL database
Text documents,
Web services
11
13. New Stats Map
A/B Testing
(hypothesis testing)
Statistics & Regression Spatial Analysis
Regression
Time Series Signal Processing
Analysis
Predictive
Analytics Association
Rule Learning
Data Mining & Cluster Analysis
Machine Learning
(formerly Artificial Classification
Intelligence)
Pattern Recognition
Neural Networks
Optimization Genetic Algorithms
Natural Language
Sentiment Analysis
Processing
13
14. Definitions. Part I
Association Rule Learning: method to uncover interesting relationships
by generating and testing possible rules. One application is “market
basket analysis”, where a retailer figures out what products are
frequently bought together. A cited example is that shoppers who buy
diapers often buy beer.
Classification: identifies the categories in which new data belongs,
based on an existing data set grouped in predefined categories. It
differs from Cluster Analysis that starts without predefined categories.
Genetic algorithms: an optimization method inspired by the “survival of
the fittest” process. Potential solutions are encoded as “chromosomes”
that can combine and mutate. The chromosomes are selected for
survival within a modeled “environment.” Examples: optimizing the
performance of an investment portfolio.
14
15. Definitions. Part II
Natural language processing (NLP): it uses algorithms to analyze text data.
Sentiment Analysis is a common application. It measures customers’
reaction to a product campaign by analyzing social media.
Neural networks: models inspired by the workings of neurons and
synapses within the brain. Used for finding nonlinear patterns. They can
be used for Pattern recognition and Optimization. Examples of neural
network applications include identifying customers that may leave and
identifying fraudulent insurance claims.
Signal processing: an electrical engineering method to analyze signals
(radio, etc…) and discern between signal and noise. It is used to extract
the signal from the noise from a set of less precise data [Signal Detection
Theory].
15
16. Definitions. Part III
Spatial Analysis: it analyzes geographic location encoded within
the data. The information comes from GPS. Applications
include spatial regression to figure a consumer willingness to
purchase a product given his location.
16
17. 6) Big Data Temperaments
Source: Harvard Business Review, April 2012 by Shvetank Shah, Andrew Horne
and Jaime Capella.
17
19. No! says Nate Silver
•“I came to realize that prediction in the era of Big Data was
not going very well.”
•“If the quantity of information is increasing [exponentially]…
Most of it is just noise.”
•He refers to John P. Ioannidis 2005
paper: “Why Most Published
Research Findings are False.”
2/3ds of scientific papers’ results
can’t be replicated!
“… numbers have no way of speaking for
themselves. We speak for them.”
19
20. Nate’s targets
• Political pundits. Their “intuitive” election predictions have
been disastrous. Granted, it was not because of Big Data
but instead No Data. He showed them how to do it using
Small Data (polls with samples < 1,000);
• Economists forecasters. They have used Big Data with
poor results. The majority of them can’t forecast a
recession already underway. ECRI predicted with certainty
a double dip recession in 2011 using tens of variables they
did not understand. Instead, the economy improved;
• Stock market & financial market forecasters. Similar
performance as economists forecasters;
• Earthquake forecasting. The field is not well understood.
“… Statistical inferences are much stronger when backed
up by theory… about their root causes.” 20
21. No! says Vincent Granville
• Big Data is huge, but information is very sparse;
• Storing and processing the entire data is very inefficient;
• You can do better by smartly sampling only 5% of the
data;
You don’t need Big Data, you need Smart Data.
21
22. Yes! Says Chris Anderson
• He quotes Peter Norvig, Google’s research director: “All models
are wrong, and increasingly you can succeed without them.”
• “… with massive data, [the scientific method] is becoming
obsolete.”
• “We can throw the numbers into the biggest computing clusters …
and let statistical algorithms find patterns where science cannot.”
He mentions examples such as J.Craig Venter gene sequencing,
Google Search, and Google Translator, among other successes.
“With enough data, the numbers speak for themselves.”
“Correlation supersedes causation, and science can advance without
22
coherent models, unified theories, or … any … explanation at all.”
23. Big Data Effectiveness Map
Field needing causal understanding Field not needing
Rule Based causal
Theory not well Theory well
understanding
understood understood
More data more More data more
Tall data Noise Signal
Oversampling Oversampling
More data better More data better
More variables more More variables more
model performance model performance
false positives explanation
Wide data
Multicollinearity Multicollinearity
Model overfitting Model overfitting
Economics, Google Search,
Games & Sports
Financial markets, Weather forecasting, Google Translator,
Examples [Chess, Baseball,
Earthquake Customer behavior Google Flu-trends,
etc…], Politics
forecasting Customer behavior
23