2. Challenges
• how do you decide which variables capture
what happened?
• /when/ did an event happen
• what's the effect of the event on the variables
• Can we construct a UI and algorithm to tackle
all three problem simultaneously?
Thursday, November 14, 13
3. Goals
• The goal is to feed in raw data as the sole
input, and obtain answers to all three
questions:
• (1) when did an event likely occur
• (2) what variables can we use to measure the
event
• (3) what was the effect of the event on those
variables
Thursday, November 14, 13
6. With limited insight…
• if we know timing and the important
variables, we can measure the effect of the
shock on the variables. (standard regression
techniques)
• if we know the set of important variables and
track variables over time, we can identify
timing of shocks.
• if we know timing and have a long history of
variable evolution, we can cluster variables by
their behavior at the important point in time
(relative to other points in time)
Thursday, November 14, 13
8. Methodology
•
For every time T and variable K, run an OLS, under the hypothesis that a shock
occurred at time T to variable K
•
Sample is restricted to variables for a neighborhood around t, i.e. [t-bandwith , t
+bandwidth]:
Y(K,t) = A(K,T) + B(K,T)S(t) + e(K,t) with
S(t) = 1(t>T) is an indicator with T as the time to test
•
Results are stored as the matrix of coefficients B(K,T)
•
OLS estimates of B(K,T) are biased towards zero to the extent that S(t) is
misspecified.
•
In other words, B(K,T) will be maximally different from zero (and unbiased) at the
true break T
Thursday, November 14, 13
10. Answers
• when did an event likely occur?
– aggregate (sum) effects across all variables
• what variables can we use to measure the
event?
– which variables had the largest effect at time point?
• what was the effect of the event on those
variables?
– we just measured that
• what variables move together often across
time?
– show similar variables
Thursday, November 14, 13
11. Example 1: Super Bowl tweets
• Twitter streaming API (every tweet)
• Sample of data selected from Sunday,
February 3, 1600-2210 hours
• Binned into minute-by-minute word counts
• Out of 651k 1-grams, kept 1035 least sparse
(> 30% sparse) words.
• Input data is 371x1035 matrix
Thursday, November 14, 13
13. Network graph of variables with
correlations > .95
Halftime show
Power outage
Thursday, November 14, 13
14. Deployable and Repeatable
• The model only requires data to be
transformed to a KxT matrix.
– K variables
– T time periods
We could use this model on many other data sets!
• minute-by-minute word count in twitter
• stock prices
• chatter on social media forums
Thursday, November 14, 13
15. Network graph of forums with
correlations > .27
Hezbollah
Thursday, November 14, 13
16. Future improvements
• OLS is simple and efficient, but other models
may be more accurate at estimating effects in
some cases
• exploring different approaches to choosing
which variables to consider and approach to
aggregating variable effects.
• massively parallel on all 630k words
simultaneously?
• real-time analytics on streaming data
Thursday, November 14, 13