1. < sen·ti·ment >
the prevailing attitude of investors as to anticipated
price development in a market.
Tim Harbers, CTO SNTMNT
DataScienceNL Meetup November 8th 2012
2. Tim Harbers
Background
BSc Computer Science
MSc Computer Science
Researcher
Data Miner
Technical Consultant
Co-Founder and COO
Co-Founder and CTO
3. The Rockstars
Vincent van Leeuwen
Customer Development
‣ Balanced multidisciplinary team
‣ Two machine learning experts in
predictive analysis and large
Kees van Nunen datasets
Product Development
‣ Academic degrees in Behavioral
Finance, Portfolio Finance,
Strategic Management & Artificial
Durk Kingma
Intelligence
Data Mining Expert
‣ Strong network in (Dutch) financial
industry
‣ Young, enthusiastic team with a
Tim Harbers proven entrepreneurial mindset
Machine Learning Expert
5. Our solution:
Predicting stock price movement
based on online buzz
Engineered based on academic research:
Van Leeuwen (2011) Bollen, et al, (2010)
Sprenger and Welpe (2010) Sehgal and Song (2007)
6. Why would this work?
Very different from traditional indicators
News travels faster via social than traditional media
Tremendous amount of data
(Almost) nobody uses it yet
7. Why focus on Twitter?
Public data & easily accessible
Structured language
400M tweets per day
8. Historic Research
Bollen (2010)
Created a model based on Twitter mood states, which
was 86% accurate on the DJI.
Sprenger and Welpe (2011)
Analyzed correlation of the stock market and micro
blogs
9. Financial Sentiment vs Brand Sentiment
Financial Sentiment Brand Sentiment
Tweets relating to Tweets relating to
stocks brands
Written by traders Written by consumers
Trader mumbo jumbo Any language
More relevant Larger dataset
Shorter term Longer term
10. Data setup
Period
June 2010 to April 2012
Stocks
Top 15 most tweeted stocks in S&P 500
Tweets
Financial Dataset Timm Sprenger (4 million)
4 Million tweets Topsy Brand Tweets (100+ million tweets)
Other
Klout
Peerindex
16. Naive Approach: Dictionaries
Use a dictionary of common positive and negative
terms
Count the number of positive and negative terms
Use the difference between the two.
17. SNTMNT’s approach: machine learning
Label a training set of tweets (target)
Use preprocessing techniques
Use several feature extractors
Create a sparse dataset.
Use supervised learning to train a machine learning
model.
18. Labeling
• 25K Financial tweets hand labeled
• 30K Commercial tweets hand labeled
• 1M #happy vs. #sad
20. Results
Financial tweets
84.3% accurate on 2-point scale (Baseline: 60.4%)
76.8% accurate on 3-point scale (Baseline: 65.0%)
Beat Lexalytics (84.3% vs. 70.3%)
Commercial tweets
84.7% accurate on 2-point scale (Baseline: 61.0%)
86.9% accurate on 3-point scale (baseline: 81.1%)
22. Stock Regression
Input:
Sentiment scores
Mood states
Meta Data
Stock
Output:
Trading Indication
Confidence
23. Many dimensions
Tweet period
Trading period
Financial Tweets or Commercial Tweets
Tweet Crunchers
Models
Trading strategy
24. Tweet Aggregation Problem
Tweet volume
Volume positive tweets
Avg sentiment
Sentiment Growth
Etc.
25. Machine Learning Models
Linear Regression
Bayesian Approaches
Decision Trees
Neural Nets
Support Vector Machines
26. Results
R2 < 0.01
Not usable as an independent trading model after
transaction costs.
Still usable as an extra indicator to be used by proven
trading models.
27. Products - next steps:
Sentiment APIs Stock Dashboard Trading Indicator API
(B2B) (B2B2C) (B2B)
‣ Market leader and
thought leader financial
sentiment analysis.
‣ Getting more insights
‣ Extend scope to further into added value of
niche domains and SNTMNT algorithm as
languages. indicator next to
fundamental and
technical analysis.