New Data Sources for Statistics, Social media: Twitter.
1. New Data Sources for
Statistics: Experiences at
Statistics Netherlands
Social media: Twitter
Piet Daas, Marko Roos, Mark van de Ven and Joyce Neroni
Statistics Netherlands
AAPOR 2012
2. Why are we interested in data sources,
such as Twitter?
• All National Statistical Institutes use:
• Survey data
• Sometimes also Administrative data
• But there are other sources of information out there
(in increasing numbers: BIG Data)
• Can they be used for statistics?
• Burden and cost reduction
• Try it!
• Innovative research is greatly stimulated
AAPOR 2012: Twitter as a potential data source for statistics
1
3. Why study Twitter?
Maps by Eric Fischer (via Fast Company)
AAPOR 2012: Twitter as a potential data source for statistics
2
4. About Twitter
• Twitter is used intensively in the Netherlands
• Relatively easily accessible (text)data
• Potential source of personal information,
opinions, and sentiments
• But what kind of information is actually
discussed?
1) Identify the topics discussed in the Netherlands
• In public tweets only
2) Is this information useful?
AAPOR 2012: Twitter as a potential data source for statistics
3
5. Start with collecting data
• How?
• Tried several ways
• Best option was to:
1) Collect usernames
2) Identify ‘Dutch’ users
3) Collect tweets from Dutch users
4) Identify topics in those tweets
AAPOR 2012: Twitter as a potential data source for statistics
4
6. 1) Collect usernames
• Breadth first algorithm / snowball sampling
• Started with a user with many followers
• A famous Dutch politician with 79,798 followers
• Collect the followers of her followers etc.
• By Twitter REST API, 12 user accounts and PHP-scripts
• After 4 weeks we obtained
• 4,413,391 unique users (id’s)
• Collected user id, username, location and profile information
AAPOR 2012: Twitter as a potential data source for statistics
5
7. 2) Identify ‘Dutch’ users
• By using location information provided
• A considerable number of users do this
• Checked the location names provided
• Inclusion and exclusion list
• A total of 380,415 (~9%) users were identified as
located in the Netherlands
• 38% of the users, 1,661,467, provided no location info
AAPOR 2012: Twitter as a potential data source for statistics
6
8. 3) Collect tweets
• For the 380,415 users the 200 most
recent tweets were collected
• A total of 12,093,065 messages was obtained
• 39% of the users had no ‘tweets’
• Some characteristics
AAPOR 2012: Twitter as a potential data source for statistics
7
9. 4) Identify topics
• Used 2 approaches
1) Hashtags (1,750,074 with 1 hash, 14.5%)
• Hashsign (#) identifies ‘keyword’
• E.g. #ned, #fail, #wk2010
• Manual and text-mining approach
2) Non-hashtags (10,330,613 in total, 85.4%)
• Manual (sample)
• Text-mining approach failed here
• Result of the large ‘Other’ group
AAPOR 2012: Twitter as a potential data source for statistics
8
10. Topic identification: Hashtags
Economy
Hashtags
Education Non-hashtags
Environment Total
Events
Health
Holiday
ICT
Living
Media
Politics
(20%)
Relations
Themes
Security
Spare time (9%)
Sports
(13%)
Transport
Weather
Work
Other (18%)
0 10 20 30 40 50
Contribution (%)
AAPOR 2012: Twitter as a potential data source for statistics
9
11. Topic identification: Non-hashtags*
Economy
Hashtags
Education Non-hashtags
Environment Total
Events
Health
Holiday
ICT
Living
Media
Politics
Relations
Themes
Security
Spare time (10%)
Sports (6%)
Transport
Weather
Work
Other (51%)
0 10 20 30 40 50
Contribution (%) * A random sample
AAPOR 2012: Twitter as a potential data source for statistics
10
12. Topic identification: Combined
Economy
Hashtags
Education Non-hashtags
Environment Total
Events (1%)
Health
Holiday
ICT
Living
Media
(7%)
Politics (3%)
Relations
Themes
Security
(10%)
Spare time
Sports (7%)
Transport
Weather
Work (5%)
(46%)
Other
0 10 20 30 40 50
Contribution (%)
AAPOR 2012: Twitter as a potential data source for statistics
11
13. Conclusions
• Is Twitter of potential interest for statistics?
• Yes
• What are the interesting topics for us?
• Work (5%), politics (3%), spare time (10%)
and events (1%)
• Can the data be used ‘as is’?
• No - ‘Low information content’
- Representativity of users
AAPOR 2012: Twitter as a potential data source for statistics
12
14. Conclusions (2)
• Representativity of the data is a serious issue
• Clear that only a subset of the (Dutch) population
is observed
• Not everybody in the Netherlands is active on Twitter
• Hardly any background information available
• Although some users provide very interesting details in
their user profile
• Work around?
• (Only) use twitter to get quick info (a trend) on a
specific topic
AAPOR 2012: Twitter as a potential data source for statistics
13
15. Future work
• Continue to study Social media!
• But:
1) No longer collect data ourselves ( )
2) In future studies focus on:
• Mine sentiment towards specific topics
• E.g. Economy, Consumer sentiment, but also
statistics and Statistics Netherlands survey’s
• Background info of users
AAPOR 2012: Twitter as a potential data source for statistics
14
16. Thank you for your attention!
• #Questions?
Contact or follow me at: @pietdaas
AAPOR 2012: Twitter as a potential data source for statistics
15