1. Big Data as a
data source for
official statistics
Piet Daas, Marco Puts, Bart Buelens and Paul van den Hurk
Statistics Netherlands
Big Data Target Conference, April 4, Groningen
2. Overview
• Data sources and statistics
• More & more data becomes available
• Effect on statistics production
• How we study Big Data: 2 examples
• Traffic loop detection data
• Social media messages
Big Data Target Conference, April 4, Groningen 1
3. Introduction
“Statistics Netherlands has produced
about 5000 official publications and
tables in 2012”
For this we need DATA
Big Data Target Conference, April 4, Groningen 2
4. Data sources for official statistics
Primary data Secondary data
Data from ‘others’
Our own surveys - Administrative sources
- ‘New’ data sources
Big Data Target Conference, April 4, Groningen 3
5. Statistics Netherlands law
• “Statistics Netherlands aims to reduce the
administrative burden for companies and the
public as much as possible”
• By (re-)using existing administrative registrations of both
government and government-funded organizations.
• And study potential new sources of information
Big Data Target Conference, April 4, Groningen 3
6. • Data, data everywhere!
X
Big Data Target Conference, April 4, Groningen 4
7. Statistics Netherlands and Data
• Data is generated in increasing amounts and at increasing frequencies:
• From ‘Data scarcity’ (sample survey) to ‘Data abundance’ (administrative
& Big)
• Ever increasing amounts of data need to be checked, processed and
analyzed
• More sources of information become available
• Opportunities to produce statistics faster (‘real-time statistics’)
• Need for new methods and tools
1. Methods to quickly uncover information from massive amounts of data
available, such as visualisation methods and data-, text- and stream-
mining techniques (‘making Big Data small’), High Performance Comp.
2. Methods capable of integrating the information in the statistical process,
e.g. linking at massive scale, macro/meso-integration, estimation methods
suited for large datasets
Big Data Target Conference, April 4, Groningen 5
8. 2 Big Data case studies
Research findings on the study of Big Data sources
from a statistics point of view
1. Traffic loop detection data
80 million records/day, studied 90 days so far,
number of vehicles detected each minute
2. Dutch social media messages
1~2 million public messages/day, studied up to 2 billion
records, content and sentiment
Big Data Target Conference, April 4, Groningen 6
9. 1. Traffic loop detection data
• Traffic ‘loops’
• Every minute (24/7) the number of passing
vehicles is counted by >10,000 road sensors
& camera’s in the Netherlands
• Total vehicles and in different length classes
• Interesting source to produce traffic and
transport statistics (and more)
• Huge amounts of data, about 100 million
records a day
Locations
Big Data Target Conference, April 4, Groningen 7
10. Number of detected vehicles on a single day
By all loops Total = ~ 295 million
Big Data Target Conference, April 4, Groningen 8
11. Traffic loop detection activity (only first 10 min.)
Big Data Target Conference, April 4, Groningen 9
12. Correct for missing data
• ‘Corrected’ data (for blocks of 5 min)
Before After
Total = ~ 295 million Total = ~ 330 million (+ 12%)
Big Data Target Conference, April 4, Groningen 10
13. Total vehicles during the day (snapshots)
Big Data Target Conference, April 4, Groningen 12
14. For different vehicle lengths
1 categorie 3 categoriën 5 categoriën
Totaal Totaal Totaal
<= 5.6m > 1.85 & <= 2.4m
> 5.6 & <= 12.2m > 2.4 & <= 5.6m
> 12.2m > 5.6 & <= 11.5m
> 11.5 & <= 12.2m
> 12.2m
Small vehicles <= 5.6 m
Medium sized vehicles > 5.6 m & <= 12.2 m
Large vehicles > 12.2 m
Big Data Target Conference, April 4, Groningen 13
15. Small vehicles
~75% of total
Big Data Target Conference, April 4, Groningen 14
16. Small & medium vehicles
Big Data Target Conference, April 4, Groningen 15
17. Small, medium & large vehicles
Big Data Target Conference, April 4, Groningen 16
18. Volatile behaviour at the micro-level
Big Data Target Conference, April 4, Groningen 17
19. 2. Social media messages
• Dutch are very active on social media platforms
• Bijna altijd bij zich en staat vrijwel altijd aan
• Steeds meer mensen hebben een smartphone!
• Mogelijke informatiebron voor:
• Welke onderwerpen zijn actueel:
• Aantal berichten en sentiment hierover
• Als meetinstrument te gebruiken voor:
• .
Map by Eric Fischer (via Fast Company)
Big Data Target Conference, April 4, Groningen 18
20. 2. Social media messages
• Dutch are very active on social media platforms
• Potential information source for:
• Topics discussed and sentiment over these topics (quickly
available!) and probably more?
• Investigate it to obtain an answer on potential use
2a. Content:
- Collected Dutch Twitter messages for study: ‘selection’ of 12 million
2b. Sentiment
- Sentiment in Dutch social media messages: ‘all’ ~2 billion
Big Data Target Conference, April 4, Groningen 19
21. Social media: Dutch Twitter topics
(3%)
(7%)
(3%)
(10%)
(7%)
(3%)
(5%)
(46%)
12 million messages
Big Data Target Conference, April 4, Groningen 20
22. Sentiment in Social media
• Access to Coosto database
• > 2 billion publicly available messages
• Twitter, Facebook, Hyves, Webfora, Blogs etc.
• Sentiment of each message
• Positive, negative or neutral
• Interesting finding
• Determine so-called ‘Mood of the nation’ compared
to Consumer confidence of Statistics Netherlands
Big Data Target Conference, April 4, Groningen 21
23. Consumer confidence, survey data
Sentiment towards the economic climate
(pos – neg) as % of total
~1000 respondents/month
Big Data Target Conference, April 4, Groningen 22
24. Final remarks: Big Data and statistics
• Preparing Big data for statistics is time consuming
• Exploration phase takes a lot of time
• Try to reduce amount of data without losing information (‘making big data
small’, noise reduction)
• Risk: ‘garbage in’ ‘garbage statistics out’
• Traditional approach does not suffice
• Big data sources are definitely not ‘large’ sample surveys or admin data
• Often a selective but a large part of the ‘population’ is included
• Events are registered, not units!
• Careful with using ‘traditional’ statistical analysis (everything is significant!)
• More need for:
• Visualisation methods (to rapidly gain insight)
• Methods & models specific for large dataset (fast and ‘robust’)
• Learn from ‘computational statistics’ & (try to) use dedicated hardware
• Beware of privacy issues!
Big Data Target Conference, April 4, Groningen 27
25. Big Data Target Conference, April 4, Groningen The future of Stat Neth?