1. HLG Big Data project
and Sandbox
Carlo Vaccari (Istat) – IAOS October 2014 1
2. This material is distributed under the Creative Commons
"Attribution - NonCommercial - Share Alike - 3.0", available at
http://creativecommons.org/licenses/by-nc-sa/3.0/
Carlo Vaccari (Istat) – IAOS October 2014 2
3. Carlo Vaccari (Istat) – IAOS October 2014 3
I
nt
er
nati
onal
High Level Group to coordinate groups working on Statistical
Standards: UNECE, OECD, Eurostat, National Statistical Org.
4. May 2013: task team with the aim to define a project to be
presented to international statistical community:
Three main objectives:
To identify the main possibilities and the main strategic and
methodological issues that Big Data poses for the official statistics
To analyze the feasibility of efficient production of official
statistics using Big Data sources, and the possibility to replicate
these approaches across different national contexts
To facilitate the sharing across organizations of knowledge,
expertise, tools and methods for the production of statistics using
Big Data sources
Carlo Vaccari (Istat) – IAOS October 2014 4
Bi
g
Dat
a
Pr
oj
ect
5. Project presented to HLG and CES
Task teams composed by people from 13 organisations
The project composed of four task teams:
Partnership Task Team
Privacy Task Team
Quality Task Team
Sandbox Task Team
Carlo Vaccari (Istat) – IAOS October 2014 5
Bi
g
Dat
a
Pr
oj
ect
6. Carlo Vaccari (Istat) – IAOS October 2014 6
Part
ner
Providers s
hi
p
Task
and sources of data - challenges: access to data,
managing privacy and confidentiality
Government (Administrative records)
Private (Commercial records)
Social Media and other Internet sites
Design - research design and development
Academia
Private and/or public research institutes
NGOs
International organizations
7. Carlo Vaccari (Istat) – IAOS October 2014 7
Part
ner
Technology s
hi
p
Task
- Tools, data and infrastructure for data
processing, data mining, real-time analytics, storage,
computing, and data visualization
Private sector (technology providers, IT companies)
Data providers themselves
Analysis - NSOs can provide standards and methodology
whereas others provide analytical capacity and modeling
Academia
Private and/or public research institutes
NGOs
International organizations
8. Overview of existing tools for risk management in view of privacy
issues
Carlo Vaccari (Istat) – IAOS October 2014 8
Pri
v
acy
Task
Tea
Risks to privacy - Privacy software
Data access strategies (onsite, remote access, microdata)
Overview of database privacy technologies
Evaluation of different privacy approaches
Big Data characteristics and their implications for data privacy
Data access strategies for Big Data
Computer Science and Statistical Disclosure approaches
Disclosure Risk assessment for Big Data
9. Information Integration and Governance (DB monitoring,
security, transport security)
Statistical Disclosure Limitations
Carlo Vaccari (Istat) – IAOS October 2014 9
Pri
v
acy
Task
Tea
Preserving confidentiality
Balance between “Data utility” and “Disclosure Risk”
SDL methods:
Data masking
Traditional approaches: aggregation, obfuscation,
perturbations, data swapping
Modern approaches: sampling and simulation
Managing potential risk to reputation: ethical practices,
controls, communication, dialog with public
10. Carlo Vaccari (Istat) – IAOS October 2014 10
Quali
Input t
y
Task
Tea
quality framework with indicators:
Source: data-source, reliability, privacy, availability, costs, procedures,
...
Metadata: representativeness, usability, completeness, id, ...
Data: collection, coverage, complexity, efficiency, integrability
Output quality framework with indicators:
Metadata: clarity, accessibility, completeness, comprehensiveness
Data: relevance, accuracy, timeliness, accessibility, coherence,
predictivity, selectivity
Process quality with indicators :
Cleaning: unambiguous, objectivity, granularity, reliability
Transformations: compliance, categorization, precision
Linking: completeness, selectivity, accuracy, id, time_related
Aggregation: quantity, confidentiality, Integration, validity, accuracy
11. Carlo Vaccari (Istat) – IAOS October 2014 11
Sandbox
Sandbox: web-accessible environment where researchers coming
from different institutions explore tools and methods needed for
statistical production and the feasibility of producing Big Data-derived
statistics
List of tools chosen: Hadoop, Hortonworks, Pentaho, RHadoop
Open list ...
12. Carlo Vaccari (Istat) – IAOS October 2014 12
Sandbox
Sandbox hosted at the Irish Center for High-
End Computing (ICHEC) which will assist
the task team for the testing and evaluation
of Hadoop work-flows and associated data
analysis application software
The mission of ICHEC is to provide High-
Performance Computing (HPC) resources,
support, education and training for
researchers
13. Carlo Vaccari (Istat) – IAOS October 2014 13
Sandbox
c
onfi
gur
The hardware on which the
sandbox system is based is a High
Performance Computing Linux
cluster hosted in the National
University of Ireland (Galway)
composed of 30 nodes each of
which has two quad-core
processors, 48GB of RAM and a
1TB local disk
Each node is connected to two
networks – one for accessing the
shared Lustre and one Gigabit
Ethernet network for management
20TB shared filesystem is available
to all nodes
14. Virtual Sprint (March 2014) → first document
Workshop in Rome (April 2014)
Training in Rome (May 2014)
Sandbox installation and verification
Workshop in Heerlen (September 2014)
Testing scenarios for BD usage in Official Statistics:
Carlo Vaccari (Istat) – IAOS October 2014 14
Sandbox i
n
2014
use as auxiliary information to improve an existing survey
replacing all or part of an existing survey with Big Data
producing a predefined statistical output either with or
without supplementation of survey data
producing a statistical output guided by findings from the
data
15. Carlo Vaccari (Istat) – IAOS October 2014 15
Sandbox
partner
Software:
Hortonworks – Granted a free enterprise support
subscription for the duration of the project
Pentaho – Free trial of enterprise platform
Data:
Mobile data from Orange
Smart meters data from Irish power agency
Smart meters from Canadian power agency
16. Carlo Vaccari (Istat) – IAOS October 2014 16
Sandbox
ex
peri
Organized in Task teams, one for each source:
Consumer Price Index
Mobile phone data
Smart meters
Traffic loops
Social Data
Web scraping
Job vacancies
17. Carlo Vaccari (Istat) – IAOS October 2014 17
Ex
peri
ment
Cons
Sources:
Web scraping from ONS (UK supermarkets)
Synthetic scanner data from Istat
Test performance of big data technologies applied to the
computation of a simplified consumer price index, based on
synthetic data sets modeling scanner data
A first version of the price generator was tested successfully in
generating a sample csv file with 11 billions rows, successfully
uploaded in the sandbox
Comparison between Hadoop ↔ NoSQL ↔ RDBMS
Visual analysis of data through Pentaho suite
18. Carlo Vaccari (Istat) – IAOS October 2014 18
Ex
peri
ment
Mobil
Four dataset from Orange provider for Ivory Coast:
calls and duration for pair of cells for each hour
calls coming from 500k phones with time and cell
calls coming from 500k randomly sampled individuals
communication sub-graphs for 5k users
Experiments:
Classification of Caller: workers, students, business, not LF,
...
Classification of zones (cells): industrial, residential,
school/university, farmers, high/low traffic
Temporal distribution of Calls (day/week/season)
19. Carlo Vaccari (Istat) – IAOS October 2014 19
Ex
peri
ment
Mobil
Parallel experiment on Slovenian and Orange data: →
exchange of methods, tools, findings
Searching for other datasets from other providers
20. Carlo Vaccari (Istat) – IAOS October 2014 20
Ex
peri
ment
Datasets:
S
mart
Smart meter data from Ireland (household level, linked
with 2 surveys)
Synthetic smart meter data from Canada (household
level, covering several years, time stamped hourly
electricity consumption linked with hourly weather data
and hourly price data, matched with quarterly survey
data)
Experiment: Rhadoop code for visualizing synthetic Canadian
smart meter data, providomg time elapsed for the following:
Hourly Consumption (kWh) v Hourly Temperature (C) for all
data
Hourly Consumption (kWh) v Hourly Price (c) for all data
21. Carlo Vaccari (Istat) – IAOS October 2014 21
Ex
peri
ment
Tr
affi
In the Netherlands, 20,000 traffic loops, counting the number
of vehicles each minute, are located on approximately 3,000
km of speedway. All this data is collected by a central agency,
the NDW (National data warehouse for traffic). Data loaded for
one year for the area of South Limburg, consisting of about
800 of these traffic loop
Experiment:
Find out how to deal with multiple files in Hadoop
See how the traffic develops during a year
Deliverables:
Code for aggregating the data in Hive and RHadoop
A graphical representation about the development of the
traffic on these roads and in this region
23. Carlo Vaccari (Istat) – IAOS October 2014 23
Ex
peri
ment
Soci
Set of tweets generated in Mexico from January to July 2014:
Sentimental analysis techniques in obtaining indicators of
subjective wellbeing (compare with stats)
Use geo-tagged tweets for analysing people movement
State of origin of tourists visiting "Magic Towns" in Mexico
24. Carlo Vaccari (Istat) – IAOS October 2014 24
Ex
peri
ment
Soci
Next steps:
Geo-located tweets experiments on:
Working patterns / commuting from morning to night
Weekends / Holydays / Seasonal movements
South – North mobility / Commerce at the North border
Work on emoticons and media acronyms analysis:
Develop a small emoticons dictionary / review research
papers
Count of emoticons on the tweets that we have, and how
many tweets have emoticons to have an idea of their
representativity power
Review of algorithms: work with some MapReduce
adaptations, Spark, Scala
25. The Job-vacancies team works on (historical) job vacancies
data, scraped from various sites on the web – goals:
to identify possible both free and commercial data sources
and its APIs and illustrate potential use cases
to scrape job vacancies data from the biggest national
websites (possibly international also)
to test scraping tools (Irobotsoft and Kimonolabs)
to test statistical process of data manipulation
Carlo Vaccari (Istat) – IAOS October 2014 25
Ex
peri
ment
J
ob
26. Carlo Vaccari (Istat) – IAOS October 2014 26
Ex
peri
ment
Web
8,600 Italian websites, indicated by the 19,000 enterprises
responding to ICT survey of year 2013, have been scraped
and the acquired texts have been processed
The scraping and processing work took about 33 hours on a
virtual server in Italy, the goal of this activity is to reproduce the
used software configuration and rerun the process on a more
powerful environment in order to measure the time
consumption
Experiment:
Configure a Nutch job runnable in the Sandbox environment
Execute the scraping job in order to produce the scraped
data in HDFS
Compare the performance of the sandbox with the
performance of a single server
27. Carlo Vaccari (Istat) – IAOS October 2014 27
St
at
e
of t
he
Pr
All teams are running experiments and have defined
objectives for final deliverables (preliminary results due for
end of November, final end of year)
Outline of final deliverables defined in September meetings
Developed training material, available for all participants and
public in future
Effective cooperation and exchange of ideas: all participants
requested more time for developing other experiments and
look forward to extending the project
28. Carlo Vaccari (Istat) – IAOS October 2014 28
Less
ons
Lear
ned
International cooperation can multiply the ideas
Data acquisition can be a long process. (eg: five months to
get Orange mobile data)
group suggested other possible approaches for the future
need “political”/legal sponsorship
Setup of the environment required time → difficult to achieve
"stable" configuration
Training should operate on different skills: IT, statistical and
algorithms. Need of people open to learn new tools,
techniques, methods...