SlideShare a Scribd company logo
1 of 19
Download to read offline
Session D7: Big Data Analysis from Classification to Dimensional
reduction
The curse of dimensionality in official statistics
Conference of European Statistics Stakeholders
Budapest, 20–21 October 2016
Emanuele Baldacci, emanuele.baldacci@ec.europa.eu
Eurostat Director, Directorate B Methodology, Corporate statistical and IT services
Dario Buono, dario.buono@ec.europa.eu
Eurostat, Unit B.1: Methodology and corporate architecture
Fabrice Gras, fabrice.gras@ec.europa.eu
Eurostat, Unit B.1: Methodology and corporate architecture
The curse of dimensionality
(coined by Richard E. Bellman in 1961)
 When the dimensionality
increases, the volume of
the space increases so fast
that the available data
become sparse.
 To obtain a statistically
significant result, the
amount of data needed
often grows
exponentially with the
dimensionality.
Big Data, Huge Dimensions…
Sparse Activities
 Dimensionality
 Big Data and Macroeconomic Nowcasting & Econometrics
 Selectivity methods
 Mobile phone data
 What's next?
Dealing with dimensionality in official statistics
Multiple sources: towards Model Based statistics
Type Huge number of
time series
High frequency time series Huge number of
dimensions
Problem Reduction of
dimensionality,
data snooping
Extraction/decomposition
of signal for high
frequency data, mixed
frequency
Curse of dimensionality
(sampling, distance
functions)
Aim Early estimate,
nowcasting,
classification
Nowcasting, Data filtering
and signal extraction of
high frequency time series
Data mining: machine
learning, clustering,
classification
Possible
methods
Shrinkage models,
Factor model,
Bayesian model,
regression trees,
panel modelling
Wavelet, ensemble mode
decomposition, outliers
detection, and extreme
events theory, state space
modelling, (U)-MIDAS
Bayesian inference,
alternative distance,
state space models
Dimensionality challenges
 Data access, storage and dissemination
 Data analytics
 Moving towards more model based statistics
while preserving robustness and quality of
existing official statistics
• NSIs actually need to pay more and
more in the future attention to the "curse of
dimensionality"
Data storage: possible solution is
Data Virtualisation
Data analytics: the way to go
 Use of all the informational content included in
models.
 Model based statistics: trade-off between
robustness and precision properties of model
based statistics.
 Assessment of scenario based on estimation of
density functions.
 Presentation of indicators based on clustering of
some contextual variables.
The curse of dimensionality &
Data Modelling
 Data snooping: among an infinite number of
candidate models, presence of a winner
 Distance: assessment of the distance relevancy in
high dimensional space, use of Bayesian
inference, embedding dimension of a problem
(Taken's theorem).
 High frequency data: at which frequency the
signal is the most relevant
 Data mining for selecting regressors
Eurostat (Sparse?) activities
 Big Data Macroeconomic Nowcasting, 2016
 Big Data Econometrics, 2017
 Selectivity in Big Data sources, ongoing
 "Assessing the Quality of Mobile Phone Data as a
Source of Statistics", Q2016 joint-paper by
Statistics Belgium, Eurostat and Proximus
Big Data Macroeconomic Nowcasting
 Literature review on the use of Big Data for macro-
economic nowcasting
 Use of a typology based on Doornik and Hendry (2015):
 Tall data: many observation, few variables
 Fat data: many variables, few observations
 Huge data: many variables, many observations
Eurostat
Models race
 Dynamic Factor Analysis
 Partial Least Squares
 Bayesian Regression
 LASSO regression
 U-Midas models
 Model averaging
 255 models tested using macro-financial and
google trend data
Eurostat
Statistical Methods: findings
 Sparse regression (LASSO) works for fat, huge data
 Data reduction techniques (PLS) helpful for large
variables
 (U)-MIDAS or bridge modelling for mixed frequency
 Dimensionality reduction improves nowcasting
 Forecast combination: Data-driven automated
strategy with model rotation based on forecasting
performance in the past works well
Follow-up: Big Data Econometrics
 Review of methods to move from unstructured to
structured time-series data sets for various types
of big data sources including filtering techniques
for high frequency data.
 Propose modelling strategies to be tested.
 Carry out further empirical tests on possible data
timeliness/accuracy gains.
 Big data handling tool developed as R package.
 Scientific summary for Big Data Econometric
strategy.
Big Data sources Selectivity:
Main Issues
 Self-selection and the resulting non-probability
character of the data.
 Discrepancies between big data populations and
the target population.
 Identification of statistical units (target
population indirectly observed).
How to deal with representativeness
and coverage of Big Data for sampling purposes.
Big Data sources Selectivity:
Proposed methods (so far…)
 Pseudo-design approach–reweighting (calibration,
Pseudo-empirical likelihood, weighting)
 Modelling approach (M-quantile models, Model
based in calibration, Bayesian approach, Machine
learning approach)
 Record linkage
New study in 2017 to go further
Mobile Phone data: Clustering Time Series
(1) Assessing the Quality of Mobile Phone Data as a Source of Statistics
http://www.ine.es/q2016/docs/q2016Final00163.pdf
Scaling: Standardization
Distance measure: Euclidian
Applied Technique: K-means
Applied Technique: K-means,
Euclidian distance after
standardisation of time series
Objectives: find patterns enabling
the classification of geographical
areas in work, residential and
commuting area
What's next
 European Big Data Hackathon ,15-17 March 2017,Brussels
 European Statistical Training Courses in 2017
Eurostat
ESTP courses supporting big data (2017)
22
Introduction to
big data and its
tools
Hands-on
immersion on big
data tools
Big data sources -
Web, Social media
and text analytics
Advanced big data
sources - Mobile
phone and other
sensors
Big data courses
Can a statistician
become a data
scientist?
The use of R in
official statistics:
model based
estimates
Time-series
econometrics
Methodology courses
Nowcasting
Activity
Q1 Q2
Q4
Q3
Q2 Q2 Q1
Thank you for your attention
Questions welcome
• References:
• Clément Marsilli Variable Selection in Predictive MIDAS Models, Document de travail 520, Banque
de France, https://www.banque-france.fr/uploads/tx_bdfdocumentstravail/DT-520.pdf
• Eurostat, Big data and macroeconomic nowcasting, preliminary results presented at the ESS
methodological working group (7 April 2016, Luxembourg)
http://ec.europa.eu/eurostat/cros/content/item21bigdataandmacroeconomicnowcastingslides_en
• M. Verleysen, D. François, G. Simon, V. Wertz, On the effects of dimensionality on data analysis
with neural networks https://perso.uclouvain.be/michel.verleysen/papers/iwann03mv.pdf
• Summary Statistics in Approximate Bayesian Computation, Dennis Prangl
https://arxiv.org/pdf/1512.05633.pdf
• Big data CROS portal
• http://ec.europa.eu/eurostat/cros/content/big-data_en

More Related Content

Viewers also liked

SFD2014_FOSS, Cloud and BigData in Vietnam
SFD2014_FOSS, Cloud and BigData in VietnamSFD2014_FOSS, Cloud and BigData in Vietnam
SFD2014_FOSS, Cloud and BigData in VietnamHieu LE ☁
 
CS404 Pattern Recognition - Locality Preserving Projections
CS404   Pattern Recognition - Locality Preserving ProjectionsCS404   Pattern Recognition - Locality Preserving Projections
CS404 Pattern Recognition - Locality Preserving ProjectionsJishnu P
 
Big Data and Analytics: The IBM Perspective
Big Data and Analytics: The IBM PerspectiveBig Data and Analytics: The IBM Perspective
Big Data and Analytics: The IBM PerspectiveThe_IPA
 
Automated Face Detection and Recognition
Automated Face Detection and RecognitionAutomated Face Detection and Recognition
Automated Face Detection and RecognitionWaldir Pimenta
 
HEC Digital Business. Sharing Economy and other trends
HEC Digital Business. Sharing Economy and other trendsHEC Digital Business. Sharing Economy and other trends
HEC Digital Business. Sharing Economy and other trendsAndré Blavier
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionalityNikhil Sharma
 
PCA Based Face Recognition System
PCA Based Face Recognition SystemPCA Based Face Recognition System
PCA Based Face Recognition SystemMd. Atiqur Rahman
 
Face recognition technology - BEST PPT
Face recognition technology - BEST PPTFace recognition technology - BEST PPT
Face recognition technology - BEST PPTSiddharth Modi
 

Viewers also liked (11)

SFD2014_FOSS, Cloud and BigData in Vietnam
SFD2014_FOSS, Cloud and BigData in VietnamSFD2014_FOSS, Cloud and BigData in Vietnam
SFD2014_FOSS, Cloud and BigData in Vietnam
 
CS404 Pattern Recognition - Locality Preserving Projections
CS404   Pattern Recognition - Locality Preserving ProjectionsCS404   Pattern Recognition - Locality Preserving Projections
CS404 Pattern Recognition - Locality Preserving Projections
 
Fyp
FypFyp
Fyp
 
Big Data and Analytics: The IBM Perspective
Big Data and Analytics: The IBM PerspectiveBig Data and Analytics: The IBM Perspective
Big Data and Analytics: The IBM Perspective
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Understandig PCA and LDA
Understandig PCA and LDAUnderstandig PCA and LDA
Understandig PCA and LDA
 
Automated Face Detection and Recognition
Automated Face Detection and RecognitionAutomated Face Detection and Recognition
Automated Face Detection and Recognition
 
HEC Digital Business. Sharing Economy and other trends
HEC Digital Business. Sharing Economy and other trendsHEC Digital Business. Sharing Economy and other trends
HEC Digital Business. Sharing Economy and other trends
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
 
PCA Based Face Recognition System
PCA Based Face Recognition SystemPCA Based Face Recognition System
PCA Based Face Recognition System
 
Face recognition technology - BEST PPT
Face recognition technology - BEST PPTFace recognition technology - BEST PPT
Face recognition technology - BEST PPT
 

Similar to Big Data Analysis: The curse of dimensionality in official statistics

Big Data and Nowcasting
Big Data and NowcastingBig Data and Nowcasting
Big Data and NowcastingDario Buono
 
Big data and macroeconomic nowcasting from data access to modelling
Big data and macroeconomic nowcasting from data access to modellingBig data and macroeconomic nowcasting from data access to modelling
Big data and macroeconomic nowcasting from data access to modellingDario Buono
 
So where are we now? The TDM landscape
So where are we now? The TDM landscapeSo where are we now? The TDM landscape
So where are we now? The TDM landscapeFutureTDM
 
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...University of Bologna
 
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docxBIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docxjasoninnes20
 
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docxBIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docxtangyechloe
 
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docxBIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docxhartrobert670
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...Piet J.H. Daas
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...tuxette
 
Australia bureau of statistics some initiatives on big data - 23 july 2014
Australia bureau of statistics   some initiatives on big data - 23 july 2014Australia bureau of statistics   some initiatives on big data - 23 july 2014
Australia bureau of statistics some initiatives on big data - 23 july 2014noviari sugianto
 
P. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European StatisticsP. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European StatisticsIstituto nazionale di statistica
 
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...European Data Forum
 
SC6 Workshop 1: Big data (phenomenon) challenges and requirements in official...
SC6 Workshop 1: Big data (phenomenon) challenges and requirements in official...SC6 Workshop 1: Big data (phenomenon) challenges and requirements in official...
SC6 Workshop 1: Big data (phenomenon) challenges and requirements in official...BigData_Europe
 
New Opportunity for Urban Analysis
New Opportunity for Urban AnalysisNew Opportunity for Urban Analysis
New Opportunity for Urban AnalysisChen Zunqiu
 
STATVIEW: a web platform for visualisation and dissemination of statistical d...
STATVIEW: a web platform for visualisation and dissemination of statistical d...STATVIEW: a web platform for visualisation and dissemination of statistical d...
STATVIEW: a web platform for visualisation and dissemination of statistical d...ALESSANDRO CAPEZZUOLI
 

Similar to Big Data Analysis: The curse of dimensionality in official statistics (20)

Big Data and Nowcasting
Big Data and NowcastingBig Data and Nowcasting
Big Data and Nowcasting
 
Big data and macroeconomic nowcasting from data access to modelling
Big data and macroeconomic nowcasting from data access to modellingBig data and macroeconomic nowcasting from data access to modelling
Big data and macroeconomic nowcasting from data access to modelling
 
So where are we now? The TDM landscape
So where are we now? The TDM landscapeSo where are we now? The TDM landscape
So where are we now? The TDM landscape
 
Why Data Science is a Science
Why Data Science is a ScienceWhy Data Science is a Science
Why Data Science is a Science
 
Presentation Sofie De Broe (ochtend)
Presentation Sofie De Broe (ochtend)Presentation Sofie De Broe (ochtend)
Presentation Sofie De Broe (ochtend)
 
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
[PhDThesis2021] - Augmenting the knowledge pyramid with unconventional data a...
 
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docxBIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
 
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docxBIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
 
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docxBIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
BIG IOT AND SOCIAL NETWORKING DATA FOR SMART CITIES Alg.docx
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
 
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...
 
Australia bureau of statistics some initiatives on big data - 23 july 2014
Australia bureau of statistics   some initiatives on big data - 23 july 2014Australia bureau of statistics   some initiatives on big data - 23 july 2014
Australia bureau of statistics some initiatives on big data - 23 july 2014
 
P. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European StatisticsP. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European Statistics
 
Big Data technology
Big Data technologyBig Data technology
Big Data technology
 
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
 
Trusted Smart Statistics
Trusted Smart StatisticsTrusted Smart Statistics
Trusted Smart Statistics
 
SC6 Workshop 1: Big data (phenomenon) challenges and requirements in official...
SC6 Workshop 1: Big data (phenomenon) challenges and requirements in official...SC6 Workshop 1: Big data (phenomenon) challenges and requirements in official...
SC6 Workshop 1: Big data (phenomenon) challenges and requirements in official...
 
New Opportunity for Urban Analysis
New Opportunity for Urban AnalysisNew Opportunity for Urban Analysis
New Opportunity for Urban Analysis
 
STATVIEW: a web platform for visualisation and dissemination of statistical d...
STATVIEW: a web platform for visualisation and dissemination of statistical d...STATVIEW: a web platform for visualisation and dissemination of statistical d...
STATVIEW: a web platform for visualisation and dissemination of statistical d...
 
Statbel and big data
Statbel and big dataStatbel and big data
Statbel and big data
 

More from Dario Buono

Reporting uncertainties - too much information?
Reporting uncertainties - too much information?Reporting uncertainties - too much information?
Reporting uncertainties - too much information?Dario Buono
 
Skills for the new generation of statisticians
Skills for the new generation of statisticians Skills for the new generation of statisticians
Skills for the new generation of statisticians Dario Buono
 
JDemetra+ Java Tool for Seasonal Adjustment
JDemetra+ Java Tool for Seasonal AdjustmentJDemetra+ Java Tool for Seasonal Adjustment
JDemetra+ Java Tool for Seasonal AdjustmentDario Buono
 
Physics4Stats & BMI vs. QoL
Physics4Stats & BMI vs. QoLPhysics4Stats & BMI vs. QoL
Physics4Stats & BMI vs. QoLDario Buono
 
Methodological network and strategy
Methodological network and strategy Methodological network and strategy
Methodological network and strategy Dario Buono
 
Safebook quality grading
Safebook quality gradingSafebook quality grading
Safebook quality gradingDario Buono
 
MIP: Analysis of metadata and data revisions
MIP: Analysis of metadata and data revisionsMIP: Analysis of metadata and data revisions
MIP: Analysis of metadata and data revisionsDario Buono
 
New innovative 3 way anova a-priori test for direct vs. indirect approach in ...
New innovative 3 way anova a-priori test for direct vs. indirect approach in ...New innovative 3 way anova a-priori test for direct vs. indirect approach in ...
New innovative 3 way anova a-priori test for direct vs. indirect approach in ...Dario Buono
 
Eurostat tools for benchmarking and seasonal adjustment j_demetra+ and jecotr...
Eurostat tools for benchmarking and seasonal adjustment j_demetra+ and jecotr...Eurostat tools for benchmarking and seasonal adjustment j_demetra+ and jecotr...
Eurostat tools for benchmarking and seasonal adjustment j_demetra+ and jecotr...Dario Buono
 
Detecting outliers at the end of the series using forecast intervals
Detecting outliers at the end of the series using forecast intervals Detecting outliers at the end of the series using forecast intervals
Detecting outliers at the end of the series using forecast intervals Dario Buono
 
1 out of 20 scenarios
1 out of 20 scenarios1 out of 20 scenarios
1 out of 20 scenariosDario Buono
 
Eurostat methodological skills staff survey lesson learned final
Eurostat methodological skills staff survey lesson learned finalEurostat methodological skills staff survey lesson learned final
Eurostat methodological skills staff survey lesson learned finalDario Buono
 

More from Dario Buono (12)

Reporting uncertainties - too much information?
Reporting uncertainties - too much information?Reporting uncertainties - too much information?
Reporting uncertainties - too much information?
 
Skills for the new generation of statisticians
Skills for the new generation of statisticians Skills for the new generation of statisticians
Skills for the new generation of statisticians
 
JDemetra+ Java Tool for Seasonal Adjustment
JDemetra+ Java Tool for Seasonal AdjustmentJDemetra+ Java Tool for Seasonal Adjustment
JDemetra+ Java Tool for Seasonal Adjustment
 
Physics4Stats & BMI vs. QoL
Physics4Stats & BMI vs. QoLPhysics4Stats & BMI vs. QoL
Physics4Stats & BMI vs. QoL
 
Methodological network and strategy
Methodological network and strategy Methodological network and strategy
Methodological network and strategy
 
Safebook quality grading
Safebook quality gradingSafebook quality grading
Safebook quality grading
 
MIP: Analysis of metadata and data revisions
MIP: Analysis of metadata and data revisionsMIP: Analysis of metadata and data revisions
MIP: Analysis of metadata and data revisions
 
New innovative 3 way anova a-priori test for direct vs. indirect approach in ...
New innovative 3 way anova a-priori test for direct vs. indirect approach in ...New innovative 3 way anova a-priori test for direct vs. indirect approach in ...
New innovative 3 way anova a-priori test for direct vs. indirect approach in ...
 
Eurostat tools for benchmarking and seasonal adjustment j_demetra+ and jecotr...
Eurostat tools for benchmarking and seasonal adjustment j_demetra+ and jecotr...Eurostat tools for benchmarking and seasonal adjustment j_demetra+ and jecotr...
Eurostat tools for benchmarking and seasonal adjustment j_demetra+ and jecotr...
 
Detecting outliers at the end of the series using forecast intervals
Detecting outliers at the end of the series using forecast intervals Detecting outliers at the end of the series using forecast intervals
Detecting outliers at the end of the series using forecast intervals
 
1 out of 20 scenarios
1 out of 20 scenarios1 out of 20 scenarios
1 out of 20 scenarios
 
Eurostat methodological skills staff survey lesson learned final
Eurostat methodological skills staff survey lesson learned finalEurostat methodological skills staff survey lesson learned final
Eurostat methodological skills staff survey lesson learned final
 

Recently uploaded

REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 

Recently uploaded (20)

Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 

Big Data Analysis: The curse of dimensionality in official statistics

  • 1. Session D7: Big Data Analysis from Classification to Dimensional reduction The curse of dimensionality in official statistics Conference of European Statistics Stakeholders Budapest, 20–21 October 2016 Emanuele Baldacci, emanuele.baldacci@ec.europa.eu Eurostat Director, Directorate B Methodology, Corporate statistical and IT services Dario Buono, dario.buono@ec.europa.eu Eurostat, Unit B.1: Methodology and corporate architecture Fabrice Gras, fabrice.gras@ec.europa.eu Eurostat, Unit B.1: Methodology and corporate architecture
  • 2. The curse of dimensionality (coined by Richard E. Bellman in 1961)  When the dimensionality increases, the volume of the space increases so fast that the available data become sparse.  To obtain a statistically significant result, the amount of data needed often grows exponentially with the dimensionality.
  • 3. Big Data, Huge Dimensions… Sparse Activities  Dimensionality  Big Data and Macroeconomic Nowcasting & Econometrics  Selectivity methods  Mobile phone data  What's next?
  • 4. Dealing with dimensionality in official statistics Multiple sources: towards Model Based statistics Type Huge number of time series High frequency time series Huge number of dimensions Problem Reduction of dimensionality, data snooping Extraction/decomposition of signal for high frequency data, mixed frequency Curse of dimensionality (sampling, distance functions) Aim Early estimate, nowcasting, classification Nowcasting, Data filtering and signal extraction of high frequency time series Data mining: machine learning, clustering, classification Possible methods Shrinkage models, Factor model, Bayesian model, regression trees, panel modelling Wavelet, ensemble mode decomposition, outliers detection, and extreme events theory, state space modelling, (U)-MIDAS Bayesian inference, alternative distance, state space models
  • 5. Dimensionality challenges  Data access, storage and dissemination  Data analytics  Moving towards more model based statistics while preserving robustness and quality of existing official statistics • NSIs actually need to pay more and more in the future attention to the "curse of dimensionality"
  • 6. Data storage: possible solution is Data Virtualisation
  • 7. Data analytics: the way to go  Use of all the informational content included in models.  Model based statistics: trade-off between robustness and precision properties of model based statistics.  Assessment of scenario based on estimation of density functions.  Presentation of indicators based on clustering of some contextual variables.
  • 8. The curse of dimensionality & Data Modelling  Data snooping: among an infinite number of candidate models, presence of a winner  Distance: assessment of the distance relevancy in high dimensional space, use of Bayesian inference, embedding dimension of a problem (Taken's theorem).  High frequency data: at which frequency the signal is the most relevant  Data mining for selecting regressors
  • 9. Eurostat (Sparse?) activities  Big Data Macroeconomic Nowcasting, 2016  Big Data Econometrics, 2017  Selectivity in Big Data sources, ongoing  "Assessing the Quality of Mobile Phone Data as a Source of Statistics", Q2016 joint-paper by Statistics Belgium, Eurostat and Proximus
  • 10. Big Data Macroeconomic Nowcasting  Literature review on the use of Big Data for macro- economic nowcasting  Use of a typology based on Doornik and Hendry (2015):  Tall data: many observation, few variables  Fat data: many variables, few observations  Huge data: many variables, many observations
  • 11. Eurostat Models race  Dynamic Factor Analysis  Partial Least Squares  Bayesian Regression  LASSO regression  U-Midas models  Model averaging  255 models tested using macro-financial and google trend data
  • 12. Eurostat Statistical Methods: findings  Sparse regression (LASSO) works for fat, huge data  Data reduction techniques (PLS) helpful for large variables  (U)-MIDAS or bridge modelling for mixed frequency  Dimensionality reduction improves nowcasting  Forecast combination: Data-driven automated strategy with model rotation based on forecasting performance in the past works well
  • 13. Follow-up: Big Data Econometrics  Review of methods to move from unstructured to structured time-series data sets for various types of big data sources including filtering techniques for high frequency data.  Propose modelling strategies to be tested.  Carry out further empirical tests on possible data timeliness/accuracy gains.  Big data handling tool developed as R package.  Scientific summary for Big Data Econometric strategy.
  • 14. Big Data sources Selectivity: Main Issues  Self-selection and the resulting non-probability character of the data.  Discrepancies between big data populations and the target population.  Identification of statistical units (target population indirectly observed). How to deal with representativeness and coverage of Big Data for sampling purposes.
  • 15. Big Data sources Selectivity: Proposed methods (so far…)  Pseudo-design approach–reweighting (calibration, Pseudo-empirical likelihood, weighting)  Modelling approach (M-quantile models, Model based in calibration, Bayesian approach, Machine learning approach)  Record linkage New study in 2017 to go further
  • 16. Mobile Phone data: Clustering Time Series (1) Assessing the Quality of Mobile Phone Data as a Source of Statistics http://www.ine.es/q2016/docs/q2016Final00163.pdf Scaling: Standardization Distance measure: Euclidian Applied Technique: K-means Applied Technique: K-means, Euclidian distance after standardisation of time series Objectives: find patterns enabling the classification of geographical areas in work, residential and commuting area
  • 17. What's next  European Big Data Hackathon ,15-17 March 2017,Brussels  European Statistical Training Courses in 2017
  • 18. Eurostat ESTP courses supporting big data (2017) 22 Introduction to big data and its tools Hands-on immersion on big data tools Big data sources - Web, Social media and text analytics Advanced big data sources - Mobile phone and other sensors Big data courses Can a statistician become a data scientist? The use of R in official statistics: model based estimates Time-series econometrics Methodology courses Nowcasting Activity Q1 Q2 Q4 Q3 Q2 Q2 Q1
  • 19. Thank you for your attention Questions welcome • References: • Clément Marsilli Variable Selection in Predictive MIDAS Models, Document de travail 520, Banque de France, https://www.banque-france.fr/uploads/tx_bdfdocumentstravail/DT-520.pdf • Eurostat, Big data and macroeconomic nowcasting, preliminary results presented at the ESS methodological working group (7 April 2016, Luxembourg) http://ec.europa.eu/eurostat/cros/content/item21bigdataandmacroeconomicnowcastingslides_en • M. Verleysen, D. François, G. Simon, V. Wertz, On the effects of dimensionality on data analysis with neural networks https://perso.uclouvain.be/michel.verleysen/papers/iwann03mv.pdf • Summary Statistics in Approximate Bayesian Computation, Dennis Prangl https://arxiv.org/pdf/1512.05633.pdf • Big data CROS portal • http://ec.europa.eu/eurostat/cros/content/big-data_en