Incidental data from social media has been used in some individual-level social science research to predict attributes like political views, personality, and health indicators. However, the author notes issues with selectivity, reliability, and comparability that limit its use in serious social research. While some methodological work has been done applying machine learning techniques, significant challenges remain around generalizability, privacy, reproducibility, and integrating different data sources and modalities. The author argues more work is needed to solve key social science challenges through a "grand challenge" approach and techniques like cross-validation, penalized models, and multimodal learning.
Oberski EAM 2018 - Incidental data for serious social research
1. Incidental data
for serious social research
Daniel Oberski
Utrecht Applied Data Science
Dept Methodology & Statistics
http://daob.nl
https://uu.nl/ads
2. • Incidental data are used throughout business and government
• What about social science?
1. Done - 2. To do - 3. Conclusion
8. … at least, on Twitter
Jungherr et al. (2012). Why the Pirate Party won the German Election
of
2009. Soc Sci Comp Rev.
Gayo-Avello (2012). I tried to predict elections from Twitter and all
I got was this lousy paper.
10. Blandfort et al. (23 Jul 2018). Multimodal Social Media Analysis
for Gang Violence Prevention. ArXiV:1807.08465v1.
“High af”
“Shyt Dnt always happen how u plan it”
“Goodmorning cold ass world”
“Rip lil B”
Image+Text -> Aggression/Loss/Substance use/Other
12. “The (implicit) hope is that analyses of
social media content might be substituted for costly
and burdensome survey responses.
Current evidence suggests we are far from that…”
Conrad (2015)
13. Problems with incidental data:
methodological
Selectivity Reliability
Source:Mellon&Prosser(2017)
Comparability:
16. Data science term Social science term
Learning Estimating a model
Supervised learning Predicting stuff
Unsupervised learning Latent variable modeling
Example / instance Case
Feature (Independent) variable
Target Dependent variable
Loss * log-likelihood
Gaussian Bayesian
network
Structural equation model
Classifier Model for categorical DV
Regression Model for continuous DV
Softmax Multinomial regression
Error Prediction error
Variance * Prediction sampling error
Bias * “Average prediction error”
Social science term Data science term
Criterion variable ~ Ground truth
Capitalization on chance,
p-hacking, HARKing, etc.
Overfitting
Reliability ?
Internal validity ?
External validity ?
(-> generalization error)
Measurement invariance ~ Concept drift
(-> transfer learning;)
Measurement error Noise
Measurement error model
(correction)
Noise-aware machine
learning
Measurement error model
(estimation)
Inverse model
~Deviance; Chi-square
(exponential of)
Perplexity
? Grand challenge
Legend: *: Usually. ~: Not really the same, but close enough. ->: Relates to. ?: Work to do!
17. Essential tools for methodologists
• Cross-validation and its relationship to generalizability
Train/validation/test paradigm
“Overfitting” theory
• Penalized estimation
L1 LASSO; L2 ridge; horseshoe; …
• Standard data science prediction workflow
18. Solving key social science challenges?
Grand challenge approach (thanks to Adrienne Mendrik, NL eScience center)
Multimodal learning (“data fusion”; see work Katrijn van Deun, Tilburg University)
Privacy-aware ML (differential privacy, federated learning; see Cynthia Dwork,
Microsoft)
21. Summary
• Incidental data haven’t revolutionized our field yet;
• Probably because we need to work the methodology first;
• Although scores of authors have come to the same conclusion,,
most of the work remains to be done;
You are the ideal person to do this work.
22. Thank you for your attention!
E: d.l.oberski@uu.nl
T: @DanielOberski
W: http://daob.nl
W: https://uu.nl/ads