This presentation describes the landscape of data and software use across the social sciences in terms of the abstract dimensions of data and data use. It then examines three use cases.
Presentation for DASPOS < https://daspos.crc.nd.edu/index.php/workshops/workshop-2 > Workshop at JCDL.
Ensuring Technical Readiness For Copilot in Microsoft 365
Characterizing Data and Software for Social Science Research
1. SURVEY OF COMMONALITY WITH OTHER DISCIPLINES
WORKSHOP 2 – JULY 25, 2013
INDIANAPOLIS, INDIANA
MICAH ALTMAN
DIRECTOR OF RESEARCH, MIT LIBRARIES
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
ESCIENCE@MIT.EDU
PRIMARY RESEARCH OR PRACTICE AREA(S)
• INFORMATION SCIENCE
• SOCIAL SCIENCE
PREVIOUS EXPERIENCE
• DIGITAL LIBRARIES
• DIGITAL PRESERVATION
• STATISTICAL COMPUTING
RELATED WORK
• PUBLICMAPPING.ORG
• INFORMATICS.MIT.EDU
CONTACT INFORMATION
E25-131, 77 MASSACHUSETTS AVE, MIT, CAMBRIDGE, MA, 02139
2. Prepared for
DASPOS Workshop
JCDL 2013
Characterizing Data and Software for
Social Science Research
Dr. Micah Altman
<escience@mit.edu>
Director of Research, MIT Libraries
Non-Resident Senior Fellow, Brookings Institution
3. DISCLAIMER
These opinions are my own, they are not the opinions
of MIT, Brookings, any of the project funders, nor (with
the exception of co-authored previously published
work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the
future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,
Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi,
Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle,
George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White,
etc.
Data and Software in Social Science Research
4. Collaborators & Co-Conspirators
• Jonathan Crabtree, Nancy McGovern
• National Digital Stewardship Coordination
Committee & Working Group Chairs
• Privacy Tools for Sharing Research Data
Team
(Salil Vadhan, P.I.)
http://privacytools.seas.harvard.edu/peopl
e
• Research Support
– Supported in part by NSF grant CNS-1237235
– Thanks to the Library of Congress, & the
Massachusetts Institute of Technology.Data and Software in Social Science Research
5. Related Work
• CoData Task Group on Data Citations, 2013 (Forthcoming) Out of Cite, Out of
Mind: The Current State of Practice, Policy, and Technology for the Citation of Data, Co-
Data Journal (Special Volume).
• Altman & Jackman, 2012, 19 Ways of Looking at Statistical Software, Journal of
Statistical Software
• National Digital Stewardship Alliance, 2013, 2014 National Agenda for Digital
Stewardship.
• Novak, K., Altman, M., Broch, E., Carroll, J. M., Clemins, P. J., Fournier, D.,
Laevart, C., et al. 201.. Communicating Science and Engineering Data in the
Information Age. Computer Science and Telecommunications. National
Academies Press
• Altman, M., Rogerson, K., & U, D. (2008). Open Research Questions on
Information and Technology in Global and Domestic Politics – Beyond “E-.i,
41(4), 1-8. Retrieved from
http://www.journals.cambridge.org/abstract_S104909650824093X
• Altman, Gill & McDonald. 2003. Numerical Issues in Statistical Computing for
the Social Scientist
Most reprints available from:Data and Software in Social Science Research
8. Some Characteristics of Research Data
Data and Software in Social Science Research
Attribute Type Examples
Data: Structure - Single relation (table)
- Fully relational
- Network
- Geospatial
- Semi-structured (e.g. text)
Data: Attribute Types - Continuous/Discrete
- Scale: ratio/interval/ordinal/nominal
Data: Performance Characteristics - Number of observations
- Frequency of updates
- Dimensionality
- Sparsity
- Collection heterogeneity
9. Some Characteristics of Research Measurements
Data and Software in Social Science Research
Attribute Type Examples
Measurement: Unit of Observation - Individuals
- Groups
- Institutions
- Organizations
- Interactions
Measurement: Measurement type - Experimental
- Observational
- Synthetic/computational
Measurement: Performance characteristic - Metadata
- Ontology
- Quality
10. Some Characteristics of Research Data Use
Data and Software in Social Science Research
Attribute Type Examples
Analysis methods - Counting
- GLM model family
- MLE model family
- (Constrained) continuous nonlinear
optimization
- Blind global optimization
- Discrete optimization
- Bayesian Methods (MCMC)
- Heuristically/algorithmically defined
- Text mining
- Clustering
- Coding and qualitative analysis
- Exploratory Data Analysis
Desired Outputs - Summary scalars
- Summary table
- Data subset
- Static data publication
- Static visualization
- Dynamic Visualization
11. Data and Software in Social Science Research
Some Characteristics of Use Constraints
Contract Intellectual Property
Access
Rights Confidentiality
Copyright
Fair Use
DMCA
Database Rights
Moral Rights
Intellectual
Attribution
Trade Secret
Patent
Trademark
Common Rule
45 CFR 26
HIPAA
FERPA
EU Privacy Directive
Privacy
Torts
(Invasion,
Defamation)
Rights of
Publicity
Sensitive but
Unclassified
Potentially
Harmful
(Archeological
Sites,
Endangered
Species,
Animal Testing,
…)
Classified
FOIA
CIPSEA
State
Privacy Laws
EAR
State FOI
Laws
Journal
Replication
Requirements
Funder Open
Access
Contract
License
Click-Wrap
TOU
Export
Restrictions
NDA
13. Exemplar: Policy Analysis
Data and Software in Social Science Research
Attribute Type Examples
Data: Structure - Single relation (table)
Data: Attribute Types - Continuous/Discrete
- Scale: ratio/interval/ordinal
Data: Performance
Characteristics
- 10K-100K observation
- Monthly/annual updates
- Dozens of dimensions/measures
Measurement: Unit of
Observation
- Individuals; Organization; Institutions
Measurement: Measurement
type
- Observational
- Repeated cross-sectional/longitudinal
over decades
Measurement: Performance
characteristic
- High quality measurements
- Systematic and complete metadata
- Controlled ontology
- Regular updates & long-term access
Management Constraints - Confidentiality; Public Access
Analysis methods - Counting (contingency tables); GLM
Family
Desired Outputs - Summary scalars
- Summary table
- Static visualization (map)
More Information
• Science and Engineering Indicators:
http://www.nsf.gov/statistics/seind12/
• Details of NCSES use case:
Novak et al. 2011
• Policy data producer perspectives:
Journal of Official Statistics
14. Exemplar: Media Anthropology Dissertation
Data and Software in Social Science Research
Attribute Type Examples
Data: Structure - audio video
- GIS coverage/ GPS trails
- Semi structured field notes
- Coded qualitative and
quantitative data
Data: Attribute Types - Discrete
- Scale: ordinal/nominal
Data: Performance Characteristics - 100’s of observed units
- Longitudinal
- Dozens of
dimensions/measures
- Static after publication
Measurement: Unit of Observation - Individuals; Organizations;
Physical environment
Measurement: Measurement type - Observational; Interaction
Measurement: Performance
characteristic
- High quality measurements
- Systematic and complete
metadata
- Emergent coding/ontology
Management Constraints - Confidentiality; social norms
Analysis methods - Counting; Discourse; CAQDA
(Qualitative)
- (Future) AI/Machine learning
Desired Outputs - Book
- 1-2 hour video / interactive
media synthesis
More Information
• Harvard media anthropology Ph.D. Program:
sel.fas.harvard.edu/phd.html
Image Sources: Wikimedia Commons. Pixabay.com, Flickr
15. Exemplar: Social Message Analysis
Data and Software in Social Science Research
Attribute Type Examples
Data: Structure - network
Data: Attribute Types - Continuous/Discrete/
- Scale: ratio/interval/ordinal/nominal
Data: Performance
Characteristics
- 10M-1B observations
- Sample from stream of continuously
updated corpus
- Dozens of dimensions/measures
Measurement: Unit of
Observation
- Individuals; Interactions
Measurement: Measurement
type
- Observational
Measurement: Performance
characteristic
- High volume
- Complex network structure
- Sparsity
- Systematic and sparse metadata
Management Constraints - License; Replication
Analysis methods - Bespoke algorithms (clustering);
nonlinear optimization; Bayesian
methods
Desired Outputs - Summary scalars (model coefficients)
- Summary table
- Static /interactive visualization
More Information
• Grimmer, Justin, and Gary King. "General purpose computer-
assisted clustering and conceptualization." Proceedings of the
National Academy of Sciences 108.7 (2011): 2643-2650.
• King, Gary, Jennifer Pan, and Molly Roberts. "How censorship in
China allows government criticism but silences collective
expression." APSA 2012 Annual Meeting Paper. 2012.
• Lazer, David, et al. "Life in the network: the coming age of
computational social science." Science (New York, NY) 323.5915
(2009): 721.
16. Trends: More
More Types of Evidence More CollaborationMore Data
More Publications, More Filters
More Learners
More Open
Data and Software in Social Science Research
More Replication
17. Some Challenges for Long-Term
Replication/Access
• “messy” human sensors
• Mix of data types, structures, sparsity
• Complex constraints: confidentiality, licensing,
NDA’s
• Manual/Computer-assisted coding
• Niche commercial software (and private bespoke
software) integral to analysis
• Very long term longitudinal data/accessibility
requirements
Data and Software in Social Science Research
This work. by Micah Altman (http://micahaltman.com) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.Any images included in derivative works must be individually attributed to their original sources, as indicated in notes
The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk discusses findings from this survey, common gaps, and trends in this area.(I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier's reliability claims. For more on that see this earlier post: http://drmaltman.wordpress.com/2012/11/15/amazons-creeping-glacier-and-digital-preservation )
File icon is licensed under CC0 on pixabay.com. http://pixabay.com/en/spreadsheet-excel-table-diagram-98491/Dissertation is licensed under CC-BY-SA by Victoria Catterson http://www.flickr.com/photos/cowlet/354911838/Other images available through commons.wikimedia.org
Other image source: wikimedia commons
LHC produces a PB every 2 weeks, Sloan Galaxy zoo has hundreds of thousands of “authors”, 50K people attend a class from the University of michigan, and to understand public opinion instead of surveying 100’s of people per month we can analyze 10ooo tweets per second.