Our presentation introduces the GDPR then gives a brief overview of key GDPR principles and explains how they are likely to affect data scientists and behavioral researchers.
Adjusting to the GDPR: The Impact on Data Scientists and Behavioral Researchers
1. Adjusting to the GDPR:
The Impact on Data Scientists and
Behavioral Researchers
Travis Greene, Galit Shmueli, Soumya Ray
National Tsing Hua University, Taiwan
INFORMS 2nd Data Science Workshop, Phoenix, Nov 4, 2018
1
2. Roadmap
1. Personal Data: USA vs. EU
2. GDPR in a Nutshell
3. Processing GDPR through the InfoQ Framework
4. How will GDPR impact data scientists?
2
3. USA
Commercial
commodity
"Collecting and processing
[personal data] is allowed unless
it causes harm or is expressly
limited by U.S. law.”
EU
Fundamental right
(Article 8 EU Charter of Fundamental Rights)
"Processing of personal data
is prohibited unless there is an
explicit legal basis that allows it."
Opt-out Opt-in
Personal Data:
Any information that could be used to ‘single out’ a person
3
4. (Potentially) global reach
● Up to 20M Euro fines or 4% of global turnover
● Affects both industry and research practices
● Similar privacy laws in USA, China, India, Brazil...
Data
Controller
Data
Processor
Data
Subjects
Evolution of 1995 Data Protection Directive into EU-wide
Regulation
Defines three key entities:
4
6. If you’re a data science researcher, it is
difficult to synthesize a coherent
understanding of the new GDPR changes
→ We need a structured framework!
6
7. Our three-step approach to analyzing GDPR
3.
Analyze
Use
categorization to
analyze the
impact of GDPR
on data science
workflow
1.
Identify
Identify key
GDPR concepts,
definitions,
principles
relevant to data
science research
2.
Categorize
Categorize key
GDPR concepts
in a meaningful
way for data
scientists
7
8. InfoQ provides a coherent, systematic framework for assessing
the impact of GDPR on data scientists
1. Data resolution
2. Data structure
3. Data integration
4. Temporal relevance
5. Chronology of data & goal
6. Generalizability
7. Operationalization
8. Communication
The Information Quality (InfoQ) Framework
(Kenett & Shmueli, 2014)
InfoQ depends on
4 components:
Assess InfoQ? 8 DimensionsPotential of a dataset to achieve a goal,
given analysis method and utility
8
9. GDPR Concepts, Definitions, Principles
Privacy by Design
Special Category Data
Purpose Limitation
Automated Profiling
Systems
Pre-GDPR Data
Pseudonymized Data
Legitim
ate
Interests
Structured and
Unstructured Data
Statistical Research
StatisticalAggregations
Consent
Principle of Proportionality
Data Controllers
InfoQ
Statistical Research
Contractual
Necessity
Goal
Scientific Research
Statistical Research
Public Interest
Research
Historical Research
Archival Research
Data
Personal Data
Special Category data
Pseudonymized data
Statistical Data
Publicly available data
Pre-GDPR Personal
Data
Utility
Principle of
Proportionality
Purpose Limitation
Contractual Necessity
Legitimate Interests
Privacy by Design
Consent
Analysis
Statistical Aggregation
Automated Profiling
Filing Systems
Structured vs.
Unstructured
Documentation
Serve Mankind
1.
Identify
2.
Categorize
10. 1.
Collect
Data
1.Resolution
2.Structure
3.Integration
4.Temporal relevance
Examine Typical Data Analysis Workflow Using
InfoQ Framework
5.
Communicate
4.
Generalize
3.
Share
Data
2.
Use
Data
Complete Analysis
InfoQ provides us with ‘x-ray’ vision for analyzing each step of the process
InfoQ
8 Dimensions
Beginning of Research
5.Chronology
6.Generalizability
7. Operationalization
8. Communication
3.
Analyze
11. 11
1. Collect Data
Data Minimization
What kinds of data can
we legally collect?
Purpose Limitation
On which legal grounds
can we collect users’
data?
Pseudonymization
How should collected
data be stored and
secured?
2. Use Data
Pre-GDPR Data
If subjects consented
prior to GDPR, can we
continue to use their
data?
Heterogeneity
Will these data be
available at the time of
prediction?
3. Share Data
Collaboration
How can academics
make use of the vast
stores of BBD collected
and processed by
major internet
companies?
Liability
GDPR imposes large
potential fines
5. Communicate
Data Subjects
How do we explain our
results to concerned data
subjects?
Data Protection Authorities
How can we prove our
compliance with GDPR
principles?
1.Resolution
2.Structure
3.Integration
4.Temporal relevance
4. Generalize
Consent bias
How do we know our
results will generalize
to the population of
interest?
Replication
Can our results be
replicated?
5.Chronology
6.Generalizability
7. Operationalization
8. Communication
A Modern Data Science Workflow
8 InfoQ Dimensions
12. 1. Gathering Data
Pre-collection
2. Using Data
3. Sharing Data
4. Generalizing
5. Communicating
Data Minimization & Purpose limitation
Collect only for specific purposes clearly explained
→ Must justify “Why do you need my ethnicity?”
Can’t arbitrarily repurpose personal data
→ Need legal basis
Data minimization & privacy preservation paradox
→ Power calculations may indirectly lead to
re-identification
12
13. Pseudonymization is just a suggestion
→ Spur research on ‘privacy protective data mining’
Different implications for different researchers
→ Personalized vs. aggregate-level models
Pseudonymized data is contextual
→ Know incentives & data environment
Pseudonymization
Data features that might (reasonably) be
used to ID a specific person are stored
separately and securely from other data
IP:
192.18.8.1
Name:
Travis
Green
1. Gathering Data
Pre-collection
Post-collection
2. Using Data
3. Sharing Data
4. Generalizing
5. Communicating
13
14. Reconsent,
Data Availability
& Heterogeneity
Pre-GDPR user data reconsent
→ Fewer rows but more accuracy
Data availability for future prediction
→ Must expect opt-outs
More user privacy options
→ Larger heterogeneity in completeness
Models built using de-consented data
→ Still not clear, but Article 7 seems to allow it
1. Gathering Data
2. Using Data
3. Sharing Data
4. Generalizing
5. Communicating
14
15. Increased Legal Liability
Companies dropping 3rd party sharing
→ Less rich data
Data subject re-identification and intellectual property
→ “Data access divide”: trusted researchers from elite universities
New legal instruments of compliance
→ Binding Corporate Rules (BCRs), Standard contractual clauses,
certification schemes
1. Gathering Data
2. Using Data
3. Sharing Data
4. Generalizing
5. Communicating
15
16. Consent Bias, Guinea Pigs, & Reproducibility
Privacy-savvy users may opt-out
→ Limits inferential power
Lower standards of consent & processing
→ Non-EU users become behavioral big data guinea pigs
Reproducibility of results vs. legal liability
→ Is it worth it for firms?
1. Gathering Data
2. Using Data
3. Sharing Data
4. Generalizing
5. Communicating
16
17. Data Subjects
→ Rights to access/information in simple, clear language
→ Right to explanation (why & how) of automated profiling
Authorities
→ Compliance documentation, data privacy impact
assessments (DPIAs), data breach reporting
1. Gathering Data
2. Using Data
3. Sharing Data
4. Generalizing
5. Communicating
Two Audiences:
Data Subjects and Data Authorities
17
18. Summary
& Final
Thoughts
- Rethink & justify how and why we collect, store, and analyze personal data
- Tradeoffs between economic development and fundamental rights to privacy
18