Data, Responsibly: The Next Decade of Data Science

Data, Responsibly:
The Next Decade of Data Science
Bill Howe, PhD
Associate Professor, Information School
Associate Director, eScience Institute
Adjunct Associate Professor, Computer Science & Engineering
University of Washington

My goals this evening
• Describe emerging topics in data science research
and practice around a technical interpretation of ethics
• Describe some specific thrusts we are pursuing
• Encourage you to get involved
11/10/2016 Data, Responsibly / SciTech NW 2

How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
11/10/2016 Bill Howe, UW 3

1) Upload data “as is”
Cloud-hosted; no need to
install or design a database;
no pre-defined schema
2) Analyze data with SQL
Right in your browser,
writing queries on top of
queries on top of queries ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results
Click on the science question,
see the SQL that answers it

5
SPARQL
(GEMS)Serial C++
PGAS/HP
CMyriaX RDBMS
SQLDatalogMyriaL
Compiler Compiler Compiler Compiler Compiler
Hadoop
(via layers)
Compiler
multiple
languages
multiple big
data systems
multiple
GUIs/apps

Making it easier to do data science
• SQLShare: Easier to use a database
• Myria: Easier to use a bunch of different
systems at once, at scale
• Worked great in the physical sciences
• But some collaborators weren’t that excited…
November 10, 2016 6

11/10/2016 Bill Howe, UW 7
Data Science Kickoff Session:
137 posters from 30+ departments and units

Data, data, data
8
Kevin Merrit
CEO
Socrata
Deep Dhillon
CTO
Socrata

9
• Pursue transformative interdisciplinary urban research
• Facilitate translation from UW to .gov stakeholders
• Position Seattle/UW as a leader in applied urban research
• 80+ faculty from 20+ departments around campus

10
Assessing Community Well-Being
Third-Place Technologies
Optimization of King County Metro Paratransit
Computer Science & Engineering
Predictors of Permanent Housing for Homeless Families
Bill and Melinda Gates Foundation
Open Sidewalk Graph for Accessible Trip Planning
Computer Science & Engineering
Inaugural 2015 program:
16 spots
140 applicants
…from 20+ departments

11
Mining Online Data to Detect Unsafe Food Products
Elaine Nsoesie, Institute for Health Metrics and Evaluation
ORCA data for improved transit system planning and operation
Washington State Transportation Center (TRAC)
Global Open Sidewalks: Creating a shared open data layer
Taskar Center for Accessible Technology
CrowdSensing Census: A tool for estimating poverty
Bell Labs, Nokia
2016 program:
16 spots
190 applicants
New in 2016: An explicit emphasis on data ethics

July 2016
“Data, Responsibly”
Dagstuhl Workshop
Gerhard
Weikum
Serge
Abiteboul
Julia
Stoyanovich
Gerome
Miklau

14
Cathy O’Neil
September 2016
Three properties of a WMD:
Opacity
Scale
Damage

First decade of Data Science research and practice:
What can we do with massive, noisy, heterogeneous datasets?
Next decade of Data Science research and practice:
What should we do with massive, noisy, heterogeneous datasets?
The way I think about this…..(1)

The way I think about this…. (2)
Decisions are based on two sources of information:
1. Past examples
e.g., “prior arrests tend to increase likelihood of future arrests”
2. Societal constraints
e.g., “we must avoid racial discrimination”
We’ve become very good at automating the use of past examples
We’ve only just started to think about incorporating societal constraints

The way I think about this… (3)
How do we apply societal constraints to algorithmic
decision-making?
Option 1: Keep a human in the loop
Ex: EU General Data Protection Regulation requires that a
human be involved in legally binding algorithmic decision-making
Ex: Wisconsin Supreme Court says a human must review
algorithmic decisions made by recidivism models
Option 2: Build them into the algorithms themselves
I’ll talk about some approaches for this

The way I think about this…(4)
On transparency vs. accountability:
• For human decision-making, sometimes explanations are
required, improving transparency
– Supreme court decisions
– Employee reprimands/termination
• But when transparency is difficult, accountability takes over
– medical emergencies, business decisions
• As we shift decisions to algorithms, we lose both
transparency AND accountability
• “The buck stops where?”

Some Facets of “Data, Responsibly”
• Privacy
• Fairness
• Transparency
• Reproducibility
• Ethics
I won’t be talking about this
I’ll give a taste of the work here
I won’t be talking about this
Towards automatic scientific claim-checking
Vignette on teaching data ethics

FAIRNESS

21
Ex: Staples online pricing
Reasoning: Offer deals to people that live near competitors’ stores
Effect: lower prices offered to buyers who live in more affluent
neighborhoods

22
[Latanya Sweeney; CACM 2013]
Racially identifying names trigger
ads suggestive of an arrest record
slide adapted from Stoyanovich, Miklau

24
The Special Committee on Criminal Justice Reform's
hearing of reducing the pre-trial jail population.
Technical.ly, September 2016
Philadelphia is grappling with the prospect of a racist computer algorithm
Any background signal in the
data of institutional racism is
amplified by the algorithm
operationalized by the algorithm
legitimized by the algorithm
“Should I be afraid of risk assessment tools?”
“No, you gotta tell me a lot more about yourself.
At what age were you first arrested?
What is the date of your most recent crime?”
“And what’s the culture of policing in the
neighborhood in which I grew up in?”

26
Towards a precise characterization of fairness…
Positive Outcomes Negative Outcomes
offered employment denied employment
accepted to school rejected from school
offered a loan denied a loan
offered a discount not offered a discount
Label outcomes to individuals as positive or negative
Fairness is concerned with how outcomes are
assigned to a population

27
Statistical parity
race
black
white
⊕ ⊖
⊖
⊕
⊕⊖
⊖
⊖
⊕
40% of the whole population
positive
outcomes
⊖
Statistical parity
demographics of the individuals receiving any outcome are the same
as demographics of the underlying population
20%
of black
60%
of white

28
First attempt: Ignore sensitive information
zip code
10025 10027
race
black
white
20%
of black
60%
of white
⊕
⊖
⊖⊖
⊕
⊕ ⊖
⊖
⊖
⊕
positive
outcomes
Removing race from the vendor’s assignment process
does not prevent discrimination
Assessing disparate impact
Discrimination is assessed by the effect on the protected sub-
population, not by the input or by the process that lead to the effect.

29
More directly: Impose statistical parity
credit score
good bad
black
white
⊕
⊖
⊖
⊖
⊕
⊕ ⊖
⊖
⊖
⊕
positive
outcomes
40%
of black
40%
of white
race positive outcome: offered a loan
Tradeoff between (perceived) accuracy and fairness;
may be contrary to the goals of the vendor

30
A systems approach:
FairTest: fairness test suite for data analysis apps
• Tests for unintentional discrimination according to several
representative discrimination measures.
• Automates search for context-specific associations between
protected variables and application outputs
• Report findings, ranked by association strength and
affected population size
[F. Tramèr et al., arXiv:1510.02377 (2015)]

As a corporation, should I care?
Compliance
Jacobson, Scientific American, 2013
Customer
Retention
Employee
Retention
Eichler, Hiffington Post, 2012
CNET, May 2016

REPRODUCIBILITY
11/10/2016 Bill Howe, UW 32

Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
11/10/2016 Bill Howe, UW 33

11/10/2016 Data, Responsibly @ Dagstuhl 35
Retractions are increasing…..

Why is this happening? (1)
11/10/2016 Bill Howe, UW 37

11/10/2016 Bill Howe, UW 38

Publication Bias!

“DEEP CURATION”
TOWARDS AUTOMATIC SCIENTIFIC CLAIM CHECKING

Vision: Validate scientific claims automatically
– Check for manipulation (manipulated images, Benford’s Law)
– Extract claims from papers
– Check claims against the authors’ data
– Check claims against related data sets
– Automatic meta-analysis across the literature + public datasets
• First steps
– Automatic curation: Validate and attach metadata to public datasets
– Longitudinal analysis of the visual literature

11/10/2016 Bill Howe, UW 43
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon

Maxim
Gretchkin
Hoifung
Poon
No growth in number of
datasets used per paper!

Maxim
Gretchkin
Hoifung
Poon
Majority of samples are
one-time-use only!

color = labels supplied
as metadata
clusters = 1st two PCA
dimensions on the
gene expression data
itself
Can we use curate algorithmically?
Maxim
Gretchkin
Hoifung
Poon
The expression data
and the text labels
appear to disagree

Maxim
Gretchkin
Hoifung
Poon
Better Tissue
Type Labels
Domain knowledge
(Ontology)
Expression data
Free-text Metadata
2 Deep Networks
text
expr
SVM

Deep Curation Maxim
Gretchkin
Hoifung
Poon
Distant supervision and co-learning between text-
based classified and expression-based classifier: Both
models improve by training on each others’ results.
Free-text classifier
Expression classifier

Deep Curation:
Our stuff wins, with no training data
Maxim
Gretchkin
Hoifung
Poon
state of the art
our reimplementation
of the state of the art
our dueling
pianos NN
amount of training data used

VIGNETTE ON TEACHING
DATA ETHICS
11/10/2016 Bill Howe, UW 51

Alcohol Study, Barrow Alaska, 1979
Native leaders and city officials,
worried about drinking and associated
violence in their community invited a
group of sociology researchers to
assess the problem and work with
them to devise solutions.

Methods
• 10% representative sample
(N=88) of everyone over the age
of 15 using a 1972 demographic
survey
• Interviewed on attitudes and
values about use of alcohol
• Obtained psychological histories
including drinking behavior
• Given the Michigan Alcoholism
Screening Test (Seltzer, 1971)
• Asked to draw a picture of a
person
– Used to determine cultural identity

Results announced unilaterally and publicly
At the conclusion of the study researchers formulated a report entitled “The
Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released
simultaneously at a press release and to the Barrow community. The press
release was picked up by the New York Times, who ran a front page story
entitled Alcohol Plagues Eskimos

The results of the Barrow Alcohol Study in Alaska were revealed in the context of a
press conference that was held far from the Native village, and without the
presence, much less the knowledge or consent, of any community member who
might have been able to present any context concerning the socioeconomic
conditions of the village. Study results suggested that nearly all adults in the
community were alcoholics. In addition to the shame felt by community members,
the town’s Standard and Poor bond rating suffered as a result, which in turn
decreased the tribe’s ability to secure funding for much needed projects.
Backlash

Methodological Problems
“The authors once again met with the Barrow Technical
Advisory Group, who stated their concern that only Natives
were studied, and that outsiders in town had not been
included.”
“The estimates of the frequency of intoxication based on
association with the probability of being detained were
termed "ludicrous, both logically and statistically.””
Edward F. Foulks, M.D., Misalliances In The Barrow Alcohol Study

Ethical Problems
• Participants were not in control of their data nor
the context in which they were presented.
• Easy to demonstrate specific, significant harms:
– Social: Stigmatization
– Financial: Bond rating lowered
• Important: Nothing to do with individual privacy
– No PII revealed at any point, to anyone
– No violations of best practices in data handling
– But even those who did not participate in the study
incurred harm

Two Topics
• Social Component: Codes of Conduct
• Technical Component: Managing Sensitive
Data

Ethical principles vs. ethical rules
• In the Barrow example, ethical rules
were generally followed
• But ethical principles were violated: The
researchers appear to have placed their
own interests ahead of those of the
research subjects, the client, and society

Principles: Codes of Conduct
• American Statistical Association
– http://www.amstat.org/committees/ethics/
• Certified Analytics Professional
– https://www.certifiedanalytics.org/ethics.php
• Data Science Association
– http://www.datascienceassn.org/code-of-
conduct.html

Recap
• There’s a sea change underway in how we will teach
and practice data science
• No longer only about what can be done, but about
what should be done
• This is not just a policy/behavior/culture issue – there
are technical problems to solve
• If you’re not thinking about this stuff, you will be facing
retention issues and compliance issues very soon
– Witness privacy, which is a few years ahead

Data, Responsibly: The Next Decade of Data Science

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a Data, Responsibly: The Next Decade of Data Science

Semelhante a Data, Responsibly: The Next Decade of Data Science (20)

Mais de University of Washington

Mais de University of Washington (19)

Último

Último (20)

Data, Responsibly: The Next Decade of Data Science

Notas do Editor