The Genomics Revolution: The Good, The Bad, and The Ugly (Confessions of a Privacy Researcher)

The Genomics Revolution:
The Good, The Bad, and The Ugly
(Confessions of a Privacy Researcher)
Emiliano De Cristofaro
University College London
https://emilianodc.com
1

But… not all data are
created equal!
8

Health Data Hacking
Anthem: one of US largest health
insurers
60 to 80 million unencrypted records stolen in the hack
(revealed in February 2015)
Social security numbers, birthdays, addresses, email and
employment information and income data for customers
and employees, including its own chief executive
9

US Healthcare “Wall of Shame”
10
Around 2 declared breaches per week, each affecting 500+ people
https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf

Some issues specific to genomes
Genome is treasure trove of sensitive
information
Genome data cannot be revoked
Genome is the ultimate identifier
Access to one’s genome ≈ access to
relatives’ genome
11

We all leave cells behind…
Hair, saliva, etc., can be collected and sequenced?
Compare this “attack” to re-identifying millions
of DNA donors, or hacking into 23andme…
The former: expensive, prone to mistakes, only works
against a handful of targeted victims
The latter: very “scalable”
Wait a minute… Why do we even
care about genome privacy?!?
12

Online Social Networks
Health and Genome Websites
Research Datasets
1000 Genomes Project
HapMap Project
dbGaP
Personal Genome Project
UK 100K Genome Project
Ancestry Search and
Family History Resources
Public Records
Census records
Marriage records
Obituaries
Criminal records
Court dockets
Voter registrations
De-identification
Kinship inference
Genetic discrimination
Blackmail
Facebook
Google+
Twitter
Linkedin
Ancestry.com
FamilyTreeDNA.com
Ysearch.org
Geni.com
OpenSNP.org
Patientslikeme.org
CureTogether.com
23andMe.com
13

Surname Inference Attack
Recover the surname of US male sequence
donors from 1000 Genome Project
Triangulate the identity of a sequence donor using his
surname, age, and state
Uses recreational genetic genealogy databases as aux info
Relies on the fact that:
Surnames are paternally inherited in most human societies
Y-chromosome haplotypes in male individuals are directly
inherited from the father
M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich. Identifying personal genomes by surname inference.
Science: 339 (6117), January 2013. 15

How Exactly?
1. Profile short tandem repeats (STR) on the Y-
chromosome
2. Query genetic genealogy databases
3. Obtain a list of possible surnames for that
sequence
4. Identity Triangulation
Combine surnames with age and state
Triangulate the identity of the target (using US census DB)
16

How about Aggregation?
Re-identification of aggregate data possible?
Presence of an individual in a group can be determined by
using allele frequencies and his DNA profile [1]
Statistics from allele frequencies can be used to identify
genetic trial participants [2]
[1] N. Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex
mixtures using high-density SNP genotyping microarrays. PLoS Genetics,2008
[2] R. Wang et al. “Learning Your Identity and Disease from Research Papers: Information
Leaks in Genome Wide Association Study.” CCS, 2009
17

Homer’s Attack
Attacker has access to a known participant’s genome
Determine if the target individual is in the case group
Use correlations in the genome (linkage disequilibrium)
18

Homer’s attack in a nutshell
The attacker knows:
The genome of the victim (her set of variants)
The size of the Mixture he’s attacking
Population allele frequencies
20
From: Homer N, Szelinger S, Redman M, Duggan D, Tembe W, et al. (2008) Resolving Individuals Contributing Trace Amounts of DNA to Highly
Complex Mixtures Using High-Density SNP Genotyping Microarrays. PLoS Genet 4(8): e1000167. doi:10.1371/journal.pgen.1000167

Re-Identification Attacks
Many other subsequent studies extended the range of vulnerabilities…
[Jacobs et al. Nature Genet. ‘09], [Vissecher and Hill PLoS Genet. ‘09], [Sankararaman et al.
Nature Genet. ‘09], [Wang et al. CCS’09], [Clayton Biostatistics ’10],
[Im et al. Am. J. Hum. Genet. ‘12], …
21
10,000 – 50,000 SNPs are sufficient to
determine if an individual was part of a
cohort, even when he contributed < 0.1%
of the data

GA4GH Beacon Project
Main features:
Allows researchers to quickly query multiple database to find the sample
they need; encourages cross-borders collaboration among researchers
Only minimal responses back in order to mitigate privacy concerns
22
Beacon 1
Beacon 2
Beacon 3
Response: yes
Researcher

Shringarpure-Bustamante’s Attack
The attack relies on the assumption that the adversary knows the set
of variants (VCF file) of the target individual & the size of the beacon
The attack is based on a likelihood ratio test where the adversary
repeatedly queries the beacon in order to re-identify the individual
Can be extremely dangerous if the beacon is associated with a
sensitive phenotype (e.g., cancer)
23
Response: yes, no, yes, …
Query1, Query2, Query3, …
Attacker
beacon
Is the target in the
beacon ?
Shringarpure SS, Bustamante CD. Privacy risks from genomic data-
sharing beacons. The American Journal of Human Genetics. 2015 Nov
5;97(5):631-46.

Likelihood Ratio Test
H0: the target individual is not in the beacon
H1: the target individual is in the beacon
𝑅 = {𝑥1, … , 𝑥 𝑛} is the set of beacon responses
𝛿 probability of sequencing errors
𝐷 𝑁
𝑖
denote the probability that none of the 𝑁 other genomes in the beacon
have an alternate allele at position 𝑖
24
𝐿 𝐻1
𝑅 =
𝑖=1
𝑛
𝑥𝑖log 1 − 𝛿𝐷 𝑁−1
𝑖
+ 1 − 𝑥𝑖 log 𝛿𝐷 𝑁−1
𝑖
.
𝐿 𝐻0
𝑅 =
𝑖=1
𝑛
𝑥𝑖log 1 − 𝐷 𝑁
𝑖
+ 1 − 𝑥𝑖 log 𝐷 𝑁
𝑖
,
Λ = 𝐿 𝐻0
(𝑅) − 𝐿 𝐻1
(𝑅

Kin Privacy
Quantifying how much privacy do relatives lose
when one’s genome is leaked?
25
M. Humbert, E. Ayday, J.-P. Hubaux, and A. Telenti.
Addressing the concerns of the Lacks family:
Quantification of kin genomic privacy. ACM CCS, 2013.

The rise of a new research community
Studying privacy issues
Exploring techniques to protect privacy
26

Differential Privacy
Computing number/location of SNPs associated to disease
Significance/correlation between a SNP and a disease
A. Johnson and V. Shmatikov. “Privacy-Preserving Data Exploration in
Genome-Wide Association Studies.” Proceedings of KDD, 2013
Genome Wide Association Studies (GWAS)
27

Computing on Encrypted Genomes
Encrypt data & outsource to the cloud
Perform private computation over encrypted data
Using partial & fully homomorphic encryption
Examples:
Pearson Goodness-of-Fit test, linkage disequilibrium
Estimation Maximization, Cochran-Armitage TT, etc.
K. Lauter, A. Lopez-Alt, M. Naehrig. Private Computation on Encrypted
Genomic Data 28

Computing on Encrypted Genomes
L. Kamm, D. Bogdanov,
S. Laur, J. Vilo.
A new way to protect
privacy in large- scale
genome-wide
association studies.
Bioinformatics 29 (7):
886-893, 2013.
29

Personal Genomic Testing
Individuals will soon be able to get their genome
sequenced, and get a copy of it
Privacy = individuals retain control of their data
Allow third parties to run genetics tests, but:
1. Full genome never disclosed, only test output is
2. Third parties can keep test specifics confidential
… two main approaches …
30

(i)DNA
sample
(i) Clinical and
Environmental
data
(ii) Encrypted SNPs
(iii)Disease
Risk
Computation
CERTIFIED
INSTITUTION (CI)
MEDICAL
UNIT (MU)
STORAGE AND
PROCESSING UNIT (SPU)
PATIENT
(P)
1. Using Semi-Trusted Parties
31

doctor
or lab
genome
individual
test specifics
Secure
Function
Evaluation
test result test result
Output reveals nothing beyond
test result
2. Users keep sequenced genomes
32

Open Problems
Long-term security
Encryption might not be enough
Modern encryption algorithms can’t guarantee security
past 30-50 yrs
Reliability, availability, and efficiency issues introduced by
cryptography layer
The curse of interdisciplinarity
Still few inter-community collaborations
Hard to get funding
33

Thank you!
Acknowledgments; E. Ayday, P. Baldi, R. Baronio, C. Dessimoz,
G. Danezis, S. Faber, P. Gasti, J-P. Hubaux, B. Malin, G. Tsudik
[In particular to Erman Ayday for sharing some of the
slides about attacks]
34

The Genomics Revolution: The Good, The Bad, and The Ugly (Confessions of a Privacy Researcher)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a The Genomics Revolution: The Good, The Bad, and The Ugly (Confessions of a Privacy Researcher)

Semelhante a The Genomics Revolution: The Good, The Bad, and The Ugly (Confessions of a Privacy Researcher) (20)

Último

Último (20)

The Genomics Revolution: The Good, The Bad, and The Ugly (Confessions of a Privacy Researcher)

Notas do Editor