The Genomics Revolution: The Good, The Bad, and The Ugly (Confessions of a Privacy Researcher)
1. The Genomics Revolution:
The Good, The Bad, and The Ugly
(Confessions of a Privacy Researcher)
Emiliano De Cristofaro
University College London
https://emilianodc.com
1
9. Health Data Hacking
Anthem: one of US largest health
insurers
60 to 80 million unencrypted records stolen in the hack
(revealed in February 2015)
Social security numbers, birthdays, addresses, email and
employment information and income data for customers
and employees, including its own chief executive
9
10. US Healthcare “Wall of Shame”
10
Around 2 declared breaches per week, each affecting 500+ people
https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf
11. Some issues specific to genomes
Genome is treasure trove of sensitive
information
Genome data cannot be revoked
Genome is the ultimate identifier
Access to one’s genome ≈ access to
relatives’ genome
11
12. We all leave cells behind…
Hair, saliva, etc., can be collected and sequenced?
Compare this “attack” to re-identifying millions
of DNA donors, or hacking into 23andme…
The former: expensive, prone to mistakes, only works
against a handful of targeted victims
The latter: very “scalable”
Wait a minute… Why do we even
care about genome privacy?!?
12
13. Online Social Networks
Health and Genome Websites
Research Datasets
1000 Genomes Project
HapMap Project
dbGaP
Personal Genome Project
UK 100K Genome Project
Ancestry Search and
Family History Resources
Public Records
Census records
Marriage records
Obituaries
Criminal records
Court dockets
Voter registrations
De-identification
Kinship inference
Genetic discrimination
Blackmail
Facebook
Google+
Twitter
Linkedin
Ancestry.com
FamilyTreeDNA.com
Ysearch.org
Geni.com
OpenSNP.org
Patientslikeme.org
CureTogether.com
23andMe.com
13
15. Surname Inference Attack
Recover the surname of US male sequence
donors from 1000 Genome Project
Triangulate the identity of a sequence donor using his
surname, age, and state
Uses recreational genetic genealogy databases as aux info
Relies on the fact that:
Surnames are paternally inherited in most human societies
Y-chromosome haplotypes in male individuals are directly
inherited from the father
M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich. Identifying personal genomes by surname inference.
Science: 339 (6117), January 2013. 15
16. How Exactly?
1. Profile short tandem repeats (STR) on the Y-
chromosome
2. Query genetic genealogy databases
3. Obtain a list of possible surnames for that
sequence
4. Identity Triangulation
Combine surnames with age and state
Triangulate the identity of the target (using US census DB)
16
17. How about Aggregation?
Re-identification of aggregate data possible?
Presence of an individual in a group can be determined by
using allele frequencies and his DNA profile [1]
Statistics from allele frequencies can be used to identify
genetic trial participants [2]
[1] N. Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex
mixtures using high-density SNP genotyping microarrays. PLoS Genetics,2008
[2] R. Wang et al. “Learning Your Identity and Disease from Research Papers: Information
Leaks in Genome Wide Association Study.” CCS, 2009
17
18. Homer’s Attack
Attacker has access to a known participant’s genome
Determine if the target individual is in the case group
Use correlations in the genome (linkage disequilibrium)
18
19. Homer’s attack in a nutshell
The attacker knows:
The genome of the victim (her set of variants)
The size of the Mixture he’s attacking
Population allele frequencies
20
From: Homer N, Szelinger S, Redman M, Duggan D, Tembe W, et al. (2008) Resolving Individuals Contributing Trace Amounts of DNA to Highly
Complex Mixtures Using High-Density SNP Genotyping Microarrays. PLoS Genet 4(8): e1000167. doi:10.1371/journal.pgen.1000167
20. Re-Identification Attacks
Many other subsequent studies extended the range of vulnerabilities…
[Jacobs et al. Nature Genet. ‘09], [Vissecher and Hill PLoS Genet. ‘09], [Sankararaman et al.
Nature Genet. ‘09], [Wang et al. CCS’09], [Clayton Biostatistics ’10],
[Im et al. Am. J. Hum. Genet. ‘12], …
21
10,000 – 50,000 SNPs are sufficient to
determine if an individual was part of a
cohort, even when he contributed < 0.1%
of the data
21. GA4GH Beacon Project
Main features:
Allows researchers to quickly query multiple database to find the sample
they need; encourages cross-borders collaboration among researchers
Only minimal responses back in order to mitigate privacy concerns
22
Beacon 1
Beacon 2
Beacon 3
Response: yes
Researcher
22. Shringarpure-Bustamante’s Attack
The attack relies on the assumption that the adversary knows the set
of variants (VCF file) of the target individual & the size of the beacon
The attack is based on a likelihood ratio test where the adversary
repeatedly queries the beacon in order to re-identify the individual
Can be extremely dangerous if the beacon is associated with a
sensitive phenotype (e.g., cancer)
23
Response: yes, no, yes, …
Query1, Query2, Query3, …
Attacker
beacon
Is the target in the
beacon ?
Shringarpure SS, Bustamante CD. Privacy risks from genomic data-
sharing beacons. The American Journal of Human Genetics. 2015 Nov
5;97(5):631-46.
23. Likelihood Ratio Test
H0: the target individual is not in the beacon
H1: the target individual is in the beacon
𝑅 = {𝑥1, … , 𝑥 𝑛} is the set of beacon responses
𝛿 probability of sequencing errors
𝐷 𝑁
𝑖
denote the probability that none of the 𝑁 other genomes in the beacon
have an alternate allele at position 𝑖
24
𝐿 𝐻1
𝑅 =
𝑖=1
𝑛
𝑥𝑖log 1 − 𝛿𝐷 𝑁−1
𝑖
+ 1 − 𝑥𝑖 log 𝛿𝐷 𝑁−1
𝑖
.
𝐿 𝐻0
𝑅 =
𝑖=1
𝑛
𝑥𝑖log 1 − 𝐷 𝑁
𝑖
+ 1 − 𝑥𝑖 log 𝐷 𝑁
𝑖
,
Λ = 𝐿 𝐻0
(𝑅) − 𝐿 𝐻1
(𝑅
24. Kin Privacy
Quantifying how much privacy do relatives lose
when one’s genome is leaked?
25
M. Humbert, E. Ayday, J.-P. Hubaux, and A. Telenti.
Addressing the concerns of the Lacks family:
Quantification of kin genomic privacy. ACM CCS, 2013.
25. The rise of a new research community
Studying privacy issues
Exploring techniques to protect privacy
26
26. Differential Privacy
Computing number/location of SNPs associated to disease
Significance/correlation between a SNP and a disease
A. Johnson and V. Shmatikov. “Privacy-Preserving Data Exploration in
Genome-Wide Association Studies.” Proceedings of KDD, 2013
Genome Wide Association Studies (GWAS)
27
27. Computing on Encrypted Genomes
Encrypt data & outsource to the cloud
Perform private computation over encrypted data
Using partial & fully homomorphic encryption
Examples:
Pearson Goodness-of-Fit test, linkage disequilibrium
Estimation Maximization, Cochran-Armitage TT, etc.
K. Lauter, A. Lopez-Alt, M. Naehrig. Private Computation on Encrypted
Genomic Data 28
28. Computing on Encrypted Genomes
L. Kamm, D. Bogdanov,
S. Laur, J. Vilo.
A new way to protect
privacy in large- scale
genome-wide
association studies.
Bioinformatics 29 (7):
886-893, 2013.
29
29. Personal Genomic Testing
Individuals will soon be able to get their genome
sequenced, and get a copy of it
Privacy = individuals retain control of their data
Allow third parties to run genetics tests, but:
1. Full genome never disclosed, only test output is
2. Third parties can keep test specifics confidential
… two main approaches …
30
30. (i)DNA
sample
(i) Clinical and
Environmental
data
(ii) Encrypted SNPs
(iii)Disease
Risk
Computation
CERTIFIED
INSTITUTION (CI)
MEDICAL
UNIT (MU)
STORAGE AND
PROCESSING UNIT (SPU)
PATIENT
(P)
1. Using Semi-Trusted Parties
31
32. Open Problems
Long-term security
Encryption might not be enough
Modern encryption algorithms can’t guarantee security
past 30-50 yrs
Reliability, availability, and efficiency issues introduced by
cryptography layer
The curse of interdisciplinarity
Still few inter-community collaborations
Hard to get funding
33
33. Thank you!
Acknowledgments; E. Ayday, P. Baldi, R. Baronio, C. Dessimoz,
G. Danezis, S. Faber, P. Gasti, J-P. Hubaux, B. Malin, G. Tsudik
[In particular to Erman Ayday for sharing some of the
slides about attacks]
34
Notas do Editor
Today I’d like to talk about my perspective, as a computer scientist, w.r.t. progress in genomics, specifically about the main challenges and issues from the information security and data privacy point of view
It is quite interesting to me how costs of whole genome sequencing have dropped much faster than what Moore’s law would predict. is the observation that the number of transistors in a dense integrated circuit doubles approximately every two years. From futuristic-sounding venture with the HGP costing 3B to sequence one genome to an affordable technology. Companies like Illumina have now they technology to fully sequence human genomes in a matter of hours for less than $1000.
So progress in sequencing can help researchers gain better understanding of diseases and their relationship with genetic features, but it is also already bearing fruits in the clinical setting as we hear more and more news of how doctors have been able to diagnose and/or cure patients thanks to whole genome sequencing
Another important point I’d like to make is that the genomics revolution is not only happening within the walls of our research labs and our hospitals, as demonstrated by the emergence of direct-to-consumer genomics, i.e., genetic tests that are marketed directly to customers, without necessarily involving a doctor or a health professional. One of the most popular companies in this market is 23andme, a US company now operating in the UK as well, that provides individuals with reports on a number of genetic traits, inherited risk factors, etc.
From a computer science perspective, genome sequencing sort of means turning this (a double stranded polymer of nucleotides) into this – data. This is the output of the illumina sequencing machine, containing the nucleotides making up the genome as well as some additional information. And, as with any kind of data, we need to store it somewhere and we’d like to provide tools to search it and enable computation over it.
Ethnic heritage, predisposition to diseases; Leakage might compound genetic discrimination threats
Hard to anonymize / de-identify
p-values, r-squares
Beacon used as an oracle
Consists in adding noise to a dataset with the goal of supporting statistical queries while preserving the privacy of the users whose information is contained in the dataset.
Consists in adding noise to a dataset with the goal of supporting statistical queries while preserving the privacy of the users whose information is contained in the dataset.
Relying on a cloud storage provider
Data encrypted and stored at a Storage Process Unit
Allow medical unit to “privately” test them
Users keep sequenced genomes (encrypted)
But still allow doctors/clinicians to run tests
Only disclose minimum amount of information
E.g. privacy-preserving version of paternity, ancestry, genealogy, personalized medicine tests