SlideShare uma empresa Scribd logo
1 de 33
The Genomics Revolution:
The Good, The Bad, and The Ugly
(Confessions of a Privacy Researcher)
Emiliano De Cristofaro
University College London
https://emilianodc.com
1
From: James Bannon, ARK 2
From: The Economist
33
4
5
6
7
But… not all data are
created equal!
8
Health Data Hacking
Anthem: one of US largest health
insurers
60 to 80 million unencrypted records stolen in the hack
(revealed in February 2015)
Social security numbers, birthdays, addresses, email and
employment information and income data for customers
and employees, including its own chief executive
9
US Healthcare “Wall of Shame”
10
Around 2 declared breaches per week, each affecting 500+ people
https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf
Some issues specific to genomes
Genome is treasure trove of sensitive
information
Genome data cannot be revoked
Genome is the ultimate identifier
Access to one’s genome ≈ access to
relatives’ genome
11
We all leave cells behind…
Hair, saliva, etc., can be collected and sequenced?
Compare this “attack” to re-identifying millions
of DNA donors, or hacking into 23andme…
The former: expensive, prone to mistakes, only works
against a handful of targeted victims
The latter: very “scalable”
Wait a minute… Why do we even
care about genome privacy?!?
12
Online Social Networks
Health and Genome Websites
Research Datasets
1000 Genomes Project
HapMap Project
dbGaP
Personal Genome Project
UK 100K Genome Project
Ancestry Search and
Family History Resources
Public Records
Census records
Marriage records
Obituaries
Criminal records
Court dockets
Voter registrations
De-identification
Kinship inference
Genetic discrimination
Blackmail
Facebook
Google+
Twitter
Linkedin
Ancestry.com
FamilyTreeDNA.com
Ysearch.org
Geni.com
OpenSNP.org
Patientslikeme.org
CureTogether.com
23andMe.com
13
Anonymization?
14
Surname Inference Attack
Recover the surname of US male sequence
donors from 1000 Genome Project
Triangulate the identity of a sequence donor using his
surname, age, and state
Uses recreational genetic genealogy databases as aux info
Relies on the fact that:
Surnames are paternally inherited in most human societies
Y-chromosome haplotypes in male individuals are directly
inherited from the father
M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich. Identifying personal genomes by surname inference.
Science: 339 (6117), January 2013. 15
How Exactly?
1. Profile short tandem repeats (STR) on the Y-
chromosome
2. Query genetic genealogy databases
3. Obtain a list of possible surnames for that
sequence
4. Identity Triangulation
Combine surnames with age and state
Triangulate the identity of the target (using US census DB)
16
How about Aggregation?
Re-identification of aggregate data possible?
Presence of an individual in a group can be determined by
using allele frequencies and his DNA profile [1]
Statistics from allele frequencies can be used to identify
genetic trial participants [2]
[1] N. Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex
mixtures using high-density SNP genotyping microarrays. PLoS Genetics,2008
[2] R. Wang et al. “Learning Your Identity and Disease from Research Papers: Information
Leaks in Genome Wide Association Study.” CCS, 2009
17
Homer’s Attack
Attacker has access to a known participant’s genome
Determine if the target individual is in the case group
Use correlations in the genome (linkage disequilibrium)
18
Homer’s attack in a nutshell
The attacker knows:
The genome of the victim (her set of variants)
The size of the Mixture he’s attacking
Population allele frequencies
20
From: Homer N, Szelinger S, Redman M, Duggan D, Tembe W, et al. (2008) Resolving Individuals Contributing Trace Amounts of DNA to Highly
Complex Mixtures Using High-Density SNP Genotyping Microarrays. PLoS Genet 4(8): e1000167. doi:10.1371/journal.pgen.1000167
Re-Identification Attacks
Many other subsequent studies extended the range of vulnerabilities…
[Jacobs et al. Nature Genet. ‘09], [Vissecher and Hill PLoS Genet. ‘09], [Sankararaman et al.
Nature Genet. ‘09], [Wang et al. CCS’09], [Clayton Biostatistics ’10],
[Im et al. Am. J. Hum. Genet. ‘12], …
21
10,000 – 50,000 SNPs are sufficient to
determine if an individual was part of a
cohort, even when he contributed < 0.1%
of the data
GA4GH Beacon Project
Main features:
Allows researchers to quickly query multiple database to find the sample
they need; encourages cross-borders collaboration among researchers
Only minimal responses back in order to mitigate privacy concerns
22
Beacon 1
Beacon 2
Beacon 3
Response: yes
Researcher
Shringarpure-Bustamante’s Attack
The attack relies on the assumption that the adversary knows the set
of variants (VCF file) of the target individual & the size of the beacon
The attack is based on a likelihood ratio test where the adversary
repeatedly queries the beacon in order to re-identify the individual
Can be extremely dangerous if the beacon is associated with a
sensitive phenotype (e.g., cancer)
23
Response: yes, no, yes, …
Query1, Query2, Query3, …
Attacker
beacon
Is the target in the
beacon ?
Shringarpure SS, Bustamante CD. Privacy risks from genomic data-
sharing beacons. The American Journal of Human Genetics. 2015 Nov
5;97(5):631-46.
Likelihood Ratio Test
H0: the target individual is not in the beacon
H1: the target individual is in the beacon
𝑅 = {𝑥1, … , 𝑥 𝑛} is the set of beacon responses
𝛿 probability of sequencing errors
𝐷 𝑁
𝑖
denote the probability that none of the 𝑁 other genomes in the beacon
have an alternate allele at position 𝑖
24
𝐿 𝐻1
𝑅 =
𝑖=1
𝑛
𝑥𝑖log 1 − 𝛿𝐷 𝑁−1
𝑖
+ 1 − 𝑥𝑖 log 𝛿𝐷 𝑁−1
𝑖
.
𝐿 𝐻0
𝑅 =
𝑖=1
𝑛
𝑥𝑖log 1 − 𝐷 𝑁
𝑖
+ 1 − 𝑥𝑖 log 𝐷 𝑁
𝑖
,
Λ = 𝐿 𝐻0
(𝑅) − 𝐿 𝐻1
(𝑅
Kin Privacy
Quantifying how much privacy do relatives lose
when one’s genome is leaked?
25
M. Humbert, E. Ayday, J.-P. Hubaux, and A. Telenti.
Addressing the concerns of the Lacks family:
Quantification of kin genomic privacy. ACM CCS, 2013.
The rise of a new research community
Studying privacy issues
Exploring techniques to protect privacy
26
Differential Privacy
Computing number/location of SNPs associated to disease
Significance/correlation between a SNP and a disease
A. Johnson and V. Shmatikov. “Privacy-Preserving Data Exploration in
Genome-Wide Association Studies.” Proceedings of KDD, 2013
Genome Wide Association Studies (GWAS)
27
Computing on Encrypted Genomes
Encrypt data & outsource to the cloud
Perform private computation over encrypted data
Using partial & fully homomorphic encryption
Examples:
Pearson Goodness-of-Fit test, linkage disequilibrium
Estimation Maximization, Cochran-Armitage TT, etc.
K. Lauter, A. Lopez-Alt, M. Naehrig. Private Computation on Encrypted
Genomic Data 28
Computing on Encrypted Genomes
L. Kamm, D. Bogdanov,
S. Laur, J. Vilo.
A new way to protect
privacy in large- scale
genome-wide
association studies.
Bioinformatics 29 (7):
886-893, 2013.
29
Personal Genomic Testing
Individuals will soon be able to get their genome
sequenced, and get a copy of it
Privacy = individuals retain control of their data
Allow third parties to run genetics tests, but:
1. Full genome never disclosed, only test output is
2. Third parties can keep test specifics confidential
… two main approaches …
30
(i)DNA
sample
(i) Clinical and
Environmental
data
(ii) Encrypted SNPs
(iii)Disease
Risk
Computation
CERTIFIED
INSTITUTION (CI)
MEDICAL
UNIT (MU)
STORAGE AND
PROCESSING UNIT (SPU)
PATIENT
(P)
1. Using Semi-Trusted Parties
31
doctor
or lab
genome
individual
test specifics
Secure
Function
Evaluation
test result test result
Output reveals nothing beyond
test result
2. Users keep sequenced genomes
32
Open Problems
Long-term security
Encryption might not be enough
Modern encryption algorithms can’t guarantee security
past 30-50 yrs
Reliability, availability, and efficiency issues introduced by
cryptography layer
The curse of interdisciplinarity
Still few inter-community collaborations
Hard to get funding
33
Thank you!
Acknowledgments; E. Ayday, P. Baldi, R. Baronio, C. Dessimoz,
G. Danezis, S. Faber, P. Gasti, J-P. Hubaux, B. Malin, G. Tsudik
[In particular to Erman Ayday for sharing some of the
slides about attacks]
34

Mais conteúdo relacionado

Mais procurados

CTF 2017 Cutaneous Neurofibroma Resource Sage Bionetworks
CTF 2017 Cutaneous Neurofibroma Resource Sage BionetworksCTF 2017 Cutaneous Neurofibroma Resource Sage Bionetworks
CTF 2017 Cutaneous Neurofibroma Resource Sage BionetworksRobert Allaway
 
Antisense Oligonucleotides, Aptamers & Triple Helix: Speech by Michael L Rior...
Antisense Oligonucleotides, Aptamers & Triple Helix: Speech by Michael L Rior...Antisense Oligonucleotides, Aptamers & Triple Helix: Speech by Michael L Rior...
Antisense Oligonucleotides, Aptamers & Triple Helix: Speech by Michael L Rior...EduConnections
 
CHAVEZ_SESSION23_ACADEMICPAPER.docx
CHAVEZ_SESSION23_ACADEMICPAPER.docxCHAVEZ_SESSION23_ACADEMICPAPER.docx
CHAVEZ_SESSION23_ACADEMICPAPER.docxArvieChavez1
 
Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery
Data Translator: an Open Science Data Platform for Mechanistic Disease DiscoveryData Translator: an Open Science Data Platform for Mechanistic Disease Discovery
Data Translator: an Open Science Data Platform for Mechanistic Disease Discoverymhaendel
 
Constructing in vivo phosphorylation networks
Constructing in vivo phosphorylation networksConstructing in vivo phosphorylation networks
Constructing in vivo phosphorylation networksLars Juhl Jensen
 
Bad Luck vs Peto's Paradox in Caner incidences
Bad Luck vs Peto's Paradox in Caner incidencesBad Luck vs Peto's Paradox in Caner incidences
Bad Luck vs Peto's Paradox in Caner incidencesChi-Ping Day
 
Methods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataMethods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataChirag Patel
 
From Expression to Pathways Using Online Tools
From Expression to Pathways Using Online ToolsFrom Expression to Pathways Using Online Tools
From Expression to Pathways Using Online ToolsAli Kishk
 
Global phenotypic data sharing standards to maximize diagnostic discovery
Global phenotypic data sharing standards to maximize diagnostic discoveryGlobal phenotypic data sharing standards to maximize diagnostic discovery
Global phenotypic data sharing standards to maximize diagnostic discoverymhaendel
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg
 
ASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary AnalysisASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary AnalysisJames Warren
 
DNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsDNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsMelanie Swan
 
Making the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discoveryMaking the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discoveryMichel Dumontier
 
Bioinformatics Strategies for Exposome 100416
Bioinformatics Strategies for Exposome 100416Bioinformatics Strategies for Exposome 100416
Bioinformatics Strategies for Exposome 100416Chirag Patel
 
Guided visual exploration of patient stratifications in cancer genomics
Guided visual exploration of patient stratifications in cancer genomicsGuided visual exploration of patient stratifications in cancer genomics
Guided visual exploration of patient stratifications in cancer genomicsNils Gehlenborg
 
2018 05 24-waldron-itcr
2018 05 24-waldron-itcr2018 05 24-waldron-itcr
2018 05 24-waldron-itcrLevi Waldron
 

Mais procurados (20)

CTF 2017 Cutaneous Neurofibroma Resource Sage Bionetworks
CTF 2017 Cutaneous Neurofibroma Resource Sage BionetworksCTF 2017 Cutaneous Neurofibroma Resource Sage Bionetworks
CTF 2017 Cutaneous Neurofibroma Resource Sage Bionetworks
 
Antisense Oligonucleotides, Aptamers & Triple Helix: Speech by Michael L Rior...
Antisense Oligonucleotides, Aptamers & Triple Helix: Speech by Michael L Rior...Antisense Oligonucleotides, Aptamers & Triple Helix: Speech by Michael L Rior...
Antisense Oligonucleotides, Aptamers & Triple Helix: Speech by Michael L Rior...
 
CHAVEZ_SESSION23_ACADEMICPAPER.docx
CHAVEZ_SESSION23_ACADEMICPAPER.docxCHAVEZ_SESSION23_ACADEMICPAPER.docx
CHAVEZ_SESSION23_ACADEMICPAPER.docx
 
PhD midterm report
PhD midterm reportPhD midterm report
PhD midterm report
 
Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery
Data Translator: an Open Science Data Platform for Mechanistic Disease DiscoveryData Translator: an Open Science Data Platform for Mechanistic Disease Discovery
Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery
 
Constructing in vivo phosphorylation networks
Constructing in vivo phosphorylation networksConstructing in vivo phosphorylation networks
Constructing in vivo phosphorylation networks
 
Bad Luck vs Peto's Paradox in Caner incidences
Bad Luck vs Peto's Paradox in Caner incidencesBad Luck vs Peto's Paradox in Caner incidences
Bad Luck vs Peto's Paradox in Caner incidences
 
Methods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataMethods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big data
 
From Expression to Pathways Using Online Tools
From Expression to Pathways Using Online ToolsFrom Expression to Pathways Using Online Tools
From Expression to Pathways Using Online Tools
 
Global phenotypic data sharing standards to maximize diagnostic discovery
Global phenotypic data sharing standards to maximize diagnostic discoveryGlobal phenotypic data sharing standards to maximize diagnostic discovery
Global phenotypic data sharing standards to maximize diagnostic discovery
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
Ngs pgd
Ngs pgdNgs pgd
Ngs pgd
 
ASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary AnalysisASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary Analysis
 
DNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsDNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal Genomics
 
Making the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discoveryMaking the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discovery
 
Bioinformatics Strategies for Exposome 100416
Bioinformatics Strategies for Exposome 100416Bioinformatics Strategies for Exposome 100416
Bioinformatics Strategies for Exposome 100416
 
neha_ppt
neha_pptneha_ppt
neha_ppt
 
In a Different Class?
In a Different Class?In a Different Class?
In a Different Class?
 
Guided visual exploration of patient stratifications in cancer genomics
Guided visual exploration of patient stratifications in cancer genomicsGuided visual exploration of patient stratifications in cancer genomics
Guided visual exploration of patient stratifications in cancer genomics
 
2018 05 24-waldron-itcr
2018 05 24-waldron-itcr2018 05 24-waldron-itcr
2018 05 24-waldron-itcr
 

Semelhante a The Genomics Revolution: The Good, The Bad, and The Ugly (Confessions of a Privacy Researcher)

The Genomics Revolution: The Good, The Bad, and The Ugly (UEOP16 Keynote)
The Genomics Revolution: The Good, The Bad, and The Ugly (UEOP16 Keynote)The Genomics Revolution: The Good, The Bad, and The Ugly (UEOP16 Keynote)
The Genomics Revolution: The Good, The Bad, and The Ugly (UEOP16 Keynote)Emiliano De Cristofaro
 
The Genomics Revolution: The Good, The Bad, and The Ugly
The Genomics Revolution: The Good, The Bad, and The UglyThe Genomics Revolution: The Good, The Bad, and The Ugly
The Genomics Revolution: The Good, The Bad, and The UglyEmiliano De Cristofaro
 
The Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingThe Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingEmiliano De Cristofaro
 
Gorman National Academies Cscans 30 October 2006
Gorman National Academies Cscans 30 October 2006Gorman National Academies Cscans 30 October 2006
Gorman National Academies Cscans 30 October 2006bgorman
 
Crowdsourcing the Analysis of Genomes
Crowdsourcing the Analysis of GenomesCrowdsourcing the Analysis of Genomes
Crowdsourcing the Analysis of GenomesBastian Greshake
 
Personal Genomes: what can I do with my data?
Personal Genomes: what can I do with my data?Personal Genomes: what can I do with my data?
Personal Genomes: what can I do with my data?Melanie Swan
 
Journal of Law and the Biosciences, 1–36doi10.1093jlblsz0
Journal of Law and the Biosciences, 1–36doi10.1093jlblsz0Journal of Law and the Biosciences, 1–36doi10.1093jlblsz0
Journal of Law and the Biosciences, 1–36doi10.1093jlblsz0TatianaMajor22
 
Friend harvard 2013-01-30
Friend harvard 2013-01-30Friend harvard 2013-01-30
Friend harvard 2013-01-30Sage Base
 
Punit Virk Transforming Pathology: Biotechnology as a positive feedback loop ...
Punit Virk Transforming Pathology: Biotechnology as a positive feedback loop ...Punit Virk Transforming Pathology: Biotechnology as a positive feedback loop ...
Punit Virk Transforming Pathology: Biotechnology as a positive feedback loop ...Kim Solez ,
 
Deciphering the Dynamic Coupling of the Human Immune System and the Gut Micro...
Deciphering the Dynamic Coupling of the Human Immune System and the Gut Micro...Deciphering the Dynamic Coupling of the Human Immune System and the Gut Micro...
Deciphering the Dynamic Coupling of the Human Immune System and the Gut Micro...Larry Smarr
 
Application of data science in Evolutionary Biology
Application of data science in Evolutionary BiologyApplication of data science in Evolutionary Biology
Application of data science in Evolutionary BiologyNima Rashvand
 
Bioinformatics lecture 1
Bioinformatics lecture 1Bioinformatics lecture 1
Bioinformatics lecture 1Hamid Ur-Rahman
 
Bioinformatics Lecture 1
Bioinformatics  Lecture 1Bioinformatics  Lecture 1
Bioinformatics Lecture 1Hamid Ur-Rahman
 
Human genome project by kk sahu
Human genome project by kk sahuHuman genome project by kk sahu
Human genome project by kk sahuKAUSHAL SAHU
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)jmoore89
 

Semelhante a The Genomics Revolution: The Good, The Bad, and The Ugly (Confessions of a Privacy Researcher) (20)

The Genomics Revolution: The Good, The Bad, and The Ugly (UEOP16 Keynote)
The Genomics Revolution: The Good, The Bad, and The Ugly (UEOP16 Keynote)The Genomics Revolution: The Good, The Bad, and The Ugly (UEOP16 Keynote)
The Genomics Revolution: The Good, The Bad, and The Ugly (UEOP16 Keynote)
 
The Genomics Revolution: The Good, The Bad, and The Ugly
The Genomics Revolution: The Good, The Bad, and The UglyThe Genomics Revolution: The Good, The Bad, and The Ugly
The Genomics Revolution: The Good, The Bad, and The Ugly
 
The Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome SequencingThe Chills and Thrills of Whole Genome Sequencing
The Chills and Thrills of Whole Genome Sequencing
 
Gorman National Academies Cscans 30 October 2006
Gorman National Academies Cscans 30 October 2006Gorman National Academies Cscans 30 October 2006
Gorman National Academies Cscans 30 October 2006
 
03 Guerra, Rudy
03 Guerra, Rudy03 Guerra, Rudy
03 Guerra, Rudy
 
Crowdsourcing the Analysis of Genomes
Crowdsourcing the Analysis of GenomesCrowdsourcing the Analysis of Genomes
Crowdsourcing the Analysis of Genomes
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
Personal Genomes: what can I do with my data?
Personal Genomes: what can I do with my data?Personal Genomes: what can I do with my data?
Personal Genomes: what can I do with my data?
 
Journal of Law and the Biosciences, 1–36doi10.1093jlblsz0
Journal of Law and the Biosciences, 1–36doi10.1093jlblsz0Journal of Law and the Biosciences, 1–36doi10.1093jlblsz0
Journal of Law and the Biosciences, 1–36doi10.1093jlblsz0
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Friend harvard 2013-01-30
Friend harvard 2013-01-30Friend harvard 2013-01-30
Friend harvard 2013-01-30
 
 
Punit Virk Transforming Pathology: Biotechnology as a positive feedback loop ...
Punit Virk Transforming Pathology: Biotechnology as a positive feedback loop ...Punit Virk Transforming Pathology: Biotechnology as a positive feedback loop ...
Punit Virk Transforming Pathology: Biotechnology as a positive feedback loop ...
 
Deciphering the Dynamic Coupling of the Human Immune System and the Gut Micro...
Deciphering the Dynamic Coupling of the Human Immune System and the Gut Micro...Deciphering the Dynamic Coupling of the Human Immune System and the Gut Micro...
Deciphering the Dynamic Coupling of the Human Immune System and the Gut Micro...
 
Application of data science in Evolutionary Biology
Application of data science in Evolutionary BiologyApplication of data science in Evolutionary Biology
Application of data science in Evolutionary Biology
 
Bioinformatics lecture 1
Bioinformatics lecture 1Bioinformatics lecture 1
Bioinformatics lecture 1
 
Bioinformatics Lecture 1
Bioinformatics  Lecture 1Bioinformatics  Lecture 1
Bioinformatics Lecture 1
 
Human genome project by kk sahu
Human genome project by kk sahuHuman genome project by kk sahu
Human genome project by kk sahu
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)
 

Último

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 

Último (20)

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 

The Genomics Revolution: The Good, The Bad, and The Ugly (Confessions of a Privacy Researcher)

  • 1. The Genomics Revolution: The Good, The Bad, and The Ugly (Confessions of a Privacy Researcher) Emiliano De Cristofaro University College London https://emilianodc.com 1
  • 4. 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. But… not all data are created equal! 8
  • 9. Health Data Hacking Anthem: one of US largest health insurers 60 to 80 million unencrypted records stolen in the hack (revealed in February 2015) Social security numbers, birthdays, addresses, email and employment information and income data for customers and employees, including its own chief executive 9
  • 10. US Healthcare “Wall of Shame” 10 Around 2 declared breaches per week, each affecting 500+ people https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf
  • 11. Some issues specific to genomes Genome is treasure trove of sensitive information Genome data cannot be revoked Genome is the ultimate identifier Access to one’s genome ≈ access to relatives’ genome 11
  • 12. We all leave cells behind… Hair, saliva, etc., can be collected and sequenced? Compare this “attack” to re-identifying millions of DNA donors, or hacking into 23andme… The former: expensive, prone to mistakes, only works against a handful of targeted victims The latter: very “scalable” Wait a minute… Why do we even care about genome privacy?!? 12
  • 13. Online Social Networks Health and Genome Websites Research Datasets 1000 Genomes Project HapMap Project dbGaP Personal Genome Project UK 100K Genome Project Ancestry Search and Family History Resources Public Records Census records Marriage records Obituaries Criminal records Court dockets Voter registrations De-identification Kinship inference Genetic discrimination Blackmail Facebook Google+ Twitter Linkedin Ancestry.com FamilyTreeDNA.com Ysearch.org Geni.com OpenSNP.org Patientslikeme.org CureTogether.com 23andMe.com 13
  • 15. Surname Inference Attack Recover the surname of US male sequence donors from 1000 Genome Project Triangulate the identity of a sequence donor using his surname, age, and state Uses recreational genetic genealogy databases as aux info Relies on the fact that: Surnames are paternally inherited in most human societies Y-chromosome haplotypes in male individuals are directly inherited from the father M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich. Identifying personal genomes by surname inference. Science: 339 (6117), January 2013. 15
  • 16. How Exactly? 1. Profile short tandem repeats (STR) on the Y- chromosome 2. Query genetic genealogy databases 3. Obtain a list of possible surnames for that sequence 4. Identity Triangulation Combine surnames with age and state Triangulate the identity of the target (using US census DB) 16
  • 17. How about Aggregation? Re-identification of aggregate data possible? Presence of an individual in a group can be determined by using allele frequencies and his DNA profile [1] Statistics from allele frequencies can be used to identify genetic trial participants [2] [1] N. Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics,2008 [2] R. Wang et al. “Learning Your Identity and Disease from Research Papers: Information Leaks in Genome Wide Association Study.” CCS, 2009 17
  • 18. Homer’s Attack Attacker has access to a known participant’s genome Determine if the target individual is in the case group Use correlations in the genome (linkage disequilibrium) 18
  • 19. Homer’s attack in a nutshell The attacker knows: The genome of the victim (her set of variants) The size of the Mixture he’s attacking Population allele frequencies 20 From: Homer N, Szelinger S, Redman M, Duggan D, Tembe W, et al. (2008) Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays. PLoS Genet 4(8): e1000167. doi:10.1371/journal.pgen.1000167
  • 20. Re-Identification Attacks Many other subsequent studies extended the range of vulnerabilities… [Jacobs et al. Nature Genet. ‘09], [Vissecher and Hill PLoS Genet. ‘09], [Sankararaman et al. Nature Genet. ‘09], [Wang et al. CCS’09], [Clayton Biostatistics ’10], [Im et al. Am. J. Hum. Genet. ‘12], … 21 10,000 – 50,000 SNPs are sufficient to determine if an individual was part of a cohort, even when he contributed < 0.1% of the data
  • 21. GA4GH Beacon Project Main features: Allows researchers to quickly query multiple database to find the sample they need; encourages cross-borders collaboration among researchers Only minimal responses back in order to mitigate privacy concerns 22 Beacon 1 Beacon 2 Beacon 3 Response: yes Researcher
  • 22. Shringarpure-Bustamante’s Attack The attack relies on the assumption that the adversary knows the set of variants (VCF file) of the target individual & the size of the beacon The attack is based on a likelihood ratio test where the adversary repeatedly queries the beacon in order to re-identify the individual Can be extremely dangerous if the beacon is associated with a sensitive phenotype (e.g., cancer) 23 Response: yes, no, yes, … Query1, Query2, Query3, … Attacker beacon Is the target in the beacon ? Shringarpure SS, Bustamante CD. Privacy risks from genomic data- sharing beacons. The American Journal of Human Genetics. 2015 Nov 5;97(5):631-46.
  • 23. Likelihood Ratio Test H0: the target individual is not in the beacon H1: the target individual is in the beacon 𝑅 = {𝑥1, … , 𝑥 𝑛} is the set of beacon responses 𝛿 probability of sequencing errors 𝐷 𝑁 𝑖 denote the probability that none of the 𝑁 other genomes in the beacon have an alternate allele at position 𝑖 24 𝐿 𝐻1 𝑅 = 𝑖=1 𝑛 𝑥𝑖log 1 − 𝛿𝐷 𝑁−1 𝑖 + 1 − 𝑥𝑖 log 𝛿𝐷 𝑁−1 𝑖 . 𝐿 𝐻0 𝑅 = 𝑖=1 𝑛 𝑥𝑖log 1 − 𝐷 𝑁 𝑖 + 1 − 𝑥𝑖 log 𝐷 𝑁 𝑖 , Λ = 𝐿 𝐻0 (𝑅) − 𝐿 𝐻1 (𝑅
  • 24. Kin Privacy Quantifying how much privacy do relatives lose when one’s genome is leaked? 25 M. Humbert, E. Ayday, J.-P. Hubaux, and A. Telenti. Addressing the concerns of the Lacks family: Quantification of kin genomic privacy. ACM CCS, 2013.
  • 25. The rise of a new research community Studying privacy issues Exploring techniques to protect privacy 26
  • 26. Differential Privacy Computing number/location of SNPs associated to disease Significance/correlation between a SNP and a disease A. Johnson and V. Shmatikov. “Privacy-Preserving Data Exploration in Genome-Wide Association Studies.” Proceedings of KDD, 2013 Genome Wide Association Studies (GWAS) 27
  • 27. Computing on Encrypted Genomes Encrypt data & outsource to the cloud Perform private computation over encrypted data Using partial & fully homomorphic encryption Examples: Pearson Goodness-of-Fit test, linkage disequilibrium Estimation Maximization, Cochran-Armitage TT, etc. K. Lauter, A. Lopez-Alt, M. Naehrig. Private Computation on Encrypted Genomic Data 28
  • 28. Computing on Encrypted Genomes L. Kamm, D. Bogdanov, S. Laur, J. Vilo. A new way to protect privacy in large- scale genome-wide association studies. Bioinformatics 29 (7): 886-893, 2013. 29
  • 29. Personal Genomic Testing Individuals will soon be able to get their genome sequenced, and get a copy of it Privacy = individuals retain control of their data Allow third parties to run genetics tests, but: 1. Full genome never disclosed, only test output is 2. Third parties can keep test specifics confidential … two main approaches … 30
  • 30. (i)DNA sample (i) Clinical and Environmental data (ii) Encrypted SNPs (iii)Disease Risk Computation CERTIFIED INSTITUTION (CI) MEDICAL UNIT (MU) STORAGE AND PROCESSING UNIT (SPU) PATIENT (P) 1. Using Semi-Trusted Parties 31
  • 31. doctor or lab genome individual test specifics Secure Function Evaluation test result test result Output reveals nothing beyond test result 2. Users keep sequenced genomes 32
  • 32. Open Problems Long-term security Encryption might not be enough Modern encryption algorithms can’t guarantee security past 30-50 yrs Reliability, availability, and efficiency issues introduced by cryptography layer The curse of interdisciplinarity Still few inter-community collaborations Hard to get funding 33
  • 33. Thank you! Acknowledgments; E. Ayday, P. Baldi, R. Baronio, C. Dessimoz, G. Danezis, S. Faber, P. Gasti, J-P. Hubaux, B. Malin, G. Tsudik [In particular to Erman Ayday for sharing some of the slides about attacks] 34

Notas do Editor

  1. Today I’d like to talk about my perspective, as a computer scientist, w.r.t. progress in genomics, specifically about the main challenges and issues from the information security and data privacy point of view
  2. It is quite interesting to me how costs of whole genome sequencing have dropped much faster than what Moore’s law would predict. is the observation that the number of transistors in a dense integrated circuit doubles approximately every two years. From futuristic-sounding venture with the HGP costing 3B to sequence one genome to an affordable technology. Companies like Illumina have now they technology to fully sequence human genomes in a matter of hours for less than $1000.
  3. So progress in sequencing can help researchers gain better understanding of diseases and their relationship with genetic features, but it is also already bearing fruits in the clinical setting as we hear more and more news of how doctors have been able to diagnose and/or cure patients thanks to whole genome sequencing
  4. Another important point I’d like to make is that the genomics revolution is not only happening within the walls of our research labs and our hospitals, as demonstrated by the emergence of direct-to-consumer genomics, i.e., genetic tests that are marketed directly to customers, without necessarily involving a doctor or a health professional. One of the most popular companies in this market is 23andme, a US company now operating in the UK as well, that provides individuals with reports on a number of genetic traits, inherited risk factors, etc.
  5. From a computer science perspective, genome sequencing sort of means turning this (a double stranded polymer of nucleotides) into this – data. This is the output of the illumina sequencing machine, containing the nucleotides making up the genome as well as some additional information. And, as with any kind of data, we need to store it somewhere and we’d like to provide tools to search it and enable computation over it.
  6. Ethnic heritage, predisposition to diseases; Leakage might compound genetic discrimination threats Hard to anonymize / de-identify
  7. p-values, r-squares
  8. Beacon used as an oracle
  9. Consists in adding noise to a dataset with the goal of supporting statistical queries while preserving the privacy of the users whose information is contained in the dataset.
  10. Consists in adding noise to a dataset with the goal of supporting statistical queries while preserving the privacy of the users whose information is contained in the dataset.
  11. Relying on a cloud storage provider Data encrypted and stored at a Storage Process Unit Allow medical unit to “privately” test them
  12. Users keep sequenced genomes (encrypted) But still allow doctors/clinicians to run tests Only disclose minimum amount of information E.g. privacy-preserving version of paternity, ancestry, genealogy, personalized medicine tests