SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
Facilitating Analytics
while Protecting
Individual Privacy Using
Data De-identification
Khaled El Emam
Talk Outline
Present two case studies where we conducted an analysis of
the privacy implications associated with sharing health data.
Overview of methodology and risk measurement basics
State of Louisiana Department of Health and Hospitals and
Cajun Code Fest 2013
Mount Sinai School of Medicine Department of Preventative
Medicine – World Trade Center Disaster Registry
Data Anonymization Resources
Book Signing:
September 26, 2013 at 10:35am
Khaled El Emam
Luk Arbuckle
Basic Methodology
Direct and In-Direct/Quasi-Identifiers
Examples of direct identifiers: Name, address, telephone
number, fax number, MRN, health card number, health plan
beneficiary number, license plate number, email address,
photograph, biometrics, SSN, SIN, implanted device number
Examples of quasi identifiers: sex, date of birth or age,
geographic locations (such as postal codes, census
geography, information about proximity to known or unique
landmarks), language spoken at home, ethnic origin, total
years of schooling, marital status, criminal history, total income,
visible minority status, profession, event dates
Terminology
Safe Harbor
Safe Harbor Direct Identifiers and Quasi-identifiers
1. Names
2. ZIP Codes (except first
three)
3. All elements of dates
(except year)
4. Telephone numbers
5. Fax numbers
6. Electronic mail
addresses
7. Social security
numbers
8. Medical record
numbers
9. Health plan beneficiary
numbers
10.Account numbers
11.Certificate/license
numbers
12.Vehicle identifiers and
serial numbers,
including license plate
numbers
13.Device identifiers and
serial numbers
14.Web Universal
Resource Locators
(URLs)
15.Internet Protocol (IP)
address numbers
16.Biometric identifiers,
including finger and
voice prints
17.Full face photographic
images and any
comparable images;
18. Any other unique
identifying number,
characteristic, or code
Actual Knowledge
Statistical Method
 A person with appropriate knowledge of and experience with
generally accepted statistical and scientific principles and methods for
rendering information not individually identifiable:
I. Applying such principles and methods, determines that the risk is
“very small” that the information could be used, alone or in
combination with other reasonably available information, by an
anticipated recipient to identify and individual who is a subject of
the information, and
II. Documents the methods and results of the analysis that justify
such determination
Equivalence Classes - I
 An equivalence class is the set of records in a table that has the
same values for all quasi-identifiers.
Equivalence Classes - II
Gender Year of Birth (10 years) DIN
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765
Equivalence Classes - III
Gender Year of Birth (10 years) DIN
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765
Equivalence Classes - IV
Gender Year of Birth (10 years) DIN
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765
Equivalence Classes - V
Gender Year of Birth (10 years) DIN
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765
Equivalence Classes - VI
Gender Year of Birth (10 years) DIN
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765
Equivalence Classes - VII
Gender Year of Birth (10 years) DIN
Male 1970-1979 2046059
Male 1980-1989 716839
Male 1970-1979 2241497
Female 1990-1999 2046059
Female 1980-1989 392537
Male 1990-1999 363766
Male 1990-1999 544981
Female 1980-1989 293512
Male 1970-1979 544981
Female 1990-1999 596612
Male 1980-1989 725765
Maximum Risk
In the example data set we had 5 equivalence classes
The largest equivalence class had a size of 3, and the smallest
equivalence class had a size of 2
The probability of correctly re-identifying a record is 1 divided
by the size of the equivalence class
The maximum probability in this table is 50% (0.5 probability)
Average Risk
There were:
- Four equivalence classes of size 2
- One equivalence class of size 3
The average risk is:
[(8 x 0.5) + (3 x 0.33)]/11
= 5/11
 This gives us an average risk of 5/11, or 45%
 This turns out to be the number of equivalence classes divided by the
number of records
Case Study: State of Louisiana – Cajun Code Fest
State of Louisiana
 Demonstrate how the State of Louisiana used a novel approach
to improve the health of its citizens by working with the Center
for Business & Information Technologies (CBIT) at the
University of Louisiana to provide data for Cajun Code Fest
 Discuss how providing realistic looking and behaving de-
identified Medicaid claims and immunization data, competitors
were able to generate applications to help Louisiana’s “Own your
Own Health” initiative – an initiative that encourages patients to
make knowledgeable and informed decisions about their
healthcare
Cajun Code Fest 2.0
 April 24-26, 2013
 27 Hours of coding put on by the Center for Business & Information Technology at
the University of Louisiana Lafayette
 Teams converged to work their innovative magic to analyze the de-identified data set
to create new healthcare solutions that will allow patients to become engaged in their
own health
Why De-identified Data?
The core data that served as the basis for Cajun Code Fest
had to be de-identified before it could be released to the
entrants in the challenge.
It would not have been possible to have the coding challenge
without properly de-identified data.
Data by the Numbers
200,000 unique individuals
6,683,337 Medicaid claims
6,410,969 Medicaid prescriptions
4,085,977 Immunization records
29,951 Providers
Data Model
Claims Summary
Long Tails & Truncation
Date Shifting – Simple Noise
Date Shifting – Fixed Shift
Date Shifting – Randomized Generalization I
Date Shifting - Randomized Generalization II
Geoproxy Attacks
Patients tend to visit providers and obtain prescriptions from
pharmacies that are close to where they live
Can we use the provider and pharmacy location information to
predict where the patient lives ?
This is called a geoproxy attack
We can measure the probability of a correct geoproxy attack
and incorporate that into our overall risk measurement
framework
Geoproxy Risk on Claims Data
Case Study: Mount Sinai School of Medicine
World Trade Center Disaster Registry
 Over 50,000 people are estimated to have helped with the rescue and
recovery efforts after 9/11, and over 27,000 of those are captured in the WTC
disaster registry created by the Clinical Center of Excellence at Mount Sinai.
 The Mount Sinai did a lot of publicity and outreach, working with a variety of
organizations, to recruit 9/11 workers and volunteers. Those who participated
have gone through comprehensive examinations including:
- Medical questionnaires
- Mental-health questionnaires
- Exposure-assessment questionnaires
- Standardised physical examinations
- Optional follow-up assessments every 12 to 18 months
Background
Public Information
Series of Events
The visit date was used for questions that were specific to the
date at which the visit occurred (e.g., “do you currently
smoke?” would create an event for smoking at the time of
visit.)
Some questions included dates that could be used directly
with the quasi-identifier, and were more informative than the
visit date. (e.g., the answer “when were you diagnosed with
this disease?” was used to provide a date to the disease
event).
Series of Events
Demographics
Examples of Events
Multiple Levels
Sometimes it is reasonable to assume that the adversary will
not have a lot of details about an event
For example, the adversary may know that an event has
occurred but not know the exact date that the event occurred
at
In such a case we change the data to match the adversary
background knowledge, but we release more detailed data
This makes sense given the assumption – the more detailed
information that is released does not give the adversary
additional useful information
 Ten years after the fact, however, it seems unlikely that an adversary
will know the dates of a patient’s events before 9/11. Often patients
gave different years of diagnosis on follow-up visits because they
themselves didn’t remember what medical conditions they had! So
instead of the date of event, we used “pre-9/11” as a value.
 We made a distinction between childhood (under 18) and adulthood
(18 and over) diagnoses, these seemed like something you could
reasonably know.
 These generalizations were done only for measuring risk, and weren’t
applied to the de-identified registry data.
Time of Events
Covering Designs
What are the quasi-identifiers when the series of events is
long?
Will an adversary know all of the details in that sequence ?
It is reasonable to assume that an adversary will only know p
events – this is the power of the adversary
But which p out of m events does the adversary know ?
If we look at all combinations of p from m we may end up with
quite a large number of combinations of quasi-identifiers to
measure the risk
Combinations of 3
Covering Design
Reduction in Computation
Contact
Khaled El Emam:
kelemam@privacyanalytics.ca
613.369.4313 ext 111
@PrivacyAnalytic

Mais conteúdo relacionado

Semelhante a Facilitating Analytics while Protecting Privacy

Week 10 Managing the Public Health Surveillance and.docx
Week 10 Managing the Public Health Surveillance and.docxWeek 10 Managing the Public Health Surveillance and.docx
Week 10 Managing the Public Health Surveillance and.docxwrite5
 
DB Question for Public Health in Disaster Management.docx
DB Question for Public Health in Disaster Management.docxDB Question for Public Health in Disaster Management.docx
DB Question for Public Health in Disaster Management.docxwrite22
 
DB Question for Public Health in Disaster Management.docx
DB Question for Public Health in Disaster Management.docxDB Question for Public Health in Disaster Management.docx
DB Question for Public Health in Disaster Management.docxsdfghj21
 
Making sense of injury data
Making sense of injury dataMaking sense of injury data
Making sense of injury databronwen_bg
 
9Studying Vulnerable PopulationsLearning Objectives.docx
9Studying Vulnerable PopulationsLearning Objectives.docx9Studying Vulnerable PopulationsLearning Objectives.docx
9Studying Vulnerable PopulationsLearning Objectives.docxblondellchancy
 
Descriptive and Analytical Epidemiology
Descriptive and Analytical Epidemiology Descriptive and Analytical Epidemiology
Descriptive and Analytical Epidemiology coolboy101pk
 
DISEASE_ (B-C) 2014.ppt
DISEASE_ (B-C) 2014.pptDISEASE_ (B-C) 2014.ppt
DISEASE_ (B-C) 2014.pptTOMMY687704
 
Running head Database Technologies and Data Structure1Datab.docx
Running head Database Technologies and Data Structure1Datab.docxRunning head Database Technologies and Data Structure1Datab.docx
Running head Database Technologies and Data Structure1Datab.docxsusanschei
 
Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...
Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...
Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...DNA Compass
 
Discussion No.IIFirst Question You have just been hired by an.docx
Discussion No.IIFirst Question You have just been hired by an.docxDiscussion No.IIFirst Question You have just been hired by an.docx
Discussion No.IIFirst Question You have just been hired by an.docxmadlynplamondon
 
Fundamental of epidemioloy
Fundamental of epidemioloyFundamental of epidemioloy
Fundamental of epidemioloyMahmoud Shaqria
 
Health Data Innovation (Wolfram Data Summit)
Health Data Innovation (Wolfram Data Summit)Health Data Innovation (Wolfram Data Summit)
Health Data Innovation (Wolfram Data Summit)Peter Speyer
 
Health Inequalities Among Australians
Health Inequalities Among AustraliansHealth Inequalities Among Australians
Health Inequalities Among AustraliansLaura Torres
 
Measuring the Vital Events in the Communities of Africa
Measuring the Vital Events in the Communities of AfricaMeasuring the Vital Events in the Communities of Africa
Measuring the Vital Events in the Communities of AfricaMEASURE Evaluation
 
Health survillence and informatics.pptx
Health survillence and informatics.pptxHealth survillence and informatics.pptx
Health survillence and informatics.pptxRichaMishra186341
 
Chapter 3Public Health Data and Communications.docx
Chapter 3Public Health Data and Communications.docxChapter 3Public Health Data and Communications.docx
Chapter 3Public Health Data and Communications.docxwalterl4
 
DQ11) N-LHi Class-In 2023 we are will educate about STDs and be aware.docx
DQ11) N-LHi Class-In 2023 we are will educate about STDs and be aware.docxDQ11) N-LHi Class-In 2023 we are will educate about STDs and be aware.docx
DQ11) N-LHi Class-In 2023 we are will educate about STDs and be aware.docxrosaliaj1
 

Semelhante a Facilitating Analytics while Protecting Privacy (20)

Privacy vs. Public Health
Privacy vs. Public HealthPrivacy vs. Public Health
Privacy vs. Public Health
 
Week 10 Managing the Public Health Surveillance and.docx
Week 10 Managing the Public Health Surveillance and.docxWeek 10 Managing the Public Health Surveillance and.docx
Week 10 Managing the Public Health Surveillance and.docx
 
DB Question for Public Health in Disaster Management.docx
DB Question for Public Health in Disaster Management.docxDB Question for Public Health in Disaster Management.docx
DB Question for Public Health in Disaster Management.docx
 
DB Question for Public Health in Disaster Management.docx
DB Question for Public Health in Disaster Management.docxDB Question for Public Health in Disaster Management.docx
DB Question for Public Health in Disaster Management.docx
 
Making sense of injury data
Making sense of injury dataMaking sense of injury data
Making sense of injury data
 
9Studying Vulnerable PopulationsLearning Objectives.docx
9Studying Vulnerable PopulationsLearning Objectives.docx9Studying Vulnerable PopulationsLearning Objectives.docx
9Studying Vulnerable PopulationsLearning Objectives.docx
 
Descriptive and Analytical Epidemiology
Descriptive and Analytical Epidemiology Descriptive and Analytical Epidemiology
Descriptive and Analytical Epidemiology
 
DISEASE_ (B-C) 2014.ppt
DISEASE_ (B-C) 2014.pptDISEASE_ (B-C) 2014.ppt
DISEASE_ (B-C) 2014.ppt
 
Running head Database Technologies and Data Structure1Datab.docx
Running head Database Technologies and Data Structure1Datab.docxRunning head Database Technologies and Data Structure1Datab.docx
Running head Database Technologies and Data Structure1Datab.docx
 
Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...
Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...
Possible Solution for Managing the Worlds Personal Genetic Data - DNA Guide, ...
 
Discussion No.IIFirst Question You have just been hired by an.docx
Discussion No.IIFirst Question You have just been hired by an.docxDiscussion No.IIFirst Question You have just been hired by an.docx
Discussion No.IIFirst Question You have just been hired by an.docx
 
The challenges of zika: a health IT response
The challenges of zika: a health IT responseThe challenges of zika: a health IT response
The challenges of zika: a health IT response
 
Fundamental of epidemioloy
Fundamental of epidemioloyFundamental of epidemioloy
Fundamental of epidemioloy
 
Health Data Innovation (Wolfram Data Summit)
Health Data Innovation (Wolfram Data Summit)Health Data Innovation (Wolfram Data Summit)
Health Data Innovation (Wolfram Data Summit)
 
Health Inequalities Among Australians
Health Inequalities Among AustraliansHealth Inequalities Among Australians
Health Inequalities Among Australians
 
Measuring the Vital Events in the Communities of Africa
Measuring the Vital Events in the Communities of AfricaMeasuring the Vital Events in the Communities of Africa
Measuring the Vital Events in the Communities of Africa
 
Ameet Sarpatwari: "Data Sharing that Enables Post-Approval Drug and Device Re...
Ameet Sarpatwari: "Data Sharing that Enables Post-Approval Drug and Device Re...Ameet Sarpatwari: "Data Sharing that Enables Post-Approval Drug and Device Re...
Ameet Sarpatwari: "Data Sharing that Enables Post-Approval Drug and Device Re...
 
Health survillence and informatics.pptx
Health survillence and informatics.pptxHealth survillence and informatics.pptx
Health survillence and informatics.pptx
 
Chapter 3Public Health Data and Communications.docx
Chapter 3Public Health Data and Communications.docxChapter 3Public Health Data and Communications.docx
Chapter 3Public Health Data and Communications.docx
 
DQ11) N-LHi Class-In 2023 we are will educate about STDs and be aware.docx
DQ11) N-LHi Class-In 2023 we are will educate about STDs and be aware.docxDQ11) N-LHi Class-In 2023 we are will educate about STDs and be aware.docx
DQ11) N-LHi Class-In 2023 we are will educate about STDs and be aware.docx
 

Mais de Khaled El Emam

Canadian AI 2014 Conference Keynote - Deploying SMC in Practice
Canadian AI 2014 Conference Keynote - Deploying SMC in PracticeCanadian AI 2014 Conference Keynote - Deploying SMC in Practice
Canadian AI 2014 Conference Keynote - Deploying SMC in PracticeKhaled El Emam
 
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...Khaled El Emam
 
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Khaled El Emam
 
Anonymizing Health Data
Anonymizing Health DataAnonymizing Health Data
Anonymizing Health DataKhaled El Emam
 
Sharing Health Research Data
Sharing Health Research DataSharing Health Research Data
Sharing Health Research DataKhaled El Emam
 
The De-identification of Clinical Data
The De-identification of Clinical DataThe De-identification of Clinical Data
The De-identification of Clinical DataKhaled El Emam
 
The Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by ConsumersThe Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by ConsumersKhaled El Emam
 
The Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical TrialsThe Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical TrialsKhaled El Emam
 

Mais de Khaled El Emam (8)

Canadian AI 2014 Conference Keynote - Deploying SMC in Practice
Canadian AI 2014 Conference Keynote - Deploying SMC in PracticeCanadian AI 2014 Conference Keynote - Deploying SMC in Practice
Canadian AI 2014 Conference Keynote - Deploying SMC in Practice
 
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
 
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
 
Anonymizing Health Data
Anonymizing Health DataAnonymizing Health Data
Anonymizing Health Data
 
Sharing Health Research Data
Sharing Health Research DataSharing Health Research Data
Sharing Health Research Data
 
The De-identification of Clinical Data
The De-identification of Clinical DataThe De-identification of Clinical Data
The De-identification of Clinical Data
 
The Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by ConsumersThe Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by Consumers
 
The Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical TrialsThe Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical Trials
 

Último

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Último (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Facilitating Analytics while Protecting Privacy

  • 1. Facilitating Analytics while Protecting Individual Privacy Using Data De-identification Khaled El Emam
  • 2. Talk Outline Present two case studies where we conducted an analysis of the privacy implications associated with sharing health data. Overview of methodology and risk measurement basics State of Louisiana Department of Health and Hospitals and Cajun Code Fest 2013 Mount Sinai School of Medicine Department of Preventative Medicine – World Trade Center Disaster Registry
  • 3. Data Anonymization Resources Book Signing: September 26, 2013 at 10:35am Khaled El Emam Luk Arbuckle
  • 5. Direct and In-Direct/Quasi-Identifiers Examples of direct identifiers: Name, address, telephone number, fax number, MRN, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device number Examples of quasi identifiers: sex, date of birth or age, geographic locations (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, total years of schooling, marital status, criminal history, total income, visible minority status, profession, event dates
  • 7. Safe Harbor Safe Harbor Direct Identifiers and Quasi-identifiers 1. Names 2. ZIP Codes (except first three) 3. All elements of dates (except year) 4. Telephone numbers 5. Fax numbers 6. Electronic mail addresses 7. Social security numbers 8. Medical record numbers 9. Health plan beneficiary numbers 10.Account numbers 11.Certificate/license numbers 12.Vehicle identifiers and serial numbers, including license plate numbers 13.Device identifiers and serial numbers 14.Web Universal Resource Locators (URLs) 15.Internet Protocol (IP) address numbers 16.Biometric identifiers, including finger and voice prints 17.Full face photographic images and any comparable images; 18. Any other unique identifying number, characteristic, or code Actual Knowledge
  • 8. Statistical Method  A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable: I. Applying such principles and methods, determines that the risk is “very small” that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify and individual who is a subject of the information, and II. Documents the methods and results of the analysis that justify such determination
  • 9. Equivalence Classes - I  An equivalence class is the set of records in a table that has the same values for all quasi-identifiers.
  • 10. Equivalence Classes - II Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 11. Equivalence Classes - III Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 12. Equivalence Classes - IV Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 13. Equivalence Classes - V Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 14. Equivalence Classes - VI Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 15. Equivalence Classes - VII Gender Year of Birth (10 years) DIN Male 1970-1979 2046059 Male 1980-1989 716839 Male 1970-1979 2241497 Female 1990-1999 2046059 Female 1980-1989 392537 Male 1990-1999 363766 Male 1990-1999 544981 Female 1980-1989 293512 Male 1970-1979 544981 Female 1990-1999 596612 Male 1980-1989 725765
  • 16. Maximum Risk In the example data set we had 5 equivalence classes The largest equivalence class had a size of 3, and the smallest equivalence class had a size of 2 The probability of correctly re-identifying a record is 1 divided by the size of the equivalence class The maximum probability in this table is 50% (0.5 probability)
  • 17. Average Risk There were: - Four equivalence classes of size 2 - One equivalence class of size 3 The average risk is: [(8 x 0.5) + (3 x 0.33)]/11 = 5/11  This gives us an average risk of 5/11, or 45%  This turns out to be the number of equivalence classes divided by the number of records
  • 18. Case Study: State of Louisiana – Cajun Code Fest
  • 19. State of Louisiana  Demonstrate how the State of Louisiana used a novel approach to improve the health of its citizens by working with the Center for Business & Information Technologies (CBIT) at the University of Louisiana to provide data for Cajun Code Fest  Discuss how providing realistic looking and behaving de- identified Medicaid claims and immunization data, competitors were able to generate applications to help Louisiana’s “Own your Own Health” initiative – an initiative that encourages patients to make knowledgeable and informed decisions about their healthcare
  • 20. Cajun Code Fest 2.0  April 24-26, 2013  27 Hours of coding put on by the Center for Business & Information Technology at the University of Louisiana Lafayette  Teams converged to work their innovative magic to analyze the de-identified data set to create new healthcare solutions that will allow patients to become engaged in their own health
  • 21. Why De-identified Data? The core data that served as the basis for Cajun Code Fest had to be de-identified before it could be released to the entrants in the challenge. It would not have been possible to have the coding challenge without properly de-identified data.
  • 22. Data by the Numbers 200,000 unique individuals 6,683,337 Medicaid claims 6,410,969 Medicaid prescriptions 4,085,977 Immunization records 29,951 Providers
  • 25. Long Tails & Truncation
  • 26. Date Shifting – Simple Noise
  • 27. Date Shifting – Fixed Shift
  • 28. Date Shifting – Randomized Generalization I
  • 29. Date Shifting - Randomized Generalization II
  • 30. Geoproxy Attacks Patients tend to visit providers and obtain prescriptions from pharmacies that are close to where they live Can we use the provider and pharmacy location information to predict where the patient lives ? This is called a geoproxy attack We can measure the probability of a correct geoproxy attack and incorporate that into our overall risk measurement framework
  • 31. Geoproxy Risk on Claims Data
  • 32. Case Study: Mount Sinai School of Medicine World Trade Center Disaster Registry
  • 33.  Over 50,000 people are estimated to have helped with the rescue and recovery efforts after 9/11, and over 27,000 of those are captured in the WTC disaster registry created by the Clinical Center of Excellence at Mount Sinai.  The Mount Sinai did a lot of publicity and outreach, working with a variety of organizations, to recruit 9/11 workers and volunteers. Those who participated have gone through comprehensive examinations including: - Medical questionnaires - Mental-health questionnaires - Exposure-assessment questionnaires - Standardised physical examinations - Optional follow-up assessments every 12 to 18 months Background
  • 36. The visit date was used for questions that were specific to the date at which the visit occurred (e.g., “do you currently smoke?” would create an event for smoking at the time of visit.) Some questions included dates that could be used directly with the quasi-identifier, and were more informative than the visit date. (e.g., the answer “when were you diagnosed with this disease?” was used to provide a date to the disease event). Series of Events
  • 39. Multiple Levels Sometimes it is reasonable to assume that the adversary will not have a lot of details about an event For example, the adversary may know that an event has occurred but not know the exact date that the event occurred at In such a case we change the data to match the adversary background knowledge, but we release more detailed data This makes sense given the assumption – the more detailed information that is released does not give the adversary additional useful information
  • 40.  Ten years after the fact, however, it seems unlikely that an adversary will know the dates of a patient’s events before 9/11. Often patients gave different years of diagnosis on follow-up visits because they themselves didn’t remember what medical conditions they had! So instead of the date of event, we used “pre-9/11” as a value.  We made a distinction between childhood (under 18) and adulthood (18 and over) diagnoses, these seemed like something you could reasonably know.  These generalizations were done only for measuring risk, and weren’t applied to the de-identified registry data. Time of Events
  • 41. Covering Designs What are the quasi-identifiers when the series of events is long? Will an adversary know all of the details in that sequence ? It is reasonable to assume that an adversary will only know p events – this is the power of the adversary But which p out of m events does the adversary know ? If we look at all combinations of p from m we may end up with quite a large number of combinations of quasi-identifiers to measure the risk