"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Sharing Health Research Data
1. SHARING HEALTH RESEARCH DATA
De-identification
METHODS & EXPERIENCES
Dr. Khaled El Emam
Electronic Health Information Laboratory
2. Motivations for De-identification
• Obtaining patient consent/authorization – not
practical for large databases and introduces
bias
• Compliance to
regulations / legislation
• Contractual obligations
• Maintain public / consumer /
client trust
• Costs of breach notification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
3. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
4. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
5. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
6. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
7. A Balance
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
8. Definition of De-identified Data
Health information that does not identify an
individual and with respect to which there is
no reasonable basis to believe that the
information can be used to identify an
individual is not individually identifiable
health information.
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
9. Re-identification Attacks
• Just to clear this issue out at the beginning
• There are some claims that health data is easy to re-
identify
• Often examples are used to support that argument
• The evidence does not support these claims
– When data are de-identified properly the
probability of a successful re-identification attack
is very small
• Let‟s consider a few highly publicized examples
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
10. AOL
• AOL releases search queries
replacing usernames with
pseudonyms
• New York Times reporters re-
identify one user 4417749
• Her search terms: “tea for good
health”, “numb fingers”, “hand
tremors”, “dry mouth”, “60 single
men”, “dog that urinates on
everything”, “landscapers in
Lilburn, Ga”, “homes sold in
shadow lake subdivision gwinnett
county georgia”
• Thelma Arnold, widow living in
Lilburn Ga ; she has three dogs
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
11. AOL ?
• It is well known that a large percentage of
individuals run „vanity‟ searches that include their
names – Thelma Arnold did
• It is also known that location information can be
determined from an individual‟s search queries
• Search queries, even if the username is replaced
with a pseudonym, cannot be considered de-
identified
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
12. Weld
• Governor Weld of Massachusetts
was unwell during a public
appearance – the story was
covered in the media
• Semi-publicly available insurance
claims data matched with voter
registration lists
• It was possible to determine which
claims records belonged to the
Governor
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
13. Weld ?
• This re-identification attack was done before HIPAA came into
effect – the insurance claims data would not pass any of the
HIPAA de-identification standards
• A recent analysis indicated that Weld was likely re-identified
because he was a famous person and there was already a lot
of information about him in the media (his admission date, his
diagnosis, his discharge date) – the voter registration list was
arguably not necessary
• The success rate for such an attack would be lower for general
members of the public because the voter registration list is
incomplete
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
14. Netflix
• Netflix publicly released movie
ratings data in the context of a
competition to develop a
recommendation algorithm
• Researchers re-identified a couple
of records by matching with a
publicly available and identifiable
movie ratings database (IMDB)
• Results in cancellation of a
second competition and litigation
started against Netflix for exposing
personal information
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
15. Netflix ?
• The re-identifications were not actually verified by
Netflix
• Authors of attack admit that the Netflix data was not
de-identified (replaced usernames with
pseudonyms)
• The false positive rate of the matching was not
evaluated (how many people in the IMDB database
were actually in the Netflix database ?)
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
17. Attribute vs Identity Disclosure
• Attribute disclosure: discover something
new about an individual in the database
without knowing which record belongs to that
individual
• Identity disclosure: determine which record
in the database belongs to a particular
individual (for example, determine that
record number 7 belongs to Bob Smith – that
is identity disclosure)
• HIPAA only cares about identity disclosure
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
18. Attribute vs Identity Disclosure
NOT
HPV Vaccinated
HPV Vaccinated
Religion A 5 40
Religion B 40 5
Statistically significant relationship (chi-
square, p<0.05)
High risk of attribute disclosure
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
19. Attribute vs Identity Disclosure
NOT
HPV Vaccinated
HPV Vaccinated
Religion A 5 40
Religion B 40 5
Statistically significant relationship (chi-
square, p<0.05)
High risk of attribute disclosure
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
20. Attribute vs Identity Disclosure
NOT
HPV Vaccinated
HPV Vaccinated
Religion A 5 6
Religion B 6 5
After suppression
Not statistically significant relationship (chi-square)
Low risk of attribute disclosure
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
22. Definition of De-identified Data
Health information that does not identify an
individual and with respect to which there is
no reasonable basis to believe that the
information can be used to identify an
individual
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
23. Direct Identifiers
• Fields that would uniquely identify individuals
in a database
• Name, address, telephone number, fax
number, MRN, health card number, health
plan beneficiary number, license plate
number, email address, photograph,
biometrics, SSN, SIN, implanted device
number
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
24. Dealing with Direct Identifiers
• Defensible approaches:
– Remove those fields
– Convert them to one-time or persistent
pseudonyms
– Randomize the values
• These approaches will ensure, if done
properly, that the probability of recovering
the original value is very small
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
25. Quasi-Identifiers
• sex, date of birth or age, geographic locations (such
as postal codes, census geography, information
about proximity to known or unique landmarks),
language spoken at home, ethnic origin, aboriginal
identity, total years of schooling, marital status,
criminal history, total income, visible minority status,
activity difficulties/reductions, profession, event
dates (such as admission, discharge, procedure,
death, specimen collection, visit/encounter), codes
(such as diagnosis codes, procedure codes, and
adverse event codes), country of birth, birth weight,
and birth plurality
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
26. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
27. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
28. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
29. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
30. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
31. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
32. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
33. Re-identification Risk Measurement
• Risk measurement will depend on:
– Granularity of quasi-identifiers
– Region of the country we are talking about
– Risk metric used (eg, uniqueness or groups of 5)
– Threshold for what is acceptable risk
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
34. De-identification Standards
• The HIPAA Privacy Rule specifies two de-
identification standards (45 CFR 164.514):
– Safe Harbor
– Statistical method (also known as the expert
statistician method)
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
35. HIPAA Safe Harbor
Safe Harbor Direct Identifiers and Quasi-identifiers
1. Names 12. Vehicle identifiers 18. Any other unique
2. ZIP Codes (except and serial numbers, identifying number,
first three) including license characteristic, or
3. All elements of dates plate numbers code
(except year) 13. Device identifiers
4. Telephone numbers and serial numbers
5. Fax numbers 14. Web Universal
6. Electronic mail Resource Locators
addresses (URLs)
7. Social security 15. Internet Protocol (IP)
numbers address numbers
8. Medical record 16. Biometric identifiers,
numbers including finger and
9. Health plan voice prints
beneficiary numbers 17. Full face
10. Account numbers photographic images
11. Certificate/license and any comparable
numbers images;
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
36. HIPAA Safe Harbor
Safe Harbor Direct Identifiers and Quasi-identifiers
1. Names 13. Device identifiers
2. ZIP Codes (except and serial numbers
first three) 14. Web Universal
3. All elements of dates Resource Locators
(except year) (URLs)
4. Telephone numbers 15. Internet Protocol (IP)
5. Fax numbers address numbers
6. Electronic mail 16. Biometric identifiers,
addresses
including finger and
7. Social security voice prints
numbers 17. Full face
8. Medical record photographic images
numbers and any comparable
9. Health plan
12. Vehicle identifiers images;
beneficiary numbers
and serial numbers, 18. Any other unique
10. Account numbers
including license identifying number,
11. Certificate/license
plate numbers characteristic, or
numbers code
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
37. Two Problems with Safe Harbor
• May be removing too much information on
the ZIP Code and date fields – these fields
are useful for many analytical purposes
• Does not provide adequate protection – it is
easy to have a Safe Harbor compliant data
set with a high risk of re-identification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
38. High Risk Safe Harbor Data - I
• If the adversary knows that Bob, 55 year old
male, is in the database
Gender Age ZIP Lab Test
M 55 112 Albumin, Serum
Alkaline
F 53 114
Phosphatase
M 24 134 Creatine Kinase
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
39. High Risk Safe Harbor Data - II
• 2.24m visits, 1.6m patients, NY discharge
data for 2007
• Compliant with Safe Harbor
Fields % of patients unique
age, gender, ZIP3 2.54%
age, gender, ZIP3, LOS 21.49%
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
40. Statistical Method Conditions
• A person with appropriate knowledge of and
experience with generally accepted statistical and
scientific principles and methods for rendering
information not individually identifiable:
I. Applying such principles and methods, determines that
the risk is very small that the information could be used,
alone or in combination with other reasonably available
information, by an anticipated recipient to identify an
individual who is a subject of the information; and
II. Documents the methods and results of the analysis that
justify such determination
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
49. Different Types of Data Releases
• The same data set can be disclosed
with different thresholds:
– Public data set
– Release with conditions for known data
recipients, including the requirement to sign a
data sharing agreement, a prohibition on re-
identification, and a requirement to pass these
conditions to all sub-contractors
– The more conditions the higher quality the data
set
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
50. Example – CA Hospital Discharges
• Context: data release to a researcher who will sign a
data use agreement, good practices for managing
sensitive health information
• There were ~2.1m patients who had ~3m visits
• Risk threshold = 0.2; use average risk across all
patients
• Variables:
– Year of birth
– Gender
– Year of admission
– Days since last visit
– Length of stay
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
51. Risk Level
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
53. De-identified Data
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
54. Key Practical Considerations
• Data warehouses: de-identification of data extracts
instead of whole data warehouses results in higher
quality de-identified data
• Beware of correlated data: data in multiple medical
domains are correlated, so one has to be cognizant
of inference attacks on data
• Automation: automation can detect outliers and
perform selective suppression, which results in
higher quality de-identified data
• Transparency: important to ensure that methods
have received peer and regulator scrutiny
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
55. Contact
kelemam@ehealthinformation.ca
@kelemam
www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca