SlideShare uma empresa Scribd logo
1 de 55
SHARING HEALTH RESEARCH DATA

De-identification
METHODS & EXPERIENCES




                                 Dr. Khaled El Emam
                Electronic Health Information Laboratory
Motivations for De-identification
  • Obtaining patient consent/authorization – not
    practical for large databases and introduces
    bias
  • Compliance to
    regulations / legislation
  • Contractual obligations
  • Maintain public / consumer /
    client trust
  • Costs of breach notification

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
A Balance




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Definition of De-identified Data

         Health information that does not identify an
          individual and with respect to which there is
            no reasonable basis to believe that the
              information can be used to identify an
             individual is not individually identifiable
                        health information.




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-identification Attacks
  • Just to clear this issue out at the beginning
  • There are some claims that health data is easy to re-
    identify
  • Often examples are used to support that argument
  • The evidence does not support these claims
     – When data are de-identified properly the
       probability of a successful re-identification attack
       is very small
  • Let‟s consider a few highly publicized examples




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
AOL
  •      AOL releases search queries
         replacing usernames with
         pseudonyms
  •      New York Times reporters re-
         identify one user 4417749
  •      Her search terms: “tea for good
         health”, “numb fingers”, “hand
         tremors”, “dry mouth”, “60 single
         men”, “dog that urinates on
         everything”, “landscapers in
         Lilburn, Ga”, “homes sold in
         shadow lake subdivision gwinnett
         county georgia”
  •      Thelma Arnold, widow living in
         Lilburn Ga ; she has three dogs


Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
AOL ?
  • It is well known that a large percentage of
    individuals run „vanity‟ searches that include their
    names – Thelma Arnold did
  • It is also known that location information can be
    determined from an individual‟s search queries
  • Search queries, even if the username is replaced
    with a pseudonym, cannot be considered de-
    identified




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Weld
  •      Governor Weld of Massachusetts
         was unwell during a public
         appearance – the story was
         covered in the media
  •      Semi-publicly available insurance
         claims data matched with voter
         registration lists
  •      It was possible to determine which
         claims records belonged to the
         Governor




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Weld ?
  • This re-identification attack was done before HIPAA came into
    effect – the insurance claims data would not pass any of the
    HIPAA de-identification standards
  • A recent analysis indicated that Weld was likely re-identified
    because he was a famous person and there was already a lot
    of information about him in the media (his admission date, his
    diagnosis, his discharge date) – the voter registration list was
    arguably not necessary
  • The success rate for such an attack would be lower for general
    members of the public because the voter registration list is
    incomplete




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Netflix
  •      Netflix publicly released movie
         ratings data in the context of a
         competition to develop a
         recommendation algorithm
  •      Researchers re-identified a couple
         of records by matching with a
         publicly available and identifiable
         movie ratings database (IMDB)
  •      Results in cancellation of a
         second competition and litigation
         started against Netflix for exposing
         personal information




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Netflix ?
  • The re-identifications were not actually verified by
    Netflix
  • Authors of attack admit that the Netflix data was not
    de-identified (replaced usernames with
    pseudonyms)
  • The false positive rate of the matching was not
    evaluated (how many people in the IMDB database
    were actually in the Netflix database ?)




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028071




      Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Attribute vs Identity Disclosure
  • Attribute disclosure: discover something
    new about an individual in the database
    without knowing which record belongs to that
    individual
  • Identity disclosure: determine which record
    in the database belongs to a particular
    individual (for example, determine that
    record number 7 belongs to Bob Smith – that
    is identity disclosure)
  • HIPAA only cares about identity disclosure

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Attribute vs Identity Disclosure

                                                                                                                NOT
                                                 HPV Vaccinated
                                                                                                           HPV Vaccinated
           Religion A                                                   5                                                       40
           Religion B                                                 40                                                          5



   Statistically significant relationship (chi-
    square, p<0.05)
   High risk of attribute disclosure

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Attribute vs Identity Disclosure

                                                                                                                NOT
                                                 HPV Vaccinated
                                                                                                           HPV Vaccinated
           Religion A                                                   5                                                       40
           Religion B                                                 40                                                          5



   Statistically significant relationship (chi-
    square, p<0.05)
   High risk of attribute disclosure

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Attribute vs Identity Disclosure

                                                                                                                NOT
                                                 HPV Vaccinated
                                                                                                           HPV Vaccinated
           Religion A                                                   5                                                         6
           Religion B                                                   6                                                         5

   After suppression
   Not statistically significant relationship (chi-square)
   Low risk of attribute disclosure


Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Stigmatizing Analytics




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Definition of De-identified Data


         Health information that does not identify an
          individual and with respect to which there is
            no reasonable basis to believe that the
             information can be used to identify an
                            individual




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Direct Identifiers
  • Fields that would uniquely identify individuals
    in a database
  • Name, address, telephone number, fax
    number, MRN, health card number, health
    plan beneficiary number, license plate
    number, email address, photograph,
    biometrics, SSN, SIN, implanted device
    number



Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Dealing with Direct Identifiers
  • Defensible approaches:
            – Remove those fields
            – Convert them to one-time or persistent
              pseudonyms
            – Randomize the values
  • These approaches will ensure, if done
    properly, that the probability of recovering
    the original value is very small



Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Quasi-Identifiers
  • sex, date of birth or age, geographic locations (such
    as postal codes, census geography, information
    about proximity to known or unique landmarks),
    language spoken at home, ethnic origin, aboriginal
    identity, total years of schooling, marital status,
    criminal history, total income, visible minority status,
    activity difficulties/reductions, profession, event
    dates (such as admission, discharge, procedure,
    death, specimen collection, visit/encounter), codes
    (such as diagnosis codes, procedure codes, and
    adverse event codes), country of birth, birth weight,
    and birth plurality

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-identification Risk Measurement
  • Risk measurement will depend on:
            –     Granularity of quasi-identifiers
            –     Region of the country we are talking about
            –     Risk metric used (eg, uniqueness or groups of 5)
            –     Threshold for what is acceptable risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
De-identification Standards
  • The HIPAA Privacy Rule specifies two de-
    identification standards (45 CFR 164.514):
            – Safe Harbor
            – Statistical method (also known as the expert
              statistician method)




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
HIPAA Safe Harbor
                     Safe Harbor Direct Identifiers and Quasi-identifiers
1. Names                                           12. Vehicle identifiers                            18. Any other unique
2. ZIP Codes (except                                   and serial numbers,                                identifying number,
    first three)                                       including license                                  characteristic, or
3. All elements of dates                               plate numbers                                      code
    (except year)                                  13. Device identifiers
4. Telephone numbers                                   and serial numbers
5. Fax numbers                                     14. Web Universal
6. Electronic mail                                     Resource Locators
    addresses                                          (URLs)
7. Social security                                 15. Internet Protocol (IP)
    numbers                                            address numbers
8. Medical record                                  16. Biometric identifiers,
    numbers                                            including finger and
9. Health plan                                         voice prints
    beneficiary numbers                            17. Full face
10. Account numbers                                    photographic images
11. Certificate/license                                and any comparable
    numbers                                            images;




 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
HIPAA Safe Harbor
                     Safe Harbor Direct Identifiers and Quasi-identifiers
1. Names                                                                                              13. Device identifiers
2. ZIP Codes (except                                                                                      and serial numbers
    first three)                                                                                      14. Web Universal
3. All elements of dates                                                                                  Resource Locators
    (except year)                                                                                         (URLs)
4. Telephone numbers                                                                                  15. Internet Protocol (IP)
5. Fax numbers                                                                                            address numbers
6. Electronic mail                                                                                    16. Biometric identifiers,
    addresses
                                                                                                          including finger and
7. Social security                                                                                        voice prints
    numbers                                                                                           17. Full face
8. Medical record                                                                                         photographic images
    numbers                                                                                               and any comparable
9. Health plan
                                                     12. Vehicle identifiers                              images;
    beneficiary numbers
                                                         and serial numbers,                          18. Any other unique
10. Account numbers
                                                         including license                                identifying number,
11. Certificate/license
                                                         plate numbers                                    characteristic, or
    numbers                                                                                               code




 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Two Problems with Safe Harbor
  • May be removing too much information on
    the ZIP Code and date fields – these fields
    are useful for many analytical purposes
  • Does not provide adequate protection – it is
    easy to have a Safe Harbor compliant data
    set with a high risk of re-identification




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
High Risk Safe Harbor Data - I
  • If the adversary knows that Bob, 55 year old
    male, is in the database

      Gender                                         Age                                        ZIP                                 Lab Test


             M                                         55                                       112                          Albumin, Serum

                                                                                                                                  Alkaline
             F                                         53                                       114
                                                                                                                                Phosphatase

             M                                         24                                       134                          Creatine Kinase



Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
High Risk Safe Harbor Data - II
  • 2.24m visits, 1.6m patients, NY discharge
    data for 2007
  • Compliant with Safe Harbor

        Fields                                                                        % of patients unique

        age, gender, ZIP3                                                                              2.54%

        age, gender, ZIP3, LOS                                                                        21.49%




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Statistical Method Conditions
  • A person with appropriate knowledge of and
    experience with generally accepted statistical and
    scientific principles and methods for rendering
    information not individually identifiable:
            I.        Applying such principles and methods, determines that
                      the risk is very small that the information could be used,
                      alone or in combination with other reasonably available
                      information, by an anticipated recipient to identify an
                      individual who is a subject of the information; and
            II.       Documents the methods and results of the analysis that
                      justify such determination




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-identification Risk Spectrum




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Managing Re-identification Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Different Types of Data Releases
  • The same data set can be disclosed
    with different thresholds:
            – Public data set
            – Release with conditions for known data
              recipients, including the requirement to sign a
              data sharing agreement, a prohibition on re-
              identification, and a requirement to pass these
              conditions to all sub-contractors
            – The more conditions the higher quality the data
              set


Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Example – CA Hospital Discharges
  • Context: data release to a researcher who will sign a
    data use agreement, good practices for managing
    sensitive health information
  • There were ~2.1m patients who had ~3m visits
  • Risk threshold = 0.2; use average risk across all
    patients
  • Variables:
            –     Year of birth
            –     Gender
            –     Year of admission
            –     Days since last visit
            –     Length of stay
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Risk Level




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Hierarchy




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
De-identified Data




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Key Practical Considerations
  • Data warehouses: de-identification of data extracts
    instead of whole data warehouses results in higher
    quality de-identified data
  • Beware of correlated data: data in multiple medical
    domains are correlated, so one has to be cognizant
    of inference attacks on data
  • Automation: automation can detect outliers and
    perform selective suppression, which results in
    higher quality de-identified data
  • Transparency: important to ensure that methods
    have received peer and regulator scrutiny

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Contact


     kelemam@ehealthinformation.ca

     @kelemam


     www.ehealthinformation.ca




 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Mais conteúdo relacionado

Semelhante a Sharing Health Research Data

15th ihs id_web_presentation_final_may142003
15th ihs id_web_presentation_final_may14200315th ihs id_web_presentation_final_may142003
15th ihs id_web_presentation_final_may142003Arnulfo Jr Rosario
 
Bio Ohio Power Point Slide Set 3 10 09
Bio Ohio Power Point Slide Set 3 10 09Bio Ohio Power Point Slide Set 3 10 09
Bio Ohio Power Point Slide Set 3 10 09guest4770066
 
The cost and value of hiv testing in Malawi
The cost and value of hiv testing in MalawiThe cost and value of hiv testing in Malawi
The cost and value of hiv testing in MalawiCarmen Figueroa
 
Bio inOhio Facts and Stats 6/2009
Bio inOhio Facts and Stats 6/2009Bio inOhio Facts and Stats 6/2009
Bio inOhio Facts and Stats 6/2009guest4bc2dd
 
Bio Ohio Power Point Slide Set All 9 17 10
Bio Ohio Power Point Slide Set All 9 17 10Bio Ohio Power Point Slide Set All 9 17 10
Bio Ohio Power Point Slide Set All 9 17 10BioOhio
 
2008 Iowa HIV Conference
2008 Iowa HIV Conference2008 Iowa HIV Conference
2008 Iowa HIV Conferencerheaju
 
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...Public Religion Research Institute
 
HIV Nursing and Home & Community Care Conference
HIV Nursing and Home  & Community Care Conference HIV Nursing and Home  & Community Care Conference
HIV Nursing and Home & Community Care Conference griehl
 

Semelhante a Sharing Health Research Data (9)

15th ihs id_web_presentation_final_may142003
15th ihs id_web_presentation_final_may14200315th ihs id_web_presentation_final_may142003
15th ihs id_web_presentation_final_may142003
 
Bio Ohio Power Point Slide Set 3 10 09
Bio Ohio Power Point Slide Set 3 10 09Bio Ohio Power Point Slide Set 3 10 09
Bio Ohio Power Point Slide Set 3 10 09
 
The cost and value of hiv testing in Malawi
The cost and value of hiv testing in MalawiThe cost and value of hiv testing in Malawi
The cost and value of hiv testing in Malawi
 
Bio inOhio Facts and Stats 6/2009
Bio inOhio Facts and Stats 6/2009Bio inOhio Facts and Stats 6/2009
Bio inOhio Facts and Stats 6/2009
 
Bio Ohio Power Point Slide Set All 9 17 10
Bio Ohio Power Point Slide Set All 9 17 10Bio Ohio Power Point Slide Set All 9 17 10
Bio Ohio Power Point Slide Set All 9 17 10
 
Looking Back to Look Forward: HIV Futures Trend Data
Looking Back to Look Forward: HIV Futures Trend Data  Looking Back to Look Forward: HIV Futures Trend Data
Looking Back to Look Forward: HIV Futures Trend Data
 
2008 Iowa HIV Conference
2008 Iowa HIV Conference2008 Iowa HIV Conference
2008 Iowa HIV Conference
 
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...
 
HIV Nursing and Home & Community Care Conference
HIV Nursing and Home  & Community Care Conference HIV Nursing and Home  & Community Care Conference
HIV Nursing and Home & Community Care Conference
 

Mais de Khaled El Emam

Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...Khaled El Emam
 
Facilitating Analytics while Protecting Privacy
Facilitating Analytics while Protecting PrivacyFacilitating Analytics while Protecting Privacy
Facilitating Analytics while Protecting PrivacyKhaled El Emam
 
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Khaled El Emam
 
Anonymizing Health Data
Anonymizing Health DataAnonymizing Health Data
Anonymizing Health DataKhaled El Emam
 
The Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by ConsumersThe Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by ConsumersKhaled El Emam
 
The Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical TrialsThe Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical TrialsKhaled El Emam
 

Mais de Khaled El Emam (6)

Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
 
Facilitating Analytics while Protecting Privacy
Facilitating Analytics while Protecting PrivacyFacilitating Analytics while Protecting Privacy
Facilitating Analytics while Protecting Privacy
 
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
 
Anonymizing Health Data
Anonymizing Health DataAnonymizing Health Data
Anonymizing Health Data
 
The Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by ConsumersThe Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by Consumers
 
The Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical TrialsThe Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical Trials
 

Último

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Último (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Sharing Health Research Data

  • 1. SHARING HEALTH RESEARCH DATA De-identification METHODS & EXPERIENCES Dr. Khaled El Emam Electronic Health Information Laboratory
  • 2. Motivations for De-identification • Obtaining patient consent/authorization – not practical for large databases and introduces bias • Compliance to regulations / legislation • Contractual obligations • Maintain public / consumer / client trust • Costs of breach notification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 3. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 4. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 5. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 6. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 7. A Balance Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 8. Definition of De-identified Data Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 9. Re-identification Attacks • Just to clear this issue out at the beginning • There are some claims that health data is easy to re- identify • Often examples are used to support that argument • The evidence does not support these claims – When data are de-identified properly the probability of a successful re-identification attack is very small • Let‟s consider a few highly publicized examples Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 10. AOL • AOL releases search queries replacing usernames with pseudonyms • New York Times reporters re- identify one user 4417749 • Her search terms: “tea for good health”, “numb fingers”, “hand tremors”, “dry mouth”, “60 single men”, “dog that urinates on everything”, “landscapers in Lilburn, Ga”, “homes sold in shadow lake subdivision gwinnett county georgia” • Thelma Arnold, widow living in Lilburn Ga ; she has three dogs Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 11. AOL ? • It is well known that a large percentage of individuals run „vanity‟ searches that include their names – Thelma Arnold did • It is also known that location information can be determined from an individual‟s search queries • Search queries, even if the username is replaced with a pseudonym, cannot be considered de- identified Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 12. Weld • Governor Weld of Massachusetts was unwell during a public appearance – the story was covered in the media • Semi-publicly available insurance claims data matched with voter registration lists • It was possible to determine which claims records belonged to the Governor Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 13. Weld ? • This re-identification attack was done before HIPAA came into effect – the insurance claims data would not pass any of the HIPAA de-identification standards • A recent analysis indicated that Weld was likely re-identified because he was a famous person and there was already a lot of information about him in the media (his admission date, his diagnosis, his discharge date) – the voter registration list was arguably not necessary • The success rate for such an attack would be lower for general members of the public because the voter registration list is incomplete Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 14. Netflix • Netflix publicly released movie ratings data in the context of a competition to develop a recommendation algorithm • Researchers re-identified a couple of records by matching with a publicly available and identifiable movie ratings database (IMDB) • Results in cancellation of a second competition and litigation started against Netflix for exposing personal information Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 15. Netflix ? • The re-identifications were not actually verified by Netflix • Authors of attack admit that the Netflix data was not de-identified (replaced usernames with pseudonyms) • The false positive rate of the matching was not evaluated (how many people in the IMDB database were actually in the Netflix database ?) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 16. http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028071 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 17. Attribute vs Identity Disclosure • Attribute disclosure: discover something new about an individual in the database without knowing which record belongs to that individual • Identity disclosure: determine which record in the database belongs to a particular individual (for example, determine that record number 7 belongs to Bob Smith – that is identity disclosure) • HIPAA only cares about identity disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 18. Attribute vs Identity Disclosure NOT HPV Vaccinated HPV Vaccinated Religion A 5 40 Religion B 40 5  Statistically significant relationship (chi- square, p<0.05)  High risk of attribute disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 19. Attribute vs Identity Disclosure NOT HPV Vaccinated HPV Vaccinated Religion A 5 40 Religion B 40 5  Statistically significant relationship (chi- square, p<0.05)  High risk of attribute disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 20. Attribute vs Identity Disclosure NOT HPV Vaccinated HPV Vaccinated Religion A 5 6 Religion B 6 5  After suppression  Not statistically significant relationship (chi-square)  Low risk of attribute disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 21. Stigmatizing Analytics Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 22. Definition of De-identified Data Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 23. Direct Identifiers • Fields that would uniquely identify individuals in a database • Name, address, telephone number, fax number, MRN, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device number Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 24. Dealing with Direct Identifiers • Defensible approaches: – Remove those fields – Convert them to one-time or persistent pseudonyms – Randomize the values • These approaches will ensure, if done properly, that the probability of recovering the original value is very small Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 25. Quasi-Identifiers • sex, date of birth or age, geographic locations (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 26. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 27. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 28. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 29. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 30. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 31. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 32. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 33. Re-identification Risk Measurement • Risk measurement will depend on: – Granularity of quasi-identifiers – Region of the country we are talking about – Risk metric used (eg, uniqueness or groups of 5) – Threshold for what is acceptable risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 34. De-identification Standards • The HIPAA Privacy Rule specifies two de- identification standards (45 CFR 164.514): – Safe Harbor – Statistical method (also known as the expert statistician method) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 35. HIPAA Safe Harbor Safe Harbor Direct Identifiers and Quasi-identifiers 1. Names 12. Vehicle identifiers 18. Any other unique 2. ZIP Codes (except and serial numbers, identifying number, first three) including license characteristic, or 3. All elements of dates plate numbers code (except year) 13. Device identifiers 4. Telephone numbers and serial numbers 5. Fax numbers 14. Web Universal 6. Electronic mail Resource Locators addresses (URLs) 7. Social security 15. Internet Protocol (IP) numbers address numbers 8. Medical record 16. Biometric identifiers, numbers including finger and 9. Health plan voice prints beneficiary numbers 17. Full face 10. Account numbers photographic images 11. Certificate/license and any comparable numbers images; Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 36. HIPAA Safe Harbor Safe Harbor Direct Identifiers and Quasi-identifiers 1. Names 13. Device identifiers 2. ZIP Codes (except and serial numbers first three) 14. Web Universal 3. All elements of dates Resource Locators (except year) (URLs) 4. Telephone numbers 15. Internet Protocol (IP) 5. Fax numbers address numbers 6. Electronic mail 16. Biometric identifiers, addresses including finger and 7. Social security voice prints numbers 17. Full face 8. Medical record photographic images numbers and any comparable 9. Health plan 12. Vehicle identifiers images; beneficiary numbers and serial numbers, 18. Any other unique 10. Account numbers including license identifying number, 11. Certificate/license plate numbers characteristic, or numbers code Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 37. Two Problems with Safe Harbor • May be removing too much information on the ZIP Code and date fields – these fields are useful for many analytical purposes • Does not provide adequate protection – it is easy to have a Safe Harbor compliant data set with a high risk of re-identification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 38. High Risk Safe Harbor Data - I • If the adversary knows that Bob, 55 year old male, is in the database Gender Age ZIP Lab Test M 55 112 Albumin, Serum Alkaline F 53 114 Phosphatase M 24 134 Creatine Kinase Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 39. High Risk Safe Harbor Data - II • 2.24m visits, 1.6m patients, NY discharge data for 2007 • Compliant with Safe Harbor Fields % of patients unique age, gender, ZIP3 2.54% age, gender, ZIP3, LOS 21.49% Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 40. Statistical Method Conditions • A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable: I. Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and II. Documents the methods and results of the analysis that justify such determination Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 41. Re-identification Risk Spectrum Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 42. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 43. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 44. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 45. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 46. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 47. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 48. Managing Re-identification Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 49. Different Types of Data Releases • The same data set can be disclosed with different thresholds: – Public data set – Release with conditions for known data recipients, including the requirement to sign a data sharing agreement, a prohibition on re- identification, and a requirement to pass these conditions to all sub-contractors – The more conditions the higher quality the data set Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 50. Example – CA Hospital Discharges • Context: data release to a researcher who will sign a data use agreement, good practices for managing sensitive health information • There were ~2.1m patients who had ~3m visits • Risk threshold = 0.2; use average risk across all patients • Variables: – Year of birth – Gender – Year of admission – Days since last visit – Length of stay Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 51. Risk Level Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 52. Hierarchy Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 53. De-identified Data Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 54. Key Practical Considerations • Data warehouses: de-identification of data extracts instead of whole data warehouses results in higher quality de-identified data • Beware of correlated data: data in multiple medical domains are correlated, so one has to be cognizant of inference attacks on data • Automation: automation can detect outliers and perform selective suppression, which results in higher quality de-identified data • Transparency: important to ensure that methods have received peer and regulator scrutiny Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 55. Contact kelemam@ehealthinformation.ca @kelemam www.ehealthinformation.ca Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca