SlideShare uma empresa Scribd logo
1 de 55
SHARING HEALTH RESEARCH DATA

De-identification
METHODS & EXPERIENCES




                                 Dr. Khaled El Emam
                Electronic Health Information Laboratory
Motivations for De-identification
  • Obtaining patient consent/authorization – not
    practical for large databases and introduces
    bias
  • Compliance to
    regulations / legislation
  • Contractual obligations
  • Maintain public / consumer /
    client trust
  • Costs of breach notification

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
A Balance




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Definition of De-identified Data

         Health information that does not identify an
          individual and with respect to which there is
            no reasonable basis to believe that the
              information can be used to identify an
             individual is not individually identifiable
                        health information.




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-identification Attacks
  • Just to clear this issue out at the beginning
  • There are some claims that health data is easy to re-
    identify
  • Often examples are used to support that argument
  • The evidence does not support these claims
     – When data are de-identified properly the
       probability of a successful re-identification attack
       is very small
  • Let‟s consider a few highly publicized examples




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
AOL
  •      AOL releases search queries
         replacing usernames with
         pseudonyms
  •      New York Times reporters re-
         identify one user 4417749
  •      Her search terms: “tea for good
         health”, “numb fingers”, “hand
         tremors”, “dry mouth”, “60 single
         men”, “dog that urinates on
         everything”, “landscapers in
         Lilburn, Ga”, “homes sold in
         shadow lake subdivision gwinnett
         county georgia”
  •      Thelma Arnold, widow living in
         Lilburn Ga ; she has three dogs


Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
AOL ?
  • It is well known that a large percentage of
    individuals run „vanity‟ searches that include their
    names – Thelma Arnold did
  • It is also known that location information can be
    determined from an individual‟s search queries
  • Search queries, even if the username is replaced
    with a pseudonym, cannot be considered de-
    identified




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Weld
  •      Governor Weld of Massachusetts
         was unwell during a public
         appearance – the story was
         covered in the media
  •      Semi-publicly available insurance
         claims data matched with voter
         registration lists
  •      It was possible to determine which
         claims records belonged to the
         Governor




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Weld ?
  • This re-identification attack was done before HIPAA came into
    effect – the insurance claims data would not pass any of the
    HIPAA de-identification standards
  • A recent analysis indicated that Weld was likely re-identified
    because he was a famous person and there was already a lot
    of information about him in the media (his admission date, his
    diagnosis, his discharge date) – the voter registration list was
    arguably not necessary
  • The success rate for such an attack would be lower for general
    members of the public because the voter registration list is
    incomplete




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Netflix
  •      Netflix publicly released movie
         ratings data in the context of a
         competition to develop a
         recommendation algorithm
  •      Researchers re-identified a couple
         of records by matching with a
         publicly available and identifiable
         movie ratings database (IMDB)
  •      Results in cancellation of a
         second competition and litigation
         started against Netflix for exposing
         personal information




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Netflix ?
  • The re-identifications were not actually verified by
    Netflix
  • Authors of attack admit that the Netflix data was not
    de-identified (replaced usernames with
    pseudonyms)
  • The false positive rate of the matching was not
    evaluated (how many people in the IMDB database
    were actually in the Netflix database ?)




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028071




      Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Attribute vs Identity Disclosure
  • Attribute disclosure: discover something
    new about an individual in the database
    without knowing which record belongs to that
    individual
  • Identity disclosure: determine which record
    in the database belongs to a particular
    individual (for example, determine that
    record number 7 belongs to Bob Smith – that
    is identity disclosure)
  • HIPAA only cares about identity disclosure

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Attribute vs Identity Disclosure

                                                                                                                NOT
                                                 HPV Vaccinated
                                                                                                           HPV Vaccinated
           Religion A                                                   5                                                       40
           Religion B                                                 40                                                          5



   Statistically significant relationship (chi-
    square, p<0.05)
   High risk of attribute disclosure

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Attribute vs Identity Disclosure

                                                                                                                NOT
                                                 HPV Vaccinated
                                                                                                           HPV Vaccinated
           Religion A                                                   5                                                       40
           Religion B                                                 40                                                          5



   Statistically significant relationship (chi-
    square, p<0.05)
   High risk of attribute disclosure

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Attribute vs Identity Disclosure

                                                                                                                NOT
                                                 HPV Vaccinated
                                                                                                           HPV Vaccinated
           Religion A                                                   5                                                         6
           Religion B                                                   6                                                         5

   After suppression
   Not statistically significant relationship (chi-square)
   Low risk of attribute disclosure


Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Stigmatizing Analytics




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Definition of De-identified Data


         Health information that does not identify an
          individual and with respect to which there is
            no reasonable basis to believe that the
             information can be used to identify an
                            individual




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Direct Identifiers
  • Fields that would uniquely identify individuals
    in a database
  • Name, address, telephone number, fax
    number, MRN, health card number, health
    plan beneficiary number, license plate
    number, email address, photograph,
    biometrics, SSN, SIN, implanted device
    number



Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Dealing with Direct Identifiers
  • Defensible approaches:
            – Remove those fields
            – Convert them to one-time or persistent
              pseudonyms
            – Randomize the values
  • These approaches will ensure, if done
    properly, that the probability of recovering
    the original value is very small



Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Quasi-Identifiers
  • sex, date of birth or age, geographic locations (such
    as postal codes, census geography, information
    about proximity to known or unique landmarks),
    language spoken at home, ethnic origin, aboriginal
    identity, total years of schooling, marital status,
    criminal history, total income, visible minority status,
    activity difficulties/reductions, profession, event
    dates (such as admission, discharge, procedure,
    death, specimen collection, visit/encounter), codes
    (such as diagnosis codes, procedure codes, and
    adverse event codes), country of birth, birth weight,
    and birth plurality

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-identification Risk Measurement
  • Risk measurement will depend on:
            –     Granularity of quasi-identifiers
            –     Region of the country we are talking about
            –     Risk metric used (eg, uniqueness or groups of 5)
            –     Threshold for what is acceptable risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
De-identification Standards
  • The HIPAA Privacy Rule specifies two de-
    identification standards (45 CFR 164.514):
            – Safe Harbor
            – Statistical method (also known as the expert
              statistician method)




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
HIPAA Safe Harbor
                     Safe Harbor Direct Identifiers and Quasi-identifiers
1. Names                                           12. Vehicle identifiers                            18. Any other unique
2. ZIP Codes (except                                   and serial numbers,                                identifying number,
    first three)                                       including license                                  characteristic, or
3. All elements of dates                               plate numbers                                      code
    (except year)                                  13. Device identifiers
4. Telephone numbers                                   and serial numbers
5. Fax numbers                                     14. Web Universal
6. Electronic mail                                     Resource Locators
    addresses                                          (URLs)
7. Social security                                 15. Internet Protocol (IP)
    numbers                                            address numbers
8. Medical record                                  16. Biometric identifiers,
    numbers                                            including finger and
9. Health plan                                         voice prints
    beneficiary numbers                            17. Full face
10. Account numbers                                    photographic images
11. Certificate/license                                and any comparable
    numbers                                            images;




 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
HIPAA Safe Harbor
                     Safe Harbor Direct Identifiers and Quasi-identifiers
1. Names                                                                                              13. Device identifiers
2. ZIP Codes (except                                                                                      and serial numbers
    first three)                                                                                      14. Web Universal
3. All elements of dates                                                                                  Resource Locators
    (except year)                                                                                         (URLs)
4. Telephone numbers                                                                                  15. Internet Protocol (IP)
5. Fax numbers                                                                                            address numbers
6. Electronic mail                                                                                    16. Biometric identifiers,
    addresses
                                                                                                          including finger and
7. Social security                                                                                        voice prints
    numbers                                                                                           17. Full face
8. Medical record                                                                                         photographic images
    numbers                                                                                               and any comparable
9. Health plan
                                                     12. Vehicle identifiers                              images;
    beneficiary numbers
                                                         and serial numbers,                          18. Any other unique
10. Account numbers
                                                         including license                                identifying number,
11. Certificate/license
                                                         plate numbers                                    characteristic, or
    numbers                                                                                               code




 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Two Problems with Safe Harbor
  • May be removing too much information on
    the ZIP Code and date fields – these fields
    are useful for many analytical purposes
  • Does not provide adequate protection – it is
    easy to have a Safe Harbor compliant data
    set with a high risk of re-identification




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
High Risk Safe Harbor Data - I
  • If the adversary knows that Bob, 55 year old
    male, is in the database

      Gender                                         Age                                        ZIP                                 Lab Test


             M                                         55                                       112                          Albumin, Serum

                                                                                                                                  Alkaline
             F                                         53                                       114
                                                                                                                                Phosphatase

             M                                         24                                       134                          Creatine Kinase



Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
High Risk Safe Harbor Data - II
  • 2.24m visits, 1.6m patients, NY discharge
    data for 2007
  • Compliant with Safe Harbor

        Fields                                                                        % of patients unique

        age, gender, ZIP3                                                                              2.54%

        age, gender, ZIP3, LOS                                                                        21.49%




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Statistical Method Conditions
  • A person with appropriate knowledge of and
    experience with generally accepted statistical and
    scientific principles and methods for rendering
    information not individually identifiable:
            I.        Applying such principles and methods, determines that
                      the risk is very small that the information could be used,
                      alone or in combination with other reasonably available
                      information, by an anticipated recipient to identify an
                      individual who is a subject of the information; and
            II.       Documents the methods and results of the analysis that
                      justify such determination




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-identification Risk Spectrum




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Overall Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Managing Re-identification Risk




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Different Types of Data Releases
  • The same data set can be disclosed
    with different thresholds:
            – Public data set
            – Release with conditions for known data
              recipients, including the requirement to sign a
              data sharing agreement, a prohibition on re-
              identification, and a requirement to pass these
              conditions to all sub-contractors
            – The more conditions the higher quality the data
              set


Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Example – CA Hospital Discharges
  • Context: data release to a researcher who will sign a
    data use agreement, good practices for managing
    sensitive health information
  • There were ~2.1m patients who had ~3m visits
  • Risk threshold = 0.2; use average risk across all
    patients
  • Variables:
            –     Year of birth
            –     Gender
            –     Year of admission
            –     Days since last visit
            –     Length of stay
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Risk Level




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Hierarchy




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
De-identified Data




Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Key Practical Considerations
  • Data warehouses: de-identification of data extracts
    instead of whole data warehouses results in higher
    quality de-identified data
  • Beware of correlated data: data in multiple medical
    domains are correlated, so one has to be cognizant
    of inference attacks on data
  • Automation: automation can detect outliers and
    perform selective suppression, which results in
    higher quality de-identified data
  • Transparency: important to ensure that methods
    have received peer and regulator scrutiny

Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Contact


     kelemam@ehealthinformation.ca

     @kelemam


     www.ehealthinformation.ca




 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca

Mais conteúdo relacionado

Semelhante a Sharing Health Research Data

15th ihs id_web_presentation_final_may142003
15th ihs id_web_presentation_final_may14200315th ihs id_web_presentation_final_may142003
15th ihs id_web_presentation_final_may142003Arnulfo Jr Rosario
 
Bio Ohio Power Point Slide Set 3 10 09
Bio Ohio Power Point Slide Set 3 10 09Bio Ohio Power Point Slide Set 3 10 09
Bio Ohio Power Point Slide Set 3 10 09guest4770066
 
The cost and value of hiv testing in Malawi
The cost and value of hiv testing in MalawiThe cost and value of hiv testing in Malawi
The cost and value of hiv testing in MalawiCarmen Figueroa
 
Bio inOhio Facts and Stats 6/2009
Bio inOhio Facts and Stats 6/2009Bio inOhio Facts and Stats 6/2009
Bio inOhio Facts and Stats 6/2009guest4bc2dd
 
Bio Ohio Power Point Slide Set All 9 17 10
Bio Ohio Power Point Slide Set All 9 17 10Bio Ohio Power Point Slide Set All 9 17 10
Bio Ohio Power Point Slide Set All 9 17 10BioOhio
 
2008 Iowa HIV Conference
2008 Iowa HIV Conference2008 Iowa HIV Conference
2008 Iowa HIV Conferencerheaju
 
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...Public Religion Research Institute
 
HIV Nursing and Home & Community Care Conference
HIV Nursing and Home  & Community Care Conference HIV Nursing and Home  & Community Care Conference
HIV Nursing and Home & Community Care Conference griehl
 

Semelhante a Sharing Health Research Data (9)

15th ihs id_web_presentation_final_may142003
15th ihs id_web_presentation_final_may14200315th ihs id_web_presentation_final_may142003
15th ihs id_web_presentation_final_may142003
 
Bio Ohio Power Point Slide Set 3 10 09
Bio Ohio Power Point Slide Set 3 10 09Bio Ohio Power Point Slide Set 3 10 09
Bio Ohio Power Point Slide Set 3 10 09
 
The cost and value of hiv testing in Malawi
The cost and value of hiv testing in MalawiThe cost and value of hiv testing in Malawi
The cost and value of hiv testing in Malawi
 
Bio inOhio Facts and Stats 6/2009
Bio inOhio Facts and Stats 6/2009Bio inOhio Facts and Stats 6/2009
Bio inOhio Facts and Stats 6/2009
 
Bio Ohio Power Point Slide Set All 9 17 10
Bio Ohio Power Point Slide Set All 9 17 10Bio Ohio Power Point Slide Set All 9 17 10
Bio Ohio Power Point Slide Set All 9 17 10
 
Looking Back to Look Forward: HIV Futures Trend Data
Looking Back to Look Forward: HIV Futures Trend Data  Looking Back to Look Forward: HIV Futures Trend Data
Looking Back to Look Forward: HIV Futures Trend Data
 
2008 Iowa HIV Conference
2008 Iowa HIV Conference2008 Iowa HIV Conference
2008 Iowa HIV Conference
 
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...
How Race and Religion Shape Millennial Attitudes on Sexuality and Reproductiv...
 
HIV Nursing and Home & Community Care Conference
HIV Nursing and Home  & Community Care Conference HIV Nursing and Home  & Community Care Conference
HIV Nursing and Home & Community Care Conference
 

Mais de Khaled El Emam

Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...Khaled El Emam
 
Facilitating Analytics while Protecting Privacy
Facilitating Analytics while Protecting PrivacyFacilitating Analytics while Protecting Privacy
Facilitating Analytics while Protecting PrivacyKhaled El Emam
 
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Khaled El Emam
 
Anonymizing Health Data
Anonymizing Health DataAnonymizing Health Data
Anonymizing Health DataKhaled El Emam
 
The Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by ConsumersThe Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by ConsumersKhaled El Emam
 
The Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical TrialsThe Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical TrialsKhaled El Emam
 

Mais de Khaled El Emam (6)

Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
Take Two Curves and Call Me in the Morning: The Story of the NSAs Dual_EC_DRB...
 
Facilitating Analytics while Protecting Privacy
Facilitating Analytics while Protecting PrivacyFacilitating Analytics while Protecting Privacy
Facilitating Analytics while Protecting Privacy
 
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
Big Data Meets Privacy:De-identification Maturity Model for Benchmarking and ...
 
Anonymizing Health Data
Anonymizing Health DataAnonymizing Health Data
Anonymizing Health Data
 
The Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by ConsumersThe Adoption of Personal Health Records by Consumers
The Adoption of Personal Health Records by Consumers
 
The Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical TrialsThe Use of EDC in Canadian Clinical Trials
The Use of EDC in Canadian Clinical Trials
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Sharing Health Research Data

  • 1. SHARING HEALTH RESEARCH DATA De-identification METHODS & EXPERIENCES Dr. Khaled El Emam Electronic Health Information Laboratory
  • 2. Motivations for De-identification • Obtaining patient consent/authorization – not practical for large databases and introduces bias • Compliance to regulations / legislation • Contractual obligations • Maintain public / consumer / client trust • Costs of breach notification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 3. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 4. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 5. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 6. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 7. A Balance Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 8. Definition of De-identified Data Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 9. Re-identification Attacks • Just to clear this issue out at the beginning • There are some claims that health data is easy to re- identify • Often examples are used to support that argument • The evidence does not support these claims – When data are de-identified properly the probability of a successful re-identification attack is very small • Let‟s consider a few highly publicized examples Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 10. AOL • AOL releases search queries replacing usernames with pseudonyms • New York Times reporters re- identify one user 4417749 • Her search terms: “tea for good health”, “numb fingers”, “hand tremors”, “dry mouth”, “60 single men”, “dog that urinates on everything”, “landscapers in Lilburn, Ga”, “homes sold in shadow lake subdivision gwinnett county georgia” • Thelma Arnold, widow living in Lilburn Ga ; she has three dogs Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 11. AOL ? • It is well known that a large percentage of individuals run „vanity‟ searches that include their names – Thelma Arnold did • It is also known that location information can be determined from an individual‟s search queries • Search queries, even if the username is replaced with a pseudonym, cannot be considered de- identified Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 12. Weld • Governor Weld of Massachusetts was unwell during a public appearance – the story was covered in the media • Semi-publicly available insurance claims data matched with voter registration lists • It was possible to determine which claims records belonged to the Governor Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 13. Weld ? • This re-identification attack was done before HIPAA came into effect – the insurance claims data would not pass any of the HIPAA de-identification standards • A recent analysis indicated that Weld was likely re-identified because he was a famous person and there was already a lot of information about him in the media (his admission date, his diagnosis, his discharge date) – the voter registration list was arguably not necessary • The success rate for such an attack would be lower for general members of the public because the voter registration list is incomplete Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 14. Netflix • Netflix publicly released movie ratings data in the context of a competition to develop a recommendation algorithm • Researchers re-identified a couple of records by matching with a publicly available and identifiable movie ratings database (IMDB) • Results in cancellation of a second competition and litigation started against Netflix for exposing personal information Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 15. Netflix ? • The re-identifications were not actually verified by Netflix • Authors of attack admit that the Netflix data was not de-identified (replaced usernames with pseudonyms) • The false positive rate of the matching was not evaluated (how many people in the IMDB database were actually in the Netflix database ?) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 16. http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028071 Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 17. Attribute vs Identity Disclosure • Attribute disclosure: discover something new about an individual in the database without knowing which record belongs to that individual • Identity disclosure: determine which record in the database belongs to a particular individual (for example, determine that record number 7 belongs to Bob Smith – that is identity disclosure) • HIPAA only cares about identity disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 18. Attribute vs Identity Disclosure NOT HPV Vaccinated HPV Vaccinated Religion A 5 40 Religion B 40 5  Statistically significant relationship (chi- square, p<0.05)  High risk of attribute disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 19. Attribute vs Identity Disclosure NOT HPV Vaccinated HPV Vaccinated Religion A 5 40 Religion B 40 5  Statistically significant relationship (chi- square, p<0.05)  High risk of attribute disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 20. Attribute vs Identity Disclosure NOT HPV Vaccinated HPV Vaccinated Religion A 5 6 Religion B 6 5  After suppression  Not statistically significant relationship (chi-square)  Low risk of attribute disclosure Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 21. Stigmatizing Analytics Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 22. Definition of De-identified Data Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 23. Direct Identifiers • Fields that would uniquely identify individuals in a database • Name, address, telephone number, fax number, MRN, health card number, health plan beneficiary number, license plate number, email address, photograph, biometrics, SSN, SIN, implanted device number Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 24. Dealing with Direct Identifiers • Defensible approaches: – Remove those fields – Convert them to one-time or persistent pseudonyms – Randomize the values • These approaches will ensure, if done properly, that the probability of recovering the original value is very small Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 25. Quasi-Identifiers • sex, date of birth or age, geographic locations (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 26. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 27. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 28. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 29. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 30. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 31. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 32. Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 33. Re-identification Risk Measurement • Risk measurement will depend on: – Granularity of quasi-identifiers – Region of the country we are talking about – Risk metric used (eg, uniqueness or groups of 5) – Threshold for what is acceptable risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 34. De-identification Standards • The HIPAA Privacy Rule specifies two de- identification standards (45 CFR 164.514): – Safe Harbor – Statistical method (also known as the expert statistician method) Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 35. HIPAA Safe Harbor Safe Harbor Direct Identifiers and Quasi-identifiers 1. Names 12. Vehicle identifiers 18. Any other unique 2. ZIP Codes (except and serial numbers, identifying number, first three) including license characteristic, or 3. All elements of dates plate numbers code (except year) 13. Device identifiers 4. Telephone numbers and serial numbers 5. Fax numbers 14. Web Universal 6. Electronic mail Resource Locators addresses (URLs) 7. Social security 15. Internet Protocol (IP) numbers address numbers 8. Medical record 16. Biometric identifiers, numbers including finger and 9. Health plan voice prints beneficiary numbers 17. Full face 10. Account numbers photographic images 11. Certificate/license and any comparable numbers images; Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 36. HIPAA Safe Harbor Safe Harbor Direct Identifiers and Quasi-identifiers 1. Names 13. Device identifiers 2. ZIP Codes (except and serial numbers first three) 14. Web Universal 3. All elements of dates Resource Locators (except year) (URLs) 4. Telephone numbers 15. Internet Protocol (IP) 5. Fax numbers address numbers 6. Electronic mail 16. Biometric identifiers, addresses including finger and 7. Social security voice prints numbers 17. Full face 8. Medical record photographic images numbers and any comparable 9. Health plan 12. Vehicle identifiers images; beneficiary numbers and serial numbers, 18. Any other unique 10. Account numbers including license identifying number, 11. Certificate/license plate numbers characteristic, or numbers code Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 37. Two Problems with Safe Harbor • May be removing too much information on the ZIP Code and date fields – these fields are useful for many analytical purposes • Does not provide adequate protection – it is easy to have a Safe Harbor compliant data set with a high risk of re-identification Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 38. High Risk Safe Harbor Data - I • If the adversary knows that Bob, 55 year old male, is in the database Gender Age ZIP Lab Test M 55 112 Albumin, Serum Alkaline F 53 114 Phosphatase M 24 134 Creatine Kinase Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 39. High Risk Safe Harbor Data - II • 2.24m visits, 1.6m patients, NY discharge data for 2007 • Compliant with Safe Harbor Fields % of patients unique age, gender, ZIP3 2.54% age, gender, ZIP3, LOS 21.49% Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 40. Statistical Method Conditions • A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable: I. Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and II. Documents the methods and results of the analysis that justify such determination Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 41. Re-identification Risk Spectrum Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 42. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 43. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 44. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 45. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 46. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 47. Overall Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 48. Managing Re-identification Risk Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 49. Different Types of Data Releases • The same data set can be disclosed with different thresholds: – Public data set – Release with conditions for known data recipients, including the requirement to sign a data sharing agreement, a prohibition on re- identification, and a requirement to pass these conditions to all sub-contractors – The more conditions the higher quality the data set Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 50. Example – CA Hospital Discharges • Context: data release to a researcher who will sign a data use agreement, good practices for managing sensitive health information • There were ~2.1m patients who had ~3m visits • Risk threshold = 0.2; use average risk across all patients • Variables: – Year of birth – Gender – Year of admission – Days since last visit – Length of stay Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 51. Risk Level Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 52. Hierarchy Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 53. De-identified Data Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 54. Key Practical Considerations • Data warehouses: de-identification of data extracts instead of whole data warehouses results in higher quality de-identified data • Beware of correlated data: data in multiple medical domains are correlated, so one has to be cognizant of inference attacks on data • Automation: automation can detect outliers and perform selective suppression, which results in higher quality de-identified data • Transparency: important to ensure that methods have received peer and regulator scrutiny Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
  • 55. Contact kelemam@ehealthinformation.ca @kelemam www.ehealthinformation.ca Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca