Talk held at Medical Informatics Europe (MIE) 2016.
Abstract: Data sharing plays an important role in modern biomedical research. Due to the inherent sensitivity of health data, patient privacy must be protected. De-identification means to transform a dataset in such a way that it becomes extremely difficult for an attacker to link its records to identified individuals. This can be achieved with different types of data transformations. As transformation impacts the information content of a dataset, it is important to balance an increase in privacy with a decrease in data quality. To this end, models for measuring both aspects are needed. Non-Uniform Entropy is a model for data quality which is frequently recommended for de-identifying health data. In this work we show that it cannot be used in a meaningful way for measuring the quality of data which has been transformed with several important types of data transformation. We introduce a generic variant, which overcomes this limitation. We performed experiments with real-world datasets, which show that our method provides a unified framework in which the quality of differently transformed data can be compared to find a good or even optimal solution to a given data de-identification problem. We have implemented our method into ARX, an open source anonymization tool for biomedical data.
Website with further information: http://arx.deidentifier.org
How to submit a standout Adobe Champion Application
ARX - A Generic Method for Assessing the Quality of De-Identified Health Data
1. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn
Chair for Medical Informatics
Institute for Medical Statistics and Epidemiologie
Technical University of Munich (TUM)
A Generic Method for Assessing the
Quality of De-Identified Health Data
2. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
2 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Motivation: legal requirements
●
Secondary use of health care data for research
●
Data sharing in cooperative research
Goal: privacy protection
●
Ensure that recipients cannot learn the identity of data subjects
●
Re-identification can have severe legal consequences
Basis: make sure that the recipient is as trustworthy as possible
●
Sign data use agreements, approval by data access committees
●
Implement multiple layers of access to create controlled environments
Residual risks: data de-identification (also called: data anonymization)
●
Step 1: Remove identifying data (e.g. names, insurance numbers)
●
Step 2: Modify data to reduce the uniqueness of potentially identifying attribute
values (e.g. date-of-birth, sex, zip code)
Background
3. Technische Universität München
Generalization
Suppression
Micro-aggregation
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
3 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Example
Reduction of the uniqueness of potentially identifying values
4. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
4 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Trade-off: privacy risks vs. quality of data
Models are needed for measuring both aspects
●
Privacy: k-anonymity, k-map, strict average risk, population uniqueness
●
Quality: loss of information (e.g. granularity), changes in statistical properties
(e.g. tendency, dispersion, shape of distributions), data utility (e.g. classification)
Challenge
Privacy risk
Dataquality
Original data
Highest risk
No data
No risk
Potential solutions
5. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
5 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Data transformation: attribute generalization
Recommended for health data: generalization hierarchies
Examples
Input data Global recoding Local recoding
6. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
6 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Data transformation: global recoding
Identical input values are mapped to identical generalized values
Examples
Input data Global recoding Local recoding
7. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
7 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Data transformation: local recoding
Identical values may be generalized to different levels
Examples
More flexible: can preserve more more information content
Input data Global recoding Local recoding
8. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
8 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Well known model for measuring information loss
●
Developed by the statistical disclosure control community
●
A. De Waal and L. Willenborg, Information loss through global recoding and local
suppression, Netherlands Official Statistics 14 (1999), 17–20.
Often used for de-identifying health data
●
Recommended in several guidelines, used in papers
Based on the concept of mutual information
●
Quantifies the amount of information which can be obtained about one variable
by observing the other
Application to data anonymization
●
Measure loss of information by comparing input data with transformed output data
Can only be used with global recoding (details: see paper)
●
We have developed a generic variant which supports local recoding
(generalization, record suppression, cell suppression)
Non-Uniform Entropy
9. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
9 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Generic Non-Uniform Entropy
Global recoding to level 0
Global recoding to level 1
Global recoding to level 2
Age
input
Age
output
Global recoding, so we
can use Non-Uniform
Entropy for calculating
Δ0,1 and Δ1,2 !
Basic idea: model local recoding as iterative global recoding
This can be done for every local recoding scheme
10. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
10 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Generic Non-Uniform Entropy
Basic idea: model local recoding as iterative global recoding
Result: Δ' = Δ0,1 + Δ1,2
Age
input
Age
output
Global recoding to level 0
Global recoding to level 1
Global recoding to level 2
Non-Uniform
Entropy
11. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
11 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Generic Non-Uniform Entropy
Basic idea: model local recoding as iterative global recoding
Result: Δ' = Δ0,1 + Δ1,2
Age
input
Age
output
Global recoding to level 0
Global recoding to level 1
Global recoding to level 2
Non-Uniform
Entropy
12. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
12 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Experiments
Two datasets
●
Extract of the 1994 US census database: 30,162 records
●
Health interview series: US survey with 1,193,504 participants
Transformation scheme
●
Initially: global recoding with generalization
●
Schemes: original, low, medium, high
●
Followed by: local recoding with record suppression
●
Iterative removal of records (10%, 20%, …, 100%)
Measured information loss with two models
●
Non-Uniform Entropy
●
Our generic variant
Expected outcome
●
Initially: loss of information via generalization
●
Followed by: linear increase of information loss (number of removed records)
13. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
13 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Results
Both models measured the same initial loss of information
Only our model captured the linear increase
→ Non-Uniform Entropy measured information gain followed by decrease
14. Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
14 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
The method describe here has been implemented into ARX
●
Oriented towards guidelines for health data de-identification
●
Supports a wide variety of approaches to data de-identification
●
Requires development of generic methods
Highly scalable
●
Millions of records with up to 50 potentially identifying attributes
Mentioned in several data protection guidelines
●
European Medicines Agency (EMA): External Guidance on the Implementation of
the European Medicines Agency Policy on the Publication of Clinical Data for
Medicinal Products for Human Use (2016)
●
EU Agency for Network and Information Security (ENISA): Privacy and Data
Protection by Design (2014)
ARX is open source software
●
Website: http://arx.deidentifier.org
●
Email: fabian.prasser@tum.de
ARX – An anonymization tool for biomedical data
15. Technische Universität München
Thank you for your attention!
Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
15 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016