ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Technische Universität München
Fabian Prasser, Raffael Bild, Klaus A. Kuhn
Chair for Medical Informatics
Institute for Medical Statistics and Epidemiologie
Technical University of Munich (TUM)
A Generic Method for Assessing the
Quality of De-Identified Health Data

Fabian Prasser, Raffael Bild, Klaus A. Kuhn:
A Generic Method for Assessing the Quality of De-Identified Health Data
2 / 15Health – Exploring Complexity HEC 2016 / Medical Informatics Europe MIE 201619.08.2016
Motivation: legal requirements
●
Secondary use of health care data for research
●
Data sharing in cooperative research
Goal: privacy protection
●
Ensure that recipients cannot learn the identity of data subjects
●
Re-identification can have severe legal consequences
Basis: make sure that the recipient is as trustworthy as possible
●
Sign data use agreements, approval by data access committees
●
Implement multiple layers of access to create controlled environments
Residual risks: data de-identification (also called: data anonymization)
●
Step 1: Remove identifying data (e.g. names, insurance numbers)
●
Step 2: Modify data to reduce the uniqueness of potentially identifying attribute
values (e.g. date-of-birth, sex, zip code)
Background

Generalization
Suppression
Micro-aggregation
Example
Reduction of the uniqueness of potentially identifying values

Trade-off: privacy risks vs. quality of data
Models are needed for measuring both aspects
●
Privacy: k-anonymity, k-map, strict average risk, population uniqueness
●
Quality: loss of information (e.g. granularity), changes in statistical properties
(e.g. tendency, dispersion, shape of distributions), data utility (e.g. classification)
Challenge
Privacy risk
Dataquality
Original data
Highest risk
No data
No risk
Potential solutions

Data transformation: attribute generalization
Recommended for health data: generalization hierarchies
Examples
Input data Global recoding Local recoding

Data transformation: global recoding
Identical input values are mapped to identical generalized values
Examples

Data transformation: local recoding
Identical values may be generalized to different levels
Examples
More flexible: can preserve more more information content

Well known model for measuring information loss
●
Developed by the statistical disclosure control community
●
A. De Waal and L. Willenborg, Information loss through global recoding and local
suppression, Netherlands Official Statistics 14 (1999), 17–20.
Often used for de-identifying health data
●
Recommended in several guidelines, used in papers
Based on the concept of mutual information
●
Quantifies the amount of information which can be obtained about one variable
by observing the other
Application to data anonymization
●
Measure loss of information by comparing input data with transformed output data
Can only be used with global recoding (details: see paper)
●
We have developed a generic variant which supports local recoding
(generalization, record suppression, cell suppression)
Non-Uniform Entropy

Generic Non-Uniform Entropy
Global recoding to level 0
Age
input
Age
output
Global recoding, so we
can use Non-Uniform
Entropy for calculating
Δ0,1 and Δ1,2 !
Basic idea: model local recoding as iterative global recoding
This can be done for every local recoding scheme

Result: Δ' = Δ0,1 + Δ1,2
Age
input
Age
output
Non-Uniform
Entropy

Experiments
Two datasets
●
Extract of the 1994 US census database: 30,162 records
●
Health interview series: US survey with 1,193,504 participants
Transformation scheme
●
Initially: global recoding with generalization
●
Schemes: original, low, medium, high
●
Followed by: local recoding with record suppression
●
Iterative removal of records (10%, 20%, …, 100%)
Measured information loss with two models
●
Non-Uniform Entropy
●
Our generic variant
Expected outcome
●
Initially: loss of information via generalization
●
Followed by: linear increase of information loss (number of removed records)

Results
Both models measured the same initial loss of information
Only our model captured the linear increase
→ Non-Uniform Entropy measured information gain followed by decrease

The method describe here has been implemented into ARX
●
Oriented towards guidelines for health data de-identification
●
Supports a wide variety of approaches to data de-identification
●
Requires development of generic methods
Highly scalable
●
Millions of records with up to 50 potentially identifying attributes
Mentioned in several data protection guidelines
●
European Medicines Agency (EMA): External Guidance on the Implementation of
the European Medicines Agency Policy on the Publication of Clinical Data for
Medicinal Products for Human Use (2016)
●
EU Agency for Network and Information Security (ENISA): Privacy and Data
Protection by Design (2014)
ARX is open source software
●
Website: http://arx.deidentifier.org
●
Email: fabian.prasser@tum.de
ARX – An anonymization tool for biomedical data

Thank you for your attention!

ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (11)

Semelhante a ARX - A Generic Method for Assessing the Quality of De-Identified Health Data

Semelhante a ARX - A Generic Method for Assessing the Quality of De-Identified Health Data (20)

Último

Último (20)

ARX - A Generic Method for Assessing the Quality of De-Identified Health Data