More Related Content Similar to Record Ordering Heuristics for Disclosure Control through Microaggregation (20) More from IDES Editor (20) Record Ordering Heuristics for Disclosure Control through Microaggregation1. ACEEE Int. J. on Control System and Instrumentation, Vol. 03, No. 01, Feb 2012
Record Ordering Heuristics for Disclosure Control
through Microaggregation
Brook Heaton1 and Sumitra Mukherjee2
1
Graduate School of Computer and Information Sciences, Nova Southeastern University, Fort Lauderdale, FL, USA
Email: willheat@nova.edu
2
Graduate School of Computer and Information Sciences, Nova Southeastern University, Fort Lauderdale, FL, USA.
Email: sumitra@nova.edu
Abstract— Statistical disclosure control (SDC) methods indistinguishable from at least k-1 other records. Replacing
reconcile the need to release information to researchers with actual data values with group means leads to information
the need to protect privacy of individual records. loss and renders the data less valuable to users. Optimal
Microaggregation is a SDC method that protects data subjects microaggregation aims to minimize information loss by
by guarantying k-anonymity: Records are partitioned into
grouping together records that are very similar, thereby
groups of size at least k and actual data values are replaced by
the group means so that each record in the group is
maximizing homogeneity among the records of each group.
indistinguishable from at least k-1 other records. The goal is The optimal microaggregation problem can be formally
to create groups of similar records such that information loss defined as follows: Given a data set with n records and p
due to data modification is minimized, where information numerical attributes per record, partition the records into
loss is measured by the sum of squared deviations between groups containing at least k records each, such that the sum
the actual data values and their group means. Since optimal squared error (SSE) within groups is minimized. SSE is defined
multivariate microaggregation is NP-hard, heuristics have
g ni
been developed for microaggregation. It has been shown that 2
for a given ordering of records, the optimal partition consistent as
i 1 j 1
X ij X i , where g is the number of groups, X
ij
with that ordering can be efficiently computed and some of
the best existing microaggregation methods are based on this
approach. This paper improves on previous heuristics by is the jth record of the ith group, and X i is the mean for the ith
adapting tour construction and tour improvement heuristics group.
for the traveling salesman problem (TSP) for
This problem has been shown to be NP-hard when p>1
microaggregation. Specifically, the Greedy heuristic and the
Quick Boruvka heuristic are investigated for tour construction
[21]. Since microaggregation is extensively used for SDC,
and the 2-opt, 3-opt, and Lin-Kernighan heuristics are used several heuristics have been proposed that lead to low
for tour improvements. Computational experiments using information loss (see e.g. [4], [6], [7], [9], [10], [11], [13], [16],
benchmark datasets indicate that our method results in lower [17], [18], [20], [24], and [25]). There is a need to develop new
information loss than extant microaggregation heuristics. heuristics that perform better than those currently known,
either by producing groups with lower information loss, or
Index Terms—Disclosure Control, Microaggregation, Privacy by achieving the same information loss at lower
protection, Tour construction heuristics, Tour improvement computational cost. Optimal univariate microaggregation (i.e.,
heuristics, Shortest path.
involving a single attribute) has been shown to be solvable
in polynomial time by taking an ordered list of records and
I. INTRODUCTION using a shortest-path algorithm to compute the lowest
Many databases, such as health information systems and information loss k-partition for the given ordered list [14]. A
U.S. census data contain information that is valuable to k-partition is a set of groups such that every group contains
researchers for statistical purposes. At the same time, these at least k elements. Optimality is guaranteed, since univariate
databases contain private information about individuals. data can be strictly ordered. Practical applications, however,
There is a tradeoff between providing information for the require microaggregation to be performed on records
benefit of society, and restricting information for the benefit containing multiple attributes. While multivariate data cannot
of individuals. Statistical disclosure control (SDC) is a set of be strictly ordered, Domingo-Ferrer et al. [9] show that, for a
techniques for providing access to aggregate information in given ordering of records, the best partition consistent with
statistical databases, while at the same time protecting the that ordering can be identified efficiently as a shortest path
privacy of individual data subjects. See [1], [12], and [27] for in a network using Hansen and Mukherjee’s method [14].
good overviews of statistical disclosure controls methods. Exploiting this idea, they develop the Multivariate Hansen-
Microaggregation is a popular SDC method that that protects Mukherjee (MHM) algorithm and propose several heuristics
data subjects by guarantying k-anonymity (see [2], [23], and for ordering the records. Empirical results indicate that MHM,
[26]). Under microaggregation, records are partitioned into when used with good ordering heuristics, outperforms extant
groups of size at least k and actual data values are replaced microaggregation methods. This paper builds on the work of
by the group means so that each record in the group is [9] by investigating new methods for ordering records as a
© 2012 ACEEE 22
DOI: 01.IJCSI.03.01.11
2. ACEEE Int. J. on Control System and Instrumentation, Vol. 03, No. 01, Feb 2012
first step in multivariate microaggregation. The techniques are assigned. Stitch the groups together by applying NPN to
for ordering records are based on the tour construction and their centroids. The MD-MHM heuristic runs in O(n3) time.
improvement heuristics for the traveling salesman problem The MDAV-MHM heuristic is very similar to MD, but
(TSP). Using previously studied benchmark datasets, our considers distances of records to group centroids, rather than
method is shown to outperform extant microaggregation distances to other records. The MDAV heuristic runs in O(n2)
heuristics. Section II discusses the MHM based time. The CBFS-MHM designates as the first record the
microaggregation method and record ordering heuristics records r that is furthest from the centroid. The k-1 closest
proposed in [9]. Section III introduces our TSP based tour records to r are added to its group. This process is repeated
construction and improvement heuristics. Section IV using the remaining records until all the records are assigned.
describes the computational experiments used to compare The complexity of CBFS is also O(n2).
our method with extant microaggregation heuristics. Section
V presents our results. III. OUR ORDERING AND IMPROVEMENT HEURISTICS
We use two TSP based tour construction heuristics –
II. MHM MICROAGGREGATION HEURISTIC
Greedy and Quick Boruvka (QB) – and three tour
A. The MHM Algorithm improvement heuristics – 2-Opt, 3-Opt, and Lin-Kernighan
The MHM algorithm introduced in [9] involves (LK). The improvement heuristics are applied not only to the
constructing a graph based on an ordered list of records, and tours generated by Greedy and QB but also to those produced
finding the shortest path in the graph. The arcs in the shortest by the record ordering heuristics discussed in the previous
path correspond to a partition of the records that is guaranteed section – NPN, MD-MHM, MDAV-MHM, and CBFS-MHM.
to be the lowest cost partition consistent with the specified The Greedy heuristic produces an ordering of records using
ordering. A partition is said to be consistent with an ordering several steps. First, the distances between every pair of
if the following is true: for any three records xi, xj, and xk records in the dataset are calculated and sorted in ascending
where i<j<k if xi and xk belong to the same group g then xj order. The pair of records with the smallest distance is chosen
also belongs to group g. The graph is constructed as follows: and added to the path. The remaining pairs are iteratively
Given an ordered set of n records, create n nodes labeled 1 to considered in order, and added to the path as long as the do
n. Additionally, create a node labeled 0. For each pair of not introduce sub-tours or vertices of degree greater than 2.
nodes (i, j) where i+k <= j < i+2k, create an arc directed The computational complexity of the Greedy heuristic is O(n2
log n). In Quick Boruvka, the records are first sorted by one
j
from i to j. The length of the arc L(i , j ) h i 1
( xh x(i , j ) )2 , of the attributes to generate an initial ordering of records.
For each record in the initial ordering, the distance between
where xh is a multivariate record, and x(i , j ) is the centroid of the current record and the other records in the dataset is
the records in (i+1, j). Once the graph is constructed, a computed. The pair of records with the minimum distance is
shortest-path algorithm such as Dijkstra’s algorithm [8] is added, subject to the same conditions outlined in the Greedy
used to determine the shortest path from node 0 to node n. heuristic above. The QB heuristic is also of complexity of
The length of the shortest path gives the information loss O(n2 log n), but empirically performs slightly faster than the
corresponding to the best partition consistent with the Greedy heuristic [3]. The 2-Opt heuristic, is a tour
specified ordering. improvement heuristic; that is, it starts with a tour and
iteratively attempts to improve it. It operates by first randomly
B. Existing Record Ordering Heuristics selecting two non-connected edges (a total of four points)
Four heuristics for ordering records have been proposed from the tour. If re-connecting the edges result in a shorter
in [9]: nearest-point-next (NPN), maximum distance (MD- overall tour the originally selected edges are replaced with
MHM), maximum distance to average vector (MDAV-MHM), the newly computed ones. This process continues until either
and centroid-based fixed-size (CBFS-MHM). The NPN no further exchanges result in a shorter tour, or some other
heuristic selects the first record by computing the record stopping criteria are met. In the 3-opt heuristic, three
furthest away from the centroid of the entire dataset. The disconnected edges are broken, and every possible way of
record closest to the initial record is designated as the second reconnecting the edges into a valid tour is examined. If a tour
record. The third record is closest to the second record. with shorter distance is found, the edges are swapped, and
This process continues until all of the records have been the process repeats until no more exchanges are possible or
added to the tour. The NPN heuristic runs in O(n2) time, some other stopping criteria are met. The Lin-Kernighan (LK)
where n is the number of records in the dataset. The MD- heuristic is a tour improvement heuristic that is similar to 2-
MHM heuristic works as follows: Find the two records r and opt and 3-opt. Like 2-opt, and 3-opt, LK takes an existing tour,
s that are furthest from one another in the dataset, using and iteratively swaps edges to create a better tour. Lin and
Euclidean distance. Find the k-1 closest records to r and Kernighan [19] note that the choice of swapping two edges in
form a group. Find the k-1 closest records to s and form a 2-opt or 3-opt is arbitrary, since it is possible that swapping
group. Repeat this process, using the remaining records that more than three edges may actually result in a better tour.
haven’t been assigned to a group until all the remaining records Therefore, LK attempts to build a series of sequential k-opt
© 2012 ACEEE 23
DOI: 01.IJCSI.03.01.11
3. ACEEE Int. J. on Control System and Instrumentation, Vol. 03, No. 01, Feb 2012
moves (sometimes referred to as swaps, flips, or exchanges), against the Census, Tarragona, and EIA datasets for five
rather than a 2-opt or 3-opt move, that will result in an overall different values of k. The information loss under each
shorter tour. A good overview of these heuristics can be method is reported. For each specified value of k, the lowest
found in [22]. information loss obtained for each dataset is highlighted in
bold to emphasize the best performing method. Table 1 shows
IV. EXPERIMENTAL DESIGN that in all but one case (k=10 for Census), our tour
improvement heuristics improved upon the results obtained
A. Datasets
using MD-MHM. Results for the MDAV-MHM and CBFS-
We used three benchmark datasets that were used in MHM family of algorithms are very similar, as shown in Tables
previous studies [9] to evaluate our proposed methods. The 2 and 3. Reductions in information loss are often substantial.
first dataset, “Tarragona”, consists of 834 records with 13 For example with the EIA dataset, the MDAV-MHM+LK
variables. The second dataset, “EIA”, consists of 4092 heuristic for k=10 resulted in an information loss of 1.968,
records with 10 variables. Finally, the “Census” dataset which is 31% lower than the information loss (2.867) obtained
consists of 1080 records from the U.S. census, with 13 using MDAV-MHM alone.
variables. Consistent with previous studies, the variables
TABLE I.
were normalized to ensure that results are not dominated by INFORMATION LOSS UNDER MD-MHM AND IMPROVEMENT
any single variable. HEURISTICS
B. Record Ordering Algorithms
The record ordering algorithms used by Domingo-Ferrer
et al. [9] are NPN, MD-MHM, CBFS-MHM, and MDAV-MHM.
These algorithms were run on the three selected datasets.
Additionally, two new record ordering algorithms introduced
in this study – Greedy and Quick Boruvka – were also run.
Three different algorithms for improving existing record
orderings – 2-opt, 3-opt, and LK – were used with all tour
construction heuristics.
C. Comparisons
The six microaggregation algorithms (i.e., NPN, MD-
MHM, MDAV-MHM, CBFS-MHM, Greedy, and QB) were TABLE II.
INFORMATION LOSS UNDER MDAV-MHM AND
run on the three datasets (Census, Tarragona, and EIA) using IMPROVEMENT HEURISTICS
five typical values of k (3, 4, 5, 6, and 10). These values of k
are consistent with those used in [9]. The three tour
improvement heuristics (2opt, 3opt, and LK) were then
applied to these resulting record orderings. That is, for a
given dataset and a given value of k, four different
experiments were run using each record ordering heuristic
(the original algorithm with no improvement, plus the three
improvement heuristics). Consistent with previous studies,
we report the normalized information loss IL = 100 * SSE/SST
(with possible values between 0 and 100) where SST =
g ni
2
i 1 j 1
X ij X , and X is the centroid of the entire TABLE III.
INFORMATION LOSS UNDER CBFS-MHM AND
IMPROVEMENT HEURISTICS
dataset. All of the algorithms were implemented in the Java
programming language, using the Sun Java 6.0 SDK. All
algorithms were run on a computer with a dual 3.0 GHz
processor and 2GB of RAM, running the Linux operating
system (Ubuntu 10.04).
V. RESULTS
This section details the results from applying the MD-
MHM, MDAV-MHM, CBFS-MHM, NPN-MHM, Greedy-
MHM, and QB-MHM algorithms, along with their
corresponding 2-opt, 3-opt, and LK improvement heuristics,
© 2012 ACEEE 24
DOI: 01.IJCSI.03.01. 11
4. ACEEE Int. J. on Control System and Instrumentation, Vol. 03, No. 01, Feb 2012
TABLE IV. TABLE VII.
INFORMATION LOSS UNDER NPN-MHM AND BEST COMBINATION OF TOUR CONSTRUCTION AND
IMPROVEMENT HEURISTICS IMPROVEMENT HEURISTICS
TABLE V.
INFORMATION LOSS UNDER QB-MHM AND IMPROVEMENT
HEURISTICS
Table 7 summarizes the results from tables 1–6 as follows: the
best performing combination of tour construction and tour
improvement heuristics for each dataset are identified under
a specified k. The row labeled “Best under” identifies the
specific improvement heuristic that produced the best results.
For example, with the Census dataset, when k=3, the best
results are obtained using a combination of Greedy-MHM
and the LK tour improvement heuristic with an information
Tables 4, 5, and 6 present the results for the NPN-MHM, QB- loss of 5.002%. While no tour construction heuristic
MHM, and Greedy-MHM family of algorithms respectively. dominates, out of 15 total instances, the best result was found
Our tour improvement heuristics resulted in the lower infor- using the LK improvement heuristic 7 times and using the 3-
mation loss in every instance opt improvement heuristic 6 times. Table 8 shows the
TABLE VI.
difference between the lowest information loss obtained using
INFORMATION LOSS UNDER GREEDY-MHM AND extant microaggregation methods (the minimum value found
IMPROVEMENT HEURISTICS from the MD, MDAV, CBFS, MD-MHM, CBFS-MHM, NPN-
MHM, and MDAV-MHM heuristics) and those obtained
using the new methods described in this paper. Under our
methods information losses are decreased by 4%, 5.5%, and
13.9% on the average for the Census, Tarragona, and EIA
datasets respectively. In all but one instance (k=10 with the
census dataset) our proposed method outperformed existing
methods and should hence be considered as a promising
microaggregation heuristic.
© 2012 ACEEE 25
DOI: 01.IJCSI.03.01.11
5. ACEEE Int. J. on Control System and Instrumentation, Vol. 03, No. 01, Feb 2012
TABLE VIII. [12] Duncan, G. T., Elliot, M., Salazar-González, J. (2011).
RESULTS COMPARED TO EXTANT METHODS Statistical Confidentiality: Principles and Practice, Springer.
[13] Fayyoumi, E., & Oommen, B.J. (2009). Achieving
microaggregation for secure statistical databases using fixed-
structure partitioning-based learning automata. IEEE Transactions
on Systems, Man, and Cybernetics, Part B, 39(5), 1192-1205.
[14] Hansen, S. L., & Mukherjee, S. (2003). A polynomial algorithm
for optimal univariate microaggregation. IEEE Transactions on
Knowledge and Data Engineering, 15(4), 1043-1044.
[15] Hundepool, A., Wetering, A., Ramaswamy, R., Franconi, L.,
Capobianchi, A., Wolf, P., Domingo-Ferrer, J., Torra, V., Brand,
R., & Giessing, A. (2004). µ-ARGUS Ver. 4.0 Software and User’s
Manual. Voorburg, The Netherlands: Statistics Netherlands.
[16] Kabir, M.E., Wang, H., & Zhang, Y. (2010). A pairwise-
systematic microaggregation for statistical disclosure control.
International Conference on Data Mining 2010, 266-273.
[17] Laszlo, M., & Mukherjee, S. (2005). Minimum spanning tree
partitioning algorithm for microaggregation. IEEE Transactions on
REFERENCES Knowledge and Data Engineering, 17(7), 902-911.
[18] Laszlo, M., & Mukherjee, S. (2009). Approximation bounds
[1] Adam, N. R., & Wortmann, J. C. (1989). Security-control
for minimum information loss microaggregation. IEEE Transactions
methods for statistical databases: a comparative study. ACM
on Knowledge and Data Engineering, 21(11), 1643-1647.
Computing Surveys, 21(4), 515-556.
[19] Lin, S., & Kernighan, B. (1973). An effective heuristic algorithm
[2] Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy,
for the traveling-salesman problem. Operations Research, 21(2),
R., Thomas, D., & Zhu, A. (2005). Approximation algorithms for
498-516.
k-anonymity. Journal of Privacy Technology, Paper No.
[20] Mateo-Sanz, J. M., & Domingo-Ferrer, J. (1999). A method
20051120001.
for data oriented multivariate microaggregation In J. Domingo-Ferrer
[3] Applegate, D., Cook, W., & Rohe, A. (2003). Chained Lin-
(Ed.), Statistical Data Protection (pp. 89-99). Luxemburg: Office
Kernighan for large traveling salesman problems. INFORMS Journal
for Official Publications of the European Communities.
on Computing, 15(1), 82-92.
[21] Oganian, A., & Domingo-Ferrer, J. (2001). On the complexity
[4] Balleste, A.M., Solanas, A., Domingo-Ferrer, J., & Mateo-Sanz,
of optimal microaggregation for statistical disclosure control.
J.M. (2007). A genetic approach to multivariate microaggregation
Statistical Journal of the United Nations Economic Commission for
for database privacy. IEEE 23rd International Conference on Data
Europe, 18(4), 345-354.
Engineering Workshop, 180-185.
[22] Rego, C., Gamboa, D., Glover, F., & Osterman, C. (2011).
[5] Bentley, J.L. (1992). Fast algorithms for geometric traveling
Traveling salesman problem heuristics: leading methods,
salesman problems. ORSA Journal of Computing, 4, 387-411.
implementations, and latest advances. European Journal of
[6] Chang, C.-C., Li, Y.-C., & Huang, W.-H. (2007). TFRP: An
Operations Research, 211(3), 427-441.
efficient microaggregation algorithm for statistical disclosure
[23] Samarati, P. (2001). Protecting respondents’ identities in
control. Journal of Systems and Software, 80(11), 1866-1878.
microdata release. IEEE Transactions on Knowledge and Data
[7] Defays, D., & Anwar, N. (1995). Micro-aggregation: a generic
Engineering, 13(6), 1010-1027.
method. Proceedings of the 2nd International Symposium on
[24] Sande, G. (2002). Exact and approximate methods for data
Statistical Confidentiality. Eurostat, Luxemburg, 69-78.
directed microaggregation in one or more dimensions. International
[8] Dijkstra, E. W. (1959). A note on two problems in connexion
Journal of Uncertainty and Fuzziness in Knowledge Based Systems,
with graphs. Numerische Mathematik, 1, 269-271.
10(5), 459-476.
[9] Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J. M.,
[25] Solanas, A., Martinez-Balleste, A., Domingo-Ferrer, J., &
& Sebé, F. (2006). Efficient multivariate data-oriented
Mateo-Sanz, M. (2006). A 2^d-tree-based blocking method for
microaggregation. The VLDB Journal, 15, 355–369.
microaggregating very large data sets. In First International
[10] Domingo-Ferrer, J., & Mateo-Sanz, J. M. (2002). Practical
Conference on Availability, Reliability and Security (ARES’06), 922-
data-oriented microaggregation for statistical disclosure control. IEEE
928.
Transactions on Knowledge and Data Engineering, 14(1), 189-
[26] Sweeney, L. (2002). K-anonymity: a model for protecting
201.
privacy. International Journal on Uncertainty, Fuzziness and
[11] Domingo-Ferrer, J., Sebe, F., & Solanas, A. A polynomial-
Knowledge-based Systems, 10(5), 557–570.
time approximation to optimal multivariate microaggregation.
[27] Willenborg, L., & de Waal, T. (2001). Elements of statistical
Computers & Mathematics with Applications, 55(4), 714-732.
disclosure control. New York: Springer-Verlag.
© 2012 ACEEE 26
DOI: 01.IJCSI.03.01. 11