Managing and publishing sensitive data in the social sciences - Webinar transcript

[Unclear] words are denoted in square brackets
Managing and publishing sensitive data in the
social sciences – ANDS Webinar
29 March 2017
Video & slides available from ANDS website
START OF TRANSCRIPT
Kate Lemay: Good afternoon - or good morning if you're over in Perth time zone - to
everyone. Thank you for calling into our webinar today. We've got
some handouts in today's webinar as well. We've got a guide to
publishing and sharing sensitive data; that is an ANDS resource, and
also an ANDS sensitive - it's called sensitive decision tree. That’s a
one page summary of the information that’s available in our guide.
I'd just like to introduce our two guests today. We've got Professor
George Alter. He's a research professor in the Institute for Social
Research and Professor of History at the University of Michigan. His
research integrates theory and methods from demography, economics
and family history, with historical sources to understand demographic
behaviours in the past.
From 2007 to 2016 he was the Director of the Inter-university
Consortium for Political and Social Research, ICPSR, the world's
largest archive of social science data. He's been active in international
efforts to promote research transparency, data sharing and secure
access to confidential research data. He's currently engaged in
projects to automate the capture of metadata from statistical analysis
software and to compare fertility transitions in contemporary and
historical populations. We're lucky to currently have him as a visiting
professor at ANU.
Dr Steve McEachern is a Director of the Australian Data Archive at the
Australian National University. He holds a PhD in industrial relations

Managing_and_sharing_sensitive_data Page 2 of 26
and a graduate diploma in management information systems and has
research interests in data management and archiving, community and
social attitude surveys, new data collection methods and reproducible
research methods.
Steve has been involved in various professional associations in
survey, research and data archiving over the last 10 years and is
currently chair of the executive board of the data documentation
initiative.
Firstly, we're going to hand over to George, who's going to share the
benefit of over 50 years of ICPSR managing sensitive social science
data. Over to you George.
George Alter: Thank you, Kate. It's a pleasure to talk to you today. ICPSR, as Kate
mentioned, has been data archiving for more than 50 years, and an
increasing amount of our effort has gone into devising safe ways to
share data that have sensitive and confidential information.
At the heart of everything we do in terms of protecting confidential
information is a part of the research process where. when we ask
people to provide information about themselves to us, we make a
promise to them. We tell them that the benefits of the research that
we're going to do are going to outweigh the risk to them, and we say
that we will protect the information that they give us.
We have a lot of data that we receive at ICPSR and here at the ADA
that include questions that are very sensitive. Often we're asking
people about types of behaviour that could cause them harm, that we
might be specifically asking them about criminal activity. We might be
asking them about medications that they take that could affect how -
their jobs or other things, so we have to be careful about it.
We're afraid that if the information gets out it could be used by various
actors for specific purposes, could be used in a divorce proceeding.
Sometimes we interview adolescents about drug use or sexual
behaviour, and we promise them that their parents won't see it and so
on.

In the data archiving world we often talk about two kinds of identifiers.
There are direct identifiers - which are things like names, addresses,
social security numbers - many of which are unnecessary, but some
types of direct identifiers - such as geographic locations or genetic
characteristics - may actually be part of the research project.
Then the most difficult problem, often, is the indirect identifiers. That is
to say characteristics of an individual that, when taken together, can
identify them. We refer to this often as deductive disclosure, meaning
that it's not obvious directly, but if you know enough information about
a person in a data set, then you can match them to something else.
Frequently we're concerned that someone who knows that another
person is in the survey could use that information and find them, or
that there is some other external database where you could match
information from the survey and re-identify a subject.
Deductive disclosures often are dependent on contextual data. If you
know that a person is in a small geographic area, or you know that
they're in a certain kind of institution, like a hospital or a school, it
makes it easier to narrow down the field over which you have to
search to identify them.
Unfortunately, in the social sciences, contextual data has become
more and more important. People now are very interested in new
things like the effect of neighbourhood on behaviour and political
attitudes or the effect of available health services on morbidity and
mortality.
There are a number of different kinds of contextual data that can
affect deductive disclosure. We're in a world right now where our
social science researchers are increasingly using data collections that
include items of information that make the subjects more identifiable.
For example, people studying the effectiveness of teaching often have
data sets that have the characteristics of students, teachers, schools,
school districts. Once you put all those things together it becomes
very identifiable.

We at ICPSR - and, I think, the social science data community in
general - have taken up a framework for protecting confidential data
that was originally developed by Felix Ritchie in the UK that talks
about ways to make data safe. I'm going to go through these points,
but Ritchie talks about safe data, safe projects, safe settings, safe
people and safe outputs. The idea of this is not that any one approach
solves the problem, but that you can create an overall system that
draws from all of these different approaches and uses them to
reinforce each other.
Safe data means taking measures that make the data less identifiable.
Ideally, that starts when the data are collected. There are things that
data producers can do to make their data less identifiable. One of the
simplest things is to do something that masks the geography. If you're
doing interviews it's best to do the interviews in multiple locations that
adds to the anonymisation of your interviewees. Or, if you're doing
them in only one location, you should keep that information about the
location as secret as possible.
Once the data have been collected - research projects have been
using a lot of different techniques for many years to mask the identity
of individuals. One of the - the most common one is what's called [pop
coding], where if you ask your subjects about their incomes, the
people with the highest incomes are going to stand out in most cases,
so usually you group them into something that says people above
$100,000 in income, [or something like that] so that there are - there's
not just one person at the very top, but a group of people, which
makes them more anonymous.
This list of things that I've given here - which goes from aggregation
approaches to actually affecting the values - is listed in terms of the
amount of intervention that’s involved. Some of the more recently
developed techniques actually involve adding a noise or random
numbers to the data itself, which tends to make it less identifiable, but
it also has an impact that you can do with the data.

Safe projects means that the projects themselves are reviewed before
access is approved. At most data repositories, when the data need to
be restricted because of sensitivity we ask the people who apply for
the data to give us a research plan. That research plan can be
reviewed in several different ways.
The first two things are things that we do regularly. At ICPSR we ask,
first of all, do you really need the confidential information to do this
research project and, if you do need it, would this research plan
identify individual subjects? We're not in the business of helping
marketers identify people for target marketing, so we would not accept
a research plan that did that. There are also projects that actually look
at the scientific merit of a research plan. To do that though you need
to have experts in the field who can help you to do that.
Safe settings means putting the data in places that reduce the risk
that it will get out. I'm going to talk here about three approaches. The
first one is - four approaches actually. The first one is data protection
plans. When we - for data that are - need to be protected, but the level
of risk is reasonably low, we often send those data to a researcher
under a data protection plan and data use agreement - which I'll come
to in a couple of minutes.
The data protection plan specifies how they're going to protect the
data. Here's a list of things that we worry about that one of my
colleagues at ICPSR made up. One of the things we tell people is
what happens if your computer is stolen? How will the confidential
data be protected?
There are a number of things that people can do, like encrypting their
hard disk, locking their computers in a closet, where they're not being
used, that can address these things. I think that data protection plans
need to move to just a general consideration of what it is that we're
trying to protect against and allow the users to propose alternative
approaches rather than saying oh, you have to use this particular

software or this or that. We have to be clear about what we're worried
about.
A couple of notes about data security plans: data security plans are
often difficult, partly because of the approach that has been taken in
the past, and also because researchers are not computer technicians,
and we're often giving them confusing information. One of the ways
that I think, in the future - in the US at least - universities are going to
move beyond this is - I'm seeing universities developing their own
protocols where they use different levels of security for different types
of problems.
At each level they specify the kinds of measures that researchers
need to take to protect data that is at that level of sensitivity. From my
point of view, as a repository director, I think that any time that the
institutions provide guidance it's a big help to us. The other way is to
make the data safe by making - putting it in a safe setting - is actually
to control access.
There are three main ways that repositories control access. One kind
of system is what [I'd call a] remote submission and execution system
where the researcher doesn’t actually get access to the data directly.
They submit a program code or a script for a statistical package to the
data repository. The repository runs the script on the data and then
sends back the results. That’s a very restrictive approach, but it's very
effective.
More recently, however, a number of repositories and statistical
agencies have been moving to virtual data enclaves. These enclaves -
which I'll illustrate briefly in a minute - use technologies that isolate the
data and provide access remotely but restrict what the user can do.
The most restrictive approach is actually a physical enclave. At ICPSR
we have a room in our basement that has computers that are isolated
from the internet.
We have certain data sets that are highly sensitive. If you want to do
research with them, you can but on the way into the enclave we're

going to go through your pockets to make sure you're not trying to
bring anything in, and [unclear] the way out we're going to go through
your pockets again and you'll be locked in there while you're working
because we want to make sure that nothing that uncontrolled is
removed from the enclave.
The disadvantage of a physical enclave is that you actually have to
travel to Michigan to use those data, which could be expensive. That’s
the reason that a number of repositories are turning to virtual data
enclaves.
This is a sketch of what the technology looks like. What happens is
that you, as a researcher, look over the internet, log on to a site that
connects you to a virtual computer. Then that virtual computer is in
contact - is - has access to the data, but your desktop machine does
not. You only can access the data through the virtual machine.
At ICPSR we actually use this system internally for our data
processing to provide an additional level of security. We talk about the
virtual data enclave, which is the service we provide to researchers,
and the secure data environment, which is where our staff works
when they're working on sensitive data.
It's a little bit of a let-down, but this is what it actually looks like. What
I've done here is - the window that’s open there with the blue
background is the - our virtual data enclave. I've opened a window for
[unclear] inside there. The black background is my desktop computer.
If you look closely, you'll see in the corner of the blue box that you see
the usual Windows icons, and that’s because when you're operating
remotely on - in the virtual enclave you're using Windows. It looks just
like Windows and acts just like Windows, except that you can't get to
anything on the internet. You can only get to things that we provide for
a level of security.
On top of that the software that’s used - we use [VMware] software,
but there are other brands that do the same thing - essentially turns
off your access to your printer, turns off your access to your hard drive

or the USB drive so you cannot copy data from the virtual machine to
your local machine. You can take a picture of what you see there, but
that - because you have that capability, we also restrict people with
data use agreement.
That’s my next topic: how do you make people safer? The main way
that we make people safer is by making them side data use
agreements or by providing them training. The data use agreements
used at ICPSR are, frankly, rather complicated. They consist of the
research plan, as I mentioned before. We require people to get IRB
approval for what they're doing, a data protection plan, which I
mentioned, and then there are these additional things of behavioural
rules and security [pledges] and an institutional signature, which I'll
mention now.
The process - if you look at the overall process of doing research,
there are a number of legal agreements that get passed back and
forth. It actually starts with an agreement made between the data
collectors and the subject, in which they provide the subjects with
informed consent about what the research is about and what they're
going to be asked. It's only after that that the data go from the subject
to the data producers.
Then the data archive - such as ICPSR or ADA - actually reaches an
agreement with the data producers in which we become their
delegates for distributing the data. That’s another legal agreement.
Then, when the data are sensitive, we actually reach - have to get an
agreement from the researcher - and these are pieces of information
we get from the researcher - and, in the United States, our system is
that the agreement is actually not with the researcher, but with the
researcher's institution.
At ICPSR, we're located at the University of Michigan, and all of our
data use agreements are between the University of Michigan and
some other university, in most cases. There are some exceptions. It's

only after we get all of these legal agreements in place that the
researcher gets the data.
One of the things in our agreements at ICPSR is a list of the types of
things that we don’t want people to do with the data. For example, we
don’t want someone to publish a table, a cross-tabulation table, where
there's one cell that has one person in it, because that makes that
person more identifiable. There's a list of these things, I think - often
we have 10 or 12 of them - that are really standard rules of thumb that
statisticians have developed for controlling re-identification.
The ICPSR agreements are also, as I said, agreements between
institutions. One of the things that we require is that the institution
takes responsibility for enforcing them, and that if we at ICPSR
believe that some has gone wrong, the agreement - the institution
agrees that they will investigate this based on their own policies about
scientific integrity and protecting research subjects.
DUAs are not ideal. They're actually - there's a lot of friction in the
system. What currently - in most cases - a [PI] needs a different data
use agreement for every data set, and they don’t like that. We can, I
think, in the future, reduce the cost of data use agreements by making
institution-wide agreements where the institution designates a steward
who will work with researchers at that institution.
There's already an example of this: the [Dayberry] project - which is a
project in developmental psychology that shares videos - has done
very good work on legal agreements. My colleague - the current
director at ICPSR, Margaret Levenstein - has been working on a
model where a researcher gets a data use agreement from one data
set - can use that to get a data use agreement for another data set so
that individuals can be certified and include that certification in other
places.
One of the things that I think we need to do more about is training. A
number of places, like ADA, train people who get confidential data.
We've actually done some work on developing an online tutorial about

disclosure risk, which we haven’t yet released, but it's, I think,
something that should be done.
Finally, there's safe outputs. One of the - the last stage in the process
is that the repository can review what was done with the data and
remove things that are a risk to subjects. This only works if you retain
control, so it doesn’t work if you send the data to the researcher, but it
does work if you're using one of these remote systems like remote
submission or a virtual data enclave. Often, this kind of checking is
costly. There are some ways to automate part of it, but a manual
review is almost always necessary in the end.
A last thing about the costs and benefits: obviously, data protection
has costs. Modifying data affects the analysis. If you restrict access
you're imposing burdens on researchers. Our view is that you need to
weigh the costs with the risks that are involved. There are two
dimensions of risk. One dimension is in this particular data set what's
the likelihood that an individual could be re-identified if someone tried
to do it and, secondly, if that person was reidentified, what harm would
result?
We think about this as a matrix, where you can see in this figure, as
you move up you're getting more harm. As you move to the right
you're increasing the probability of disclosure. If the data set is low on
both of these things - for example, if it's a national survey where 1000
people from all over the United States were interviewed and we don’t
know where they're from and we ask them what their favourite brand
of refrigerator is, that kind of data we're happy to send out directly
over the web without a data use agreement with a simple terms of
use.
But as we get more complex data with more questions, more sensitive
questions, we often will add some requirements in the form of a data
use agreement to assure the data are protected. When we get to
complex data where there is a strong possibility of re-identification and

where some harm would result to the subjects, in that case we often
add a technology component like the virtual data enclave.
Then there are the really seriously risky and sensitive things. My usual
example of this is we have a data set at ICPSR that was compiled
from interviews with convicts about sexual abuse and other kinds of
abuse in prisons. That data is very easy to identify and very sensitive.
We only provide that in our physical data set.
That’s the end of my presentation. Thank you for your attention. We'll
take questions later.
Kate LeMay: Great. Thank you, George. We'll pass over to Dr Steve Steve
McEachern to give his presentation about managing sensitive data at
the Australian Data Archive.
Steve McEachern: My aim today is to build off what George has talked about, particularly
taking the five safes model and looking at what the situation is in the
Australian case. I'll talk about the Australian Data Archive and how we
support sensitive data, but I want to put it in the context of the broader
framework of how we access sensitive data in Australian social
sciences generally.
I'm going to talk about some of the different options that are around,
picking up on some of what George has discussed in terms of some of
the alternatives that are available and demonstrate the different ways
these are in use here in Australia. I'm really focussing more on the five
safes model and its application in Australia than I am specifically on
ADA. As I say, we are one component of the broader framework for
sensitive data access here.
Just to say - I mean what I really wanted to cover off here is thinking
about sensitive and the five safes model. I'll look at the different
frameworks for sensitive data, access in Australia and where you
might find them, and then how we apply the five safes model at ADA
in particular. Then, being on time, I might say something briefly about
the data life cycle and sensitive data as we go through.

I wanted to just pick up on - particularly the ANZ definition here of
sensitive data. As I say - well, I'll frame this in the context of most of
what we deal with at ADA - at some point in its life cycle has been
sensitive data. It's more often it's information that’s collected from
humans, often with some degree of identifiability, at least at the point
of data collection, not necessarily the point of distribution.
A lot of what we deal with - and this is true for a lot of social science
archives - has been subject to - would fall into the class of sensitive
data. There's a distinction there between what we get and what we
distribute that we would probably draw a distinction. In terms of our
definition here - this is the handout, I think, that’s in the handout
section, and it's available online - that can be used to identify an
individual species, object, process or location being introduced to the
risk of discrimination, harm or unwanted attention.
We tend to think in terms of human risks more than anything else, the
risk to humans and individuals, but it does apply in other cases as
well. For example, the identification of sites for Indigenous art might,
in and of itself, lead more people to want to go and visit that location
and, in a sense, destroy the thing that you're actually trying to protect;
so the more visits that they actually get, the more degraded the art
itself becomes. It doesn’t just hold for human research, but that’s
probably our emphasis at ADA.
Just to reiterate the five safes again, we talk about five things: people,
projects, settings, data and outputs and the reference here, as I say.
Down the bottom you can look at the document that Felix Ritchie and
two of his colleagues developed, framing out the five safes model.
What I would say about this is say it's been adopted directly with our
UK data service. That’s where it has its origins.
The basic principles are applied in a lot of the social science data
archives, and it's now actually been adopted by the Australian Bureau
of Statistics as well. Their framework - they're thinking about output of

different types of publications. Literally, [unclear] this model. [Unclear]
it's quite a useful framework for talking about.
I'm going to take a slightly different approach to George in thinking
about how we think about what we're worried about. I'm going to take -
as a depositor you worry about the risk of disclosure. As a researcher,
what's the flip side of that? Why do we need access to sensitive
data? What does it provide?
The National Science Foundation, about four or five years ago, put out
a call around how could we improve access to microdata, particularly
from Government sources. It highlights the sorts of things - why we
talk about the need for access is it - the sorts of research you can do.
This comes from a submission from [David Card], Raj Chetty and
several economists in the US and elsewhere.
They were highlighting what's needed. Direct access is really the
critical thing here, direct access to microdata. By microdata we mean
information about individuals, line by line, aggregate statistics,
synthetic data. We can create fake people, as it were, [or] submission
of computer programs for someone else to run really don’t allow you
to do the sorts of work you need to answer policy questions in
particular. A lot of the particular social policy research is focussed in
this way.
In order to do certain things, access to this data is necessarily. How
do we facilitate that, taking account of the sorts of concerns that have
been raised? [On site] that is. How do people expect to access it?
This was an interesting blog post from a researcher based at,
previously, the University of Canterbury, comparing how you access
US census data versus the New Zealand census and, similarly, you
could say the same for the Australian census as well.
In the US you can get a one per cent sample of the census and you
just go and download a file directly. It's open as what's called a public
use microset file. Those are directly available. In New Zealand, there's
a whole series of instructions you have to go through. You might be

subject to data use agreements. You might be subject to an
application process et cetera, et cetera.
He's criticising, saying it should be much easier, it should be the US
model that’s appropriate here rather than the New Zealand model.
What we're really probably talking about here is that both are
appropriate depending upon the sorts of detail, the sorts of identifying
information that are available. Both might be valid models. They just
allow you to do different things.
The first model really focusses on, in a sense, masking the data to
some degree, in some of the safe data models that George talked
about. The other uses other types of - aspects of the safe model to do
- address confidentiality concerns.
What you also find is researchers understand these, but there has to
be some trade-off. The recognition of the need for confidentiality is
recognised and understood, and that there may well be - there ought
to be trade-offs in return for that. For example, Card and his
colleagues suggest that there is a set of criteria that you could put in
place for enabling what form of access to microdata, to sensitive
microdata. It might - they reference access through local statistical
offices, through some remote connections such as the virtual enclave
that George talked about, and monitoring of what people are doing. If
you're going to have highly sensitive data available, the trade-off for
that for access should be appropriate monitoring.
So there is a recognition that these - I mean this is just one possible
approach, but a recognition that access brings with it responsibilities
and appropriate checks and balances. What I want to talk about is
how has that eventuated in Australia, what do we see? [This bubble
here].
The sorts of models that we see here in Australia - I've broken them
out here broadly. I'd say four broad areas, but the one that people are
probably most familiar with is the ABS, the Australian Bureau of

Statistics. They have a number of systems and access methods that
suit different types of safe profiles.
These include what's called confidentialised unit record files, or
CURFs; what have they - remote access data lab, which is one of their
online execution systems. They have an on-site data lab. You can go
to the bowels of the ABS buildings in - certainly in Canberra and, I
believe, in other states as well, and do on-site processing.
Then they have other systems. Probably, the best known of these is
what's called the table builder, which is an online data aggregation
tool which does safe data processing on the fly. Our emphasis at ADA
is primarily on these confidentialised unit record files, so we write unit
record access and some aggregated data access as well.
Then we have the remote execution - [or one of the] remote analysis
environments. I put under this model the Australian research
infrastructure network [for jet graphic] data access in particular. The
secure unified research environment produced by the Population
Health Research Network is an example of George's remote access
environment as well, and even data linkage facilities. Another part of
the PHRN network fits some degrees under this type of secure access
model. That’s, in a sense, a more extreme version of that.
Then we have other ad hoc arrangements as well; things like the
physical secure rooms. A number of institutions have a secure space.
There are a number here at ANU, for example. Then you might have
other departmental arrangements as well that exist. We can probably
classify those in terms of the distinction in the type of approaches that
we have.
What I've done here is just a very simple assessment from not at all to
a very strong yes. It fits within this sort of - addressing this safe
element - from low to high. I have some question marks on some of
the facilities, particularly [sure, without] linkage facilities, not because I
don’t think they can do it. It's I don’t have enough information to make
an assessment there.

If you look at the different types, things like the ABS models have
tended towards safe data, those sorts of confidentialisation and
[unclear] routines, [datage] - output checking and secure access
models. Tabulations are a secure access model as well. They’ve
tended less towards safe people and safe projects, so of checking of
people and checking of projects.
We tend to more - a lot of cases - there's more trust in the technology
than there is the people using the technology, which I think is a little
bit problematic, given that there are - and I'm going to talk to this.
There are some fairly good processes in Australia for actually
assessing the quality of people in particular and, to some extent, the
projects.
This is - again, we can profile - the point here I'm making is that
different - you have different alternatives to how you might make
sense of the data available. There's not a one solution. It's what's the
mix of things that I might do - and I'll come back to that at the end.
In the Australian experience, [as I say], we have a strong emphasis on
safe data. We came up with the term in Australia called - of
confidentialisation. That’s probably the term you'll see most regularly.
Anywhere else in the world you would hear the term anonymization.
I'm not quite sure why this is a case but, as I say, in Australia the term
is - we tend to use confidentialisation.
The Australian Data Archive used this model. The ABS are, and the
Department of Social Services, things like the Household Income
Labour Dynamics in Australia, use anonymisation techniques as the
starting point. You can make data safe before you release it. It has its
limitations. A good example of that is some of the data sets were
released into the data.gov environment used anonymisation. Safe
data is the priority. The potential for it to be reverse engineered - if you
haven’t done your anonymisation properly, then you have - it could be
reversed and you get a safe data risk, so it has its flaws.

This is why we tended towards looking at a combination of
techniques. As George pointed out, [unclear] if the risk of actually
being identified is low - and particularly the harm that comes from that
is low - then it may be the case that this is sufficient. Certainly, a lot of
the content that we have at ADA, most of our emphasis is actually on
the safe data more than anything else.
Safe settings: we do have - as I say, in examples here, tabulation
systems, things where you can do cross-tabs online are fundamentally
a safe settings model. People don’t get access to the unit record data.
They just get access to the systems to produce outputs.
Remote access systems: the access data lab, PHRN, sure system,
and a new system that the ABS are bringing on, their remote data lab.
They're making their data labs available in a virtual environment.
They're in pilot stages that we're working with them on at the moment
- are increasingly being used as well. They're also secure
environments, because I mentioned the data lab and the secure
rooms.
Safe outputs: a number of the safe settings environments - because
they tend to use highly sensitive data - have safe output models as
well. The real problem has been with these in scaling them. It requires
manual checking more often than not. Reviewing the output of these
sorts of systems, that requires people. That requires time. It's hard to
automate as well.
The ABS have invested a lot of money into automating output
checking, in point of fact. Their table builder system is one of the best
that’s around, but their new remote lab still has manual checking of
outputs. It depends on what you're trying to do and the sorts of
outputs you're checking as to the extent to which you - sorry, the sorts
of outputs you're producing as to whether you can actually automate
the checking as well.
The other side of this that I think will become increasingly relevant too
is the replication and reproducibility elements of things that come out

of systems like this. How are we going to facilitate the replication
models within those environments? I'm not sure that question's been
addressed yet.
Safe researchers and safe projects in Australia - to be frank, they are
considered in most models, but they're not really closely monitored.
That’s because they're difficult to monitor. How do you follow the
extent to which people follow the things that they’ve signed up to?
Anyone who's involved in reporting of research outputs for ERA or
anything will know that getting people to actually fill out the forms -
putting in place what have I produced would be hard. Filling out forms
to say have I compliant with a data use agreement is [unclear].
That said, we do have some checks and balances that are there.
Certainly, in terms of the ethics models and the codes of conduct for
research, do provide some degree of vetting [assurance] for those that
go through that sort of system. We have some checks and balances in
places for - particularly for university researchers - to address the
sorts of concerns. I think an emphasis increasingly on, say,
researchers and projects might be one that we can leverage a bit
more carefully.
As I say, because of the frameworks we have in place - the Australian
code of conduct - an increasingly professional association - and
journal requirements as well for data sharing - are going to put a
degree of assessment on the sorts of practices we use as well. In
America it's the Economic Association, the DART agenda in their
political science, [unclear] data sharing, these are actually a
mechanism also for addressing partly the extent to which - or why - by
sharing - assessing the sharing of data, but also assessing the extent
to which you'll, essentially, [unclear] as well. That’s something to be
considering in the future.
I'll quickly turn to the ADA model and then wrap up. The ADA model -
as I say, our emphasis is primarily on safe data. Data is anonymised.
Either we - it tends to be through the agencies that provide - all the

researchers that provide data to us in advance. We will also do some
review on content as well. We'll provide recommendations back to our
depositors as to these are the sorts of things you'll probably want to
think about, in terms of have you included things like postcodes or
occupational information?
If I know someone's postcode, their occupation and their age, there's
a fair chance that I can identify them in many cases, in remote
locations in Australia in particular. There's some basic checks you can
do. Certainly, safe people and safe settings - our data access is
almost all mediated. You must be identified. You must provide contact
information and [supervise …].
We do some checking on safe people and we're providing information
on project descriptions, what do you intend to do with the data as well,
particularly for where we have more sensitive content. Often that’s a
requirement around depositors.
We don’t apply, frankly, in safe settings and safe outputs. That’s not
the space that we work in. We work with other agencies such as the
ABS. Where there's access to certain sensitive content we'll point
people to the relevant locations, where you've got highly sensitive
content that you want to make available.
As I say, something like the remote data lab, where was its focus?
They focus less on safe data. They're a virtual enclave. They don’t
prohibit the use of safe data practices, so they do limit - where you
have highly sensitive data there's a more dedicated assessment
process on a project of the outcomes. Highly safe settings. Sitting at
the ABS. The problem is that the costs they have is in establishing the
system itself. They vet all of the outputs. It has cost associated with it.
They have safe people. [There is] training for researchers prior to
accessing the system. There is some challenge in assessing the
backgrounds of people, for example. How do you - this is where the
need for domain experts - if you're going to fully assess people and

projects and you're going to assess their domain expertise. You need
domain experts to be able to [unclear] that sort of evaluation.
The emphasis might well be on the - are you using appropriate
techniques. Are you maintaining secure facilities, and are you
potentially - what's the research [plan] itself look like is more the
emphasis than the quality of the science. That’s a much harder to
evaluate.
Safe projects: that has been used in some places at the ABS.
Sometimes it's required for legislative reasons. The extent of data
release is dependent itself upon meeting a public good statement, for
example. One of the questions for future for some organisations is
should this matter? Basic research itself might generate useful
insights that you didn’t expect. As I say, in some cases, again, you're
going to be probably moving the levers, focus on different type
aspects of the safe data environment.
I guess the message we want to put through here is, certainly, there
are sweeter options that are available for you for accessing sensitive
data. Different models exist and they have different - ranges of the five
safes. You can certainly incorporate safe people models. [Unclear], a
lot of models focus on the expectation that we have an intruder.
Hackers are coming in to access our system. Actually, what tends to
be the case, more often than not, is the silly mistakes. I made a
mistake by leaving my laptop on the train or leaving my USB in the
computer lab. That’s far more common.
We have - we tend to try and profile to default options in terms of our
mix of safe settings but, as I say, there are options available to you.
What you have to think about is what's appropriate for the [form
of/formal] data that you're trying to work with. Fundamentally, the
argument is that principles should enable the right mix of safes for a
given data source.
Kate LeMay: Thank you very much Steve. It was a really great overview about the
different ways that the five elements of the safes can be mixed and

using different [settings]. I thought it was really interesting that both of
you mentioned that a safe location was in a basement [laughs]. I've
just got these images of people locked up in basements.
I also wanted to note that George mentioned data masking and using
de-identification methods. Steve also mentioned confidentialisation,
anonymisation. They're similar words for similar processes. ANZ has a
de-identification guide available on our website now. If you're
interested in that, it's more detailed than that information. We have our
guide there that you can have a look at.
I was also wondering about - George, you were talking about with the
data protection plan and the data use agreement that the onus is on
the institution, that if someone breaks it, that they need to put them
through some sort of research integrity investigation or something like
that. If that doesn’t happen is there any potential recourse for the
university? Would ICPSR turn around and say you didn’t follow this
process, you're not going to be accessing any of our data anymore?
George Alter: Sure. Actually, on our website we actually list the levels of escalation
that [we'll have to] go to. We can certainly cut off the institution from
the - from access to ICPSR data, but what is - what really gets
people's attention is that our - the National Institutes of Health in the
US has an Office of Human Research Protections. If we thought that
someone was breaching one of our agreements and endangering the
confidentiality of research subjects, I would report them to that office.
That office has a lot of power. They regularly publish the names of
bad actors. What's more, they can cut off all NIH funding to
universities. They have done that in the past when they thought that
protections weren’t in place. I always think of that as the nuclear
option. I know for a fact that university administrations and their
trustees and agents are terrified that NIH will do something like that.
Just waving that in front of a university compliance officer gets their
attention.

Kate LeMay: Steve, I was wondering, with your - the Australian Data Archive, with
the use agreement, that people are signing that - is that with the
individual user or with the institution, as it is with…
[Over speaking]
Steve McEachern: Primarily, it's with the individual. We have a small number of
organisational agreements, but not many. There is - I would say
there's more [unclear] - yeah, pinning a focus on an agreement
between the individual and the organisation rather than at the
organisation level. Some organisations do ask for them but, frankly,
it's more actually for pragmatic reasons than it is for compliance
reasons - is that they will want to host the content and manage access
by requesting access to a particular data set for all members of their
research team, for example. It just makes that easier, as it were.
There are other models. As I say, the ABS model is - actually, the
agreement is with the institution. Then individuals sign up to the
institutional agreement. The Department of Social Services model is
the same as well. It will be interesting to see the extent to which we
move in one direction or another. I'd say I think the compliance
argument hasn’t been one that’s been all that common here in
Australia. It's actually been - except in the case of where you have
Government data. I would say it's probably [unclear] situation. For
academic produced data it hasn’t tended to be an emphasis.
Kate LeMay: With George's agreement with institutions where the - the recourse is
that the institution should then have some integrity investigation - what
level of recourse do you have with…
[Over speaking]
Steve McEachern: [Limited].
Facilitator: …with the individual?
Steve McEachern: Limited. I mean we probably report back to the institution to which
they belong. [As I say], we do have the question supervisory
arrangements. We would probably also follow some of the questions

under the code of conduct for research. That’s why I make reference
back to there is an overarching set of obligations on those within
Australian academic institutions. We'll pursue something in that way.
One of the challenges for us - and I'm going to guess for George as
well - is just finding where you get breaches of compliance. One of the
hardest things to do is actually find out what happened in the first
place. We've had one case that I'm aware of - certainly my
predecessors' lifetime - which is going back to the late '90s. It's not a
common occurrence, but we're aware of it.
Kate LeMay: George mentioned standardised data use agreements between US
institutions. Has that been formalised across a number of institutions
as part of a consortium arrangement? Or is it more of an informal and
gaining momentum?
George Alter: The example I gave is the Databerry project. They're the only ones I
know that have done this in a formal way where they get institutions to
sign on as an institution and then that covers all of the researchers at
that institution. It took them a while to negotiate that and get the bugs
out, but I think it's paying off for them.
This is something that I think other groups like ICPSR should move to.
Right now it's a big problem that about one in six of our data use
agreements at ICPSR involved a negotiation between lawyers of the
University of Michigan and lawyers at the other institutions. It's a major
cost. I think it's one of the ways to go.
Steve McEachern: I would say - I mean in Australia we have a pretty strong example,
which is the university's Australia ABS agreement. I mean that model
facilitates a whole lot of things. It's enabled access to the broad
collection of ABS CURF data under a single agreement. The other is
universities sign up for the cost that comes with that as well. They're
paying a fee for that, but that - as I say, it covers the broad spectrum
of what they can do. The challenge in some cases is what [unclear]
have you got for dissemination of the content?

As I say, if I went to the [next department] - I've had this discussion
with various departments - could we establish a consistent data
access agreement? It's because the departments themselves are set
up under different models - from legislation, sorry. The impact of that
is they can't necessarily have the same set of conditions. Certainly,
there is some capacity to [unclear] some of that and, I venture, to see
the extent of which the project [commission] report that’s [coming of]
data access might address some of those questions as well.
Kate LeMay: Just quickly, there's a question about are there any checklists or
guidelines for new researchers to assess their research surveys for
the level of confidentiality? I think that they're talking about privacy risk
assessments.
Steve McEachern: This is - actually, we have an internal checklist. This is something
we've talked about in terms of thinking about whether you - what you
need to do in terms - but it really depends on publication. We talked
before about the fact that in order to do certain research you need to
have actually some things that [might be identified], so it depends on
which point in the data life cycle you're actually talking about here.
When we're thinking about data release, then you - as I say, we would
basically apply some basic principles for - these are the sorts of things
that we look for. Actually, we've talked about making that checklist
available, in terms of these are the sorts of things you have to be
concerned about.
There is advice around what we could probably bring together but, as
I say, the - it's this usability versus confidentiality question again. One
of the things we sometimes do is we split off those things that have a
high confidentiality risk. We actually release [several] different sets of
data, so that if you need that additional information you can actually
make that available under a separate - additional set of requirements,
possibly under a different technological setting.
I think it depends a little bit on when in the life cycle you are talking
about here. It often is useful to have as much - have information,

particularly - for example, if you're a longitudinal study you must have
identifying information going forward. You’ve got to be able to contact
someone the next time round. It depends on what you're trying to
achieve but, yeah, there are some basic advices that we put out.
George Alter: There's a literature that’s been used by statistical agencies about what
[unclear], but that whole area is right now somewhat contentious
because the statistical agencies develop that literature largely in the
age when data were released in the form of published tables. When
the data are available online and you can do repetitive iterative
operations on them, you're in a new world. There's a separate
literature that’s developed in the computer science world.
Anyway, it is a problem. There is guidance out there in really complex
areas like in some health care areas. Doing a full assessment of a
data set can be very complicated and difficult, so I think my
recommendation is that people start at the basics and think about how
would you identify this person, and if this information got out what
harm would it cause? Often the researchers themselves have a good
sense of that from the research they're doing.
Kate LeMay: There's one last question: are the five safes applicable in all research
disciplines? Or are they specifically limited to suit the social sciences?
Steve McEachern: I think they're already applicable.
Kate LeMay: I agree.
Steve McEachern: I mean it's interesting. We were have a discussion here about the
social sciences. For example, we work a lot with health sciences
[unclear] environmental sciences [unclear]. It's - I don’t see any
reason why they shouldn’t be applied elsewhere. I mean that’s - part
of the question actually is - it's more [unclear] about what do you have
to think about in terms of the privacy and confidentiality risks, far more
so than what's the topic.
The topic helps you make some sort of judgement about the harm, in
George's terms, but yeah, it's the confidentiality questions that…

[Over speaking]
Kate LeMay: The framework is [unclear].
Steve McEachern: Yeah.
George Alter: Oh yeah.
Kate LeMay: Fabulous. Thank you very much to George and Steve for coming
along to our webinar today, and thank you everyone for calling in…
END OF TRANSCRIPT

Managing and publishing sensitive data in the social sciences - Webinar transcript

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Managing and publishing sensitive data in the social sciences - Webinar transcript

Semelhante a Managing and publishing sensitive data in the social sciences - Webinar transcript (20)

Mais de ARDC

Mais de ARDC (20)

Último

Último (20)

Managing and publishing sensitive data in the social sciences - Webinar transcript