SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
[Unclear] words are denoted in square brackets
Managing and publishing sensitive data in the
social sciences – ANDS Webinar
29 March 2017
Video & slides available from ANDS website
START OF TRANSCRIPT
Kate Lemay: Good afternoon - or good morning if you're over in Perth time zone - to
everyone. Thank you for calling into our webinar today. We've got
some handouts in today's webinar as well. We've got a guide to
publishing and sharing sensitive data; that is an ANDS resource, and
also an ANDS sensitive - it's called sensitive decision tree. That’s a
one page summary of the information that’s available in our guide.
I'd just like to introduce our two guests today. We've got Professor
George Alter. He's a research professor in the Institute for Social
Research and Professor of History at the University of Michigan. His
research integrates theory and methods from demography, economics
and family history, with historical sources to understand demographic
behaviours in the past.
From 2007 to 2016 he was the Director of the Inter-university
Consortium for Political and Social Research, ICPSR, the world's
largest archive of social science data. He's been active in international
efforts to promote research transparency, data sharing and secure
access to confidential research data. He's currently engaged in
projects to automate the capture of metadata from statistical analysis
software and to compare fertility transitions in contemporary and
historical populations. We're lucky to currently have him as a visiting
professor at ANU.
Dr Steve McEachern is a Director of the Australian Data Archive at the
Australian National University. He holds a PhD in industrial relations
Managing_and_sharing_sensitive_data Page 2 of 26
and a graduate diploma in management information systems and has
research interests in data management and archiving, community and
social attitude surveys, new data collection methods and reproducible
research methods.
Steve has been involved in various professional associations in
survey, research and data archiving over the last 10 years and is
currently chair of the executive board of the data documentation
initiative.
Firstly, we're going to hand over to George, who's going to share the
benefit of over 50 years of ICPSR managing sensitive social science
data. Over to you George.
George Alter: Thank you, Kate. It's a pleasure to talk to you today. ICPSR, as Kate
mentioned, has been data archiving for more than 50 years, and an
increasing amount of our effort has gone into devising safe ways to
share data that have sensitive and confidential information.
At the heart of everything we do in terms of protecting confidential
information is a part of the research process where. when we ask
people to provide information about themselves to us, we make a
promise to them. We tell them that the benefits of the research that
we're going to do are going to outweigh the risk to them, and we say
that we will protect the information that they give us.
We have a lot of data that we receive at ICPSR and here at the ADA
that include questions that are very sensitive. Often we're asking
people about types of behaviour that could cause them harm, that we
might be specifically asking them about criminal activity. We might be
asking them about medications that they take that could affect how -
their jobs or other things, so we have to be careful about it.
We're afraid that if the information gets out it could be used by various
actors for specific purposes, could be used in a divorce proceeding.
Sometimes we interview adolescents about drug use or sexual
behaviour, and we promise them that their parents won't see it and so
on.
Managing_and_sharing_sensitive_data Page 3 of 26
In the data archiving world we often talk about two kinds of identifiers.
There are direct identifiers - which are things like names, addresses,
social security numbers - many of which are unnecessary, but some
types of direct identifiers - such as geographic locations or genetic
characteristics - may actually be part of the research project.
Then the most difficult problem, often, is the indirect identifiers. That is
to say characteristics of an individual that, when taken together, can
identify them. We refer to this often as deductive disclosure, meaning
that it's not obvious directly, but if you know enough information about
a person in a data set, then you can match them to something else.
Frequently we're concerned that someone who knows that another
person is in the survey could use that information and find them, or
that there is some other external database where you could match
information from the survey and re-identify a subject.
Deductive disclosures often are dependent on contextual data. If you
know that a person is in a small geographic area, or you know that
they're in a certain kind of institution, like a hospital or a school, it
makes it easier to narrow down the field over which you have to
search to identify them.
Unfortunately, in the social sciences, contextual data has become
more and more important. People now are very interested in new
things like the effect of neighbourhood on behaviour and political
attitudes or the effect of available health services on morbidity and
mortality.
There are a number of different kinds of contextual data that can
affect deductive disclosure. We're in a world right now where our
social science researchers are increasingly using data collections that
include items of information that make the subjects more identifiable.
For example, people studying the effectiveness of teaching often have
data sets that have the characteristics of students, teachers, schools,
school districts. Once you put all those things together it becomes
very identifiable.
Managing_and_sharing_sensitive_data Page 4 of 26
We at ICPSR - and, I think, the social science data community in
general - have taken up a framework for protecting confidential data
that was originally developed by Felix Ritchie in the UK that talks
about ways to make data safe. I'm going to go through these points,
but Ritchie talks about safe data, safe projects, safe settings, safe
people and safe outputs. The idea of this is not that any one approach
solves the problem, but that you can create an overall system that
draws from all of these different approaches and uses them to
reinforce each other.
Safe data means taking measures that make the data less identifiable.
Ideally, that starts when the data are collected. There are things that
data producers can do to make their data less identifiable. One of the
simplest things is to do something that masks the geography. If you're
doing interviews it's best to do the interviews in multiple locations that
adds to the anonymisation of your interviewees. Or, if you're doing
them in only one location, you should keep that information about the
location as secret as possible.
Once the data have been collected - research projects have been
using a lot of different techniques for many years to mask the identity
of individuals. One of the - the most common one is what's called [pop
coding], where if you ask your subjects about their incomes, the
people with the highest incomes are going to stand out in most cases,
so usually you group them into something that says people above
$100,000 in income, [or something like that] so that there are - there's
not just one person at the very top, but a group of people, which
makes them more anonymous.
This list of things that I've given here - which goes from aggregation
approaches to actually affecting the values - is listed in terms of the
amount of intervention that’s involved. Some of the more recently
developed techniques actually involve adding a noise or random
numbers to the data itself, which tends to make it less identifiable, but
it also has an impact that you can do with the data.
Managing_and_sharing_sensitive_data Page 5 of 26
Safe projects means that the projects themselves are reviewed before
access is approved. At most data repositories, when the data need to
be restricted because of sensitivity we ask the people who apply for
the data to give us a research plan. That research plan can be
reviewed in several different ways.
The first two things are things that we do regularly. At ICPSR we ask,
first of all, do you really need the confidential information to do this
research project and, if you do need it, would this research plan
identify individual subjects? We're not in the business of helping
marketers identify people for target marketing, so we would not accept
a research plan that did that. There are also projects that actually look
at the scientific merit of a research plan. To do that though you need
to have experts in the field who can help you to do that.
Safe settings means putting the data in places that reduce the risk
that it will get out. I'm going to talk here about three approaches. The
first one is - four approaches actually. The first one is data protection
plans. When we - for data that are - need to be protected, but the level
of risk is reasonably low, we often send those data to a researcher
under a data protection plan and data use agreement - which I'll come
to in a couple of minutes.
The data protection plan specifies how they're going to protect the
data. Here's a list of things that we worry about that one of my
colleagues at ICPSR made up. One of the things we tell people is
what happens if your computer is stolen? How will the confidential
data be protected?
There are a number of things that people can do, like encrypting their
hard disk, locking their computers in a closet, where they're not being
used, that can address these things. I think that data protection plans
need to move to just a general consideration of what it is that we're
trying to protect against and allow the users to propose alternative
approaches rather than saying oh, you have to use this particular
Managing_and_sharing_sensitive_data Page 6 of 26
software or this or that. We have to be clear about what we're worried
about.
A couple of notes about data security plans: data security plans are
often difficult, partly because of the approach that has been taken in
the past, and also because researchers are not computer technicians,
and we're often giving them confusing information. One of the ways
that I think, in the future - in the US at least - universities are going to
move beyond this is - I'm seeing universities developing their own
protocols where they use different levels of security for different types
of problems.
At each level they specify the kinds of measures that researchers
need to take to protect data that is at that level of sensitivity. From my
point of view, as a repository director, I think that any time that the
institutions provide guidance it's a big help to us. The other way is to
make the data safe by making - putting it in a safe setting - is actually
to control access.
There are three main ways that repositories control access. One kind
of system is what [I'd call a] remote submission and execution system
where the researcher doesn’t actually get access to the data directly.
They submit a program code or a script for a statistical package to the
data repository. The repository runs the script on the data and then
sends back the results. That’s a very restrictive approach, but it's very
effective.
More recently, however, a number of repositories and statistical
agencies have been moving to virtual data enclaves. These enclaves -
which I'll illustrate briefly in a minute - use technologies that isolate the
data and provide access remotely but restrict what the user can do.
The most restrictive approach is actually a physical enclave. At ICPSR
we have a room in our basement that has computers that are isolated
from the internet.
We have certain data sets that are highly sensitive. If you want to do
research with them, you can but on the way into the enclave we're
Managing_and_sharing_sensitive_data Page 7 of 26
going to go through your pockets to make sure you're not trying to
bring anything in, and [unclear] the way out we're going to go through
your pockets again and you'll be locked in there while you're working
because we want to make sure that nothing that uncontrolled is
removed from the enclave.
The disadvantage of a physical enclave is that you actually have to
travel to Michigan to use those data, which could be expensive. That’s
the reason that a number of repositories are turning to virtual data
enclaves.
This is a sketch of what the technology looks like. What happens is
that you, as a researcher, look over the internet, log on to a site that
connects you to a virtual computer. Then that virtual computer is in
contact - is - has access to the data, but your desktop machine does
not. You only can access the data through the virtual machine.
At ICPSR we actually use this system internally for our data
processing to provide an additional level of security. We talk about the
virtual data enclave, which is the service we provide to researchers,
and the secure data environment, which is where our staff works
when they're working on sensitive data.
It's a little bit of a let-down, but this is what it actually looks like. What
I've done here is - the window that’s open there with the blue
background is the - our virtual data enclave. I've opened a window for
[unclear] inside there. The black background is my desktop computer.
If you look closely, you'll see in the corner of the blue box that you see
the usual Windows icons, and that’s because when you're operating
remotely on - in the virtual enclave you're using Windows. It looks just
like Windows and acts just like Windows, except that you can't get to
anything on the internet. You can only get to things that we provide for
a level of security.
On top of that the software that’s used - we use [VMware] software,
but there are other brands that do the same thing - essentially turns
off your access to your printer, turns off your access to your hard drive
Managing_and_sharing_sensitive_data Page 8 of 26
or the USB drive so you cannot copy data from the virtual machine to
your local machine. You can take a picture of what you see there, but
that - because you have that capability, we also restrict people with
data use agreement.
That’s my next topic: how do you make people safer? The main way
that we make people safer is by making them side data use
agreements or by providing them training. The data use agreements
used at ICPSR are, frankly, rather complicated. They consist of the
research plan, as I mentioned before. We require people to get IRB
approval for what they're doing, a data protection plan, which I
mentioned, and then there are these additional things of behavioural
rules and security [pledges] and an institutional signature, which I'll
mention now.
The process - if you look at the overall process of doing research,
there are a number of legal agreements that get passed back and
forth. It actually starts with an agreement made between the data
collectors and the subject, in which they provide the subjects with
informed consent about what the research is about and what they're
going to be asked. It's only after that that the data go from the subject
to the data producers.
Then the data archive - such as ICPSR or ADA - actually reaches an
agreement with the data producers in which we become their
delegates for distributing the data. That’s another legal agreement.
Then, when the data are sensitive, we actually reach - have to get an
agreement from the researcher - and these are pieces of information
we get from the researcher - and, in the United States, our system is
that the agreement is actually not with the researcher, but with the
researcher's institution.
At ICPSR, we're located at the University of Michigan, and all of our
data use agreements are between the University of Michigan and
some other university, in most cases. There are some exceptions. It's
Managing_and_sharing_sensitive_data Page 9 of 26
only after we get all of these legal agreements in place that the
researcher gets the data.
One of the things in our agreements at ICPSR is a list of the types of
things that we don’t want people to do with the data. For example, we
don’t want someone to publish a table, a cross-tabulation table, where
there's one cell that has one person in it, because that makes that
person more identifiable. There's a list of these things, I think - often
we have 10 or 12 of them - that are really standard rules of thumb that
statisticians have developed for controlling re-identification.
The ICPSR agreements are also, as I said, agreements between
institutions. One of the things that we require is that the institution
takes responsibility for enforcing them, and that if we at ICPSR
believe that some has gone wrong, the agreement - the institution
agrees that they will investigate this based on their own policies about
scientific integrity and protecting research subjects.
DUAs are not ideal. They're actually - there's a lot of friction in the
system. What currently - in most cases - a [PI] needs a different data
use agreement for every data set, and they don’t like that. We can, I
think, in the future, reduce the cost of data use agreements by making
institution-wide agreements where the institution designates a steward
who will work with researchers at that institution.
There's already an example of this: the [Dayberry] project - which is a
project in developmental psychology that shares videos - has done
very good work on legal agreements. My colleague - the current
director at ICPSR, Margaret Levenstein - has been working on a
model where a researcher gets a data use agreement from one data
set - can use that to get a data use agreement for another data set so
that individuals can be certified and include that certification in other
places.
One of the things that I think we need to do more about is training. A
number of places, like ADA, train people who get confidential data.
We've actually done some work on developing an online tutorial about
Managing_and_sharing_sensitive_data Page 10 of 26
disclosure risk, which we haven’t yet released, but it's, I think,
something that should be done.
Finally, there's safe outputs. One of the - the last stage in the process
is that the repository can review what was done with the data and
remove things that are a risk to subjects. This only works if you retain
control, so it doesn’t work if you send the data to the researcher, but it
does work if you're using one of these remote systems like remote
submission or a virtual data enclave. Often, this kind of checking is
costly. There are some ways to automate part of it, but a manual
review is almost always necessary in the end.
A last thing about the costs and benefits: obviously, data protection
has costs. Modifying data affects the analysis. If you restrict access
you're imposing burdens on researchers. Our view is that you need to
weigh the costs with the risks that are involved. There are two
dimensions of risk. One dimension is in this particular data set what's
the likelihood that an individual could be re-identified if someone tried
to do it and, secondly, if that person was reidentified, what harm would
result?
We think about this as a matrix, where you can see in this figure, as
you move up you're getting more harm. As you move to the right
you're increasing the probability of disclosure. If the data set is low on
both of these things - for example, if it's a national survey where 1000
people from all over the United States were interviewed and we don’t
know where they're from and we ask them what their favourite brand
of refrigerator is, that kind of data we're happy to send out directly
over the web without a data use agreement with a simple terms of
use.
But as we get more complex data with more questions, more sensitive
questions, we often will add some requirements in the form of a data
use agreement to assure the data are protected. When we get to
complex data where there is a strong possibility of re-identification and
Managing_and_sharing_sensitive_data Page 11 of 26
where some harm would result to the subjects, in that case we often
add a technology component like the virtual data enclave.
Then there are the really seriously risky and sensitive things. My usual
example of this is we have a data set at ICPSR that was compiled
from interviews with convicts about sexual abuse and other kinds of
abuse in prisons. That data is very easy to identify and very sensitive.
We only provide that in our physical data set.
That’s the end of my presentation. Thank you for your attention. We'll
take questions later.
Kate LeMay: Great. Thank you, George. We'll pass over to Dr Steve Steve
McEachern to give his presentation about managing sensitive data at
the Australian Data Archive.
Steve McEachern: My aim today is to build off what George has talked about, particularly
taking the five safes model and looking at what the situation is in the
Australian case. I'll talk about the Australian Data Archive and how we
support sensitive data, but I want to put it in the context of the broader
framework of how we access sensitive data in Australian social
sciences generally.
I'm going to talk about some of the different options that are around,
picking up on some of what George has discussed in terms of some of
the alternatives that are available and demonstrate the different ways
these are in use here in Australia. I'm really focussing more on the five
safes model and its application in Australia than I am specifically on
ADA. As I say, we are one component of the broader framework for
sensitive data access here.
Just to say - I mean what I really wanted to cover off here is thinking
about sensitive and the five safes model. I'll look at the different
frameworks for sensitive data, access in Australia and where you
might find them, and then how we apply the five safes model at ADA
in particular. Then, being on time, I might say something briefly about
the data life cycle and sensitive data as we go through.
Managing_and_sharing_sensitive_data Page 12 of 26
I wanted to just pick up on - particularly the ANZ definition here of
sensitive data. As I say - well, I'll frame this in the context of most of
what we deal with at ADA - at some point in its life cycle has been
sensitive data. It's more often it's information that’s collected from
humans, often with some degree of identifiability, at least at the point
of data collection, not necessarily the point of distribution.
A lot of what we deal with - and this is true for a lot of social science
archives - has been subject to - would fall into the class of sensitive
data. There's a distinction there between what we get and what we
distribute that we would probably draw a distinction. In terms of our
definition here - this is the handout, I think, that’s in the handout
section, and it's available online - that can be used to identify an
individual species, object, process or location being introduced to the
risk of discrimination, harm or unwanted attention.
We tend to think in terms of human risks more than anything else, the
risk to humans and individuals, but it does apply in other cases as
well. For example, the identification of sites for Indigenous art might,
in and of itself, lead more people to want to go and visit that location
and, in a sense, destroy the thing that you're actually trying to protect;
so the more visits that they actually get, the more degraded the art
itself becomes. It doesn’t just hold for human research, but that’s
probably our emphasis at ADA.
Just to reiterate the five safes again, we talk about five things: people,
projects, settings, data and outputs and the reference here, as I say.
Down the bottom you can look at the document that Felix Ritchie and
two of his colleagues developed, framing out the five safes model.
What I would say about this is say it's been adopted directly with our
UK data service. That’s where it has its origins.
The basic principles are applied in a lot of the social science data
archives, and it's now actually been adopted by the Australian Bureau
of Statistics as well. Their framework - they're thinking about output of
Managing_and_sharing_sensitive_data Page 13 of 26
different types of publications. Literally, [unclear] this model. [Unclear]
it's quite a useful framework for talking about.
I'm going to take a slightly different approach to George in thinking
about how we think about what we're worried about. I'm going to take -
as a depositor you worry about the risk of disclosure. As a researcher,
what's the flip side of that? Why do we need access to sensitive
data? What does it provide?
The National Science Foundation, about four or five years ago, put out
a call around how could we improve access to microdata, particularly
from Government sources. It highlights the sorts of things - why we
talk about the need for access is it - the sorts of research you can do.
This comes from a submission from [David Card], Raj Chetty and
several economists in the US and elsewhere.
They were highlighting what's needed. Direct access is really the
critical thing here, direct access to microdata. By microdata we mean
information about individuals, line by line, aggregate statistics,
synthetic data. We can create fake people, as it were, [or] submission
of computer programs for someone else to run really don’t allow you
to do the sorts of work you need to answer policy questions in
particular. A lot of the particular social policy research is focussed in
this way.
In order to do certain things, access to this data is necessarily. How
do we facilitate that, taking account of the sorts of concerns that have
been raised? [On site] that is. How do people expect to access it?
This was an interesting blog post from a researcher based at,
previously, the University of Canterbury, comparing how you access
US census data versus the New Zealand census and, similarly, you
could say the same for the Australian census as well.
In the US you can get a one per cent sample of the census and you
just go and download a file directly. It's open as what's called a public
use microset file. Those are directly available. In New Zealand, there's
a whole series of instructions you have to go through. You might be
Managing_and_sharing_sensitive_data Page 14 of 26
subject to data use agreements. You might be subject to an
application process et cetera, et cetera.
He's criticising, saying it should be much easier, it should be the US
model that’s appropriate here rather than the New Zealand model.
What we're really probably talking about here is that both are
appropriate depending upon the sorts of detail, the sorts of identifying
information that are available. Both might be valid models. They just
allow you to do different things.
The first model really focusses on, in a sense, masking the data to
some degree, in some of the safe data models that George talked
about. The other uses other types of - aspects of the safe model to do
- address confidentiality concerns.
What you also find is researchers understand these, but there has to
be some trade-off. The recognition of the need for confidentiality is
recognised and understood, and that there may well be - there ought
to be trade-offs in return for that. For example, Card and his
colleagues suggest that there is a set of criteria that you could put in
place for enabling what form of access to microdata, to sensitive
microdata. It might - they reference access through local statistical
offices, through some remote connections such as the virtual enclave
that George talked about, and monitoring of what people are doing. If
you're going to have highly sensitive data available, the trade-off for
that for access should be appropriate monitoring.
So there is a recognition that these - I mean this is just one possible
approach, but a recognition that access brings with it responsibilities
and appropriate checks and balances. What I want to talk about is
how has that eventuated in Australia, what do we see? [This bubble
here].
The sorts of models that we see here in Australia - I've broken them
out here broadly. I'd say four broad areas, but the one that people are
probably most familiar with is the ABS, the Australian Bureau of
Managing_and_sharing_sensitive_data Page 15 of 26
Statistics. They have a number of systems and access methods that
suit different types of safe profiles.
These include what's called confidentialised unit record files, or
CURFs; what have they - remote access data lab, which is one of their
online execution systems. They have an on-site data lab. You can go
to the bowels of the ABS buildings in - certainly in Canberra and, I
believe, in other states as well, and do on-site processing.
Then they have other systems. Probably, the best known of these is
what's called the table builder, which is an online data aggregation
tool which does safe data processing on the fly. Our emphasis at ADA
is primarily on these confidentialised unit record files, so we write unit
record access and some aggregated data access as well.
Then we have the remote execution - [or one of the] remote analysis
environments. I put under this model the Australian research
infrastructure network [for jet graphic] data access in particular. The
secure unified research environment produced by the Population
Health Research Network is an example of George's remote access
environment as well, and even data linkage facilities. Another part of
the PHRN network fits some degrees under this type of secure access
model. That’s, in a sense, a more extreme version of that.
Then we have other ad hoc arrangements as well; things like the
physical secure rooms. A number of institutions have a secure space.
There are a number here at ANU, for example. Then you might have
other departmental arrangements as well that exist. We can probably
classify those in terms of the distinction in the type of approaches that
we have.
What I've done here is just a very simple assessment from not at all to
a very strong yes. It fits within this sort of - addressing this safe
element - from low to high. I have some question marks on some of
the facilities, particularly [sure, without] linkage facilities, not because I
don’t think they can do it. It's I don’t have enough information to make
an assessment there.
Managing_and_sharing_sensitive_data Page 16 of 26
If you look at the different types, things like the ABS models have
tended towards safe data, those sorts of confidentialisation and
[unclear] routines, [datage] - output checking and secure access
models. Tabulations are a secure access model as well. They’ve
tended less towards safe people and safe projects, so of checking of
people and checking of projects.
We tend to more - a lot of cases - there's more trust in the technology
than there is the people using the technology, which I think is a little
bit problematic, given that there are - and I'm going to talk to this.
There are some fairly good processes in Australia for actually
assessing the quality of people in particular and, to some extent, the
projects.
This is - again, we can profile - the point here I'm making is that
different - you have different alternatives to how you might make
sense of the data available. There's not a one solution. It's what's the
mix of things that I might do - and I'll come back to that at the end.
In the Australian experience, [as I say], we have a strong emphasis on
safe data. We came up with the term in Australia called - of
confidentialisation. That’s probably the term you'll see most regularly.
Anywhere else in the world you would hear the term anonymization.
I'm not quite sure why this is a case but, as I say, in Australia the term
is - we tend to use confidentialisation.
The Australian Data Archive used this model. The ABS are, and the
Department of Social Services, things like the Household Income
Labour Dynamics in Australia, use anonymisation techniques as the
starting point. You can make data safe before you release it. It has its
limitations. A good example of that is some of the data sets were
released into the data.gov environment used anonymisation. Safe
data is the priority. The potential for it to be reverse engineered - if you
haven’t done your anonymisation properly, then you have - it could be
reversed and you get a safe data risk, so it has its flaws.
Managing_and_sharing_sensitive_data Page 17 of 26
This is why we tended towards looking at a combination of
techniques. As George pointed out, [unclear] if the risk of actually
being identified is low - and particularly the harm that comes from that
is low - then it may be the case that this is sufficient. Certainly, a lot of
the content that we have at ADA, most of our emphasis is actually on
the safe data more than anything else.
Safe settings: we do have - as I say, in examples here, tabulation
systems, things where you can do cross-tabs online are fundamentally
a safe settings model. People don’t get access to the unit record data.
They just get access to the systems to produce outputs.
Remote access systems: the access data lab, PHRN, sure system,
and a new system that the ABS are bringing on, their remote data lab.
They're making their data labs available in a virtual environment.
They're in pilot stages that we're working with them on at the moment
- are increasingly being used as well. They're also secure
environments, because I mentioned the data lab and the secure
rooms.
Safe outputs: a number of the safe settings environments - because
they tend to use highly sensitive data - have safe output models as
well. The real problem has been with these in scaling them. It requires
manual checking more often than not. Reviewing the output of these
sorts of systems, that requires people. That requires time. It's hard to
automate as well.
The ABS have invested a lot of money into automating output
checking, in point of fact. Their table builder system is one of the best
that’s around, but their new remote lab still has manual checking of
outputs. It depends on what you're trying to do and the sorts of
outputs you're checking as to the extent to which you - sorry, the sorts
of outputs you're producing as to whether you can actually automate
the checking as well.
The other side of this that I think will become increasingly relevant too
is the replication and reproducibility elements of things that come out
Managing_and_sharing_sensitive_data Page 18 of 26
of systems like this. How are we going to facilitate the replication
models within those environments? I'm not sure that question's been
addressed yet.
Safe researchers and safe projects in Australia - to be frank, they are
considered in most models, but they're not really closely monitored.
That’s because they're difficult to monitor. How do you follow the
extent to which people follow the things that they’ve signed up to?
Anyone who's involved in reporting of research outputs for ERA or
anything will know that getting people to actually fill out the forms -
putting in place what have I produced would be hard. Filling out forms
to say have I compliant with a data use agreement is [unclear].
That said, we do have some checks and balances that are there.
Certainly, in terms of the ethics models and the codes of conduct for
research, do provide some degree of vetting [assurance] for those that
go through that sort of system. We have some checks and balances in
places for - particularly for university researchers - to address the
sorts of concerns. I think an emphasis increasingly on, say,
researchers and projects might be one that we can leverage a bit
more carefully.
As I say, because of the frameworks we have in place - the Australian
code of conduct - an increasingly professional association - and
journal requirements as well for data sharing - are going to put a
degree of assessment on the sorts of practices we use as well. In
America it's the Economic Association, the DART agenda in their
political science, [unclear] data sharing, these are actually a
mechanism also for addressing partly the extent to which - or why - by
sharing - assessing the sharing of data, but also assessing the extent
to which you'll, essentially, [unclear] as well. That’s something to be
considering in the future.
I'll quickly turn to the ADA model and then wrap up. The ADA model -
as I say, our emphasis is primarily on safe data. Data is anonymised.
Either we - it tends to be through the agencies that provide - all the
Managing_and_sharing_sensitive_data Page 19 of 26
researchers that provide data to us in advance. We will also do some
review on content as well. We'll provide recommendations back to our
depositors as to these are the sorts of things you'll probably want to
think about, in terms of have you included things like postcodes or
occupational information?
If I know someone's postcode, their occupation and their age, there's
a fair chance that I can identify them in many cases, in remote
locations in Australia in particular. There's some basic checks you can
do. Certainly, safe people and safe settings - our data access is
almost all mediated. You must be identified. You must provide contact
information and [supervise …].
We do some checking on safe people and we're providing information
on project descriptions, what do you intend to do with the data as well,
particularly for where we have more sensitive content. Often that’s a
requirement around depositors.
We don’t apply, frankly, in safe settings and safe outputs. That’s not
the space that we work in. We work with other agencies such as the
ABS. Where there's access to certain sensitive content we'll point
people to the relevant locations, where you've got highly sensitive
content that you want to make available.
As I say, something like the remote data lab, where was its focus?
They focus less on safe data. They're a virtual enclave. They don’t
prohibit the use of safe data practices, so they do limit - where you
have highly sensitive data there's a more dedicated assessment
process on a project of the outcomes. Highly safe settings. Sitting at
the ABS. The problem is that the costs they have is in establishing the
system itself. They vet all of the outputs. It has cost associated with it.
They have safe people. [There is] training for researchers prior to
accessing the system. There is some challenge in assessing the
backgrounds of people, for example. How do you - this is where the
need for domain experts - if you're going to fully assess people and
Managing_and_sharing_sensitive_data Page 20 of 26
projects and you're going to assess their domain expertise. You need
domain experts to be able to [unclear] that sort of evaluation.
The emphasis might well be on the - are you using appropriate
techniques. Are you maintaining secure facilities, and are you
potentially - what's the research [plan] itself look like is more the
emphasis than the quality of the science. That’s a much harder to
evaluate.
Safe projects: that has been used in some places at the ABS.
Sometimes it's required for legislative reasons. The extent of data
release is dependent itself upon meeting a public good statement, for
example. One of the questions for future for some organisations is
should this matter? Basic research itself might generate useful
insights that you didn’t expect. As I say, in some cases, again, you're
going to be probably moving the levers, focus on different type
aspects of the safe data environment.
I guess the message we want to put through here is, certainly, there
are sweeter options that are available for you for accessing sensitive
data. Different models exist and they have different - ranges of the five
safes. You can certainly incorporate safe people models. [Unclear], a
lot of models focus on the expectation that we have an intruder.
Hackers are coming in to access our system. Actually, what tends to
be the case, more often than not, is the silly mistakes. I made a
mistake by leaving my laptop on the train or leaving my USB in the
computer lab. That’s far more common.
We have - we tend to try and profile to default options in terms of our
mix of safe settings but, as I say, there are options available to you.
What you have to think about is what's appropriate for the [form
of/formal] data that you're trying to work with. Fundamentally, the
argument is that principles should enable the right mix of safes for a
given data source.
Kate LeMay: Thank you very much Steve. It was a really great overview about the
different ways that the five elements of the safes can be mixed and
Managing_and_sharing_sensitive_data Page 21 of 26
using different [settings]. I thought it was really interesting that both of
you mentioned that a safe location was in a basement [laughs]. I've
just got these images of people locked up in basements.
I also wanted to note that George mentioned data masking and using
de-identification methods. Steve also mentioned confidentialisation,
anonymisation. They're similar words for similar processes. ANZ has a
de-identification guide available on our website now. If you're
interested in that, it's more detailed than that information. We have our
guide there that you can have a look at.
I was also wondering about - George, you were talking about with the
data protection plan and the data use agreement that the onus is on
the institution, that if someone breaks it, that they need to put them
through some sort of research integrity investigation or something like
that. If that doesn’t happen is there any potential recourse for the
university? Would ICPSR turn around and say you didn’t follow this
process, you're not going to be accessing any of our data anymore?
George Alter: Sure. Actually, on our website we actually list the levels of escalation
that [we'll have to] go to. We can certainly cut off the institution from
the - from access to ICPSR data, but what is - what really gets
people's attention is that our - the National Institutes of Health in the
US has an Office of Human Research Protections. If we thought that
someone was breaching one of our agreements and endangering the
confidentiality of research subjects, I would report them to that office.
That office has a lot of power. They regularly publish the names of
bad actors. What's more, they can cut off all NIH funding to
universities. They have done that in the past when they thought that
protections weren’t in place. I always think of that as the nuclear
option. I know for a fact that university administrations and their
trustees and agents are terrified that NIH will do something like that.
Just waving that in front of a university compliance officer gets their
attention.
Managing_and_sharing_sensitive_data Page 22 of 26
Kate LeMay: Steve, I was wondering, with your - the Australian Data Archive, with
the use agreement, that people are signing that - is that with the
individual user or with the institution, as it is with…
[Over speaking]
Steve McEachern: Primarily, it's with the individual. We have a small number of
organisational agreements, but not many. There is - I would say
there's more [unclear] - yeah, pinning a focus on an agreement
between the individual and the organisation rather than at the
organisation level. Some organisations do ask for them but, frankly,
it's more actually for pragmatic reasons than it is for compliance
reasons - is that they will want to host the content and manage access
by requesting access to a particular data set for all members of their
research team, for example. It just makes that easier, as it were.
There are other models. As I say, the ABS model is - actually, the
agreement is with the institution. Then individuals sign up to the
institutional agreement. The Department of Social Services model is
the same as well. It will be interesting to see the extent to which we
move in one direction or another. I'd say I think the compliance
argument hasn’t been one that’s been all that common here in
Australia. It's actually been - except in the case of where you have
Government data. I would say it's probably [unclear] situation. For
academic produced data it hasn’t tended to be an emphasis.
Kate LeMay: With George's agreement with institutions where the - the recourse is
that the institution should then have some integrity investigation - what
level of recourse do you have with…
[Over speaking]
Steve McEachern: [Limited].
Facilitator: …with the individual?
Steve McEachern: Limited. I mean we probably report back to the institution to which
they belong. [As I say], we do have the question supervisory
arrangements. We would probably also follow some of the questions
Managing_and_sharing_sensitive_data Page 23 of 26
under the code of conduct for research. That’s why I make reference
back to there is an overarching set of obligations on those within
Australian academic institutions. We'll pursue something in that way.
One of the challenges for us - and I'm going to guess for George as
well - is just finding where you get breaches of compliance. One of the
hardest things to do is actually find out what happened in the first
place. We've had one case that I'm aware of - certainly my
predecessors' lifetime - which is going back to the late '90s. It's not a
common occurrence, but we're aware of it.
Kate LeMay: George mentioned standardised data use agreements between US
institutions. Has that been formalised across a number of institutions
as part of a consortium arrangement? Or is it more of an informal and
gaining momentum?
George Alter: The example I gave is the Databerry project. They're the only ones I
know that have done this in a formal way where they get institutions to
sign on as an institution and then that covers all of the researchers at
that institution. It took them a while to negotiate that and get the bugs
out, but I think it's paying off for them.
This is something that I think other groups like ICPSR should move to.
Right now it's a big problem that about one in six of our data use
agreements at ICPSR involved a negotiation between lawyers of the
University of Michigan and lawyers at the other institutions. It's a major
cost. I think it's one of the ways to go.
Steve McEachern: I would say - I mean in Australia we have a pretty strong example,
which is the university's Australia ABS agreement. I mean that model
facilitates a whole lot of things. It's enabled access to the broad
collection of ABS CURF data under a single agreement. The other is
universities sign up for the cost that comes with that as well. They're
paying a fee for that, but that - as I say, it covers the broad spectrum
of what they can do. The challenge in some cases is what [unclear]
have you got for dissemination of the content?
Managing_and_sharing_sensitive_data Page 24 of 26
As I say, if I went to the [next department] - I've had this discussion
with various departments - could we establish a consistent data
access agreement? It's because the departments themselves are set
up under different models - from legislation, sorry. The impact of that
is they can't necessarily have the same set of conditions. Certainly,
there is some capacity to [unclear] some of that and, I venture, to see
the extent of which the project [commission] report that’s [coming of]
data access might address some of those questions as well.
Kate LeMay: Just quickly, there's a question about are there any checklists or
guidelines for new researchers to assess their research surveys for
the level of confidentiality? I think that they're talking about privacy risk
assessments.
Steve McEachern: This is - actually, we have an internal checklist. This is something
we've talked about in terms of thinking about whether you - what you
need to do in terms - but it really depends on publication. We talked
before about the fact that in order to do certain research you need to
have actually some things that [might be identified], so it depends on
which point in the data life cycle you're actually talking about here.
When we're thinking about data release, then you - as I say, we would
basically apply some basic principles for - these are the sorts of things
that we look for. Actually, we've talked about making that checklist
available, in terms of these are the sorts of things you have to be
concerned about.
There is advice around what we could probably bring together but, as
I say, the - it's this usability versus confidentiality question again. One
of the things we sometimes do is we split off those things that have a
high confidentiality risk. We actually release [several] different sets of
data, so that if you need that additional information you can actually
make that available under a separate - additional set of requirements,
possibly under a different technological setting.
I think it depends a little bit on when in the life cycle you are talking
about here. It often is useful to have as much - have information,
Managing_and_sharing_sensitive_data Page 25 of 26
particularly - for example, if you're a longitudinal study you must have
identifying information going forward. You’ve got to be able to contact
someone the next time round. It depends on what you're trying to
achieve but, yeah, there are some basic advices that we put out.
George Alter: There's a literature that’s been used by statistical agencies about what
[unclear], but that whole area is right now somewhat contentious
because the statistical agencies develop that literature largely in the
age when data were released in the form of published tables. When
the data are available online and you can do repetitive iterative
operations on them, you're in a new world. There's a separate
literature that’s developed in the computer science world.
Anyway, it is a problem. There is guidance out there in really complex
areas like in some health care areas. Doing a full assessment of a
data set can be very complicated and difficult, so I think my
recommendation is that people start at the basics and think about how
would you identify this person, and if this information got out what
harm would it cause? Often the researchers themselves have a good
sense of that from the research they're doing.
Kate LeMay: There's one last question: are the five safes applicable in all research
disciplines? Or are they specifically limited to suit the social sciences?
Steve McEachern: I think they're already applicable.
Kate LeMay: I agree.
Steve McEachern: I mean it's interesting. We were have a discussion here about the
social sciences. For example, we work a lot with health sciences
[unclear] environmental sciences [unclear]. It's - I don’t see any
reason why they shouldn’t be applied elsewhere. I mean that’s - part
of the question actually is - it's more [unclear] about what do you have
to think about in terms of the privacy and confidentiality risks, far more
so than what's the topic.
The topic helps you make some sort of judgement about the harm, in
George's terms, but yeah, it's the confidentiality questions that…
Managing_and_sharing_sensitive_data Page 26 of 26
[Over speaking]
Kate LeMay: The framework is [unclear].
Steve McEachern: Yeah.
George Alter: Oh yeah.
Kate LeMay: Fabulous. Thank you very much to George and Steve for coming
along to our webinar today, and thank you everyone for calling in…
END OF TRANSCRIPT

Mais conteúdo relacionado

Mais procurados

Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Micah Altman
 
Big data, data science & fast data
Big data, data science & fast dataBig data, data science & fast data
Big data, data science & fast data
Kunal Joshi
 

Mais procurados (20)

State of the Art Informatics for Research Reproducibility, Reliability, and...
 State of the Art  Informatics for Research Reproducibility, Reliability, and... State of the Art  Informatics for Research Reproducibility, Reliability, and...
State of the Art Informatics for Research Reproducibility, Reliability, and...
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
 
Creating Value in Health through Big Data
Creating Value in Health through Big DataCreating Value in Health through Big Data
Creating Value in Health through Big Data
 
Lessons Learned The Hard Way: 32+ Data Science Interviews
Lessons Learned The Hard Way: 32+ Data Science InterviewsLessons Learned The Hard Way: 32+ Data Science Interviews
Lessons Learned The Hard Way: 32+ Data Science Interviews
 
Building the learning analytics curriculum: Should we teach (a code of) ethics?
Building the learning analytics curriculum: Should we teach (a code of) ethics?Building the learning analytics curriculum: Should we teach (a code of) ethics?
Building the learning analytics curriculum: Should we teach (a code of) ethics?
 
Data Science and its relationship to Big Data and data-driven decision making
Data Science and its relationship to Big Data and data-driven decision makingData Science and its relationship to Big Data and data-driven decision making
Data Science and its relationship to Big Data and data-driven decision making
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
 
Reproducibility from an infomatics perspective
Reproducibility from an infomatics perspectiveReproducibility from an infomatics perspective
Reproducibility from an infomatics perspective
 
Resilience in the Cyber Era
Resilience in the Cyber EraResilience in the Cyber Era
Resilience in the Cyber Era
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist?
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...
 
A42020106
A42020106A42020106
A42020106
 
Managing Confidential Information – Trends and Approaches
Managing Confidential Information – Trends and ApproachesManaging Confidential Information – Trends and Approaches
Managing Confidential Information – Trends and Approaches
 
Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?
 
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
 
Data Science and its Relationship to Big Data and Data-Driven Decision Making
Data Science and its Relationship to Big Data and Data-Driven Decision MakingData Science and its Relationship to Big Data and Data-Driven Decision Making
Data Science and its Relationship to Big Data and Data-Driven Decision Making
 
Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
 
Biomedical Research as an Open Digital Enterprise
Biomedical Research as an Open Digital EnterpriseBiomedical Research as an Open Digital Enterprise
Biomedical Research as an Open Digital Enterprise
 
Big data, data science & fast data
Big data, data science & fast dataBig data, data science & fast data
Big data, data science & fast data
 

Semelhante a Managing and publishing sensitive data in the social sciences - Webinar transcript

Data Science at Intersection of Security and Privacy
Data Science at Intersection of Security and PrivacyData Science at Intersection of Security and Privacy
Data Science at Intersection of Security and Privacy
Tarun Chopra
 
Cyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
Cyber Summit 2016: Privacy Issues in Big Data Sharing and ReuseCyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
Cyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
Cybera Inc.
 
Karenclassslides
KarenclassslidesKarenclassslides
Karenclassslides
kcarter14
 
BIO 10 Can Eating Insects Save the WorldDue Monday, Dec 10, .docx
BIO 10 Can Eating Insects Save the WorldDue Monday, Dec 10, .docxBIO 10 Can Eating Insects Save the WorldDue Monday, Dec 10, .docx
BIO 10 Can Eating Insects Save the WorldDue Monday, Dec 10, .docx
tangyechloe
 

Semelhante a Managing and publishing sensitive data in the social sciences - Webinar transcript (20)

Transcript webinar-9-5-17-health-and-medical-short-bites
Transcript webinar-9-5-17-health-and-medical-short-bitesTranscript webinar-9-5-17-health-and-medical-short-bites
Transcript webinar-9-5-17-health-and-medical-short-bites
 
Grounded, data with a story
Grounded, data with a storyGrounded, data with a story
Grounded, data with a story
 
06 Network Study Design: Ethical Considerations and Safeguards (2016)
06 Network Study Design: Ethical Considerations and Safeguards (2016)06 Network Study Design: Ethical Considerations and Safeguards (2016)
06 Network Study Design: Ethical Considerations and Safeguards (2016)
 
06 Network Study Design: Ethical Considerations and Safeguards
06 Network Study Design: Ethical Considerations and Safeguards06 Network Study Design: Ethical Considerations and Safeguards
06 Network Study Design: Ethical Considerations and Safeguards
 
Data Science at Intersection of Security and Privacy
Data Science at Intersection of Security and PrivacyData Science at Intersection of Security and Privacy
Data Science at Intersection of Security and Privacy
 
The State of Open Data Report by @figshare
The State of Open Data Report  by @figshareThe State of Open Data Report  by @figshare
The State of Open Data Report by @figshare
 
Transcript FAIR webinar #2: A for Accessable-06-06-2017
Transcript FAIR webinar #2: A for Accessable-06-06-2017Transcript FAIR webinar #2: A for Accessable-06-06-2017
Transcript FAIR webinar #2: A for Accessable-06-06-2017
 
Big Data, Communities and Ethical Resilience: A Framework for Action
Big Data, Communities and Ethical Resilience: A Framework for ActionBig Data, Communities and Ethical Resilience: A Framework for Action
Big Data, Communities and Ethical Resilience: A Framework for Action
 
A brave new world: student surveillance in higher education
A brave new world: student surveillance in higher educationA brave new world: student surveillance in higher education
A brave new world: student surveillance in higher education
 
FSCI Sharing sensitive data
FSCI Sharing sensitive dataFSCI Sharing sensitive data
FSCI Sharing sensitive data
 
Privacy in the Digital Age, Helen Cullyer
Privacy in the Digital Age, Helen CullyerPrivacy in the Digital Age, Helen Cullyer
Privacy in the Digital Age, Helen Cullyer
 
‘Personal data literacies’: A critical literacies approach to enhancing under...
‘Personal data literacies’: A critical literacies approach to enhancing under...‘Personal data literacies’: A critical literacies approach to enhancing under...
‘Personal data literacies’: A critical literacies approach to enhancing under...
 
Cyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
Cyber Summit 2016: Privacy Issues in Big Data Sharing and ReuseCyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
Cyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
 
Ho3313111316
Ho3313111316Ho3313111316
Ho3313111316
 
big-data-and-data-sharing_ethical-issues.pdf
big-data-and-data-sharing_ethical-issues.pdfbig-data-and-data-sharing_ethical-issues.pdf
big-data-and-data-sharing_ethical-issues.pdf
 
How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...
How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...
How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medici...
 
Karenclassslides
KarenclassslidesKarenclassslides
Karenclassslides
 
Lecture series: Using trace data or subjective data, that is the question dur...
Lecture series: Using trace data or subjective data, that is the question dur...Lecture series: Using trace data or subjective data, that is the question dur...
Lecture series: Using trace data or subjective data, that is the question dur...
 
Netnography and Research Ethics: From ACR 2015 Doctoral Symposium
Netnography and Research Ethics: From ACR 2015 Doctoral SymposiumNetnography and Research Ethics: From ACR 2015 Doctoral Symposium
Netnography and Research Ethics: From ACR 2015 Doctoral Symposium
 
BIO 10 Can Eating Insects Save the WorldDue Monday, Dec 10, .docx
BIO 10 Can Eating Insects Save the WorldDue Monday, Dec 10, .docxBIO 10 Can Eating Insects Save the WorldDue Monday, Dec 10, .docx
BIO 10 Can Eating Insects Save the WorldDue Monday, Dec 10, .docx
 

Mais de ARDC

Mais de ARDC (20)

Introduction to ADA
Introduction to ADAIntroduction to ADA
Introduction to ADA
 
Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and Standards
 
Data Sharing and Release Legislation
Data Sharing and Release Legislation   Data Sharing and Release Legislation
Data Sharing and Release Legislation
 
Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)
 
Investigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspectiveInvestigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspective
 
NCRIS and the health domain
NCRIS and the health domainNCRIS and the health domain
NCRIS and the health domain
 
International perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research dataInternational perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research data
 
Clinical trials data sharing
Clinical trials data sharingClinical trials data sharing
Clinical trials data sharing
 
Clinical trials and cohort studies
Clinical trials and cohort studiesClinical trials and cohort studies
Clinical trials and cohort studies
 
Introduction to vision and scope
Introduction to vision and scopeIntroduction to vision and scope
Introduction to vision and scope
 
FAIR for the future: embracing all things data
FAIR for the future: embracing all things dataFAIR for the future: embracing all things data
FAIR for the future: embracing all things data
 
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
 
Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128
 
Research data management and sharing of medical data
Research data management and sharing of medical dataResearch data management and sharing of medical data
Research data management and sharing of medical data
 
Findable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) dataFindable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) data
 
Applying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and ChallengesApplying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and Challenges
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018
 
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global SprintReady, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
 
How FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of dataHow FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of data
 
Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018
 

Último

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Último (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

Managing and publishing sensitive data in the social sciences - Webinar transcript

  • 1. [Unclear] words are denoted in square brackets Managing and publishing sensitive data in the social sciences – ANDS Webinar 29 March 2017 Video & slides available from ANDS website START OF TRANSCRIPT Kate Lemay: Good afternoon - or good morning if you're over in Perth time zone - to everyone. Thank you for calling into our webinar today. We've got some handouts in today's webinar as well. We've got a guide to publishing and sharing sensitive data; that is an ANDS resource, and also an ANDS sensitive - it's called sensitive decision tree. That’s a one page summary of the information that’s available in our guide. I'd just like to introduce our two guests today. We've got Professor George Alter. He's a research professor in the Institute for Social Research and Professor of History at the University of Michigan. His research integrates theory and methods from demography, economics and family history, with historical sources to understand demographic behaviours in the past. From 2007 to 2016 he was the Director of the Inter-university Consortium for Political and Social Research, ICPSR, the world's largest archive of social science data. He's been active in international efforts to promote research transparency, data sharing and secure access to confidential research data. He's currently engaged in projects to automate the capture of metadata from statistical analysis software and to compare fertility transitions in contemporary and historical populations. We're lucky to currently have him as a visiting professor at ANU. Dr Steve McEachern is a Director of the Australian Data Archive at the Australian National University. He holds a PhD in industrial relations
  • 2. Managing_and_sharing_sensitive_data Page 2 of 26 and a graduate diploma in management information systems and has research interests in data management and archiving, community and social attitude surveys, new data collection methods and reproducible research methods. Steve has been involved in various professional associations in survey, research and data archiving over the last 10 years and is currently chair of the executive board of the data documentation initiative. Firstly, we're going to hand over to George, who's going to share the benefit of over 50 years of ICPSR managing sensitive social science data. Over to you George. George Alter: Thank you, Kate. It's a pleasure to talk to you today. ICPSR, as Kate mentioned, has been data archiving for more than 50 years, and an increasing amount of our effort has gone into devising safe ways to share data that have sensitive and confidential information. At the heart of everything we do in terms of protecting confidential information is a part of the research process where. when we ask people to provide information about themselves to us, we make a promise to them. We tell them that the benefits of the research that we're going to do are going to outweigh the risk to them, and we say that we will protect the information that they give us. We have a lot of data that we receive at ICPSR and here at the ADA that include questions that are very sensitive. Often we're asking people about types of behaviour that could cause them harm, that we might be specifically asking them about criminal activity. We might be asking them about medications that they take that could affect how - their jobs or other things, so we have to be careful about it. We're afraid that if the information gets out it could be used by various actors for specific purposes, could be used in a divorce proceeding. Sometimes we interview adolescents about drug use or sexual behaviour, and we promise them that their parents won't see it and so on.
  • 3. Managing_and_sharing_sensitive_data Page 3 of 26 In the data archiving world we often talk about two kinds of identifiers. There are direct identifiers - which are things like names, addresses, social security numbers - many of which are unnecessary, but some types of direct identifiers - such as geographic locations or genetic characteristics - may actually be part of the research project. Then the most difficult problem, often, is the indirect identifiers. That is to say characteristics of an individual that, when taken together, can identify them. We refer to this often as deductive disclosure, meaning that it's not obvious directly, but if you know enough information about a person in a data set, then you can match them to something else. Frequently we're concerned that someone who knows that another person is in the survey could use that information and find them, or that there is some other external database where you could match information from the survey and re-identify a subject. Deductive disclosures often are dependent on contextual data. If you know that a person is in a small geographic area, or you know that they're in a certain kind of institution, like a hospital or a school, it makes it easier to narrow down the field over which you have to search to identify them. Unfortunately, in the social sciences, contextual data has become more and more important. People now are very interested in new things like the effect of neighbourhood on behaviour and political attitudes or the effect of available health services on morbidity and mortality. There are a number of different kinds of contextual data that can affect deductive disclosure. We're in a world right now where our social science researchers are increasingly using data collections that include items of information that make the subjects more identifiable. For example, people studying the effectiveness of teaching often have data sets that have the characteristics of students, teachers, schools, school districts. Once you put all those things together it becomes very identifiable.
  • 4. Managing_and_sharing_sensitive_data Page 4 of 26 We at ICPSR - and, I think, the social science data community in general - have taken up a framework for protecting confidential data that was originally developed by Felix Ritchie in the UK that talks about ways to make data safe. I'm going to go through these points, but Ritchie talks about safe data, safe projects, safe settings, safe people and safe outputs. The idea of this is not that any one approach solves the problem, but that you can create an overall system that draws from all of these different approaches and uses them to reinforce each other. Safe data means taking measures that make the data less identifiable. Ideally, that starts when the data are collected. There are things that data producers can do to make their data less identifiable. One of the simplest things is to do something that masks the geography. If you're doing interviews it's best to do the interviews in multiple locations that adds to the anonymisation of your interviewees. Or, if you're doing them in only one location, you should keep that information about the location as secret as possible. Once the data have been collected - research projects have been using a lot of different techniques for many years to mask the identity of individuals. One of the - the most common one is what's called [pop coding], where if you ask your subjects about their incomes, the people with the highest incomes are going to stand out in most cases, so usually you group them into something that says people above $100,000 in income, [or something like that] so that there are - there's not just one person at the very top, but a group of people, which makes them more anonymous. This list of things that I've given here - which goes from aggregation approaches to actually affecting the values - is listed in terms of the amount of intervention that’s involved. Some of the more recently developed techniques actually involve adding a noise or random numbers to the data itself, which tends to make it less identifiable, but it also has an impact that you can do with the data.
  • 5. Managing_and_sharing_sensitive_data Page 5 of 26 Safe projects means that the projects themselves are reviewed before access is approved. At most data repositories, when the data need to be restricted because of sensitivity we ask the people who apply for the data to give us a research plan. That research plan can be reviewed in several different ways. The first two things are things that we do regularly. At ICPSR we ask, first of all, do you really need the confidential information to do this research project and, if you do need it, would this research plan identify individual subjects? We're not in the business of helping marketers identify people for target marketing, so we would not accept a research plan that did that. There are also projects that actually look at the scientific merit of a research plan. To do that though you need to have experts in the field who can help you to do that. Safe settings means putting the data in places that reduce the risk that it will get out. I'm going to talk here about three approaches. The first one is - four approaches actually. The first one is data protection plans. When we - for data that are - need to be protected, but the level of risk is reasonably low, we often send those data to a researcher under a data protection plan and data use agreement - which I'll come to in a couple of minutes. The data protection plan specifies how they're going to protect the data. Here's a list of things that we worry about that one of my colleagues at ICPSR made up. One of the things we tell people is what happens if your computer is stolen? How will the confidential data be protected? There are a number of things that people can do, like encrypting their hard disk, locking their computers in a closet, where they're not being used, that can address these things. I think that data protection plans need to move to just a general consideration of what it is that we're trying to protect against and allow the users to propose alternative approaches rather than saying oh, you have to use this particular
  • 6. Managing_and_sharing_sensitive_data Page 6 of 26 software or this or that. We have to be clear about what we're worried about. A couple of notes about data security plans: data security plans are often difficult, partly because of the approach that has been taken in the past, and also because researchers are not computer technicians, and we're often giving them confusing information. One of the ways that I think, in the future - in the US at least - universities are going to move beyond this is - I'm seeing universities developing their own protocols where they use different levels of security for different types of problems. At each level they specify the kinds of measures that researchers need to take to protect data that is at that level of sensitivity. From my point of view, as a repository director, I think that any time that the institutions provide guidance it's a big help to us. The other way is to make the data safe by making - putting it in a safe setting - is actually to control access. There are three main ways that repositories control access. One kind of system is what [I'd call a] remote submission and execution system where the researcher doesn’t actually get access to the data directly. They submit a program code or a script for a statistical package to the data repository. The repository runs the script on the data and then sends back the results. That’s a very restrictive approach, but it's very effective. More recently, however, a number of repositories and statistical agencies have been moving to virtual data enclaves. These enclaves - which I'll illustrate briefly in a minute - use technologies that isolate the data and provide access remotely but restrict what the user can do. The most restrictive approach is actually a physical enclave. At ICPSR we have a room in our basement that has computers that are isolated from the internet. We have certain data sets that are highly sensitive. If you want to do research with them, you can but on the way into the enclave we're
  • 7. Managing_and_sharing_sensitive_data Page 7 of 26 going to go through your pockets to make sure you're not trying to bring anything in, and [unclear] the way out we're going to go through your pockets again and you'll be locked in there while you're working because we want to make sure that nothing that uncontrolled is removed from the enclave. The disadvantage of a physical enclave is that you actually have to travel to Michigan to use those data, which could be expensive. That’s the reason that a number of repositories are turning to virtual data enclaves. This is a sketch of what the technology looks like. What happens is that you, as a researcher, look over the internet, log on to a site that connects you to a virtual computer. Then that virtual computer is in contact - is - has access to the data, but your desktop machine does not. You only can access the data through the virtual machine. At ICPSR we actually use this system internally for our data processing to provide an additional level of security. We talk about the virtual data enclave, which is the service we provide to researchers, and the secure data environment, which is where our staff works when they're working on sensitive data. It's a little bit of a let-down, but this is what it actually looks like. What I've done here is - the window that’s open there with the blue background is the - our virtual data enclave. I've opened a window for [unclear] inside there. The black background is my desktop computer. If you look closely, you'll see in the corner of the blue box that you see the usual Windows icons, and that’s because when you're operating remotely on - in the virtual enclave you're using Windows. It looks just like Windows and acts just like Windows, except that you can't get to anything on the internet. You can only get to things that we provide for a level of security. On top of that the software that’s used - we use [VMware] software, but there are other brands that do the same thing - essentially turns off your access to your printer, turns off your access to your hard drive
  • 8. Managing_and_sharing_sensitive_data Page 8 of 26 or the USB drive so you cannot copy data from the virtual machine to your local machine. You can take a picture of what you see there, but that - because you have that capability, we also restrict people with data use agreement. That’s my next topic: how do you make people safer? The main way that we make people safer is by making them side data use agreements or by providing them training. The data use agreements used at ICPSR are, frankly, rather complicated. They consist of the research plan, as I mentioned before. We require people to get IRB approval for what they're doing, a data protection plan, which I mentioned, and then there are these additional things of behavioural rules and security [pledges] and an institutional signature, which I'll mention now. The process - if you look at the overall process of doing research, there are a number of legal agreements that get passed back and forth. It actually starts with an agreement made between the data collectors and the subject, in which they provide the subjects with informed consent about what the research is about and what they're going to be asked. It's only after that that the data go from the subject to the data producers. Then the data archive - such as ICPSR or ADA - actually reaches an agreement with the data producers in which we become their delegates for distributing the data. That’s another legal agreement. Then, when the data are sensitive, we actually reach - have to get an agreement from the researcher - and these are pieces of information we get from the researcher - and, in the United States, our system is that the agreement is actually not with the researcher, but with the researcher's institution. At ICPSR, we're located at the University of Michigan, and all of our data use agreements are between the University of Michigan and some other university, in most cases. There are some exceptions. It's
  • 9. Managing_and_sharing_sensitive_data Page 9 of 26 only after we get all of these legal agreements in place that the researcher gets the data. One of the things in our agreements at ICPSR is a list of the types of things that we don’t want people to do with the data. For example, we don’t want someone to publish a table, a cross-tabulation table, where there's one cell that has one person in it, because that makes that person more identifiable. There's a list of these things, I think - often we have 10 or 12 of them - that are really standard rules of thumb that statisticians have developed for controlling re-identification. The ICPSR agreements are also, as I said, agreements between institutions. One of the things that we require is that the institution takes responsibility for enforcing them, and that if we at ICPSR believe that some has gone wrong, the agreement - the institution agrees that they will investigate this based on their own policies about scientific integrity and protecting research subjects. DUAs are not ideal. They're actually - there's a lot of friction in the system. What currently - in most cases - a [PI] needs a different data use agreement for every data set, and they don’t like that. We can, I think, in the future, reduce the cost of data use agreements by making institution-wide agreements where the institution designates a steward who will work with researchers at that institution. There's already an example of this: the [Dayberry] project - which is a project in developmental psychology that shares videos - has done very good work on legal agreements. My colleague - the current director at ICPSR, Margaret Levenstein - has been working on a model where a researcher gets a data use agreement from one data set - can use that to get a data use agreement for another data set so that individuals can be certified and include that certification in other places. One of the things that I think we need to do more about is training. A number of places, like ADA, train people who get confidential data. We've actually done some work on developing an online tutorial about
  • 10. Managing_and_sharing_sensitive_data Page 10 of 26 disclosure risk, which we haven’t yet released, but it's, I think, something that should be done. Finally, there's safe outputs. One of the - the last stage in the process is that the repository can review what was done with the data and remove things that are a risk to subjects. This only works if you retain control, so it doesn’t work if you send the data to the researcher, but it does work if you're using one of these remote systems like remote submission or a virtual data enclave. Often, this kind of checking is costly. There are some ways to automate part of it, but a manual review is almost always necessary in the end. A last thing about the costs and benefits: obviously, data protection has costs. Modifying data affects the analysis. If you restrict access you're imposing burdens on researchers. Our view is that you need to weigh the costs with the risks that are involved. There are two dimensions of risk. One dimension is in this particular data set what's the likelihood that an individual could be re-identified if someone tried to do it and, secondly, if that person was reidentified, what harm would result? We think about this as a matrix, where you can see in this figure, as you move up you're getting more harm. As you move to the right you're increasing the probability of disclosure. If the data set is low on both of these things - for example, if it's a national survey where 1000 people from all over the United States were interviewed and we don’t know where they're from and we ask them what their favourite brand of refrigerator is, that kind of data we're happy to send out directly over the web without a data use agreement with a simple terms of use. But as we get more complex data with more questions, more sensitive questions, we often will add some requirements in the form of a data use agreement to assure the data are protected. When we get to complex data where there is a strong possibility of re-identification and
  • 11. Managing_and_sharing_sensitive_data Page 11 of 26 where some harm would result to the subjects, in that case we often add a technology component like the virtual data enclave. Then there are the really seriously risky and sensitive things. My usual example of this is we have a data set at ICPSR that was compiled from interviews with convicts about sexual abuse and other kinds of abuse in prisons. That data is very easy to identify and very sensitive. We only provide that in our physical data set. That’s the end of my presentation. Thank you for your attention. We'll take questions later. Kate LeMay: Great. Thank you, George. We'll pass over to Dr Steve Steve McEachern to give his presentation about managing sensitive data at the Australian Data Archive. Steve McEachern: My aim today is to build off what George has talked about, particularly taking the five safes model and looking at what the situation is in the Australian case. I'll talk about the Australian Data Archive and how we support sensitive data, but I want to put it in the context of the broader framework of how we access sensitive data in Australian social sciences generally. I'm going to talk about some of the different options that are around, picking up on some of what George has discussed in terms of some of the alternatives that are available and demonstrate the different ways these are in use here in Australia. I'm really focussing more on the five safes model and its application in Australia than I am specifically on ADA. As I say, we are one component of the broader framework for sensitive data access here. Just to say - I mean what I really wanted to cover off here is thinking about sensitive and the five safes model. I'll look at the different frameworks for sensitive data, access in Australia and where you might find them, and then how we apply the five safes model at ADA in particular. Then, being on time, I might say something briefly about the data life cycle and sensitive data as we go through.
  • 12. Managing_and_sharing_sensitive_data Page 12 of 26 I wanted to just pick up on - particularly the ANZ definition here of sensitive data. As I say - well, I'll frame this in the context of most of what we deal with at ADA - at some point in its life cycle has been sensitive data. It's more often it's information that’s collected from humans, often with some degree of identifiability, at least at the point of data collection, not necessarily the point of distribution. A lot of what we deal with - and this is true for a lot of social science archives - has been subject to - would fall into the class of sensitive data. There's a distinction there between what we get and what we distribute that we would probably draw a distinction. In terms of our definition here - this is the handout, I think, that’s in the handout section, and it's available online - that can be used to identify an individual species, object, process or location being introduced to the risk of discrimination, harm or unwanted attention. We tend to think in terms of human risks more than anything else, the risk to humans and individuals, but it does apply in other cases as well. For example, the identification of sites for Indigenous art might, in and of itself, lead more people to want to go and visit that location and, in a sense, destroy the thing that you're actually trying to protect; so the more visits that they actually get, the more degraded the art itself becomes. It doesn’t just hold for human research, but that’s probably our emphasis at ADA. Just to reiterate the five safes again, we talk about five things: people, projects, settings, data and outputs and the reference here, as I say. Down the bottom you can look at the document that Felix Ritchie and two of his colleagues developed, framing out the five safes model. What I would say about this is say it's been adopted directly with our UK data service. That’s where it has its origins. The basic principles are applied in a lot of the social science data archives, and it's now actually been adopted by the Australian Bureau of Statistics as well. Their framework - they're thinking about output of
  • 13. Managing_and_sharing_sensitive_data Page 13 of 26 different types of publications. Literally, [unclear] this model. [Unclear] it's quite a useful framework for talking about. I'm going to take a slightly different approach to George in thinking about how we think about what we're worried about. I'm going to take - as a depositor you worry about the risk of disclosure. As a researcher, what's the flip side of that? Why do we need access to sensitive data? What does it provide? The National Science Foundation, about four or five years ago, put out a call around how could we improve access to microdata, particularly from Government sources. It highlights the sorts of things - why we talk about the need for access is it - the sorts of research you can do. This comes from a submission from [David Card], Raj Chetty and several economists in the US and elsewhere. They were highlighting what's needed. Direct access is really the critical thing here, direct access to microdata. By microdata we mean information about individuals, line by line, aggregate statistics, synthetic data. We can create fake people, as it were, [or] submission of computer programs for someone else to run really don’t allow you to do the sorts of work you need to answer policy questions in particular. A lot of the particular social policy research is focussed in this way. In order to do certain things, access to this data is necessarily. How do we facilitate that, taking account of the sorts of concerns that have been raised? [On site] that is. How do people expect to access it? This was an interesting blog post from a researcher based at, previously, the University of Canterbury, comparing how you access US census data versus the New Zealand census and, similarly, you could say the same for the Australian census as well. In the US you can get a one per cent sample of the census and you just go and download a file directly. It's open as what's called a public use microset file. Those are directly available. In New Zealand, there's a whole series of instructions you have to go through. You might be
  • 14. Managing_and_sharing_sensitive_data Page 14 of 26 subject to data use agreements. You might be subject to an application process et cetera, et cetera. He's criticising, saying it should be much easier, it should be the US model that’s appropriate here rather than the New Zealand model. What we're really probably talking about here is that both are appropriate depending upon the sorts of detail, the sorts of identifying information that are available. Both might be valid models. They just allow you to do different things. The first model really focusses on, in a sense, masking the data to some degree, in some of the safe data models that George talked about. The other uses other types of - aspects of the safe model to do - address confidentiality concerns. What you also find is researchers understand these, but there has to be some trade-off. The recognition of the need for confidentiality is recognised and understood, and that there may well be - there ought to be trade-offs in return for that. For example, Card and his colleagues suggest that there is a set of criteria that you could put in place for enabling what form of access to microdata, to sensitive microdata. It might - they reference access through local statistical offices, through some remote connections such as the virtual enclave that George talked about, and monitoring of what people are doing. If you're going to have highly sensitive data available, the trade-off for that for access should be appropriate monitoring. So there is a recognition that these - I mean this is just one possible approach, but a recognition that access brings with it responsibilities and appropriate checks and balances. What I want to talk about is how has that eventuated in Australia, what do we see? [This bubble here]. The sorts of models that we see here in Australia - I've broken them out here broadly. I'd say four broad areas, but the one that people are probably most familiar with is the ABS, the Australian Bureau of
  • 15. Managing_and_sharing_sensitive_data Page 15 of 26 Statistics. They have a number of systems and access methods that suit different types of safe profiles. These include what's called confidentialised unit record files, or CURFs; what have they - remote access data lab, which is one of their online execution systems. They have an on-site data lab. You can go to the bowels of the ABS buildings in - certainly in Canberra and, I believe, in other states as well, and do on-site processing. Then they have other systems. Probably, the best known of these is what's called the table builder, which is an online data aggregation tool which does safe data processing on the fly. Our emphasis at ADA is primarily on these confidentialised unit record files, so we write unit record access and some aggregated data access as well. Then we have the remote execution - [or one of the] remote analysis environments. I put under this model the Australian research infrastructure network [for jet graphic] data access in particular. The secure unified research environment produced by the Population Health Research Network is an example of George's remote access environment as well, and even data linkage facilities. Another part of the PHRN network fits some degrees under this type of secure access model. That’s, in a sense, a more extreme version of that. Then we have other ad hoc arrangements as well; things like the physical secure rooms. A number of institutions have a secure space. There are a number here at ANU, for example. Then you might have other departmental arrangements as well that exist. We can probably classify those in terms of the distinction in the type of approaches that we have. What I've done here is just a very simple assessment from not at all to a very strong yes. It fits within this sort of - addressing this safe element - from low to high. I have some question marks on some of the facilities, particularly [sure, without] linkage facilities, not because I don’t think they can do it. It's I don’t have enough information to make an assessment there.
  • 16. Managing_and_sharing_sensitive_data Page 16 of 26 If you look at the different types, things like the ABS models have tended towards safe data, those sorts of confidentialisation and [unclear] routines, [datage] - output checking and secure access models. Tabulations are a secure access model as well. They’ve tended less towards safe people and safe projects, so of checking of people and checking of projects. We tend to more - a lot of cases - there's more trust in the technology than there is the people using the technology, which I think is a little bit problematic, given that there are - and I'm going to talk to this. There are some fairly good processes in Australia for actually assessing the quality of people in particular and, to some extent, the projects. This is - again, we can profile - the point here I'm making is that different - you have different alternatives to how you might make sense of the data available. There's not a one solution. It's what's the mix of things that I might do - and I'll come back to that at the end. In the Australian experience, [as I say], we have a strong emphasis on safe data. We came up with the term in Australia called - of confidentialisation. That’s probably the term you'll see most regularly. Anywhere else in the world you would hear the term anonymization. I'm not quite sure why this is a case but, as I say, in Australia the term is - we tend to use confidentialisation. The Australian Data Archive used this model. The ABS are, and the Department of Social Services, things like the Household Income Labour Dynamics in Australia, use anonymisation techniques as the starting point. You can make data safe before you release it. It has its limitations. A good example of that is some of the data sets were released into the data.gov environment used anonymisation. Safe data is the priority. The potential for it to be reverse engineered - if you haven’t done your anonymisation properly, then you have - it could be reversed and you get a safe data risk, so it has its flaws.
  • 17. Managing_and_sharing_sensitive_data Page 17 of 26 This is why we tended towards looking at a combination of techniques. As George pointed out, [unclear] if the risk of actually being identified is low - and particularly the harm that comes from that is low - then it may be the case that this is sufficient. Certainly, a lot of the content that we have at ADA, most of our emphasis is actually on the safe data more than anything else. Safe settings: we do have - as I say, in examples here, tabulation systems, things where you can do cross-tabs online are fundamentally a safe settings model. People don’t get access to the unit record data. They just get access to the systems to produce outputs. Remote access systems: the access data lab, PHRN, sure system, and a new system that the ABS are bringing on, their remote data lab. They're making their data labs available in a virtual environment. They're in pilot stages that we're working with them on at the moment - are increasingly being used as well. They're also secure environments, because I mentioned the data lab and the secure rooms. Safe outputs: a number of the safe settings environments - because they tend to use highly sensitive data - have safe output models as well. The real problem has been with these in scaling them. It requires manual checking more often than not. Reviewing the output of these sorts of systems, that requires people. That requires time. It's hard to automate as well. The ABS have invested a lot of money into automating output checking, in point of fact. Their table builder system is one of the best that’s around, but their new remote lab still has manual checking of outputs. It depends on what you're trying to do and the sorts of outputs you're checking as to the extent to which you - sorry, the sorts of outputs you're producing as to whether you can actually automate the checking as well. The other side of this that I think will become increasingly relevant too is the replication and reproducibility elements of things that come out
  • 18. Managing_and_sharing_sensitive_data Page 18 of 26 of systems like this. How are we going to facilitate the replication models within those environments? I'm not sure that question's been addressed yet. Safe researchers and safe projects in Australia - to be frank, they are considered in most models, but they're not really closely monitored. That’s because they're difficult to monitor. How do you follow the extent to which people follow the things that they’ve signed up to? Anyone who's involved in reporting of research outputs for ERA or anything will know that getting people to actually fill out the forms - putting in place what have I produced would be hard. Filling out forms to say have I compliant with a data use agreement is [unclear]. That said, we do have some checks and balances that are there. Certainly, in terms of the ethics models and the codes of conduct for research, do provide some degree of vetting [assurance] for those that go through that sort of system. We have some checks and balances in places for - particularly for university researchers - to address the sorts of concerns. I think an emphasis increasingly on, say, researchers and projects might be one that we can leverage a bit more carefully. As I say, because of the frameworks we have in place - the Australian code of conduct - an increasingly professional association - and journal requirements as well for data sharing - are going to put a degree of assessment on the sorts of practices we use as well. In America it's the Economic Association, the DART agenda in their political science, [unclear] data sharing, these are actually a mechanism also for addressing partly the extent to which - or why - by sharing - assessing the sharing of data, but also assessing the extent to which you'll, essentially, [unclear] as well. That’s something to be considering in the future. I'll quickly turn to the ADA model and then wrap up. The ADA model - as I say, our emphasis is primarily on safe data. Data is anonymised. Either we - it tends to be through the agencies that provide - all the
  • 19. Managing_and_sharing_sensitive_data Page 19 of 26 researchers that provide data to us in advance. We will also do some review on content as well. We'll provide recommendations back to our depositors as to these are the sorts of things you'll probably want to think about, in terms of have you included things like postcodes or occupational information? If I know someone's postcode, their occupation and their age, there's a fair chance that I can identify them in many cases, in remote locations in Australia in particular. There's some basic checks you can do. Certainly, safe people and safe settings - our data access is almost all mediated. You must be identified. You must provide contact information and [supervise …]. We do some checking on safe people and we're providing information on project descriptions, what do you intend to do with the data as well, particularly for where we have more sensitive content. Often that’s a requirement around depositors. We don’t apply, frankly, in safe settings and safe outputs. That’s not the space that we work in. We work with other agencies such as the ABS. Where there's access to certain sensitive content we'll point people to the relevant locations, where you've got highly sensitive content that you want to make available. As I say, something like the remote data lab, where was its focus? They focus less on safe data. They're a virtual enclave. They don’t prohibit the use of safe data practices, so they do limit - where you have highly sensitive data there's a more dedicated assessment process on a project of the outcomes. Highly safe settings. Sitting at the ABS. The problem is that the costs they have is in establishing the system itself. They vet all of the outputs. It has cost associated with it. They have safe people. [There is] training for researchers prior to accessing the system. There is some challenge in assessing the backgrounds of people, for example. How do you - this is where the need for domain experts - if you're going to fully assess people and
  • 20. Managing_and_sharing_sensitive_data Page 20 of 26 projects and you're going to assess their domain expertise. You need domain experts to be able to [unclear] that sort of evaluation. The emphasis might well be on the - are you using appropriate techniques. Are you maintaining secure facilities, and are you potentially - what's the research [plan] itself look like is more the emphasis than the quality of the science. That’s a much harder to evaluate. Safe projects: that has been used in some places at the ABS. Sometimes it's required for legislative reasons. The extent of data release is dependent itself upon meeting a public good statement, for example. One of the questions for future for some organisations is should this matter? Basic research itself might generate useful insights that you didn’t expect. As I say, in some cases, again, you're going to be probably moving the levers, focus on different type aspects of the safe data environment. I guess the message we want to put through here is, certainly, there are sweeter options that are available for you for accessing sensitive data. Different models exist and they have different - ranges of the five safes. You can certainly incorporate safe people models. [Unclear], a lot of models focus on the expectation that we have an intruder. Hackers are coming in to access our system. Actually, what tends to be the case, more often than not, is the silly mistakes. I made a mistake by leaving my laptop on the train or leaving my USB in the computer lab. That’s far more common. We have - we tend to try and profile to default options in terms of our mix of safe settings but, as I say, there are options available to you. What you have to think about is what's appropriate for the [form of/formal] data that you're trying to work with. Fundamentally, the argument is that principles should enable the right mix of safes for a given data source. Kate LeMay: Thank you very much Steve. It was a really great overview about the different ways that the five elements of the safes can be mixed and
  • 21. Managing_and_sharing_sensitive_data Page 21 of 26 using different [settings]. I thought it was really interesting that both of you mentioned that a safe location was in a basement [laughs]. I've just got these images of people locked up in basements. I also wanted to note that George mentioned data masking and using de-identification methods. Steve also mentioned confidentialisation, anonymisation. They're similar words for similar processes. ANZ has a de-identification guide available on our website now. If you're interested in that, it's more detailed than that information. We have our guide there that you can have a look at. I was also wondering about - George, you were talking about with the data protection plan and the data use agreement that the onus is on the institution, that if someone breaks it, that they need to put them through some sort of research integrity investigation or something like that. If that doesn’t happen is there any potential recourse for the university? Would ICPSR turn around and say you didn’t follow this process, you're not going to be accessing any of our data anymore? George Alter: Sure. Actually, on our website we actually list the levels of escalation that [we'll have to] go to. We can certainly cut off the institution from the - from access to ICPSR data, but what is - what really gets people's attention is that our - the National Institutes of Health in the US has an Office of Human Research Protections. If we thought that someone was breaching one of our agreements and endangering the confidentiality of research subjects, I would report them to that office. That office has a lot of power. They regularly publish the names of bad actors. What's more, they can cut off all NIH funding to universities. They have done that in the past when they thought that protections weren’t in place. I always think of that as the nuclear option. I know for a fact that university administrations and their trustees and agents are terrified that NIH will do something like that. Just waving that in front of a university compliance officer gets their attention.
  • 22. Managing_and_sharing_sensitive_data Page 22 of 26 Kate LeMay: Steve, I was wondering, with your - the Australian Data Archive, with the use agreement, that people are signing that - is that with the individual user or with the institution, as it is with… [Over speaking] Steve McEachern: Primarily, it's with the individual. We have a small number of organisational agreements, but not many. There is - I would say there's more [unclear] - yeah, pinning a focus on an agreement between the individual and the organisation rather than at the organisation level. Some organisations do ask for them but, frankly, it's more actually for pragmatic reasons than it is for compliance reasons - is that they will want to host the content and manage access by requesting access to a particular data set for all members of their research team, for example. It just makes that easier, as it were. There are other models. As I say, the ABS model is - actually, the agreement is with the institution. Then individuals sign up to the institutional agreement. The Department of Social Services model is the same as well. It will be interesting to see the extent to which we move in one direction or another. I'd say I think the compliance argument hasn’t been one that’s been all that common here in Australia. It's actually been - except in the case of where you have Government data. I would say it's probably [unclear] situation. For academic produced data it hasn’t tended to be an emphasis. Kate LeMay: With George's agreement with institutions where the - the recourse is that the institution should then have some integrity investigation - what level of recourse do you have with… [Over speaking] Steve McEachern: [Limited]. Facilitator: …with the individual? Steve McEachern: Limited. I mean we probably report back to the institution to which they belong. [As I say], we do have the question supervisory arrangements. We would probably also follow some of the questions
  • 23. Managing_and_sharing_sensitive_data Page 23 of 26 under the code of conduct for research. That’s why I make reference back to there is an overarching set of obligations on those within Australian academic institutions. We'll pursue something in that way. One of the challenges for us - and I'm going to guess for George as well - is just finding where you get breaches of compliance. One of the hardest things to do is actually find out what happened in the first place. We've had one case that I'm aware of - certainly my predecessors' lifetime - which is going back to the late '90s. It's not a common occurrence, but we're aware of it. Kate LeMay: George mentioned standardised data use agreements between US institutions. Has that been formalised across a number of institutions as part of a consortium arrangement? Or is it more of an informal and gaining momentum? George Alter: The example I gave is the Databerry project. They're the only ones I know that have done this in a formal way where they get institutions to sign on as an institution and then that covers all of the researchers at that institution. It took them a while to negotiate that and get the bugs out, but I think it's paying off for them. This is something that I think other groups like ICPSR should move to. Right now it's a big problem that about one in six of our data use agreements at ICPSR involved a negotiation between lawyers of the University of Michigan and lawyers at the other institutions. It's a major cost. I think it's one of the ways to go. Steve McEachern: I would say - I mean in Australia we have a pretty strong example, which is the university's Australia ABS agreement. I mean that model facilitates a whole lot of things. It's enabled access to the broad collection of ABS CURF data under a single agreement. The other is universities sign up for the cost that comes with that as well. They're paying a fee for that, but that - as I say, it covers the broad spectrum of what they can do. The challenge in some cases is what [unclear] have you got for dissemination of the content?
  • 24. Managing_and_sharing_sensitive_data Page 24 of 26 As I say, if I went to the [next department] - I've had this discussion with various departments - could we establish a consistent data access agreement? It's because the departments themselves are set up under different models - from legislation, sorry. The impact of that is they can't necessarily have the same set of conditions. Certainly, there is some capacity to [unclear] some of that and, I venture, to see the extent of which the project [commission] report that’s [coming of] data access might address some of those questions as well. Kate LeMay: Just quickly, there's a question about are there any checklists or guidelines for new researchers to assess their research surveys for the level of confidentiality? I think that they're talking about privacy risk assessments. Steve McEachern: This is - actually, we have an internal checklist. This is something we've talked about in terms of thinking about whether you - what you need to do in terms - but it really depends on publication. We talked before about the fact that in order to do certain research you need to have actually some things that [might be identified], so it depends on which point in the data life cycle you're actually talking about here. When we're thinking about data release, then you - as I say, we would basically apply some basic principles for - these are the sorts of things that we look for. Actually, we've talked about making that checklist available, in terms of these are the sorts of things you have to be concerned about. There is advice around what we could probably bring together but, as I say, the - it's this usability versus confidentiality question again. One of the things we sometimes do is we split off those things that have a high confidentiality risk. We actually release [several] different sets of data, so that if you need that additional information you can actually make that available under a separate - additional set of requirements, possibly under a different technological setting. I think it depends a little bit on when in the life cycle you are talking about here. It often is useful to have as much - have information,
  • 25. Managing_and_sharing_sensitive_data Page 25 of 26 particularly - for example, if you're a longitudinal study you must have identifying information going forward. You’ve got to be able to contact someone the next time round. It depends on what you're trying to achieve but, yeah, there are some basic advices that we put out. George Alter: There's a literature that’s been used by statistical agencies about what [unclear], but that whole area is right now somewhat contentious because the statistical agencies develop that literature largely in the age when data were released in the form of published tables. When the data are available online and you can do repetitive iterative operations on them, you're in a new world. There's a separate literature that’s developed in the computer science world. Anyway, it is a problem. There is guidance out there in really complex areas like in some health care areas. Doing a full assessment of a data set can be very complicated and difficult, so I think my recommendation is that people start at the basics and think about how would you identify this person, and if this information got out what harm would it cause? Often the researchers themselves have a good sense of that from the research they're doing. Kate LeMay: There's one last question: are the five safes applicable in all research disciplines? Or are they specifically limited to suit the social sciences? Steve McEachern: I think they're already applicable. Kate LeMay: I agree. Steve McEachern: I mean it's interesting. We were have a discussion here about the social sciences. For example, we work a lot with health sciences [unclear] environmental sciences [unclear]. It's - I don’t see any reason why they shouldn’t be applied elsewhere. I mean that’s - part of the question actually is - it's more [unclear] about what do you have to think about in terms of the privacy and confidentiality risks, far more so than what's the topic. The topic helps you make some sort of judgement about the harm, in George's terms, but yeah, it's the confidentiality questions that…
  • 26. Managing_and_sharing_sensitive_data Page 26 of 26 [Over speaking] Kate LeMay: The framework is [unclear]. Steve McEachern: Yeah. George Alter: Oh yeah. Kate LeMay: Fabulous. Thank you very much to George and Steve for coming along to our webinar today, and thank you everyone for calling in… END OF TRANSCRIPT