Transcript - Provenance and Social Science data

[Unclear] words are denoted in square brackets.
Webinar: Provenance and Social Science data
15 March 2017
Video & slides available from ANDS website
START OF TRANSCRIPT
Kate LeMay: Today we're going to be speaking about provenance and social
science data. So you should be able to see on our screen we're
showing our data provenance, community page and we have a data
provenance interest group and if you're interested in that you can
contact us through the contacts on that page.
We have our speakers here. I'm Kate LeMay, I'm from ANDS and I'm
one of the research data specialists at ANDS. We have George Alter,
Steve McEachern and Nicholas Car. We'll give each of them a little bit
of an intro when we get to their point in speaking.
So as I mentioned this is part of a series, today's our first one. So I'd
like to introduce Steve and Nick who will be speaking first. So Steve is
the Director of the Australian Data Archive at the Australian National
University. He holds a PhD in industrial relations and a Graduate
Diploma in Management Information Systems and has research
interest in data management and archiving, community and social
attitude surveys, new data collection methods and reproducible
research methods.
Steve has been involved in various professional associations in
survey research and data archiving over the last 10 years and is

AUDIO_-_Provenance_and_Social_Science_data
Page 2 of 25
currently chair of the executive board of the data documentation
initiative.
And Nick, Nicholas Car is the Data Architect for Geosciences
Australia, GA. In that role he designs and helps build enterprise data
platforms. GA is particularly interested in the transparency and
repeatability of its science and the data products it delivers. For these
reasons, Nick implements provenance modelling and management
systems in order to represent and store information about data
lineage. What was done and who did it and what they used to do it.
Previous to working at GA, Nick was an experimental scientist at
CSIRO and researched metadata systems, provenance, data
management and linked data. He currently co-chairs the International
Research Data Alliances Research Data Provenance Interest Group
which the ANDS Provenance which the [unclear] Group works with
and through that and other groups assists organisations with
provenance management.
Nicholas Car: Okay, thanks Kate. All right so this is a very quick introduction to
PROV, so PROV is a provenance standard and what you see on that
first slide there is a very, very simple diagram of a little provenance
network and I'll discuss some of that as we go. So it's not just a
frivolous diagram, it's actually - it has some meaning.
Okay so the outline for today, so what is PROV? I'm just going to
mention that very quickly and then I'm going to get to how do I actually
use this thing in a couple of different ways. So first I'll talk about
modelling. Then I'll talk about how do I actually manage the data once
I've collected or made provenance data and then I'll talk about using
PROV with other systems.
So what is PROV? PROV is a W3C recommendation. So W3C is the
World Wide Web Consortium. So it's one of the governing bodies of
internet standards. They don't issue any documents called standards.
They issue documents called recommendations. So PROV is a
recommendation. It's top level of standard I suppose. Other standards

Page 3 of 25
by the W3C are things like HTML. I'm sure everyone is familiar with
HTML at least to some extent.
PROV itself was completed in 2013 and sort of formalised by the end
of that year. So it's only a couple of years old and a large number of
authors were involved in PROV. There were several initiatives to
make provenance standards before PROV over the last perhaps 20
years and many of the authors involved in those standards such as
PML and OPM, I'm not going to elaborate [unclear] so if you're
interested in those previous standards just Google them.
Many of the authors involved in those initiatives were involved with
PROV. So PROV really does know about those other initiatives and
it's simpler than those precursors because it's trying to do sort of a
high level standard. It doesn't do as many of the tasks as those
precursors do, but it certainly represents the very important bits that
they come up with.
Another thing to say about PROV is that there's no version two
planned any time soon. Why am I bringing this up now? Well it's a
pain for people to have to deal with standards and then versions twos
and threes and fours of standards. PROV doesn't quite operate like
that and I'll explain how. It is what it is and there are ways to extend it
and use it in different circumstances, but it's unlikely that we're going
to see any version change in the next few years I would think.
It's seen good adoption. PROV is really the only international broad
scale provenance standard and as a result people are happy to - I
think, happy to adopt it in lieu of really anything else.
Right, so PROV is actually a collection of documents and I've just
listed them there. I'm not going to go through them all in great detail,
but there is an overview document and then certain bits and pieces
which are actual recommendations or standards and additional things
that just help you use the PROV thing.
Now the main document is the PROV-DM the data model. That tells
you what PROV contains, how its classes operate and so on. Then

Page 4 of 25
there's a series of documents like an XML version of PROV, an OWL
ontology version and special notations and so on. The only other one
I'll mention is the PROV-CONSTRAINS which is a list of things, of
rules that PROV compliance, chunks of data must adhere to and that
works across any formulation of PROV. I'll provide a link there to the
collection of standards - of documents.
So how do I use PROV? This is a modelling - how do I actually model
something using PROV to do the core of provenance representation?
Well I'm starting off with some negatives, so don't do it like this. Don't
take a document for something, perhaps a metadata catalogue entry
and expect to shove a bunch of information into some field within that
document.
So ISO19115 us a standard for spatial datasets and it's got a field
called lineage and some people expect to take provenance
information and stick it in that lineage field. Don't do that. PROV
doesn't let you do that. I'll explain why in a second. So that's one thing
not to do. We're not going to see a single item's metadata record
containing a bunch of provenance information. You could do that but
not recommended.
What else should I not do? So this diagram here is the class model of
the DCAT the Data Catalogue Vocabulary which is a very generic
metadata model. It's used in relation to things like Dublin Core and
various catalogue style things and we're not going to link a dataset or
any other object in DCAT or Dublin Core or other standards like that to
a class of provenance information. This is true for Steve's DDI
initiative as well. We're not going to take objects in DDI and link to a
provenance object that tells you the provenance of that object. That's
an anti-pattern right there.
So what are we going to do? We don't even do this using Dublin
Core's provenance properties. So Dublin Core vocabulary as a
property called provenance and the wording for that says, use this to

Page 5 of 25
describe lineage history. PROV doesn't want you to do that exactly
like that.
What does PROV want you to do? PROV wants you to think of
everything that you're interested in in terms of three general classes of
objects. So is the scenario, the things that you're interested in, are
they things, are they entities? Are they occurrences? Are they
processes? Are they activities? Or are they causative people or
organisations which [unclear] cause an agent?
So PROV says model everything you know about using those three
classes and then link them together and that's what PROV's all about.
So how does GA use PROV? So we often process chunks of data at
GA. So we have a very simple model that's using the provenance
ontology and it looks like this. There's some process, the process
generates outputs, the outputs are entities, the process itself is an
activity and then there's data and code and configuration and so on
that feed into that process and those are also entities. Finally the
process and the entities might be related to a system and even a
person who operates that system. So that's the model we use.
Okay so how do I actually manage the data that I get in provenance or
that I get according to PROV? Well you can create reports. So you
can go and do something. A human or a system could log what they
had done and they could store that information in some kind of
database according to the PROV model. Then you can - it's a
document database but you can query that thing. So we often have
systems that sort of send reports every time they run.
You might have a form that looks like any other metadata entry form
where you fill in details and you hit enter and that sends off
provenance information, but again it's not storing it with respect to one
specific object, it's linking existing objects together. So some dataset
that is produced from another dataset is going to link those two things
together.

Page 6 of 25
For catalogue things we can link things again. If we have a catalogue
that has a dataset A or X and a dataset Y and we want to show there's
a linking, we can say dataset Y was derived from dataset X and
record that information somewhere. Now dataset Y may record I come
from dataset X but that's just a very simple, little bit of provenance
information. It's not a whole glob of provenance information stored
within dataset Y.
We can ensure that any system that has information that is
provenance information like who the creator of a dataset was, does so
in accordance to the PROV model. So in this case if we had a dataset
that had a creator, we would say the dataset was associated with an
agent and the agent had a role to play and that role, in this case, was
creator. That's now a PROV expression of that relationship.
For databases it can be very difficult. I can't explain it in depth here,
but there's many ways in which databases could store provenance or
PROV related provenance information, but they would need to be able
to show that they can actually export their content, their provenance
content according to the PROV data model. You actually have to
prove that if you want to say that you are compliant with a standard.
So fairly quickly, how do I get PROV to work with other systems? Well
we can fully align our system, whatever this system is. So I've used a
theoretical example of Metadata System X. How do I align Metadata
System X with PROV? I could classify all of the things in Metadata
System X according to PROV. It requires a metadata model for
Metadata System X - sorry a data model. Not just in coding formats.
We can't just deal with XMLs and so on. We actually have to have a
conceptual model and then we can say, this class of thing in Metadata
System X is the same as this class of thing in PROV.
Now PROV's only got a few classes, so that's usually pretty easy to
do. But it will definitely prompt you to do things that you wouldn't
normally do. You may have to tease apart some of the objects that

Page 7 of 25
you know and love into things that PROV recognises as different
objects.
You could do a partial alignment. You could take your Metadata
System X and only acknowledge that some of the things in that
scenario are PROV understood things. So maybe you've got a
metadata model that talks about all kinds of stuff and one of the things
it talks about is a dataset. You say your dataset is the same as what
PROV thinks of as an entity and maybe you ignore all the other things.
You would still need to demonstrate that you could extract value
PROV out of that and not all the other stuff, but that would be one way
to do it.
You could also link to things not in your own data model if you also
classified those things according to PROV. The last scenario you
could think about is to just deprecate your obviously not as good
systems and use PROV. That would require you perhaps to make
either a new dataset of provenance information or a data store and
put that information somewhere and that's it.
Kate LeMay: Thank you very much Nick. So we'll move onto Steve.
Steve McEachern: Nick's talked about the sort of general PROV model that is
increasingly getting used in various different spaces. I'm going to talk
specifically about the various ways of thinking about provenance in
what we're doing in the social sciences, particularly using - within the
standard that we utilise and I'm not the Director for the Data
Documentation Initiative.
Part of the reason we've sort of connected these two together is we're
now looking at how we can leverage the PROV standard inside DDI
employer [facts]. So Nick and I and a group of others have been
working on how we might go about this. I'm not going to touch too
much on that but I'll return to that at the end.
I sort of want to talk more generally about how we might think about
provenance at different stages of the data lifecycle, different stages in
the research or in the data management experience and how we

Page 8 of 25
progressed thinking about provenance over that time. Just to give you
a sense of, well what sort of things we can do already and how can
we increasingly embed, capture provenance in what we do.
Okay, I'm quickly going to - for those who don't know. The Australian
Data Archive, we've had various names over time. I'm going to do a
quick introduction. We've been around for a little while now based
here at the Research School of Social Sciences at ANU. Our mission
is to collect and preserve Australian social science data on behalf of
the social science research community in Australia and internationally.
Now we've sort of developed a collection of over 5000 datasets now
over 1500 different studies as we call them or projects. Lots of
different sources, lots of different provenance from various different
locations, academic, government and private sector. So as our
holdings have developed, our understanding of provenance has
developed probably alongside that. Maybe we didn't call it that at the
time but after 35 years I think that's always been sort of underpinning
a lot of what we've done.
There's helping researchers who might be the secondary users of our
data to know where did this come from, what was it used for and how
might I use it in the future is really the emphasis there. For those who
don't really know what we're talking about when I use the term data
archive, we're using the term a trusted system out of a project done by
the Social Science & Humanities Research Council of Canada.
They're kind of the equivalent in Canada for the ARC.
“An accessible and comprehensive service empowering researchers
to locate, request, retrieve and use data resources…” - so you've got
to be able to find it and understand it - “…in a simple, seamless and
cost effective way, while at the same time protecting the privacy,
confidentiality and intellectual property rights of those involved”. Part
of why we're interested in provenance is really that last point.

Page 9 of 25
One is to help researchers understand where this came from but it is
to sort of recognise and acknowledge the intellectual property that's
been developed in those resources over time.
Okay, so I'm going to give a brief introduction to the DDI standard and
its different flavours. As Nick pointed out, having multiple versions is
not always much fun. We're up to version four. We're about 20 years
old now. So I think that's not too bad from Nick's point of view.
How we've sort of captured what we might think of as different forms
of provenance over time. So I've got the website there the
ddialliance.org website if you're interested in knowing more. You know
you can go and explore the different versions of the standard there.
So what is DDI? It's a structured metadata specification developed for
the community and by the community. So particularly in social science
data archives that exist in most OECD countries. It's used in about 90
different countries around the world now thanks to work by the World
Bank and the World Health Organization and others. There's two
major development lines that are basically XML Schemas. One's DDI
Codebook and the other DDI Lifecycle which both correspond to
version two and version three of the standard.
I'll talk a little bit more about those in a moment. We have some other
elements to it as well, additional specifications including some
controlled vocabularies often for things like encoding methodology,
data types and data capture processes and some RDF vocabularies
so that we can sort of start moving into a linked data world. So you
can leverage the standard, particularly the Lifecycle standard into a
linked data environment.
The current - the version four is in development at the moment, has
been over the last couple of years and that's where the work with Nick
has sort of come on board as well. It's moving to a model based
specification. So rather than being based in a particular schema we're
looking at sort of to focus on the model then its expression into
various different formats. The provisional ones at this point are XML

Page 10 of 25
and RDF and that includes support for provenance and process
models.
So we're looking at that point at how do we leverage what we know
from PROV to support the provenance model within the new version
of the standard. It's managed by the DDI Alliance.
So briefly on the two versions of the standard are already in place, so
it's been around in the codebook format which has its origins in - the
print codebooks are produced by organisations like [Georges] going
right back to the 1960s and 70s. So we've sort of formalised in the
social sciences a fairly structured way of thinking about describing
data back 40 years ago really.
So the codebook version of the standard really is an after the fact
description of what this dataset is about. It includes four basic
sections, the document description which is describing the document
that's ascribing the dataset. A study description, we use the term
study to describe sort of the package of datasets that encapsulate a
project. So that includes characteristics of the study itself that the DDI
is describing. That includes lots of sections on authorship, citation,
access, conditions, but particularly, from the point of view of
provenance, we have their methodological content, data collection
processes, sources.
Then we also include a lot of what we call related materials. They are
documents associated with the project. We tell you something about
the provenance of where it came from. It includes all the
questionnaires, previous codebooks, technical reports, et cetera. So
from a human point of view your starting to get into the area of
thinking about provenance even though it's not really a [machine
actionable] version of that.
We also describe the files themselves, the characteristics of the
physical data files, data formats, et cetera, their size and their
structure. Then what we call variable descriptions. Descriptions - the
variables that are included in the data file. The simplest way of

Page 11 of 25
thinking about this is the columns of a tabulated dataset. What does
that column mean because in a lot of the social sciences, a column - a
number does not represent actually a number. It represents a
characteristic of some sort.
For example, a five point agree/disagree scale in a survey, how do
you interpret a lot of those becomes important. George is going to talk
to a specific project looking at how we do a lot more with the variable
description and the [unclear] of variables in a moment.
So codebook was really describing - developed to describe things
after the fact. The DDI Lifecycle Model takes a more data lifecycle
approach to thinking about capturing metadata and provenance.
Underlying it is the model we have on the screen here. I think this is
just a working model of describing the different processes in the DDI
framework that a dataset can go through. Everything from
conceptualising the study.
The first place through collection and processing and distribution, as a
side point archiving that data and storing it around for future use and
then rediscovery and analysis and repurposing into the future. So it
was built with the intent of re-usability and particularly machine action-
ability as well. So the metadata that's developed in a dataset can be
re-used in the future for sport for the same purpose, a similar purpose
or something entirely new.
In order to do that you need to be able to understand where did it
come from. So embedded in that is generating metadata going
forward to be able to look backwards through the lifecycle as well. So
as - it's focused on metadata re-use. That re-use of metadata really
implies a provenance in expectation.
So why DDI Lifecycle, the things it can do, it's machine actionable. It's
more complex. There are 27 different schemas. It's probably overly
complex if we're being fair. It's structured and identifiable. So every
metadata item is actually able to be permanent identified and
managed and repurposed if that's required.

Page 12 of 25
It supports related standards and it supports reuse across different
projects and again, that's sort of something that George is going to
touch on as well. [I'm going to pass this] because I think there are
some particular features for it that I can refer back to in the future. But
I want to talk very briefly about how do we think about provenance
within the different versions and then pass to George, who just wants
to talk specifically about one of the projects there.
So if we think about how provenance is being supported here, I mean
Nick's approach to the PROV model with really a machine actionable
model, fundamentally, DDI Codebook is not really designed for that.
But it is designed at least to be able to describe to a human reading a
catalogue entry, what the provenance of this dataset wants. So it
includes attribution, methodology, data processing, collection and all
the documentation we can find on what happened to the data. But it
doesn't really do that as a sort of automated way, it's really focused on
a human response to prior research to be able to come back and have
a look.
Similarly, with variables, question texts, [variable] name, what the
value, labels mean are all there.
DDI Lifecycle is really trying to - it was our first attempt really to look at
sort of the machine actionable provenance. So can we capture this
along the way or represents again the information from the studies,
attribution, methodology and so forth. But particularly with variables
it's really trying to look at the reusable elements of how we might
reuse questions, reuse columns of data and understand and reuse the
basic conceptual ideas that are embedded within that.
So, for example, if you've got a variable measuring employment, can I
reuse that employment maybe the categorisation that was used, the
numbers that were used in the survey and so forth.
Then where we're going with DDI 4; our tagline for that is what we're
calling DDI Views, is can I - to what extent can I actually embed a
provenance model inside that framework. So now we're moving

Page 13 of 25
towards really recognising the importance of provenance both
conceptually and even sort of the physical and digital formats of data
as well. Measuring codes and categories across the lifecycle. For
example, managing what provenance of missing values. If your value
of a datum changes, how do I understand that?
So we've got really - we're able to generate this out automatically,
what happened at the level of an individual datum of a variable or of a
dataset. So we're moving progressively towards the sort of framework
that Nick described but [unclear] but that requires the management of
the metadata that we have to be moved forward. That's kind of it from
me.
George Alter: Hi everyone. Thanks very much to ANDS and to ADA for inviting me
to be here. What I'm going to talk about today is a project that started
in October with funding from the US National Science Foundation
about capturing metadata during the process of data creation.
So I don't think for this audience I have to go into this - justify
metadata but the big problem that we face is how do we actually get
the metadata. That's often more difficult than we - it's a lot easier to
describe it than it is to actually get it most of the time.
So to give you some backgrounds I'm going to put this in the context
of my home institution which is the Inter-University Consortium for
Political and Social Research located at the University of Michigan.
We've been in the business of archiving social science data since
1962 and we're an international consortium of more than 760
institutions. We were also one of the founding members of the Data
Documentation Initiative Alliance which Steve just talked about and
we actually provide the home office for the DDI Alliance. ICPSR has
been using DDI for many years but we're now getting to the point
where we're able to build all kinds of tools that take advantage of DDI.
One of the first things that we were doing which we've been doing for
at least 10 years is that when you download data from ICPSR you get
with it a codebook and pdf. What the pdf is [that] - we created from the

Page 14 of 25
DDI, not the other way around. So for us the DDI is the native version
of the metadata.
So what we started to do is take advantage of DDI to build more kinds
of tools. One of the first ones we created was what's called our
variable search page where you can put in a search term and look for
questions that have been used in datasets that are like that search
term. So this is an example of the results that come out of a variable
search and we are now searching over more than 4.5 million variables
in about 5000 studies or data collections.
One of the things that DDI makes possible is that we can go from this
search to other characteristics of the data. So you can see here in the
blue that there are a number of things that are hyperlinked. If you click
on the place I've got circled, it takes you to an online codebook. The
online codebook has a number of features. It tells you the question
that was asked. It tells you how it was coded. If the data are available
online you can go to a cross tab tool and it also can link to an online
graphing tool.
The other thing that you see on the left side of the screen is a list of
the other variables in the dataset. So you can move around in the
dataset and clicking on any of those variables will bring up a display a
bit similar to this.
Another thing you can do from our variable search screen is if you
click on these check boxes on the left, you can pick out a certain
number of variables that you want to look at more closely and clicking
on this compare button at the top there brings you to this screen which
is a side by side comparison of these different variables which come
from different studies and so you can see whether they're asking the
same question, whether they're coded the same or differently. As
before, this screen is also hyperlinked to the online codebook so you
can go back and forth.
One of our more recent tools which I think is one of the most powerful
is that you can now search for datasets that include more than one

Page 15 of 25
variable that you're interested in. So this is a search in using what we
call our variable relevant search that's actually in the study search
rather than the variable search, where we're looking for three -
variables about three different things. Does the respondent read
newspapers? Do they volunteer in schools? What's their race? You
can see here that the results come out in three different columns
within each study so you can see which variables are present in each
study.
As before everything is hyperlinked to both the online codebook and
the variable comparison. So you can check on any combination of
these variables and compare them side by side.
Another thing that we did as another previous NSF project, working
with the American National Election Study and the General Social
Survey, we made a crosswalk of the variables that are available in
those two studies. Now the American National Election Study started
in 1948 and is done every four years. The General Social Survey
started in 1972 and is done every two years. So we're actually going
to be looking over 70 different datasets.
What we've done is created this crosswalk where we've grouped the
variables according to certain tags. We've got eight lists of tags and
then 134 tags in total. The columns here, each column represents a
dataset and there are70 datasets. All of the variables are linked here
and I can't actually show it here but if you hover over one of those
variables it shows you the question text for the variable.
Again you can use the checkboxes to pick out things that you want to
compare and go the variable comparison screen. So this is - a
crosswalk like this is a tool that's actually very common. You've
probably seen these before. There are two things that are different
about this though. One is that this is all keyed into the online
codebook so you can go transparently back and forth.
The other thing is that we can use this tool to crosswalk any of the 4.5
million variables in the ICPSR collection because this is drawing

Page 16 of 25
directly from our store of DDI metadata and we don't have to build a
separate tool for each one. This one tool works over all of these
datasets.
Another thing that we did in this project was to think about how we
could extend the online codebook. So here's our online codebook that
you saw before which has the question text and how it was coded, but
this version has something new in this location here. It shows how you
got to this question.
In big surveys every respondent doesn't answer every question. There
are what are often called skip patterns. So you get asked what your
marital status is and if you're single you go to one question, if you're
married you go to another question, divorced people go to a third
pattern. So there are different pathways through the questionnaire.
What we've done here is try to show, here's how you go to that
question which explains why some people didn't answer the question.
We also represented it in words down here.
So we built this and we were quite proud of ourselves for building it
because this does answer the question about who answered this
question in the survey. But then we ran into a problem so how do we
know who answered the question in the survey. The answer is that we
get that information from the data providers in a pdf. The only way we
could build this demo prototype was to have one of our staff members
enter this program flow information by manually into XML for one of
datasets so we could show how this works.
So we showed a tool that we think is really useful, but we reached a
roadblock because we don't actually get machine actionable metadata
about this kind of information. The problem is that when the data
arrived at the archive, they don't have the question text. That's
something that we at ICPSR and ADA have to type in. They don't
have the interview flow. They don't have any information about
variable provenance and variables that are created out of other
variables are not documented.

Page 17 of 25
So the project we're working on now which is called C2 metadata for
[news capture] of metadata is about how do you get that. To
understand that, how do we get it, you have to think about how the
data are created and what happens.
So first of all the data themselves are actually born digital. People do
not go around with a paper questionnaire these days. They use these
computer assisted interview programs. They're on telephone or they
go around with a laptop of a tablet to answer them. There's no paper
questionnaire. There is instead a program, and it's the program that's
metadata.
So technically at the beginning you start with this computer assisted
interviewing system and what you get out of it is the original dataset.
But you can also derive from it DDI metadata in XML and there are
programs, a couple of different programs that will take these CAI
systems, the code that they run on [unclear] to XML.
But what happens next, well what happens next is that the project that
commissioned the data is going to modify the data. There are a
number of reasons for doing that. There are some things that are in
the data that are just purely there for administrative purposes. There
are some variables that have to be changed to reduce the
identifiability of individuals. Some variables that need to be combined
into scales or indexes.
So what they do is they write a script that's going to run in one of the
major statistical packages. They take that script and the script and the
data go through that software and what comes out is a new dataset.
Well what happens to the metadata, well at this point the metadata
don't match the dataset anymore and you would need to update the
XML to fix it and nobody likes updating XML.
So the metadata get trashed and thrown away. What happens then is
this, when the data - after the data are revised, the metadata are
recreated. What happens is that we at the archive take the revised
data and extract as much metadata from it as we can. So we get an

Page 18 of 25
extracted XML file and what about the things that went on in the script
here? Well we actually have to sit down and extract them by hand.
So a person has to read the script and write down what happened.
Well what we're working on in - well so what are we missing? Well
what we get from the statistics packages are just names, labels for
variables, labels for values and virtually no provenance information.
So what we're working on is a way that we can automate the capture
of this variable transformation metadata. So what I did is this, we're
going to write software where you could take the script that was used
to modify the data, take the very same script and run it through what
we're calling a script parser and pull from that the information about
variable transformations. Put that into a standard format which we're
calling a standard data transformation language.
Then you take that information and incorporate it into the original DDI.
You update the original DDI and then you've got a new version of the
XML that is in sync with the revised data. So this process then
requires two different software tools, one that will read the script and
turn it into a standard format and second one that will update the XML
and that's what we're building.
So we are building tools that will work with the different software
packages and update XML. We're actually writing these parsers for
scripts in four different languages, SPSS, SAS, Stata and R. The
reason we're doing four languages is that if you look at the column
over there on the right which is based on downloads at ICPSR in
cases where the dataset had all four formats, you can see that there's
not a single dominant format.
SPSS and Stata are the most downloaded formats from ICPSR and
they both have about 24 per cent. SAS and R both have about 12 per
cent. If we did one package we wouldn't please any - we'd be pleasing
only a few people and we couldn't have an impact. So we're actually
writing parsers for four languages.

Page 19 of 25
Here's something I thought that's come out of our work that you might
find interesting. This is about why we need to have a special language
for expressing these data transformations. So here are three brief
programs in SPSS, Stata and SAS that all are designed to operate on
the same data. I tried very hard to make the programs, the scripts
identical and I think that I succeeded. But if you run these three
programs you get three different results.
The key thing here is to look at the last row that the row in which we
set the minus one to be missing, in SPSS you get two missing values.
In Stata and SAS one of the variables is set to a number, but it's a
different one in each one. Why does this happen?
Well the reason is that in logical expressions SPSS treats a missing
value as missed [unclear], makes the result of a logical expression
that includes a missing value, missing. Which in most cases is treated
as false. Stata treats a missing value as a number which is equal to
infinity. SAS treats the missing value as a number which is equal to
minus infinity. So both Stata and SAS actually do return a number
when you have one of these comparisons.
So it's actually more accurate to represent the data in this way, which
you wouldn't see if you just looked at the datasets. So what we're
doing is creating our own language. Well we're actually using the
language that's been created by another community. The SDMX
community called validation and transformation language so that we
can put all three of these languages into a common core.
So what are we doing and why are we doing it? So the goal of the
project is to capture this metadata and automate it. If we can capture
more metadata from the data creation process, we'll be able to
provide much better information to researchers about what's in the
dataset. Automating this process we hope will make it cheaper for
everyone and make it easier. That has been one of the principles that
we've tried to do here that if we can't make it easier for the
researchers, they're not going to do it.

Page 20 of 25
So the hope here is that the software we get will make their lives
easier. Here's just a - to acknowledge my - some of my partners in
this, we've got partners from a couple of software firms Colectica,
Metadata Technology North America, the Norwegian Centre for
Research Data and two of the - the two projects I mentioned, the
General Social Survey and the American National Election Study are
part of the project too.
So that's my talk.
Kate LeMay: Fabulous, thank you very much George. We had a question that came
through earlier from your talk when you were speaking about people
putting variables into ICPSR and searching for them and [Ming 43:40]
has asked, when a user searches for a variable or variables, do they
need to come up with the exact variable name as in the variable
index?
George Alter: So right now what we're doing is really a text search. When you
search for variables, you're searching over the variable name and the
variable label. It also can bring up items that are in the values for the
variables. But one of the problems in the social sciences is that people
don't reuse questions very often. So we don't have a tradition of
reusing questions. It's very hard to find the same question in multiple
datasets. The kind of search we're doing now in our question bank is
frankly kind of clunky and it often misses things. That's an issue that
I'm trying to address in some other projects that we're trying to
improve the way we can search over variables.
Kate LeMay: Thank you very much. We've got a question for Nick as well.
Nick Car: [Unclear].
Kate LeMay: Yes. So Nick we've got a question, how widely is PROV used and
what have you found to be the main challenges working with PROV?
Noting that a V2 is not on the horizon, is it easy to update a PROV
model if a change is required?

Page 21 of 25
Nick Car: Okay, so first part first, how widely is it used? So my - I have a direct
interest in things provenance but aside from that I have an interest in
things geospatial and I guess physical sciences data. In that
community there's only one game in town, that's through PROV but
it's early days. So most of the spatial, geophysical, blah, blah, blah
sorts of places, those hard physical sciences side, the either are using
their own systems or they're intending to use PROV. There's not many
that are actually already using PROV. But there are certainly not many
that are intending to use something other than PROV.
Outside of my own Geoscience Australia area, other communities I
know of including DDI and so on - because PROV's only been around
for a few years, if people can characterise their problem in a
provenance way, like they actually understand this as a provenance
question as opposed to some other kind of question like an ownership
or an attribution question, they fairly quickly end up at PROV.
So I think it's about as - it's certainly more widely used than any other
provenance standard has ever been and it's showing signs of being
much more widely used that that and that's because in the other
initiatives in the space have been sort of swallowed up by PROV.
Now the second part of the question was, what are the problems and
I've identified one already which is people have to know that they're
asking a provenance question. So we get a lot of questions which are
synonyms for provenance questions, probably much like variable
naming where people say I'm interested in the lineage of my data or
the transparency or the process flow of the ownership or attribution
and those are all what could be all provenance questions. The hardest
thing to work out is specifically what questions are being asked and
then if there is an existing metadata model or something in that space
already, what's it doing and what's it not doing and therefore do we
need provenance - a specific provenance initiative.
So for instance many metadata models have authorship, ownership,
creator information indicated in them. So if your provenance question

Page 22 of 25
is, I want to know datasets created by Nick, that kind of provenance
question you can usually answer in other metadata systems. You'd
have to have something a bit more complicated than that and
determine provenance to then think about using the provenance
system.
The other thing is the move away from what I call point metadata
where you've got a single thing with a bunch of properties that come
from it. So a study or a document or a chunk of data with a bunch of
properties. That's one way to do things, but what PROV and what
other models are interested in is whole networks. Things that relate to
other things. It's more complex, but it's much, much more powerful to
do that.
Kate LeMay: Great, thank you very much. So question for George, how is sensitive
data, variables or values controlled for during the C2 automatic
capture. ICPSR has a confidentialised service on Ingest, is this
process carried over to the C2 metadata project? Is this activity
captured in PROV like metadata?
George Alter: So the C2 metadata model is to operate solely on the metadata not on
the data. So that's really a - so it doesn't really play into the issue of
confidentiality. If you're interested, in two weeks we're going to have
another webinar where I am going to talk about how we manage
confidential data, but in general it's rarely the case that we have to
mask the metadata of a dataset for confidentiality reasons. Obviously
controlling the data is something else.
Kate LeMay: So we've got another question here for George. Your script parser that
reads from SAS script, do researchers need to - would they need to
install that in their SAS package?
George Alter: We haven't gotten to that point yet but probably not. Probably what
we'll do, at least as a starting point is offer it as a web service. What
you'll do is simply export your SAS program into a text file and upload
the text file to the web service and it will download a new XML file.

Page 23 of 25
Kate LeMay: So we've got another question here. Does PROV support the workflow
of creation and approval of provenance data, eg. the PROV entry is
proposed and has been submitted to the data custodian for approval?
Nicholas Car: Well it's got two kind of answers to it. One is a generic PROV answer
and the other one seems to be more in line with a particular repository
or a particular set of steps. So this isn't exactly what you asked, but
I'm going to answer it in a slightly different way. You can talk about the
provenance of provenance which is bit tricky.
But say you had information about the lineage or the history of a
dataset and you wanted to control that chunk of stuff, you could talk
about that thing being a dataset itself, even though it's about
something else and manage that. You could certainly work out how to
link your dataset to the dataset that contains its provenance
information. So you can do that.
But the second part of the question, or I think the general sense of the
question is more to do with how does a specific repository do things.
Does that make sense? Does the PROV support the workflow of
creation and approval? Okay, in general and you can represent
anything in PROV because it's really high level and it's got those three
generic classes of entity, activity and agent. There's almost nothing in
the world that I've come across that you can't decompose down into
one of those three things.
Is it a thing? Is it a causative agent or is it a temporal occurrence. So
in general yes.
Kate LeMay: Okay fabulous.
Nicholas Car: So Natasha asks, philosophical question for the whole panel, how do
you think provenance relates to trust. So I'm going to jump in very
quickly and say, provenance models before PROV often had the word
trust in them somewhere. Many of the motivations for provenance
models were to do with trust, we deal with trust as - the goal of
Geoscience Australia to put out data and make it open and
transparent. It's fundamentally a trust issue for users of that data.

Page 24 of 25
They want to know how did this data come to be. So that's really what
provenance is about. It's about telling about the history of something
so you can generate all sorts of trust. But then the specifics of what
you put in there and you can work out, do I trust the people who
created this thing, do I trust the process that was undertaken to deal
with it or transform it. Do I trust the particular chunks of code that we
used? So that's the generic answer.
Then there's the sort of more specific ones like for data [unclear]
repository how do I trust that it's - even though you're telling me
something about it, [but whether it's] in fact true. There are also very
difficult things about how do I actually trust this metadata even if it
looks like it's all [unclear]. If this data comes from God delivered to you
on a stone tablet, I could write that down, but is it true? You have to
work that out. That is now a non-provenance thing. You have to work
out some other way of attributing a trust metric to that claim. That
might be that it's digitally signed and you trust the agency that
delivered it. So that's an appeal to authority.
You might trust that there's enough information present for you to
understand the process enough to have confidence in it. It might link
to well-known sources like Open Code or something like that, that you
trust or maybe there's a mechanism for you to validate certain chunks
of data or calculations.
So the total number is five, you can look back through the provenance
and see somewhere two plus three, where you see five, that you can
calculate and you can establish that trust directly.
George Alter: So I think - Nick said it very well. But I'll say the same thing in fewer
words [laughs].
Nicholas Car: Thanks George, thank you.
George Alter: Provenance is really fundamental to trust and Nick really hit the nail
on the head when he talked about transparency. Provenance is about
transparency and in the world we live in now, even appeals to
authority don't work very well anymore. I think that for science to have

Page 25 of 25
- to gain legitimacy and gain trust, we have to be transparent and
that's what provenance metadata is all about.
Kate LeMay: So we've reached the end of our time. I'd just like to thank our three
speakers for coming along to our ANDS Canberra Office today and
speaking to us about provenance and introducing lots of new
acronyms to us all. Every time I encounter anything new at ANDS
there's always more acronyms to learn.
So thank you very much for coming. We have two more webinars in
the social science series, so hope to see you there again.
END OF TRANSCRIPT

Transcript - Provenance and Social Science data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Transcript - Provenance and Social Science data

Similar to Transcript - Provenance and Social Science data (20)

More from ARDC

More from ARDC (20)

Recently uploaded

Recently uploaded (20)

Transcript - Provenance and Social Science data