SlideShare a Scribd company logo
1 of 25
Download to read offline
[Unclear] words are denoted in square brackets.
Webinar: Provenance and Social Science data
15 March 2017
Video & slides available from ANDS website
START OF TRANSCRIPT
Kate LeMay: Today we're going to be speaking about provenance and social
science data. So you should be able to see on our screen we're
showing our data provenance, community page and we have a data
provenance interest group and if you're interested in that you can
contact us through the contacts on that page.
We have our speakers here. I'm Kate LeMay, I'm from ANDS and I'm
one of the research data specialists at ANDS. We have George Alter,
Steve McEachern and Nicholas Car. We'll give each of them a little bit
of an intro when we get to their point in speaking.
So as I mentioned this is part of a series, today's our first one. So I'd
like to introduce Steve and Nick who will be speaking first. So Steve is
the Director of the Australian Data Archive at the Australian National
University. He holds a PhD in industrial relations and a Graduate
Diploma in Management Information Systems and has research
interest in data management and archiving, community and social
attitude surveys, new data collection methods and reproducible
research methods.
Steve has been involved in various professional associations in
survey research and data archiving over the last 10 years and is
AUDIO_-_Provenance_and_Social_Science_data
Page 2 of 25
currently chair of the executive board of the data documentation
initiative.
And Nick, Nicholas Car is the Data Architect for Geosciences
Australia, GA. In that role he designs and helps build enterprise data
platforms. GA is particularly interested in the transparency and
repeatability of its science and the data products it delivers. For these
reasons, Nick implements provenance modelling and management
systems in order to represent and store information about data
lineage. What was done and who did it and what they used to do it.
Previous to working at GA, Nick was an experimental scientist at
CSIRO and researched metadata systems, provenance, data
management and linked data. He currently co-chairs the International
Research Data Alliances Research Data Provenance Interest Group
which the ANDS Provenance which the [unclear] Group works with
and through that and other groups assists organisations with
provenance management.
Nicholas Car: Okay, thanks Kate. All right so this is a very quick introduction to
PROV, so PROV is a provenance standard and what you see on that
first slide there is a very, very simple diagram of a little provenance
network and I'll discuss some of that as we go. So it's not just a
frivolous diagram, it's actually - it has some meaning.
Okay so the outline for today, so what is PROV? I'm just going to
mention that very quickly and then I'm going to get to how do I actually
use this thing in a couple of different ways. So first I'll talk about
modelling. Then I'll talk about how do I actually manage the data once
I've collected or made provenance data and then I'll talk about using
PROV with other systems.
So what is PROV? PROV is a W3C recommendation. So W3C is the
World Wide Web Consortium. So it's one of the governing bodies of
internet standards. They don't issue any documents called standards.
They issue documents called recommendations. So PROV is a
recommendation. It's top level of standard I suppose. Other standards
AUDIO_-_Provenance_and_Social_Science_data
Page 3 of 25
by the W3C are things like HTML. I'm sure everyone is familiar with
HTML at least to some extent.
PROV itself was completed in 2013 and sort of formalised by the end
of that year. So it's only a couple of years old and a large number of
authors were involved in PROV. There were several initiatives to
make provenance standards before PROV over the last perhaps 20
years and many of the authors involved in those standards such as
PML and OPM, I'm not going to elaborate [unclear] so if you're
interested in those previous standards just Google them.
Many of the authors involved in those initiatives were involved with
PROV. So PROV really does know about those other initiatives and
it's simpler than those precursors because it's trying to do sort of a
high level standard. It doesn't do as many of the tasks as those
precursors do, but it certainly represents the very important bits that
they come up with.
Another thing to say about PROV is that there's no version two
planned any time soon. Why am I bringing this up now? Well it's a
pain for people to have to deal with standards and then versions twos
and threes and fours of standards. PROV doesn't quite operate like
that and I'll explain how. It is what it is and there are ways to extend it
and use it in different circumstances, but it's unlikely that we're going
to see any version change in the next few years I would think.
It's seen good adoption. PROV is really the only international broad
scale provenance standard and as a result people are happy to - I
think, happy to adopt it in lieu of really anything else.
Right, so PROV is actually a collection of documents and I've just
listed them there. I'm not going to go through them all in great detail,
but there is an overview document and then certain bits and pieces
which are actual recommendations or standards and additional things
that just help you use the PROV thing.
Now the main document is the PROV-DM the data model. That tells
you what PROV contains, how its classes operate and so on. Then
AUDIO_-_Provenance_and_Social_Science_data
Page 4 of 25
there's a series of documents like an XML version of PROV, an OWL
ontology version and special notations and so on. The only other one
I'll mention is the PROV-CONSTRAINS which is a list of things, of
rules that PROV compliance, chunks of data must adhere to and that
works across any formulation of PROV. I'll provide a link there to the
collection of standards - of documents.
So how do I use PROV? This is a modelling - how do I actually model
something using PROV to do the core of provenance representation?
Well I'm starting off with some negatives, so don't do it like this. Don't
take a document for something, perhaps a metadata catalogue entry
and expect to shove a bunch of information into some field within that
document.
So ISO19115 us a standard for spatial datasets and it's got a field
called lineage and some people expect to take provenance
information and stick it in that lineage field. Don't do that. PROV
doesn't let you do that. I'll explain why in a second. So that's one thing
not to do. We're not going to see a single item's metadata record
containing a bunch of provenance information. You could do that but
not recommended.
What else should I not do? So this diagram here is the class model of
the DCAT the Data Catalogue Vocabulary which is a very generic
metadata model. It's used in relation to things like Dublin Core and
various catalogue style things and we're not going to link a dataset or
any other object in DCAT or Dublin Core or other standards like that to
a class of provenance information. This is true for Steve's DDI
initiative as well. We're not going to take objects in DDI and link to a
provenance object that tells you the provenance of that object. That's
an anti-pattern right there.
So what are we going to do? We don't even do this using Dublin
Core's provenance properties. So Dublin Core vocabulary as a
property called provenance and the wording for that says, use this to
AUDIO_-_Provenance_and_Social_Science_data
Page 5 of 25
describe lineage history. PROV doesn't want you to do that exactly
like that.
What does PROV want you to do? PROV wants you to think of
everything that you're interested in in terms of three general classes of
objects. So is the scenario, the things that you're interested in, are
they things, are they entities? Are they occurrences? Are they
processes? Are they activities? Or are they causative people or
organisations which [unclear] cause an agent?
So PROV says model everything you know about using those three
classes and then link them together and that's what PROV's all about.
So how does GA use PROV? So we often process chunks of data at
GA. So we have a very simple model that's using the provenance
ontology and it looks like this. There's some process, the process
generates outputs, the outputs are entities, the process itself is an
activity and then there's data and code and configuration and so on
that feed into that process and those are also entities. Finally the
process and the entities might be related to a system and even a
person who operates that system. So that's the model we use.
Okay so how do I actually manage the data that I get in provenance or
that I get according to PROV? Well you can create reports. So you
can go and do something. A human or a system could log what they
had done and they could store that information in some kind of
database according to the PROV model. Then you can - it's a
document database but you can query that thing. So we often have
systems that sort of send reports every time they run.
You might have a form that looks like any other metadata entry form
where you fill in details and you hit enter and that sends off
provenance information, but again it's not storing it with respect to one
specific object, it's linking existing objects together. So some dataset
that is produced from another dataset is going to link those two things
together.
AUDIO_-_Provenance_and_Social_Science_data
Page 6 of 25
For catalogue things we can link things again. If we have a catalogue
that has a dataset A or X and a dataset Y and we want to show there's
a linking, we can say dataset Y was derived from dataset X and
record that information somewhere. Now dataset Y may record I come
from dataset X but that's just a very simple, little bit of provenance
information. It's not a whole glob of provenance information stored
within dataset Y.
We can ensure that any system that has information that is
provenance information like who the creator of a dataset was, does so
in accordance to the PROV model. So in this case if we had a dataset
that had a creator, we would say the dataset was associated with an
agent and the agent had a role to play and that role, in this case, was
creator. That's now a PROV expression of that relationship.
For databases it can be very difficult. I can't explain it in depth here,
but there's many ways in which databases could store provenance or
PROV related provenance information, but they would need to be able
to show that they can actually export their content, their provenance
content according to the PROV data model. You actually have to
prove that if you want to say that you are compliant with a standard.
So fairly quickly, how do I get PROV to work with other systems? Well
we can fully align our system, whatever this system is. So I've used a
theoretical example of Metadata System X. How do I align Metadata
System X with PROV? I could classify all of the things in Metadata
System X according to PROV. It requires a metadata model for
Metadata System X - sorry a data model. Not just in coding formats.
We can't just deal with XMLs and so on. We actually have to have a
conceptual model and then we can say, this class of thing in Metadata
System X is the same as this class of thing in PROV.
Now PROV's only got a few classes, so that's usually pretty easy to
do. But it will definitely prompt you to do things that you wouldn't
normally do. You may have to tease apart some of the objects that
AUDIO_-_Provenance_and_Social_Science_data
Page 7 of 25
you know and love into things that PROV recognises as different
objects.
You could do a partial alignment. You could take your Metadata
System X and only acknowledge that some of the things in that
scenario are PROV understood things. So maybe you've got a
metadata model that talks about all kinds of stuff and one of the things
it talks about is a dataset. You say your dataset is the same as what
PROV thinks of as an entity and maybe you ignore all the other things.
You would still need to demonstrate that you could extract value
PROV out of that and not all the other stuff, but that would be one way
to do it.
You could also link to things not in your own data model if you also
classified those things according to PROV. The last scenario you
could think about is to just deprecate your obviously not as good
systems and use PROV. That would require you perhaps to make
either a new dataset of provenance information or a data store and
put that information somewhere and that's it.
Kate LeMay: Thank you very much Nick. So we'll move onto Steve.
Steve McEachern: Nick's talked about the sort of general PROV model that is
increasingly getting used in various different spaces. I'm going to talk
specifically about the various ways of thinking about provenance in
what we're doing in the social sciences, particularly using - within the
standard that we utilise and I'm not the Director for the Data
Documentation Initiative.
Part of the reason we've sort of connected these two together is we're
now looking at how we can leverage the PROV standard inside DDI
employer [facts]. So Nick and I and a group of others have been
working on how we might go about this. I'm not going to touch too
much on that but I'll return to that at the end.
I sort of want to talk more generally about how we might think about
provenance at different stages of the data lifecycle, different stages in
the research or in the data management experience and how we
AUDIO_-_Provenance_and_Social_Science_data
Page 8 of 25
progressed thinking about provenance over that time. Just to give you
a sense of, well what sort of things we can do already and how can
we increasingly embed, capture provenance in what we do.
Okay, I'm quickly going to - for those who don't know. The Australian
Data Archive, we've had various names over time. I'm going to do a
quick introduction. We've been around for a little while now based
here at the Research School of Social Sciences at ANU. Our mission
is to collect and preserve Australian social science data on behalf of
the social science research community in Australia and internationally.
Now we've sort of developed a collection of over 5000 datasets now
over 1500 different studies as we call them or projects. Lots of
different sources, lots of different provenance from various different
locations, academic, government and private sector. So as our
holdings have developed, our understanding of provenance has
developed probably alongside that. Maybe we didn't call it that at the
time but after 35 years I think that's always been sort of underpinning
a lot of what we've done.
There's helping researchers who might be the secondary users of our
data to know where did this come from, what was it used for and how
might I use it in the future is really the emphasis there. For those who
don't really know what we're talking about when I use the term data
archive, we're using the term a trusted system out of a project done by
the Social Science & Humanities Research Council of Canada.
They're kind of the equivalent in Canada for the ARC.
“An accessible and comprehensive service empowering researchers
to locate, request, retrieve and use data resources…” - so you've got
to be able to find it and understand it - “…in a simple, seamless and
cost effective way, while at the same time protecting the privacy,
confidentiality and intellectual property rights of those involved”. Part
of why we're interested in provenance is really that last point.
AUDIO_-_Provenance_and_Social_Science_data
Page 9 of 25
One is to help researchers understand where this came from but it is
to sort of recognise and acknowledge the intellectual property that's
been developed in those resources over time.
Okay, so I'm going to give a brief introduction to the DDI standard and
its different flavours. As Nick pointed out, having multiple versions is
not always much fun. We're up to version four. We're about 20 years
old now. So I think that's not too bad from Nick's point of view.
How we've sort of captured what we might think of as different forms
of provenance over time. So I've got the website there the
ddialliance.org website if you're interested in knowing more. You know
you can go and explore the different versions of the standard there.
So what is DDI? It's a structured metadata specification developed for
the community and by the community. So particularly in social science
data archives that exist in most OECD countries. It's used in about 90
different countries around the world now thanks to work by the World
Bank and the World Health Organization and others. There's two
major development lines that are basically XML Schemas. One's DDI
Codebook and the other DDI Lifecycle which both correspond to
version two and version three of the standard.
I'll talk a little bit more about those in a moment. We have some other
elements to it as well, additional specifications including some
controlled vocabularies often for things like encoding methodology,
data types and data capture processes and some RDF vocabularies
so that we can sort of start moving into a linked data world. So you
can leverage the standard, particularly the Lifecycle standard into a
linked data environment.
The current - the version four is in development at the moment, has
been over the last couple of years and that's where the work with Nick
has sort of come on board as well. It's moving to a model based
specification. So rather than being based in a particular schema we're
looking at sort of to focus on the model then its expression into
various different formats. The provisional ones at this point are XML
AUDIO_-_Provenance_and_Social_Science_data
Page 10 of 25
and RDF and that includes support for provenance and process
models.
So we're looking at that point at how do we leverage what we know
from PROV to support the provenance model within the new version
of the standard. It's managed by the DDI Alliance.
So briefly on the two versions of the standard are already in place, so
it's been around in the codebook format which has its origins in - the
print codebooks are produced by organisations like [Georges] going
right back to the 1960s and 70s. So we've sort of formalised in the
social sciences a fairly structured way of thinking about describing
data back 40 years ago really.
So the codebook version of the standard really is an after the fact
description of what this dataset is about. It includes four basic
sections, the document description which is describing the document
that's ascribing the dataset. A study description, we use the term
study to describe sort of the package of datasets that encapsulate a
project. So that includes characteristics of the study itself that the DDI
is describing. That includes lots of sections on authorship, citation,
access, conditions, but particularly, from the point of view of
provenance, we have their methodological content, data collection
processes, sources.
Then we also include a lot of what we call related materials. They are
documents associated with the project. We tell you something about
the provenance of where it came from. It includes all the
questionnaires, previous codebooks, technical reports, et cetera. So
from a human point of view your starting to get into the area of
thinking about provenance even though it's not really a [machine
actionable] version of that.
We also describe the files themselves, the characteristics of the
physical data files, data formats, et cetera, their size and their
structure. Then what we call variable descriptions. Descriptions - the
variables that are included in the data file. The simplest way of
AUDIO_-_Provenance_and_Social_Science_data
Page 11 of 25
thinking about this is the columns of a tabulated dataset. What does
that column mean because in a lot of the social sciences, a column - a
number does not represent actually a number. It represents a
characteristic of some sort.
For example, a five point agree/disagree scale in a survey, how do
you interpret a lot of those becomes important. George is going to talk
to a specific project looking at how we do a lot more with the variable
description and the [unclear] of variables in a moment.
So codebook was really describing - developed to describe things
after the fact. The DDI Lifecycle Model takes a more data lifecycle
approach to thinking about capturing metadata and provenance.
Underlying it is the model we have on the screen here. I think this is
just a working model of describing the different processes in the DDI
framework that a dataset can go through. Everything from
conceptualising the study.
The first place through collection and processing and distribution, as a
side point archiving that data and storing it around for future use and
then rediscovery and analysis and repurposing into the future. So it
was built with the intent of re-usability and particularly machine action-
ability as well. So the metadata that's developed in a dataset can be
re-used in the future for sport for the same purpose, a similar purpose
or something entirely new.
In order to do that you need to be able to understand where did it
come from. So embedded in that is generating metadata going
forward to be able to look backwards through the lifecycle as well. So
as - it's focused on metadata re-use. That re-use of metadata really
implies a provenance in expectation.
So why DDI Lifecycle, the things it can do, it's machine actionable. It's
more complex. There are 27 different schemas. It's probably overly
complex if we're being fair. It's structured and identifiable. So every
metadata item is actually able to be permanent identified and
managed and repurposed if that's required.
AUDIO_-_Provenance_and_Social_Science_data
Page 12 of 25
It supports related standards and it supports reuse across different
projects and again, that's sort of something that George is going to
touch on as well. [I'm going to pass this] because I think there are
some particular features for it that I can refer back to in the future. But
I want to talk very briefly about how do we think about provenance
within the different versions and then pass to George, who just wants
to talk specifically about one of the projects there.
So if we think about how provenance is being supported here, I mean
Nick's approach to the PROV model with really a machine actionable
model, fundamentally, DDI Codebook is not really designed for that.
But it is designed at least to be able to describe to a human reading a
catalogue entry, what the provenance of this dataset wants. So it
includes attribution, methodology, data processing, collection and all
the documentation we can find on what happened to the data. But it
doesn't really do that as a sort of automated way, it's really focused on
a human response to prior research to be able to come back and have
a look.
Similarly, with variables, question texts, [variable] name, what the
value, labels mean are all there.
DDI Lifecycle is really trying to - it was our first attempt really to look at
sort of the machine actionable provenance. So can we capture this
along the way or represents again the information from the studies,
attribution, methodology and so forth. But particularly with variables
it's really trying to look at the reusable elements of how we might
reuse questions, reuse columns of data and understand and reuse the
basic conceptual ideas that are embedded within that.
So, for example, if you've got a variable measuring employment, can I
reuse that employment maybe the categorisation that was used, the
numbers that were used in the survey and so forth.
Then where we're going with DDI 4; our tagline for that is what we're
calling DDI Views, is can I - to what extent can I actually embed a
provenance model inside that framework. So now we're moving
AUDIO_-_Provenance_and_Social_Science_data
Page 13 of 25
towards really recognising the importance of provenance both
conceptually and even sort of the physical and digital formats of data
as well. Measuring codes and categories across the lifecycle. For
example, managing what provenance of missing values. If your value
of a datum changes, how do I understand that?
So we've got really - we're able to generate this out automatically,
what happened at the level of an individual datum of a variable or of a
dataset. So we're moving progressively towards the sort of framework
that Nick described but [unclear] but that requires the management of
the metadata that we have to be moved forward. That's kind of it from
me.
George Alter: Hi everyone. Thanks very much to ANDS and to ADA for inviting me
to be here. What I'm going to talk about today is a project that started
in October with funding from the US National Science Foundation
about capturing metadata during the process of data creation.
So I don't think for this audience I have to go into this - justify
metadata but the big problem that we face is how do we actually get
the metadata. That's often more difficult than we - it's a lot easier to
describe it than it is to actually get it most of the time.
So to give you some backgrounds I'm going to put this in the context
of my home institution which is the Inter-University Consortium for
Political and Social Research located at the University of Michigan.
We've been in the business of archiving social science data since
1962 and we're an international consortium of more than 760
institutions. We were also one of the founding members of the Data
Documentation Initiative Alliance which Steve just talked about and
we actually provide the home office for the DDI Alliance. ICPSR has
been using DDI for many years but we're now getting to the point
where we're able to build all kinds of tools that take advantage of DDI.
One of the first things that we were doing which we've been doing for
at least 10 years is that when you download data from ICPSR you get
with it a codebook and pdf. What the pdf is [that] - we created from the
AUDIO_-_Provenance_and_Social_Science_data
Page 14 of 25
DDI, not the other way around. So for us the DDI is the native version
of the metadata.
So what we started to do is take advantage of DDI to build more kinds
of tools. One of the first ones we created was what's called our
variable search page where you can put in a search term and look for
questions that have been used in datasets that are like that search
term. So this is an example of the results that come out of a variable
search and we are now searching over more than 4.5 million variables
in about 5000 studies or data collections.
One of the things that DDI makes possible is that we can go from this
search to other characteristics of the data. So you can see here in the
blue that there are a number of things that are hyperlinked. If you click
on the place I've got circled, it takes you to an online codebook. The
online codebook has a number of features. It tells you the question
that was asked. It tells you how it was coded. If the data are available
online you can go to a cross tab tool and it also can link to an online
graphing tool.
The other thing that you see on the left side of the screen is a list of
the other variables in the dataset. So you can move around in the
dataset and clicking on any of those variables will bring up a display a
bit similar to this.
Another thing you can do from our variable search screen is if you
click on these check boxes on the left, you can pick out a certain
number of variables that you want to look at more closely and clicking
on this compare button at the top there brings you to this screen which
is a side by side comparison of these different variables which come
from different studies and so you can see whether they're asking the
same question, whether they're coded the same or differently. As
before, this screen is also hyperlinked to the online codebook so you
can go back and forth.
One of our more recent tools which I think is one of the most powerful
is that you can now search for datasets that include more than one
AUDIO_-_Provenance_and_Social_Science_data
Page 15 of 25
variable that you're interested in. So this is a search in using what we
call our variable relevant search that's actually in the study search
rather than the variable search, where we're looking for three -
variables about three different things. Does the respondent read
newspapers? Do they volunteer in schools? What's their race? You
can see here that the results come out in three different columns
within each study so you can see which variables are present in each
study.
As before everything is hyperlinked to both the online codebook and
the variable comparison. So you can check on any combination of
these variables and compare them side by side.
Another thing that we did as another previous NSF project, working
with the American National Election Study and the General Social
Survey, we made a crosswalk of the variables that are available in
those two studies. Now the American National Election Study started
in 1948 and is done every four years. The General Social Survey
started in 1972 and is done every two years. So we're actually going
to be looking over 70 different datasets.
What we've done is created this crosswalk where we've grouped the
variables according to certain tags. We've got eight lists of tags and
then 134 tags in total. The columns here, each column represents a
dataset and there are70 datasets. All of the variables are linked here
and I can't actually show it here but if you hover over one of those
variables it shows you the question text for the variable.
Again you can use the checkboxes to pick out things that you want to
compare and go the variable comparison screen. So this is - a
crosswalk like this is a tool that's actually very common. You've
probably seen these before. There are two things that are different
about this though. One is that this is all keyed into the online
codebook so you can go transparently back and forth.
The other thing is that we can use this tool to crosswalk any of the 4.5
million variables in the ICPSR collection because this is drawing
AUDIO_-_Provenance_and_Social_Science_data
Page 16 of 25
directly from our store of DDI metadata and we don't have to build a
separate tool for each one. This one tool works over all of these
datasets.
Another thing that we did in this project was to think about how we
could extend the online codebook. So here's our online codebook that
you saw before which has the question text and how it was coded, but
this version has something new in this location here. It shows how you
got to this question.
In big surveys every respondent doesn't answer every question. There
are what are often called skip patterns. So you get asked what your
marital status is and if you're single you go to one question, if you're
married you go to another question, divorced people go to a third
pattern. So there are different pathways through the questionnaire.
What we've done here is try to show, here's how you go to that
question which explains why some people didn't answer the question.
We also represented it in words down here.
So we built this and we were quite proud of ourselves for building it
because this does answer the question about who answered this
question in the survey. But then we ran into a problem so how do we
know who answered the question in the survey. The answer is that we
get that information from the data providers in a pdf. The only way we
could build this demo prototype was to have one of our staff members
enter this program flow information by manually into XML for one of
datasets so we could show how this works.
So we showed a tool that we think is really useful, but we reached a
roadblock because we don't actually get machine actionable metadata
about this kind of information. The problem is that when the data
arrived at the archive, they don't have the question text. That's
something that we at ICPSR and ADA have to type in. They don't
have the interview flow. They don't have any information about
variable provenance and variables that are created out of other
variables are not documented.
AUDIO_-_Provenance_and_Social_Science_data
Page 17 of 25
So the project we're working on now which is called C2 metadata for
[news capture] of metadata is about how do you get that. To
understand that, how do we get it, you have to think about how the
data are created and what happens.
So first of all the data themselves are actually born digital. People do
not go around with a paper questionnaire these days. They use these
computer assisted interview programs. They're on telephone or they
go around with a laptop of a tablet to answer them. There's no paper
questionnaire. There is instead a program, and it's the program that's
metadata.
So technically at the beginning you start with this computer assisted
interviewing system and what you get out of it is the original dataset.
But you can also derive from it DDI metadata in XML and there are
programs, a couple of different programs that will take these CAI
systems, the code that they run on [unclear] to XML.
But what happens next, well what happens next is that the project that
commissioned the data is going to modify the data. There are a
number of reasons for doing that. There are some things that are in
the data that are just purely there for administrative purposes. There
are some variables that have to be changed to reduce the
identifiability of individuals. Some variables that need to be combined
into scales or indexes.
So what they do is they write a script that's going to run in one of the
major statistical packages. They take that script and the script and the
data go through that software and what comes out is a new dataset.
Well what happens to the metadata, well at this point the metadata
don't match the dataset anymore and you would need to update the
XML to fix it and nobody likes updating XML.
So the metadata get trashed and thrown away. What happens then is
this, when the data - after the data are revised, the metadata are
recreated. What happens is that we at the archive take the revised
data and extract as much metadata from it as we can. So we get an
AUDIO_-_Provenance_and_Social_Science_data
Page 18 of 25
extracted XML file and what about the things that went on in the script
here? Well we actually have to sit down and extract them by hand.
So a person has to read the script and write down what happened.
Well what we're working on in - well so what are we missing? Well
what we get from the statistics packages are just names, labels for
variables, labels for values and virtually no provenance information.
So what we're working on is a way that we can automate the capture
of this variable transformation metadata. So what I did is this, we're
going to write software where you could take the script that was used
to modify the data, take the very same script and run it through what
we're calling a script parser and pull from that the information about
variable transformations. Put that into a standard format which we're
calling a standard data transformation language.
Then you take that information and incorporate it into the original DDI.
You update the original DDI and then you've got a new version of the
XML that is in sync with the revised data. So this process then
requires two different software tools, one that will read the script and
turn it into a standard format and second one that will update the XML
and that's what we're building.
So we are building tools that will work with the different software
packages and update XML. We're actually writing these parsers for
scripts in four different languages, SPSS, SAS, Stata and R. The
reason we're doing four languages is that if you look at the column
over there on the right which is based on downloads at ICPSR in
cases where the dataset had all four formats, you can see that there's
not a single dominant format.
SPSS and Stata are the most downloaded formats from ICPSR and
they both have about 24 per cent. SAS and R both have about 12 per
cent. If we did one package we wouldn't please any - we'd be pleasing
only a few people and we couldn't have an impact. So we're actually
writing parsers for four languages.
AUDIO_-_Provenance_and_Social_Science_data
Page 19 of 25
Here's something I thought that's come out of our work that you might
find interesting. This is about why we need to have a special language
for expressing these data transformations. So here are three brief
programs in SPSS, Stata and SAS that all are designed to operate on
the same data. I tried very hard to make the programs, the scripts
identical and I think that I succeeded. But if you run these three
programs you get three different results.
The key thing here is to look at the last row that the row in which we
set the minus one to be missing, in SPSS you get two missing values.
In Stata and SAS one of the variables is set to a number, but it's a
different one in each one. Why does this happen?
Well the reason is that in logical expressions SPSS treats a missing
value as missed [unclear], makes the result of a logical expression
that includes a missing value, missing. Which in most cases is treated
as false. Stata treats a missing value as a number which is equal to
infinity. SAS treats the missing value as a number which is equal to
minus infinity. So both Stata and SAS actually do return a number
when you have one of these comparisons.
So it's actually more accurate to represent the data in this way, which
you wouldn't see if you just looked at the datasets. So what we're
doing is creating our own language. Well we're actually using the
language that's been created by another community. The SDMX
community called validation and transformation language so that we
can put all three of these languages into a common core.
So what are we doing and why are we doing it? So the goal of the
project is to capture this metadata and automate it. If we can capture
more metadata from the data creation process, we'll be able to
provide much better information to researchers about what's in the
dataset. Automating this process we hope will make it cheaper for
everyone and make it easier. That has been one of the principles that
we've tried to do here that if we can't make it easier for the
researchers, they're not going to do it.
AUDIO_-_Provenance_and_Social_Science_data
Page 20 of 25
So the hope here is that the software we get will make their lives
easier. Here's just a - to acknowledge my - some of my partners in
this, we've got partners from a couple of software firms Colectica,
Metadata Technology North America, the Norwegian Centre for
Research Data and two of the - the two projects I mentioned, the
General Social Survey and the American National Election Study are
part of the project too.
So that's my talk.
Kate LeMay: Fabulous, thank you very much George. We had a question that came
through earlier from your talk when you were speaking about people
putting variables into ICPSR and searching for them and [Ming 43:40]
has asked, when a user searches for a variable or variables, do they
need to come up with the exact variable name as in the variable
index?
George Alter: So right now what we're doing is really a text search. When you
search for variables, you're searching over the variable name and the
variable label. It also can bring up items that are in the values for the
variables. But one of the problems in the social sciences is that people
don't reuse questions very often. So we don't have a tradition of
reusing questions. It's very hard to find the same question in multiple
datasets. The kind of search we're doing now in our question bank is
frankly kind of clunky and it often misses things. That's an issue that
I'm trying to address in some other projects that we're trying to
improve the way we can search over variables.
Kate LeMay: Thank you very much. We've got a question for Nick as well.
Nick Car: [Unclear].
Kate LeMay: Yes. So Nick we've got a question, how widely is PROV used and
what have you found to be the main challenges working with PROV?
Noting that a V2 is not on the horizon, is it easy to update a PROV
model if a change is required?
AUDIO_-_Provenance_and_Social_Science_data
Page 21 of 25
Nick Car: Okay, so first part first, how widely is it used? So my - I have a direct
interest in things provenance but aside from that I have an interest in
things geospatial and I guess physical sciences data. In that
community there's only one game in town, that's through PROV but
it's early days. So most of the spatial, geophysical, blah, blah, blah
sorts of places, those hard physical sciences side, the either are using
their own systems or they're intending to use PROV. There's not many
that are actually already using PROV. But there are certainly not many
that are intending to use something other than PROV.
Outside of my own Geoscience Australia area, other communities I
know of including DDI and so on - because PROV's only been around
for a few years, if people can characterise their problem in a
provenance way, like they actually understand this as a provenance
question as opposed to some other kind of question like an ownership
or an attribution question, they fairly quickly end up at PROV.
So I think it's about as - it's certainly more widely used than any other
provenance standard has ever been and it's showing signs of being
much more widely used that that and that's because in the other
initiatives in the space have been sort of swallowed up by PROV.
Now the second part of the question was, what are the problems and
I've identified one already which is people have to know that they're
asking a provenance question. So we get a lot of questions which are
synonyms for provenance questions, probably much like variable
naming where people say I'm interested in the lineage of my data or
the transparency or the process flow of the ownership or attribution
and those are all what could be all provenance questions. The hardest
thing to work out is specifically what questions are being asked and
then if there is an existing metadata model or something in that space
already, what's it doing and what's it not doing and therefore do we
need provenance - a specific provenance initiative.
So for instance many metadata models have authorship, ownership,
creator information indicated in them. So if your provenance question
AUDIO_-_Provenance_and_Social_Science_data
Page 22 of 25
is, I want to know datasets created by Nick, that kind of provenance
question you can usually answer in other metadata systems. You'd
have to have something a bit more complicated than that and
determine provenance to then think about using the provenance
system.
The other thing is the move away from what I call point metadata
where you've got a single thing with a bunch of properties that come
from it. So a study or a document or a chunk of data with a bunch of
properties. That's one way to do things, but what PROV and what
other models are interested in is whole networks. Things that relate to
other things. It's more complex, but it's much, much more powerful to
do that.
Kate LeMay: Great, thank you very much. So question for George, how is sensitive
data, variables or values controlled for during the C2 automatic
capture. ICPSR has a confidentialised service on Ingest, is this
process carried over to the C2 metadata project? Is this activity
captured in PROV like metadata?
George Alter: So the C2 metadata model is to operate solely on the metadata not on
the data. So that's really a - so it doesn't really play into the issue of
confidentiality. If you're interested, in two weeks we're going to have
another webinar where I am going to talk about how we manage
confidential data, but in general it's rarely the case that we have to
mask the metadata of a dataset for confidentiality reasons. Obviously
controlling the data is something else.
Kate LeMay: So we've got another question here for George. Your script parser that
reads from SAS script, do researchers need to - would they need to
install that in their SAS package?
George Alter: We haven't gotten to that point yet but probably not. Probably what
we'll do, at least as a starting point is offer it as a web service. What
you'll do is simply export your SAS program into a text file and upload
the text file to the web service and it will download a new XML file.
AUDIO_-_Provenance_and_Social_Science_data
Page 23 of 25
Kate LeMay: So we've got another question here. Does PROV support the workflow
of creation and approval of provenance data, eg. the PROV entry is
proposed and has been submitted to the data custodian for approval?
Nicholas Car: Well it's got two kind of answers to it. One is a generic PROV answer
and the other one seems to be more in line with a particular repository
or a particular set of steps. So this isn't exactly what you asked, but
I'm going to answer it in a slightly different way. You can talk about the
provenance of provenance which is bit tricky.
But say you had information about the lineage or the history of a
dataset and you wanted to control that chunk of stuff, you could talk
about that thing being a dataset itself, even though it's about
something else and manage that. You could certainly work out how to
link your dataset to the dataset that contains its provenance
information. So you can do that.
But the second part of the question, or I think the general sense of the
question is more to do with how does a specific repository do things.
Does that make sense? Does the PROV support the workflow of
creation and approval? Okay, in general and you can represent
anything in PROV because it's really high level and it's got those three
generic classes of entity, activity and agent. There's almost nothing in
the world that I've come across that you can't decompose down into
one of those three things.
Is it a thing? Is it a causative agent or is it a temporal occurrence. So
in general yes.
Kate LeMay: Okay fabulous.
Nicholas Car: So Natasha asks, philosophical question for the whole panel, how do
you think provenance relates to trust. So I'm going to jump in very
quickly and say, provenance models before PROV often had the word
trust in them somewhere. Many of the motivations for provenance
models were to do with trust, we deal with trust as - the goal of
Geoscience Australia to put out data and make it open and
transparent. It's fundamentally a trust issue for users of that data.
AUDIO_-_Provenance_and_Social_Science_data
Page 24 of 25
They want to know how did this data come to be. So that's really what
provenance is about. It's about telling about the history of something
so you can generate all sorts of trust. But then the specifics of what
you put in there and you can work out, do I trust the people who
created this thing, do I trust the process that was undertaken to deal
with it or transform it. Do I trust the particular chunks of code that we
used? So that's the generic answer.
Then there's the sort of more specific ones like for data [unclear]
repository how do I trust that it's - even though you're telling me
something about it, [but whether it's] in fact true. There are also very
difficult things about how do I actually trust this metadata even if it
looks like it's all [unclear]. If this data comes from God delivered to you
on a stone tablet, I could write that down, but is it true? You have to
work that out. That is now a non-provenance thing. You have to work
out some other way of attributing a trust metric to that claim. That
might be that it's digitally signed and you trust the agency that
delivered it. So that's an appeal to authority.
You might trust that there's enough information present for you to
understand the process enough to have confidence in it. It might link
to well-known sources like Open Code or something like that, that you
trust or maybe there's a mechanism for you to validate certain chunks
of data or calculations.
So the total number is five, you can look back through the provenance
and see somewhere two plus three, where you see five, that you can
calculate and you can establish that trust directly.
George Alter: So I think - Nick said it very well. But I'll say the same thing in fewer
words [laughs].
Nicholas Car: Thanks George, thank you.
George Alter: Provenance is really fundamental to trust and Nick really hit the nail
on the head when he talked about transparency. Provenance is about
transparency and in the world we live in now, even appeals to
authority don't work very well anymore. I think that for science to have
AUDIO_-_Provenance_and_Social_Science_data
Page 25 of 25
- to gain legitimacy and gain trust, we have to be transparent and
that's what provenance metadata is all about.
Kate LeMay: So we've reached the end of our time. I'd just like to thank our three
speakers for coming along to our ANDS Canberra Office today and
speaking to us about provenance and introducing lots of new
acronyms to us all. Every time I encounter anything new at ANDS
there's always more acronyms to learn.
So thank you very much for coming. We have two more webinars in
the social science series, so hope to see you there again.
END OF TRANSCRIPT

More Related Content

What's hot

What's hot (20)

"Why the Semantic Web will Never Work" (note the quotes)
"Why the Semantic Web will Never Work"  (note the quotes)"Why the Semantic Web will Never Work"  (note the quotes)
"Why the Semantic Web will Never Work" (note the quotes)
 
Consuming Linked Data SemTech2010
Consuming Linked Data SemTech2010Consuming Linked Data SemTech2010
Consuming Linked Data SemTech2010
 
Semantic Web Landscape 2009
Semantic Web Landscape 2009Semantic Web Landscape 2009
Semantic Web Landscape 2009
 
On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links. On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links.
 
Software Citation and a Proposal (NSF workshop at Havard Medical School)
Software Citation and a Proposal (NSF workshop at Havard Medical School)Software Citation and a Proposal (NSF workshop at Havard Medical School)
Software Citation and a Proposal (NSF workshop at Havard Medical School)
 
NELL: The Never-Ending Language Learning System
NELL: The Never-Ending Language Learning SystemNELL: The Never-Ending Language Learning System
NELL: The Never-Ending Language Learning System
 
Methods for Intrinsic Evaluation of Links in the Web of Data
Methods for Intrinsic Evaluation of Links in the Web of DataMethods for Intrinsic Evaluation of Links in the Web of Data
Methods for Intrinsic Evaluation of Links in the Web of Data
 
Automatically Labeling Facts in a Never-Ending Langue Learning system
Automatically Labeling Facts in a Never-Ending Langue Learning systemAutomatically Labeling Facts in a Never-Ending Langue Learning system
Automatically Labeling Facts in a Never-Ending Langue Learning system
 
ESWC2015 opening ceremony
ESWC2015 opening ceremonyESWC2015 opening ceremony
ESWC2015 opening ceremony
 
Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
On Beyond OWL: challenges for ontologies on the Web
On Beyond OWL: challenges for ontologies on the WebOn Beyond OWL: challenges for ontologies on the Web
On Beyond OWL: challenges for ontologies on the Web
 
Query-Load aware partitioning of RDF data
Query-Load aware partitioning of RDF dataQuery-Load aware partitioning of RDF data
Query-Load aware partitioning of RDF data
 
DBpedia Spotlight at I-SEMANTICS 2011
DBpedia Spotlight at I-SEMANTICS 2011DBpedia Spotlight at I-SEMANTICS 2011
DBpedia Spotlight at I-SEMANTICS 2011
 
HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7HKU Data Curation MLIM7350 Class 7
HKU Data Curation MLIM7350 Class 7
 
The Semantic Web: 2010 Update
The Semantic Web: 2010 Update The Semantic Web: 2010 Update
The Semantic Web: 2010 Update
 
Normative Requirements as Linked Data
Normative Requirements as Linked DataNormative Requirements as Linked Data
Normative Requirements as Linked Data
 
Similarity on DBpedia
Similarity on DBpediaSimilarity on DBpedia
Similarity on DBpedia
 
LOD2 Webinar Series: DBpedia Spotlight
LOD2 Webinar Series: DBpedia SpotlightLOD2 Webinar Series: DBpedia Spotlight
LOD2 Webinar Series: DBpedia Spotlight
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 

Similar to Transcript - Provenance and Social Science data

Similar to Transcript - Provenance and Social Science data (20)

Transcript - DOIs to support citation of grey literature
Transcript - DOIs to support citation of grey literatureTranscript - DOIs to support citation of grey literature
Transcript - DOIs to support citation of grey literature
 
Transcript FAIR 3 -I-for-interoperable-13-9-17
Transcript FAIR 3 -I-for-interoperable-13-9-17Transcript FAIR 3 -I-for-interoperable-13-9-17
Transcript FAIR 3 -I-for-interoperable-13-9-17
 
Transcript - Tracking Research Data Footprints via Integration with Research ...
Transcript - Tracking Research Data Footprints via Integration with Research ...Transcript - Tracking Research Data Footprints via Integration with Research ...
Transcript - Tracking Research Data Footprints via Integration with Research ...
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Transcript of Webinar: Data management plans (DMPs) - audio
Transcript of Webinar: Data management plans (DMPs) - audioTranscript of Webinar: Data management plans (DMPs) - audio
Transcript of Webinar: Data management plans (DMPs) - audio
 
Preparing Catalogers for Linked data
Preparing Catalogers for Linked dataPreparing Catalogers for Linked data
Preparing Catalogers for Linked data
 
2014 11-17 crichton institute talk on open data
2014 11-17 crichton institute talk on open data2014 11-17 crichton institute talk on open data
2014 11-17 crichton institute talk on open data
 
Semantic Web, e-commerce
Semantic Web, e-commerceSemantic Web, e-commerce
Semantic Web, e-commerce
 
A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...
A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...
A Backpack to go the Extra-Functional Mile (a hitched hike by the PROWESS pro...
 
Web Services Catalog
Web Services CatalogWeb Services Catalog
Web Services Catalog
 
Semantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaSemantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenza
 
Transcript - Trusted Data Repositories - 13 March 2018
Transcript -  Trusted Data Repositories - 13 March 2018Transcript -  Trusted Data Repositories - 13 March 2018
Transcript - Trusted Data Repositories - 13 March 2018
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 
Designing the Garden: Getting Grounded in Linked Data
Designing the Garden: Getting Grounded in Linked DataDesigning the Garden: Getting Grounded in Linked Data
Designing the Garden: Getting Grounded in Linked Data
 
Metadata Provenance Tutorial at SWIB 13, Part 1
Metadata Provenance Tutorial at SWIB 13, Part 1Metadata Provenance Tutorial at SWIB 13, Part 1
Metadata Provenance Tutorial at SWIB 13, Part 1
 
Testing Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card SortingTesting Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card Sorting
 
lec1.pdf
lec1.pdflec1.pdf
lec1.pdf
 
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
Building SADI Services Tutorial - SIB Workshop, Geneva, December 2015
 
Interview
InterviewInterview
Interview
 
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic Applications
 

More from ARDC

More from ARDC (20)

Introduction to ADA
Introduction to ADAIntroduction to ADA
Introduction to ADA
 
Architecture and Standards
Architecture and StandardsArchitecture and Standards
Architecture and Standards
 
Data Sharing and Release Legislation
Data Sharing and Release Legislation   Data Sharing and Release Legislation
Data Sharing and Release Legislation
 
Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)Australian Dementia Network (ADNet)
Australian Dementia Network (ADNet)
 
Investigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspectiveInvestigator-initiated clinical trials: a community perspective
Investigator-initiated clinical trials: a community perspective
 
NCRIS and the health domain
NCRIS and the health domainNCRIS and the health domain
NCRIS and the health domain
 
International perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research dataInternational perspective for sharing publicly funded medical research data
International perspective for sharing publicly funded medical research data
 
Clinical trials data sharing
Clinical trials data sharingClinical trials data sharing
Clinical trials data sharing
 
Clinical trials and cohort studies
Clinical trials and cohort studiesClinical trials and cohort studies
Clinical trials and cohort studies
 
Introduction to vision and scope
Introduction to vision and scopeIntroduction to vision and scope
Introduction to vision and scope
 
FAIR for the future: embracing all things data
FAIR for the future: embracing all things dataFAIR for the future: embracing all things data
FAIR for the future: embracing all things data
 
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian DuncanARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
 
Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128Skilling-up-in-research-data-management-20181128
Skilling-up-in-research-data-management-20181128
 
Research data management and sharing of medical data
Research data management and sharing of medical dataResearch data management and sharing of medical data
Research data management and sharing of medical data
 
Findable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) dataFindable, Accessible, Interoperable and Reusable (FAIR) data
Findable, Accessible, Interoperable and Reusable (FAIR) data
 
Applying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and ChallengesApplying FAIR principles to linked datasets: Opportunities and Challenges
Applying FAIR principles to linked datasets: Opportunities and Challenges
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018
 
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global SprintReady, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
 
How FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of dataHow FAIR is your data? Copyright, licensing and reuse of data
How FAIR is your data? Copyright, licensing and reuse of data
 
Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018Peter neish DMPs BoF eResearch 2018
Peter neish DMPs BoF eResearch 2018
 

Recently uploaded

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Recently uploaded (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 

Transcript - Provenance and Social Science data

  • 1. [Unclear] words are denoted in square brackets. Webinar: Provenance and Social Science data 15 March 2017 Video & slides available from ANDS website START OF TRANSCRIPT Kate LeMay: Today we're going to be speaking about provenance and social science data. So you should be able to see on our screen we're showing our data provenance, community page and we have a data provenance interest group and if you're interested in that you can contact us through the contacts on that page. We have our speakers here. I'm Kate LeMay, I'm from ANDS and I'm one of the research data specialists at ANDS. We have George Alter, Steve McEachern and Nicholas Car. We'll give each of them a little bit of an intro when we get to their point in speaking. So as I mentioned this is part of a series, today's our first one. So I'd like to introduce Steve and Nick who will be speaking first. So Steve is the Director of the Australian Data Archive at the Australian National University. He holds a PhD in industrial relations and a Graduate Diploma in Management Information Systems and has research interest in data management and archiving, community and social attitude surveys, new data collection methods and reproducible research methods. Steve has been involved in various professional associations in survey research and data archiving over the last 10 years and is
  • 2. AUDIO_-_Provenance_and_Social_Science_data Page 2 of 25 currently chair of the executive board of the data documentation initiative. And Nick, Nicholas Car is the Data Architect for Geosciences Australia, GA. In that role he designs and helps build enterprise data platforms. GA is particularly interested in the transparency and repeatability of its science and the data products it delivers. For these reasons, Nick implements provenance modelling and management systems in order to represent and store information about data lineage. What was done and who did it and what they used to do it. Previous to working at GA, Nick was an experimental scientist at CSIRO and researched metadata systems, provenance, data management and linked data. He currently co-chairs the International Research Data Alliances Research Data Provenance Interest Group which the ANDS Provenance which the [unclear] Group works with and through that and other groups assists organisations with provenance management. Nicholas Car: Okay, thanks Kate. All right so this is a very quick introduction to PROV, so PROV is a provenance standard and what you see on that first slide there is a very, very simple diagram of a little provenance network and I'll discuss some of that as we go. So it's not just a frivolous diagram, it's actually - it has some meaning. Okay so the outline for today, so what is PROV? I'm just going to mention that very quickly and then I'm going to get to how do I actually use this thing in a couple of different ways. So first I'll talk about modelling. Then I'll talk about how do I actually manage the data once I've collected or made provenance data and then I'll talk about using PROV with other systems. So what is PROV? PROV is a W3C recommendation. So W3C is the World Wide Web Consortium. So it's one of the governing bodies of internet standards. They don't issue any documents called standards. They issue documents called recommendations. So PROV is a recommendation. It's top level of standard I suppose. Other standards
  • 3. AUDIO_-_Provenance_and_Social_Science_data Page 3 of 25 by the W3C are things like HTML. I'm sure everyone is familiar with HTML at least to some extent. PROV itself was completed in 2013 and sort of formalised by the end of that year. So it's only a couple of years old and a large number of authors were involved in PROV. There were several initiatives to make provenance standards before PROV over the last perhaps 20 years and many of the authors involved in those standards such as PML and OPM, I'm not going to elaborate [unclear] so if you're interested in those previous standards just Google them. Many of the authors involved in those initiatives were involved with PROV. So PROV really does know about those other initiatives and it's simpler than those precursors because it's trying to do sort of a high level standard. It doesn't do as many of the tasks as those precursors do, but it certainly represents the very important bits that they come up with. Another thing to say about PROV is that there's no version two planned any time soon. Why am I bringing this up now? Well it's a pain for people to have to deal with standards and then versions twos and threes and fours of standards. PROV doesn't quite operate like that and I'll explain how. It is what it is and there are ways to extend it and use it in different circumstances, but it's unlikely that we're going to see any version change in the next few years I would think. It's seen good adoption. PROV is really the only international broad scale provenance standard and as a result people are happy to - I think, happy to adopt it in lieu of really anything else. Right, so PROV is actually a collection of documents and I've just listed them there. I'm not going to go through them all in great detail, but there is an overview document and then certain bits and pieces which are actual recommendations or standards and additional things that just help you use the PROV thing. Now the main document is the PROV-DM the data model. That tells you what PROV contains, how its classes operate and so on. Then
  • 4. AUDIO_-_Provenance_and_Social_Science_data Page 4 of 25 there's a series of documents like an XML version of PROV, an OWL ontology version and special notations and so on. The only other one I'll mention is the PROV-CONSTRAINS which is a list of things, of rules that PROV compliance, chunks of data must adhere to and that works across any formulation of PROV. I'll provide a link there to the collection of standards - of documents. So how do I use PROV? This is a modelling - how do I actually model something using PROV to do the core of provenance representation? Well I'm starting off with some negatives, so don't do it like this. Don't take a document for something, perhaps a metadata catalogue entry and expect to shove a bunch of information into some field within that document. So ISO19115 us a standard for spatial datasets and it's got a field called lineage and some people expect to take provenance information and stick it in that lineage field. Don't do that. PROV doesn't let you do that. I'll explain why in a second. So that's one thing not to do. We're not going to see a single item's metadata record containing a bunch of provenance information. You could do that but not recommended. What else should I not do? So this diagram here is the class model of the DCAT the Data Catalogue Vocabulary which is a very generic metadata model. It's used in relation to things like Dublin Core and various catalogue style things and we're not going to link a dataset or any other object in DCAT or Dublin Core or other standards like that to a class of provenance information. This is true for Steve's DDI initiative as well. We're not going to take objects in DDI and link to a provenance object that tells you the provenance of that object. That's an anti-pattern right there. So what are we going to do? We don't even do this using Dublin Core's provenance properties. So Dublin Core vocabulary as a property called provenance and the wording for that says, use this to
  • 5. AUDIO_-_Provenance_and_Social_Science_data Page 5 of 25 describe lineage history. PROV doesn't want you to do that exactly like that. What does PROV want you to do? PROV wants you to think of everything that you're interested in in terms of three general classes of objects. So is the scenario, the things that you're interested in, are they things, are they entities? Are they occurrences? Are they processes? Are they activities? Or are they causative people or organisations which [unclear] cause an agent? So PROV says model everything you know about using those three classes and then link them together and that's what PROV's all about. So how does GA use PROV? So we often process chunks of data at GA. So we have a very simple model that's using the provenance ontology and it looks like this. There's some process, the process generates outputs, the outputs are entities, the process itself is an activity and then there's data and code and configuration and so on that feed into that process and those are also entities. Finally the process and the entities might be related to a system and even a person who operates that system. So that's the model we use. Okay so how do I actually manage the data that I get in provenance or that I get according to PROV? Well you can create reports. So you can go and do something. A human or a system could log what they had done and they could store that information in some kind of database according to the PROV model. Then you can - it's a document database but you can query that thing. So we often have systems that sort of send reports every time they run. You might have a form that looks like any other metadata entry form where you fill in details and you hit enter and that sends off provenance information, but again it's not storing it with respect to one specific object, it's linking existing objects together. So some dataset that is produced from another dataset is going to link those two things together.
  • 6. AUDIO_-_Provenance_and_Social_Science_data Page 6 of 25 For catalogue things we can link things again. If we have a catalogue that has a dataset A or X and a dataset Y and we want to show there's a linking, we can say dataset Y was derived from dataset X and record that information somewhere. Now dataset Y may record I come from dataset X but that's just a very simple, little bit of provenance information. It's not a whole glob of provenance information stored within dataset Y. We can ensure that any system that has information that is provenance information like who the creator of a dataset was, does so in accordance to the PROV model. So in this case if we had a dataset that had a creator, we would say the dataset was associated with an agent and the agent had a role to play and that role, in this case, was creator. That's now a PROV expression of that relationship. For databases it can be very difficult. I can't explain it in depth here, but there's many ways in which databases could store provenance or PROV related provenance information, but they would need to be able to show that they can actually export their content, their provenance content according to the PROV data model. You actually have to prove that if you want to say that you are compliant with a standard. So fairly quickly, how do I get PROV to work with other systems? Well we can fully align our system, whatever this system is. So I've used a theoretical example of Metadata System X. How do I align Metadata System X with PROV? I could classify all of the things in Metadata System X according to PROV. It requires a metadata model for Metadata System X - sorry a data model. Not just in coding formats. We can't just deal with XMLs and so on. We actually have to have a conceptual model and then we can say, this class of thing in Metadata System X is the same as this class of thing in PROV. Now PROV's only got a few classes, so that's usually pretty easy to do. But it will definitely prompt you to do things that you wouldn't normally do. You may have to tease apart some of the objects that
  • 7. AUDIO_-_Provenance_and_Social_Science_data Page 7 of 25 you know and love into things that PROV recognises as different objects. You could do a partial alignment. You could take your Metadata System X and only acknowledge that some of the things in that scenario are PROV understood things. So maybe you've got a metadata model that talks about all kinds of stuff and one of the things it talks about is a dataset. You say your dataset is the same as what PROV thinks of as an entity and maybe you ignore all the other things. You would still need to demonstrate that you could extract value PROV out of that and not all the other stuff, but that would be one way to do it. You could also link to things not in your own data model if you also classified those things according to PROV. The last scenario you could think about is to just deprecate your obviously not as good systems and use PROV. That would require you perhaps to make either a new dataset of provenance information or a data store and put that information somewhere and that's it. Kate LeMay: Thank you very much Nick. So we'll move onto Steve. Steve McEachern: Nick's talked about the sort of general PROV model that is increasingly getting used in various different spaces. I'm going to talk specifically about the various ways of thinking about provenance in what we're doing in the social sciences, particularly using - within the standard that we utilise and I'm not the Director for the Data Documentation Initiative. Part of the reason we've sort of connected these two together is we're now looking at how we can leverage the PROV standard inside DDI employer [facts]. So Nick and I and a group of others have been working on how we might go about this. I'm not going to touch too much on that but I'll return to that at the end. I sort of want to talk more generally about how we might think about provenance at different stages of the data lifecycle, different stages in the research or in the data management experience and how we
  • 8. AUDIO_-_Provenance_and_Social_Science_data Page 8 of 25 progressed thinking about provenance over that time. Just to give you a sense of, well what sort of things we can do already and how can we increasingly embed, capture provenance in what we do. Okay, I'm quickly going to - for those who don't know. The Australian Data Archive, we've had various names over time. I'm going to do a quick introduction. We've been around for a little while now based here at the Research School of Social Sciences at ANU. Our mission is to collect and preserve Australian social science data on behalf of the social science research community in Australia and internationally. Now we've sort of developed a collection of over 5000 datasets now over 1500 different studies as we call them or projects. Lots of different sources, lots of different provenance from various different locations, academic, government and private sector. So as our holdings have developed, our understanding of provenance has developed probably alongside that. Maybe we didn't call it that at the time but after 35 years I think that's always been sort of underpinning a lot of what we've done. There's helping researchers who might be the secondary users of our data to know where did this come from, what was it used for and how might I use it in the future is really the emphasis there. For those who don't really know what we're talking about when I use the term data archive, we're using the term a trusted system out of a project done by the Social Science & Humanities Research Council of Canada. They're kind of the equivalent in Canada for the ARC. “An accessible and comprehensive service empowering researchers to locate, request, retrieve and use data resources…” - so you've got to be able to find it and understand it - “…in a simple, seamless and cost effective way, while at the same time protecting the privacy, confidentiality and intellectual property rights of those involved”. Part of why we're interested in provenance is really that last point.
  • 9. AUDIO_-_Provenance_and_Social_Science_data Page 9 of 25 One is to help researchers understand where this came from but it is to sort of recognise and acknowledge the intellectual property that's been developed in those resources over time. Okay, so I'm going to give a brief introduction to the DDI standard and its different flavours. As Nick pointed out, having multiple versions is not always much fun. We're up to version four. We're about 20 years old now. So I think that's not too bad from Nick's point of view. How we've sort of captured what we might think of as different forms of provenance over time. So I've got the website there the ddialliance.org website if you're interested in knowing more. You know you can go and explore the different versions of the standard there. So what is DDI? It's a structured metadata specification developed for the community and by the community. So particularly in social science data archives that exist in most OECD countries. It's used in about 90 different countries around the world now thanks to work by the World Bank and the World Health Organization and others. There's two major development lines that are basically XML Schemas. One's DDI Codebook and the other DDI Lifecycle which both correspond to version two and version three of the standard. I'll talk a little bit more about those in a moment. We have some other elements to it as well, additional specifications including some controlled vocabularies often for things like encoding methodology, data types and data capture processes and some RDF vocabularies so that we can sort of start moving into a linked data world. So you can leverage the standard, particularly the Lifecycle standard into a linked data environment. The current - the version four is in development at the moment, has been over the last couple of years and that's where the work with Nick has sort of come on board as well. It's moving to a model based specification. So rather than being based in a particular schema we're looking at sort of to focus on the model then its expression into various different formats. The provisional ones at this point are XML
  • 10. AUDIO_-_Provenance_and_Social_Science_data Page 10 of 25 and RDF and that includes support for provenance and process models. So we're looking at that point at how do we leverage what we know from PROV to support the provenance model within the new version of the standard. It's managed by the DDI Alliance. So briefly on the two versions of the standard are already in place, so it's been around in the codebook format which has its origins in - the print codebooks are produced by organisations like [Georges] going right back to the 1960s and 70s. So we've sort of formalised in the social sciences a fairly structured way of thinking about describing data back 40 years ago really. So the codebook version of the standard really is an after the fact description of what this dataset is about. It includes four basic sections, the document description which is describing the document that's ascribing the dataset. A study description, we use the term study to describe sort of the package of datasets that encapsulate a project. So that includes characteristics of the study itself that the DDI is describing. That includes lots of sections on authorship, citation, access, conditions, but particularly, from the point of view of provenance, we have their methodological content, data collection processes, sources. Then we also include a lot of what we call related materials. They are documents associated with the project. We tell you something about the provenance of where it came from. It includes all the questionnaires, previous codebooks, technical reports, et cetera. So from a human point of view your starting to get into the area of thinking about provenance even though it's not really a [machine actionable] version of that. We also describe the files themselves, the characteristics of the physical data files, data formats, et cetera, their size and their structure. Then what we call variable descriptions. Descriptions - the variables that are included in the data file. The simplest way of
  • 11. AUDIO_-_Provenance_and_Social_Science_data Page 11 of 25 thinking about this is the columns of a tabulated dataset. What does that column mean because in a lot of the social sciences, a column - a number does not represent actually a number. It represents a characteristic of some sort. For example, a five point agree/disagree scale in a survey, how do you interpret a lot of those becomes important. George is going to talk to a specific project looking at how we do a lot more with the variable description and the [unclear] of variables in a moment. So codebook was really describing - developed to describe things after the fact. The DDI Lifecycle Model takes a more data lifecycle approach to thinking about capturing metadata and provenance. Underlying it is the model we have on the screen here. I think this is just a working model of describing the different processes in the DDI framework that a dataset can go through. Everything from conceptualising the study. The first place through collection and processing and distribution, as a side point archiving that data and storing it around for future use and then rediscovery and analysis and repurposing into the future. So it was built with the intent of re-usability and particularly machine action- ability as well. So the metadata that's developed in a dataset can be re-used in the future for sport for the same purpose, a similar purpose or something entirely new. In order to do that you need to be able to understand where did it come from. So embedded in that is generating metadata going forward to be able to look backwards through the lifecycle as well. So as - it's focused on metadata re-use. That re-use of metadata really implies a provenance in expectation. So why DDI Lifecycle, the things it can do, it's machine actionable. It's more complex. There are 27 different schemas. It's probably overly complex if we're being fair. It's structured and identifiable. So every metadata item is actually able to be permanent identified and managed and repurposed if that's required.
  • 12. AUDIO_-_Provenance_and_Social_Science_data Page 12 of 25 It supports related standards and it supports reuse across different projects and again, that's sort of something that George is going to touch on as well. [I'm going to pass this] because I think there are some particular features for it that I can refer back to in the future. But I want to talk very briefly about how do we think about provenance within the different versions and then pass to George, who just wants to talk specifically about one of the projects there. So if we think about how provenance is being supported here, I mean Nick's approach to the PROV model with really a machine actionable model, fundamentally, DDI Codebook is not really designed for that. But it is designed at least to be able to describe to a human reading a catalogue entry, what the provenance of this dataset wants. So it includes attribution, methodology, data processing, collection and all the documentation we can find on what happened to the data. But it doesn't really do that as a sort of automated way, it's really focused on a human response to prior research to be able to come back and have a look. Similarly, with variables, question texts, [variable] name, what the value, labels mean are all there. DDI Lifecycle is really trying to - it was our first attempt really to look at sort of the machine actionable provenance. So can we capture this along the way or represents again the information from the studies, attribution, methodology and so forth. But particularly with variables it's really trying to look at the reusable elements of how we might reuse questions, reuse columns of data and understand and reuse the basic conceptual ideas that are embedded within that. So, for example, if you've got a variable measuring employment, can I reuse that employment maybe the categorisation that was used, the numbers that were used in the survey and so forth. Then where we're going with DDI 4; our tagline for that is what we're calling DDI Views, is can I - to what extent can I actually embed a provenance model inside that framework. So now we're moving
  • 13. AUDIO_-_Provenance_and_Social_Science_data Page 13 of 25 towards really recognising the importance of provenance both conceptually and even sort of the physical and digital formats of data as well. Measuring codes and categories across the lifecycle. For example, managing what provenance of missing values. If your value of a datum changes, how do I understand that? So we've got really - we're able to generate this out automatically, what happened at the level of an individual datum of a variable or of a dataset. So we're moving progressively towards the sort of framework that Nick described but [unclear] but that requires the management of the metadata that we have to be moved forward. That's kind of it from me. George Alter: Hi everyone. Thanks very much to ANDS and to ADA for inviting me to be here. What I'm going to talk about today is a project that started in October with funding from the US National Science Foundation about capturing metadata during the process of data creation. So I don't think for this audience I have to go into this - justify metadata but the big problem that we face is how do we actually get the metadata. That's often more difficult than we - it's a lot easier to describe it than it is to actually get it most of the time. So to give you some backgrounds I'm going to put this in the context of my home institution which is the Inter-University Consortium for Political and Social Research located at the University of Michigan. We've been in the business of archiving social science data since 1962 and we're an international consortium of more than 760 institutions. We were also one of the founding members of the Data Documentation Initiative Alliance which Steve just talked about and we actually provide the home office for the DDI Alliance. ICPSR has been using DDI for many years but we're now getting to the point where we're able to build all kinds of tools that take advantage of DDI. One of the first things that we were doing which we've been doing for at least 10 years is that when you download data from ICPSR you get with it a codebook and pdf. What the pdf is [that] - we created from the
  • 14. AUDIO_-_Provenance_and_Social_Science_data Page 14 of 25 DDI, not the other way around. So for us the DDI is the native version of the metadata. So what we started to do is take advantage of DDI to build more kinds of tools. One of the first ones we created was what's called our variable search page where you can put in a search term and look for questions that have been used in datasets that are like that search term. So this is an example of the results that come out of a variable search and we are now searching over more than 4.5 million variables in about 5000 studies or data collections. One of the things that DDI makes possible is that we can go from this search to other characteristics of the data. So you can see here in the blue that there are a number of things that are hyperlinked. If you click on the place I've got circled, it takes you to an online codebook. The online codebook has a number of features. It tells you the question that was asked. It tells you how it was coded. If the data are available online you can go to a cross tab tool and it also can link to an online graphing tool. The other thing that you see on the left side of the screen is a list of the other variables in the dataset. So you can move around in the dataset and clicking on any of those variables will bring up a display a bit similar to this. Another thing you can do from our variable search screen is if you click on these check boxes on the left, you can pick out a certain number of variables that you want to look at more closely and clicking on this compare button at the top there brings you to this screen which is a side by side comparison of these different variables which come from different studies and so you can see whether they're asking the same question, whether they're coded the same or differently. As before, this screen is also hyperlinked to the online codebook so you can go back and forth. One of our more recent tools which I think is one of the most powerful is that you can now search for datasets that include more than one
  • 15. AUDIO_-_Provenance_and_Social_Science_data Page 15 of 25 variable that you're interested in. So this is a search in using what we call our variable relevant search that's actually in the study search rather than the variable search, where we're looking for three - variables about three different things. Does the respondent read newspapers? Do they volunteer in schools? What's their race? You can see here that the results come out in three different columns within each study so you can see which variables are present in each study. As before everything is hyperlinked to both the online codebook and the variable comparison. So you can check on any combination of these variables and compare them side by side. Another thing that we did as another previous NSF project, working with the American National Election Study and the General Social Survey, we made a crosswalk of the variables that are available in those two studies. Now the American National Election Study started in 1948 and is done every four years. The General Social Survey started in 1972 and is done every two years. So we're actually going to be looking over 70 different datasets. What we've done is created this crosswalk where we've grouped the variables according to certain tags. We've got eight lists of tags and then 134 tags in total. The columns here, each column represents a dataset and there are70 datasets. All of the variables are linked here and I can't actually show it here but if you hover over one of those variables it shows you the question text for the variable. Again you can use the checkboxes to pick out things that you want to compare and go the variable comparison screen. So this is - a crosswalk like this is a tool that's actually very common. You've probably seen these before. There are two things that are different about this though. One is that this is all keyed into the online codebook so you can go transparently back and forth. The other thing is that we can use this tool to crosswalk any of the 4.5 million variables in the ICPSR collection because this is drawing
  • 16. AUDIO_-_Provenance_and_Social_Science_data Page 16 of 25 directly from our store of DDI metadata and we don't have to build a separate tool for each one. This one tool works over all of these datasets. Another thing that we did in this project was to think about how we could extend the online codebook. So here's our online codebook that you saw before which has the question text and how it was coded, but this version has something new in this location here. It shows how you got to this question. In big surveys every respondent doesn't answer every question. There are what are often called skip patterns. So you get asked what your marital status is and if you're single you go to one question, if you're married you go to another question, divorced people go to a third pattern. So there are different pathways through the questionnaire. What we've done here is try to show, here's how you go to that question which explains why some people didn't answer the question. We also represented it in words down here. So we built this and we were quite proud of ourselves for building it because this does answer the question about who answered this question in the survey. But then we ran into a problem so how do we know who answered the question in the survey. The answer is that we get that information from the data providers in a pdf. The only way we could build this demo prototype was to have one of our staff members enter this program flow information by manually into XML for one of datasets so we could show how this works. So we showed a tool that we think is really useful, but we reached a roadblock because we don't actually get machine actionable metadata about this kind of information. The problem is that when the data arrived at the archive, they don't have the question text. That's something that we at ICPSR and ADA have to type in. They don't have the interview flow. They don't have any information about variable provenance and variables that are created out of other variables are not documented.
  • 17. AUDIO_-_Provenance_and_Social_Science_data Page 17 of 25 So the project we're working on now which is called C2 metadata for [news capture] of metadata is about how do you get that. To understand that, how do we get it, you have to think about how the data are created and what happens. So first of all the data themselves are actually born digital. People do not go around with a paper questionnaire these days. They use these computer assisted interview programs. They're on telephone or they go around with a laptop of a tablet to answer them. There's no paper questionnaire. There is instead a program, and it's the program that's metadata. So technically at the beginning you start with this computer assisted interviewing system and what you get out of it is the original dataset. But you can also derive from it DDI metadata in XML and there are programs, a couple of different programs that will take these CAI systems, the code that they run on [unclear] to XML. But what happens next, well what happens next is that the project that commissioned the data is going to modify the data. There are a number of reasons for doing that. There are some things that are in the data that are just purely there for administrative purposes. There are some variables that have to be changed to reduce the identifiability of individuals. Some variables that need to be combined into scales or indexes. So what they do is they write a script that's going to run in one of the major statistical packages. They take that script and the script and the data go through that software and what comes out is a new dataset. Well what happens to the metadata, well at this point the metadata don't match the dataset anymore and you would need to update the XML to fix it and nobody likes updating XML. So the metadata get trashed and thrown away. What happens then is this, when the data - after the data are revised, the metadata are recreated. What happens is that we at the archive take the revised data and extract as much metadata from it as we can. So we get an
  • 18. AUDIO_-_Provenance_and_Social_Science_data Page 18 of 25 extracted XML file and what about the things that went on in the script here? Well we actually have to sit down and extract them by hand. So a person has to read the script and write down what happened. Well what we're working on in - well so what are we missing? Well what we get from the statistics packages are just names, labels for variables, labels for values and virtually no provenance information. So what we're working on is a way that we can automate the capture of this variable transformation metadata. So what I did is this, we're going to write software where you could take the script that was used to modify the data, take the very same script and run it through what we're calling a script parser and pull from that the information about variable transformations. Put that into a standard format which we're calling a standard data transformation language. Then you take that information and incorporate it into the original DDI. You update the original DDI and then you've got a new version of the XML that is in sync with the revised data. So this process then requires two different software tools, one that will read the script and turn it into a standard format and second one that will update the XML and that's what we're building. So we are building tools that will work with the different software packages and update XML. We're actually writing these parsers for scripts in four different languages, SPSS, SAS, Stata and R. The reason we're doing four languages is that if you look at the column over there on the right which is based on downloads at ICPSR in cases where the dataset had all four formats, you can see that there's not a single dominant format. SPSS and Stata are the most downloaded formats from ICPSR and they both have about 24 per cent. SAS and R both have about 12 per cent. If we did one package we wouldn't please any - we'd be pleasing only a few people and we couldn't have an impact. So we're actually writing parsers for four languages.
  • 19. AUDIO_-_Provenance_and_Social_Science_data Page 19 of 25 Here's something I thought that's come out of our work that you might find interesting. This is about why we need to have a special language for expressing these data transformations. So here are three brief programs in SPSS, Stata and SAS that all are designed to operate on the same data. I tried very hard to make the programs, the scripts identical and I think that I succeeded. But if you run these three programs you get three different results. The key thing here is to look at the last row that the row in which we set the minus one to be missing, in SPSS you get two missing values. In Stata and SAS one of the variables is set to a number, but it's a different one in each one. Why does this happen? Well the reason is that in logical expressions SPSS treats a missing value as missed [unclear], makes the result of a logical expression that includes a missing value, missing. Which in most cases is treated as false. Stata treats a missing value as a number which is equal to infinity. SAS treats the missing value as a number which is equal to minus infinity. So both Stata and SAS actually do return a number when you have one of these comparisons. So it's actually more accurate to represent the data in this way, which you wouldn't see if you just looked at the datasets. So what we're doing is creating our own language. Well we're actually using the language that's been created by another community. The SDMX community called validation and transformation language so that we can put all three of these languages into a common core. So what are we doing and why are we doing it? So the goal of the project is to capture this metadata and automate it. If we can capture more metadata from the data creation process, we'll be able to provide much better information to researchers about what's in the dataset. Automating this process we hope will make it cheaper for everyone and make it easier. That has been one of the principles that we've tried to do here that if we can't make it easier for the researchers, they're not going to do it.
  • 20. AUDIO_-_Provenance_and_Social_Science_data Page 20 of 25 So the hope here is that the software we get will make their lives easier. Here's just a - to acknowledge my - some of my partners in this, we've got partners from a couple of software firms Colectica, Metadata Technology North America, the Norwegian Centre for Research Data and two of the - the two projects I mentioned, the General Social Survey and the American National Election Study are part of the project too. So that's my talk. Kate LeMay: Fabulous, thank you very much George. We had a question that came through earlier from your talk when you were speaking about people putting variables into ICPSR and searching for them and [Ming 43:40] has asked, when a user searches for a variable or variables, do they need to come up with the exact variable name as in the variable index? George Alter: So right now what we're doing is really a text search. When you search for variables, you're searching over the variable name and the variable label. It also can bring up items that are in the values for the variables. But one of the problems in the social sciences is that people don't reuse questions very often. So we don't have a tradition of reusing questions. It's very hard to find the same question in multiple datasets. The kind of search we're doing now in our question bank is frankly kind of clunky and it often misses things. That's an issue that I'm trying to address in some other projects that we're trying to improve the way we can search over variables. Kate LeMay: Thank you very much. We've got a question for Nick as well. Nick Car: [Unclear]. Kate LeMay: Yes. So Nick we've got a question, how widely is PROV used and what have you found to be the main challenges working with PROV? Noting that a V2 is not on the horizon, is it easy to update a PROV model if a change is required?
  • 21. AUDIO_-_Provenance_and_Social_Science_data Page 21 of 25 Nick Car: Okay, so first part first, how widely is it used? So my - I have a direct interest in things provenance but aside from that I have an interest in things geospatial and I guess physical sciences data. In that community there's only one game in town, that's through PROV but it's early days. So most of the spatial, geophysical, blah, blah, blah sorts of places, those hard physical sciences side, the either are using their own systems or they're intending to use PROV. There's not many that are actually already using PROV. But there are certainly not many that are intending to use something other than PROV. Outside of my own Geoscience Australia area, other communities I know of including DDI and so on - because PROV's only been around for a few years, if people can characterise their problem in a provenance way, like they actually understand this as a provenance question as opposed to some other kind of question like an ownership or an attribution question, they fairly quickly end up at PROV. So I think it's about as - it's certainly more widely used than any other provenance standard has ever been and it's showing signs of being much more widely used that that and that's because in the other initiatives in the space have been sort of swallowed up by PROV. Now the second part of the question was, what are the problems and I've identified one already which is people have to know that they're asking a provenance question. So we get a lot of questions which are synonyms for provenance questions, probably much like variable naming where people say I'm interested in the lineage of my data or the transparency or the process flow of the ownership or attribution and those are all what could be all provenance questions. The hardest thing to work out is specifically what questions are being asked and then if there is an existing metadata model or something in that space already, what's it doing and what's it not doing and therefore do we need provenance - a specific provenance initiative. So for instance many metadata models have authorship, ownership, creator information indicated in them. So if your provenance question
  • 22. AUDIO_-_Provenance_and_Social_Science_data Page 22 of 25 is, I want to know datasets created by Nick, that kind of provenance question you can usually answer in other metadata systems. You'd have to have something a bit more complicated than that and determine provenance to then think about using the provenance system. The other thing is the move away from what I call point metadata where you've got a single thing with a bunch of properties that come from it. So a study or a document or a chunk of data with a bunch of properties. That's one way to do things, but what PROV and what other models are interested in is whole networks. Things that relate to other things. It's more complex, but it's much, much more powerful to do that. Kate LeMay: Great, thank you very much. So question for George, how is sensitive data, variables or values controlled for during the C2 automatic capture. ICPSR has a confidentialised service on Ingest, is this process carried over to the C2 metadata project? Is this activity captured in PROV like metadata? George Alter: So the C2 metadata model is to operate solely on the metadata not on the data. So that's really a - so it doesn't really play into the issue of confidentiality. If you're interested, in two weeks we're going to have another webinar where I am going to talk about how we manage confidential data, but in general it's rarely the case that we have to mask the metadata of a dataset for confidentiality reasons. Obviously controlling the data is something else. Kate LeMay: So we've got another question here for George. Your script parser that reads from SAS script, do researchers need to - would they need to install that in their SAS package? George Alter: We haven't gotten to that point yet but probably not. Probably what we'll do, at least as a starting point is offer it as a web service. What you'll do is simply export your SAS program into a text file and upload the text file to the web service and it will download a new XML file.
  • 23. AUDIO_-_Provenance_and_Social_Science_data Page 23 of 25 Kate LeMay: So we've got another question here. Does PROV support the workflow of creation and approval of provenance data, eg. the PROV entry is proposed and has been submitted to the data custodian for approval? Nicholas Car: Well it's got two kind of answers to it. One is a generic PROV answer and the other one seems to be more in line with a particular repository or a particular set of steps. So this isn't exactly what you asked, but I'm going to answer it in a slightly different way. You can talk about the provenance of provenance which is bit tricky. But say you had information about the lineage or the history of a dataset and you wanted to control that chunk of stuff, you could talk about that thing being a dataset itself, even though it's about something else and manage that. You could certainly work out how to link your dataset to the dataset that contains its provenance information. So you can do that. But the second part of the question, or I think the general sense of the question is more to do with how does a specific repository do things. Does that make sense? Does the PROV support the workflow of creation and approval? Okay, in general and you can represent anything in PROV because it's really high level and it's got those three generic classes of entity, activity and agent. There's almost nothing in the world that I've come across that you can't decompose down into one of those three things. Is it a thing? Is it a causative agent or is it a temporal occurrence. So in general yes. Kate LeMay: Okay fabulous. Nicholas Car: So Natasha asks, philosophical question for the whole panel, how do you think provenance relates to trust. So I'm going to jump in very quickly and say, provenance models before PROV often had the word trust in them somewhere. Many of the motivations for provenance models were to do with trust, we deal with trust as - the goal of Geoscience Australia to put out data and make it open and transparent. It's fundamentally a trust issue for users of that data.
  • 24. AUDIO_-_Provenance_and_Social_Science_data Page 24 of 25 They want to know how did this data come to be. So that's really what provenance is about. It's about telling about the history of something so you can generate all sorts of trust. But then the specifics of what you put in there and you can work out, do I trust the people who created this thing, do I trust the process that was undertaken to deal with it or transform it. Do I trust the particular chunks of code that we used? So that's the generic answer. Then there's the sort of more specific ones like for data [unclear] repository how do I trust that it's - even though you're telling me something about it, [but whether it's] in fact true. There are also very difficult things about how do I actually trust this metadata even if it looks like it's all [unclear]. If this data comes from God delivered to you on a stone tablet, I could write that down, but is it true? You have to work that out. That is now a non-provenance thing. You have to work out some other way of attributing a trust metric to that claim. That might be that it's digitally signed and you trust the agency that delivered it. So that's an appeal to authority. You might trust that there's enough information present for you to understand the process enough to have confidence in it. It might link to well-known sources like Open Code or something like that, that you trust or maybe there's a mechanism for you to validate certain chunks of data or calculations. So the total number is five, you can look back through the provenance and see somewhere two plus three, where you see five, that you can calculate and you can establish that trust directly. George Alter: So I think - Nick said it very well. But I'll say the same thing in fewer words [laughs]. Nicholas Car: Thanks George, thank you. George Alter: Provenance is really fundamental to trust and Nick really hit the nail on the head when he talked about transparency. Provenance is about transparency and in the world we live in now, even appeals to authority don't work very well anymore. I think that for science to have
  • 25. AUDIO_-_Provenance_and_Social_Science_data Page 25 of 25 - to gain legitimacy and gain trust, we have to be transparent and that's what provenance metadata is all about. Kate LeMay: So we've reached the end of our time. I'd just like to thank our three speakers for coming along to our ANDS Canberra Office today and speaking to us about provenance and introducing lots of new acronyms to us all. Every time I encounter anything new at ANDS there's always more acronyms to learn. So thank you very much for coming. We have two more webinars in the social science series, so hope to see you there again. END OF TRANSCRIPT