SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Minimal Viable Data Reuse
Prof. Paul Groth | @pgroth | pgroth.com | indelab.org

Thanks to Dr. Kathleen Gregory, Dr. Laura Koesten, Prof. Elena Simperl,
Dr. Pavlos Vougiouklis, Dr. Andrea Scharnhorst, Prof. Sally Wyatt
VOGIN-IP

May 11, 2022
Prof. Elena Simperl
King’s College London
Dr. Laura Koesten
King’s College London /
University of Vienna
Dr. Kathleen Gregory
KNAW DANS
Prof. Sally Wyatt
Maastricht University
Dr. Andrea Scharnhorst
KNAW DANS
Dr. Pavlos Vougiouklis
Huawei
Thanks to my
collaborators on this work in
HCI, social science, humanities
Research Topics at INDE lab
• Design systems to support people in working with data from diverse sources

• Address problems related to the preparation, management, and integration of data

• Automated Knowledge Graph Construction

(e.g., predicting and adding new links in datasets such as Wikidata based on text;

building KGs from video)

• Data Search & Reuse 

(e.g., studies on GitHub hosted data; research objects for making data FAIR, 

data handling impact on computational models)

• Data Management for Machine Learning 

(e.g., scalable concept drift detection for ML training data,

integrated in AWS SageMaker Model Monitor; using data provenance for ML debugging)
• Causality-Inspired Machine Learning (e.g., using ideas from 

causal inference to improve the robustness and generalization 

of ML algorithms, especially in cases of distribution shift; domain adaptation)





Data is everywhere in your organization


Sources & Signals
• Knowledge or entity graphs: e.g. databases of facts about the target
domain.

• Aggregate statistics: e.g. tracked metrics about the target domain. 

• Heuristics and rules: e.g. existing human-authored rules about the target
domain.

• Topic models, taggers, and classi
fi
ers: e.g. machine learning models about
the target domain or a related domain.
https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.html
What should we do as data providers to enable data reuse?
Lots of good advice
Editorial
Ten Simple Rules for the Care and Feeding of Scientific
Data
Alyssa Goodman1
, Alberto Pepe1
*, Alexander W. Blocker1
, Christine L. Borgman2
, Kyle Cranmer3
,
Merce Crosas1
, Rosanne Di Stefano1
, Yolanda Gil4
, Paul Groth5
, Margaret Hedstrom6
, David W. Hogg3
,
Vinay Kashyap1
, Ashish Mahabal7
, Aneta Siemiginowska1
, Aleksandra Slavkovic8
1 Harvard University, Cambridge, Massachusetts, United States of America, 2 University of California, Los Angeles, Los Angeles, California, United States of America, 3 New
York University, New York, New York, United States of America, 4 University of Southern California, Los Angeles, Los Angeles, California, United States of America, 5 Vrije
Universiteit Amsterdam, Amsterdam, The Netherlands, 6 University of Michigan, Ann Arbor, Michigan, United States of America, 7 California Institute of Technology,
Pasadena, California, United States of America, 8 Pennsylvania State University, State College, Pennsylvania, United States of America
Introduction
In the early 1600s, Galileo Galilei
turned a telescope toward Jupiter. In his
log book each night, he drew to-scale
schematic diagrams of Jupiter and some
oddly moving points of light near it.
Galileo labeled each drawing with the
date. Eventually he used his observations
to conclude that the Earth orbits the Sun,
just as the four Galilean moons orbit
Jupiter. History shows Galileo to be much
more than an astronomical hero, though.
His clear and careful record keeping and
publication style not only let Galileo
understand the solar system, they continue
to let anyone understand how Galileo did it.
Galileo’s notes directly integrated his data
(drawings of Jupiter and its moons), key
metadata (timing of each observation,
weather, and telescope properties), and
text (descriptions of methods, analysis,
and conclusions). Critically, when Galileo
included the information from those notes
in Sidereus Nuncius [1], this integration of
text, data, and metadata was preserved, as
shown in Figure 1. Galileo’s work ad-
vanced the ‘‘Scientific Revolution,’’ and
his approach to observation and analysis
contributed significantly to the shaping of
today’s modern ‘‘scientific method’’ [2,3].
Today, most research projects are
considered complete when a journal
article based on the analysis has been
written and published. The trouble is,
unlike Galileo’s report in Sidereus Nuncius,
the amount of real data and data descrip-
tion in modern publications is almost
never sufficient to repeat or even statisti-
cally verify a study being presented.
Worse, researchers wishing to build upon
and extend work presented in the litera-
ture often have trouble recovering data
associated with an article after it has been
published. More often than scientists
would like to admit, they cannot even
recover the data associated with their own
published works.
Complicating the modern situation, the
words ‘‘data’’ and ‘‘analysis’’ have a wider
variety of definitions today than at the
time of Galileo. Theoretical investigations
can create large ‘‘data’’ sets through
simulations (e.g., The Millennium Simu-
lation Project: http://www.mpa-garching.
mpg.de/galform/virgo/millennium/).
Large-scale data collection often takes
place as a community-wide effort (e.g.,
The Human Genome project: http://
www.genome.gov/10001772), which leads
to gigantic online ‘‘databases’’ (organized
collections of data). Computers are so
essential in simulations, and in the pro-
cessing of experimental and observational
data, that it is also often hard to draw a
dividing line between ‘‘data’’ and ‘‘analy-
sis’’ (or ‘‘code’’) when discussing the care
and feeding of ‘‘data.’’ Sometimes, a copy
of the code used to create or process data
is so essential to the use of those data that
the code should almost be thought of as
part of the ‘‘metadata’’ description of the
data. Other times, the code used in a
scientific study is more separable from the
data, but even then, many preservation
and sharing principles apply to code just as
well as they do to data.
So how do we go about caring for and
feeding data? Extra work, no doubt, is
associated with nurturing your data, but
care up front will save time and increase
insight later. Even though a growing number
of researchers, especially in large collabora-
tions, know that conducting research with
sharing and reuse in mind is essential, it still
requires a paradigm shift. Most people are
still motivated by piling up publications and
by getting to the next one as soon as possible.
But, the more we scientists find ourselves
wishing we had access to extant but now
unfindable data [4], the more we will realize
why bad data management is bad for
science. How can we improve?
This article offers a short guide to
the steps scientists can take to
ensure that their data and associat-
ed analyses continue to be of value
and to be recognized. In just the past
few years, hundreds of scholarly papers
and reports have been written on ques-
tions of data sharing, data provenance,
research reproducibility, licensing, attribu-
tion, privacy, and more—but our goal
here is not to review that literature.
Instead, we present a short guide intended
for researchers who want to know why it is
important to ‘‘care for and feed’’ data,
with some practical advice on how to do
that. The final section at the close of this
work (Links to Useful Resources) offers
links to the types of services referred to
throughout the text. Boldface lettering
below highlights actions one can take to
follow the suggested rules.
Rule 1. Love Your Data, and
Help Others Love It, Too
Data management is a repeat-play
game. If you take care to make your data
Citation: Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, et al. (2014) Ten Simple Rules for the Care
and Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542
Published April 24, 2014
Copyright: ! 2014 Goodman et al. This is an open-access article distributed under the terms of the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Funding: The authors received no specific funding for writing this manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: alberto.pepe@gmail.com
Editor: Philip E. Bourne, University of California San Diego, United States of America
PLOS Computational Biology | www.ploscompbiol.org 1 April 2014 | Volume 10 | Issue 4 | e1003542
Article
Dataset Reuse: Toward Translating
Principles to Practice
Laura Koesten,1,* Pavlos Vougiouklis,2 Elena Simperl,1 and Paul Groth3,4,*
1King’s College London, London WC2B 4BG, UK
2Huawei Technologies, Edinburgh EH9 3BF, UK
3University of Amsterdam, Amsterdam 1090 GH, the Netherlands
4Lead Contact
*Correspondence: laura.koesten@kcl.ac.uk (L.K.), p.groth@uva.nl (P.G.)
https://doi.org/10.1016/j.patter.2020.100136
SUMMARY
The web provides access to millions of datasets that can have additional impact when used beyond their
original context. We have little empirical insight into what makes a dataset more reusable than others and
which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential
reuse features through a literature review and present a case study on datasets on GitHub, a popular open
platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over
65,000 repositories. Using GitHub’s engagement metrics as proxies for dataset reuse, we relate them to
reuse features from the literature and devise an initial model, using deep neural networks, to predict a data-
set’s reusability. This demonstrates the practical gap between principles and actionable insights that allow
data publishers and tools designers to implement functionalities that provably facilitate reuse.
1 INTRODUCTION
There has been a gradual shift in the last years from viewing da-
tasets as byproducts of (digital) work to critical assets, whose
value increases the more they are used.1,2
However, our under-
standing of how this value emerges, and of the factors that
demonstrably affect the reusability of a dataset is still limited.
Using a dataset beyond the context where it originated re-
mains challenging for a variety of socio-technical reasons, which
have been discussed in the literature;3,4
the bottom line is that
simply making data available, even when complying with existing
guidance and best practices, does not mean it can be easily
used by others.5
At the same time, making data reusable to a diverse audience,
in terms of domain, skill sets, and purposes, is an important way
to realize its potential value (and recover some of the, sometimes
considerable, resources invested in policy and infrastructure
support). This is one of the reasons why scientific journals and
research-funding organizations are increasingly calling for
further data sharing6
or why industry bodies, such as the Interna-
tional Data Spaces Association (IDSA) (https://www.
internationaldataspaces.org/) are investing in reference archi-
tectures to smooth data flows from one business to another.
There is plenty of advice on how to make data easier to
reuse, including technical standards, legal frameworks, and
guidelines. Much work places focus on machine readability
THE BIGGER PICTURE The web provides access to millions of datasets. These data can have additional
impact when it is used beyond the context for which it was originally created. We have little empirical insight
into what makes a dataset more reusable than others, and which of the existing guidelines and frameworks,
if any, make a difference. In this paper, we explore potential reuse features through a literature review and
present a case study on datasets on GitHub, a popular open platform for sharing code and data. We
describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engage-
ment metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an
initial model, using deep neural networks, to predict a dataset’s reusability. This work demonstrates the
practical gap between principles and actionable insights that allow data publishers and tools designers
to implement functionalities that provably facilitate reuse.
Proof-of-Concept: Data science output has been formulated,
implemented, and tested for one domain/problem
Patterns 1, 100136, November 13, 2020 ª 2020 The Author(s). 1
ll
OPEN ACCESS
Lots of good advice
• Maybe a bit too much….

• Currently, 140 policies on fairsharing.org as
of April 5, 2021

• We reviewed 40 papers

• Cataloged 39 di
ff
erent features of datasets
that enable data reuse
Enable access
Feature Description References
Access
License (1) available, (2) allows reuse W3C 3,22,45–47
Format/machine readability
(1) consistent format, (2) single value type per column, (3) human as well
as machine readable and non-proprietary format, (4) different formats
available
W3C2,22,48–50
Code available for cleaning, analysis, visualizations 51–53
Unique identi
fi
er PID for the dataset/ID's within the dataset W3C2,53
Download link/API (1) available, (2) functioning W3C47,50
Document
Documentation: Methodological Choices
Methodology description of experimental setup (sampling,
tools, etc.), link to publication or project
3,13,54,60,63,66
Units and reference systems (1) de
fi
ned, (2) consistently used 54,67
Representativeness/Population in relation to a total population 21,60
Caveats changes: classi
fi
cation/seasonal or special
event/sample size/coverage/rounding
48,54
Cleaning/pre-processing (1) cleaning choices described, (2) are the raw
data available?
3,13,21,68
Biases/limitations different types of bias (i.e., sampling bias) 21,49,69
Data management (1) mode of storage, (2) duration of storage 3,70,71
Documentation: Quality
Missing values/null values (1) de
fi
ned what they mean, (2) ratio of empty
cells
W3C22,48,49,59,60
Margin of error/reliability/quality control
procedures
(1) con
fi
dence intervals, (2) estimates versus
actual measurements
54,65
Formatting (1) consistent data type per column, (2)
consistent date format
W3C41,65
Outliers are there data points that differ signi
fi
cantly from
the rest
22
Possible options/constraints on a variable (1) value type, (2) if data contains an “other”
category
W3C72
Last update information about data maintenance if
applicable
21,62
Completeness of metadata empty
fi
elds in the applied metadata structure? 41
Abbreviations/acronyms/codes de
fi
ned 49,54
Documentation: Summary Representations and
Understandability
Description/README
fi
le meaningful textual description (can also
include text, code, images)
22,54,55
Purpose purpose of data collection, context of creation 3,21,49,56,57
Summarizing statistics (1) on dataset level, (2) on column level 22,49
Visual representations statistical properties of the dataset 22,58
Headers understandable
(1) column-level documentation (e.g.,
abbreviations explained), (2) variable types, (3)
how derived (e.g., categorization, such as
labels or codes)
22,59,60
Geographical scope (1) de
fi
ned, (2) level of granularity 45,54,61,62
Temporal scope (1) de
fi
ned, (2) level of granularity 45,54,61,62
Time of data collection (1) when collected, (2) what time span 63–65
Situate
Connections
Relationships between variables de
fi
ned (1) explained in documentation, (2) formulae 21,22
Cite sources (1) links or citation, (2) indication of link quality 21
Links to dataset being used elsewhere i.e., in publications, community-led projects 21,59
Contact person or organization, mode of contact speci
fi
ed W3C41,73
Provenance and Versioning
Publisher/producer/repository
(1) authoritativeness of source, (2) funding mechanisms/
other interests that in
fl
uenced data collection speci
fi
ed
21,49,54,59,74,
75
Version indicator version or modi
fi
cation of dataset documented W3C50,66,76
Version history work
fl
ow provenance W3C50,76
Prior reuse/advice on data reuse (1) example projects, (2) access to discussions 3,27,59,60
Ethics
Ethical considerations, personal data (1) data related to individually identi
fi
able
people, (2) if applicable, was consent
given
21,57,71,75
Semantics
Schema/Syntax/Data Model de
fi
ned W3C47,67
Use of existing taxonomies/vocabularies (1) documented, (2) link W3C2
Where should a data provider start?
• Lots of good advice!

• It would be great to do all these things

• But it’s all a bit overwhelming

• Can we help prioritize?
Getting some data
• Used Github as a case study

• ~1.4 million datasets (e.g. CSV, excel) from
~65K repos

• Use engagement metrics as proxies for data
reuse

• Map literature features to both dataset and
repository features

• Train a predictive model to see what are
features are good predictors
Dataset Features
Missing values
Size
Columns + Rows
Readme features
Issue features
Age
Description
Parsable
Where to start?
• Some ideas from this study if you’re publishing data
with Github

• provide an informative short textual summary of the
dataset 

• provide a comprehensive README
fi
le in a
structured form and links to further information 

• datasets should not exceed standard processable
fi
le sizes 

• datasets should be possible to open with a standard
con
fi
guration of a common library (such as Pandas)

Trained a Recurrent Neural Network. Might be better models but useful for
handling text, Not the greatest predicator (good for classifying not reuse)
but still useful for helping us tease out features
Understand your target users
Multiple responses possible. Percents are percent of respondents (n=1677).
Why do you use or need secondary data?
Gregory, K., Groth, P. Scharnhorst, A., Wyatt, S. (2020). Lost or found? Discovering data needed for
research. Harvard Data Science Review. https://doi.org/10.1162/99608f92.e38165eb
How would you make sense of this data?
Koesten, L., Gregory, K., Groth, P., & Simperl, E. (2021). Talking datasets –
Understanding data sensemaking behaviours. International Journal of Human-
Computer Studies, 146, 102562. https://doi.org/10.1016/j.ijhcs.2020.102562
Patterns of data-centric sense making
• 31 research “data people”

• Brought their own data

• Presented with unknown data

• Think-out loud 

• Talk about both their data and then given data

• Interview transcripts + screen captures
Inspecting unknown data
Engaging with data
Known Unknown
Acronyms
and
abbreviations
“That is a classic abbreviation in the
fi
eld of hepatic surgery. AFP is
alpha feto protein. It is a marker. It’s very well known by
everybody...the AFP score is a criterion for liver transplantation. (P22)”
“I’m not sure what ‘long’ means. I wonder if it’s not
something to do with longevity. On the other hand, no, it’s
got negative numbers. I can’t make sense of this. (P7)”
Identi
fi
ying
strange
things
“Although we’ve tried really hard, because we’ve put in a coding frame
and how we manipulate all the data, I’m sure that there are things in
there which we haven’t recorded in terms of, well, what exactly does
this mean? I hope we’ve covered it all but I’m sure we haven’t. (P10)”
“Now that sounds quite high for the Falklands. I wouldn’t
have thought the population was all that great...and yet it’s
only one con
fi
rmed case. Okay [laughs]. So yes...one might
need to actually examine that a little bit more carefully,
because the population of the Falklands doesn’t reach a
million, so therefore you end up with this huge number of
deaths per million population [laughs], but only one case and
one death. (P23)”
Placing data
• P2: It’s listing the countries for which data are available, not sure if
this is truly all countries we know of...

• P8: It includes essentially every country in the world

• P29: Global data 

• P30: I would like to know whether it’s complete...it says 212 rows
representing countries, whether I have data from all countries or
only from 25% or something because then it’s not really
representative. 

• P7: If it was the whole country that was a
ff
ected or not, a
ff
ecting
the northern part, the western, eastern, southern parts

• P24: Was it sampled and then estimated for the whole country? Or
is it the exact number of deaths that were got from hospitals and
health agencies, for example? So is it a census or is it an estimate?
Activity patterns during data sense making
Recommendations
✅ for data providers
• Help users understand shape

• Provide information at the dataset level (e.g. summaries) ✅ 

• Column level summaries

• Make it easier to pan and zoom

• Use strange things as an entry point

• Flag and highlight strange things ✅ 

• Provide explanations of abbreviations and missing values ✅ 

• Provide metrics or links to other information structures necessary for
understanding the column’s content ✅ 

• Include links to basic concepts ✅ 

• Highlight relationships between columns or entities ✅ 

• Identify anchor variables that are considered most important ✅ 

• Help users placing data

• Embrace di
ff
erent levels of expertise and enable drill down

• Link to standardized de
fi
nitions ✅ 

• Connect to broader forms of documentation ✅
Data is Social
Do you want a data community?
Gregory, K., Groth, P. Scharnhorst, A., Wyatt, S. (2020). Lost or
found? Discovering data needed for research. Harvard Data
Science Review. https://doi.org/10.1162/99608f92.e38165eb
Conclusion
• For data platforms

• Think about ways of measuring data reuse

• Tooling for summaries and overviews of data

• Automated linking to information for sense making 

• For data providers

• Simple steps

• Focus on making it easy to “get to know” your data. 

• Easy to load and explore (e.g. in pandas, excel, community tool)

• Links to more information

• Are you trying to be a part or build a data community? 

• We still need a lot more work on data practices and methods informed
by practices
Paul Groth | @pgroth | pgroth.com | indelab.org

Kathleen Marie Gregory
Kathleen
Marie
Gregory
Findable
and
reusable?
Data
discovery
practices
in
research
Findable and reusable?
Data discovery practices in research

Mais conteúdo relacionado

Semelhante a Minimal viable data reuse

Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
dgarijo
 

Semelhante a Minimal viable data reuse (20)

Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decade
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
 
Simon hodson
Simon hodsonSimon hodson
Simon hodson
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
AI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data ScienceAI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data Science
 
A Big Picture in Research Data Management
A Big Picture in Research Data ManagementA Big Picture in Research Data Management
A Big Picture in Research Data Management
 
How to Execute A Research Paper
How to Execute A Research PaperHow to Execute A Research Paper
How to Execute A Research Paper
 
Mind the Gap: Reflections on Data Policies and Practice
Mind the Gap: Reflections on Data Policies and PracticeMind the Gap: Reflections on Data Policies and Practice
Mind the Gap: Reflections on Data Policies and Practice
 
Fsci 2018 monday30_july_am6
Fsci 2018 monday30_july_am6Fsci 2018 monday30_july_am6
Fsci 2018 monday30_july_am6
 
Research Data Management Services at UWA (July 2015)
Research Data Management Services at UWA (July 2015)Research Data Management Services at UWA (July 2015)
Research Data Management Services at UWA (July 2015)
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
 
Curation of Research Data
Curation of Research DataCuration of Research Data
Curation of Research Data
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
 
Digital Resources for Open Science
Digital Resources for Open ScienceDigital Resources for Open Science
Digital Resources for Open Science
 
Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...
 
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
Big Data R&D Strategy - Ensure the long term sustainability, access, and deve...
 

Mais de voginip

Mais de voginip (20)

Zo wordt je factchecker - Aafko Boonstra
Zo wordt je factchecker - Aafko BoonstraZo wordt je factchecker - Aafko Boonstra
Zo wordt je factchecker - Aafko Boonstra
 
Automatisch metadateren - de kansen en de uitdagingen
Automatisch metadateren - de kansen en de uitdagingenAutomatisch metadateren - de kansen en de uitdagingen
Automatisch metadateren - de kansen en de uitdagingen
 
Hybride Intelligentie: de rol van Large Language Models in informatieverwerking
Hybride Intelligentie: de rol van Large Language Models in informatieverwerkingHybride Intelligentie: de rol van Large Language Models in informatieverwerking
Hybride Intelligentie: de rol van Large Language Models in informatieverwerking
 
Solving World War II Photo Mysteries with Open Source Techniques
Solving World War II Photo Mysteries with Open Source TechniquesSolving World War II Photo Mysteries with Open Source Techniques
Solving World War II Photo Mysteries with Open Source Techniques
 
PiCo: Historische personen beter vindbaar maken
PiCo: Historische personen beter vindbaar makenPiCo: Historische personen beter vindbaar maken
PiCo: Historische personen beter vindbaar maken
 
Red het internet! Op weg naar de online publieke ruimte
Red het internet! Op weg naar de online publieke ruimteRed het internet! Op weg naar de online publieke ruimte
Red het internet! Op weg naar de online publieke ruimte
 
AI en IP (Artificieele Intelligentie en Intellectueel Eigendom)
AI en IP (Artificieele Intelligentie en Intellectueel Eigendom)AI en IP (Artificieele Intelligentie en Intellectueel Eigendom)
AI en IP (Artificieele Intelligentie en Intellectueel Eigendom)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
The Dark Side of Science: Misconduct in Biomedical Research
The Dark Side of Science: Misconduct in Biomedical ResearchThe Dark Side of Science: Misconduct in Biomedical Research
The Dark Side of Science: Misconduct in Biomedical Research
 
Oude boeken, nieuwe vaardigheden en Wikipedia
Oude boeken, nieuwe vaardigheden en WikipediaOude boeken, nieuwe vaardigheden en Wikipedia
Oude boeken, nieuwe vaardigheden en Wikipedia
 
De kracht van samenwerking: hoe de Universiteitsbibliotheek Gent open kennisc...
De kracht van samenwerking: hoe de Universiteitsbibliotheek Gent open kennisc...De kracht van samenwerking: hoe de Universiteitsbibliotheek Gent open kennisc...
De kracht van samenwerking: hoe de Universiteitsbibliotheek Gent open kennisc...
 
Open yet everywhere in chains: Where next for open knowledge?
Open yet everywhere in chains: Where next for open knowledge?Open yet everywhere in chains: Where next for open knowledge?
Open yet everywhere in chains: Where next for open knowledge?
 
The three layers of a knowledge graph and what it means for authoring, storag...
The three layers of a knowledge graph and what it means for authoring, storag...The three layers of a knowledge graph and what it means for authoring, storag...
The three layers of a knowledge graph and what it means for authoring, storag...
 
Vijf vindbaarheidsproblemen waar een taxonomie de schuld van krijgt (maar nik...
Vijf vindbaarheidsproblemen waar een taxonomie de schuld van krijgt (maar nik...Vijf vindbaarheidsproblemen waar een taxonomie de schuld van krijgt (maar nik...
Vijf vindbaarheidsproblemen waar een taxonomie de schuld van krijgt (maar nik...
 
Why one-size-fits all does not work in Explainable Artificial Intelligence!
Why one-size-fits all does not work in Explainable Artificial Intelligence!Why one-size-fits all does not work in Explainable Artificial Intelligence!
Why one-size-fits all does not work in Explainable Artificial Intelligence!
 
Systematisch zoeken op het web
Systematisch zoeken op het webSystematisch zoeken op het web
Systematisch zoeken op het web
 
Grote hoeveelheden tekst analyseren als data
Grote hoeveelheden tekst analyseren als dataGrote hoeveelheden tekst analyseren als data
Grote hoeveelheden tekst analyseren als data
 
Werken met Wikidata
Werken met WikidataWerken met Wikidata
Werken met Wikidata
 
Een gereedschapskist voor digitale vaardigheden
Een gereedschapskist voor digitale vaardighedenEen gereedschapskist voor digitale vaardigheden
Een gereedschapskist voor digitale vaardigheden
 
Een startende éénpitter in informatieland: wat goed ging en wat niet
Een startende éénpitter in informatieland: wat goed ging en wat nietEen startende éénpitter in informatieland: wat goed ging en wat niet
Een startende éénpitter in informatieland: wat goed ging en wat niet
 

Último

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Minimal viable data reuse

  • 1. Minimal Viable Data Reuse Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Dr. Kathleen Gregory, Dr. Laura Koesten, Prof. Elena Simperl, Dr. Pavlos Vougiouklis, Dr. Andrea Scharnhorst, Prof. Sally Wyatt VOGIN-IP May 11, 2022
  • 2. Prof. Elena Simperl King’s College London Dr. Laura Koesten King’s College London / University of Vienna Dr. Kathleen Gregory KNAW DANS Prof. Sally Wyatt Maastricht University Dr. Andrea Scharnhorst KNAW DANS Dr. Pavlos Vougiouklis Huawei Thanks to my collaborators on this work in HCI, social science, humanities
  • 3. Research Topics at INDE lab • Design systems to support people in working with data from diverse sources • Address problems related to the preparation, management, and integration of data
 • Automated Knowledge Graph Construction
 (e.g., predicting and adding new links in datasets such as Wikidata based on text;
 building KGs from video) • Data Search & Reuse 
 (e.g., studies on GitHub hosted data; research objects for making data FAIR, 
 data handling impact on computational models) • Data Management for Machine Learning 
 (e.g., scalable concept drift detection for ML training data,
 integrated in AWS SageMaker Model Monitor; using data provenance for ML debugging) • Causality-Inspired Machine Learning (e.g., using ideas from 
 causal inference to improve the robustness and generalization 
 of ML algorithms, especially in cases of distribution shift; domain adaptation)
 
 

  • 4. Data is everywhere in your organization Sources & Signals • Knowledge or entity graphs: e.g. databases of facts about the target domain. • Aggregate statistics: e.g. tracked metrics about the target domain. • Heuristics and rules: e.g. existing human-authored rules about the target domain. • Topic models, taggers, and classi fi ers: e.g. machine learning models about the target domain or a related domain. https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.html
  • 5. What should we do as data providers to enable data reuse?
  • 6. Lots of good advice Editorial Ten Simple Rules for the Care and Feeding of Scientific Data Alyssa Goodman1 , Alberto Pepe1 *, Alexander W. Blocker1 , Christine L. Borgman2 , Kyle Cranmer3 , Merce Crosas1 , Rosanne Di Stefano1 , Yolanda Gil4 , Paul Groth5 , Margaret Hedstrom6 , David W. Hogg3 , Vinay Kashyap1 , Ashish Mahabal7 , Aneta Siemiginowska1 , Aleksandra Slavkovic8 1 Harvard University, Cambridge, Massachusetts, United States of America, 2 University of California, Los Angeles, Los Angeles, California, United States of America, 3 New York University, New York, New York, United States of America, 4 University of Southern California, Los Angeles, Los Angeles, California, United States of America, 5 Vrije Universiteit Amsterdam, Amsterdam, The Netherlands, 6 University of Michigan, Ann Arbor, Michigan, United States of America, 7 California Institute of Technology, Pasadena, California, United States of America, 8 Pennsylvania State University, State College, Pennsylvania, United States of America Introduction In the early 1600s, Galileo Galilei turned a telescope toward Jupiter. In his log book each night, he drew to-scale schematic diagrams of Jupiter and some oddly moving points of light near it. Galileo labeled each drawing with the date. Eventually he used his observations to conclude that the Earth orbits the Sun, just as the four Galilean moons orbit Jupiter. History shows Galileo to be much more than an astronomical hero, though. His clear and careful record keeping and publication style not only let Galileo understand the solar system, they continue to let anyone understand how Galileo did it. Galileo’s notes directly integrated his data (drawings of Jupiter and its moons), key metadata (timing of each observation, weather, and telescope properties), and text (descriptions of methods, analysis, and conclusions). Critically, when Galileo included the information from those notes in Sidereus Nuncius [1], this integration of text, data, and metadata was preserved, as shown in Figure 1. Galileo’s work ad- vanced the ‘‘Scientific Revolution,’’ and his approach to observation and analysis contributed significantly to the shaping of today’s modern ‘‘scientific method’’ [2,3]. Today, most research projects are considered complete when a journal article based on the analysis has been written and published. The trouble is, unlike Galileo’s report in Sidereus Nuncius, the amount of real data and data descrip- tion in modern publications is almost never sufficient to repeat or even statisti- cally verify a study being presented. Worse, researchers wishing to build upon and extend work presented in the litera- ture often have trouble recovering data associated with an article after it has been published. More often than scientists would like to admit, they cannot even recover the data associated with their own published works. Complicating the modern situation, the words ‘‘data’’ and ‘‘analysis’’ have a wider variety of definitions today than at the time of Galileo. Theoretical investigations can create large ‘‘data’’ sets through simulations (e.g., The Millennium Simu- lation Project: http://www.mpa-garching. mpg.de/galform/virgo/millennium/). Large-scale data collection often takes place as a community-wide effort (e.g., The Human Genome project: http:// www.genome.gov/10001772), which leads to gigantic online ‘‘databases’’ (organized collections of data). Computers are so essential in simulations, and in the pro- cessing of experimental and observational data, that it is also often hard to draw a dividing line between ‘‘data’’ and ‘‘analy- sis’’ (or ‘‘code’’) when discussing the care and feeding of ‘‘data.’’ Sometimes, a copy of the code used to create or process data is so essential to the use of those data that the code should almost be thought of as part of the ‘‘metadata’’ description of the data. Other times, the code used in a scientific study is more separable from the data, but even then, many preservation and sharing principles apply to code just as well as they do to data. So how do we go about caring for and feeding data? Extra work, no doubt, is associated with nurturing your data, but care up front will save time and increase insight later. Even though a growing number of researchers, especially in large collabora- tions, know that conducting research with sharing and reuse in mind is essential, it still requires a paradigm shift. Most people are still motivated by piling up publications and by getting to the next one as soon as possible. But, the more we scientists find ourselves wishing we had access to extant but now unfindable data [4], the more we will realize why bad data management is bad for science. How can we improve? This article offers a short guide to the steps scientists can take to ensure that their data and associat- ed analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on ques- tions of data sharing, data provenance, research reproducibility, licensing, attribu- tion, privacy, and more—but our goal here is not to review that literature. Instead, we present a short guide intended for researchers who want to know why it is important to ‘‘care for and feed’’ data, with some practical advice on how to do that. The final section at the close of this work (Links to Useful Resources) offers links to the types of services referred to throughout the text. Boldface lettering below highlights actions one can take to follow the suggested rules. Rule 1. Love Your Data, and Help Others Love It, Too Data management is a repeat-play game. If you take care to make your data Citation: Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, et al. (2014) Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542 Published April 24, 2014 Copyright: ! 2014 Goodman et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: The authors received no specific funding for writing this manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: alberto.pepe@gmail.com Editor: Philip E. Bourne, University of California San Diego, United States of America PLOS Computational Biology | www.ploscompbiol.org 1 April 2014 | Volume 10 | Issue 4 | e1003542
  • 7. Article Dataset Reuse: Toward Translating Principles to Practice Laura Koesten,1,* Pavlos Vougiouklis,2 Elena Simperl,1 and Paul Groth3,4,* 1King’s College London, London WC2B 4BG, UK 2Huawei Technologies, Edinburgh EH9 3BF, UK 3University of Amsterdam, Amsterdam 1090 GH, the Netherlands 4Lead Contact *Correspondence: laura.koesten@kcl.ac.uk (L.K.), p.groth@uva.nl (P.G.) https://doi.org/10.1016/j.patter.2020.100136 SUMMARY The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a data- set’s reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. 1 INTRODUCTION There has been a gradual shift in the last years from viewing da- tasets as byproducts of (digital) work to critical assets, whose value increases the more they are used.1,2 However, our under- standing of how this value emerges, and of the factors that demonstrably affect the reusability of a dataset is still limited. Using a dataset beyond the context where it originated re- mains challenging for a variety of socio-technical reasons, which have been discussed in the literature;3,4 the bottom line is that simply making data available, even when complying with existing guidance and best practices, does not mean it can be easily used by others.5 At the same time, making data reusable to a diverse audience, in terms of domain, skill sets, and purposes, is an important way to realize its potential value (and recover some of the, sometimes considerable, resources invested in policy and infrastructure support). This is one of the reasons why scientific journals and research-funding organizations are increasingly calling for further data sharing6 or why industry bodies, such as the Interna- tional Data Spaces Association (IDSA) (https://www. internationaldataspaces.org/) are investing in reference archi- tectures to smooth data flows from one business to another. There is plenty of advice on how to make data easier to reuse, including technical standards, legal frameworks, and guidelines. Much work places focus on machine readability THE BIGGER PICTURE The web provides access to millions of datasets. These data can have additional impact when it is used beyond the context for which it was originally created. We have little empirical insight into what makes a dataset more reusable than others, and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engage- ment metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset’s reusability. This work demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem Patterns 1, 100136, November 13, 2020 ª 2020 The Author(s). 1 ll OPEN ACCESS Lots of good advice • Maybe a bit too much…. • Currently, 140 policies on fairsharing.org as of April 5, 2021 • We reviewed 40 papers • Cataloged 39 di ff erent features of datasets that enable data reuse
  • 8. Enable access Feature Description References Access License (1) available, (2) allows reuse W3C 3,22,45–47 Format/machine readability (1) consistent format, (2) single value type per column, (3) human as well as machine readable and non-proprietary format, (4) different formats available W3C2,22,48–50 Code available for cleaning, analysis, visualizations 51–53 Unique identi fi er PID for the dataset/ID's within the dataset W3C2,53 Download link/API (1) available, (2) functioning W3C47,50
  • 9. Document Documentation: Methodological Choices Methodology description of experimental setup (sampling, tools, etc.), link to publication or project 3,13,54,60,63,66 Units and reference systems (1) de fi ned, (2) consistently used 54,67 Representativeness/Population in relation to a total population 21,60 Caveats changes: classi fi cation/seasonal or special event/sample size/coverage/rounding 48,54 Cleaning/pre-processing (1) cleaning choices described, (2) are the raw data available? 3,13,21,68 Biases/limitations different types of bias (i.e., sampling bias) 21,49,69 Data management (1) mode of storage, (2) duration of storage 3,70,71 Documentation: Quality Missing values/null values (1) de fi ned what they mean, (2) ratio of empty cells W3C22,48,49,59,60 Margin of error/reliability/quality control procedures (1) con fi dence intervals, (2) estimates versus actual measurements 54,65 Formatting (1) consistent data type per column, (2) consistent date format W3C41,65 Outliers are there data points that differ signi fi cantly from the rest 22 Possible options/constraints on a variable (1) value type, (2) if data contains an “other” category W3C72 Last update information about data maintenance if applicable 21,62 Completeness of metadata empty fi elds in the applied metadata structure? 41 Abbreviations/acronyms/codes de fi ned 49,54 Documentation: Summary Representations and Understandability Description/README fi le meaningful textual description (can also include text, code, images) 22,54,55 Purpose purpose of data collection, context of creation 3,21,49,56,57 Summarizing statistics (1) on dataset level, (2) on column level 22,49 Visual representations statistical properties of the dataset 22,58 Headers understandable (1) column-level documentation (e.g., abbreviations explained), (2) variable types, (3) how derived (e.g., categorization, such as labels or codes) 22,59,60 Geographical scope (1) de fi ned, (2) level of granularity 45,54,61,62 Temporal scope (1) de fi ned, (2) level of granularity 45,54,61,62 Time of data collection (1) when collected, (2) what time span 63–65
  • 10. Situate Connections Relationships between variables de fi ned (1) explained in documentation, (2) formulae 21,22 Cite sources (1) links or citation, (2) indication of link quality 21 Links to dataset being used elsewhere i.e., in publications, community-led projects 21,59 Contact person or organization, mode of contact speci fi ed W3C41,73 Provenance and Versioning Publisher/producer/repository (1) authoritativeness of source, (2) funding mechanisms/ other interests that in fl uenced data collection speci fi ed 21,49,54,59,74, 75 Version indicator version or modi fi cation of dataset documented W3C50,66,76 Version history work fl ow provenance W3C50,76 Prior reuse/advice on data reuse (1) example projects, (2) access to discussions 3,27,59,60 Ethics Ethical considerations, personal data (1) data related to individually identi fi able people, (2) if applicable, was consent given 21,57,71,75 Semantics Schema/Syntax/Data Model de fi ned W3C47,67 Use of existing taxonomies/vocabularies (1) documented, (2) link W3C2
  • 11. Where should a data provider start? • Lots of good advice! • It would be great to do all these things • But it’s all a bit overwhelming • Can we help prioritize?
  • 12. Getting some data • Used Github as a case study • ~1.4 million datasets (e.g. CSV, excel) from ~65K repos • Use engagement metrics as proxies for data reuse • Map literature features to both dataset and repository features • Train a predictive model to see what are features are good predictors
  • 13. Dataset Features Missing values Size Columns + Rows Readme features Issue features Age Description Parsable
  • 14. Where to start? • Some ideas from this study if you’re publishing data with Github • provide an informative short textual summary of the dataset 
 • provide a comprehensive README fi le in a structured form and links to further information 
 • datasets should not exceed standard processable fi le sizes 
 • datasets should be possible to open with a standard con fi guration of a common library (such as Pandas)
 Trained a Recurrent Neural Network. Might be better models but useful for handling text, Not the greatest predicator (good for classifying not reuse) but still useful for helping us tease out features
  • 16. Multiple responses possible. Percents are percent of respondents (n=1677). Why do you use or need secondary data? Gregory, K., Groth, P. Scharnhorst, A., Wyatt, S. (2020). Lost or found? Discovering data needed for research. Harvard Data Science Review. https://doi.org/10.1162/99608f92.e38165eb
  • 17. How would you make sense of this data? Koesten, L., Gregory, K., Groth, P., & Simperl, E. (2021). Talking datasets – Understanding data sensemaking behaviours. International Journal of Human- Computer Studies, 146, 102562. https://doi.org/10.1016/j.ijhcs.2020.102562
  • 18. Patterns of data-centric sense making • 31 research “data people” • Brought their own data • Presented with unknown data • Think-out loud • Talk about both their data and then given data • Interview transcripts + screen captures
  • 20. Engaging with data Known Unknown Acronyms and abbreviations “That is a classic abbreviation in the fi eld of hepatic surgery. AFP is alpha feto protein. It is a marker. It’s very well known by everybody...the AFP score is a criterion for liver transplantation. (P22)” “I’m not sure what ‘long’ means. I wonder if it’s not something to do with longevity. On the other hand, no, it’s got negative numbers. I can’t make sense of this. (P7)” Identi fi ying strange things “Although we’ve tried really hard, because we’ve put in a coding frame and how we manipulate all the data, I’m sure that there are things in there which we haven’t recorded in terms of, well, what exactly does this mean? I hope we’ve covered it all but I’m sure we haven’t. (P10)” “Now that sounds quite high for the Falklands. I wouldn’t have thought the population was all that great...and yet it’s only one con fi rmed case. Okay [laughs]. So yes...one might need to actually examine that a little bit more carefully, because the population of the Falklands doesn’t reach a million, so therefore you end up with this huge number of deaths per million population [laughs], but only one case and one death. (P23)”
  • 21. Placing data • P2: It’s listing the countries for which data are available, not sure if this is truly all countries we know of... • P8: It includes essentially every country in the world • P29: Global data • P30: I would like to know whether it’s complete...it says 212 rows representing countries, whether I have data from all countries or only from 25% or something because then it’s not really representative. • P7: If it was the whole country that was a ff ected or not, a ff ecting the northern part, the western, eastern, southern parts • P24: Was it sampled and then estimated for the whole country? Or is it the exact number of deaths that were got from hospitals and health agencies, for example? So is it a census or is it an estimate?
  • 22. Activity patterns during data sense making
  • 23. Recommendations ✅ for data providers • Help users understand shape • Provide information at the dataset level (e.g. summaries) ✅ • Column level summaries • Make it easier to pan and zoom • Use strange things as an entry point • Flag and highlight strange things ✅ • Provide explanations of abbreviations and missing values ✅ • Provide metrics or links to other information structures necessary for understanding the column’s content ✅ • Include links to basic concepts ✅ • Highlight relationships between columns or entities ✅ • Identify anchor variables that are considered most important ✅ • Help users placing data • Embrace di ff erent levels of expertise and enable drill down • Link to standardized de fi nitions ✅ • Connect to broader forms of documentation ✅
  • 24. Data is Social Do you want a data community? Gregory, K., Groth, P. Scharnhorst, A., Wyatt, S. (2020). Lost or found? Discovering data needed for research. Harvard Data Science Review. https://doi.org/10.1162/99608f92.e38165eb
  • 25. Conclusion • For data platforms • Think about ways of measuring data reuse • Tooling for summaries and overviews of data • Automated linking to information for sense making • For data providers • Simple steps • Focus on making it easy to “get to know” your data. • Easy to load and explore (e.g. in pandas, excel, community tool) • Links to more information • Are you trying to be a part or build a data community? • We still need a lot more work on data practices and methods informed by practices Paul Groth | @pgroth | pgroth.com | indelab.org Kathleen Marie Gregory Kathleen Marie Gregory Findable and reusable? Data discovery practices in research Findable and reusable? Data discovery practices in research