SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Data Curation and Debugging
for Data Centric AI
Prof. Paul Groth | @pgroth | pgroth.com | indelab.org

Thanks to Stefan Grafberger, Dr. Julia Stoyanovich,
Dr. Sebastian Schelter, Dr. Laura Koesten, Prof.
Elena Simperl, Dr. Pavlos Vougiouklis, Madelon
Hulsebos, Dr. Çağatay Demiralp, Dr. Juan Sequeda,
Prof. George Fletcher

DBML - May 8, 2022
The making of data is important
Finding digital truth—that is, identifying
and combining data that accurately
represent reality—is becoming more
difficult and more important.


More difficult because data and their
sources are multiplying.


And more important because firms need to
get their data house in order to benefit
from AI, which they must to stay
competitive.


-- The Economist, February 2020
Data interoperability and quality, as well
as their structure, authenticity and
integrity are key for the exploitation of the
data value, especially in the context of AI
deployment


-- European Commission, “A European strategy
for data”, February 2020
(andrio/Shutterstock)
Source:

http://veekaybee.github.io/2019/02/13/data-science-is-di
ff
erent/
Source: https://www.youtube.com/watch?v=06-AZXmwHjo
Bottlenecks
• Manual

• Di
ffi
culty in creating
fl
exible reusable work
fl
ows 

• Lack of transparency
Paul Groth."The Knowledge-Remixing Bottleneck," Intelligent
Systems, IEEE , vol.28, no.5, pp.44,48,  Sept.-Oct. 2013 doi:
10.1109/MIS.2013.138

Paul Groth, "Transparency and Reliability in the Data Supply
Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71,
March-April 2013 doi: 10.1109/MIC.2013.41
Debugging
Sebastian Schelter

UvA
Julia Stoyanovich

NYU
Stefan Grafberger
UvA
Credits
ML Pipelines in the Real World
9
Integration & Cleaning

of Data
Feature Encoding Pipelines

& Data Augmentation
Model Training &

Evaluation
Heterogeneous

Datasources
⋈
σ
π
⋈
The “last mile” of end-to-end ML
make_pipeline([
(‘encoding’, ColumnTransformer([
('num', StandardScaler, …),
(‘cat', OneHotEncoder, …)])),

’learner’, KerasClassifier(…))
1 2 3
ML Pipelines in the Real World
10
Integration & Cleaning

of Data
Feature Encoding Pipelines

& Data Augmentation
Model Training &

Evaluation
Heterogeneous

Datasources
⋈
σ
π
⋈
The “last mile” of end-to-end ML
make_pipeline([
(‘encoding’, ColumnTransformer([
('num', StandardScaler, …),
(‘cat', OneHotEncoder, …)])),

’learner’, KerasClassifier(…))
1 2 3
Data Representation 

Bugs
ML Pipelines in the Real World
11
Integration & Cleaning

of Data
Feature Encoding Pipelines

& Data Augmentation
Model Training &

Evaluation
Heterogeneous

Datasources
⋈
σ
π
⋈
The “last mile” of end-to-end ML
make_pipeline([
(‘encoding’, ColumnTransformer([
('num', StandardScaler, …),
(‘cat', OneHotEncoder, …)])),

’learner’, KerasClassifier(…))
1 2 3
Data Representation 

Bugs
Schema Violations

& Missing Data
ML Pipelines in the Real World
12
Integration & Cleaning

of Data
Feature Encoding Pipelines

& Data Augmentation
Model Training &

Evaluation
Heterogeneous

Datasources
⋈
σ
π
⋈
The “last mile” of end-to-end ML
make_pipeline([
(‘encoding’, ColumnTransformer([
('num', StandardScaler, …),
(‘cat', OneHotEncoder, …)])),

’learner’, KerasClassifier(…))
1 2 3
Data Representation 

Bugs
Schema Violations

& Missing Data
Unsound

Experimentation
The Way Forward
• First approach: invent new programming languages + runtime systems to regain control
(e.g. SystemDS) -> would require to rewrite all existing code

• Second approach: manually annotate and instrument existing code (ml
fl
ow) -> does
not happen in practice

• Our approach: retro
fi
t inspection techniques into the existing DS landscape
• Observation: declarative speci
fi
cation of operations for preprocessing present in some
popular ML libraries: 

• Pandas mostly applies relational operations

• Estimator / Transformer pipelines (scikit-learn / SparkML / Tensor
fl
ow Transform)
o
ff
er nestable and composable way to declaratively specify feature transformations
13
Example
14
Can we
fi
nd ways to automatically hint data 

scientists at potentially problematic operations 

in the preprocessing code of their ML pipelines?

Inspiration from software engineering, e.g. 

code inspection in modern IDE’s
Example
15
mlinspect
• Library to instrument ML preprocessing code with custom inspections

• available on GitHub: https://github.com/stefan-grafberger/mlinspect
• Works with “native” preprocessing pipelines (no annotation / manual
instrumentation required) in pandas / sklearn

• Representation of preprocessing operations based on data
fl
ow graph

• Allows users to implement inspections as user-de
fi
ned functions which
are automatically applied to the inputs and outputs of certain
operations

• Allow for the propagation of annotations per record through the
program
16
Grafberger, S., Groth, P., Stoyanovich, J., & Schelter, S. (2022). Data
distribution debugging in machine learning pipelines. The VLDB Journal, 1-24.
Example Inspections
• Change detection for the proportions of protected groups: compute histograms of operator outputs









17
age_group county
60 CountyA
60 CountyA
20 CountyA
60 CountyB
20 CountyB
20 CountyB
data = data[data.county = “CountyA”]
age_group county
60 CountyA
60 CountyA
20 CountyA
• Lineage tracking: generate identi
fi
er annotations for records and propagate them through operators
50% vs 50%
66% vs 33%
ssn smoke
123 Y
456 N
789 Y
ssn cost
123 100
789 200
ssn smoke cost
123 Y 100
789 N 200
smoke cost
Y 100
N 200
data = pd.merge([patient, cost],

on=“ssn”)
data = data[[“smoke”, “cost”]]
[p1]
[p2]
[p3]
[c1]
[c2]
[p1, c1]
[p3, c2]
[p1, c1]
[p3, c2]
Summary
• mlinspect is a general runtime for ML pipeline
analysis available on GitHub: 

https://github.com/stefan-grafberger/mlinspect
• Limitation: Our approach relies on“declaratively”
written ML pipelines, where we can identify the
semantics of the operations
• Enables many use cases like ArgusEyes, a CI tool

https://github.com/schelterlabs/arguseyes
18
Curation
Prof. Elena Simperl
King’s College London
Dr. Laura Koesten
King’s College London /
University of Vienna
Dr. Pavlos Vougiouklis
Huawei
Credits
Madelon Hulsebos
UvA
Sigma Computing
Çağatay Demiralp
Sigma Computing
What curation should data providers prioritise to facilitate reuse?
Lots of good advice
Editorial
Ten Simple Rules for the Care and Feeding of Scientific
Data
Alyssa Goodman1
, Alberto Pepe1
*, Alexander W. Blocker1
, Christine L. Borgman2
, Kyle Cranmer3
,
Merce Crosas1
, Rosanne Di Stefano1
, Yolanda Gil4
, Paul Groth5
, Margaret Hedstrom6
, David W. Hogg3
,
Vinay Kashyap1
, Ashish Mahabal7
, Aneta Siemiginowska1
, Aleksandra Slavkovic8
1 Harvard University, Cambridge, Massachusetts, United States of America, 2 University of California, Los Angeles, Los Angeles, California, United States of America, 3 New
York University, New York, New York, United States of America, 4 University of Southern California, Los Angeles, Los Angeles, California, United States of America, 5 Vrije
Universiteit Amsterdam, Amsterdam, The Netherlands, 6 University of Michigan, Ann Arbor, Michigan, United States of America, 7 California Institute of Technology,
Pasadena, California, United States of America, 8 Pennsylvania State University, State College, Pennsylvania, United States of America
Introduction
In the early 1600s, Galileo Galilei
turned a telescope toward Jupiter. In his
log book each night, he drew to-scale
schematic diagrams of Jupiter and some
oddly moving points of light near it.
Galileo labeled each drawing with the
date. Eventually he used his observations
to conclude that the Earth orbits the Sun,
just as the four Galilean moons orbit
Jupiter. History shows Galileo to be much
more than an astronomical hero, though.
His clear and careful record keeping and
publication style not only let Galileo
understand the solar system, they continue
to let anyone understand how Galileo did it.
Galileo’s notes directly integrated his data
(drawings of Jupiter and its moons), key
metadata (timing of each observation,
weather, and telescope properties), and
text (descriptions of methods, analysis,
and conclusions). Critically, when Galileo
included the information from those notes
in Sidereus Nuncius [1], this integration of
text, data, and metadata was preserved, as
shown in Figure 1. Galileo’s work ad-
vanced the ‘‘Scientific Revolution,’’ and
his approach to observation and analysis
contributed significantly to the shaping of
today’s modern ‘‘scientific method’’ [2,3].
Today, most research projects are
considered complete when a journal
article based on the analysis has been
written and published. The trouble is,
unlike Galileo’s report in Sidereus Nuncius,
the amount of real data and data descrip-
tion in modern publications is almost
never sufficient to repeat or even statisti-
cally verify a study being presented.
Worse, researchers wishing to build upon
and extend work presented in the litera-
ture often have trouble recovering data
associated with an article after it has been
published. More often than scientists
would like to admit, they cannot even
recover the data associated with their own
published works.
Complicating the modern situation, the
words ‘‘data’’ and ‘‘analysis’’ have a wider
variety of definitions today than at the
time of Galileo. Theoretical investigations
can create large ‘‘data’’ sets through
simulations (e.g., The Millennium Simu-
lation Project: http://www.mpa-garching.
mpg.de/galform/virgo/millennium/).
Large-scale data collection often takes
place as a community-wide effort (e.g.,
The Human Genome project: http://
www.genome.gov/10001772), which leads
to gigantic online ‘‘databases’’ (organized
collections of data). Computers are so
essential in simulations, and in the pro-
cessing of experimental and observational
data, that it is also often hard to draw a
dividing line between ‘‘data’’ and ‘‘analy-
sis’’ (or ‘‘code’’) when discussing the care
and feeding of ‘‘data.’’ Sometimes, a copy
of the code used to create or process data
is so essential to the use of those data that
the code should almost be thought of as
part of the ‘‘metadata’’ description of the
data. Other times, the code used in a
scientific study is more separable from the
data, but even then, many preservation
and sharing principles apply to code just as
well as they do to data.
So how do we go about caring for and
feeding data? Extra work, no doubt, is
associated with nurturing your data, but
care up front will save time and increase
insight later. Even though a growing number
of researchers, especially in large collabora-
tions, know that conducting research with
sharing and reuse in mind is essential, it still
requires a paradigm shift. Most people are
still motivated by piling up publications and
by getting to the next one as soon as possible.
But, the more we scientists find ourselves
wishing we had access to extant but now
unfindable data [4], the more we will realize
why bad data management is bad for
science. How can we improve?
This article offers a short guide to
the steps scientists can take to
ensure that their data and associat-
ed analyses continue to be of value
and to be recognized. In just the past
few years, hundreds of scholarly papers
and reports have been written on ques-
tions of data sharing, data provenance,
research reproducibility, licensing, attribu-
tion, privacy, and more—but our goal
here is not to review that literature.
Instead, we present a short guide intended
for researchers who want to know why it is
important to ‘‘care for and feed’’ data,
with some practical advice on how to do
that. The final section at the close of this
work (Links to Useful Resources) offers
links to the types of services referred to
throughout the text. Boldface lettering
below highlights actions one can take to
follow the suggested rules.
Rule 1. Love Your Data, and
Help Others Love It, Too
Data management is a repeat-play
game. If you take care to make your data
Citation: Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, et al. (2014) Ten Simple Rules for the Care
and Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542
Published April 24, 2014
Copyright: ! 2014 Goodman et al. This is an open-access article distributed under the terms of the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Funding: The authors received no specific funding for writing this manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: alberto.pepe@gmail.com
Editor: Philip E. Bourne, University of California San Diego, United States of America
PLOS Computational Biology | www.ploscompbiol.org 1 April 2014 | Volume 10 | Issue 4 | e1003542
Article
Dataset Reuse: Toward Translating
Principles to Practice
Laura Koesten,1,* Pavlos Vougiouklis,2 Elena Simperl,1 and Paul Groth3,4,*
1King’s College London, London WC2B 4BG, UK
2Huawei Technologies, Edinburgh EH9 3BF, UK
3University of Amsterdam, Amsterdam 1090 GH, the Netherlands
4Lead Contact
*Correspondence: laura.koesten@kcl.ac.uk (L.K.), p.groth@uva.nl (P.G.)
https://doi.org/10.1016/j.patter.2020.100136
SUMMARY
The web provides access to millions of datasets that can have additional impact when used beyond their
original context. We have little empirical insight into what makes a dataset more reusable than others and
which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential
reuse features through a literature review and present a case study on datasets on GitHub, a popular open
platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over
65,000 repositories. Using GitHub’s engagement metrics as proxies for dataset reuse, we relate them to
reuse features from the literature and devise an initial model, using deep neural networks, to predict a data-
set’s reusability. This demonstrates the practical gap between principles and actionable insights that allow
data publishers and tools designers to implement functionalities that provably facilitate reuse.
1 INTRODUCTION
There has been a gradual shift in the last years from viewing da-
tasets as byproducts of (digital) work to critical assets, whose
value increases the more they are used.1,2
However, our under-
standing of how this value emerges, and of the factors that
demonstrably affect the reusability of a dataset is still limited.
Using a dataset beyond the context where it originated re-
mains challenging for a variety of socio-technical reasons, which
have been discussed in the literature;3,4
the bottom line is that
simply making data available, even when complying with existing
guidance and best practices, does not mean it can be easily
used by others.5
At the same time, making data reusable to a diverse audience,
in terms of domain, skill sets, and purposes, is an important way
to realize its potential value (and recover some of the, sometimes
considerable, resources invested in policy and infrastructure
support). This is one of the reasons why scientific journals and
research-funding organizations are increasingly calling for
further data sharing6
or why industry bodies, such as the Interna-
tional Data Spaces Association (IDSA) (https://www.
internationaldataspaces.org/) are investing in reference archi-
tectures to smooth data flows from one business to another.
There is plenty of advice on how to make data easier to
reuse, including technical standards, legal frameworks, and
guidelines. Much work places focus on machine readability
THE BIGGER PICTURE The web provides access to millions of datasets. These data can have additional
impact when it is used beyond the context for which it was originally created. We have little empirical insight
into what makes a dataset more reusable than others, and which of the existing guidelines and frameworks,
if any, make a difference. In this paper, we explore potential reuse features through a literature review and
present a case study on datasets on GitHub, a popular open platform for sharing code and data. We
describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engage-
ment metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an
initial model, using deep neural networks, to predict a dataset’s reusability. This work demonstrates the
practical gap between principles and actionable insights that allow data publishers and tools designers
to implement functionalities that provably facilitate reuse.
Proof-of-Concept: Data science output has been formulated,
implemented, and tested for one domain/problem
Patterns 1, 100136, November 13, 2020 ª 2020 The Author(s). 1
ll
OPEN ACCESS
Lots of good advice
• Maybe a bit too much….

• Currently, 140 policies on fairsharing.org as
of April 5, 2021

• We reviewed 40 papers

• Cataloged 39 di
ff
erent features of datasets
that enable data reuse
Where should a data provider start?
• Lots of good advice!

• It would be great to do all these things

• But it’s all a bit overwhelming

• Can we help prioritize?
Getting some data
• Used Github as a case study

• ~1.4 million datasets (e.g. CSV, excel) from
~65K repos

• Use engagement metrics as proxies for data
reuse

• Map literature features to both dataset and
repository features

• Train a predictive model to see what are
features are good predictors
Dataset Features
Missing values
Size
Columns + Rows
Readme features
Issue features
Age
Description
Parsable
Where to start?
• Some ideas from this study if you’re publishing data
with Github

• provide an informative short textual summary of the
dataset 

• provide a comprehensive README
fi
le in a
structured form and links to further information 

• datasets should not exceed standard processable
fi
le sizes 

• datasets should be possible to open with a standard
con
fi
guration of a common library (such as Pandas)

Trained a Recurrent Neural Network. Might be better models but useful for
handling text, Not the greatest predicator (good for classifying not reuse)
but still useful for helping us tease out features
Can we help automate curation?
Madelon Hulsebos
https://madelonhulsebos.github.io
Example: semantic column type detection
Sherlock [Hulsebos et al., KDD, 2019]


DL method for semantic data type detection of table columns


https://github.com/mitmedialab/sherlock-project
Need for a new corpora
• Database-like table content and structure (semantics, data types, size).

• Large-scale to facilitate table representation models. 

• Broad coverage to generalize to a diversity of domains. 

• Table semantics (e.g. column types).
CSVs from Github
https://gittables.github.io
Tools to improve data supply chains
Groth, Paul, "Transparency and Reliability in the Data Supply Chain,"
Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-April 2013 doi:
10.1109/MIC.2013.41
Conclusion
• AI is data centric

• Need tools that help users debug and curate their data for ML

• Way forward: Conversation between ML, DB, and HCI research

• We are hiring :-)
Paul Groth | @pgroth | pgroth.com | indelab.org

Mais conteúdo relacionado

Mais procurados

Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMsLoic Merckel
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsMichał Łopuszyński
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
Sales Analytics Using Power BI
Sales Analytics Using Power BISales Analytics Using Power BI
Sales Analytics Using Power BINetwoven Inc.
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
LLM Healthcare.pdf
LLM Healthcare.pdfLLM Healthcare.pdf
LLM Healthcare.pdfATPowr
 
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedUplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedRising Media Ltd.
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
Google BARD v/s ChatGPT _ A review
Google BARD v/s ChatGPT _ A reviewGoogle BARD v/s ChatGPT _ A review
Google BARD v/s ChatGPT _ A reviewDR. Ram Kumar Pathak
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Krishnaram Kenthapadi
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 

Mais procurados (20)

Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Sales Analytics Using Power BI
Sales Analytics Using Power BISales Analytics Using Power BI
Sales Analytics Using Power BI
 
Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
LLM Healthcare.pdf
LLM Healthcare.pdfLLM Healthcare.pdf
LLM Healthcare.pdf
 
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedUplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Generative models
Generative modelsGenerative models
Generative models
 
Google BARD v/s ChatGPT _ A review
Google BARD v/s ChatGPT _ A reviewGoogle BARD v/s ChatGPT _ A review
Google BARD v/s ChatGPT _ A review
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
 
Machine learning
Machine learningMachine learning
Machine learning
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 

Semelhante a Data Curation and Debugging for Data Centric AI

Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reusevoginip
 
Building Effective Visualization Shiny WVF
Building Effective Visualization Shiny WVFBuilding Effective Visualization Shiny WVF
Building Effective Visualization Shiny WVFOlga Scrivner
 
Data science innovations
Data science innovations Data science innovations
Data science innovations suresh sood
 
Massive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World ProblemsMassive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World Problemsinside-BigData.com
 
Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...Don Pellegrino
 
Data science Innovations January 2018
Data science Innovations January 2018Data science Innovations January 2018
Data science Innovations January 2018suresh sood
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridIan Foster
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Jisc
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
 
Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science  Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science suresh sood
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overviewdgarijo
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Visualisation - techniques, interaction dynamics, big data
Visualisation - techniques, interaction dynamics, big dataVisualisation - techniques, interaction dynamics, big data
Visualisation - techniques, interaction dynamics, big dataJoris Klerkx
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of ScienceGlobus
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 

Semelhante a Data Curation and Debugging for Data Centric AI (20)

Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reuse
 
Building Effective Visualization Shiny WVF
Building Effective Visualization Shiny WVFBuilding Effective Visualization Shiny WVF
Building Effective Visualization Shiny WVF
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Data science innovations
Data science innovations Data science innovations
Data science innovations
 
Massive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World ProblemsMassive-Scale Analytics Applied to Real-World Problems
Massive-Scale Analytics Applied to Real-World Problems
 
Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...
 
Data science Innovations January 2018
Data science Innovations January 2018Data science Innovations January 2018
Data science Innovations January 2018
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
 
Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science  Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Visualisation - techniques, interaction dynamics, big data
Visualisation - techniques, interaction dynamics, big dataVisualisation - techniques, interaction dynamics, big data
Visualisation - techniques, interaction dynamics, big data
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 

Mais de Paul Groth

Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningPaul Groth
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph FuturesPaul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of DataPaul Groth
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data ShowcasingPaul Groth
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphPaul Groth
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?Paul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsPaul Groth
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationPaul Groth
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chainPaul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 

Mais de Paul Groth (20)

Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data Showcasing
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 

Último

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Data Curation and Debugging for Data Centric AI

  • 1. Data Curation and Debugging for Data Centric AI Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Stefan Grafberger, Dr. Julia Stoyanovich, Dr. Sebastian Schelter, Dr. Laura Koesten, Prof. Elena Simperl, Dr. Pavlos Vougiouklis, Madelon Hulsebos, Dr. Çağatay Demiralp, Dr. Juan Sequeda, Prof. George Fletcher DBML - May 8, 2022
  • 2. The making of data is important
  • 3. Finding digital truth—that is, identifying and combining data that accurately represent reality—is becoming more difficult and more important. More difficult because data and their sources are multiplying. And more important because firms need to get their data house in order to benefit from AI, which they must to stay competitive. -- The Economist, February 2020
  • 4. Data interoperability and quality, as well as their structure, authenticity and integrity are key for the exploitation of the data value, especially in the context of AI deployment -- European Commission, “A European strategy for data”, February 2020 (andrio/Shutterstock)
  • 7. Bottlenecks • Manual • Di ffi culty in creating fl exible reusable work fl ows • Lack of transparency Paul Groth."The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48,  Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138 Paul Groth, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-April 2013 doi: 10.1109/MIC.2013.41
  • 9. ML Pipelines in the Real World 9 Integration & Cleaning
 of Data Feature Encoding Pipelines
 & Data Augmentation Model Training &
 Evaluation Heterogeneous
 Datasources ⋈ σ π ⋈ The “last mile” of end-to-end ML make_pipeline([ (‘encoding’, ColumnTransformer([ ('num', StandardScaler, …), (‘cat', OneHotEncoder, …)])),
 ’learner’, KerasClassifier(…)) 1 2 3
  • 10. ML Pipelines in the Real World 10 Integration & Cleaning
 of Data Feature Encoding Pipelines
 & Data Augmentation Model Training &
 Evaluation Heterogeneous
 Datasources ⋈ σ π ⋈ The “last mile” of end-to-end ML make_pipeline([ (‘encoding’, ColumnTransformer([ ('num', StandardScaler, …), (‘cat', OneHotEncoder, …)])),
 ’learner’, KerasClassifier(…)) 1 2 3 Data Representation 
 Bugs
  • 11. ML Pipelines in the Real World 11 Integration & Cleaning
 of Data Feature Encoding Pipelines
 & Data Augmentation Model Training &
 Evaluation Heterogeneous
 Datasources ⋈ σ π ⋈ The “last mile” of end-to-end ML make_pipeline([ (‘encoding’, ColumnTransformer([ ('num', StandardScaler, …), (‘cat', OneHotEncoder, …)])),
 ’learner’, KerasClassifier(…)) 1 2 3 Data Representation 
 Bugs Schema Violations
 & Missing Data
  • 12. ML Pipelines in the Real World 12 Integration & Cleaning
 of Data Feature Encoding Pipelines
 & Data Augmentation Model Training &
 Evaluation Heterogeneous
 Datasources ⋈ σ π ⋈ The “last mile” of end-to-end ML make_pipeline([ (‘encoding’, ColumnTransformer([ ('num', StandardScaler, …), (‘cat', OneHotEncoder, …)])),
 ’learner’, KerasClassifier(…)) 1 2 3 Data Representation 
 Bugs Schema Violations
 & Missing Data Unsound
 Experimentation
  • 13. The Way Forward • First approach: invent new programming languages + runtime systems to regain control (e.g. SystemDS) -> would require to rewrite all existing code • Second approach: manually annotate and instrument existing code (ml fl ow) -> does not happen in practice • Our approach: retro fi t inspection techniques into the existing DS landscape • Observation: declarative speci fi cation of operations for preprocessing present in some popular ML libraries: • Pandas mostly applies relational operations • Estimator / Transformer pipelines (scikit-learn / SparkML / Tensor fl ow Transform) o ff er nestable and composable way to declaratively specify feature transformations 13
  • 14. Example 14 Can we fi nd ways to automatically hint data 
 scientists at potentially problematic operations 
 in the preprocessing code of their ML pipelines? Inspiration from software engineering, e.g. 
 code inspection in modern IDE’s
  • 16. mlinspect • Library to instrument ML preprocessing code with custom inspections • available on GitHub: https://github.com/stefan-grafberger/mlinspect • Works with “native” preprocessing pipelines (no annotation / manual instrumentation required) in pandas / sklearn • Representation of preprocessing operations based on data fl ow graph • Allows users to implement inspections as user-de fi ned functions which are automatically applied to the inputs and outputs of certain operations • Allow for the propagation of annotations per record through the program 16 Grafberger, S., Groth, P., Stoyanovich, J., & Schelter, S. (2022). Data distribution debugging in machine learning pipelines. The VLDB Journal, 1-24.
  • 17. Example Inspections • Change detection for the proportions of protected groups: compute histograms of operator outputs
 
 
 
 
 17 age_group county 60 CountyA 60 CountyA 20 CountyA 60 CountyB 20 CountyB 20 CountyB data = data[data.county = “CountyA”] age_group county 60 CountyA 60 CountyA 20 CountyA • Lineage tracking: generate identi fi er annotations for records and propagate them through operators 50% vs 50% 66% vs 33% ssn smoke 123 Y 456 N 789 Y ssn cost 123 100 789 200 ssn smoke cost 123 Y 100 789 N 200 smoke cost Y 100 N 200 data = pd.merge([patient, cost],
 on=“ssn”) data = data[[“smoke”, “cost”]] [p1] [p2] [p3] [c1] [c2] [p1, c1] [p3, c2] [p1, c1] [p3, c2]
  • 18. Summary • mlinspect is a general runtime for ML pipeline analysis available on GitHub: 
 https://github.com/stefan-grafberger/mlinspect • Limitation: Our approach relies on“declaratively” written ML pipelines, where we can identify the semantics of the operations • Enables many use cases like ArgusEyes, a CI tool
 https://github.com/schelterlabs/arguseyes 18
  • 19. Curation Prof. Elena Simperl King’s College London Dr. Laura Koesten King’s College London / University of Vienna Dr. Pavlos Vougiouklis Huawei Credits Madelon Hulsebos UvA Sigma Computing Çağatay Demiralp Sigma Computing
  • 20. What curation should data providers prioritise to facilitate reuse?
  • 21. Lots of good advice Editorial Ten Simple Rules for the Care and Feeding of Scientific Data Alyssa Goodman1 , Alberto Pepe1 *, Alexander W. Blocker1 , Christine L. Borgman2 , Kyle Cranmer3 , Merce Crosas1 , Rosanne Di Stefano1 , Yolanda Gil4 , Paul Groth5 , Margaret Hedstrom6 , David W. Hogg3 , Vinay Kashyap1 , Ashish Mahabal7 , Aneta Siemiginowska1 , Aleksandra Slavkovic8 1 Harvard University, Cambridge, Massachusetts, United States of America, 2 University of California, Los Angeles, Los Angeles, California, United States of America, 3 New York University, New York, New York, United States of America, 4 University of Southern California, Los Angeles, Los Angeles, California, United States of America, 5 Vrije Universiteit Amsterdam, Amsterdam, The Netherlands, 6 University of Michigan, Ann Arbor, Michigan, United States of America, 7 California Institute of Technology, Pasadena, California, United States of America, 8 Pennsylvania State University, State College, Pennsylvania, United States of America Introduction In the early 1600s, Galileo Galilei turned a telescope toward Jupiter. In his log book each night, he drew to-scale schematic diagrams of Jupiter and some oddly moving points of light near it. Galileo labeled each drawing with the date. Eventually he used his observations to conclude that the Earth orbits the Sun, just as the four Galilean moons orbit Jupiter. History shows Galileo to be much more than an astronomical hero, though. His clear and careful record keeping and publication style not only let Galileo understand the solar system, they continue to let anyone understand how Galileo did it. Galileo’s notes directly integrated his data (drawings of Jupiter and its moons), key metadata (timing of each observation, weather, and telescope properties), and text (descriptions of methods, analysis, and conclusions). Critically, when Galileo included the information from those notes in Sidereus Nuncius [1], this integration of text, data, and metadata was preserved, as shown in Figure 1. Galileo’s work ad- vanced the ‘‘Scientific Revolution,’’ and his approach to observation and analysis contributed significantly to the shaping of today’s modern ‘‘scientific method’’ [2,3]. Today, most research projects are considered complete when a journal article based on the analysis has been written and published. The trouble is, unlike Galileo’s report in Sidereus Nuncius, the amount of real data and data descrip- tion in modern publications is almost never sufficient to repeat or even statisti- cally verify a study being presented. Worse, researchers wishing to build upon and extend work presented in the litera- ture often have trouble recovering data associated with an article after it has been published. More often than scientists would like to admit, they cannot even recover the data associated with their own published works. Complicating the modern situation, the words ‘‘data’’ and ‘‘analysis’’ have a wider variety of definitions today than at the time of Galileo. Theoretical investigations can create large ‘‘data’’ sets through simulations (e.g., The Millennium Simu- lation Project: http://www.mpa-garching. mpg.de/galform/virgo/millennium/). Large-scale data collection often takes place as a community-wide effort (e.g., The Human Genome project: http:// www.genome.gov/10001772), which leads to gigantic online ‘‘databases’’ (organized collections of data). Computers are so essential in simulations, and in the pro- cessing of experimental and observational data, that it is also often hard to draw a dividing line between ‘‘data’’ and ‘‘analy- sis’’ (or ‘‘code’’) when discussing the care and feeding of ‘‘data.’’ Sometimes, a copy of the code used to create or process data is so essential to the use of those data that the code should almost be thought of as part of the ‘‘metadata’’ description of the data. Other times, the code used in a scientific study is more separable from the data, but even then, many preservation and sharing principles apply to code just as well as they do to data. So how do we go about caring for and feeding data? Extra work, no doubt, is associated with nurturing your data, but care up front will save time and increase insight later. Even though a growing number of researchers, especially in large collabora- tions, know that conducting research with sharing and reuse in mind is essential, it still requires a paradigm shift. Most people are still motivated by piling up publications and by getting to the next one as soon as possible. But, the more we scientists find ourselves wishing we had access to extant but now unfindable data [4], the more we will realize why bad data management is bad for science. How can we improve? This article offers a short guide to the steps scientists can take to ensure that their data and associat- ed analyses continue to be of value and to be recognized. In just the past few years, hundreds of scholarly papers and reports have been written on ques- tions of data sharing, data provenance, research reproducibility, licensing, attribu- tion, privacy, and more—but our goal here is not to review that literature. Instead, we present a short guide intended for researchers who want to know why it is important to ‘‘care for and feed’’ data, with some practical advice on how to do that. The final section at the close of this work (Links to Useful Resources) offers links to the types of services referred to throughout the text. Boldface lettering below highlights actions one can take to follow the suggested rules. Rule 1. Love Your Data, and Help Others Love It, Too Data management is a repeat-play game. If you take care to make your data Citation: Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, et al. (2014) Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542 Published April 24, 2014 Copyright: ! 2014 Goodman et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: The authors received no specific funding for writing this manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: alberto.pepe@gmail.com Editor: Philip E. Bourne, University of California San Diego, United States of America PLOS Computational Biology | www.ploscompbiol.org 1 April 2014 | Volume 10 | Issue 4 | e1003542
  • 22. Article Dataset Reuse: Toward Translating Principles to Practice Laura Koesten,1,* Pavlos Vougiouklis,2 Elena Simperl,1 and Paul Groth3,4,* 1King’s College London, London WC2B 4BG, UK 2Huawei Technologies, Edinburgh EH9 3BF, UK 3University of Amsterdam, Amsterdam 1090 GH, the Netherlands 4Lead Contact *Correspondence: laura.koesten@kcl.ac.uk (L.K.), p.groth@uva.nl (P.G.) https://doi.org/10.1016/j.patter.2020.100136 SUMMARY The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a data- set’s reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. 1 INTRODUCTION There has been a gradual shift in the last years from viewing da- tasets as byproducts of (digital) work to critical assets, whose value increases the more they are used.1,2 However, our under- standing of how this value emerges, and of the factors that demonstrably affect the reusability of a dataset is still limited. Using a dataset beyond the context where it originated re- mains challenging for a variety of socio-technical reasons, which have been discussed in the literature;3,4 the bottom line is that simply making data available, even when complying with existing guidance and best practices, does not mean it can be easily used by others.5 At the same time, making data reusable to a diverse audience, in terms of domain, skill sets, and purposes, is an important way to realize its potential value (and recover some of the, sometimes considerable, resources invested in policy and infrastructure support). This is one of the reasons why scientific journals and research-funding organizations are increasingly calling for further data sharing6 or why industry bodies, such as the Interna- tional Data Spaces Association (IDSA) (https://www. internationaldataspaces.org/) are investing in reference archi- tectures to smooth data flows from one business to another. There is plenty of advice on how to make data easier to reuse, including technical standards, legal frameworks, and guidelines. Much work places focus on machine readability THE BIGGER PICTURE The web provides access to millions of datasets. These data can have additional impact when it is used beyond the context for which it was originally created. We have little empirical insight into what makes a dataset more reusable than others, and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engage- ment metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset’s reusability. This work demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem Patterns 1, 100136, November 13, 2020 ª 2020 The Author(s). 1 ll OPEN ACCESS Lots of good advice • Maybe a bit too much…. • Currently, 140 policies on fairsharing.org as of April 5, 2021 • We reviewed 40 papers • Cataloged 39 di ff erent features of datasets that enable data reuse
  • 23. Where should a data provider start? • Lots of good advice! • It would be great to do all these things • But it’s all a bit overwhelming • Can we help prioritize?
  • 24. Getting some data • Used Github as a case study • ~1.4 million datasets (e.g. CSV, excel) from ~65K repos • Use engagement metrics as proxies for data reuse • Map literature features to both dataset and repository features • Train a predictive model to see what are features are good predictors
  • 25. Dataset Features Missing values Size Columns + Rows Readme features Issue features Age Description Parsable
  • 26. Where to start? • Some ideas from this study if you’re publishing data with Github • provide an informative short textual summary of the dataset 
 • provide a comprehensive README fi le in a structured form and links to further information 
 • datasets should not exceed standard processable fi le sizes 
 • datasets should be possible to open with a standard con fi guration of a common library (such as Pandas)
 Trained a Recurrent Neural Network. Might be better models but useful for handling text, Not the greatest predicator (good for classifying not reuse) but still useful for helping us tease out features
  • 27. Can we help automate curation?
  • 29. Example: semantic column type detection Sherlock [Hulsebos et al., KDD, 2019] DL method for semantic data type detection of table columns https://github.com/mitmedialab/sherlock-project
  • 30.
  • 31. Need for a new corpora • Database-like table content and structure (semantics, data types, size). • Large-scale to facilitate table representation models. • Broad coverage to generalize to a diversity of domains. • Table semantics (e.g. column types).
  • 33.
  • 34.
  • 35. Tools to improve data supply chains Groth, Paul, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-April 2013 doi: 10.1109/MIC.2013.41
  • 36. Conclusion • AI is data centric • Need tools that help users debug and curate their data for ML • Way forward: Conversation between ML, DB, and HCI research • We are hiring :-) Paul Groth | @pgroth | pgroth.com | indelab.org