Dunning - SIGMOD - Data Economy.pptx

T
Ted DunningSoftware Engineer em MapR Technologies
FROM ROOTS TO FRUITS: EXPLORING
LINEAGE FOR DATASET
RECOMMENDATIONS
Ted Dunning, Fellow, HPE
18 June, 2023
the meaning of words lies in their use
2
the meaning of words lies in their use
3
the meaning of data lies in its use
(apologies to Dr. Wittgenstein)
4
A meteorologist’s data
- rainfall
- windspeed
- temperature
5
A meteorologist’s data
- rainfall
- windspeed
- temperature
A business uses the
data to predict umbrella
sales
6
What does the data
actually mean?
7
What does the data
actually mean?
the meaning of data lies in its use
TRAINING PROCESS
8
README
URL
History
Datasets
+
Models
Metadata
We start with explicit metadata.
Examples: column and table
names, documentation, common
values, and others
TRAINING PROCESS
9
README
URL
History
Datasets
+
Models
Metadata
This is encoded as a large
artifact x characters
incidence table
At this point, direct metadata
search is possible
TRAINING PROCESS
10
README
URL
History
Datasets
+
Models
Metadata
We augment with
metadata from all
ancestors and
descendants in
the global data
lineage graph
TRAINING PROCESS
11
README
URL
History
Datasets
+
Models
Metadata
Finally, we reduce the characteristic
cooccurrences using indicator-based
recommendation methods.
A NOTE ON IMPLICATIONS
12
The characteristic indicator
matrix is what connects
“umbrella” with “rainfall” or
“mosquito” with
“temperature” + “windspeed”
QUERY PROCESS
13
The original query is often
textual, possibly a README
QUERY PROCESS
14
augmented by recent project
behavior (queries, references)
QUERY PROCESS
15
The query is expanded based
on indicators (when they say
“umbrellas” they also mean
“rainfall”)
as well as semantic token
embedding using BERT
Recommendations Explanation
positives.csv
notpositives.csv
SARIMA_model
dengue_monthly.csv
climate_monthly.csv
“dengue” ancestor
“dengue” ancestor
“dengue” ancestor
“dengue"
“wind speed”
QUERY PROCESS
16
The final results include an
explanation of why files or
programs are included.
Recommendations Explanation
positives.csv
notpositives.csv
SARIMA_model
dengue_monthly.csv
climate_monthly.csv
“dengue” ancestor
“dengue” ancestor
“dengue” ancestor
“dengue"
“wind speed”
QUERY PROCESS
17
EVALUATION
• Evaluation is difficult due to a lack of public datasets
• Most machine learning examples are truncated to final steps
• Very few non-machine learning pipelines exist outside of toy examples
• Private datasets generally cannot be shared
• Still important to use when possible due to scale
• Evaluation of recommendation engines is a subtle art
• Their purpose is to change behaviors
• Todays recommendations select tomorrow’s training data
• We aren’t to this point yet, this would be a symptom of success
18
EVALUATION
19
EVALUATION
20
THANK YOU
ted.dunning@hpe.com
@ted_dunning
@ted_dunning@mastodon.social
21
1 de 21

Mais conteúdo relacionado

Similar a Dunning - SIGMOD - Data Economy.pptx

Stream Processing Stream Processing
Stream Processing FogGuru MSCA Project
60 visualizações38 slides
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapubeswcsummerschool
433 visualizações36 slides
Cognitive dataCognitive data
Cognitive dataSören Auer
1.9K visualizações48 slides
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
25 visualizações26 slides

Similar a Dunning - SIGMOD - Data Economy.pptx(20)

Stream Processing Stream Processing
Stream Processing
FogGuru MSCA Project60 visualizações
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
eswcsummerschool433 visualizações
Cognitive dataCognitive data
Cognitive data
Sören Auer1.9K visualizações
From Science to Data: Following a principled path to Data ScienceFrom Science to Data: Following a principled path to Data Science
From Science to Data: Following a principled path to Data Science
Institute of Contemporary Sciences127 visualizações
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
ShilpaKrishna6837 visualizações
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
Simon Twigger25 visualizações
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
Paolo Missier659 visualizações
KBART update ER&L 2009KBART update ER&L 2009
KBART update ER&L 2009
Jason Price, PhD426 visualizações
ER&L KBART UpdateER&L KBART Update
ER&L KBART Update
Jason Price, PhD427 visualizações
2015 genome-center2015 genome-center
2015 genome-center
c.titus.brown1.5K visualizações
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
Tilmann Rabl5K visualizações
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
caise2013vlc536 visualizações
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
Prasant Misra549 visualizações
Data management for TA'sData management for TA's
Data management for TA's
aaroncollie576 visualizações
NIH BD2K DataMed model, DATSNIH BD2K DataMed model, DATS
NIH BD2K DataMed model, DATS
Susanna-Assunta Sansone542 visualizações
Current Trends and Challenges in Big Data BenchmarkingCurrent Trends and Challenges in Big Data Benchmarking
Current Trends and Challenges in Big Data Benchmarking
eXascale Infolab3.2K visualizações

Mais de Ted Dunning(20)

How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
Ted Dunning593 visualizações
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning473 visualizações
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
Ted Dunning613 visualizações
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
Ted Dunning3.9K visualizações
T digest-updateT digest-update
T digest-update
Ted Dunning1.4K visualizações
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
Ted Dunning803 visualizações
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
Ted Dunning545 visualizações
Real time-hadoopReal time-hadoop
Real time-hadoop
Ted Dunning1.7K visualizações
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
Ted Dunning1.8K visualizações
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
Ted Dunning1.8K visualizações
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning8.5K visualizações
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning1.1K visualizações
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
Ted Dunning3.3K visualizações
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Ted Dunning6.3K visualizações

Último(20)

PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdf
HiNedHaJar14 visualizações
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 visualizações
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej6 visualizações
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra10 visualizações
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 visualizações
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika21 visualizações
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann91 visualizações
GA4 - Google Analytics 4 - Session Metrics.pdfGA4 - Google Analytics 4 - Session Metrics.pdf
GA4 - Google Analytics 4 - Session Metrics.pdf
GA4 Tutorials20 visualizações
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 visualizações
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials8 visualizações
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 12 visualizações
Microsoft Fabric.pptxMicrosoft Fabric.pptx
Microsoft Fabric.pptx
Shruti Chaurasia19 visualizações
MOSORE_BRESCIAMOSORE_BRESCIA
MOSORE_BRESCIA
Federico Karagulian5 visualizações
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0114 visualizações
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 visualizações
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela166 visualizações
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
thomasjvarghese4918 visualizações
 The Business Tycoons (Jan-2023) - The Unparalleled Digital Leaders The Business Tycoons (Jan-2023) - The Unparalleled Digital Leaders
The Business Tycoons (Jan-2023) - The Unparalleled Digital Leaders
Global India Business Forum14 visualizações

Dunning - SIGMOD - Data Economy.pptx