SlideShare uma empresa Scribd logo
1 de 42
Baixar para ler offline
11 September 2017
Reproducible Research
To infinity and beyond
!1
Outline - getting
to reproducibility
✤ What does reproducibility
mean?
✤ Should I design my data
analysis as code?
✤ What is data & what is
metadata?
✤ Should I keep a digital
notebook?
2https://github.com/MorrellLAB/sequence_handling
Inspirations for talk
✤ Yosef Cohen - Statistics and Data with R
✤ Vince Buffalo - Bioinformatics Data Skills
✤ Karl Broman - Initial steps toward reproducible research
✤ Greg Wilson, Titus Brown, et al. - Best practices for
scientific computing
✤ Every dissertation, manuscript, or paper I read!
!3
NSF Statement
(today)
4
What does
reproducible mean?
✤ Person “skilled in the art” can
reproduce results
✤ Or collect new data and get to
similar results
5
Why should research
be reproducible?
✤ Public investment requires
reliable results
✤ Research training is NOT
intended to teach a boutique
set of skills only an artisan can
replicate
✤ Work is replicated and
updated; current human
genome is version 38
6
Why this does
[not] apply to me?
✤ “I work on a highly specialized
system”
✤ “Croak length in
subterranean Iberian frogs”
✤ Whenever possible approaches
and methods should be
“generalizable”
7
Should I design my
analysis as code?
✤ Easier said than done
✤ Goal should be self-contained
code that executes analyses
✤ Hand-editing any file or
intermediate step is a no-no
✤ Primarily because you will
have to do it repeatedly, and
could make mistakes
8
“You all grew up in Latin America, you just don’t know it
yet.”
–Joseph Whitecotton - anthropologist - to U. of Oklahoma
undergrads
!9
“You are all computational biologists, you just don’t know it
yet.”
–Peter Morrell - biologist - to seminar audience at U. of Minnesota
!10
Transition to
executable analysis
✤ Often includes code in Perl,
Python, R, BASH, C++ or a
combination of languages
✤ Elegance of design and
simplicity of interpretation
increase with experience
11
Reproducibility starts with good
tools that are well vetted
✤ “I wrote my own sequence read mapper” - bad idea
✤ Unless you are a computer science grad student
✤ Even then, bad idea because there are dozens
already
✤ Testing of these tools is crowd sourced
!12
“Use your [power] tools whenever possible.”
–Leonard (Top) Hartin - my grandfather
!13
Power tools
✤ Primary tools for most analyses
are developed by professional
programmers and statisticians
✤ Use them whenever possible!
✤ BWA - sequence read mapping
✤ R/QTL - biparental QTL
mapping
✤ Bedtools - for genome slicing
and dicing
14
Good code is
well commented
✤ Comments should explain the
“why” of a step taken in code
✤ Should be sufficient for anyone
“skilled in the art”
✤ Avoid “!@##$!!@#*%” in
comments; code eventually
becomes public!
15
If you only learn one
language - BASH
✤ Unix shell (BASH) scripting
provides access to pro tools
✤ Necessary to use a
supercomputer
✤ Bioinformatics Data Skills -
Chapters 3 & 7
16
Example shell
script
✤ Avoid hard coding file paths in
the middle of a script
✤ Hard coding nearly
eliminates reproducibility
✤ Set paths as variable names at
the top of the script
✤ WORKSHOP=/panfs/roc/
groups/9/morrellp/
pmorrell/Workshop
17
What is a concurrent versioning
system?
✤ Track files and changes over versions of the files
✤ Why is it relevant for my system?
✤ Remember, “I work on croak length of subterranean
frogs”
✤ Rerunning analysis during manuscript revision, on
a similar project years later, or by another group
analyzing rooster crows
!18
Git & Github are preferred options
✤ Git repositories can be local or shared on the internet
✤ Flaky WiFi and poor internet connectivity are not
friendly for remote work
✤ Github stores repositories, can be accessed through
a web browser or cloned to another computer (or
server account)
✤ Alternatives include Bitbucket and UMN Github
!19
What types of files
belong in a repository?
✤ Plain text (with Markdown) is
preferred format
✤ Code, readme files, workflow
diagrams all belong in a
repository
✤ Data generally belong
elsewhere
20
“Code base”
repository
✤ A Git/Github repository used
across multiple projects
✤ Should be generalizable and
extensible
✤ Changes relatively slowly
✤ Are often public during
development
21
“Code base”
repository
✤ Multiple users contribute
✤ Active contributors change
over time
22
Project
repository
✤ Git/Github repository with
files for a specific project
✤ Can be specific challenges in a
data set
✤ Changes often and quickly as a
project develops
✤ Are often private until project
is published
23
What is data,
what is metadata?
✤ Actual information; e.g., the
contents of a photo
✤ Metadata includes description
of the file containing the photo
24
File Metadata
✤ All files have a physical
location, size, dates
✤ Photos have extended
attributes - EXIF data
✤ Geolocation - spatial
metadata
✤ Doesn’t include -
accelerometer or inertial
navigation info
25
!26
!27
Metadata for
RNASeq
✤ Tissue, growth stage, genotype,
time of day, light intensity
28
Hypothetical
Study
✤ Barley 8 parent MAGIC
population
✤ Reporting on population
creation, sequence and genetic
mapping
✤ Recombination rate variation
as a phenotype
✤ Genetics or PLoS Genetics
manuscript
29Macdonald & Long 2007 - Genetics
What data & metadata should we
collect?
!30
What data & metadata should we
collect?
✤ Nanopore / Illumina sequencing of each parent
✤ Skim sequencing of progeny
✤ Recombination events per chromosomal segment for
progeny
!31
Primary
manuscript
✤ Figures & tables
✤ Names of parental lines
✤ Sequence depth
✤ Crossover frequency
✤ Location, number, & effect of
QTL discovered
32Macdonald & Long 2007 - Genetics
Supplementary
material
✤ Pedigree of parental lines
✤ Number of variants from
barley Morex reference for each
parent
✤ Proportion of variants that are
coding, noncoding, deleterious,
etc.
33Kono et al. 2016 - Mol Bio Evol
Public archive
✤ Sequence Read Archive (SRA)
✤ Stores nucleotide sequence &
quality values (FASTQ files)
✤ FASTQ for skim sequencing of
progeny
34
DRUM
✤ Data needed to reproduce
research
✤ Variant Call Format (VCF) -
millions of SNPs from 8
barley parents
✤ Detailed variant annotation
for each parent
✤ Get DOI with permanent URL
to data
35
Digital notebook
✤ Store goals for analysis, output,
code snippets
✤ Entries should contain as much
metadata as possible
36
!37
!38
Broader Impacts
!39
!40
Acknowledgements
✤ Funding
✤ NSF Plant Genome
✤ USDA Biotech Risk Assessment
✤ MN Ag Experiment Station
✤ USDA National Needs
✤ MnDrive
✤ UMN Doctororal Dissertation Fellowships
!41
Acknowledgements
!42

Mais conteúdo relacionado

Semelhante a Reproducible research - to infinity

2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
c.titus.brown
 

Semelhante a Reproducible research - to infinity (20)

Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchFrom the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
High-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesHigh-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life Sciences
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
From Zero to Nextflow 2017
From Zero to Nextflow 2017From Zero to Nextflow 2017
From Zero to Nextflow 2017
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
Open data genomics_palermo_2017_ver03
Open data genomics_palermo_2017_ver03Open data genomics_palermo_2017_ver03
Open data genomics_palermo_2017_ver03
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 

Último

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 

Último (20)

Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 

Reproducible research - to infinity

  • 1. 11 September 2017 Reproducible Research To infinity and beyond !1
  • 2. Outline - getting to reproducibility ✤ What does reproducibility mean? ✤ Should I design my data analysis as code? ✤ What is data & what is metadata? ✤ Should I keep a digital notebook? 2https://github.com/MorrellLAB/sequence_handling
  • 3. Inspirations for talk ✤ Yosef Cohen - Statistics and Data with R ✤ Vince Buffalo - Bioinformatics Data Skills ✤ Karl Broman - Initial steps toward reproducible research ✤ Greg Wilson, Titus Brown, et al. - Best practices for scientific computing ✤ Every dissertation, manuscript, or paper I read! !3
  • 5. What does reproducible mean? ✤ Person “skilled in the art” can reproduce results ✤ Or collect new data and get to similar results 5
  • 6. Why should research be reproducible? ✤ Public investment requires reliable results ✤ Research training is NOT intended to teach a boutique set of skills only an artisan can replicate ✤ Work is replicated and updated; current human genome is version 38 6
  • 7. Why this does [not] apply to me? ✤ “I work on a highly specialized system” ✤ “Croak length in subterranean Iberian frogs” ✤ Whenever possible approaches and methods should be “generalizable” 7
  • 8. Should I design my analysis as code? ✤ Easier said than done ✤ Goal should be self-contained code that executes analyses ✤ Hand-editing any file or intermediate step is a no-no ✤ Primarily because you will have to do it repeatedly, and could make mistakes 8
  • 9. “You all grew up in Latin America, you just don’t know it yet.” –Joseph Whitecotton - anthropologist - to U. of Oklahoma undergrads !9
  • 10. “You are all computational biologists, you just don’t know it yet.” –Peter Morrell - biologist - to seminar audience at U. of Minnesota !10
  • 11. Transition to executable analysis ✤ Often includes code in Perl, Python, R, BASH, C++ or a combination of languages ✤ Elegance of design and simplicity of interpretation increase with experience 11
  • 12. Reproducibility starts with good tools that are well vetted ✤ “I wrote my own sequence read mapper” - bad idea ✤ Unless you are a computer science grad student ✤ Even then, bad idea because there are dozens already ✤ Testing of these tools is crowd sourced !12
  • 13. “Use your [power] tools whenever possible.” –Leonard (Top) Hartin - my grandfather !13
  • 14. Power tools ✤ Primary tools for most analyses are developed by professional programmers and statisticians ✤ Use them whenever possible! ✤ BWA - sequence read mapping ✤ R/QTL - biparental QTL mapping ✤ Bedtools - for genome slicing and dicing 14
  • 15. Good code is well commented ✤ Comments should explain the “why” of a step taken in code ✤ Should be sufficient for anyone “skilled in the art” ✤ Avoid “!@##$!!@#*%” in comments; code eventually becomes public! 15
  • 16. If you only learn one language - BASH ✤ Unix shell (BASH) scripting provides access to pro tools ✤ Necessary to use a supercomputer ✤ Bioinformatics Data Skills - Chapters 3 & 7 16
  • 17. Example shell script ✤ Avoid hard coding file paths in the middle of a script ✤ Hard coding nearly eliminates reproducibility ✤ Set paths as variable names at the top of the script ✤ WORKSHOP=/panfs/roc/ groups/9/morrellp/ pmorrell/Workshop 17
  • 18. What is a concurrent versioning system? ✤ Track files and changes over versions of the files ✤ Why is it relevant for my system? ✤ Remember, “I work on croak length of subterranean frogs” ✤ Rerunning analysis during manuscript revision, on a similar project years later, or by another group analyzing rooster crows !18
  • 19. Git & Github are preferred options ✤ Git repositories can be local or shared on the internet ✤ Flaky WiFi and poor internet connectivity are not friendly for remote work ✤ Github stores repositories, can be accessed through a web browser or cloned to another computer (or server account) ✤ Alternatives include Bitbucket and UMN Github !19
  • 20. What types of files belong in a repository? ✤ Plain text (with Markdown) is preferred format ✤ Code, readme files, workflow diagrams all belong in a repository ✤ Data generally belong elsewhere 20
  • 21. “Code base” repository ✤ A Git/Github repository used across multiple projects ✤ Should be generalizable and extensible ✤ Changes relatively slowly ✤ Are often public during development 21
  • 22. “Code base” repository ✤ Multiple users contribute ✤ Active contributors change over time 22
  • 23. Project repository ✤ Git/Github repository with files for a specific project ✤ Can be specific challenges in a data set ✤ Changes often and quickly as a project develops ✤ Are often private until project is published 23
  • 24. What is data, what is metadata? ✤ Actual information; e.g., the contents of a photo ✤ Metadata includes description of the file containing the photo 24
  • 25. File Metadata ✤ All files have a physical location, size, dates ✤ Photos have extended attributes - EXIF data ✤ Geolocation - spatial metadata ✤ Doesn’t include - accelerometer or inertial navigation info 25
  • 26. !26
  • 27. !27
  • 28. Metadata for RNASeq ✤ Tissue, growth stage, genotype, time of day, light intensity 28
  • 29. Hypothetical Study ✤ Barley 8 parent MAGIC population ✤ Reporting on population creation, sequence and genetic mapping ✤ Recombination rate variation as a phenotype ✤ Genetics or PLoS Genetics manuscript 29Macdonald & Long 2007 - Genetics
  • 30. What data & metadata should we collect? !30
  • 31. What data & metadata should we collect? ✤ Nanopore / Illumina sequencing of each parent ✤ Skim sequencing of progeny ✤ Recombination events per chromosomal segment for progeny !31
  • 32. Primary manuscript ✤ Figures & tables ✤ Names of parental lines ✤ Sequence depth ✤ Crossover frequency ✤ Location, number, & effect of QTL discovered 32Macdonald & Long 2007 - Genetics
  • 33. Supplementary material ✤ Pedigree of parental lines ✤ Number of variants from barley Morex reference for each parent ✤ Proportion of variants that are coding, noncoding, deleterious, etc. 33Kono et al. 2016 - Mol Bio Evol
  • 34. Public archive ✤ Sequence Read Archive (SRA) ✤ Stores nucleotide sequence & quality values (FASTQ files) ✤ FASTQ for skim sequencing of progeny 34
  • 35. DRUM ✤ Data needed to reproduce research ✤ Variant Call Format (VCF) - millions of SNPs from 8 barley parents ✤ Detailed variant annotation for each parent ✤ Get DOI with permanent URL to data 35
  • 36. Digital notebook ✤ Store goals for analysis, output, code snippets ✤ Entries should contain as much metadata as possible 36
  • 37. !37
  • 38. !38
  • 40. !40
  • 41. Acknowledgements ✤ Funding ✤ NSF Plant Genome ✤ USDA Biotech Risk Assessment ✤ MN Ag Experiment Station ✤ USDA National Needs ✤ MnDrive ✤ UMN Doctororal Dissertation Fellowships !41