SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Reproducibility:
The Myths and Truths of “Push-
Button” Bioinformatics
Simon Cockell

Bioinformatics Special Interest Group
19th July 2012
Repeatability and Reproducibility
• Main principle of
  scientific method
• Repeatability is „within
  lab‟
• Reproducibility is
  „between lab‟
  • Broader concept
• This should be easy in
 bioinformatics, right?
  • Same data + same code =
    same results
  • Not many analyses have
    stochastisicity

                              http://xkcd.com/242/
Same data?
• Example
  • Data deposited in SRA
  • Original data deleted by
    researchers
  • .sra files are NOT .fastq
  • All filtering/QC steps lost
  • Starting point for
    subsequent analysis not
    the same – regardless of
    whether same code
    used
Same data?
• Data files are very large
• Hardware failures are
  surprisingly common
• Not all hardware failures
  are catastrophic
  • Bit-flipping by faulty RAM
• Do you keep an
 md5sum of your data, to
 ensure it hasn‟t been
 corrupted by the transfer
 process?
Same code?
• What version of a
  particular software did
  you use?
• Is it still available?
• Did you write it yourself?
• Do you use version
  control?
• Did you tag a version?
• Is the software
  closed/proprietary?
Version Control
• Good practice for
  software AND data
• DVCS means it doesn‟t
  have to be in a remote
  repository
• All local folders can be
  versioned
 • Doesn‟t mean they have to
   be, it‟s a judgment call
• Check-in regularly
• Tag important “releases”
                       https://twitter.com/sjcockell/status/202041359920676864
Pipelines
• Package your analysis
• Easily repeatable
• Also easy to distribute
• Start-to-finish task
  automation
• Process captured by
  underlying pipeline
  architecture


              http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/
Tools for pipelining analyses
• Huge numbers
  • See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems
  • Only a few widely used:
• Bash
  • old school
• Taverna
   • build workflows from public webservices
• Galaxy
  • sequencing focus – tools provided in „toolshed‟
• Microbase
  • distributed computing, build workflows from „responders‟
• e-Science Central
   • „Science as a Service‟ – cloud focus
   • not specifically a bioinformatics tool
Bash
• Single-machine (or
  cluster) command-line
  workflows
• No fancy GUIs
• Record provenance &
  process
• Rudimentary parallel
  processing



                          http://www.gnu.org/software/bash/
Taverna
• Workflows from web
  services
• Lack of relevant services
 • Relies on providers
• Gluing services together
  increasingly problematic
• Sharing workflows
  through myExperiment
 • http://www.myexperiment.org/




                                  http://www.taverna.org.uk/
Galaxy
• “open, web-based
  platform for data
  intensive biomedical
  research”
• Install or use (limited)
  public server
• Can build workflows
  from tools in „toolshed‟
• Command-line tools
  wrapped with web
  interface
                             https://main.g2.bx.psu.edu/
Galaxy Workflow
Microbase
• Task management framework
• Workflows emerge from interacting „responders‟
• Notification system passes messages around
• „Cloud-ready‟ system that scales easily
• Responders must be written for new tools




                                    http://www.microbasecloud.com/
e-Science Central
• „Blocks‟ can be
  combined into
  workflows
• Blocks need to be
  written by an expert
• Social networking
  features
• Good provenance
  recording

                         http://www.esciencecentral.co.uk/
The best approach?
• Good for individual analysis
  • Package & publish
• All datasets different
  • One size does not fit all
  • Downstream processes often depend on results of upstream ones
• Note lack of QC
  • Requires human interaction – impossible to pipeline
  • Different every time
  • Subjective – major source of variation in results
  • BUT – important and necessary (GIGO)
More tools for reproducibility
• iPython notebook
   • http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html
   • Build notebooks with code embedded
   • Run code arbitrarily
   • Example: https://pilgrims.ncl.ac.uk:9999/
• Runmycode.org
  • Allows researchers to create „companion websites‟ for papers
  • This website allows readers to implement the methodology
    described in the paper
  • Example:
    http://www.runmycode.org/CompanionSite/site.do?siteId=92
The executable paper
• The ultimate in
  repeatable research
• Data and code
  embedded in the
  publication
• Figures can be
  generated, in situ, from
  the actual data
• http://ged.msu.edu/pap
  ers/2012-diginorm/
Summary
• For work to be repeatable:
  • Data and code must be available
  • Process must be documented (and preferably shared)
  • Version information is important
  • Pipelines are not the great panacea
    • Though they may help for parts of the process
    • Bash is as good as many „fancier‟ tools (for tasks on a single machine
      or cluster)
Inspirations for this talk
• C. Titus Brown‟s blogposts on repeatability and the
 executable paper
  • http://ivory.idyll.org/blog
• Michael Barton‟s blogposts about organising
 bioinformatics projects and pipelines
  • http://bioinformaticszen.com/

Mais conteúdo relacionado

Mais procurados

The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects Carole Goble
 
FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout Carole Goble
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsCarole Goble
 
Data management, data sharing: the SysMO-SEEK Story
Data management, data sharing: the SysMO-SEEK StoryData management, data sharing: the SysMO-SEEK Story
Data management, data sharing: the SysMO-SEEK StoryCarole Goble
 
Open Science: how to serve the needs of the researcher?
Open Science: how to serve the needs of the researcher? Open Science: how to serve the needs of the researcher?
Open Science: how to serve the needs of the researcher? Carole Goble
 
EOSC-Life Workflow Collaboratory
EOSC-Life Workflow CollaboratoryEOSC-Life Workflow Collaboratory
EOSC-Life Workflow CollaboratoryCarole Goble
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
 
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
 
A Big Picture in Research Data Management
A Big Picture in Research Data ManagementA Big Picture in Research Data Management
A Big Picture in Research Data ManagementCarole Goble
 
Building the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsBuilding the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsCarole Goble
 
FAIR History and the Future
FAIR History and the FutureFAIR History and the Future
FAIR History and the FutureCarole Goble
 
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Data Bridging from researcher data management to ELIXIR archives in the...FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Data Bridging from researcher data management to ELIXIR archives in the...Carole Goble
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceCarole Goble
 
Building Federated FAIR Data Spaces, Yann Le Franc, EOSC-Pillar
Building Federated FAIR Data Spaces, Yann Le Franc, EOSC-PillarBuilding Federated FAIR Data Spaces, Yann Le Franc, EOSC-Pillar
Building Federated FAIR Data Spaces, Yann Le Franc, EOSC-PillarEOSC-Pillar European Project
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE
 
The European Open Science Cloud: just what is it?
The European Open Science Cloud: just what is it?The European Open Science Cloud: just what is it?
The European Open Science Cloud: just what is it?Carole Goble
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsDuncan Hull
 
Invited talk @ ESIP summer meeting, 2009
Invited talk @ ESIP summer meeting, 2009Invited talk @ ESIP summer meeting, 2009
Invited talk @ ESIP summer meeting, 2009Paolo Missier
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 

Mais procurados (20)

The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects
 
FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
 
Data management, data sharing: the SysMO-SEEK Story
Data management, data sharing: the SysMO-SEEK StoryData management, data sharing: the SysMO-SEEK Story
Data management, data sharing: the SysMO-SEEK Story
 
Open Science: how to serve the needs of the researcher?
Open Science: how to serve the needs of the researcher? Open Science: how to serve the needs of the researcher?
Open Science: how to serve the needs of the researcher?
 
EOSC-Life Workflow Collaboratory
EOSC-Life Workflow CollaboratoryEOSC-Life Workflow Collaboratory
EOSC-Life Workflow Collaboratory
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
 
A Big Picture in Research Data Management
A Big Picture in Research Data ManagementA Big Picture in Research Data Management
A Big Picture in Research Data Management
 
Building the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsBuilding the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of Scientists
 
FAIR History and the Future
FAIR History and the FutureFAIR History and the Future
FAIR History and the Future
 
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Data Bridging from researcher data management to ELIXIR archives in the...FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
 
Building Federated FAIR Data Spaces, Yann Le Franc, EOSC-Pillar
Building Federated FAIR Data Spaces, Yann Le Franc, EOSC-PillarBuilding Federated FAIR Data Spaces, Yann Le Franc, EOSC-Pillar
Building Federated FAIR Data Spaces, Yann Le Franc, EOSC-Pillar
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 
The European Open Science Cloud: just what is it?
The European Open Science Cloud: just what is it?The European Open Science Cloud: just what is it?
The European Open Science Cloud: just what is it?
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
Invited talk @ ESIP summer meeting, 2009
Invited talk @ ESIP summer meeting, 2009Invited talk @ ESIP summer meeting, 2009
Invited talk @ ESIP summer meeting, 2009
 
COPO - Collaborative Open Plant Omics, by Rob Davey
COPO - Collaborative Open Plant Omics, by Rob DaveyCOPO - Collaborative Open Plant Omics, by Rob Davey
COPO - Collaborative Open Plant Omics, by Rob Davey
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 

Semelhante a Reproducibility: The Myths and Truths of "Push-Button

2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformaticsStephen Turner
 
Open Source Visualization of Scientific Data
Open Source Visualization of Scientific DataOpen Source Visualization of Scientific Data
Open Source Visualization of Scientific DataMarcus Hanwell
 
GlobusWorld 2015
GlobusWorld 2015GlobusWorld 2015
GlobusWorld 2015Tanu Malik
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopMarcus Hanwell
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsMarcus Hanwell
 
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016Grid Protection Alliance
 
Taverna workflows in the cloud
Taverna workflows in the cloudTaverna workflows in the cloud
Taverna workflows in the cloudmyGrid team
 
dREG & SimVascular-Gateways-ECSS-Presentation
dREG & SimVascular-Gateways-ECSS-PresentationdREG & SimVascular-Gateways-ECSS-Presentation
dREG & SimVascular-Gateways-ECSS-PresentationEroma Abeysinghe
 
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisMarcus Hanwell
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsCarole Goble
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
02-Lifecycle.pptx
02-Lifecycle.pptx02-Lifecycle.pptx
02-Lifecycle.pptxShree Shree
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby
 
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...InfluxData
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesJan Aerts
 
2010 EGITF Amsterdam - Gap between GRID and Humanities
2010 EGITF Amsterdam - Gap between GRID and Humanities2010 EGITF Amsterdam - Gap between GRID and Humanities
2010 EGITF Amsterdam - Gap between GRID and HumanitiesDirk Roorda
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDaveEdwards12
 

Semelhante a Reproducibility: The Myths and Truths of "Push-Button (20)

2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
Beyond the Science Gateway
Beyond the Science GatewayBeyond the Science Gateway
Beyond the Science Gateway
 
Open Source Visualization of Scientific Data
Open Source Visualization of Scientific DataOpen Source Visualization of Scientific Data
Open Source Visualization of Scientific Data
 
GlobusWorld 2015
GlobusWorld 2015GlobusWorld 2015
GlobusWorld 2015
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the Desktop
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and Semantics
 
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
 
Taverna workflows in the cloud
Taverna workflows in the cloudTaverna workflows in the cloud
Taverna workflows in the cloud
 
dREG & SimVascular-Gateways-ECSS-Presentation
dREG & SimVascular-Gateways-ECSS-PresentationdREG & SimVascular-Gateways-ECSS-Presentation
dREG & SimVascular-Gateways-ECSS-Presentation
 
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & Analysis
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
02-Lifecycle.pptx
02-Lifecycle.pptx02-Lifecycle.pptx
02-Lifecycle.pptx
 
142 wendy shank
142 wendy shank142 wendy shank
142 wendy shank
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
 
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 
2010 EGITF Amsterdam - Gap between GRID and Humanities
2010 EGITF Amsterdam - Gap between GRID and Humanities2010 EGITF Amsterdam - Gap between GRID and Humanities
2010 EGITF Amsterdam - Gap between GRID and Humanities
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
 

Último

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Último (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

Reproducibility: The Myths and Truths of "Push-Button

  • 1. Reproducibility: The Myths and Truths of “Push- Button” Bioinformatics Simon Cockell Bioinformatics Special Interest Group 19th July 2012
  • 2. Repeatability and Reproducibility • Main principle of scientific method • Repeatability is „within lab‟ • Reproducibility is „between lab‟ • Broader concept • This should be easy in bioinformatics, right? • Same data + same code = same results • Not many analyses have stochastisicity http://xkcd.com/242/
  • 3. Same data? • Example • Data deposited in SRA • Original data deleted by researchers • .sra files are NOT .fastq • All filtering/QC steps lost • Starting point for subsequent analysis not the same – regardless of whether same code used
  • 4. Same data? • Data files are very large • Hardware failures are surprisingly common • Not all hardware failures are catastrophic • Bit-flipping by faulty RAM • Do you keep an md5sum of your data, to ensure it hasn‟t been corrupted by the transfer process?
  • 5. Same code? • What version of a particular software did you use? • Is it still available? • Did you write it yourself? • Do you use version control? • Did you tag a version? • Is the software closed/proprietary?
  • 6. Version Control • Good practice for software AND data • DVCS means it doesn‟t have to be in a remote repository • All local folders can be versioned • Doesn‟t mean they have to be, it‟s a judgment call • Check-in regularly • Tag important “releases” https://twitter.com/sjcockell/status/202041359920676864
  • 7. Pipelines • Package your analysis • Easily repeatable • Also easy to distribute • Start-to-finish task automation • Process captured by underlying pipeline architecture http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/
  • 8. Tools for pipelining analyses • Huge numbers • See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems • Only a few widely used: • Bash • old school • Taverna • build workflows from public webservices • Galaxy • sequencing focus – tools provided in „toolshed‟ • Microbase • distributed computing, build workflows from „responders‟ • e-Science Central • „Science as a Service‟ – cloud focus • not specifically a bioinformatics tool
  • 9. Bash • Single-machine (or cluster) command-line workflows • No fancy GUIs • Record provenance & process • Rudimentary parallel processing http://www.gnu.org/software/bash/
  • 10.
  • 11. Taverna • Workflows from web services • Lack of relevant services • Relies on providers • Gluing services together increasingly problematic • Sharing workflows through myExperiment • http://www.myexperiment.org/ http://www.taverna.org.uk/
  • 12. Galaxy • “open, web-based platform for data intensive biomedical research” • Install or use (limited) public server • Can build workflows from tools in „toolshed‟ • Command-line tools wrapped with web interface https://main.g2.bx.psu.edu/
  • 14. Microbase • Task management framework • Workflows emerge from interacting „responders‟ • Notification system passes messages around • „Cloud-ready‟ system that scales easily • Responders must be written for new tools http://www.microbasecloud.com/
  • 15. e-Science Central • „Blocks‟ can be combined into workflows • Blocks need to be written by an expert • Social networking features • Good provenance recording http://www.esciencecentral.co.uk/
  • 16. The best approach? • Good for individual analysis • Package & publish • All datasets different • One size does not fit all • Downstream processes often depend on results of upstream ones • Note lack of QC • Requires human interaction – impossible to pipeline • Different every time • Subjective – major source of variation in results • BUT – important and necessary (GIGO)
  • 17. More tools for reproducibility • iPython notebook • http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html • Build notebooks with code embedded • Run code arbitrarily • Example: https://pilgrims.ncl.ac.uk:9999/ • Runmycode.org • Allows researchers to create „companion websites‟ for papers • This website allows readers to implement the methodology described in the paper • Example: http://www.runmycode.org/CompanionSite/site.do?siteId=92
  • 18. The executable paper • The ultimate in repeatable research • Data and code embedded in the publication • Figures can be generated, in situ, from the actual data • http://ged.msu.edu/pap ers/2012-diginorm/
  • 19. Summary • For work to be repeatable: • Data and code must be available • Process must be documented (and preferably shared) • Version information is important • Pipelines are not the great panacea • Though they may help for parts of the process • Bash is as good as many „fancier‟ tools (for tasks on a single machine or cluster)
  • 20. Inspirations for this talk • C. Titus Brown‟s blogposts on repeatability and the executable paper • http://ivory.idyll.org/blog • Michael Barton‟s blogposts about organising bioinformatics projects and pipelines • http://bioinformaticszen.com/

Notas do Editor

  1. Or any other Unix shell (maybe even Windows batch scripts)
  2. Expose often simplified set of options
  3. Can make arbitrary blocks in R, Java or Octave