SlideShare uma empresa Scribd logo
1 de 63
Baixar para ler offline
GALAXY
       BIOINFORMATICS WORKFLOW
                    ENVIRONMENT

Rutger Vos, 3 April 2012
Overview
Informatics in the post-genomic era
The past (?)

      Analyses glued together using
       scripting languages, directly on
       the CLI or in GUI
      Sanger sequencing
      Smaller data volumes
      Fewer remote data resources
      Hypothesis-driven
The present

    Graphical or text-
     based workflow tools
    “Next generation”
     sequencing
    Large data sets
    Many remote data
     resources
    “Data-driven”
NGS – Roche 454 pyrosequencing

Pyrosequencing                     Genome Sequencer FLX

    “Emulsion PCR”
    Bead with primer in each
     droplet
    Each bead is placed in a
     well with luciferase
    Plate is analyzed by fiber-
     optic chip
NGS – Illumina/Solexa

Reversible dye-terminator seq    HiSeq2000 (BGI)

    DNA attaches to primer on
     slide and is amplified
    4 RT-bases are added
    Camera detects labeled
     nucleotides
    Next 1-base cycle
NGS – IonTorrent

Ion semiconductor sequencing     Ion Torrent PGM

    “sequencing by synthesis”
    Not light-based, sensor
     detects H+ ions during
     synthesis
    Longer reads
NGS – SOLiD Sequencing

Oligonucleotide ligation and
                                    SOLiD 5500 Genetic Analyzer
detection
    Beads with DNA fragments
    Universal adapter attached
     to fragments
    PCR product attaches to
     slide
    Fluorescent probes ligate to
     the primer
The future


     “Next-next generation”
      sequencing (MinION?)
     Smaller data sets?
     Semantic web?
     Back to hypotheses?
Workflows
Tools for automating bioinformatics analyses
Examples of workflow tools


 Taverna        taverna.org.uk

 Galaxy         usegalaxy.org

 eHive          ensembl.org/info/docs/eHive

 Mobyle         mobyle.pasteur.fr

 Cipres         phylo.org

 Yahoo! Pipes   pipes.yahoo.com

 “make”         gnu.org/software/make
Galaxy
Provenance, histories

    Where do the
     data come from?
    How were they
     altered?
    Galaxy tracks the
     history of data.
    Histories can be
     converted to
     workflows
Creating a workflow
Workflow editor
Reproducibility

  Good science requires that results be reproducible
  Some analyses are run many times
Sharing

    “Standing on the
     shoulders of giants”
    Not re-inventing the
     wheel
    “Executable
     papers” (doi:
     10.1101/gr.
     094508.109)
Data
What data types, expressed in which formats,
does Galaxy operate on? How do I get data
into and out of Galaxy?
Data types
•    Sequences
•    Alignments
•    Intervals
•    Tabular data
File format conversion


    Built-in converters
     between related
     file formats are
     provided
    Additional
     converters can be
     added
Galaxy sequence formats

     FASTA – def line, sequence
     FASTQ – FASTA + quality - SOLEXA:
     ABI/SCF – binary sequence trace (see below)
     SFF (454) – binary flowgram format
Galaxy alignment formats

      MAF – multiple alignment format (see below)
      (S|B)AM – text or binary format for reads and ref
      AXT – pairwise alignment, from LAV
      LAV – BLASTZ output, pairwise alignment
Galaxy interval/feature formats

    BED
         chrom, start, end, …
    INTERVAL
         BED with headers
    GFF (GFF3)
         like BED and
          interval, but 1-based
          inclusive
    WIG(GLE)
         dense, continuous-
          valued tracks
Other data formats


 In addition to interval/feature tabular data as listed
    previously, other files with similar properties can be
    processed by some tools:
        *.txt tab-separated values (e.g. tabular FASTA)
        HTML (for additional prose)
        LPED/PBED (to describe SNPs, really two files, one for
         coordinates, other for alleles)
Data I/O
•    Upload via FTP
•    Fetch by URL
•    Import from data library
•    Provided by interoperable web service
Upload data using FTP
Fetch data from a URL
Import from data library
Fetch data from proxy service

    User submits proxy request to Galaxy
    Galaxy forwards request to remote service
    Service returns data
    Galaxy infers data type and presents results
“Big Data”


     NGS has led to massive
      data sets
     Data formats are simple,
      binary, and/or compressed
     Still, people drive around
      with USB hard disks
Data sharing/publishing


      The Galaxy platform
       allows users to
       publish and share
       their data, for
       example as
       supplemental
       materials to a
       publication*


         * example: http://genome.cshlp.org/content/19/11/2144
Tools
Which operations can I run on Galaxy?
Galaxy tools


  Send and Get data – Upload, fetch, send, submit
  Data manipulation – Join, sort, filter

  Format conversion – FASTA, other format operations

  Statistics – Regressions, simulations, model tests

  NGS – BAM, FASTQ, SOLiD, 454 file operations

  RNA analysis – cufflinks, tophat

  Evolution – branch lengths, NJ, HyPhy
Visualization
What data can I view in Galaxy?
Galaxy visualization - histograms
Galaxy visualization - scatterplots
Galaxy visualization – XY plot
Galaxy visualization – box plot
Galaxy visualization - trackster
Deployment
How to deploy your analyses on Galaxy, on
public servers, in the cloud or locally
Public servers
•    “Main” at UseGalaxy.org
•    NBIC
•    Others listed on Galaxy wiki
Local server

Pro:                          Con:

    Data close by                Complicated install
    Can add your own tools       Many dependencies
    Can develop Galaxy           UNIX-based
     further
    UNIX-based
Cloud Galaxy

  Galaxy can be installed in the (Amazon EC2 cloud)
  Private data without the hardware hassle

  Uploading and storing data can be costly, however
Implementation
How does it work under the hood?
Galaxy under the hood


             Issues
           command




 1. Parses HTTP request
 2. Identifies which tool to use
 3. Reads tool description
 4. Queues tool
 5. Parses result
 6. Returns HTML representation of result
Web server


    Simple Galaxy installs use
     a built-in, python-based
     HTTP server
    More robust installs
     typically use the Apache
     httpd server
Code base


      Most of the framework
       and the wrapper code is
       written in python
      Some wrappers in other
       languages, e.g. perl
Interface language


    Under the hood, Galaxy executes command-line programs and
     scripts
    Their interfaces and tool tips are described in XML files
Queuing

              Jobs are executed
               asynchronously
              Progress is shown in the
               data browser
              On big servers (e.g.
               “Main”), queuing is
               managed by the
               dedicated “Torque”
               system
UNIX


    Galaxy (simply)executes
     command-line programs
     within a UNIX-like
     environment
    Galaxy doesn’t have to
     “know” how to run those
     programs, it finds out from
     their descriptions at runtime
Database



      Analysis metadata is stored
       in a database, by default
       this is SQLite
      More robust installs use
       PostgreSQL
Configuration


     Galaxy has many moving
      parts that can be
      configured
     Configuration is done using
      simple text-based INI files
Version control


    Revision control (or version
     control) provides unlimited
     undo and detailed tracking
     of changes
    Galaxy uses Mercurial
    Popular now are svn, git
     and hg
Community
How to get in touch with the world-wide
community to get the most out of Galaxy?
Wiki
Mailing lists



      @lists.bx.psu.edu:
           galaxy-user
           galaxy-dev
           galaxy-announce
           galaxy-commits
Tutorials


     Galaxy 101:
          Getting data from UCSC
          Performing simple data
           manipulation
          Understanding Galaxy's
           History system
          Creating and editing
           workflows
          Applying workflows to
           your data
Screencasts
Events

      Galaxy Community
       Conference
      ISMB
      ECCB
      BOSC
      PAG
      Bio-IT World
      GMOD Meetings
“Tool shed”



                  Easy sharing of
                   new tools
                  Based on Mercurial
                  Turns Galaxy into a
                   modular ecosystem
CiteULike group




         Social citation manager
         There is a Galaxy group:
              citeulike.org/group/16008

           Articles are tagged by:
              PROJECT,
                     ISGALAXY, SHARED,
              HOWTO, METHODS,
              REPRODUCIBILITY, WORKFLOW
Organizations


     Development:
          Penn State University
          Emory University
     Support:
          NSF
          NHGRI
     Power user:
          NBIC
Links

    UseGalaxy.org – the Main server
    GetGalaxy.org – for local installs
    UseGalaxy.org/galaxy101 – intro tutorial
    Galaxy.nbic.nl – Sombrero, the NBIC server
    CiteULike.org/group/16008 – references
    genome.cshlp.org/content/19/11/2144 – windshield paper
    SlideShare.net/rvosa – these slides

Mais conteúdo relacionado

Mais procurados

Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
Nikolay Vyahhi
 

Mais procurados (20)

Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
BITS: Basics of Sequence similarity
BITS: Basics of Sequence similarityBITS: Basics of Sequence similarity
BITS: Basics of Sequence similarity
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
THIRD GEN SEQUENCING.pptx
THIRD GEN SEQUENCING.pptxTHIRD GEN SEQUENCING.pptx
THIRD GEN SEQUENCING.pptx
 
Protein-protein interaction networks
Protein-protein interaction networksProtein-protein interaction networks
Protein-protein interaction networks
 
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
NGS Introduction and Technology Overview (UEB-UAT Bioinformatics Course - Ses...
 
Overview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seqOverview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seq
 
Rna seq
Rna seqRna seq
Rna seq
 
Molecular markers: Outlook
Molecular markers: OutlookMolecular markers: Outlook
Molecular markers: Outlook
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Biological databases
Biological databasesBiological databases
Biological databases
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Sequence database
Sequence databaseSequence database
Sequence database
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
Autodock and vina
Autodock and vinaAutodock and vina
Autodock and vina
 
Next generation sequencing technologies for crop improvement
Next generation sequencing technologies for crop improvementNext generation sequencing technologies for crop improvement
Next generation sequencing technologies for crop improvement
 

Destaque

Geothermal energy
Geothermal energyGeothermal energy
Geothermal energy
efabra
 
Human/Social Sciences/Cultural & Behavioral Dynamics and Advanced Analytics
Human/Social Sciences/Cultural & Behavioral Dynamics and Advanced AnalyticsHuman/Social Sciences/Cultural & Behavioral Dynamics and Advanced Analytics
Human/Social Sciences/Cultural & Behavioral Dynamics and Advanced Analytics
NC Military Business Center
 
Pharmacogenomics Research Ion AmpliSeq Assay | ESHG 2015 Poster PM15.10
Pharmacogenomics Research Ion AmpliSeq Assay | ESHG 2015 Poster PM15.10Pharmacogenomics Research Ion AmpliSeq Assay | ESHG 2015 Poster PM15.10
Pharmacogenomics Research Ion AmpliSeq Assay | ESHG 2015 Poster PM15.10
Thermo Fisher Scientific
 
Sun, Moon and Planets Slideshow
Sun, Moon and Planets SlideshowSun, Moon and Planets Slideshow
Sun, Moon and Planets Slideshow
sanstanton
 
Special quadrilaterals proofs ans constructions
Special quadrilaterals  proofs ans constructions Special quadrilaterals  proofs ans constructions
Special quadrilaterals proofs ans constructions
cristufer
 

Destaque (20)

Galaxy presentation
Galaxy presentationGalaxy presentation
Galaxy presentation
 
Earth Moon And Sun
Earth Moon And SunEarth Moon And Sun
Earth Moon And Sun
 
Geothermal energy
Geothermal energyGeothermal energy
Geothermal energy
 
Human/Social Sciences/Cultural & Behavioral Dynamics and Advanced Analytics
Human/Social Sciences/Cultural & Behavioral Dynamics and Advanced AnalyticsHuman/Social Sciences/Cultural & Behavioral Dynamics and Advanced Analytics
Human/Social Sciences/Cultural & Behavioral Dynamics and Advanced Analytics
 
Example site map
Example site mapExample site map
Example site map
 
Development and verification of an Ion AmpliSeq TP53 Panel
Development and verification of an Ion AmpliSeq TP53 PanelDevelopment and verification of an Ion AmpliSeq TP53 Panel
Development and verification of an Ion AmpliSeq TP53 Panel
 
Assessment of TP53 Mutation Status in Breast Tumor Tissue using the "Ion Ampl...
Assessment of TP53 Mutation Status in Breast Tumor Tissue using the "Ion Ampl...Assessment of TP53 Mutation Status in Breast Tumor Tissue using the "Ion Ampl...
Assessment of TP53 Mutation Status in Breast Tumor Tissue using the "Ion Ampl...
 
Fully automated Ion AmpliSeq™ library preparation using the Ion Chef System
Fully automated Ion AmpliSeq™ library preparation using the Ion Chef SystemFully automated Ion AmpliSeq™ library preparation using the Ion Chef System
Fully automated Ion AmpliSeq™ library preparation using the Ion Chef System
 
Application Note: A Simple One-Step Library Prep Method To Enable AmpliSeq Pa...
Application Note: A Simple One-Step Library Prep Method To Enable AmpliSeq Pa...Application Note: A Simple One-Step Library Prep Method To Enable AmpliSeq Pa...
Application Note: A Simple One-Step Library Prep Method To Enable AmpliSeq Pa...
 
xGen® Lockdown® Probes
xGen® Lockdown® ProbesxGen® Lockdown® Probes
xGen® Lockdown® Probes
 
Pharmacogenomics Research Ion AmpliSeq Assay | ESHG 2015 Poster PM15.10
Pharmacogenomics Research Ion AmpliSeq Assay | ESHG 2015 Poster PM15.10Pharmacogenomics Research Ion AmpliSeq Assay | ESHG 2015 Poster PM15.10
Pharmacogenomics Research Ion AmpliSeq Assay | ESHG 2015 Poster PM15.10
 
Improved Reagents & Methods for Target Enrichment in Next Generation Sequencing
Improved Reagents & Methods for Target Enrichment in Next Generation SequencingImproved Reagents & Methods for Target Enrichment in Next Generation Sequencing
Improved Reagents & Methods for Target Enrichment in Next Generation Sequencing
 
Target capture of DNA from FFPE samples— recommendations for generating robus...
Target capture of DNA from FFPE samples— recommendations for generating robus...Target capture of DNA from FFPE samples— recommendations for generating robus...
Target capture of DNA from FFPE samples— recommendations for generating robus...
 
Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
 
Galaxies
GalaxiesGalaxies
Galaxies
 
Sun, Moon and Planets Slideshow
Sun, Moon and Planets SlideshowSun, Moon and Planets Slideshow
Sun, Moon and Planets Slideshow
 
DATA STRUCTURES
DATA STRUCTURESDATA STRUCTURES
DATA STRUCTURES
 
Spine X Live2D 百萬智多星製作經驗談
Spine X Live2D 百萬智多星製作經驗談Spine X Live2D 百萬智多星製作經驗談
Spine X Live2D 百萬智多星製作經驗談
 
Unit 1.my school
Unit 1.my schoolUnit 1.my school
Unit 1.my school
 
Special quadrilaterals proofs ans constructions
Special quadrilaterals  proofs ans constructions Special quadrilaterals  proofs ans constructions
Special quadrilaterals proofs ans constructions
 

Semelhante a The Galaxy bioinformatics workflow environment

Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster
 
e-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubee-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCube
FAO
 
Virtual Science in the Cloud
Virtual Science in the CloudVirtual Science in the Cloud
Virtual Science in the Cloud
thetfoot
 

Semelhante a The Galaxy bioinformatics workflow environment (20)

Sfu ngs course_workshop tutorial_2.1
Sfu ngs course_workshop tutorial_2.1Sfu ngs course_workshop tutorial_2.1
Sfu ngs course_workshop tutorial_2.1
 
3rd presentation
3rd presentation3rd presentation
3rd presentation
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
Horizontal scaling with Galaxy
Horizontal scaling with GalaxyHorizontal scaling with Galaxy
Horizontal scaling with Galaxy
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
e-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubee-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCube
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
GeoChronos
GeoChronosGeoChronos
GeoChronos
 
Big data
Big dataBig data
Big data
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
 
Virtual Science in the Cloud
Virtual Science in the CloudVirtual Science in the Cloud
Virtual Science in the Cloud
 
Grid computing
Grid computingGrid computing
Grid computing
 
IMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna RoadmapIMPACT/myGrid Hackathon - Taverna Roadmap
IMPACT/myGrid Hackathon - Taverna Roadmap
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 

Mais de Rutger Vos

Vos at NCB Naturalis
Vos at NCB NaturalisVos at NCB Naturalis
Vos at NCB Naturalis
Rutger Vos
 

Mais de Rutger Vos (20)

Anna Karenina on hooves - what makes an animal fit for domestication?
Anna Karenina on hooves - what makes an animal fit for domestication?Anna Karenina on hooves - what makes an animal fit for domestication?
Anna Karenina on hooves - what makes an animal fit for domestication?
 
10 Misverstanden Over Evolutie
10 Misverstanden Over Evolutie10 Misverstanden Over Evolutie
10 Misverstanden Over Evolutie
 
Crash Course Biodiversiteit
Crash Course BiodiversiteitCrash Course Biodiversiteit
Crash Course Biodiversiteit
 
Natural history research as a replicable data science
Natural history research as a replicable data scienceNatural history research as a replicable data science
Natural history research as a replicable data science
 
Species delimitation - species limits and character evolution
Species delimitation - species limits and character evolutionSpecies delimitation - species limits and character evolution
Species delimitation - species limits and character evolution
 
Onderzoek bio-informatica Naturalis. Raad voor Cultuur 2017.
Onderzoek bio-informatica Naturalis. Raad voor Cultuur 2017.Onderzoek bio-informatica Naturalis. Raad voor Cultuur 2017.
Onderzoek bio-informatica Naturalis. Raad voor Cultuur 2017.
 
Robot eye for the butterfly
Robot eye for the butterflyRobot eye for the butterfly
Robot eye for the butterfly
 
Taxonomic classification of digitized specimens using machine learning
Taxonomic classification of digitized specimens using machine learningTaxonomic classification of digitized specimens using machine learning
Taxonomic classification of digitized specimens using machine learning
 
Self-Updating Platform for the Estimation of Rates of Speciation, Migration A...
Self-Updating Platform for the Estimation of Rates of Speciation, Migration A...Self-Updating Platform for the Estimation of Rates of Speciation, Migration A...
Self-Updating Platform for the Estimation of Rates of Speciation, Migration A...
 
Assembling the Tree of Life from public DNA sequence data
Assembling the Tree of Life from public DNA sequence dataAssembling the Tree of Life from public DNA sequence data
Assembling the Tree of Life from public DNA sequence data
 
Hoe leer je een robot soorten te herkennen?
Hoe leer je een robot soorten te herkennen?Hoe leer je een robot soorten te herkennen?
Hoe leer je een robot soorten te herkennen?
 
Modeling the biosphere: the natural historian's perspective
Modeling the biosphere: the natural historian's perspectiveModeling the biosphere: the natural historian's perspective
Modeling the biosphere: the natural historian's perspective
 
Kunnen we een tomaat van 400 jaar oud proeven
Kunnen we een tomaat van 400 jaar oud proevenKunnen we een tomaat van 400 jaar oud proeven
Kunnen we een tomaat van 400 jaar oud proeven
 
PhyloTastic: names-based phyloinformatic data integration
PhyloTastic: names-based phyloinformatic data integrationPhyloTastic: names-based phyloinformatic data integration
PhyloTastic: names-based phyloinformatic data integration
 
SUPERSMART pipeline intro
SUPERSMART pipeline introSUPERSMART pipeline intro
SUPERSMART pipeline intro
 
Reconstructing paleoenvironments using metagenomics
Reconstructing paleoenvironments using metagenomicsReconstructing paleoenvironments using metagenomics
Reconstructing paleoenvironments using metagenomics
 
Synthesising disparate data resources to obtain composite estimates of geophy...
Synthesising disparate data resources to obtain composite estimates of geophy...Synthesising disparate data resources to obtain composite estimates of geophy...
Synthesising disparate data resources to obtain composite estimates of geophy...
 
Retrieving useful information from connected specimen- and data collections
Retrieving useful information from connected specimen- and data collectionsRetrieving useful information from connected specimen- and data collections
Retrieving useful information from connected specimen- and data collections
 
NeXML - phylogenetic data as XML
NeXML - phylogenetic data as XMLNeXML - phylogenetic data as XML
NeXML - phylogenetic data as XML
 
Vos at NCB Naturalis
Vos at NCB NaturalisVos at NCB Naturalis
Vos at NCB Naturalis
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

The Galaxy bioinformatics workflow environment

  • 1. GALAXY BIOINFORMATICS WORKFLOW ENVIRONMENT Rutger Vos, 3 April 2012
  • 2. Overview Informatics in the post-genomic era
  • 3. The past (?)   Analyses glued together using scripting languages, directly on the CLI or in GUI   Sanger sequencing   Smaller data volumes   Fewer remote data resources   Hypothesis-driven
  • 4. The present   Graphical or text- based workflow tools   “Next generation” sequencing   Large data sets   Many remote data resources   “Data-driven”
  • 5. NGS – Roche 454 pyrosequencing Pyrosequencing Genome Sequencer FLX   “Emulsion PCR”   Bead with primer in each droplet   Each bead is placed in a well with luciferase   Plate is analyzed by fiber- optic chip
  • 6. NGS – Illumina/Solexa Reversible dye-terminator seq HiSeq2000 (BGI)   DNA attaches to primer on slide and is amplified   4 RT-bases are added   Camera detects labeled nucleotides   Next 1-base cycle
  • 7. NGS – IonTorrent Ion semiconductor sequencing Ion Torrent PGM   “sequencing by synthesis”   Not light-based, sensor detects H+ ions during synthesis   Longer reads
  • 8. NGS – SOLiD Sequencing Oligonucleotide ligation and SOLiD 5500 Genetic Analyzer detection   Beads with DNA fragments   Universal adapter attached to fragments   PCR product attaches to slide   Fluorescent probes ligate to the primer
  • 9. The future   “Next-next generation” sequencing (MinION?)   Smaller data sets?   Semantic web?   Back to hypotheses?
  • 10. Workflows Tools for automating bioinformatics analyses
  • 11. Examples of workflow tools Taverna taverna.org.uk Galaxy usegalaxy.org eHive ensembl.org/info/docs/eHive Mobyle mobyle.pasteur.fr Cipres phylo.org Yahoo! Pipes pipes.yahoo.com “make” gnu.org/software/make
  • 13. Provenance, histories   Where do the data come from?   How were they altered?   Galaxy tracks the history of data.   Histories can be converted to workflows
  • 16. Reproducibility   Good science requires that results be reproducible   Some analyses are run many times
  • 17. Sharing   “Standing on the shoulders of giants”   Not re-inventing the wheel   “Executable papers” (doi: 10.1101/gr. 094508.109)
  • 18. Data What data types, expressed in which formats, does Galaxy operate on? How do I get data into and out of Galaxy?
  • 19. Data types •  Sequences •  Alignments •  Intervals •  Tabular data
  • 20. File format conversion   Built-in converters between related file formats are provided   Additional converters can be added
  • 21. Galaxy sequence formats   FASTA – def line, sequence   FASTQ – FASTA + quality - SOLEXA:   ABI/SCF – binary sequence trace (see below)   SFF (454) – binary flowgram format
  • 22. Galaxy alignment formats   MAF – multiple alignment format (see below)   (S|B)AM – text or binary format for reads and ref   AXT – pairwise alignment, from LAV   LAV – BLASTZ output, pairwise alignment
  • 23. Galaxy interval/feature formats   BED   chrom, start, end, …   INTERVAL   BED with headers   GFF (GFF3)   like BED and interval, but 1-based inclusive   WIG(GLE)   dense, continuous- valued tracks
  • 24. Other data formats In addition to interval/feature tabular data as listed previously, other files with similar properties can be processed by some tools:   *.txt tab-separated values (e.g. tabular FASTA)   HTML (for additional prose)   LPED/PBED (to describe SNPs, really two files, one for coordinates, other for alleles)
  • 25. Data I/O •  Upload via FTP •  Fetch by URL •  Import from data library •  Provided by interoperable web service
  • 28. Import from data library
  • 29. Fetch data from proxy service   User submits proxy request to Galaxy   Galaxy forwards request to remote service   Service returns data   Galaxy infers data type and presents results
  • 30. “Big Data”   NGS has led to massive data sets   Data formats are simple, binary, and/or compressed   Still, people drive around with USB hard disks
  • 31. Data sharing/publishing   The Galaxy platform allows users to publish and share their data, for example as supplemental materials to a publication* * example: http://genome.cshlp.org/content/19/11/2144
  • 32. Tools Which operations can I run on Galaxy?
  • 33. Galaxy tools   Send and Get data – Upload, fetch, send, submit   Data manipulation – Join, sort, filter   Format conversion – FASTA, other format operations   Statistics – Regressions, simulations, model tests   NGS – BAM, FASTQ, SOLiD, 454 file operations   RNA analysis – cufflinks, tophat   Evolution – branch lengths, NJ, HyPhy
  • 34. Visualization What data can I view in Galaxy?
  • 36. Galaxy visualization - scatterplots
  • 40. Deployment How to deploy your analyses on Galaxy, on public servers, in the cloud or locally
  • 41. Public servers •  “Main” at UseGalaxy.org •  NBIC •  Others listed on Galaxy wiki
  • 42. Local server Pro: Con:   Data close by   Complicated install   Can add your own tools   Many dependencies   Can develop Galaxy   UNIX-based further   UNIX-based
  • 43. Cloud Galaxy   Galaxy can be installed in the (Amazon EC2 cloud)   Private data without the hardware hassle   Uploading and storing data can be costly, however
  • 44. Implementation How does it work under the hood?
  • 45. Galaxy under the hood Issues command 1. Parses HTTP request 2. Identifies which tool to use 3. Reads tool description 4. Queues tool 5. Parses result 6. Returns HTML representation of result
  • 46. Web server   Simple Galaxy installs use a built-in, python-based HTTP server   More robust installs typically use the Apache httpd server
  • 47. Code base   Most of the framework and the wrapper code is written in python   Some wrappers in other languages, e.g. perl
  • 48. Interface language   Under the hood, Galaxy executes command-line programs and scripts   Their interfaces and tool tips are described in XML files
  • 49. Queuing   Jobs are executed asynchronously   Progress is shown in the data browser   On big servers (e.g. “Main”), queuing is managed by the dedicated “Torque” system
  • 50. UNIX   Galaxy (simply)executes command-line programs within a UNIX-like environment   Galaxy doesn’t have to “know” how to run those programs, it finds out from their descriptions at runtime
  • 51. Database   Analysis metadata is stored in a database, by default this is SQLite   More robust installs use PostgreSQL
  • 52. Configuration   Galaxy has many moving parts that can be configured   Configuration is done using simple text-based INI files
  • 53. Version control   Revision control (or version control) provides unlimited undo and detailed tracking of changes   Galaxy uses Mercurial   Popular now are svn, git and hg
  • 54. Community How to get in touch with the world-wide community to get the most out of Galaxy?
  • 55. Wiki
  • 56. Mailing lists   @lists.bx.psu.edu:   galaxy-user   galaxy-dev   galaxy-announce   galaxy-commits
  • 57. Tutorials   Galaxy 101:   Getting data from UCSC   Performing simple data manipulation   Understanding Galaxy's History system   Creating and editing workflows   Applying workflows to your data
  • 59. Events   Galaxy Community Conference   ISMB   ECCB   BOSC   PAG   Bio-IT World   GMOD Meetings
  • 60. “Tool shed”   Easy sharing of new tools   Based on Mercurial   Turns Galaxy into a modular ecosystem
  • 61. CiteULike group   Social citation manager   There is a Galaxy group:   citeulike.org/group/16008   Articles are tagged by:   PROJECT, ISGALAXY, SHARED, HOWTO, METHODS, REPRODUCIBILITY, WORKFLOW
  • 62. Organizations   Development:   Penn State University   Emory University   Support:   NSF   NHGRI   Power user:   NBIC
  • 63. Links   UseGalaxy.org – the Main server   GetGalaxy.org – for local installs   UseGalaxy.org/galaxy101 – intro tutorial   Galaxy.nbic.nl – Sombrero, the NBIC server   CiteULike.org/group/16008 – references   genome.cshlp.org/content/19/11/2144 – windshield paper   SlideShare.net/rvosa – these slides