O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Advanced Bioinformatics for Genomics and BioData Driven Research


Confira estes a seguir

1 de 44 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Advanced Bioinformatics for Genomics and BioData Driven Research (20)


Mais recentes (20)

Advanced Bioinformatics for Genomics and BioData Driven Research

  1. 1. European Bioinformatics Institute - the home for big data in biology www.ebi.ac.uk Advanced Bioinformatics for Genomics and BioData Driven Research
  2. 2. The European Molecular Biology Laboratory Heidelberg, Germany Main Laboratory Barcelona, Spain Tissue Biology, Disease Modeling 80+ nationalities Hinxton, Cambridge, UK Bioinformatics Mouse Biology Rome, Italy >1700 personnel Grenoble, France Hamburg, Germany Structural Biology 6 sites in Europe Structural Biology
  3. 3. Our mission Deliver excellent research Train the next generation of scientists Engage with industry Coordinate bioinformatics in Europe Deliver scientific services
  4. 4. Data and tools to support life science research www.ebi.ac.uk/services Bioinformatics services
  5. 5. What services do we provide? Labs around the world send us their data and we… Archive it Classify it Share it with other data providers Analyse, add value and integrate it …provide tools to help researchers use it A collaborative enterprise
  6. 6. ~64 million requests to EMBL-EBI websites every day 273 petabytes of raw storage in our data centres 22 500 participants to EMBL-EBI Training events Requests from 20 million unique IP addresses Big Data, big demand for EMBL-EBI data services…
  7. 7. Data resources at EMBL-EBI
  8. 8. Data resources for Genomics – Molecular Archives BioSamples database - centralised resource for FAIR sample data (>12 million samples) Experimental Factor Ontology - systematic description of experimental variables available in EBI databases and projects (26,764 terms) European Genome-phenome Archive - sequence and genotype experiments, including case-control and population studies (3,445 studies) European Nucleotide Archive (ENA) - record of the world's nucleotide sequencing information (>2,400 million sequences, > 7,200 billion bases) European Variation Archive - sole international resource for human and non-human variation
  9. 9. Data resources for Genomics – Genes, Genomes & Variation Ensembl - genome browser (human: >0.6 billion SNV, >6 million SV) Ensembl Genomes - 275 vertebrate species / strains; Metazoa; Plants; Fungi; Protists; Bacteria GWAS Catalog - moved to EBI in 2015 (4,390 publicn., > 17,000 assocn.) HGNC - 41,787 approved gene entries (19,320 protein coding) International Genome Sample Resource - ensures future usability and accessibility of 1000 Genomes Project data
  10. 10. VEP started as a simple wrapper around the Ensembl API to map variants to transcripts and predict molecular consequence. As new data sets and algorithms have become available, functionality has increased and VEP is now an extensive and sophisticated tool The Ensembl Variant Effect Predictor
  11. 11. New resource for Genomics • New resource for gene expression and splicing QTLs • https://www.ebi.ac.uk/eqtl/
  12. 12. Global Alliance for Genomics and Health (GA4GH) • Chaired by EMBL-EBI Director Ewan Birney • EMBL-EBI teams leading various activities in Technical Work Streams: • Large Scale Genomics (file formats and htsget subgroups) • Clinical and phenotypic data capture • Data Use and Researcher identification • ENA/EGA/EVA and HCA DCP are also Driver Projects
  13. 13. Data resources for Genomics – Molecular Atlas • Human Cell Atlas Data Coordination Platform • In 2017, Chan Zuckerberg Initiative (CZI) funding to EMBL- EBI, Broad Institute and the UCSC Genomics Institute, to build a cloud-based data coordination platform • HCA will generate petabytes of data for billions of cells, across multiple modalities, generated by hundreds of labs around the world • DCP will organise, curate, standardise analyse this data and enable open data access
  14. 14. Data resources for Genomics – Proteins and Protein Families A free to use resource for the archiving, assembly, analysis, & browsing of microbiome data AnalysisData archiving Assembly
  15. 15. NEW Resource: BioImage Archive Molecules Cells Tissues / Organisms Molecular Machines Graphic courtesy of Jan Ellenberg Light Sheet Microscopy High Throughput Microscopy Superresolution Microscopy Cryo Electron Microscopy Correlate Technologies Integrate Data 0.1 TB / day 0.5 TB / dataset 0.5 TB / day 7.5 TB / dataset 40 TB / day 10 TB / dataset 5 TB / day 20 TB / dataset
  16. 16. Data-driven discovery Research www.ebi.ac.uk/research
  17. 17. Zamin Iqbal Thomas Keene John Marioni Janet Thornton Andrew Leach Evangelia Petsalaki Virginie Uhlmann Daniel Zerbino Paul Flicaek Nick Goldman Rob Finn Alvis Brazma Pedro Beltrao Alex Bateman Ewan Birney Moritz Gerstung Isidro Cortes- Ciriano Research groups at EMBL-EBI Irene Papatheodorou In 2018, EMBL-EBI had 165 grants awarded, 120 jointly funded with researchers and institutes in 62 countries
  18. 18. Pedro Beltrao: Functional landscape of the human phosphoproteome Ochoa et al Nature Biotech 2019 • Created largest phospho- proteome resource to date (120,000 human phosphosites) • Used machine learning methods to compile and analyse large phosphorylation related biological datasets • Identifying new functional phosphosites has enormous potential to progress research into many biological processes and diseases
  19. 19. Evangelia Petsalaki: Inference of kinase-kinase regulatory networks from phosphoproteomics data (collaboration with Beltrao group) Invergo*,Petursson* et al, bioRxiv
  20. 20. Moritz Gerstung: Pan-cancer computational histopathology • Analysis with deep learning extracts histopathological patterns • accurately discriminates 28 cancer and 14 normal tissue types • Predicts: whole genome duplications; focal amplifications and deletions; driver gene mutations • Correlations with gene expression indicative of immune infiltration and proliferation • Prognostic information augments conventional grading and histopathology subtyping https://doi.org/10.1101/813543
  21. 21. Zam Iqbal: Mykrobe – predicting TB drug resistance from WGS data https://wellcomeopenresearch.org/articles/4-191/v1
  22. 22. Virginie Uhlmann: Mathematical models for bioimage analysis doi.org/10.1371/journal.pone.0173433
  23. 23. Dictionary Learning for Two-Dimensional Kendall Shapes https://arxiv.org/abs/1903.11356
  24. 24. An example of best practice for complex datasets Single Cell RNA-Seq analysis at EMBL-EBI From Irene Papatheodorou Team Leader – Gene Expression
  25. 25. ArrayExpress – functional genomics archive • started in 2000 as an archive for microarray data • evolved into general archive for high-throughput functional genomics data (microarray- or NGS- based) • all data are manually curated prior to inclusion • microarray data stored directly in ArrayExpress • sequencing data brokered to and stored in ENA • curated datasets support reproducible and re-usable research
  26. 26. Annotare – Minimum information about a scRNA-Seq experiment single cell isolation single cell well quality OK doublet debris single cell identifier barcode UMI cDNA read pass fail post-analysis single cell quality library construction inferred cell type R1 R2 I1 files sample metadata https://arxiv.org/abs/1910.14623
  27. 27. From database to knowledgebase: Expression Atlases 165 baseline expression ~ 3,350 differential expression > 3,500 bulk datasets 62 species > 955,000 assays > 120 single-cell datasets 12 species https://www.ebi.ac.uk/gxa
  28. 28. https://www.ebi.ac.uk/gxa/sc/home
  29. 29. Interactive Analysis with Galaxy https://humancellatlas.usegalaxy.eu/ Flexible Interoperable Scalable
  30. 30. Main Points • Enabling rational choices when composing workflows • Using a common exchange format as ‘workflow glue’ • Galaxy integrations
  31. 31. What people usually do... Read Filter Normalise Compare Cluster Markers Read Filter Normalise Compare Cluster Markers Read Filter Normalise Compare Cluster Markers OR OR
  32. 32. What we really should be doing Read Filter Normalise Compare Cluster Markers
  33. 33. Problem 2: need format glue! ... but to do that we need interoperable components Read Filter Normalise Compare Cluster Markers Read Filter Normalise Compare Cluster Markers Read Filter Normalise Compare Cluster Markers Read Filter Normalise Compare Cluster Markers Read Filter Normalise Compare Cluster Markers Read Filter Normalise Compare Cluster Markers Read Filter Normalise Compare Cluster Markers Problem 1: components in different languages
  34. 34. Our solution Read Filter Normalise Compare Cluster Markers Environments & containers Workflows CLI CLI CLI CLI CLI CLIScripts layer
  35. 35. Galaxy integrations • Extended Galaxy init container: • Thin tool wrappers leveraging Bioconda wrappers • Starting tertiary workflows • Added logic for dynamic destinations • Leverage existing Kubernetes integrations • Improved LSF functionality for non-DRMAA clusters: • Improved CLI executor https://github.com/ebi-gene-expression-group/container-galaxy-sc-tertiary Pablo Moreno
  36. 36. Summary • ArrayExpress/Annotare for data Submissions • Expression Atlas/Single Cell Expression Atlas • Analysis Workflows in Galaxy
  37. 37. Open Targets Data integration Platforms
  38. 38. Drug discovery • Finding the right biological target for a drug requires bioinformatics to: • identify promising targets • select candidate medicines. • EMBL-EBI services support all stages of drug discovery: • Ensembl • UniProt • ChEMBL • Protein Data Bank in Europe • Reactome
  39. 39. • Pinpointing the processes in the human body that have a demonstrable effect on disease • Aims to improve the success rate in the discovery and repurposing of medicines • A new kind of collaboration with: • GSK • EMBL-EBI • Wellcome Sanger Institute • Biogen • Takeda • Celgene • Sanofi Open Targets www.opentargets.org
  40. 40. Open Targets Platform and Open Targets Genetics www.targetvalidation.org genetics.opentargets.org
  41. 41. Challenges for the near future • Non-coding SNVs • Data standardization to enable AI/ML • Connecting data • Moving to the cloud
  42. 42. www.ebi.ac.uk Stay in touch Twitter: @emblebi Facebook: EMBLEBI LinkedIn: /company/ebi YouTube: EMBLMedia