Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
E-Biothon Platform Accelerates Genomics Research
1. e-Biothon
V. Breton (breton@clermont.in2p3.fr)
LPC Clermont-Ferrand, IdGC
CNRS-IN2P3
http://france-grilles.fr
Credit: N. Bard, A. Franc, JF Gibrat
Extreme Performance Computational Science workshop
Tokyo, April 15th 2014
2. Table of content
2
• What are the computing challenges of life
sciences?
• France Grilles: a
multidisciplinarydistributede-
infrastructure for science
• E-Biothon: an HPC platform for research in
life sciences
3. Generalities on sequencing
• Genome = DNA sequence (4 nucleotids:
A, C, G, T)
– Smallest non viral genome:
Carsonellaruddii (0,16Mbp)
– Largestgenome: Polychaosdubium(670Gbp)
4. Sanger technology 500 bpsequences
454 technology 105reads of 450 to 600bp seq.
Illumina Technology 106 reads of 100 bpseq.
Currentprojects(Tara) 107reads of 100 to 400 bpseq.
Explosion of data set size
Data analysis ?
Algorithms?
Heuristics?
Tara @ http://oceans.taraexpeditions.org/
Evolution of sequencing
techniques
5. Data production isdistributed
2558 High Throughput « NextGeneration » sequencingfacilities in the world,
located in 920 centers (only 10 with more than 15 machines)
Source: omicspmaps.com
7. Sequencing scenarii
• Interest for a new genome requires assembly
– process of taking a large number of short DNA sequences and
putting them back together to create a representation of the
original
– Algorithms based on read overlapping benefit from large RAM (1
TO) -> HPC
• Working with a reference genome requires comparative
analysis
– Alignment algorithms (BLAST) findregions of local
similaritybetweensequences
– Phylogeny algorithms (PhyML) build evolutionary relationships
between genomes
– Comparative analyses are easily parallelized at data level -> HTC
8. Summary
• Life Sciences have specificcomputational challenges
– Data production growsfasterthan Moore law
– Permanent need of comparing new data to existingones
• Life sciences needscanberelevantlyaddressed on
multidisciplinary IT infrastructures (e-infrastructures)
– HPC resources best fitted for genomeassembly
– Grid/cloud HTC resourceswellfitted for comparative analysis
• Life sciences are among the main users of the French
national grid/cloud production infrastructure
9. France Grilles
• Is a ScientificInterest Group…
– Created in 2010 by 8 partners: CEA, CNRS,CPU, INRA, INRIA,
INSERM, MESR, RENATER…
– To steer up and coordinate the national strategy in the fields of
grids and clouds
• Vision:
– Build and operate a national distributedcomputing
infrastructure open to all sciences and to developing countries
9
10. France Grilles model
• France Grilles does not own the resources
– Resourcesowned by user communities
• France Grilles provides a framework
– To shareresources, expertise and know how
– To promote innovation and initiatives
– To foster collaboration at national and international
levels
– To reach out to the long tail of users
10
12. EGI de 2010 à 2013
12
2010-2013: from 14 regional to 34 operations centres in 53 countries,
from 188,000 jobs/day with 80,000 cores on 250 Resource Centres
to 1,200,000 jobs/day with 430,000 cores on 337 Resource Centres
Technologies
• Grids
• Clouds
• Desktops
Exposé S. Newhouse Madrid, Sept. 2013
France Grilles, a partner of EGI
14. Provide an open environment for fruitfuldisciplinary and
multidisciplinaryresearch
14
5 1 1
218
54
9 1 5 9 11 15 13 11
755
99 50
9
23
1
10
100
1000
Over 1500 scientific publications
june 2010 – April 2014
15. Web portal
Users
479 registered users in Nov 2013 (175 in France)
Most used robot certificate in EGI (http://go.egi.eu/wiki.robot.users)
Neuro-image analysisCancer therapy simulation
Prostate radiotherapy plan simulated
with GATE(L. Grevillot and D. Sarrut)
Image simulation
Echocardiography simulated with
FIELD-II (O. Bernard et al)
Modeling and optimization of
distributed computing systems
Acceleration yielded by non-clairvoyant
task replication (R. Ferreira da Silva et al)
Brain tissue segmentation
with Freesurfer
Scientific applications
Infrastructure
Supported by EGI Infrastructure
Uses biomed VO (most used EGI VO for life sciences in 2013)
VIP accounts for ~25% of biomed's activity
VIP consumes ~50 CPU years every month
DIRAC
France-Grilles
Application as a service
File transfer to/from grid
Virtual Imaging Platform:
http://www.creatis.insa-lyon.fr/vip
16. Collaborations withdedicated life sciences infrastructures
• Institut Français de Bioinformatique (computing
and storageresourcesatIDRIS)
• France Genomique ( computing and
storageresourcesat TGCC)
• France Life Imaging (infrastructure for
biomedicalimaging)
• E-Biothon
16
17. 17
• Telethon: everyyear, fundraising by
french media for French
MuscularDistrophy Association (AFM)
• FromTelethon to Decrypthon
– Computing infrastructure (IBM)
– Researchprojects (CNRS)
– Humanresources (AFM)
• FromDecrypthon to E-Biothon
E-Biothon: history
18. e-Biothon: an HPC platform for
research in life sciences
18
User Support
Blue Gene / p
machines
Technical supportUser Support
Blue Gene / P
operationWeb access
portal
19. E-Biothon: infrastructure
19
• 2 Blue Gene/P IBM racks
with 200 TO storage
– 2x1024 4-core nodes
– up to 28 TFlopspeak
performance
• SysFera-DS web access
to computingresources
• 2 modes:
– Standard (MPI)
– HTC (1024
independenttasks in
parallel)
20. E-Biothon vision is to offer a service to
the user communities in life sciences
• 2013-2014: first 3 projects
– Jean-François Gibrat et al, (MIGALE
platform, INRA Jouy-en-Josas)
– Olivier Gascuel, Stéphane Guindon et
Vincent Lefort (CNRS Montpellier)
– Yec’hanLaizet, Philippe
Chaumeil, Jean-Marc
Frigerio, Stéphanie Mariette, Sophie
Gerber, Alain Franc (INRA BioGeCo –
Bordeaux)
• > 2014: open call for projects (IFB)
21. Studying the synteny over a wide
range of microbialgenomes
21
• Definition: similar blocks of genes in the same relative positions in
the genome
• Interest: Study of syntenycan show how the genomeiscut and pasted
in the course of evolution
• MIGALE team at INRA designed a pipeline analysis to
computesyntenybetween 2 genomes and store it in a database
• E-Biothon impact: change in scale - capacity to
computesyntenybetween 2000 completebacterialgenomes (7
millions comparisons)
22. PhyML
Philogeneticsis the study of evolutionaryrelationshipsamong groups of
organisms
PhyMLis a software thatestimates maximum
likelihoodphylogeniesfromalignments of nucleotide or
aminoacidsequences
PhyML original publication in 2007 is the mostcited in environment and
ecology (> 6000 citations).
E-Biothon impact: change in scale in the resources made available
to PhyMLusers
25. Study of biodiversity in Guyane
16000 differenttreespecies
in amazonianforest (≈ 300
in Europe)
More biodiversity in 10000
m2 of forest in French
Guyana than in Europe
Decrypthonadded value
Change in scale (from local Mesocenter in
Bordeaux)
Millions of reads
Exact distance computation
withoutheuristics (alignement scores)
TOctets of data producedeveryweek
26. Conclusion
• Both HPC and HTC resources are increasinglyneeded to
address life sciences data and computing challenges:
– As sequencing technologies keepevolving, data production
growsfasterthan Moore law and isincreasinglydistributed
– Biological data need to beconstantlycompared to
eachother (phylogenetics, genomics comparative analysis)
• France isdevelopingcomplementary HPC and HTC
infrastructures for life sciences
– Institut Français de Bioinformatique, France Génomique
– E-Biothon: an HPC platform for research in life sciences
– France Grilles: a multidisciplinarygrid/cloud production
infrastructure
30. Are life sciences
specificw.r.tcomputing?
Whatisspecific to life sciences:
- As sequencing technologies keepevolving, data production growsfasterthan
Moore law
- Biological data need to beconstantlycompared to eachother (phylogenetics,
Genomics comparative analysis)
Whatis not specific?
- Data production isdistributed
- Multiscalemodeling