SlideShare a Scribd company logo
1 of 28
Future Architectures for
genomics
Guy Coates
Wellcome Trust Sanger Institute

gmpc@sanger.ac.uk
The Sanger Institute
Funded by Wellcome Trust.
• 2nd largest research charity in the world.
• ~700 employees.
• Based in Hinxton Genome Campus,
Cambridge, UK.

Large scale genomic research.
• Sequenced 1/3 of the human genome.

•

(largest single contributor).
Large scale sequencing with an impact
on human and animal health.

Data is freely available.
• Websites, ftp, direct database access,
programmatic APIs.
• Some restrictions for potentially
identifiable data.

My team:
• Scientific computing systems architects.
The not quite so scary graph

Peak Yearly capillary
sequencing: 30 Gbase

Current weekly sequencing:
7-10 Tbases. Not increasing as aggressively
as the historical trend.
New CPU Architectures
MRSA Outbreak
MRSA:
• Antibiotic resistant bacteria.
• Carried by many people with no symptoms.
• But can cause serious diseases (boils, sepsis, necrotising fasciitis).
• Bad news in hospitals (14 patients presented absesses during the
outbreak)

Infection control via traditional methods
• Compare antibiotic resistence profiles of samples.
• Epidemiology detective work to find common factors.
• Close wards for deep cleans.
3 different outbreaks of MRSA over 6 months
Sequencing Advantages
Sequence samples from
infections:
• Compare sequence and build
phylogenetic trees.

F

Identified extra cases, missed by
traditional screening.
• Traced infection to a staff member
(asymptomatic).

P25

H
H

H

H(2)

H(2)
H
H

H

H P15

H

H

H

P24
H(5) +

P8

H

“Real Time” information:
• culture → sequence → informatics
within 48 hrs.

P22

P9

P21

P17

P2, 3

P11, 12
P5
Loss ofermC
plasmid

P14, 23
P1 P6

P13
P7

P10

P20
P4, 16, and 19
P26

Aids, not replaces, current
infection control protocols.
Informatics in a hospital
setting?
Can we produce lab friendly informatics as well as
sequencing?
• Low power, low footprint compute.
Informatics was run in our datacentre (20,000 cores, 20
PB storage)
• Can we do it on something you can plug into a 13AMP socket in the
lab?
ARM for Bioinformatics
Not as stupid as it first seems:
Bioinformatics code:
• Single threaded, integer, not floating point dominated.
Bacteria have small genomes:
• 3 Mbases (MRSA) vs 3 Gbases (Human)
• May not need 64bit memory address space for some problems.
Compute Farm Memory
footprint

Some jobs do have large memory footprints:
• But large number of jobs below 1.5GB.
Intel vs ARM
ARM:
• Calxeda Energy Core
• 4 core 1.0 GHz ARM A9, 4GB RAM.
Intel:
• IBM HS22
• 6 core Xeon X5650 (Westmere) 2.67 GHz, 36 GB RAM
• (Not the latest or greatest, but what we had available)
OS: Ubuntu 12.04 LTS
• Kernel 2.6.32 on Intel, 3.6 on ARM.
• gcc 4.6.3
Porting code:
• Surprisingly easy.
• R / Python / Perl: Interpreted languages worked in our favour!
• C “just worked”.
Bioinformatics Codes
Standard set of
bioinformatics code:
• ARM ~5x slower than the Intel.
• Right ballpark to be more

8
7

efficient on performance/watt.

6

Speedup

5
ARM
Intel

4
3
2
1
0
exonerate

blastn

blastp

tblastn

Some exceptions:
• Exonerate 2.2: 250x slower.

•

Pointer chasing?
Hmmer v3: uses sse2 on Intel.
•
SNP calling Pipeline
MAP
MAP
(BWA)
(BWA)

Looks for single base
changes in DNA sequence
• C, java with perl glue.
Data sizes:
• 2 x 40 Mbyte sequence file files.
• 3 Mbyte reference

Sort
Sort
(samtools)
(samtools)

Intel is 4.2 speed of the ARM
• 310 vs 1326 seconds.

SNP call
SNP call
(mpileup)
(mpileup)

4.5
4
3.5

Speedup

3
2.5
2
1.5
1
0.5
0

ARM
Intel
Further Investigations
Scaling tests:
• ARM A9 showed poorer scaling than Intel as we loaded the cores up.
• Kernel related?
Explore compiler / JVM opts:
• Built with the package defaults (typically gcc / -O2).
• gcc vs icc etc.
Lots of change in the ecosystem:
ARM A15 vs ARM 64 vs Atom etc etc..
Conclusions
Initial tests look promising.
• Code runs.
Is ARM cost effective?
• Numbers look to be in the right ballpark wrt performance / Watt.

•

going from datasheet power numbers.
We don't know yet on price/performance.
•

We need to see real production systems so we can get real
power / $ numbers.
Object Storage
Storage for Sequencing
Sequencing produces lots of data.
• Lots of data + lots of people quickly becomes un-manageble.
We use iRODS to manage data.
• Stores data + *metadata* in a storage agnostic system.
Sequencing data flow.
Sequencer
Sequencer

Processing/
Processing/
QC
QC

analysis
analysis

datastore
datastore

Structured data
(databases)

Unstructured
(Flat files)

Raw data
(10 TB)

Internet
Internet

Sequence
(500GB)

Alignments
(200GB)

Variation data
(1GB)

Feature
(3MB)
Sequencing data flow.
Sequencer
Sequencer

Processing/
Processing/
QC
QC

analysis
analysis

Internet
Internet

Structured
Unmanaged data
(databases)
Pbytes!

Unstructured
(Flat files)

Raw data
(10 TB)

datastore
datastore

Sequence
(500GB)

Alignments
(200GB)

Variation data
(1GB)

Feature
(3MB)
iRODS Data Management
User interface
User interface

WebDAV, icommands,fuse
WebDAV, icommands,fuse

Irods Server
Irods Server
Data in S3
Data in S3

ICAT
ICAT

Catalogue
Catalogue
database
database

Rule Engine
Rule Engine

Irods Server
Irods Server

Data in database
Data in database

Implements policies
Implements policies

Irods Server
Irods Server
Data on disk
Data on disk
Sanger Implementation
Storage
• 5PB Storage (2 x 2.5 PB).
• Data stored on standard posix filesystems at the backend.
• Mixture of vendors and filesystem sizes.
• 40TB → 200TB chunks.
Database:
• Oracle 10g RAC.
Replicated:
• One copy in two sections of our datacentre.
•

(probably move 1 off site this year)

Federated:
• Split system to isolate research teams from one another.
• Still single namespace.
User interaction:
• Via CLI tools (think command line ftp) or via C API.
• Archival system, separate from our HPC lustre systems.
Irods Filesystem Issues
We have lots of filesystems behind irods.
• 60 filesystems.
Filesystems need TLC.
• They get full.
• Sometimes they go wrong.
•

Can you fsck a 200TB filesystem?

Is there an alternative storage backend that is simpler or
cheaper?
Object Stores
System for storing objects (files)
• “put” and “get” semantics
Not POSIX:
• Fewer features, but simpler to implement; should be more scalable and

•

robust.
No directory structure; just a set of object Ids.
• You need to implement your own organisational schema on top of the
object store.

Lots of alternatives
• Commercial, open source, hardware or software based.
• Different approaches to data integrity / DR.
•

Replication vs erasure coding.

No standard APIs.
• Amazon S3 defacto API.
• Lowest common denominator:
•

(S3 currently does not support seek operations, which is important if we
are dealing with large structured files.)
Object store and iRODS
Object stores are interesting, but very different from our
POSIX world.
Conceptually object is a good fit for iRODS.
Transparency:
• iRODS is storage back-end agnostic.
• Putting object store behind irods makes it transparent to the end user.
Provides a good organisational schema.
•

Searchable metadata.

(Potentially) simplifies storage administration.
Questions we need to answer
How does iRODS replication and object store replication
interact?
• iRODS knows how to replicate objects.
• Most (all?) object stores have replication / erasure coding mechaisms.
• What is the right level to do the replication at?
How are seek operations handled?
• We can currently pull records out of BAM files without having to download

•

the entire file.
Will this still work on object store.

Data locality
• Important in multi-site / federated irods installations.
• If I get an object from irods, I'd like to talk to the storage elements

•

“nearest” to me on the network
Many ways to potentially tackle this:
• Loadbalancers, proxies and other network tricks.
• Make irods aware of the object store topology.
• Not clear what the best mechanism will be.
Acknowledgements
My Team:
• Pete Clapham

•
•

(ARM & iRODS)
James Beal
Helen Brimmer
•

Karl Freund
• Calxeda
Whole-genome sequencing for analysis of an outbreak of meticillin-resistant Staphylococcus
aureus: a descriptive study
Lancet Infect Dis. 2013 February; 13(2): 130–136.
doi: 10.1016/S1473-3099(12)70268-2

More Related Content

What's hot

Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynotec.titus.brown
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesGuy Coates
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Databricks
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryIntel IT Center
 
Cassava genome hub
Cassava genome hubCassava genome hub
Cassava genome hubCIAT
 

What's hot (20)

Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
 
Cassava genome hub
Cassava genome hubCassava genome hub
Cassava genome hub
 

Viewers also liked

2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
Declaring a TB outbreak over with genomics
Declaring a TB outbreak over with genomicsDeclaring a TB outbreak over with genomics
Declaring a TB outbreak over with genomicsJennifer Gardy
 
" Use of genomics for understanding and improving adaptation to climate chang...
" Use of genomics for understanding and improving adaptation to climate chang..." Use of genomics for understanding and improving adaptation to climate chang...
" Use of genomics for understanding and improving adaptation to climate chang...ExternalEvents
 
Groundnut improvement: Use of genetic and genomic tools
Groundnut improvement: Use of genetic and genomic toolsGroundnut improvement: Use of genetic and genomic tools
Groundnut improvement: Use of genetic and genomic toolsICRISAT
 
Genetic enhancement of groundnut for resistance to aflatoxin contamination
Genetic enhancement of groundnut for resistance to aflatoxin contaminationGenetic enhancement of groundnut for resistance to aflatoxin contamination
Genetic enhancement of groundnut for resistance to aflatoxin contaminationILRI
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingHealth Catalyst
 

Viewers also liked (7)

2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Declaring a TB outbreak over with genomics
Declaring a TB outbreak over with genomicsDeclaring a TB outbreak over with genomics
Declaring a TB outbreak over with genomics
 
" Use of genomics for understanding and improving adaptation to climate chang...
" Use of genomics for understanding and improving adaptation to climate chang..." Use of genomics for understanding and improving adaptation to climate chang...
" Use of genomics for understanding and improving adaptation to climate chang...
 
Groundnut improvement: Use of genetic and genomic tools
Groundnut improvement: Use of genetic and genomic toolsGroundnut improvement: Use of genetic and genomic tools
Groundnut improvement: Use of genetic and genomic tools
 
Genetic enhancement of groundnut for resistance to aflatoxin contamination
Genetic enhancement of groundnut for resistance to aflatoxin contaminationGenetic enhancement of groundnut for resistance to aflatoxin contamination
Genetic enhancement of groundnut for resistance to aflatoxin contamination
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
 

Similar to Future Architectures for genomics

Guy Coates
Guy CoatesGuy Coates
Guy CoatesEduserv
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Ryft
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and DataGuy Coates
 
Building a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchBuilding a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchMark Greene
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data WarehousingThomas Kejser
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Ivo Andreev
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentSpeedment, Inc.
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, whenEugenio Minardi
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Vortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-dataVortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-dataAravindharamanan S
 
Hard Coding as a design approach
Hard Coding as a design approachHard Coding as a design approach
Hard Coding as a design approachOren Eini
 
Accelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAccelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAmazon Web Services
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Javamalduarte
 
Extended memory access in PHP
Extended memory access in PHPExtended memory access in PHP
Extended memory access in PHPAndrew Goodwin
 
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSCassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSDataStax Academy
 

Similar to Future Architectures for genomics (20)

Guy Coates
Guy CoatesGuy Coates
Guy Coates
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Building a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchBuilding a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearch
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ Speedment
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Vortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-dataVortrag ralph behrens_ibm-data
Vortrag ralph behrens_ibm-data
 
Hard Coding as a design approach
Hard Coding as a design approachHard Coding as a design approach
Hard Coding as a design approach
 
Accelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAccelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of Genomics
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
 
Extended memory access in PHP
Extended memory access in PHPExtended memory access in PHP
Extended memory access in PHP
 
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWSCassandra Summit 2014: Performance Tuning Cassandra in AWS
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
 

Recently uploaded

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Recently uploaded (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Future Architectures for genomics

  • 1. Future Architectures for genomics Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
  • 2. The Sanger Institute Funded by Wellcome Trust. • 2nd largest research charity in the world. • ~700 employees. • Based in Hinxton Genome Campus, Cambridge, UK. Large scale genomic research. • Sequenced 1/3 of the human genome. • (largest single contributor). Large scale sequencing with an impact on human and animal health. Data is freely available. • Websites, ftp, direct database access, programmatic APIs. • Some restrictions for potentially identifiable data. My team: • Scientific computing systems architects.
  • 3. The not quite so scary graph Peak Yearly capillary sequencing: 30 Gbase Current weekly sequencing: 7-10 Tbases. Not increasing as aggressively as the historical trend.
  • 5.
  • 6. MRSA Outbreak MRSA: • Antibiotic resistant bacteria. • Carried by many people with no symptoms. • But can cause serious diseases (boils, sepsis, necrotising fasciitis). • Bad news in hospitals (14 patients presented absesses during the outbreak) Infection control via traditional methods • Compare antibiotic resistence profiles of samples. • Epidemiology detective work to find common factors. • Close wards for deep cleans. 3 different outbreaks of MRSA over 6 months
  • 7. Sequencing Advantages Sequence samples from infections: • Compare sequence and build phylogenetic trees. F Identified extra cases, missed by traditional screening. • Traced infection to a staff member (asymptomatic). P25 H H H H(2) H(2) H H H H P15 H H H P24 H(5) + P8 H “Real Time” information: • culture → sequence → informatics within 48 hrs. P22 P9 P21 P17 P2, 3 P11, 12 P5 Loss ofermC plasmid P14, 23 P1 P6 P13 P7 P10 P20 P4, 16, and 19 P26 Aids, not replaces, current infection control protocols.
  • 8. Informatics in a hospital setting? Can we produce lab friendly informatics as well as sequencing? • Low power, low footprint compute. Informatics was run in our datacentre (20,000 cores, 20 PB storage) • Can we do it on something you can plug into a 13AMP socket in the lab?
  • 9. ARM for Bioinformatics Not as stupid as it first seems: Bioinformatics code: • Single threaded, integer, not floating point dominated. Bacteria have small genomes: • 3 Mbases (MRSA) vs 3 Gbases (Human) • May not need 64bit memory address space for some problems.
  • 10. Compute Farm Memory footprint Some jobs do have large memory footprints: • But large number of jobs below 1.5GB.
  • 11. Intel vs ARM ARM: • Calxeda Energy Core • 4 core 1.0 GHz ARM A9, 4GB RAM. Intel: • IBM HS22 • 6 core Xeon X5650 (Westmere) 2.67 GHz, 36 GB RAM • (Not the latest or greatest, but what we had available) OS: Ubuntu 12.04 LTS • Kernel 2.6.32 on Intel, 3.6 on ARM. • gcc 4.6.3 Porting code: • Surprisingly easy. • R / Python / Perl: Interpreted languages worked in our favour! • C “just worked”.
  • 12. Bioinformatics Codes Standard set of bioinformatics code: • ARM ~5x slower than the Intel. • Right ballpark to be more 8 7 efficient on performance/watt. 6 Speedup 5 ARM Intel 4 3 2 1 0 exonerate blastn blastp tblastn Some exceptions: • Exonerate 2.2: 250x slower. • Pointer chasing? Hmmer v3: uses sse2 on Intel. •
  • 13. SNP calling Pipeline MAP MAP (BWA) (BWA) Looks for single base changes in DNA sequence • C, java with perl glue. Data sizes: • 2 x 40 Mbyte sequence file files. • 3 Mbyte reference Sort Sort (samtools) (samtools) Intel is 4.2 speed of the ARM • 310 vs 1326 seconds. SNP call SNP call (mpileup) (mpileup) 4.5 4 3.5 Speedup 3 2.5 2 1.5 1 0.5 0 ARM Intel
  • 14. Further Investigations Scaling tests: • ARM A9 showed poorer scaling than Intel as we loaded the cores up. • Kernel related? Explore compiler / JVM opts: • Built with the package defaults (typically gcc / -O2). • gcc vs icc etc. Lots of change in the ecosystem: ARM A15 vs ARM 64 vs Atom etc etc..
  • 15. Conclusions Initial tests look promising. • Code runs. Is ARM cost effective? • Numbers look to be in the right ballpark wrt performance / Watt. • going from datasheet power numbers. We don't know yet on price/performance. • We need to see real production systems so we can get real power / $ numbers.
  • 17. Storage for Sequencing Sequencing produces lots of data. • Lots of data + lots of people quickly becomes un-manageble. We use iRODS to manage data. • Stores data + *metadata* in a storage agnostic system.
  • 18. Sequencing data flow. Sequencer Sequencer Processing/ Processing/ QC QC analysis analysis datastore datastore Structured data (databases) Unstructured (Flat files) Raw data (10 TB) Internet Internet Sequence (500GB) Alignments (200GB) Variation data (1GB) Feature (3MB)
  • 19. Sequencing data flow. Sequencer Sequencer Processing/ Processing/ QC QC analysis analysis Internet Internet Structured Unmanaged data (databases) Pbytes! Unstructured (Flat files) Raw data (10 TB) datastore datastore Sequence (500GB) Alignments (200GB) Variation data (1GB) Feature (3MB)
  • 20. iRODS Data Management User interface User interface WebDAV, icommands,fuse WebDAV, icommands,fuse Irods Server Irods Server Data in S3 Data in S3 ICAT ICAT Catalogue Catalogue database database Rule Engine Rule Engine Irods Server Irods Server Data in database Data in database Implements policies Implements policies Irods Server Irods Server Data on disk Data on disk
  • 21.
  • 22.
  • 23. Sanger Implementation Storage • 5PB Storage (2 x 2.5 PB). • Data stored on standard posix filesystems at the backend. • Mixture of vendors and filesystem sizes. • 40TB → 200TB chunks. Database: • Oracle 10g RAC. Replicated: • One copy in two sections of our datacentre. • (probably move 1 off site this year) Federated: • Split system to isolate research teams from one another. • Still single namespace. User interaction: • Via CLI tools (think command line ftp) or via C API. • Archival system, separate from our HPC lustre systems.
  • 24. Irods Filesystem Issues We have lots of filesystems behind irods. • 60 filesystems. Filesystems need TLC. • They get full. • Sometimes they go wrong. • Can you fsck a 200TB filesystem? Is there an alternative storage backend that is simpler or cheaper?
  • 25. Object Stores System for storing objects (files) • “put” and “get” semantics Not POSIX: • Fewer features, but simpler to implement; should be more scalable and • robust. No directory structure; just a set of object Ids. • You need to implement your own organisational schema on top of the object store. Lots of alternatives • Commercial, open source, hardware or software based. • Different approaches to data integrity / DR. • Replication vs erasure coding. No standard APIs. • Amazon S3 defacto API. • Lowest common denominator: • (S3 currently does not support seek operations, which is important if we are dealing with large structured files.)
  • 26. Object store and iRODS Object stores are interesting, but very different from our POSIX world. Conceptually object is a good fit for iRODS. Transparency: • iRODS is storage back-end agnostic. • Putting object store behind irods makes it transparent to the end user. Provides a good organisational schema. • Searchable metadata. (Potentially) simplifies storage administration.
  • 27. Questions we need to answer How does iRODS replication and object store replication interact? • iRODS knows how to replicate objects. • Most (all?) object stores have replication / erasure coding mechaisms. • What is the right level to do the replication at? How are seek operations handled? • We can currently pull records out of BAM files without having to download • the entire file. Will this still work on object store. Data locality • Important in multi-site / federated irods installations. • If I get an object from irods, I'd like to talk to the storage elements • “nearest” to me on the network Many ways to potentially tackle this: • Loadbalancers, proxies and other network tricks. • Make irods aware of the object store topology. • Not clear what the best mechanism will be.
  • 28. Acknowledgements My Team: • Pete Clapham • • (ARM & iRODS) James Beal Helen Brimmer • Karl Freund • Calxeda Whole-genome sequencing for analysis of an outbreak of meticillin-resistant Staphylococcus aureus: a descriptive study Lancet Infect Dis. 2013 February; 13(2): 130–136. doi: 10.1016/S1473-3099(12)70268-2

Editor's Notes

  1. Sequencing the start of most analysis People = Umanaged data Data in wrong place Duplicated Nobody can find anything Inc systems:Backups/security Capacity planning?
  2. Sequencing the start of most analysis People = Umanaged data Data in wrong place Duplicated Nobody can find anything Inc systems:Backups/security Capacity planning?