Guy Coates

Big Data:
Sanger Experiences

Guy Coates
Wellcome Trust Sanger Institute

gmpc@sanger.ac.uk

The Sanger Institute
Funded by Wellcome Trust.
• 2nd largest research charity in the world.
• ~700 employees.
• Based in Hinxton Genome Campus,
Cambridge, UK.

Large scale genomic research.
• Sequenced 1/3 of the human genome.
(largest single contributor).
• Large scale sequencing with an impact
on human and animal health.

Data is freely available.
• Websites, ftp, direct database access,
programmatic APIs.
• Some restrictions for potentially
identifiable data.

My team:
• Scientific computing systems architects.

DNA Sequencing

TCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG

AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA

TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC

ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG

TGCACTCCAGCTTGGGTGACACAG CAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG

AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA

ATGAAGTAAATCG ATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC

250 Million * 75-108 Base fragments

~1 TByte / day / machine
Human Genome (3GBases)

Economic Trends:

Cost of sequencing halves every 12
months.
The Human genome project:
• 13 years.
• 23 labs.
• $500 Million.
A Human genome today:
• 3 days.
• 1 machine.
• $8,000.
Trend will continue:
• $1000 genome is probable within 2 years.

The scary graph

Peak Yearly capillary Current weekly sequencing:
sequencing: 30 Gbase 7-10 Tbases

What are we doing with all
these genomes?
UK10K
• Find and understand impact of rare genetic
variants on disease.

Ensembl
• Genome annotation.
• Data resources and analysis pipelines.
Cancer Genome Project
• Catalogue causal mutations in cancer.
• Genomics of tumor drug sensitivity.
Pathogen Genomics
• Bacterial / viral genomics
• Malaria Genetics
• Parasite genetic / tropical diseases.
All these programmes exist in
frameworks of external collaboration.
• Sharing data and resources is crucial.

IT Requirements
Needs to match growth in
sequencing technology.
Disk Storage
Growth of compute & storage 12000

• Storage /compute doubles every 12 10000
months.
• 2012 ~17 PB Usable 8000

Terabytes
Everything changes, all the time.
6000

• Science is very fluid. 4000

• Speed to deployment is critical.
2000

Moore's law will not save us. 0
1995 1997 1999 2001 2003 2005 2007 2009
1994 1996 1998 2000 2002 2004 2006 2008 2010

$1000 genome* Year
• *Informatics not included

Sequencing data flow.

Processing/ Comparative
Sequencer Archive Internet
QC analysis

Unstructured data Structured data
(Flat files) (databases)

Raw data Sequence Alignments Variation data Feature
(10 TB) (500GB) (200GB) (1GB) (3MB)

Agile Systems
Modular design.
• Blocks of network, compute and
storage.
• Assume from day 1 we will be adding
more.
• Expand simply by adding more blocks.
• Lots of automation.
Disk Disk Disk Disk

Make storage visible from
everywhere.
• Key enabler; lots of 10Gig.
Network

Compute Compute Compute Compute

Compute Modules
Commodity Servers
• Blade form-factor.
• Automated Management.
Generic intel/AMD CPUs
• Single threaded / embarrasingly parallel
workload.
• No FPGAs or GPUs.

2000-10,000 core per cluster
• 3 Gbyte/s memory per core.
• A few bigger memory machines (0.5TB).

Storage Modules
Two flavours:
Scale up (Fast)
• DDN storage arrays.
• Lustre. 250-500TB per filesystem.
• High performance. Expensive.
Scale out (Slow)
• Linux NFS servers.
• Nexsan Storage arrays.
• 50-100TB per filesystem.
• Cheap and cheerful.
How large?
• More modules = more management overhead.
• Fewer modules = large unit of failure.
• 100-500 TB

Actual Architecture
Compute Silos
• Beware of over- Fast
consolidation. disk
Fast
• Some workflows interact
Slow disk
disk
badly with one another.
• Separate out some work
onto different clusters. Fast
disk

Logically rather than
physically separated.
• LSF to manage workflow. Network
• Simple software re-config to
move capacity between
silos.

Farm 1 LSF Farm2 LSF Farm3

Some things we learned
KISS! Keep It Simple, Stupid.
• Simple solution may look less reliable on paper than the fully redundant
failover option.
• Operational reality:
• Simple solutions are much quicker to fix when they break.
• Not always possible (eg lustre use).

Good communication between science and IT teams.
• Expose the IT costs to researchers.
Build systems Iteratively.
• Constantly evolving systems.
• Groups start out with everything on fast storage, but realise they can get
away with slower stuff.
• More cost effective to do 3x1 yearly purchase rather than 1x 3 yearly?

Data Triage
• What do we really want to keep?

General Farm
(6K core)
IRODs
Sequencing
Sequencer (1K cores) Archive Internet
QC UK10K Farm
analysis
(1.5K core)

CGP Farm
(2K cores) Structured data
Unstructured data

Slow
Fast Fast Slow Fast

Slow

Slow
Slow


Sequencer datastore Internet
QC analysis

Unstructured data Structured data

Pbytes!
Raw data Sequence Alignments Variation data Feature
(10 TB) (500GB) (200GB) (1GB) (3MB)

People = Unmanaged Data
Investigators take data and “do stuff” with it.
Data is left in the wrong place.
• Typically left where it was created.
• Moving data is hard and slow.
• Important data left in scratch areas, or high IO analysis being run against
slow storage.

Duplication.
• Everyone take a copy of the data, “just to be sure.”
Capacity planning becomes impossible.
• Who is using our disk space?
• “du” on 4PB is not going to work...

Not Just an IT Problem

#df -h
Filesystem Size Used Avail Use% Mounted on
lus02-mds1:/lus02 108T 107T 1T 99% /lustre/scratch102

#df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
lus02-mds1:/lus02 300296107 136508072 163788035 45%/lustre/scratch102

100TB filesystem, 136M files.
• “Where is the stuff we can delete so we can continue production...?”

Lost productivity
Data management impacts on research productivity.
• Groups spend large amounts of time and effort just keeping track of data.
• Groups who control their data get much more done.
• But they spend time writing data tracking applications.

Money talks:
• “Group A only need ½ the storage budget of group B to do the same
analysis.”
• Powerful message.

Need a common site-wide data management infrastructure.
• We need something simple so people will want to use it.

Data management
iRODS: Integrated Rule-Oriented Data System.
• Produced by DICE Group (Data Intensive Cyber Environments) at U.
North Carolina, Chapel Hill.

Successor to SRB.
• SRB used by the High-Energy-Physics (HEP) community.
•20PB/year LHC data.
• HEP community has lots of “lessons learned” that we can benefit from.

iRODS
User interface
Web, command line, fuse, API

Irods Server
Data in S3

Irods Server
ICAT Rule Engine Data in database
Catalogue
Implements policies
database

Irods Server
Data on disk

iRODS
Queryable metadata
• SQL like language.
Scalable
• Copes with PB of data and 100,000M+ files.
• Data replication engine.
• Fast parallel data transfers across local and wide area network links.
Customisable Rules
• Trigger actions on data according to policy.
• Eg generate thumbnail for every image uploaded.

Federated
• iRODS installs can be federated across institutions.
• Sharing data is easy.
Open Source
• BSD licensed.

Sequencing Archive
Final resting place for all our
sequencing data.
• Researchers pull data from irods for
further analysis.

2x 800TB space.
• First deployment; KISS!
Simple ruleset.
• Replicate & checksum data.
• External scripts periodically scrub data.
Positively received.
• Researchers are pushing us for new
instances to store their data.

Next Iterations:
• Experiments with WAN, external
federations, complex rules.

Architecture

ICAT
Oracle 10g RAC Irods Server

Replica 1 Replica 2
(red datacentre)
(green datacentre)

275TB 120TB
(Nexan)
275TB 120TB
(Nexan)
(Nexsan) 480TB (Nexsan) 480TB
(DDN) (DDN)

Some thoughts on Clouds
Largest drag on response is dealing with real hardware.
• Delivery lead times, racking, cabling etc.
To the cloud!
Nothing about our IT approach precludes/mandates cloud.
• Use it where it makes sense.
Public clouds for big-data.
• Uploading data to the cloud takes along time.
• Data Security.
• Need to do your due-diligence
• (just like you should be doing in-house!)
• Cloud may be more appropriate than in house.

Currently cheaper for us to do production in-house.
• But: Purely an economic decision.

Cloud Archives
Dark Archives
• You can get data, but cannot compute
across it.
• Nobody is going to download 400TB
of sequence data.

Cloud Archives
• Cloud models allow compute to be
uploaded to the data and run “in-
place.”
• Private clouds may simplify data
governance.
• Can you do it more cheaply than
public providers?

Summary

Modular Infrastructure.
Manage Data.
Data Triage.
Strong Collaboration / Dialogue with Researchers.

Acknowledgements
Sanger Systems Team
• Phil Butcher (Director of IT)
• Informatics Systems Group.
• Networking, DBA, Infrastructure & helpdesk teams.
• Cancer, human-genetics, uk10k informatics teams.

Resources:
• http://www.sanger.ac.uk/research
• http://www.uk10k.org
• http://www.sanger.ac.uk/genetics/cgp/cosmic
• http://www.ensembl.org
• http://www.irods.org
• http://www.nanoporetech.com

Guy Coates

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Guy Coates

Semelhante a Guy Coates (20)

Mais de Eduserv

Mais de Eduserv (20)

Último

Último (20)

Guy Coates