SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Big Data:
Sanger Experiences

           Guy Coates
  Wellcome Trust Sanger Institute


       gmpc@sanger.ac.uk
The Sanger Institute
Funded by Wellcome Trust.
• 2nd largest research charity in the world.
• ~700 employees.
• Based in Hinxton Genome Campus,
    Cambridge, UK.

Large scale genomic research.
• Sequenced 1/3 of the human genome.
    (largest single contributor).
•   Large scale sequencing with an impact
    on human and animal health.

Data is freely available.
• Websites, ftp, direct database access,
    programmatic APIs.
     • Some restrictions for potentially
       identifiable data.

My team:
• Scientific computing systems architects.
DNA Sequencing


                             TCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG

                                 AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA

                              TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC

                             ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG

                             TGCACTCCAGCTTGGGTGACACAG   CAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG

                              AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA

                             ATGAAGTAAATCG     ATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC




                         250 Million * 75-108 Base fragments

                         ~1 TByte / day / machine
Human Genome (3GBases)
Economic Trends:

Cost of sequencing halves every 12
months.
The Human genome project:
• 13 years.
• 23 labs.
• $500 Million.
A Human genome today:
• 3 days.
• 1 machine.
• $8,000.
Trend will continue:
• $1000 genome is probable within 2 years.
The scary graph




Peak Yearly capillary   Current weekly sequencing:
sequencing: 30 Gbase    7-10 Tbases
Gen III Sequencers this year?
What are we doing with all
      these genomes?
UK10K
• Find and understand impact of rare genetic
  variants on disease.

Ensembl
• Genome annotation.
• Data resources and analysis pipelines.
Cancer Genome Project
• Catalogue causal mutations in cancer.
• Genomics of tumor drug sensitivity.
Pathogen Genomics
• Bacterial / viral genomics
• Malaria Genetics
• Parasite genetic / tropical diseases.
All these programmes exist in
frameworks of external collaboration.
• Sharing data and resources is crucial.
IT Requirements
Needs to match growth in
sequencing technology.
                                                                               Disk Storage
Growth of compute & storage                       12000


• Storage /compute doubles every 12               10000
  months.
   • 2012 ~17 PB Usable                           8000




                                      Terabytes
Everything changes, all the time.
                                                  6000


• Science is very fluid.                          4000

• Speed to deployment is critical.
                                                  2000


Moore's law will not save us.                         0
                                                              1995    1997    1999    2001    2003    2005    2007    2009
                                                          1994    1996    1998    2000    2002    2004    2006    2008    2010

$1000 genome*                                                                         Year
• *Informatics not included
Sequencing data flow.

                 Processing/      Comparative
Sequencer                                                 Archive             Internet
                     QC             analysis




 Unstructured data                                                  Structured data
    (Flat files)                                                     (databases)




   Raw data          Sequence   Alignments      Variation data      Feature
    (10 TB)           (500GB)    (200GB)            (1GB)            (3MB)
Agile Systems
Modular design.
• Blocks of network, compute and
    storage.
•   Assume from day 1 we will be adding
    more.
•   Expand simply by adding more blocks.
•   Lots of automation.
                                           Disk             Disk        Disk        Disk

Make storage visible from
everywhere.
• Key enabler; lots of 10Gig.
                                                            Network


                                                  Compute     Compute   Compute   Compute
Compute Modules
Commodity Servers
• Blade form-factor.
• Automated Management.
Generic intel/AMD CPUs
• Single threaded / embarrasingly parallel
    workload.
•   No FPGAs or GPUs.

2000-10,000 core per cluster
• 3 Gbyte/s memory per core.
• A few bigger memory machines (0.5TB).
Storage Modules
Two flavours:
Scale up (Fast)
• DDN storage arrays.
• Lustre. 250-500TB per filesystem.
• High performance. Expensive.
Scale out (Slow)
• Linux NFS servers.
• Nexsan Storage arrays.
• 50-100TB per filesystem.
• Cheap and cheerful.
How large?
• More modules = more management overhead.
• Fewer modules = large unit of failure.
• 100-500 TB
Actual Architecture
Compute Silos
• Beware of over-                Fast
    consolidation.               disk
                                                                          Fast
•   Some workflows interact
                                                        Slow disk
                                                                          disk
    badly with one another.
•   Separate out some work
    onto different clusters.     Fast
                                 disk

Logically rather than
physically separated.
• LSF to manage workflow.                              Network
• Simple software re-config to
    move capacity between
    silos.

                                        Farm 1   LSF        Farm2   LSF   Farm3
Some things we learned
KISS! Keep It Simple, Stupid.
• Simple solution may look less reliable on paper than the fully redundant
    failover option.
•   Operational reality:
      • Simple solutions are much quicker to fix when they break.
•   Not always possible (eg lustre use).

Good communication between science and IT teams.
• Expose the IT costs to researchers.
Build systems Iteratively.
• Constantly evolving systems.
• Groups start out with everything on fast storage, but realise they can get
    away with slower stuff.
•   More cost effective to do 3x1 yearly purchase rather than 1x 3 yearly?

Data Triage
• What do we really want to keep?
Sequencing data flow.
                               General Farm
                                (6K core)
                                                      IRODs
                Sequencing
                 Processing/    Comparative
Sequencer       (1K cores)                           Archive            Internet
                     QC        UK10K Farm
                                  analysis
                                (1.5K core)

                                CGP Farm
                                (2K cores)                     Structured data
 Unstructured data
    (Flat files)                                                (databases)


                     Slow
                                Fast          Fast       Slow         Fast

                     Slow

                                Slow
                     Slow
That was easy!
Sequencing data flow.

                 Processing/      Comparative
Sequencer                                                datastore             Internet
                     QC             analysis




 Unstructured data                                                   Structured data
    (Flat files)                                                      (databases)




                                       Pbytes!
   Raw data          Sequence   Alignments      Variation data       Feature
    (10 TB)           (500GB)    (200GB)            (1GB)             (3MB)
People = Unmanaged Data
Investigators take data and “do stuff” with it.
Data is left in the wrong place.
• Typically left where it was created.
     • Moving data is hard and slow.
•   Important data left in scratch areas, or high IO analysis being run against
    slow storage.

Duplication.
• Everyone take a copy of the data, “just to be sure.”
Capacity planning becomes impossible.
• Who is using our disk space?
     •   “du” on 4PB is not going to work...
Not Just an IT Problem

#df -h
Filesystem      Size Used Avail Use% Mounted on
lus02-mds1:/lus02 108T 107T 1T 99% /lustre/scratch102

#df -i
Filesystem      Inodes IUsed IFree IUse% Mounted on
lus02-mds1:/lus02 300296107 136508072 163788035 45%/lustre/scratch102




     100TB filesystem, 136M files.
     • “Where is the stuff we can delete so we can continue production...?”
Lost productivity
Data management impacts on research productivity.
• Groups spend large amounts of time and effort just keeping track of data.
• Groups who control their data get much more done.
   •   But they spend time writing data tracking applications.

Money talks:
• “Group A only need ½ the storage budget of group B to do the same
  analysis.”
   • Powerful message.

Need a common site-wide data management infrastructure.
• We need something simple so people will want to use it.
Data management
iRODS: Integrated Rule-Oriented Data System.
• Produced by DICE Group (Data Intensive Cyber Environments) at U.
    North Carolina, Chapel Hill.

Successor to SRB.
• SRB used by the High-Energy-Physics (HEP) community.
     •20PB/year LHC data.
•   HEP community has lots of “lessons learned” that we can benefit from.
iRODS
                 User interface
            Web, command line, fuse, API

                                           Irods Server
                                              Data in S3




                                           Irods Server
 ICAT             Rule Engine              Data in database
Catalogue
                Implements policies
database




                                           Irods Server
                                             Data on disk
iRODS
Queryable metadata
• SQL like language.
Scalable
• Copes with PB of data and 100,000M+ files.
• Data replication engine.
• Fast parallel data transfers across local and wide area network links.
Customisable Rules
• Trigger actions on data according to policy.
   •   Eg generate thumbnail for every image uploaded.

Federated
• iRODS installs can be federated across institutions.
• Sharing data is easy.
Open Source
• BSD licensed.
Sequencing Archive
Final resting place for all our
sequencing data.
• Researchers pull data from irods for
  further analysis.

2x 800TB space.
• First deployment; KISS!
Simple ruleset.
• Replicate & checksum data.
• External scripts periodically scrub data.
Positively received.
• Researchers are pushing us for new
  instances to store their data.

Next Iterations:
• Experiments with WAN, external
  federations, complex rules.
Architecture

           ICAT
     Oracle 10g RAC                  Irods Server




           Replica 1                                           Replica 2
                                                           (red datacentre)
      (green datacentre)




275TB 120TB
      (Nexan)
                                                    275TB 120TB
                                                          (Nexan)
(Nexsan)               480TB                        (Nexsan)               480TB
                           (DDN)                                              (DDN)
Some thoughts on Clouds
Largest drag on response is dealing with real hardware.
• Delivery lead times, racking, cabling etc.
To the cloud!
Nothing about our IT approach precludes/mandates cloud.
• Use it where it makes sense.
Public clouds for big-data.
• Uploading data to the cloud takes along time.
• Data Security.
   •   Need to do your due-diligence
         •   (just like you should be doing in-house!)
   •   Cloud may be more appropriate than in house.

Currently cheaper for us to do production in-house.
• But: Purely an economic decision.
Cloud Archives
Dark Archives
• You can get data, but cannot compute
    across it.
•   Nobody is going to download 400TB
    of sequence data.

Cloud Archives
• Cloud models allow compute to be
    uploaded to the data and run “in-
    place.”
•   Private clouds may simplify data
    governance.
•   Can you do it more cheaply than
    public providers?
Summary

Modular Infrastructure.
Manage Data.
Data Triage.
Strong Collaboration / Dialogue with Researchers.
Acknowledgements
Sanger Systems Team
• Phil Butcher (Director of IT)
• Informatics Systems Group.
• Networking, DBA, Infrastructure & helpdesk teams.
• Cancer, human-genetics, uk10k informatics teams.

Resources:
•   http://www.sanger.ac.uk/research
•   http://www.uk10k.org
•   http://www.sanger.ac.uk/genetics/cgp/cosmic
•   http://www.ensembl.org
•   http://www.irods.org
•   http://www.nanoporetech.com

Mais conteúdo relacionado

Semelhante a Guy Coates

Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 
Exadata 11-2-overview-v2 11
Exadata 11-2-overview-v2 11Exadata 11-2-overview-v2 11
Exadata 11-2-overview-v2 11
Oracle BH
 
Coates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substanceCoates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substance
BOSC 2010
 
Switc Hpa
Switc HpaSwitc Hpa
Switc Hpa
PTIHPA
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
Jason Riedy
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
Edward Capriolo
 

Semelhante a Guy Coates (20)

KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Exadata 11-2-overview-v2 11
Exadata 11-2-overview-v2 11Exadata 11-2-overview-v2 11
Exadata 11-2-overview-v2 11
 
Coates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substanceCoates bosc2010 clouds-fluff-and-no-substance
Coates bosc2010 clouds-fluff-and-no-substance
 
Switc Hpa
Switc HpaSwitc Hpa
Switc Hpa
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhc
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
Storage: Alternate Futures
Storage: Alternate FuturesStorage: Alternate Futures
Storage: Alternate Futures
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 
PostgreSQL: The Time-Series Database You (Actually) Want
PostgreSQL: The Time-Series Database You (Actually) WantPostgreSQL: The Time-Series Database You (Actually) Want
PostgreSQL: The Time-Series Database You (Actually) Want
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Cloud Technical Challenges
Cloud Technical ChallengesCloud Technical Challenges
Cloud Technical Challenges
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 

Mais de Eduserv

Mais de Eduserv (20)

Phase two of OpenAthens SP evolution including OpenID connect option
Phase two of OpenAthens SP evolution including OpenID connect optionPhase two of OpenAthens SP evolution including OpenID connect option
Phase two of OpenAthens SP evolution including OpenID connect option
 
Partnership Licensing - allowing access to licensed resources
Partnership Licensing - allowing access to licensed resources Partnership Licensing - allowing access to licensed resources
Partnership Licensing - allowing access to licensed resources
 
Lightning talk - EBSCO
Lightning talk - EBSCOLightning talk - EBSCO
Lightning talk - EBSCO
 
Lightning talk - Boopsie
Lightning talk - BoopsieLightning talk - Boopsie
Lightning talk - Boopsie
 
Lightning talk - Softlink
Lightning talk - SoftlinkLightning talk - Softlink
Lightning talk - Softlink
 
Lightning talk - Third Iron BrowZine
Lightning talk - Third Iron BrowZineLightning talk - Third Iron BrowZine
Lightning talk - Third Iron BrowZine
 
Lightning talk - Eduserv Chest Agreements
Lightning talk - Eduserv Chest AgreementsLightning talk - Eduserv Chest Agreements
Lightning talk - Eduserv Chest Agreements
 
Phase one of OpenAthens SP evolution
Phase one of OpenAthens SP evolutionPhase one of OpenAthens SP evolution
Phase one of OpenAthens SP evolution
 
Key considerations when mapping your end user experience
Key considerations when mapping your end user experienceKey considerations when mapping your end user experience
Key considerations when mapping your end user experience
 
Our product development methodology
Our product development methodologyOur product development methodology
Our product development methodology
 
How Readers Discover Content
How Readers Discover ContentHow Readers Discover Content
How Readers Discover Content
 
OpenAthens product update
OpenAthens product updateOpenAthens product update
OpenAthens product update
 
OpenAthens Customer Conference - Welcome address
OpenAthens Customer Conference - Welcome addressOpenAthens Customer Conference - Welcome address
OpenAthens Customer Conference - Welcome address
 
Generating leads with content marketing
Generating leads with content marketingGenerating leads with content marketing
Generating leads with content marketing
 
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
 
Mobius from Maplesoft
Mobius from MaplesoftMobius from Maplesoft
Mobius from Maplesoft
 
QSR NVivo
QSR NVivo QSR NVivo
QSR NVivo
 
How Eduserv are helping local government organisations
How Eduserv are helping local government organisationsHow Eduserv are helping local government organisations
How Eduserv are helping local government organisations
 
Is cloud the right fit for your needs?
Is cloud the right fit for your needs?Is cloud the right fit for your needs?
Is cloud the right fit for your needs?
 
Planning your cloud strategy: Adur and Worthing Councils
Planning your cloud strategy: Adur and Worthing CouncilsPlanning your cloud strategy: Adur and Worthing Councils
Planning your cloud strategy: Adur and Worthing Councils
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Guy Coates

  • 1. Big Data: Sanger Experiences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
  • 2. The Sanger Institute Funded by Wellcome Trust. • 2nd largest research charity in the world. • ~700 employees. • Based in Hinxton Genome Campus, Cambridge, UK. Large scale genomic research. • Sequenced 1/3 of the human genome. (largest single contributor). • Large scale sequencing with an impact on human and animal health. Data is freely available. • Websites, ftp, direct database access, programmatic APIs. • Some restrictions for potentially identifiable data. My team: • Scientific computing systems architects.
  • 3. DNA Sequencing TCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG TGCACTCCAGCTTGGGTGACACAG CAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA ATGAAGTAAATCG ATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 250 Million * 75-108 Base fragments ~1 TByte / day / machine Human Genome (3GBases)
  • 4. Economic Trends: Cost of sequencing halves every 12 months. The Human genome project: • 13 years. • 23 labs. • $500 Million. A Human genome today: • 3 days. • 1 machine. • $8,000. Trend will continue: • $1000 genome is probable within 2 years.
  • 5. The scary graph Peak Yearly capillary Current weekly sequencing: sequencing: 30 Gbase 7-10 Tbases
  • 6. Gen III Sequencers this year?
  • 7. What are we doing with all these genomes? UK10K • Find and understand impact of rare genetic variants on disease. Ensembl • Genome annotation. • Data resources and analysis pipelines. Cancer Genome Project • Catalogue causal mutations in cancer. • Genomics of tumor drug sensitivity. Pathogen Genomics • Bacterial / viral genomics • Malaria Genetics • Parasite genetic / tropical diseases. All these programmes exist in frameworks of external collaboration. • Sharing data and resources is crucial.
  • 8. IT Requirements Needs to match growth in sequencing technology. Disk Storage Growth of compute & storage 12000 • Storage /compute doubles every 12 10000 months. • 2012 ~17 PB Usable 8000 Terabytes Everything changes, all the time. 6000 • Science is very fluid. 4000 • Speed to deployment is critical. 2000 Moore's law will not save us. 0 1995 1997 1999 2001 2003 2005 2007 2009 1994 1996 1998 2000 2002 2004 2006 2008 2010 $1000 genome* Year • *Informatics not included
  • 9. Sequencing data flow. Processing/ Comparative Sequencer Archive Internet QC analysis Unstructured data Structured data (Flat files) (databases) Raw data Sequence Alignments Variation data Feature (10 TB) (500GB) (200GB) (1GB) (3MB)
  • 10. Agile Systems Modular design. • Blocks of network, compute and storage. • Assume from day 1 we will be adding more. • Expand simply by adding more blocks. • Lots of automation. Disk Disk Disk Disk Make storage visible from everywhere. • Key enabler; lots of 10Gig. Network Compute Compute Compute Compute
  • 11. Compute Modules Commodity Servers • Blade form-factor. • Automated Management. Generic intel/AMD CPUs • Single threaded / embarrasingly parallel workload. • No FPGAs or GPUs. 2000-10,000 core per cluster • 3 Gbyte/s memory per core. • A few bigger memory machines (0.5TB).
  • 12. Storage Modules Two flavours: Scale up (Fast) • DDN storage arrays. • Lustre. 250-500TB per filesystem. • High performance. Expensive. Scale out (Slow) • Linux NFS servers. • Nexsan Storage arrays. • 50-100TB per filesystem. • Cheap and cheerful. How large? • More modules = more management overhead. • Fewer modules = large unit of failure. • 100-500 TB
  • 13. Actual Architecture Compute Silos • Beware of over- Fast consolidation. disk Fast • Some workflows interact Slow disk disk badly with one another. • Separate out some work onto different clusters. Fast disk Logically rather than physically separated. • LSF to manage workflow. Network • Simple software re-config to move capacity between silos. Farm 1 LSF Farm2 LSF Farm3
  • 14. Some things we learned KISS! Keep It Simple, Stupid. • Simple solution may look less reliable on paper than the fully redundant failover option. • Operational reality: • Simple solutions are much quicker to fix when they break. • Not always possible (eg lustre use). Good communication between science and IT teams. • Expose the IT costs to researchers. Build systems Iteratively. • Constantly evolving systems. • Groups start out with everything on fast storage, but realise they can get away with slower stuff. • More cost effective to do 3x1 yearly purchase rather than 1x 3 yearly? Data Triage • What do we really want to keep?
  • 15. Sequencing data flow. General Farm (6K core) IRODs Sequencing Processing/ Comparative Sequencer (1K cores) Archive Internet QC UK10K Farm analysis (1.5K core) CGP Farm (2K cores) Structured data Unstructured data (Flat files) (databases) Slow Fast Fast Slow Fast Slow Slow Slow
  • 17. Sequencing data flow. Processing/ Comparative Sequencer datastore Internet QC analysis Unstructured data Structured data (Flat files) (databases) Pbytes! Raw data Sequence Alignments Variation data Feature (10 TB) (500GB) (200GB) (1GB) (3MB)
  • 18. People = Unmanaged Data Investigators take data and “do stuff” with it. Data is left in the wrong place. • Typically left where it was created. • Moving data is hard and slow. • Important data left in scratch areas, or high IO analysis being run against slow storage. Duplication. • Everyone take a copy of the data, “just to be sure.” Capacity planning becomes impossible. • Who is using our disk space? • “du” on 4PB is not going to work...
  • 19. Not Just an IT Problem #df -h Filesystem Size Used Avail Use% Mounted on lus02-mds1:/lus02 108T 107T 1T 99% /lustre/scratch102 #df -i Filesystem Inodes IUsed IFree IUse% Mounted on lus02-mds1:/lus02 300296107 136508072 163788035 45%/lustre/scratch102 100TB filesystem, 136M files. • “Where is the stuff we can delete so we can continue production...?”
  • 20. Lost productivity Data management impacts on research productivity. • Groups spend large amounts of time and effort just keeping track of data. • Groups who control their data get much more done. • But they spend time writing data tracking applications. Money talks: • “Group A only need ½ the storage budget of group B to do the same analysis.” • Powerful message. Need a common site-wide data management infrastructure. • We need something simple so people will want to use it.
  • 21. Data management iRODS: Integrated Rule-Oriented Data System. • Produced by DICE Group (Data Intensive Cyber Environments) at U. North Carolina, Chapel Hill. Successor to SRB. • SRB used by the High-Energy-Physics (HEP) community. •20PB/year LHC data. • HEP community has lots of “lessons learned” that we can benefit from.
  • 22. iRODS User interface Web, command line, fuse, API Irods Server Data in S3 Irods Server ICAT Rule Engine Data in database Catalogue Implements policies database Irods Server Data on disk
  • 23. iRODS Queryable metadata • SQL like language. Scalable • Copes with PB of data and 100,000M+ files. • Data replication engine. • Fast parallel data transfers across local and wide area network links. Customisable Rules • Trigger actions on data according to policy. • Eg generate thumbnail for every image uploaded. Federated • iRODS installs can be federated across institutions. • Sharing data is easy. Open Source • BSD licensed.
  • 24.
  • 25. Sequencing Archive Final resting place for all our sequencing data. • Researchers pull data from irods for further analysis. 2x 800TB space. • First deployment; KISS! Simple ruleset. • Replicate & checksum data. • External scripts periodically scrub data. Positively received. • Researchers are pushing us for new instances to store their data. Next Iterations: • Experiments with WAN, external federations, complex rules.
  • 26. Architecture ICAT Oracle 10g RAC Irods Server Replica 1 Replica 2 (red datacentre) (green datacentre) 275TB 120TB (Nexan) 275TB 120TB (Nexan) (Nexsan) 480TB (Nexsan) 480TB (DDN) (DDN)
  • 27. Some thoughts on Clouds Largest drag on response is dealing with real hardware. • Delivery lead times, racking, cabling etc. To the cloud! Nothing about our IT approach precludes/mandates cloud. • Use it where it makes sense. Public clouds for big-data. • Uploading data to the cloud takes along time. • Data Security. • Need to do your due-diligence • (just like you should be doing in-house!) • Cloud may be more appropriate than in house. Currently cheaper for us to do production in-house. • But: Purely an economic decision.
  • 28. Cloud Archives Dark Archives • You can get data, but cannot compute across it. • Nobody is going to download 400TB of sequence data. Cloud Archives • Cloud models allow compute to be uploaded to the data and run “in- place.” • Private clouds may simplify data governance. • Can you do it more cheaply than public providers?
  • 29. Summary Modular Infrastructure. Manage Data. Data Triage. Strong Collaboration / Dialogue with Researchers.
  • 30. Acknowledgements Sanger Systems Team • Phil Butcher (Director of IT) • Informatics Systems Group. • Networking, DBA, Infrastructure & helpdesk teams. • Cancer, human-genetics, uk10k informatics teams. Resources: • http://www.sanger.ac.uk/research • http://www.uk10k.org • http://www.sanger.ac.uk/genetics/cgp/cosmic • http://www.ensembl.org • http://www.irods.org • http://www.nanoporetech.com