Mais conteúdo relacionado Semelhante a Ari Berman - Intel Big Data Seminar 9/6/2012 (16) Ari Berman - Intel Big Data Seminar 9/6/20121. BIOTEAM
Enabling Science
Storage Infrastructure
Font: Optima Regular
and Data Management
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
in Life Sciences
Ari E. Berman, Ph.D.
Senior Scientific Consultant, BioTeam, Inc.
©BioTeam, Inc. 2012 - http://www.bioteam.net
2. BIOTEAM A little about me
Enabling Science
• Ph.D. in Molecular Biology/Neuroscience
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• Trained in laboratory and bioinformatics
• 13 years experience as an IT infrastructure/
HPC geek/Perl monger
• Odd mix of skills led me to BioTeam
• Joined BioTeam in May
©BioTeam, Inc. 2012 - http://www.bioteam.net
3. BIOTEAM Who is BioTeam?
Enabling Science
• Independent Consulting Practice
Font: Optima Regular
Colors:
• Made up of scientists vast experience in
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
software, HPC, and IT
• Unique cross-section of skill sets
• 10+ years of bridging the gap between
technology and science
• Functions as much as a think tank as a
consulting practice.
©BioTeam, Inc. 2012 - http://www.bioteam.net
4. Why am I here talking
BIOTEAM
Enabling Science to you?
• We work on broad range of projects:
Font: Optima Regular
Pharma, Biotech, EDU, .gov, .mil, etc.
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• We are in a unique position: can see how
people are approaching current problems
• We work from a tech agnostic perspective:
we provide what’s best for the customer
• Our niche: 1000ft. overview of tech
problems in life sciences
©BioTeam, Inc. 2012 - http://www.bioteam.net
5. BIOTEAM
Enabling Science
Why are we all here?
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
Big data in life-sciences: just when you thought
it was safe to go back into the datacenter...
©BioTeam, Inc. 2012 - http://www.bioteam.net
6. BIOTEAM Big data: the tired story
Enabling Science
• Next-generation sequencing,
Mass spec, imaging, etc.
Font: Optima Regular
•
Colors:
High-throughput
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
experimentation
• Clinical research/standard
healthcare - personalized
medicine
• Un-natural expansion of
technology (sequencing)
• Now: we can get the data
fast, what do we do with it?
©BioTeam, Inc. 2012 - http://www.bioteam.net
7. BIOTEAM Big data: the tired story
Enabling Science
• At this point, this is an old
problem
Font: Optima Regular
•
Colors:
Most sequencers generating
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
0.5TB/day
• Final genomes around
300GB
• High-volume quantitative
methods quickly produce
100’s of TBs of data
• The kicker: tight research
budgets
©BioTeam, Inc. 2012 - http://www.bioteam.net
8. BIOTEAM Storing Big Data
Enabling Science
• Problem is less about storing
Font: Optima Regular
the data. We’ve solved
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
storage.
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• We can now put in
thousands of spindles in a
semi-affordable manner
• Lots of high-density boxes
• The petabyte challenge has
been met
• Now, it needs to work well
• And still be affordable
©BioTeam, Inc. 2012 - http://www.bioteam.net
9. Today’s problem: Accessing
BIOTEAM
Enabling Science Big Data
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0) • In practice - get to 1.5PB,
Light Blue #6699CC (CMYK 62, 22, 3, 0)
500M files: metadata falls off
a cliff
• Directory listings take
minutes
• Sorting takes forever
• Forget about filesystem
profiling/optimization
©BioTeam, Inc. 2012 - http://www.bioteam.net
10. Today’s problem: Accessing
BIOTEAM
Enabling Science Big Data
•
Font: Optima Regular
Colors:
What’s being done?
•
Dark Blue #003399 (CMYK 96, 69, 3, 0)
SSDs thought to be our
Light Blue #6699CC (CMYK 62, 22, 3, 0)
savior
• Blazing fast, SLC, many in
parallel
• Parallel filesystems could
cache metadata on SSDs
• Reduce search time orders
of magnitude
©BioTeam, Inc. 2012 - http://www.bioteam.net
11. Today’s problem: Accessing
BIOTEAM
Enabling Science Big Data
•
Font: Optima Regular
Of course, it’s not that
Colors:
simple
Dark Blue #003399 (CMYK 96, 69, 3, 0)
•
Light Blue #6699CC (CMYK 62, 22, 3, 0)
Now, distribution and access
points of SSDs matter
• How they are addressed
matters
• How many small files on the
filesystem matters
• How the files are to be used
matters
©BioTeam, Inc. 2012 - http://www.bioteam.net
12. BIOTEAM But wait: there’s more
Enabling Science
• A consistent array of disks
Font: Optima Regular
Colors:
no longer enough beyond
1.5PB
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• Entire solutions of high-
speed disks not cost-
effective
• Distribution of file access
needs: some fast, some
archive
• Tiering of storage
infrastructure
©BioTeam, Inc. 2012 - http://www.bioteam.net
13. BIOTEAM Tiering
Enabling Science
• Keep archival data on
Font: Optima Regular
slower, cheaper disks
Colors:
• No SSDs
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• Keep fast access files on
smaller, high-speed disks with
many (possibly all) SSDs
(HPC, high throughput
needs)
• Mid-level tiers for
administrative needs
(documents, etc)
• Can even add a tape tier for
more permanent storage
©BioTeam, Inc. 2012 - http://www.bioteam.net
14. BIOTEAM Managing Tiers
Enabling Science
•
Font: Optima Regular
Administratively difficult
•
Colors:
Can manage by different
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
mount points, quotas, user
education
• Better: policy engines
• Use with parallel file systems
(GPFS, OneFS, etc)
• Policy based automated
movement of files through
tiers, even to tape
©BioTeam, Inc. 2012 - http://www.bioteam.net
15. BIOTEAM By golly, we’ve done it!
Enabling Science
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• If done correctly, single namespace
infrastructure can work well for all needs
• Can handle HPC to archive
• Can be done in a semi-affordable manner
©BioTeam, Inc. 2012 - http://www.bioteam.net
16. BIOTEAM Now what?
Enabling Science
Colors: • Now, we’re faced with more problems
Font: Optima Regular
Dark Blue #003399 (CMYK 96, 69, 3, 0)
• For NIH, HIPAA laws, and general sanity,
Light Blue #6699CC (CMYK 62, 22, 3, 0)
need DR
• Need twice the space than you’ll use
• No other way to do it right now
• Use inexpensive, slow disk solutions to save
money on DR
©BioTeam, Inc. 2012 - http://www.bioteam.net
17. BIOTEAM Now what?
Enabling Science
Font: Optima Regular
•
Colors:
Also: how to keep track
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
of data?
• At 1PB and 0.5 billion
files, creative directory
structures lose out
• Complexity too much
for anyone to handle
©BioTeam, Inc. 2012 - http://www.bioteam.net
18. BIOTEAM Data Management
Enabling Science
•
Font: Optima Regular
Colors: One solution: databases
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• Keep higher depth of
metadata (tagging,
descriptions)
• Cumbersome for the
general user to use: adds
complexity layer to user
experience
©BioTeam, Inc. 2012 - http://www.bioteam.net
19. BIOTEAM Data Management
Enabling Science
•
Font: Optima Regular
Colors: Databases can work,
Dark Blue #003399 (CMYK 96, 69, 3, 0)
though
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• iRODS is a good
example
• Put the metadata
database layer in-
between the filesystem
and the user
©BioTeam, Inc. 2012 - http://www.bioteam.net
20. BIOTEAM Data Management
Enabling Science
• Others working on this
model as well
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• Cambridge Computer:
“As you approach
billions of files, file
exploring is no longer
feasible.”
• Need a new interface
• Rich metadata to keep
track of the files
©BioTeam, Inc. 2012 - http://www.bioteam.net
21. BIOTEAM Wait, more metadata?
Enabling Science
• More metadata? wasn’t
this the original problem
Font: Optima Regular
Colors:
on large filesystems?
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• Wouldn’t this make
matters worse?
• Depends on how it is
done.
• Current models have
metadata completely
separate from files
©BioTeam, Inc. 2012 - http://www.bioteam.net
22. BIOTEAM Wait, more metadata?
Enabling Science
• And...
•
Font: Optima Regular
Colors:
Who’s going to go back
and type all of that
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
metadata in?
• No one - we kind of
need to start over
• ...or, need a way of
inferring metadata and
filling in the blanks from
existing data
• Still need legacy support
for systems
©BioTeam, Inc. 2012 - http://www.bioteam.net
23. BIOTEAM Or: Middleware
Enabling Science
Colors:
• Use an interactive software product
Font: Optima Regular
Dark Blue #003399 (CMYK 96, 69, 3, 0)
between filesystem and user
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• Can manage link between filesystem
and extended metadata
• Can enhance the scientific process:
manage data, analysis, results, and
facilitate collaboration
©BioTeam, Inc. 2012 - http://www.bioteam.net
24. BIOTEAM What are scientists doing?
Enabling Science
Lab Scientist w/
Excel
•Accessible for most
scientists
Font: Optima Regular
•Flexible
Colors:
• Data maintenance
Dark Blue #003399 (CMYK 96, 69, 3, 0)
burden on lab scientists
Light Blue #6699CC (CMYK 62, 22, 3, 0)
•Quickly overwhelmed in
size and complexity
•Data publication by email
©BioTeam, Inc. 2012 - http://www.bioteam.net
25. BIOTEAM What are scientists doing?
Enabling Science
Lab Scientist w/
Excel
•Accessible for most
scientists
Font: Optima Regular
•Flexible
Colors:
• Data maintenance
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Lab Bioinformatician
•Quick development of web-
burden on lab scientists
Light Blue #6699CC (CMYK 62, 22, 3, 0)
•Quickly overwhelmed in based system
size and complexity •Rapid turn around for
•Data publication by emailscientist needs
•Single point of failure
•Limited breadth of
experience
•Poor documentation, poor
transition
©BioTeam, Inc. 2012 - http://www.bioteam.net
26. BIOTEAM What are scientists doing?
Enabling Science
Lab Scientist w/
Excel
•Accessible for most
scientists
Font: Optima Regular
•Flexible
Colors:
• Data maintenance
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Lab Bioinformatician
•Quick development of web-
burden on lab scientists
Light Blue #6699CC (CMYK 62, 22, 3, 0)
•Quickly overwhelmed in based system
size and complexity •Rapid turn around for Outsource custom
•Data publication by emailscientist needs software
•Single point of failure
•Limited breadth of •Stable, professional software
experience •Well documented, easier
transition
•Poor documentation, poor Communication barrier with
transition •
scientists
•Lack of domain knowledge
leaves large functionality
gaps
•Inflexible design leaves
software obsolete in a
matter
©BioTeam, Inc. 2012 - http://www.bioteam.net
27. BIOTEAM What are scientists doing?
Enabling Science
Lab Scientist w/
Excel
•Accessible for most
scientists
Font: Optima Regular
•Flexible
Colors:
• Data maintenance
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Lab Bioinformatician
•Quick development of web-
burden on lab scientists
Light Blue #6699CC (CMYK 62, 22, 3, 0)
•Quickly overwhelmed in based system
size and complexity •Rapid turn around for Outsource custom
•Data publication by emailscientist needs software
•Single point of failure
•Limited breadth of •Stable, professional software
experience •Well documented, easier
transition
•Poor documentation, poor Communication barrier with
transition •
scientists
“Shrink-wrapped”
•Lack of domain knowledge
software.
leaves large functionality
gaps •Lab data management solutions
•Inflexible design leavesleverage many customers, years
software obsolete in a experience
of
matter •Year-to-year enhancement of
product
•High purchase price due to limited
market
•Mismatch to local lab expertise
and workflow
•Unused complexity
©BioTeam, Inc. 2012 - http://www.bioteam.net
28. BIOTEAM LIMS
Enabling Science
Font: Optima Regular
• Laboratory Information Management
Colors:
System
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• Many out there now: standard and custom
• Many focus markets
• Basespace: Illumina (NGS)
• Quartzy: general lab monkey
• MiniLIMS
©BioTeam, Inc. 2012 - http://www.bioteam.net
29. BIOTEAM Disclaimer
Enabling Science
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• This will feel like a sales pitch
• Just want to illustrate how we’re tackling
information mangement problem
©BioTeam, Inc. 2012 - http://www.bioteam.net
30. BIOTEAM The BioTeam Solution: MiniLIMS
Enabling Science
• An affordable software product that leverages
Font: Optima Regular
real world experience
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
•
Light Blue #6699CC (CMYK 62, 22, 3, 0)
Decades of combined software and informatics
expertise
• Years of LIMS customization
• $4995 license for academic labs
• Flexible architecture that adapts to new
processes and technologies
• Schema-less design allows real time changes to data
model
• Plugin architecture allows mix and match functionality
©BioTeam, Inc. 2012 - http://www.bioteam.net
31. BIOTEAM The BioTeam Solution: MiniLIMS
Enabling Science
Font: Optima Regular • Customization options that match lab
Colors:
resources
Dark Blue #003399 (CMYK 96, 69, 3, 0)
• End user customizable system and Excel import/
Light Blue #6699CC (CMYK 62, 22, 3, 0)
export that empowers lab scientists
• Accessible source code and APIs for in-house
developers
• BioTeam consulting for labs without development
resources, or development teams that are stretched
thin
©BioTeam, Inc. 2012 - http://www.bioteam.net
32. BIOTEAM The BioTeam Solution: MiniLIMS
Enabling Science
End user configurable
Font: Optima Regular
Colors: Form and Page Display GC Mass Spec
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
Invoicing
PHP API
Data and Configuration
Objects
Analysis tools
Data Broker
NGS
Schema-less MySQL
persistence
MiniLIMS Core MiniLIMS Plugins
©BioTeam, Inc. 2012 - http://www.bioteam.net
33. MiniLIMS: Linking lab to
BIOTEAM datacenter
Enabling Science
Central auth
Customer workflow MiniLIMS Core
Reagent
inventory Uptime
User acct setup, login reporting Lab workflow MiniLIMS Plugin
Font: Optima Regular
Colors: MiniLIMS Custom
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Sample receiving
Sample registration
Light Blue #6699CC (CMYK 62, 22, 3, 0)
Sample / library prep,
QC
Run / slide / flowcell
Sample status setup
Instrument console
Run monitoring
Results delivery, billing Analysis launch,
monitoring, results
©BioTeam, Inc. 2012 - http://www.bioteam.net
34. BIOTEAM Simple/Flexible Concept
Enabling Science
Font: Optima Regular
Type Name
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
Property Value
©BioTeam, Inc. 2012 - http://www.bioteam.net
35. BIOTEAM Simple to query
Enabling Science
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
©BioTeam, Inc. 2012 - http://www.bioteam.net
36. BIOTEAM Customizations: plugins
Enabling Science
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
©BioTeam, Inc. 2012 - http://www.bioteam.net
37. BIOTEAM Customizations: plugins
Enabling Science
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
1. Select Bowtie 3. Click to run
2. Select the Fastq File &
the protocol
Name the experiment
©BioTeam, Inc. 2012 - http://www.bioteam.net
38. BIOTEAM Customizations: plugins
Enabling Science
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
©BioTeam, Inc. 2012 - http://www.bioteam.net
39. BIOTEAM Workflows
Enabling Science
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
©BioTeam, Inc. 2012 - http://www.bioteam.net
40. BIOTEAM Moving forward: Appliance
Enabling Science
• Turnkey solution
•
Font: Optima Regular
MiniLIMS + Local
Analysis Engine
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• Plan is to link to cloud
resources: automatic
backup & link to hosted
MiniLIMS
• 16 cores, 96GB RAM,
18T redundant storage,
SSD for OS.
• Solution for any lab
needing LIMS
©BioTeam, Inc. 2012 - http://www.bioteam.net
41. BIOTEAM How to enable science
Enabling Science
• Solidify storage infrastructure
Font: Optima Regular
Colors:
• Add tiered storage with
Dark Blue #003399 (CMYK 96, 69, 3, 0)
policy engine to move data
Light Blue #6699CC (CMYK 62, 22, 3, 0)
• Supply DR
• Enable metadata
acceleration: SSDs + cache
• Implement middleware for
rich metadata tracking
• Make it easy for the
scientists
©BioTeam, Inc. 2012 - http://www.bioteam.net
42. BIOTEAM
Enabling Science
Font: Optima Regular
Colors:
Dark Blue #003399 (CMYK 96, 69, 3, 0)
Light Blue #6699CC (CMYK 62, 22, 3, 0)
Thank you!
©BioTeam, Inc. 2012 - http://www.bioteam.net
Notas do Editor \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n