SlideShare uma empresa Scribd logo
1 de 131
Big
Data:
tools
and

techniques
for
working

  with
large
data
sets
               Ian
Stokes‐Rees,
PhD
        Harvard
Medical
School,
Boston,
USA

 Workshop
on
Tools,
Technologies
and
Collaborative

Opportunities
for
HPC
in
Life
Sciences
and
Healthcare

         http://portal.sbgrid.org
       ijstokes@hkl.hms.harvard.edu
Slides
and
Contact
   ijstokes@hkl.hms.harvard.edu

   http://linkedin.com/in/ijstokes
   http://slidesha.re/ijstokes-thailand2011




Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
Slides
and
Contact
   ijstokes@hkl.hms.harvard.edu

   http://linkedin.com/in/ijstokes
   http://slidesha.re/ijstokes-thailand2011


   http://www.sbgrid.org
   http://portal.sbgrid.org
   http://www.opensciencegrid.org



Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
About
Me




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
rotational      translation
 2D
simple
crystal           Patterson
map
                                               search          search




   score
model:
                                                               aggregate
best
peak,
R
factor,          alternatives   composites
                                                              and
cluster
 electron
density
Big Data - Ian Stokes-Rees                       ijstokes@hkl.hms.harvard.edu
Protein Structure Studies




Big Data - Ian Stokes-Rees      ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...




Big Data - Ian Stokes-Rees     ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data




Big Data - Ian Stokes-Rees                     ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics




Big Data - Ian Stokes-Rees                                ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling


                             • It
is
easy
to
drown
in
the
Olood
of
data




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling


                             • It
is
easy
to
drown
in
the
Olood
of
data
                               •   storage
issues
‐
capacity




Big Data - Ian Stokes-Rees                                ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling


                             • It
is
easy
to
drown
in
the
Olood
of
data
                               •   storage
issues
‐
capacity
                               •   ownership
issues
‐
security
and
collaboration




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling


                             • It
is
easy
to
drown
in
the
Olood
of
data
                               •   storage
issues
‐
capacity
                               •   ownership
issues
‐
security
and
collaboration
                               •   provenance
‐
origin,
access,
changes




Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Data,
Data
Everywhere
...
                             • We
are
being
overwhelmed
with
data
                               •   high
temporal
resolution
due
to
fast
electronics
                               •   high
spatial
resolution
due
to
advanced
imaging

                                   techniques
                               •   high
dimensional
data
                               •   large
data
sets
                               •   simulation
                               •   modeling


                             • It
is
easy
to
drown
in
the
Olood
of
data
                               •   storage
issues
‐
capacity
                               •   ownership
issues
‐
security
and
collaboration
                               •   provenance
‐
origin,
access,
changes


           Today,
we’ll
think
about
software,
hardware,
and

           models
for
coping
with
large
quantities
of
data
Big Data - Ian Stokes-Rees                               ijstokes@hkl.hms.harvard.edu
Next
Generation
Sequencing




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
High
Energy
Physics




Big Data - Ian Stokes-Rees       ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
40
MHz
bunch
crossing
rate
     10
million
data
channels
     1
KHz
level
1
event
recording
rate
     1­10
MB
per
event
     14
hours
per
day,
7+
months
/
year
     4
detectors
     6
PB
of
data
/
year
     globally
distribute
data
for
analysis
(x2)



Big Data - Ian Stokes-Rees                    ijstokes@hkl.hms.harvard.edu
Molecular
Dynamics
Simulations




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Molecular
Dynamics
Simulations
                                   1
fs
time
step
                                   1ns
snapshot
                                   1
us
simulation
                                   1e6
steps
                                   1000
frames
                                   10
MB
/
frame
                                   10
GB
/
sim
                                   20
CPU­years
                                   3
months
(wall­
                                   clock)

Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Electronic
Patient
Records




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Electronic
Patient
Records




    77
page
PDF
(bespoke
report)
Big Data - Ian Stokes-Rees         ijstokes@hkl.hms.harvard.edu
Electronic
Patient
Records




               Clinical
Document
Architecture
XML
representation
Big Data - Ian Stokes-Rees                   ijstokes@hkl.hms.harvard.edu
Electronic
Patient
Records




                        HTML
rendering
of
XML
via
XSLT
transform
Big Data - Ian Stokes-Rees                   ijstokes@hkl.hms.harvard.edu
Clinical
Imaging
Data




   DICOM
­
Digital
Imaging
and

   Communications
in
Medicine
   2D,
3D,
4D
Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
Clinical
Imaging
Data




Big Data - Ian Stokes-Rees       ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
It
is
clear
there
is
no
shortage
of
data.




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
It
is
clear
there
is
no
shortage
of
data.


         Potential
for
great
new
insights
...




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
It
is
clear
there
is
no
shortage
of
data.


         Potential
for
great
new
insights
...


         ...
if
we
can
organize,
access,
share,
and

         analyze
this
data
ef[iciently



Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...




Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning




Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning


     • Understand
your
data
sources




Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning


     • Understand
your
data
sources

     • Understand
your
data
consumers




Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning


     • Understand
your
data
sources

     • Understand
your
data
consumers

     • Educate
yourself
on
available
tools
and
technology




Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Jumping
to
the
end
...
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning


     • Understand
your
data
sources

     • Understand
your
data
consumers

     • Educate
yourself
on
available
tools
and
technology

     • Design
your
data
management
system
suitably



Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store




Big Data - Ian Stokes-Rees           ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store
                  • How
to
store




Big Data - Ian Stokes-Rees           ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store
                  • How
to
store
                  • How
to
process




Big Data - Ian Stokes-Rees           ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  •   Where
to
store
                  •   How
to
store
                  •   How
to
process
                  •   Organization,
searching,

                      and
meta‐data




Big Data - Ian Stokes-Rees                ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store
                  • How
to
store
                  • How
to
process
                  • Organization,
searching,

                    and
meta‐data
                  • How
to
manage
access




Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store
                  • How
to
store
                  • How
to
process
                  • Organization,
searching,

                    and
meta‐data
                  • How
to
manage
access
                  • How
to
copy,
move,
and

                    backup


Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  • Where
to
store
                  • How
to
store
                  • How
to
process
                  • Organization,
searching,

                    and
meta‐data
                  • How
to
manage
access
                  • How
to
copy,
move,
and

                    backup
                  • Provenance

Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
Problems
arising
from
“Big
Data”
                  •   Where
to
store
                  •   How
to
store
                  •   How
to
process
                  •   Organization,
searching,

                      and
meta‐data
                  •   How
to
manage
access
                  •   How
to
copy,
move,
and

                      backup
                  •   Provenance
                  •   Lifecycle
Big Data - Ian Stokes-Rees                ijstokes@hkl.hms.harvard.edu
Where
to
store
(I)




Big Data - Ian Stokes-Rees        ijstokes@hkl.hms.harvard.edu
Where
to
store
(I)
  • RAM
     •   fast
     •   expensive
     •   volatile




Big Data - Ian Stokes-Rees          ijstokes@hkl.hms.harvard.edu
Where
to
store
(I)
  • RAM
     •   fast
                             • local
disk
     •   expensive
                              •   get
a
good
controller
(SATA/SAS2)
     •   volatile
                              •   lots
of
fast
spinning
disk
(7200+
rpm)
                              •   high
bandwidth
possible
                              •   good
Oirst
stop
for
data
                              •   hard
to
share,
persist,
backup
                              •   SSD
good
for
random
reads:
lots
of
small

                                  Oiles,
unpredictable
I/O
patterns
                              •   large
Oiles,
sequential
I/O,
spinning
disk

                                  comparable
to
SSDs




Big Data - Ian Stokes-Rees                    ijstokes@hkl.hms.harvard.edu
Where
to
store
(I)
  • RAM
     •   fast
                                          • local
disk
     •   expensive
                                           •   get
a
good
controller
(SATA/SAS2)
     •   volatile
                                           •   lots
of
fast
spinning
disk
(7200+
rpm)
                                           •   high
bandwidth
possible
                                           •   good
Oirst
stop
for
data
                                           •   hard
to
share,
persist,
backup
  • Parallel
Filesystem                    •   SSD
good
for
random
reads:
lots
of
small

     •   gluster,
luster,
gpfs                 Oiles,
unpredictable
I/O
patterns
     •   HDFS
(Hadoop)                     •   large
Oiles,
sequential
I/O,
spinning
disk

     •   auto‐replication
for
parallel
        comparable
to
SSDs
         decentralized
I/O



Big Data - Ian Stokes-Rees                                 ijstokes@hkl.hms.harvard.edu
Where
to
store
(II)




Big Data - Ian Stokes-Rees         ijstokes@hkl.hms.harvard.edu
Where
to
store
(II)
 • SAN
with
high
performance

   interconnect
   •   Storage
Area
Network
   •   fully
managed
data
storage
   •   Oiber
channel
(2
Gb/s)
or
InOiniband

       (10,20,40
Gb/s)
interconnect
   •   parallel,
non‐blocking,
dedicated

       routes




Big Data - Ian Stokes-Rees                     ijstokes@hkl.hms.harvard.edu
Where
to
store
(II)
 • SAN
with
high
performance

   interconnect
   •   Storage
Area
Network
   •   fully
managed
data
storage              • NAS
over
ethernet
   •   Oiber
channel
(2
Gb/s)
or
InOiniband
      •   Network
Attached
Storage
       (10,20,40
Gb/s)
interconnect               •   Think
NFS,
CIFS,
Samba
network

   •   parallel,
non‐blocking,
dedicated
             interface
to
storage
       routes                                     •   ethernet
1
Gb/s
with
contention

                                                      (effective
limit
of
~500
Mb/s)
                                                  •   SATA
(10k
rpm,
2
TB,
3
Gb/s)
                                                  •   SAS2
(15k
rpm,
750
GB,
6
Gb/s)


                                               • Cloud
storage
                                                  •   Amazon
S3
                                                  •   Box.net,
Dropbox
                                                  •   BackBlaze:
bit.ly/backblaze‐20


Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Where
to
store
(II)
 • SAN
with
high
performance

   interconnect
   •   Storage
Area
Network
   •   fully
managed
data
storage              • NAS
over
ethernet
   •   Oiber
channel
(2
Gb/s)
or
InOiniband
      •   Network
Attached
Storage
       (10,20,40
Gb/s)
interconnect               •   Think
NFS,
CIFS,
Samba
network

   •   parallel,
non‐blocking,
dedicated
             interface
to
storage
       routes                                     •   ethernet
1
Gb/s
with
contention

                                                      (effective
limit
of
~500
Mb/s)
                                                  •   SATA
(10k
rpm,
2
TB,
3
Gb/s)
 • Hybrid
                                                  •   SAS2
(15k
rpm,
750
GB,
6
Gb/s)
   •   Create
in‐house
tiered
storage

                                               • Cloud
storage
                                                  •   Amazon
S3
                                                  •   Box.net,
Dropbox
                                                  •   BackBlaze:
bit.ly/backblaze‐20


Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
How
to
store
(data
formats)
       • ASCII                      • SQL
DB
           •   tab
delimited           •   MySQL
           •   comma
separated         •   sqlite
       • XML                           •   Oracle
                                       •   Access
           •   DTD
deOinition?
                                       •   SQL
Server
           •   Schema
deOinition?
           •   Namespaces?          • Hierarchical
DB
       •   JSON                        •   Berkeley
XML
DB
                                       •   LDAP
       •   NetCDF
                                    • Object‐Relational
Mapper
       •   HDF5                        •   SQL
Alchemy
(Python)
       •   DICOM                       •   Hibernate
(Java,
.NET)
                                       •   Django
ORM
(Python)
       •   Matlab
.MAT
format
                                    • No‐SQL
DB
       •   NumPy
.NPZ
format           •   MongoDB
       •   Bespoke
binary              •   CouchDB
Big Data - Ian Stokes-Rees                      ijstokes@hkl.hms.harvard.edu
How
to
process

  • Analytical
software      • Analytical
environments
     •   custom
programs        •   multi‐core
machine
‐
48+
core

     •   Matlab                     systems
for
under
$5000
(USD)
     •   Perl                   •   GPU
     •   R                      •   compute
cluster
     •   Python                 •   supercomputers
     •   SAS,
SPSS              •   grid
computing
     •   Tableau                •   cloud
computing
                                •   web‐based
services
                                •   network
of
workstations
(NOW)
                                •   Map/Reduce
models
                                •   “screen‐saver”
computing
(BOINC)




Big Data - Ian Stokes-Rees                   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
48 cores, single system image
For
$500
to
$2000
(USD),
up
to
order
of
magnitude

processing
speedups
may
be
possible
GPU
Computing
200­800
stream

                        processing
cores
per
card




For
$500
to
$2000
(USD),
up
to
order
of
magnitude

processing
speedups
may
be
possible
Open
Science
Grid




                             www.opensciencegrid.org
Big Data - Ian Stokes-Rees              ijstokes@hkl.hms.harvard.edu
Map/Reduce
         • Unix
users:
            •   cat | grep | sort | unique > file
         • Map/Reduce
equivalent:
            •   input | map | shuffle | reduce > output
         • HadoopFS
(HDFS)
            •   large
data
set
is
automatically
spread
and
replicated
across
local

                storage
resources
(disks)
of
each
node
in
a
cluster
         • Map
            •   creates
a
job
for
each
data
block
in
the
input
            •   maps
the
computational
kernel
to
each
job
            •   schedules
jobs
to
nodes
with
required
data
block
            •   each
job
produces
a
set
of
key/value
pair
job
result
         • Reduce
            •   collect
results
from
Map
stage
based
on
keys
(Combine)
            •   aggregates
values
to
produce
task
(Oinal)
result

Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Extensions

         • Pig
and
Hive
            •   pig.apache.org



hive.apache.org
            •   simplify
writing
Map/Reduce
programs
for
Hadoop
            •   SQL‐like
query
language
for
datasets
available
on
HDFS
         • Cloudera
            •   www.cloudera.com
            •   packaged
distribution
of
Hadoop
+
extensions
            •   education
+
training
material
         • Amazon
Elastic
Map
Reduce
            •   aws.amazon.com/elasticmapreduce
            •   Amazon
“cloud‐based”
hosting
of
Hadoop
for
Map/Reduce
using
EC2

                for
compute
and
S3
for
storage



Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Organization,
Searching,
and
Meta‐Data
         • Few
“software”
solutions
for
this
problem
            •   iRODS

provides
some
of
this
            •   Unix
“locate”
database
            •   SAN
solutions
may
index
software
and
provide
tools
for
searching
         • Establish
protocols,
document,
communicate
            •   director
hierarchy
            •   Oile
naming
            •   persisted
working
space
            •   scratch/temporary
space
         • Filesystem
functionality
            •   many
Oile
systems
have
per‐Oile
meta‐data
controls
to
add
arbitrary

                key/value
pairs
         • Augmented
web‐based
view
            •   cern_meta
Apache
module
provides
key/value
pairs
in
HTTP
HEAD
            •   ability
to
assert
arbitrary
web
organization
on
top
of
Oilesystem

                organization,
with
searching
and
graphical
views

Big Data - Ian Stokes-Rees                                    ijstokes@hkl.hms.harvard.edu
•   www.irods.org
         •   File‐like
paradigm
for
data‐management
         •   addition
of
meta‐data
         •   can
integrate
database
resources
         •   provides
rich
access
policy
management
         •   automated
workOlows
based
on
data
actions
             •   add,
remove,
modify
         • automated
replication
         • built‐in
provenance
             •   information
life‐cycle
management

Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Search:
Apache


         •   lucene.apache.org
         •   Java‐based
         •   full
text
querying
and
searching
         •   indexing
         •   Solr
provides
web
interface




Big Data - Ian Stokes-Rees                      ijstokes@hkl.hms.harvard.edu
Meta‐Data:
Semantic
Media
Wiki




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Meta‐Data:
Semantic
Media
Wiki

   • You
know
Wikipedia




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Meta‐Data:
Semantic
Media
Wiki

   • You
know
Wikipedia
   • It
is
built
using
Mediawiki




Big Data - Ian Stokes-Rees         ijstokes@hkl.hms.harvard.edu
Meta‐Data:
Semantic
Media
Wiki

   • You
know
Wikipedia
   • It
is
built
using
Mediawiki
   • Semantic
Media
Wiki
adds
Semantic
Web
features
      •   Flexible
key/value
schemas
      •   User
deOined
and
changeable
object
classes
      •   Built‐in
knowledge
of
dates
→
timelines
      •   Built‐in
knowledge
of
locations
→
maps
      •   Built‐in
handling
of
images
→
picture
galleries




Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Access
Control




Big Data - Ian Stokes-Rees          ijstokes@hkl.hms.harvard.edu
Access
Control
  • Need
a
strong
Identity
Management
environment
     •   individuals:
identity
tokens
and
identiOiers
     •   groups:
membership
lists
     •   Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐
         based




Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Access
Control
  • Need
a
strong
Identity
Management
environment
     •   individuals:
identity
tokens
and
identiOiers
     •   groups:
membership
lists
     •   Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐
         based
  • Need
to
manage
and
communicate
Access
Control
policies
     •   institutionally
driven
     •   user
driven




Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Access
Control
  • Need
a
strong
Identity
Management
environment
     •   individuals:
identity
tokens
and
identiOiers
     •   groups:
membership
lists
     •   Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐
         based
  • Need
to
manage
and
communicate
Access
Control
policies
     •   institutionally
driven
     •   user
driven
  • Need
Authorization
System
     •   Policy
Enforcement
Point
(shell
login,
data
access,
web
access,
start
application)
     •   Policy
Decision
Point
(store
policies
and
understand
relationship
of
identity
token


         and
policy)




Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Case
Study:
SBGrid
         • www.sbgrid.org
         • computing
expertise
for
protein
structure
and

           function
research
            •   software
            •   training
            •   technical
support
            •   storage
            •   cluster
and
grid
computing
         • 150
member
labs
in
consortium
            •   about
1000
total
researchers
         • structure
imaging
and
model
building:
            •   imaging
techniques
are
data
intensive
            •   model
determination
techniques
are
compute
intensive


Big Data - Ian Stokes-Rees                                ijstokes@hkl.hms.harvard.edu
SBGrid
Science
Portal
                GlobusOnline                            UC San Diego
                 @Argonne              GUMS
    User                              GUMS
                                   GridFTP +            glideinWMS
              data                  Hadoop                 factory          Open Science Grid


     computations
                                                                                         MyProxy
                                                                                       @NCSA, UIUC
     monitoring      interfaces            data          computation     ID mgmt
      Ganglia                         scp                Condor          FreeIPA
                     Apache                                                             DOEGrids CA
      Nagios                          GridFTP            Cycle Server                    @Lawrence
                     GridSite                                            LDAP
      RSV                             SRM                VDT                            Berkley Labs
                     Django                                              VOMS
                                                         Globus
      pacct                           WebDAV
                     Sage Math                                           GUMS
                                                         glideinWMS                    Gratia Acct'ing
                     R-Studio                                            GACL           @FermiLab
                                    file          SQL
                     shell CLI    server          DB       cluster
                                                                                         Monitoring
     SBGrid Science Portal @ Harvard Medical School                                      @Indiana


Big Data - Ian Stokes-Rees                                              ijstokes@hkl.hms.harvard.edu
Data
Model

    • Data
Tiers
       •   VO­wide:
all
sites,
admin
managed,
very
stable
       •   User
project:
all
sites,
user
managed,
1‐10
weeks,
1‐3
GB
       •   User
static:
all
sites,
user
managed,
indeOinite,
10
MB
       •   Job
set:
all
sites,
infrastructure
managed,
1‐10
days,
0.1‐1
GB
       •   Job:
direct
to
worker
node,
infrastructure
managed,
1
day,
<10
MB
       •   Job
indirect:
to
worker
node
via
UCSD,
infrastructure
managed,
1

           day,
<10
GB



Big Data - Ian Stokes-Rees                            ijstokes@hkl.hms.harvard.edu
Data
Management
 quota
 du
scan
 tmpwatch
 conventions
 workOlow
integration

 Data
Movement
 scp
(users)
 rsync
(VO‐wide)
 grid‐ftp
(UCSD)
 curl
(WNs)
 cp
(NFS)
 htcp
(secure
web)




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
red
­
push
<iles
   green
­
pull
<iles




Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
red
­
push
<iles
   green
­
pull
<iles




                             1.
user
<ile
upload

Big Data - Ian Stokes-Rees                         ijstokes@hkl.hms.harvard.edu
red
­
push
<iles
   green
­
pull
<iles

                             2.
replicate
gold
standard




                               1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                ijstokes@hkl.hms.harvard.edu
3.
Auto­replicate




    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard




                                     1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                      ijstokes@hkl.hms.harvard.edu
4.
pull
<iles
from
                                                  UCSD
to
WNs



                        3.
Auto­replicate




    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard




                                     1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                       ijstokes@hkl.hms.harvard.edu
4.
pull
<iles
from
                                                  UCSD
to
WNs


                                                                      5.
pull
<iles
from
                        3.
Auto­replicate                             local
NSF
to
WNs




    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard




                                     1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                       ijstokes@hkl.hms.harvard.edu
4.
pull
<iles
from
                                                  UCSD
to
WNs


                                                                      5.
pull
<iles
from
                        3.
Auto­replicate                             local
NSF
to
WNs
                                                                             6.
pull
<iles
from
                                                                              SBGrid
to
WNs

    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard




                                     1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                       ijstokes@hkl.hms.harvard.edu
4.
pull
<iles
from
                                                  UCSD
to
WNs


                                                                      5.
pull
<iles
from
                        3.
Auto­replicate                             local
NSF
to
WNs
                                                                             6.
pull
<iles
from
                                                                              SBGrid
to
WNs

    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard


                                                                         7.
job
results
copied

                                                                             back
to
SBGrid



                                     1.
user
<ile
upload

Big Data - Ian Stokes-Rees                                       ijstokes@hkl.hms.harvard.edu
4.
pull
<iles
from
                                                  UCSD
to
WNs


                                                                      5.
pull
<iles
from
                        3.
Auto­replicate                             local
NSF
to
WNs
                                                                             6.
pull
<iles
from
                                                                              SBGrid
to
WNs

    red
­
push
<iles
   green
­
pull
<iles

                                   2.
replicate
gold
standard


                                                                         7.
job
results
copied

                                                                             back
to
SBGrid
                                                                        8a.
large
job
results

                                                                          copied
to
UCSD
                                                                         8b.
later
pulled
to

                                     1.
user
<ile
upload                       SBGrid
Big Data - Ian Stokes-Rees                                       ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup




Big Data - Ian Stokes-Rees      ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup




Big Data - Ian Stokes-Rees                   ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup
       • Tools
and
protocols
required,
with
management
          •   sys
admin
(technial
knowledge)
          •   archivist/curator
(domain
knowledge)




Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup
       • Tools
and
protocols
required,
with
management
          •   sys
admin
(technial
knowledge)
          •   archivist/curator
(domain
knowledge)
       • Common
structure:
          •   Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup
          •   Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community
          •   Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data




Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup
       • Tools
and
protocols
required,
with
management
          •   sys
admin
(technial
knowledge)
          •   archivist/curator
(domain
knowledge)
       • Common
structure:
          •   Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup
          •   Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community
          •   Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data
       • GridFTP




Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup
       • Tools
and
protocols
required,
with
management
          •   sys
admin
(technial
knowledge)
          •   archivist/curator
(domain
knowledge)
       • Common
structure:
          •   Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup
          •   Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community
          •   Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data
       • GridFTP
       • Storage
Resource
Broker
(SRB)


Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Copy,
Move,
Backup
       • Large
data
sets
are
difOicult
to
copy,
move,

         replicate,
and
backup
       • Tools
and
protocols
required,
with
management
          •   sys
admin
(technial
knowledge)
          •   archivist/curator
(domain
knowledge)
       • Common
structure:
          •   Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup
          •   Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community
          •   Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data
       • GridFTP
       • Storage
Resource
Broker
(SRB)
       • GlobusOnline

Big Data - Ian Stokes-Rees                                     ijstokes@hkl.hms.harvard.edu
Globus
Online:
High
Performance

           Reliable
3rd
Party
File
Transfer
                     http://www.globusonline.org




                 portal

       cluster




                                                        data collection
                                                            facility
             lab file
             server



                             desktop   laptop
Big Data - Ian Stokes-Rees                  ijstokes@hkl.hms.harvard.edu
Big Data - Ian Stokes-Rees   ijstokes@hkl.hms.harvard.edu
Summary
     • Data
can
empower
rather
than
overwhelm
you
        •   but
this
requires
thought
and
planning


     • Understand
your
data
sources

     • Understand
your
data
consumers

     • Educate
yourself
on
available
tools
and
technology

     • Design
your
data
management
system
suitably



Big Data - Ian Stokes-Rees                           ijstokes@hkl.hms.harvard.edu
Acknowledgements
&
Questions
  • Piotr
Sliz
     •   Principle
Investigator,
head
of
SBGrid
  • SBGrid
System
Administrators
     •   Ian
Levesque,
Peter
Doherty
  • Globus
Online
Team
     •   Steve
Tueke,
Ian
Foster,
Rachana

         Ananthakrishnan,
Raj
Kettimuthu

  • Terrence
Martin
     •   System
administrator
at
UCSD
for
assistance
and

         encouragement
using
1
PB
Hadoop
storage
array
  • Brian
Bockleman
     •   Physics
faculty
at
University
of
Nebraska
  • Steve
Timm
     •   System
administrator
at
FermiLab
  • Ruth
Pordes
     •   Director
of
OSG,
for
championing
SBGrid
Big Data - Ian Stokes-Rees                                  ijstokes@hkl.hms.harvard.edu
Acknowledgements
&
Questions
  • Piotr
Sliz
     •   Principle
Investigator,
head
of
SBGrid
  • SBGrid
System
Administrators
     •   Ian
Levesque,
Peter
Doherty                        Please
contact
me

  • Globus
Online
Team                                      with
any
questions:
     •   Steve
Tueke,
Ian
Foster,
Rachana
                  • Ian
Stokes‐Rees
         Ananthakrishnan,
Raj
Kettimuthu
                   • ijstokes@hkl.hms.harvard.edu
                                                            • ijstokes@spmetric.com
  • Terrence
Martin
     •   System
administrator
at
UCSD
for
assistance
and

         encouragement
using
1
PB
Hadoop
storage
array      Look
at
our
work
  • Brian
Bockleman                                           •   portal.sbgrid.org
     •   Physics
faculty
at
University
of
Nebraska            •   www.sbgrid.org
                                                              •   www.opensciencegrid.org
  • Steve
Timm
     •   System
administrator
at
FermiLab
  • Ruth
Pordes
     •   Director
of
OSG,
for
championing
SBGrid
Big Data - Ian Stokes-Rees                                   ijstokes@hkl.hms.harvard.edu

Mais conteúdo relacionado

Semelhante a Big Data: tools and techniques for working with large data sets

2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt12012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
Boston Consulting Group
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt22012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
Boston Consulting Group
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and Analytics
Perficient, Inc.
 

Semelhante a Big Data: tools and techniques for working with large data sets (20)

2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt12012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
 
Big Data
Big Data Big Data
Big Data
 
Lecture 3_1 CharacteristicsOfBigData.pptx
Lecture 3_1 CharacteristicsOfBigData.pptxLecture 3_1 CharacteristicsOfBigData.pptx
Lecture 3_1 CharacteristicsOfBigData.pptx
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!Data! Data! Data! I Can't Make Bricks Without Clay!
Data! Data! Data! I Can't Make Bricks Without Clay!
 
Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...
Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...
Curriculum Development at the Tetherless World Constellation - Peter Fox - RD...
 
DBMS
DBMSDBMS
DBMS
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...
Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...
Streaming Towards Our Quantum Future (David Elbert & Tyrel McQueen, Johns Hop...
 
Geospatial Rectification of Web Transactions and Data Security
Geospatial Rectification of Web Transactions and Data SecurityGeospatial Rectification of Web Transactions and Data Security
Geospatial Rectification of Web Transactions and Data Security
 
Big Data on The Cloud
Big Data on The CloudBig Data on The Cloud
Big Data on The Cloud
 
Predictive Analytics - BarCamp Boston 2011
Predictive Analytics - BarCamp Boston 2011Predictive Analytics - BarCamp Boston 2011
Predictive Analytics - BarCamp Boston 2011
 
Data_Science.ppt
Data_Science.pptData_Science.ppt
Data_Science.ppt
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt22012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
 
Big Data presentation Tensing
Big Data presentation TensingBig Data presentation Tensing
Big Data presentation Tensing
 
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and Analytics
 
Top 10 Myths Regarding Data Scientists Roles in India | Edureka
Top 10 Myths Regarding Data Scientists Roles in India | EdurekaTop 10 Myths Regarding Data Scientists Roles in India | Edureka
Top 10 Myths Regarding Data Scientists Roles in India | Edureka
 
Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 

Mais de Boston Consulting Group

Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...
Boston Consulting Group
 
Wide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interfaceWide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interface
Boston Consulting Group
 
2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees
Boston Consulting Group
 

Mais de Boston Consulting Group (13)

Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Beyond the Science Gateway
Beyond the Science GatewayBeyond the Science Gateway
Beyond the Science Gateway
 
Anaconda Data Science Collaboration
Anaconda Data Science CollaborationAnaconda Data Science Collaboration
Anaconda Data Science Collaboration
 
Python Blaze Overview
Python Blaze OverviewPython Blaze Overview
Python Blaze Overview
 
Making Data Analytics Awesome
Making Data Analytics AwesomeMaking Data Analytics Awesome
Making Data Analytics Awesome
 
Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...
 
SBGrid Science Portal - eScience 2012
SBGrid Science Portal - eScience 2012SBGrid Science Portal - eScience 2012
SBGrid Science Portal - eScience 2012
 
2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesrees2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesrees
 
Grid Computing Overview
Grid Computing OverviewGrid Computing Overview
Grid Computing Overview
 
Wide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interfaceWide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interface
 
2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees
 
To Infiniband and Beyond
To Infiniband and BeyondTo Infiniband and Beyond
To Infiniband and Beyond
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Big Data: tools and techniques for working with large data sets

  • 1. Big
Data:
tools
and
 techniques
for
working
 with
large
data
sets Ian
Stokes‐Rees,
PhD Harvard
Medical
School,
Boston,
USA Workshop
on
Tools,
Technologies
and
Collaborative
 Opportunities
for
HPC
in
Life
Sciences
and
Healthcare http://portal.sbgrid.org ijstokes@hkl.hms.harvard.edu
  • 2. Slides
and
Contact ijstokes@hkl.hms.harvard.edu http://linkedin.com/in/ijstokes http://slidesha.re/ijstokes-thailand2011 Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 3. Slides
and
Contact ijstokes@hkl.hms.harvard.edu http://linkedin.com/in/ijstokes http://slidesha.re/ijstokes-thailand2011 http://www.sbgrid.org http://portal.sbgrid.org http://www.opensciencegrid.org Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 4. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 5. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 6. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 7. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 8. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 9. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 10. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 11. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 12. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 13. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 14. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 15. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 16. About
Me Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 17. rotational translation 2D
simple
crystal Patterson
map search search score
model: aggregate best
peak,
R
factor, alternatives composites and
cluster electron
density Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 18. Protein Structure Studies Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 19. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 20. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 21. Data,
Data
Everywhere
... Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 22. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 23. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 24. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 25. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 26. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 27. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 28. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 29. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 30. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 31. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaboration Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 32. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaboration • provenance
‐
origin,
access,
changes Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 33. Data,
Data
Everywhere
... • We
are
being
overwhelmed
with
data • high
temporal
resolution
due
to
fast
electronics • high
spatial
resolution
due
to
advanced
imaging
 techniques • high
dimensional
data • large
data
sets • simulation • modeling • It
is
easy
to
drown
in
the
Olood
of
data • storage
issues
‐
capacity • ownership
issues
‐
security
and
collaboration • provenance
‐
origin,
access,
changes Today,
we’ll
think
about
software,
hardware,
and
 models
for
coping
with
large
quantities
of
data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 34. Next
Generation
Sequencing Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 35. High
Energy
Physics Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 36. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 37. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 38. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 39. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 40. 40
MHz
bunch
crossing
rate 10
million
data
channels 1
KHz
level
1
event
recording
rate 1­10
MB
per
event 14
hours
per
day,
7+
months
/
year 4
detectors 6
PB
of
data
/
year globally
distribute
data
for
analysis
(x2) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 41. Molecular
Dynamics
Simulations Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 42. Molecular
Dynamics
Simulations 1
fs
time
step 1ns
snapshot 1
us
simulation 1e6
steps 1000
frames 10
MB
/
frame 10
GB
/
sim 20
CPU­years 3
months
(wall­ clock) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 43. Electronic
Patient
Records Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 44. Electronic
Patient
Records 77
page
PDF
(bespoke
report) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 45. Electronic
Patient
Records Clinical
Document
Architecture
XML
representation Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 46. Electronic
Patient
Records HTML
rendering
of
XML
via
XSLT
transform Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 47. Clinical
Imaging
Data DICOM
­
Digital
Imaging
and
 Communications
in
Medicine 2D,
3D,
4D Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 48. Clinical
Imaging
Data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 49. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 50. It
is
clear
there
is
no
shortage
of
data. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 51. It
is
clear
there
is
no
shortage
of
data. Potential
for
great
new
insights
... Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 52. It
is
clear
there
is
no
shortage
of
data. Potential
for
great
new
insights
... ...
if
we
can
organize,
access,
share,
and
 analyze
this
data
ef[iciently Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 53. Jumping
to
the
end
... Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 54. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 55. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 56. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 57. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technology Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 58. Jumping
to
the
end
... • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technology • Design
your
data
management
system
suitably Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 59. Problems
arising
from
“Big
Data” Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 60. Problems
arising
from
“Big
Data” • Where
to
store Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 61. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 62. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 63. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 64. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 65. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backup Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 66. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backup • Provenance Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 67. Problems
arising
from
“Big
Data” • Where
to
store • How
to
store • How
to
process • Organization,
searching,
 and
meta‐data • How
to
manage
access • How
to
copy,
move,
and
 backup • Provenance • Lifecycle Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 68. Where
to
store
(I) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 69. Where
to
store
(I) • RAM • fast • expensive • volatile Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 70. Where
to
store
(I) • RAM • fast • local
disk • expensive • get
a
good
controller
(SATA/SAS2) • volatile • lots
of
fast
spinning
disk
(7200+
rpm) • high
bandwidth
possible • good
Oirst
stop
for
data • hard
to
share,
persist,
backup • SSD
good
for
random
reads:
lots
of
small
 Oiles,
unpredictable
I/O
patterns • large
Oiles,
sequential
I/O,
spinning
disk
 comparable
to
SSDs Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 71. Where
to
store
(I) • RAM • fast • local
disk • expensive • get
a
good
controller
(SATA/SAS2) • volatile • lots
of
fast
spinning
disk
(7200+
rpm) • high
bandwidth
possible • good
Oirst
stop
for
data • hard
to
share,
persist,
backup • Parallel
Filesystem • SSD
good
for
random
reads:
lots
of
small
 • gluster,
luster,
gpfs Oiles,
unpredictable
I/O
patterns • HDFS
(Hadoop) • large
Oiles,
sequential
I/O,
spinning
disk
 • auto‐replication
for
parallel
 comparable
to
SSDs decentralized
I/O Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 72. Where
to
store
(II) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 73. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • Oiber
channel
(2
Gb/s)
or
InOiniband
 (10,20,40
Gb/s)
interconnect • parallel,
non‐blocking,
dedicated
 routes Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 74. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • NAS
over
ethernet • Oiber
channel
(2
Gb/s)
or
InOiniband
 • Network
Attached
Storage (10,20,40
Gb/s)
interconnect • Think
NFS,
CIFS,
Samba
network
 • parallel,
non‐blocking,
dedicated
 interface
to
storage routes • ethernet
1
Gb/s
with
contention
 (effective
limit
of
~500
Mb/s) • SATA
(10k
rpm,
2
TB,
3
Gb/s) • SAS2
(15k
rpm,
750
GB,
6
Gb/s) • Cloud
storage • Amazon
S3 • Box.net,
Dropbox • BackBlaze:
bit.ly/backblaze‐20 Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 75. Where
to
store
(II) • SAN
with
high
performance
 interconnect • Storage
Area
Network • fully
managed
data
storage • NAS
over
ethernet • Oiber
channel
(2
Gb/s)
or
InOiniband
 • Network
Attached
Storage (10,20,40
Gb/s)
interconnect • Think
NFS,
CIFS,
Samba
network
 • parallel,
non‐blocking,
dedicated
 interface
to
storage routes • ethernet
1
Gb/s
with
contention
 (effective
limit
of
~500
Mb/s) • SATA
(10k
rpm,
2
TB,
3
Gb/s) • Hybrid • SAS2
(15k
rpm,
750
GB,
6
Gb/s) • Create
in‐house
tiered
storage • Cloud
storage • Amazon
S3 • Box.net,
Dropbox • BackBlaze:
bit.ly/backblaze‐20 Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 76. How
to
store
(data
formats) • ASCII • SQL
DB • tab
delimited • MySQL • comma
separated • sqlite • XML • Oracle • Access • DTD
deOinition? • SQL
Server • Schema
deOinition? • Namespaces? • Hierarchical
DB • JSON • Berkeley
XML
DB • LDAP • NetCDF • Object‐Relational
Mapper • HDF5 • SQL
Alchemy
(Python) • DICOM • Hibernate
(Java,
.NET) • Django
ORM
(Python) • Matlab
.MAT
format • No‐SQL
DB • NumPy
.NPZ
format • MongoDB • Bespoke
binary • CouchDB Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 77. How
to
process • Analytical
software • Analytical
environments • custom
programs • multi‐core
machine
‐
48+
core
 • Matlab systems
for
under
$5000
(USD) • Perl • GPU • R • compute
cluster • Python • supercomputers • SAS,
SPSS • grid
computing • Tableau • cloud
computing • web‐based
services • network
of
workstations
(NOW) • Map/Reduce
models • “screen‐saver”
computing
(BOINC) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 78. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 79. 48 cores, single system image
  • 80.
  • 82. GPU
Computing
200­800
stream
 processing
cores
per
card For
$500
to
$2000
(USD),
up
to
order
of
magnitude
 processing
speedups
may
be
possible
  • 83.
  • 84.
  • 85.
  • 86. Open
Science
Grid www.opensciencegrid.org Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 87. Map/Reduce • Unix
users: • cat | grep | sort | unique > file • Map/Reduce
equivalent: • input | map | shuffle | reduce > output • HadoopFS
(HDFS) • large
data
set
is
automatically
spread
and
replicated
across
local
 storage
resources
(disks)
of
each
node
in
a
cluster • Map • creates
a
job
for
each
data
block
in
the
input • maps
the
computational
kernel
to
each
job • schedules
jobs
to
nodes
with
required
data
block • each
job
produces
a
set
of
key/value
pair
job
result • Reduce • collect
results
from
Map
stage
based
on
keys
(Combine) • aggregates
values
to
produce
task
(Oinal)
result
 Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 88. Extensions • Pig
and
Hive • pig.apache.org



hive.apache.org • simplify
writing
Map/Reduce
programs
for
Hadoop • SQL‐like
query
language
for
datasets
available
on
HDFS • Cloudera • www.cloudera.com • packaged
distribution
of
Hadoop
+
extensions • education
+
training
material • Amazon
Elastic
Map
Reduce • aws.amazon.com/elasticmapreduce • Amazon
“cloud‐based”
hosting
of
Hadoop
for
Map/Reduce
using
EC2
 for
compute
and
S3
for
storage Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 89. Organization,
Searching,
and
Meta‐Data • Few
“software”
solutions
for
this
problem • iRODS

provides
some
of
this • Unix
“locate”
database • SAN
solutions
may
index
software
and
provide
tools
for
searching • Establish
protocols,
document,
communicate • director
hierarchy • Oile
naming • persisted
working
space • scratch/temporary
space • Filesystem
functionality • many
Oile
systems
have
per‐Oile
meta‐data
controls
to
add
arbitrary
 key/value
pairs • Augmented
web‐based
view • cern_meta
Apache
module
provides
key/value
pairs
in
HTTP
HEAD • ability
to
assert
arbitrary
web
organization
on
top
of
Oilesystem
 organization,
with
searching
and
graphical
views Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 90. www.irods.org • File‐like
paradigm
for
data‐management • addition
of
meta‐data • can
integrate
database
resources • provides
rich
access
policy
management • automated
workOlows
based
on
data
actions • add,
remove,
modify • automated
replication • built‐in
provenance • information
life‐cycle
management Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 91. Search:
Apache • lucene.apache.org • Java‐based • full
text
querying
and
searching • indexing • Solr
provides
web
interface Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 92. Meta‐Data:
Semantic
Media
Wiki Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 93. Meta‐Data:
Semantic
Media
Wiki • You
know
Wikipedia Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 94. Meta‐Data:
Semantic
Media
Wiki • You
know
Wikipedia • It
is
built
using
Mediawiki Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 95. Meta‐Data:
Semantic
Media
Wiki • You
know
Wikipedia • It
is
built
using
Mediawiki • Semantic
Media
Wiki
adds
Semantic
Web
features • Flexible
key/value
schemas • User
deOined
and
changeable
object
classes • Built‐in
knowledge
of
dates
→
timelines • Built‐in
knowledge
of
locations
→
maps • Built‐in
handling
of
images
→
picture
galleries Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 96. Access
Control Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 97. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ based Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 98. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ based • Need
to
manage
and
communicate
Access
Control
policies • institutionally
driven • user
driven Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 99. Access
Control • Need
a
strong
Identity
Management
environment • individuals:
identity
tokens
and
identiOiers • groups:
membership
lists • Active
Directory/CIFS
(Windows),
Open
Directory
(Apple),
FreeIPA
(Unix)
all
LDAP‐ based • Need
to
manage
and
communicate
Access
Control
policies • institutionally
driven • user
driven • Need
Authorization
System • Policy
Enforcement
Point
(shell
login,
data
access,
web
access,
start
application) • Policy
Decision
Point
(store
policies
and
understand
relationship
of
identity
token

 and
policy) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 100. Case
Study:
SBGrid • www.sbgrid.org • computing
expertise
for
protein
structure
and
 function
research • software • training • technical
support • storage • cluster
and
grid
computing • 150
member
labs
in
consortium • about
1000
total
researchers • structure
imaging
and
model
building: • imaging
techniques
are
data
intensive • model
determination
techniques
are
compute
intensive Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 101. SBGrid
Science
Portal GlobusOnline UC San Diego @Argonne GUMS User GUMS GridFTP + glideinWMS data Hadoop factory Open Science Grid computations MyProxy @NCSA, UIUC monitoring interfaces data computation ID mgmt Ganglia scp Condor FreeIPA Apache DOEGrids CA Nagios GridFTP Cycle Server @Lawrence GridSite LDAP RSV SRM VDT Berkley Labs Django VOMS Globus pacct WebDAV Sage Math GUMS glideinWMS Gratia Acct'ing R-Studio GACL @FermiLab file SQL shell CLI server DB cluster Monitoring SBGrid Science Portal @ Harvard Medical School @Indiana Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 102. Data
Model • Data
Tiers • VO­wide:
all
sites,
admin
managed,
very
stable • User
project:
all
sites,
user
managed,
1‐10
weeks,
1‐3
GB • User
static:
all
sites,
user
managed,
indeOinite,
10
MB • Job
set:
all
sites,
infrastructure
managed,
1‐10
days,
0.1‐1
GB • Job:
direct
to
worker
node,
infrastructure
managed,
1
day,
<10
MB • Job
indirect:
to
worker
node
via
UCSD,
infrastructure
managed,
1
 day,
<10
GB Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 103. Data
Management quota du
scan tmpwatch conventions workOlow
integration Data
Movement scp
(users) rsync
(VO‐wide) grid‐ftp
(UCSD) curl
(WNs) cp
(NFS) htcp
(secure
web) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 104. red
­
push
<iles green
­
pull
<iles Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 105. red
­
push
<iles green
­
pull
<iles 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 106. red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 107. 3.
Auto­replicate red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 108. 4.
pull
<iles
from UCSD
to
WNs 3.
Auto­replicate red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 109. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 110. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 111. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 7.
job
results
copied
 back
to
SBGrid 1.
user
<ile
upload Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 112. 4.
pull
<iles
from UCSD
to
WNs 5.
pull
<iles
from 3.
Auto­replicate local
NSF
to
WNs 6.
pull
<iles
from SBGrid
to
WNs red
­
push
<iles green
­
pull
<iles 2.
replicate
gold
standard 7.
job
results
copied
 back
to
SBGrid 8a.
large
job
results
 copied
to
UCSD 8b.
later
pulled
to
 1.
user
<ile
upload SBGrid Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 113. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 114. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 115. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 116. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 117. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 118. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 119. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 120. Copy,
Move,
Backup Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 121. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 122. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 123. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 124. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTP Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 125. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTP • Storage
Resource
Broker
(SRB) Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 126. Copy,
Move,
Backup • Large
data
sets
are
difOicult
to
copy,
move,
 replicate,
and
backup • Tools
and
protocols
required,
with
management • sys
admin
(technial
knowledge) • archivist/curator
(domain
knowledge) • Common
structure: • Tier
1
‐
single
master
copy
of
data
(live),
possible
ofOline
tape
backup • Tier
2
‐
multiple
reliable
T‐1
replicas
serving
a
speciOic
community • Tier
3
‐
temporary
“working
set”
T‐2
replicas
of
required
data • GridFTP • Storage
Resource
Broker
(SRB) • GlobusOnline Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 127. Globus
Online:
High
Performance
 Reliable
3rd
Party
File
Transfer http://www.globusonline.org portal cluster data collection facility lab file server desktop laptop Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 128. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 129. Summary • Data
can
empower
rather
than
overwhelm
you • but
this
requires
thought
and
planning • Understand
your
data
sources • Understand
your
data
consumers • Educate
yourself
on
available
tools
and
technology • Design
your
data
management
system
suitably Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 130. Acknowledgements
&
Questions • Piotr
Sliz • Principle
Investigator,
head
of
SBGrid • SBGrid
System
Administrators • Ian
Levesque,
Peter
Doherty • Globus
Online
Team • Steve
Tueke,
Ian
Foster,
Rachana
 Ananthakrishnan,
Raj
Kettimuthu
 • Terrence
Martin • System
administrator
at
UCSD
for
assistance
and
 encouragement
using
1
PB
Hadoop
storage
array • Brian
Bockleman • Physics
faculty
at
University
of
Nebraska • Steve
Timm • System
administrator
at
FermiLab • Ruth
Pordes • Director
of
OSG,
for
championing
SBGrid Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
  • 131. Acknowledgements
&
Questions • Piotr
Sliz • Principle
Investigator,
head
of
SBGrid • SBGrid
System
Administrators • Ian
Levesque,
Peter
Doherty Please
contact
me
 • Globus
Online
Team with
any
questions: • Steve
Tueke,
Ian
Foster,
Rachana
 • Ian
Stokes‐Rees Ananthakrishnan,
Raj
Kettimuthu
 • ijstokes@hkl.hms.harvard.edu • ijstokes@spmetric.com • Terrence
Martin • System
administrator
at
UCSD
for
assistance
and
 encouragement
using
1
PB
Hadoop
storage
array Look
at
our
work • Brian
Bockleman • portal.sbgrid.org • Physics
faculty
at
University
of
Nebraska • www.sbgrid.org • www.opensciencegrid.org • Steve
Timm • System
administrator
at
FermiLab • Ruth
Pordes • Director
of
OSG,
for
championing
SBGrid Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu

Notas do Editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. \n
  111. \n
  112. \n
  113. \n
  114. \n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n
  123. \n
  124. \n
  125. \n
  126. \n
  127. \n
  128. \n
  129. \n
  130. \n
  131. \n
  132. \n
  133. \n
  134. \n