Working with thousands, millions, or billions of data records in high dimensions is increasingly becoming the reality for scientific research. What are some techniques to make this kind of data volume tractable? How can parallel computing help? In this talk I'll review data management tools and infrastructures, languages, and paradigms that help in this regard. In particular, I'll discuss Hadoop, MapReduce, Python, NumPy, and Globus Online to provide a survey of ways in which researchers can manage their data and process it in parallel.
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Big Data: tools and techniques for working with large data sets
1. Big Data: tools and
techniques for working
with large data sets
Ian Stokes‐Rees, PhD
Harvard Medical School, Boston, USA
Workshop on Tools, Technologies and Collaborative
Opportunities for HPC in Life Sciences and Healthcare
http://portal.sbgrid.org
ijstokes@hkl.hms.harvard.edu
2. Slides and Contact
ijstokes@hkl.hms.harvard.edu
http://linkedin.com/in/ijstokes
http://slidesha.re/ijstokes-thailand2011
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
3. Slides and Contact
ijstokes@hkl.hms.harvard.edu
http://linkedin.com/in/ijstokes
http://slidesha.re/ijstokes-thailand2011
http://www.sbgrid.org
http://portal.sbgrid.org
http://www.opensciencegrid.org
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
17. rotational translation
2D simple crystal Patterson map
search search
score model:
aggregate
best peak, R factor, alternatives composites
and cluster
electron density
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
22. Data, Data Everywhere ...
• We are being overwhelmed with data
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
23. Data, Data Everywhere ...
• We are being overwhelmed with data
• high temporal resolution due to fast electronics
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
24. Data, Data Everywhere ...
• We are being overwhelmed with data
• high temporal resolution due to fast electronics
• high spatial resolution due to advanced imaging
techniques
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
25. Data, Data Everywhere ...
• We are being overwhelmed with data
• high temporal resolution due to fast electronics
• high spatial resolution due to advanced imaging
techniques
• high dimensional data
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
26. Data, Data Everywhere ...
• We are being overwhelmed with data
• high temporal resolution due to fast electronics
• high spatial resolution due to advanced imaging
techniques
• high dimensional data
• large data sets
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
27. Data, Data Everywhere ...
• We are being overwhelmed with data
• high temporal resolution due to fast electronics
• high spatial resolution due to advanced imaging
techniques
• high dimensional data
• large data sets
• simulation
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
28. Data, Data Everywhere ...
• We are being overwhelmed with data
• high temporal resolution due to fast electronics
• high spatial resolution due to advanced imaging
techniques
• high dimensional data
• large data sets
• simulation
• modeling
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
29. Data, Data Everywhere ...
• We are being overwhelmed with data
• high temporal resolution due to fast electronics
• high spatial resolution due to advanced imaging
techniques
• high dimensional data
• large data sets
• simulation
• modeling
• It is easy to drown in the Olood of data
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
30. Data, Data Everywhere ...
• We are being overwhelmed with data
• high temporal resolution due to fast electronics
• high spatial resolution due to advanced imaging
techniques
• high dimensional data
• large data sets
• simulation
• modeling
• It is easy to drown in the Olood of data
• storage issues ‐ capacity
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
31. Data, Data Everywhere ...
• We are being overwhelmed with data
• high temporal resolution due to fast electronics
• high spatial resolution due to advanced imaging
techniques
• high dimensional data
• large data sets
• simulation
• modeling
• It is easy to drown in the Olood of data
• storage issues ‐ capacity
• ownership issues ‐ security and collaboration
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
32. Data, Data Everywhere ...
• We are being overwhelmed with data
• high temporal resolution due to fast electronics
• high spatial resolution due to advanced imaging
techniques
• high dimensional data
• large data sets
• simulation
• modeling
• It is easy to drown in the Olood of data
• storage issues ‐ capacity
• ownership issues ‐ security and collaboration
• provenance ‐ origin, access, changes
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
33. Data, Data Everywhere ...
• We are being overwhelmed with data
• high temporal resolution due to fast electronics
• high spatial resolution due to advanced imaging
techniques
• high dimensional data
• large data sets
• simulation
• modeling
• It is easy to drown in the Olood of data
• storage issues ‐ capacity
• ownership issues ‐ security and collaboration
• provenance ‐ origin, access, changes
Today, we’ll think about software, hardware, and
models for coping with large quantities of data
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
36. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
37. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
38. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
39. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
40. 40 MHz bunch crossing rate
10 million data channels
1 KHz level 1 event recording rate
110 MB per event
14 hours per day, 7+ months / year
4 detectors
6 PB of data / year
globally distribute data for analysis (x2)
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
52. It is clear there is no shortage of data.
Potential for great new insights ...
... if we can organize, access, share, and
analyze this data ef[iciently
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
54. Jumping to the end ...
• Data can empower rather than overwhelm you
• but this requires thought and planning
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
55. Jumping to the end ...
• Data can empower rather than overwhelm you
• but this requires thought and planning
• Understand your data sources
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
56. Jumping to the end ...
• Data can empower rather than overwhelm you
• but this requires thought and planning
• Understand your data sources
• Understand your data consumers
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
57. Jumping to the end ...
• Data can empower rather than overwhelm you
• but this requires thought and planning
• Understand your data sources
• Understand your data consumers
• Educate yourself on available tools and technology
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
58. Jumping to the end ...
• Data can empower rather than overwhelm you
• but this requires thought and planning
• Understand your data sources
• Understand your data consumers
• Educate yourself on available tools and technology
• Design your data management system suitably
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
62. Problems arising from “Big Data”
• Where to store
• How to store
• How to process
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
63. Problems arising from “Big Data”
• Where to store
• How to store
• How to process
• Organization, searching,
and meta‐data
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
64. Problems arising from “Big Data”
• Where to store
• How to store
• How to process
• Organization, searching,
and meta‐data
• How to manage access
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
65. Problems arising from “Big Data”
• Where to store
• How to store
• How to process
• Organization, searching,
and meta‐data
• How to manage access
• How to copy, move, and
backup
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
66. Problems arising from “Big Data”
• Where to store
• How to store
• How to process
• Organization, searching,
and meta‐data
• How to manage access
• How to copy, move, and
backup
• Provenance
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
67. Problems arising from “Big Data”
• Where to store
• How to store
• How to process
• Organization, searching,
and meta‐data
• How to manage access
• How to copy, move, and
backup
• Provenance
• Lifecycle
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
69. Where to store (I)
• RAM
• fast
• expensive
• volatile
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
70. Where to store (I)
• RAM
• fast
• local disk
• expensive
• get a good controller (SATA/SAS2)
• volatile
• lots of fast spinning disk (7200+ rpm)
• high bandwidth possible
• good Oirst stop for data
• hard to share, persist, backup
• SSD good for random reads: lots of small
Oiles, unpredictable I/O patterns
• large Oiles, sequential I/O, spinning disk
comparable to SSDs
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
71. Where to store (I)
• RAM
• fast
• local disk
• expensive
• get a good controller (SATA/SAS2)
• volatile
• lots of fast spinning disk (7200+ rpm)
• high bandwidth possible
• good Oirst stop for data
• hard to share, persist, backup
• Parallel Filesystem • SSD good for random reads: lots of small
• gluster, luster, gpfs Oiles, unpredictable I/O patterns
• HDFS (Hadoop) • large Oiles, sequential I/O, spinning disk
• auto‐replication for parallel comparable to SSDs
decentralized I/O
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
73. Where to store (II)
• SAN with high performance
interconnect
• Storage Area Network
• fully managed data storage
• Oiber channel (2 Gb/s) or InOiniband
(10,20,40 Gb/s) interconnect
• parallel, non‐blocking, dedicated
routes
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
74. Where to store (II)
• SAN with high performance
interconnect
• Storage Area Network
• fully managed data storage • NAS over ethernet
• Oiber channel (2 Gb/s) or InOiniband • Network Attached Storage
(10,20,40 Gb/s) interconnect • Think NFS, CIFS, Samba network
• parallel, non‐blocking, dedicated interface to storage
routes • ethernet 1 Gb/s with contention
(effective limit of ~500 Mb/s)
• SATA (10k rpm, 2 TB, 3 Gb/s)
• SAS2 (15k rpm, 750 GB, 6 Gb/s)
• Cloud storage
• Amazon S3
• Box.net, Dropbox
• BackBlaze: bit.ly/backblaze‐20
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
75. Where to store (II)
• SAN with high performance
interconnect
• Storage Area Network
• fully managed data storage • NAS over ethernet
• Oiber channel (2 Gb/s) or InOiniband • Network Attached Storage
(10,20,40 Gb/s) interconnect • Think NFS, CIFS, Samba network
• parallel, non‐blocking, dedicated interface to storage
routes • ethernet 1 Gb/s with contention
(effective limit of ~500 Mb/s)
• SATA (10k rpm, 2 TB, 3 Gb/s)
• Hybrid
• SAS2 (15k rpm, 750 GB, 6 Gb/s)
• Create in‐house tiered storage
• Cloud storage
• Amazon S3
• Box.net, Dropbox
• BackBlaze: bit.ly/backblaze‐20
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
76. How to store (data formats)
• ASCII • SQL DB
• tab delimited • MySQL
• comma separated • sqlite
• XML • Oracle
• Access
• DTD deOinition?
• SQL Server
• Schema deOinition?
• Namespaces? • Hierarchical DB
• JSON • Berkeley XML DB
• LDAP
• NetCDF
• Object‐Relational Mapper
• HDF5 • SQL Alchemy (Python)
• DICOM • Hibernate (Java, .NET)
• Django ORM (Python)
• Matlab .MAT format
• No‐SQL DB
• NumPy .NPZ format • MongoDB
• Bespoke binary • CouchDB
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
77. How to process
• Analytical software • Analytical environments
• custom programs • multi‐core machine ‐ 48+ core
• Matlab systems for under $5000 (USD)
• Perl • GPU
• R • compute cluster
• Python • supercomputers
• SAS, SPSS • grid computing
• Tableau • cloud computing
• web‐based services
• network of workstations (NOW)
• Map/Reduce models
• “screen‐saver” computing (BOINC)
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
78. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
82. GPU Computing 200800 stream
processing cores per card
For $500 to $2000 (USD), up to order of magnitude
processing speedups may be possible
83.
84.
85.
86. Open Science Grid
www.opensciencegrid.org
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
87. Map/Reduce
• Unix users:
• cat | grep | sort | unique > file
• Map/Reduce equivalent:
• input | map | shuffle | reduce > output
• HadoopFS (HDFS)
• large data set is automatically spread and replicated across local
storage resources (disks) of each node in a cluster
• Map
• creates a job for each data block in the input
• maps the computational kernel to each job
• schedules jobs to nodes with required data block
• each job produces a set of key/value pair job result
• Reduce
• collect results from Map stage based on keys (Combine)
• aggregates values to produce task (Oinal) result
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
88. Extensions
• Pig and Hive
• pig.apache.org hive.apache.org
• simplify writing Map/Reduce programs for Hadoop
• SQL‐like query language for datasets available on HDFS
• Cloudera
• www.cloudera.com
• packaged distribution of Hadoop + extensions
• education + training material
• Amazon Elastic Map Reduce
• aws.amazon.com/elasticmapreduce
• Amazon “cloud‐based” hosting of Hadoop for Map/Reduce using EC2
for compute and S3 for storage
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
89. Organization, Searching, and Meta‐Data
• Few “software” solutions for this problem
• iRODS provides some of this
• Unix “locate” database
• SAN solutions may index software and provide tools for searching
• Establish protocols, document, communicate
• director hierarchy
• Oile naming
• persisted working space
• scratch/temporary space
• Filesystem functionality
• many Oile systems have per‐Oile meta‐data controls to add arbitrary
key/value pairs
• Augmented web‐based view
• cern_meta Apache module provides key/value pairs in HTTP HEAD
• ability to assert arbitrary web organization on top of Oilesystem
organization, with searching and graphical views
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
90. • www.irods.org
• File‐like paradigm for data‐management
• addition of meta‐data
• can integrate database resources
• provides rich access policy management
• automated workOlows based on data actions
• add, remove, modify
• automated replication
• built‐in provenance
• information life‐cycle management
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
91. Search: Apache
• lucene.apache.org
• Java‐based
• full text querying and searching
• indexing
• Solr provides web interface
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
94. Meta‐Data: Semantic Media Wiki
• You know Wikipedia
• It is built using Mediawiki
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
95. Meta‐Data: Semantic Media Wiki
• You know Wikipedia
• It is built using Mediawiki
• Semantic Media Wiki adds Semantic Web features
• Flexible key/value schemas
• User deOined and changeable object classes
• Built‐in knowledge of dates → timelines
• Built‐in knowledge of locations → maps
• Built‐in handling of images → picture galleries
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
97. Access Control
• Need a strong Identity Management environment
• individuals: identity tokens and identiOiers
• groups: membership lists
• Active Directory/CIFS (Windows), Open Directory (Apple), FreeIPA (Unix) all LDAP‐
based
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
98. Access Control
• Need a strong Identity Management environment
• individuals: identity tokens and identiOiers
• groups: membership lists
• Active Directory/CIFS (Windows), Open Directory (Apple), FreeIPA (Unix) all LDAP‐
based
• Need to manage and communicate Access Control policies
• institutionally driven
• user driven
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
99. Access Control
• Need a strong Identity Management environment
• individuals: identity tokens and identiOiers
• groups: membership lists
• Active Directory/CIFS (Windows), Open Directory (Apple), FreeIPA (Unix) all LDAP‐
based
• Need to manage and communicate Access Control policies
• institutionally driven
• user driven
• Need Authorization System
• Policy Enforcement Point (shell login, data access, web access, start application)
• Policy Decision Point (store policies and understand relationship of identity token
and policy)
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
100. Case Study: SBGrid
• www.sbgrid.org
• computing expertise for protein structure and
function research
• software
• training
• technical support
• storage
• cluster and grid computing
• 150 member labs in consortium
• about 1000 total researchers
• structure imaging and model building:
• imaging techniques are data intensive
• model determination techniques are compute intensive
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
101. SBGrid Science Portal
GlobusOnline UC San Diego
@Argonne GUMS
User GUMS
GridFTP + glideinWMS
data Hadoop factory Open Science Grid
computations
MyProxy
@NCSA, UIUC
monitoring interfaces data computation ID mgmt
Ganglia scp Condor FreeIPA
Apache DOEGrids CA
Nagios GridFTP Cycle Server @Lawrence
GridSite LDAP
RSV SRM VDT Berkley Labs
Django VOMS
Globus
pacct WebDAV
Sage Math GUMS
glideinWMS Gratia Acct'ing
R-Studio GACL @FermiLab
file SQL
shell CLI server DB cluster
Monitoring
SBGrid Science Portal @ Harvard Medical School @Indiana
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
102. Data Model
• Data Tiers
• VOwide: all sites, admin managed, very stable
• User project: all sites, user managed, 1‐10 weeks, 1‐3 GB
• User static: all sites, user managed, indeOinite, 10 MB
• Job set: all sites, infrastructure managed, 1‐10 days, 0.1‐1 GB
• Job: direct to worker node, infrastructure managed, 1 day, <10 MB
• Job indirect: to worker node via UCSD, infrastructure managed, 1
day, <10 GB
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
103. Data Management
quota
du scan
tmpwatch
conventions
workOlow integration
Data Movement
scp (users)
rsync (VO‐wide)
grid‐ftp (UCSD)
curl (WNs)
cp (NFS)
htcp (secure web)
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
104. red push <iles
green pull <iles
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
105. red push <iles
green pull <iles
1. user <ile upload
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
106. red push <iles
green pull <iles
2. replicate gold standard
1. user <ile upload
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
107. 3. Autoreplicate
red push <iles
green pull <iles
2. replicate gold standard
1. user <ile upload
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
108. 4. pull <iles from
UCSD to WNs
3. Autoreplicate
red push <iles
green pull <iles
2. replicate gold standard
1. user <ile upload
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
109. 4. pull <iles from
UCSD to WNs
5. pull <iles from
3. Autoreplicate local NSF to WNs
red push <iles
green pull <iles
2. replicate gold standard
1. user <ile upload
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
110. 4. pull <iles from
UCSD to WNs
5. pull <iles from
3. Autoreplicate local NSF to WNs
6. pull <iles from
SBGrid to WNs
red push <iles
green pull <iles
2. replicate gold standard
1. user <ile upload
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
111. 4. pull <iles from
UCSD to WNs
5. pull <iles from
3. Autoreplicate local NSF to WNs
6. pull <iles from
SBGrid to WNs
red push <iles
green pull <iles
2. replicate gold standard
7. job results copied
back to SBGrid
1. user <ile upload
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
112. 4. pull <iles from
UCSD to WNs
5. pull <iles from
3. Autoreplicate local NSF to WNs
6. pull <iles from
SBGrid to WNs
red push <iles
green pull <iles
2. replicate gold standard
7. job results copied
back to SBGrid
8a. large job results
copied to UCSD
8b. later pulled to
1. user <ile upload SBGrid
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
113. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
114. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
115. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
116. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
117. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
118. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
119. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
121. Copy, Move, Backup
• Large data sets are difOicult to copy, move,
replicate, and backup
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
122. Copy, Move, Backup
• Large data sets are difOicult to copy, move,
replicate, and backup
• Tools and protocols required, with management
• sys admin (technial knowledge)
• archivist/curator (domain knowledge)
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
123. Copy, Move, Backup
• Large data sets are difOicult to copy, move,
replicate, and backup
• Tools and protocols required, with management
• sys admin (technial knowledge)
• archivist/curator (domain knowledge)
• Common structure:
• Tier 1 ‐ single master copy of data (live), possible ofOline tape backup
• Tier 2 ‐ multiple reliable T‐1 replicas serving a speciOic community
• Tier 3 ‐ temporary “working set” T‐2 replicas of required data
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
124. Copy, Move, Backup
• Large data sets are difOicult to copy, move,
replicate, and backup
• Tools and protocols required, with management
• sys admin (technial knowledge)
• archivist/curator (domain knowledge)
• Common structure:
• Tier 1 ‐ single master copy of data (live), possible ofOline tape backup
• Tier 2 ‐ multiple reliable T‐1 replicas serving a speciOic community
• Tier 3 ‐ temporary “working set” T‐2 replicas of required data
• GridFTP
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
125. Copy, Move, Backup
• Large data sets are difOicult to copy, move,
replicate, and backup
• Tools and protocols required, with management
• sys admin (technial knowledge)
• archivist/curator (domain knowledge)
• Common structure:
• Tier 1 ‐ single master copy of data (live), possible ofOline tape backup
• Tier 2 ‐ multiple reliable T‐1 replicas serving a speciOic community
• Tier 3 ‐ temporary “working set” T‐2 replicas of required data
• GridFTP
• Storage Resource Broker (SRB)
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
126. Copy, Move, Backup
• Large data sets are difOicult to copy, move,
replicate, and backup
• Tools and protocols required, with management
• sys admin (technial knowledge)
• archivist/curator (domain knowledge)
• Common structure:
• Tier 1 ‐ single master copy of data (live), possible ofOline tape backup
• Tier 2 ‐ multiple reliable T‐1 replicas serving a speciOic community
• Tier 3 ‐ temporary “working set” T‐2 replicas of required data
• GridFTP
• Storage Resource Broker (SRB)
• GlobusOnline
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
127. Globus Online: High Performance
Reliable 3rd Party File Transfer
http://www.globusonline.org
portal
cluster
data collection
facility
lab file
server
desktop laptop
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
128. Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
129. Summary
• Data can empower rather than overwhelm you
• but this requires thought and planning
• Understand your data sources
• Understand your data consumers
• Educate yourself on available tools and technology
• Design your data management system suitably
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
130. Acknowledgements & Questions
• Piotr Sliz
• Principle Investigator, head of SBGrid
• SBGrid System Administrators
• Ian Levesque, Peter Doherty
• Globus Online Team
• Steve Tueke, Ian Foster, Rachana
Ananthakrishnan, Raj Kettimuthu
• Terrence Martin
• System administrator at UCSD for assistance and
encouragement using 1 PB Hadoop storage array
• Brian Bockleman
• Physics faculty at University of Nebraska
• Steve Timm
• System administrator at FermiLab
• Ruth Pordes
• Director of OSG, for championing SBGrid
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu
131. Acknowledgements & Questions
• Piotr Sliz
• Principle Investigator, head of SBGrid
• SBGrid System Administrators
• Ian Levesque, Peter Doherty Please contact me
• Globus Online Team with any questions:
• Steve Tueke, Ian Foster, Rachana • Ian Stokes‐Rees
Ananthakrishnan, Raj Kettimuthu • ijstokes@hkl.hms.harvard.edu
• ijstokes@spmetric.com
• Terrence Martin
• System administrator at UCSD for assistance and
encouragement using 1 PB Hadoop storage array Look at our work
• Brian Bockleman • portal.sbgrid.org
• Physics faculty at University of Nebraska • www.sbgrid.org
• www.opensciencegrid.org
• Steve Timm
• System administrator at FermiLab
• Ruth Pordes
• Director of OSG, for championing SBGrid
Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu