Big Data: tools and techniques for working with large data sets

Big Data: tools and 
techniques for working 
with large data sets
Ian Stokes‐Rees, PhD
Harvard Medical School, Boston, USA

Workshop on Tools, Technologies and Collaborative 
Opportunities for HPC in Life Sciences and Healthcare

http://portal.sbgrid.org
ijstokes@hkl.hms.harvard.edu

Slides and Contact

http://linkedin.com/in/ijstokes
http://slidesha.re/ijstokes-thailand2011

Big Data - Ian Stokes-Rees ijstokes@hkl.hms.harvard.edu

Slides and Contact

http://linkedin.com/in/ijstokes
http://slidesha.re/ijstokes-thailand2011

http://www.sbgrid.org
http://portal.sbgrid.org
http://www.opensciencegrid.org


About Me


rotational translation
2D simple crystal Patterson map
search search

score model:
aggregate
best peak, R factor, alternatives composites
and cluster
electron density

Protein Structure Studies


Data, Data Everywhere ...


• We are being overwhelmed with data


• high temporal resolution due to fast electronics


• high spatial resolution due to advanced imaging 
techniques


techniques
• high dimensional data


techniques
• large data sets


techniques
• simulation


techniques
• simulation
• modeling


techniques
• simulation
• modeling

• It is easy to drown in the Olood of data


techniques
• simulation
• modeling

• storage issues ‐ capacity


techniques
• simulation
• modeling

• ownership issues ‐ security and collaboration


techniques
• simulation
• modeling

• provenance ‐ origin, access, changes


techniques
• simulation
• modeling

• provenance ‐ origin, access, changes

Today, we’ll think about software, hardware, and 
models for coping with large quantities of data

Next Generation Sequencing


High Energy Physics


40 MHz bunch crossing rate
10 million data channels
1 KHz level 1 event recording rate
110 MB per event
14 hours per day, 7+ months / year
4 detectors
6 PB of data / year
globally distribute data for analysis (x2)


Molecular Dynamics Simulations


Molecular Dynamics Simulations
1 fs time step
1ns snapshot
1 us simulation
1e6 steps
1000 frames
10 MB / frame
10 GB / sim
20 CPUyears
3 months (wall
clock)


Electronic Patient Records



77 page PDF (bespoke report)


Clinical Document Architecture XML representation


HTML rendering of XML via XSLT transform

Clinical Imaging Data

DICOM  Digital Imaging and 
Communications in Medicine
2D, 3D, 4D

Clinical Imaging Data


It is clear there is no shortage of data.



Potential for great new insights ...



Potential for great new insights ...

... if we can organize, access, share, and 
analyze this data ef[iciently


Jumping to the end ...


• Data can empower rather than overwhelm you
• but this requires thought and planning



• Understand your data sources




• Understand your data consumers





• Educate yourself on available tools and technology






• Design your data management system suitably


Problems arising from “Big Data”


• Where to store


• How to store


• How to process


• Organization, searching, 
and meta‐data


and meta‐data
• How to manage access


and meta‐data
• How to copy, move, and 
backup


and meta‐data
backup
• Provenance


and meta‐data
backup
• Provenance
• Lifecycle

Where to store (I)


• RAM
• fast
• expensive
• volatile


• RAM
• fast
• local disk
• expensive
• get a good controller (SATA/SAS2)
• volatile
• lots of fast spinning disk (7200+ rpm)
• high bandwidth possible
• good Oirst stop for data
• hard to share, persist, backup
• SSD good for random reads: lots of small 
Oiles, unpredictable I/O patterns
• large Oiles, sequential I/O, spinning disk 
comparable to SSDs


• RAM
• fast
• local disk
• expensive
• get a good controller (SATA/SAS2)
• volatile
• lots of fast spinning disk (7200+ rpm)
• high bandwidth possible
• good Oirst stop for data
• hard to share, persist, backup
• Parallel Filesystem • SSD good for random reads: lots of small 
• gluster, luster, gpfs Oiles, unpredictable I/O patterns
• HDFS (Hadoop) • large Oiles, sequential I/O, spinning disk 
• auto‐replication for parallel  comparable to SSDs
decentralized I/O


Where to store (II)


• SAN with high performance 
interconnect
• Storage Area Network
• fully managed data storage
• Oiber channel (2 Gb/s) or InOiniband 
(10,20,40 Gb/s) interconnect
• parallel, non‐blocking, dedicated 
routes


interconnect
• fully managed data storage • NAS over ethernet
• Oiber channel (2 Gb/s) or InOiniband  • Network Attached Storage
(10,20,40 Gb/s) interconnect • Think NFS, CIFS, Samba network 
• parallel, non‐blocking, dedicated  interface to storage
routes • ethernet 1 Gb/s with contention 
(effective limit of ~500 Mb/s)
• SATA (10k rpm, 2 TB, 3 Gb/s)
• SAS2 (15k rpm, 750 GB, 6 Gb/s)

• Cloud storage
• Amazon S3
• Box.net, Dropbox
• BackBlaze: bit.ly/backblaze‐20


interconnect
• fully managed data storage • NAS over ethernet
• Oiber channel (2 Gb/s) or InOiniband  • Network Attached Storage
(10,20,40 Gb/s) interconnect • Think NFS, CIFS, Samba network 
• parallel, non‐blocking, dedicated  interface to storage
routes • ethernet 1 Gb/s with contention 
(effective limit of ~500 Mb/s)
• SATA (10k rpm, 2 TB, 3 Gb/s)
• Hybrid
• SAS2 (15k rpm, 750 GB, 6 Gb/s)
• Create in‐house tiered storage

• Cloud storage
• Amazon S3
• Box.net, Dropbox
• BackBlaze: bit.ly/backblaze‐20


How to store (data formats)
• ASCII • SQL DB
• tab delimited • MySQL
• comma separated • sqlite
• XML • Oracle
• Access
• DTD deOinition?
• SQL Server
• Schema deOinition?
• Namespaces? • Hierarchical DB
• JSON • Berkeley XML DB
• LDAP
• NetCDF
• Object‐Relational Mapper
• HDF5 • SQL Alchemy (Python)
• DICOM • Hibernate (Java, .NET)
• Django ORM (Python)
• Matlab .MAT format
• No‐SQL DB
• NumPy .NPZ format • MongoDB
• Bespoke binary • CouchDB

How to process

• Analytical software • Analytical environments
• custom programs • multi‐core machine ‐ 48+ core 
• Matlab systems for under $5000 (USD)
• Perl • GPU
• R • compute cluster
• Python • supercomputers
• SAS, SPSS • grid computing
• Tableau • cloud computing
• web‐based services
• network of workstations (NOW)
• Map/Reduce models
• “screen‐saver” computing (BOINC)


For $500 to $2000 (USD), up to order of magnitude 
processing speedups may be possible

GPU Computing 200800 stream 
processing cores per card

For $500 to $2000 (USD), up to order of magnitude 
processing speedups may be possible

Open Science Grid

www.opensciencegrid.org

Map/Reduce
• Unix users:
• cat | grep | sort | unique > file
• Map/Reduce equivalent:
• input | map | shuffle | reduce > output
• HadoopFS (HDFS)
• large data set is automatically spread and replicated across local 
storage resources (disks) of each node in a cluster
• Map
• creates a job for each data block in the input
• maps the computational kernel to each job
• schedules jobs to nodes with required data block
• each job produces a set of key/value pair job result
• Reduce
• collect results from Map stage based on keys (Combine)
• aggregates values to produce task (Oinal) result 

Extensions

• Pig and Hive
• pig.apache.org    hive.apache.org
• simplify writing Map/Reduce programs for Hadoop
• SQL‐like query language for datasets available on HDFS
• Cloudera
• www.cloudera.com
• packaged distribution of Hadoop + extensions
• education + training material
• Amazon Elastic Map Reduce
• aws.amazon.com/elasticmapreduce
• Amazon “cloud‐based” hosting of Hadoop for Map/Reduce using EC2 
for compute and S3 for storage


Organization, Searching, and Meta‐Data
• Few “software” solutions for this problem
• iRODS  provides some of this
• Unix “locate” database
• SAN solutions may index software and provide tools for searching
• Establish protocols, document, communicate
• director hierarchy
• Oile naming
• persisted working space
• scratch/temporary space
• Filesystem functionality
• many Oile systems have per‐Oile meta‐data controls to add arbitrary 
key/value pairs
• Augmented web‐based view
• cern_meta Apache module provides key/value pairs in HTTP HEAD
• ability to assert arbitrary web organization on top of Oilesystem 
organization, with searching and graphical views


• www.irods.org
• File‐like paradigm for data‐management
• addition of meta‐data
• can integrate database resources
• provides rich access policy management
• automated workOlows based on data actions
• add, remove, modify
• automated replication
• built‐in provenance
• information life‐cycle management


Search: Apache

• lucene.apache.org
• Java‐based
• full text querying and searching
• indexing
• Solr provides web interface


Meta‐Data: Semantic Media Wiki



• You know Wikipedia



• It is built using Mediawiki



• It is built using Mediawiki
• Semantic Media Wiki adds Semantic Web features
• Flexible key/value schemas
• User deOined and changeable object classes
• Built‐in knowledge of dates → timelines
• Built‐in knowledge of locations → maps
• Built‐in handling of images → picture galleries


Access Control


Access Control
• Need a strong Identity Management environment
• individuals: identity tokens and identiOiers
• groups: membership lists
• Active Directory/CIFS (Windows), Open Directory (Apple), FreeIPA (Unix) all LDAP‐
based


Access Control
based
• Need to manage and communicate Access Control policies
• institutionally driven
• user driven


Access Control
based
• Need to manage and communicate Access Control policies
• institutionally driven
• user driven
• Need Authorization System
• Policy Enforcement Point (shell login, data access, web access, start application)
• Policy Decision Point (store policies and understand relationship of identity token  
and policy)


Case Study: SBGrid
• www.sbgrid.org
• computing expertise for protein structure and 
function research
• software
• training
• technical support
• storage
• cluster and grid computing
• 150 member labs in consortium
• about 1000 total researchers
• structure imaging and model building:
• imaging techniques are data intensive
• model determination techniques are compute intensive


SBGrid Science Portal
GlobusOnline UC San Diego
@Argonne GUMS
User GUMS
GridFTP + glideinWMS
data Hadoop factory Open Science Grid

computations
MyProxy
@NCSA, UIUC
monitoring interfaces data computation ID mgmt
Ganglia scp Condor FreeIPA
Apache DOEGrids CA
Nagios GridFTP Cycle Server @Lawrence
GridSite LDAP
RSV SRM VDT Berkley Labs
Django VOMS
Globus
pacct WebDAV
Sage Math GUMS
glideinWMS Gratia Acct'ing
R-Studio GACL @FermiLab
file SQL
shell CLI server DB cluster
Monitoring
SBGrid Science Portal @ Harvard Medical School @Indiana


Data Model

• Data Tiers
• VOwide: all sites, admin managed, very stable
• User project: all sites, user managed, 1‐10 weeks, 1‐3 GB
• User static: all sites, user managed, indeOinite, 10 MB
• Job set: all sites, infrastructure managed, 1‐10 days, 0.1‐1 GB
• Job: direct to worker node, infrastructure managed, 1 day, <10 MB
• Job indirect: to worker node via UCSD, infrastructure managed, 1 
day, <10 GB


Data Management
quota
du scan
tmpwatch
conventions
workOlow integration

Data Movement
scp (users)
rsync (VO‐wide)
grid‐ftp (UCSD)
curl (WNs)
cp (NFS)
htcp (secure web)


red  push <iles
green  pull <iles



1. user <ile upload



2. replicate gold standard



3. Autoreplicate





4. pull <iles from
UCSD to WNs

3. Autoreplicate





UCSD to WNs

3. Autoreplicate local NSF to WNs





UCSD to WNs

SBGrid to WNs





UCSD to WNs

SBGrid to WNs



7. job results copied 
back to SBGrid



UCSD to WNs

SBGrid to WNs



7. job results copied 
back to SBGrid
8a. large job results 
copied to UCSD
8b. later pulled to 
1. user <ile upload SBGrid

Copy, Move, Backup


• Large data sets are difOicult to copy, move, 
replicate, and backup


• Tools and protocols required, with management
• sys admin (technial knowledge)
• archivist/curator (domain knowledge)


• Common structure:
• Tier 1 ‐ single master copy of data (live), possible ofOline tape backup
• Tier 2 ‐ multiple reliable T‐1 replicas serving a speciOic community
• Tier 3 ‐ temporary “working set” T‐2 replicas of required data


• GridFTP


• GridFTP
• Storage Resource Broker (SRB)


• GridFTP
• Storage Resource Broker (SRB)
• GlobusOnline


Globus Online: High Performance 
Reliable 3rd Party File Transfer
http://www.globusonline.org

portal

cluster

data collection
facility
lab file
server

desktop laptop

Summary




• Design your data management system suitably


Acknowledgements & Questions
• Piotr Sliz
• Principle Investigator, head of SBGrid
• SBGrid System Administrators
• Ian Levesque, Peter Doherty
• Globus Online Team
• Steve Tueke, Ian Foster, Rachana 
Ananthakrishnan, Raj Kettimuthu 
• Terrence Martin
• System administrator at UCSD for assistance and 
encouragement using 1 PB Hadoop storage array
• Brian Bockleman
• Physics faculty at University of Nebraska
• Steve Timm
• System administrator at FermiLab
• Ruth Pordes
• Director of OSG, for championing SBGrid

Acknowledgements & Questions
• Piotr Sliz
• Principle Investigator, head of SBGrid
• SBGrid System Administrators
• Ian Levesque, Peter Doherty Please contact me 
• Globus Online Team with any questions:
• Steve Tueke, Ian Foster, Rachana  • Ian Stokes‐Rees
Ananthakrishnan, Raj Kettimuthu  • ijstokes@hkl.hms.harvard.edu
• ijstokes@spmetric.com
• Terrence Martin
• System administrator at UCSD for assistance and 
encouragement using 1 PB Hadoop storage array Look at our work
• Brian Bockleman • portal.sbgrid.org
• Physics faculty at University of Nebraska • www.sbgrid.org
• www.opensciencegrid.org
• Steve Timm
• System administrator at FermiLab
• Ruth Pordes
• Director of OSG, for championing SBGrid

Big Data: tools and techniques for working with large data sets

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Big Data: tools and techniques for working with large data sets

Semelhante a Big Data: tools and techniques for working with large data sets (20)

Mais de Boston Consulting Group

Mais de Boston Consulting Group (13)

Último

Último (20)

Big Data: tools and techniques for working with large data sets

Notas do Editor