SlideShare uma empresa Scribd logo
1 de 107
© 2014 MapR Technologies 1© 2014 MapR Technologies
Genomics Use Cases @ MapR
© 2014 MapR Technologies 2© 2014 MapR Technologies
DNA Sequencing Company
© 2014 MapR Technologies 3
Parallelize Primary Analytics
.fastq .vcf
short read
alignment
genotype
callingreads &
mappings
© 2014 MapR Technologies 4
Sequence Analysis, Quick Overview
[…] G A C T A G A fragment1
A C A G T T T A C A fragment2
A G A T A - - A G A fragment3
A A C A G C T T A C A […] fragment4
C T A T A G A T A A fragment5
[…] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA
[…] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA
© 2014 MapR Technologies 5
What is the (Probable) Color of Each Column?
© 2014 MapR Technologies 6
Which Columns are (probably) Not White?
Strategy 1: examine foreach column, foreach row O(rows*cols)
+ O(1 col) memory
© 2014 MapR Technologies 7
Which Columns are (probably) Not White?
Strategy 2: examine foreach row. keep running tallies O(rows)
+ O(rows*cols) memory
© 2014 MapR Technologies 8
Which Columns are (probably) Not White?
Strategy 3: rotate matrix. examine foreach column O(rows log rows)
+ O(cols)
+ O(1 col) memory
© 2014 MapR Technologies 9
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
• Sequential access
pattern
• Requires Sort
Strategy 2
• High mem req
• Sequential access
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
+ O(rows*cols) memory
O(rows log rows)
+ O(cols)
+ O(1 col) memory
© 2014 MapR Technologies 10
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
• Sequential access
pattern
• Requires Sort
Strategy 2
• High mem req
• Sequential access
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
+ O(rows*cols) memory
O(rows log rows) ÷ shards
+ O(cols) ÷ shards
+ O(1 col) memory
As # of rows & columns increases
Strategy 3 becomes more attractive
© 2014 MapR Technologies 11
Primary Sequence Analysis (ETL), MapReduce style
.fastq .bam .vcf
short read
alignment
genotype
calling
MAP
MAP
REDUCE, rotate matrix 90º
(O(mn)) / 1 (O(mn) + O(n log n)) / s
Hello!
© 2014 MapR Technologies 12
Clinical Applications: Performance Matters
MapR
FilesystemN
F
S
DNA
Sequencer
DNA
Sequencer
DNA
Sequencer
Raw
DNARaw
DNARaw
DNA
1º Analytics
Raw
DNARaw
DNASNP
calls
Static
Clinical
Reporting
PhysicianPatient
Reference
DBs
SNP DB
ETL
2º
Analytics
ResearcherSubject
© 2014 MapR Technologies 13
Variant Collection Enables Downstream Apps
• GWAS Association Studies
• Versioned, Personalized
Medicine
• Companion Diagnostics
SNP DB 2º
Analytics
New
Markets
Hello!
More linear algebra 
[Spark,
Summingbird,
Lambda Architecture
Slides]
© 2014 MapR Technologies 14
The Post-Sequencing Genomics Workload
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
© 2014 MapR Technologies 15
Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype…
– And an associated SNP from my
experiment
• That elucidate genetic basis of
phenotype…
• And rank order them by
impact/likelihood/etc
© 2014 MapR Technologies 16
Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype…
– And an associated SNP from my
experiment
• That elucidate genetic basis of
phenotype…
• And rank order them by
impact/likelihood/etc
• In context of, e.g.
– ε1: Racial, etc. background
– ε2: Experimental design-
specific concerns (e.g. familial
IBD/IBS)
– ε3: Environmental factors and
penetrance
– ε4: Assay-specific biases and
noise
phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4
At risk of over-simplifying as
business-level concept…
© 2014 MapR Technologies 17
HUGE PROBLEM
COMBINATORIAL EXPLOSION
© 2014 MapR Technologies 18
What’s a Percolator?
• Google Percolator
– “Caffeine” update 2010
• Iterative, incremental prioritized
updates
• No batch processing
• Decouple computational results
from data size
Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
© 2014 MapR Technologies 19
Solution: Percolate
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
Denormalize
and Percolate
(re)prioritize &
(re)process
service queries
drive
dashboards
create reports
denormalize for
display
buffer
New
models
© 2014 MapR Technologies 20
Robot Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
© 2014 MapR Technologies 21
Robot (Data?) Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
© 2014 MapR Technologies 22© 2014 MapR Technologies
Genealogy Company
Slides credit: Bill Yetman, Hadoop Summit 2014
http://slidesha.re/1vRh3kY
© 2014 MapR Technologies 23
GERMLINE is…
• …an algorithm that finds hidden relationships within a pool of
DNA
• …the reference implementation of that algorithm written in C++.
• You can find it here:
http://www1.cs.columbia.edu/~gusev/germline/
2
3
© 2014 MapR Technologies 24
Projected GERMLINE run times (in hours)
2
4
Hours
Samples
0
100
200
300
400
500
600
700
2,500
12,500
22,500
32,500
42,500
52,500
62,500
72,500
82,500
92,500
102,500
112,500
122,500
GERMLINE run times
Projected GERMLINE run
times
700 hours = 29+ days
EXPONENTIAL COMPLEXITY
© 2014 MapR Technologies 25
GERMLINE: What’s the Problem?
• GERMLINE (the implementation) was not meant to be used in
an industrial setting
– Stateless, single threaded, prone to swapping (heavy memory usage)
– GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would slow to
a crawl
• Put simply: GERMLINE couldn't scale
2
5
© 2014 MapR Technologies 26
Run times for matching (in hours)
2
6
Hours
Samples
0
20
40
60
80
100
120
140
160
180
GERMLINE run times
Jermline run times
Projected GERMLINE
run times
EXPONENTIAL LINEAR
HBase
Refactor
© 2014 MapR Technologies 27
• Paper submitted describing the implementation
• Releasing as an Open Source project soon
• [HBase Schema/Algorithm Slides]
2
7
© 2014 MapR Technologies 28© 2014 MapR Technologies
Further Growth & Optimization
© 2014 MapR Technologies 29
Underdog (Strand Phasing) performance
– Went from 12 hours to process 1,000 samples
to under 25 minutes with a MapReduce
implementation
2
9
With improved accuracy!
Underdog
replaces
Beagle
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Total Run Size Total Beagle-Underdog Duration
© 2014 MapR Technologies 30
Pipeline steps and incremental change…
– Incremental change over time
– Supporting the business in a “just in time” Agile way
3
0
0
50000
100000
150000
200000
250000
500
3622
7243
9615
12353
16333
19522
22861
26642
31172
35986
40852
45252
49817
54738
61675
69496
77257
84337
90074
97448
104684
111937
119669
127194
134970
142232
149988
157710
165685
173719
181617
189817
197853
205855
213471
221290
228912
236516
243550
251315
259164
267266
275335
283114
291017
298823
306556
314662
322655
330745
338813
346847
354938
362954
371064
379208
387334
395432
Beagle-Underdog Phasing
Pipeline Finalize
Relationship Processing
Germline-Jermline Results Processing
Germline-Jermline Processing
Beagle Post Phasing
Admixture
Plink Prep
Pipeline Initialization
Jermline replaces
Germline
Ethnicity V2 Release
Underdog Replaces
Beagle
AdMixture on
Hadoop
© 2014 MapR Technologies 31
…while the business continues to grow rapidly
3
1
-
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14
#ofprocessedsamples)
DNA Database Size
© 2014 MapR Technologies 32© 2014 MapR Technologies
BigData App Development Lifecycle
© 2014 MapR Technologies 33
BigData App Development Lifecycle
outputinput
1M rows
tail | grep | sort | uniq -c
© 2014 MapR Technologies 34
Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Over decades of progress,
Unix-based systems have set the
standard for compatibility and
functionality
© 2014 MapR Technologies 35
BigData App Development Lifecycle
outputinput
1M rows
tail | grep | sort | uniq -c
© 2014 MapR Technologies 36
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1M rows
1B rows
© 2014 MapR Technologies 37
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1M rows
1B rows
1T rows
© 2014 MapR Technologies 38
Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Hadoop
Hadoop achieves much higher scalability
by trading away essentially all of this
compatibility
© 2014 MapR Technologies 39
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1T rows
1T rows
input output
Port to BigData Tools
($$$$)
© 2014 MapR Technologies 40
Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Hadoop
MapR enhances Apache Hadoop by restoring
the compatibility while increasing scalability and
performance
© 2014 MapR Technologies 41
BigData App Development Lifecycle
tail | grep | sort | uniq -c
outputinput
1T rows
POSIX (NFS)
Hadoop HDFS
Port
© 2014 MapR Technologies 42
BigData App Development Lifecycle
tail | grep | sort | uniq -c
1 1 1 1
100 100 100 100
Prototype Tools
Dev Cost
BigData Tools
Dev Cost
Use When
Possible
Use When
Needed
© 2014 MapR Technologies 43
BigData App Development Lifecycle
tail | grep | sort | uniq -c
1 1 1
100
Prototype Tools
Dev Cost
BigData Tools
Dev Cost
Use When
Possible
Use When
Needed
© 2014 MapR Technologies 44© 2014 MapR Technologies
Aadhaar – World’s Largest Biometric
Database
© 2014 MapR Technologies 45
Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE
© 2014 MapR Technologies 46
India: Problem
• 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day
– ~75% literacy, <3% pays Income Tax, <20% banking
– ~800 million mobile, ~200-300 mn migrant workers
• Govt. spends about $25-40 billion on direct subsidies
– Residents have no standard identity document
– Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
© 2014 MapR Technologies 47
India: Vision
• Create a common “national identity” for every “resident”
– Biometric backed identity to eliminate duplicates
– “Verifiable online identity” for portability
• Applications ecosystem using open APIs
– Aadhaar enabled bank account and payment platform
– Aadhaar enabled electronic, paperless KYC
• Enrolment
– One time in a person’s lifetime
– Multi-modal biometrics (fingerprints, iris)
© 2014 MapR Technologies 48
Aadhaar Biometric Capture & Index
© 2014 MapR Technologies 49
Aadhaar Biometric Capture & Index
© 2014 MapR Technologies 50
Aadhaar Biometric Capture & Index
© 2014 MapR Technologies 51
Architectural Principles
• Design for Scale
– Every component needs to scale to large volumes
– Millions of transactions and billions of records
– Accommodate failure and design for recovery
• Open Architecture
– Open Source
– Open APIs
• Security
– End-to-end security of resident data
© 2014 MapR Technologies 52
Design for Scale
• Horizontal scale-out
• Distributed computing
• Distributed data storage and partitioning
• No single points of failure
• No single points of bottleneck
• Asynchronous processing throughout the system
– Allows loose coupling various components
– Allows independent component level scaling
© 2014 MapR Technologies 53
MapR Filesystem
Aadhaar Multi-DC Data Storage Stack*
ID + Biometrics
(M7 HBase)
All raw packets
(HDFS+NFS)
Enrollment ID
API
ID + Demo + Photo + Benefits
(MySQL, Solr)
Authentication
API
Authorization
API
* as best I understand
from public
documents
© 2014 MapR Technologies 54
Enrollment Volume
• 600 to 800 million UIDs in 4 years
– 1 million a day
– 200+ trillion matches every day!!!
• ~5MB per resident
– Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!)
– About 30 TB I/O every day
– Replication and backup across DCs of about 5+ TB of incremental data every day
– Lifecycle updates and new enrolments will continue for ever
• Additional process data
– Several million events on an average moving through async channels (some
persistent and some transient)
– Needing complete update and insert guarantees across data stores
© 2014 MapR Technologies 55
Authentication Volume
• 100+ million authentications per day (10 hrs)
– Possible high variance on peak and average
– Sub second response
– Guaranteed audits
• Multi-DC architecture
– All changes needs to be propagated from enrolment data stores to all authentication
sites
• Authentication request is about 4 K
– 100 million authentications a day
– 1 billion audit records in 10 days (30+ billion a year)
– 4 TB encrypted audit logs in 10 days
– Audit write must be guaranteed
© 2014 MapR Technologies 56
How Do Biometrics Relate to Genomics?
Data Shape and Size
• Aadhaar: 5MB features (minutia)
• Genome: ~3M features (variants)
Data Set Operations
• Aadhaar: ƒ(x) Unique feature subset => identity
• Genome: “ “ “ “ “
• Genome: Variant × Phenotype
Commonality => Causal Genes
ƒ-1(x) !
SNP DB 2º
Analytics
© 2014 MapR Technologies 57
Data Shape and Size
• Aadhaar: 5MB features (minutia)
• Genome: ~3M features (variants)
Data Set Operations
• Aadhaar: ƒ(x) Unique feature subset => identity
• Genome: “ “ “ “ “
• Genome: Variant × Phenotype
Commonality => Causal Genes
ƒ-1(x) !
Vector Pattern Matching
SNP DB 2º
Analytics
ƒ-1(x): common features
ƒ(x): unique features
ƒ(x): uncommon features
ƒ(x): other features
© 2014 MapR Technologies 58
Data Shape and Size
• Aadhaar: 5MB features (minutia)
• Genome: ~3M features (variants)
Data Set Operations
• Aadhaar: ƒ(x) Unique feature subset => identity
• Genome: “ “ “ “ “
• Genome: Variant × Phenotype
Commonality => Causal Genes
ƒ-1(x) !
Topological Pattern Matching
SNP DB 2º
Analytics
© 2014 MapR Technologies 59© 2014 MapR Technologies
MapR Platform
© 2014 MapR Technologies 60
Apache Hadoop NameNode High Availability (HA)
NameNode
A B C D E F
HDFS-based Distributions
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
Primary NameNode
A B C D E F
Standby NameNode
A B C D E F
NameNode
A B
NameNode
C D
NameNode
E F
NameNode
A B
NameNode
C D
NameNode
E F
NAS
Appliance
HDFS HA
HDFS
Federation
Single point of failure
Limited to 50-200 million files
Performance bottleneck
Metadata must fit in memory
Only one active NameNode
Limited to 50-200 million files
Commercial NAS possibly needed
Metadata must fit in memory
Performance bottleneck Double the block reports
Multiple single points
of failure w/o HA
Needs 20 NameNodes
for 1 Billion files
Commercial NAS needed
Metadata must fit in memory
Performance bottleneck
Double the block reports
© 2014 MapR Technologies 61
No NameNode Architecture
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
NameNode
A B C D E FAAA BBBB CCC DDD EEE FFF
Up to 1T files (> 5000x advantage)
Significantly less hardware & OpEx
Higher performance
No special config to enable HA
Automatic failover & re-replication
Metadata is persisted to disk
© 2014 MapR Technologies 62
MapR M7: The Best In-Hadoop Database
 NoSQL Columnar Store
 Apache HBase API
 In-Hadoop database
HBase
JVM
HDFS
JVM
ext3/ext4
Disks
Other Distros
Tables/Files
Disks
MapR M7
The most scalable, enterprise-grade,
NoSQL database that supports online applications and analytics
© 2014 MapR Technologies 63
MapR M7: The Best In-Hadoop Database
 NoSQL Columnar Store
 Apache HBase API
 In-Hadoop database
Hbase
Interface
JVM
HDFS
Interface
JVM
ext3/ext4
Disks
Other Distros
Tables/Files
Disks
MapR M7
The most scalable, enterprise-grade,
NoSQL database that supports online applications and analytics
BigData Application
© 2014 MapR Technologies 64
Hbase Apps: High Performance with Consistent Low
Latency
--- M7 Read Latency --- Others Read Latency
© 2014 MapR Technologies 65© 2014 MapR Technologies
MapR Services
© 2014 MapR Technologies 66
Professional Services
• Installation
• Migrations
• SLA Plans
• Best Practices
• Performance
Tuning
Hadoop Core
Services
IT/ Infrastructure
Linux
Networking
Data Center
Storage
Operations
Big Data
Workflows
• Hive/Pig
• Oozie/Sqoop
• Flume
• M7/HBase
• Data Flow
BI / DBA
BI / ETL / Reporting
Scripting / Java
Hadoop MR
Eco Projects
(HBase, Hive, …)
Solution
Design
• HBase/M7
• Map/Reduce
• Application
Development
• Integration
Development
Java
Hadoop Developer
Architectural Design
Advanced
Analytics
• Use case
Discovery
• Use case
Modeling
• POC
• Workshops
Modeler / Analyst
PhD
Statistics/Math
MatLab / R / SAS
Scripting / Java
BI / ETL / Reporting
Data Engineering Data Science
AUDIENCE
ENGAGEMENTS
SKILLS
© 2014 MapR Technologies 67
Global PS Resources
17 Today (+8 in Q3)
D.C.
Keys Botzum (DE/Security/Developer)
Joe Blue (Data Scientist)
Venkat Gunnup (DE/Development)
Alex Rodriguez (DE/Development)
Kannappan Sirchabesa (DE/OPS)
SAN JOSE
Wayne Cappas (Director/DE)
John Benninghoff (DE/OPS)
Dmitry Gomerman (DE/OPS & Security)
Ivan Bishop (DE/OPS)
James Caseletto (Data Scientist)
Sungwook Yoon (Data Scientist)
Sridhar Reddy (Director - M7/Hbase)
LOS ANGELES
John Ewing (DE/OPS)
Marco Vasquez (Data Scientist/DE)
SOUTH CAROLINA
David Schexnayder (DE/OPS)
PHOENIX
Michael Farnbach (DE/OPS)
SINGAPORE
Allen Day (Data Scientist)
© 2014 MapR Technologies 68
Use Case Data Flow Example
MapR Data Platform
Processing and Analytics
Ingest
Sqoop
Flume
HDFS
NFS
Access
Tez
Drill
Hive
Pig
Impala
Data Sources
Clickstream
Billing Data
Mobile Data
Product Catalog
Social Media
Server Logs
Merchant Listings
Online Chat
Call Detail Records
Visualization
M7HBase
MapReduce
v1 & v2
StormCascadingPig
Solr MahoutYARN
Oozie Hive MLLib
Set-Top Box Data
© 2014 MapR Technologies 69
Engagement Types
• Customer engagement is typically 1-4 weeks (longer okay)
• Well established partners (15,000 resources globally)
• Custom training based on customer use-case
• Small 1-3 days workshops
• Extended support / Staff augmentation
© 2014 MapR Technologies 70
Q&A
twitter.com/allenday aday@mapr.com
Thanks!
slideshare.net/allendaylinkedin.com/in/allenday
© 2014 MapR Technologies 71© 2014 MapR Technologies
An Overview of Apache Spark
© 2014 MapR Technologies 72
Agenda
• MapReduce Refresher
• What is Spark?
• The Difference with Spark
• Preexisting MapReduce
• Examples and Resources
© 2014 MapR Technologies 73© 2014 MapR Technologies
MapReduce Refresher
© 2014 MapR Technologies 74
MapReduce Basics
• Foundational model is based on a distributed file system
– Scalability and fault-tolerance
• Map
– Loading of the data and defining a set of keys
• Reduce
– Collects the organized key-based data to process and output
• Performance can be tweaked based on known details of your
source files and cluster shape (size, total number)
© 2014 MapR Technologies 75
Languages and Frameworks
• Languages
– Java, Scala, Clojure
– Python, Ruby
• Higher Level Languages
– Hive
– Pig
• Frameworks
– Cascading, Crunch
• DSLs
– Scalding, Scrunch, Scoobi, Cascalog
© 2014 MapR Technologies 76
MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For complex work, chain jobs together
– Or use a higher level language or DSL that does this for you
© 2014 MapR Technologies 77© 2014 MapR Technologies
What is Spark?
© 2014 MapR Technologies 78
Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009 in UC
Berkeley’s AMP Lab
• Fully open sourced in 2010 – now
a Top Level Project at the Apache
Software Foundation
© 2014 MapR Technologies 79
The Spark Community
© 2014 MapR Technologies 80
Spark is the Most Active Open Source Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Projectcontributorsinpastyear
© 2014 MapR Technologies 81
Unified Platform
Shark
(SQL)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Continued innovation bringing new functionality, e.g.:
• Java 8 (Closures, Lamba Expressions)
• Spark SQL (SQL on Spark, not just Hive)
• BlinkDB (Approximate Queries)
• SparkR (R wrapper for Spark)
© 2014 MapR Technologies 82
Supported Languages
• Java
• Scala
• Python
• Hive?
© 2014 MapR Technologies 83
Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop InputFormat
• HBase
© 2014 MapR Technologies 84
Machine Learning - MLlib
• K-Means
• L1 and L2-regularized Linear Regression
• L1 and L2-regularized Logistic Regression
• Alternating Least Squares
• Naive Bayes
• Stochastic Gradient Descent
* As of May 14, 2014
** Don’t be surprised if you see the Mahout library converting to Spark soon
© 2014 MapR Technologies 85© 2014 MapR Technologies
The Difference with Spark
© 2014 MapR Technologies 86
Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive shell
• Fast to Run
– General execution graphs
– In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
© 2014 MapR Technologies 87
Resilient Distributed Datasets (RDD)
• Spark revolves around RDDs
• Fault-tolerant collection of elements that can be operated on in
parallel
– Parallelized Collection: Scala collection which is run in parallel
– Hadoop Dataset: records of files supported by Hadoop
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
© 2014 MapR Technologies 88
RDD Operations
• Transformations
– Creation of a new dataset from an existing
• map, filter, distinct, union, sample, groupByKey, join, etc…
• Actions
– Return a value after running a computation
• collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete list
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
© 2014 MapR Technologies 89
RDD Persistence / Caching
• Variety of storage levels
– memory_only (default), memory_and_disk, etc…
• API Calls
– persist(StorageLevel)
– cache() – shorthand for persist(StorageLevel.MEMORY_ONLY)
• Considerations
– Read from disk vs. recompute (memory_and_disk)
– Total memory storage size (memory_only_ser)
– Replicate to second node for faster fault recovery (memory_only_2)
• Think about this option if supporting a web application
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
© 2014 MapR Technologies 90
Cache Scaling Matters
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully cached
Executiontime(s)
% of working set in cache
© 2014 MapR Technologies 91
Directed Acylic Graph (DAG)
• Directed
– Only in a single direction
• Acyclic
– No looping
• Why does this matter?
– This supports fault-tolerance
© 2014 MapR Technologies 92
RDD Fault Recovery
RDDs track lineage information that can be used to efficiently
recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
© 2014 MapR Technologies 93
Comparison to Storm
• Higher throughput than Storm
– Spark Streaming: 670k records/sec/node
– Storm: 115k records/sec/node
– Commercial systems: 100-500k records/sec/node
0
10
20
30
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
Grep
Spark
Storm
© 2014 MapR Technologies 94
Interactive Shell
• Iterative Development
– Cache those RDDs
– Open the shell and ask questions
• We have all wished we could do this with MapReduce
– Compile / save your code for scheduled jobs later
• Scala – spark-shell
• Python – pyspark
© 2014 MapR Technologies 95
The Game Changer!
• The
– Port them over if you need better performance
• Be sure to share the results and learning's
• Pig Scripts
– Port them over
– Try SPORK!
• Hive Queries….
© 2014 MapR Technologies 96© 2014 MapR Technologies
Preexisting MapReduce
© 2014 MapR Technologies 97
Existing Jobs
• Java MapReduce
– Port them over if you need better performance
• Be sure to share the results and learning's
• Pig Scripts
– Port them over
– Try SPORK!
• Hive Queries….
© 2014 MapR Technologies 98
Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)
– Works in existing Hive warehouses without changing queries or data!
• Augments Hive
– In-memory tables and columnar memory store
• Fast execution engine
– Uses Spark as the underlying execution engine
– Low-latency, interactive queries
– Scale-out and tolerates worker failures
© 2014 MapR Technologies 99© 2014 MapR Technologies
Examples and Resources
© 2014 MapR Technologies 100
SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]);
JavaRDD<String> file = sc.textFile("hdfs://...");
JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" ")))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile("hdfs://...");
val sc = new SparkContext(master, appName, [sparkHome], [jars])
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Word Count
• Java MapReduce (~15 lines of code)
• Java Spark (~ 7 lines of code)
• Scala and Python (4 lines of code)
– interactive shell: skip line 1 and replace the last line with counts.collect()
• Java8 (4 lines of code)
© 2014 MapR Technologies 101
Network Word Count – Streaming
// Create the context with a 1 second batch size
val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1),
System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass))
// Create a NetworkInputDStream on target host:port and count the
// words in input stream of n delimited text (eg. generated by 'nc')
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
© 2014 MapR Technologies 102
Deploying Spark – Cluster Manager Types
• Standalone mode
– Comes bundled (EC2 capable)
• YARN
• Mesos
© 2014 MapR Technologies 103
Remember
• If you want to use a new technology you must learn that new
technology
• For those who have been using Hadoop for a while, at one time
you had to learn all about MapReduce and how to manage and
tune it
• To get the most out of a new technology you need to learn that
technology, this includes tuning
– There are switches you can use to optimize your work
© 2014 MapR Technologies 104
Configuration
http://spark.apache.org/docs/latest/
Most Important
• Application Configuration
http://spark.apache.org/docs/latest/configuration.html
• Standalone Cluster Configuration
http://spark.apache.org/docs/latest/spark-standalone.html
• Tuning Guide
http://spark.apache.org/docs/latest/tuning.html
© 2014 MapR Technologies 105
Resources
• Pig on Spark
– http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html
– https://github.com/aniket486/pig
– https://github.com/twitter/pig/tree/spork
– http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
– https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
• Latest on Spark
– http://databricks.com/categories/spark/
– http://www.spark-stack.org/
© 2014 MapR Technologies 106
• San Francisco
June 30 – July 2
• Use Cases
• Tech Talks
• Training
http://spark-summit.org/
© 2014 MapR Technologies 107
Q&A
@mapr maprtech
jscott@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Mais conteúdo relacionado

Mais procurados

Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storageDataWorks Summit
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Data science on big data. Pragmatic approach
Data science on big data. Pragmatic approachData science on big data. Pragmatic approach
Data science on big data. Pragmatic approachPavel Mezentsev
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 

Mais procurados (20)

Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
Hadoop - Apache Pig
Hadoop - Apache PigHadoop - Apache Pig
Hadoop - Apache Pig
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storagePerformance tuning your Hadoop/Spark clusters to use cloud storage
Performance tuning your Hadoop/Spark clusters to use cloud storage
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Data science on big data. Pragmatic approach
Data science on big data. Pragmatic approachData science on big data. Pragmatic approach
Data science on big data. Pragmatic approach
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 

Destaque

Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use casesGuy Coates
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomicsGuy Coates
 
Declaring a TB outbreak over with genomics
Declaring a TB outbreak over with genomicsDeclaring a TB outbreak over with genomics
Declaring a TB outbreak over with genomicsJennifer Gardy
 
" Use of genomics for understanding and improving adaptation to climate chang...
" Use of genomics for understanding and improving adaptation to climate chang..." Use of genomics for understanding and improving adaptation to climate chang...
" Use of genomics for understanding and improving adaptation to climate chang...ExternalEvents
 
Groundnut improvement: Use of genetic and genomic tools
Groundnut improvement: Use of genetic and genomic toolsGroundnut improvement: Use of genetic and genomic tools
Groundnut improvement: Use of genetic and genomic toolsICRISAT
 
Genetic enhancement of groundnut for resistance to aflatoxin contamination
Genetic enhancement of groundnut for resistance to aflatoxin contaminationGenetic enhancement of groundnut for resistance to aflatoxin contamination
Genetic enhancement of groundnut for resistance to aflatoxin contaminationILRI
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingHealth Catalyst
 

Destaque (9)

Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
Declaring a TB outbreak over with genomics
Declaring a TB outbreak over with genomicsDeclaring a TB outbreak over with genomics
Declaring a TB outbreak over with genomics
 
" Use of genomics for understanding and improving adaptation to climate chang...
" Use of genomics for understanding and improving adaptation to climate chang..." Use of genomics for understanding and improving adaptation to climate chang...
" Use of genomics for understanding and improving adaptation to climate chang...
 
Groundnut improvement: Use of genetic and genomic tools
Groundnut improvement: Use of genetic and genomic toolsGroundnut improvement: Use of genetic and genomic tools
Groundnut improvement: Use of genetic and genomic tools
 
Genetic enhancement of groundnut for resistance to aflatoxin contamination
Genetic enhancement of groundnut for resistance to aflatoxin contaminationGenetic enhancement of groundnut for resistance to aflatoxin contamination
Genetic enhancement of groundnut for resistance to aflatoxin contamination
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
 

Semelhante a 2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down InternetMapR Technologies
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San DiegoMapR Technologies
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in businessMapR Technologies
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFMLconf
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessDataWorks Summit
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Building HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningBuilding HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningMapR Technologies
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 

Semelhante a 2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China (20)

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San Diego
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in business
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development Process
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Building HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningBuilding HBase Applications - Ted Dunning
Building HBase Applications - Ted Dunning
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 

Mais de Allen Day, PhD

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser UniversityAllen Day, PhD
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIAllen Day, PhD
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big DataAllen Day, PhD
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
 

Mais de Allen Day, PhD (20)

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and Genomics
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
 

Último

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 

Último (20)

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 

2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Genomics Use Cases @ MapR
  • 2. © 2014 MapR Technologies 2© 2014 MapR Technologies DNA Sequencing Company
  • 3. © 2014 MapR Technologies 3 Parallelize Primary Analytics .fastq .vcf short read alignment genotype callingreads & mappings
  • 4. © 2014 MapR Technologies 4 Sequence Analysis, Quick Overview […] G A C T A G A fragment1 A C A G T T T A C A fragment2 A G A T A - - A G A fragment3 A A C A G C T T A C A […] fragment4 C T A T A G A T A A fragment5 […] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA […] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA
  • 5. © 2014 MapR Technologies 5 What is the (Probable) Color of Each Column?
  • 6. © 2014 MapR Technologies 6 Which Columns are (probably) Not White? Strategy 1: examine foreach column, foreach row O(rows*cols) + O(1 col) memory
  • 7. © 2014 MapR Technologies 7 Which Columns are (probably) Not White? Strategy 2: examine foreach row. keep running tallies O(rows) + O(rows*cols) memory
  • 8. © 2014 MapR Technologies 8 Which Columns are (probably) Not White? Strategy 3: rotate matrix. examine foreach column O(rows log rows) + O(cols) + O(1 col) memory
  • 9. © 2014 MapR Technologies 9 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) + O(cols) + O(1 col) memory
  • 10. © 2014 MapR Technologies 10 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) memory As # of rows & columns increases Strategy 3 becomes more attractive
  • 11. © 2014 MapR Technologies 11 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
  • 12. © 2014 MapR Technologies 12 Clinical Applications: Performance Matters MapR FilesystemN F S DNA Sequencer DNA Sequencer DNA Sequencer Raw DNARaw DNARaw DNA 1º Analytics Raw DNARaw DNASNP calls Static Clinical Reporting PhysicianPatient Reference DBs SNP DB ETL 2º Analytics ResearcherSubject
  • 13. © 2014 MapR Technologies 13 Variant Collection Enables Downstream Apps • GWAS Association Studies • Versioned, Personalized Medicine • Companion Diagnostics SNP DB 2º Analytics New Markets Hello! More linear algebra  [Spark, Summingbird, Lambda Architecture Slides]
  • 14. © 2014 MapR Technologies 14 The Post-Sequencing Genomics Workload Sboner, et al, 2011. The real cost of sequencing: higher than you think!
  • 15. © 2014 MapR Technologies 15 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc
  • 16. © 2014 MapR Technologies 16 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc • In context of, e.g. – ε1: Racial, etc. background – ε2: Experimental design- specific concerns (e.g. familial IBD/IBS) – ε3: Environmental factors and penetrance – ε4: Assay-specific biases and noise phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4 At risk of over-simplifying as business-level concept…
  • 17. © 2014 MapR Technologies 17 HUGE PROBLEM COMBINATORIAL EXPLOSION
  • 18. © 2014 MapR Technologies 18 What’s a Percolator? • Google Percolator – “Caffeine” update 2010 • Iterative, incremental prioritized updates • No batch processing • Decouple computational results from data size Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
  • 19. © 2014 MapR Technologies 19 Solution: Percolate SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Denormalize and Percolate (re)prioritize & (re)process service queries drive dashboards create reports denormalize for display buffer New models
  • 20. © 2014 MapR Technologies 20 Robot Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • 21. © 2014 MapR Technologies 21 Robot (Data?) Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • 22. © 2014 MapR Technologies 22© 2014 MapR Technologies Genealogy Company Slides credit: Bill Yetman, Hadoop Summit 2014 http://slidesha.re/1vRh3kY
  • 23. © 2014 MapR Technologies 23 GERMLINE is… • …an algorithm that finds hidden relationships within a pool of DNA • …the reference implementation of that algorithm written in C++. • You can find it here: http://www1.cs.columbia.edu/~gusev/germline/ 2 3
  • 24. © 2014 MapR Technologies 24 Projected GERMLINE run times (in hours) 2 4 Hours Samples 0 100 200 300 400 500 600 700 2,500 12,500 22,500 32,500 42,500 52,500 62,500 72,500 82,500 92,500 102,500 112,500 122,500 GERMLINE run times Projected GERMLINE run times 700 hours = 29+ days EXPONENTIAL COMPLEXITY
  • 25. © 2014 MapR Technologies 25 GERMLINE: What’s the Problem? • GERMLINE (the implementation) was not meant to be used in an industrial setting – Stateless, single threaded, prone to swapping (heavy memory usage) – GERMLINE performs poorly on large data sets • Our metrics predicted exactly where the process would slow to a crawl • Put simply: GERMLINE couldn't scale 2 5
  • 26. © 2014 MapR Technologies 26 Run times for matching (in hours) 2 6 Hours Samples 0 20 40 60 80 100 120 140 160 180 GERMLINE run times Jermline run times Projected GERMLINE run times EXPONENTIAL LINEAR HBase Refactor
  • 27. © 2014 MapR Technologies 27 • Paper submitted describing the implementation • Releasing as an Open Source project soon • [HBase Schema/Algorithm Slides] 2 7
  • 28. © 2014 MapR Technologies 28© 2014 MapR Technologies Further Growth & Optimization
  • 29. © 2014 MapR Technologies 29 Underdog (Strand Phasing) performance – Went from 12 hours to process 1,000 samples to under 25 minutes with a MapReduce implementation 2 9 With improved accuracy! Underdog replaces Beagle 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 Total Run Size Total Beagle-Underdog Duration
  • 30. © 2014 MapR Technologies 30 Pipeline steps and incremental change… – Incremental change over time – Supporting the business in a “just in time” Agile way 3 0 0 50000 100000 150000 200000 250000 500 3622 7243 9615 12353 16333 19522 22861 26642 31172 35986 40852 45252 49817 54738 61675 69496 77257 84337 90074 97448 104684 111937 119669 127194 134970 142232 149988 157710 165685 173719 181617 189817 197853 205855 213471 221290 228912 236516 243550 251315 259164 267266 275335 283114 291017 298823 306556 314662 322655 330745 338813 346847 354938 362954 371064 379208 387334 395432 Beagle-Underdog Phasing Pipeline Finalize Relationship Processing Germline-Jermline Results Processing Germline-Jermline Processing Beagle Post Phasing Admixture Plink Prep Pipeline Initialization Jermline replaces Germline Ethnicity V2 Release Underdog Replaces Beagle AdMixture on Hadoop
  • 31. © 2014 MapR Technologies 31 …while the business continues to grow rapidly 3 1 - 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 #ofprocessedsamples) DNA Database Size
  • 32. © 2014 MapR Technologies 32© 2014 MapR Technologies BigData App Development Lifecycle
  • 33. © 2014 MapR Technologies 33 BigData App Development Lifecycle outputinput 1M rows tail | grep | sort | uniq -c
  • 34. © 2014 MapR Technologies 34 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Over decades of progress, Unix-based systems have set the standard for compatibility and functionality
  • 35. © 2014 MapR Technologies 35 BigData App Development Lifecycle outputinput 1M rows tail | grep | sort | uniq -c
  • 36. © 2014 MapR Technologies 36 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1M rows 1B rows
  • 37. © 2014 MapR Technologies 37 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1M rows 1B rows 1T rows
  • 38. © 2014 MapR Technologies 38 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Hadoop Hadoop achieves much higher scalability by trading away essentially all of this compatibility
  • 39. © 2014 MapR Technologies 39 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1T rows 1T rows input output Port to BigData Tools ($$$$)
  • 40. © 2014 MapR Technologies 40 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Hadoop MapR enhances Apache Hadoop by restoring the compatibility while increasing scalability and performance
  • 41. © 2014 MapR Technologies 41 BigData App Development Lifecycle tail | grep | sort | uniq -c outputinput 1T rows POSIX (NFS) Hadoop HDFS Port
  • 42. © 2014 MapR Technologies 42 BigData App Development Lifecycle tail | grep | sort | uniq -c 1 1 1 1 100 100 100 100 Prototype Tools Dev Cost BigData Tools Dev Cost Use When Possible Use When Needed
  • 43. © 2014 MapR Technologies 43 BigData App Development Lifecycle tail | grep | sort | uniq -c 1 1 1 100 Prototype Tools Dev Cost BigData Tools Dev Cost Use When Possible Use When Needed
  • 44. © 2014 MapR Technologies 44© 2014 MapR Technologies Aadhaar – World’s Largest Biometric Database
  • 45. © 2014 MapR Technologies 45 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
  • 46. © 2014 MapR Technologies 46 India: Problem • 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pays Income Tax, <20% banking – ~800 million mobile, ~200-300 mn migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40%
  • 47. © 2014 MapR Technologies 47 India: Vision • Create a common “national identity” for every “resident” – Biometric backed identity to eliminate duplicates – “Verifiable online identity” for portability • Applications ecosystem using open APIs – Aadhaar enabled bank account and payment platform – Aadhaar enabled electronic, paperless KYC • Enrolment – One time in a person’s lifetime – Multi-modal biometrics (fingerprints, iris)
  • 48. © 2014 MapR Technologies 48 Aadhaar Biometric Capture & Index
  • 49. © 2014 MapR Technologies 49 Aadhaar Biometric Capture & Index
  • 50. © 2014 MapR Technologies 50 Aadhaar Biometric Capture & Index
  • 51. © 2014 MapR Technologies 51 Architectural Principles • Design for Scale – Every component needs to scale to large volumes – Millions of transactions and billions of records – Accommodate failure and design for recovery • Open Architecture – Open Source – Open APIs • Security – End-to-end security of resident data
  • 52. © 2014 MapR Technologies 52 Design for Scale • Horizontal scale-out • Distributed computing • Distributed data storage and partitioning • No single points of failure • No single points of bottleneck • Asynchronous processing throughout the system – Allows loose coupling various components – Allows independent component level scaling
  • 53. © 2014 MapR Technologies 53 MapR Filesystem Aadhaar Multi-DC Data Storage Stack* ID + Biometrics (M7 HBase) All raw packets (HDFS+NFS) Enrollment ID API ID + Demo + Photo + Benefits (MySQL, Solr) Authentication API Authorization API * as best I understand from public documents
  • 54. © 2014 MapR Technologies 54 Enrollment Volume • 600 to 800 million UIDs in 4 years – 1 million a day – 200+ trillion matches every day!!! • ~5MB per resident – Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!) – About 30 TB I/O every day – Replication and backup across DCs of about 5+ TB of incremental data every day – Lifecycle updates and new enrolments will continue for ever • Additional process data – Several million events on an average moving through async channels (some persistent and some transient) – Needing complete update and insert guarantees across data stores
  • 55. © 2014 MapR Technologies 55 Authentication Volume • 100+ million authentications per day (10 hrs) – Possible high variance on peak and average – Sub second response – Guaranteed audits • Multi-DC architecture – All changes needs to be propagated from enrolment data stores to all authentication sites • Authentication request is about 4 K – 100 million authentications a day – 1 billion audit records in 10 days (30+ billion a year) – 4 TB encrypted audit logs in 10 days – Audit write must be guaranteed
  • 56. © 2014 MapR Technologies 56 How Do Biometrics Relate to Genomics? Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! SNP DB 2º Analytics
  • 57. © 2014 MapR Technologies 57 Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! Vector Pattern Matching SNP DB 2º Analytics ƒ-1(x): common features ƒ(x): unique features ƒ(x): uncommon features ƒ(x): other features
  • 58. © 2014 MapR Technologies 58 Data Shape and Size • Aadhaar: 5MB features (minutia) • Genome: ~3M features (variants) Data Set Operations • Aadhaar: ƒ(x) Unique feature subset => identity • Genome: “ “ “ “ “ • Genome: Variant × Phenotype Commonality => Causal Genes ƒ-1(x) ! Topological Pattern Matching SNP DB 2º Analytics
  • 59. © 2014 MapR Technologies 59© 2014 MapR Technologies MapR Platform
  • 60. © 2014 MapR Technologies 60 Apache Hadoop NameNode High Availability (HA) NameNode A B C D E F HDFS-based Distributions DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Primary NameNode A B C D E F Standby NameNode A B C D E F NameNode A B NameNode C D NameNode E F NameNode A B NameNode C D NameNode E F NAS Appliance HDFS HA HDFS Federation Single point of failure Limited to 50-200 million files Performance bottleneck Metadata must fit in memory Only one active NameNode Limited to 50-200 million files Commercial NAS possibly needed Metadata must fit in memory Performance bottleneck Double the block reports Multiple single points of failure w/o HA Needs 20 NameNodes for 1 Billion files Commercial NAS needed Metadata must fit in memory Performance bottleneck Double the block reports
  • 61. © 2014 MapR Technologies 61 No NameNode Architecture DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode NameNode A B C D E FAAA BBBB CCC DDD EEE FFF Up to 1T files (> 5000x advantage) Significantly less hardware & OpEx Higher performance No special config to enable HA Automatic failover & re-replication Metadata is persisted to disk
  • 62. © 2014 MapR Technologies 62 MapR M7: The Best In-Hadoop Database  NoSQL Columnar Store  Apache HBase API  In-Hadoop database HBase JVM HDFS JVM ext3/ext4 Disks Other Distros Tables/Files Disks MapR M7 The most scalable, enterprise-grade, NoSQL database that supports online applications and analytics
  • 63. © 2014 MapR Technologies 63 MapR M7: The Best In-Hadoop Database  NoSQL Columnar Store  Apache HBase API  In-Hadoop database Hbase Interface JVM HDFS Interface JVM ext3/ext4 Disks Other Distros Tables/Files Disks MapR M7 The most scalable, enterprise-grade, NoSQL database that supports online applications and analytics BigData Application
  • 64. © 2014 MapR Technologies 64 Hbase Apps: High Performance with Consistent Low Latency --- M7 Read Latency --- Others Read Latency
  • 65. © 2014 MapR Technologies 65© 2014 MapR Technologies MapR Services
  • 66. © 2014 MapR Technologies 66 Professional Services • Installation • Migrations • SLA Plans • Best Practices • Performance Tuning Hadoop Core Services IT/ Infrastructure Linux Networking Data Center Storage Operations Big Data Workflows • Hive/Pig • Oozie/Sqoop • Flume • M7/HBase • Data Flow BI / DBA BI / ETL / Reporting Scripting / Java Hadoop MR Eco Projects (HBase, Hive, …) Solution Design • HBase/M7 • Map/Reduce • Application Development • Integration Development Java Hadoop Developer Architectural Design Advanced Analytics • Use case Discovery • Use case Modeling • POC • Workshops Modeler / Analyst PhD Statistics/Math MatLab / R / SAS Scripting / Java BI / ETL / Reporting Data Engineering Data Science AUDIENCE ENGAGEMENTS SKILLS
  • 67. © 2014 MapR Technologies 67 Global PS Resources 17 Today (+8 in Q3) D.C. Keys Botzum (DE/Security/Developer) Joe Blue (Data Scientist) Venkat Gunnup (DE/Development) Alex Rodriguez (DE/Development) Kannappan Sirchabesa (DE/OPS) SAN JOSE Wayne Cappas (Director/DE) John Benninghoff (DE/OPS) Dmitry Gomerman (DE/OPS & Security) Ivan Bishop (DE/OPS) James Caseletto (Data Scientist) Sungwook Yoon (Data Scientist) Sridhar Reddy (Director - M7/Hbase) LOS ANGELES John Ewing (DE/OPS) Marco Vasquez (Data Scientist/DE) SOUTH CAROLINA David Schexnayder (DE/OPS) PHOENIX Michael Farnbach (DE/OPS) SINGAPORE Allen Day (Data Scientist)
  • 68. © 2014 MapR Technologies 68 Use Case Data Flow Example MapR Data Platform Processing and Analytics Ingest Sqoop Flume HDFS NFS Access Tez Drill Hive Pig Impala Data Sources Clickstream Billing Data Mobile Data Product Catalog Social Media Server Logs Merchant Listings Online Chat Call Detail Records Visualization M7HBase MapReduce v1 & v2 StormCascadingPig Solr MahoutYARN Oozie Hive MLLib Set-Top Box Data
  • 69. © 2014 MapR Technologies 69 Engagement Types • Customer engagement is typically 1-4 weeks (longer okay) • Well established partners (15,000 resources globally) • Custom training based on customer use-case • Small 1-3 days workshops • Extended support / Staff augmentation
  • 70. © 2014 MapR Technologies 70 Q&A twitter.com/allenday aday@mapr.com Thanks! slideshare.net/allendaylinkedin.com/in/allenday
  • 71. © 2014 MapR Technologies 71© 2014 MapR Technologies An Overview of Apache Spark
  • 72. © 2014 MapR Technologies 72 Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Preexisting MapReduce • Examples and Resources
  • 73. © 2014 MapR Technologies 73© 2014 MapR Technologies MapReduce Refresher
  • 74. © 2014 MapR Technologies 74 MapReduce Basics • Foundational model is based on a distributed file system – Scalability and fault-tolerance • Map – Loading of the data and defining a set of keys • Reduce – Collects the organized key-based data to process and output • Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
  • 75. © 2014 MapR Technologies 75 Languages and Frameworks • Languages – Java, Scala, Clojure – Python, Ruby • Higher Level Languages – Hive – Pig • Frameworks – Cascading, Crunch • DSLs – Scalding, Scrunch, Scoobi, Cascalog
  • 76. © 2014 MapR Technologies 76 MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together – Or use a higher level language or DSL that does this for you
  • 77. © 2014 MapR Technologies 77© 2014 MapR Technologies What is Spark?
  • 78. © 2014 MapR Technologies 78 Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation
  • 79. © 2014 MapR Technologies 79 The Spark Community
  • 80. © 2014 MapR Technologies 80 Spark is the Most Active Open Source Project in Big Data Giraph Storm Tez 0 20 40 60 80 100 120 140 Projectcontributorsinpastyear
  • 81. © 2014 MapR Technologies 81 Unified Platform Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • Java 8 (Closures, Lamba Expressions) • Spark SQL (SQL on Spark, not just Hive) • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark)
  • 82. © 2014 MapR Technologies 82 Supported Languages • Java • Scala • Python • Hive?
  • 83. © 2014 MapR Technologies 83 Data Sources • Local Files – file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem – Regular files, sequence files, any other Hadoop InputFormat • HBase
  • 84. © 2014 MapR Technologies 84 Machine Learning - MLlib • K-Means • L1 and L2-regularized Linear Regression • L1 and L2-regularized Logistic Regression • Alternating Least Squares • Naive Bayes • Stochastic Gradient Descent * As of May 14, 2014 ** Don’t be surprised if you see the Mahout library converting to Spark soon
  • 85. © 2014 MapR Technologies 85© 2014 MapR Technologies The Difference with Spark
  • 86. © 2014 MapR Technologies 86 Easy and Fast Big Data • Easy to Develop – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 87. © 2014 MapR Technologies 87 Resilient Distributed Datasets (RDD) • Spark revolves around RDDs • Fault-tolerant collection of elements that can be operated on in parallel – Parallelized Collection: Scala collection which is run in parallel – Hadoop Dataset: records of files supported by Hadoop http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 88. © 2014 MapR Technologies 88 RDD Operations • Transformations – Creation of a new dataset from an existing • map, filter, distinct, union, sample, groupByKey, join, etc… • Actions – Return a value after running a computation • collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
  • 89. © 2014 MapR Technologies 89 RDD Persistence / Caching • Variety of storage levels – memory_only (default), memory_and_disk, etc… • API Calls – persist(StorageLevel) – cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) • Considerations – Read from disk vs. recompute (memory_and_disk) – Total memory storage size (memory_only_ser) – Replicate to second node for faster fault recovery (memory_only_2) • Think about this option if supporting a web application http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
  • 90. © 2014 MapR Technologies 90 Cache Scaling Matters 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache
  • 91. © 2014 MapR Technologies 91 Directed Acylic Graph (DAG) • Directed – Only in a single direction • Acyclic – No looping • Why does this matter? – This supports fault-tolerance
  • 92. © 2014 MapR Technologies 92 RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 93. © 2014 MapR Technologies 93 Comparison to Storm • Higher throughput than Storm – Spark Streaming: 670k records/sec/node – Storm: 115k records/sec/node – Commercial systems: 100-500k records/sec/node 0 10 20 30 100 1000 Throughputpernode (MB/s) Record Size (bytes) WordCount Spark Storm 0 20 40 60 100 1000 Throughputpernode (MB/s) Record Size (bytes) Grep Spark Storm
  • 94. © 2014 MapR Technologies 94 Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
  • 95. © 2014 MapR Technologies 95 The Game Changer! • The – Port them over if you need better performance • Be sure to share the results and learning's • Pig Scripts – Port them over – Try SPORK! • Hive Queries….
  • 96. © 2014 MapR Technologies 96© 2014 MapR Technologies Preexisting MapReduce
  • 97. © 2014 MapR Technologies 97 Existing Jobs • Java MapReduce – Port them over if you need better performance • Be sure to share the results and learning's • Pig Scripts – Port them over – Try SPORK! • Hive Queries….
  • 98. © 2014 MapR Technologies 98 Shark – SQL over Spark • Hive-compatible (HiveQL, UDFs, metadata) – Works in existing Hive warehouses without changing queries or data! • Augments Hive – In-memory tables and columnar memory store • Fast execution engine – Uses Spark as the underlying execution engine – Low-latency, interactive queries – Scale-out and tolerates worker failures
  • 99. © 2014 MapR Technologies 99© 2014 MapR Technologies Examples and Resources
  • 100. © 2014 MapR Technologies 100 SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]); JavaRDD<String> file = sc.textFile("hdfs://..."); JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" "))) .mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://..."); val sc = new SparkContext(master, appName, [sparkHome], [jars]) val file = sc.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Word Count • Java MapReduce (~15 lines of code) • Java Spark (~ 7 lines of code) • Scala and Python (4 lines of code) – interactive shell: skip line 1 and replace the last line with counts.collect() • Java8 (4 lines of code)
  • 101. © 2014 MapR Technologies 101 Network Word Count – Streaming // Create the context with a 1 second batch size val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1), System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass)) // Create a NetworkInputDStream on target host:port and count the // words in input stream of n delimited text (eg. generated by 'nc') val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start()
  • 102. © 2014 MapR Technologies 102 Deploying Spark – Cluster Manager Types • Standalone mode – Comes bundled (EC2 capable) • YARN • Mesos
  • 103. © 2014 MapR Technologies 103 Remember • If you want to use a new technology you must learn that new technology • For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it • To get the most out of a new technology you need to learn that technology, this includes tuning – There are switches you can use to optimize your work
  • 104. © 2014 MapR Technologies 104 Configuration http://spark.apache.org/docs/latest/ Most Important • Application Configuration http://spark.apache.org/docs/latest/configuration.html • Standalone Cluster Configuration http://spark.apache.org/docs/latest/spark-standalone.html • Tuning Guide http://spark.apache.org/docs/latest/tuning.html
  • 105. © 2014 MapR Technologies 105 Resources • Pig on Spark – http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html – https://github.com/aniket486/pig – https://github.com/twitter/pig/tree/spork – http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 – https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix • Latest on Spark – http://databricks.com/categories/spark/ – http://www.spark-stack.org/
  • 106. © 2014 MapR Technologies 106 • San Francisco June 30 – July 2 • Use Cases • Tech Talks • Training http://spark-summit.org/
  • 107. © 2014 MapR Technologies 107 Q&A @mapr maprtech jscott@mapr.com Engage with us! MapR maprtech mapr-technologies

Notas do Editor

  1. Graph of each step in the pipeline for every run. This graph shows how important it is to measure everything. Some steps have been greatly reduced or eliminated. Light blue is the matching step. You can see it going quadratic and then the change when ‘J’ Jermline was released.
  2. Gives up random access read on files Gives up strong authentication / authorization model Gives up random access write / append on files
  3. 45
  4. Historically, the NameNode in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. As you add more nodes to your cluster and want to configure HA, you have to add expensive NAS and have warm standby’s for the NN and related metadata which is persisted in memory. Even more, once you surpass the file limit in HDFS, you have to have region NameNode servers to support those additional nodes. A “federated NameNode approach”. Think of the additional dedicated hardware and configurations/administration required to set up NameNode HA in Hadoop! And this is ONLY for NameNode HA.
  5. What if you could distribute the NameNode metadata and have it share resources in your cluster? What if Hadoop was a truly distributed environment? With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. (advantages of this approach are called out on the left and right side of the diagram
  6. Because of architecture. Apache Hbase runs in a JVM which read/writes to HDFS which is also running is a separate JVM, storing data in the Linux OS which is reading and writing to disk. As data is collected, it needs to be written to disk and “compacted” (i.e, maintenance is performed), this introduces many layers and steps that need to happen MapR M7 has integrated tables and files which are a true file system, reading and writing directly on disks. MapR M7 is a tightly integrated, in-Hadoop database which is NoSQL, columnar store which is 100% Apache Hbase API compatible
  7. Because of architecture. Apache Hbase runs in a JVM which read/writes to HDFS which is also running is a separate JVM, storing data in the Linux OS which is reading and writing to disk. As data is collected, it needs to be written to disk and “compacted” (i.e, maintenance is performed), this introduces many layers and steps that need to happen MapR M7 has integrated tables and files which are a true file system, reading and writing directly on disks. MapR M7 is a tightly integrated, in-Hadoop database which is NoSQL, columnar store which is 100% Apache Hbase API compatible
  8. **Consistent** low latency on read due to compactions Recall Aadhar Why?
  9. Spark is really cool…
  10. When do you use regular mapreduce over higher level languages? When Hive? When Pig? When anything?
  11. You can find Project Resources on the Apache. You’ll also find information about the mailing list there (including archives)
  12. Yahoo and Adobe are in production with Spark.
  13. This sounds a lot like the reason to consider Pig vs. Java MapReduce
  14. Gracefully
  15. Looks kind of like a source control tree
  16. You can import the MLlib to use here in the shell!
  17. Best use case? Standalone followed by Mesos… My personal opinion is that Mesos is where the future will take us.
  18. Don’t forget to share your experiences. This is really what the community is about. Don’t have time to contribute to open source, use it and share your experiences!
  19. This isn’t all proven out yet, but some of it should just work already.
  20. This is a really simple example. Reality is 22 chromosomes and 96 characters in a word
  21. ‘G’ Germline would have to rebuild the hash table for all samples and then re-run all comparisons. An all by all comparison
  22. This is where HBase shines. It is easy to add columns and rows, very efficient with empty cells (sparse matrix). Hammer HBase with multiple processes doing this at the same time.