O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Big Analytics
without Big Hassles

Bryan Lewis
Chief Data Scientist
Alex Poliakov
Solutions Architect
Paradigm4’s SciDB

SciDB is an open source, scalable array database, with
native complex math analytics, integrated with R...
Paradigm4’s SciDB

SciDB helps data scientists, bioinformaticians, quants,
analysts, and scientists tackle their toughest ...
Webinar Replay
These slides are from a Paradigm4 webinar held on 11/12/13

You can find this webinar, and additional webin...
Agenda

1. Brief Introduction to SciDB
1. Demos

© Paradigm4 5

1. Q & A
Developed by Paradigm4

Open-source high-performance database
Data organized in multi-dimensional sparse arrays
Horizontal...
About Paradigm4
Paradigm4 develops & supports SciDB

CTO is MIT database researcher Mike Stonebraker
Force behind many maj...
Developed by Paradigm4

Community edition
• Open Source
• Unrestricted
• Fully scalable

• More math
• Fault tolerance
• S...
SciDB Powers NIH NCBI’s
1000 Genomes Project

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/

© Paradigm4 9

Run...
SciDB Builds ARCA NBBO Book

• 186 million quotes for one day

• Runs in about half the time
on a cluster twice as large

...
SciDB Powers Recommendation Engines

• Fast truncated SVD

• Minutes per singular value
on a four node Linux cluster

© Pa...
SciDB System Architecture

“Shared Nothing” cluster of commodity hardware nodes

Interconnected with standard ethernet and...
SciDB Arrays
Each cell in a SciDB array consists of a fixed number
of typed attributes (variables).

Here is an example ce...
SciDB Arrays
A 1-D array looks like a spreadsheet
This picture shows five cells, each with four attributes
Attributes
Volu...
SciDB Arrays
The same data “redimensioned” into a 2D array

.

Dimension Symbol
“AAPL”
Volume

36013008713

450.61
450.73
...
Access multi-dimensional
subsets in constant time
cts
u
od
Pr

Customers

(price, location, age, gender, …)

Vendors
other...
High Performance Windowing

Fast, one-pass, running stats over arbitrary time or data windows
Even when time intervals cro...
SciDB Arrays
Arrays can be joined
along dimensions or subsets of dimensions

Values can be aggregated
along dimensions and...
© Paradigm4 19

• Work in familiar IDE
• Data persisted in SciDB
• Offload large computations to cluster
Demos
Quantitative Finance example
• Regularized correlation
• Relevance network graph

Remote Sensing application

Surviv...
© Paradigm4 21

Live demos
Two modes for using R & Python
SciDB-R/Py

R/Py-exec

(global)

(local)

Program SciDB naturally
from R or Python

Invoke ...
Rationale

Provide a simple, robust way to run R or Python from
inside SciDB queries, in parallel

© Paradigm4 23

Extend ...
Really simple example
Instance-parallel Monte Carlo estimate of π
avg(
r_exec(
build(<z:double>[i=1:1000,1,0],0),

'expr=x...
Big data bootstrap example
Consider a matrix named "events" with 8 columns:
Race
Age
Group
Gender

(categorical)
(numeric)...
Big data bootstrap example
Randomly partition rows of the events matrix into blocks of
at most 1000 rows (the "bag" part o...
Big data bootstrap example
store(redimension(
apply(
r_exec(P,
"expr=
require(survival);
D <- as.data.frame(matrix(val,nco...
Big data bootstrap result

Group 2 exhibits
significantly lower
relative risk of an
event than Group
1 in this example.

p...
Take Away

In-database, scalable, complex math
Less coding, more analysis
Transparent scale-up & speed-up
Interactive expl...
Questions?
Tell us about your application
• info@paradigm4.com

Try our Quick Start
• scidb.org/forum
• Download a VM or E...
Próximos SlideShares
Carregando em…5
×

Big Analytics Without Big Hassles

1.197 visualizações

Publicada em

Complex analytics should work as nimbly on extremely large data sets as on small ones. You don’t want to think about whether your data fits in-memory, about parallelism, or formatting data for math packages. You’d like to use your favorite analytical language and have it transparently scale up to Big Data volumes.

Paradigm4 presents a webinar about SciDB—the massively scalable, open source, array database with native complex analytics, integrated with R and Python.

Details:

Presenter: Bryan Lewis, Chief Data Scientist, Paradigm4
Day/Time: Tuesday November 12th, 2013 at 1pm EST


Learn how SciDB enables you to:

-Explore rich data sets interactively
-Do complex math in-database—without being constrained -by memory limitations
-Perform multi-dimensional windowing, filtering, and aggregation
-Offload large computations to a commodity hardware cluster—on-premise or in a cloud
-Use R and Python to analyze SciDB arrays as if they were R or Python objects.
-Share data among users, with multi-user data integrity guarantees and version control
Webinar Agenda:

-Introduction to SciDB
-Demo
-Live Q&A

Publicada em: Tecnologia
  • Entre para ver os comentários

Big Analytics Without Big Hassles

  1. 1. Big Analytics without Big Hassles Bryan Lewis Chief Data Scientist Alex Poliakov Solutions Architect
  2. 2. Paradigm4’s SciDB SciDB is an open source, scalable array database, with native complex math analytics, integrated with R & python © Paradigm4 Inc. 2
  3. 3. Paradigm4’s SciDB SciDB helps data scientists, bioinformaticians, quants, analysts, and scientists tackle their toughest “Big Data” management and complex analytics challenges. © Paradigm4 Inc. 3
  4. 4. Webinar Replay These slides are from a Paradigm4 webinar held on 11/12/13 You can find this webinar, and additional webinars, at: http://www.paradigm4.com/video/ www.paradigm4.com © Paradigm4 Inc. 4
  5. 5. Agenda 1. Brief Introduction to SciDB 1. Demos © Paradigm4 5 1. Q & A
  6. 6. Developed by Paradigm4 Open-source high-performance database Data organized in multi-dimensional sparse arrays Horizontally scalable Excels at parallel linear algebra © Paradigm4 6 ACID, data replication, versioned data
  7. 7. About Paradigm4 Paradigm4 develops & supports SciDB CTO is MIT database researcher Mike Stonebraker Force behind many major advances in commercial database products (Postgres, Illustra, Streambase, Vertica, VoltDB, …) Computational Genomics Imaging Quantitative Finance E-commerce Industrial Analytics Internet of Things © Paradigm4 7 Commercial applications
  8. 8. Developed by Paradigm4 Community edition • Open Source • Unrestricted • Fully scalable • More math • Fault tolerance • System management tools © Paradigm4 8 Enterprise edition
  9. 9. SciDB Powers NIH NCBI’s 1000 Genomes Project http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/ © Paradigm4 9 Running 24 x 7 since Fall 2012
  10. 10. SciDB Builds ARCA NBBO Book • 186 million quotes for one day • Runs in about half the time on a cluster twice as large © Paradigm4 10 • 80 seconds on a 32-instance cluster
  11. 11. SciDB Powers Recommendation Engines • Fast truncated SVD • Minutes per singular value on a four node Linux cluster © Paradigm4 11 • Sparse 50M x 50M matrix 4 billion nonzero values
  12. 12. SciDB System Architecture “Shared Nothing” cluster of commodity hardware nodes Interconnected with standard ethernet and TCP/IP © Paradigm4 12 SciDB Client ( iquery, Python, R, C++, C, JDBC )
  13. 13. SciDB Arrays Each cell in a SciDB array consists of a fixed number of typed attributes (variables). Here is an example cell with four attributes usec 36013008713 © Paradigm4 13 Price Volume Symbol 450.61 150 “AAPL”
  14. 14. SciDB Arrays A 1-D array looks like a spreadsheet This picture shows five cells, each with four attributes Attributes Volume Symbol usec 1 450.61 150 “AAPL” 36013008713 2 450.73 200 “AAPL” 36013008915 3 450.84 10 “AAPL” 36013208113 4 36.57 75 “MSFT” 36019008713 5 36.20 100 “MSFT” 36003200113 © Paradigm4 14 Dimension i Price
  15. 15. SciDB Arrays The same data “redimensioned” into a 2D array . Dimension Symbol “AAPL” Volume 36013008713 450.61 450.73 450.84 36.57 75 200 36013208113 100 150 36013008915 Volume 36.20 36003200113 Price 10 36019008713 © Paradigm4 15 Dimension usec Price “MSFT”
  16. 16. Access multi-dimensional subsets in constant time cts u od Pr Customers (price, location, age, gender, …) Vendors other dimensions …. © Paradigm4 16 Customer [1]
  17. 17. High Performance Windowing Fast, one-pass, running stats over arbitrary time or data windows Even when time intervals cross over internal storage shards © Paradigm4 17 Simple running median outlier filter
  18. 18. SciDB Arrays Arrays can be joined along dimensions or subsets of dimensions Values can be aggregated along dimensions and over windows Functions can be applied over values in arrays Linear algebra operations, matrix decompositions, and other interesting operations are defined for matrices and vectors © Paradigm4 18 Arrays can be sparse
  19. 19. © Paradigm4 19 • Work in familiar IDE • Data persisted in SciDB • Offload large computations to cluster
  20. 20. Demos Quantitative Finance example • Regularized correlation • Relevance network graph Remote Sensing application Survival Analysis on Healthcare Data • Estimate Cox proportional hazards model with the big data bootstrap © Paradigm4 20 • NASA MODIS satellite images • Regrid with spatial interpolation • Visualize (multiple resolutions)
  21. 21. © Paradigm4 21 Live demos
  22. 22. Two modes for using R & Python SciDB-R/Py R/Py-exec (global) (local) Program SciDB naturally from R or Python Invoke R or Python from within SciDB queries © Paradigm4 22 SciDB coordinator R-exec
  23. 23. Rationale Provide a simple, robust way to run R or Python from inside SciDB queries, in parallel © Paradigm4 23 Extend SciDB's powerful native analysis capabilities
  24. 24. Really simple example Instance-parallel Monte Carlo estimate of π avg( r_exec( build(<z:double>[i=1:1000,1,0],0), 'expr=x<-runif(1000);y<-runif(1000);list(sum(x^2+y^2<1)/250)') ) {i} x_avg © Paradigm4 24 {0} 3.14119
  25. 25. Big data bootstrap example Consider a matrix named "events" with 8 columns: Race Age Group Gender (categorical) (numeric) (categorical) (categorical) Apply the bag of little bootstraps to estimate confidence intervals for coefficients of a Cox proportional hazards survival model. © Paradigm4 25 ID (numeric) SES (numeric) Days_to_event (numeric) Event (binary)
  26. 26. Big data bootstrap example Randomly partition rows of the events matrix into blocks of at most 1000 rows (the "bag" part of the BLB method). © Paradigm4 26 store( redimension( cross_join(events as A, redimension(apply(project(sort(apply( build(<v:int64>[k=0:9999,1000,0],random()),p,k)),p),m,n), <p:int64> [m=0:*,1000,0]) as B, A.i, B.m), <val:double>[p=0:9999,1000,0,j=0:7,8,0]), P)
  27. 27. Big data bootstrap example store(redimension( apply( r_exec(P, "expr= require(survival); D <- as.data.frame(matrix(val,ncol=8,byrow=TRUE)); names(D) <-c ('ID','Race','SES','Age','Days','Event','Group','Gender'); D[,'Race'] <- factor(D[,'Race'], levels=1:13); D[,'Group'] <- factor(D[,'Group'], levels=1:2); D[,'Gender'] <- factor(D[,'Gender'], levels=1:2); ans <- sapply(1:500, function(x) { M <- coxph(Surv(Days, Event) ~ Age + Race + Group + Gender + SES + cluster(ID), data=D[sample(nrow(D),nrow(D),replace=1),]); c(coef(M), sqrt(diag(M[['var']])))}); list(apply(ans, 1, mean)); © Paradigm4 27 '), m, n%32), <ans:double null>[m=0:31,32,0], avg(val) as ans), coefs)
  28. 28. Big data bootstrap result Group 2 exhibits significantly lower relative risk of an event than Group 1 in this example. plot(exp(cf)) lapply(1:4,function(j){lines(c(j, j),c(exp(cf[j]1.96*se[j]),exp(cf[j]+1.96*se[j]) ))}) © Paradigm4 28 library("scidb") cf =scidb("coefs")[c(0,13:15)][] se =scidb("coefs")[c(16,29:31)][]
  29. 29. Take Away In-database, scalable, complex math Less coding, more analysis Transparent scale-up & speed-up Interactive exploratory analytics Seamless R and Python integration www.paradigm4.com
  30. 30. Questions? Tell us about your application • info@paradigm4.com Try our Quick Start • scidb.org/forum • Download a VM or EC2 AMI www.paradigm4.com © Paradigm4 Inc. 30

×