Predictive analytics have long lived in the domain of statistical tools like R. Increasingly, however, as companies struggle to deal with exploding volumes of data not easily analyzed by small data tools, they are looking at ways of doing predictive analytics directly inside the primary data store.
This approach, called in-database predictive analytics, eliminates the need to sample data and perform a separate ETL process into a statistical tool, which can decrease total cost, improve the quality of predictive models, and dramatically shorten development time. In this class, you will learn the pros and cons of doing in-database predictive analytics, highlights of its limitations, and survey the tools and technologies necessary to head down the path.
3. Introduction
In-Database Predictive Analytics
In-database predictive analytics refers
to the the process of performing
advanced predictive analytics directly
inside the database.
4. Introduction
Traditional Predictive Analytics
R
database
SAS
5. Introduction
R
database
SAS
Data Bottleneck:
Painful, Slow
9. Abusing SQL
General Approach in RDBMS
SQL
Driver Database
Feedback
10. Abusing SQL
Our Initial Model
model
d k n iteration avg_q
number of dimensions number of points variance
number of clusters number of iterations
11. Abusing SQL
Our Initial Data Set
Y
Y1 Y2 Y3 Y3
n rows
12. Abusing SQL
Projection & Numbering
Y YH
Y1 Y2 Y3 ... i Y1 ... Yd
1 1
2 2
3 3
4 4
... ...
... ...
n n
INSERT INTO YH
SELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., Yd
FROM Y;
13. Abusing SQL
Flattening
YH YV
i Y1 ... Yd i l val
1 1 1
2 1 2
3 1
... ...
4 1 d
... 2 1
... ... ...
n n d
n x d rows
INSERT INTO YV SELECT i,1,Y1 FROM YH;
...
INSERT INTO YV SELECT i,d,Yd FROM YH;
14. Abusing SQL
Initializing k Cluster Centers
YH CH
i Y1 ... Yd j Y1 ... Yd
1 1
2 2
3 3
4 4
... ...
... ...
n k
INSERT INTO CH
SELECT 1,Y1, ..., Yd FROM YH SAMPLE 1;
...
INSERT INTO CH
SELECT k,Y1, ..., Yd FROM YH SAMPLE 1;
15. Abusing SQL
Flattening
CH C
j Y1 ... Yd l j val
1 1 1
2 1 2
3 ... ...
4 1 k
... 2 1
... ... ...
k d k
d x k rows
INSERT INTO C
SELECT 1, 1, Y1 FROM CH WHERE j = 1;
...
INSERT INTO C
SELECT d, k, Yd FROM CH WHERE j = k;
16. Abusing SQL
Computing Distances to Clusters
YD
i j dist
1 1
1 2
INSERT INTO YD
... ... SELECT i, j, sum((YV.val - C.val)**2)
1 k FROM YV, C WHERE YV.l = C.l
GROUP BY i, j;
2 1
... ...
n k
n x k rows
17. Abusing SQL
Computing Nearest Neighbors
YNN
nearest clusters
i j
1
2 INSERT INTO YNN
SELECT YD.i,Y D.j
3
FROM YD,
4 (SELECT i, min(dist) AS mindist FROM YD
GROUP BY i) YMIND
5
WHERE Y D.i = YMIND.i
... and Y D.distance = YMIND.mindist;
n
n rows
18. Abusing SQL
Count Points Per Cluster
INSERT INTO W SELECT j, count(*)
FROM YNN GROUP BY j;
UPDATE W SET w = w/model.n;
19. Abusing SQL
Compute New Centroids
INSERT INTO C
SELECT l, j, avg(YV.val) FROM YV, YNN
WHERE YV.i = YNN.i GROUP BY l, j;
20. Abusing SQL
Compute Variances
INSERT INTO R
SELECT C.l, C.j, avg((YV.val-
C.val)**2)
FROM C, YV, YNN
WHERE YV.i = YNN.i
and YV.l = C.l and YNN.j = C.j
GROUP BY C.l, C.j;
21. Abusing SQL
Update Model
INSERT INTO R
SELECT C.l, C.j, avg((YV.val-
C.val)**2)
FROM C, YV, YNN
WHERE YV.i = YNN.i
and YV.l = C.l and YNN.j = C.j
GROUP BY C.l, C.j;
25. Painful by Design
#2: Relational Algebra Sucks
Projection Selection Rename Natural Join
R S
Semijoin Antijoin Division Theta Join
R S R S R ÷ S
Left outer join Right outer join Full outer join Aggregation
R ⟕ S R ⟖ S R⟗ S G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)
Iteration Recursion Multiple Dimensions
28. Database Extensions
UDF UDA
User-Defined Function User-Defined Aggregate
Map Reduce
map(a) init(a)
op2(a,b) accum(a, b)
merge(a, b)
final(a)
29. MADlib
MADlib is an open-source library for
scalable in-database analytics.
It is implemented using database
extensions written in C, and is available
for PostgreSQL and Greenplum.
30. MADlib
1. Download the binary
Mac OS X
http://www.madlib.net/files/madlib-0.6-
Darwin.dmg
Linux
http://www.madlib.net/files/madlib-0.6-
Linux.rpm
31. MADlib
2. Start the Installation
Mac OS X
Double-click on installer
Linux
yum install $MADLIB_PACKAGE --nogpgcheck
32. MADlib
3. Verify Locatability
Greenplum
source /path/to/greenplum/
greenplum_path.sh
PostgreSQL
Make sure psql is in PATH
40. Other Approaches
2. Reject RDBMS
• Datalog + variants
• In theory, ideal for many kinds of predictive analytics
• Suffers from a lack of distributed, feature-complete implementations
41. Other Approaches
2. Reject RDBMS
• Rasdaman / RASQL
• Arrays but not analytics
Community Editions
http://www.rasdaman.org
42. Other Approaches
2. Reject RDBMS
• MonetDB / SciQL
• Array extension of SQL
• Poor analytics
Community Editions
http://www.monetdb.org
43. Other Approaches
2. Reject RDBMS
• SciDB / AFL (AQL)
• Excellent analytics
• Limited composability
Community Editions
http://www.scidb.org/forum/viewtopic.php?f=16&t=364/
44. Other Approaches
2. Reject RDBMS
• Precog / Quirrel (simple “R for big data”)
• Multidimensional, arrays + functions
• Still immature
Community Editions
http://www.precog.com/editions/precog-for-mongodb (MongoDB)
http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)
45. Summary
• Increase performance, reduce friction by doing more inside
the database
• Not a panacea
• Hard to do in SQL
• Hard to do in C (but you may not have to: MADlib)
• Pre-canned & brittle in most databases
• Ultimately what’s needed is tech designed for advanced
analytics