In-Database Predictive Analytics

In-Database
Predictive Analytics
John A. De Goes
@jdegoes, john@precog.com

Agenda

• Introduction
• Abusing SQL
• Painful by Design
• Database Extensions
• MADlib
• Other Approaches
• Summary

Introduction

In-Database Predictive Analytics

In-database predictive analytics refers
to the the process of performing
advanced predictive analytics directly
inside the database.

Introduction

Traditional Predictive Analytics

R

database

SAS

Introduction

R

database
SAS

Data Bottleneck:
Painful, Slow

Introduction

What’s the answer?

Introduction
Move the Code, not the Data!

Advanced
Analytics

“MapReduce”

Abusing SQL

Let’s Do K-Means in SQL!

Abusing SQL
General Approach in RDBMS

SQL

Driver Database
Feedback

Abusing SQL
Our Initial Model

model
d k n iteration avg_q

number of dimensions number of points variance

number of clusters number of iterations

Abusing SQL
Our Initial Data Set

Y
Y1 Y2 Y3 Y3

n rows

Abusing SQL
Projection & Numbering

Y YH
Y1 Y2 Y3 ... i Y1 ... Yd
1 1
2 2
3 3
4 4
... ...
... ...
n n

INSERT INTO YH
SELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., Yd
FROM Y;

Abusing SQL
Flattening

YH YV
i Y1 ... Yd i l val
1 1 1
2 1 2
3 1
... ...
4 1 d
... 2 1
... ... ...
n n d
n x d rows

INSERT INTO YV SELECT i,1,Y1 FROM YH;
...
INSERT INTO YV SELECT i,d,Yd FROM YH;

Abusing SQL
Initializing k Cluster Centers

YH CH
i Y1 ... Yd j Y1 ... Yd
1 1
2 2
3 3
4 4
... ...
... ...
n k

INSERT INTO CH
SELECT 1,Y1, ..., Yd FROM YH SAMPLE 1;
...
INSERT INTO CH
SELECT k,Y1, ..., Yd FROM YH SAMPLE 1;

Abusing SQL
Flattening

CH C
j Y1 ... Yd l j val
1 1 1
2 1 2
3 ... ...
4 1 k
... 2 1
... ... ...
k d k
d x k rows
INSERT INTO C
SELECT 1, 1, Y1 FROM CH WHERE j = 1;
...
INSERT INTO C
SELECT d, k, Yd FROM CH WHERE j = k;

Abusing SQL
Computing Distances to Clusters

YD
i j dist
1 1
1 2
INSERT INTO YD
... ... SELECT i, j, sum((YV.val - C.val)**2)
1 k FROM YV, C WHERE YV.l = C.l
GROUP BY i, j;
2 1
... ...
n k
n x k rows

Abusing SQL
Computing Nearest Neighbors

YNN
nearest clusters
i j
1
2 INSERT INTO YNN
SELECT YD.i,Y D.j
3
FROM YD,
4 (SELECT i, min(dist) AS mindist FROM YD
GROUP BY i) YMIND
5
WHERE Y D.i = YMIND.i
... and Y D.distance = YMIND.mindist;
n
n rows

Abusing SQL
Count Points Per Cluster

INSERT INTO W SELECT j, count(*)
FROM YNN GROUP BY j;
UPDATE W SET w = w/model.n;

Abusing SQL
Compute New Centroids

INSERT INTO C
SELECT l, j, avg(YV.val) FROM YV, YNN
WHERE YV.i = YNN.i GROUP BY l, j;

Abusing SQL
Compute Variances

INSERT INTO R
SELECT C.l, C.j, avg((YV.val-
C.val)**2)
FROM C, YV, YNN
WHERE YV.i = YNN.i
and YV.l = C.l and YNN.j = C.j
GROUP BY C.l, C.j;

Abusing SQL
Update Model

INSERT INTO R
SELECT C.l, C.j, avg((YV.val-
C.val)**2)
FROM C, YV, YNN
WHERE YV.i = YNN.i
and YV.l = C.l and YNN.j = C.j
GROUP BY C.l, C.j;

Abusing SQL

Let’s not do that again!

Painful by Design

Why are predictive analytics so
hard to express in SQL?

Painful by Design
#1: No Arrays

Sets Tuples Arrays
rows columns

Painful by Design
#2: Relational Algebra Sucks

Projection Selection Rename Natural Join
R S

Semijoin Antijoin Division Theta Join
R S R S R ÷ S

Left outer join Right outer join Full outer join Aggregation
R ⟕ S R ⟖ S R⟗ S G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)

Iteration Recursion Multiple Dimensions

Database Extensions

There’s GOT to be a better way!

Database Extensions

C Extension

Database Extensions

UDF UDA
User-Deﬁned Function User-Deﬁned Aggregate

Map Reduce
map(a) init(a)
op2(a,b) accum(a, b)
merge(a, b)
final(a)

MADlib

MADlib is an open-source library for
scalable in-database analytics.
It is implemented using database
extensions written in C, and is available
for PostgreSQL and Greenplum.

MADlib
1. Download the binary

Mac OS X
http://www.madlib.net/files/madlib-0.6-
Darwin.dmg

Linux
http://www.madlib.net/files/madlib-0.6-
Linux.rpm

MADlib
2. Start the Installation

Mac OS X
Double-click on installer

Linux
yum install $MADLIB_PACKAGE --nogpgcheck

MADlib
3. Verify Locatability

Greenplum
source /path/to/greenplum/
greenplum_path.sh

PostgreSQL
Make sure psql is in PATH

MADlib
4. Register MADlib

Greenplum
/usr/local/madlib/bin/madpack -p greenplum
-c $USER@$HOST/$DATABASE install

PostgreSQL
/usr/local/madlib/bin/madpack -p postgres
-c $USER@$HOST/$DATABASE install

MADlib
5. Test Installation

Greenplum
/usr/local/madlib/bin/madpack -p greenplum -c
$USER@$HOST/$DATABASE install-check

PostgreSQL
/usr/local/madlib/bin/madpack -p postgres
-c $USER@$HOST/$DATABASE install-check

MADlib
Clustering in MADlib

SELECT * FROM kmeans_random(
'rel_source', 'expr_point', k,
[ 'fn_dist', 'agg_centroid',
max_num_iterations,
min_frac_reassigned ]
);

MADlib

Ahhhhhh......

MADlib
Our Way or the Highway

Composability

Other Approaches

RDBMS Isn’t the
Only Game in Town!

Other Approaches
1. Embrace Coding

• Hadoop Ecosystem
• Mahout, Cascading/Scalding, Crunch/Scrunch, Pangool, Cascalog, and,
of course, MapReduce
• BDAS Ecosystem
• Spark

Other Approaches
2. Reject RDBMS

• Datalog + variants
• In theory, ideal for many kinds of predictive analytics
• Suffers from a lack of distributed, feature-complete implementations

Other Approaches
2. Reject RDBMS

• Rasdaman / RASQL
• Arrays but not analytics

Community Editions
http://www.rasdaman.org

Other Approaches
2. Reject RDBMS

• MonetDB / SciQL
• Array extension of SQL
• Poor analytics

Community Editions
http://www.monetdb.org

Other Approaches
2. Reject RDBMS

• SciDB / AFL (AQL)
• Excellent analytics
• Limited composability

Community Editions
http://www.scidb.org/forum/viewtopic.php?f=16&t=364/

Other Approaches
2. Reject RDBMS

• Precog / Quirrel (simple “R for big data”)
• Multidimensional, arrays + functions
• Still immature

Community Editions
http://www.precog.com/editions/precog-for-mongodb (MongoDB)
http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)

Summary

• Increase performance, reduce friction by doing more inside
the database

• Not a panacea
• Hard to do in SQL
• Hard to do in C (but you may not have to: MADlib)
• Pre-canned & brittle in most databases

• Ultimately what’s needed is tech designed for advanced
analytics

Q&A
John A. De Goes
@jdegoes, john@precog.com

References

• Programming the K-means Clustering Algorithm in SQL
(Teradata, NCR)

In-Database Predictive Analytics

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (17)

Destaque

Destaque (20)

Semelhante a In-Database Predictive Analytics

Semelhante a In-Database Predictive Analytics (20)

Mais de John De Goes

Mais de John De Goes (20)

Último

Último (20)

In-Database Predictive Analytics