SlideShare uma empresa Scribd logo
1 de 37
Srivatsan Ramanujam
Senior Data Scientist
Greenplum

© Copyright 2011 EMC Corporation. All rights reserved.

1
Agenda
• Greenplum UAP overview
– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance
– GPDB Architecture

• MADlib
–
–
–
–

Overview
Algorithms
Working Mechanism
Performance Comparison with Mahout

• PyMADlib
– Overview
– Demo in IPython Notebook

• Future Directions
– GPHD and HAWQ

© Copyright 2011 EMC Corporation. All rights reserved.

2
Greenplum Overview

© Copyright 2011 EMC Corporation. All rights reserved.

3
Products

© Copyright 2011 EMC Corporation. All rights reserved.

4
Greenplum Database - Architecture
MPP (Massively Parallel Processing)
Shared-Nothing Architecture
Master
Servers

...

SQL
MapReduce

...

Query planning &
dispatch

Network
Interconnect

Segment
Servers

...

...

Query processing
& data storage

External
Sources
Loading,
streaming, etc.

© Copyright 2011 EMC Corporation. All rights reserved.

5
MADlib

© Copyright 2011 EMC Corporation. All rights reserved.

6
MADlib: The Origin

UrbanDictionary.com:
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.

• First mention of MAD analytics was at VLDB’09
– MAD Skills: New Analysis Practices for Big Data
– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, Caleb
Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• MADlib project initiated in late 2010
– Maintained by Greenplum/EMC with significant contributions
from UW Madison, UFlorida and UC Berkeley.

© Copyright 2011 EMC Corporation. All rights reserved.

7
Current Modules
Data Modeling
Supervised Learning
•
•
•
•
•
•
•
•
•

Naive Bayes Classification
Linear Regression
Logistic Regression
Multinomial Logistic Regression
Decision Tree
Random Forest
Support Vector Machines
Cox-Proportional Hazards Regression
Conditional Random Field

Unsupervised Learning
• Association Rules
• k-Means Clustering
• Low-rank Matrix Factorization
• SVD Matrix Factorization
• Parallel Latent Dirichlet Allocation

Descriptive Statistics
Sketch-based Estimators
• CountMin (CormodeMuthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent Values)

Profile

Quantile

Support
Array
Operations
Conjugate
Gradient
Sparse
Vectors
Probability
Functions
Random
Sampling

Inferential Statistics
Hypothesis tests

© Copyright 2011 EMC Corporation. All rights reserved.

8
MADlib – User Doc
• Check out the user guide with examples at: http://doc.madlib.net

© Copyright 2011 EMC Corporation. All rights reserved.

9
How does it work ? : A Linear Regression Example
• Finding linear dependencies between variables
– y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2

Vector of
dependent
variables y

© Copyright 2011 EMC Corporation. All rights reserved.

from unm limit 6;

y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8

Design
matrix X

10
Reminder: Linear-Regression Model
•
• If residuals i.i.d. Gaussians with standard deviation σ:
– max likelihood ⇔ min sum of squared residuals

• First-order conditions for the following quadratic objective (in c)

yield the minimizer

© Copyright 2011 EMC Corporation. All rights reserved.

11
Linear Regression: Streaming Algorithm
• How to compute with a single table scan?

-1
XT

XT

y

X

X TX

© Copyright 2011 EMC Corporation. All rights reserved.

XTy

12
Linear Regression: Parallel Computation
XT
y

Segment 1

T
X1 y1

© Copyright 2011 EMC Corporation. All rights reserved.

Segment 2

T
X2 y2

Master

X Ty

13
Performance Comparison : Test Setup on AWB
• AWB
– 1000-node cluster located in Las Vegas
– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk
storage
– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity
– GPHD 1.1, GPDB 4.2.3

• Mahout v0.7
• MADlib v0.5
– With small LMF change to allow 4-byte integer values

• Test matrix
–
–
–
–

Data size (# rows/records, # columns/features)
Algorithms
Algorithm parameters (e.g. convergence threshold, # iterations)
GPDB segment / MR (Map-Reduce) task configurations

© Copyright 2011 EMC Corporation. All rights reserved.

14
Performance & Scalability Results (summary)

• Whitepaper coming out shortly!

© Copyright 2011 EMC Corporation. All rights reserved.

15
Logistic Regression
• Mahout only has sequential (i.e. single node) IGD implementation

MADlib & Mahout Logistic Regression Scalability Across
Number of Attributes
700

Census data, 48 attributes [Mahout]
600

Time in Minutes

Census data, 48 attributes [MADlib]
500
400
300
200
100
0
1000000

10000000

10000000

1E+09

log(Number of Rows)

© Copyright 2011 EMC Corporation. All rights reserved.

16
Logistic Regression
MADlib Scalability Across Number of GPDB Segments
18
16

Time in Minutes

14
12
10
8
6
4
2
0
0

50

100

150

200

250

300

Number of GPDB Segments

© Copyright 2011 EMC Corporation. All rights reserved.

17
K-Means Clustering
MADlib & Mahout K-means Scalability Across
Number of Rows
350

Census data, 48 attributes [Mahout]
300

Census data, 48 attributes [MADlib]
Time in Min

250
200
150
100
50
0
1000000

10000000

10000000

1E+09

log(Number of Rows)

© Copyright 2011 EMC Corporation. All rights reserved.

18
K-Means Clustering
MADlib K-means Scalability Across
Number of GPDB Segments
10
9
8

Time in Min

7
6
5
4
3
2
1

0
0

50

100

150

200

250

300

Number of GPDB Segments

© Copyright 2011 EMC Corporation. All rights reserved.

19
PyMADlib : Python + MADlib = Awesome!

© Copyright 2011 EMC Corporation. All rights reserved.

20
Motivation
• SQL is great for many things, but it’s not nearly enough

• Undeniably the most straightforward way to query data

• But not necessarily designed for data science

© Copyright 2011 EMC Corporation. All rights reserved.

21
MADlib is a godsend!
• Empowers data scientists to run canned machine learning
routines – focus less on coding, more on science
• In-database, explicitly parallel.

• So why do we need anything else?
– UI is still all in SQL
– Need to tap into rich visualization libraries

© Copyright 2011 EMC Corporation. All rights reserved.

22
Then which interface is favored by and familiar
to data scientists?

• Depends on who you ask
• Left survey is for “higher level languages,” and right survey is for “lower level languages”

© Copyright 2011 EMC Corporation. All rights reserved.

23
Wait, don’t we already have this (PL/R,
PL/Python, SAS HPA)?
• PL/X’s are wonderful, but:
– It still requires non-trivial knowledge of SQL to use effectively
– Mostly limited to explicitly parallel jobs
– Primarily a SQL interface to the end user

• Need an interface that is:
– Less SQL, more R/Python/SAS
– Implicitly parallelized
– More scalable

• SAS HPA = $$$$$

© Copyright 2011 EMC Corporation. All rights reserved.

24
The challenge
• MADlib
–
–
–
–

Open source
Extremely powerful/scalable
Growing algorithm breadth
SQL

• Python/R
–
–
–
–

Open source
Memory limited
High algorithm breadth
Language/interface purpose-designed for data science

• SAS
–
–
–
–

High user loyalty
Non-HPA is memory limited, HPA requires investment
High algorithm breadth
Language/interface purpose-designed for data science

• Want to leverage both the performance benefits of MADlib and the
usability of languages like Python, SAS, and R

© Copyright 2011 EMC Corporation. All rights reserved.

25
Simple solution: Translate Python code into
SQL
ODBC/
JDBC

Python  SQL

SQL to execute MADlib
Model output

• All data stays in DB and all model estimation and heavy lifting done in DB by
MADlib

• Only strings of SQL and model output transferred across ODBC/JDBC
• Best of both worlds: number crunching power of MADlib along with rich set of
visualizations of Matplotlib, NetworkX and all your other favorite Python
libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL
database, while you program in your favorite language – Python.

© Copyright 2011 EMC Corporation. All rights reserved.

26
Demo

PyMADlib Tutorial –
IPython Notebook Viewer Link

http://nbviewer.ipython.org/5275846

© Copyright 2011 EMC Corporation. All rights reserved.

27
Where do I get it ?

$pip install pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

28
I don’t have GPDB or MADlib – What do I do ?
• Greenplum Database Community Edition is freely
available for single node installations on multiple
platforms
– Written permission may be requested from EMC/Greenplum
for research use for multi-node installations

• MADlib is free and open-source
– Downloadable for multiple platforms from
https://github.com/madlib/madlib

• PyMADlib is also free and open-source 
– Downloadable from https://github.com/vatsan/pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

29
Future Directions

© Copyright 2011 EMC Corporation. All rights reserved.

30
Greenplum HD
• HAWQ – Parallel SQL query engine that combines the key
technological advantages of industry-leading Greenplum
Database with scalability and convenience of Hadoop

• SQL Standards Compliant
– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes
+ range of scalar and aggregate functions

• ACID Compliant

© Copyright 2011 EMC Corporation. All rights reserved.

31
HAWQ – Architecture

© Copyright 2011 EMC Corporation. All rights reserved.

32
Performance : HAWQ1 Vs. Hive Vs. Impala2

All experiments were run on a 60 node deployment with Analytics Workbench3

1
2
3

http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf
https://github.com/cloudera/impala/
http://www.analyticsworkbench.com/

© Copyright 2011 EMC Corporation. All rights reserved.

33
HAWQ: Deep Scalable Analytics
What’s inside the box?

• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• K-Means

• Association Rules
• Latent Dirichlet Allocation
• Users can connect to HAWQ via popular programming languages and it also
supports JDBC and ODBC.
• Most tools will work out of the box with HAWQ, including PyMADlib

© Copyright 2011 EMC Corporation. All rights reserved.

34
Questions?
@being_bayesian
vatsan.cs@utexas.edu
https://github.com/vatsan/pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

35
Appendix

© Copyright 2011 EMC Corporation. All rights reserved.

36
Datasets
The following datasets were used in comparing the performance of
MADlib with Mahout
– KDD Cup 2009 Orange marketing churn data (16.5 MB)
• About 500,000 records and 15,000 numerical and categorical attributes
– Census 2000 data (1.7 GB)
• About 14 million records and 48 numerical and categorical attributes
– Enron data (1.9 GB)
• About 700,000 documents with a vocabulary size of 200,000
– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)
• About 1 million users, 600,000 songs, and 250 million ratings
– Netflix Prize 2009 data (52.7 MB)
• About 400,000 users, 900 movies, and 4.5 million ratings

© Copyright 2011 EMC Corporation. All rights reserved.

37

Mais conteúdo relacionado

Mais procurados

Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesCitiusTech
 
VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...
VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...
VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...Maichino Sepede
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 
MySQL Connectors 8.0.19 & DNS SRV
MySQL Connectors 8.0.19 & DNS SRVMySQL Connectors 8.0.19 & DNS SRV
MySQL Connectors 8.0.19 & DNS SRVKenny Gryp
 
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)Phil Wilkins
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLYugabyte
 
The Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationThe Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationEric Kavanagh
 
Meet up roadmap cloudera 2020 - janeiro
Meet up   roadmap cloudera 2020 - janeiroMeet up   roadmap cloudera 2020 - janeiro
Meet up roadmap cloudera 2020 - janeiroThiago Santiago
 
Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Hao H. Zhang
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
 
Deploying MariaDB databases with containers at Nokia Networks
Deploying MariaDB databases with containers at Nokia NetworksDeploying MariaDB databases with containers at Nokia Networks
Deploying MariaDB databases with containers at Nokia NetworksMariaDB plc
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group ReplicationUlf Wendel
 
IBM MQ on cloud and containers
IBM MQ on cloud and containersIBM MQ on cloud and containers
IBM MQ on cloud and containersRobert Parker
 
Replacing and Augmenting F5 BIG-IP with NGINX Plus
Replacing and Augmenting F5 BIG-IP with NGINX PlusReplacing and Augmenting F5 BIG-IP with NGINX Plus
Replacing and Augmenting F5 BIG-IP with NGINX PlusNGINX, Inc.
 
Deep dive into Kubernetes Networking
Deep dive into Kubernetes NetworkingDeep dive into Kubernetes Networking
Deep dive into Kubernetes NetworkingSreenivas Makam
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseDataWorks Summit
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 
Kubernetes 101 for Beginners
Kubernetes 101 for BeginnersKubernetes 101 for Beginners
Kubernetes 101 for BeginnersOktay Esgul
 

Mais procurados (20)

Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best Practices
 
VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...
VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...
VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
MySQL Connectors 8.0.19 & DNS SRV
MySQL Connectors 8.0.19 & DNS SRVMySQL Connectors 8.0.19 & DNS SRV
MySQL Connectors 8.0.19 & DNS SRV
 
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQL
 
The Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationThe Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data Integration
 
Meet up roadmap cloudera 2020 - janeiro
Meet up   roadmap cloudera 2020 - janeiroMeet up   roadmap cloudera 2020 - janeiro
Meet up roadmap cloudera 2020 - janeiro
 
Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Deploying MariaDB databases with containers at Nokia Networks
Deploying MariaDB databases with containers at Nokia NetworksDeploying MariaDB databases with containers at Nokia Networks
Deploying MariaDB databases with containers at Nokia Networks
 
Snowflake Architecture
Snowflake ArchitectureSnowflake Architecture
Snowflake Architecture
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group Replication
 
IBM MQ on cloud and containers
IBM MQ on cloud and containersIBM MQ on cloud and containers
IBM MQ on cloud and containers
 
Replacing and Augmenting F5 BIG-IP with NGINX Plus
Replacing and Augmenting F5 BIG-IP with NGINX PlusReplacing and Augmenting F5 BIG-IP with NGINX Plus
Replacing and Augmenting F5 BIG-IP with NGINX Plus
 
Deep dive into Kubernetes Networking
Deep dive into Kubernetes NetworkingDeep dive into Kubernetes Networking
Deep dive into Kubernetes Networking
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Kubernetes 101 for Beginners
Kubernetes 101 for BeginnersKubernetes 101 for Beginners
Kubernetes 101 for Beginners
 

Destaque

Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National ParkClimate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National ParkSrivatsan Ramanujam
 
Analyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesAnalyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesSrivatsan Ramanujam
 
Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...Sarah Aerni
 
Data Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesData Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesSrivatsan Ramanujam
 
Strata aerni 2015_09_30_1315
Strata aerni 2015_09_30_1315Strata aerni 2015_09_30_1315
Strata aerni 2015_09_30_1315Sarah Aerni
 
Data Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceData Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceSrivatsan Ramanujam
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam
 

Destaque (10)

Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National ParkClimate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
 
Analyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesAnalyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity Futures
 
Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...
 
Data Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesData Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehicles
 
Strata aerni 2015_09_30_1315
Strata aerni 2015_09_30_1315Strata aerni 2015_09_30_1315
Strata aerni 2015_09_30_1315
 
Data Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceData Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data Science
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
All thingspython@pivotal
All thingspython@pivotalAll thingspython@pivotal
All thingspython@pivotal
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 

Semelhante a PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad IIIT ALLAHABAD
 
Pro sphere customer technical
Pro sphere customer technicalPro sphere customer technical
Pro sphere customer technicalsolarisyougood
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIBM Switzerland
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...Paul Hofmann
 
BrightTalk session-The right SDS for your OpenStack Cloud
BrightTalk session-The right SDS for your OpenStack CloudBrightTalk session-The right SDS for your OpenStack Cloud
BrightTalk session-The right SDS for your OpenStack CloudEitan Segal
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator Ganesan Narayanasamy
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetupragss
 
Oaktable World 2014 Kevin Closson: SLOB – For More Than I/O!
Oaktable World 2014 Kevin Closson:  SLOB – For More Than I/O!Oaktable World 2014 Kevin Closson:  SLOB – For More Than I/O!
Oaktable World 2014 Kevin Closson: SLOB – For More Than I/O!Kyle Hailey
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...MongoDB
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJim Dowling
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsEMC
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructureinside-BigData.com
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john maoNAVER D2
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJDaniel Madrigal
 

Semelhante a PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library. (20)

EMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras PelenisEMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras Pelenis
 
Greenplum feature
Greenplum featureGreenplum feature
Greenplum feature
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad
 
Pro sphere customer technical
Pro sphere customer technicalPro sphere customer technical
Pro sphere customer technical
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
 
BrightTalk session-The right SDS for your OpenStack Cloud
BrightTalk session-The right SDS for your OpenStack CloudBrightTalk session-The right SDS for your OpenStack Cloud
BrightTalk session-The right SDS for your OpenStack Cloud
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
 
Oaktable World 2014 Kevin Closson: SLOB – For More Than I/O!
Oaktable World 2014 Kevin Closson:  SLOB – For More Than I/O!Oaktable World 2014 Kevin Closson:  SLOB – For More Than I/O!
Oaktable World 2014 Kevin Closson: SLOB – For More Than I/O!
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data Analytics
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructure
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john mao
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 

Último

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Último (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

  • 1. Srivatsan Ramanujam Senior Data Scientist Greenplum © Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2. Agenda • Greenplum UAP overview – Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance – GPDB Architecture • MADlib – – – – Overview Algorithms Working Mechanism Performance Comparison with Mahout • PyMADlib – Overview – Demo in IPython Notebook • Future Directions – GPHD and HAWQ © Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3. Greenplum Overview © Copyright 2011 EMC Corporation. All rights reserved. 3
  • 4. Products © Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5. Greenplum Database - Architecture MPP (Massively Parallel Processing) Shared-Nothing Architecture Master Servers ... SQL MapReduce ... Query planning & dispatch Network Interconnect Segment Servers ... ... Query processing & data storage External Sources Loading, streaming, etc. © Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6. MADlib © Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7. MADlib: The Origin UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills. • First mention of MAD analytics was at VLDB’09 – MAD Skills: New Analysis Practices for Big Data – Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, Caleb Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf • MADlib project initiated in late 2010 – Maintained by Greenplum/EMC with significant contributions from UW Madison, UFlorida and UC Berkeley. © Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8. Current Modules Data Modeling Supervised Learning • • • • • • • • • Naive Bayes Classification Linear Regression Logistic Regression Multinomial Logistic Regression Decision Tree Random Forest Support Vector Machines Cox-Proportional Hazards Regression Conditional Random Field Unsupervised Learning • Association Rules • k-Means Clustering • Low-rank Matrix Factorization • SVD Matrix Factorization • Parallel Latent Dirichlet Allocation Descriptive Statistics Sketch-based Estimators • CountMin (CormodeMuthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Profile Quantile Support Array Operations Conjugate Gradient Sparse Vectors Probability Functions Random Sampling Inferential Statistics Hypothesis tests © Copyright 2011 EMC Corporation. All rights reserved. 8
  • 9. MADlib – User Doc • Check out the user guide with examples at: http://doc.madlib.net © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 10. How does it work ? : A Linear Regression Example • Finding linear dependencies between variables – y ≈ c0 + c1 · x1 + c2 · x2 ? # select y, x1, x2 Vector of dependent variables y © Copyright 2011 EMC Corporation. All rights reserved. from unm limit 6; y | x1 | x2 -------+------+----10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design matrix X 10
  • 11. Reminder: Linear-Regression Model • • If residuals i.i.d. Gaussians with standard deviation σ: – max likelihood ⇔ min sum of squared residuals • First-order conditions for the following quadratic objective (in c) yield the minimizer © Copyright 2011 EMC Corporation. All rights reserved. 11
  • 12. Linear Regression: Streaming Algorithm • How to compute with a single table scan? -1 XT XT y X X TX © Copyright 2011 EMC Corporation. All rights reserved. XTy 12
  • 13. Linear Regression: Parallel Computation XT y Segment 1 T X1 y1 © Copyright 2011 EMC Corporation. All rights reserved. Segment 2 T X2 y2 Master X Ty 13
  • 14. Performance Comparison : Test Setup on AWB • AWB – 1000-node cluster located in Las Vegas – Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk storage – 8000+ Map Task Capacity, 5000+ Reduce Task Capacity – GPHD 1.1, GPDB 4.2.3 • Mahout v0.7 • MADlib v0.5 – With small LMF change to allow 4-byte integer values • Test matrix – – – – Data size (# rows/records, # columns/features) Algorithms Algorithm parameters (e.g. convergence threshold, # iterations) GPDB segment / MR (Map-Reduce) task configurations © Copyright 2011 EMC Corporation. All rights reserved. 14
  • 15. Performance & Scalability Results (summary) • Whitepaper coming out shortly! © Copyright 2011 EMC Corporation. All rights reserved. 15
  • 16. Logistic Regression • Mahout only has sequential (i.e. single node) IGD implementation MADlib & Mahout Logistic Regression Scalability Across Number of Attributes 700 Census data, 48 attributes [Mahout] 600 Time in Minutes Census data, 48 attributes [MADlib] 500 400 300 200 100 0 1000000 10000000 10000000 1E+09 log(Number of Rows) © Copyright 2011 EMC Corporation. All rights reserved. 16
  • 17. Logistic Regression MADlib Scalability Across Number of GPDB Segments 18 16 Time in Minutes 14 12 10 8 6 4 2 0 0 50 100 150 200 250 300 Number of GPDB Segments © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 18. K-Means Clustering MADlib & Mahout K-means Scalability Across Number of Rows 350 Census data, 48 attributes [Mahout] 300 Census data, 48 attributes [MADlib] Time in Min 250 200 150 100 50 0 1000000 10000000 10000000 1E+09 log(Number of Rows) © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 19. K-Means Clustering MADlib K-means Scalability Across Number of GPDB Segments 10 9 8 Time in Min 7 6 5 4 3 2 1 0 0 50 100 150 200 250 300 Number of GPDB Segments © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 20. PyMADlib : Python + MADlib = Awesome! © Copyright 2011 EMC Corporation. All rights reserved. 20
  • 21. Motivation • SQL is great for many things, but it’s not nearly enough • Undeniably the most straightforward way to query data • But not necessarily designed for data science © Copyright 2011 EMC Corporation. All rights reserved. 21
  • 22. MADlib is a godsend! • Empowers data scientists to run canned machine learning routines – focus less on coding, more on science • In-database, explicitly parallel. • So why do we need anything else? – UI is still all in SQL – Need to tap into rich visualization libraries © Copyright 2011 EMC Corporation. All rights reserved. 22
  • 23. Then which interface is favored by and familiar to data scientists? • Depends on who you ask • Left survey is for “higher level languages,” and right survey is for “lower level languages” © Copyright 2011 EMC Corporation. All rights reserved. 23
  • 24. Wait, don’t we already have this (PL/R, PL/Python, SAS HPA)? • PL/X’s are wonderful, but: – It still requires non-trivial knowledge of SQL to use effectively – Mostly limited to explicitly parallel jobs – Primarily a SQL interface to the end user • Need an interface that is: – Less SQL, more R/Python/SAS – Implicitly parallelized – More scalable • SAS HPA = $$$$$ © Copyright 2011 EMC Corporation. All rights reserved. 24
  • 25. The challenge • MADlib – – – – Open source Extremely powerful/scalable Growing algorithm breadth SQL • Python/R – – – – Open source Memory limited High algorithm breadth Language/interface purpose-designed for data science • SAS – – – – High user loyalty Non-HPA is memory limited, HPA requires investment High algorithm breadth Language/interface purpose-designed for data science • Want to leverage both the performance benefits of MADlib and the usability of languages like Python, SAS, and R © Copyright 2011 EMC Corporation. All rights reserved. 25
  • 26. Simple solution: Translate Python code into SQL ODBC/ JDBC Python  SQL SQL to execute MADlib Model output • All data stays in DB and all model estimation and heavy lifting done in DB by MADlib • Only strings of SQL and model output transferred across ODBC/JDBC • Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python. © Copyright 2011 EMC Corporation. All rights reserved. 26
  • 27. Demo PyMADlib Tutorial – IPython Notebook Viewer Link http://nbviewer.ipython.org/5275846 © Copyright 2011 EMC Corporation. All rights reserved. 27
  • 28. Where do I get it ? $pip install pymadlib © Copyright 2011 EMC Corporation. All rights reserved. 28
  • 29. I don’t have GPDB or MADlib – What do I do ? • Greenplum Database Community Edition is freely available for single node installations on multiple platforms – Written permission may be requested from EMC/Greenplum for research use for multi-node installations • MADlib is free and open-source – Downloadable for multiple platforms from https://github.com/madlib/madlib • PyMADlib is also free and open-source  – Downloadable from https://github.com/vatsan/pymadlib © Copyright 2011 EMC Corporation. All rights reserved. 29
  • 30. Future Directions © Copyright 2011 EMC Corporation. All rights reserved. 30
  • 31. Greenplum HD • HAWQ – Parallel SQL query engine that combines the key technological advantages of industry-leading Greenplum Database with scalability and convenience of Hadoop • SQL Standards Compliant – Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes + range of scalar and aggregate functions • ACID Compliant © Copyright 2011 EMC Corporation. All rights reserved. 31
  • 32. HAWQ – Architecture © Copyright 2011 EMC Corporation. All rights reserved. 32
  • 33. Performance : HAWQ1 Vs. Hive Vs. Impala2 All experiments were run on a 60 node deployment with Analytics Workbench3 1 2 3 http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf https://github.com/cloudera/impala/ http://www.analyticsworkbench.com/ © Copyright 2011 EMC Corporation. All rights reserved. 33
  • 34. HAWQ: Deep Scalable Analytics What’s inside the box? • Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Users can connect to HAWQ via popular programming languages and it also supports JDBC and ODBC. • Most tools will work out of the box with HAWQ, including PyMADlib © Copyright 2011 EMC Corporation. All rights reserved. 34
  • 36. Appendix © Copyright 2011 EMC Corporation. All rights reserved. 36
  • 37. Datasets The following datasets were used in comparing the performance of MADlib with Mahout – KDD Cup 2009 Orange marketing churn data (16.5 MB) • About 500,000 records and 15,000 numerical and categorical attributes – Census 2000 data (1.7 GB) • About 14 million records and 48 numerical and categorical attributes – Enron data (1.9 GB) • About 700,000 documents with a vocabulary size of 200,000 – KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB) • About 1 million users, 600,000 songs, and 250 million ratings – Netflix Prize 2009 data (52.7 MB) • About 400,000 users, 900 movies, and 4.5 million ratings © Copyright 2011 EMC Corporation. All rights reserved. 37

Notas do Editor

  1. Special thanks to Grace Gee (Engineer, SOAR Program, Greenplum)