SlideShare a Scribd company logo
1 of 30
Download to read offline
BUILT FOR THE SPEED OF BUSINESS
Python Powered Data
Science at Pivotal
How do we use the PyData stack in real
engagements?
Srivatsan Ramanujam, @being_bayesian
Ian Huston, @ianhuston
Data Scientists, Pivotal

© Copyright 2013 Pivotal. All rights reserved.

2
Agenda
Ÿ  Introduction to Pivotal
Ÿ  Pivotal Data Science Toolkit
Ÿ  PL/Python
Ÿ  In-database Machine Learning with MADlib
Ÿ  Live Demos
–  Topic Analysis and Sentiment Analysis Engine
–  Traffic Disruption Prediction

© Copyright 2013 Pivotal. All rights reserved.

3
Pivotal Platform Stack
Data-Driven
Application
Development

Pivotal Data
Science Labs

Cloud
Application
Platform
Data &
Analytics
Platform

Virtualization

Cloud Storage

© Copyright 2013 Pivotal. All rights reserved.

4
What do our customers look like?
Ÿ  Large enterprises with lots of data collected
–  Work with 10s of TBs to PBs of data, structured & unstructured

Ÿ  Not able to get what they want out of their data
–  Old Legacy systems with high cost and no flexibility
–  Response times are too slow for interactive data analysis
–  Can only deal with small samples of data locally

Ÿ  They want to transform into data driven enterprises

© Copyright 2013 Pivotal. All rights reserved.

5
MPP Architectural Overview

Think of it as multiple
PostGreSQL servers
Master

Segments/Workers
Rows are distributed across segments by
a particular field (or randomly)

© Copyright 2013 Pivotal. All rights reserved.

6
Typical Engagement Tech Setup
Ÿ  Platform:

–  Greenplum Analytics Database (GPDB)
–  Pivotal HD Hadoop Distribution + HAWQ (SQL DB on Hadoop)

Ÿ  Open Source Options (http://gopivotal.com):
–  Greenplum Community Edition
–  Pivotal HD Community Edition (HAWQ not included)
–  MADlib in-database machine learning library (http://madlib.net)

Ÿ  Where Python fits in:
–  PL/Python running in-database, with nltk, scikit-learn etc
–  IPython for exploratory analysis
–  Pandas, Matplotlib etc.

© Copyright 2013 Pivotal. All rights reserved.

7
PIVOTAL DATA SCIENCE
TOOLKIT
1

Find Data

Platforms
•  Greenplum DB
•  Pivotal HD
•  Hadoop (other)
•  SAS HPA
•  AWS

2

3

Run Code

Interfaces
•  pgAdminIII
•  psql
•  psycopg2
•  Terminal
•  Cygwin
•  Putty
•  Winscp

Write Code

Editing Tools
•  Vi/Vim
•  Emacs
•  Smultron
•  TextWrangler
•  Eclipse
•  Notepad++
•  IPython
•  Sublime

Languages
•  SQL
•  Bash scripting
•  C
•  C++
•  C#
•  Java
•  Python
•  R

© Copyright 2013 Pivotal. All rights reserved.

4

Write Code for Big Data

In-Database
•  SQL
•  PL/Python
•  PL/Java
•  PL/R
•  PL/pgSQL
5

Hadoop
•  HAWQ
•  Pig
•  Hive
•  Java

6

Visualization
•  python-matplotlib
•  python-networkx
•  D3.js
•  Tableau

Implement Algorithms

Libraries
•  MADlib
Java
•  Mahout
R
•  (Too many to list!)
Text
•  OpenNLP
•  NLTK
•  GPText
C++
•  opencv

Show Results

Python
•  NumPy
•  SciPy
•  scikit-learn
•  Pandas
Programs
•  Alpine Miner
•  Rstudio
•  MATLAB
•  SAS
•  Stata

•  GraphViz
•  Gephi
•  R (ggplot2, lattice,
shiny)
•  Excel
7

Collaborate

Sharing Tools
•  Chorus
•  Confluence
•  Socialcast
•  Github
•  Google Drive &
Hangouts

A large and
varied tool box!

8
Data Parallelism
Ÿ  Little or no effort is required to break up the problem into a
number of parallel tasks, and there exists no dependency (or
communication) between those parallel tasks.
Ÿ  Examples:
–  Measure the height of each student in a classroom (explicitly
parallelizable by student)
–  MapReduce
–  map() function in Python

© Copyright 2013 Pivotal. All rights reserved.

9
User-Defined Functions (UDFs)
Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Ÿ  Simple UDFs are SQL queries with calling arguments and return types.

Definition:

Execution:

CREATE	
  FUNCTION	
  times2(INT)	
  
RETURNS	
  INT	
  
AS	
  $$	
  
	
  	
  	
  	
  SELECT	
  2	
  *	
  $1	
  
$$	
  LANGUAGE	
  sql;	
  

SELECT	
  times2(1);	
  
	
  times2	
  	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  	
  	
  	
  	
  	
  2	
  
(1	
  row)	
  

© Copyright 2013 Pivotal. All rights reserved.

10
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
•  Allows users to write Greenplum/
PostgreSQL functions in the R/Python/
Java, Perl, pgsql or C languages

SQL
Master
Host

Ÿ  The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
•  Data Parallelism:
-  PL/X piggybacks on
Greenplum’s MPP architecture

© Copyright 2013 Pivotal. All rights reserved.

Standby
Master

Interconnect

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

…
11
Intro to PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be superuser to install.
Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.

SQL wrapper
Normal Python
SQL wrapper

© Copyright 2013 Pivotal. All rights reserved.

CREATE	
  FUNCTION	
  pymax	
  (a	
  integer,	
  b	
  integer)	
  
	
  	
  RETURNS	
  integer	
  
AS	
  $$	
  
	
  	
  if	
  a	
  >	
  b:	
  
	
  	
  	
  	
  return	
  a	
  
	
  	
  return	
  b	
  
$$	
  LANGUAGE	
  plpythonu;	
  

12
Returning Results
Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
Ÿ  Composite types can be returned by creating a composite type in the database:	
  
CREATE	
  TYPE	
  named_value	
  AS	
  (	
  
	
  	
  name	
  	
  text,	
  
	
  	
  value	
  	
  integer	
  
);	
  

Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE	
  FUNCTION	
  make_pair	
  (name	
  text,	
  value	
  integer)	
  
	
  	
  RETURNS	
  named_value	
  
AS	
  $$	
  
	
  	
  return	
  [	
  name,	
  value	
  ]	
  
	
  	
  #	
  or	
  alternatively,	
  as	
  tuple:	
  return	
  (	
  name,	
  value	
  )	
  
	
  	
  #	
  or	
  as	
  dict:	
  return	
  {	
  "name":	
  name,	
  "value":	
  value	
  }	
  
	
  	
  #	
  or	
  as	
  an	
  object	
  with	
  attributes	
  .name	
  and	
  .value	
  
$$	
  LANGUAGE	
  plpythonu;	
  

Ÿ  For functions which return multiple rows, prefix “setof” before the return type

© Copyright 2013 Pivotal. All rights reserved.

13
Returning more results
You can return multiple results by wrapping them in a sequence (tuple, list or set),
an iterator or a generator:

Sequence

Generator

© Copyright 2013 Pivotal. All rights reserved.

CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  SETOF	
  named_value	
  
AS	
  $$	
  
	
  	
  return	
  ([	
  name,	
  1	
  ],	
  [	
  name,	
  2	
  ],	
  [	
  name,	
  3])	
  	
  
$$	
  LANGUAGE	
  plpythonu;	
  
CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  SETOF	
  named_value	
  	
  AS	
  $$	
  
	
  	
  for	
  i	
  in	
  range(3):	
  
	
  	
  	
  	
  	
  	
  yield	
  (name,	
  i)	
  	
  
$$	
  LANGUAGE	
  plpythonu;	
  

14
Accessing Packages
Ÿ  On Greenplum DB: To be available packages must be installed on the
individual segment nodes.
–  Can use “parallel ssh” tool gpssh to conda/pip install
–  Currently Greenplum DB ships with Python 2.6 (!)

Ÿ  Then just import as usual inside function:

	
  	
  

CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  named_value	
  
AS	
  $$	
  
	
  	
  import	
  numpy	
  as	
  np	
  
	
  	
  return	
  ((name,i)	
  for	
  i	
  in	
  np.arange(3))	
  
$$	
  LANGUAGE	
  plpythonu;	
  

© Copyright 2013 Pivotal. All rights reserved.

15
Benefits of PL/Python
Ÿ  Easy to bring your code to the data.
Ÿ  When SQL falls short leverage your Python (or R/Java/C)
experience quickly.
Ÿ  Apply Python across terabytes of data with minimal
overhead or additional requirements.
Ÿ  Results are already in the database system, ready for further
analysis or storage.

© Copyright 2013 Pivotal. All rights reserved.

16
Going Beyond Data Parallelism
Ÿ  Data Parallel computation via PL/Python libraries only allow
us to run ‘n’ models in parallel.
Ÿ  This works great when we are building one model for each
value of the group by column, but we need parallelized
algorithms to be able to build a single model on all the
available data
Ÿ  For this, we use MADlib – an open source library of parallel
in-database machine learning algorithms.

© Copyright 2013 Pivotal. All rights reserved.

17
MADlib: The Origin
UrbanDictionary
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills

•  First mention of MAD analytics was at VLDB 2009
MAD Skills: New Analysis Practices for Big Data
J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton
(with help from: Noelle Sio, David Hubbard, James Marca)
http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

•  MADlib project initiated in late 2010:
Greenplum Analytics team and Prof. Joe Hellerstein

© Copyright 2013 Pivotal. All rights reserved.

• 

Open Source!
https://github.com/madlib/madlib

• 
• 

Works on Greenplum DB and
PostgreSQL
Active development by Pivotal
- 

• 

Latest Release : v1.3 (Oct 2013)

Downloads and Docs:
http://madlib.net/

18
MADlib Executes Algorithms In-Place
MADlib User

MADlib Advantages
Master
Processor

Ø 

SQL

SQL

SQL

M

M

M

No Data Movement

Ø 

M

Use MPP architecture’s
full compute power

Ø 

Use MPP architecture’s
entire memory to
process data sets

Segment
Processors

© Copyright 2013 Pivotal. All rights reserved.

19
MADlib In-Database
Functions
Descriptive Statistics

Predictive Modeling Library
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards
•  Regression
•  Elastic Net Regularization
•  Sandwich Estimators (Huber white,
clustered, marginal effects)

Matrix Factorization
•  Single Value Decomposition (SVD)
•  Low-Rank

© Copyright 2013 Pivotal. All rights reserved.

Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Affinity Analysis, Market
Basket)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Ensemble Learners (Random Forests)
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
Linear Systems
•  Sparse and Dense Solvers

Sketch-based Estimators
•  CountMin (CormodeMuthukrishnan)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

20
Architecture
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High-level Abstraction Layer
(iteration controller, ...)

RDBMS
Built-in
Functions

SQL, generated from
specification

Python with
templated SQL
Python

Functions for Inner Loops
(for streaming algorithms)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS
type bridge, …)

C++

RDBMS Query Processing
(Greenplum, PostgreSQL, …)
© Copyright 2013 Pivotal. All rights reserved.

21
How does it work ? : A Linear Regression Example
Ÿ  Finding linear dependencies between variables
–  y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2

Vector of dependent
variables y

© Copyright 2013 Pivotal. All rights reserved.

y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8

from unm limit 6;

Design Matrix X

22
Reminder: Linear-Regression Model
• 
•  If residuals i.i.d. Gaussians with standard deviation σ:
–  max likelihood ⇔ min sum of squared residuals

f (y | x) ∝ exp

−

1
· (y − xT c)2
2σ 2

•  First-order conditions for the following quadratic objective (in c)
yield the minimizer

© Copyright 2013 Pivotal. All rights reserved.

23
Linear Regression: Streaming Algorithm
•  How to compute with a single table scan?
-1

XT

XT

X

XTX
© Copyright 2013 Pivotal. All rights reserved.

y

XTy
24
Linear Regression: Parallel Computation
XT
y

Segment 1

T
X1

y1

Segment 2

T
T
X T y = X 1 X2

© Copyright 2013 Pivotal. All rights reserved.

Master

T
X2 y2

y1
y2

=

XTy

T
Xi y i

25
Demos
Ÿ  We built demos to showcase our technology pipeline, using
Python technology.
Ÿ  Two use cases:
–  Topic and Sentiment Analysis of Tweets
–  London Road Traffic Disruption prediction

© Copyright 2013 Pivotal. All rights reserved.

26
Topic and Sentiment Analysis Pipeline

Tweet
Stream

D3.js
Stored on
HDFS
Topic Analysis through
MADlib pLDA
(gpfdist)
Loaded as
external tables
into GPDB

© Copyright 2013 Pivotal. All rights reserved.

Parallel Parsing of
JSON and extraction
of fields using PL/
Python

Sentiment Analysis
through custom
PL/Python functions

27
Transport Disruption Prediction Pipeline

Transport for London
Traffic Disruption feed

Pivotal Greenplum
Database

d3.js	
  &	
  NVD3	
  
Interactive SVG figures

© Copyright 2013 Pivotal. All rights reserved.

Deduplication

Feature Creation

Modelling & Machine Learning

28
Pivotal’s Open Source Contributions
http://gopivotal.com/pivotal-products/open-source-software

•  PyMADlib – Python Wrapper for MADlib
-  https://github.com/gopivotal/pymadlib

•  PivotalR – R wrapper for MADlib
-  https://github.com/madlib-internal/PivotalR

•  Part-of-speech tagger for Twitter via SQL
-  http://vatsan.github.io/gp-ark-tweet-nlp/

Questions?
@being_bayesian
@ianhuston
© Copyright 2013 Pivotal. All rights reserved.

29
BUILT FOR THE SPEED OF BUSINESS

More Related Content

What's hot

Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deploymentNovita Sari
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with HadoopSangchul Song
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library EMC
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easyVictor Sanchez Anguix
 

What's hot (20)

Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with Hadoop
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 

Viewers also liked

Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National ParkClimate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National ParkSrivatsan Ramanujam
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
 
Germin8 - Social Media Analytics
Germin8 - Social Media AnalyticsGermin8 - Social Media Analytics
Germin8 - Social Media AnalyticsGermin8
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
R server and spark
R server and sparkR server and spark
R server and sparkBAINIDA
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computingBAINIDA
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview EMC
 

Viewers also liked (7)

Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National ParkClimate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Germin8 - Social Media Analytics
Germin8 - Social Media AnalyticsGermin8 - Social Media Analytics
Germin8 - Social Media Analytics
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
R server and spark
R server and sparkR server and spark
R server and spark
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
 

Similar to Python Powered Data Science at Pivotal (PyData 2013)

Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonPyData
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesIan Huston
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To PythonVanessa Rene
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
Researh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonResearh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonWaternomics
 
Researh toolbox - Data analysis with python
Researh toolbox  - Data analysis with pythonResearh toolbox  - Data analysis with python
Researh toolbox - Data analysis with pythonUmair ul Hassan
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
 
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724sdeeg
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Mobcoder
 

Similar to Python Powered Data Science at Pivotal (PyData 2013) (20)

Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To Python
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
Researh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonResearh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-python
 
Researh toolbox - Data analysis with python
Researh toolbox  - Data analysis with pythonResearh toolbox  - Data analysis with python
Researh toolbox - Data analysis with python
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Python Powered Data Science at Pivotal (PyData 2013)

  • 1. BUILT FOR THE SPEED OF BUSINESS
  • 2. Python Powered Data Science at Pivotal How do we use the PyData stack in real engagements? Srivatsan Ramanujam, @being_bayesian Ian Huston, @ianhuston Data Scientists, Pivotal © Copyright 2013 Pivotal. All rights reserved. 2
  • 3. Agenda Ÿ  Introduction to Pivotal Ÿ  Pivotal Data Science Toolkit Ÿ  PL/Python Ÿ  In-database Machine Learning with MADlib Ÿ  Live Demos –  Topic Analysis and Sentiment Analysis Engine –  Traffic Disruption Prediction © Copyright 2013 Pivotal. All rights reserved. 3
  • 4. Pivotal Platform Stack Data-Driven Application Development Pivotal Data Science Labs Cloud Application Platform Data & Analytics Platform Virtualization Cloud Storage © Copyright 2013 Pivotal. All rights reserved. 4
  • 5. What do our customers look like? Ÿ  Large enterprises with lots of data collected –  Work with 10s of TBs to PBs of data, structured & unstructured Ÿ  Not able to get what they want out of their data –  Old Legacy systems with high cost and no flexibility –  Response times are too slow for interactive data analysis –  Can only deal with small samples of data locally Ÿ  They want to transform into data driven enterprises © Copyright 2013 Pivotal. All rights reserved. 5
  • 6. MPP Architectural Overview Think of it as multiple PostGreSQL servers Master Segments/Workers Rows are distributed across segments by a particular field (or randomly) © Copyright 2013 Pivotal. All rights reserved. 6
  • 7. Typical Engagement Tech Setup Ÿ  Platform: –  Greenplum Analytics Database (GPDB) –  Pivotal HD Hadoop Distribution + HAWQ (SQL DB on Hadoop) Ÿ  Open Source Options (http://gopivotal.com): –  Greenplum Community Edition –  Pivotal HD Community Edition (HAWQ not included) –  MADlib in-database machine learning library (http://madlib.net) Ÿ  Where Python fits in: –  PL/Python running in-database, with nltk, scikit-learn etc –  IPython for exploratory analysis –  Pandas, Matplotlib etc. © Copyright 2013 Pivotal. All rights reserved. 7
  • 8. PIVOTAL DATA SCIENCE TOOLKIT 1 Find Data Platforms •  Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS 2 3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp Write Code Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R © Copyright 2013 Pivotal. All rights reserved. 4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL 5 Hadoop •  HAWQ •  Pig •  Hive •  Java 6 Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau Implement Algorithms Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv Show Results Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata •  GraphViz •  Gephi •  R (ggplot2, lattice, shiny) •  Excel 7 Collaborate Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive & Hangouts A large and varied tool box! 8
  • 9. Data Parallelism Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ  Examples: –  Measure the height of each student in a classroom (explicitly parallelizable by student) –  MapReduce –  map() function in Python © Copyright 2013 Pivotal. All rights reserved. 9
  • 10. User-Defined Functions (UDFs) Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions. Ÿ  Simple UDFs are SQL queries with calling arguments and return types. Definition: Execution: CREATE  FUNCTION  times2(INT)   RETURNS  INT   AS  $$          SELECT  2  *  $1   $$  LANGUAGE  sql;   SELECT  times2(1);    times2     -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐              2   (1  row)   © Copyright 2013 Pivotal. All rights reserved. 10
  • 11. PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} •  Allows users to write Greenplum/ PostgreSQL functions in the R/Python/ Java, Perl, pgsql or C languages SQL Master Host Ÿ  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster •  Data Parallelism: -  PL/X piggybacks on Greenplum’s MPP architecture © Copyright 2013 Pivotal. All rights reserved. Standby Master Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment … 11
  • 12. Intro to PL/Python Ÿ  Procedural languages need to be installed on each database used. Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be superuser to install. Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside. SQL wrapper Normal Python SQL wrapper © Copyright 2013 Pivotal. All rights reserved. CREATE  FUNCTION  pymax  (a  integer,  b  integer)      RETURNS  integer   AS  $$      if  a  >  b:          return  a      return  b   $$  LANGUAGE  plpythonu;   12
  • 13. Returning Results Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) Ÿ  Composite types can be returned by creating a composite type in the database:   CREATE  TYPE  named_value  AS  (      name    text,      value    integer   );   Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE  FUNCTION  make_pair  (name  text,  value  integer)      RETURNS  named_value   AS  $$      return  [  name,  value  ]      #  or  alternatively,  as  tuple:  return  (  name,  value  )      #  or  as  dict:  return  {  "name":  name,  "value":  value  }      #  or  as  an  object  with  attributes  .name  and  .value   $$  LANGUAGE  plpythonu;   Ÿ  For functions which return multiple rows, prefix “setof” before the return type © Copyright 2013 Pivotal. All rights reserved. 13
  • 14. Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator: Sequence Generator © Copyright 2013 Pivotal. All rights reserved. CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value   AS  $$      return  ([  name,  1  ],  [  name,  2  ],  [  name,  3])     $$  LANGUAGE  plpythonu;   CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value    AS  $$      for  i  in  range(3):              yield  (name,  i)     $$  LANGUAGE  plpythonu;   14
  • 15. Accessing Packages Ÿ  On Greenplum DB: To be available packages must be installed on the individual segment nodes. –  Can use “parallel ssh” tool gpssh to conda/pip install –  Currently Greenplum DB ships with Python 2.6 (!) Ÿ  Then just import as usual inside function:     CREATE  FUNCTION  make_pair  (name  text)      RETURNS  named_value   AS  $$      import  numpy  as  np      return  ((name,i)  for  i  in  np.arange(3))   $$  LANGUAGE  plpythonu;   © Copyright 2013 Pivotal. All rights reserved. 15
  • 16. Benefits of PL/Python Ÿ  Easy to bring your code to the data. Ÿ  When SQL falls short leverage your Python (or R/Java/C) experience quickly. Ÿ  Apply Python across terabytes of data with minimal overhead or additional requirements. Ÿ  Results are already in the database system, ready for further analysis or storage. © Copyright 2013 Pivotal. All rights reserved. 16
  • 17. Going Beyond Data Parallelism Ÿ  Data Parallel computation via PL/Python libraries only allow us to run ‘n’ models in parallel. Ÿ  This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data Ÿ  For this, we use MADlib – an open source library of parallel in-database machine learning algorithms. © Copyright 2013 Pivotal. All rights reserved. 17
  • 18. MADlib: The Origin UrbanDictionary mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills •  First mention of MAD analytics was at VLDB 2009 MAD Skills: New Analysis Practices for Big Data J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton (with help from: Noelle Sio, David Hubbard, James Marca) http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf •  MADlib project initiated in late 2010: Greenplum Analytics team and Prof. Joe Hellerstein © Copyright 2013 Pivotal. All rights reserved. •  Open Source! https://github.com/madlib/madlib •  •  Works on Greenplum DB and PostgreSQL Active development by Pivotal -  •  Latest Release : v1.3 (Oct 2013) Downloads and Docs: http://madlib.net/ 18
  • 19. MADlib Executes Algorithms In-Place MADlib User MADlib Advantages Master Processor Ø  SQL SQL SQL M M M No Data Movement Ø  M Use MPP architecture’s full compute power Ø  Use MPP architecture’s entire memory to process data sets Segment Processors © Copyright 2013 Pivotal. All rights reserved. 19
  • 20. MADlib In-Database Functions Descriptive Statistics Predictive Modeling Library Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white, clustered, marginal effects) Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank © Copyright 2013 Pivotal. All rights reserved. Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation Linear Systems •  Sparse and Dense Solvers Sketch-based Estimators •  CountMin (CormodeMuthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions 20
  • 21. Architecture User Interface “Driver” Functions (outer loops of iterative algorithms, optimizer invocations) High-level Abstraction Layer (iteration controller, ...) RDBMS Built-in Functions SQL, generated from specification Python with templated SQL Python Functions for Inner Loops (for streaming algorithms) Low-level Abstraction Layer (matrix operations, C++ to RDBMS type bridge, …) C++ RDBMS Query Processing (Greenplum, PostgreSQL, …) © Copyright 2013 Pivotal. All rights reserved. 21
  • 22. How does it work ? : A Linear Regression Example Ÿ  Finding linear dependencies between variables –  y ≈ c0 + c1 · x1 + c2 · x2 ? # select y, x1, x2 Vector of dependent variables y © Copyright 2013 Pivotal. All rights reserved. y | x1 | x2 -------+------+----10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 from unm limit 6; Design Matrix X 22
  • 23. Reminder: Linear-Regression Model •  •  If residuals i.i.d. Gaussians with standard deviation σ: –  max likelihood ⇔ min sum of squared residuals f (y | x) ∝ exp − 1 · (y − xT c)2 2σ 2 •  First-order conditions for the following quadratic objective (in c) yield the minimizer © Copyright 2013 Pivotal. All rights reserved. 23
  • 24. Linear Regression: Streaming Algorithm •  How to compute with a single table scan? -1 XT XT X XTX © Copyright 2013 Pivotal. All rights reserved. y XTy 24
  • 25. Linear Regression: Parallel Computation XT y Segment 1 T X1 y1 Segment 2 T T X T y = X 1 X2 © Copyright 2013 Pivotal. All rights reserved. Master T X2 y2 y1 y2 = XTy T Xi y i 25
  • 26. Demos Ÿ  We built demos to showcase our technology pipeline, using Python technology. Ÿ  Two use cases: –  Topic and Sentiment Analysis of Tweets –  London Road Traffic Disruption prediction © Copyright 2013 Pivotal. All rights reserved. 26
  • 27. Topic and Sentiment Analysis Pipeline Tweet Stream D3.js Stored on HDFS Topic Analysis through MADlib pLDA (gpfdist) Loaded as external tables into GPDB © Copyright 2013 Pivotal. All rights reserved. Parallel Parsing of JSON and extraction of fields using PL/ Python Sentiment Analysis through custom PL/Python functions 27
  • 28. Transport Disruption Prediction Pipeline Transport for London Traffic Disruption feed Pivotal Greenplum Database d3.js  &  NVD3   Interactive SVG figures © Copyright 2013 Pivotal. All rights reserved. Deduplication Feature Creation Modelling & Machine Learning 28
  • 29. Pivotal’s Open Source Contributions http://gopivotal.com/pivotal-products/open-source-software •  PyMADlib – Python Wrapper for MADlib -  https://github.com/gopivotal/pymadlib •  PivotalR – R wrapper for MADlib -  https://github.com/madlib-internal/PivotalR •  Part-of-speech tagger for Twitter via SQL -  http://vatsan.github.io/gp-ark-tweet-nlp/ Questions? @being_bayesian @ianhuston © Copyright 2013 Pivotal. All rights reserved. 29
  • 30. BUILT FOR THE SPEED OF BUSINESS