Mais conteúdo relacionado Semelhante a Data Science at Scale on MPP databases - Use Cases & Open Source Tools (20) Data Science at Scale on MPP databases - Use Cases & Open Source Tools1. 1© Copyright 2016 Pivotal. All rights reserved. 1© Copyright 2016 Pivotal. All rights reserved.
Esther Vasiete
Pivotal Data Scientist
Structure Data 2016
Data Science at Scale on MPP
Databases – Use Cases & Open Source
Tools
Joint work with Pivotal Data Science
2. 2© Copyright 2016 Pivotal. All rights reserved.
Agenda
Ÿ Introduction
Ÿ Open Source Data Science Toolkit
Ÿ Real world applications
– Predictive maintenance of automobiles
– Predicting insurance claims
– Predicting customer churn
Ÿ Data science deep-dive with Jupyter notebooks
– Text analytics on MPP (github.com/vatsan)
– Image processing on MPP (github.com/gautamsm)
3. 3© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science
Our Charter:
Pivotal Data Science is Pivotal’s differentiated and
highly opinionated data-centric service delivery
organization (part of Pivotal Labs)
Our Goals:
Expedite customer time-to-value and ROI, by driving
business-aligned innovation and solutions assurance
within Pivotal’s Data Fabric technologies.
Drive customer adoption and autonomy across the full
spectrum of Pivotal Data technologies through best-in-
class data science and data engineering services, with
a deep emphasis on knowledge transfer.
Data Science Data Engineering
App Dev
4. 4© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Knowledge Development
5. 5© Copyright 2016 Pivotal. All rights reserved.
Use Case: Preventive Maintenance for
Connected Vehicles
Ÿ Customer vehicles transmit Diagnostic Trouble Codes (DTC)
and vehicle status data to the Pivotal analytics environment
Ÿ Can the DTC data be leveraged to predict the presence of
potential problems in vehicles?
Ÿ Set up a data science framework on the Pivotal analytics
environment that would enable the customer data science
team to continuously monitor problems in their vehicles
using DTC data
6. 6© Copyright 2016 Pivotal. All rights reserved.
Problem Setup – Predicting Job Type from
Diagnostic Trouble Codes (DTCs)
Time
Job Type:
Transmission
Job Type:
Transmission
Engine
Job Type:
Body
DTC: B DTC:
B,
P, C
DTC: U
DTC: B DTC: B
DTC:
B, P, C, U
DTC:
P, B, U
DTC: P DTC: B DTC:
B,P
DTC:
B,P
Can the DTCs
observed here predict
this Job Type?
Can the DTCs observed
here predict this Job
Type?
Can the DTCs observed
here predict this Job
Type?
7. 7© Copyright 2016 Pivotal. All rights reserved.
Data Parallelism
One or more job on the same day
Multi-labeling problem
One-vs-rest classifiers
built in parallel
1
0
0
1
0 1
0
Class 1
Class 2
Class 3
One-vs-Rest Classification
Red vs.
Non Red
On Segment 1
Green vs.
Non Green
On Segment 2
Blue vs.
Non Blue
On Segment N
8. 8© Copyright 2016 Pivotal. All rights reserved.
Model Scoring Pipeline
DTC: B DTC: B, P, C DTC: U
Body
Axle
Engine
Prob >=
Threshold
Prob >=
Threshold
Prob >=
Threshold
Model Caching
(GPDB/
HAWQ)
Real time
scoring
web or mobile app dashboard
Ingest
Sink
9. 9© Copyright 2016 Pivotal. All rights reserved.
MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by
a particular field (or randomly)
11. 11© Copyright 2016 Pivotal. All rights reserved.
Open Source Data Science Toolkit
KEY LANGUAGES
P L A T F O R M
KEY TOOLS
MLlib
PL/X
Pivotal Big Data Suite
ModelingTools
VisualizationTools
Platform
GemFire
12. 12© Copyright 2016 Pivotal. All rights reserved.
Scalable, In-Database
Machine Learning
• Open Source https://github.com/madlib/madlib
• Works on Greenplum DB, Apache HAWQ and PostgreSQL
• In active development by Pivotal
• MADlib is now an Apache Software Foundation incubator project!
Apache (incubating)
13. 13© Copyright 2016 Pivotal. All rights reserved.
Functions
Supervised Learning
Regression Models
• Cox Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Ordinal Regression
• Robust Variance, Clustered Variance
• Support Vector Machines
Tree Methods
• Decision Tree
• Random Forest
Other Methods
• Conditional Random Field
• Naïve Bayes
Unsupervised Learning
• Association Rules (Apriori)
• Clustering (K-means)
• Topic Modeling (LDA)
Statistics
Descriptive
• Cardinality Estimators
• Correlation
• Summary
Inferential
• Hypothesis Tests
Other Statistics
• Probability Functions
Other Modules
• Conjugate Gradient
• Linear Solvers
• PMML Export
• Random Sampling
• Term Frequency for Text
Time Series
• ARIMA
Aug 2015
Data Types and Transformations
• Array Operations
• Dimensionality Reduction (PCA)
• Encoding Categorical Variables
• Matrix Operations
• Matrix Factorization (SVD, Low Rank)
• Norms and Distance Functions
• Sparse Vectors
Model Evaluation
• Cross Validation
Predictive Analytics Library
@MADlib_analytic
14. 14© Copyright 2016 Pivotal. All rights reserved.
Use Case: Predicting insurance claim amounts
using structured and unstructured data
Ÿ Using features from structured and unstructured data
sources associated with claims, build the capability to
predict claim amounts
15. 15© Copyright 2016 Pivotal. All rights reserved.
Text analytics on MPP
Ÿ Unstructured data in the
form of claim comments and
claim descriptions (text)
Ÿ Use a bag-of-words
approach (unigrams,
bigrams)
Ÿ tf-idf for more meaningful
insights
16. 16© Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Text analytics on MPP
github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models
We’ll walk through
this Jupyter
notebook
17. 17© Copyright 2016 Pivotal. All rights reserved.
Use Case: Churn prediction
Ÿ Build a churn model to predict
which customers are most likely
to churn
Ÿ Provide insights into key factors
responsible for churn to
potentially intervene prior to
churn
18. 18© Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data
Ÿ Aggregate weekly usage by user
Ÿ Compute descriptive statistics
Ÿ Extract features based on business expertise
19. 19© Copyright 2016 Pivotal. All rights reserved.
Open Source Analytics Ecosystem
Companies benefit from algorithmic breadth and scalability for
building and socializing data science models
MLlib
PL/X
Algorithms Visualization
Best of breed in-memory and in-database tools for an MPP platform
20. 20© Copyright 2016 Pivotal. All rights reserved.
• For embarrassingly parallel
tasks, we can use procedural
languages to easily
parallelize any stand-alone
library in Java, Python, R,
pgSQL or C/C++
• The interpreter/VM of the
language ‘X’ is installed on
each node of the MPP
environment
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Data Parallelism through PL/X : X in Python, R, Java,
C/C++ and pgSQL
• plpython and python are loaded as dynamic
libraries on the master and segment nodes
(libpython.so and plpython.so are under
$GPHOME/ext/python)
21. 21© Copyright 2016 Pivotal. All rights reserved.
User Defined Functions (UDFs) in PL/Python
Ÿ Procedural languages need to be installed on each database used.
Ÿ Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.
CREATE
FUNCTION
seasonality
(x
float[])
RETURNS
float[]
AS
$$
import
statsmodels.api
as
sm
s
=
sm.tsa.seasonal_decompose(x).seasonal
return
s
$$
LANGUAGE
plpythonu;
SQL wrapper
SQL wrapper
Normal Python
22. 22© Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data with PL/X
Ÿ Easily harness your UDF with open source libraries (for machine learning,
signal processing...)
Ÿ Runs at scale through data parallelism
23. 23© Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Image processing on MPP
github.com/gautamsm/data-science-on-mpp/tree/master/image_processing
In-database Canny edge detection with OpenCV
inside a PL/C function
24. 24© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Blogs
1. Scaling native (C++) apps on Pivotal MPP
2. Predicting commodity futures through Tweets
3. A pipeline for distributed topic & sentiment analysis of tweets on Greenplum
4. Using data science to predict TV viewer behavior
5. Twitter NLP: Scaling part-of-speech tagging
6. Distributed deep learning on MPP and Hadoop
7. Multi-variate time series forecasting
8. Pivotal for good – Crisis Textline
http://blog.pivotal.io/data-science-pivotal