SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
1© Copyright 2016 Pivotal. All rights reserved. 1© Copyright 2016 Pivotal. All rights reserved.
Esther Vasiete
Pivotal Data Scientist
Structure Data 2016
Data Science at Scale on MPP
Databases – Use Cases & Open Source
Tools
Joint work with Pivotal Data Science
2© Copyright 2016 Pivotal. All rights reserved.
Agenda
Ÿ  Introduction
Ÿ  Open Source Data Science Toolkit
Ÿ  Real world applications
–  Predictive maintenance of automobiles
–  Predicting insurance claims
–  Predicting customer churn
Ÿ  Data science deep-dive with Jupyter notebooks
–  Text analytics on MPP (github.com/vatsan)
–  Image processing on MPP (github.com/gautamsm)
3© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science
Our Charter:
Pivotal Data Science is Pivotal’s differentiated and
highly opinionated data-centric service delivery
organization (part of Pivotal Labs)
Our Goals:
Expedite customer time-to-value and ROI, by driving
business-aligned innovation and solutions assurance
within Pivotal’s Data Fabric technologies.
Drive customer adoption and autonomy across the full
spectrum of Pivotal Data technologies through best-in-
class data science and data engineering services, with
a deep emphasis on knowledge transfer.
Data Science Data Engineering
App Dev
4© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Knowledge Development
5© Copyright 2016 Pivotal. All rights reserved.
Use Case: Preventive Maintenance for
Connected Vehicles
Ÿ  Customer vehicles transmit Diagnostic Trouble Codes (DTC)
and vehicle status data to the Pivotal analytics environment
Ÿ  Can the DTC data be leveraged to predict the presence of
potential problems in vehicles?
Ÿ  Set up a data science framework on the Pivotal analytics
environment that would enable the customer data science
team to continuously monitor problems in their vehicles
using DTC data
6© Copyright 2016 Pivotal. All rights reserved.
Problem Setup – Predicting Job Type from
Diagnostic Trouble Codes (DTCs)
Time
Job Type:
Transmission
Job Type:
Transmission
Engine
Job Type:
Body
DTC: B DTC:
B,
P, C
DTC: U
DTC: B DTC: B
DTC:
B, P, C, U
DTC:
P, B, U
DTC: P DTC: B DTC:
B,P
DTC:
B,P
Can the DTCs
observed here predict
this Job Type?
Can the DTCs observed
here predict this Job
Type?
Can the DTCs observed
here predict this Job
Type?
7© Copyright 2016 Pivotal. All rights reserved.
Data Parallelism
One or more job on the same day
Multi-labeling problem
One-vs-rest classifiers
built in parallel
1
0
0
1
0 1
0
Class 1
Class 2
Class 3
One-vs-Rest Classification
Red vs.
Non Red
On Segment 1
Green vs.
Non Green
On Segment 2
Blue vs.
Non Blue
On Segment N
8© Copyright 2016 Pivotal. All rights reserved.
Model Scoring Pipeline
DTC: B DTC: B, P, C DTC: U
Body
Axle
Engine
Prob >=
Threshold
Prob >=
Threshold
Prob >=
Threshold
Model Caching
(GPDB/
HAWQ)
Real time
scoring
web or mobile app dashboard
Ingest
Sink
9© Copyright 2016 Pivotal. All rights reserved.
MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by
a particular field (or randomly)
10© Copyright 2016 Pivotal. All rights reserved.
IT TAKES MORE THAN
ONE TOOL
11© Copyright 2016 Pivotal. All rights reserved.
Open Source Data Science Toolkit
KEY LANGUAGES
P L A T F O R M
KEY TOOLS
MLlib
PL/X
Pivotal Big Data Suite
ModelingTools
VisualizationTools
Platform
GemFire
12© Copyright 2016 Pivotal. All rights reserved.
Scalable, In-Database
Machine Learning
•  Open Source https://github.com/madlib/madlib
•  Works on Greenplum DB, Apache HAWQ and PostgreSQL
•  In active development by Pivotal
•  MADlib is now an Apache Software Foundation incubator project!
Apache (incubating)
13© Copyright 2016 Pivotal. All rights reserved.
Functions
Supervised Learning
Regression Models
•  Cox Proportional Hazards Regression
•  Elastic Net Regularization
•  Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Marginal Effects
•  Multinomial Regression
•  Ordinal Regression
•  Robust Variance, Clustered Variance
•  Support Vector Machines
Tree Methods
•  Decision Tree
•  Random Forest
Other Methods
•  Conditional Random Field
•  Naïve Bayes
Unsupervised Learning
•  Association Rules (Apriori)
•  Clustering (K-means)
•  Topic Modeling (LDA)
Statistics
Descriptive
•  Cardinality Estimators
•  Correlation
•  Summary
Inferential
•  Hypothesis Tests
Other Statistics
•  Probability Functions
Other Modules
•  Conjugate Gradient
•  Linear Solvers
•  PMML Export
•  Random Sampling
•  Term Frequency for Text
Time Series
•  ARIMA
Aug 2015
Data Types and Transformations
•  Array Operations
•  Dimensionality Reduction (PCA)
•  Encoding Categorical Variables
•  Matrix Operations
•  Matrix Factorization (SVD, Low Rank)
•  Norms and Distance Functions
•  Sparse Vectors
Model Evaluation
•  Cross Validation
Predictive Analytics Library
@MADlib_analytic
14© Copyright 2016 Pivotal. All rights reserved.
Use Case: Predicting insurance claim amounts
using structured and unstructured data
Ÿ  Using features from structured and unstructured data
sources associated with claims, build the capability to
predict claim amounts
15© Copyright 2016 Pivotal. All rights reserved.
Text analytics on MPP
Ÿ  Unstructured data in the
form of claim comments and
claim descriptions (text)
Ÿ  Use a bag-of-words
approach (unigrams,
bigrams)
Ÿ  tf-idf for more meaningful
insights
16© Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Text analytics on MPP
github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models
We’ll walk through
this Jupyter
notebook
17© Copyright 2016 Pivotal. All rights reserved.
Use Case: Churn prediction
Ÿ  Build a churn model to predict
which customers are most likely
to churn
Ÿ  Provide insights into key factors
responsible for churn to
potentially intervene prior to
churn
18© Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data
Ÿ  Aggregate weekly usage by user
Ÿ  Compute descriptive statistics
Ÿ  Extract features based on business expertise
19© Copyright 2016 Pivotal. All rights reserved.
Open Source Analytics Ecosystem
Companies benefit from algorithmic breadth and scalability for
building and socializing data science models
MLlib
PL/X
Algorithms Visualization
Best of breed in-memory and in-database tools for an MPP platform
20© Copyright 2016 Pivotal. All rights reserved.
•  For embarrassingly parallel
tasks, we can use procedural
languages to easily
parallelize any stand-alone
library in Java, Python, R,
pgSQL or C/C++
•  The interpreter/VM of the
language ‘X’ is installed on
each node of the MPP
environment
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Data Parallelism through PL/X : X in Python, R, Java,
C/C++ and pgSQL
•  plpython and python are loaded as dynamic
libraries on the master and segment nodes
(libpython.so and plpython.so are under
$GPHOME/ext/python)
21© Copyright 2016 Pivotal. All rights reserved.
User Defined Functions (UDFs) in PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.
CREATE	
  FUNCTION	
  seasonality	
  (x	
  float[])	
  
	
  	
  RETURNS	
  float[]	
  
AS	
  $$	
  
	
  	
  import	
  statsmodels.api	
  as	
  sm	
  
	
  	
  s	
  =	
  sm.tsa.seasonal_decompose(x).seasonal	
  	
  
	
  	
  return	
  s	
  
$$	
  LANGUAGE	
  plpythonu;	
  
SQL wrapper
SQL wrapper
Normal Python
22© Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data with PL/X
Ÿ  Easily harness your UDF with open source libraries (for machine learning,
signal processing...)
Ÿ  Runs at scale through data parallelism
23© Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Image processing on MPP
github.com/gautamsm/data-science-on-mpp/tree/master/image_processing
In-database Canny edge detection with OpenCV
inside a PL/C function
24© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Blogs
1.  Scaling native (C++) apps on Pivotal MPP
2.  Predicting commodity futures through Tweets
3.  A pipeline for distributed topic & sentiment analysis of tweets on Greenplum
4.  Using data science to predict TV viewer behavior
5.  Twitter NLP: Scaling part-of-speech tagging
6.  Distributed deep learning on MPP and Hadoop
7.  Multi-variate time series forecasting
8.  Pivotal for good – Crisis Textline
http://blog.pivotal.io/data-science-pivotal
25© Copyright 2016 Pivotal. All rights reserved.
Thank You!
A NEW PLATFORM FOR A NEW ERA

Mais conteúdo relacionado

Mais procurados

The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Masayuki Matsushita
 

Mais procurados (20)

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Pivotal HAWQ 소개
Pivotal HAWQ 소개Pivotal HAWQ 소개
Pivotal HAWQ 소개
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 

Semelhante a Data Science at Scale on MPP databases - Use Cases & Open Source Tools

Semelhante a Data Science at Scale on MPP databases - Use Cases & Open Source Tools (20)

A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
 
Pivotal Digital Transformation Forum: Journey to Become a Data-Driven Enterprise
Pivotal Digital Transformation Forum: Journey to Become a Data-Driven EnterprisePivotal Digital Transformation Forum: Journey to Become a Data-Driven Enterprise
Pivotal Digital Transformation Forum: Journey to Become a Data-Driven Enterprise
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical Overview
 
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session
 
Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019
Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019
Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical Overview
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 

Último

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Data Science at Scale on MPP databases - Use Cases & Open Source Tools

  • 1. 1© Copyright 2016 Pivotal. All rights reserved. 1© Copyright 2016 Pivotal. All rights reserved. Esther Vasiete Pivotal Data Scientist Structure Data 2016 Data Science at Scale on MPP Databases – Use Cases & Open Source Tools Joint work with Pivotal Data Science
  • 2. 2© Copyright 2016 Pivotal. All rights reserved. Agenda Ÿ  Introduction Ÿ  Open Source Data Science Toolkit Ÿ  Real world applications –  Predictive maintenance of automobiles –  Predicting insurance claims –  Predicting customer churn Ÿ  Data science deep-dive with Jupyter notebooks –  Text analytics on MPP (github.com/vatsan) –  Image processing on MPP (github.com/gautamsm)
  • 3. 3© Copyright 2016 Pivotal. All rights reserved. Pivotal Data Science Our Charter: Pivotal Data Science is Pivotal’s differentiated and highly opinionated data-centric service delivery organization (part of Pivotal Labs) Our Goals: Expedite customer time-to-value and ROI, by driving business-aligned innovation and solutions assurance within Pivotal’s Data Fabric technologies. Drive customer adoption and autonomy across the full spectrum of Pivotal Data technologies through best-in- class data science and data engineering services, with a deep emphasis on knowledge transfer. Data Science Data Engineering App Dev
  • 4. 4© Copyright 2016 Pivotal. All rights reserved. Pivotal Data Science Knowledge Development
  • 5. 5© Copyright 2016 Pivotal. All rights reserved. Use Case: Preventive Maintenance for Connected Vehicles Ÿ  Customer vehicles transmit Diagnostic Trouble Codes (DTC) and vehicle status data to the Pivotal analytics environment Ÿ  Can the DTC data be leveraged to predict the presence of potential problems in vehicles? Ÿ  Set up a data science framework on the Pivotal analytics environment that would enable the customer data science team to continuously monitor problems in their vehicles using DTC data
  • 6. 6© Copyright 2016 Pivotal. All rights reserved. Problem Setup – Predicting Job Type from Diagnostic Trouble Codes (DTCs) Time Job Type: Transmission Job Type: Transmission Engine Job Type: Body DTC: B DTC: B, P, C DTC: U DTC: B DTC: B DTC: B, P, C, U DTC: P, B, U DTC: P DTC: B DTC: B,P DTC: B,P Can the DTCs observed here predict this Job Type? Can the DTCs observed here predict this Job Type? Can the DTCs observed here predict this Job Type?
  • 7. 7© Copyright 2016 Pivotal. All rights reserved. Data Parallelism One or more job on the same day Multi-labeling problem One-vs-rest classifiers built in parallel 1 0 0 1 0 1 0 Class 1 Class 2 Class 3 One-vs-Rest Classification Red vs. Non Red On Segment 1 Green vs. Non Green On Segment 2 Blue vs. Non Blue On Segment N
  • 8. 8© Copyright 2016 Pivotal. All rights reserved. Model Scoring Pipeline DTC: B DTC: B, P, C DTC: U Body Axle Engine Prob >= Threshold Prob >= Threshold Prob >= Threshold Model Caching (GPDB/ HAWQ) Real time scoring web or mobile app dashboard Ingest Sink
  • 9. 9© Copyright 2016 Pivotal. All rights reserved. MPP Architectural Overview Think of it as multiple PostGreSQL servers Segments/Workers Master Rows are distributed across segments by a particular field (or randomly)
  • 10. 10© Copyright 2016 Pivotal. All rights reserved. IT TAKES MORE THAN ONE TOOL
  • 11. 11© Copyright 2016 Pivotal. All rights reserved. Open Source Data Science Toolkit KEY LANGUAGES P L A T F O R M KEY TOOLS MLlib PL/X Pivotal Big Data Suite ModelingTools VisualizationTools Platform GemFire
  • 12. 12© Copyright 2016 Pivotal. All rights reserved. Scalable, In-Database Machine Learning •  Open Source https://github.com/madlib/madlib •  Works on Greenplum DB, Apache HAWQ and PostgreSQL •  In active development by Pivotal •  MADlib is now an Apache Software Foundation incubator project! Apache (incubating)
  • 13. 13© Copyright 2016 Pivotal. All rights reserved. Functions Supervised Learning Regression Models •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Generalized Linear Models •  Linear Regression •  Logistic Regression •  Marginal Effects •  Multinomial Regression •  Ordinal Regression •  Robust Variance, Clustered Variance •  Support Vector Machines Tree Methods •  Decision Tree •  Random Forest Other Methods •  Conditional Random Field •  Naïve Bayes Unsupervised Learning •  Association Rules (Apriori) •  Clustering (K-means) •  Topic Modeling (LDA) Statistics Descriptive •  Cardinality Estimators •  Correlation •  Summary Inferential •  Hypothesis Tests Other Statistics •  Probability Functions Other Modules •  Conjugate Gradient •  Linear Solvers •  PMML Export •  Random Sampling •  Term Frequency for Text Time Series •  ARIMA Aug 2015 Data Types and Transformations •  Array Operations •  Dimensionality Reduction (PCA) •  Encoding Categorical Variables •  Matrix Operations •  Matrix Factorization (SVD, Low Rank) •  Norms and Distance Functions •  Sparse Vectors Model Evaluation •  Cross Validation Predictive Analytics Library @MADlib_analytic
  • 14. 14© Copyright 2016 Pivotal. All rights reserved. Use Case: Predicting insurance claim amounts using structured and unstructured data Ÿ  Using features from structured and unstructured data sources associated with claims, build the capability to predict claim amounts
  • 15. 15© Copyright 2016 Pivotal. All rights reserved. Text analytics on MPP Ÿ  Unstructured data in the form of claim comments and claim descriptions (text) Ÿ  Use a bag-of-words approach (unigrams, bigrams) Ÿ  tf-idf for more meaningful insights
  • 16. 16© Copyright 2016 Pivotal. All rights reserved. Code walkthrough: Text analytics on MPP github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models We’ll walk through this Jupyter notebook
  • 17. 17© Copyright 2016 Pivotal. All rights reserved. Use Case: Churn prediction Ÿ  Build a churn model to predict which customers are most likely to churn Ÿ  Provide insights into key factors responsible for churn to potentially intervene prior to churn
  • 18. 18© Copyright 2016 Pivotal. All rights reserved. Usage Time Series Data Ÿ  Aggregate weekly usage by user Ÿ  Compute descriptive statistics Ÿ  Extract features based on business expertise
  • 19. 19© Copyright 2016 Pivotal. All rights reserved. Open Source Analytics Ecosystem Companies benefit from algorithmic breadth and scalability for building and socializing data science models MLlib PL/X Algorithms Visualization Best of breed in-memory and in-database tools for an MPP platform
  • 20. 20© Copyright 2016 Pivotal. All rights reserved. •  For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R, pgSQL or C/C++ •  The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment Standby Master … Master Host SQL Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Data Parallelism through PL/X : X in Python, R, Java, C/C++ and pgSQL •  plpython and python are loaded as dynamic libraries on the master and segment nodes (libpython.so and plpython.so are under $GPHOME/ext/python)
  • 21. 21© Copyright 2016 Pivotal. All rights reserved. User Defined Functions (UDFs) in PL/Python Ÿ  Procedural languages need to be installed on each database used. Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside. CREATE  FUNCTION  seasonality  (x  float[])      RETURNS  float[]   AS  $$      import  statsmodels.api  as  sm      s  =  sm.tsa.seasonal_decompose(x).seasonal        return  s   $$  LANGUAGE  plpythonu;   SQL wrapper SQL wrapper Normal Python
  • 22. 22© Copyright 2016 Pivotal. All rights reserved. Usage Time Series Data with PL/X Ÿ  Easily harness your UDF with open source libraries (for machine learning, signal processing...) Ÿ  Runs at scale through data parallelism
  • 23. 23© Copyright 2016 Pivotal. All rights reserved. Code walkthrough: Image processing on MPP github.com/gautamsm/data-science-on-mpp/tree/master/image_processing In-database Canny edge detection with OpenCV inside a PL/C function
  • 24. 24© Copyright 2016 Pivotal. All rights reserved. Pivotal Data Science Blogs 1.  Scaling native (C++) apps on Pivotal MPP 2.  Predicting commodity futures through Tweets 3.  A pipeline for distributed topic & sentiment analysis of tweets on Greenplum 4.  Using data science to predict TV viewer behavior 5.  Twitter NLP: Scaling part-of-speech tagging 6.  Distributed deep learning on MPP and Hadoop 7.  Multi-variate time series forecasting 8.  Pivotal for good – Crisis Textline http://blog.pivotal.io/data-science-pivotal
  • 25. 25© Copyright 2016 Pivotal. All rights reserved. Thank You!
  • 26. A NEW PLATFORM FOR A NEW ERA