SlideShare uma empresa Scribd logo
1 de 28
— 0
Essex Lake Group LLC www.essexlg.com
Probabilistic Graphical Model
(Bayesian Network)
Advanced Distributed Machine Learning Labs (ADML)
Essex Lake Group LLC
Knowledge Sharing Session
July, 2015
— 1
Objective:
Technical Objectives:
• How can we identify the probabilistic relationships among the variables in the given data
• For any given target, how can we identify the most influencing variables ?
• Can Bayesian networks help us in finding better insights than the existing Pivot approach ?
Business Objective :
• For the MET004 Account level data, how do we evaluate the performance of “Participation Rate” using PGM
Embracing Bayesian approaches for business problem solving
— 2
▪ How do we construct Probabilistic Graphical Models (PGMs) ?
▪ PGMs on Sales Ratio Data
▪ PGM with respect to CART
Agenda
— 3
What are Probabilistic Graphical Models /Bayesian Networks?
Graph’s Joint Probability Table will be tabulated using the following equation:
P(A, B, C, D) = P(T|A,B) * P(D|T,C) *P(D|C)* P(A) * P(B) * P(C )
• Bayesian Network is Directed Acyclic Graph
• Nodes are variables and edges are relationships between variables.
• Each node has set of values
• Direction of edges indicates Causality i.e. cause-effect relationships among variables
• They are graphical models, capable of displaying relationships clearly and intuitively
• They are directional, thus being capable of representing cause-effect relationships
• They handle uncertainty though the established theory of probability
• They can be used to represent indirect in addition to direct causation
Example: “Car”
A
T
BA
Parent 1 Parent 2
Child
Spouse
C
D
“steering wheel angle”B
A
“gas pedal pressed”
“gear”
“car direction”
C
D
— 4
Network Terminology
Markov Blanket: For a given target variable What are the most
influencing variables ?
Dependency among the variables of a system (Car )
The Markov Blanket of a (Target) node consists of all nodes that make this Target
conditionally independent of all the other nodes in the model:
• The Parents (blue nodes in the example) are used for cutting the information
coming from their ascendants
• The Children (red node) are used for cutting the information coming from their
descendants
• The Spouses (or co-parents, dark green) are used for cutting the information
coming from the ascendants of the Children
If one Variable takes on some value then other variables are going to take that value
B
A
“gas pedal pressed”
“gear”
“car direction” = Forward
C
D
“steering wheel angle”
B
A
“gas pedal pressed” = Yes
“gear” = Forward
“car direction” = Forward
C
D
“steering wheel angle” = 90
Target
A
BA
Parent 1 Parent 2
Child
Spouse
C
D
T
T
— 5
▪ Data obtained after
processing must have:
- categorical values
- must be binned in case
of continuous values
▪ All data must be converted
to factors.
▪ For the fitting of Bayesian
Network, a scoring function φ
is considered
▪ Given data and scoring
function φ, find a network
such that it maximizes the
value of φ.
▪ Model tries various
permutations of adding and
removing arcs, such that final
network has maximum
posterior probability
distribution, starting from a
prior probability distribution
on the possible networks,
conditioned on data
▪ While learning the network,
the joint probability
distribution is created for all
the variables in the data set
▪ This joint probability
distribution is converted to
conditional probabilities for
each dependent variable.
▪ These CPTs are used to solve
any kind of analysis problems.
▪ Import the learned
structure and its CPT's into
SAMIAM to form
hypotheses and verify
them.
01. Processing data
02. Creating
relationships between
variables
03. Build CPTs of final
fitted network
04. Record Network
Constructing Bayesian Network - Process Flow
— 6
How do we interpret the Bayesian Network for problem solving ?
A. Forward Analysis :
Q1. I have not done any mailing activity
this year. How did this affect my
participation percentage?
Q2. To check for conditional dependencies
like mailing activity not being helpful if the
account is already active.
B. Backward Analysis
Q3. If we know that our participation is
high, then find responsible variables and
their values to replicate the result in the
future.
Q4. For my participation bucket to lie in
the higher bucket, should I offer payroll
deduct or not?
Variable Name Evidence
Set
Participation Bucket
value before evidence
Participation Bucket
value after evidence
Change in Values
Sponsored Mail 2014 Y 15.15% 19.46% +4.31%
Non Sponsored Mail 2013 Y 15.15% 18.85% +3.70%
Mail Status 2013 Y 15.15% 18.68% +3.53%
Sponsored Mail 2014 N 15.15% 13.8% -1.35%
Non Sponsored Mail 2013 N 15.15% 10.30% -4.85%
Mail Status 2013 N 15.15% 10.32% -4.83%
All Probability Values for Participation being in the 2.5%-5% (high) bucket.
ES_Supp_Mail_2014 NS_Mail_Status_2013
Participation Bucket
Onsite_Event_ind_2013
Mail_status_2013
A
Example: 1
Target
Fig: Markov Blanket for a target variable
— 7
A. Forward Analysis :
Q1. I have not done any mailing activity
this year. How did this affect my
participation percentage?
Q2. To check for conditional dependencies
like mailing activity not being helpful if the
account is already active.
B. Backward Analysis
Q3. If we know that our participation is
high, then find responsible variables and
their values to replicate the result in the
future.
Q4. For my participation bucket to lie in
the higher bucket, should I offer payroll
deduct or not?
All Probability Values for Participation being in the 2.5%-5% (high) bucket.
Evidence Value Non-Sponsored Mail 2013 Payroll Deduct Account Status
Participation Bucket = 2.5-
5%
Y Y ACTIVE
Participation
Bucket
Variable Evidence Participation Bucket
Value before Evidence
Participation Bucket
Value before
Evidence
Change in
Values
2.5% - 5% Non-Sponsored
Mail 2013
Payroll Deduct
Account Status
Y
Y
ACTIVE
15.15% 28.55% +13.40%
PD_AVAIL_ NS_Mail_Status_ind_2013
Participation
Bucket =2.5-5%
ACCOUNT_STATUS
Example: 2
One more example  How do we interpret the Bayesian Network for problem solving ?
— 8
Most Probable Explanation for given variable (MAP): Process Flow
▪ The network consists of all
the variables in the data set
▪ To perform any kind of
analysis, the target variable
must be selected.
▪ The network is built without
keeping a target variable so
in case analysis needs to be
done on any other variable,
it can also be selected in the
same network.
▪ Target’s parent nodes, child
nodes and all other parents of
target’s child nodes form it’s
Markov Blanket.
▪ These variables make the
target variable independent
from the rest of the variables
in the network.
▪ In Samiam, for the given
Bayesian Network, set
evidence to the target
variable, which is the variable
of our interest.
▪ We need to find what values
should other variables have
so as to observe the
evidenced value of target
variable.
▪ Select the target variable’s
Markov Blanket variables
for which MAP values are
to be computed.
▪ Run MAP query.
▪ This query returns the
value of the selected
variables such that the
posterior probability of the
evidenced target is
maximized.
01. Selection of Target
Variable
02. Find Target’s
Markov Blanket
03. Set Evidence on
Target
04. Run MAP Query
on Markov Blanket
— 9
SAMIAM and R-SHINY Demo
— 10
Employee Sponsored Mail
Participation Bucket
Onsite Event
Payroll Deduct
Email Status
More details on Network Structure:
• Payroll Deduct, Employee Sponsored Mail and Participation Bucket for a V-Structure.
• This V-Structure means that Payroll Deduct and Employee Sponsored Mail are
independent of each other.
• This means, any change in Payroll Deduct won’t influence Employee Sponsored Mail
and vice versa.
• Also, any change in Email Status which is the sibling of Participation Bucket will result in
change in Participation Bucket as well due to the presence of an active trail between the
two.
• Payroll deduct can however, affect Onsite Event through Participation Bucket but not
Email Status due to the presence of a V-Structure between Payroll Deduct and
Employee Sponsored Mail which is the parent of Email Status.
• Similarly, Employee Sponsored Mail can affect Onsite Event due to the presence of the
active trail between them.
• Also, if Participation Bucket is evidenced, then, The V-Structure will be activated and
influence can flow between Payroll Deduct and Employee Sponsored Mail.
— 11
▪ What are Probabilistic Graphical Models (PGM)
▪ PGMs on Sales Ratio Data
▪ PGM with respect to CART
Agenda
— 12
Our work in 3 modules:
1. Data Preparation
Imputation
Bayesian Network
Construction
Binning
Raw Data
Identify the variables
Sensitivity Analysis
(SAMIAM – a tool)
2. Model Building 3. Network Analysis - Hypothesis
Testing
Module 1:
Module 2: Pivot Approach Vs. Bayesian Network Approach
Pivots Bayesian network
Module 3: Observations on CART and Bayesina Networks
CART Bayesian Network
— 13
Data Details
Account Variables
 Company Name
 Market Segment
 Company State
Mail Variables
 Marketing Activity
 Non-Sponsored Mail
 Employee Sponsored Mail
Flag Variables
 New or Existing account flag
Policy Variables
 Program Type
 Payroll Deduct
 Employer Type
1. Sales Account Data (2013 to 2015)
2. Demographic details
3. MetLife’s mailing
4. Nature and tenure policy
5. Agent and employer type details
6. No. of participants and eligible accounts
1. Target Variable : Participation Bucket
(Ratio of no. of participants and no. of eligible)
2. No. of observations : 5007
1. Company Name and Company State depict the account details. Market segment could be local, national, regional, USD or other. Majority of the accounts are regional
2. Mailing variable(Non Sponsored Mail) is crucial for participation bucket
3. Policy Variable (Payroll Deduct) is essential to improve participation bucket
4. Most of the accounts are of Single Carrier type and don’t have Payroll Deduct
0
1000
2000
3000
4000
5000
6000
Total 0%-1% 1%-2.5% 2.5-5% >5%
Frequency
Participation Bucket
— 14
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
Marketing Activity 2013 - For Participation
Bucket in range 2.5%-5%, 11% accounts have
marketing activity in 2013
Onsite Event 2013 - For Participation Bucket in
range 2.5%-5%, 3% accounts have Onsite Events
in 2013
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
Mail Status 2013 - For Participation Bucket in
range 2.5%-5%, 11% accounts have Mail Status
2013
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
Employee Sponsored Mail 2013 - For
Participation Bucket in range 2.5%-5%, 5.81%
accounts have Employee Sponsored Mail 2013
Payroll Deduct - For Participation Bucket in range
2.5%-5%, 9% accounts have Payroll Deduct
Non Sponsored Mail 2013 - For Participation
Bucket in range 2.5%-5%, 11% accounts have
Non Sponsored Mail 2013
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% - 1% 1% -
2.5%
2.5% -
5%
>5%
Grand
Total
N
Y
Bivariate Analysis on Target Variable “Participation Bucket”
— 15
32.22%
67.78%
N
Y
64.80%
35.20%
N
Y
35.01%
64.99% N
Y
88.28%
11.72%
N
Y
Profiling of Important Variables against Target Variable – “Participation Bucket”
Marketing Activity 2013 Onsite Event 2013 Payroll Deduct
Non Sponsored Mail 2013 Mail Status 2013 Employee Sponsored Mail 2013
32.55%
67.45%
N
Y
68% of the accounts have Marketing Activity
2013
88% of the accounts don’t have Onsite Event
2013
65% of the accounts don’t have Payroll Deduct
65% of the accounts have Non Sponsored Mail
2013
65% of the accounts have Non Sponsored Mail
2013
25% of the accounts received employee
sponsored mails.
75.40%
24.60%
N
Y
— 16
Bayesian Network discovers NOT ONLY the variables identified by the business users AND
ALSO discovers other most influencing variables
Description: To show how to analyze the variables in a Bayesian Network, we tried to find the values of the variables that influence
the target variable the most, so as to maximize/ minimize the desired target variable class.
The following are the variables found using the Markov Blanket of the Target Variable. When considered together, these make the
target Variable independent of the rest of the network. Thus, it can be said that these hold the maximum influence over the target.
Target Variable:
Participation Bucket
1st Level Markov Blanket Variables:
• Payroll Deduct
• Non Sponsored Mail 2013
• Marketing Activity 2013, 2015
• Mail Status 2013
• Employee Sponsored Mail 2014
• Account Type
• Account Status
• Onsite Event 2013
2nd Level Markov Blanket
Variables:
• Eligibility File
• Non Sponsored Mail 2013
• Employee Sponsored Mail
2014
• Program Type
• Tenure Years
Level 1 Level 2
Business Users’
variables
Business Users’
variables
Business Users’
variables
Target Target
Module 1
— 17
Better Results with Bayesian Network (All Variables)
Variables Selected: Evidence
Payroll Deduct -> Yes
Variables Selected: Evidence
Payroll Deduct -> Yes
Non- Sponsored Mail Status
2013 -> Yes
Mail Status 2013 -> Yes
23.47% 24.49% 23.27%
28.77%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
13.51%
19.57%
28.23%
38.69%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
Inference:
• Payroll Deduct is Critical to account performance
• Payroll Deduct without the support of any Mailing Activity cannot drive account performance.
Description: To show how to analyze the variables in a Bayesian Network, we tried to find the values of the variables that influence
the target variable the most, so as to maximize/ minimize the desired target variable class.
The following are the variables found using the Markov Blanket of the Target Variable. When considered together, these make the
target Variable independent of the rest of the network. Thus, it can be said that these hold the maximum influence over the target.
Module 1
— 18
Full Model Analysis on Bayesian Network
Variables Selected: Evidence
Eligibility File -> Yes
Variables Selected: Evidence
Eligibility File -> Yes
Non- Sponsored Mail -> Yes
Employee Sponsored Mail ->Yes
Inference:
• Payroll Deduct causes the maximum change in the Target Variable. Apart from that, Payroll Deduct’s Markov blanket variables like
Eligibility File, Employee Sponsored Mail should be considered as well.
33.03%
24.35%
19.61%
23.01%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
6.81%
16.01%
28.15%
49.02%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
Description: To show how to analyze the variables in a Bayesian Network, we tried to find the values of the variables that influence
the target variable the most, so as to maximize/ minimize the desired target variable class.
The following are the variables found using the Markov Blanket of the Target Variable. When considered together, these make the
target Variable independent of the rest of the network. Thus, it can be said that these hold the maximum influence over the target.
Module 1
— 19
Hypothesis: Mailing activity significantly improves participation bucket with mailing activity in 2013 having more influence than 2014
Full Model Analysis on Bayesian Network
Variables Selected: Evidence
Employee Sponsored Mail 2014 ->
Yes
Non-Sponsored Mail 2014 -> Yes
Onsite Event 2014 -> Yes
Variables Selected: Evidence
Employee Sponsored Mail 2013 ->
Yes
Non-Sponsored Mail 2013 -> Yes
Onsite Event 2013 -> Yes
29.05%
26.59%
19.84%
24.51%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
4.91%
17.90%
27.20%
49.98%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
Inference :
• Improving Mailing Activity for the year 2013 has more influence than in the year 2014
• Certain combinations of marketing activities tend to more effective, in particular - a combination of Non-sponsored, employee
sponsored and On-site Events for the year 2013 is yielding higher Participation Bucket (12.05% higher)
Module 1
— 20
Hypothesis: Participation Bucket depends on type of Marketing Channel – Eligibility File, Employee Sponsored Mail, Onsite Events,
Payroll Deduct.
Full Model Analysis on Bayesian Network
Variables Selected: Evidence
Marketing Activity 2013 -> Yes
Payroll Deduct -> Yes
Variables Selected: Evidence
Marketing Activity 2013 -> Yes
Onsite Event 2013 -> Yes
Inference :
• Along with Marketing Activity, performance also depends on the type of channels – Payroll Deduct, and Onsite Event Activity 2013
being the most important
15.80%
20.70%
26.97%
36.53%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
15.70%
25.61%
23.79%
34.91%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
Module 1
— 21
Hypothesis: Newer accounts can improve their Participation Bucket by Mailing Activity
Full Model Analysis on Bayesian Network
Variables Selected: Evidence
Tenure Years -> 0
Non-Sponsored Mail 2013 -> Y
Employee Sponsored Mail 2013 -> Yes
Onsite Event 2013 -> Yes
Inference :
• Newer accounts can improve their Participation Bucket by Mailing Activity - specifically, sending non sponsored and employee
sponsored mails and doing some onsite events.
-19.48%
0.74%
7.25%
11.49%
-30.00%
-20.00%
-10.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
0%-1% 1%-2.5% 2.5%-5% >5%
Before Evidence
After Evidence
Difference % between probabilites
Module 1
— 22
Same insights from the both Pivot and Bayesian Analysis (3 variables)
Description: To test the accuracy of the Bayesian network, we built models with only those variables that were considered in pivot
analysis. We then observed the target variable i.e. the Participation Bucket with respect to selected variables. The following results
were obtained:
0% - 1% 1% - 2.5% 2.5% -5% >5%
BROKER CHOICE 25.04% 21.41% 24.74% 28.81%
MARSH COMBINED 41.61% 29.93% 11.68% 16.79%
METLIFE CHOICE 20.00% 31.11% 26.67% 22.22%
BROKER CHOICE 25.04% 21.41% 24.74% 28.81%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
Participation Bucket
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Pivot N 54.64% 27.22% 12.16% 5.98%
Pivot Y 16.09% 20.96% 28.16% 34.79%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
Participation Bucket
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Bayesian N 54.34% 27.19% 12.30% 6.17%
Bayesian Y 16.16% 21.01% 28.11% 34.72%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
Participation Bucket
0% - 1% 1% - 2.5% 2.5% -5% >5%
SINGLE CARRIER 25.02% 21.46% 24.72% 28.80%
MARSH COMBINED 41.41% 29.87% 11.84% 16.88%
METLIFE CHOICE 20.18% 30.89% 26.61% 22.32%
BROKER CHOICE 28.64% 28.64% 25.40% 17.32%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
45.00%
Participaton Bucket
Program TypeProgram Type
Payroll Deduct Payroll Deduct
Pivot Analysis Bayesian Analysis
Module 2
— 23
More insights with more interactions: Pivot and Bayesian (6 variables)
Description: To test the accuracy of the Bayesian network, we built models with only those variables that were considered in pivot
analysis. We then observed the target variable i.e. the Participation Bucket with respect to selected variables. The following results
were obtained:
Pivot Analysis Bayesian Analysis
0% - 1% 1% - 2.5% 2.5% -5% >5%
N 61.32% 24.53% 3.77% 10.38%
Y 40.54% 23.94% 20.46% 15.06%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00% Non – Sponsored Mail Non – Sponsored Mail
0% - 1% 1% - 2.5% 2.5% -5% >5%
N 59.32% 24.77% 4.77% 11.14%
Y 40.47% 23.99% 20.35% 15.18%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
Bayesian Advantage:
• Pivot Tables often return zero values in case it cannot return values according to the set filter when multiple filters are applied.
• Bayesian Networks do not work on values but on conditional probabilities. Thus, even though data rows may be unavailable,
probability of the occurrence of each bucket according to the evidence given to filter variables in the network will still be returned.
Variables Selected:
 Non Sponsored Mail 2014
 Employee Sponsored Mail 2014
 Onsite Event 2014
 Microsite Activity 2014
 Participation Bucket 2014
 Email Status 2013
Module 2
— 24
Better Insights with Bayesian (Interaction among all variables)
Description: To test the accuracy of the Bayesian network, we built models with only those variables that were considered in pivot
analysis. We then observed the target variable i.e. the Participation Bucket with respect to selected variables. The following results
were obtained:
0% -
1%
1% -
2.5%
2.5% -
5%
>5%
Pivot N 54.64% 27.22% 12.16% 5.98%
Pivot Y 16.09% 20.96% 28.16% 34.79%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00% Payroll Deduct
Pivot Analysis Bayesian Analysis
0% - 1%
1% -
2.5%
2.5% -5% >5%
N 65.33% 18.70% 8.19% 7.78%
Y 23.47% 24.49% 23.27% 28.77%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0% - 1%
1% -
2.5%
2.5% -
5%
>5%
N 61.32% 24.53% 3.77% 10.38%
Y 40.54% 23.94% 20.46% 15.06%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00% Non – Sponsored Mail
0% - 1%
1% -
2.5%
2.5% -
5%
>5%
N 57.85% 20.74% 10.30% 11.11%
Y 38.28% 22.24% 18.85% 20.62%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00% Non – Sponsored Mail
Payroll Deduct
Module 2
— 25
▪What are Probabilistic Graphical Models (PGM)
▪ PGMs on Sales Ratio Data
▪PGM with respect to CART
Agenda
— 26
Observations On Bayesian Networks and CART
CART is target variable dependent. Problem specific and rigid in
terms of modification.
Bayesian Network doesn’t involve fixed target variable. It is
easier to build and modify.
To perform second level analysis we need to construct a new
classification tree taking important variables as a target variable.
Second Level of analysis can be done by considering Markov
Blanket of the important variable in target’s Markov Blanket.
Participation Bucket
Payroll Deduct
No variable can be added to the tree forcibly or otherwise if
needed.
As all variables of data set are already in the model, such need
would never arise.
Module 3
— 27
CART (Classification And Regression Tree): For the same data set, we try to see how many business questions can the Bayesian Network
and CART models answer.
ES_Supp_Mail_ind_2014
NS_Mail_Status_ind_2013
Participation Bucket
Onsite_Event_ind_2013
Mail_status_ind_2013
Participation
Bucket
Zone
Open to New
Business
Open to New
Business
Eligibility File
Questions CART Bayesian Network
If we know that our participation is high, then find responsible variables and their values to replicate the result
in the future.
Yes Yes
For my participation bucket to lie in the higher bucket, should I offer payroll deduct or not? Yes Yes
I have not done any mailing activity this year. How did this affect my participation percentage? No Yes
To check for conditional dependencies like mailing activity not being helpful if the account is already active. No Yes
Is the model easily interpretable? Yes Yes
N N
Observations On Bayesian Networks and CART
Module 3

Mais conteúdo relacionado

Destaque

PLANIFICACION ARANA OSORIO
PLANIFICACION ARANA OSORIOPLANIFICACION ARANA OSORIO
PLANIFICACION ARANA OSORIOMarleni Herrera
 
Δήλωση Συμμετοχής
Δήλωση ΣυμμετοχήςΔήλωση Συμμετοχής
Δήλωση Συμμετοχήςcsdtesting
 
Presentation for agl
Presentation for aglPresentation for agl
Presentation for aglshokhin5000
 
Colegio Alemán de Quilmes - Sala Amarilla y Sala Verde 2013 TM
Colegio Alemán de Quilmes - Sala Amarilla y Sala Verde 2013 TMColegio Alemán de Quilmes - Sala Amarilla y Sala Verde 2013 TM
Colegio Alemán de Quilmes - Sala Amarilla y Sala Verde 2013 TMcolegioholmberg
 
ERATA AND SUMMARY (ten years after starting investigation)
ERATA AND SUMMARY (ten years after starting investigation)ERATA AND SUMMARY (ten years after starting investigation)
ERATA AND SUMMARY (ten years after starting investigation)siegfried van hoek
 
Презентация УФСИН России по Тюменской области
Презентация УФСИН России по Тюменской областиПрезентация УФСИН России по Тюменской области
Презентация УФСИН России по Тюменской областиuipsr
 
Diseño publicitario
Diseño publicitarioDiseño publicitario
Diseño publicitarioDexiEPitanoA
 
24 reasons i should be alive
24 reasons i should be alive24 reasons i should be alive
24 reasons i should be aliveJustice Lukeshi
 
Etapas del manejo de información anorexia.docx
Etapas del manejo de información anorexia.docxEtapas del manejo de información anorexia.docx
Etapas del manejo de información anorexia.docxVanessa Gonzalez
 
La gestión educativa
La gestión educativaLa gestión educativa
La gestión educativamaireny01
 
Composiciones bidimensionales
Composiciones bidimensionalesComposiciones bidimensionales
Composiciones bidimensionalesdeltha12
 
Usability Primer
Usability  PrimerUsability  Primer
Usability PrimerRavi Shyam
 

Destaque (20)

PLANIFICACION ARANA OSORIO
PLANIFICACION ARANA OSORIOPLANIFICACION ARANA OSORIO
PLANIFICACION ARANA OSORIO
 
Diapos
DiaposDiapos
Diapos
 
Experiencias
ExperienciasExperiencias
Experiencias
 
Δήλωση Συμμετοχής
Δήλωση ΣυμμετοχήςΔήλωση Συμμετοχής
Δήλωση Συμμετοχής
 
Presentation for agl
Presentation for aglPresentation for agl
Presentation for agl
 
World Economic Forum on East Asia 2007
World Economic Forum on East Asia 2007World Economic Forum on East Asia 2007
World Economic Forum on East Asia 2007
 
ePAD project
ePAD projectePAD project
ePAD project
 
Colegio Alemán de Quilmes - Sala Amarilla y Sala Verde 2013 TM
Colegio Alemán de Quilmes - Sala Amarilla y Sala Verde 2013 TMColegio Alemán de Quilmes - Sala Amarilla y Sala Verde 2013 TM
Colegio Alemán de Quilmes - Sala Amarilla y Sala Verde 2013 TM
 
wqwqrTask 1
wqwqrTask 1wqwqrTask 1
wqwqrTask 1
 
ERATA AND SUMMARY (ten years after starting investigation)
ERATA AND SUMMARY (ten years after starting investigation)ERATA AND SUMMARY (ten years after starting investigation)
ERATA AND SUMMARY (ten years after starting investigation)
 
Презентация УФСИН России по Тюменской области
Презентация УФСИН России по Тюменской областиПрезентация УФСИН России по Тюменской области
Презентация УФСИН России по Тюменской области
 
Zaragoza turismo-108
Zaragoza turismo-108Zaragoza turismo-108
Zaragoza turismo-108
 
Diseño publicitario
Diseño publicitarioDiseño publicitario
Diseño publicitario
 
24 reasons i should be alive
24 reasons i should be alive24 reasons i should be alive
24 reasons i should be alive
 
Etapas del manejo de información anorexia.docx
Etapas del manejo de información anorexia.docxEtapas del manejo de información anorexia.docx
Etapas del manejo de información anorexia.docx
 
La gestión educativa
La gestión educativaLa gestión educativa
La gestión educativa
 
Composiciones bidimensionales
Composiciones bidimensionalesComposiciones bidimensionales
Composiciones bidimensionales
 
Tutorial Pageonex
Tutorial PageonexTutorial Pageonex
Tutorial Pageonex
 
Linealizacion
LinealizacionLinealizacion
Linealizacion
 
Usability Primer
Usability  PrimerUsability  Primer
Usability Primer
 

Semelhante a PPT_ADML_PGM_KnowledgeSharing_9JULY2015_v1

Using PySpark to Scale Markov Decision Problems for Policy Exploration
Using PySpark to Scale Markov Decision Problems for Policy ExplorationUsing PySpark to Scale Markov Decision Problems for Policy Exploration
Using PySpark to Scale Markov Decision Problems for Policy ExplorationDatabricks
 
Phase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptxPhase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptxJayalaxmiRs
 
Phase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptxPhase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptxJayalaxmiRs
 
Phase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptxPhase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptxBrendaL11
 
Performance Evaluation of Employee
Performance Evaluation of EmployeePerformance Evaluation of Employee
Performance Evaluation of EmployeeAnil Shrestha
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataDatabricks
 
IRJET- User Preferences and Similarity Estimation
IRJET- User Preferences and Similarity EstimationIRJET- User Preferences and Similarity Estimation
IRJET- User Preferences and Similarity EstimationIRJET Journal
 
SAS Training session - By Pratima
SAS Training session  -  By Pratima SAS Training session  -  By Pratima
SAS Training session - By Pratima Pratima Pandey
 
Big Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
Big Data LDN 2017: Advanced Analytics Applied to Marketing AttributionBig Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
Big Data LDN 2017: Advanced Analytics Applied to Marketing AttributionMatt Stubbs
 
Data Mining to Classify Telco Churners
Data Mining to Classify Telco ChurnersData Mining to Classify Telco Churners
Data Mining to Classify Telco ChurnersMohitMhapuskar
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentationNeerajNishad4
 
Step by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectStep by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectRamkumar Ravichandran
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 

Semelhante a PPT_ADML_PGM_KnowledgeSharing_9JULY2015_v1 (20)

Using PySpark to Scale Markov Decision Problems for Policy Exploration
Using PySpark to Scale Markov Decision Problems for Policy ExplorationUsing PySpark to Scale Markov Decision Problems for Policy Exploration
Using PySpark to Scale Markov Decision Problems for Policy Exploration
 
Phase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptxPhase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptx
 
Phase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptxPhase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptx
 
Phase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptxPhase_3_Introduction_to_Statistical_Analysis.pptx
Phase_3_Introduction_to_Statistical_Analysis.pptx
 
Telemarketing prediction project
Telemarketing prediction projectTelemarketing prediction project
Telemarketing prediction project
 
Data Mining
Data MiningData Mining
Data Mining
 
Performance Evaluation of Employee
Performance Evaluation of EmployeePerformance Evaluation of Employee
Performance Evaluation of Employee
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
 
Predictive modeling
Predictive modelingPredictive modeling
Predictive modeling
 
IRJET- User Preferences and Similarity Estimation
IRJET- User Preferences and Similarity EstimationIRJET- User Preferences and Similarity Estimation
IRJET- User Preferences and Similarity Estimation
 
SAS Training session - By Pratima
SAS Training session  -  By Pratima SAS Training session  -  By Pratima
SAS Training session - By Pratima
 
Data processing
Data processingData processing
Data processing
 
Big Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
Big Data LDN 2017: Advanced Analytics Applied to Marketing AttributionBig Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
Big Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
 
Data Mining to Classify Telco Churners
Data Mining to Classify Telco ChurnersData Mining to Classify Telco Churners
Data Mining to Classify Telco Churners
 
laptop price prediction presentation
laptop price prediction presentationlaptop price prediction presentation
laptop price prediction presentation
 
Step by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectStep by Step guide to executing an analytics project
Step by Step guide to executing an analytics project
 
Employee mode of commuting
Employee mode of commutingEmployee mode of commuting
Employee mode of commuting
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 

PPT_ADML_PGM_KnowledgeSharing_9JULY2015_v1

  • 1. — 0 Essex Lake Group LLC www.essexlg.com Probabilistic Graphical Model (Bayesian Network) Advanced Distributed Machine Learning Labs (ADML) Essex Lake Group LLC Knowledge Sharing Session July, 2015
  • 2. — 1 Objective: Technical Objectives: • How can we identify the probabilistic relationships among the variables in the given data • For any given target, how can we identify the most influencing variables ? • Can Bayesian networks help us in finding better insights than the existing Pivot approach ? Business Objective : • For the MET004 Account level data, how do we evaluate the performance of “Participation Rate” using PGM Embracing Bayesian approaches for business problem solving
  • 3. — 2 ▪ How do we construct Probabilistic Graphical Models (PGMs) ? ▪ PGMs on Sales Ratio Data ▪ PGM with respect to CART Agenda
  • 4. — 3 What are Probabilistic Graphical Models /Bayesian Networks? Graph’s Joint Probability Table will be tabulated using the following equation: P(A, B, C, D) = P(T|A,B) * P(D|T,C) *P(D|C)* P(A) * P(B) * P(C ) • Bayesian Network is Directed Acyclic Graph • Nodes are variables and edges are relationships between variables. • Each node has set of values • Direction of edges indicates Causality i.e. cause-effect relationships among variables • They are graphical models, capable of displaying relationships clearly and intuitively • They are directional, thus being capable of representing cause-effect relationships • They handle uncertainty though the established theory of probability • They can be used to represent indirect in addition to direct causation Example: “Car” A T BA Parent 1 Parent 2 Child Spouse C D “steering wheel angle”B A “gas pedal pressed” “gear” “car direction” C D
  • 5. — 4 Network Terminology Markov Blanket: For a given target variable What are the most influencing variables ? Dependency among the variables of a system (Car ) The Markov Blanket of a (Target) node consists of all nodes that make this Target conditionally independent of all the other nodes in the model: • The Parents (blue nodes in the example) are used for cutting the information coming from their ascendants • The Children (red node) are used for cutting the information coming from their descendants • The Spouses (or co-parents, dark green) are used for cutting the information coming from the ascendants of the Children If one Variable takes on some value then other variables are going to take that value B A “gas pedal pressed” “gear” “car direction” = Forward C D “steering wheel angle” B A “gas pedal pressed” = Yes “gear” = Forward “car direction” = Forward C D “steering wheel angle” = 90 Target A BA Parent 1 Parent 2 Child Spouse C D T T
  • 6. — 5 ▪ Data obtained after processing must have: - categorical values - must be binned in case of continuous values ▪ All data must be converted to factors. ▪ For the fitting of Bayesian Network, a scoring function φ is considered ▪ Given data and scoring function φ, find a network such that it maximizes the value of φ. ▪ Model tries various permutations of adding and removing arcs, such that final network has maximum posterior probability distribution, starting from a prior probability distribution on the possible networks, conditioned on data ▪ While learning the network, the joint probability distribution is created for all the variables in the data set ▪ This joint probability distribution is converted to conditional probabilities for each dependent variable. ▪ These CPTs are used to solve any kind of analysis problems. ▪ Import the learned structure and its CPT's into SAMIAM to form hypotheses and verify them. 01. Processing data 02. Creating relationships between variables 03. Build CPTs of final fitted network 04. Record Network Constructing Bayesian Network - Process Flow
  • 7. — 6 How do we interpret the Bayesian Network for problem solving ? A. Forward Analysis : Q1. I have not done any mailing activity this year. How did this affect my participation percentage? Q2. To check for conditional dependencies like mailing activity not being helpful if the account is already active. B. Backward Analysis Q3. If we know that our participation is high, then find responsible variables and their values to replicate the result in the future. Q4. For my participation bucket to lie in the higher bucket, should I offer payroll deduct or not? Variable Name Evidence Set Participation Bucket value before evidence Participation Bucket value after evidence Change in Values Sponsored Mail 2014 Y 15.15% 19.46% +4.31% Non Sponsored Mail 2013 Y 15.15% 18.85% +3.70% Mail Status 2013 Y 15.15% 18.68% +3.53% Sponsored Mail 2014 N 15.15% 13.8% -1.35% Non Sponsored Mail 2013 N 15.15% 10.30% -4.85% Mail Status 2013 N 15.15% 10.32% -4.83% All Probability Values for Participation being in the 2.5%-5% (high) bucket. ES_Supp_Mail_2014 NS_Mail_Status_2013 Participation Bucket Onsite_Event_ind_2013 Mail_status_2013 A Example: 1 Target Fig: Markov Blanket for a target variable
  • 8. — 7 A. Forward Analysis : Q1. I have not done any mailing activity this year. How did this affect my participation percentage? Q2. To check for conditional dependencies like mailing activity not being helpful if the account is already active. B. Backward Analysis Q3. If we know that our participation is high, then find responsible variables and their values to replicate the result in the future. Q4. For my participation bucket to lie in the higher bucket, should I offer payroll deduct or not? All Probability Values for Participation being in the 2.5%-5% (high) bucket. Evidence Value Non-Sponsored Mail 2013 Payroll Deduct Account Status Participation Bucket = 2.5- 5% Y Y ACTIVE Participation Bucket Variable Evidence Participation Bucket Value before Evidence Participation Bucket Value before Evidence Change in Values 2.5% - 5% Non-Sponsored Mail 2013 Payroll Deduct Account Status Y Y ACTIVE 15.15% 28.55% +13.40% PD_AVAIL_ NS_Mail_Status_ind_2013 Participation Bucket =2.5-5% ACCOUNT_STATUS Example: 2 One more example  How do we interpret the Bayesian Network for problem solving ?
  • 9. — 8 Most Probable Explanation for given variable (MAP): Process Flow ▪ The network consists of all the variables in the data set ▪ To perform any kind of analysis, the target variable must be selected. ▪ The network is built without keeping a target variable so in case analysis needs to be done on any other variable, it can also be selected in the same network. ▪ Target’s parent nodes, child nodes and all other parents of target’s child nodes form it’s Markov Blanket. ▪ These variables make the target variable independent from the rest of the variables in the network. ▪ In Samiam, for the given Bayesian Network, set evidence to the target variable, which is the variable of our interest. ▪ We need to find what values should other variables have so as to observe the evidenced value of target variable. ▪ Select the target variable’s Markov Blanket variables for which MAP values are to be computed. ▪ Run MAP query. ▪ This query returns the value of the selected variables such that the posterior probability of the evidenced target is maximized. 01. Selection of Target Variable 02. Find Target’s Markov Blanket 03. Set Evidence on Target 04. Run MAP Query on Markov Blanket
  • 10. — 9 SAMIAM and R-SHINY Demo
  • 11. — 10 Employee Sponsored Mail Participation Bucket Onsite Event Payroll Deduct Email Status More details on Network Structure: • Payroll Deduct, Employee Sponsored Mail and Participation Bucket for a V-Structure. • This V-Structure means that Payroll Deduct and Employee Sponsored Mail are independent of each other. • This means, any change in Payroll Deduct won’t influence Employee Sponsored Mail and vice versa. • Also, any change in Email Status which is the sibling of Participation Bucket will result in change in Participation Bucket as well due to the presence of an active trail between the two. • Payroll deduct can however, affect Onsite Event through Participation Bucket but not Email Status due to the presence of a V-Structure between Payroll Deduct and Employee Sponsored Mail which is the parent of Email Status. • Similarly, Employee Sponsored Mail can affect Onsite Event due to the presence of the active trail between them. • Also, if Participation Bucket is evidenced, then, The V-Structure will be activated and influence can flow between Payroll Deduct and Employee Sponsored Mail.
  • 12. — 11 ▪ What are Probabilistic Graphical Models (PGM) ▪ PGMs on Sales Ratio Data ▪ PGM with respect to CART Agenda
  • 13. — 12 Our work in 3 modules: 1. Data Preparation Imputation Bayesian Network Construction Binning Raw Data Identify the variables Sensitivity Analysis (SAMIAM – a tool) 2. Model Building 3. Network Analysis - Hypothesis Testing Module 1: Module 2: Pivot Approach Vs. Bayesian Network Approach Pivots Bayesian network Module 3: Observations on CART and Bayesina Networks CART Bayesian Network
  • 14. — 13 Data Details Account Variables  Company Name  Market Segment  Company State Mail Variables  Marketing Activity  Non-Sponsored Mail  Employee Sponsored Mail Flag Variables  New or Existing account flag Policy Variables  Program Type  Payroll Deduct  Employer Type 1. Sales Account Data (2013 to 2015) 2. Demographic details 3. MetLife’s mailing 4. Nature and tenure policy 5. Agent and employer type details 6. No. of participants and eligible accounts 1. Target Variable : Participation Bucket (Ratio of no. of participants and no. of eligible) 2. No. of observations : 5007 1. Company Name and Company State depict the account details. Market segment could be local, national, regional, USD or other. Majority of the accounts are regional 2. Mailing variable(Non Sponsored Mail) is crucial for participation bucket 3. Policy Variable (Payroll Deduct) is essential to improve participation bucket 4. Most of the accounts are of Single Carrier type and don’t have Payroll Deduct 0 1000 2000 3000 4000 5000 6000 Total 0%-1% 1%-2.5% 2.5-5% >5% Frequency Participation Bucket
  • 15. — 14 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0% - 1% 1% - 2.5% 2.5% - 5% >5% Grand Total N Y 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0% - 1% 1% - 2.5% 2.5% - 5% >5% Grand Total N Y Marketing Activity 2013 - For Participation Bucket in range 2.5%-5%, 11% accounts have marketing activity in 2013 Onsite Event 2013 - For Participation Bucket in range 2.5%-5%, 3% accounts have Onsite Events in 2013 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0% - 1% 1% - 2.5% 2.5% - 5% >5% Grand Total N Y 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0% - 1% 1% - 2.5% 2.5% - 5% >5% Grand Total N Y Mail Status 2013 - For Participation Bucket in range 2.5%-5%, 11% accounts have Mail Status 2013 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0% - 1% 1% - 2.5% 2.5% - 5% >5% Grand Total N Y Employee Sponsored Mail 2013 - For Participation Bucket in range 2.5%-5%, 5.81% accounts have Employee Sponsored Mail 2013 Payroll Deduct - For Participation Bucket in range 2.5%-5%, 9% accounts have Payroll Deduct Non Sponsored Mail 2013 - For Participation Bucket in range 2.5%-5%, 11% accounts have Non Sponsored Mail 2013 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0% - 1% 1% - 2.5% 2.5% - 5% >5% Grand Total N Y Bivariate Analysis on Target Variable “Participation Bucket”
  • 16. — 15 32.22% 67.78% N Y 64.80% 35.20% N Y 35.01% 64.99% N Y 88.28% 11.72% N Y Profiling of Important Variables against Target Variable – “Participation Bucket” Marketing Activity 2013 Onsite Event 2013 Payroll Deduct Non Sponsored Mail 2013 Mail Status 2013 Employee Sponsored Mail 2013 32.55% 67.45% N Y 68% of the accounts have Marketing Activity 2013 88% of the accounts don’t have Onsite Event 2013 65% of the accounts don’t have Payroll Deduct 65% of the accounts have Non Sponsored Mail 2013 65% of the accounts have Non Sponsored Mail 2013 25% of the accounts received employee sponsored mails. 75.40% 24.60% N Y
  • 17. — 16 Bayesian Network discovers NOT ONLY the variables identified by the business users AND ALSO discovers other most influencing variables Description: To show how to analyze the variables in a Bayesian Network, we tried to find the values of the variables that influence the target variable the most, so as to maximize/ minimize the desired target variable class. The following are the variables found using the Markov Blanket of the Target Variable. When considered together, these make the target Variable independent of the rest of the network. Thus, it can be said that these hold the maximum influence over the target. Target Variable: Participation Bucket 1st Level Markov Blanket Variables: • Payroll Deduct • Non Sponsored Mail 2013 • Marketing Activity 2013, 2015 • Mail Status 2013 • Employee Sponsored Mail 2014 • Account Type • Account Status • Onsite Event 2013 2nd Level Markov Blanket Variables: • Eligibility File • Non Sponsored Mail 2013 • Employee Sponsored Mail 2014 • Program Type • Tenure Years Level 1 Level 2 Business Users’ variables Business Users’ variables Business Users’ variables Target Target Module 1
  • 18. — 17 Better Results with Bayesian Network (All Variables) Variables Selected: Evidence Payroll Deduct -> Yes Variables Selected: Evidence Payroll Deduct -> Yes Non- Sponsored Mail Status 2013 -> Yes Mail Status 2013 -> Yes 23.47% 24.49% 23.27% 28.77% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 0%-1% 1%-2.5% 2.5%-5% >5% Before Evidence After Evidence 13.51% 19.57% 28.23% 38.69% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 0%-1% 1%-2.5% 2.5%-5% >5% Before Evidence After Evidence Inference: • Payroll Deduct is Critical to account performance • Payroll Deduct without the support of any Mailing Activity cannot drive account performance. Description: To show how to analyze the variables in a Bayesian Network, we tried to find the values of the variables that influence the target variable the most, so as to maximize/ minimize the desired target variable class. The following are the variables found using the Markov Blanket of the Target Variable. When considered together, these make the target Variable independent of the rest of the network. Thus, it can be said that these hold the maximum influence over the target. Module 1
  • 19. — 18 Full Model Analysis on Bayesian Network Variables Selected: Evidence Eligibility File -> Yes Variables Selected: Evidence Eligibility File -> Yes Non- Sponsored Mail -> Yes Employee Sponsored Mail ->Yes Inference: • Payroll Deduct causes the maximum change in the Target Variable. Apart from that, Payroll Deduct’s Markov blanket variables like Eligibility File, Employee Sponsored Mail should be considered as well. 33.03% 24.35% 19.61% 23.01% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 0%-1% 1%-2.5% 2.5%-5% >5% Before Evidence After Evidence 6.81% 16.01% 28.15% 49.02% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 0%-1% 1%-2.5% 2.5%-5% >5% Before Evidence After Evidence Description: To show how to analyze the variables in a Bayesian Network, we tried to find the values of the variables that influence the target variable the most, so as to maximize/ minimize the desired target variable class. The following are the variables found using the Markov Blanket of the Target Variable. When considered together, these make the target Variable independent of the rest of the network. Thus, it can be said that these hold the maximum influence over the target. Module 1
  • 20. — 19 Hypothesis: Mailing activity significantly improves participation bucket with mailing activity in 2013 having more influence than 2014 Full Model Analysis on Bayesian Network Variables Selected: Evidence Employee Sponsored Mail 2014 -> Yes Non-Sponsored Mail 2014 -> Yes Onsite Event 2014 -> Yes Variables Selected: Evidence Employee Sponsored Mail 2013 -> Yes Non-Sponsored Mail 2013 -> Yes Onsite Event 2013 -> Yes 29.05% 26.59% 19.84% 24.51% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 0%-1% 1%-2.5% 2.5%-5% >5% Before Evidence After Evidence 4.91% 17.90% 27.20% 49.98% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 0%-1% 1%-2.5% 2.5%-5% >5% Before Evidence After Evidence Inference : • Improving Mailing Activity for the year 2013 has more influence than in the year 2014 • Certain combinations of marketing activities tend to more effective, in particular - a combination of Non-sponsored, employee sponsored and On-site Events for the year 2013 is yielding higher Participation Bucket (12.05% higher) Module 1
  • 21. — 20 Hypothesis: Participation Bucket depends on type of Marketing Channel – Eligibility File, Employee Sponsored Mail, Onsite Events, Payroll Deduct. Full Model Analysis on Bayesian Network Variables Selected: Evidence Marketing Activity 2013 -> Yes Payroll Deduct -> Yes Variables Selected: Evidence Marketing Activity 2013 -> Yes Onsite Event 2013 -> Yes Inference : • Along with Marketing Activity, performance also depends on the type of channels – Payroll Deduct, and Onsite Event Activity 2013 being the most important 15.80% 20.70% 26.97% 36.53% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 0%-1% 1%-2.5% 2.5%-5% >5% Before Evidence After Evidence 15.70% 25.61% 23.79% 34.91% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 0%-1% 1%-2.5% 2.5%-5% >5% Before Evidence After Evidence Module 1
  • 22. — 21 Hypothesis: Newer accounts can improve their Participation Bucket by Mailing Activity Full Model Analysis on Bayesian Network Variables Selected: Evidence Tenure Years -> 0 Non-Sponsored Mail 2013 -> Y Employee Sponsored Mail 2013 -> Yes Onsite Event 2013 -> Yes Inference : • Newer accounts can improve their Participation Bucket by Mailing Activity - specifically, sending non sponsored and employee sponsored mails and doing some onsite events. -19.48% 0.74% 7.25% 11.49% -30.00% -20.00% -10.00% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 0%-1% 1%-2.5% 2.5%-5% >5% Before Evidence After Evidence Difference % between probabilites Module 1
  • 23. — 22 Same insights from the both Pivot and Bayesian Analysis (3 variables) Description: To test the accuracy of the Bayesian network, we built models with only those variables that were considered in pivot analysis. We then observed the target variable i.e. the Participation Bucket with respect to selected variables. The following results were obtained: 0% - 1% 1% - 2.5% 2.5% -5% >5% BROKER CHOICE 25.04% 21.41% 24.74% 28.81% MARSH COMBINED 41.61% 29.93% 11.68% 16.79% METLIFE CHOICE 20.00% 31.11% 26.67% 22.22% BROKER CHOICE 25.04% 21.41% 24.74% 28.81% 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% Participation Bucket 0% - 1% 1% - 2.5% 2.5% - 5% >5% Pivot N 54.64% 27.22% 12.16% 5.98% Pivot Y 16.09% 20.96% 28.16% 34.79% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% Participation Bucket 0% - 1% 1% - 2.5% 2.5% - 5% >5% Bayesian N 54.34% 27.19% 12.30% 6.17% Bayesian Y 16.16% 21.01% 28.11% 34.72% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% Participation Bucket 0% - 1% 1% - 2.5% 2.5% -5% >5% SINGLE CARRIER 25.02% 21.46% 24.72% 28.80% MARSH COMBINED 41.41% 29.87% 11.84% 16.88% METLIFE CHOICE 20.18% 30.89% 26.61% 22.32% BROKER CHOICE 28.64% 28.64% 25.40% 17.32% 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% Participaton Bucket Program TypeProgram Type Payroll Deduct Payroll Deduct Pivot Analysis Bayesian Analysis Module 2
  • 24. — 23 More insights with more interactions: Pivot and Bayesian (6 variables) Description: To test the accuracy of the Bayesian network, we built models with only those variables that were considered in pivot analysis. We then observed the target variable i.e. the Participation Bucket with respect to selected variables. The following results were obtained: Pivot Analysis Bayesian Analysis 0% - 1% 1% - 2.5% 2.5% -5% >5% N 61.32% 24.53% 3.77% 10.38% Y 40.54% 23.94% 20.46% 15.06% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% Non – Sponsored Mail Non – Sponsored Mail 0% - 1% 1% - 2.5% 2.5% -5% >5% N 59.32% 24.77% 4.77% 11.14% Y 40.47% 23.99% 20.35% 15.18% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% Bayesian Advantage: • Pivot Tables often return zero values in case it cannot return values according to the set filter when multiple filters are applied. • Bayesian Networks do not work on values but on conditional probabilities. Thus, even though data rows may be unavailable, probability of the occurrence of each bucket according to the evidence given to filter variables in the network will still be returned. Variables Selected:  Non Sponsored Mail 2014  Employee Sponsored Mail 2014  Onsite Event 2014  Microsite Activity 2014  Participation Bucket 2014  Email Status 2013 Module 2
  • 25. — 24 Better Insights with Bayesian (Interaction among all variables) Description: To test the accuracy of the Bayesian network, we built models with only those variables that were considered in pivot analysis. We then observed the target variable i.e. the Participation Bucket with respect to selected variables. The following results were obtained: 0% - 1% 1% - 2.5% 2.5% - 5% >5% Pivot N 54.64% 27.22% 12.16% 5.98% Pivot Y 16.09% 20.96% 28.16% 34.79% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% Payroll Deduct Pivot Analysis Bayesian Analysis 0% - 1% 1% - 2.5% 2.5% -5% >5% N 65.33% 18.70% 8.19% 7.78% Y 23.47% 24.49% 23.27% 28.77% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0% - 1% 1% - 2.5% 2.5% - 5% >5% N 61.32% 24.53% 3.77% 10.38% Y 40.54% 23.94% 20.46% 15.06% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% Non – Sponsored Mail 0% - 1% 1% - 2.5% 2.5% - 5% >5% N 57.85% 20.74% 10.30% 11.11% Y 38.28% 22.24% 18.85% 20.62% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% Non – Sponsored Mail Payroll Deduct Module 2
  • 26. — 25 ▪What are Probabilistic Graphical Models (PGM) ▪ PGMs on Sales Ratio Data ▪PGM with respect to CART Agenda
  • 27. — 26 Observations On Bayesian Networks and CART CART is target variable dependent. Problem specific and rigid in terms of modification. Bayesian Network doesn’t involve fixed target variable. It is easier to build and modify. To perform second level analysis we need to construct a new classification tree taking important variables as a target variable. Second Level of analysis can be done by considering Markov Blanket of the important variable in target’s Markov Blanket. Participation Bucket Payroll Deduct No variable can be added to the tree forcibly or otherwise if needed. As all variables of data set are already in the model, such need would never arise. Module 3
  • 28. — 27 CART (Classification And Regression Tree): For the same data set, we try to see how many business questions can the Bayesian Network and CART models answer. ES_Supp_Mail_ind_2014 NS_Mail_Status_ind_2013 Participation Bucket Onsite_Event_ind_2013 Mail_status_ind_2013 Participation Bucket Zone Open to New Business Open to New Business Eligibility File Questions CART Bayesian Network If we know that our participation is high, then find responsible variables and their values to replicate the result in the future. Yes Yes For my participation bucket to lie in the higher bucket, should I offer payroll deduct or not? Yes Yes I have not done any mailing activity this year. How did this affect my participation percentage? No Yes To check for conditional dependencies like mailing activity not being helpful if the account is already active. No Yes Is the model easily interpretable? Yes Yes N N Observations On Bayesian Networks and CART Module 3