SlideShare a Scribd company logo
1 of 48
Download to read offline
Causal Inference and Direct Effects
Pearl’s Graph Surgery and Jouffe’s Likelihood Matching
Illustrated with Simpson’s Paradox and a Marketing Mix Model




Stefan Conrady, stefan.conrady@conradyscience.com

Dr. Lionel Jouffe, jouffe@bayesia.com

September 15, 2011




Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
Causal Inference and Direct Effects




Table of Contents
Introduction
  Motivation & Objective                                             4

  Overview                                                           5
       Notation                                                      5

I. Methods of Causal Inference
  Simpson’s Paradox Example                                          6

  Neyman-Rubin Model of Causal Inference                             7
         Inference Based on Experimental Data                        7
         Inference from Observational Data                           7

  The Bayesian Network Representation                                9

  Pearl’s Do-Operator                                               10
       Causal Networks                                              10
       Intervention                                                 11

  Jouffe’s Likelihood Matching (LM)                                 13
       Importance of Network Performance                            15

  Summary                                                           16

II. Practical Applications of Direct Effects and Causal Inference
  The Marketing Mix Model Example                                   17

  CPG Example & Dataset                                             18

  Model Development                                                 18
       Data Import                                                  18
       Supervised Learning                                          22
       Network Performance                                          24

  Model Analysis                                                    25
       Pearson’s Correlation                                        26
       Mutual Information                                           27

  Observational Inference                                           29



www.conradyscience.com | www.bayesia.com
                            ii
Causal Inference and Direct Effects



       Total Effects on Target                            31

  Causal Inference                                        33
       Pearl’s Do-Operator                                33
       Direct Effects with Likelihood Matching (LM)       36
       Causal Inference as an Afterthought                39
       Causal Reasoning                                   40

  Marketing Mix Optimization                              40
       Linear Marketing Mix Optimization                  40
         Non-Controllable Variables and Non-Confounders   40

  Summary                                                 44

Appendix
  About the Authors                                       46
         Stefan Conrady                                   46
         Lionel Jouffe                                    46

  References                                              47

  Contact Information                                     48
         Conrady Applied Science, LLC                     48
         Bayesia S.A.S.                                   48

  Copyright                                               48




www.conradyscience.com | www.bayesia.com
                  iii
Causal Inference and Direct Effects




Introduction

Motivation & Objective

To this day, randomized experiments remain the gold standard for generating models that permit causal
inference. In many elds, such as drug trials, they are, in fact, the conditio sine qua non. Without rst hav-
ing established and quanti ed the treatment effect (and any associated side effects), no new drug could pos-
sibly win approval. This means that a drug must be proven in terms of its causal effect and hence the under-
lying study must facilitate causal inference.

However, in many other domains, such controlled experiments are not feasible, be it for ethical, economical
or practical reasons. For instance, it is obvious that the federal government could not create two different
tax regimes in order to evaluate their respective impact on economic growth. For lack of such experiments,
economists have been traditionally be constrained to studying strictly observational data and, although
much-desired, causal inference is much more dif cult to carry out on that basis. Causal inference from ob-
servational studies typically requires an extensive range of assumptions, which may or may not be justi able
depending on one’s viewpoint. Being subject to such individual judgement, it should not surprise us that
there is widespread disagreement among economic experts and government leaders regarding the effect of
economic policies.

While economists and social scientists have been using observational data for over a century for policy de-
velopment, the business world has only recently been discovering the emerging potential of “big data” and
“competing on analytics.” As these terms are becoming buzzwords, and are rightfully expected to hold great
promise, the strictly observational nature of most “big data” sources is often overlooked. The wide avail-
ability of new, easy-to-use analytics tools may turn out to be counterproductive, as observational versus
causal inference are not explicitly differentiated. While the mantra of “correlation does not imply causa-
tion” remains frequently quoted as a general warning, many business analysts would not know under what
speci c conditions it can be acceptable to derive a causal interpretation from correlation in observational
data. Consequently, causal assumptions are often made rather informally and implicitly and thus they typi-
cally remain undocumented. The line between association and causation often becomes further blurred in
the eyes of the end users of such research. Given that the concept of causality remains ill-understood in
many practical applications, we seriously question today’s real-world business capabilities for deriving ra-
tional policies from the newly-found “big data.”

With these presumed shortcomings in business practice, it is our objective to provide a framework that fa-
cilitates a much more disciplined approach regarding causal inference while remaining accessible to (non-
statistician) business analysts and transparent to executive decision makers. We believe that Bayesian net-
works are an appropriate paradigm for this purpose and that the BayesiaLab software package offers a ro-
bust toolset for distinguishing observational and causal inference.




www.conradyscience.com | www.bayesia.com
                                                                 4
Causal Inference and Direct Effects



Overview

The format of this document is essentially “two papers in one,” with the rst chapter focusing on mostly
theoretical considerations (although illustrated with an example), while the second chapter provides a prac-
tical, real-world example presented in the form of a tutorial.

I.    Methods of Causal Inference
      We will rst introduce the reader to the idea of formal causal inference using the well-known example
      of Simpson’s Paradox. Secondly, we will provide a brief summary of the Neyman-Rubin model, which
      represents a traditional statistical approach in this context. Once this method is established as a refer-
      ence point, we will introduce two methods within the Bayesian network paradigm, Pearl’s Do-
      Operator, which is based on “Graph Surgery”, and a method based on Jouffe’s “Likelihood Match-
      ing” algorithm (LM). LM allows xing probability distributions and can be considered as a probabilis-
      tic extension of statistical matching.

II.   Practical Applications of Direct Effects and Causal Inference
      While our treatment of Neyman-Rubin is limited to the rst chapter, the two Bayesian network-based
      methods will be further illustrated as practical applications in the second chapter. Special weight will
      be given to Likelihood Matching (LM), as it has not yet been documented in literature. We will ex-
      plain the practical bene ts of LM with a real-world business application and discuss observational and
      causal inference in the context of a marketing mix model. Using the marketing mix model as the prin-
      cipal example, we will go into greater detail regarding the analysis work ow, so the reader can use this
      example as a step-by-step guide to implementing such a model with BayesiaLab.


Notation
To clearly distinguish between natural language, software-speci c functions and example-speci c variable
names, the following notation is used:

• Bayesian network and BayesiaLab-speci c functions, keywords, commands, etc., are capitalized and
  shown in bold type.

• Names of attributes, variables, nodes and are italicized.




www.conradyscience.com | www.bayesia.com
                                                                     5
I. Methods of Causal Inference




I. Methods of Causal Inference

Simpson’s Paradox Example

In our recent white paper, Paradoxes and Fallacies, we have written about Simpson’s Paradox, which occa-
sionally appears in the popular press as a rather enigmatic statistical anomaly. We use an admittedly con-
trived example to illustrate this paradox:

A hypothetical type of cancer equally effects men and women. A long-term, observational and non-
experimental study nds that a speci c type of cancer therapy is associated with an increased remission rate
among all treated patients (see table). Based on the study, this particular treatment is thus recommended for
broader application.

                        Remission
 Treatment           Yes         No
    Yes              50%        50%
    No               40%        60%

However, when examining patient records by gender, the remission rate for male patients — upon treat-
ment — decreases from 70% to 60% and for female patients the remission rate declines from 30% to 20%
(see table). So, is this new therapy effective overall or not?

                                      Remission
   Gender        Treatment         Yes         No
                    Yes            60%        40%
     Male
                    No             70%        30%
                    Yes            20%        80%
   Female
                    No             30%        70%

The answer lies in the fact that — in this example — there was an unequal application of the treatment to
men and women. More speci cally, 75% of the male patients and only 25% of female patients received the
treatment. Although the reason for this imbalance is irrelevant for inference, one could imagine that side
effects of this treatment are much more severe for females, who thus seek alternatives therapies. As a result,
there is a greater share of men among the treated patients. Given that men also have a better recovery pros-
pect with this type of cancer, the remission rate for the total treated population increases.

So, what is the true overall effect of this treatment? There are actually several possible ways to compute the
effect, namely the statistical approach based on the Neyman-Rubin Model of Causal Inference, and the two
Bayesian network-based approaches: Pearl’s Graph Surgery approach and the method based on Jouffe’s LM
algorithm, which are both implemented in BayesiaLab.




www.conradyscience.com | www.bayesia.com
                                                                   6
I. Methods of Causal Inference



Neyman-Rubin Model of Causal Inference

To begin our discussion of causal inference, we will rst present matching as a statistical method for causal
inference based on observational data. Our brief summary follows the framework that is widely known as
the Neyman-Rubin Model of Causal Inference (Rubin, 2006; Sekhon, 2007; Morgan and Winship, 2007;
Rosenbaum, 2002).

We closely follow Sekhon (2007) for this highly condensed summary:

Inference Based on Experimental Data

Let Yi1 denote the potential outcome for unit i if the unit receives treatment, and let Yi 0 denote the potential

outcome for a unit in the control group. The treatment effect for unit i is de ned as:   τ i = Yi1 − Yi 0

Furthermore, let   Ti be the a treatment indicator: 1 when unit i is in the treatment group and 0 when unit i is
in the non-treatment control group.

If assignment to treatment is randomized, causal inference is fairly simple because the two groups are drawn
from the same population, and treatment assignment is independent of all baseline variables. As the sample
size grows, observed and unobserved confounders are balanced across treatment and control groups. That
is, with random assignment, the distributions of both observed and unobserved variables in both groups are
equal in expectation.


Treatment assignment is independent of Y0 and Y1 — i.e.,    {Y i0   ,Yi1 ⊥ Ti } ,

where “ ⊥ ” symbol represents independence.

Hence, for j = 0, 1

 E(Yij∣ i = 1) = E(Yij∣ i = 0) = E(Yi∣ i = j)
       T               T              T

Therefore, the average treatment effect can be estimated by:

τ = E(Yi1∣ i = 1) − E(Yi 0∣ i = 0)
          T                T
= E(Yi∣ i = 1) − E(Yi∣ i = 0)
       T              T

τ can be estimated in an experimental setting because randomization can ensure that observations in treat-
ment and control groups are exchangeable. Randomization ensures that assignment to treatment will not, in
expectation, be associated with the potential outcomes.

Inference from Observational Data

The situation with observational data is much less straightforward as treatment and control groups are not
necessarily drawn from the same population. Hence, the average treatment effect      τ cannot be estimated the
same way as was the case with experimental data.


www.conradyscience.com | www.bayesia.com
                                                                      7
I. Methods of Causal Inference



As an alternative, we can pursue the average treatment effect for the treated, more formally expressed as

τ∣ = 1) = E(Yi1∣ i = 1) − E(Yi 0∣ i = 1)
  (T            T                T

However, the challenge in this case is that Yi 0 is not observed for the treated, i.e. we simply cannot know

how those, who were in fact treated, would have fared, had they not been treated.

As a potential remedy for this quandary, one could assume that treatment selection depends on a set of ob-
servable covariates X. Furthermore, we could assume that given X, treatment assignment is independent of
Y.


More formally,   {Y ,Y ⊥ T∣X } , which is referred to as “unconfoundedness”
                    0   1




A nal assumption is that there is a so-called “overlap:”   0 < P(T = 1
                                                                     ∣X) < 1 .

In particular case, where    X ∈{male, female} , this implies that treatments must be observed for both
males and females in order to obtain overlap. Together, unconfoundedness and overlap form the concept of
strong ignorability, which are required for the estimation of the average treatment effect for the treated:


τ∣ = 1) = E { E(Yi∣Xi ,Ti = 1) − E(Yi∣Xi ,Ti = 0) Ti = 1}
  (T                                            ∣

This means we condition on the observed covariates, Xi , and thus treatment and control groups are bal-

anced.

Conditioning on X can be a straightforward task and can typically be achieved by matching, which means
  nding exactly matching sets of covariates. In our case, matching is simple, as we only have one covariate
with two states, i.e. male and female. We can then compute the treatment effects for exactly matched sets of
treated and untreated units within each subset, i.e. within the male and female group.

However, in most real-world applications, we have many more covariates and among those many may have
continuous values rather than discrete states. Inevitably, this makes matching much more challenging and it
actually may be impossible to perform exact matching.

Given this challenge, propensity score matching (Rosenbaum and Rubin, 1983) and matching based on the
Mahalanobis distance (Cohran and Rubin, 1973) have emerged as commonly used methods. Both methods
perform matching based on covariate similarity, but it goes be beyond the scope of this paper to elaborate
further on the details of these and related methods.




www.conradyscience.com | www.bayesia.com
                                                                   8
I. Methods of Causal Inference



The Bayesian Network Representation

To illustrate Pearl’s Do-Operator based on Graph Surgery and subsequently the method based on Jouffe’s
Likelihood Matching, we need to switch from the traditional statistical framework to the Bayesian network
paradigm. The starting point for both methods is a synthetically generated dataset with three variables,
Gender, Treatment and Remission, with a total of 1,000 observations, which re ects the statistics described
in the tables provided in the description of Simpson’s Paradox.1 This dataset will serve as the basis for the
Bayesian network to be used for causal inference.

For expositional simplicity, we will omit the steps required for importing the dataset into BayesiaLab and
refer the reader to the second chapter, which describe the import process in detail. Rather, we begin directly
in BayesiaLab’s Modeling Mode with the initially unconnected network consisting of three nodes, i.e. the
variables of interest:




From theory we know that we can factorize a Joint Probability Distribution (JPD) into the product of condi-
tional probability distributions (see Barber, 2011, for a detailed discussion). With the three nodes that we
have in our example, there are actually six different ways to do this:

p(x1 , x2 , x3 ) =   p(x1∣x2 , x3 )p(x2 , x3 ) = p(x1∣x2 , x3 )p(x2∣x3 )p(x3 )
p(x1 , x3 , x2 ) =   p(x1∣x3 , x2 )p(x3 , x2 ) = p(x1∣x3 , x2 )p(x3∣x2 )p(x2 )
p(x2 , x1 , x3 ) =   p(x2∣x1 , x3 )p(x1 , x3 ) = p(x2∣x1 , x3 )p(x1∣x3 )p(x3 )
p(x2 , x3 , x1 ) =   p(x2∣x3 , x1 )p(x3 , x1 ) = p(x2∣x3 , x1 )p(x3∣x1 )p(x1 )
p(x3 , x1 , x2 ) =   p(x3∣x1 , x2 )p(x1 , x2 ) = p(x3∣x1 , x2 )p(x1∣x2 )p(x2 )
p(x3 , x2 , x1 ) =   p(x3∣x2 , x1 )p(x2 , x1 ) = p(x3∣x2 , x1 )p(x2∣x1 )p(x1 )

Given the semantics of Bayesian networks, this translates into six possible, equivalent Bayesian networks,
that are all representing exactly the same JPD.




When we perform one of BayesiaLab’s learning network algorithms on the sample dataset, we will indeed
obtain one of the six possible networks shown above, as suggested by the theory. Without additional infor-
mation on those variables, such as we might obtain from temporal indices, we will be unable to select one
network over the other and the network choice would have be entirely arbitrary.




1   This dataset was created with BayesiaLab’s Generate Data function, based on the true Joint Probability Distribution
(JPD).


www.conradyscience.com | www.bayesia.com
                                                                                 9
I. Methods of Causal Inference



In order to visualize that the arcs in these networks are invertible in their orientation, BayesiaLab can high-
light the Essential Graph (Analysis>Graphic>Show the Edges). This will display the edges that can be ori-
ented in either direction without modifying the represented JPD.




For purposes of observational inference, any of these six equivalent networks would be suf cient. For in-
stance, the probability of Remission=yes, given that we observe Treatment=yes, i.e. P(Remission=yes|Trea-
tment=yes), can be computed with any of the six networks shown earlier.

However, from the introduction of Simpson’s Paradox, we realize that a simple observation is not suf cient
to establish the treatment effect. Observational inference may actually be misleading for interpretation pur-
poses, which is at the very core of the paradox. So, our question remains, “what is the effect of treatment?”
More speci cally, “what is the probability of remission, given that we do administer the treatment?” This
means that we want to see the effect of an intervention instead of merely observing that treatment has oc-
curred.

Graph Surgery and LM provide different ways to answers this question, which we will explain the following
two sections:



Pearl’s Do-Operator

Causal Networks
To introduce Pearl’s Do-Operator, we need to make a formal transition from a general Bayesian network to
a causal network, because Bayesian networks describe a joint distribution over possible observed events but
say nothing about what will happen if an intervention occurs. A causal network is a Bayesian network with
the added property that the parents of each node are its direct causes. For example, Fire → Smoke is a
causal network whereas Smoke → Fire is not, even though both networks are equally capable of represent-
ing any joint distribution on the two variables.

More formally, causal networks are de ned as a type of Bayesian network with special properties: upon
setting an intervention on a node in a causal network, the correct probability distribution is given by delet-
ing the incoming arcs from the node’s parents, i.e. “cutting off” the direct causes of the node. Pearl has
characterized this deletion of links rather graphically as “graph mutilation” or “graph surgery.”2

2   Interestingly, “intervenire”, the Latin origin of “intervention,” symbolizes this separation as it literally means “to
come in between.”


www.conradyscience.com | www.bayesia.com
                                                                                    10
I. Methods of Causal Inference



With this de nition, Pearl’s Graph Surgery approach requires us to provide a complete set of causal assump-
tions regarding the network to compute the effect of an intervention. Given our background knowledge re-
garding Simpson’s Paradox, we can make causal assumptions for all edges and thus declare, i.e. by at, our
Bayesian network a causal network. As stated earlier, we assume that Gender has a causal effect on Remis-
sion (rather than Remission on Gender), so we de ne the arc direction as Gender ➝ Remission. We also
assume that Treatment has a causal effect (whether positive or negative) on Remission, which translates into
the (directed) arc Treatment ➝ Remission. Finally, we have learned that Gender in uences (causes) whether
or not one would undergo Treatment, so we have Gender ➝ Treatment. This eliminates ve of the six pos-
sible Bayesian networks and leaves us with only one possible causal Bayesian network:




Now that we have a causal Bayesian network we can make a distinction between observational inference
and causal inference. This is because of the semantic difference of “given that we observe” versus “given
that we do.” The former is strictly an observation, i.e. we focus on the patients who received treatment,
whereas the latter is an active intervention. The answer to our question of the treatment effect then is infer-
ring as to what would hypothetically happen, “given that we do”, i.e. given that we force the treatment
without permitting patients to self-select their treatment. In the semantics of Bayesian networks, this means
that there must not be a direct relationship between Gender and Treatment. In other words, Treatment
must not directly depend on Gender.


Intervention
In our Bayesian network this can be done easily by “mutilating” the graph, i.e. deleting the arc connecting
Gender and Treatment. BayesiaLab offers a very simple function to achieve this, which is aptly named In-
tervention (right-click on the node’s Monitor and then select Intervention).

By intervening on the Treatment variable (and setting Treatment=yes), the causal Bayesian network is
modi ed (or “mutilated”) as follows:

• The entering arcs of the node on which we want to perform intervention are “surgically” removed. With
  intervention, we cut the dependency between Treatment and Gender, i.e. administering the treatment will
  not affect Gender.

• The original Chance Node (round) representing Treatment is transformed into a Decision Node (square).
  The associated Monitor will be highlighted in blue.




www.conradyscience.com | www.bayesia.com
                                                                   11
I. Methods of Causal Inference




Now we can observe what happens to Remission when we “do” Treatment, instead of just “observing”
Treatment.

Treatment=No




Treatment=Yes




As we can see, the Gender probability distribution remains the same. However, Remission decreases from
50% to 40%, given that we “do” Treatment. With this we have now obtained the treatment effect:


τ = P(Remission = yes Treatment = yes) − P(Remission = yes Treatment = no) = −0.1
                    ∣                                    ∣


www.conradyscience.com | www.bayesia.com
                                                          12
I. Methods of Causal Inference



To answer our original question, we must conclude that this new treatment is detrimental to the patients’
health.



Jouffe’s Likelihood Matching (LM)

We will now brie y introduce the Jouffe’s Likelihood Matching (LM) algorithm, which was originally im-
plemented in the BayesiaLab software package for “ xing” probability distributions of an arbitrary set of
variables, allowing then to easily de ne complex sets of soft evidence. The LM algorithm searches for a set
of likelihood distributions, which, when applied on the Joint Probability Distribution (JPD) encoded by the
Bayesian network, allows obtaining the posterior probability distributions de ned (as constraints) by the
user.

As we saw with Pearl’s Graph Surgery approach, the core idea of intervention is to set an evidence on the
node on which we wish to intervene, while all other ascending nodes remain unchanged. Using this very
same idea, we can then intervene on a node by xing the posterior probability distributions of its covariates.
Casually speaking, this would be a kind “virtual mutilation.”




Treatment=No




www.conradyscience.com | www.bayesia.com
                                                                 13
I. Methods of Causal Inference



Treatment=Yes




These results are identical to what was obtained with Pearl’s Graph Surgery.

τ = P(Remission = yes Treatment = yes) − P(Remission = yes Treatment = no) = −0.1
                    ∣                                    ∣

However, two main differences exist between the methods:

I.   One important feature of the LM algorithm is that it returns the same result for all the instantiations
     of the Essential Graph, i.e. for any one of the six equivalent networks. For example, intervening on
     Treatment using the Bayesian network below, and using the LM algorithm, will lead to exactly the
     same posterior probability distribution for Remission, even though the arc directions are non-causal
     (and thus perceived counterintuitive).




      In comparison to the Do-Operator, the approach based on LM does not require any available causal
      knowledge to be formally translated into a causal structure in order to compute treatment effects.
      While it may be easy to specify all the causal directions in a simple model with only three nodes, such
      as in our example, it is obviously more of a challenge to do the same for a larger network, perhaps
      consisting of dozens or even hundreds of nodes. That is not to claim that we can avoid causal assump-
      tions altogether, however, we aim to defer the need for making such assumptions until a later point
      and then only make those assumptions that are directly related to the pair of variables for which we
      want to obtain the causal effect.




www.conradyscience.com | www.bayesia.com
                                                                 14
I. Methods of Causal Inference



II.   The other difference between the Graph Surgery and LM is more subtle and may not always be obvi-
      ous. The Graph Surgery implies a modi cation of the representation of the JPD, whereas the approach
      based on LM always works on the original JPD. The mere graph mutilation can bring about a modi -
      cation of some marginal probability distributions, even without changing the marginal probability dis-
      tributions of the nodes on which we want to intervene.
      We need to brie y digress from our principal example in order to clarify this particular point: The two
      graphs below illustrate the mutilation impact on the marginal probability distribution of Customer
      Satisfaction.




Importance of Network Performance
While the statistical matching approach of the Neyman-Rubin Matching Model directly utilizes the original
observations, the LM algorithm is based on the JPD encoded by the Bayesian network. This emphasizes the
requirement that a Bayesian network to be used for this purpose must provide a good representation of the
true JPD. While there is no hard-and-fast rule as to what constitutes a minimum t requirement, we can
review the overall network performance by selecting Analysis>Network Performance>Global:




www.conradyscience.com | www.bayesia.com
                                                                 15
I. Methods of Causal Inference




The key metric here is the Contingency Table Fit (CTF). This measure can range between 0%, as if the JPD
were represented with a fully unconnected network (all the nodes are independent), and 100%, as if the JPD
were perfectly represented with a fully connected network. The network learned on the Simpson’s Paradox
dataset happens to be a complete (fully-connected) graph, thus the CTF is 100%.



Summary

We have provided a brief summary of the Neyman-Rubin model, which represents a traditional statistical
approach for causal inference. Extending beyond the statistical framework, and now within the Bayesian
network paradigm, we illustrated Graph Surgery and Pearl’s Do-Operator, and a nally presented a method
based on Jouffe’s Likelihood Matching algorithm (LM). Most importantly, working with the Bayesian net-
work methods highlighted that formal causal assumptions are critical to correct causal inference.




www.conradyscience.com | www.bayesia.com
                                                              16
II. Practical Applications of Direct Effects and Causal Inference




II. Practical Applications of Direct Effects and Causal
Inference

The Marketing Mix Model Example

The adage, “I know I waste half of my advertising dollars...I just wish I knew which half”, re ects a
century-old uncertainty about the effectiveness of marketing instruments.3 More formally, one could de-
scribes this quandary as a domain with an unknown (or ill-understood) causal structure.

While “big data”, especially in the eld of marketing, is expected to rapidly yield “actionable business in-
sights,” we need to recognize that there are many steps to traverse to achieve this goal. Hence, we would
like to parse this overarching objective of “actionable insights” into distinct components, which will imme-
diately highlight the central role of causal inference:

• “Big data” most often refers to large amounts of observational data from a domain. Despite the ever-
  increasing amount of data, most measures collected do include noise and missing data points.

• “Actionable insights” actually implies several things: Firstly, it requires an understanding of the domain,
  which can be used as a basis for reasoning about this domain. A key assumption in this context is that we
  must not only have a structure describing the observations we have gathered, but rather we must have a
  causal structure, so we can anticipate the consequences of actions we have not yet taken. If we have this
  ability to evaluate the results of our potential interventions in this domain, we can chose the rational
  course of action among all the possible alternatives. As an added complexity, most dynamics uncovered in
  a domain are probabilistic rather than deterministic in nature.

Although it is typically a challenge, our chosen toolset, BayesiaLab, can implicitly handle missing values and
capture the probabilistic nature of the domain and hence we will not focus on that aspect. Rather, the cen-
tral theme of this paper is the transition from observation to causation.

As we have seen in the rst chapter with the introductory example of Simpson’s Paradox, Bayesian net-
works provide two principal ways of moving from observational inference to causal inference, namely
Graph Surgery and Likelihood Matching. With the following example from the CPG industry, we will jux-
tapose Graph Surgery and LM and then speci cally demonstrate how LM can be utilized for computing
causal effects, which can subsequently be used for performing marketing mix optimization.




3   Various versions of this quote have been attributed to Henry Procter, Henry Ford, John Wanamaker and J.C. Penney


www.conradyscience.com | www.bayesia.com
                                                                          17
II. Practical Applications of Direct Effects and Causal Inference



CPG Example & Dataset

To illustrate this approach we study daily ice cream sales of a European food distributor as a function of
environmental variables and marketing efforts.4

Our sample data set includes the following variables:

• Seasonally-adjusted daily sales in the local currency

• Traditional advertising, such as print advertising (incl. coupons), TV, radio, in-store promotions, etc.

• Online advertising, including banner ads, search engine marketing, online coupons

• Competitive advertising (estimate of all competitive marketing efforts combined)

• Temperature in °C

• Number of open retail outlets

• Weekday



Model Development

While the focus of this example is to evaluate and to causally interpret a given marketing mix model, we
will spell out the steps one would take to generate such a model with BayesiaLab. This should enable cur-
rent users of BayesiaLab to replicate the exercise in its entirety.


Data Import
We use BayesiaLab’s Data Import Wizard to load all 7 time series5 into memory from a comma-separated
 le (CSV). BayesiaLab automatically detects the column headers, which contain the variable names.




4   For expository purposes, this dataset was synthetically generated based on actual market dynamics observed in an
industry and locale different from the example.
5   Although the dataset has a temporal ordering, for expository simplicity we will treat each time interval as an inde-
pendent observation.


www.conradyscience.com | www.bayesia.com
                                                                                  18
II. Practical Applications of Direct Effects and Causal Inference




The next step identi es the data types contained in the dataset. BayesiaLab will attempt to detect the type of
variables in the dataset and assumes in this case all variables to be continuous, as indicated by the turquoise
background color for all columns.




Although Weekday appears continuous, i.e. 1 through 7, it must be treated as discrete so as to avoid bin-
ning in the subsequent discretization function.6 Upon setting it to discrete, the Weekday variable will appear
in red.




6   In the original dataset the variable Weekday was coded into ordered numerical states, 1 through 7, representing Mon-
day through Sunday. BayesiaLab could also have used text descriptions as state labels, in which case the variable would
have been automatically recognized as discrete.


www.conradyscience.com | www.bayesia.com
                                                                             19
II. Practical Applications of Direct Effects and Causal Inference




As our dataset contains missing values, we need to specify the type of missing values imputation. We will
choose the Structural EM method, given that for the size of this dataset, the computational complexity of
this algorithm will not be a burden.




The following discretization step is very important for all models in BayesiaLab and thus we provide a bit
more detail here. Our objective of this model is to establish Sales as a function of the marketing instruments
and other external factors. Thus we can take this objective into account for the discretization process. More
speci cally, we will split the process into two parts. First, we will discretize the target variable, i.e. Sales, on
its own. We highlight the Sales column in the data table and then choose Manual as the Discretization
Type. This provides us with probability density function of Sales.




www.conradyscience.com | www.bayesia.com
                                                                        20
II. Practical Applications of Direct Effects and Causal Inference




By clicking Generate a Discretization, we are prompted to select the discretization type.




We chose Type: K-Means and Intervals: 4.7 The chart will now display the results of this discretization.




7   For a discussion of discretization algorithms and a guide for interval selection, please see the papers referenced in the
appendix.


www.conradyscience.com | www.bayesia.com
                                                                                   21
II. Practical Applications of Direct Effects and Causal Inference



Now that we have discretized the target variable by itself, we will discretize the remaining continuous vari-
ables with the Decision Tree algorithm and use Sales as the target. This allows binning the continuous vari-
ables in such a way that we gain a maximum amount of information from these variables with respect to
the target.




Upon completion of the discretization, BayesiaLab will present all variables as nodes in an unconnected
network in the Graph Panel.




Supervised Learning
Now that we have an initial network, albeit unconnected, we can perform our rst Supervised Learning al-
gorithm with the objective of characterizing the target node. However, we do need to rst specify the target
by right-clicking on Sales and selecting Set As Target Node (or pressing “T” while double-clicking on the
node).



www.conradyscience.com | www.bayesia.com
                                                                 22
II. Practical Applications of Direct Effects and Causal Inference




Once this is set, the Sales node will appear in the graph as a bulls-eye, symbolizing a target.




We now have an array of Supervised Learning algorithms available to apply here. Given the small number
of nodes, variables selection is not an issue and hence this should not in uence our choice. Furthermore, the
relatively small number of observations does not create a challenge in terms of computational effort. With
these considerations, and without going into further detail, we select the Augmented Naive Bayes algorithm.
The “augmented” part in the name of this algorithm refers to the additional unsupervised search that is per-
formed on the basis of the given naive structure.




www.conradyscience.com | www.bayesia.com
                                                                 23
II. Practical Applications of Direct Effects and Causal Inference



Upon learning, the newly generated network is now displayed in the Graph Panel.




The prede ned naive structure is highlighted by the dotted arcs, while the additional (augmented) arcs from
the unsupervised learning are shown in solid black.


Network Performance
We could now spend some time to further re ne this model, such as balancing the degree of complexity ver-
sus the overall model t. Furthermore, we could also specify this as a dynamic model.8 To maintain exposi-
tional clarity, we will leave the model as is.

However, we do wish to cover a few performance measures to assure the reader that the model presented
here is a reasonable characterization of the underlying domain.

With the relatively small number of observations, we chose not to set aside a hold-out sample (e.g. 20% of
observations) during the data import process. As an alternative way of testing the out-of-sample network
performance, we carry out Cross Validation by selecting (from within the Validation Mode) Tools>Cross
Validation>Targeted:




In terms of parameters for the Cross Validation, we select the same learning algorithm as before, i.e. Aug-
mented Naive Bayes. Also, using a 10-fold validation is a typical choice in this context.

8   Given the inherently dynamic nature of marketing effects, it would be very appropriate to model this as a temporal
Bayesian network. For instance, this would enable us to capture potential lags in the effects of marketing activities on
the target variable. The BayesiaLab framework can easily accommodate such a temporal speci cation.


www.conradyscience.com | www.bayesia.com
                                                                                  24
II. Practical Applications of Direct Effects and Causal Inference




The resulting Global Report provides a variety of metrics, including precision and R2.

Sampling Method: K-Folds
Learning Algorithm: Augmented Naive Bayes
Target: Sales
                                            <=20755 <=23387 <=25914   >259145
Value
                                              6.406   7.375  5.594      .594
Gini Index                                    66%    41.75% 38.03%     69.52%
Relative Gini Index                          75.25%  62.92% 63.76%     80.63%
Mean Lift                                     2.49    1.64    1.52      2.49
Relative Lift Index                          81.50%  78.29% 80.11%     84.09%
Relative Gini Global Mean: 70.64%
Relative Lift Global Mean: 81%
Total Precision: 67.37%
R: 0.76104342242
R2: 0.57918709081
Occurrences
                                            <=20755 <=23387 <=25914 >259145
                    Value                    6.406    7.375   5.594   .594
                                              (53)    (142)   (172)   (59)
               <=207556.406 (56)                   37     18        1      0
              <=233877.375 (124)                   15     86       22      1
              <=259145.594 (213)                    1     38      140     34
                >259145.594 (33)                    0       0       9     24
Reliability
                                            <=20755 <=23387 <=25914 >259145
                    Value                    6.406   7.375   5.594    .594
                                              (53)   (142)   (172)    (59)
               <=207556.406 (56)              66.07% 32.14%   1.79%       0%
              <=233877.375 (124)              12.10% 69.35%  17.74%    0.81%
              <=259145.594 (213)               0.47% 17.84%  65.73% 15.96%
                >259145.594 (33)                  0%     0%  27.27% 72.73%
Precision
                                            <=20755 <=23387 <=25914 >259145
                    Value                    6.406   7.375   5.594    .594
                                              (53)   (142)   (172)    (59)
               <=207556.406 (56)              69.81% 12.68%   0.58%       0%
              <=233877.375 (124)              28.30% 60.56%  12.79%    1.69%
              <=259145.594 (213)               1.89% 26.76%  81.40% 57.63%
                >259145.594 (33)                  0%     0%   5.23% 40.68%



Even without further comparison, the reported values appear reasonable and suggest that we can proceed
with analyzing this network.



Model Analysis

We have accepted the network as plausible representation of this domain and will now interpret the struc-
ture we obtained. To make it easier to understand the structure, we will rst apply one of BayesiaLab’s
automatic layout algorithms, which quite literally “disentangles” the network and thus provides a clearer
picture. Selecting View>Automatic Layout achieves this (or pressing the keyboard shortcut “P”).




www.conradyscience.com | www.bayesia.com
                                                             25
II. Practical Applications of Direct Effects and Causal Inference




The “Naive Bayes” versus the “Augmented” part of this network, shown in dotted arcs and solid arcs re-
spectively, are now much more obvious in this layout.




As that the naive structure was given by de nition, only the presence or absence of solid arcs provides in-
formation about the existence of relationships between the predictors. Much more can be understood when
we examine the magnitude and the sign of all relationships in the network.


Pearson’s Correlation
Although correlation, as we will later emphasize, is not a central metric for network analysis in BayesiaLab,
we will use it for a rst look, especially since all readers will be familiar with this measure. Selecting Analy-
sis>Graphic>Pearson’s Correlation provides this information directly in the network graph.




www.conradyscience.com | www.bayesia.com
                                                                    26
II. Practical Applications of Direct Effects and Causal Inference




The colors of the arcs indicate the sign of the relationship and the arc labels provide the correlation value.




Many of the shown relationships seem intuitive, for instance that No. of Stores and both Trad. Adv. and
Online Adv. have a positive association with Sales. Equally plausible is the fact that Temperature is associ-
ated with Sales (although one of the co-authors of this paper believes that one can eat ice cream rain or
shine). The negative association between Competitive Adv. and Sales also seems expected. Less clear is the
negative correlation between Sales and Weekday, but the small value suggests either very weak link or per-
haps a nonlinear relationship.


Mutual Information
Given that Pearson’s correlation is a strictly linear metric, its ability to characterize all these relationships is
inherently limited. We will now turn to Mutual Information as a new measure, which can help overcome
this limitation.




www.conradyscience.com | www.bayesia.com
                                                                        27
II. Practical Applications of Direct Effects and Causal Inference




In contrast to correlation, Mutual Information does not re ect the sign of the relationship, however, this
measure captures the strength of relationships between variables, even if they are highly nonlinear.




More speci cally, the Mutual Information I(X,Y) measures how much (on average) the observation of ran-
dom variable Y tells us about the uncertainty of X, i.e. by how much the entropy of X is reduced if we have
information on Y. Mutual Information is a symmetric metric, which re ects the uncertainty reduction of X
by knowing Y as well as of Y by knowing X.

In our example, knowing the value of Weekday on average reduces the uncertainty of the value of Sales by
0.4802 bits, which means that it reduces its uncertainty by 26.3% (shown in red, in the opposite direction
of the arc). Conversely, knowing Sales reduces the uncertainty of Weekday by 17.11% (shown in blue, in
the direction of the arc). It is interesting to see that, by looking at Mutual Information, Weekday and Sales
now have a very strong relationship whereas previously the correlation coef cient was near zero.




www.conradyscience.com | www.bayesia.com
                                                                 28
II. Practical Applications of Direct Effects and Causal Inference



Observational Inference

To explore the nature of this relationship further, we can perform the Target Mean Analysis with Sales and
Weekday (Analysis>Graphic>Target Mean Analysis).




This prompts us to select the way we want to examine this relationship. In this context it seems appropriate
to look at the delta mean of the target as a function of Weekday.




The resulting plot con rms the previous hypothesis of nonlinearity.




www.conradyscience.com | www.bayesia.com
                                                                29
II. Practical Applications of Direct Effects and Causal Inference




For instance, we can interpret this as follows: given that Weekday=Friday, we observe that Sales reach their
highest value. Furthermore we can infer, given that Weekday=Sunday, we observe that Sales have their low-
est value, as many shops in Europe are closed on Sundays. We can further speculate that consumers perhaps
buy more ice cream on Fridays in preparation for leisure activities over the weekend.

Returning to our interpretation of Mutual Information, it is now obvious why Weekday reduces the uncer-
tainty of Sales by over 25%. There is quite apparently an intra-week seasonality. Another interpretation of
Mutual Information is “importance” and we can use Analysis>Report>Target Analysis>Correlations with
the Target Node to obtain an overview of the importance of all nodes in the network with respect to the
target, Sales.




Node significance with respect to the information gain brought by the node to the knowledge of Sales
                        Mutual        Mutual        Relative                              Degrees.of                                   Degrees.of
       Node                                                     Mean.Value      G:test                 p:value      G:test.(Data)                    p:value.(Data)
                     information information.(%) significance                              Freedom                                  Freedom.(Data)
Weekday                   0.4802         26.30%             1      4.0047        283.5916        18         0.00%      283.5916                18             0.00%
Competitive:Adv.          0.1293          7.08%        0.2692    514.9959          76.332         9         0.00%        76.332                 9             0.00%
Trad.:Adv.                0.0835          4.57%        0.1739    483.8701          49.307         9         0.00%        49.307                 9             0.00%
No.:of:Stores              0.081          4.44%        0.1686   3096.5023         47.8213         9         0.00%       47.8213                 9             0.00%
Online:Adv.               0.0764          4.18%         0.159    181.6759         45.0943         9         0.00%       45.0943                 9             0.00%
Temperature               0.0592          3.24%        0.1233     14.5441         34.9654         9         0.01%       34.9654                 9             0.01%




www.conradyscience.com | www.bayesia.com
                                                                                                                        30
II. Practical Applications of Direct Effects and Causal Inference



It is important to stress that this is a form of observational inference and it does not imply a causal relation-
ship with Sales. We assume that some of these variables “cause” Sales, but from this table we can only infer
association, not causation.


Total Effects on Target
The same caveat also holds true for our next evaluation, Total Effects on Target (Analysis>Report>Target
Analysis>Total Effects on Target):




Total Effect is a linearized measure that shows the impact of a one-unit change in the mean (that is com-
puted at the mean) of each node on the Target.

Total Effects on Target Sales
                    Standardized                            Degrees-of                          Degrees-of
       Node                       Total-Effect   G:test                p:value G:test-(Data)                p:value-(Data)
                     Total-Effect                            Freedom                         Freedom-(Data)
Competitive)Adv.           -0.3456   -32.0159      76.332           9    0.00%       76.332               9         0.00%
Trad.)Adv.                  0.2567     6.2351      49.307           9    0.00%       49.307               9         0.00%
No.)of)Stores               0.1679    48.0703     47.8213           9    0.00%      47.8213               9         0.00%
Online)Adv.                 0.1482    22.4707     45.0943           9    0.00%      45.0943               9         0.00%
Temperature                 0.1291   323.6881     34.9654           9    0.01%      34.9654               9         0.01%
Weekday                    -0.0501   -583.078    283.5916          18    0.00%     283.5916              18         0.00%


This can be illustrated by performing the computation manually in the Monitor Panel. By default, the Moni-
tors shows the marginal frequency distributions of the states of the nodes plus the mean value (expected
value) of those distributions:




www.conradyscience.com | www.bayesia.com
                                                                                    31
II. Practical Applications of Direct Effects and Causal Inference




As stated above, the Total Effect is computed on the basis of a one-unit change of each node. We can simu-
late this by setting Competitive Adv. to a new mean value, i.e. changing its mean from 514.996 to 515.996.
It must be noted that there is an in nite possibility of achieving a mean value of +1 in this distribution.
BayesiaLab supports the analyst by choosing the particular distribution (of all possible distributions) that is
closest to the original distribution while achieving the targeted mean value of +1. We simply need to right-
click on the Monitor for Competitive Adv. and select Distribution for Target Value/Mean.




This prompts us to type in our desired value, i.e 515.996, to re ect the one-unit change.




www.conradyscience.com | www.bayesia.com
                                                                   32
II. Practical Applications of Direct Effects and Causal Inference




We can now observe the impact on Sales as a result of changing Competitive Adv. by one unit. The resulting
delta of -32.104 is shown in parentheses. This con rms (within the possible numeric precision) the value
reported in the Total Effects table.




However, the reader will notice that not only Sales was affected but also most of the other nodes, albeit
with very small changes. This means that, given that we observe a one-unit change of Competitive Adv., will
also observe a change in other nodes, which are too connected to the target and may thus contribute to a
change in the target. This re ects the Bayesian network property of omnidirectional inference. As such, the
one-unit change in Competitive Adv. is not an orthogonal impulse, which is very important to bear in mind
for interpretation purposes.



Causal Inference

Pearl’s Do-Operator
To move beyond the observational inference generated by the Total Effects function, we must now turn to a
causal framework. Our rst option is to use Intervention with the Do-Operator, which requires us to con-
vert our original network into a fully speci ed causal network.

At it is immediately obvious that most of the original arc directions, which were found by the Supervised
Learning algorithm, cannot be interpreted causally, e.g. Sales does neither cause Temperature nor Weekday.


www.conradyscience.com | www.bayesia.com
                                                               33
II. Practical Applications of Direct Effects and Causal Inference




However, using our domain knowledge, we can assume that Sales is the effect of all the other variables in
this model.

So, we will need to encode these causal relationships manually, as shown in the following graph:




While this causal representation is formally correct, it creates an immediate practical problem. As we do not
have any parametric representation of the relationship between Sales and the other 6 variables, the required
CPT associated with Sales contains 28,672 cells. With only a few hundred observations, it is impossible to
obtain a robust estimate all these parameters.

BayesiaLab will actually highlight this problem as we build this network manually.




www.conradyscience.com | www.bayesia.com
                                                                 34
II. Practical Applications of Direct Effects and Causal Inference




For now, however, we may want to ignore this constraint and proceed with this approach. We can use
BayesiaLab’s Taboo Learning to search for additional probabilistic relationships after having xed the
manually-encoded causal arc structure from above. Upon completion of this algorithm, and having applied
the layout algorithm, we now have a more connected network:




These newly established arcs, however, do not yet re ect our causal assumptions. We now need to go
through them one by one to formalize the direction of causality. With some arcs, it is fairly obvious, such as
Weekday ➝ No. of Stores (e.g. some stores are closed because it is Sunday).

We can invert this arc from within BayesiaLab’s Validation Mode. We simply right-click the arc of inter-
ested and select Invert Orientation within the Equivalence Class.9




9   For a discussion of equivalence class, see chapter 1.


www.conradyscience.com | www.bayesia.com
                                                                  35
II. Practical Applications of Direct Effects and Causal Inference



The new structure, with the inverted arc highlighted in red, is shown below:




However, a side effect of this arc inversion within the equivalence class was that the arc, Temperature ➝
No. of Stores was automatically inverted in order to maintain the original JPD. We can resolve this by es-
tablishing constraints that re ect our causal knowledge, e.g. a higher Temperature in summer causes a
higher No. of Stores to be open, meaning that only the arc Temperature ➝ No. of Stores is permissible but
not the inverse.

While these constraints can be easily applied in BayesiaLab, we will omit these details and instead fast-
forward to another issue, which, as it turns out will make all previous efforts futile: We have probabilistic
relationships in our domain for which, given our knowledge, we cannot resolve the causal direction. For
instance, does Trad. Adv. cause Competitive Adv. or is it the other way around? Without nalizing this
causal structure we are unable to proceed with Graph Surgery, which ultimately prevents us from carrying
out causal inference.

In conclusion, we have two major obstacles towards performing causal inference with Graph Surgery: rst,
the intractable size of the CPT and, second, the incomplete causal structure.


Direct Effects with Likelihood Matching (LM)
As opposed to using the Do-Operator, we can move forward using LM, regardless of the arc directions, as
long as the network provides a good representation of the JPD of the underlying data.

For this purpose, a new Direct Effect Analysis tool has recently been introduced in BayesiaLab 5.0.4. This is
similar to the Total Effects tool, however Direct Effects obtains, as the name implies, the “direct” impact of
a treatment variable on the target node by using the LM algorithm to x the confounders.

The new approach with LM requires fewer prerequisites and may thus lead us to the desired causal infer-
ence more quickly. We can return to the originally learned non-causal Bayesian network, which is computa-
tionally entirely tractable.




www.conradyscience.com | www.bayesia.com
                                                                  36
II. Practical Applications of Direct Effects and Causal Inference




On the basis of this non-causal network, we can perform Direct Effects (Analysis>Report>Target Analy-
sis>Direct Effects on Target).




The resulting table provides us with Standardized Direct Effect, Direct Effect, Contribution and Elasticity,
with respect to Sales:

Direct Effects on Target Sales
                     Standardized
        Node                       Direct,Effect Contribution Elasticity
                     Direct,Effect
No.$of$Stores                   0.229       65.5416        32.72%   21.39%
Trad.$Adv.                     0.1851         4.496        26.45%   19.41%
Competitive$Adv.                 ?0.13      ?12.041        18.57%   ?9.67%
Online$Adv.                    0.0982       14.8906        14.03%    9.03%
Weekday                        0.0305       354.755         4.36%    2.55%
Temperature                     0.027       67.7507         3.86%    2.09%


The Direct Effect column represents the effect of a unit-change of each variable while holding all other vari-
ables xed. One can think of each node (in turn and by itself) being considered a treatment, while all other
nodes, except for the target, are being used as “likelihood-matched” sets of covariates. For instance, a one-
unit change in No. of Stores is associated with +65.5 delta in Sales, everything else being equal.

The Contribution column provides a breakdown of each variable’s individual contributions in percent
(summing up to 100%). This means than an observed change in Sales should be attributed to the individual
variables as per the Contribution values.




www.conradyscience.com | www.bayesia.com
                                                                  37
II. Practical Applications of Direct Effects and Causal Inference



Elasticity is shown in the rightmost column. The de nition of Elasticity is based on the mathematical notion
of point elasticity. In general, the “x-elasticity of y”, also called the “elasticity of y with respect to x”, is:


           ∂ln y ∂y x %Δy
E y,x =         =  ⋅ =
           ∂ln x ∂x y %Δx

In marketing, Elasticity is most often used in the context of price elasticity.

It is important to point out that the Direct Effect is a linearized value and represents the derivative of the
Direct Effects Function taken at the a-priori mean value of the respective variable. All the Direct Effects
Functions can be shown with Analysis>Graphic>Target Mean Analysis>Direct Effect:




www.conradyscience.com | www.bayesia.com
                                                                      38
II. Practical Applications of Direct Effects and Causal Inference




To make the graph easier to interpret, the values of all variables (except the target) are normalized. In the
case of Weekday, this means that the numerical values 1 through 7 (representing Monday through Sunday)
are normalized to a 0 to 100 range.

The nonlinear character of several of these variables is rather obvious and suggests that the linearized Direct
Effect must be used with caution. For the near-linear variables Trad. Adv. and Competitive Adv. the Direct
Effect may fully capture the nature of the relationship with Sales, whereas for the nonlinear Weekday it
would be misleading. This becomes particularly relevant in the context of optimization, which we will dis-
cuss later.


Causal Inference as an Afterthought
Direct Effects per se carry no causal meaning. However, if we do provide causal assumptions, we can im-
mediately interpret Direct Effects as causal effects. We can make the causal assumption after computing the
Direct Effects, quite literally as an afterthought.




www.conradyscience.com | www.bayesia.com
                                                                   39
II. Practical Applications of Direct Effects and Causal Inference



Causal Reasoning
Upon concluding our causal assumptions we now have a model of our domain that we can use for reason-
ing and subsequent decision making. Hence, we return to the original objective of obtaining “actionable
insight,” as we can now formally reason about our domain. We now have the ability to anticipate the con-
sequences of (marketing) actions we have not yet taken.

In this particular domain, assuming that we are in the position of the ice cream distributor, only two of the
model’s variable are under our control, Trad. Adv. and Online Adv., all others are beyond our control, al-
though we might wish for a higher Temperature and less Competitive Adv. Searching for a rational course
of action could thus only include combinations of Trad. Adv. and Online Adv. as “marketing levers.” It is
now our task to reason about how these levers should be best employed for a maximum in Sales.



Marketing Mix Optimization

While we have emphasized the abstract concept “reasoning about a domain,” from a practical perspective
we are looking at the classical task of marketing mix optimization.


Linear Marketing Mix Optimization

From theory we understand that in a linear marketing model (represented by a function f), the gradient of
the response function f provides the optimal ratio of marketing instruments.

The gradient (or gradient vector eld) of a scalar function f(x1,x2,...xn) is denoted ∇f, where ∇ (the nabla
symbol) denotes the vector differential operator. The gradient of f is de ned to be the vector eld whose
components are the partial derivatives of f. That is:


     ⎛ ∂f ∂f         ∂f ⎞
∇f = ⎜    ,    ,...,
     ⎝ ∂x1 ∂x2       ∂xn ⎟
                         ⎠


Our previously generated Elasticity column represents ∇f. As a result, we can directly read the optimal mar-
keting mix ratios from the Elasticity column. Among other things, we would suggest to raise Temperature
and reduce Competitive Adv. Quite obviously, such a recommendation cannot be serious, as we do not have
control over such variables.

Non-Controllable Variables and Non-Confounders

The non-controllable nature of variables like Temperature and Weekday are self-evident. We can declare
them as such via the Cost Editor, which allows setting the non-controllable variables to “not observable.”
The Cost Editor can be selected from the contextual menu that appears when right-clicking on the Graph
Panel background.




www.conradyscience.com | www.bayesia.com
                                                                 40
II. Practical Applications of Direct Effects and Causal Inference




This declaration will keep them xed in any subsequent analysis and also exclude them from being used as
treatment variables. This new de nition is also re ected in the node colors, as non-observable nodes are
now shown in a light shade of purple.




That leaves two more nodes that are also not under our control, No. of Stores and Competitive Adv. They,
however, must be differentiated versus the non-controllable variables. The difference is that these variables,


www.conradyscience.com | www.bayesia.com
                                                                  41
II. Practical Applications of Direct Effects and Causal Inference



although we do not control them, may very well be affected by our actions. It is reasonable to believe that
the level of Competitive Adv. is, at least to some extent, a function of our own advertising. This means that
we need to assign a special status to them, which excludes them from our optimization algorithm but does
not keep them xed. We need to speci cally permit their “responsive effects.” In our terminology we call
them “non-confounders” and we can assign that status via BaysiaLab’s Classes (right-click on node and
select Properties>Classes>Add)




The reserved Class name is “Non_Confounder”




www.conradyscience.com | www.bayesia.com
                                                                 42
II. Practical Applications of Direct Effects and Causal Inference



To highlight their distinct role, we have highlighted these nodes in red:




With the Non-Observables and the Non-Confounders de ned, we can now proceed to compute the Direct
Effects:

Direct Effects on Target Sales
                     Standardized
        Node                       Direct,Effect Contribution Elasticity
                     Direct,Effect
Trad.&Adv.                           0.147         3.5702           63.36%   15.41%
Online&Adv.                          0.085        12.8867           36.64%    7.82%

We can immediately take the values of the Elasticity column as mix recommendation, i.e a ratio of 2 to 1
for Trad. Adv. versus Online Adv.

It would be reasonable to object that this mix recommendation is only valid when accepting the linearity
assumption of the Direct Effects. Indeed, by displaying the Direct Effects Functions again, now only show-
ing the two variables under our control, we can see that the linearity assumption would only hold in the
center area of the plot.




www.conradyscience.com | www.bayesia.com
                                                              43
II. Practical Applications of Direct Effects and Causal Inference




So, while the linear approximation might be acceptable for estimating the effects as a result of small
changes, considering major policy shifts would clearly demand approaching this as a nonlinear problem. As
the principal focus in this paper is on observational versus causal inference, we conclude that this nonlinear
optimization as out-of-scope and leave it to a separate tutorial to be published in the near future.



Summary

I.    The Neyman-Rubin model and Pearl’s Graph Surgery remain proven tools for computing causal ef-
      fects. However, direct and causal effect estimation based on Jouffe’s Likelihood Matching provides
      signi cant advantages, as it does not require the speci cation of a complete causal structure. With this
      lower burden of a-priori speci cation, known causal relationships can be calculated with signi cantly
      less effort. In many cases, this will facilitate quantifying causal effects for the rst time in practical ap-
      plications.

II.   Despite these welcome advances in terms of estimating causal effects, the path from “big data” to “ac-
      tionable insights” still requires a very disciplined application of expert knowledge to provide the neces-



www.conradyscience.com | www.bayesia.com
                                                                       44
II. Practical Applications of Direct Effects and Causal Inference



      sary causal assumptions for correct reasoning. The marketing mix model example illustrates the need
      for a clear understanding of the role of variables, even though we may not need a complete causal
      structure.

We conclude that Bayesian networks can provide a powerful framework for dealing with complex domains
and uncovering dynamics within them. However, at this time, there is no substitute for external assump-
tions, e.g. from expert knowledge, about the nature of causal relationships.




www.conradyscience.com | www.bayesia.com
                                                             45
Appendix




Appendix

About the Authors
Stefan Conrady
Stefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held
consulting rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In
2010, Conrady Applied Science was appointed the authorized sales and consulting partner of Bayesia S.A.S.
for North America.

Stefan Conrady studied Electrical Engineering and has extensive management experience in the elds of
product planning, marketing and analytics, working at Daimler and BMW Group in Europe, North Amer-
ica and Asia. Prior to establishing his own rm, he was heading the Analytics & Forecasting group at Nis-
san North America.

Lionel Jouffe
Dr. Lionel Jouffe is cofounder and CEO of France-based Bayesia S.A.S. Lionel Jouffe holds a Ph.D. in Com-
puter Science and has been working in the eld of Arti cial Intelligence since the early 1990s. He and his
team have been developing BayesiaLab since 1999 and it has emerged as the leading software package for
knowledge discovery, data mining and knowledge modeling using Bayesian networks. BayesiaLab enjoys
broad acceptance in academic communities as well as in business and industry. The relevance of Bayesian
networks, especially in the context of consumer research, is highlighted by Bayesia’s strategic partnership
with Procter & Gamble, who has deployed BayesiaLab globally since 2007.




www.conradyscience.com | www.bayesia.com
                                                               46
Appendix



References

Brady, H.E. “Models of causal inference: Going beyond the Neyman-Rubin-Holland theory.” In annual
    meeting of the Midwest Political Science Association, Chicago, IL, 2002.
Cochran, William G., and Donald B. Rubin. “Controlling Bias in Observational Studies: A Review.”
    Sankhyā: The Indian Journal of Statistics, Series A 35, no. 4 (December 1, 1973): 417-446.
Conrady, Stefan, and Lionel Jouffe. “Knowledge Discovery in the Stock Market - Supervised and Unsuper-
   vised Learning with BayesiaLab”, June 29, 2011.
   http://www.conradyscience.com/index.php/knowledgediscovery.
———. “Paradoxes and Fallacies - Resolving some well-known puzzles with Bayesian networks”, May 2,
  2011. http://www.conradyscience.com/index.php/paradoxes.
“Data, data everywhere.” The Economist, February 25, 2010.
   http://www.economist.com/node/15557443?story_id=15557443.
Dorfman, Robert, and Peter O. Steiner. “Optimal Advertising and Optimal Quality.” The American Eco-
    nomic Review 44, no. 5 (December 1, 1954): 826-836.
Gelman, Andrew, and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models.
    1st ed. Cambridge University Press, 2006.
Hagmayer, Y., and M. R Waldmann. “Simulating causal models: The way to structural sensitivity.” In Pro-
   ceedings of the Twenty-second Annual Conference of the Cognitive Science Society: August 13-15,
   2000, Institute for Research in Cognitive Science, University of Pennsylvania, Philadelphia, PA, 214,
   2000.
Hagmayer, Y., S.A. Sloman, D.A. Lagnado, and M.R. Waldmann. “Causal reasoning through interven-
   tion.” Causal learning: Psychology, philosophy, and computation (2007): 86–100.
Heckman, James, Hidehiko Ichimura, Jeffrey Smith, and Petra Todd. “Characterizing Selection Bias Using
    Experimental Data.” Econometrica 66, no. 5 (1998): 1017-1098.
Imbens, G. “Estimating average treatment effects in Stata.” In West Coast Stata Users’ Group Meetings
    2007, 2007.
Lauritzen, S. L., and D. J. Spiegelhalter. “Local Computations with Probabilities on Graphical Structures
    and Their Application to Expert Systems.” Journal of the Royal Statistical Society. Series B (Methodo-
    logical) 50, no. 2 (January 1, 1988): 157-224.
Morgan, Stephen L., and Christopher Winship. Counterfactuals and Causal Inference: Methods and Princi-
   ples for Social Research. 1st ed. Cambridge University Press, 2007.
Pearl, J., and S. Russell. “Bayesian networks.” Handbook of brain theory and neural networks, ed. M. Ar-
    bib. MIT Press.[DAL] (2001).
Pearl, Judea. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press, 2009.
Rosenbaum, Paul R. Observational Studies. Softcover reprint of hardcover 2nd ed. 2002 ed. Springer, 2010.
ROSENBAUM, PAUL R., and DONALD B. RUBIN. “The central role of the propensity score in observa-
   tional studies for causal effects.” Biometrika 70, no. 1 (April 1, 1983): 41 -55.
Rubin, Donald B. Matched Sampling for Causal Effects. 1st ed. Cambridge University Press, 2006.
Sekhon, J.S. The Neyman-Rubin model of causal inference and estimation via matching methods. Oxford:
    Oxford University Press, 2008.
Stolley, Paul D. “When Genius Errs: R. A. Fisher and the Lung Cancer Controversy.” American Journal of
     Epidemiology 133, no. 5 (March 1, 1991): 416 -425.



www.conradyscience.com | www.bayesia.com
                                                                47
Appendix



Stuart, E.A., and D.B. Rubin. “Matching methods for causal inference: Designing observational studies.”
    Harvard University Department of Statistics mimeo (2004).
Witten, Ian, and Frank Eibe. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. Am-
    sterdam, Boston: Morgan Kaufman, 2005.




Contact Information
Conrady Applied Science, LLC

312 Hamlet’s End Way
Franklin, TN 37067
USA
+1 888-386-8383
info@conradyscience.com
www.conradyscience.com

Bayesia S.A.S.

6, rue Léonard de Vinci
BP 119
53001 Laval Cedex
France
+33(0)2 43 49 75 69
info@bayesia.com
www.bayesia.com



Copyright

© 2011 Conrady Applied Science, LLC and Bayesia S.A.S. All rights reserved.

Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the
following:

• You may print or download this document for your personal and noncommercial use only.

• You may copy the content to individual third parties for their personal use, but only if you acknowledge
  Conrady Applied Science, LLC and Bayesia S.A.S as the source of the material.

• You may not, except with our express written permission, distribute or commercially exploit the content.
  Nor may you transmit it or store it in any other website or other form of electronic retrieval system.




www.conradyscience.com | www.bayesia.com
                                                                 48

More Related Content

Viewers also liked

Solving Business Problems for Our Clients, Each Step of the Way
Solving Business Problems for Our Clients, Each Step of the WaySolving Business Problems for Our Clients, Each Step of the Way
Solving Business Problems for Our Clients, Each Step of the WayKevin Hoffman
 
Identify & Charaterize Arguments
Identify & Charaterize ArgumentsIdentify & Charaterize Arguments
Identify & Charaterize Argumentscrickchamps
 
Linkedinstudentppt
LinkedinstudentpptLinkedinstudentppt
Linkedinstudentpptrsusami
 
Cecyt 2008
Cecyt 2008Cecyt 2008
Cecyt 2008moniki
 
NEDMA Seminar: PURLs of Wisdom...How to Use Personalized URLs to Build Strong...
NEDMA Seminar: PURLs of Wisdom...How to Use Personalized URLs to Build Strong...NEDMA Seminar: PURLs of Wisdom...How to Use Personalized URLs to Build Strong...
NEDMA Seminar: PURLs of Wisdom...How to Use Personalized URLs to Build Strong...New England Direct Marketing Association
 
Cecyt 2009
Cecyt 2009Cecyt 2009
Cecyt 2009moniki
 
8 steps to requirements success
8 steps to requirements success8 steps to requirements success
8 steps to requirements successSteve Orr
 
The Truth About Eels
The Truth About EelsThe Truth About Eels
The Truth About EelsMike Dickison
 
Растим профессионалов
Растим профессионаловРастим профессионалов
Растим профессионалов404fest
 
Stokes Slideshare
Stokes SlideshareStokes Slideshare
Stokes SlideshareMiszShayG
 
Luis Veas Powerpoint Tennis
Luis Veas Powerpoint TennisLuis Veas Powerpoint Tennis
Luis Veas Powerpoint TennisLuis9
 
Mi piace un SAC! - Report del percorso di animazione territoriale
Mi piace un SAC! - Report del percorso di animazione territorialeMi piace un SAC! - Report del percorso di animazione territoriale
Mi piace un SAC! - Report del percorso di animazione territorialeConetica
 
Understanding Fractures
Understanding FracturesUnderstanding Fractures
Understanding Fracturesguest4334a9
 
Debt Dr Introduction
Debt Dr IntroductionDebt Dr Introduction
Debt Dr Introductiondrazza65
 
NEDMA14: Targeting Audiences with Direct Response Campaigns on Mobile - Ted M...
NEDMA14: Targeting Audiences with Direct Response Campaigns on Mobile - Ted M...NEDMA14: Targeting Audiences with Direct Response Campaigns on Mobile - Ted M...
NEDMA14: Targeting Audiences with Direct Response Campaigns on Mobile - Ted M...New England Direct Marketing Association
 
Hoe Werkt Een Balans
Hoe Werkt Een BalansHoe Werkt Een Balans
Hoe Werkt Een Balansguesta11592
 

Viewers also liked (20)

Solving Business Problems for Our Clients, Each Step of the Way
Solving Business Problems for Our Clients, Each Step of the WaySolving Business Problems for Our Clients, Each Step of the Way
Solving Business Problems for Our Clients, Each Step of the Way
 
Identify & Charaterize Arguments
Identify & Charaterize ArgumentsIdentify & Charaterize Arguments
Identify & Charaterize Arguments
 
Linkedinstudentppt
LinkedinstudentpptLinkedinstudentppt
Linkedinstudentppt
 
Cecyt 2008
Cecyt 2008Cecyt 2008
Cecyt 2008
 
Digital educational symposium
Digital educational symposiumDigital educational symposium
Digital educational symposium
 
NEDMA Seminar: PURLs of Wisdom...How to Use Personalized URLs to Build Strong...
NEDMA Seminar: PURLs of Wisdom...How to Use Personalized URLs to Build Strong...NEDMA Seminar: PURLs of Wisdom...How to Use Personalized URLs to Build Strong...
NEDMA Seminar: PURLs of Wisdom...How to Use Personalized URLs to Build Strong...
 
Cecyt 2009
Cecyt 2009Cecyt 2009
Cecyt 2009
 
8 steps to requirements success
8 steps to requirements success8 steps to requirements success
8 steps to requirements success
 
Asp.net exception reporter
Asp.net exception reporterAsp.net exception reporter
Asp.net exception reporter
 
Customer Service by Jamie Haenggi
Customer Service by Jamie HaenggiCustomer Service by Jamie Haenggi
Customer Service by Jamie Haenggi
 
The Truth About Eels
The Truth About EelsThe Truth About Eels
The Truth About Eels
 
Растим профессионалов
Растим профессионаловРастим профессионалов
Растим профессионалов
 
Stokes Slideshare
Stokes SlideshareStokes Slideshare
Stokes Slideshare
 
Luis Veas Powerpoint Tennis
Luis Veas Powerpoint TennisLuis Veas Powerpoint Tennis
Luis Veas Powerpoint Tennis
 
Mi piace un SAC! - Report del percorso di animazione territoriale
Mi piace un SAC! - Report del percorso di animazione territorialeMi piace un SAC! - Report del percorso di animazione territoriale
Mi piace un SAC! - Report del percorso di animazione territoriale
 
Understanding Fractures
Understanding FracturesUnderstanding Fractures
Understanding Fractures
 
Investing in Youth
Investing in YouthInvesting in Youth
Investing in Youth
 
Debt Dr Introduction
Debt Dr IntroductionDebt Dr Introduction
Debt Dr Introduction
 
NEDMA14: Targeting Audiences with Direct Response Campaigns on Mobile - Ted M...
NEDMA14: Targeting Audiences with Direct Response Campaigns on Mobile - Ted M...NEDMA14: Targeting Audiences with Direct Response Campaigns on Mobile - Ted M...
NEDMA14: Targeting Audiences with Direct Response Campaigns on Mobile - Ted M...
 
Hoe Werkt Een Balans
Hoe Werkt Een BalansHoe Werkt Een Balans
Hoe Werkt Een Balans
 

Similar to Causal Inference and Direct Effects

Probabilistic Latent Factor Induction and
 Statistical Factor Analysis
Probabilistic Latent Factor Induction and
 Statistical Factor AnalysisProbabilistic Latent Factor Induction and
 Statistical Factor Analysis
Probabilistic Latent Factor Induction and
 Statistical Factor AnalysisBayesia USA
 
Microarray Analysis with BayesiaLab
Microarray Analysis with BayesiaLabMicroarray Analysis with BayesiaLab
Microarray Analysis with BayesiaLabBayesia USA
 
Bayesia Lab Choice Modeling 1
Bayesia Lab Choice Modeling 1Bayesia Lab Choice Modeling 1
Bayesia Lab Choice Modeling 1jouffe
 
Introduction of abm
Introduction of abmIntroduction of abm
Introduction of abmYudi Yasik
 
Machine learning - session 4
Machine learning - session 4Machine learning - session 4
Machine learning - session 4Luis Borbon
 
Semantic representation of neuroimaging observation
Semantic representation of neuroimaging observationSemantic representation of neuroimaging observation
Semantic representation of neuroimaging observationEmna AMDOUNI, Ph.D.
 
Modeling Vehicle Choice and Simulating Market Share with Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share with Bayesian NetworksModeling Vehicle Choice and Simulating Market Share with Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share with Bayesian NetworksBayesia USA
 
hisory of computers in pharmaceutical research presentation.pptx
hisory of computers in pharmaceutical research presentation.pptxhisory of computers in pharmaceutical research presentation.pptx
hisory of computers in pharmaceutical research presentation.pptxDhanaa Dhoni
 
Presentation (9).pptx
Presentation (9).pptxPresentation (9).pptx
Presentation (9).pptxAmitMasand5
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph MiningSabri Skhiri
 
3. introduction of ABM_INTI.pdf
3. introduction of ABM_INTI.pdf3. introduction of ABM_INTI.pdf
3. introduction of ABM_INTI.pdfYudi Yasik
 
MedChemica Levinthal Lecture at Openeye CUP XX 2020
MedChemica Levinthal Lecture at Openeye CUP XX 2020MedChemica Levinthal Lecture at Openeye CUP XX 2020
MedChemica Levinthal Lecture at Openeye CUP XX 2020Ed Griffen
 
[P.D.F] Bayesian Methods for Hackers: Probabilistic Programming and Bayesian ...
[P.D.F] Bayesian Methods for Hackers: Probabilistic Programming and Bayesian ...[P.D.F] Bayesian Methods for Hackers: Probabilistic Programming and Bayesian ...
[P.D.F] Bayesian Methods for Hackers: Probabilistic Programming and Bayesian ...nis62
 

Similar to Causal Inference and Direct Effects (20)

Probabilistic Latent Factor Induction and
 Statistical Factor Analysis
Probabilistic Latent Factor Induction and
 Statistical Factor AnalysisProbabilistic Latent Factor Induction and
 Statistical Factor Analysis
Probabilistic Latent Factor Induction and
 Statistical Factor Analysis
 
Microarray Analysis with BayesiaLab
Microarray Analysis with BayesiaLabMicroarray Analysis with BayesiaLab
Microarray Analysis with BayesiaLab
 
Beyond the Mean
Beyond the MeanBeyond the Mean
Beyond the Mean
 
Bayesia Lab Choice Modeling 1
Bayesia Lab Choice Modeling 1Bayesia Lab Choice Modeling 1
Bayesia Lab Choice Modeling 1
 
man0 ppt.pptx
man0 ppt.pptxman0 ppt.pptx
man0 ppt.pptx
 
Introduction of abm
Introduction of abmIntroduction of abm
Introduction of abm
 
Machine learning - session 4
Machine learning - session 4Machine learning - session 4
Machine learning - session 4
 
Semantic representation of neuroimaging observation
Semantic representation of neuroimaging observationSemantic representation of neuroimaging observation
Semantic representation of neuroimaging observation
 
Chemoinformatic
Chemoinformatic Chemoinformatic
Chemoinformatic
 
Modeling Vehicle Choice and Simulating Market Share with Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share with Bayesian NetworksModeling Vehicle Choice and Simulating Market Share with Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share with Bayesian Networks
 
hisory of computers in pharmaceutical research presentation.pptx
hisory of computers in pharmaceutical research presentation.pptxhisory of computers in pharmaceutical research presentation.pptx
hisory of computers in pharmaceutical research presentation.pptx
 
Presentation (9).pptx
Presentation (9).pptxPresentation (9).pptx
Presentation (9).pptx
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph Mining
 
3. introduction of ABM_INTI.pdf
3. introduction of ABM_INTI.pdf3. introduction of ABM_INTI.pdf
3. introduction of ABM_INTI.pdf
 
MedChemica Levinthal Lecture at Openeye CUP XX 2020
MedChemica Levinthal Lecture at Openeye CUP XX 2020MedChemica Levinthal Lecture at Openeye CUP XX 2020
MedChemica Levinthal Lecture at Openeye CUP XX 2020
 
[P.D.F] Bayesian Methods for Hackers: Probabilistic Programming and Bayesian ...
[P.D.F] Bayesian Methods for Hackers: Probabilistic Programming and Bayesian ...[P.D.F] Bayesian Methods for Hackers: Probabilistic Programming and Bayesian ...
[P.D.F] Bayesian Methods for Hackers: Probabilistic Programming and Bayesian ...
 
Us fsi bs_sifma_systemic_riskinformationstudyjune2010updated
Us fsi bs_sifma_systemic_riskinformationstudyjune2010updatedUs fsi bs_sifma_systemic_riskinformationstudyjune2010updated
Us fsi bs_sifma_systemic_riskinformationstudyjune2010updated
 
Go Predictive Analytics
Go Predictive AnalyticsGo Predictive Analytics
Go Predictive Analytics
 
1305 track 3 siegel
1305 track 3 siegel1305 track 3 siegel
1305 track 3 siegel
 
1115 track2 siegel
1115 track2 siegel1115 track2 siegel
1115 track2 siegel
 

Causal Inference and Direct Effects

  • 1. Causal Inference and Direct Effects Pearl’s Graph Surgery and Jouffe’s Likelihood Matching Illustrated with Simpson’s Paradox and a Marketing Mix Model Stefan Conrady, stefan.conrady@conradyscience.com Dr. Lionel Jouffe, jouffe@bayesia.com September 15, 2011 Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
  • 2. Causal Inference and Direct Effects Table of Contents Introduction Motivation & Objective 4 Overview 5 Notation 5 I. Methods of Causal Inference Simpson’s Paradox Example 6 Neyman-Rubin Model of Causal Inference 7 Inference Based on Experimental Data 7 Inference from Observational Data 7 The Bayesian Network Representation 9 Pearl’s Do-Operator 10 Causal Networks 10 Intervention 11 Jouffe’s Likelihood Matching (LM) 13 Importance of Network Performance 15 Summary 16 II. Practical Applications of Direct Effects and Causal Inference The Marketing Mix Model Example 17 CPG Example & Dataset 18 Model Development 18 Data Import 18 Supervised Learning 22 Network Performance 24 Model Analysis 25 Pearson’s Correlation 26 Mutual Information 27 Observational Inference 29 www.conradyscience.com | www.bayesia.com ii
  • 3. Causal Inference and Direct Effects Total Effects on Target 31 Causal Inference 33 Pearl’s Do-Operator 33 Direct Effects with Likelihood Matching (LM) 36 Causal Inference as an Afterthought 39 Causal Reasoning 40 Marketing Mix Optimization 40 Linear Marketing Mix Optimization 40 Non-Controllable Variables and Non-Confounders 40 Summary 44 Appendix About the Authors 46 Stefan Conrady 46 Lionel Jouffe 46 References 47 Contact Information 48 Conrady Applied Science, LLC 48 Bayesia S.A.S. 48 Copyright 48 www.conradyscience.com | www.bayesia.com iii
  • 4. Causal Inference and Direct Effects Introduction Motivation & Objective To this day, randomized experiments remain the gold standard for generating models that permit causal inference. In many elds, such as drug trials, they are, in fact, the conditio sine qua non. Without rst hav- ing established and quanti ed the treatment effect (and any associated side effects), no new drug could pos- sibly win approval. This means that a drug must be proven in terms of its causal effect and hence the under- lying study must facilitate causal inference. However, in many other domains, such controlled experiments are not feasible, be it for ethical, economical or practical reasons. For instance, it is obvious that the federal government could not create two different tax regimes in order to evaluate their respective impact on economic growth. For lack of such experiments, economists have been traditionally be constrained to studying strictly observational data and, although much-desired, causal inference is much more dif cult to carry out on that basis. Causal inference from ob- servational studies typically requires an extensive range of assumptions, which may or may not be justi able depending on one’s viewpoint. Being subject to such individual judgement, it should not surprise us that there is widespread disagreement among economic experts and government leaders regarding the effect of economic policies. While economists and social scientists have been using observational data for over a century for policy de- velopment, the business world has only recently been discovering the emerging potential of “big data” and “competing on analytics.” As these terms are becoming buzzwords, and are rightfully expected to hold great promise, the strictly observational nature of most “big data” sources is often overlooked. The wide avail- ability of new, easy-to-use analytics tools may turn out to be counterproductive, as observational versus causal inference are not explicitly differentiated. While the mantra of “correlation does not imply causa- tion” remains frequently quoted as a general warning, many business analysts would not know under what speci c conditions it can be acceptable to derive a causal interpretation from correlation in observational data. Consequently, causal assumptions are often made rather informally and implicitly and thus they typi- cally remain undocumented. The line between association and causation often becomes further blurred in the eyes of the end users of such research. Given that the concept of causality remains ill-understood in many practical applications, we seriously question today’s real-world business capabilities for deriving ra- tional policies from the newly-found “big data.” With these presumed shortcomings in business practice, it is our objective to provide a framework that fa- cilitates a much more disciplined approach regarding causal inference while remaining accessible to (non- statistician) business analysts and transparent to executive decision makers. We believe that Bayesian net- works are an appropriate paradigm for this purpose and that the BayesiaLab software package offers a ro- bust toolset for distinguishing observational and causal inference. www.conradyscience.com | www.bayesia.com 4
  • 5. Causal Inference and Direct Effects Overview The format of this document is essentially “two papers in one,” with the rst chapter focusing on mostly theoretical considerations (although illustrated with an example), while the second chapter provides a prac- tical, real-world example presented in the form of a tutorial. I. Methods of Causal Inference We will rst introduce the reader to the idea of formal causal inference using the well-known example of Simpson’s Paradox. Secondly, we will provide a brief summary of the Neyman-Rubin model, which represents a traditional statistical approach in this context. Once this method is established as a refer- ence point, we will introduce two methods within the Bayesian network paradigm, Pearl’s Do- Operator, which is based on “Graph Surgery”, and a method based on Jouffe’s “Likelihood Match- ing” algorithm (LM). LM allows xing probability distributions and can be considered as a probabilis- tic extension of statistical matching. II. Practical Applications of Direct Effects and Causal Inference While our treatment of Neyman-Rubin is limited to the rst chapter, the two Bayesian network-based methods will be further illustrated as practical applications in the second chapter. Special weight will be given to Likelihood Matching (LM), as it has not yet been documented in literature. We will ex- plain the practical bene ts of LM with a real-world business application and discuss observational and causal inference in the context of a marketing mix model. Using the marketing mix model as the prin- cipal example, we will go into greater detail regarding the analysis work ow, so the reader can use this example as a step-by-step guide to implementing such a model with BayesiaLab. Notation To clearly distinguish between natural language, software-speci c functions and example-speci c variable names, the following notation is used: • Bayesian network and BayesiaLab-speci c functions, keywords, commands, etc., are capitalized and shown in bold type. • Names of attributes, variables, nodes and are italicized. www.conradyscience.com | www.bayesia.com 5
  • 6. I. Methods of Causal Inference I. Methods of Causal Inference Simpson’s Paradox Example In our recent white paper, Paradoxes and Fallacies, we have written about Simpson’s Paradox, which occa- sionally appears in the popular press as a rather enigmatic statistical anomaly. We use an admittedly con- trived example to illustrate this paradox: A hypothetical type of cancer equally effects men and women. A long-term, observational and non- experimental study nds that a speci c type of cancer therapy is associated with an increased remission rate among all treated patients (see table). Based on the study, this particular treatment is thus recommended for broader application. Remission Treatment Yes No Yes 50% 50% No 40% 60% However, when examining patient records by gender, the remission rate for male patients — upon treat- ment — decreases from 70% to 60% and for female patients the remission rate declines from 30% to 20% (see table). So, is this new therapy effective overall or not? Remission Gender Treatment Yes No Yes 60% 40% Male No 70% 30% Yes 20% 80% Female No 30% 70% The answer lies in the fact that — in this example — there was an unequal application of the treatment to men and women. More speci cally, 75% of the male patients and only 25% of female patients received the treatment. Although the reason for this imbalance is irrelevant for inference, one could imagine that side effects of this treatment are much more severe for females, who thus seek alternatives therapies. As a result, there is a greater share of men among the treated patients. Given that men also have a better recovery pros- pect with this type of cancer, the remission rate for the total treated population increases. So, what is the true overall effect of this treatment? There are actually several possible ways to compute the effect, namely the statistical approach based on the Neyman-Rubin Model of Causal Inference, and the two Bayesian network-based approaches: Pearl’s Graph Surgery approach and the method based on Jouffe’s LM algorithm, which are both implemented in BayesiaLab. www.conradyscience.com | www.bayesia.com 6
  • 7. I. Methods of Causal Inference Neyman-Rubin Model of Causal Inference To begin our discussion of causal inference, we will rst present matching as a statistical method for causal inference based on observational data. Our brief summary follows the framework that is widely known as the Neyman-Rubin Model of Causal Inference (Rubin, 2006; Sekhon, 2007; Morgan and Winship, 2007; Rosenbaum, 2002). We closely follow Sekhon (2007) for this highly condensed summary: Inference Based on Experimental Data Let Yi1 denote the potential outcome for unit i if the unit receives treatment, and let Yi 0 denote the potential outcome for a unit in the control group. The treatment effect for unit i is de ned as: τ i = Yi1 − Yi 0 Furthermore, let Ti be the a treatment indicator: 1 when unit i is in the treatment group and 0 when unit i is in the non-treatment control group. If assignment to treatment is randomized, causal inference is fairly simple because the two groups are drawn from the same population, and treatment assignment is independent of all baseline variables. As the sample size grows, observed and unobserved confounders are balanced across treatment and control groups. That is, with random assignment, the distributions of both observed and unobserved variables in both groups are equal in expectation. Treatment assignment is independent of Y0 and Y1 — i.e., {Y i0 ,Yi1 ⊥ Ti } , where “ ⊥ ” symbol represents independence. Hence, for j = 0, 1 E(Yij∣ i = 1) = E(Yij∣ i = 0) = E(Yi∣ i = j) T T T Therefore, the average treatment effect can be estimated by: τ = E(Yi1∣ i = 1) − E(Yi 0∣ i = 0) T T = E(Yi∣ i = 1) − E(Yi∣ i = 0) T T τ can be estimated in an experimental setting because randomization can ensure that observations in treat- ment and control groups are exchangeable. Randomization ensures that assignment to treatment will not, in expectation, be associated with the potential outcomes. Inference from Observational Data The situation with observational data is much less straightforward as treatment and control groups are not necessarily drawn from the same population. Hence, the average treatment effect τ cannot be estimated the same way as was the case with experimental data. www.conradyscience.com | www.bayesia.com 7
  • 8. I. Methods of Causal Inference As an alternative, we can pursue the average treatment effect for the treated, more formally expressed as τ∣ = 1) = E(Yi1∣ i = 1) − E(Yi 0∣ i = 1) (T T T However, the challenge in this case is that Yi 0 is not observed for the treated, i.e. we simply cannot know how those, who were in fact treated, would have fared, had they not been treated. As a potential remedy for this quandary, one could assume that treatment selection depends on a set of ob- servable covariates X. Furthermore, we could assume that given X, treatment assignment is independent of Y. More formally, {Y ,Y ⊥ T∣X } , which is referred to as “unconfoundedness” 0 1 A nal assumption is that there is a so-called “overlap:” 0 < P(T = 1 ∣X) < 1 . In particular case, where X ∈{male, female} , this implies that treatments must be observed for both males and females in order to obtain overlap. Together, unconfoundedness and overlap form the concept of strong ignorability, which are required for the estimation of the average treatment effect for the treated: τ∣ = 1) = E { E(Yi∣Xi ,Ti = 1) − E(Yi∣Xi ,Ti = 0) Ti = 1} (T ∣ This means we condition on the observed covariates, Xi , and thus treatment and control groups are bal- anced. Conditioning on X can be a straightforward task and can typically be achieved by matching, which means nding exactly matching sets of covariates. In our case, matching is simple, as we only have one covariate with two states, i.e. male and female. We can then compute the treatment effects for exactly matched sets of treated and untreated units within each subset, i.e. within the male and female group. However, in most real-world applications, we have many more covariates and among those many may have continuous values rather than discrete states. Inevitably, this makes matching much more challenging and it actually may be impossible to perform exact matching. Given this challenge, propensity score matching (Rosenbaum and Rubin, 1983) and matching based on the Mahalanobis distance (Cohran and Rubin, 1973) have emerged as commonly used methods. Both methods perform matching based on covariate similarity, but it goes be beyond the scope of this paper to elaborate further on the details of these and related methods. www.conradyscience.com | www.bayesia.com 8
  • 9. I. Methods of Causal Inference The Bayesian Network Representation To illustrate Pearl’s Do-Operator based on Graph Surgery and subsequently the method based on Jouffe’s Likelihood Matching, we need to switch from the traditional statistical framework to the Bayesian network paradigm. The starting point for both methods is a synthetically generated dataset with three variables, Gender, Treatment and Remission, with a total of 1,000 observations, which re ects the statistics described in the tables provided in the description of Simpson’s Paradox.1 This dataset will serve as the basis for the Bayesian network to be used for causal inference. For expositional simplicity, we will omit the steps required for importing the dataset into BayesiaLab and refer the reader to the second chapter, which describe the import process in detail. Rather, we begin directly in BayesiaLab’s Modeling Mode with the initially unconnected network consisting of three nodes, i.e. the variables of interest: From theory we know that we can factorize a Joint Probability Distribution (JPD) into the product of condi- tional probability distributions (see Barber, 2011, for a detailed discussion). With the three nodes that we have in our example, there are actually six different ways to do this: p(x1 , x2 , x3 ) = p(x1∣x2 , x3 )p(x2 , x3 ) = p(x1∣x2 , x3 )p(x2∣x3 )p(x3 ) p(x1 , x3 , x2 ) = p(x1∣x3 , x2 )p(x3 , x2 ) = p(x1∣x3 , x2 )p(x3∣x2 )p(x2 ) p(x2 , x1 , x3 ) = p(x2∣x1 , x3 )p(x1 , x3 ) = p(x2∣x1 , x3 )p(x1∣x3 )p(x3 ) p(x2 , x3 , x1 ) = p(x2∣x3 , x1 )p(x3 , x1 ) = p(x2∣x3 , x1 )p(x3∣x1 )p(x1 ) p(x3 , x1 , x2 ) = p(x3∣x1 , x2 )p(x1 , x2 ) = p(x3∣x1 , x2 )p(x1∣x2 )p(x2 ) p(x3 , x2 , x1 ) = p(x3∣x2 , x1 )p(x2 , x1 ) = p(x3∣x2 , x1 )p(x2∣x1 )p(x1 ) Given the semantics of Bayesian networks, this translates into six possible, equivalent Bayesian networks, that are all representing exactly the same JPD. When we perform one of BayesiaLab’s learning network algorithms on the sample dataset, we will indeed obtain one of the six possible networks shown above, as suggested by the theory. Without additional infor- mation on those variables, such as we might obtain from temporal indices, we will be unable to select one network over the other and the network choice would have be entirely arbitrary. 1 This dataset was created with BayesiaLab’s Generate Data function, based on the true Joint Probability Distribution (JPD). www.conradyscience.com | www.bayesia.com 9
  • 10. I. Methods of Causal Inference In order to visualize that the arcs in these networks are invertible in their orientation, BayesiaLab can high- light the Essential Graph (Analysis>Graphic>Show the Edges). This will display the edges that can be ori- ented in either direction without modifying the represented JPD. For purposes of observational inference, any of these six equivalent networks would be suf cient. For in- stance, the probability of Remission=yes, given that we observe Treatment=yes, i.e. P(Remission=yes|Trea- tment=yes), can be computed with any of the six networks shown earlier. However, from the introduction of Simpson’s Paradox, we realize that a simple observation is not suf cient to establish the treatment effect. Observational inference may actually be misleading for interpretation pur- poses, which is at the very core of the paradox. So, our question remains, “what is the effect of treatment?” More speci cally, “what is the probability of remission, given that we do administer the treatment?” This means that we want to see the effect of an intervention instead of merely observing that treatment has oc- curred. Graph Surgery and LM provide different ways to answers this question, which we will explain the following two sections: Pearl’s Do-Operator Causal Networks To introduce Pearl’s Do-Operator, we need to make a formal transition from a general Bayesian network to a causal network, because Bayesian networks describe a joint distribution over possible observed events but say nothing about what will happen if an intervention occurs. A causal network is a Bayesian network with the added property that the parents of each node are its direct causes. For example, Fire → Smoke is a causal network whereas Smoke → Fire is not, even though both networks are equally capable of represent- ing any joint distribution on the two variables. More formally, causal networks are de ned as a type of Bayesian network with special properties: upon setting an intervention on a node in a causal network, the correct probability distribution is given by delet- ing the incoming arcs from the node’s parents, i.e. “cutting off” the direct causes of the node. Pearl has characterized this deletion of links rather graphically as “graph mutilation” or “graph surgery.”2 2 Interestingly, “intervenire”, the Latin origin of “intervention,” symbolizes this separation as it literally means “to come in between.” www.conradyscience.com | www.bayesia.com 10
  • 11. I. Methods of Causal Inference With this de nition, Pearl’s Graph Surgery approach requires us to provide a complete set of causal assump- tions regarding the network to compute the effect of an intervention. Given our background knowledge re- garding Simpson’s Paradox, we can make causal assumptions for all edges and thus declare, i.e. by at, our Bayesian network a causal network. As stated earlier, we assume that Gender has a causal effect on Remis- sion (rather than Remission on Gender), so we de ne the arc direction as Gender ➝ Remission. We also assume that Treatment has a causal effect (whether positive or negative) on Remission, which translates into the (directed) arc Treatment ➝ Remission. Finally, we have learned that Gender in uences (causes) whether or not one would undergo Treatment, so we have Gender ➝ Treatment. This eliminates ve of the six pos- sible Bayesian networks and leaves us with only one possible causal Bayesian network: Now that we have a causal Bayesian network we can make a distinction between observational inference and causal inference. This is because of the semantic difference of “given that we observe” versus “given that we do.” The former is strictly an observation, i.e. we focus on the patients who received treatment, whereas the latter is an active intervention. The answer to our question of the treatment effect then is infer- ring as to what would hypothetically happen, “given that we do”, i.e. given that we force the treatment without permitting patients to self-select their treatment. In the semantics of Bayesian networks, this means that there must not be a direct relationship between Gender and Treatment. In other words, Treatment must not directly depend on Gender. Intervention In our Bayesian network this can be done easily by “mutilating” the graph, i.e. deleting the arc connecting Gender and Treatment. BayesiaLab offers a very simple function to achieve this, which is aptly named In- tervention (right-click on the node’s Monitor and then select Intervention). By intervening on the Treatment variable (and setting Treatment=yes), the causal Bayesian network is modi ed (or “mutilated”) as follows: • The entering arcs of the node on which we want to perform intervention are “surgically” removed. With intervention, we cut the dependency between Treatment and Gender, i.e. administering the treatment will not affect Gender. • The original Chance Node (round) representing Treatment is transformed into a Decision Node (square). The associated Monitor will be highlighted in blue. www.conradyscience.com | www.bayesia.com 11
  • 12. I. Methods of Causal Inference Now we can observe what happens to Remission when we “do” Treatment, instead of just “observing” Treatment. Treatment=No Treatment=Yes As we can see, the Gender probability distribution remains the same. However, Remission decreases from 50% to 40%, given that we “do” Treatment. With this we have now obtained the treatment effect: τ = P(Remission = yes Treatment = yes) − P(Remission = yes Treatment = no) = −0.1 ∣ ∣ www.conradyscience.com | www.bayesia.com 12
  • 13. I. Methods of Causal Inference To answer our original question, we must conclude that this new treatment is detrimental to the patients’ health. Jouffe’s Likelihood Matching (LM) We will now brie y introduce the Jouffe’s Likelihood Matching (LM) algorithm, which was originally im- plemented in the BayesiaLab software package for “ xing” probability distributions of an arbitrary set of variables, allowing then to easily de ne complex sets of soft evidence. The LM algorithm searches for a set of likelihood distributions, which, when applied on the Joint Probability Distribution (JPD) encoded by the Bayesian network, allows obtaining the posterior probability distributions de ned (as constraints) by the user. As we saw with Pearl’s Graph Surgery approach, the core idea of intervention is to set an evidence on the node on which we wish to intervene, while all other ascending nodes remain unchanged. Using this very same idea, we can then intervene on a node by xing the posterior probability distributions of its covariates. Casually speaking, this would be a kind “virtual mutilation.” Treatment=No www.conradyscience.com | www.bayesia.com 13
  • 14. I. Methods of Causal Inference Treatment=Yes These results are identical to what was obtained with Pearl’s Graph Surgery. τ = P(Remission = yes Treatment = yes) − P(Remission = yes Treatment = no) = −0.1 ∣ ∣ However, two main differences exist between the methods: I. One important feature of the LM algorithm is that it returns the same result for all the instantiations of the Essential Graph, i.e. for any one of the six equivalent networks. For example, intervening on Treatment using the Bayesian network below, and using the LM algorithm, will lead to exactly the same posterior probability distribution for Remission, even though the arc directions are non-causal (and thus perceived counterintuitive). In comparison to the Do-Operator, the approach based on LM does not require any available causal knowledge to be formally translated into a causal structure in order to compute treatment effects. While it may be easy to specify all the causal directions in a simple model with only three nodes, such as in our example, it is obviously more of a challenge to do the same for a larger network, perhaps consisting of dozens or even hundreds of nodes. That is not to claim that we can avoid causal assump- tions altogether, however, we aim to defer the need for making such assumptions until a later point and then only make those assumptions that are directly related to the pair of variables for which we want to obtain the causal effect. www.conradyscience.com | www.bayesia.com 14
  • 15. I. Methods of Causal Inference II. The other difference between the Graph Surgery and LM is more subtle and may not always be obvi- ous. The Graph Surgery implies a modi cation of the representation of the JPD, whereas the approach based on LM always works on the original JPD. The mere graph mutilation can bring about a modi - cation of some marginal probability distributions, even without changing the marginal probability dis- tributions of the nodes on which we want to intervene. We need to brie y digress from our principal example in order to clarify this particular point: The two graphs below illustrate the mutilation impact on the marginal probability distribution of Customer Satisfaction. Importance of Network Performance While the statistical matching approach of the Neyman-Rubin Matching Model directly utilizes the original observations, the LM algorithm is based on the JPD encoded by the Bayesian network. This emphasizes the requirement that a Bayesian network to be used for this purpose must provide a good representation of the true JPD. While there is no hard-and-fast rule as to what constitutes a minimum t requirement, we can review the overall network performance by selecting Analysis>Network Performance>Global: www.conradyscience.com | www.bayesia.com 15
  • 16. I. Methods of Causal Inference The key metric here is the Contingency Table Fit (CTF). This measure can range between 0%, as if the JPD were represented with a fully unconnected network (all the nodes are independent), and 100%, as if the JPD were perfectly represented with a fully connected network. The network learned on the Simpson’s Paradox dataset happens to be a complete (fully-connected) graph, thus the CTF is 100%. Summary We have provided a brief summary of the Neyman-Rubin model, which represents a traditional statistical approach for causal inference. Extending beyond the statistical framework, and now within the Bayesian network paradigm, we illustrated Graph Surgery and Pearl’s Do-Operator, and a nally presented a method based on Jouffe’s Likelihood Matching algorithm (LM). Most importantly, working with the Bayesian net- work methods highlighted that formal causal assumptions are critical to correct causal inference. www.conradyscience.com | www.bayesia.com 16
  • 17. II. Practical Applications of Direct Effects and Causal Inference II. Practical Applications of Direct Effects and Causal Inference The Marketing Mix Model Example The adage, “I know I waste half of my advertising dollars...I just wish I knew which half”, re ects a century-old uncertainty about the effectiveness of marketing instruments.3 More formally, one could de- scribes this quandary as a domain with an unknown (or ill-understood) causal structure. While “big data”, especially in the eld of marketing, is expected to rapidly yield “actionable business in- sights,” we need to recognize that there are many steps to traverse to achieve this goal. Hence, we would like to parse this overarching objective of “actionable insights” into distinct components, which will imme- diately highlight the central role of causal inference: • “Big data” most often refers to large amounts of observational data from a domain. Despite the ever- increasing amount of data, most measures collected do include noise and missing data points. • “Actionable insights” actually implies several things: Firstly, it requires an understanding of the domain, which can be used as a basis for reasoning about this domain. A key assumption in this context is that we must not only have a structure describing the observations we have gathered, but rather we must have a causal structure, so we can anticipate the consequences of actions we have not yet taken. If we have this ability to evaluate the results of our potential interventions in this domain, we can chose the rational course of action among all the possible alternatives. As an added complexity, most dynamics uncovered in a domain are probabilistic rather than deterministic in nature. Although it is typically a challenge, our chosen toolset, BayesiaLab, can implicitly handle missing values and capture the probabilistic nature of the domain and hence we will not focus on that aspect. Rather, the cen- tral theme of this paper is the transition from observation to causation. As we have seen in the rst chapter with the introductory example of Simpson’s Paradox, Bayesian net- works provide two principal ways of moving from observational inference to causal inference, namely Graph Surgery and Likelihood Matching. With the following example from the CPG industry, we will jux- tapose Graph Surgery and LM and then speci cally demonstrate how LM can be utilized for computing causal effects, which can subsequently be used for performing marketing mix optimization. 3 Various versions of this quote have been attributed to Henry Procter, Henry Ford, John Wanamaker and J.C. Penney www.conradyscience.com | www.bayesia.com 17
  • 18. II. Practical Applications of Direct Effects and Causal Inference CPG Example & Dataset To illustrate this approach we study daily ice cream sales of a European food distributor as a function of environmental variables and marketing efforts.4 Our sample data set includes the following variables: • Seasonally-adjusted daily sales in the local currency • Traditional advertising, such as print advertising (incl. coupons), TV, radio, in-store promotions, etc. • Online advertising, including banner ads, search engine marketing, online coupons • Competitive advertising (estimate of all competitive marketing efforts combined) • Temperature in °C • Number of open retail outlets • Weekday Model Development While the focus of this example is to evaluate and to causally interpret a given marketing mix model, we will spell out the steps one would take to generate such a model with BayesiaLab. This should enable cur- rent users of BayesiaLab to replicate the exercise in its entirety. Data Import We use BayesiaLab’s Data Import Wizard to load all 7 time series5 into memory from a comma-separated le (CSV). BayesiaLab automatically detects the column headers, which contain the variable names. 4 For expository purposes, this dataset was synthetically generated based on actual market dynamics observed in an industry and locale different from the example. 5 Although the dataset has a temporal ordering, for expository simplicity we will treat each time interval as an inde- pendent observation. www.conradyscience.com | www.bayesia.com 18
  • 19. II. Practical Applications of Direct Effects and Causal Inference The next step identi es the data types contained in the dataset. BayesiaLab will attempt to detect the type of variables in the dataset and assumes in this case all variables to be continuous, as indicated by the turquoise background color for all columns. Although Weekday appears continuous, i.e. 1 through 7, it must be treated as discrete so as to avoid bin- ning in the subsequent discretization function.6 Upon setting it to discrete, the Weekday variable will appear in red. 6 In the original dataset the variable Weekday was coded into ordered numerical states, 1 through 7, representing Mon- day through Sunday. BayesiaLab could also have used text descriptions as state labels, in which case the variable would have been automatically recognized as discrete. www.conradyscience.com | www.bayesia.com 19
  • 20. II. Practical Applications of Direct Effects and Causal Inference As our dataset contains missing values, we need to specify the type of missing values imputation. We will choose the Structural EM method, given that for the size of this dataset, the computational complexity of this algorithm will not be a burden. The following discretization step is very important for all models in BayesiaLab and thus we provide a bit more detail here. Our objective of this model is to establish Sales as a function of the marketing instruments and other external factors. Thus we can take this objective into account for the discretization process. More speci cally, we will split the process into two parts. First, we will discretize the target variable, i.e. Sales, on its own. We highlight the Sales column in the data table and then choose Manual as the Discretization Type. This provides us with probability density function of Sales. www.conradyscience.com | www.bayesia.com 20
  • 21. II. Practical Applications of Direct Effects and Causal Inference By clicking Generate a Discretization, we are prompted to select the discretization type. We chose Type: K-Means and Intervals: 4.7 The chart will now display the results of this discretization. 7 For a discussion of discretization algorithms and a guide for interval selection, please see the papers referenced in the appendix. www.conradyscience.com | www.bayesia.com 21
  • 22. II. Practical Applications of Direct Effects and Causal Inference Now that we have discretized the target variable by itself, we will discretize the remaining continuous vari- ables with the Decision Tree algorithm and use Sales as the target. This allows binning the continuous vari- ables in such a way that we gain a maximum amount of information from these variables with respect to the target. Upon completion of the discretization, BayesiaLab will present all variables as nodes in an unconnected network in the Graph Panel. Supervised Learning Now that we have an initial network, albeit unconnected, we can perform our rst Supervised Learning al- gorithm with the objective of characterizing the target node. However, we do need to rst specify the target by right-clicking on Sales and selecting Set As Target Node (or pressing “T” while double-clicking on the node). www.conradyscience.com | www.bayesia.com 22
  • 23. II. Practical Applications of Direct Effects and Causal Inference Once this is set, the Sales node will appear in the graph as a bulls-eye, symbolizing a target. We now have an array of Supervised Learning algorithms available to apply here. Given the small number of nodes, variables selection is not an issue and hence this should not in uence our choice. Furthermore, the relatively small number of observations does not create a challenge in terms of computational effort. With these considerations, and without going into further detail, we select the Augmented Naive Bayes algorithm. The “augmented” part in the name of this algorithm refers to the additional unsupervised search that is per- formed on the basis of the given naive structure. www.conradyscience.com | www.bayesia.com 23
  • 24. II. Practical Applications of Direct Effects and Causal Inference Upon learning, the newly generated network is now displayed in the Graph Panel. The prede ned naive structure is highlighted by the dotted arcs, while the additional (augmented) arcs from the unsupervised learning are shown in solid black. Network Performance We could now spend some time to further re ne this model, such as balancing the degree of complexity ver- sus the overall model t. Furthermore, we could also specify this as a dynamic model.8 To maintain exposi- tional clarity, we will leave the model as is. However, we do wish to cover a few performance measures to assure the reader that the model presented here is a reasonable characterization of the underlying domain. With the relatively small number of observations, we chose not to set aside a hold-out sample (e.g. 20% of observations) during the data import process. As an alternative way of testing the out-of-sample network performance, we carry out Cross Validation by selecting (from within the Validation Mode) Tools>Cross Validation>Targeted: In terms of parameters for the Cross Validation, we select the same learning algorithm as before, i.e. Aug- mented Naive Bayes. Also, using a 10-fold validation is a typical choice in this context. 8 Given the inherently dynamic nature of marketing effects, it would be very appropriate to model this as a temporal Bayesian network. For instance, this would enable us to capture potential lags in the effects of marketing activities on the target variable. The BayesiaLab framework can easily accommodate such a temporal speci cation. www.conradyscience.com | www.bayesia.com 24
  • 25. II. Practical Applications of Direct Effects and Causal Inference The resulting Global Report provides a variety of metrics, including precision and R2. Sampling Method: K-Folds Learning Algorithm: Augmented Naive Bayes Target: Sales <=20755 <=23387 <=25914 >259145 Value 6.406 7.375 5.594 .594 Gini Index 66% 41.75% 38.03% 69.52% Relative Gini Index 75.25% 62.92% 63.76% 80.63% Mean Lift 2.49 1.64 1.52 2.49 Relative Lift Index 81.50% 78.29% 80.11% 84.09% Relative Gini Global Mean: 70.64% Relative Lift Global Mean: 81% Total Precision: 67.37% R: 0.76104342242 R2: 0.57918709081 Occurrences <=20755 <=23387 <=25914 >259145 Value 6.406 7.375 5.594 .594 (53) (142) (172) (59) <=207556.406 (56) 37 18 1 0 <=233877.375 (124) 15 86 22 1 <=259145.594 (213) 1 38 140 34 >259145.594 (33) 0 0 9 24 Reliability <=20755 <=23387 <=25914 >259145 Value 6.406 7.375 5.594 .594 (53) (142) (172) (59) <=207556.406 (56) 66.07% 32.14% 1.79% 0% <=233877.375 (124) 12.10% 69.35% 17.74% 0.81% <=259145.594 (213) 0.47% 17.84% 65.73% 15.96% >259145.594 (33) 0% 0% 27.27% 72.73% Precision <=20755 <=23387 <=25914 >259145 Value 6.406 7.375 5.594 .594 (53) (142) (172) (59) <=207556.406 (56) 69.81% 12.68% 0.58% 0% <=233877.375 (124) 28.30% 60.56% 12.79% 1.69% <=259145.594 (213) 1.89% 26.76% 81.40% 57.63% >259145.594 (33) 0% 0% 5.23% 40.68% Even without further comparison, the reported values appear reasonable and suggest that we can proceed with analyzing this network. Model Analysis We have accepted the network as plausible representation of this domain and will now interpret the struc- ture we obtained. To make it easier to understand the structure, we will rst apply one of BayesiaLab’s automatic layout algorithms, which quite literally “disentangles” the network and thus provides a clearer picture. Selecting View>Automatic Layout achieves this (or pressing the keyboard shortcut “P”). www.conradyscience.com | www.bayesia.com 25
  • 26. II. Practical Applications of Direct Effects and Causal Inference The “Naive Bayes” versus the “Augmented” part of this network, shown in dotted arcs and solid arcs re- spectively, are now much more obvious in this layout. As that the naive structure was given by de nition, only the presence or absence of solid arcs provides in- formation about the existence of relationships between the predictors. Much more can be understood when we examine the magnitude and the sign of all relationships in the network. Pearson’s Correlation Although correlation, as we will later emphasize, is not a central metric for network analysis in BayesiaLab, we will use it for a rst look, especially since all readers will be familiar with this measure. Selecting Analy- sis>Graphic>Pearson’s Correlation provides this information directly in the network graph. www.conradyscience.com | www.bayesia.com 26
  • 27. II. Practical Applications of Direct Effects and Causal Inference The colors of the arcs indicate the sign of the relationship and the arc labels provide the correlation value. Many of the shown relationships seem intuitive, for instance that No. of Stores and both Trad. Adv. and Online Adv. have a positive association with Sales. Equally plausible is the fact that Temperature is associ- ated with Sales (although one of the co-authors of this paper believes that one can eat ice cream rain or shine). The negative association between Competitive Adv. and Sales also seems expected. Less clear is the negative correlation between Sales and Weekday, but the small value suggests either very weak link or per- haps a nonlinear relationship. Mutual Information Given that Pearson’s correlation is a strictly linear metric, its ability to characterize all these relationships is inherently limited. We will now turn to Mutual Information as a new measure, which can help overcome this limitation. www.conradyscience.com | www.bayesia.com 27
  • 28. II. Practical Applications of Direct Effects and Causal Inference In contrast to correlation, Mutual Information does not re ect the sign of the relationship, however, this measure captures the strength of relationships between variables, even if they are highly nonlinear. More speci cally, the Mutual Information I(X,Y) measures how much (on average) the observation of ran- dom variable Y tells us about the uncertainty of X, i.e. by how much the entropy of X is reduced if we have information on Y. Mutual Information is a symmetric metric, which re ects the uncertainty reduction of X by knowing Y as well as of Y by knowing X. In our example, knowing the value of Weekday on average reduces the uncertainty of the value of Sales by 0.4802 bits, which means that it reduces its uncertainty by 26.3% (shown in red, in the opposite direction of the arc). Conversely, knowing Sales reduces the uncertainty of Weekday by 17.11% (shown in blue, in the direction of the arc). It is interesting to see that, by looking at Mutual Information, Weekday and Sales now have a very strong relationship whereas previously the correlation coef cient was near zero. www.conradyscience.com | www.bayesia.com 28
  • 29. II. Practical Applications of Direct Effects and Causal Inference Observational Inference To explore the nature of this relationship further, we can perform the Target Mean Analysis with Sales and Weekday (Analysis>Graphic>Target Mean Analysis). This prompts us to select the way we want to examine this relationship. In this context it seems appropriate to look at the delta mean of the target as a function of Weekday. The resulting plot con rms the previous hypothesis of nonlinearity. www.conradyscience.com | www.bayesia.com 29
  • 30. II. Practical Applications of Direct Effects and Causal Inference For instance, we can interpret this as follows: given that Weekday=Friday, we observe that Sales reach their highest value. Furthermore we can infer, given that Weekday=Sunday, we observe that Sales have their low- est value, as many shops in Europe are closed on Sundays. We can further speculate that consumers perhaps buy more ice cream on Fridays in preparation for leisure activities over the weekend. Returning to our interpretation of Mutual Information, it is now obvious why Weekday reduces the uncer- tainty of Sales by over 25%. There is quite apparently an intra-week seasonality. Another interpretation of Mutual Information is “importance” and we can use Analysis>Report>Target Analysis>Correlations with the Target Node to obtain an overview of the importance of all nodes in the network with respect to the target, Sales. Node significance with respect to the information gain brought by the node to the knowledge of Sales Mutual Mutual Relative Degrees.of Degrees.of Node Mean.Value G:test p:value G:test.(Data) p:value.(Data) information information.(%) significance Freedom Freedom.(Data) Weekday 0.4802 26.30% 1 4.0047 283.5916 18 0.00% 283.5916 18 0.00% Competitive:Adv. 0.1293 7.08% 0.2692 514.9959 76.332 9 0.00% 76.332 9 0.00% Trad.:Adv. 0.0835 4.57% 0.1739 483.8701 49.307 9 0.00% 49.307 9 0.00% No.:of:Stores 0.081 4.44% 0.1686 3096.5023 47.8213 9 0.00% 47.8213 9 0.00% Online:Adv. 0.0764 4.18% 0.159 181.6759 45.0943 9 0.00% 45.0943 9 0.00% Temperature 0.0592 3.24% 0.1233 14.5441 34.9654 9 0.01% 34.9654 9 0.01% www.conradyscience.com | www.bayesia.com 30
  • 31. II. Practical Applications of Direct Effects and Causal Inference It is important to stress that this is a form of observational inference and it does not imply a causal relation- ship with Sales. We assume that some of these variables “cause” Sales, but from this table we can only infer association, not causation. Total Effects on Target The same caveat also holds true for our next evaluation, Total Effects on Target (Analysis>Report>Target Analysis>Total Effects on Target): Total Effect is a linearized measure that shows the impact of a one-unit change in the mean (that is com- puted at the mean) of each node on the Target. Total Effects on Target Sales Standardized Degrees-of Degrees-of Node Total-Effect G:test p:value G:test-(Data) p:value-(Data) Total-Effect Freedom Freedom-(Data) Competitive)Adv. -0.3456 -32.0159 76.332 9 0.00% 76.332 9 0.00% Trad.)Adv. 0.2567 6.2351 49.307 9 0.00% 49.307 9 0.00% No.)of)Stores 0.1679 48.0703 47.8213 9 0.00% 47.8213 9 0.00% Online)Adv. 0.1482 22.4707 45.0943 9 0.00% 45.0943 9 0.00% Temperature 0.1291 323.6881 34.9654 9 0.01% 34.9654 9 0.01% Weekday -0.0501 -583.078 283.5916 18 0.00% 283.5916 18 0.00% This can be illustrated by performing the computation manually in the Monitor Panel. By default, the Moni- tors shows the marginal frequency distributions of the states of the nodes plus the mean value (expected value) of those distributions: www.conradyscience.com | www.bayesia.com 31
  • 32. II. Practical Applications of Direct Effects and Causal Inference As stated above, the Total Effect is computed on the basis of a one-unit change of each node. We can simu- late this by setting Competitive Adv. to a new mean value, i.e. changing its mean from 514.996 to 515.996. It must be noted that there is an in nite possibility of achieving a mean value of +1 in this distribution. BayesiaLab supports the analyst by choosing the particular distribution (of all possible distributions) that is closest to the original distribution while achieving the targeted mean value of +1. We simply need to right- click on the Monitor for Competitive Adv. and select Distribution for Target Value/Mean. This prompts us to type in our desired value, i.e 515.996, to re ect the one-unit change. www.conradyscience.com | www.bayesia.com 32
  • 33. II. Practical Applications of Direct Effects and Causal Inference We can now observe the impact on Sales as a result of changing Competitive Adv. by one unit. The resulting delta of -32.104 is shown in parentheses. This con rms (within the possible numeric precision) the value reported in the Total Effects table. However, the reader will notice that not only Sales was affected but also most of the other nodes, albeit with very small changes. This means that, given that we observe a one-unit change of Competitive Adv., will also observe a change in other nodes, which are too connected to the target and may thus contribute to a change in the target. This re ects the Bayesian network property of omnidirectional inference. As such, the one-unit change in Competitive Adv. is not an orthogonal impulse, which is very important to bear in mind for interpretation purposes. Causal Inference Pearl’s Do-Operator To move beyond the observational inference generated by the Total Effects function, we must now turn to a causal framework. Our rst option is to use Intervention with the Do-Operator, which requires us to con- vert our original network into a fully speci ed causal network. At it is immediately obvious that most of the original arc directions, which were found by the Supervised Learning algorithm, cannot be interpreted causally, e.g. Sales does neither cause Temperature nor Weekday. www.conradyscience.com | www.bayesia.com 33
  • 34. II. Practical Applications of Direct Effects and Causal Inference However, using our domain knowledge, we can assume that Sales is the effect of all the other variables in this model. So, we will need to encode these causal relationships manually, as shown in the following graph: While this causal representation is formally correct, it creates an immediate practical problem. As we do not have any parametric representation of the relationship between Sales and the other 6 variables, the required CPT associated with Sales contains 28,672 cells. With only a few hundred observations, it is impossible to obtain a robust estimate all these parameters. BayesiaLab will actually highlight this problem as we build this network manually. www.conradyscience.com | www.bayesia.com 34
  • 35. II. Practical Applications of Direct Effects and Causal Inference For now, however, we may want to ignore this constraint and proceed with this approach. We can use BayesiaLab’s Taboo Learning to search for additional probabilistic relationships after having xed the manually-encoded causal arc structure from above. Upon completion of this algorithm, and having applied the layout algorithm, we now have a more connected network: These newly established arcs, however, do not yet re ect our causal assumptions. We now need to go through them one by one to formalize the direction of causality. With some arcs, it is fairly obvious, such as Weekday ➝ No. of Stores (e.g. some stores are closed because it is Sunday). We can invert this arc from within BayesiaLab’s Validation Mode. We simply right-click the arc of inter- ested and select Invert Orientation within the Equivalence Class.9 9 For a discussion of equivalence class, see chapter 1. www.conradyscience.com | www.bayesia.com 35
  • 36. II. Practical Applications of Direct Effects and Causal Inference The new structure, with the inverted arc highlighted in red, is shown below: However, a side effect of this arc inversion within the equivalence class was that the arc, Temperature ➝ No. of Stores was automatically inverted in order to maintain the original JPD. We can resolve this by es- tablishing constraints that re ect our causal knowledge, e.g. a higher Temperature in summer causes a higher No. of Stores to be open, meaning that only the arc Temperature ➝ No. of Stores is permissible but not the inverse. While these constraints can be easily applied in BayesiaLab, we will omit these details and instead fast- forward to another issue, which, as it turns out will make all previous efforts futile: We have probabilistic relationships in our domain for which, given our knowledge, we cannot resolve the causal direction. For instance, does Trad. Adv. cause Competitive Adv. or is it the other way around? Without nalizing this causal structure we are unable to proceed with Graph Surgery, which ultimately prevents us from carrying out causal inference. In conclusion, we have two major obstacles towards performing causal inference with Graph Surgery: rst, the intractable size of the CPT and, second, the incomplete causal structure. Direct Effects with Likelihood Matching (LM) As opposed to using the Do-Operator, we can move forward using LM, regardless of the arc directions, as long as the network provides a good representation of the JPD of the underlying data. For this purpose, a new Direct Effect Analysis tool has recently been introduced in BayesiaLab 5.0.4. This is similar to the Total Effects tool, however Direct Effects obtains, as the name implies, the “direct” impact of a treatment variable on the target node by using the LM algorithm to x the confounders. The new approach with LM requires fewer prerequisites and may thus lead us to the desired causal infer- ence more quickly. We can return to the originally learned non-causal Bayesian network, which is computa- tionally entirely tractable. www.conradyscience.com | www.bayesia.com 36
  • 37. II. Practical Applications of Direct Effects and Causal Inference On the basis of this non-causal network, we can perform Direct Effects (Analysis>Report>Target Analy- sis>Direct Effects on Target). The resulting table provides us with Standardized Direct Effect, Direct Effect, Contribution and Elasticity, with respect to Sales: Direct Effects on Target Sales Standardized Node Direct,Effect Contribution Elasticity Direct,Effect No.$of$Stores 0.229 65.5416 32.72% 21.39% Trad.$Adv. 0.1851 4.496 26.45% 19.41% Competitive$Adv. ?0.13 ?12.041 18.57% ?9.67% Online$Adv. 0.0982 14.8906 14.03% 9.03% Weekday 0.0305 354.755 4.36% 2.55% Temperature 0.027 67.7507 3.86% 2.09% The Direct Effect column represents the effect of a unit-change of each variable while holding all other vari- ables xed. One can think of each node (in turn and by itself) being considered a treatment, while all other nodes, except for the target, are being used as “likelihood-matched” sets of covariates. For instance, a one- unit change in No. of Stores is associated with +65.5 delta in Sales, everything else being equal. The Contribution column provides a breakdown of each variable’s individual contributions in percent (summing up to 100%). This means than an observed change in Sales should be attributed to the individual variables as per the Contribution values. www.conradyscience.com | www.bayesia.com 37
  • 38. II. Practical Applications of Direct Effects and Causal Inference Elasticity is shown in the rightmost column. The de nition of Elasticity is based on the mathematical notion of point elasticity. In general, the “x-elasticity of y”, also called the “elasticity of y with respect to x”, is: ∂ln y ∂y x %Δy E y,x = = ⋅ = ∂ln x ∂x y %Δx In marketing, Elasticity is most often used in the context of price elasticity. It is important to point out that the Direct Effect is a linearized value and represents the derivative of the Direct Effects Function taken at the a-priori mean value of the respective variable. All the Direct Effects Functions can be shown with Analysis>Graphic>Target Mean Analysis>Direct Effect: www.conradyscience.com | www.bayesia.com 38
  • 39. II. Practical Applications of Direct Effects and Causal Inference To make the graph easier to interpret, the values of all variables (except the target) are normalized. In the case of Weekday, this means that the numerical values 1 through 7 (representing Monday through Sunday) are normalized to a 0 to 100 range. The nonlinear character of several of these variables is rather obvious and suggests that the linearized Direct Effect must be used with caution. For the near-linear variables Trad. Adv. and Competitive Adv. the Direct Effect may fully capture the nature of the relationship with Sales, whereas for the nonlinear Weekday it would be misleading. This becomes particularly relevant in the context of optimization, which we will dis- cuss later. Causal Inference as an Afterthought Direct Effects per se carry no causal meaning. However, if we do provide causal assumptions, we can im- mediately interpret Direct Effects as causal effects. We can make the causal assumption after computing the Direct Effects, quite literally as an afterthought. www.conradyscience.com | www.bayesia.com 39
  • 40. II. Practical Applications of Direct Effects and Causal Inference Causal Reasoning Upon concluding our causal assumptions we now have a model of our domain that we can use for reason- ing and subsequent decision making. Hence, we return to the original objective of obtaining “actionable insight,” as we can now formally reason about our domain. We now have the ability to anticipate the con- sequences of (marketing) actions we have not yet taken. In this particular domain, assuming that we are in the position of the ice cream distributor, only two of the model’s variable are under our control, Trad. Adv. and Online Adv., all others are beyond our control, al- though we might wish for a higher Temperature and less Competitive Adv. Searching for a rational course of action could thus only include combinations of Trad. Adv. and Online Adv. as “marketing levers.” It is now our task to reason about how these levers should be best employed for a maximum in Sales. Marketing Mix Optimization While we have emphasized the abstract concept “reasoning about a domain,” from a practical perspective we are looking at the classical task of marketing mix optimization. Linear Marketing Mix Optimization From theory we understand that in a linear marketing model (represented by a function f), the gradient of the response function f provides the optimal ratio of marketing instruments. The gradient (or gradient vector eld) of a scalar function f(x1,x2,...xn) is denoted ∇f, where ∇ (the nabla symbol) denotes the vector differential operator. The gradient of f is de ned to be the vector eld whose components are the partial derivatives of f. That is: ⎛ ∂f ∂f ∂f ⎞ ∇f = ⎜ , ,..., ⎝ ∂x1 ∂x2 ∂xn ⎟ ⎠ Our previously generated Elasticity column represents ∇f. As a result, we can directly read the optimal mar- keting mix ratios from the Elasticity column. Among other things, we would suggest to raise Temperature and reduce Competitive Adv. Quite obviously, such a recommendation cannot be serious, as we do not have control over such variables. Non-Controllable Variables and Non-Confounders The non-controllable nature of variables like Temperature and Weekday are self-evident. We can declare them as such via the Cost Editor, which allows setting the non-controllable variables to “not observable.” The Cost Editor can be selected from the contextual menu that appears when right-clicking on the Graph Panel background. www.conradyscience.com | www.bayesia.com 40
  • 41. II. Practical Applications of Direct Effects and Causal Inference This declaration will keep them xed in any subsequent analysis and also exclude them from being used as treatment variables. This new de nition is also re ected in the node colors, as non-observable nodes are now shown in a light shade of purple. That leaves two more nodes that are also not under our control, No. of Stores and Competitive Adv. They, however, must be differentiated versus the non-controllable variables. The difference is that these variables, www.conradyscience.com | www.bayesia.com 41
  • 42. II. Practical Applications of Direct Effects and Causal Inference although we do not control them, may very well be affected by our actions. It is reasonable to believe that the level of Competitive Adv. is, at least to some extent, a function of our own advertising. This means that we need to assign a special status to them, which excludes them from our optimization algorithm but does not keep them xed. We need to speci cally permit their “responsive effects.” In our terminology we call them “non-confounders” and we can assign that status via BaysiaLab’s Classes (right-click on node and select Properties>Classes>Add) The reserved Class name is “Non_Confounder” www.conradyscience.com | www.bayesia.com 42
  • 43. II. Practical Applications of Direct Effects and Causal Inference To highlight their distinct role, we have highlighted these nodes in red: With the Non-Observables and the Non-Confounders de ned, we can now proceed to compute the Direct Effects: Direct Effects on Target Sales Standardized Node Direct,Effect Contribution Elasticity Direct,Effect Trad.&Adv. 0.147 3.5702 63.36% 15.41% Online&Adv. 0.085 12.8867 36.64% 7.82% We can immediately take the values of the Elasticity column as mix recommendation, i.e a ratio of 2 to 1 for Trad. Adv. versus Online Adv. It would be reasonable to object that this mix recommendation is only valid when accepting the linearity assumption of the Direct Effects. Indeed, by displaying the Direct Effects Functions again, now only show- ing the two variables under our control, we can see that the linearity assumption would only hold in the center area of the plot. www.conradyscience.com | www.bayesia.com 43
  • 44. II. Practical Applications of Direct Effects and Causal Inference So, while the linear approximation might be acceptable for estimating the effects as a result of small changes, considering major policy shifts would clearly demand approaching this as a nonlinear problem. As the principal focus in this paper is on observational versus causal inference, we conclude that this nonlinear optimization as out-of-scope and leave it to a separate tutorial to be published in the near future. Summary I. The Neyman-Rubin model and Pearl’s Graph Surgery remain proven tools for computing causal ef- fects. However, direct and causal effect estimation based on Jouffe’s Likelihood Matching provides signi cant advantages, as it does not require the speci cation of a complete causal structure. With this lower burden of a-priori speci cation, known causal relationships can be calculated with signi cantly less effort. In many cases, this will facilitate quantifying causal effects for the rst time in practical ap- plications. II. Despite these welcome advances in terms of estimating causal effects, the path from “big data” to “ac- tionable insights” still requires a very disciplined application of expert knowledge to provide the neces- www.conradyscience.com | www.bayesia.com 44
  • 45. II. Practical Applications of Direct Effects and Causal Inference sary causal assumptions for correct reasoning. The marketing mix model example illustrates the need for a clear understanding of the role of variables, even though we may not need a complete causal structure. We conclude that Bayesian networks can provide a powerful framework for dealing with complex domains and uncovering dynamics within them. However, at this time, there is no substitute for external assump- tions, e.g. from expert knowledge, about the nature of causal relationships. www.conradyscience.com | www.bayesia.com 45
  • 46. Appendix Appendix About the Authors Stefan Conrady Stefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held consulting rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In 2010, Conrady Applied Science was appointed the authorized sales and consulting partner of Bayesia S.A.S. for North America. Stefan Conrady studied Electrical Engineering and has extensive management experience in the elds of product planning, marketing and analytics, working at Daimler and BMW Group in Europe, North Amer- ica and Asia. Prior to establishing his own rm, he was heading the Analytics & Forecasting group at Nis- san North America. Lionel Jouffe Dr. Lionel Jouffe is cofounder and CEO of France-based Bayesia S.A.S. Lionel Jouffe holds a Ph.D. in Com- puter Science and has been working in the eld of Arti cial Intelligence since the early 1990s. He and his team have been developing BayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining and knowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well as in business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is highlighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since 2007. www.conradyscience.com | www.bayesia.com 46
  • 47. Appendix References Brady, H.E. “Models of causal inference: Going beyond the Neyman-Rubin-Holland theory.” In annual meeting of the Midwest Political Science Association, Chicago, IL, 2002. Cochran, William G., and Donald B. Rubin. “Controlling Bias in Observational Studies: A Review.” Sankhyā: The Indian Journal of Statistics, Series A 35, no. 4 (December 1, 1973): 417-446. Conrady, Stefan, and Lionel Jouffe. “Knowledge Discovery in the Stock Market - Supervised and Unsuper- vised Learning with BayesiaLab”, June 29, 2011. http://www.conradyscience.com/index.php/knowledgediscovery. ———. “Paradoxes and Fallacies - Resolving some well-known puzzles with Bayesian networks”, May 2, 2011. http://www.conradyscience.com/index.php/paradoxes. “Data, data everywhere.” The Economist, February 25, 2010. http://www.economist.com/node/15557443?story_id=15557443. Dorfman, Robert, and Peter O. Steiner. “Optimal Advertising and Optimal Quality.” The American Eco- nomic Review 44, no. 5 (December 1, 1954): 826-836. Gelman, Andrew, and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. 1st ed. Cambridge University Press, 2006. Hagmayer, Y., and M. R Waldmann. “Simulating causal models: The way to structural sensitivity.” In Pro- ceedings of the Twenty-second Annual Conference of the Cognitive Science Society: August 13-15, 2000, Institute for Research in Cognitive Science, University of Pennsylvania, Philadelphia, PA, 214, 2000. Hagmayer, Y., S.A. Sloman, D.A. Lagnado, and M.R. Waldmann. “Causal reasoning through interven- tion.” Causal learning: Psychology, philosophy, and computation (2007): 86–100. Heckman, James, Hidehiko Ichimura, Jeffrey Smith, and Petra Todd. “Characterizing Selection Bias Using Experimental Data.” Econometrica 66, no. 5 (1998): 1017-1098. Imbens, G. “Estimating average treatment effects in Stata.” In West Coast Stata Users’ Group Meetings 2007, 2007. Lauritzen, S. L., and D. J. Spiegelhalter. “Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems.” Journal of the Royal Statistical Society. Series B (Methodo- logical) 50, no. 2 (January 1, 1988): 157-224. Morgan, Stephen L., and Christopher Winship. Counterfactuals and Causal Inference: Methods and Princi- ples for Social Research. 1st ed. Cambridge University Press, 2007. Pearl, J., and S. Russell. “Bayesian networks.” Handbook of brain theory and neural networks, ed. M. Ar- bib. MIT Press.[DAL] (2001). Pearl, Judea. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press, 2009. Rosenbaum, Paul R. Observational Studies. Softcover reprint of hardcover 2nd ed. 2002 ed. Springer, 2010. ROSENBAUM, PAUL R., and DONALD B. RUBIN. “The central role of the propensity score in observa- tional studies for causal effects.” Biometrika 70, no. 1 (April 1, 1983): 41 -55. Rubin, Donald B. Matched Sampling for Causal Effects. 1st ed. Cambridge University Press, 2006. Sekhon, J.S. The Neyman-Rubin model of causal inference and estimation via matching methods. Oxford: Oxford University Press, 2008. Stolley, Paul D. “When Genius Errs: R. A. Fisher and the Lung Cancer Controversy.” American Journal of Epidemiology 133, no. 5 (March 1, 1991): 416 -425. www.conradyscience.com | www.bayesia.com 47
  • 48. Appendix Stuart, E.A., and D.B. Rubin. “Matching methods for causal inference: Designing observational studies.” Harvard University Department of Statistics mimeo (2004). Witten, Ian, and Frank Eibe. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. Am- sterdam, Boston: Morgan Kaufman, 2005. Contact Information Conrady Applied Science, LLC 312 Hamlet’s End Way Franklin, TN 37067 USA +1 888-386-8383 info@conradyscience.com www.conradyscience.com Bayesia S.A.S. 6, rue Léonard de Vinci BP 119 53001 Laval Cedex France +33(0)2 43 49 75 69 info@bayesia.com www.bayesia.com Copyright © 2011 Conrady Applied Science, LLC and Bayesia S.A.S. All rights reserved. Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following: • You may print or download this document for your personal and noncommercial use only. • You may copy the content to individual third parties for their personal use, but only if you acknowledge Conrady Applied Science, LLC and Bayesia S.A.S as the source of the material. • You may not, except with our express written permission, distribute or commercially exploit the content. Nor may you transmit it or store it in any other website or other form of electronic retrieval system. www.conradyscience.com | www.bayesia.com 48