TechnoWeb Split Test in the context of validated learning

Validated Learning at TechnoWeb
A. Oertl, M. Heiss, B. Laenger & B. Kavsek | Corporate Technology | Sep. 2013

siemens.com/answers
Unrestricted © Siemens AG 2013. All rights reserved

Have you read this book?

WARNING:

Each day you delay reading this book
you risk wasting money.

Source: http://www.amazon.de/The-Lean-Startup-Entrepreneurs-Continuous/dp/0307887898
Page 2

September 2013

Siemens CT TIM CEE


Reduce Cycle Time
Evade the worst impact on productivity: Do not build something nobody wants

Solution:
• Get immediate feedback from customers

Source: http://www.betterthanpants.com/baby-mop.html
Page 3

September 2013

Siemens CT TIM CEE


Meaningful progress:
Validated Learning
Defining success:
• Success is improved customer behavior
• Success is measured either by generally applicable metrics, or metrics tailored to a specific situation.
Metric examples:
• Value hypothesis
• Retention rate (generic): How many customers return within a set time period?
• UR-conversion rate (custom): How many per mill of the notified users respond to the question?

• Growth hypothesis
• Cohort based (generic): Separate behavior analysis of independent user groups (e.g. monthly new users).
• Invitation rate (generic): The willingness of users to invite their personal contacts to the same service.

• The results are used to decide if the change in the feature has positive, negative or no effects on
consumer behavior.
• This way, learning immediately delivers business relevant insights.

Page 4

September 2013

Siemens CT TIM CEE


Agenda

Page 5

September 2013

Siemens CT TIM CEE


Corporate Problem Solving via TechnoWeb:
Ask an Urgent Request and get answers from peers
Urgent Requests are distributed per email to the relevant target group (target messaging)
Headline of the
Urgent Request

Business Impact
(estimated by sender)

Many replies
on average 7 replies,
first within 35min.

90% get help
Name and optional
photo of the sender

Page 6

September 2013

Siemens CT TIM CEE


New is not always better
Requirement: Urgent Request notifications had to be changed to fit corporate design guidelines

Which solution is better?

Page 7

September 2013

Siemens CT TIM CEE


Releasing new features without validated learning
is like being in the dark
Decisions are often made using one‟s own best judgment, ignoring customer needs.
Common approach:

Validated Learning

• The automatic conclusion: the new feature
is “obviously better” than the old one,
and the time and money for the
improvement were well spent.

A meaningful conclusion can only be drawn after
this question is answered:
Does the change positively influence
customer behavior?
• Urgent Requests are the most important
functionality of TechnoWeb
• The e-mail notification invites users to give
answers
• Therefore, the effectiveness of the notification
is mission critical for the success of
TechnoWeb.
 It is imperative to measure customer response
to the new template.

Page 8

September 2013

Siemens CT TIM CEE


Agenda

Page 9

September 2013

Siemens CT TIM CEE


Statistical Evaluation by Engineers
without Specialized Statistical Knowledge

Page 10

September 2013

Siemens CT TIM CEE


Split-Test:
Preparing to prove assumptions
Split Test: Define a hypothesis with metric and expected value
Hypothesis

Urgent Request
number i

The new template outperforms the old
template in:

Approx. 50% of the users
receive the old template

Approx. 50% of the users
receive the new template

• Click-through rate

SPLIT
• Conversion rate

Ei, old

Ei, new

Vi, old

Vi, new

Ci, old

Ci, new

The introduced metrics are:
• Click-through rate ratio
• Conversion rate ratio
•

i…Urgent Request number •

Ei…number of sent notifications

•

Vi…number of views

•

old…old template

•

Ci…number of comments

•

new…new template

Page 11

September 2013

Siemens CT TIM CEE


Results of 323.560 Urgent Request notifications

Histogram: Conversion rate ratio
convnew/convold
5

12
Count of Urgent Requests with
corresponding ratio

corresponding ratio

Histogram: Click-through rate ratio
ctrnew/ctrold
10
8
6
4
2

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Click-through rate ratio

4
3
2

1
0
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2 >2

Conversion rate ratio

• The click-through and conversion rate ratios compare the relative success (relative to the number of notifications sent) of
the old and new templates. A value <1 means that the performance of the new template is inferior to the old template. A
value of 1 signifies no change, whereas a value >1 indicates a better performing new template.
• For the click-through rate ratio, all 61 Urgent Requests are considered. For the conversion rate ratio, only 32 Urgent
Requests have sufficient data to be used.

Page 12

September 2013

Siemens CT TIM CEE


Decreasing click-through rate ratio
with increasing Business Impact

Visibility of Business Impact Level: Impact on performance
• The Business Impact Level assigns a monetary value to the problem statement of the Urgent
Request
• A value <1 means that the performance of the new template is inferior to the old template
• The monetary value is displayed less prominently in the new template.
Instead of assuming an impact, we measure it:

Average Click-through rate ratio

Average Conversion rate ratio
1.2

1.0
0.8
0.6
0.4
0.2
0.0
€1,000

€10,000

€50,000

€250,000

€1,000,000

Average Conversion rate ratio

Average Click-through ratio

1.2

1.0
0.8
0.6
0.4
0.2
0.0

€1,000

€10,000

€50,000

€250,000

€1,000,000

Business Impact
Business Impact

Page 13

September 2013

Siemens CT TIM CEE


Split-Tests in enterprises face
lower statistical significance
Decisions had to be made concerning raw data processing:
Low comment count

Statistical significance

• Problem

• Problem

Even though 323.560 notifications were
evaluated, it‟s in the nature of the application
that the absolute number of comments are low.
In some cases +/- 1 comment can significantly
influence the result.
• Solution

• Disregard multiple comments by the same
user (e.g. follow-up comments).

Some Urgent Requests, with 25.000 email
notifications, are statistically significant. Others
send only a couple of hundred emails.
• Solution
Discard all data sets where no significant data
(views or comments) has been recorded from
either old or new template.

• Disregard all activity by the author of the
Urgent Request.
• Discard all data sets where there are no
comments from both the old and new
template.

This comes at the cost of less data to work
with, but the remaining data is much more
trustworthy.

Page 14

September 2013

Siemens CT TIM CEE


Statistical Evaluation
by Statisticians

Page 15

September 2013

Siemens CT TIM CEE


Techno Web Split Analysis:
Old versus new Template for Urgent Requests

Approach for sample selection:

0%

50%

100%

Urgent Request 1

NEW

OLD

Urgent Request t1

Before first-time

OLD

Urgent Request t2

UR 1 .. t1
Before
UR 1 .. t2

Receivers of NEW template will always
receive NEW further on.
Page 16

September 2013

Siemens CT TIM CEE

OLD

If >50% have already
received new
template  more
„NEW“ than „OLD“

Statistical Questions

• Statistical Question to be answered by the analysis:
• Is there a difference in the number of responses
(views, comments) of the old versus new template?
• Do first-time users of the new template behave differently from
users that received the new template before?

• Requested for future analyses:
• Is there one representative number for the extent of this
difference, considering all urgent requests?

Page 17

September 2013

Siemens CT TIM CEE

Sample Characteristics
Dependency within 1 observation?
• Are we considering paired or unpaired samples?
- Paired sample means that 2 characteristics of one observation are dependent
- We want to compare responses (views, comments) to the same urgent
request for old versus new template.
- Thus, we have to consider pairs of responses and investigate the difference
between response ratios for each urgent request.
- Example:

click-through
ratio old

click-through
ratio new

Urgent request 1

0.01

0.03

Urgent request 2

0.03

0.05

Urgent request 3

0.07

0.01

Urgent request 4

0.05

0.07

Assuming independent samples  assuming equal mean in old and new
template. BUT: In reality: ctrold < ctrnew in ¾ of requests!
 We assume dependent samples  paired test
Page 18

September 2013

Siemens CT TIM CEE

Independency between observations?
The problem is that for most statistical tests, values between observations of the
sample (i.e. different urgent requests) have to be independent.
We know that the same person gets several urgent requests, however, it is
assumed that the response behavior (to click on the notification link) is
independent for different topics.
Thus we can assume independence of the different urgent requests.
click-through
ratio old
Urgent request 1

Page 19

September 2013

0.01

0.03

Urgent request 2

0.03

0.05

Urgent request 3

0.07

0.01

Urgent request 4

independent

click-through
ratio new

0.05

0.07

Siemens CT TIM CEE

dependent

Selection of Test Method
Comparison of means:
Is the mean response significantly different in the new template compared to the
old template?
• t-Test for paired samples
Premises:
- 2 paired samples (xi,yi) with expectation values

1

and

2

- Differences di=xi-yi normally distributed with expectation value .
Hypothesis: H0: d=0
• Wilcoxon-test for paired samples
- 2 paired samples (xi,yi) with expectation values

1

and

2

- Differences di=xi-yi symmetrically distributed  fulfilled if xi and yi have the
same distribution shape.

Hypothesis: H0:
Page 20

1=

September 2013

2

Siemens CT TIM CEE

Check of premises
Before applying a hypothesis test, the differences (v0-v1 and c0-c1) have to be
tested on normal distribution.
Using the Kolmogoroff-Smirnoff test, we receive the following result:

H0: Variable has a normal distribution.
v1: click-through ratio new
variable

p-value

v1-v0

0.04558

c1-c0

0.002431

v0: click-through ratio old
c1: conversion rate new
c0: conversion rate old

=5%  no normal distribution in both cases (views, comments)
Therefore we have to use a test which does not require normal distribution
 Wilcoxon rank sum test.
Page 21

September 2013

Siemens CT TIM CEE

Check of premises
Symmetrical Distribution of differences v0-v1 and c0-c1:

Page 22

September 2013

Siemens CT TIM CEE

Hypothesis Test:
Principle of the Wilcoxon rank sum test
Wilcoxon rank sum test (U-test for paired samples):
Example for n=8
UR

v0

v1

dv=v1-v0

rank for dv>0

rank for dv<0

1

0.02

0.02

0

-

-

2

0.01

0

-0.01

3

0.01

0.10

0.09

7

4

0.06

0.13

0.07

6

5

0.03

0.04

0.01

1.5

6

0.11

0.15

0.04

5

7

0.06

0.08

0.02

3

8

0.03

0.06

0.03

4

1.5

R+ = 26.5

R=min(R+, R- )=1.5
Critical value for n=7 (UR1 excluded), =5%: Rcritical=2

R<Rcritical  H0:
Page 23

September 2013

Siemens CT TIM CEE

1=

2 is

rejected

R- = 1.5

Test Results
Results of Wilcoxon rank sum test:
H0

p-value

v0= v1

1
9.076e-07

c0= c1

0.4616

c0>

c1

0.7718

c0< c1

Comments

v0> v1
v0< v1

Views

1.815e-06

0.2308

Test result:

Red: p<0.05  significant i.e. H0 is rejected.

v0> v1

More views using old template.

c0= c1

No significant change in number of
comments.

Possible explanations why there are more views of the old template:
- Link to urgent request better visible.
- Users used to old template.
- Already enough information in e-mail  no need to view details.

- Subjective impression of full information in new template.
Page 24

September 2013

Siemens CT TIM CEE

Plots: response for old versus new template
.
Views in old (black)
versus new (red)
template

Comments in old
(black) versus new
(red) template

Page 25

September 2013

Siemens CT TIM CEE

Variable for comparison of old and new template

Click-through ratio and conversion ratio: Problem of exclusion of zero values.

Histogram: Conversion rate ratio
convnew/convold
5

12
corresponding ratio

corresponding ratio

Histogram: Click-through rate ratio
ctrnew/ctrold
10
8
6
4
2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

Click-through rate ratio

4
3
2
1
0
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2 >2

Conversion rate ratio

Page 26

September 2013

Siemens CT TIM CEE

Variable for comparison of old and new template
Using differences v1-v0 and c1-c0 instead of quotients v1/v0 and c1/c0
 zero values do not have to be excluded.

Page 27

September 2013

Siemens CT TIM CEE

New First-Timers
• Considering only subgroup receiving the new template:
Is there a correlation between the number of “new first-timers” (NFT) and the
number of
(a) views (V1)?
(b) comments (C1)?
(a) H0: r(V1,NFT) = 0
(b) H0: r(C1,NFT) = 0

Kolmogoroff-Smirnoff test yields that number of new first-timers NFT is not
normally distributed (p= 7.936e-10)  using Spearman„s or Kendall„s
correlation coefficient.

Page 28

September 2013

Siemens CT TIM CEE

New First-Timers
• Considering only subgroup receiving the new template:
Is there a correlation between the number of “new first-timers” (NFT) and the
number of
(a) views (V1)?
(b) comments (C1)?
Test results:

case

variables

(a)

V1,NFT

(a)

V1, NFT

(b)

C1,NFT

(b)

C1, NFT

method

r

Spearman 0.3939
Kendall

p-value
0.0017

0.2919

0.0015

Spearman 0.3976

0.0015

Kendall

0.3012

0.0019

 H0 is rejected in every case (p<0.05).
 significant positive correlation
 Number of new first-timers related to number of views and comments:
The more new first-timers, the more views and comments.
Page 29

September 2013

Siemens CT TIM CEE

Improvement suggestion for novel split test

Proposition of sample selection for next split test:
• Existing TechnoWeb users are randomly split into two equally sized groups A
and B.
• Every new TechnoWeb user is assigned group A or group B randomly with a
probability of 50% for each group.
• Group A always receives the old, group B always receives the new template.
• First time views don‟t have to be investigated separately by this
approach, because they are more clearly distinguished from the beginning.

Page 30

September 2013

Siemens CT TIM CEE

Results and Recommendations
The following results were obtained:
•

No significant change in number of comments in new versus old template.

•

More views in old than in new template.

•

The more users receiving the new template for the first time, the more views
and comments.

•

Statistically relevant number for comparison of old and new template:
R=min(R+, R- ). Critical R varies according to sample size.

Suggestions:
• Use v1-v0 and c1-c0, respectively, instead of v1/v0 and c1/c0, in order not to
exclude zero-answers.
• Sample selection: randomly choose 50% that always receive old template, 50%
that always receive new template and stick to that selection.

Page 31

September 2013

Siemens CT TIM CEE

Agenda

Page 32

September 2013

Siemens CT TIM CEE


Learning:
The new template has three usability problems
• The overall performance of the new template is inferior to the old template
• Identified cause:
• An important eye-catcher, the assignment of a monetary value to the problem, is less visible in the
new template, resulting in a decreased click-through rate.
• Suspected causes: (to be validated in the next build-measure-learn cycle)
• The prominently placed call-to-action in the new template might be less inviting for users – most do
not want to comment right away.
• The Link “Show Urgent Request” is much less visible in the new template
• Part of the reduced click-through rate in the new template could be due to the content being
presented in an easily-readable way.
Old Template

and

Page 33

New Template

and

September 2013

Siemens CT TIM CEE


Moving key elements to the side
decreased their effectiveness

Page 34

September 2013

Siemens CT TIM CEE


Conclusion

Specific

General

• Split-testing a new feature is worth the time
and effort.

• Features that do not positively influence
customer behavior should not be
implemented.

• The initial time investment in the first split is
offset by knowledge gained on how to
efficiently set up a split test.
• Even though initially the problem looked
simple, regular statistical text-book knowledge
was not sufficient for the statistical
significance analysis.

• Initial negative results should not kill a project.
Instead, iterative improvement will lead to a
product that consumers will appreciate.

• Consulting a professional statistician from the
planning phase of the split would have saved
much time and effort, and allowed to measure
in a more focused way.

Page 35

September 2013

Siemens CT TIM CEE


TechnoWeb Split Test in the context of validated learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to TechnoWeb Split Test in the context of validated learning

Similar to TechnoWeb Split Test in the context of validated learning (20)

Recently uploaded

Recently uploaded (20)

TechnoWeb Split Test in the context of validated learning