Bizwerx Innovation & Mobility Hub by Dr. Cassandra Little
Lean Experimentation
1. Lean Experimentation
How to leverage online experiments in research and practice
Thomas Høgenhaven
Twitter: @thogenhaven
Cornell IS Breakfast Talk
April 4th, 2012
Friday, April 6, 12
2. Agenda
1. Conducting Online Experiments
2. Experimentation Literature
3. Experimentation in SMEs and Government Today
4. Lean Experimentation
Friday, April 6, 12
3. I Conducting Online Experiments
Friday, April 6, 12
4. The Why Bother Question
“While some social scientists engage in small-scale
controlled experimentation with dozens of users or
groups, the capacity to perform large-scale interventions
with thousands of users opens up new opportunities for
research."
(Preece and Schneiderman 2009: 25).
Friday, April 6, 12
5. What I Mean With Online Experiments
In online experiments, we are interested in examining
online behavior. Not just using the internet as a means
to examine offline behavior.
Friday, April 6, 12
6. What I Mean With Online Experiments
Users
Independent Variation Variation Variation
variable A B n
Dependent Online Online Online
Behavior
variable Behavior Behavior Behavior
Statistical Difference
test
Friday, April 6, 12
8. Example: Experimentation At Microsoft
Guess which one performs better, in each of these 8 pairs.
Anyone getting 6/8 right,
wins a t-shirt
Friday, April 6, 12
9. Experimenting At Microsoft
1 A B 5 A B Which one is significantly better?
[] A
[] B
2 A B 6 A B [] None of them
3 A B 7 A B
4 A B 8 A B
Kohavi et al (2009): Online Experimentation at Microsoft
Friday, April 6, 12
10. Experimenting At Microsoft
0 / 200 Microsoft employees
got more than 5 / 8 answers right
1 A B 5 A B
2 A B 6 A B
3 A B 7 A B
4 A B 8 A B
Kohavi et al (2009): Online Experimentation at Microsoft
Friday, April 6, 12
11. What Is The Effect Of Experiments?
Improvement No Effect Disimprovement
33% 33%
33%
Kohavi et al (2009): Online Experimentation at Microsoft
Friday, April 6, 12
12. Is That Just Microsoft Being Microsoft?
No. Estimating effects of changes is incredible hard.
Netflix considers 90% of what they try to
be wrong.
Friday, April 6, 12
13. It’s Actually Hard To Predict
https://whichtestwon.com/past-tests
Friday, April 6, 12
15. Current Experimental Framework in HCI
Psychology &
Social Psychology
Experimental methodology
literature
HCI
Friday, April 6, 12
16. Offline And Online Experiments
• Psychology literature sometimes uses the internet to study
human behavior
• But it does not use the internet to study the internet
Friday, April 6, 12
17. For example...
No mentions of experimentation
in online environments
2010
Friday, April 6, 12
18. Offline And Online Experiments
Laboratory Field
Offline
Online
Friday, April 6, 12
19. Offline And Online Experiments
Laboratory Field Psychology
covers this
Offline
Online
Friday, April 6, 12
20. Offline And Online Experiments
Laboratory Field Psychology
covers this
Offline
But not this
Online
Friday, April 6, 12
21. The Research There Is, Is Not Systematic
"To the extent of our knowledge, no research has so far been
reported on treating online test design and implementation in a
systematic manner"
(Cámara and Kobsa 2009: 18).
Friday, April 6, 12
22. Online Experiments In Academia
CHI and CSCW use experiments all the time - but more can be
invested in methodology literature.
This will help explore possibilities and limitations of online
experimentation
Friday, April 6, 12
23. 3 Experimentation In SMEs And
Government Agencies Today
Friday, April 6, 12
24. State Of The Art In Industry Today
• Experimentation is increasing
• At least 25 different software vendors
• $0 - $320,000 a year*
*Source: whichmvt.com
Friday, April 6, 12
26. Website Experiments
Several ways to conduct experiments
1. Server-side / Client-side
2. A/B Test / Multivariate Test
Friday, April 6, 12
27. Not Overly Expensive Software
Just 2 out of
25+ vendors
Google Website Optimizer Visual Website Optimizer
(free) ($600 - $3000 / year)
Friday, April 6, 12
28. A/B/n Experiment
Users Javascript
Independent Webpage Webpage Webpage
variable A B n
Dependent
variable
Behavior Behavior Behavior
Statistical test Difference
Friday, April 6, 12
30. Limitations Of Mainstream Experimental Software
1. Limited to between-subject design
2. Lack of data export
3. No control over statistical test
4. Expensive coding necessary
Friday, April 6, 12
31. Limitation 1: Limited To Between Subject Design
• Cannot control for individual differences (No such data
is collected / made available)
• Requires more experimental subjects
• No pre-experimental data is collected
Friday, April 6, 12
36. Software Limitations: Data Export
• Some software better than other
• No data on individual users
• No segmentation on background variables
• This might be the biggest problem, as this is where
many significant results lie.
Friday, April 6, 12
37. Limitation 3: No Choice Between Statistical Tests
Okay?
Friday, April 6, 12
38. Statistical Test = Chance To Beat Original
“The chance to beat original ... displays the
probability that a combination will be more the
successful than the original version.
When numbers in this column are high, perhaps
around 95%, that means a given combination is
probably a good candidate to replace your
original content.
Low numbers in this column mean that the
corresponding combination is a poor candidate
for replacement.”
http://support.google.com/websiteoptimizer/bin/answer.py?hl=en&answer=55944
Friday, April 6, 12
39. Visual Website Optimizer Is More Transparent
“ Visual Website Optimizer uses z-tests for both A/
B tests and multivariate tests”
Standard Error (SE) = Square root of (p * (1-p) / n)
http://visualwebsiteoptimizer.com/split-testing-blog/what-you-really-need-to-know-about-mathematics-of-ab-split-testing/
Friday, April 6, 12
40. z-tests
We don’t know if
data fits this
• Focus on a single parameter
• Assumes parametric assumptions are met
Friday, April 6, 12
41. Limitation 4: Coding Required
Have to
Users Javascript be coded
Independent Webpage Webpage Webpage
variable A B n
Dependent
variable
Behavior Behavior Behavior
Statistical test Difference
Friday, April 6, 12
42. Software Limitations: Expensive Coding
We already coded it, so we
can as well keep it. I hate
working for no reason
Friday, April 6, 12
43. Software Limitations: Expensive Coding
I knew this wouldn’t work!
We should never have
spent resources on it...
Friday, April 6, 12
44. The Challenge
1. Overcome methodological limitations of experimental
software
2. Reduce development costs
3. Explore possibilities and limitations of online experimentation
Friday, April 6, 12
46. Test Environment
Users
Independent Proxy Proxy Proxy
variable A B n
Dependent Behavior on Behavior on Behavior on
variable
Behavior
website website website
Statistical test Difference
Friday, April 6, 12
51. 2. Test Before Coding, Not After
Ideas Bad Idea
Good Idea
Experimentation
Implementation
Friday, April 6, 12
52. 3. Test In The Field
• Identical design patterns have different effects in different
contexts
• E.g. social comparison information in respectively
competitive and cooperative communities
• Cocktail effects are largely unknown
Friday, April 6, 12
53. Requirements Of Lean Experimentation
1. Independent groups
2. Random assignment
3. Allows tracking
Friday, April 6, 12
55. Test Environment
• Manipulates the independent variable through a proxy
• Examines dependent variable in natural field environment
Friday, April 6, 12
56. Test Subjects
• Existing users (when using website, email, and survey)
• Potential users (when using advertisements)
Friday, April 6, 12
57. Proposed Usage and limitations
Good for Less suited for
• Ideas • Small changes
• Theories • Graphical changes
• Hypothesis
• Features
Can be useful if
testing assumptions
Friday, April 6, 12
58. Data Output
• Mixed sources that need to be combined
• Open / CTR rates from proxy
• Web analytics
• SQL databases
Friday, April 6, 12
59. Durability of Proxy Experiment is short
Email experiment
Control Experimentation
16
12
8
4
0
Wk0 Wk1 Wk2 Wk3
Friday, April 6, 12
60. Buy In Needed
Hard to sell
1. Making changes on websites
2. Sending Emails
3. Conducting Surveys
4. Running Ads
Easy to sell
Friday, April 6, 12
61. Feedback Quality
Critical feedback
1. Wireframes / early stage development
2. Finished / Nearly finished stages
Not so critical
feedback
Friday, April 6, 12
62. Influence On Decisions
Increased likelihood of impact when getting
experimental effect data early
Friday, April 6, 12