Seu SlideShare está sendo baixado. ×

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

1 de 75 Anúncio

# SAMPLE SIZE – The indispensable A/B test calculation that you’re not making

If you’re a marketer it’s very likely that you’ve run an A/B test. It’s also likely that you’ve never calculated the sample size for your tests, and instead, you run tests until they reach statistical significance. If this is the case, your strategy is statistically flawed. Conforming to sample size requires marketers to wait longer for test results, but choosing to ignore it will bear false positives and lead to bad decisions.

This deck was created for an email audience for there are valuable lessons for anyone who runs A/B tests.

If you’re a marketer it’s very likely that you’ve run an A/B test. It’s also likely that you’ve never calculated the sample size for your tests, and instead, you run tests until they reach statistical significance. If this is the case, your strategy is statistically flawed. Conforming to sample size requires marketers to wait longer for test results, but choosing to ignore it will bear false positives and lead to bad decisions.

This deck was created for an email audience for there are valuable lessons for anyone who runs A/B tests.

Anúncio
Anúncio

Anúncio

### SAMPLE SIZE – The indispensable A/B test calculation that you’re not making

1. 1. Sample Size The indispensable A/B test calculation that you’re not making.
2. 2. As Marketers, many of us run A/B Tests
3. 3. We test copy
4. 4. We test design
5. 5. We test subject lines
6. 6. We choose winners
7. 7. Version A is converting better than Version B and statistical significance has breached 95%. So, Version A won.
8. 8. Version A is converting better than Version B and statistical significance has breached 95%. So, Version A won. OR DID IT?
9. 9. That math is half-baked
10. 10. Suppose you check an A/B Test twice: Once after 200 impressions and then after 500. Then you end the test.
11. 11. Now, instead, suppose you stop the test once you reach significance:
12. 12. Now, suppose you stop the experiment as soon as there is a significant result: FALSE POSITIVE!
13. 13. How often will you get a false positive?
14. 14. Assuming you check results after every impression and stop once you reach significance…. 26.1% So you just went from 95% confidence to 74% This is a worst-case scenario. BUT, some test platforms do this automatically!
15. 15. OK…well, then when should I stop an A/B test?
16. 16. SAMPLE SIZE Dictates how long to run a test
17. 17. SAMPLE SIZE • Used religiously in the pharmaceutical Industry, economic studies, etc…
18. 18. https://www.optimizely.com/resources/sample-size-calculator
19. 19. Agenda 1. How we put this into practice on a website test 2. How we applied these learnings to email testing: • Open rates • Click to Open Rates • Conversion Rates
20. 20. A/B Testing on your website Here’s your new test process: 1. Determine your baseline conversion rate (or click rate, or download rate, etc..) 2. Decide how long you are willing to wait for a result. Convert your unique traffic metric to a sample size. 3. Adjust MDE (Minimum Detectable Effect) until your Sample Size is just under the target you determined in #2 above. 4. Re-adjust MDE until you are content. 5. Start the test, and don’t stop until you hit the sample size.
21. 21. Case Study: Item Urgency
22. 22. Case Study: Item Urgency TEST (VERSION A): INVENTORY NOTIFICATION CONTROL (VERSION B): NO INVENTORY NOTIFICATION
23. 23. STEP 1 – We determined our baseline conversion rate
24. 24. STEP 2 – Calculate Target Sample Size We initially decided we wanted a result in 2 weeks. So we took the last 2 weeks of unique product page views:
25. 25. STEP 2 – Calculate Target Sample Size We then divided that number by two (since we’ll have two test segments) Divided by two again to account for desktop traffic only Then multiplied by 5% (since the message only displays on 5% of product pages) Sample Size -> 12,351
26. 26. This gave us 30% MDE (Conversion Lift). This is unrealistic
27. 27. How about 10% ?
28. 28. 107,105 unique visits ~ 17 weeks
29. 29. Wow, that’s a long time…
30. 30. Yep.
31. 31. You’re probably not running your tests long enough
32. 32. WAIT A MINUTE. MY A/B TEST PLATFORM SAYS NOTHING ABOUT SAMPLE SIZE…
33. 33. EVERYONE WANTS INSTANT GRATIFICATION
34. 34. YOUR A/B TEST PLATFORM IS HAPPY TO SELL IT
35. 35. Quietly assuming you have calculated sample size on your own
36. 36. Item Urgency - Test Results We are over 4 weeks in…. *Conv. rate is higher than expected because test platform runs on 7 day conversion window.
37. 37. Item Urgency - Test Results Lift is over 10% Note the spike in the beginning and the increased stabilization with time
38. 38. Test Results The effect is slowly approaching the MDE
39. 39. Test Results Significance is now over 95%, but it’s been up and down. Many marketers would stop the test on 9/5 and declare a 57% Lift.
40. 40. Email Testing
41. 41. After learning about Sample Size, we reconsidered our email testing strategy • Open Rate (Subject line testing) • Click-to-Open (CTO) Rate • Conversion Rate
42. 42. OPEN RATE We used sample size to gut check the size of our subject line test segments
43. 43. OPEN RATE Remember, for the sample size calculator, you need the baseline conversion rate and then the sample size, and that will give you MDE.
44. 44. OPEN RATE First, we needed the baseline conversion open rate
45. 45. OPEN RATE Our open rates typically end up ~ 17% , but when we make the call on our winning subject line, open rates are usually around 7%.
46. 46. OPEN RATE Next we need the sample size
47. 47. OPEN RATE We always test 4 different subject lines. We had been sending each subject line to 10,000 customers. So, sample size ~ 10,000
48. 48. OPEN RATE Plugging these numbers in, this would only detect 13% open rate lift or higher
49. 49. OPEN RATE 13% lift on 17% open rate is 19.2%. We rarely see subject lines this high We needed a lower MDE to make sure we could detect more winners…
50. 50. OPEN RATE We ended up doubling our subject line segment to 80,000, giving us an MDE ~ 9.2%
51. 51. CTO First we needed the baseline
52. 52. CTO We averaged the last 10 weeks -> 11% CTO
53. 53. CTO Sample size = ½ of the avg opens count
54. 54. CTO We averaged the last 10 weeks -> Avg opens = 107,000 / 2 = 53,500
55. 55. CTO
56. 56. CTO 4.4% CTO lift is a very reasonable goal for a test. This showed us that we could trust most of the results of our past CTO tests.
57. 57. GRID vs. FREE FORM 15.7% CTO Lift
58. 58. PRODUCT NAMES vs. NO PRODUCT NAMES 22.6% CTO Lift
59. 59. Conversion Rate We had been making many email decisions after reaching significance on a conversion rate lift
60. 60. Conversion Rate Time for a reality check.
61. 61. Conversion Rate Baseline Conversion Rate ~ 1.5%
62. 62. Conversion Rate Sample Size = ½ Average # Clicks -> 6,000
63. 63. Conversion Rate
64. 64. Conversion Rate 38% is ASTRONOMICAL
65. 65. Conversion Rate To get meaningful results for conversion rate, consider running an email test many times, so that you can eventually reach the necessary sample size.
66. 66. Takeaways This is the MDE curve again. Remember what this looks like. The longer you run a test, the lower MDE will be. The more traffic volume you have, the faster MDE will drop
67. 67. Takeaways For Web Testing • If you stop your A/B tests once you reach statistical significance, you are increasing your chances of finding false positives • Calculating sample size will give you a clear stop date and an MDE • MDE and sample size are inversely related – The lower the MDE, the larger the sample size • Most likely, your A/B tests need to run much longer than you realize For Email Testing • Use sample size to determine the size of your subject line test segments • Your CTO tests are probably reaching the necessary sample size • Your Conversion tests are probably not hitting sample size
68. 68. Sources Kyle Rush – Mozcon 2014 Presentation https://seomoz.box.com/shared/static/2fw6yevkkmmdum z431j4.pdf Evan Miller – How not to run an AB test http://www.evanmiller.org/how-not-to-run-an-ab-test. html
69. 69. Zack Notes Digital Marketing Manager zack@uncommongoods.com @zacknotes slideshare.net/zacknotes1/presentations
70. 70. Appendix
71. 71. GRID vs. FREE FORM
72. 72. PRODUCT NAMES vs. NO PRODUCT NAMES
73. 73. What do you do if a test reaches sample size and your lift < MDE?
74. 74. You can either extend the test and accept a lower MDE or Move On.

### Notas do Editor

• I have a thought experiment for you…
• All of the scenarios have the same end result, except for scenario #3
• Note the baseline, minimum detectable effect and sample size
Baseline = preexisting conversion rate, click rate, open rate, etc.
Minimum Detectable Effect or MDE = the minimum lift you will be able to detect once you’ve reached the sample size.
So here’s the gist of this presentation – If you are running a test on a page with a baseline conversion rate of 3% - and you run the test until your test segment reaches 10,316 impressions. If you’re observed conversion lift is below 20%, you can’t declare a winner. Even if you’ve reached statistical significance. You either need to keep your test running or move on.
• This is your step by step guide to sample size testing. I’m going to go over it very briefly. This slide is more of a resource for you to come back to when you set up your test.
• We’re currently running this test on our product pages
• This is a message that fires when there are less than 5 items in inventory. You can see the test on top and the control below.
• We put 1.16% in conversion rate. Then we turn the MDE up until our sample size is just under 12,351
• Nobody wants to wait 17 weeks for a test result, but if you make a call too early, you could be shooting yourself in the foot, buying a false positive, and deploying a new design which is actually making your site worse
• BACK TO ITEM URGENCY
This is a chart of conversion rate.
Test in Blue. Control in Red
• Note how the MDE contains the lift with an upper bound
Also note how lift is approaching MDE towards the end
• Red line is significance
Green line is the 95% mark
Note that the significance crossed 95% in the beginning and then came back down, and it’s now rising above 95% again.
• 17% after a week but 7% after 2 hours when we make the call
• 17% after a week but 7% after 2 hours when we make the call
• Here’s a few examples of CTO tests we ran and at the end, in the appendix, I’ve included all data for you to look at afterwards
• I went back through all of our tests. In the 3 years since we’ve started A/B testing in emails, the only time we’ve hit sample size for conversion rate is when we’ve tested putting prices in an email (vs. leaving them out).

And Unfortunately, this lift in conversion was countered by an equal drop in CTO

• The lift from the vast majority of your tests will never reach MDE. Be more comfortable reporting “no statistical difference”