IAC 2024 - IA Fast Track to Search Focused AI Solutions
Building better products through Experimentation - SDForum Business Intelligence SIG
1. Building better products through Experimentation
Deepak Nadig, eBay Principal Architect
SDForum Business Intelligence SIG
March 27, 2008
2. What we’re up against
• eBay manages …
– Over 276,000,000 registered users
– Over 1 Billion photos
– eBay users worldwide trade more than $2039
worth of goods every second
– eBay averages well over 1 billion page views
per day
– At any given time, there are over 113 million
items for sale on the site
– eBay stores over 2 Petabytes of data – over
200 times the size of the Library of Congress!
– eBay analytics processes over 25 Petabytes
of data on any day
– The eBay platform handles 4.4 billion API calls
per month
An SUV good every 5 minutes
A sportingis soldsells every 2 seconds
Over ½ Million pounds of
Kimchi are sold every year!
• In a dynamic environment
– 300+ features per quarter
– We roll 100,000+ lines of code every two weeks
• In 39 countries, in seven languages, 24x7
>44 Billion SQL executions/day!
2
3. Site Statistics: in a typical day…
June
1999
Q1
2007
Growth
Outbound Emails
1M
41 M
41x
Total Page Views
54 M
>1 B
19x
16 Gbps
59x
0
150 M
N/A
~97%
99.94%
50x
43 mins/day
50 sec/day
Peak Network Utilization
API Calls
Availability
3
268 Mbps
4. Velocity of eBay -- Software Development Process
276M Users
300+ Features
Per Quarter
99.94%
100K LOC/Wk
6M LOC
• Our site is our product. We change it incrementally through implementing new features.
• Very predictable development process – trains leave on-time at regular intervals (weekly).
• Parallel development process with significant output -- 100,000 LOC per release.
• Always on – over 99.94% available.
All while supporting a 24x7 environment
4
5. James Lind and cure for scurvy
cider
5
elixir of
vitriol
sea water
garlic
vinegar
mustard
horseradish
orange
lemon
6. Reminder for data/analytics driven decisions
• Auction vs. Stores
• Combined search results
– Return a broader mix of inventory
– Listings of core + stores were combined
– More exposure to store listings
• Results
– Business metrics were down – bids, average sales price, etc.
– Latency in discovering this
• Analysis
– Overall cost of a store listing is less than that of auction listing
– Sellers shifted inventory to save on fees
• Rolled back in 03/2006
– Higher fees for store listings
6
7. Many Insights Methods (By Data Source vs. Approach)
Focus Groups / “Voices”
Desirability studies
Exit Surveys
Phone Interviews
Self-reported
(stated)
Cardsorting
Product Tracker
Diary/Camera Study
Message Board Mining
DATA SOURCE
(Onsite interviews)
“Visits” / Ethnographic Field Studies
mixture
Intent Discovery
Usability Lab Studies (task-based)
(Extended observation)
/
Quantitative user experience assessments
Usability benchmarking (in lab)
Observed
Behavior
/
Data mining
Eyetracking
Experimentation
Clickstreams
Qualitative (direct)
APPROACH
Quantitative (indirect)
KEY – Context of data collection with respect to product use
7
De-contextualized / not using product
Scripted or lab-based use of product
Natural use of product
Combination / hybrid
8. Concepts
• Unit (of experimentation, analysis)
– Entity on whom the experimentation or analysis is being made
– e.g. user, seller, buyer, item
• Factor (or variable)
– Something that can have multiple values
– Independent or controlled (cause), Dependent or response (effect)
• Treatment (or experience)
– A variation of information (e.g. page flow, page, module) served to the unit. The
variation is characterized by change in one or more factors or variables
• Sample
– A group of users who are served the same treatment.
• Evaluation Metric
– A metric used to compare the response to different treatments
• Experimentation
– A method of comparing 2 or more treatments based on measurable metric. One
variant, the status quo, is referred as the ‘control’.
8
9. Treatment (or experience)
• Module
– Strict subset of the page
– User is treated to changes to a module
– For e.g. zebra vs. integrated vs. distinct ads
• Page
– User is treated to different variations of the page
– For e.g. 2L1R (Left column is twice as wide as right) vs. 1L2R
• Page Flow or Use Case
– User is treated to different variations of a use case
– For e.g. different flows for listing an item for sale
9
10. Sampling
• Population
– Group you want to generalize to
People
• Sample
– Units from the population selected
• Sampling
– Process of selecting units from a population of interest
– By studying the sample you can fairly generalize the
results to the population
• External validity (Generalizability)
• Mechanisms
– Random
– Stratified random
– …
• What matters is number of samples
10
Time
Place
Setting
11. Experiments
• A/B testing
– A form of testing in which two treatments, a control (‘A’) and variant (‘B’) are
compared.
– No emphasis on cause (factor)
• Single-factor testing
– A form of testing in which treatments corresponding to values of a single-factor
are compared
– For e.g. Ad – Yes/No
• Multi-factorial testing (DOE)
– A method of testing in which treatments corresponding to multiple-values of
multiple-factors are compared
– For e.g. Ad – Yes/No, Location – Top/Bottom
– Manual vs. Automated
11
12. Objective
• To explore relationship between factors
• Relationships
– None
– Co-relational, Synchronized
• Positive vs. Negative
• Third-variable problem
– Causal relationship
• Establishing causal relationship
– If X, then Y
– If not X, then not Y
• Distinguish significant factors and interactions
• Measure impact on the metric
12
14. Reduce Email Guessing
• Purpose
– Measure decline in registrations from introduction of blocking message
– Users cannot create username which equals email address
– E.g. Username: cooky1 Email: cooky1@gmail.com
• Metrics
– Number of registrations
– Reduction in phishing
• Samples
– 3% US
• Treatments
– Classic, Blocked
• Outcome
– No difference in registrations
– Improved security
14
15. Text Ads on SRP
• Purpose
– Determine whether the use of text
ads on search result pages
• Metrics
– Overall revenue
• Samples
– 1% US, International
• Treatments
– Ad, No-ad
• Outcome
– Overall revenue increased in
certain markets
15
16. Home Page
• Purpose
– Optimal construction of page
– Per user segment?
• Metrics
– Overall revenue
• Samples
– Varied per treatment
• Treatments
– 100s of variations
– Ads, Merchandising, P13N,
Navigation, Layout
• Outcome
– Page structures different for
different user segments
16
17. What we think about
Fidelity of Experiments
The quality of the model and its testing conditions in representing
the final feature or product under actual use conditions
Cost of Experiments
The total cost of designing, building, running, and analyzing an
experiment
Iteration time
The time from planning experiments to when the analyzed results
are available and used for planning another iteration
Concurrency
The number of experiments that can be run at the same time
Signal/Noise Ratio
The extent to which the signal (response) of interest is obscured
by noise
Type/Level of Experiment
Types and Levels of experiment that can be carried out
17
19. Implementation Considerations
• User identification
• User
–
–
–
–
–
Sample
No bias towards any experiment or treatment
Sticky-ness between activities (and sessions)
No interaction between experiments
Enabling a user to try out a specific treatment
Ramping-up to understand generalization effects
• Sample
Treatment
– No bias
• Splitting traffic
–
–
–
–
Inline
Application server
Load balancer
Browser
• Factor-driven development
19
20. Measurement – A case of traveling shoppers
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Page-1
Alice
Alice
Bob
Charlie
Bob
Charlie
Alice
3
Page-2
Bob
Alice
Bob
Alice
Bob
Bob
Alice
2
2
1
1
2
1
2
1
3
20
21. Limitations and ways to overcome them …
• Sticky-ness to user
– Session-level analysis
• What, not why
– When qualitative research complements
• Short-term vs. Long-term effects
– Think about the duration of the experiment
• Newness effect
– Consider burn-in periods
• Minor vs. major differences
– Think about amount of effort being committed
• Anonymity of tests
– When qualitative research spills the beans
21
22. Key takeaways
• Experimentation is one of the most effective approaches for gaining
quantitative insights
• Enables businesses to quickly understand and establish relationships
between product changes and their impact on business metrics
• Different types and levels of experiments can be used to gain different
amounts of insights
• Experimentation has limitations, but they can be overcome
• Think about “experiment-ability”, as one another “-ability” in product design
22