Mobile App Feature Configuration and A/B Experiments

Feature Configuration
and A/B Experiments
in native and hybrid mobile apps

Hello

and A/B Experiments
Breaking Development Nashville
October 21 2013
lacy@etsy.com
Mostly at Etsy we work on greenfield development, new products or features from
scratch. However as our mobile platform matures and our apps start to mature, we
are finding ourselves wanting to take the principals we find most fruitful in other,
older parts of Etsy and apply them in the mobile realm. TLDR this is about the fun
stuff.

When I talk to people about working at Etsy I expect they want to talk about
deploying web applications on tons of servers..

Or working with industry leaders like Rasmus Lerdorf..

But in fact typically people want to talk about the unique and sometimes strange
things you can ﬁnd on Etsy.

Often little do these people know that interesting oddities can be purchased on their
phone or tablet.

One might wander what kind of business it is selling unique items over the internet,
but as you can see we do pretty well. In recent months we’ve seen as much as half of
our business happen on mobile devices. This makes mobile experimentation a really
exciting place to be.

Feature Flags

http://www.flickr.com/photos/mig/15964697/

The first of the concepts that we are trying to adapt from the old web world to mobile
is the notion of feature flags. We find wrapping things up as features is incredibly
useful. Another of these principals of engineering is something we call Continuous
Experimentation.

Continuous Experimentation

When we talk about Experimentation though we’re talking about A/B experiments. If
you’ve ever been using an app or a site and noticed that someone sitting RIGHT NEXT
to you has a different looking app or site, this is probably part of an experiment.

• make small changes
• stay honest
• don't break the [product]

The name we give it comes from what we call “Continuous Deployment”. Instead this
“Continuous Experimentation” describes how we try to continue to develop apps and
mature app features after they’ve been working out in the world and proven as largely
a good idea.

Real People
Brenda

https://www.etsy.com/shop/CattailsWoodwork

Why is it important to be honest and to not break the product? Because we’re talking
about the livelihood of real people.

Matthew

Real People

https://www.etsy.com/shop/PretentiousBeerGlass

People who make cool stuff and depend on Etsy to let them sell that cool stuff.

• stay honest

We use this to develop our products iteratively using real world feedback and data
about our users and how their experience is.

• stay honest

I love building things, making things. You can’t, for lack of a better analogy, you can’t
build an automobile or a vacuum cleaner or whatever thing, sell that thing, have
someone use that thing every day and then expect that you could change it, change
how the car works and look at how your changes effect the experience or the
performance of that thing from back at the factory.

http://www.ﬂickr.com/photos/usfws_paciﬁcsw/6391185103/

Steve Jobs had this quote about how the computer was a bicycle for the mind. I feel
like software today allows us to use computers as a bicycle for industry.

Disclaimer
• Analytics / Big Data
• Experimental Analysis
• Exploratory Analysis

I’m going to talk mostly about design and development that will keep us from making
bad data or picking poor audiences. There are other elements here but let’s assume
we have some way to gather analytics and some way to analyze the data we have
gathered.

• Everyone does experiments
• But not everything works this way
• Rarely seller tools, usually public stuff
• The develop / release / cheer cycle
• Mobile Apps

Everyone at Etsy dabbles in experiments, at least a little. But really not every line and
every product gets measured by experiment. Sometimes products are just made
because they’re needed. Like making a mobile website.

Launch Day

Story time, we have a big announcement to make one day at Etsy..

Introducing
the New Thing

We are introducing a new thing, a new feature or whatever..

The new product exists in a ton of places. These are all DIFFERENT teams mind you!

iPhone

Android
Web

Email

iPad
This is all to say that this New Thing exists in an ecosystem across all these different
mediums. This is kind of the mindset you have to be in with cross-platform features and
their a/b tests. A feature now becomes a dimension in and of itself. It is not just part of your
website. Features can transcend an individual app.

Launch Day

Fortunately in this story, on launch day, QA has already seen the new product in operation,
ahead of the world. The App Store review process is already done. We had a release go out
last week. The code that will show this New Thing lies “dormant” out in the wild, on the
website, in the web views in the app and even in the app binary itself.

if ($cfg[‘cool-dog-pics’]) { ...

Z

Z
Z
index.php

in the PHP world..

Z

Z

index2.php

if ($cfg[‘cool-dog-pics’]) { ...

Z

Z
Z

Z

index.php

Z

index2.php

When we ﬂip the switch, the code can branch to the new code.

if (coolDogsIsEnabled()) { ...

Z

Z
Z
base.js

in the DOM..

Z

Z

base_new.js

if ([Config isEnabled@”CoolDogs”])
{ ...

Z

Z
Z

Z

Z

SomeViewController.h SomeNewViewController.h

and in native apps..

One Line!

So the launch starts on a engineer’s macbook. We push, for sake of illustration, as
little as one line of web code out to our web servers and API servers. This one line
mentions The New Thing by name, saying simply “turn on The New Thing” and within
the granularity of 20-30 minutes you have the Newness throughout the product to
millions of users, across the ecosystem of all types, across the world.

Launch Day

So this story is really just like a parlor trick. It’s to get your attention. It’s fun to talk
about. The real value here is the illustration of this unique dimension, these
“features” which we can use to divide up our product that is spanning out in so many
different mediums.

Measure Everything
• Feature Configuration
• Benefits of Features
• Configuration as Tests
• How Flags Get to the Code
• Examples of Experiments
• Making Sensible Tests

The roadmap to making our mobile a/b experiments a reality..


So first some of the things we take for granted. Four years ago -four years is a really
long time- there was a blog article from Flickr I saw floating around on the internet. It
was about what they called feature flags. We all thought it was pretty cool stuff. I had
no idea why. Then at some point I started working at Etsy and now I have to preface
any technical discussion of interest with how feature configuration shapes the way we
make software.

Flags, Flippers, Flickr
http://code.flickr.net/2009/12/02/flipping-out/
$cfg = array();
$cfg[‘cool-dog-pics’];
$cfg[‘cool-dog-pics’] =
array(‘enabled’=>‘off’);
$cfg[‘cool-dog-pics’] =
array(‘enabled’=>‘on’);

This idea starts with a configuration array that names things on your site. Maybe cool
dog pics.

Flags, Flippers, Flickr
if ($cfg[‘cool-dog-pics’) {
echo Dogs::getCoolPics();
}

You basically push out all the code all the time. Even if it doesn’t work, you just keep
it turned off.


The original beneﬁt here, the exciting thing about this from the Flickr article is that
you don’t have to merge code or do gigantic atomic (and painful) releases. We push
code 30, 40, 50 times a day. We push code that doesn’t even work, on purpose,
because pushing tons of little pieces of resilient code is a lot more predictable than
trying to do a big release.

More Cool Stuff
$cfg = array(
‘enabled’=>array(‘users’=>‘lacyrhoades’)
);
$cfg = array(
‘enabled’=>array(‘admin’=>true)
);
$cfg = array(
‘enabled’=>array(‘groups’=>54321)
);

You could maybe do something like only enable that branch of code to be active for
one person, just some people you work with or a group of users you know have a
particular interest in the feature.

Measuring Things

Another principal is measuring things in a real-time way

Like here we’ve got what we call dashboards, so as we ﬂip these switches on or off we
can look for anomalies and make sure nothing is blowing up.

Benefits of Features

Another trick we have is what we call “Slow Rampups”. Things don’t have to be just
“on” or “off”.

Benefits of Features

Launches can now happen on a schedule. There’s no sense in trying to make a
software deadline be punctuated by releasing that software. The software
development is going to take too long. You are going to need to be able to QA it,
change it and make it right.

Requisites
• No backwards incompatible changes
• Uniquely identified users

There are some limitations or requirements.. you’ll need to make sure the branches
in the code are not drastically different. The data schema for example has to be
backwards compatible. You’re going to need a way to uniquely identify users in order
to put them into groups and give them consistent experiences.

Features in the
New World

Feature ﬂags were made for websites. But your website is not a website anymore. It
hasn’t been a website for a while now. Your website probably employs at least one
mobile developer who doesn’t even look at your “website” every day. Similarly feature
ﬂags have to adapt to this new world.

Features as Tests

A/B Tests put people in buckets, giving them different values for one ﬂag. Watch
what they do over time, this means attaching information about their test buckets to
analytics events. Make your app better based on the data you get back.

An Example
Feature

On the main screen. The individual facets you see there are powered by different parts
of the infrastructure.

The personal activity feed is fed from an map-reduce stack in PHP. The curated
panels are fed from a Java-based search stack. One of these might mess up and you
want the app to go on without it. This is a great place to start for dividing up features
in your design.

An Example
Prototype

For a prototype we might want to gather feedback from users, a select group, for
example the photo-editor in our app started as a prototype group and grew with
feedback we received.

This was a photo editing interface in the Etsy app. We wanted to know some
particularly qualitative things about it, like was it easy to use and understand. It’s not
exactly like we can study really dry analytics to get at this sort of answer. So we
invited some interested sellers to join a group on Etsy. Those sellers could see the
new tools and we could at that point gather feedback from them.

This was a redesign of the activity feed on our mobile web. We used a feedback group
to preview the features and make sure we got it right.

One of our recent experiments.. mobile templates vs. desktop templates. It turns out
from the measurements we could make that users were quantiﬁably more satisﬁed
with the desktop “look” of Etsy on tablet devices.

We wanted to really go down the road of experiments as the web side of Etsy has
before us. We took this listing view as a place to start. Here’s a reasonably priced
shadow puppet. Here it is (right) with the experiment enabled.

Here was the ﬂow before. The last three steps were all web views on the server side.

Here’s the flow afterwards. The idea was reducing the number of steps would
significantly reduce friction in the checkout flow.

Experiment Steps

• Set up a feature flag

First we needed to look at eligibility, or who can take part in this experiment. If the
eligible audience is VERY small compared the general audience of all users, our total
number of "people who have used this feature" is going to be small, and so then the
numerator of "people who bought an item with this feature" will also be small. Then
there’s the feature ﬂag. We make this and start to write code against it, so we’re
pushing code from day 1.

Experiment Steps
$config['mobile']['iphone']
['BuyItNow'] = array(
'enabled' => 0,
'group' => 54321,
'admin' => 0
)
];

The config might look something like this. This goes into the one config file that’s at
the center of everything.

Experiment Steps

• Determine eligibility programmatically

Next we need to determine eligibility programatically. We’ve already determined our
eligible audience is of considerable size at this point. We did that as part of
exploratory analysis. This is more about being able to say in the code, something
like..

Eligibility
$eligible = isEligible($user,
$listing);
if ($eligible) { ... }

We want to answer questions like, do we have this user’s billing information on ﬁle?
Can this listing even be bought using that credit card? Can the seller who’s selling
this item ship the item to the country the user lives in? If the answers to any of these
questions are not good, we need to not be including this buyer in the experiment.
We’ll dilute our results quickly, since we know there are a bunch of combinations of
items and buyers who can’t take part in the experiment.

Experiment Steps

• Determine eligibility
• Start hacking away

At this point we start coding up the native elements and the web elements, all the
while hiding them behind the feature conﬁguration we chose before. Generally being
careful if we need to add code to shared libraries or shared ﬁles.

Experiment Steps
• Begin testing

When things are working the way we expect, we begin testing with a small internal
group, usually people in QA or just staff members. Also we begin to QA the app as a
whole for release.

Experiment Steps
• Begin testing
• Put the product on the shelf

Once QA approves the way it works, most of the coding is done and we put the
feature on the shelf for a while. We’ll probably try to begin the app store review
process as soon as possible.

Eligibility

[EtsyConfig isEnabled:@”BuyItNow”];

The code in objective C will look something like this. The EtsyConfig class here is
going to be responsible for remembering “yes I did see this experiment, someone
asked about it, and the answer was: x” That answer, and the specific question we
came looking for need to be attached to analytics events the user is firing.

Experiment Steps

• Experiment group
• Up to a certain percentage
• Analytics events

When the app is live, we can implement an experiment group. We can ramp up slowly
so that we can kick the tires and know things are okay. When things are looking good
we’ll take the experiment up to a percentage we established beforehand. Analytics
events are capturing the state of this test as people see it. Typically you can ignore
the state of this feature on analytics events for people who are completely ineligible.

Looking at results

• Self Selection
• Refunds / Returns
• Visit-level vs. User-level
Our initial results were actually pretty good. There are other things to consider,
drawbacks and perhaps biased design of the experiment.

How the Config
Gets into the Device

plist

The code starts as the plist you probably recognize from any Xcode project.

+
server

We download a set of conﬁg values, things that are enabled or disabled, from the
server.

+
plist

=
server

At runtime we merge these values into a single dictionary.

runtime

Configuration Steps
• App launch
• Periodically later, login
• Merge downloaded config
• Post notification

The config is downloaded and merged whenever the app launches. It also is
downloaded when the user logs in or out, as their experiments might change. As a
final step in this merging we’ll post a notification in the app code so that UI elements
in the app which need to update based on any experimental code, can do so.

Bucketing Users

You’re going to need to know who individual users are if you can put 20% into one
bucket and ensure that they stay there, and ensure that no one from the 80% control
group, makes it in to that experiment bucket.

Bucketing Users

• Persistent Cookies
• Device UDID
• user_id (where available)

The best thing to use here would be the user_id. This is not always available, like if
the user is logged out. Typically websites use persistent cookies to bucket logged out
users. Unfortunately an API for mobile apps doesn't have this avenue of cookies.
You've got to have some way to bucket users. One approach is to make up a sort of
UDID. Something that is speciﬁc to a device, and is stored locally with the app. This is
truly only unique to each install of the app, but it seems to work fairly well. You need
to pass that UDID to any webviews so that those webviews can identify themselves as
part of those app sessions.

Sensible Testing

We mentioned before that the number of users, the percentage, was key to obtaining
signiﬁcant results. If you don’t run the experiment long enough, you’re going to not
prove anything. If you take too long measuring something, you’re wasting time.
There are deadlines to meet for this or other things, etc. For our test results we follow
the equation used here:

ExperimentCalculator.com
via @mcfunley

• How many eligible visits per day?
• What percentage of visits will see the change?
• What is your current conversion rate?
• How will you change conversion?
• How confident do you want to be?
• How likely should you be to detect the change?

experimentcalculator.com This is something one of our engineers made from a paper
on statistics. I don’t pretend to understand exactly but the idea is it gets you to not
choose random numbers for the length of the experiment. You have to weigh the cost
of developing a new feature against the feature’s potential value.

Sensible Testing

Some shortcomings of these approaches include: lots more code. More permutations
of QA. For every test variant you add, you essentially add the need for another QA
user story. You risk introducing an unpredictable user experience. Changing minor
interactions are probably okay, changing the main navigation scheme in your
application probably isn’t. Analysis paralysis - At some point you’ve just got to make
decisions. A/B testing only goes so far in helping you choose direction for your
product. You can’t be creative in your product decisions by just piling A/B tests back
to back.

Things to
Watch Out For

Make sure default is "off" for predictable stable experiences and sanity of future
support. You don’t want to have a bunch of old ﬂags you need to keep around for
un-updated apps. If you've got a good app and a good idea going, you've probably
not going to discover a breakaway victory by running an a/b test. They're too subtle
and often times only prove that your intuition about your users is heavily biased.

The Future

So get out there, do some experiments. Time is precious. If you're going to make
heads or tails of the numbers you see in an experiment, you'll probably need all the
time you can get.

and A/B Experiments
Breaking Development Nashville
October 21 2013
lacy@etsy.com

Mobile App Feature Configuration and A/B Experiments

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (8)

Semelhante a Mobile App Feature Configuration and A/B Experiments

Semelhante a Mobile App Feature Configuration and A/B Experiments (20)

Último

Último (20)

Mobile App Feature Configuration and A/B Experiments