Tuesday, August 28, 2012

Bucket Testing System

Before launching or changing any major feature on Polyvore we run live experiments to find the best design, wording, flow or algorithm.

Our typical flow is to design a few variations, pick the most promising ones, run them as experiments against a percentage of our site traffic, measure the difference in how each performs to iterate on designs and finally pick the best performing version.

Running experiments is baked into our product development process. We have run hundreds of experiments since launch and have a few live at any given time. Once we realized we’d be running lots of experiments, we took the time to build a bucket testing system that makes it easy and streamlined to define, run and report on experiments.

Our Implementation

The system we built has three components.

The first component defines the experiment, its variations (hereto referred to as buckets) and how each bucket is selected for a given user. Most times, we use the default selection behavior which is to randomly distribute users based on a unique browser id cookie. Here is an example experiment defenition:

add_to_cart => {
    desc => 'Show "Add to Cart" instead of "Buy" buttons',
    bucket_list => [
        {
            name => 'add_to_cart',
            probability => 0.5
        },
        {
            name => 'buy',
            probability => 0.5
        }
    ],
    # Extend tracking to GA, affiliate network and Splunk
    track => 1,
    # optionally override the default bucket browser.id based selection
    selector => sub {
        my ($bucket_testing) = @_;

        my $user_id = $bucket_testing->request()->user_id();

        if ($user_id % 2 == 0) {
            return 'add_to_cart';
        } else {
            return 'buy';
        }
    }
}

The second component is used in our application code, where we can ask the bucket testing system which bucket the current request falls into.

my $link_title;
if ($req->bucket_testing()->select_bucket('add_to_cart') eq 'add_to_cart') {
    $link_title = 'Add to Cart';
} else {
    $link_title = 'Buy';
}
print $req->a({ href => $url }, $link_title);

Bucket tests are not limited to simple UI changes. We use them to test different algorithms, flows or settings. For example, is it better to skew our product search results towards cheaper products or more expensive ones? Or, should we use CloudFront or Akamai as our image CDN? etc...

The final component is the integration of the experiment selections with our various analytics system. We record the selected bucket in our click tracking system, transaction tracking, GA and Splunk for analysis.

Cool Stuff

A huge chunk of Polyvore's UI is implemented in javascript and to make accessing the bucket testing system easy we implemented an identical js API. Our backend automatically makes the selected experiments available for use in js.

We also have an internal experiment dashboard which allows us to manually switch in and out of experiment buckets, generate signed links that allow a user to opt themselves into a particular bucket. We share these with our VIP users to let them preview product changes that are not live to everyone and get their feedback (and love).

The bucket testing system is integrated with other measurement tools we use. For example, the selected experiments for a given visit to Polyvore are logged as custom google analytics variables. These can be used to segment the visits based on the experiment and look at metrics like session length, bounce rate etc. We also integrate with our access logs which allow us to watch metrics in realtime in Splunk, and our affiliate tracking system which lets us look at how site changes affect conversion rates downstream.

Gotchas

seed()/rand()'s non-randomness

One issue we ran into while implementing our bucket testing framework was how to ensure that each bucket is selected independently. To isolate cross-experiment bucket selection, we use the experiment name to seed a random bucket for each experiment. The system worked fine for a while, but as more experiments were added, we noticed some oddities.

For example, we were concurrently running two experiments, which both affected the rendering of actions on our shop page. We noticed one of the experiments was getting 0 traffic. The reason, it turned out, is that given a different seed (but not VERY different), the rand() function returns similar results. The value returned isn’t identical but within a similar range, and since buckets are selected based on a range it produced a correlation between which buckets were selected between experiments.

For example, given three experiments (exp1, exp2, exp3), with each experiment having two 50% buckets ‘on’ and ‘off’, the probability of all experiments being on should be 50%^3 = 12.5%, but in practice we found it to be 5%, while all off was 20%. Within each experiment, bucket sizes were about 50%, but across experiments, they were not even. This correlation between experiments makes it difficult to independently assess the performance of one experiment if it could be disproportionately affected by another.

The solution was relatively straightforward, using a different RNG (Math::MT::Random) that provides more even distribution given small differences between seeds.

Google Analytics Sampling Errors

To test the aggregate performance of our experiments, we use Google Analytics custom variables to segment users. However, as we tested experiments on small populations (10%), we noticed after launch it often didn’t have the effect the experiment would have led us to believe. The margin of error for these small populations turned out to be extremely high.

Though we were aware that Google Analytics sampled data when using custom segmentation, we weren’t aware of how much it could affect the results. To help determine the margin of error for the GA sampling, we started tracking experiments with no-op buckets of varying sizes (4x10%, 3x20%, 2x50%). To test for margin of error, we used segments of our highest traffic pages and determined the different performance of equal sized buckets. The results were surprising, we found that at 20% bucket sizes, the sampling errors of around 0.5% where larger than the effect of our design changes. It was only at 50% buckets that the error fell significantly below the observed magnitude of changes.

Because this sampling produced such wildly inaccurate data, we changed our policy to always test using large (50%) bucket sizes if GA was the primary source of metrics for the experiment’s performance. Also, we stopped considering changes close to or less than the margin of error between the no-op buckets as a success/failure.

For cases where we don't need GA's unique metrics, we have started using Splunk which gives us perfect sampling and allows us to test with much smaller buckets.

Summary

Bucket testing is an essential technique for testing out different version of a design, wording, algorithm, etc… against segments of real users and measuring their performance. The results allow you to iterate and make better decisions. At Polyvore, we have been able to run hundreds of experiments because we invested the time in building an integrated bucket testing system that made it easy and fun.