# Minimizing Regret in A/B Testing

There are lots of times where A/B testing is useful. For any situation where you don’t know whether option A, B, C or D is the best choice — and you are lucky enough to be in a position to try each one enough times — the A/B test provides a great way to find out and guide your future decision making.

Within advertising this is often used for trying out different visual designs or messages for ads of the same product or brand. These different designs, also called “creatives,” may vary the artwork, wording or visual style to see which version performs best according to a metric that matters to the advertiser. In online display advertising, some common metrics are the cost per acquisition (CPA, or the number of online sales the creative helps drive) and the click-through rate.

But are there limitations to this approach? And are there any good alternatives?

## The Crystal Ball

The perfect alternative to conventional A/B testing is the proverbial crystal ball.

If you already know *Creative B* is going to be drive more clicks and sales than *Creative A*, because you have a crystal ball, then running an A/B test is not only a waste of time, it will actually hurt you.

Why? Because for all that time the A/B test is in process, you’re wasting 50% of your ad opportunities showing *Creative A* when you could be showing *Creative B*. There’s a cost to running this experiment, which means fewer people are clicking your ads and buying your products.

## “Regret” – The cost of the A/B Test

As an example, let us consider running a case where we want to try two different styles of creative and optimize for click-through rate. We make an assumption that the click-through rate is a characteristic of the creative itself and not how many times it is shown. Both creatives will be trafficked equally on the same sources of inventory and shown to the same users.

Here are our results after showing each creative 3000 times:

Impressions | Clicks | Click-Through Rate (%) | |

Creative A | 3000 | 15 | 0.5 |

Creative B | 3000 | 30 | 1.0 |

Total |
6000 |
45 |
0.75 |

Let us assume for a moment that the results above are statistically significant and *Creative A* has an intrinsic 0.5% click through rate, while *Creative B* has an intrinsic 1% click through rate on the inventory sources users were shown.

This means that — if we’d had a crystal ball — we could have just run* Creative B* from the start and ended up with 60 clicks instead of 45.

The difference between the clicks driven by the optimal strategy and the sum of the clicks we got when running the test is known the “regret”.

## Better than a lucky dip

But let us not write off conventional A/B testing just yet. If our crystal ball were phony, or we just used our own intuition and “guessed” which creative was best, we might also have ended up trafficking *Creative A* for the whole time rather than *Creative B* and ended up with only 30 clicks instead of 45.

So, and let me once again caveat this remark by saying we can only be confident with this if the above results are statistically significant, the A/B test has done its job and taught us something about *Creative A* and *Creative B*, which we can use in the future. We could at this point stop running *Creative A* and prefer *Creative B*, or run another test with a new *Creative B* against *Creative C* (a sort of “winner stays on” approach).

## The pain of waiting for results and statistical significance

Unfortunately, the above example is somewhat contrived. It’s unlikely that we would have statistical confidence in our results after just 3000 impressions and 15 clicks.

If we plot the click-through-rate of each creative over time, calculated as follows

we might see something like the graph below. Note, we’ve also plotted the number of ads served to date for *Creative A* and *Creative B*. Since these are trafficked 50-50 as a fair test, we see the numbers are equal for both.

You can see for the first couple of days of the test, *Creative B* has a higher click-through rate than *Creative A*, and the calculated click-through rate varies considerably until we get enough samples of data, which takes many days. After around day 23 in this example, we start to see stability in the results, and it remains fairly stable for the next 5 days. We could probably conclude fairly reasonably that *Creative A* has a higher click through rate than *Creative B*. Note, if we had attempted to draw this conclusion earlier in the experiment, we might have come to some premature and incorrect deductions!

There are various statistical tests that can be done to test the hypothesis that *Creative B*’s click through rate is better than *Creative A*’s, that will check the conclusion is reasonable given the data we have. I won’t go into detail (you can read more about these here) but the long and short of it is you often need to run the experiments for many days to pass these tests, even once the results “seem” conclusive (like above).

An important factor here is advertisers often want to update their creatives frequently to show different marketing messages or up to date brand messaging. This means it might not actually be possible to run the creatives long enough to have confidence in the results.

## An Alternative

For Quantcast’s Dynamic Creative Optimization product we use a different approach to creative selection based on the multi-armed bandit model. Candidate creatives start out with fair weighting, i.e. they receive a (1/n) share of traffic. As performance results accumulate, weighting can shift in real-time based on the performance to-date.

We are, in essence, ‘doubling down’ on an emerging trend. By weighting toward the early candidate, we can either quickly prove or disprove that it’s better, since we’ll get more samples. If we, say, start shifting at only a 60% confidence threshold (of rejecting the null hypothesis that both creatives have the same intrinsic performance), then over enough tests this early shifting will produce better-than-even results.

Although this isn’t as good as having a crystal ball, this means we are able to shift trafficking towards the best performing creatives before statistical significance is achieved. If the performance of an item changes over time, the system will learn and adjust the emergent results.

The advantage to this approach is that the number of ads using the most-likely sub-optimal choice of creative is minimized and regret is reduced. In other words, this approach results in more clicks.

As an example, the graph below shows how this works. In this graph, we’ve plotted the click-through rate to date, as before, but also the number of times *Creative A* or *Creative B* was selected by day.

As we can see, initially the algorithm would (incorrectly) begin trafficking *Creative B* more than *Creative A* as the early results suggested this was the best performing. However, as the algorithm got more data suggesting *Creative A* was better than *Creative B*, it will begin choosing *Creative A* a greater proportion of the time. As time goes on, and the algorithm gains more confidence in the performance of *Creative A*, it will begin trafficking even more of *Creative A*, and less of *Creative B*.

So how does this look in terms of clicks?

Given the equal trafficking A/B test strategy example, we see the following results if we total the whole period:

Impressions | Clicks | Click-Through Rate (%) | |

Creative A | 84,000 | 700 | 0.83 |

Creative B | 84,000 | 616 | 0.73 |

Total |
168,000 |
1,316 |
0.78 |

With the real-time weighted trafficking we see:

Impressions | Clicks | Click-Through Rate (%) | |

Creative A | 108,582 | 906 | 0.83 |

Creative B | 59,744 | 439 | 0.73 |

168,326 |
1,345 |
0.80 |

Over this period we already see the benefit in terms of clicks, and overall click through rate, and this would increase further if the system was left running for longer.

A further advantage to this approach is that new creatives can be seamlessly inserted into a running test, with no deleterious impact; new ads get a weight of 1/n, with an increased value n, and the learned weighting of the other creatives adjusted proportionally. This is great in an environment where constantly trafficking fresh creatives itself tends to give a boost in performance, as it prevents users from suffering from ad fatigue.

The other advantage with this approach is it works in an unsupervised manner and no-one needs to remember to “switch off” *Creative B* once the experiment is “finished”. Sub-optimal creatives will converge asymptotically toward a zero weight, as significance is achieved.

## You only get what you give

In all of the approaches above, it is worth noting that any algorithmic approach to creative selection will only help select the best of the set of creatives provided. If the creatives don’t actually differ in intrinsic performance, there’s no magic that the creative selection algorithm can do to improve performance. In fact, as we have seen, running a test between “good” creatives with high click-through rates, and “bad” creatives with a low click through rate requires us to show the bad creative a lot of times in order to allow the system to learn. If we already know from experience what makes a bad creative bad, then there’s little reason to run it.

So, even with a robust creative selection algorithm at your disposal, it’s still important to consider carefully which experiments you want to run, and have high standards when choosing which creatives you want to use to represent your brand.

That said, I don’t think any of us are oracles with crystal balls, so an algorithmic approach to automated performance-driven creative selection is a vital part of any online marketing strategy.

This same approach can be used across whole families of problems, such as automating specific element selection within an ad, like which call to action to use, which marketing message to use, which colours to use, or selecting between different product recommendation engines.

*Mark Hammond is a senior engineering manager on the Dynamic Creative Optimization Team at Quantcast. *