Why Multi-armed Bandit Algorithm is Not “Better” than A/B Testing
Founder and Chairman of Wingify.
Recently, there has been a lot of interest in a new, alternative approach to A/B testing. Thanks to a popular blog post with a catchy title: 20 lines of code that will beat A/B testing every time, this new, new thing was all over the Internet (Actually, it is not new. But public interest in it is new). This new approach is called multi-armed bandit algorithm, and since VWO is an A/B testing tool, our customers were curious what our response to this alternative approach was. We didn’t want to rush into commenting, so we read the theory and did some simulations. What we found out was that the reality is not as simple as what that blog post claimed to be. In short, multi-armed bandit algorithms do not “beat” A/B testing. In fact, a naive understanding of this algorithm can lead you to wondering what’s happening under the hood.
What is Multi-armed Bandit Algorithm?
It’s actually shockingly simple. Let us explain it in website optimization context. Suppose you have two versions of a landing page (say a control and a variation). In normal A/B testing, you will split your traffic equally between these two versions, so both get 50% traffic.
However, in multi-armed bandit, what happens is that:
- For 10% of the time, you split your traffic equally between the two versions (called the exploration phase)
- For the rest 90% of the time, you send traffic to currently best performing version (called the exploitation phase)
Yes, it is that simple (and of course, you can tweak this 10/90 ratio). Now the idea promoted in the blog post linked above is that this simple strategy will beat A/B testing every single time. It is because for most of the time, we’re sending traffic to currently best performing version, so average conversion rate in a multi-armed bandit campaign is much higher as compared to simple randomization of A/B testing (where you keep sending equal traffic even to a worse performing version).
So, Is Multi-armed Bandit Really a Panacea?
Not so soon! Actually, if you just talk about average conversion rates, multi-armed bandit algorithms usually perform better than A/B testing. But a fundamental point missing here is the concept of statistical significance.
When you have a control and a variation, your objective is to find out which one is better performing, right? Of course, if your variation is performing badly, you will lose some sales and conversions in the process of A/B testing but that is the price you pay for finding out if variation really did perform so badly.
For finding out if a variation is performing better or worse, we use statistical tests such as a Z-test or a chi-square test, and mathematically (and intuitively) you need to have tested at least a certain number of visitors before you can say with any certainty that the variation is really bad performing (or is current numbers are just due to chance).
What multi-armed bandit algorithm does is that it aggressively (and greedily) optimizes for currently best performing variation, so the actual worse performing versions end up receiving very little traffic (mostly in the explorative 10% phase). This little traffic means when you try to calculate statistical significance, there’s still a lot of uncertainty whether the variation is “really” worse performing or the current worse performance is due to random chance. So, in a multi-armed bandit algorithm, it takes a lot more traffic to declare statistical significance as compared to simple randomization of A/B testing. (But, of course, in a multi-armed bandit campaign, the average conversion rate is higher).
There’s an inverse relationship (and hence a tradeoff) between how soon you see statistical significance and average conversion rate during the campaign.
In a multi-armed bandit algorithm, what you gain by increased average conversion rate, you lose on time it takes to see statistical significance. Hence you need to test a lot more visitors to gain the same amount of confidence in results.
Simulation Results
Intuitively, our understanding of the algorithm did make sense, but we wanted to be sure. So we fired up a Python terminal and hacked a quick script for running simulations for both A/B testing randomization and multi-armed bandit optimization. Objective was to simulate a campaign of 2 and 3 versions with known conversion rates, and then for both algorithms measure two key metrics:
- How soon statistical significance (of p-value < 0.05) was detected (in a two tailed Z-test)
- Average conversion rate for the whole test till that point
Just to be sure of statistical significance (sometimes it is detected too early due to chance), we counted till the statistical significance was seen at least 10 times during the campaign. We ran all simulations for 10,000 iterations.
Following algorithms were evaluated:
- Simple randomization of A/B testing: RAND
- Multi-armed bandit (10% exploration, 90% exploitation): MAB-10
- Multi-armed bandit (25% exploration, 75% exploitation): MAB-25
- Multi-armed bandit (50% exploration, 50% exploitation): MAB-50
- Multi-armed bandit (77% exploration, 23% exploitation): MAB-77
- Multi-armed bandit (90% exploration, 10% exploitation): MAB-90
Scenario #1: two versions (A = 10% conversion rate and B = 20% conversion rate)
Iteration # when statistical significance was reached | Average Conversion Rate | |
---|---|---|
RAND | 248 | 15%^{#} |
MAB-10 | 1400 | 19.8% |
MAB-25 | 564 | 20.1% |
MAB-50 | 278 | 18.5% |
MAB-90 | 214 | 17%^{^} |
# You can guess that in simple randomization, average conversion rate will be very close to average of conversion rate of two versions
# The median conversion rate here was 15.8%, so in reality (and intuitively as well) multi-armed bandit that does exploration for 90% of the time is similar to randomized A/B testing
From these results, it does appear that MAB-50 could be a very good algorithm but perhaps it is because we have two versions only, and that too with conversion rates different. We thought of evaluating the algorithms when conversion rates were similar.
Scenario #2: two versions (A = 10% conversion rate and B = 11% conversion rate)
Number of times no statistical significance was seen^{%} | Iteration # when statistical significance was reached | Average Conversion Rate^{*} | |
---|---|---|---|
RAND | 5 | 1700 | 10-11% |
MAB-10 | 12 | 3400 | 10-11% |
MAB-50 | 6 | 2000 | 10-11% |
% We ran 25 simulations each of 10,000 iterations. Since the difference between conversion rates was less, simulations took a lot more runs before detecting statistical significance. And sometimes, even after 10,000 iterations they couldn’t detect the difference.
* Interestingly, when the conversion rate of different versions is similar, not a lot of optimization can be expected, so average conversion rates are always been 10-11%.
Note that as the conversion rate difference becomes less, the advantage of multi-bandit algorithm fades away, and yet they still a take more time than simple randomization to detect statistical significance. Curious point is that in real world (as opposed to a simulation) you would only know if conversion rate difference is less after the campaign has run. So, since in this case the performance of randomization and multi-armed bandit is same in terms of average conversion rate, before running any campaign you can only “guess” whether multi-armed bandit is worth the extra time it will take for detecting statistical significance.
Scenario: 3 three versions (A = 10% conversion rate, B = 15% conversion rate, C = 20% conversion rate)
We extended our simulations with 3 versions to see if anything interesting emerges. In this case statistical significance was calculated for all 3 comparisons (AB, AC, BC) and similar to cases above, we waited till statistical significance was found at least 10 times during the simulations. Here are the results:
Iteration # when statistical significance was reached | Average Conversion Rate | |
---|---|---|
RAND | 270 | 15% |
MAB-10 | 1540 | 18% |
MAB-50 | 555 | 17.7% |
MAB-77 | 360 | 16.8% |
Summary
So as you can see from the results, there’s a clear tradeoff between average conversion rate and the time it takes to detect statistical significance. Moreover, it is also clear that any advantages of multi-armed bandit algorithms vanish if conversion rate of different versions is similar. The only scenario where multi-armed bandit algorithms would work best is if performance of different versions is massively different (which is infrequent in practice). Even in such cases, since difference is massive simple randomization would detect statistical significance quite early on, so for rest of the time you can simply use the best performing version.
This is not to say that multi-armed bandit algorithms are useless. They’re very useful in the scenario where you are doing continuous and permanent optimization and are frequently adding/removing versions to be tested. It is also useful in cases where you can afford to wait for a longer duration before knowing with certainty which version is performing the best (or the worst). (Or, in cases where don’t care about the knowledge and the certainty, and you simply want average conversion rates to be optimized.)
So, comparing A/B testing and multi-armed bandit algorithms head to head is wrong because they are clearly meant for different purposes. A/B testing is meant for strict experiments where focus is on statistical significance, whereas multi-armed bandit algorithms are meant for continuous optimization where focus is on maintaining higher average conversion rate.
Hope this article was useful. Do let us know your comments and thoughts!