Why Multi-armed Bandit Algorithm is Not “Better” than A/B Testing
Founder and Chairman of Wingify.
Recently, there has been a lot of interest in a new, alternative approach to A/B testing. Thanks to a popular blog post with a catchy title: 20 lines of code that will beat A/B testing every time, this new, new thing was all over the Internet (Actually, it is not new. But public interest in it is new). This new approach is called multi-armed bandit algorithm, and since Visual Website Optimizer is an A/B testing tool, our customers were curious what our response to this alternative approach was. We didn’t want to rush into commenting, so we read the theory and did some simulations. What we found out was that the reality is not as simple as what that blog post claimed to be. In short, multi-armed bandit algorithms do not “beat” A/B testing. In fact, a naive understanding of this algorithm can lead you to wondering what’s happening under the hood.
What is Multi-armed Bandit Algorithm?
It’s actually shockingly simple. Let us explain it in website optimization context. Suppose you have two versions of a landing page (say a control and a variation). In normal A/B testing, you will split your traffic equally between these two versions, so both get 50% traffic.
However, in multi-armed bandit, what happens is that:
- For 10% of the time, you split your traffic equally between the two versions (called the exploration phase)
- For the rest 90% of the time, you send traffic to currently best performing version (called the exploitation phase)
Yes, it is that simple (and of course, you can tweak this 10/90 ratio). Now the idea promoted in the blog post linked above is that this simple strategy will beat A/B testing every single time. It is because for most of the time, we’re sending traffic to currently best performing version, so average conversion rate in a multi-armed bandit campaign is much higher as compared to simple randomization of A/B testing (where you keep sending equal traffic even to a worse performing version).
So, Is Multi-armed Bandit Really a Panacea?
Not so soon! Actually, if you just talk about average conversion rates, multi-armed bandit algorithms usually perform better than A/B testing. But a fundamental point missing here is the concept of statistical significance.
When you have a control and a variation, your objective is to find out which one is better performing, right? Of course, if your variation is performing badly, you will lose some sales and conversions in the process of A/B testing but that is the price you pay for finding out if variation really did perform so badly.
For finding out if a variation is performing better or worse, we use statistical tests such as a Z-test or a chi-square test, and mathematically (and intuitively) you need to have tested at least a certain number of visitors before you can say with any certainty that the variation is really bad performing (or is current numbers are just due to chance).
What multi-armed bandit algorithm does is that it aggressively (and greedily) optimizes for currently best performing variation, so the actual worse performing versions end up receiving very little traffic (mostly in the explorative 10% phase). This little traffic means when you try to calculate statistical significance, there’s still a lot of uncertainty whether the variation is “really” worse performing or the current worse performance is due to random chance. So, in a multi-armed bandit algorithm, it takes a lot more traffic to declare statistical significance as compared to simple randomization of A/B testing. (But, of course, in a multi-armed bandit campaign, the average conversion rate is higher).
There’s an inverse relationship (and hence a tradeoff) between how soon you see statistical significance and average conversion rate during the campaign.
In a multi-armed bandit algorithm, what you gain by increased average conversion rate, you lose on time it takes to see statistical significance. Hence you need to test a lot more visitors to gain the same amount of confidence in results.
Simulation Results
Intuitively, our understanding of the algorithm did make sense, but we wanted to be sure. So we fired up a Python terminal and hacked a quick script for running simulations for both A/B testing randomization and multi-armed bandit optimization. Objective was to simulate a campaign of 2 and 3 versions with known conversion rates, and then for both algorithms measure two key metrics:
- How soon statistical significance (of p-value < 0.05) was detected (in a two tailed Z-test)
- Average conversion rate for the whole test till that point
Just to be sure of statistical significance (sometimes it is detected too early due to chance), we counted till the statistical significance was seen at least 10 times during the campaign. We ran all simulations for 10,000 iterations.
Following algorithms were evaluated:
- Simple randomization of A/B testing: RAND
- Multi-armed bandit (10% exploration, 90% exploitation): MAB-10
- Multi-armed bandit (25% exploration, 75% exploitation): MAB-25
- Multi-armed bandit (50% exploration, 50% exploitation): MAB-50
- Multi-armed bandit (77% exploration, 23% exploitation): MAB-77
- Multi-armed bandit (90% exploration, 10% exploitation): MAB-90
Scenario #1: two versions (A = 10% conversion rate and B = 20% conversion rate)
Iteration # when statistical significance was reached | Average Conversion Rate | |
---|---|---|
RAND | 248 | 15%^{#} |
MAB-10 | 1400 | 19.8% |
MAB-25 | 564 | 20.1% |
MAB-50 | 278 | 18.5% |
MAB-90 | 214 | 17%^{^} |
# You can guess that in simple randomization, average conversion rate will be very close to average of conversion rate of two versions
# The median conversion rate here was 15.8%, so in reality (and intuitively as well) multi-armed bandit that does exploration for 90% of the time is similar to randomized A/B testing
From these results, it does appear that MAB-50 could be a very good algorithm but perhaps it is because we have two versions only, and that too with conversion rates different. We thought of evaluating the algorithms when conversion rates were similar.
Scenario #2: two versions (A = 10% conversion rate and B = 11% conversion rate)
Number of times no statistical significance was seen^{%} | Iteration # when statistical significance was reached | Average Conversion Rate^{*} | |
---|---|---|---|
RAND | 5 | 1700 | 10-11% |
MAB-10 | 12 | 3400 | 10-11% |
MAB-50 | 6 | 2000 | 10-11% |
% We ran 25 simulations each of 10,000 iterations. Since the difference between conversion rates was less, simulations took a lot more runs before detecting statistical significance. And sometimes, even after 10,000 iterations they couldn’t detect the difference.
* Interestingly, when the conversion rate of different versions is similar, not a lot of optimization can be expected, so average conversion rates are always been 10-11%.
Note that as the conversion rate difference becomes less, the advantage of multi-bandit algorithm fades away, and yet they still a take more time than simple randomization to detect statistical significance. Curious point is that in real world (as opposed to a simulation) you would only know if conversion rate difference is less after the campaign has run. So, since in this case the performance of randomization and multi-armed bandit is same in terms of average conversion rate, before running any campaign you can only “guess” whether multi-armed bandit is worth the extra time it will take for detecting statistical significance.
Scenario: 3 three versions (A = 10% conversion rate, B = 15% conversion rate, C = 20% conversion rate)
We extended our simulations with 3 versions to see if anything interesting emerges. In this case statistical significance was calculated for all 3 comparisons (AB, AC, BC) and similar to cases above, we waited till statistical significance was found at least 10 times during the simulations. Here are the results:
Iteration # when statistical significance was reached | Average Conversion Rate | |
---|---|---|
RAND | 270 | 15% |
MAB-10 | 1540 | 18% |
MAB-50 | 555 | 17.7% |
MAB-77 | 360 | 16.8% |
Summary
So as you can see from the results, there’s a clear tradeoff between average conversion rate and the time it takes to detect statistical significance. Moreover, it is also clear that any advantages of multi-armed bandit algorithms vanish if conversion rate of different versions is similar. The only scenario where multi-armed bandit algorithms would work best is if performance of different versions is massively different (which is infrequent in practice). Even in such cases, since difference is massive simple randomization would detect statistical significance quite early on, so for rest of the time you can simply use the best performing version.
This is not to say that multi-armed bandit algorithms are useless. They’re very useful in the scenario where you are doing continuous and permanent optimization and are frequently adding/removing versions to be tested. It is also useful in cases where you can afford to wait for a longer duration before knowing with certainty which version is performing the best (or the worst). (Or, in cases where don’t care about the knowledge and the certainty, and you simply want average conversion rates to be optimized.)
So, comparing A/B testing and multi-armed bandit algorithms head to head is wrong because they are clearly meant for different purposes. A/B testing is meant for strict experiments where focus is on statistical significance, whereas multi-armed bandit algorithms are meant for continuous optimization where focus is on maintaining higher average conversion rate.
Hope this article was useful. Do let us know your comments and thoughts!
“Multi-armed bandit (77% exploration, 33% exploitation): MAB-77”
110%?
@The Dude: sorry, yes a typo. Updated the article.
@Ken: yes, that’s correct. Thanks for adding this.
I suppose MAB would be an interesting setup in a never-ending test. A test where you keep ALL earlier experiments in the pool while continously adding new variations.
Would this setup not take into consideration changes in user’s behaviors in that sense that if (for som ereason or another) a variation suddenly trends, it could (theoretically) suddenly become the best performing variation – even though for just a while?
Or am I totally of the rocker on this one?
FYI … the term “multi-armed bandit” describes the _problem_ (of optimizing results across many alternatives). The particular optimization _strategy_ described by Steve Hanov (and above) is called “epsilon-greedy.”
^ this. Also, epsilon-greedy doesn’t seem to be the industry standard. In practice probability matching or Thompson sampling is used, which would still allocate some traffic to bandits that are not currently optimal.
1. Could you gist your Python code?
2. You do not indicate whether you’ve run enough simulations to attain a high statistical significance (over the meta- result that each campaign statistical significance has been attained)
3. Correct me if I’m wrong, but I don’t think A/B testing should be run “until statistical significance has been attained” because it foils the purpose of a statistical test, that is, you should rather define before hand a number of iterations, wait until they have been completed, and then check whether it gives you statistical significance
4. The purpose of the epsilon-greedy algorithm is that it can be run for very long duration without knowing which statistical significance is best because it will adapt the most commonly seen version of your site to this particular version (ie it is made to be run almost indefinitely, and will — in effect — automatically remove the non-working versions after a while!
M
@Mael: the code is posted here http://pastie.org/4007859 and regarding other points, please check the interesting discussion on Hacker News http://news.ycombinator.com/item?id=4052997
“Just to be sure of statistical significance (sometimes it is detected too early due to chance), we counted till the statistical significance was seen at least 10 times during the campaign. We ran all simulations for 10,000 iterations.”
You have a multiple testing issue (when to stop), so you add some random hacks to your system. Now what do your new p values mean?
I think this discussion is much better with regards to why you might want to use A/B testing over multiple armed bandits like algorithms in practice.
http://news.ycombinator.com/item?id=4040022
@Rob: we’re yet to investigate how statistical power relates to MAB. It’s true that for A/B testing, statistical power is straightforward to calculate before experiment is setup. But we’re not sure if that’s true for MAB as well since MAB optimizes for conversion rate. So it is like comparing apples to oranges.
It occurs to me that the difference is mostly academic vs realistic. If I own a medium sized website, the statistical significance and, in all honesty, the actual answer to the question “which variation is better” is less important than getting the revenue from people clicking the buttons. So the questions I would be answering are as follows:
1) If you were to graph, from multiple randomized runs, the point at which each algorithm steadily indicates the “right answer”, which would win?
2) If I owned the website, what would be the percentage difference in revenue using epsilon-greedy vs A/B
I’m sure epsilon-greedy is no better than A/B in terms of speed to statistical significance, but the real question on most business owners’ lips is: Which makes me more money?
Could you please release the source code you used? I’d like to implement and test an alternative hybrid form.
@Jorgen: it’s here http://pastie.org/4007859
Surely a more sensible method is to not fix the A/B ratio in the epsilon-greedy algorithm. Specifically, adjusting the ratio to weight more heavily the more likely answer as the difference becomes more statistically significant. So the less certain you are, the more you should be keeping other options open. It would also be advisable to have a decay function on results in the past in this case to prevent new results being lost in the plethora of old data.
Your argument is only true for a low number of AB tests. If you want to find the best performing out of ABC then you have to run AB, AC, and BC tests. The time taken for statistical significance will keep growing and at some point will perform worse than MAB. MAB turns out to be the best option for conversion rate vs statistical significance. If I own a website selling goods, I am going to choose MAB because I want to make as much as I can, as fast as I can make it.
I didn’t do the math (yet), but I bet you can have much finer grained results with bayesian statistics, instead of the frequentist ones you seem to be using.
“Statistical significance” doesn’t mean much, actually. It’s just an arbitrary threshold, hopefully harsh enough to make you confident when you cross it.
My point is, every time a conversion failed (respectively succeed) with a particular setup, that’s evidence against (respectively for) that setup. Working under the assumption that each setup converts X% people at random, and that your knowledge of X is a (subjective) probability distribution over 0%-100%, every conversion attempt narrows your estimation by a precise amount.
Combined with a reasonable prior probability distribution for X (I can’t help you here), you can see exactly when you are Y% the new design performs better (or worse). You can then tweak your exploration/exploitation algorithm according to that number. Like, when Y is close to 50% (total uncertainty), it makes sense to explore all the time, but when Y is close to 0%, it makes sense to exploit nearly all the time.
Even better, there are ways to calculate the expected value of new information. You may be very sure setup B is worse than setup A, but becoming even more certain may be worth the loss of conversions.
But at that point, we’re way past 20 lines of code…
Your comparisons aren’t as useful as they could be.
Try a different comparison:
EVERY algorithm gets, say, 5000 hits.
For the A/B algorithm, the first, say, 500 hits are used to find statistical significance. If it discovers that one choice is more significant, then it uses the remaining 4500 hits with the more significant choice. Otherwise, it runs another test of 500 hits.
The multi-armed bandit algorithm stay the same.
Keeping the number of hits constant will give a percentage of conversions that matters most to the users of these algorithms.
Hmm, “continuous optimization where focus is on maintaining higher average conversion rate”… so, like, every single test I’ve ever seen or heard about running? Sure the nerd in me would like to see statistical significance sooner or whatever, cool story, but what really matters is _obviously_ **ALWAYS** conversion rate.
You could make your case much stronger if you compared the conversion rate of:
1. A/B testing until statistical significance, then use the winner for all subsequent iterations.
2. Multi-armed bandit testing for all iterations.
I suspect that method 1’s conversion rate with converge with method 2’s conversion rate as the number of iterations goes up, and probably be very close by the time MAB reaches statistical significance.
If you can show that you’re determining the winner more quickly, and what you lose early on in conversions you make up for later on, this would certainly make your case stronger!
One complexity is the likelihood that your A/B test will declare the wrong winner, this would be a small disadvantage to method 1, but could be factored in to the analysis.
Experiment and Optimization are different topics and scoring one by the criteria of the other will lead to mistaken conclusions.
Your simulation is a fiction. It only informs you of the behavior of the code you wrote. To connect statistical significance to reality requires honoring criteria you have not met.
Applying frequentest evaluation concepts to Bayesian and Markov learning methods will lead to mistaken conclusions because they follow from a different set of axioms. They both lead to numbers, but these numbers are not interchangeable and you must remember their qualifications when applying them to predicting reality.
In more frequentest terms, multi-arm bandit is searching for minimum total regret via a random walk, not rapid convergence to a significance predicate. It can do this in the face of changing conditions and weaker assumptions concerning control than frequentest significance requires, which is why such methods are now the norm in machine learning.
There are huge opportunities that you will not see with your current perspective. If you do not learn them, your competitors will.
I do not know of a particular reference for multi-arm bandit algorithms in specific, but they are a specialized case of Markov model learning described with vocabulary that predates a more modern broad and uniform understanding of these topics.
David Barber’s recent book is very good.
If you want a broad understanding of common mistakes due to unstated metaphysical assumptions concerning statistics, experiment and action, read Judea Pearl’s 2nd edition.
“””you need to have tested at least a certain number of visitors before you can say with any certainty that the variation is really bad performing “””
Please be careful. This sentence implies that you may not be aware of a problem that many of the studies I see, that leads to high false positives.
The problem is, people will start a study, then part way through they’ll do analysis on the data, to determine if the results are “statistically significant”. If they “are”, they’ll often stop the study and proclaim their findings. If they’re not, they’ll wait. What’s worse, is that they’ll often continue testing until their finding are “statistically significant”.
Remember, a good trial is one where it’s decided ahead of time how many samples will be taken (or, if appropriate, how much time will pass) ahead of time.
Nice post. Something I noticed regarding MAB is that MailChimp uses MAB by default if you choose to send an A/B Split Campaign; they send version A of your campaign to 10% of the audience and version B to another 10%, somehow pick a winner which is shared with the remaining 80% audience. I guess it works pretty well. A ‘true’ A/B split campaign may not work as well as MAB in this case, in my opinion.
Great detailed and mathematically sound post. I’m glad you went to the trouble to explain this to people. I have never heard of the multi-armed bandit before but when you described it my very first thought was, “Well, this is just going to make my tests take longer!”.
Who cares about the potential for a very short term payoff when my test duration now lasts so much longer? I’m anxious to get results that I can trust, not to squeeze an extra dollar out of my website this week.
Chris
You say: where you are doing continuous and permanent optimization and are frequently adding/removing versions to be tested.
Yes! And in fact, that is the ONLY scenario that every website owner finds themselves in. On the web, the OPPORTUNITY to try an endless number of variations is at hand everyday and the FLEXIBILITY to make changes is the essence of the web.
THEREFORE, what the website owner wants is an INDUSTRIAL STRENGTH testing method that can ACCOMMODATE the realities of the web and NOT BREAK due to some not-well-understood-by-most p test or chi-square test or other validity test, but rather a self optimizing, self-validating bullet proof solution that you can BANG ON CONTINUOUSLY and that just keeps on working and getting smarter.
A/B does not meet this requirement because it’s “too” precise, “too” slow and “too” rigorous. The hopper is too small.
Epsilon Greedy Multi Armed Bandit or Taguchi – call it what you will – is the better, faster, more convenient way to test the Pachinko-machine-like casino environment (thank you Seth Godin) and optimizing all the many little conversion-affecting components of web landing pages. It has a hopper that will accept anything you can throw at it. That’s what we want!
You still need A/B testing, but you can use MAB or Taguchi to identify your first control and get past your first round of testing. Use it to get SUSPECTS that you can then further test and prove via A/B while you simultaneously throw more alternatives at Epsilon Greedy.
This is an incredibly naive straw man argument against a particular (and not very good) multi-armed bandit algorithm.
Not surprisingly, the conclusions reached are equally naive.
You would have done much better if you had done even enough work to read the wikipedia page on the problem. Even better would be to examine the literature on sequential decision problems.
Your so-called A/B test is, in fact, just another algorithm for solving the multi-armed bandit problem. It is also a decidedly sub-optimal solution compared, say, with Thompson sampling or UCB algorithms.
I realize that you had to be all forceful and authoritative for marketing reasons, but you really would do better if you had done so after educating yourself.
Ted, thanks for responding. Even though I haven’t detailed the math, the generic difference between these two methodologies still stays. I agree with you that the specific name ‘Multi-armed Bandit’ encompasses a class of algorithms and not the particular one I have written about, but this article addressed a specific algorithm which people associated with the name MAB and I had to respond to it as I felt the understanding that the specific algorithm is the panacea to optimization was wrong.
I think you are still very far off-base here.
First, multi-armed bandit is a problem, not an algorithm nor a family of algorithms.
Secondly, A/B testing is a (simplistic) form of the epsilon-greedy algorithm with a single epoch.
The alternative that you outlined with 90%/10% traffic split is an erroneous implementation of epsilon-greedy.
In any case there are two striking problems in your posting:
1) the quality of a solution to the multi-armed bandit problem (whether your A/B split solution or any other) is best described as cumulative regret. You don’t even mention regret as a figure of merit.
2) you further the error of equating multi-armed bandit with the epsilon-greedy algorithm in your posting. Epsilon-greedy was introduced as a known sub-optimal solution to serve as a benchmark. It was definitely never intended to be considered a good solution.
You really ought to fix these mistakes.
Ted, my limited research showed that epsilon greedy is actually hard to beat. See this research paper “Multi-armed Bandit Algorithms and Empirical Evaluation” http://link.springer.com/chapter/10.1007%2F11564096_42?LI=true
They say that:
>One remarkable outcome of our experiments is that the most naive approach, the epsilon-greedy strategy, proves to be often hard to beat.
Do you have sources for a different conclusion? I’d love to get educated on this topic.
Hi Paras,
I agree with Ted that the approach presented here for multi armed bandit comparison with A/B testing could be improved.
The following paper gives a good and modern look on multi armed bandits:
A modern Bayesian look at the multi-armed bandit
http://www.economics.uci.edu/~ivan/asmb.874.pdf
The basic problem about MAB problems are assumptions about the probability distributions of the bandits.
As far as I now there are no automatic methods to figure out from which distribution a sample is derived. The following matlab functions gives some hints about that:
http://www.mathworks.com/matlabcentral/fileexchange/34943
If you know or assume the probability distribution of a multiarmed bandit problem it is a lot easier to solve. e.g. with the gittins index http://en.wikipedia.org/wiki/Gittins_index .
e.g. in our samples the basket size follows a log normal distribution. So this could be used as the metric for minimize loss for example for optimizing a free shipping boundry.
Hope that helps
Manuel
I’m not sure about either the multi-armed bandit nor the A/B testing approach described here.
First, we should not run a test until significance is achieved (see article by Evan Miller and others on “repeated significance testing error”). As I understand it, the significance (alpha) and sample size should be set before the test (see Wikipedia article “P-value”, Misunderstandings section, point #5).
Second, I’m uncertain about whether any statistical assumptions of the p-value calculation are violated by running a test using the multi-armed bandit method rather than perfect randomization (may someone can comment).
Third, it seems to me at any point early in the test, it’s not reasonable to ascribe any trend to the observed data. If we need 1000 samples just to start to get a vague idea of a real trend, then any pattern in the initial sample of 20 is not reliable even for an interim decision (e.g., the loser in the fist 20 samples may be the winner overall). If we preferentially target false winners, it seems to me we inhibit the effect of real winners on the data. It would take the test longer to re-align.
Finally, if there is measurement error (there is always some), it seems to me the error will be inflated by preferentially hitting the winning variation. And any trend toward a false positive will also accumulate on that variation.
Vlad, you’re right. A sample size has to be calculated in advance of a test. That sample size assures you that you would control your alpha and beta errors. This is the major difference between the two approaches. There’s no concept of alpha / beta in MAB. It’s an inherently different technique, and that’s the point of this article
Paras,
Vlad is not correct. You *don’t* have to choose sample size ahead of time.
You do have to pick it if you use old style tests.
You don’t have to pick it if you use a sequential test design with a test that actually works correctly. The Bayesian Bandit (based on Thompson sampling) is one such test.
A good system will tell you what the current best estimate is, what the problem that this is the actual best, what the probability that any candidate is within x% of the best and how much expected value per trial you will be leaving on the table if you stop the test now. A good system will also adapt the number of trials for each option to avoid wasting trials and will handle parameterized models.
The algorithm described in this blog is normally called epsilon-greedy and is known to succeed on only a few of these criteria.
Alternative methods based on t-tests such as you seem to be suggesting with your talk of pre-determined test sizes fail on all of these criteria.
Modern Bayesian approaches satisfy all of these requirements.
@Ted: yes, you’re correct. I was comparing the classical testing method with the popular epsilon greedy method (at least it was popular on the web at the time I wrote this article).
I’m aware that sequential testing and bayesian methods are superior to classical methods.
Since I don’t see a link to it, anyone reading this blog post should immediately follow it with:
http://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs_ab.html
The short takeaway is that not only is the bandit technique shown here not a good one, the A/B test here is backed by garbage math in its significance tests.
PLEASE don’t use this article as a guide to what works without more investigation.
This article portrays a pretty misguided picture of bandit learning.