Minimize Conversion Loss in CRO Testing with Multi-Armed Bandits

Transcription

Disclaimer- Please be aware that the content below is computer-generated, so kindly disregard any potential errors or shortcomings.

Shanaz from VWO: Hey, everyone. Welcome to another session of VWO webinars where experts in digital marketing, experimentation, data, and product, share their trade tips and inspiring stories for you to learn from. I am Shanaz, Marketing Manager at VWO. For those of you ...

who do not know what VWO is, VWO is a full-funnel A/B testing, experimentation, and conversion rate optimization platform. Today, we have with us Ishan Goel, Lead Data Scientist, and Anshul Gupta, Senior Data Scientist at VWO to answer all questions about Multi Armed Bandit. If you could please turn on your camera, Ishan and Anshul so that the audience can see you during this.

I’m sure there’ll be a plethora of insights to take back from this session today. Before I pass on the mic to you guys, I’d like to thank everyone for tuning in from all over the world and inform you that we will be taking up questions at the end of the session. So, everyone, please feel free to drop your questions at any given point during the presentation, and we’ll take them up at the end. With that, guys, the stage is all yours.

Ishan Goel:

Let’s start. So, today’s webinar is basically about what are multi-armed bandits. And, I know this sounds like, a very, very complex name that doesn’t suggest anything about digital marketing or A/B testing. So I think we should start by first talking about what actually is multi-armed bandits and where the name comes from. So, basically, multi-armed bandits were these, casino machines that you would see in Vegas and casinos in Vegas.

And, the idea was that you had these multiple machines, and, every machine had some fixed probability that you’ll win the jackpot. So, it was different for every machine. So say the first machine has a 10% jackpot probability. The second machine has 20% and the 3rd machine has 30%. The idea was that you didn’t know these probabilities, and you had to pull a lever in expecting a jackpot.

And, for one coin, you could just pull the liver once and then you had to, like, try out different machines and then settle on a machine that you felt was the luckiest machine for you. So they were multi-armed bandits. Basically, when you win, you win the jackpot. The question comes, like, how did these become analogous to statistics and digital marketing? So, why are we talking about them, basically?

So, basically, multi-armed bandits are formalized in statistics into a formal algorithm. That was a minor variation in A/B testing. So to recap, I would also tell you what A/B testing is, basically. A/B testing, so suppose in the digital marketing world, A/B testing is being used, yeah, until we can, like, go back to them. So in the digital marketing world, A/B testing is basically if you have a website and you want to experimentally improve the website for the best conversion rates.

Then what you do is you create a variation of the web page and then you use an A/B testing engine to test out which one has a higher conversion rate. And if your variation wins over the previous web page you have, you can deploy the new variation. So it is basically a formalized experimental testing method that equally distributes traffic between your two variations, 50%-50%, and then tells you in the quickest amount of time possible if your new variation is better than the previous variation in terms of how much the users are converting. So why we are talking about MABs today?

Because at VWO, we are now introducing MABs, which are minor variations of the A/B test, but I’ve unique use cases that we’ll discuss on the next slide. I would also like to point out that, Anshul, is the one who has implemented and developed this algorithm for MAB at VWO. So, yeah, we can move forward. And, I’ll further discuss the differences between, A/B tests and MABs. So, basically, the subtle difference between A/B and, maybe selection process is something that you can see on the image on the left.

Basically, the A/B test equally distributes all your traffic. And then for a specified amount of time, till it reaches statistical significance, it tests the 3 variations by equally diverting the traffic towards them. And then at the end of the test, which is at somewhere in week 5 in this figure, you decide which one was the best variation, which was giving you the highest conversion rates. And then you deploy that best-performing variation for the future.

Whereas MABs are slightly different. What MABs do is, rather than keeping a uniform, equal traffic distribution, MAB slowly diverts more and more traffic to the winning variation. So as MABs become sure that option A is the one which is performing the best out of the 3. Then maybe start diverting more traffic towards those and they save, conversions for you while you are running the test. So, if you see this image, on the left where there is A/B testing contrasted against standard selection, you are able to see that as you progress through the test, the proportion of visitors that are allotted to option B and option C is reduced down and the proportion of visitors there are a lot of to option A remain, increase.

And, after week 5, when we reach a statistical, when we are sure that option A is better, then you deploy it to the whole traffic. So that is the central idea behind an MAB and why it relates to the slot machines because such an algorithm is what you used to win at the slot machines. So, the thing is, this basically makes it seem like and maybe these are absolutely better than A/B tests because they help you save conversions during the tests. Whereas, we need to clear the air that, there are some pros of MABs and some pros of A/B testing. And the major limitation that MAB is bringing is that you cannot be entirely sure that these results are future-proof.

That is the major limitation. So if you are looking for statistical significance, what we say and when we say statistical significance, actually, what we really mean is that the results that we got on the specified period in the test, remain valid for the future as well after the tests. And so that thing remains so that the assumption that future-proofing is very well tested with A/B testing. Because we equally distribute the traffic.

So we learn equally well about all the 3 variations. Whereas, maybe algorithms are less future-proof that way. Because, what happens is that, if option C is performing not very well in the starting, it slowly diverges and doesn’t give a lot of focus on option C, and it might be that later option C was actually working better, but it did not get much traffic to prove itself. So in that way, MABs are more focused on the campaign conversions, whereas A/B tests are more focused on that, you might lose conversions while you are running the campaign, but, you’ll be sure that you are deploying the best variation in the future. So, pointing out the pros and cons more clearly, A/B tests have a higher loss con loss of conversions during the test because you are diverting so much tracking to these suboptimal variations.

They are faster to reach statistical significance. So we need to realize that statistical significance depends on, being sure about all the variations that are running in the test and not that you’re just sure about the one conversion rate of option A. So that is why MABs take a longer time to reach statistical significance whereas A/B tests are faster to reach statistical significance. And the core ideology and the core objective behind MABs and A/B tests are actually different.

So even with this very minor variation, the core ideology is different because A/B tests focus on scientifically learning which variations perform better and then deploying them to the future. So, basically, it’s very important for them that they learn well. Whereas, maybe focus on saving the conversions while you are running the test. They don’t worry about, whether in the future, those things will be valid or not. And as we go further we’ll see that these translate to very different business use cases.

So, yeah, I’m sure we can go to the next slide. So I am going to talk about, the business use cases of MAB and where essentially you should think of in terms of MABs and not A/B tests. So one of the major, major applications of MAB is short-lived campaigns. Like, suppose you are having a flash sale on your website. So the thing about flash sales is that whatever you try out for, say, it’s a 3-day flash sale.

Whatever variations you try out for those 3 days, it doesn’t mean a lot that, whoever comes out to be the winner, you are never going to deploy it for the indefinite future. What matters is that through the flash sale, you get the highest conversion rate possible. After the flash sale ends, it will not be useful, whatever you’ve learned in that period. So essentially, if you look at it that way, A/B testing is an experimentation tool that tells you what is better for the future. Whereas, an MAB is not an optimization tool.

That just helps you save conversion in that short-lived campaign. And it’s, not restricted to just flash sales. If you think about it, you can even think of push notifications for that matter. If you have to push a set of notifications to your user set and you have 2 variations that you can try and you wanna see which one elicits a higher conversion rate, then maybe you can bucket your, suppose you are sending out 1000 variations, you can bucket them in 100 each. See the response of the first 100.

Then divert more traffic to the winning variation in the next 100 and so on. So, because push notifications as well, like, they are not something that you are going to deploy in per night, indefinite future. So such types of short-lived campaigns and, if you think about it, one more interesting use case, I’ll tell you. Suppose you have a slot on your website that is, you have to give to some third-party software to, like, put ads. So you have to decide from multiple ads which ads would you place on that slot?

Suppose those ads are continuously like old ads are going out and new ads are coming in, something like the bidding algorithm of Google. So in that case, basically, what happens is that because new ads keep coming in, you really don’t want to A/B test those different ads against each other. What you really want is you want an MAB that continuously looks at the set of ads that it currently has and then diverts the traffic to the best winning variation. And as new ads keep coming, traffic gets diverted to those new ads. So whenever there is any variation you should think of it in this way.

Whenever any variation has a very fixed, visitor attention budget available to them, you should start thinking of MABs because, their future learnings won’t help you a lot and, hence, you should not waste your conversions on A/B tests. So that was a major use case of MAB. But there are some other use cases as well. Like, one I would like to tell you is when you have multiple variations and by multiple variations what I mean is that suppose you have 100 variations that you create would generally come up in the case of a multivariate test, which I’ll tell you in just the next 3, 4 minutes, what a multi-period test is.

So suppose you have many variations done, you want to figure out what are the best variations in those. And, say if you have, 20 variations and you go on dividing all your traffic into 20 buckets for all those variations, then the lowest performing variations will perform very badly and cost you a lot. So, essentially, what digital marketers want in this scenario is that they run a test with multiple variations and it quickly converges towards the 3, or 4 variations that are the best performing, and those variations that do not make sense at all they do not get traffic. So they get eliminated very quickly.

So that is another use case of MAB and where this use case comes in very specifically is a multivariate test. So for those who are not aware of what a multivariate test is, I’ll give you a quick idea. Basically, suppose you have a web page and in a simple A/B test, you create a variation of the entire web page. Whereas in multi-variant tests, you have different widgets on the web page. And so suppose there is a banner, there is an image, there is an explanation.

So you create variations of all the different widgets, so you can put in 5 variations of the banner you can put in 3 variations of the image. You can put in 4 variations of the explanation. And what a multivariate test does is that it takes a combinatorial – all possible combinations of these 3. So 5x3x4, that translates to 60. So you realize how quickly the number of combinations bloated up to 60, and you can probably intuitively understand that a bunch of these would not make sense.

So in those cases, you would want to add an MAB so that the competition remains just between the 45 variations that are making sense. For 5 combinations that make sense and all the other 50-55 get eliminated, read it out pretty quickly, and don’t waste your conversions. So that’s one use case of it and, sometimes, it’s just for saving conversion. So you might have a usual A/B test and you would want to just save conversions on the low-performing variations and you are in no hurry of getting statistical significance. So people in that case to just save conversions, they, run an MAB rather than an A/B test. So that’s the business use cases largely of, and maybe what will be moving further and Anshul will be telling us that the algorithm that we at VWO have deployed for MABs, and there’s specifically, we have built that algorithm for, some advantages and, so that it performs the best for multimedia tests and also, yeah, please.

Anshul Gupta:

Thank you, Ishan. Yeah. Now I will talk about, basically, the core companies that were involved in building up this multi-armed bandit algorithm that we built at VWO. So the first one is made initialization, waiting for updation, CapEx per computation, and exploration factor. I’ll be talking about only the high-level overview of these core components but if you want to get a detailed understanding of what is the mathematics behind it, you can refer to this article which we will share after this talk.

And you can just read more about it. That’s what really goes into it. So let’s first start with visualization. So suppose that this is your website layout, and it has 2 sections. The first one is a short section and the other one has 1000 sections, and there are multiple shows that you have for each section and thousands.

Now what we go into these weights is that first is the content weight. Essentially, we define that for every variant of a shirt and trousers, we are setting up a bit. And what is the meaning of the state essentially is that you are trying to quantify what the contribution of a particular variant of a shirt, let’s say, is towards the conversion rate of this entire website layout. So you can have so what, in case of content for every individual shirt and trouser, you will have a bit associated to it. Eventually, with this algorithm, you’ll try to learn, how much this is contributing towards your websites and version date. The second part is the content interaction page, which basically takes into account these period dates.

This is because what can happen is that even more shirts and out, in trousers in the coming words, but when they are here together, they kind of, like, getting it, and they don’t do it well with the audience. So that is why these content interactions, we are keeping into, account. And then the following interaction, we basically would come only in the case of multimedia test only. So the basic idea is that this has a pronounced saying, when you are comparing it with a multimedia test, in the case of A/B, there would only be content-based. So this is about weight initialization.

And all these things that we are trying to learn are in the form of a distribution. So in case of less data, we are trying to account for the uncertainty also with this data, there would be a high uncertainty. As we get more data the basic uncertainty will be less and will be more sure about that. How much these rates are contributing towards early-hours conversion rates? Moving on.

Now that we have understood what all these rates are, let’s try to understand how these rates are really learned. So we use the combination of this algorithm called a new message passing algorithm that we use, to learn the distribution of these weights. I’ll just give you this idea about what goes like, what is this, important or just an intuition around it, so you can definitely go back to the go-to article which is, the help article and to see what is essentially the maths around it. So this is called a base essentially a factor graph.

In the first layer, we have all the weights that we have learned in the previous slide. And, what we are doing is that we are trying to combine it together and eventually by applying a certain set of equations And then we are getting a layout and version a distribution, which is this steep, note. Here what we do is we combine the response that a visitor has given to that year over 10 years and work on conversion. And then we basically get a distribution which is called a posterior. Then we perform a backward pass and obtain the posterior for our bid also.

So through a combination of this forward pass and backward pass, we have done the posteriors of our bid. Essentially our learning of weight is happening this way. Now this learning of, this algorithm is just one weight There are other organizations also like Marco Stream, and Monte Carlo. But, they are computationally intensive. The reason why we are using this algorithm is for essentially 3 aspects.

The first one is that it is really, like, once we have defined the equations, it is pretty straightforward to implement. And, that way, it makes it really manageable. The second one is that this is important and does not really provide us any surprises that, if the way it is when to work, it is working that way. And third one is that it is really able to scale bills for large amounts of data. That is essentially the essence of using this algorithm.

Now moving on, once we have learned the rate, how are we going to use this to decide which variation to show to a visitor? Then here comes something which was built in 1993 by William Thompson, for performing gene-edited trials. So what I get behind this is that we have this data. Towards exploring a lot. And as you get more data into the test, you start to exploit whatever behavior.

Like, whichever variation is leading, you start showing more, that variation to a lot number of visitors. So this gradual shift from exploration to exploitation can be done really well with the Thompson sampling algorithm. Moving on. Now, the whole nature of this exploitation, and exploration really works for any system to really adapt to the changing environment. It always has to make a balance between this exploration and exploitation strategy.

So how does this exploration-exploitation strategy work in our system? We use a basic adaptation of the Thompson campaign only. Like, let’s say a layout, we have a certain set of weights associated with it. For let’s say that layout 1 and layout 2, we have w 1, w 2, w 2, and w 3. And there is a possibility that weights can be common among layouts. So what we do is that we compute our score, which is a layout score, and that is done by performing a sample from the distribution of weight 1 all the weights, and then we sum it up together. And then we find out which layout is leading. And that’s basically, a winner, which is the winning layout. This simulation will be performed several times to obtain essentially a traffic split that we have to dedicate towards, like, until the next update of the rate happens to work the algorithm of a gym that’s different. So, at a particular time, any traffic speed combination computation happens by performing the simulation and we eventually get that through this algorithm. We get a portion of that layout.

1 is winning and proportion times. They are 2 is winning, and this is how we get our traffic split. This is how traffic split computation really happens. Moving on. Now comes the important part which is to meet her and we’ll go at some work in a non-stationary environment because what can happen is the possibility that what is leading today may be lagging tomorrow.

So, what thousands have done by default is that it only explores at the time when the data points are really low. But when the test gets a really large number of data, then it completely goes into an exploration mode and eventually at that time, if another variation starts reading, then if we were to apply a classical Thompson sampling approach, then it won’t be able to figure out that if this variation has started eating now. In that case, we very explicitly add exploration factors to the traffic, which is, given by the Thompson sampling algorithm. And, currently, we keep a 10% complete exploration factor right from the start. So that at any point in time, later on, things change, conversion rate, what is happening, like, earlier, we have seen, like, one variation which was lagging is leading now. So our algorithm is able to adapt to that non-station environment.

Ishan:

So I would like to add some intuition to that exploration part, actually. That might help people understand better. So suppose you have, 5 variations and suppose, your variations are set up in a way that your customers prefer some variations in the morning, whereas some variations in the evening. And if we start to test in the morning, it’ll naturally be preferring a few of those morning variations, and then it’ll be biased towards those morning variations. It’ll divert all the traffic to those morning variations.

What this exploration factor lets us do is that when the evening comes, the other variations start to perform better. That is the time this exploration factor helps us guide the major amount of traffic to those evening-performing variations. So that is the reason that we never cut out this exploration factor ideally to 0. So this exploration factor remains fixed so that we can detect changes like this. So, yeah,

Anshul:

Yes. Absolutely.

Ishan:

Yeah.

Anshul:

Now, uh-uh, moving on, here comes another part. Now we have implemented MAB, and it is dynamically changing the traffic. What happens because of that by nature, it is adding certain buyers into the system because of which we can never perform a statistical analysis, which is exactly the like, then if you take an example, what has been mentioned that our morning users. What will happen? Because Amy is directing traffic to a winning variation, which is performing, which is leading at a particular time.

Now if there’s a possibility, then one integration will have a lot more morning visitors, which is why it won’t be an apple-to-apple comparison between the two variations that we measured their performance. And, anyway, inherently create an ISO. If you go to any platform, it would really say that you can never perform a statistical analysis when anybody is, run. So, what we try to do is that out in an M and A report, we try to ensure how can we reduce the spires to performing whatever visitors, a statistical test. So we applied a really simple heuristic and because anybody is changing the traffic split, we are not able to that is why a bias is being created. So we ensure an equal proportion of traffic between all the visitors so that this bias can be reduced.

How we do that is just suppose that a multi-armed abandoned law firm has detected computed at a high speed of 70% and 30%. Now what we will do, we will take the minimum of the rapid split, which is 30% and, we perform a burn oil price. But not even think of, like, doing a point where the probability of heads is 30%. So what we are doing is that every time a visitor comes and becomes part of a test, we’ll toss a coin whose probability of success heads is 30%. And if the head comes, we’ll market.

This will ensure that in variation A and variation B the number of visitors that are moving variation A and variation B, is always in a proportional manner, like, equates to about, like, 30% 30%. And then by performing a statistical analysis, we will use only the marked visitors. So if you look at any statistical report, whatever we are at that issue on our platform, all these metrics will remain valid because the bias is not there now. Each variation has a somewhat equal, proportional, equal number of visitors. So that’s how this is more of like an add-on feature that you can use because, traditionally, like, in MAB reporting, they are not meant to take any insights from it.

You’re not meant to take any insights from it. You only have for what your objective should be that you want to maximize the number of conversions in the time your test is running. Now this is just an add-on feature that, at any point in time, whatever visitors that you have, you can perform like a statistical analysis, which will remain valid. And, you can use, it for your, basically purpose. Ishan, do you want to add anything to this part?

Ishan:

So, interestingly, yeah, I would like to add a bit of a discussion there about why have we added this statistical analysis, heuristic. Essentially, MABs were never meant to attain statistical significance. That was so when you think about those slot machines, you really didn’t care about the fact which machine is actually generating a higher amount of jackpots. What you really, really cared about is that if you took 20 coins from the counter, you want to make the best chatbot that you can in those 20 coins. So, essentially, MABs were never concerned about statistical significance, but we realized that in a lot of A/B testing scenarios as well, you might want to do actually both that you might want to save the conversions while the test is going on as well.

And you might want to get some statistical significance even if it is relatively delayed. So we added this heuristic specifically, which lets us actually calculate statistical significance. Although slowly, there is a trade-off clearly that, it won’t be as fast as an A/B test, but it does calculate statistical significance for you. So, yeah.

Anshul:

Yeah. So anytime whenever this test is performed, you won’t be making any wrong conclusions. Like, if you have used an entire set of visitors from an Airbnb and performed a statistical test where you are taking a subset of visitors, which is why it would be slow to reach you if you are looking for statistical certificates. But, because it’s using only a subset, it will take a lot of time. Yeah. That is about it, moving on. So proving on the file. Please, Ishan, could take that?

Ishan:

Yeah. So, I would like to discuss briefly our choice of algorithm. You will definitely find a lot of MAB algorithms and MAB algorithms can be very simple. So, like, for, people who have run tests on VWO before, they must be aware that we show a chance to beat control and a chance to beat the all as well. So essentially chance to beat all is one of the simplest MAB algorithms where we take the CTP values in the test currently at that point, and we distribute the traffic based on the CTP values.

So that’ll be the simplest MAB algorithm that we can create, but we didn’t choose to do that. We created a much more advanced version algorithm. The major advantage of that algorithm is that it helps us realize that a major use case would be multivariate testing. And, our MAB implementation, when you use it with the MVTs, they are super fast compared to, like, normal MAB algorithms, and I’ll tell you the reason for that. So the reason for that is that, when we create an MVT, like, I just told you five panels, 5 variations of banners, store variations of the photos, and 3 variations of the explanations. Then what happens is you get effectively 60 combinations.

And, what a usual A/B test would do is create 60 buckets of visitors for all those 60 combinations. And, similarly, even a usual or traditional MAB algorithm would do that. So if you have, say, 60,000 visitors, it creates 60 buckets out of those. So, basically, that’s 1000 visitors in every bucket, and every bucket is learning independently. What our algorithm identifies is that, suppose there is this banner variation called welcome that is shown with, 4 x 3 = 12 different combinations. So essentially, like, if you have 5 banners, 4 images, and, 3 explanations, then you can realize that one banner welcome will be shown within 12 combinations.

So what our algorithm does is that it learns about that banner welcome collectively from those 12 combinations. And that is what makes our algorithm so fast. So that is one advantage of,

like, our algorithm exploits the structure of the entire setup because when it is repeated from different layouts, it is able to learn from the response of all of those layouts.

Ishan:

Right. Right. Essentially. And, this technique, of learning, by defining similarities between different things, this in machine learning is called matrix factorization.

So when you have an exponential explosion, a combinatorial explosion of the number of values, this matrix factorization technique helps you define that, okay, this is similar to this. So essentially, what we said here was that all the times that welcome comes up as a banner, it is one similar thing. So that is what helps us reduce the complexity so much. So welcome was forming very differently in the 12 combinations.

Our assumption, our MAB algorithm wouldn’t perform as well, but we tested this out as well and our assumption is that welcome has a specific content weight. So you can, like, go back to thinking about the content weight and content interaction weights. So that is where our algorithm actually comes out to be better. So, this was one advantage of our algorithm. Anshul, do you want to add anything?

Anshul:

No. No. That is all.

Ishan:

That was one advantage here. I had put up a green thumbs up there, but I don’t know why it is not showing up for some reason. But, anyways, so, oh, it does. So there is one disadvantage also for the algorithm that is, if you remember, we were talking about the 10% exploration realistic, right, that lets us decide in the evening that lets us know in the evening that types have turned and now different variations are performing better. So that is one major disadvantage also that does not happen if your best-performing variation remains the best-performing variation.

Then, due to that 10% heuristic, we’ll be diverting 10% of traffic to suboptimal variations. So you’ll be able to max out only at 90% of that best-performing variation. That is a difficult choice we had to make, but, we realized that if we consider all use cases, it is better to put that 10% heuristic. And in general, it will perform better.

Because patterns in the background often change over time. So, yeah, these were some pros and cons of our algorithm. Yeah. Anshul, do you want to add anything there? No?

Anshul:

No. No. Nothing as such. This is only one part that I already covered also that really helped us to scale. There are other things also which you can use, like, in a similar fashion. You can put up the structure, which is like Marcus and Monte Carlo and variation-based. It’s just that they are completely expensive. This is because there are only analytical equations involved here.

You are able to perform learning really fast. And that is in 2 inter-deterministic fashion. So what we have seen so far is that there are no surprises that you would get because of those. It’s because my coaching for Monte Carlo is they’re known for things like going into some areas where solutions do not exist and the algorithm completely breaks. So something that I said, there’s no surprises we have seen in this. So we are pretty much working fine.

Ishan:

Yeah. And, yeah, Anshul has extensively run simulations on a lot of use cases. We have gauged and tested out the performance of our IIM and maybe algorithm for this thing. So, yeah, currently, it is in early access and people who are in can, like, yeah, reach out to us and stuff as well. But, yeah, we’ll be releasing it in full force.

So, yeah, until we can move to the next slide, I guess before we close. So, before we wrap up, we would like to once in detail revisit the benefits and the trade-offs of A/B tests and multi-armed bandits so that you can make an informed decision on which campaign you want to run. So if you talk about the total conversions from the test, total conversions in the test period I’m specifically talking about, you’ll have a higher number of conversions in multi-armed bandits naturally because you are diverting more traffic to the winning variation. In A/B tests, you’ll end up having a lower number of conversions And, if you talk about use cases with many variations, then A/B tests will be like really, really slow. So if you have many variations the multi-variate test will take a lot of time just because there are so many buckets they are distributing traffic into.

Whereas multi-armed will be very quick. To ignore all the suboptimal ones and just pick out which are the 4, or 5 best-competing variations and just run an A/B test on those but run practically an A/B test on those 4 or 5 best-performing variations. So, I mean, our magnets give you rapid convergence in those cases. Then A/B tests actually require manual decision-making in the sense that you’ll have to monitor the metrics and eventually make a decision on which variation to deploy. Whereas multi-armed bandits are not experimentation tools.

They are optimization tools, and you need to think about this very carefully. With multi-armed bandits, you’ll be giving a decision capability to the algorithm to actually divert your traffic based on the results. Based on the learning that the algorithm has, you don’t need to make any intermediate decisions you just need to set it and forget about the type of test where you just deploy all the variations and then let the algorithm just handle it. And get you the best conversions possible. Finally, the future stability of insights.

If we talk about, if you have to deploy one of those variations for the indefinite future, then A/B tests are better because they test all the variations equally, and then they make an informed decision. So statistical significance is obviously better in A/B tests. And, hence, the future stability of those results will be valid. And maybe future stability won’t be there because naturally they are not made for that purpose. They are made for a fixed-time campaign, and you optimize those conversions, get the highest conversions, and abandon the test sort of a thing.

And, time to statistical significance with A/B tests would be lower. So naturally, you are sending out an equal traffic split, and you are spending more conversions in a way, more conversion loss. So you’ll get statistical significance quickly. In multi-armed bandits, that is in multi-armed bandits. I’m sorry. There is an error in this slide. Okay. So, in my opinion, it’s actually higher time because the skill significance is higher. And, so, it takes more. It takes longer to detect changing conversion rates, actually. So this is the case with morning and evening variations. And, so if, suddenly the patterns change, the tides change in the background, and the bad-performing variation starts to perform better suddenly, then, the A/B test would be able to detect it earlier although they won’t do anything about it. Let me be very clear. They’ll detect it very quickly because they have an equal track split.

So they’ll quickly detect that this variation is now performing better, but they’ll still divert 50% of traffic to both these variations. Whereas when you talk about multi-armed bandits, suddenly the variation that was performing better first now has a lot of traffic and the variation that was performing worse has very little traffic. So if that performing variation starts to perform better, then it’ll take a lot of time for that variation to match the algorithm that will wait. Give me more visitors. Give me more visitors so that I can show you that I can perform better because it won’t be getting any visitors.

So, this detection of changing conversion rates would be lagged in multiple environments. But whatever changes they detect, they’ll be actively doing something about it and saving you conversions. So, yeah, these are largely the trade-offs and benefits. So, yeah, Anshul, if you have anything to add there.

Anshul:

No. I think it’s pretty much basically the sum up. Yeah, this was pretty much it. Do you have any questions?

Ishan:

And, yeah, to conclude, we are in early access right now, and we’ll be releasing it for, full access in a while. So anyone who is interested in MABs can contact us. So yeah. Any questions?

Shanaz:

Alright. Ishan and Anshul, that was really insightful. Thank you for that presentation. I’m sure, the audience did find it super, super insightful especially those databases, and stats that you highlighted.

So thank you for that. We do have quite a few questions, but due to the positive time, I’ll pick up some of them and see how many of them we can answer. For those questions that we don’t pick up, please feel free to connect with us on LinkedIn and dial in your questions there. So, starting off, we have a question from Saurabh who asks for MAB, what is the recommended size of data to ensure meaningful tests?

Ishan:

Well, so I’m assuming, that, by data, you mean the number of visitors and not the number of variations. But, I personally, Anshul, what do you feel?

Anshul:

I think MABs can yeah. This is what the benefit of MOB is, but we do not there’s no such thing as a minimum number of visitors, but it should meet. Like, in the case of an A/B test to make meaningful insight around it. In any way, you can start, like, you would need to deal with it first and the way it’s going to work is that it’s going, like, if you are implementing Thompson’s landing only, in starting, it’s going to distribute traffic equally by nature of the setup. And as your test, gains more visitors, whichever variation is performing well.

It starts directing that off, and then we’ll go ahead and buy design then, we’ll start directing that traffic to that winning variation. So I don’t think that as such any minimum number of visitors is required in any meeting.

Ishan:

Yeah. Because we are not reading that significance, even for a very small number of visitors. So if you have all 10 visitors and, even with that, you can run an MAB and try to match your conversions. So even if it’ll be learning slowly, it’ll be saving you. We’ll be trying to save you some conversions over an A/B test. So, yeah,

Shanaz:

Alright. I hope that answers your question, Saurabh. Yanking Kwan and Mike and Hawaii. I know by pronouncing your names correctly. We will be sharing, the recording as well as the slide over email view, in the next 24 to 48 hours, and it will also be posted on VWO on VWO.com/revenues.

So I hope you can go and check it out there. If you missed anything from the webinar. Do you propose connectivity or, problems like that? The next question is from Ayala Josephis. Ayala asks, how do you see a test being applied to a B2B SaaS lead gen campaign that takes a very long time, sometimes months to reach statistical significance?

Ishan:

Okay. How do you see MAB is being applied to a lead-generation campaign, right?

Shanaz:

Yeah.

Ishan:

So, okay. I’ll make whatever assumptions that I can. And, maybe until you can add to it later. So what I personally feel is that if you have multiple sources of lead generation, and you want to divert your budget to the best sources of lead generation. So suppose, if you have $100 for the sake of simplicity, and you have multiple sources of lead generation and you want to optimize those sources of lead generation.

You want to optimize what you can do with that budget. That is exactly an MAB use case. And what you can do is you can run an MAB on those different sources. You can set a goal on what leads are being converted and you can let that MAB algorithm handle that budget. It will automatically test out which sources and which routes of lead generation are performing the best and then divert more traffic budget towards those lead-generating sources.

So Yeah. That’ll be a very valid MAB use case. Anshul, do you have an issue?

Anshul:

Yeah. At a very broad level, I think this problem went to respective there are, like, 2 strategies you have of generating leads. And, eventually, you want to decide which strategy is more suitable in a really long run, and ultimately, your future decisions are going to be dependent upon that strategy. Then A/B test is tested actually your use case. But if your objective is to no matter what strategy is, well, I want to generate as many leads as possible in this particular time, then you should go for MVT. At a really broad level, I think this is how I would decide which algorithm would suit my needs.

Ishan:

Yeah. Sorry, Keith. Sorry, please. Yeah. I personally feel that the one way that MAB might be useful is if you really don’t care about certain significance in that use case.

It’s just sort of set it and for that type of a thing that you have multiple sources in an end, maybe you can plug in different sources even later as they come along, and it will just optimize the budget for the best of those lead-generating sources. So, yeah.

Shanaz:

Alright. I hope that answers your question, Ayala, if not, please feel free to reach out to Ishan or Anshul on LinkedIn, and they’ll be more than happy to delve deeper into it and help you understand that. So I think, this will be the last question we picked today. This question is by, branch of another How scalable are MABs compared to A/B testing or other ML-based methods? What are the average training inference times for a sample recommendation?

Ishan:

Interesting. Can you repeat the question once again, Shanaz? Sorry.

Shanaz:

So I’ll repeat the first part first, and then we can move on to the second part of the question. The first part is how scalable are MABs compared to A/B testing or other ML-based methods.

Ishan:

Interesting. So, for scalability, we need more variations than whether MABs are scalable than A/B tests or not. If, really, by scalability, we mean that, then, yes, if by scalability and that should not be the case. If by scalability, we mean that whatever winners come out of the campaign and whether we can deploy it to a larger audience or not, then I think MABs are not scalable. So like we told that the future the insight that we get out of MABs is not future-proof.

So you cannot do that, the insights that you get out of an MAB suppose you run an MAB on a small subset of your visitors and then you want to scale the result that came out to a broader audience. In that sense, MABs are not scalable, I think. Anshul, what do you think?

Anshul:

I, you know, actually, like, class of use cases problems, rather than actually if not right, I would say that to compare actually computations of these 2 problems in a way that it is actually like a use case specific area and different use cases. The logical complexity is all together, like, if we would compare to a team you algorithms of, like, teams, which are serving us in a use case, then it would make more sense.

Ishan:

Yeah. I think from a time efficiency manner, scalability is not an issue. ABS and MABs are equally time-efficient. So yeah. And what is the second part?

Shanaz:

Yeah. Yeah. The second part is, what are the average trading slash influence times, for example, recommendation. Interesting. What does this transfer mean by sample recommendation?

Anshul:

So it’s not like in case so I think that I kind of understand where he’s coming from. So maybe he has this mind of frequentist approach like a tradition with a, ML approach where you Expect that you have a large amount of data, and then only it makes sense to apply an ML. The ML got into it. Now this is because we are using a vision strategy we are incorporating that uncertainty about the system when you have less data, you would always get a distribution that you are not sure of. Whatever result that you will get, it will always be another 13 range. And so in terms of sample wise, there’s no ask because you are applying patient you are incorporating that knowledge that you have low samples. So things are going to work out in this patient world.

Shanaz

That wraps up today’s webinar. Thank you so much. We’ll not be taking any more questions due to the lack of time, but, Michelle and Anshul will be happy to answer your questions on MAB or anything, the version rate optimization on LinkedIn. And feel free to reach out. Thank you so much for attending this, and always being part of VWO webinars. We try to we try to give out as much insightful, you know, information as we can, based on our everyday practice and and from experts all across the world.

So thank you so much for attending, the webinar, and thank you so much, Ishan and Anshul, for doing this with us today.

Ishan:

Thank you for organizing this. We are, like, more than thankful, actually.

Shanaz:

Thank you so much.

Follow us and stay on top of everything CRO

Minimize Conversion Loss in CRO Testing with Multi-Armed Bandits

Key Takeaways

Summary of the session

Webinar Video

Webinar Deck

Top questions asked by the audience

For MAB, what is the recommended size of data to ensure meaningful tests?

How do you see a test being applied to a B2B SaaS lead gen campaign that takes a very long time, sometimes months to reach statistical significance?

How scalable are MABs compared to A/B testing or other ML-based methods? What are the average training inference times for a sample recommendation?

Transcription

See VWO in action now.

While we will deliver a demo that covers the entire VWO platform, please share a few details for us to personalize the demo for you.

Select the capabilities that you would like us to emphasise on during the demo.

Which of these sounds like you?

Please share the use cases, goals or needs that you are trying to solve.

Please provide your website URL or links to your application.