This website works best with JavaScript enabledLearn how to enable JavaScript

Back to all sessions

A/B Tests: Two Important Uncommon Topics: The OEC and Trust

In this live workshop, Ronny will refresh your understanding of OEC and importance of trust in experimentation.

Transcript

[NOTE: This transcript is pending review and, hence, may contain errors. The final, polished version will be available shortly. Thank you for your patience.]

[00:00:00] Vipul Bansal: Hello everyone. Welcome to ConvEx ’23, VWO’s annual virtual summit for growth and experimentation professionals. My name is Vipul and I’m a Senior Marketing Manager at VWO. Thousands of brands across the globe use VWO as an experimentation platform to run A/B tests on their websites, apps, and products.

[00:00:28] I’m really, really excited when I say that I’m going to hold the first workshop for ConvEx 2023 by Ronny Kohavi.

[00:00:37] Of course, Ronny does not need any introduction. So, Ronny, I invite you on the stage.

[00:00:44] Ronny Kohavi: Thank you very much.

[00:00:46] Vipul Bansal: Hi there. How are you?

[00:00:48] Ronny Kohavi: Excellent. Glad to be here.

[00:00:50] Vipul Bansal: I learned that it’s morning for you. Did you have a good sleep yesterday?

[00:00:54] Ronny Kohavi: I did. It’s 8 AM at this point. So I know we did a dry run at 11:30 PM my time. So yes, it was kind of fun.

[00:01:05] Vipul Bansal: That’s the reason I did that because this workshop will definitely require a lot of energy and definitely our audience is going to love it.

[00:01:15] So I am curious and of course, our audience would be curious to know as well, can you quickly give a very brief overview of what are you going to share with our audience today?

[00:01:27] Ronny Kohavi: Well, I have an agenda slide if you want to go there, but basically we’re going to talk about two important uncommon topics, things that I think people don’t invest enough effort and time, and these are trust and the OEC Overall Evaluation Criterion.

[00:01:44] Vipul Bansal: Awesome. Awesome. So without any further delay, I’ll just jump off the stage and hand over the mic to you. Here you go, Ronny and yeah, go ahead and you can share the screen now.

[00:01:55] Ronny Kohavi: Thank you so much.

[00:01:57] Vipul Bansal: Yes, the screen is now visible.

[00:01:58] Ronny Kohavi: And I just want to say for folks, we are going to be using Slido for both Q and A and for the polls.

[00:02:07] There are fun polls. So if you’ve got a phone. Go to the QR code or if you’ve got another window open or make sure you have another window open, go to Slido.com and enter ConvExRonnyK. The K thing doesn’t matter.

[00:02:24] It could be all lowercase, uppercase and I will show this again when we get to the polls but I’m going to leave this just for 20 seconds up there.

[00:02:33] So people want to get ready. Put your second window up on this. You’re welcome to start answering, sorry, asking questions as I start presenting the topics and I will take some of them live.

[00:02:48] I think most of them, I’ll probably answer offline in writing after the presentation but you will get answers to anything that was voted high. With that, let’s get going.

[00:02:59] I’m gonna just fly through this slide. I worked at Amazon. I worked at Microsoft. I worked at Airbnb. I wrote this book with a hippo and I’m now consulting and teaching.

[00:03:10] So no need to go deep into this background. The agenda for today, really simple. Two uncommon topics that deserve more attention.

[00:03:22] Most of the time will be spent on trustworthiness. How can we improve the trust in A/B tests and then there’s going to be probably around 15 minutes on the OEC.

[00:03:33] What should you be optimizing for? And again, these are huge topics that I teach longer in the class but I hope to give you a good taste of it during the presentation.

[00:03:44] Okay. Again, we’re using Slido. Notice that Slido has two tabs, one for Q and A, one for polls. So when we get to the polls. Make sure to switch to that polls tab, so you can participate and have fun in this presentation. And we are going to start with a poll.

[00:04:06] Okay, so here’s the question. This comes from TestDome. It’s a website that does pre employment screening to measure skills and they give this A/B test on their website.

[00:04:17] So you have an original or a control with 4,300 views 436 conversions. You have a proposal A or treatment A and a treatment B. And we can assume they don’t state this but I’ll make it clear.

[00:04:30] Randomization is done by page views and each one of these variants is assigned one third of the traffic. Okay, assume we’re trying to optimize for conversion rate. Here’s the Slido link if you haven’t gotten it from the initial slide.

[00:04:48] Again, go to slido.com, type ConvExRonnyK or scan this QR code and tell us which of the three things you would answer on this TestDome site to get employed.

[00:05:04] Okay. So we had 16% saying you should stop 49% saying you should continue the test and 35% saying there is no way to tell if you should continue the test.

[00:05:17] So the correct answer according to TestDome is that you should continue. That’s the 49%. Who said you should continue but it is wrong. It is wrong. You should abort and you should start debugging the issue because the results are not trustworthy. You have what is called a sample ratio mismatch.

[00:05:46] Okay, so this company is not aware that not only the results need to come out but they need to be trustworthy. So first, I’m going to take a digression and review what an SRM is or a Sample Ratio Mismatch.

[00:05:59] If you are assigning equal percentages to the control and treatment or in this case to the three variants, you should have approximately the same number of users in each of the variants.

[00:06:13] So here’s a real example. You have control with ~820,000 users. You have treatment with ~815,000 users.

[00:06:21] The ratio needs to be 50-50, close to that. But you’ve got 50.2. And the question is, well of course we’re not going to get exactly the same number. Is 50.2 reasonable or should we be worried? And the answer is absolutely you should be worried.

[00:06:40] The p value or the probability that this split or more extreme happens by chance is one in 500,000. Okay. So think about it. You should be running 500,000 of these experiments because before you get something as extreme as 50.2 with large numbers.

[00:07:01] I’m giving you some references here and I’ll make the slide available after the presentation if you need to go. I have a little spreadsheet. Lukas has a plugin for Chrome that you can use. But just some surprising statistics, 6% of experiments at Microsoft have sample ratio mismatches and teams that start experimenting, it’s very usual to see 10-15% of experiments have a sample ratio mismatch and when I mean a sample ratio mismatch, I mean strict with an alpha threshold of 0.001.

[00:07:35] I’m not just talking about 0.05 and you might have 5% by chance. 0.1 % might have an SRM by chance but the other 5.9% have a true SRM that we need to debug and find.

[00:07:50] Okay, so if you’re interested, there’s a post. I think if you have some experimentation platform, you don’t have an SRM. It’s like driving a car without a seatbelt.

[00:07:57] You need to get that trust. There’s a fun halloween for those of you that are US centric. You’re supposed to do something scary on the pumpkin and somebody did this, if you ignore SRM, it’s the scariest thing you can do.

[00:08:13] So here’s a real example that people say to me, it’s only 50.2. How bad can it be? Right?

[00:08:22] So what if we don’t do what you’re saying and look at this, the data this way and my answer is look at this result. Very strong result.

[00:08:35] These are key metrics that we’re measuring for an experiment, right? And you can see here the p value, it’s got ten zero’s. This has got nine zero’s. This guy’s got four zero’s. This one’s got three zero’s. So very, very statistically significant results, amazingly statistically significant. And what the problem is that the SRM shows that the ratio is 0.497. Should have been 0.5 but it’s 0.497.

[00:09:08] Now could this be happening by chance and the answer is unlikely to be happening by chance. Because the p value is this small number.

[00:09:22] So, what could be causing this? How off could these strong p values be if I were to correct the small imbalance?

[00:09:32] So, in this case, the region was a bot. We excluded the bot and got a balanced set of users. Okay, once we did that, we re ran the analysis.

[00:09:45] Nothing was statsig. I just want to make sure this you understand, you internalize this from this small correction of removing the bot that caused the SRM. All these extreme P values disappeared. Nothing was statsing.

[00:10:02] Somebody wrote in the Q and A, to stop the test means that this is okay or it is BS.

[00:10:06] Yes, it is not trustworthy. When you stop the test, because you have to start debugging, don’t run a test, which has an SRM because there may be some hidden problem and you’re not getting any results and maybe you’re degrading.

[00:10:21] Maybe the metrics are really bad but they look good to you as they did here. So absolutely you stop the test when you have a strong SRM and you try to understand what is going on.

[00:10:34] Okay. So back to our example from Test Dome. What’s the SRM here? So this is the SRM just below 0.001 for A and then B has this dramatically low P value to E minus 11, which means you’ve got. 10 zero’s before the digit two.

[00:10:53] So there is no doubt, no matter what your prior was that there is a strong sample ratio mismatch in this example. So even if we’re doing some correction for the fact that we may be doing this SRM test a few times, there’s no doubt that a number as small as this.

[00:11:11] It means the experiment is invalid. So if you guys are not doing SRM tests in your platform, first thing you should do after this talk, go add it to your roadmap and then check historical experiments.

[00:11:25] You’re likely to invalidate probably 6 to 10% of them. And say, hey, all these experiments that we thought were great, actually are invalid. And when you have an SRM because the cause for the SRM is usually some form of extreme traffic, like a bot or zero clicks, there’s a whole paper on the causes of SRM, the P values tend to be very low and you either get an amazingly successful looking result or something that looks really, really bad.

[00:11:55] Okay. Let’s go to a second fun poll. So, replication is the hallmark of science. You know, the ability to replicate a result, critically important. You ran a well designed experiment. It’s got 120,000 users in each variant. Well powered, amazingly successful and you got a statistically significant value of 0.005.

[00:12:21] So much lower than the industry standard 0.05. You’re about to start celebrating but you know, maybe we should replicate it.

[00:12:32] Okay. Now, given the strong result, we got a P value of 0.005. You’re thinking to yourself, well, is it okay to run the experiment for a shorter time? So I’m going to replicate this result with only half the users.

[00:12:45] It will therefore finish in half the time and we’ll be able to see if this result is reliable or not. So I just want to get a sense from you. If you were to run the experiment with half the users given that it was such a strong result, what is the probability that the replication run will be StatSig at 0.05?

[00:13:07] I’m not expecting 0.005 again but what is the probability that I will replicate with anything statsig? So let’s give people, 30 seconds to vote here.

[00:13:20] Come on folks. Have fun. What is your intuition here? We got 90 people voted. I’ll wait a little bit more to get to a hundred.

[00:13:28] Okay. So let me share with you the probabilities that people respond to the 25%. Thought, hey, this would replicate at 80 to 100%. 22% said this will replicate 60 to 80%.

[00:13:46] So, what do we have? 47% of people think this is going to replicate above 60%, and 53% of people think that this will be 40 to 60%. So, the correct answer is 51%.

[00:13:58] If the first experiment was well designed and assuming the true effect is what we found, so it’s not exaggerated. Then the power of the second test is 51% and I’ll talk to you about power in a minute.

[00:14:12] If you’re interested in the actual spreadsheet where I computed this number, it’s there. There’s also a beautiful paper by Nobel Prize winner Kahneman and Tversky, where they ask this similar question in their paper, the belief in the law of small numbers and they do it two ways.

[00:14:30] One of them is what I did. The other one is using a more sophisticated Bayesian framework to say, you’re assuming that this result is the true effect. It could be anywhere.

[00:14:39] They get a very similar number. So, very surprising to many people. A lot of people think that this result should replicate with higher probability.

[00:14:48] I think it is very interesting to realize how weak the replication probabilities and this is something that caused the replication crisis in psychology because they were running a lot of experiments that had low power.

[00:15:01] So we’ll get to this. What is statistical power, right? I love this quote by Jacob Cohen. He’s famous for creating this Cohen’s D metric. When I finally stumbled onto power analysis, it was as if I had died and gone to heaven. So, he thinks very highly of this. I do too. So what is the power? This is the technical definition.

[00:15:24] It’s the probability of detecting a meaningful difference between your variants when there really is one, right? So, you reject the null when there is some true difference of some delta.

[00:15:38] Remember, most of the time we’re worried about what’s called type one error, assuming that there is no difference.

[00:15:45] Why, how many times, what percentage of the time would we say that this is something Stat Sig, this is called type one error, powers mean meant to address the opposite.

[00:15:54] If there is a true difference of at least some delta, this minimum detectable effect, then you can plug in a power formula that tells you how many user you need to detect it with a certain level of power, which is normally set to 80%.

[00:16:13] So, there’s an example here. I’m not gonna go through the formula in detail on this talk because it’s brief but assume you have a website, you’re trying to improve your conversion rate.

[00:16:24] It’s currently at 5%. You specify an MDE, you want to detect differences also of 5%. So 5% over this 5% means 0.25. If you plug it into the power formula, you need 101,000 users.

[00:16:39] People are surprised. I’ve heard people say, wait a minute, didn’t I just hear this vaccine was tested on 2,000 people? Yes. If the vaccine has to be 50% effective, then they can plug in the formula and the number will be 100 times smaller or it will tell them that they need only 1,200 users in each variant.

[00:17:01] Given that we are trying to detect a small difference, we need to push a lot of data for that normal distribution to be peaked and that’s why we’re going to need 121,000 users in each variant in this very realistic scenario.

[00:17:18] So let’s look at this fun poll. You are running a retail website that sells widgets. Your conversion rate is 5%. You got some CRO consultant. They’re very famous. They tell you, hey, you hire us. We’re going to 2x your conversion rate.

[00:17:35] See the statement a lot by a lot of agencies. Now they run several tests on 14,000 people, they changed the copy and then, wow, they just changed the copy on some page and is a huge winner.

[00:17:48] A 20% improvement to conversion rate. Okay. They didn’t do exit but hey, 20% improvement. That means your revenue is just up 20%. P value below 0.05. So stats sig. And the consultant says, hey, just wait a minute. I got a few of these in my back pocket. I’ll be able to do more. So, now, what do you think of this experiment? One, it is underpowered, so you can’t trust the 20%.

[00:18:18] It was underpowered but hey, once you get Stat Sig, you get Stat Sig. It may have not been able to detect it but once you detect it, it’s a go. And the experiment on 14,000 is so much larger than the COVID test. So, it is a very, very impressive result.

[00:18:36] So, we have 61% voted for A, it’s hard to trust the 20% voted and then 35% for me and I think I’ve sort of given you that C is not correct in the prior slide and there was only 4% of people that voted for that.

[00:18:53] Those were sleeping at the wheel. So, now if you look at success rates for changing copy at a good site, like goodui.org, you’ll see that it’s about 3 to 6%. So I’m going to plug it into the power formula. I’m looking for a 5% change and as I told you before I need 121,000 users.

[00:19:16] So the experiment was way underpowered. I need 120,000 users. I had 14,000 in total. I had 7,000 in each variant. So you can’t trust the 20%.

[00:19:27] Okay. What happens when you run an underpowered experiment? Okay. I think I want you to make sure you understand what happens. Two things happen here.

[00:19:40] One is, most of the time the results will not be Stat Sig. I’ll show you some graphs in a minute. When you do get a Stat Sig result, as we got in this lucky example, you hit something called the winner’s curse.

[00:19:56] Your results will very highly likely be exaggerated. The exaggeration factor for 10% power and again, you’ll have to believe me that I plugged it into the formulas is a factor of four.

[00:20:17] So if I do get a statistically significant result with an underpowered experiment, it will tell me that I have a huge winner, but it’s wrong.

[00:20:28] That 20% should really be about five. So I’m overestimating the treatment effect here. I would re-run the experiment with higher power.

[00:20:40] Do not trust experiments with low power. I want to take you through three examples, the real world of published results. Okay. So GuessTheTest, the fun site does A/B test every couple of weekends and they shared an example but they had 80 users and that specific example claimed 337% improvement. Now we did the math. It had only 3% power.

[00:21:06] If it’s 3% power to even detect a 10% delta. So I’m not even asking for a minimum detectable of 5%, which I recommend you normally use. If you do the math, you use Bayes rule. It is 63% likely to be a false positive. When you get a Stat Sig result, it is most likely to be wrong and of course, it’s highly likely to exaggerate.

[00:21:32] The result is 337% is not trustworthy.

[00:21:37] Okay, do not trust experiments with low power. I will try to drill it into you in this presentation.

[00:21:42] Second example, based on this talk, Optimizely ran an experiment. It was reported 22,000 users in each variant.

[00:21:50] Big winner, lift of 75% the click through rate. This one had a very low base rate of 0.91. So when you have a low base rate, it’s actually harder. You need more users.

[00:22:04] So if you look at similar experiments, they had about 3-6% lift. So I plugged in a 5% MDE, the power is only 7.3%. So much further from 80%. Can’t trust this result.

[00:22:18] Do not trust experiments with low power. Now, criticized Optimizely. Here’s one from our sponsor from VWO. They published this, how to offer coupon codes with promo codes and here’s an experiment, big winner, 25% lift on the click through rate but most people reported 3%.

[00:22:40] Now if you assume a 3% MDE the power for this experiment was only 16% again far from the desired 80% power. Do not trust experiments with low power. Okay, so summary, there’s two problems with low power.

[00:22:58] Okay and I’m going to use the number 10% to make this real for you. One, you are unlikely to get a statsig result. If you have 10% power and there is a real effect of 10%, you will only detect it 10% of the time.

[00:23:13] So 90% chance that you will be wrong and remember of this 10% power 5% is what we’re running the experiment out.

[00:23:24] This is our classical type one error saying 0.05 means that if there is no difference, zero difference, 5% of the time, you will get a Stat Sig result.

[00:23:37] Okay, so we have 10% power. All we’ve done is raised it from 5% to 10%. Okay and the second thing and this is what I again want to drill down.

[00:23:46] Some people say I’m running this with low power. Maybe I won’t get it but maybe I’ll get lucky and it will be Stat Sig. This is the winner’s curse. If you get a statistically significant result, the treatment effect will be highly exaggerated.

[00:24:02] Great paper. If you haven’t looked at it by Gelman and Carlin, they have this graph that you can see. So if you’re running at 80% power, you’re pretty good but as the power diminishes, this goes high, high up.

[00:24:16] So 10% power or this 0.1. The exaggeration factor is a factor of four, right? And if you’re running some experiments at less than that, you could be looking at a factor of 5,6,7,8,9,10, this goes to infinity, obviously.

[00:24:35] Okay. Here’s another way to share these results and again, this is slightly more complicated. Some of you may miss it but here’s the idea. I’m running an experiment. I have 10% power and the actual lift. This is a simulation. The true lift. The treatment is better than control by 5%.

[00:24:56] This is the distribution of P values that I will get from doing 10,000 runs in this simulation. What does that mean?

[00:25:05] That means when you run an experiment, you’re most likely going to get a P value here where it’s above 0.05. So you might get a p value of 0.9 or 0.5 or 0. 2 but the probability that you’re going to get below 0.05 or a stat sig result is about 10%.

[00:25:25] Now this simulation, I got 9.8%. Now look at the other numbers here. The average lift, remember the true number is 5%, the average lift for a statistically significant result is 19.3.

[00:25:39] So that means when I run it, I’m going to have an average exaggeration of a factor of four. The minimum, this is surprising to many people, the minimum lift that will ever be reported on a Stat Sig result is 13%.

[00:25:53] That’s 12.8. Remember the actual lift is five. I will never even be close to five because to get a Stat Sig result, I’m going to have to get lucky, winner’s curse and get a higher lift so that it will declare it’s Stat Sig.

[00:26:08] Okay, and sometimes I can get a lift as high as 40%, so eight times. Okay, three percent of the time or this 3.6% of the time when I get a Stat Sig result I’m gonna get the sign wrong.

[00:26:21] This was a positive lift. I will get a negative lift. Or conversely if the experiment is hurting I might get a positive lift Stat Sig 3.6% of the time when I have a Stat Sig result.

[00:26:35] So very very interesting. Make sure you understand this chart, spend a little time on it to see what’s going on.

[00:26:46] I’ve got five votes on this Q and A, “I am confused about what MDE exactly represents. Can you give a brief example of what it actually represents?” Yes. The MDE is the Minimum Detectable Effect that you want to detect.

[00:27:03] So let me go to this definition of power. Okay, what is power? The hard thing about power calculation is you have to say, I’m going to detect some effect larger than delta.

[00:27:21] Okay. Now, so if you’re saying I’m going to detect a 5% effect or greater, you will need a certain number of users. Now, if you want to detect 1%. You’re gonna need a lot more users. In fact, 25 times more users than if you’re going for a 5%. So there’s a trade off here.

[00:27:37] Everybody wants to detect the 1% but to detect the 1%, you’re gonna need a lot more users than you most likely have. So you’ll settle down. You’ll say, okay, I’m going for a 3% or 5%.

[00:27:47] That means that below 5% the experiment is underpowered. I will unlikely get a statistically significant result.

[00:27:59] Okay, I know this power concept. It is hard to grasp that you’ll get the slides afterwards. You may have a re-read this but the important thing is make sure to understand that you need to power your experiments at 80% or else the results are not going to be Stat Sig.

[00:28:15] When you have the effect that you want to detect most of the time, you’re not going to get Stat Sig result and if you do get a Stat Sig, you will suffer from the winner’s curse.

[00:28:28] There was a question on bot that I’ll answer later. Okay. So. What about the real world? Does this really happen? Yes. There’s a US centric example. If you drove in the US in most states, most places in the US except for New York city, you get to turn on red after a stop I guess you stop at the light even though it’s red, you get to go right.

[00:28:56] If there is no traffic that impedes you from turning right. Prior to the 1970s, the engineer said, this is a safety hazard. We shouldn’t allow this. It will kill pedestrians, bicyclists, etc. But there was an old crisis, right?

[00:29:10] And if you stop less at a light and you’re allowed to turn right, it will save fuel and it will save time. So people were like, hey, why don’t we look at this? So they ran some study and they looked at 20 intersections. What happens before and what happens when they allowed right turn on red.

[00:29:27] So they added a sign. You can turn right on red and they got this. 308 accidents in some period before, 337 accidents in a period of the same duration after. It was a 9% increase but notice they only evaluated on 20 intersections.

[00:29:44] They did not have any reasonable statistical power, small numbers, man. They couldn’t declare this 9% of Stat Sig. So they concluded that because it was not statistically significant result, there is no significant hazard to motorists or pedestrians.

[00:30:05] Okay, they confused statistical significance with practical significance. Okay, later on people combine multiples of these experiments in a meta analysis and they estimated that 60% more pedestrians are run over twice as many bicyclists are being struck.

[00:30:23] That’s a 100% increase. But by then, Congress already passed the right turn on red. Okay? Very, very interesting real world example. Daniel Kahneman, Nobel Prize winner, Economics 2002 comes up with this book called Thinking Fast and Slow. Great book 2011.

[00:30:47] He sort of summarizes the field. Lots of amazing results, surprising results. One of them is about priming. He shows all these amazing results and he says, look, this is science. Even though you’re surprised. You have to change your intuition.

[00:31:02] Well, people started to look at the experiment that he’s used for that chapter and they were severely underpowered, right? This happens a lot in psychology.

[00:31:11] This is what’s caused the replication crisis. They were star people were trying to replicate them and they were not replicating.

[00:31:19] Okay. So there’s an index called the replicability index. You can read all about it but they estimated that the replicability index is about 14 to 19%. A replicability index below 50% is considered to be unreliable.

[00:31:34] So this chapter, if you do a meta analysis of all the results, you get a number that is way, way below 50%. Okay. Now is this criticism reasonable? Absolutely.

[00:31:46] To the point where Kahneman responded, what the blog gets absolutely right is that I placed too much faith in underpowered studies. Authors who review a field, I did, Kahneman should be wary of using memorable results of underpowered studies as evidence for their claims.

[00:32:07] So he basically said, forgive me, I should not have written that chapter. Too many underpowered results confused me. I think by now I’ve drilled this into you.

[00:32:20] So this is sort of my focus as on in this presentation on Power because a lot of people don’t get it but I want to mention a few more things about trust in general A/A test when you run A/A test very simply you split the traffic into two. Half of them get A, the other one get also A.

[00:32:42] Hence the name, there is no difference. Okay, if the system’s operating correctly and you do these repeated trials, only 5% of the time should a metric be Stat Sig with a p value less than 0.05.

[00:32:57] If you’re a statistician, then you should run hundreds of these A/A tests and look at your p values and they should form a uniform distribution. So you can run an anderson darling test or something to see that the distribution is actually uniform.

[00:33:13] This is what we’ve done at Microsoft for example, and there are blog posts and this is described in my book in more detail. So you’re welcome to look at this but here’s an example, real world. When Optimizely first came out, they were all excited that they could compute p values in near real time.

[00:33:30] They did not realize that they were doing statistics incorrectly, doing multiple hypothesis testing. The first signs that something were wrong, where other than comments on their book, which I also made that they’re statistically naive, where these posts where somebody says, Hey, I ran A/A tests.

[00:33:49] 30% of them are Stat Sig is something wrong. Okay. So that initiative with peaking, they fixed it, of course. So, now they’re using something called always valid P values but it is important to see that you’re running A/A test to raise the trust in your results. There are pitfalls described in papers.

[00:34:11] I’m going to summarize a few more in one liners. Novelty effects. If your treatment effect is decreasing over time, so while you run the experiment for two weeks, you see a very clear trend, suspect.

[00:34:26] So this is what’s called novelty effect or the opposite is primacy effect. If you’re computing, percent confidence intervals, so I’m not just saying revenue grew by 10 cents, I’m saying it grew by 3% and I’m saying the confidence interval is between one and five.

[00:34:45] This is not obvious, it’s not what most people think how to convert. You have to use something called filer’s formula or the delta method if you have ratio metrics, which is a lot of our metrics you divide two random variables.

[00:35:00] You need to use the delta method. We have initially found this when we ran A/A test we realized lots of our metrics are incorrect. Make sure you filter bots, they reduce your power.

[00:35:12] This is a real photo that I took. You know, this guy must be doing well with his bot because he’s got on a license place of her, this nice Mercedes. Just to give you an idea of how serious this is at Bing. 50% of traffic in the US was bought driven. Russia and China, 90% of our traffic came from bots.

[00:35:35] Look for instrumentation issues and then look for outliers. So to summarize, I love this law called Twyman’s law, which in the book of exploring data, they write perhaps the most important single law in the whole of data analysis, right?

[00:35:55] You look at this law and you’re like, Oh, this must be the E equals MC squared. What is Twyman’s law? Any figure that looks interesting or different is usually wrong.

[00:36:09] So if something is amazing, find the flaw. If you see massive improvement to key business metric called Twyman’s law, find the flaw. You don’t get 50% improvement to your OEC to this key metric, we’re going to talk about next.

[00:36:26] Triple check things before you celebrate. We are biased to celebrate success and investigate failures. Correct for that. Okay. Here’s the link if you’re interested in learning more about Twyma’s law.

[00:36:40] Right on time. I plan to go to 8:45 and I’m at 8:43 pacific time. Apologies for everybody. In other time zones, I’m using mine but let me spend the next 15 minutes talking to you about the OEC and I’m going to stay a little after, getting a message here from the folks running it, that they might allow people to ask questions verbally on stage which is cool.

[00:37:09] I’m going to continue to look at the live Q and A, so that we can control the time. So let me pick a couple. How can we increase the power for smaller audience on websites with lower traffic? It is hard.

[00:37:24] There are mechanisms. One is to reduce the variance of your metrics. So you might go for surrogate metrics, which are maybe not really what you want but you believe they’re highly correlated and they are more sensitive.

[00:37:35] There are techniques like triggering. So if you’re running an experiment on checkout, you don’t need to include people that haven’t reached checkout. This is called triggering.

[00:37:45] It could easily double your power and I have an example of that in the book and in the class and the third one is used techniques like cuped variance reduction techniques.

[00:37:54] You can look up c u p e d and you will find that. So again, that’s a question that was voted five.

[00:38:01] So I answered it. I want to control the time. Let’s go to the OEC. What should you be optimizing for? This is like the slide that I want you to remember about the OEC.

[00:38:14] Agree early on what you’re optimizing for. I’ve seen organizations that run lots of A/B tests and they don’t agree and so they cherry pick the metrics.

[00:38:30] The OEC is this idea of an overall evaluation criterion. Ideally, it’s one. Some call it one metric that matters. I’m not a fan of the one metric because the OEC could be a combination of metrics.

[00:38:45] The idea is very, very similar. Getting agreement on your OEC and your org is a huge step forward. Okay. What should you think about as you try to come up with an OEC?

[00:38:57] Look for customer lifetime value, not anything short term. Short term is usually where you get into all these pitfalls. Discussed a lot of them in my class that I teach.

[00:39:08] It is a cause of lots and lots of issues. Think about the longterm. Look for success indicators leading indicators and avoid all these lagging indicators and vanity metrics that we see out there.

[00:39:22] Okay. If you’re not sure, great book, how to measure anything by Doug Hubbard. If you have a funnel. I’m using the F word here.

[00:39:32] Classical way to use experimentation is just to see how people flow through the funnel and look at the micro conversion rates. Okay, look at your acquisition, activation, retention, revenue, and referral.

[00:39:44] Try to summarize all these. You get this R, hence pirate metrics. Be really careful about watermelon metrics. Teams report, they’re doing a great job. My annual review is great, everything is green but when you talk to customers, they see lots of red. Make sure you actually measure the customer experience.

[00:40:11] Don’t be afraid of wave factor. You might have three factors in your OEC you might initially say I don’t know how to weigh them. So I’ll use three of them and then over time you might come with the right weight In order to weigh them. Things like conversion to action is important, the time to the action, right?

[00:40:30] This CRO is an industry but a lot of the times the time to action is actually a more sensitive metric. I’m able to improve and shorten the time that users take to check out. I’ve probably done something good. Visit frequency.

[00:40:45] How often do they come to you? Now it’s important that the OEC has only a few key metrics. Beware of what’s called the Otis Redding problem and let me try to play this song for you. So you’ll remember it.

[00:41:12] So hopefully some of you remember this part from this song, I can’t do what ten people tell me to do. So I guess I’ll remain the same and that what happens when you give people 10 metrics 15 metrics to optimize they give up make sure to limit it so people can understand and focus on the right metrics.

[00:41:33] Now, this doesn’t mean your score card should have these three metric. It should have tons of metrics as diagnostic to help you understand, why the OEC changed?

[00:41:44] Okay, to give you an idea at Bing, we had four OEC metrics that were combined with weights into one but we had 2,000 debugging metrics. 2,000 metrics in a typical scorecard.

[00:42:00] If you are not sure what’s going on, you can say give me everything and you would get over 10,000 metrics to help you debug what is going on.

[00:42:11] So I want to go through one bad example of an OEC. This is an example that I took from a conference, what they did is they took this middle widget here and moved it left and they said, look at this amazing thing. We raised clicks by a by 109%. It is a bad OEC, it is a local metric. It only looks at part of the page.

[00:42:40] It’s trivial to move a local metric. Why? Because you’re cannibalizing whatever was on the left. You don’t even show us what happened to it. Okay? So next week, the team on the right is gonna move themselves left and they’ll claim that they have increased the OEC by 150% their metric. Okay, now, the OEC should be something global.

[00:43:03] Here is sort of the guidance that I can give you. Look at the clicks on different areas of the page and assign them some downstream value and that’s what you measure and you’re going to find this is a lot harder to improve.

[00:43:18] Okay, a couple of examples of real examples of OECs. Amazon, email system. I worked at Amazon, their director of personalization and data mining.

[00:43:29] I was responsible for this email system that sent what are called programmatic campaigns. Okay. So if you have a email, if you have users that bought some book by an author, the author has a new book, we would send you an email.

[00:43:40] Look, JK Rowling has a new book or we look at your purchases and we send you a recommendation email based on things that you purchase or own.

[00:43:51] Okay. Fun poll for you and we are going to do this as a word cloud. What would be a good OEC for email? Let’s see if some of you can guess this, just put it in OEC.

[00:44:08] What would you optimize for if you were me or the manager of the team that is trying to understand how valuable their emails are?

[00:44:21] Okay, so let me read some of the results for you that I see here. Open rate is the number one thing. CTR very similar to click through rate.

[00:44:34] Sales, okay. That’s a good one. Purchases. Similarly, click rate. Great. So, okay. So what was the initial version? We picked revenue generated from users clicking through email. So we sent you an email and you click through the email, went to Amazon and you bought something. We would give that email the credit. It’s like an affiliate program, right?

[00:45:01] Somebody is an affiliate and drives traffic to Amazon. We already had an affiliate program live. That’s what we would use. What’s the problem? The metric is monotonically increasing with volume. The more email I send, it can only increase revenue and that led to spamming of users. It was just getting absurd how many emails we were sending.

[00:45:26] So initially we did sort of a simple looking solution. We said, okay, we’re going to build a traffic cop. We’re only going to allow users to receive email every X days and we would let the data scientists tell us what X is.

[00:45:40] But the problem was that. If I send you today a mediocre email, that means I can’t send you a great email tomorrow.

[00:45:47] So we tried to make this an optimization problem. Is there in the pipeline an email that you might get tomorrow that will be better than today? And it was a mess. It was impossible to get this right.

[00:45:59] Truly complicated. What we did instead is come up with a better OEC. Okay and this is something I encourage you to do. Click through revenue is optimizing for something short term.

[00:46:12] Annoyed users unsubscribe, right? We had a spamming problem. So what we did and again, this is an important lesson always look for this countervailing metric. We added a penalty term for the unsubscribed and we said, okay, here’s the revenue. Let’s convert this into a number and then we’re gonna now subtract from the revenue the cost of unsubscribed.

[00:46:36] So we estimated lifetime value and then we subtracted this term and more than half of the campaigns that we were sending out had a negative OEC.

[00:46:48] Okay, this insight not only allowed us to solve the problem better but it led to a really interesting change that you’re now seeing on multiple site, which is we changed the default so that you don’t subscribe from everything but you subscribe from this family of campaigns.

[00:47:05] So when you hit unsubscribe, we’re going to default to say hey, we understand you don’t want to get author emails.

[00:47:12] We will unsubscribe you from author emails but allow us to continue to send you emails on the other campaigns of say, recommendations. Okay. Next, real example is, the OEC Netflix business model is based on subscriptions.

[00:47:29] Ad based is now in the works and so the revenue is impacted by three things the acquisition rate for new users, the current users who cancel and the rate at which former users who previously canceled. Now join.

[00:47:46] Okay, it takes a month to observe retention of new users. So the impact to existing users depend on the next payment period. So it’s a very delayed mechanism, very slow.

[00:47:57] People ask me how to get more power, find a surrogate. Retention is very insensitive. It’s not very sensitive, takes a long time. So what would you do at Netflix in order to find a metric that is sensitive enough so that you can run this in a two week experiment that will predict.

[00:48:24] In a causal way whether somebody is going to renew their subscription or not. Cancel if they’re signing up for auto renewals. So, please go to the poll think about this you are now the top notch guy at netflix trying to come up with the metric that you want your org to optimize.

[00:48:46] We know that retention is great but it’s not sensitive. So we’re trying to come up with what are called driver metrics. They drive retention. Okay, let me see some results. I got daily active users. I just want to respond to that. That’s not going to work in an experiment because I assign people or at least users is not going to work because I assign people to the experiment of 50 50.

[00:49:15] So I have to look at something close to daily active users, maybe sessions per user. So that’s close for people but I see a few people voting for watch time and you are getting very close to the right answer.

[00:49:31] Yes. There’s a nice word cloud here that I will be able to paste, maybe. Cool. So, netflix picks streaming hours, which is effectively engagement or watch time but things get a little more complicated if you get users to stream a few hours a month to stream a little bit more that has large positive impact on retention but if a user is already streaming 50 hours a week, then maybe it’s not as important.

[00:50:01] And so, what they came up with is bucketization of hours based on some thresholds and that was the OEC that they used. I’m going to skip this next one. You’ll have the slides, how we picked the OEC for search engines like Bing and again, this slide I wanted to just put up, if you’re running an experimentation platform.

[00:50:21] Lucas and I recently wrote an article on what are you trying to optimize for and what are good driver metrics for that?

[00:50:28] So I will end here with a summary, experiments are your safety net for evaluating ideas reliably. Make sure to focus on trust. Do not trust experiments with low power. Run A/A tests.

[00:50:43] Remember Twyman’s law. Any figure that looks interesting is probably wrong. Make sure you agree on the OEC. It’s a hard problem, right?

[00:50:54] If you’re looking at CRO, it already stouts out by assuming that conversion rate is the OEC. That is wrong in a lot of cases. The average order value can change. The satisfaction isn’t captured.

[00:51:06] Make sure your OEC is more than just conversion rate. So think about these long term values. If you want to learn more, I teach a 10 hour quarterly interactive zoom class. There’s a link, there’s a promo code that will give you 500 off for the first 10 people using it. This is a link.

[00:51:24] It’s now going to show you, no permission. I will post it on LinkedIn, this complicated link and I will give permissions. Maybe half hour after this talk, after I put some of the poll results.

[00:51:40] Yeah, you guys can pick out three to four questions. I’m going to be happy to run over with some Q and A, I want to respect the time.

[00:51:48] If you need to leave, I’m not going to be offended. I was told that there’s like 268 people attending this, a nice number.

[00:51:56] But we can run over by a few minutes if people want to ask questions live. Let me do one more from the poll and then people, you can switch to anybody live. So eight people voted.

[00:52:12] What would you consider the best way to run two parallel experiments focusing on the same part of the funnel? With no contamination. So I’m a big fan of running con-current experiments.

[00:52:23] You got to convert the funnel into these micro conversions in order to be able to run them. It is much harder to run on the same micro conversion of the funnel.

[00:52:34] I will give you something that we’ve done in search ranker, right? Ultimately, when somebody types a query, there’s only one ranker but we split the ranker into layers and there’s the layer that tries to find the 1500 results that may match and then there’s some filtering and personalization.

[00:52:54] These layers could be optimized con-currently. That’s what we’ve done. So great question. Not a perfect answer here but something that we have done.

[00:53:06] Okay. People, you want to select somebody for a verbal question.

[00:53:11] So yeah, most certainly would definitely love to have a few people on stage and I see a lot of people have actually requested to be on stage as well.

[00:53:18] But in order to and we can just extend by over just 10 minutes because the next workshop by Carl Gillis starts in another 30 minutes for you and the calendar invites are already on your calendar but we can just pick two people.

[00:53:33] So if you can just quickly, along with your name.

[00:53:37] I’m gonna refer to you. You pick.

[00:53:40] Yeah, sure, sure. So if you can, I’m speaking to the attendees right now. So if you can send an interesting question our way through Slido itself along with your name, just do that right now and I will have you on stage after that.

[00:53:58] So I give you a few seconds to ask a question or let me scroll down. If I see any interesting question which may not have gotten a good number of upvotes.

[00:54:10] Ronny, do you see any interesting question down there that you would like to answer? And I can request the person to come up.

[00:54:18] Well, I see an anonymous question, a lot of them but there’s nothing with a name that’s got more than.

[00:54:25] Vipul Bansal: So there’s one question I just saw from Chao L, I think that’s how it pronounced. Sorry if it isn’t right. Chao, let me see if you have requested to be on stage by any chance and if you haven’t requested, you can just go ahead and request to be on stage right now and I will have you on stage.

[00:54:46] Ronny Kohavi: While you’re looking for Chao, I’ll look. I’ll answer Luke’s question. If you’re using overlap analysis heavily as part of your program or are you less likely to make power related mistakes? So the question of overlapping experiments or con-current experiment has been asked a lot.

[00:55:05] The answer is that any reasonable site is going to run a lot of con-current experiments. There are thousands of experiments running on any search engine that you use right now on Amazon.

[00:55:20] Airbnb runs hundreds of con-current experiments. What allows us to do this is that the experimentation mechanism of controlled experiments basically isolates you from external factors that impact control and treatment in the same way.

[00:55:33] So these other experiments are basically these other factors. Now there could be an interaction and of course if you took any course or design of experiments, it is highlighted that these can happen.

[00:55:47] What we found in practice is one, they are rare. So we’re using two mechanisms to address this. One is prevention. We tell people, look, if you’re touching the exact same area of the page or the exact same back algorithm, be aware of the fact that they could interact and therefore run them sequentially but the other thing is we test for interactions.

[00:56:11] So every night, we look at all pairs of experiments and see if there’s an interaction and after this presentation, I’ll answer some of these in writings.

[00:56:21] I can give you the link to the paper that shares how to do this but basically, you’re gonna run a lot of tests to see if there’s interaction. You’ll have to control for false positive rate, false discovery rate but it is possible we’ve done this and they were rare.

[00:56:36] So, think about this, thousands of experiments running con-currently at Bing. There was an unexpected interaction once every couple of weeks, maybe and most of the time it was not even something that we could have determined.

[00:56:48] It was due to a bug. People didn’t realize that this other group was doing something else and there was something egregious, like a bug that caused a crash or a big JavaScript error or something like that.

[00:56:59] Let me just pick more questions that were voted.

[00:57:01] Vipul Bansal: We can just go for five more minutes

[00:57:06] Ronny Kohavi: Yeah, I got your 10 minutes message. MDE, we already answered. So somebody got three votes on this.

[00:57:15] Since it is a bot traffic causing the SRM, then the test could have continued. Yes. So we didn’t know what was the cause of the SRM.

[00:57:24] We are concerned that we’re flying blind. So whenever you SRM, you have an SRM, you can’t trust the results. So you’re now flying blind. One of the goals of a controlled experiment system is to be the safety net.

[00:57:36] If something is egregiously bad, you want to be able to abort. When you have an SRM, you lost that. You’re flying the plane with no instruments.

[00:57:45] You can leave it on for another day while you debug but I would encourage you to abort experiments that as SRMs pretty quickly because you don’t know the reason and many times we did realize that, Oh, this treatment is actually causing crashes or something or not getting users with a certain thing.

[00:58:03] There’s a whole paper on how to diagnose SRMs, in the case of about if you can filter the bot. Sometimes something you can do post experiment, then you can still use the results of the experiment, and remove this.

[00:58:19] Another question that was voted high is do we need to test SRM for each point in the funnel with respect to the previous point? So this is called a metrics SRM. It’s not as guaranteed. So with a user SRM, it is guaranteed because you’re randomizing users into control and treatment.

[00:58:38] Before you give them the treatment. Now, if users are already exposed to some treatment then the metrics that are conditioned on, factors that could be impacted by the treatment may cause SRM.

[00:58:53] Think about this. I’m. causing users to view more pages. Therefore, I will have some SRM in the metrics that use page views as a denominator or things like that.

[00:59:07] So for metrics, you have to be a little more careful but it is certainly something to look at when the denominator is dramatically different between the control and treatment.

[00:59:19] If your question is specific about triggering, yes, then at the triggering point, you can run a SRM test and it should be obviously balanced if you design it for balance because up to the triggering point, there should not have been a difference for the users between the control and treatment.

[00:59:37] Would you comment on sequential analysis and power, especially with always valid inference? It’s a complicated area. When people talk about sequential analysis, I think what you’re looking for is what’s called group sequential analysis.

[00:59:52] Sequential analysis can after some group of users, do some tests. The initial sequential analysis is something that you can do.

[01:00:02] I’m more of a fan of group sequential analysis. So what you do, is you allocate some of your type one error rate into a few tests that you will run. For example, you run an experiment for two weeks.

[01:00:12] You might say, I’m going to test three times on the first day and if something is bad, I will abort and that you can do that in a way called interim analysis.

[01:00:26] There’s a blog post that I wrote on this, on LinkedIn, you can read references to interim analysis and group sequential design. So there’s a way to do this so that you control your error rate.

[01:00:37] I am a big fan of doing interim analysis to abort not for declaring success. So I’m looking at your experiment as two one sided tests, one to abort the other one is to declare success.

[01:00:52] For the success, I want the most power because that’s the thing that I need the most and so I’m going to do one test at the end of the fixed horizon, say the experiment runs for two weeks.

[01:01:01] But for the abort, I’m going to run intermediate analyses to see if something is egregiously bad and then abort. So I think we’re at time.

[01:01:11] I’m going to thank everybody for your questions. Again I will post these slides. I don’t know, people, you’ll have some way to send them this link.

[01:01:22] Vipul Bansal: Yes.

[01:01:22] Ronny Kohavi: And if not I will post it on LinkedIn in about half an hour, once I put in some of the fun poll results into the presentation and then I will also answer the Slido in writing. So if you come here in an hour, you will see more of these answered in writing with some references that you would need.

[01:01:41] Vipul Bansal: Awesome. I so wish this call would have extended by an hour more but we have only so much time and there are more workshops lined up after this.

[01:01:51] So thank you so much, Ronny for taking out the time today and preparing this presentation and sharing your knowledge, all your experience. Of course, it’s immense.

[01:02:02] There’s absolutely no container that can measure the amount of experience that you’ve had and the amount of knowledge you have in this space. So thank you so much for choosing to speak to our audience.

[01:02:12] Ronny Kohavi: Thank you for inviting me. Have a great day. .

Speaker

Ronny Kohavi

Technical Fellow and Corporate Vice President, Analysis & Experimentation, Microsoft

155 Microsoft

Other Suggested Sessions

Experiments, Events, and More: Fireside Chat With Mark Kilens

Dive into a laid-back chat with Mark and Jan, spilling secrets from 17+ years in SaaS and experimenting in the virtual world. Don't miss their stories!

Building Engaging Onsite Experience At Avast

In this interview, get a peek into how experiments are run at Avast and what does their experimentation culture look like.

[Workshop] Psychology and Controversy - How to do A/B Testing the Right Way

Meet Oliver, Katja, and Ivan for a fun, deep dive into A/B testing's quirks, from bias hacks to guessing game winners and hot debates in CRO.