VWO Logo VWO Logo
Request Demo

Causal Inference, Experimentation and A/B Testing

Explore the depths of A/B testing, Generative AI, and Causal Inference, revealing the impact on KPIs and modern experimentation.


[00:00:00] Ishan Goel: Hello everyone. This is Ishan Goel, Associate Director of Data Science at VWO. I’m very excited to be speaking at ConvEx ’23 this year and today my talk is centered around From Digital Marketers to Digital Scientists. 

[00:00:22] The idea of this talk is, what modern digital marketers and product managers are doing in A/B testing is very fundamentally similar to what scientists have been doing for the past 200, 300 years. And I want to convince you that the work we are doing in A/B testing, it’s equivalent to what scientists do in understanding various phenomenas in various domains like physics, chemistry, or medicine, things like that. 

[00:00:51] So with that in mind, let’s get started.

[00:00:56] So firstly, I would want to start with the claim that experimenters are digital scientists and I would want to start with the idea, what is the fundamental query of science? 

[00:01:11] So the fundamental query of science in any domain is to understand the cause and effect in various phenomenas. Whenever someone asks the question of why something happens or how something works, then essentially what they’re trying to understand is how one factor causes other and all scientific questions eventually boil down to cause and effect.

[00:01:37] So what does a scientist do to establish cause and effect? Essentially, a scientist runs an experiment and in the classical sense, what was an experiment? 

[00:01:49] So, an experiment in the classical sense was when a scientist would create a controlled environment. What is a controlled environment? A controlled environment is where no other factors are changing, only that one factor that the scientist wants to study, that changes. And he sees the impact of that factor on the effect on some metric of interest. 

[00:02:10] For example, if a chemical engineer is wanting to mix two chemicals and wanting to see the impact on the color of the resulting liquid. Then the cause that he’s trying to explore is the two chemicals. The metric that he’s trying to see is the color. 

[00:02:27] So, that is how most scientific experimentation would work and if you think about it, what the modern experimenter is doing in A/B testing is something very similar.

[00:02:38] They are making a change on their website and then they are trying to see some metric of customer behavior. Either it is conversion rate or it is revenue earned but that is what modern experimenters are doing. 

[00:02:53] So in that regards, the modern A/B testing is very, very similar to classical scientific experimentation. And that is why I want to ask this question today that we have become modern experimenters but have we adopted the scientific mindset? 

[00:03:09] There is a lot of things to the scientific mindset that we can consider and learn from and in this talk, I’m going to explore first the statistical and fundamental aspects of the scientific mindset.

[00:03:25] What are the query scientists in general are trying to solve and also the philosophical aspects. My aim in this talk is that I’m able to impart a perspective that makes the listener become a better and more aware experimenter towards what they are doing and the hurdles that it involves.

[00:03:46] So let’s get started. 

[00:03:49] So I would start with a very common and very banal remark in statistics that correlation is not causation. 

[00:04:01] So anytime that someone dabbling in numbers comes up with a trend with a similar trend and then shows it to an economist or a statistician, maybe that one of these two things are related, underlying, this is a common rebuttal that they get that correlation is not causation. And this is very corely the problem with the scientific research of finding cause and effect. 

[00:04:28] So let’s take a minute to throw light on this. So I’ve highlighted two correlations here that can be commonly observed. 

[00:04:38] The ice creams and sunburns are correlated. So having more ice creams and getting more sunburns are correlated in data.

[00:04:47] Similarly, having more stroke nests and having more babies has also been found to be correlated through history. So it’s a common belief that people have that if you have a stroke nest at your home, it will cause you to have more babies and the idea here is that obviously we understand scientifically that both of these things do not cause each other.

[00:05:12] Then why do they seem to be correlated? Now the underlying catch here is something called Confounders and the literary meaning of confounders is that something that confuses you. And that is why economists chose this term for their studies and let’s see the confounders in both cases. 

[00:05:31] So, in the first case where ice cream sales and sunburns are correlated, the confounder is the dry, hot and sunny weather. 

[00:05:40] The idea is that eating ice cream is not causing you to have sunburn but the sunny and the dry weather is causing you to eat more ice creams and it is causing more sunburns as well, which is easy to guess. And that is why the sun is the confounder that is causing both of these things. And these things in data, if you do not have the data on how the weather was, if you only look at the data of ice creams eaten and sunburns then you’ll be falsely led to believe that they are correlated and may be, they are causing each other. And similar with stroke nests that can someone guess what is the confounder with stroke nests and babies?

[00:06:26] So the confounder with stroke nests and babies is human settlement. In all the places that there was more human settlement, there were more stoke nests and there were more babies as well. 

[00:06:35] So that is why historically people have observed this pattern and then believed this correlation to be something very fundamental and developed a superstition out of it.

[00:06:46] So, my claim here is very clear that it is the confounders that complicate causal estimation and all endeavor of experimentation is to get rid of these confounders. 

[00:06:59] So I’ll take you through the story of how science developed, what is this complication in more detail and how did we develop the modern A/B testing procedure to solve this.

[00:07:09] So, when you try to learn cause and effect, so suppose for the time being, let us believe that if we could just observe things the way they are, if we did not need a controlled environment, if we did not need to have an experiment and we could learn what is causing what, what is the problem in that?

[00:07:30] So there is a very great problem that happens in that and it is a very technical term, it’s called Selection Bias. So let’s spend a bit of time in understanding selection bias because that essentially reveals why you need to run experiments and why observation is not enough to learn cause and effect.

[00:07:50] So here is a very interesting joke on selection bias. So suppose someone runs a survey that, do you love to fill surveys or do you toss them in the bin? 

[00:08:01] And suddenly you see from the result of the surveys that 99.8% are the ones who love to respond to surveys and 0.2% are the ones who toss them in the bin. And can someone guess what is the problem here? Is it a true picture of reality? 

[00:08:19] So, this is actually called Sampling Bias and sampling bias and selection bias are very similar. Depends just on the context. Sampling bias is a form of selection bias but what is happening here is that all the people who do not like to build surveys, they’ve also tossed this survey into the bin.

[00:08:37] So the data from those people has never reached the surveyor. So for all practical matters, it might be the case that people who do not like surveys, they are much more in number. Maybe they are 5,000 maybe they are 10,000 and this data is not representative. 

[00:08:54] So, that is the reason observation is not enough. Let me take you through an example in a bit more detail that will give you the sense of why cause and effect cannot be learned from observational data. 

[00:09:08] Suppose I ask you the question that, does having a college degree cause you to get more wages in the future? And you have to identify the causal relationship between having a college degree and getting more wages and suppose someone comes up and says that, okay what I’ll do is I’ll collect the data for thousand people who did go to college and thousand people who did not go to college and then we’ll bring out the average wage salary. And the average wages that they are getting in both groups and the difference between these two groups will be the causal impact of getting a college degree.

[00:09:51] So this might sound very logical and reasonable that this will reveal the causal estimation but the problem here is the selection bias. The problem why this difference between the two groups, between the average wages of the two groups won’t reveal the causal impact and what it reveals might be entirely opposite from the true causal impact.

[00:10:14] The problem is that the people who did go to college, maybe they had the intelligence or the resources to earn more wages even if they had not gone to college and that is the selection bias that people who did go to college maybe even if they had not gone to college, they would have earned better.

[00:10:38] How would you filter them out? So that is the confounder that will actually be biasing the result in this case and this is a big problem with all observational data. 

[00:10:49] So observational data might look okay when you look at it but it can have various sorts of selection bias that you would not be able to guess unless you have the data on the confounding variable. 

[00:11:02] So, that is why we need to intervene and run experiments. That is why observation is not enough to learn cause and effect. And that takes us to the next slide and a very interesting story on how modern randomized controlled trials were born. This is a very interesting story friends and why this is an interesting story because our randomized controlled trials are what you are running in the form of modern A/B tests. 

[00:11:34] So, essentially no one calls them randomized controlled trials but statistically A/B tests are driven by this methodology. So let’s delve into this very interesting story. 

[00:11:43] So R.A.Fisher was a biostatistician and he’s the father of modern experimentation. And what R.A.Fisher was one day trying to do was he was trying to understand the impact of a fertilizer on the yield of his crops. 

[00:12:00] So what he was trying to understand was that if I put this fertilizer, then does it increase or decrease the yield of my crops or not? 

[00:12:07] So what he did was he neatly divided his field into two halves and then sprayed the fertilizer over one half, did not spray the fertilizer over the other half. And as he collected the yield, he realized that these two different parts of his field, they were getting different sunlight, probably they had different quality of soil, probably they also had maybe different quality of water so water that was flowing into the different fields, maybe different amounts of water.

[00:12:38] So these were all the confounders that were biasing the result. So he concluded that from this initial study, he cannot get an apples to apples comparison. 

[00:12:48] So what R.A.Fisher did and he devised a very ingenious way because there was no way to measure the sunlight or the number of various confounders that there could have been.

[00:13:01] So he decided that I’ll counter this natural randomness by synthesizing randomness on my own and that is the core crux. So what he did was, he divided the field neatly into small, small squares and then he decided with a fine toss for every square, whether the square gets the fertilizer or not. 

[00:13:21] So this coin toss is very, very fundamental. Why? Because if he had just randomly chosen, arbitrarily chosen where to put the fertilizer or not, then it might be that his own personal biases had creeped in and maybe the sum had caused some yield to be increased and not increased. 

[00:13:42] So a coin toss is fair that way. So the idea behind this whole thing was that if you are deciding with a coin toss that which square gets the fertilizer and which does not, in other words, which goes into the control group and which goes into the treatment group, then you have virtually ensured that all the confounders get equally distributed on both sides and pay a second here. 

[00:14:07] So really, really, this is quite an interesting discovery because if you think about the first slide where I told you that when a scientist runs experiments, he is wanting to create a controlled environment where the other variables don’t change. 

[00:14:25] Here, Fisher has done something phenomenal. He could not create a controlled environment for a lot of things. He could not create a controlled environment to grow crops but how he handled, how he created this controlled environment in this case is by randomization. 

[00:14:41] By randomization, he ensured that all other randomized factors get equally distributed in the control group and the treatment group.

[00:14:50] So this is the core fundamental and this is very much carried forward into the modern A/B testing which we will discuss in the next slide. 

[00:15:00] So R.A.Fisher gave birth to the gold standard of causal discovery. So till date, this randomized control trials and what you are running as experiments, they are the gold standard of causal discovery and if you can ever run an experiment, it’s best. So that is what gives you the most trust and when you cannot run an experiment, that is when you look out for other methods where you cannot run a randomized control trial. 

[00:15:28] So, let’s move ahead and see how this birth of randomized control trial has impacted modern A/B tests. So A/B testing in the 21st century, I believe that A/B testing and scientific experimentation at large has become democratized. 

[00:15:47] It has been made available for everyone, made easy for everyone. This has happened for two, three reasons. Let’s throw light on them.

[00:15:55] Firstly, data collection has become easy. So with the invention of computers, now you don’t need, so if you see in the previous slide actually R.A.Fisher is noting down in his notebook the yield that he’s getting from every crop and whether which square has gotten the fertilizer or not. Now this all has become automated and scaled.

[00:16:18] So, you can collect data very easily at the click of a button for millions of customers and for different metrics. So that brings us to the second point that millions of signals that are coming in, now they can be clubbed together and combined into complex metrics that can be calculated. 

[00:16:38] Think about it without the invention of computers, all this metric calculation actually required someone to do them by hand or at the best by a calculator and people did not have the option to write code that automatically shows something to you on the screen. 

[00:16:55] So, complex metrics can now be evaluated. You have the choice to choose the metric that you are wanting to study and advanced companies that are experimenting, they often study the impact across various metrics.

[00:17:08] So, that is another thing, another advantage. The last advantage that I want to highlight and that has been historically very, very rare in all of science is a high fidelity environment for testing. 

[00:17:23] So what is fidelity? Fidelity is that when an experimental environment is close to the real environment.

[00:17:30] So, for instance, say someone is building a car and is wanting to test the strength of that car. So, they would create an environment where they’ll keep dummy humans or dummy trees and everything, and we’ll call it that they’ve given the controlled environment a good amount of fidelity.

[00:17:50] But 100% fidelity has never been possible in scientific experimentation because you could never actually create an environment that is very, very much like the real world environment in which the phenomena will be actually happening. 

[00:18:07] So if someone is say maybe designing a machine, they create various environments on a computer and that environment also coarsely fits the real world environment but not perfectly fits the real world environment.

[00:18:24] What you are getting in A/B testing? You are actually getting to test in the real world directly. So it’s very, very high fidelity. It’s almost close to 100% fidelity environment for testing. 

[00:18:37] So, why I mention these three points? Because if you look at the historical context of scientific experimentation, A/B testing is very privileged, is very, very privileged. 

[00:18:49] If you call a scientist from the 17th century and you tell them that, okay, now we are a scientist. We are scientists as well and we are running experiments in this sort of environment, he’ll be gazed, he’ll be like, how lucky you are. 

[00:19:03] So, that is what I want to share that A/B testing, scientific experimentation has been democratized but the core fundamental methods and the fundamental inquiry is the same.

[00:19:14] Moving next, now I want to explore some philosophical aspects of scientific experimentation that have been formulated over the centuries and that I feel can help us become better experimenters. 

[00:19:33] So this is a fundamental theory, the theory of falsification and it was developed by Karl Popper in the 20th century.

[00:19:44] So it’s very interesting. Karl Popper says that, anything that is scientific in nature has to have the possibility to be provable wrong. 

[00:19:57] So the idea is that any hypothesis, suppose, so they call it the black swan problem. So suppose I make the hypothesis that all swans are black.

[00:20:07] Karl Popper says that any hypothesis remains true for a set of entities, it can never be true for all entities because you can never go on to explore all the classes of objects it applies to. 

[00:20:25] So this statement that all swans are black, this remains, I’m sorry, all swans are white. I am really sorry. But the statement that all swans are white, it remains true until you see a black swan popping up the ocean. 

[00:20:41] So no matter how much evidence you collect on all swans being white, you’ll never be able to a 100% prove that all swans are white. There’ll always be a possibility that tomorrow a black swan pops up from the ocean and then you are forced to refute this hypothesis and think about this. 

[00:20:59] This applies to some of the very, very fundamentally rooted truths that you believe in. So suppose if you believe in that the sun rises every day. This hypothesis stands true only till the day that the sun does not rise. If tomorrow the sun does not rise, what will you do?

[00:21:21] You cannot go back to whatever math or experimentation or evidence that you use to prove this. It has happened for innumerable times in the history. 

[00:21:32] In history, in the past, it has happened multiple times but tomorrow, if the sun does not rise, you cannot rely on evidence. You have to accept that there is something else and maybe that is what happens when people first observe the solar eclipse. They realize that okay, sun does not rise every day. 

[00:21:49] It rises until and unless there is not a solar eclipse happening. So, the theory of falsification says that any hypothesis that is scientific has to be provable wrong and it remains true only till it can be proved wrong and no matter how much evidence you collect, it will never be 100% true. 

[00:22:12] But one counter evidence that you get to that hypothesis will help you prove it is 100% wrong. So it is easy to be 100% wrong because you just have to find one counter evidence but it is impossible to be 100% correct because you can never ensure for all objects that the hypothesis that you have applies for those objects. 

[00:22:34] So, this is very, very interesting. It takes a while to understand but this has very, very strong implications for the experimental mindset, which I’ll come to. So, first let’s see the joke here.

[00:22:52] I’m very excited about this joke. So this joke says that there is no evidence that Karl Popper wasn’t born on July 28, 1902 and no one has proven that he didn’t grow up in Vienna. 

[00:23:03] So what the teacher is trying to tell the student is that Karl Popper was born on July 28, 1902 but why we believe this is because there is no counter evidence up till now.

[00:23:13] The day we get the counter evidence, then we’ll be forced to give up this belief and similarly for the fact that he grew up in Vienna. So, why this matters as modern experimenters is that, I was asked by a fellow experimenter that can ever an experimental result be proven 100% correct?

[00:23:34] So if you have had a statistically significant winner, can you go on and can you rely on statistics or any sort of proof that this will remain true? The answer is sadly no. 

[00:23:46] The answer is that no experimental result that you get, will be true for the entire eternity. There can always be some complication that can come up or some observation that can come up later into the future that proves your result wrong and that has been what is happening in all scientific domains.

[00:24:07] Like papers written maybe 50 years ago. There are findings that prove those papers wrong. So it is futile to try to prove your hypothesis to be true. And I would want to let that be thought of very deeply, that we as humans have a tendency to basically build some beliefs and then find evidence to prove that belief to be true, and find supporting evidence for that belief but it is futile. 

[00:24:39] It is really futile that if you are generating hypotheses and if you have some beliefs, you can go on your entire life trying to find evidence to prove that belief and you still won’t be 100% correct. It’s very much easy to find counter evidence to those beliefs.

[00:24:56] So all the beliefs that you have, you should try and find regularly the counter evidence of what makes those beliefs not work and whenever you find the counter evidence, you can replace those beliefs with better beliefs. And that is really the scientific mindset that all scientists have realized and they need to be working with that you can not prove a model to be right.

[00:25:24] You can not prove a hypothesis or an explanation to be right, a 100%. But you can always be in the search of counter evidence and the moment you find the counter evidence, you learn something from that counter evidence and you learn what was wrong with your belief and you replace it with a better belief.

[00:25:43] So that is the broader mindset I feel, that needs to govern our A/B testing efforts and that is the cultural shift that we need to really become digital scientists from digital marketers and yeah, that is what I’m trying to convey here. 

[00:26:06] So, moving with this, I want to go on to the next slide and this is derived from that philosophical mindset that you need to be data driven and you need to use data to contradict your own self. 

[00:26:23] So, the idea is that you hypothesize, always hypothesize, how will customers behave, what is the underlying phenomena with which things are happening on your website and then experiment so that you can self contradict yourself. 

[00:26:41] Experiment. It’s great if you are wrong. It’s great if you are right. It’s great if your hypothesis works but you haven’t learned anything if your hypothesis worked. 

[00:26:51] You learn in experimentation when you find out that you were wrong. Either you thought something will not work and it worked.

[00:26:58] That is when you can now update your beliefs or you thought something will work and for some set of people or maybe for all your audience it did not work and that is the way you need to move and here is a very, very interesting joke. 

[00:27:15] For explaining that joke, I would want to first explain this term and the term is called HIPPO. I think a lot of people would be aware of this term. 

[00:27:23] HIPPO is called the Highest Paid Person’s Opinion first. HIPPO is the acronym for Highest Paid Person’s Opinion. So someone who exercises the Highest Paid Person’s Opinion, so in the joke here, the boss is like shown to be a HIPPO. 

[00:27:41] He want a dashboard application of metrics but he will make the decision based on company politics. So he’ll only trust the data when it suits his hypothesis and that is not the right approach. 

[00:27:57] That is not the scientific mindset that we are trying to battle in culture of experimentation. So HIPPOs, definitely they have a great vision. That is why they land as CEOs and leaders of various organizations and the senior members of organizations but HIPPOs have a very high chance of going bust in the long game because as the stakes of the game increase, as the game becomes very uncertain, HIPPOs often go bust. 

[00:28:26] So that is the philosophical mindset that I want to impart in this talk. So with that, I would like to close the talk and all the best, good luck towards adopting the scientific mindset and becoming digital scientists from digital marketers.

[00:28:44] So thank you for questions and comments. Please feel free to mail me or reach out to me on LinkedIn. 


Ishan Goel

Ishan Goel

Associate Director of Data Science, VWO

Other Suggested Sessions

Beyond Basics: Addressing Complex Challenges in Experimentation

Join Dr. Ozbay to master online experimentation: managing multiple tests, analyzing complex data, and tailoring strategies for team success.

Performance: Each 0.1 Second Is Worth $18M For Bing

Every millisecond decides whether a user will stay or leave Bing. In this session, hear from Ronny himself how Bing ensures a great experience to its users.

How Experimentation Works at eBay

Peak into eBay's experimentation journey with Benjamin, revealing the impact of data analysis and strategic testing on business decisions.