The Experimentation Secret Sauce: A Multidisciplinary Story
Explore the experimentation industry with experts Firoz, Kees, Sander, and Denise, delving into how diverse, in-depth knowledge drives excellence.
Dennis Visser, Product Manager of Team Experimentation at Bol.com, presents the company's approach to evidence-based decision-making through experimentation. The team, comprising experts in various fields, focuses on making experimentation easy and accessible for colleagues. They have developed an in-house experimentation platform tailored to their needs, supporting different business models. The platform offers functionalities like easy experiment creation, a comprehensive experimentation library, and a results page with automated data collection and interpretation. Team members share insights on building trust in results, the importance of transparency in data models, and the challenges of maintaining user stability versus statistical power. They emphasize the role of software engineers in implementing and evangelizing experimentation within the organization.
- he in-house experimentation platform is designed to be intuitive and seamless for various users, including business users, engineers, and data scientists.
- Transparency in data models and clear KPI definitions are crucial for gaining trust in the results produced by the experimentation platform.
- Addressing biases and ensuring proper statistical analysis are essential, especially in cases of user stability and quasi-experiments.
[00:00:00] Dennis Visser: Welcome to this presentation. I was asked to give a presentation on this event, and of course I’m very happy to talk about why experimentation is such a great opportunity for a company like Bol.com to work evidence based. It can make an organization way more efficient and people prouder about their added value.
[00:00:26] But the success of team experimentation is not only because of me, it’s mainly because of the amazing team experimentation we are.
[00:00:33] Currently it’s a small team, but we are ready to grow. And all team members excel in their expertise and they all have the consultancy and educational skills to help our colleagues out.
[00:00:44] I’m very proud about the team and I feel very lucky that I can work with such talented experts every day. So, therefore, I propose that we do this presentation together, so you will see all of us on stage. I will introduce everybody to you later.
[00:01:01] As a product manager experimentation, I evangelize the organization from bottom up, reach out to data driven colleagues and make them curious.
[00:01:09] And they will be happy to facilitate an experiment. And also do a top down, make the managers aware of how much more effective an organization can be if all decisions are made based on data with the same definitions of success.
[00:01:24] So the objective for team experimentation has been for a long time make it ridiculously easy for our colleagues to do experiments.
[00:01:31] The moment somebody is willing to start an experiment, there should be no barriers to actually do it. And the two main facilitators are an awesome team as a center of expertise and a good platform. And I give a brief introduction of both.
[00:01:48] So first, let me introduce the team. So, Denise, that’s me. I’m around team experimentation since the beginning of it, five years ago, and I’m also around the company for more than 16 years, so that helps also to find my way in the organization to drive a change.
[00:02:06] We have Firoz as a data engineer, Kees as a statistician, Maarten as a tech lead, and Sander as a full stack engineer. And there is still an open seat for another full stack engineer, so if you are looking for a great job, please apply.
[00:02:22] All these team members will be on stage in this presentation to highlight why experimentation energizes them.
[00:02:30] And the other facilitator for success is a solid experimentation platform. At Bol.Com, we decided to build an in house solution. This gives us the opportunity to make it a perfect fit within our own landscape. And we have around 200 scrum teams within Bol.com that support different business models of eCommerce, advertising, and logistics.
[00:02:51] And ideally, they all run experiments independently. It should be easy for them to start and stop an experiment and rely on the data that we provide. Just to give a brief introduction of the platform, here is a list of functionalities. So, simple create, start and stop experiments and also per experiment.
[00:03:10] On a high level you put a hypothesis in, you can start multiple runs for one experiment. There is an experimentation library, of course. You can filter on team, on day, on the phase of an experiment, on a specific date, but also on a specific topic to see what’s going on but also what happened in the past to see what learnings are stored.
[00:03:36] There is a calculator in the platform to determine the running time of the experiment up front, or the outcome afterwards. And what we also, provide is an overview of the added value of all the experiments for the business. So that makes it also easy for us to reason about our reason of existence.
[00:04:01] And of course, there is a result page on an experiment level, or a run level, I have to say. And if you ask me, the real candy is in the result page. So this is just a fake screenshot to give you an idea about what we want to achieve. So on the left side, you see a summary of the experiment and what is actually tested.
[00:04:20] So we rely on the users for the completeness of this. And on the right side, you find the fully automated data collection and interpretation. And Firoz will tell you more about how we collect the data. So I hand it over to Firoz.
[00:04:37] Firoz Ansari: Hi everyone. I’m Firoz Monsali. Within team experimentation, I mainly work as a data engineer by creating data for our tools.
[00:04:44] I would like to talk about the topic that has been one of the challenges while working within team experimentation. And that is building trust in results.
[00:04:54] You have seen the result page of experimentation platform. Apart from making your results presentable, it is even more important to gain trust of experimenters in our results.
[00:05:08] Now if you have curious minds in your organization, which is the case for Bol.com you’re going to be questioned on the results that you show on your platform. These can be regarding the numbers they see on the platform, such as how did you come up with these numbers? Why are they high? Why are they low? And so on.
[00:05:28] And often data analysts or scientists ask us to provide the query that has been used to for these to generate these results. And the questions can also be regarding the definition of the KPIs. It happens that different teams have different definitions of their KPIs and similarly different technical implementation as well.
[00:05:56] So, let’s see what we can do about this.
[00:06:02] I will present a few ways we have taken or would like to take to build trust in our results. Firstly, one of the most powerful ways to make your data models that generate the results transparent. This can be easily done by , doing the data modeling by DPT. And what DPT does is it provides a visualization of the data processing from source to the destination.
[00:06:29] This is beneficial for the experimenters’ understandability. They can even look at the queries we do during the data transformation. And they even come to us to suggest improvements to our own data models which is quite awesome. And we work together to make our data even more correct.
[00:06:51] The second is tool tips. So having these small icons near KPIs and results can help solve most of the questions. When the experimenter is browsing through through the experimentation platform we can include information such as the definition of the KPI or the formulation of the KPI.
[00:07:17] Let’s say what is the alpha level for the test when is my design going to be ready, etc. And lastly, we call it bring your own KPI. This is something we haven’t tried yet, but would like to do so. The domains in Bol.com is expanding and with it even comes even more KPIs. Since we want to be the center of excellence for experimentation and not KPIs, we are looking for a more distributed approach.
[00:07:51] So everyone within Bol.com is an expert of their own domain, and hence we believe that they can define their KPI the best, functionally and technically. What we want to do is we want to provide a step by step guide to the experimenter. Who wants to experiment with a new KPI. The outcome of this guide would be a data set that will help us in stitching these new KPI fairly easily with the measurements.
[00:08:21] And since the KPI has been developed by the experimenter himself there’s usually more trust, regarding KPIs.
[00:08:35] So I’ve presented a few ways. I’m sure there are many other ways which still need to be explored by us. For now, that was my bit as a data engineer within team experimentation at Bol.com. I’m going to pass it to Kees, who is the data scientist in our team.
[00:08:54] Kees Mulder: Hi, everyone. I’m Kees Mulder, and I’m going to try to add to this story from the perspective of data science and statistics.
[00:09:01] So, my background is actually as a pure statistician. When I came into this field, my question was really, okay, how do people adapt these methods in this kind of new business context? What I was used to in classical statistics is to have small to medium size samples, but huge effects. For example, someone is sick, we give them medicine, they get better.
[00:09:28] That’s a huge and significant change that’s relatively easy to detect. Not everyone might get better, but most will. Easy to detect, even with small data sets. In the case where it becomes a little bit more difficult, we end up using a lot of very complex models in order to wring all the value out of the data.
[00:09:47] Compare that to experimentation statistics, where effectively we have big or huge sample sizes. In the same way we would find in machine learning, for example, but we’re not going to We don’t do all the very complex models we have in machine learning, but instead what we’re looking at is tiny effects, but we used basically the same statistical methods we use in classical statistics.
[00:10:13] So it’s using a lot of data to look at tiny effects. As a result, the models end up being relatively simple. So where is the challenge? Where is the interesting part? The interesting part is in making sure that the system of analyses is not biased at some point. And there’s many, many ways in which people tend to incur biases.
[00:10:36] So I picked out two examples that I run into all the time in my work, where I think I can give like a nice, useful recommendation. So that’s what I’m going to try to do within the, maybe, three minutes I have.
[00:10:49] So first is user stability versus statistical power. So we have this idea of we want stability, which means that a user shouldn’t be getting a completely different page every time they refresh.
[00:11:04] For example, we know that that is a worse experience than either one of your two variants, because the user is learning how to use the website. And if everything is moving, they get very confused. We know this. So often in many trials, we try to keep the user in the same bucket, give them the same variant over time.
[00:11:25] Then we should, if you listen to the rules of statistics, basically, analyze on the level of the user. That means every row in our data set is one user. And we get some sort of aggregate metric and use that. But I see that analysts at our company are very often tempted and rightfully so to look at maybe sessions within a user, if they came back three or four times over the course of the experiment, or even page views, where every time they interact with a certain element.
[00:12:02] And it’s somewhat sensible, it feels somewhat sensible, because we have many interactions that we would like to leverage. That may be much bigger data set, right? We might have 10 rows per user. We want to use that because it gives us much more statistical power. This approach is sometimes really bad. And sometimes it’s totally fine.
[00:12:22] And it depends on how different a user is from session to session or page view to page view. They change their behavior completely. Then it’s probably fine, but if they stay giving the same result on your metric every time because they can be influenced once and that’s stable, then we have a big problem.
[00:12:43] So there’s this tension between getting user stability, statistical power, and doing proper statistical analysis. There are methods, by the way, to do this properly, but it’s more complex statistical methodology, and you really have to be aware of how to do this. My recommendation, know what people are doing in your organization and do the proper checks to see if you have this problem. Because if you don’t, it’s very easy to incur structural and very continuous bias in this case.
[00:13:16] So the second situation where we should be aware of some some biases is when we were doing quasi experiments. So it’s known that we cannot always do a pure experiment. For example, we had an important case where if you want to do an experiment on pricing, it’s not really possible to give some customers a better price than others for all kinds of reasons.
[00:13:39] This is really not an option. So what we do is we split products up and allow some of those to change, but that means our experiment becomes a quasi experiment because we have to measure the customer’s response to a bucketing structure that is at the level of the product.
[00:14:00] So what should we do then? What I see all the time is that people simply do some checks and analyses and don’t really completely think through how this quasi experiment should be analyzed.
[00:14:17] And in those cases it’s very easy to get huge biases or very low statistical power. Those two things happen all the time. So my recommendation for these quasi experiments is you really should invest in good methods, and as an experimentation team or as someone who is evangelizing experimentation, also be talking with these people, be looking at quasi experiments, because some of the most kind of interesting, fascinating, and useful insights in our company have come from really well designed quasi experiments.
[00:14:54] So do that as well. So with that, I’m going to give it to the next person. Thank you all.
[00:15:00] Maarten Zwart: Hi there. Thanks Case. My name is Maarten. I’m the tech lead for this product for the team experimentation at Bol.Com. And I’d like to take the next five minutes to explain why I think my job is so awesome. Why it’s so cool to be a tech lead in a product like this.
[00:15:18] As an engineer, I am more of the functional type. So I’d like to get most energized thinking about how to create that intuitive, seamless, like we like to put it ridiculously easy experience for all our users. And if you look at it like that, working on the experimentation platform just provides so many interesting angles. Angles because of the different types of users we have.
[00:15:48] Yeah. We have business users. How do we want to make things seamless for them? We have application engineers and data engineers. How can we make things ridiculously easy for them? How can we make it as intuitive as possible for data scientists? And I’d like to go over every one of them. Just take on that angle for one minute each.
[00:16:11] Business users, they want an intuitive UI. We want to help them build up a knowledge base, a knowledge center where we collect all the learnings all those interesting experiments, all those hypotheses that we came up with and that we put to the test. And it’s not all as intuitive by default, right?
[00:16:40] Some of the things in experimentation are quite hard to grasp when it’s when it comes to sample sizes and, all the statistical stuff and then also boring stuff like administrating your work. So our goal here, our task is to make it super intuitive and easy and fun to work with our platform so that they provide that content because without their content, our platform is nothing.
[00:17:10] Then there’s a huge challenge and a purely functional challenge. They also want results from experiments to be shareable with the rest of the organization. They might want to orchestrate their experiments with other business users. So that’s that angle.
[00:17:31] Engineers. Completely different angle. We have quite a vast technical landscape. Front end applications that are customer facing with client server type applications. We have a lot of microservices. Some of them running in domains where engineers. Well, they hardly interact with some of the engineers in other parts of the organization, because that’s how we are set up.
[00:18:09] And they don’t want to lose that autonomy for something like experimentation. So we need to set it up in a way that there’s the least amount of interdependency. And I think we achieved quite a lot in that area already. There needs to be easy integration because some services are built in Kotlin, some services are built in Go.
[00:18:56] So a completely different angle from the functional one with the business. And then, of course, we have the data scientists. They want to trust their results. We need to give proper sanity checks to them. They have completely different workflows. They work with notebooks and queries and data pipelines.
[00:19:14] And what we offer needs to seamlessly integrate with that. So we want sometimes to just provide bits of SQL data sets where they can just, they can use as a starting point for their in depth explorations. And they need to, the things that we offer and that we want to share with the rest of the organization, they need to be able to explain it. And to trust it.
[00:19:35] I hope this nicely shows all the different angles that we encountered when building this platform. Besides that, it’s just awesome to get so many different questions from the organization. There are so many different domains, all experimenting. And yeah, you really get the feeling that you stay in touch with a little bit of everything that’s going on in the company.
[00:19:58] So all that combined really makes my job awesome. And I go to work every day with a smile. Over to you, Sander. Thank you very much. Bye bye.
[00:20:12] Sander Boumeester: Hey, my name is Sander and I’ve been a software engineer in Bol.com’s experimentation team for over two years now. In this role, I have been contributing to what I like to call the experimentation toolbox. This toolbox helps various roles within Bol.com to experiment in a standardized way. It helps with designing, implementing, and analyzing an experiment while also serving as a research logbook. I want to talk a little bit about what it is like to be a software engineer in an experimentation team and what you can learn from that.
[00:20:45] I believe the perspective of a software engineer in any experimentation team is quite interesting and unique. Normally, as an engineer, you transform functional needs into technical solutions. Within an experimentation team, this goes one step further. Not only do you contribute to the toolbox plus platform you are building, As a software engineer, you also become an evangelist, inspiring other software engineers in the organization to start experimenting.
[00:21:09] You actively encourage others to start experimenting and to deliver software based on validated hypotheses instead of assumptions. Besides that, as an engineer in an experimentation team, you become a sort of experimentation consultant. Whenever a team wants to start experimenting, they will have to think about how to implement an experiment.
[00:21:29] Often, this results in questions for our team, ending up with a software engineer to answer. What always amazes me is the variety of questions and different architecture styles that teams come with or have questions about. We have, for instance, followed along on how to implement experiments in frontends, backends, asynchronous workflows, cache services, and in a multi arm bandit setup.
[00:21:54] What makes this extra interesting from Bol.com’s perspective is that we have different domains and business models with very different needs. Often, these challenges lie in the statistical/data science area.
[00:22:06] As a software engineer in an experimentation team, it is good to develop a certain statistical sense, knowing when to forward certain questions to someone like Kees, for instance, with deep statistical/data science knowledge.
[00:22:17] On the other hand, from an implementation perspective, it luckily almost always comes down to two things, measurements and bucketing. Measurements record the interaction of an experiment subject while bucketing divides the subject in control or treatment. This is where we circle back to our experimentation toolbox, in which we provide tooling to start implementing an experiment out of the box.
[00:22:41] I believe what makes this toolbox, in the case of Bol.com, so powerful, is that when using the provided tools by our team, anyone that is experimenting can potentially benefit from the result page that Denise mentioned before.
[00:22:55] As a last note, I want to note that I believe that it is crucial to involve software engineers in an experimentation team and in any experiment setup from the start.
[00:23:07] I think someone around that needs to implement the actual ideas that others come up with can be vital to make your experimentation platform or experiment implementation a success. In that sense, I believe that we as Bol.com are in a good position having our own centralized experimentation team and in house built experimentation tool while teams can run experiments autonomously. Back to you, Denise.
[00:23:35] Dennis Visser: That’s it. Thanks team for all your stories. And I hope you as attendee and liked it as well. And if you think your team looks similar as ours, and you think there are interesting thoughts and experience to share, please don’t hesitate to reach out to one of us. You can find me on LinkedIn. See you next time.