This website works best with JavaScript enabledLearn how to enable JavaScript

Back to all sessions

How Booking.com Manages Large Scale Experiments

Curious how a business with million visitors across the world runs experiments at such a large scale? Get your answers in this interview with Lukas.

Summary

Lukas discusses the evolution of their experimentation platform over 15 years and its role in product development and company culture. Initially, the platform could only handle four experiments, but through iterative learning and scaling, it now supports thousands. Key to its success is the dual data pipeline system, which helps identify discrepancies and improve data accuracy. Lukas emphasizes the importance of adapting tools based on actual usage and stretching points, rather than building something perfect from the start.

His role involves not only overseeing the infrastructure and methodologies for experimentation but also training new employees, as experimentation is still uncommon in the industry. He highlights the significance of a company culture that encourages healthy debate and challenges decisions, exemplified by their peer review program. Lukas also touches on the broader analytical activities at Booking.com, including causal inference. Finally, he stresses the importance of leadership being open to data-driven decision-making, putting the product and user experience above personal opinions or ego.

Key Takeaways

Dual Data Pipelines: Redundant systems in data infrastructure can help in cross-verification, leading to more accurate and reliable results.
Tools should be developed based on real-world usage and needs, rather than theoretical perfection.
Effective leaders prioritize data and user experience over personal biases, demonstrating a willingness to be proven wrong by data.

Transcript

0:06

Sameer: Hi! Welcome to ConvEx, where we are celebrating experiment-driven marketing. My name is Sameer Sinha and I head the revenue function here at VWO. If there are leaks in your conversion funnel, we would invite you to come to the VWO website and give it a shot. Today, I’m very excited to have Lukas Vermeer, who is the Director of Experimentation at Booking.com. Lukas is a great public speaker and I’ve had the privilege to be a part of one of his talks as an audience and like I was mentioning to him prior to the start of this talk, it was very hard to get a couple of minutes with him to have a chat because he was surrounded by all the audience who loved his talk. Really really glad and a privilege to have you, Lukas, with us today.

Lukas: Glad to be here. Thank you for your kind words.

0:58

Sameer: All right. So let’s move on Lukas. I’m very well aware that you are the Director of Experimentation at Booking.com. But for the benefit of our audience, could you please describe your responsibilities a little bit and what are the priorities that you are working on?

Lukas: It’s a good start. I’m Lukas. I’ve been at Booking for about six years now, little bit longer. I joined when it was a little bit more of a scrappy startup starting to become a bigger player, but now Booking is obviously a bit of a giant. And my role as the Director of Experimentation has been for the last five years to help the company learn from customer behavior and to learn how to build a better product for customers. To do that, we have our own tooling, our own support function and we have a group of people here who are building our own experimentation platform.

1:58

Lukas: We started doing that many many years ago. So there was nothing available off the shelf that we could buy and so we essentially have our own internal platform and I’m responsible for experimentation within the company in the broadest sense of the word. So on the one hand I’m responsible for the infrastructure and the tooling that people use to run experiments and that’s what these people focus on mostly, and part of that is also what methodologies we then implement in that tooling.

2:25

Lukas: What sort of randomization functions do we support or what types of variance reduction techniques do we use etc., also the metrics that are included in the tooling. We have APIs for teams to create their own metrics, but we also have a lot of standardized metrics to measure impact of experimentation across the business – things like revenue and return rate etc. And then lastly, my responsibility is around the training to help people skill-up so that they use the tooling and understand how experimentation is used. I think, sadly, it’s still the case that experimentation is not very common in the industry. So whenever we hire new people, we often have to start from the basics and explain to them why experimentation is important and how we use it as part of product development. And as part of that I think training is one side of the coin. I mean you could you can train people in the classroom, but then once they have the skills, they don’t have to also be able to apply them in a social construct within the company. And so I think a lot about what culture do we have as a company that allows people to speak up and to have debates or to challenge decisions that other people have made – more to do with the cultural aspects of experimentation. To give you an example; one of the things that we work on is not a product at all.

3:51

Lukas: It’s a program called the peer review program, where people can sign up to be part of a group of people who every week get paired up with another peer reviewer to randomly pick an experiment that is run by a team that they don’t know and try to give feedback that is useful to the people who are running the experiments. So you read the description, you look at the variation and say ‘hey, have you considered looking at this particular metric or why did you pick this particular variation or we don’t understand how you came up with this particular hypothesis.’ And so we try to encourage this sort of healthy debate amongst product development teams without becoming a gatekeeper, right? We are simply an enabler and so that’s the experimentation side of my job. Obviously experimentation is a tool that we use but it’s not the only way that we do analytics. Bookings is a very data-driven company. We do a lot of data analysis and a lot of it revolves around causal inference, which is a wider field than just the experimentation.

4:53

Lukas: So I also play a part in that wider analytical community reviewing people’s works, reading white papers, etc. Mostly around trying to understand how these changes that we’re making and the products that we build influence the overall customer experience and how we can improve the user experience. And then lastly, I think my role is, and that’s why we met at the conference, to hire great people to come work for Booking, because we are always looking for people. I think the best way to get good people to join us is to show the good work that you’re doing and to show the level that you’re at and then hopefully the people who are interested in this topic and want to work on causal inference at scale, they’ll come to us.

Sameer: That’s terrific and a lot of nuggets there, Lukas. I was taking notes all the time.

5:53

Sameer: I love the social contract and the peer review process that you mentioned, to be honest with you. I think as far as the process of experimentation is concerned, that’s sort of obvious, with a few tweaks here and there. What’s missing in the industry is that it’s like a stop and start culture with experimentation. Because people sometimes get started and then they are expecting quick results, aapnd when the quick results don’t happen, the organization as a whole tends to lose interest. So, building a culture of experimentation and really tying it as a way of life almost, right? I think that’s what you mentioned. It’s almost a way of life at Booking and it’s no longer something that you have to do. It’s the way of doing things.

6:40

Lukas: Yeah, I think so and in this presentation that you saw at […]. What it really boils down to is that it’s almost pointless to run experiments if you’re not willing to be wrong.

6:57

Lukas: And it’s even more pointless to allow teams to run experiments if you’re not willing to listen to what they then find. And so this puts an enormous onus on leadership because it means that as a director, CPO or CEO, you need to be willing to say that I think this is the direction our product should go but I’m not sure, I don’t know and please show me how I am wrong, what assumptions I’m making that are incorrect.

7:29

Lukas: This is a very vulnerable position to take as leadership because you’re essentially saying that you don’t know for sure that something is going to be the right thing to do and it takes a very special type of leader to say- “What I think or my ego is less important than that this product is a good user experience and that users want to use the product”. And to put your own opinions and ego below the data that the people in your teams are going to find is something that I’ve rarely seen and I think that really is the hallmark of an exceptional leader that you’re putting the cause ahead of your own ego and I’m happy that something that we have here. But, it is also something that you have to protect and so when I talk about cultural experimentation- that to me is included in culture experimentation because without that boundary condition, there’s no point in running experiments. When you say ‘Start and stop’, I think what I see sometimes happen is that people say- “Well, we want the benefits of experimentation, but we’re not willing to let the data change our minds”, and that doesn’t work.

8:35

Sameer: Or you want results without the diligence or you want results without having to fail for it, which is almost impossible, right?

Lukas: That’s part of my presentation and something that is a recurrent theme in many of my talks and a blog post would put as well, is that I see a lot of people put emphasis on the value of permutation as trying to estimate the value of the winners. And I think that’s partially where the value comes from but that is very largely missing a much wider point, which is that learning where you fail and what things customers do not want has tremendous value. Figuring out the things that people are not responding to, the things where you were wrong about how customers were using your product, or you were wrong about what they expected from you is an extremely valuable thing that I cannot really put a number on, right? The six months I don’t waste working on a product that doesn’t work – what value does that have?

9:41

Lukas: This is something that I think gets lost in the rhetoric of experimentation as a value machine. It’s much protecting against doing the wrong thing.

Sameer: Yeah, and again something that we see all the time, Lukas. Because you know, we have in excess of a billion experiences being optimized on our platform. Time and again, what we observe is that people simply don’t put in the diligence to plan and create a very data-driven hypotheses, and rush into experimentation. Based on some preconceived notions, based on the HiPPO concept, I think, which is so obvious. And despite all the evidence to the contrary, you know, we continue to make the mistakes.

Lukas: So, I’m curious how you feel about this because it’s something that I’ve been thinking about for a long time. And since you were on a platform like this, so I think in my mind one of the things that’s happened is that with the experimentation tooling that’s out there now, it has become much more easy.

10:41

Lukas: To run an A/B test to the point where it’s almost so easy to run an experiment that it’s easier than doing proper user research, proper QA, proper diary studies, draw in a panel, ask them questions- that’s all too complicated. Let’s just run an A/B test. So, it was partially and I’ll be a little hyperbolic here, you are responsible for this. You have created the incentive system where it’s easy to run an A/B test than to do due diligence. Do you see any merit to that position?

Sameer: Yeah, so I’ll tell you what our take on that is and you’re absolutely right. I think technology in general is a double-edged sword. You know the power is intoxicating and the power can easily be misused and we have seen this. Again, we’re stepping into the realm of philosophy, but we’ve seen this happen time and time again. From VWO’s side, Lukas, what we’ve done is we have integrated an entire platform and actually implemented something like a kanban product project management dashboard into the system to really help people ease out so people will not do what they find difficult. What we’ve done is we’ve made the process of doing creating the hypothesis really simple. All right, so you’ve got an integrated user research and it’s very very simple to look at that research, draw observations, aggregating those observations and sort of putting them into a bucket of hypotheses. Looking at some other websites, best practices that others are doing, again capturing those observations in terms of screenshots, making them available, all of them together so that you can create a hypothesis. So we have really spent a lot of effort into making sure that this entire CRO process is automated and is simple. However, at the end of it, the system is as good as the person using it.

12:38

Lukas: I mean, it sounds a lot like the direction we’ve been taking and it goes back to an earlier point we’ve been making is that actually the mechanics of running an experiment and the methodology hasn’t really changed for the last 80 years, right? The statistics is still the same. Yes, you have to be diligent about how you collect data and you have to make sure that the data is legit, that your mathematical functions work. But really those problems have been largely solved. But, the larger problem is of helping people who are not scientists figure out- ‘How do I write a hypothesis?, How do I find the right supporting evidence for a hypothesis?, How do I figure out which metrics I can use to support the hypothesis?’. All of these settings also influence to a very large degree the quality of an experimentation process, but they’re much more fluffy and it depends a lot more on the user and so a lot of the work that we’ve been doing really is helping our users use the existing tooling better rather than improving the tooling and I do see other parties in industry doing the same thing.

13:39

Lukas: So, I think everyone has sort of come to the point where we can no longer really improve the mechanics of the of the tooling, but we’re sort of, not to get too philosophical, but these tools are being used to optimize the user experience. We want to make sure that our product works the best for our users and that’s why we’re testing our implementations. We want to make sure that they do what we want them to do. Now at some point we started looking at ourselves and started thinking of our own tooling as a product and who said who are our users? Well, they’re the people who run experiments. How are they using the tooling? Do they understand what they’re doing? And so at some point we’re turning around so that we can apply the same methods and try to figure out where there are two lines actually helping our product teams improve the product in the way that we think it is. I think that was a really key turn around moment for us at least, to think of us about the product development that way.

Sameer: And another thing that comes to mind here is that creation of hypothesis is not an individual effort. I mean you spoke about the peer review process and really opening yourself and that’s something again we’ve incorporated in our platform where the process itself you can actually invite feedback from other users and you can comment and look at everything together because again something that we realize is that the more we can democratize the process, the better off we all are and the better off are our chances of getting a better conversion at the end of it because that’s what matters.

Lukas: Yeah, so I wonder what you think about this because I’ve sort of had second thoughts about using the word democratization in it because I feel that at the time I thought it would it gives the sense of we want everyone to be able to use these tools. But then I realize [there is] actually a second meaning to democracy which is that every vote is equally important and that’s sort of undermines the concept of being data-driven, right?

15:44

Lukas: It doesn’t matter how many people are voting for a particular result, everyone can use these tools and you can have an open debate about what the results mean, right? But it doesn’t mean we get to vote on who wins. It was not like 20 votes for A and 21 votes for B. So let’s do B. That’s not how this works.

Sameer: I agree and I honestly I like it because it gives everybody the feeling that this is not an elitist kind of a thing. So for a very long time, it was very elitist, this whole thing was that it’s only a specialist can do that. So, I think I like the word because ‘democratizes’ for me means that there are set of tools, methodologies and insights available which make it possible for just about anybody to really do it. You don’t need to be a Booking.com to do a successful A/B test. A very small travel and hospitality website can also do it and and almost do it to the same level. So I like ‘democratize’ in that sense. I like the fact that because democracy for me is transparency and debate. It’s about everybody having a say and I think in a way the great part about A/B test is now the user gets to vote.

Lukas: That’s a hidden meaning, right? That’s what I didn’t point out. But the hidden meaning of the maximization is that you actually let your customers vote with their feet on- What is the right product for them. And I think this is one of the reasons really that booking started running experiments. We said we want to put the customer at the center, we want to give them a way to tell us what they need, what they want. And we do that by running these experiments where we explicitly look at- what is it that they are doing with these changes that we make to the product. You get these really interesting debates within where when you ask customers, they will say one thing and when you expose them to experience they will do another thing and then you have this debate of ‘Are we going to do what they say they need or are we going to do the things that will help them find exactly the right accommodation for them and it’s I mean, this is a nice internal friction.

Sameer: Yes, absolutely! I think you have to walk a balance. Sometimes there is a difference between satisfying a need and making people happy. There is a difference and I think that’s what A/B test helps you do. In the end, we are helping the customer find what they want in the shortest time possible and make the best choice.

Lukas: There’s an added complexity, I think, for a product like Booking.com. What we’re selling is an experience of staying at a property or experiencing the world, going out there and seeing different places and so in the wider sense, what we are selling is not on our website. What we are selling is going out in the world and seeing things and that’s not going to happen on our website. Our website is only a tool to get you there. And so what do we optimize for? Do we optimize for your experience of the tool? Or for your experience out there? And I think we should be optimizing for that- the second thing right? We should be helping people go out door and experiencing the world. That’s the thing that we should want to try to encourage. That creates some friction between how important is then the experience of the tool itself respective to the experience of going out there and the world.

Sameer: Fantastic, great! I’ll move on. There are some questions that we have over here. You spoke about giving internal teams the accountability and ownership, right? So, just want to talk to you about the team structure that you have, Lukas. I think the structure is very important. I mean, how do you structure your experimentation teams? Is it by product or do you have experimentation team stand alone? Um, you know that be something very interesting to find out.

Lukas: On the team level, there are very heterogeneous teams. So, we try to make sure that every team has the right skill mix to independently execute on their task. So, we don’t want a separate IT department and a separate design department and they then, to execute, have to talk between themselves and their management layers in between etc. We don’t want that. We want those people to be in the same team so that together they can execute against the mission. The mission itself, it really depends on the area. We will have areas where we have a very mature part of the product that can be optimized.

20:37

Lukas: And it requires small step iterations- trying to find just those tiny things that are still wrong with it and to increment on that, but we’ll also have parts of the product that are very large green fields where there’s a lot of unknowns, a lot of things that we don’t even know – what the product should look like and there they’ll do much larger step iteration, much bigger projects. And then there’s this more fluid, product-based but more customer-centric, problem-focused teams that are attacking a very specific problem or a set very specific audience where we know these people are struggling with the current product, but it’s not going to be fixed with small changes. It’s going to be a series of changes along the entire customer journey. And so we’ll build a team around this particular group of people and say well – people who have dogs. There’s probably lots of things or parts of the product that could be improved for this particular audience. And so we might build a team around this and say go talk to some dog owners and figure out how they’re traveling, how our product is not feeding their needs and build some hypotheses on how we can improve the product for this particular audience. But again, that team would be a heterogeneous team. So, they would probably have a legitimate user, researchers, developers, designers, copywriters- everything that they need to execute.

Sameer: Like a like a mini project team in itself. And everything is is sort of together.

Lukas: … and […] as possible. I mean you want to minimize the dependencies between teams so that these teams can really execute on their own without having to do a lot of communication between them.

22:28

Lukas: I think experimentation is a nice enabler there because if you think about democracy and making things is visible, then it should be visible what these people are doing and that’s sort of removes the need for a lot of communication because I can already see what you’re doing, I don’t need to talk to you to understand what you’re doing, I can see where you’re aiming for and I can see what decision you made. And so a lot of these communication barriers are removed by having centralized system.

Sameer: And I think it also creates a platform for a common objective. The objective is experimentation, like you said. The objective is not to do IT, the objective is not to do design, the objectives to get together, get the product out, get it in front of users and and get real time feedback for it.

Lukas: The objective is to solve known customer problems. It’s a guy out there to find something that users are struggling with to solve it and but the challenge is that in this field that we work with, there’s a barrier between us and the user and the barrier is the internet. We cannot see the customer. We cannot talk to them directly one-on-one, at least not all the time.

23:35

Lukas: And so whenever we try to solve a problem, what we’re really doing is we’re saying we understand now what the problem is and we’re going to implement the solution and we think that the solution solves that problem and all that the experiment helps us try to do is measure whether the solution actually does what we think it does.

23:55

Lukas: It seems trivial, right? My background is engineering; I’m a computer scientist and I think it’s engineering hubris to think that we know exactly what the product is going to do. It’s something that by my very training is something that I had to overcome and understand that whenever we build a solution it might not actually do what we think it does, it might not work.

Sameer: One thing that we wanted to understand Lukas, is that all of this is understood. And again, you’ve alluded to it a little bit. How do you get this out to the people within Booking.com? Sometimes what happens is there is an energy to rush into a solution. How do you continuously convey to them that they need to do an experimentation? They need to prove it before they are able to put that solution in place.

Lukas: So I get this question a lot and and to be honest, I don’t really have an answer until we rewind six years ago and I was a consultant not working for Booking.com trying to help companies understand that this is a thing, that this is how you need to operate. That you have to validate in production that the changes you make have the impact you think they make. This is what I was trying to help companies do and it was making very little headway because what I found was that companies had already decided that this was going to be the product before they haven’t even figured out whether that’s what customers wanted. There’s a lot of hubris of [what] we already know what our customers want. We only need to build it.

25:32

Lukas: And I had personally given up on that and then I ran into a man who worked for Booking.com and we talked a little bit about the product philosophy here and I realized that they already got it. And so the reason I joined Booking.com was because I was looking for that and I found it here. So, if you ask me, how do you create that? I have no idea, honestly, I don’t.

26:01

Lukas: That is something I’ve spent my last five years here trying to protect to make sure that as we grow as a company and really we’ve grown a lot. And when you bring in new people, how do you protect that philosophy? How do you protect that culture and not lose it as you as you scale out? If you have a company that’s already at this scale and that doesn’t have this, I would really struggle to think how you would teach people. To bring it back to Booking, the way I’ve been trying to do that is by doing a lot of classroom education. So, there’s a lot of discussion groups that we have where it’s not a lecture where I just talk for a day, but it’s a very interactive session where we discuss difficult decisions. We discuss particular examples of experiments where the discussion could go either way and I tell I tell the people in the room that we are going to debate this experiment.

26:59

Lukas: And if all of you say ‘A’ and I say ‘B’, and I will be able to convince every single one of you that it should be ‘B’. But if all of you say ‘B’ and I say ‘A’, then the same thing will happen, right? And also because this is in the middle and so we can have the base either way and what is important is not that we make the right decision here, but that we keep the customer at the center of that debate and we have a good way to discuss what it is that this is trying to do and why we think it’s good or not. And that is something that I really try to encourage the debate around, and to create openness within Booking.

27:38

Sameer: Make sense. So, a very well-known fact is that you probably run hundreds if not thousands of concurrent experiments at Booking.com. Then how have you designed the infrastructure for it? That’s something that comes to mind because people generally struggle with a few ends of concurrent experiment.

Lukas: I think the answer is that it’s taken us 15 years. And maybe five iterations or six. So, this is not something that you build from the get-go. The first version of our experimentation platform, 50 years ago, the maximum number of experiments was 4, it couldn’t support more than that and we actually didn’t think that we would ever need more than four.

28:25

Lukas: I wasn’t there, by the way, but so I have all of this on hear say. But, the reality is that you build a platform for four and then you learn along the way all of the ways that that’s going to break and then you scale up to 40, and then you scale up to 400, and then you scale up to four thousand. But, every time you scale up, you learn about what are some of the things that are breaking and how we will avoid that next time. So, this is really an iterative process. I don’t think what we have now, we could have built or designed from the get-go. And one of the things that (this is also in the democratization paper) has really been a step change for us is the fact that we have two independent data pipelines. So, our infrastructure has a lot of redundancy built-in on purpose and that allows us to double check all of our own findings.

29:29

Lukas: And this is a like an instant bug hunting machine because the moment that the two sources of truth disagree, there must be a mistake somewhere and we use this to actively hunt for discrepancies and we don’t see those as problems. We see those as opportunities for improving. So, we’ve taken this really as a way of improving our internal data pipelines and thinking about why is it that these two different methods disagree and how do we make sure that they don’t disagree going forward. And then you just build on top of that until it breaks, and then you rebuild it from scratch.

Sameer: I think iteration – making small changes, seeing what will work and then adding on top of it rather than trying to do like a big bang approach, all-in-one kind of a kind of a thing, right?

30:27

Lukas: I’m going to say it’s very similar to the way we build our product as a whole. You look at how people actually consume it what they’re actually doing with it. You look at the things that are actually stretching and not scaling and you attack those rather than start from first principles and then build something that’s perfect. You go with what people are actually using, what things are actually making a difference and then you invest on those and we do the same thing for internal products that we do for our own product. We have landing pages that are just there to see how much people land on them.

Sameer: You’ve spoken in between about keep doing it till it breaks and that sort of rang a bell in my mind because I hear a lot of people who are not using A/B testing or conversion optimization and are worried about doing it because they think the user experience will break and I’ve heard them say this a few times. What would be your message to them? You know, to people who worry about the user experience breaking when they deploy a test.

Lukas: I remember this talk from Craig Sullivan at some point who said his dirty little secret for CRO was that he would just find bugs on websites and fix them. It really wasn’t that complicated that you just find a product and you find where it’s broken and then you fix that and then he showed an example of a website where if you click on a particular link, then the entire website was unused. I think it was opening the gallery and the gallery couldn’t be closed anymore. So, basically the entire website becomes unusable at that point and your only option as a user is to close the browser or tab and open a new one and start from scratch. And that’s clearly a broken user experience and makes you wonder how did that ever end up on that website in the first place? Because if this was an A/B test, if this new feature was launched as part of an A/B test, then immediately would have seen that way. This is weird. We added a new gallery and no one is buying anything anymore.

32:25

Lukas: So, merely that would make alarm bells go off and you would reconsider implementing that but the fact that they had this in the website makes me think that it wasn’t part of an A/B test at all. So, I think what you mean is that people are afraid that rapid change will break the user experience. Something that often comes with A/B testing is that people will try lots of different things and I think the concern is that’s the thing that’s going to break the user experience because now it’s inconsistent or whatever term they use but I think that’s a different problem. It’s a problem of rapid change, it’s not a problem that experimentation. I think even if you make a few changes, you can still use experimentation to make sure that they’re not breaking the user experience. In fact, like I said, I would argue that if you’re not running a big experiment you have a bigger risk of breaking bad user experience as you are not actually checking whether user experience is broken.

33:21

Sameer: I think experimentation helps you sort of limit the audience which will be exposed to that error. So, instead of exposing it to the entire audience, when you do an A/B test, you’re at least limiting the people who will be exposed and, you know, an early warning system if you will.

33:36

Lukas: Yeah, I mean we wrote a blog post about the circuit breaker, which is a system that we have internally, that automatically stops tests when users see error pages and under the hood it runs A/B tests. There’s no developer involvement here. But it means that as a developer here, if you write a new feature and you put that on the Booking.com website and within the first second of exposing it to users, we see users see more 404 pages or more 500 pages, basically the product becomes unusable. Then we say we cannot conceive of a situation where this would be a good thing. Like there’s no way that 404 pages are going to improve the user experience. And so at that point we automatically pull the plug and say, I’m sorry, this is not good. Go back to the drawing board or reconsider how you’re implementing this and that’s only possible because it’s an A/B test.

34:30

Lukas: And so this to me is a huge value of this way of development is that you actually protect the user experience and you protect users from mistakes that we are inevitably going to make because obviously we still have QA we still have testing all of these things, we do. But it’s still possible that the piece of code that hits our production servers just doesn’t do what you expected to do. Every day, every week on the news, you see some website that breaks and I wonder how this could have been detected earlier if there was experimentation at all.

Sameer: Yeah and just testing doesn’t do it because it doesn’t expose that situation to a wider variety and there is no real sampling involved. I think sampling is so important and that’s where A/B test is better than your conventional dev or otherwise testing. Would you agree we need both?

Lukas: You need both. It’s not either/or. I think there’s a huge value to testing things before you put them in production because it gives you a tighter feedback loop. You get more immediate feedback as a developer when the CI pipeline breaks, right? When you use integration, you push something in production and the pipeline breaks and says this is what’s wrong, this test has failed, right? So that’s better than not doing any QA testing. However, there will always be limits to what you can test in this way. Especially if you’re building a new product. Because these test suites, they’re designed to make sure that the product still does what it’s supposed to do. But when you’re changing the product, you’re changing what the product is supposed to do. And you have some assumptions about why that’s good. So, if you think about, let’s say, changing how a certain button works or whether it opens a new page or does it open a new modal, right? Then at the integration level, you can test whether indeed instead of opening a new page now open up a new model.

36:31

Lukas: But you cannot test whether actually users understand that that’s what’s supposed to happen. Whether users will then respond in the way that you think they will or whether they just closed the modal assuming it’s a cookie box or whatever. And so you can test more of the technical aspects of your change, but really not the behavioral ones of how people are going to perceive this change and that’s the core here and. And QA testing will never be able to do that. So, I think you will always require both- QA before production and something in production- be it a be testing or anything else.

Sameer: Yeah. Absolutely. I think a great point you made about not being able to test the behavioral aspects of the change. We wouldn’t have to test it because we already know what people do then.

Lukas: Life would be so easy.

Sameer: Life would be so easy. And I think it was also connected to the point you made earlier that we may not even be the representative sample. So, who are we to decide whether the change is right or not? You know, there are so many websites where the intended audience is not us at all. So, for us to think that we know better than the users almost arrogant.

Lukas: Yeah, I mean, so it is good that you bring that up because I think it’s one of the things I like about working here is that Booking has realized this and actually very consciously has tried to make the mix of people in this building more representative of our users. We’re a travel company. People go all over the world and we know that assumptions are going to be different across the world. The way currency is displayed, a very simple example, is different across the world. The way taxes are interpreted is different across the world and so in this building alone, which is our headquarters here in Amsterdam, I think on the sixth floor, there’s a hundred plus nationalities in this one building and that’s not by accident. It’s a very conscious, deliberate thing for us to say we want as much diversity as possible in our internal space with as many different people with as many different backgrounds within this building. So, though we are more representative of our user base and still, you are right. We are not a full representative sample of the people who use Booking.com. There’s lots of people who are retired and traveling a lot (this is one of the prerogatives of being retired that you can travel more). Yeah, but those people don’t work for Booking.com because there are retired. We will never have that particular segment of users in this building and so we still need to validate that the changes we make are relevant to that particular audience.

Sameer: Great thought. I think it’s very difficult to have global customers unless you’re yourself representative of it like you correctly mentioned. So it’s very interesting that diversity is really by design and it’s a conscious decision to be so.

Lukas: Yeah, we say diversity gives us strength and we honestly believe that this is one of the things that makes us a strong company culture, in the sense that we have so much diversity that we actually get challenged on these assumptions every day. I think if you have a mono culture where everyone has the same background, you don’t think about these things, think about how someone might interpret your copy or your photos or the colors that you use in a different way and I think that’s one of the great things about being here. You actually get challenged on that. And you go home, you think oh, no, that’s not normal, other people don’t think this is a thing. At first, you think that’s weird. But then you reflect on yourself and think I’m weird. So, my background is Dutch and so this is my home country, but most of the people who work here are not Dutch and I get questions every once in a while like- Is this really what people do here? And I go like, yeah! And it’s the simple things right? Like how you celebrate a birthday.

40:30

Lukas: It’s a silly thing. Well, obviously you all sit in a circle and you shake everyone’s hand to say hello, and then you eat cake and then you go home before dinner, right? That’s how you’re supposed to celebrate a birthday. But apparently that’s a very Dutch thing and no one else in the world celebrates birthdays that way. But that’s good and I think it’s a wonderful thing for people to travel and that’s also why I like I like Booking as a product because it’s helping people go out there and actually see that people across the world are different and I think that would make the world a better place.

Sameer: Yeah, where there is more understanding and more openness.

Lukas: Yeah, and that goes back to your earlier point- if you want to create this open culture where people can talk, can challenge each other and for me, that’s also something that I want in the outside world. I want people to travel to realize that the way they do things is not necessarily the only way to do things and to talk about these things.

41:29

Lukas: To have open discussions about how you celebrate birthdays or what is delicious food or how you should spend your holidays or what is your relationship with your mother? All of these things are subject to culture and the best way to realize this is to go out there to travel and to experience.

Sameer: Absolutely! So Lukas, we also did an internal poll on a few questions that could be asked and I’m going to run through a couple here. One of the questions was are there any further challenges that you see when your company’s running experiments, any other challenges that we haven’t covered so far?

42:09

Lukas: So one of the interesting things that we’re looking into now is- there are cases where you cannot run A/B simply because you don’t have control or because of legal constraints or ethical constraints or some reason that you cannot flip a coin and decide who gets to see A and B and there are methods to do with that. We’ve written a paper about them and we have a blog post coming up but they’re not as easy. So how do we scale them in the same way that we scale A/B because my group is not that large were about 30 people. But we could treat those cases on a one-by-one basis with smart people. I mean we have smart people but then we would be spending an enormous amount of time just doing the analysis and that’s just not going to scale. So we have to find ways to take the scaling that we’ve applied to A/B testing and apply it to things that are not quite A/B testing. I think that’s one of the biggest challenges that we run and you see this also actually in the wider experimentation community.

43:22

Lukas: If you look at things that other Silicon Valley companies write about A/B Testing, they’re not writing about basic A/B testing, they’re talking about- can we do time series, prediction? Can we do a […] waiting? Can we do introduced […] action. I think that’s some of the fun stuff that we’re butting our heads against.

Sameer: Alright, makes sense. So, another thing that comes to mind is, when you have responses to one type of experiment, how do you make sure it’s independent of another response? And that the two are not really related and are independent of each other? How do you make that judgment?

Lukas: There’s a great blog post by […], he wrote a very long post about this particular problem. I think it’s important to realize interaction effect are real so it’s possible that two experiments conflict in some way.

44:30

Lukas: So, this is not something to be disregarded, but the flip side is that there is also something that in some sense if it’s very bad, it’s very easy to detect. So, if there’s very severe interaction effect between the two experiments, you can very easily pick those up. There are simple statistical methods for picking them up. But, a much larger problem would be if these interaction effects are, let’s say, solving that testing time but not the production time and what I mean, is that a solution that is often presented to this problem is to isolate tests. To say half of my traffic gets one experiment and the other half gets the other experiment, so that there’s no way that they can interact and surely this solves the interaction effect testing time. What happens when both of those tests are then winners and you ship them? That means that you now put into production two tests them never been tested together. And so if there is an interaction effect between them, you’ve only moved the problem from the moment that you run the test to the moment they’re put into production. And this is much worse. I’m actually much more worried about that particular scenario than the former because if the interaction effect is very bad, it is relatively straightforward to pick it up. And I think as long as you keep those constraints in mind, most tests will not interact. When test do interact, it’s quite easy to detect. If they do interact, there are ways to avoid it because all you would need to do is create a multivariate test, right? So instead of running A/B and A/B, I run A/B/C. I’ve now isolated the test, but I also made sure that I can either ship either A, B or C, but never two things together because they could play. So, that’s fairly straightforward to do. Once you have realized that they interact, which you will only be able to do if you overlap.

Sameer: I think the biggest challenge that you correctly said is that if you don’t realize the interaction during testing time, but the interaction becomes apparent only at production. How do you address? The way you’re saying it, it has happened to you in the past. How did you address it?

46:52

Lukas: I don’t personally do and I’ve heard of cases where this happens but usually what happens if someone starts a test and notices that it’s absolutely terrible in terms of results or generates errors, they automatically stop and goes that’s weird, that’s not something that I was expecting, that didn’t come out of QA. They investigate, they realize it’s a conflicting test. This is something that people do almost automatically and it doesn’t really require intervention from us other than basically giving them insight into how their test is doing.

Sameer: Another thing that comes out for our customers is data quality as an issue. Again something that we have spoken the spoken earlier in this discussion where you know for you one of the things you put behind you is the validity of the data or the validity of the approach and the validity of the methodology. As it relates to data specifically, do you have any advice for our customers where data quality is an issue?

Lukas: I think one of the reasons that we can validate our own data pipelines is because we own end to end, the entire thing and we feel responsible for the entire thing and then we have two of them. So we don’t just have one in-house system we have to in-house systems and we compare it against each other. Now, I don’t know how you would do this. If you were relying on a third party. I realize the cost to doing this is enormous, right? This validity validity comes at a price which at our scale and our complexity is worth it. But, if you’re dealing with the third party and you’re really concerned about data quality the only thing I can think of is to have two third parties, compare results, and I’m sure you’ll come out fine when compared to competitors. But as a consumer, I would probably want to see how these two different platforms deal with my data, whether they agree or disagree and that’s for sure going to disagree. That’s 100% sure. But you have to find out where does this disagreement comes from to understand which one you trust more? Yeah, and then I think a lot of our tests could be replicated even if you’re using a third party.

49:09

Lukas: So, we have a new paper coming out, actually last month, three key checklists for online experimentation together with Microsoft’s Aleksandr Fabian. Which has a very simple checklist of how as an end user of an experimentation platform, at inception, during runtime, and at decision time, we have a checklist that you can run through to make sure that you’re you’re making reliable decisions and I don’t see why those checklists couldn’t be used when you’re dealing with a third party. They would work just as well. They include simple things like did you decide up front how much run time you’re going to run the experiment and how much visitors you would need and what you expected outcome was to statistical tests that you can do even if you don’t control the underlying infrastructure to make sure that the data that’s coming out of the experiment is valid.

Sameer: You just mentioned a blog which is great. In fact, you mentioned a couple in our discussion. Are there any other blogs or podcasts that you would recommend to our audience as it relates to online experimentation, any favorites?

50:19

Lukas: I actually don’t listen too much blogs or podcasts. I tend to keep an eye out on other companies that have their own internal experimentation platforms, there are a bunch of big players. There was a summit back in December with a with the 10 biggest ones. So we do a lot of knowledge sharing between these companies and so I watch out for their blogs to see the see what’s up and I reach out and ask them directly, but in terms of podcasts, I don’t really have anything. When people ask me here, what should I read? There’s a great book by [Alan S.] Gerber and [Donald] Green on is called Field Experiments, which is really geared toward psychology and political science students on how to do controlled experiments. It has some nice practical examples from non online space and I recommend that because it’s a nice introduction to the statistics and the methodology behind experimentation that’s not too daunting. So I recommend the Gerber and Green- Field experiments. It’s not a simple interaction, but it’s good.

Sameer: Thank you so much. Will definitely go through it. And Lukas, finally, what is the preferred way in which our audience can connect with you?

51:30

Lukas: Oh probably reaching out by Twitter is the easiest and my Twitter handle is @lukasvemeer. But you can also just go to my website- lukasvermeer.nl There’s a form you can fill out. I’ll email you back. Yeah, that’s probably easiest.

Sameer: All right, that’s good. Perfect. Can’t thank you enough, Lukas. We are right on top of our time. Great insights, amazing insights. I’m very sure that people will really benefit out of it. Thank you so much for your time and participating in ConvEx.

52:04

Lukas: Thank you, nice talking to you.

Speaker

Lukas Vermeer

Director of Experimentation, Booking.com

40 Booking Com

Other Suggested Sessions

Conversion Research

Rich talks about various methods of doing conversion research and at what point should they be done.

Rapid Experimentation: How Testing Works at Scale

Explore how Aditi leverages rapid experimentation for product growth at scale, aligning tests, business and team for impactful outcomes.

How HubSpot Used Data To Redesign Its Academy Website

Learn what thoughts and efforts went into redesigning HubSpot's academy website, and how they collect data to inform decision-making along the way.