• +1 415-349-3207
  • Contact Us
  • Logout
VWO Logo VWO Logo
Dashboard
Request Demo

How Experimentation Works at eBay

Peak into eBay's experimentation journey with Benjamin, revealing the impact of data analysis and strategic testing on business decisions.

Summary

Benjamin Skrainka, an economist and thought leader in eBay's experimentation science team, shared his extensive experience in data-driven decision-making and experimentation. With a background in physics and economics, Ben's career has been driven by a passion for using data to answer the 'why' behind various phenomena.

At eBay, he's been instrumental in developing a maturity model to enhance the company's experimentation culture and infrastructure. Ben emphasized the importance of a well-structured experimentation team and the benefits of a Bayesian approach over frequentist methods in certain contexts. He also highlighted the significance of running a variety of experiments to capture both marginal gains and potential 'black swan' events.

Key Takeaways

  • Passion for Data and Correctness: Ben's career trajectory showcases his deep interest in understanding underlying reasons through data, emphasizing the importance of accuracy and thoroughness in data analysis.
  • Building a Maturity Model: At eBay, Ben developed a maturity model to assess and improve the company's experimentation culture, focusing on lifecycle stages and best practices.
  • Experimentation Team Structure: eBay's experimentation team operates as a center of excellence, collaborating with embedded data scientists across business units, highlighting the importance of cross-functional integration.
  • Bayesian vs. Frequentist Methods: Ben advocates for a Bayesian approach in certain experimentation scenarios, as it directly addresses business-relevant probabilities and risks, demonstrating the need for flexibility in statistical methodologies.

Transcript

[00:00:00] Vipul Bansal: Hello, everyone. Welcome to ConvEx ’23, VWO’s annual virtual summit for growth and experimentation professionals. 

[00:00:16] Thousands of brands across the globe use VWO as an experimentation platform to run A/B tests on their website, apps and products.

[00:00:24] I’m excited to have Benjamin with me here, who is the economist and a thought leader in the eBay’s experimentation science team.

[00:00:32] Hi Benjamin, or as you mentioned you would like me to call you Ben. Hi Ben. Good to have you here. 

[00:00:38] Benjamin Skrainka: Hi, I’m excited to speak with you all about experimentation today.

[00:00:43] Vipul Bansal: Awesome. Awesome. Awesome. So Ben, let’s start off our discussion with a brief background about yourself. You’ve been in this space, you’ve been working in the field of data for almost over a decade now.

[00:00:57] So, can I assume that you’re obsessed with data? 

[00:01:01] Benjamin Skrainka: Yeah, I’ve always been interested in the question, why?

[00:01:06] Early in my career, when I went to university, I tried to be an engineer and that lasted about a semester because I was more interested in the question why than how and so I switched to physics. And then ultimately I went downhill from physics to economics, but the thing I’ve always been passionate about in every version of science and engineering that I’ve worked on is data and using data to make better decisions. 

[00:01:30] And, for the last most recent part of my career, I’ve really focused on experimentation, which is an incredibly exciting field, as many of you know, because it really goes to the core of how we know things and it just has its fingers in so many parts of life and decision making. 

[00:01:52] Vipul Bansal: That’s great. So, without taking any of your previous employers’ name, could you just quickly share the most challenging projects that you’ve worked on till date? 

[00:02:03] Benjamin Skrainka: Yeah, so, I think I’ve worked on a variety of interesting and challenging problems. As an undergraduate I was lucky to work with John Wheeler on some computational cosmology problems, which were very interesting. And John Wheeler was an amazing guy who was a disciple of Bohr and Fermi and he always had great quotes from Fermi and Bohr and people like that. The other thing that was very challenging was writing my dissertation at University College London in England, where I studied industrial organization, econometrics, and computational methods. And a lot of my research focused on studying differential products and part of my thesis was one of the largest simulations that’s ever been run in economics to investigate the properties of a certain set of models for estimating demand.

[00:02:55] And so those are some complex things I’ve worked on. As well as in industry, everything from prediction to forecasting to measurement. One problem I’ve also done besides a lot of experimentation is capacity planning, which involves both economics and forecasting and the stakes are really high because capacity planning decisions are billion dollar decisions often.

[00:03:22] So you really don’t want to get those wrong. 

[00:03:26] Vipul Bansal: Yeah, absolutely. 

[00:03:29] In principle, it does sound like a very complex thing. Maybe that’ll take me another 10 years to learn but that’s great to know. 

[00:03:39] In our previous discussion you did mention to me about building a maturity model at eBay.

[00:03:47] So how are you building maturity model at eBay to improve the experimentation culture, the infrastructure and the overall platform? 

[00:03:58] Benjamin Skrainka: Yeah, that is a great question. So when I came to eBay, originally I was working as a consultant and there were certain aspects of the platform that I thought could be better or aspects of culture. And if you’ve ever tried to change culture and organization, you know it’s really hard. And there are people who are vested in keeping things the way they were but my wife’s a designer and I learned this idea of a building maturity model from her. Designers use it all the time.

[00:04:27] And I said, wow, this is a great thing. We can do this at eBay to improve experimentation. So I wrote a white paper that led to a bunch of executives realizing that this was something we should look into is the level of maturity of our experimentation platform. 

[00:04:43] So what we did was at eBay, we think about an experiment having a life cycle. We thought about this different stages in the life cycle. We went out and surveyed what were industry best practices at each of those stages and what we thought were important for us to have. And then we went and assembled a cross functional team so that the different people in the organization were bought in and we evaluated each of the different verticals using this set of characteristics. 

[00:05:14] So we looked at performance marketing and customer marketing and on site feature experiments and so on. And then we were able to assign basically a score to each level of the organization from somewhere from crawl, walk, run, to fly. And then we continually update this report every quarter and it helps us drive change and it help the different teams see where they are in terms of their maturity.

[00:05:44] Vipul Bansal: Awesome. At this point, I’m just curious to also know, if you can share, how is the experimentation team built at eBay? What does it look like? 

[00:05:58] Benjamin Skrainka: Yeah. So I can say that it’s originally when I came to eBay, we actually had two separate experimentation teams.

[00:06:06] We had a separate team that was focused on marketing and we had another team that was focused on the onsite and feature experiments. And we realized that basically this is the essentially the same thing and we combine the teams and it’s been very good because now we’re not duplicating effort. 

[00:06:20] So we have a head of experimentation. I report to the head of experimentation and he is a diplomat. He is a political genius and he’s very good at helping us navigate and make change and make good strategic decisions. And I provide a lot of thought leadership. And then there’s a team of PhD statisticians who sit next to me with another leader who leads the people who do the day to day science, they handle customer service requests, and features and research. And I think more about like what’s going on in the field of research and new methods that maybe we should implement or staying abreast of industry best practice. 

[00:07:07] And then also we lay throughout the organization with the different data science teams that may be responsible for individual business units measurement. 

[00:07:15] So you can think of the experimentation team I’m on is an experimentation center of excellence. But there are also our embedded data scientists in different business units that we support. 

[00:07:26] Vipul Bansal: Got it. Got it. So since you must be running a lot of experiments at eBay, do you standardize how do you measure and define the success at eBay for all the experimentation efforts that you put in? 

[00:07:45] Benjamin Skrainka: Yeah and that’s a great question. Once you decide to measure something that creates an incentive that affects behavior. And the goal really of the experimentation platform is that someone with limited statistical experience should be able to read the report and make a good decision. The default type of experiment that we let someone run is the classic frequentist A/B test and we have a report that has been designed to make it very visually easy to see if the key metrics have moved in a favorable direction or not. And it’s color coded and there’s guidance built into the tool. So that someone who maybe doesn’t know all the nuances of a P value doesn’t have to worry about that. 

[00:08:33] And I think that these things can always be refined. I have a colleague at University of Chicago who’s actually doing some interesting research that I’d love to be able to talk about more but I don’t think he’s published it yet, where he’s kind of looked at things from a behavioral economics point. 

[00:08:51] If you’ve ever thought about like Kahneman and those guys about what information we present to people so that they make better decisions. So I think there’s some interesting research in those directions that are good to think about as well.

[00:09:04] So, there’s both like what’s the criteria for how do we make decisions and then how do we present it so that people make good decisions? There’s all important stuff. 

[00:09:12] Vipul Bansal: Yeah, that does sound exciting. Of course, I know that you cannot share the nitty gritties of the entire process, that must be confidential, but that does sound exciting. It’s always good to know how big companies handle all the measurement criterias and their experimentation programs as well.

[00:09:32] Benjamin Skrainka: Right and I think it’s important to think about the portfolio of experiments. Because I want to be right as often as I can be in general. And there’s some great work by people like Georgi Georgiev, he’s got a great book and blog and he says in some ways, we should really be thinking about risk.

[00:09:52] It’s about managing risk and that this classical frequentist approach that’s come to us often from history and medicine is not always the best way to do things. Because we should really be thinking about minimizing risk. And there also were strong arguments to move to a Bayesian framework that helps us make these decisions too.

[00:10:13] And I know that VWO has been very strong in that area and there are many benefits of the Bayesian framework. Because it also then lets us think about things like what’s the expected loss if I make this decision. 

[00:10:24] And then the last thing I would say is there was another paper that I think is very interesting from Airbnb on how, if I have a portfolio of decisions because I ship things that are statistically significant, there’s going to be a bias, an upward bias on the aggregate lift and so I need to think about that in terms of think about the cumulative benefits of all my tests and so I maybe need to shade things down a little bit. Because the expected lift conditional on being statistically significant will be biased upwards. 

[00:10:59] Vipul Bansal: Are you a Frequentist or are you a Bayesian follower? That’s a debate. 

[00:11:03] Benjamin Skrainka: That is a great question. Our platform currently is primarily focused on a frequentist experiment and I’m actually working hard to advance bayesian ideas. I think that in the world of A/B testing where we’re making decisions, you can make a very strong case for Bayes’ approach. I think it solves a lot of problems because the frequentist approach doesn’t really answer the question I care about. 

[00:11:34] If I run up a frequentist test, the thing I am testing is conditional on the null hypothesis being true. How likely is it that I’ve seen data as extreme as what I saw or more extreme?

[00:11:49] And that’s not what the business owner really cares about they care about conditional on the data. I’ve seen what’s the probability that the null hypothesis is true and that’s the reverse. And so Bayes’ rules let’s me go directly after that I can complete compute a posterior and I can then estimate, hey, what’s the likelihood that the thing I care about is actually true. And then I can get into these great conversations about, what’s the risk if I actually ship things and I don’t have these problems like my P value is just slightly bigger than the significance level. What do I do in the frequentist world?

[00:12:22] I’m out of luck, right? I don’t ship. In the Bayesian world, I can calculate what’s my loss and do I want to ship? And am I comfortable with that risk? And so I can make a more nuanced and informed decision. 

[00:12:34] Vipul Bansal: Yeah, that’s very, very insightful, Ben. 

[00:12:37] So, eBay receives over a million footfalls each month in terms of, of course, digital visitors.

[00:12:46] So, first of all, if you could share what is the scale of experimentation that you run at eBay? And of course, how do you handle that scale? It’s something which would be really interesting to know as well. 

[00:12:59] Benjamin Skrainka: Right. Yeah. That’s a great question. And we’re fortunate to have a lot of traffic but again it’s never enough.

[00:13:07] You’d always like to have more traffic. And probably like everyone in the audience, we’re always working to increase the velocity of experimentation. 

[00:13:17] And so part of the way we do that is, working with really good engineers to build robust and performant infrastructure and I haven’t talked about this before but I think the success of experimentation really hinges on 3 things which are culture, infrastructure and methodology.

[00:13:37] And so, your question is about infrastructure. So we are actually in the middle of re-architecting our experimentation pipeline. So that it will be more performant and faster. It’s really important to decrease latency. You’d like people to get results sooner. 

[00:13:53] The other thing that is also really important part of this is training our users and if a user chooses a really bad metric, something like, for instance, sales. This tends to be highly skewed in the Internet world, right? Where we have this long tail. And we think about classic papers like the Randall Lewis and Rao paper on the near impossibility of measuring returns to advertising. That these things are so skewed that even with a million observations, you might not even have asymptotic normality. And so what we’re trying to do is educate our users to choose better metrics that are both sensitive and directional. 

[00:14:29] So that means they’re easy to move with limited amount of data. That means, metrics that are things like binary indicators of behavior or ratio metrics. These tend to require fewer observations to measure, which means we can run more experiments and get more throughput we can learn faster. 

[00:14:44] And the last thing I’ll say is that we actually track the velocity of experimentation. We want to make sure that more teams are adopting experiments that they’re running good experiments and that the pace of innovation is it continues to accelerate. 

[00:14:56] Vipul Bansal: Got it. So increasing velocity of experimentation is something which almost every large organization, on the scale of eBay follows.

[00:15:07] Is it done to increase the number of right decisions, or is it done to increase the decision accuracy? They may be similar, both the concepts, but this is something that I’m trying to understand as well. Why does increasing the velocity of experimentation is important?

[00:15:25] Benjamin Skrainka: So there’s actually a really interesting paper that I read recently by Azevedo and some other folks. I think it was done on Microsoft data. What they found is, if your innovation process, the way you create ideas that tend to be marginal gains, you should run, okay, these are going to be, experiments then that tend to have small effect sizes. In order to detect those marginal gains you need to run few but expensive experiments because they’re going to require lots of observations to measure. 

[00:15:59] Now, if your innovation process produces a few really big winners. So black swans, it means you have a fat tailed innovation process, then you want to run lots of small experiments that are short because you want to find the big winners.

[00:16:12] If you miss some of the marginal gains, that’s okay. And what this paper showed is using Microsoft data, and also data that they found in other papers about eBay and other companies, is that many of these tech companies have fat tail innovation processes. 

[00:16:27] So what that means is, you should focus on running lots of cheap, short experiments that are perhaps in our case, one or two weeks long, because if I miss some small marginal gain and I don’t get enough data that it’s statistically significant, it’s not a huge deal. If I miss that black swan, huge home run or in your case, hit for six, right? I think that’s a cricket term, right?

[00:16:49] Where you hit for six, that’s a big loss and that is a big impact on the business. So it’s good to be able to run then, an experimentation strategy that aligns with your innovation process and so you probably want to go and look at the experiments you’ve been running and if you can do something simple, like compare the mean to the standard error for your effect sizes and if you tend to have really high variance on the effect sizes in your experiments, it’s likely that you have a fat tailed innovation process. And you can actually fit the model in the Azevedo paper pretty easily.

[00:17:32] It’s just a maximum likelihood model. And so if you have a bunch of data on your experiments, that’s a good thing to do. So then you know, what’s your right experimentation strategy. So I hope that answered your question. I realized I kind of diverged a bit, but hopefully that’s useful for listeners.

[00:17:48] Vipul Bansal: No, no, definitely it is. And you mentioned about Azevedo framework, right? So, could you explain to us a bit more about what this framework is? 

[00:17:58] Benjamin Skrainka: Yeah, so Azevedo and his co-authors, they looked at data on Microsoft experiments. They built a model of the ideation process and then they were able to show in their model, what’s the optimal experimentation strategy? And they find that it depends basically on the thickness of your tail which affects the likelihood that you’re going to generate black swan events. 

[00:18:22] So if your idea generation process, the distribution of the lift from your ideas tends to be more normally distributed then that means you’ve got a thin tailed ideation process, you should focus on trying to make marginal gains.

[00:18:36] That means you run fewer experiments but they need to be longer because you’re trying to measure small effect sizes because it’s unlikely that you’re going to have an idea that’s a big winner. 

[00:18:46] Whereas, if you have something that is a thicker tail like a T distribution for the ideation process, that’s the likelihood of generating an idea with a certain effect size, then you want to run lots of lighter weight experiments. Because you want to capture the big winners.

[00:19:01] A famous economist once said in a lecture, math is best done in the privacy of one’s own home. And so I think if you’re interested in the details it’s really best to read the paper. Otherwise, I’m just going to be spewing out technical mumbo jumbo about tail thickness and maximum likelihood.

[00:19:21] Vipul Bansal: I definitely would recommend the audience members to have a look at the Azevedo framework that Ben just mentioned. I’m just curious, do you also implement this framework at eBay and how do you do it? 

[00:19:36] Benjamin Skrainka: So, yeah, I joke that I write white papers that nobody reads.

[00:19:42] I wrote a white paper on this and it was well received and I think it’s affecting our strategy. And one of the things that we rolled out recently that was really helpful was, we came up with this concept of a rapid test, which is, if you’re on an experimentation platform which has a really easy option to say, I want to run a test that’s one week long or two weeks long. And so we make it very easy to run a short duration test. 

[00:20:10] You probably don’t want to run less than one week because there’s seasonality on the days of the week and so we would want to see a week so that we could average out all that seasonality. There’s an option based on how much traffic you expect to get in your experiment, where you can choose a one week or a two week rapid test.

[00:20:28] And it basically gives you kind of the model T version of a A/B test and you hit that button and basically, you’re up and running. And so it makes it very easy to test ideas in this world where we have a fat tailed ideation process. 

[00:20:48] Vipul Bansal: In your journey as an experimentation professional, what are some common mistakes or pitfalls, let’s call them, that you’ve encountered and how did you learn from them? 

[00:21:03] Benjamin Skrainka: Yeah, that’s a great mistake. And I think for anyone who’s first starting out in experimentation, you should be prepared to make a couple million dollars worth of mistakes. Obviously you don’t want to. And so it’s good if you’ve got senior people you can learn from. 

[00:21:17] I think one of the biggest mistakes and this is certainly affected me when I first was running experiments. 

[00:21:24] I was one of the first economists hired at Amazon and I was also not in the central economist team.

[00:21:31] So I was off running experiments on my own. And so, like, don’t do this alone. Seek out help and so I think that’s the biggest thing. 

[00:21:39] It’s always good to get, just like you would do a code review if you’re an engineer. You want people to review your experimental design. It’s very easy to make mistakes. And if you’ve gone through something like a data science boot camp and you’ve learned how to run a power calculation and you think you know how to design an experiment, it’s so much more complicated than that. There’s so many subtleties. 

[00:22:00] The other thing that’s good to do is, I mentioned Georgi Georgiev. I’ve read his book. I’d also read the book by Kohavi et al, on Trustworthy Online Experiments, that’ll save you from many mistakes. 

[00:22:12] Unfortunately, those books didn’t exist when I was getting started. So, I was told to go read, R.A. Fisher’s book and I managed to find a copy from 1934, which is pretty cool.

[00:22:23] It’s got that awesome old book smell but I think they’re less painful ways to learn experimentation. The other thing to understand is that a lot of the things you need to measure, people don’t want you to measure them. If you’re measuring marketing, I think that people like Google and many of the people who sell media, like internet radio or whatever, they make it really difficult for you to measure.

[00:22:47] And it’s only gonna get harder with attribution or the increase in privacy making attribution much harder. I would start out by trying to run the simplest possible experiments. 

[00:23:00] Maybe marketing copy, email, and really master that stuff that’s simple. When you start running things that aren’t feature sites on your experiment that are off site, for instance, I’ve worked a lot of eBay with the team measuring PLA ads.

[00:23:16] That’s hard if you’re trying to measure the long term impact of something and you want to have a hold out. That gets really tricky because it’s survivorship bias. A lot of people are rightly excited about this new method called synthetic control. I think that’s a really important thing to have in your tool kit as is difference in differences. There’s currently a revolution also going on in difference in differences that makes it much easier to use these tools in quasi experiments. 

[00:23:47] Synthetic control is particularly nice because if you have smaller limited samples because you either have holdout and there’s a limited number of units you can put in the treatment group or the holdout, for instance, your marketing to different cities in your country.

[00:24:03] There are a limited number of cities that are going to produce good signal. So for instance in the US, there are 210 DMAs. Those are basically city areas. Of those about 80 produce good signal.

[00:24:14] Someone else at your company may already be using half of them so maybe you’ve got 40 units that produce good signal.

[00:24:20] It’s going to be hard to get enough power. So something like statistical control or synthetic control can help because you can construct, it’s a way of cherry picking units so that you can compare apples to apples instead of apples to oranges. And there’s a new variant on synthetic control that’s particularly useful if you have to deal with low number of units called synthetic experimental design, which lets you construct a synthetic treatment and control group that both look representative of, say, the national average behavior that you care about, so that you’re comparing apples to apples.

[00:24:55] Niel Bohr said that his definition of an expert is someone who, through their own sad, bitter experience, has made a sizable fraction of all possible errors in the field. 

[00:25:06] The one closing thing I would say is, if the result seems too good to be true, it probably is. So you should be very suspicious and if you run some marketing campaign and it has 10% lift, you should be very skeptical.

[00:25:20] It’s very rare to get 10% lift, so you should really dig into that and figure out why. There are also many subtle ways of having bias and so the other thing to look out for is if your randomization grain doesn’t equal the analysis grain, you need to take extra steps to eliminate bias and compute errors correctly.

[00:25:42] So those are just a few things, I think to keep you busy. But we will probably run out of time if I keep talking about this. 

[00:25:51] Vipul Bansal: No, I think that’s definitely quite helpful. I’m sure the audience members must definitely be getting insights from that. 

[00:26:01] I also get feedback from our previous sessions that people want to know what metrics does eBay track when they’re running experimentations. And what are the KPIs that you use to track the impact of experimentation efforts? So if you can quickly share a few, it’ll be great. 

[00:26:21] Benjamin Skrainka: Yeah, so that’s very difficult to answer because it’s very experiment dependent and it also depends on where you are in the funnel. So I think maybe this is a way of answering this. There’s a growing recognition and industry that it’s dangerous to have teams optimizing just their part, say of the funnel, because when you optimize your part of the funnel, you might make something worse for the person up here.

[00:26:47] And so we try to avoid that with guardrails but so that’s why it’s really important to have an organization align on an OEC, so an Overall Evaluation Criteria. And this is something that should align with the leadership of the organization’s goals because when we create a metric, it creates an incentive and the OEC may not be just one metric like sales or revenue or completed transactions.

[00:27:13] It may combine several things that because we may have to trade off the performance of different metrics. Making one better may make another worse. And so we have to do what’s best for the business. 

[00:27:26] There’s interesting research by Yandex on how to compute an OEC in a principled way using optimization over metrics. Basically you take a set of experiments where you have labeled outcomes and you can then figure out what’s the optimal metric that’s a weighted combination of them.

[00:27:45] Other metrics that you can then use that will be more sensitive and have lower variance. So that’s a principled way to go after that. So I think, thinking about what your OEC is really important. 

[00:27:56] So that it aligns your teams and it means one’s team work isn’t undoing and other team’s work and that we’re thinking about optimization across the whole funnel and what’s best overall and not just my little piece.

[00:28:08] So it’s easy to get lost in that. And unfortunately, I’ve not been able to win that political fight at eBay, is to have a standardization on the OEC or a couple key North Star metrics. 

[00:28:23] But any experiment that we run, it has a primary metric which is the thing that is most relevant for the business question we’re trying to answer. 

[00:28:30] We have a guardrail metric, which is typically something like gross merchandise bought. To make sure that whatever optimization we’re doing doesn’t have a negative impact on that. 

[00:28:43] Now, GMB, Gross Merchant Bought is very hard to measure. So it’s you may not want it as a primary metric. And so then we have these guardrail metrics and then we may we also have some supporting metrics that may help you tell the story or debug something. 

[00:29:01] So if you’re using a ratio metric, like click through rate, it’s also good to look at the numerator and denominator because that can help you debug problems and also understand why that metric moved in the direction it did. Was it from more clicks or getting more transactions? 

[00:29:18] Hopefully that’s helped answer your question. I realize it’s not perhaps as precise as you want but the metrics that are relevant for us aren’t necessarily relevant for you. So the main thing I would say is try to come up with a good OEC for your business.

[00:29:33] Vipul Bansal: Definitely, quite insightful there. So, eBay, of course, has been in the existence for several years. I was a kid when I first came to know about eBay. So has experimentation played a key role in driving innovation at eBay? 

[00:29:51] Benjamin Skrainka: I think eBay understands and values experimentation and eBay has actually been one of the pioneers in the field.

[00:30:00] There was a very famous paper that many people probably know about in marketing, Blake, Nosko, and Tadelis, that was done at eBay about 10 years ago. I think we’re on the 10 year anniversary now. This was the famous paper that showed the branded keyword search basically had zero or negative ROI.

[00:30:17] So it’s like this huge wake up call. Like, why are you spending money on branded keyword search? And despite this paper, which has been reproduced many times, many people continue to spend money on branded keyword search and later research showed it may be helpful if you’re very small company.

[00:30:32] If you’re a big company like eBay, it’s not something to spend money on. So we have a long history of experimentation and some parts of the website are highly optimized. So if you’re in a company that’s been around is eBay and we’re basically been around since the beginning of the Internet.

[00:30:46] I remember eBay from the nineties, back when I had more hair. So some parts are highly optimized and that means it’s making progress can be very hard. And so experiments tend to take more options. 

[00:31:02] The innovations you make are smaller. And the other thing is, which ideas you think are going to be winners are very hard to predict and so it’s important to be very democratic to what you test and think about how to do that. 

[00:31:18] So I hope that answered your question. 

[00:31:22] Vipul Bansal: Yes, indeed. 

[00:31:22] So what are the essential skills, according to you, one needs to have for data driven decision making in the context of experimentation.

[00:31:35] Benjamin Skrainka: Yeah. So I think that one of the most important things is really having a passion for correctness. A lot of people don’t check their work as much as they should or could. And often if I interview a SQL engineer for a job, or data engineer, I’ll say, hey, how do you know your sequels correct? 

[00:31:55] And they usually look at me like I’m crazy. No one’s ever asked him that question. And it’s super important because if you pull some rubbish data set, like no amount of fancy statistics is going to take make it better. 

[00:32:05] So I think people who are very methodical and passionate about checking every step of their work with data to make sure it’s correct.

[00:32:12] Data is so hard to debug and it’s so easy to miss something in a step like an outlier or you know something as simple as you run an experiment. And I think one feature is better than another when it isn’t and the only thing that’s different is that the new features introduced a lot of latency into that arm of the experiment and so what you’re really measuring is that, there’s a difference in latency between the two treatments.

[00:32:33] All these kinds of subtle things and you need someone who’s got that awareness and that’s more important than just about anything because we can always learn the statistics after that. 

[00:32:42] But you need someone who’s got that personality where they really care about correctness and then after that, I think really mastering the basic statistics.

[00:32:51] I think experimentation is fascinating and incredibly valuable because it teaches you the fundamentals of statistics and how we use data to know things and you can wander off into the world of Popper and other philosophers about, how do we know what’s true or likely to be true? 

[00:33:11] And then, in addition to that, you need to know some kind of programming language and I would argue, you should learn not R, not Python, but both because they’re better for different things. And if you’re just starting out, Python gives you more options, but R is often better for answering statistical questions or quickly pivoting and looking at data. 

[00:33:31] So it’s good to know both and the people on our team largely do. So we have people who are at least bilingual and you need to know some statistics skills and you need to know culture skill. 

[00:33:45] And then the last thing is being able to present your results in a simple and coherent way. So that executives understand them and part of that is really learning how to communicate to leaders.

[00:33:55] That’s something that I’ve really worked on improving the last couple of years as a scientist. If I write a 10 page white paper that looks like a journal article, and it’s all teched up, executives don’t read that. They need a one pager with a nice picture and that tells them what to do.

[00:34:13] I think those are some things to work on, if you’re starting out. 

[00:34:17] Vipul Bansal: As we end our discussion, I just have one really, really important question to ask you.

[00:34:24] What are the books that you’re currently reading? And if you’re not a book person, can you also recommend certain web series that everyone should watch?

[00:34:34] Benjamin Skrainka: Yeah, absolutely. So anyone who knows me or has been a student of mine, I’ve done a lot of teaching. I love teaching. They’re like, oh my God, not another book. 

[00:34:43] Like, so, yeah, I love books. I read a lot. Reading’s important and I think a lot of us when we go into industry we stop reading as much.

[00:34:53] So for some things you need to read journal articles and Microsoft has a really good website on papers about experimentation that their experimentation team has put together. 

[00:35:03] So that’s a great free resource. The Scott Cunningham book, ‘Causal Inference The Mixtape’, is also available free online but you should support Scott and buy his book.

[00:35:13] He’s got a great podcast and Substack as well. It’s really great on all these causal inference tools, particularly around quasi experiments. 

[00:35:22] He also runs workshops. That’s a fabulous resource. If you’re first starting out, you want to read the Trustworthy Online Testing experimentation book by Kohavi and his co-authors.

[00:35:33] The ‘Gergiev’ book that I mentioned is also good. If you want a PhD level treatment, I like the book by Guido Imbens and Don Rubin. Guido Imbens recently won the Nobel Prize on causal inference and it is very clear, but it’s axiomatic and canonical. 

[00:35:55] So if you like a mathematician style treatment, they will lay out all the qualities of what is a randomized controlled trial and so then it’s crisp in your mind and if one of these things is violated, I have some problem I need to deal with and they go into a lot of detail on subjects, other people don’t discuss like checking for balance and practical stuff like that. 

[00:36:15] So great book, and I’m also reading this book on Bayesian modeling. It’s a computational perspective in Python. 

[00:36:26] So it’s Bayesian modeling and computation in Python. That’s what’s on my desk now. And sitting under it is a philosophy of experimentation book that I tried to read and stalled out on. So it was like unreadable philosophical gobbledy gook, so I will not name and shame that author.

[00:36:42] I tried so hard and I just kept falling asleep reading it and they’re like stains on it from where I like fell asleep while reading tea and spilled it on the book. So not a book I’d recommend unless you’re having sleep trouble, in which case, like hit me up and I’ll help you out how to cure your sleep problems.

[00:37:05] Vipul Bansal: Awesome. That’s a really big list of recommendations. So we will definitely put a few names from your recommendation in the slider below so that people can search for those books and buy those books if they’re interested. 

[00:37:21] But yeah, tthat brings us to the end of this discussion, Ben.

[00:37:25] This has been really, really insightful. Saying it genuinely and it was a pleasure to speak with you. I’m definitely sure that the audience must have gathered some insights from all your answers. 

[00:37:39] So thank you so much Ben for being part of ConvEx 2023 by VWO. 

[00:37:44] Benjamin Skrainka: Oh, you are most welcome.

[00:37:45] Thank you for the opportunity to speak and hopefully, I’ll get to meet some of you in person or online and I look forward to talking experimentation with everyone whenever it happens. So have a great conference. 

[00:37:58] Vipul Bansal: Thank you so much, Ben. 

Speaker

Benjamin Skrainka

Benjamin Skrainka

Data Science Manager, Experimentation, eBay

Other Suggested Sessions

Performance: Each 0.1 Second Is Worth $18M For Bing

Every millisecond decides whether a user will stay or leave Bing. In this session, hear from Ronny himself how Bing ensures a great experience to its users.

How Booking.com Manages Large Scale Experiments

Curious how a business with million visitors across the world runs experiments at such a large scale? Get your answers in this interview with Lukas.

10 Examples Of Winning Experiments That Drove Revenue And Performance Gains

This presentation is all about providing a practical set of ideas to leverage within your website experiments.