Beyond Basics: Addressing Complex Challenges in Experimentation
Join Dr. Ozbay to master online experimentation: managing multiple tests, analyzing complex data, and tailoring strategies for team success.
And Ozbay shares insights from his diverse experiences as a consumer, analyst, and leader in experimentation platforms. He emphasizes the importance of a well-defined experimentation process, effective tooling, scientific methodology, and building institutional knowledge. Ozbay discusses challenges like traffic scarcity, experiment health, and network effects, offering strategies like automation, prioritizing experiments, and advanced statistical methods to enhance efficiency and accuracy. He advocates for a holistic approach, integrating data engineering, statistics, and software engineering to manage complex experimentation landscapes effectively.
- Emphasize a comprehensive process, including hypothesis formation, execution, data analysis, and decision-making, to standardize best practices in experimentation.
- Regularly check for imbalances or interaction effects in experiments to ensure accurate results and avoid misleading conclusions.
- Combine data engineering, statistics, and software engineering skills to build robust, adaptable experimentation programs that can handle complex
[00:00:00] And Ozbay: Hello everyone. Good morning, good afternoon, or good evening, wherever you are in our connected world.
[00:00:14] I’m thrilled to be speaking to people who get the power of digital experimentation and who like me are eager to delve into its complexities. I’m And Ozbay. I’ve had the opportunity to explore online experimentation from three different angles throughout my career as a consumer, as an analyst, and as a leader of experimentation platforms.
[00:00:43] As a consumer and participant in various projects, I’ve been part of refining things like pricing models, recommendation systems, and search and ad algorithms.
[00:00:55] All of this was tweaked through continuous experimentation. These experiences gave me a deep understanding of the decision making process that drives successful experimentation. They also showed me how well designed tests and experiments can transform businesses and change how consumers behave. In my work as an analyst, I have dealt into complex data sets, test hypotheses, and applied statistical methods to turn raw data into actionable insights. This gave me a unique look at how we can make data-driven strategic decisions. Using data from experiments has been key in predicting, understanding and creating interventions that define business strategies, improve customer experiences and drive growth in competitive marketplace.
[00:01:55] As a leader within experimentation platforms, I’ve worked with the machinery that makes it possible to execute online testing at scale. From forming hypotheses and designing experiments to analyzing data and creating insights, I worked on removing the blockers to making experimentation accessible for everyone in the organization.
[00:02:21] These experiences have influenced my perspective and understanding of online experimentation. They’ve highlighted the hurdles that even mature experimentation programs can face, and they’ve driven me to look for creative ways to overcome these obstacles.
[00:02:41] The title of my presentation is ‘Beyond Basics: Addressing Complex Challenges in Experimentation. The discussion we are about to have assume that your organization has met some key requirements, like you’ve got enough traffic to run experiments, your infrastructure is solid and can support experimentation either in-house or through a purchase off the shelf solution.
[00:03:10] You have effective instrumentation in place and your experiments are statistically valid. Plus, your organization values learning, understands the power of data, and is ready to make data-driven decisions. Today, I will not spend any time visiting the challenges that are typically faced by young experimentation programs.
[00:03:34] Instead, I’ll share my experiences dealing with the challenges faced by more established programs. I’ll also share some practical solutions that, that have worked for me. The goal is to help you improve and streamline your experimentation program, enabling you to tackle these challenges and increase success and efficiency.
[00:03:59] When I talk about an experimentation program, I’m not referring to something huge and unchanging. Instead, I’m talking about a mix of crucial parts. These include the ex experimentation process. The tools and infrastructure, the scientific methodology, and the institutional knowledge.
[00:04:22] Let’s start with the experimentation process.
[00:04:25] The experimentation process is a well-defined journey, leading an experiment from its birth to its conclusion. The journey includes everything from forming the initial hypothesis, designing the experiment, executing it, collecting data, analyzing the data, and interpreting the results, and finally making decisions based on those results.
[00:04:52] This process is always repeating and recycling, feeding into the next experiment and decision-making with the objective of standardizing the best practices that help get the most out of experimentation.
[00:05:08] Next up is tooling. Tooling powers the process. It often includes a powerful targeting engine that randomizes units into treatments, a telemetry system that logs the signals received during execution, a statistics engine that turns data into results, and a management console that allows users to interact with the platform.
[00:05:35] These tools are key for setting up, running and analyzing experiments, delivering accurate, trustworthy, and insightful data. Investing in good quality tools gives your program credibility and sets it up to become a valuable storehouse of knowledge. Then there’s the scientific side of things. It uses advanced statistics to bring depth and versatility to the insights gained from experimentation.
[00:06:07] This helps to reveal trends and patterns and provides a firm base for strategic decision making. Finally, as your experimentation program earns trust and is recognized as reliable, it grows into a store of institutional knowledge. What experiments have been done in the past? What were the results? By documenting and sharing these insights, your organization can build on past successes, learn from failures, and avoid making the same mistakes again.
[00:06:41] This store of knowledge can also give predictive insights for future experiments success rates. By examining past data, we can form reasonable expectations of what success looks like, and we can set up a more informed decision making framework. Plus the store of knowledge can put a number on the value created by different teams or initiatives.
[00:07:07] What extra value came from team experiments? This is the kind of information that is priceless for strategic planning, allocating resources, and evaluating performance. In short, investing into these four components can really boost your organization’s learning and decision-making abilities.
[00:07:32] Let’s dive into the usual hurdles, mature experimentation programs come across.
[00:07:39] First up is traffic scarcity. Even if you have enough traffic to run experiments today, when you make it easy to run experiments, your experiments will grow in size and scope, and you’ll find your organization’s hunger to test more, learn more, and push the boundaries more will eventually outpace your traffic growth.
[00:08:03] Consider web traffic as a crucial resource. When the demand for its surges, it becomes a scarce commodity. As a result, many experiments find themselves queued up to get executed after the current ones conclude. Let me share an example from my previous experience.
[00:08:23] We had a website that showcased these recommendation boxes in over a hundred places on various pages. Some of these recommendation boxes provided suggestions based on past purchases, others leveraged users’ ongoing activities, in the session, while a few used user to user similarity to recommend products, et cetera.
[00:08:50] Every time we developed a new recommendation algorithm, a significant backlog of experiments would accumulate. These experiments were crucial to determine which of the components, which of the boxes would benefit most from the integration of these fresh backhand capabilities. However, the inherent limitation was that we couldn’t run all these experiments simultaneously because getting sufficient signal on the effects of the change on each box individually would require many, many times the traffic we had.
[00:09:28] So some of the experiments had to be put on hold for several months until the necessary traffic was accessible for a robust and valid analysis. This situation was a testament to the critical role traffic plays in the scheduling and execution of experiments. I think we can all agree that it is critical to use the traffic resource wisely.
[00:09:53] However, there are many reasons why we might waste traffic. The first big waste maker is the process of setting up experiments. It usually has manual steps that eat up time and can easily lead to mistakes. These mistakes often mean retesting, which further eats into your traffic resource. This really adds up when you’re running hundreds of experiments all at once.
[00:10:24] Next, Experiments might take longer to finish because the statistical methods used aren’t the best fit to answer the experimenter’s questions efficiently. Lastly, the experimentation process is inherently uncertain. A good number of experiments don’t give the results we want, or in other words, they fail.
[00:10:49] Different sources suggest that between only 10 to 35% of experiments succeed. This is a statistic that lines up with my own experience. To sum it up, as your experimentation program expands in ambition and scale, the challenge is not only juggling multiple experiments at once, but also minimizing traffic waste.
[00:11:13] The goal is to squeeze every bit of value and insight from the available traffic. This pushes us to think about more advanced, innovative, and efficient ways to experiment. In my experience, there are several strategies that work well to address this. To tackle the first issue, we found that automating processes can really save you time and effort.
[00:11:39] It cuts down on manual work and frees up resources. This lets you run more experiments and focus on strategic tasks. You can use automation in many stages of the experiment process from setting up and collecting data to analyzing results, and even reporting. Automation will not be cheap, fast, or easy. So a good strategy when implementing automation is prioritizing the steps in the process that lead to the biggest inefficiencies.
[00:12:12] Also, ensuring that your program requires experimenters to craft their experiments with care, makes sure that the experiments give you meaningful and accurate results. This involves clearly stating your hypothesis upfront, picking the right metrics, and getting the right sample size for statistical significance.
[00:12:35] On top of that, ranking experiments based on their potential impact on key metrics can stop you from wasting resources on low value experiments. You can use methods like the ice, which stands for impact, Confidence, and Ease framework to decide which experiments are most likely to give you high value.
[00:12:59] With this framework, you can list all the experiments your organization is interested in running and give them a score based on the potential benefit expected, your confidence in seeing that outcome and the ease of implementing the change if the experiment is successful. Once you scored each idea on these three criteria, you can calculate an ICE score by taking the average of those three scores. This gives you a single number that you can use to rank and prioritize your experiments.
[00:13:32] Moving on to the next item, your data can give you more insights than you might first think. Techniques like heterogeneous treatment effects can show you how different user groups react differently to changes.
[00:13:46] Statistical methods like Cupid can make your experiments more sensitive by tightening the confidence intervals around your outcome metrics. This helps you spot the impact of changes with less traffic. To increase the success rate or to increase the heat rate of your experiments, it’s smart to be more confident in the outcome before you run the experiment.
[00:14:12] For example, offline experiments and interleaving are two fast ways to do this. Offline experiments use all data to run simulations, helping you spot promising changes before you put them online. Interleaving lets you compare the performance of different algorithms at the same time by weaving their results together and tracking user interactions.
[00:14:38] These methods will not replace experimenting, but using them can help you weed out experiments that are less likely to work. This lets you use your traffic more effectively. Then there’s the choice of running parallel versus orthogonal experiments. In orthogonal experimenting, you run multiple tests at the same time by changing what or who gets randomized into different variants for each experiment.
[00:15:08] So if you are running, let’s say, 10 user level experiments, orthogonally, this means any given user might be in the treatment or control group of any of these 10 experiments, but because you use the same traffic to run all these tests, this testing doesn’t account for possible interactions between tests.
[00:15:31] I will talk about interaction effects more in the next slide. On the other hand, parallel experiments, make sure that every user is assigned to the treatment group of only one experiment at a time. This ensures that each interventions effect is measured on its own, steering, clear or confounding effects.
[00:15:52] The ledge approach requires significantly more traffic. Deciding which approach to take depends on your specific situation, but both can be useful tools in your experimenting toolkit. To sum up, traffic scarcity is a big challenge, but it’s not unbeatable. With the right strategies, you can handle this issue and run a successful, insightful experimentation program.
[00:16:20] When managing a mature experimentation program, organizations face the challenge of keeping their experiments healthy. A complex experimental setup typically has many moving parts, and each one plays a key role in the experiment. If any of these parts fail or behave unexpectedly, it can mess with the basic assumptions of the experiment.
[00:16:44] Also, running many experiments at the same time can cause problems. They might interfere with each other and model the results. So it is important to keep a constant eye on things no matter how advanced your program is. A common issue, that we might see is imbalance, or it’s also known as sample ratio mismatch.
[00:17:07] This means the traffic sent to the different variants doesn’t match the planned percentages. For example, your control might get 51% of the traffic, then you planned for 50%. This can happen for many reasons, like randomizing too early, increasing new redirects, set up problems, or even due to problems that are caused by the infrastructure. Imbalances like these can seriously warp your experiment results and lead the wrong conclusions.
[00:17:39] Let me give you an example. In one of the companies I worked for, we encountered a peculiar challenge. We are facing issues with the, with some of our experiments as they consistently failed the sample ratio mismatch test. We reviewed our randomization algorithm, we scrutinized differences in our customer demographics that got randomized into different groups, we looked at our acquisition channels, we looked at our geography locations. We even examined the latencies of our algorithms in the treatment and control groups, yet we found no significant discrepancies.
[00:18:16] The breakthrough came when we started investigating instances where users were supposed to receive the intervention but didn’t.
[00:18:23] We noticed an intriguing pattern. A subset of customers was switched to the treatment group sometime after the experiment had started. This was unexpected as our randomization process allocated the participants to their respective groups at the start of the experiment. Upon further investigation, we discovered that configuration changes were being pushed to the nodes serving the website in a parallel manner.
[00:18:48] However, due to unforeseen delays, some updates arrived significantly later than intended. As a result, these nodes acted as if no change had been made, which consequence led to the late treatment allocation. This anomaly shed light on the sample ratio mismatch and underscored the necessity of synchronized configurations across all nodes.
[00:19:12] Experimenters might also run into the problem of interaction effects. These happen than two or more experiments affect each other’s results. For instance, my experiment might increase the number of ads shown while another one makes the non-ad recommendations more personal. If the recommendations are more appealing, users might spend more time looking at them, which reduces their engagement with ads.
[00:19:38] But if there are more ads, it leaves less room to show recommendations, which reduces the effect of personalization.
[00:19:46] Interaction effects aren’t handled properly. They can lead to misleading or incorrect results. So it is important to come up with strategies that either help us prevent these effects or detect them when they happen. To handle challenges like imbalance and ensure the health of your experiments, there are several efficient strategies we can use. To address imbalance, the most important step is to have the right tools in place to detect it. It is possible for this error to go unnoticed for a long time. Once you detect it, the solution might differ depending on the root cause. For example, if users are randomized upstream and, they fall off by the time the intervention happens, the solution might involve changing what you randomize on.
[00:20:34] This might mean moving your randomization code more downstream if the issue arises from delays in updating the setup on your end noes, like the situation I described earlier, it might be necessary to use version control for the experiment state, and remove the unqualified events for your treatment and control groups equally.
[00:20:55] For interaction effects, there are two main solutions. One way is to keep a close eye on the experiments that are running at the same time, and make sure that no two experiments running Orthogonally can interfere with that with each other. But if you’re running hundreds of experiments, this can be hard to guarantee.
[00:21:13] So it’s better to have alert systems in place. For example, we use the method that compares the lift achieved in an experiment segmented by, who has been exposed to another experiment. This way we could quickly show which experiments have been affected and take precautions to, address the situation.
[00:21:35] Navigating the complex landscape of large scale experimentation programs comes with its unique set of challenges, particularly when it comes to striking the balance between speed, maximizing insights from data, and ensuring their accuracy.
[00:21:52] So these programs typically generate a vast amount of data, which is invaluable for shaping tactical and strategic decisions. But to extract insightful and nuanced information from this data, advanced analytical methods are typically needed. Methods such as heterogeneous treatment effects and quantile treatment effects allow us to go beyond average treatment effect, and identify how our experiments influence different user segments or, quantiles.
[00:22:24] But as we delve deeper into data, We encountered the challenge of multiple testing. Multiple testing arises when we split our data to analyze effects across different segments or time intervals, increasing the number of hypothesis we’re testing, and thus, raising the risk of false positives.
[00:22:44] Furthermore, multiple testing, issues can extend to general handling of experimental data. For example, frequent intermitent checks, or, this is also called peaking, multiplies the number of analysis conducted, and therefore heightening the risk of false discoveries. In addition to the risks posed by multiple testing, the experimentation landscape has inherent complexities that can affect the results.
[00:23:15] For example, winner curves. Refers to the tendency for significant findings to overestimate the true effect size. A practical example of this from my own experience can help illustrate the point.
[00:23:32] In our team, we were consistently noticing a disparity between the cumulative results of individual experiments compared to the outcomes obtained from a persistent ongoing experiment.
[00:23:46] Results of the individual tests always appeared larger leading to collective, surprise when the persistent experiment revealed a smaller effect. This confusion was mostly due to the result of the winners curves at play as each significant result on its own had a tendency to overestimate the true effect.
[00:24:09] In order to adjust for this bias and reach a more accurate understanding of our experiments’ overall impact, it was vital to apply statistical corrections. This experience enforced the importance of understanding and addressing the winner’s curves in managing our experimentation program.
[00:24:28] Novelty effects. They are observed when users react to the novelty of a change rather than its inherent value, potentially skewing the results. Recognizing and accounting for this effect in our data, helped us ensure that results we’re seeing reflected the true value of our interventions.
[00:24:50] Carryover effects. These are observed when the treatment from one period of an experiment influences the outcome in a subsequent period, like when we are running a design change test that has a positive effect that lasts even after we turn off the change.
[00:25:04] This can become an issue, especially with parallel tests or when we are running browser level test where the same user can be exposed to both, treatment and control.
[00:25:14] Lastly, one needs to consider the concept of statistical power when designing experiments. It represents the probability that an experiment will detect an effect when one truly exists.
[00:25:26] The larger the sample size, the greater your experiments power will be as it reduces the effect of random variation. However, the larger sample also demands more resources requiring a careful balance of desired statistical power, the effects as you’re interested in detecting the significant, the significance level, and the available resources.
[00:25:49] So how do we navigate these challenges? Let’s explore.
[00:25:54] Our goal is to ensure the accuracy of our data, while also optimizing the extraction of insightful information and doing this as quickly as possible. In order to address multiple testing, it is important to incorporate the necessary statistical corrections that adjust your P values.
[00:26:12] This can provide the assurance that your type one error is within the intended percentages.
[00:26:19] For winner curves, the typical solution that is used in the industry is to apply a discount to the results of your individual experiments. However, there are more advanced solutions as well. For example, at Etsy, we have developed a statistical approach that adjusts the amount of discount that needs to be applied to the result based on the width of the confidence interval.
[00:26:39] This gets us closer to the real answer on an experiment by experiment basis. Novelty effects also require intervention. We found that these are typically very predictable by the type of change we make. So, for example, the typical novelty effects of a UI change and the typical effect of an algo change, are predictable.
[00:27:01] So we have invested into studying them separately, and now we have the able to incorporate those into our metrics. CRV modeling is another way one can approach how long-term impacts of interventions play out. The idea is if you’re able to predict the amount of value a customer may generate in the long run based on the signals you can read during the experimentation, then some of these concerns around the winner’s curves or novelty effects can be addressed.
[00:27:30] Of course, CLV models have their own challenges.
[00:27:34] There are a few have strategies we can employ to address carryover effects. First we can design our experiments to minimize potential carryover. In the context of AB testing, this might mean assigning each user to only one version of the intervention, for the entire duration of the experiment.
[00:27:52] Second, we can use more orthogonal testing. And lastly, statistical methods can also be used to adjust for carryover effects than we analyze our data. This involves typically more complex modeling and requires a deep understanding of the nature of the carryover effects in question.
[00:28:10] So it’s not always the ideal solution. To address the sample size issues, table stakes include the platform to make the optimal sample size calculation easy. Performing a power analysis before the experiment starts can help you determine the optimal sample size needed to detect an effect of a certain size, given the certain level of significance and desired physical power.
[00:28:37] This allows you to balance the resources you have available with the need for reliable results. Another approach is using sequential analysis. This approach allows for the analysis of data as they come in. You can stop the experiment as soon as the data provides sufficient evidence to answer the research question, potentially saving resources if an effect is detected early.
[00:29:01] And finally, if you’re running multiple experiments, consider the potential benefits of each one allocating more resources to experiments with higher potential payoff, or those that align more closely with business objectives, could be more beneficial in the long run.
[00:29:22] Now let’s take a closer look at another intricate aspect of online experimentation, network effects. While they bear some resemblance to the interaction effects we discussed earlier, network effects also present unique challenges and complexities. In the context of experimentation, we consider a network effect as the influence that one experimental variant can have on another variant. But this is due to their common dependence on a shared resource, whether that’s supply, demand, training data, or other resources not explicitly covered in here.
[00:30:02] To get a clear grasp of this concept, let’s explore a few specific examples. Consider the case of shared demand as illustrated by the complex issue of dynamic or algorithmic pricing. When conducting pricing experiments, any alterations made to one variant could inevitably impact the behaviors of customers interacting with the other variants.
[00:30:26] For instance, if you reduce the price for a set of products, this might trigger a surge in purchases from the variant with the discounted price, unintentionally decreasing purchases from the control group. This ripple effect can distort our results, making them less universally applicable. Similarly, if you raise the price for the test group, you might once again grapple with skewed non generalizable results.
[00:30:54] Therefore, it’s crucial to recognize and yet for these network effects, than designing and evaluating pricing experiments. Turning to the realm of shared supply, imagine a scenario where you’re running competing algorithms, determining which ads to display. There’s a risk that one variant might use a significant part of the advertiser’s budgets.
[00:31:17] Leaving other variants, starved of sufficient funds to yield comparable results. If one variant starts to outperform the others, it might essentially end up consuming a large chunk of the shared resource, creating an imbalance within the experiment that could potentially lead to misleading conclusions.
[00:31:37] Finally, let’s consider an interesting network effect where a control algorithm benefits from the data generated by a new algorithm. The situation can arise when a novel algorithm enhances the user experience such as by optimizing recommendations, such an improvement could indirectly benefit the control algorithms.
[00:31:58] For example, this new recommendation algorithm, if it leads to a wider variety of items being displayed to users, as a result, the control algorithm, although it’s less efficient, could still leverage the user interaction data coming from the new algorithm. This could create a false impression of the control algorithm’s performance, again, distorting your results, potentially leading to incorrect conclusions.
[00:32:27] As we start talking about potential solutions for counteracting network effects, it’s crucial to keep in mind that there is no universal solution that fits all scenarios due to the complex nature of network effects.
[00:32:40] Tailored solutions that are theoretically solid and that align with your organization’s unique infrastructure and processes are often required.
[00:32:49] The crux of the challenge caused by network effects lies within the design of your experimental architecture. Each variant algorithm and test shares resources with the system and changes to one component can trigger far reaching effects. Recognizing this inherent interconnectedness can provide a roadmap for devising strategies to offset, network effects.
[00:33:15] Some of the solutions are, for example, in the case of dynamic pricing, re-imagining your randomization strategy could be a helpful approach. For instance, in the case of algorithm pricing for local services, using a geographic randomization could help counter network effects. I. If you cannot geographically separate the test, you might consider randomizing byproduct category to minimize potential substitution effects.
[00:33:43] Additionally, an effective dynamic pricing strategy should account for both immediate and long-term impacts of price changes. Thus, alongside randomization, adjusting for the duration of the experiment is also key. In the context of ad budget distribution, one solution could be to allocate specific budget segments or partitions to ensure balanced budget availability across all variants and advertisers.
[00:34:09] This approach can prevent any single variant from hoarding the budget. However, if your organization controls the ad serving infrastructure, you’ll need to build this capability internally to align your experiments with your experimental infrastructure. The subsequent challenge once this is implemented is managing budget splitting when numerous partition experiments are running concurrently, potentially affecting the advertiser’s ability to purchase ad slots.
[00:34:40] In situations where control algorithms benefit from data produced by a new algorithm, you might think about setting up separate pipelines for each variant. This ensures that interactions with one variance results don’t influence others. However, this isn’t the simple fix. It requires substantial infrastructure investment, and potential lead to reduction in training data volume, and that could lead to other issues.
[00:35:07] The complexity is brought forth by network effects. Emphasize the need for a holistic approach to experimentation. It’s not only about running tests or crunching data, it’s about understanding and managing the larger system in which these experiments operate. This might mean exploring innovative experimental setups like multivariate testing or multi-armed bandit algorithms, or creating custom engineering solutions that are deeply integrated with your organization’s infrastructure.
[00:35:38] Agility is a key asset when it comes to handling network effects. Having the ability to identify a problem, safely come up with a solution and efficiently execute it. This underscores the importance of a flexible and robust experimentation platform. A system that facilitates quick adjustments, real-time monitoring and rapid deployment of changes.
[00:36:03] Having this can be invaluable in managing network effects.
[00:36:07] In summary, dealing with network effects is a complex task, but with good planning, continuous monitoring, and a profound understanding of your systems interdependencies, you can not only mitigate these effects, but also turn these challenges into opportunities to further hone your experimentation program.
[00:36:27] As I drove to a close on today’s exploration of the complexities in online experimentation, I’d like to say that effective online experimentation isn’t a singular skill, but it is a fusion of many. It calls for a unique blend of data engineering statistics, and backend and frontend software engineering. In managing the complexities of online experimentation, the ability to bridge these disciplines becomes an invaluable asset. By bringing together these diverse skill sets, we can build robust, adaptable programs that stand up to the multifaceted challenges we’ve discussed today.
[00:37:08] And in my mind, the complexities we have discussed today aren’t roadblocks, but opportunities to hone our approaches, bolster our methodologies, and yield more reliable and insightful results that can propel our organizations forward.
[00:37:22] Thank you for your time.