This website works best with JavaScript enabledLearn how to enable JavaScript

Back to all sessions

The Human Element of Evidence at Chewy

Experimentation isn’t only about statistical models—it’s about building trust, clarity, and collaboration across teams. This conversation explores how Chewy blends scientific rigor with human judgment to make better product decisions, build shared learning systems, and scale experimentation without slowing down execution.

Summary

Christina shares how her statistical background led her into experimentation science and how she now supports product decisions at Chewy through evidence-based testing. She explains that testing only works when leaders prioritize it, teams follow consistent metric definitions, and there is a structured process for planning and execution. The conversation highlights how Chewy balances fast business goals with methodological discipline, including common issues like poor planning, data quality gaps, and incorrect statistical methods. Christina also discusses how qualitative feedback helps explain quantitative outcomes, how failures can still deliver learning, and how AI can speed up tasks like hypothesis creation and reporting while humans retain control over final decisions.

Key Takeaways

Experimentation succeeds when leadership supports it and teams follow a clear playbook.
Failed tests still provide learning if you analyze segments and document outcomes.
AI can accelerate analysis, but humans must guide final decisions.

Transcript

NOTE: This is a raw transcript and contains grammatical errors. The curated transcript will be uploaded soon.

Hi, Christina.

Hello. Hi there.

How are doing today?

I’m doing great.

Great. Christina, welcome to Convex. Convex is VWO’s annual gathering for those who are pushing the boundaries of evidence based growth. And with your role at Chewy and what you have been doing, I can attest to that that you’ve been doing that for the industry and for Chewy as well. So we’re truly excited to have you join us.

Thank you. Thank you. It’s my pleasure to be part of this community experimentation, and thanks for having me in this conference.

Absolutely. Christina, my name is Vishesh, and I lead partnerships across Asia. And for the folks who are joining us at Convex, Christina leads experimentation at Chewy. And before I say much, Christina, why don’t you share how has your day been so far? Where are you currently located? And if you can briefly touch upon what do you do at Chewy.

Yeah. Yeah. My day has been fine. Thanks for asking. I’m currently now actually in my hometown in the Philippines, but I’m originally based in Seattle, Washington.

So there’s definitely a shift in weather right now from warm. But I’ll soon be in Seattle and experience like the winter weather. I’m quite excited about that.

But, yeah, to just to quickly touch on what I do at the moment, I work at Chewy. I’m I’m the lead data scientist in the small but mighty experimentation team at Chewy that supports the product development within the Chewy website. So if you’re a Chewy member, hopefully, you resonate to having your pets order your pet needs at in our website. But, yeah, it’s a US based company and I’m really excited to support experimentation and making the experience for pet owners be really good in Chewy Dog.

Absolutely. I’m sure that all the pet owners really thank you because especially in the US, you know, the amount of pet owners can be so much. They look for a good experience while they are purchasing products. You know, pet parents are purchasing products for their pets, so that’s that’s really great of you to, you know, be part of that team. Can’t wait to dig into it. We have a host of questions, some light questions for you, some, you know, so called sing this questions, but I I’m sure that, you know, the use the the listeners out here have a lot to take away from your experience that you’ve had in this field. I’d like to dive straight in.

And, you know, I was I was checking out your profile, and, you know, you’ve you’ve had a journey from statistics to analytics and science and science backed experimentation. So that’s a pretty interesting journey to take. And if you can just share a little bit about that. And what is it that pushes you more to be part of an evidence driven culture at Chewy?

Yeah. So I am as what as what you mentioned, like, started off in statistics, so I have my master’s and bachelor’s in statistics. I was really interested in making making using data into, a day to day analytics. So I started off being a general analyst in the grocery retail here in the US.

But after a while, I really enjoyed my time there. We were developing tools that would scale insights, basically building Alexa for ex analytics. However, I always felt that, like, I missed the statistic part of of of the core of what I do on a day to day level. But that changed when I found a role at Expedia Group, and I was, like, positioned to be in the experimentation science team where I realized that I could actually use statistics to scale in the product decision making in in the company, and that was really, really exciting for me.

So as part of that role, I work closely with the experimentation platform wherein we build the tool that experimentation owners use to run experiment. I help evangelize the experimentation playbook, which involves, like, the technical and nontechnical way of how we should experiment best practices, and also, like, spread, like, best practices across the company, working with cross functional teams, like data scientists, product, and so on and so forth.

So now I landed I’m excited to be at Chewy because I’m continuing the same journey, but in a different earlier stage of maturity. So we’re shaping the foundations of experimentation and also supporting teams by building confidence in testing using the right methodology guardrails. So as what I’ve mentioned, what I’m really excited about in experimentation is that it’s both scientific but also practical. And the idea that anyone can be a scientist is quite exciting. So having this idea makes me feel like anyone can make be a good decision making while you balance, like, the risk and reward in what you do. So, yeah, it’s been exciting journey so far that landed me here.

Lovely. Lovely. That’s an interesting journey, and I and I’m, you know, glad to for you that you you know, you take away things like, you know, that anyone can be a scientist and that, you know, we just have to put our right minds to it. And Xperia definitely globally is known for a very experimentation forward, you know, company just like there is Booking dot com and just like there are many other companies. And since you mentioned scientists and you’ve worked across two large organizations in the experimentation roles, being a data scientist in the field of experimentation, there can be millions of problems that you can pick up on a day to day basis. But what do you decide that, you know, which one should be picked up or which one should be prioritized versus something that is just about exploring or understanding, the problem? So how do you decide, to prioritize which problems to pick up?

Yeah.

I might be a little biased here because as what you mentioned, I work in, like, a for experimentation driven Of course.

Company. So I would dare say that, like, in any mature organization that experimentation should be considered for any, like, site changes because experimentation allows you to measure potential risks and also, like, the positive or impact of, like, the changes that you make. It could be hard sometimes, but I think, like, with a good platform and having a good like experimentation process, it could be done. There are companies wherein their rule is any change has to go through experimentation.

But as I think the important thing about experimentation is it serves as the mechanism to consistent have a consistent standards in how different teams tall has tolerate the risks and measure the risk across teams. So it’s quite important for any, like, growing and maturing organization. However, if you’re a new team and maybe you only have hundreds or two hundred thousands of, like, traffic, maybe your priority more is, like, how do you increase traffic to your website? But you could still use, like, experimentation or AB testing, but focus mainly on how you could identify bugs and error rather than really interpreting the potential impact impact of the change that you’re making.

Because if you have a small sample size, you tend to have higher variation, which may not be a good representation of truly what the impact is of your experiment, but you could use experimentation to really mitigate risk in any changes that you make in your in your platform. So definitely, as much as you can, definitely consider experimentation. But for smaller teams that are just starting, it’s always better to start, but maybe your objective would be a little bit different than, like, a mature organization.

Got it. Got it. Makes sense that, you know, depending upon the size of your organization or where you are in the maturity journey, you get to choose which experiments to focus on, and it can start from as low as, you know, converting traffic and then as high as when you become a mature organization like Chewy and Expedia. That definitely makes sense.

And in the space of Chewy because you’re in the e commerce space and you’re in the retail space, online retail space, and that can be complex in nature because there are so many moving parts and there are business speeds to execute, the goals. There are stakeholder priorities. Someone wants revenue up. Someone wants visibility up.

So and tie that with, you know, the rigor that comes with science, that experimentation because it cannot be that, okay. Hey. Let’s just go ahead and test this. In such a complex environment in the retail and commerce environment, how do you balance the need for a scientific rigor with all the priorities and business goals that keep coming up?

Yeah. Definitely. So my experience has been really on the product side of things and having a good integration on the product road map, having good partners within the product team, and having them understand the benefits of experimentation is really key.

I’ve met leaders wherein they just all just from the bat from the goal, they know experimentation is key. They have to make it. And they actually invest in experimentation.

Both both my experience in Expedia and Chewy leadership values experimentation, which is amazing for our team. It allows us to really grow and expand our the teams that we are touching on. So I think embedding testing also in, like, the workflow of product development is important. So I say this as if, like, it’s but it’s actually it’s not as easy, but it’s it’s crucial. And that’s what our team as experimentation platform try to resolve. How can we make the entry level to experimentation really as simple as possible?

And that is based on allowing automated reporting, and then that helps the automated reporting usually helps balance the statistical rigor and, like, the speed of how the business is making decisions. So we definitely the one of the myths about experimentation is that it slows down development. But really, ideally, if you follow if if you have a good experimentation life cycle, a platform, it should actually speed things up because once you get started and adopted, like, the experimentation, you should be able to follow through the same rigor. So what I found that are what I found that I found that enabling teams to onboard smoothly would be having a clear experimentation playbook, running a lot of trainings, and having regular check ins especially for, like, if you have your SMEs, like subject matter expert on experimentation, having, like, a regular check-in with them to understand how the adoption is going and to address any challenges proactively.

So ideally, at the end of the day, the focus is to at the end of the day, the goal of experimentation is to make it feel like a natural part of, like, how teams build and learn from from what they do and not make it as an additional step.

So that’s that’s very likely said because, you know, in my endeavor across partnerships as well, and that is what I tell partners and agencies that you do not have to carve out experimentation as another function, but it can very beautifully embed into your website journey, into your product life cycle journey. And, of course, what you said that makes a lot of difference is the leadership valuing experiments. If that really happens, then you can find all of this becoming seamlessly tied in. And very true that, you know, it can be a templatized process. You do not have to do the same thing again and again, but creating templates and ensuring new goals, new metrics, new experiments are being tested under the same framework.

So you seem to have covered a lot of ground, and, you know, and I’m glad that you found the right leaders, Christina.

And and it’s great. And one myth definitely is for everyone listening to this call, experimentation does not, I mean, you know, does not slow down your work. But once you get the hang of it, once you’re able to create the process of it, it actually makes it much more quicker. So thank you for sharing that tidbit with us, Christina. And since you speak about product teams and, you know, every product team, every leader, there are different metrics to be tested, there are different designs to be tested. In optimizing the user journeys and every journey, every user goes through a particular journey, what is the most valuable metric or design that you found in terms of either retaining a customer or in terms of ensuring that they go through the entire buying slow? So if you can just take us through that.

Absolutely. I feel like each organization would have their different metrics or success metrics that they would wanna focus on. Most of the time in my experience, the way once you run experiment and you wanna use it to provide value and impact to the company, they usually focus on, like, the end end of funnel metric, which is usually purchasing. So it’s conversion.

The downside of that, sometimes conversion is harder to detect change of because it’s, like, lower down the funnel. So I think, like, the key here is to ensure that each test has a well defined success metric. Unfortunately, there’s no one metric wherein, like, use it for all the time, but what I’ve learned so far is, like, you need to really be intentional in what metrics you’re gonna be measuring that would help you make the decision of launch or trip or not ship for your experiment. And then another thing that we always tell our users is that make sure you also have secondary metrics and this secondary match metric should allow you to answer why did your end of funnel metric or conversion metric had changed.

So this is where you would wanna follow, like, the user engagement or customer journey from when they start seeing the change that you had that you are testing until the end of, like, their customer journey. With metrics, the other thing that I’ve learned is that that is so important is to have a standardized unified metric definition across all experiment, and as much as possible, make it tied up to like a business reporting that you already have across the company.

Both companies that I work with and I I I think other bigger companies also have a metrics platform that is integrated to their experimentation. So aside of knowing what metric you put in your in your hypothesis, having it avail available platform is actually a a game changer to even make your experimentation faster and scale more. And lastly, would say, like, maybe not one of the things that most teams are interested in are long term metrics such as retention.

Did user come back? Did they repurchase again? With experimentation, sometimes this is harder to measure because with experimentation, you tend to measure metrics that you could measure during the test duration that you’re running the experiment. But there are more like advanced methodologies right now that allows you to measure longer term impact. There are more basic ways of doing that in different companies, but like to be having focus on the retention sometimes is what really the core business is interested in more than, like, the short term metrics. So being able to figure that out in your organization as a growing platform, how do you measure, maybe identify proxy metrics to your retention metric is really key.

Given that in experimentation, we tend to make decisions, launch decision based on metrics that we measure during the experiment, and there are some that you need to measure outside of the experimentation test duration.

That’s that’s I I’ve never looked at it like that, you know, that, you know, on focus while focusing on one metric, you have an impact on long term metrics as well, which may not be directly measured during the experimentation journey. And retention is actually a very, you know, great example in this case because retention can be such a broad thing, and you may be focusing on one metric, but it it has a cascading effect on the other metrics as well. So that’s that’s really, you know, good for you to share that. And, you know, as you said, macro metrics and there are micro metrics and not focusing only on bottom of the funnel on but even on top of the funnels.

So definitely, it’s it’s industry dependent. It is business goal dependent. But you need to have those business goals, like if you’re focusing on retention or if you’re focusing on new business acquisition. And then from there, you can sort of drill down into which metrics.

And of course, if you are integrated into other platforms where metrics can land into your experimentation platforms, it makes the job definitely much more easier. So definitely a well rounded answer team, and you can definitely get a lot from this.

Shifting gears a little bit, we have spoken about the fundamentals, I’ll shift my guess to experimentation. And when you run these large scale experiments, especially with a platform like Chewy, millions and millions of users, there’s a lot of qualitative results quantitative results that keep coming in in terms of those, you know, metrics, in terms of conversion rates. And along with that, there is qualitative feedback that you get from your end users as well. How do you sort of combine the two in order to take a decision that meets a business goal?

Yeah. Definitely. So when we’re running large scale experiments, we rely on the standardized reporting wherein you have all the metrics standardized, you report it at the end of the experiment. And most of these are quantitative. Right? And the idea here is that you are measuring as what you said millions of users and data would allow you to infer on, like, the whole population.

The downside of whereas qualitative metrics tend to be like what you would get from like a user research, interview, survey

And kid scores and so on and so forth.

That those type of insights or metrics are very useful because you tend to have like real answers from the career roosers. Unfortunately, sometimes those could be biased because like we don’t get a good random data. Right? So sometimes maybe there’s a bias on who would answer an interview or who would actually send you a a So there’s a little bit more, like, intentional, more, like, statistical advances that you need to apply in those analysis, but those are great.

So, basically, the qualitative one will help you uncover why behinds like behind your quantitative metrics. You can use it to supplement the the results that you might have seen using like the measured metrics through like quali quantitative measurements. So I think like if you wanna run fast using converting some of like the say quant qualitative into something measurable and quantitative would make you scale faster. But I’ve seen teams use the qualitative context to understand to generate hypothesis, to generate feedback on, like, the launches that they’re making.

So definitely by combining both perspective, you can validate the insights that you have gathered quantitatively while grounding your decisions on, like, real user sentiment. So I wouldn’t I think there’s a lot more complexity when generating statistical inferences using quant qualitative results, but there those are really important, and it could be supplemented to the qualitative results that you measured during the quantitative results that you measured in the platform when you do experimentation.

That’s that’s really good to hear. And, you know, what I take from your answer is, you know, use quantitative results and ground them via the qualitative results. So that’s actually a very good, you know, understanding for our users out here then do not ignore the qualitative just because you cannot measure it, but they’re directly coming from a voice of opinion from the user versus, of course, data and metrics to speak the truth. And if I have to ask you any one interesting survey or feedback or NPS that you got from customers at Chewy that, you know, you were like, okay. That that really validates Chewy as a brand or your role. If you can if you do remember any one such survey or one such comment or one such voice of opinion from a Chewy user.

Sure. I mean, I could just say in general that, like, Chewy’s customer service is really up, like, the highest level that you could find out there. We have a lot of users coming back. I don’t know if you know this where Chewy send out sends out gifts for or tokens to to their pet owners when they know that, like, their pet died because they stopped, like Oh, they stopped, like, ordering, like Yeah.

Subscription and so on and so forth. So what we receive from our users are really good feedback about the customer service and how Chewy really shows up for all the life cycle of like our customers. So all these feedbacks that we get from users are so important and we do take them seriously. So yeah.

So definitely, we don’t just take them for granted. These are real customers that we wanna make sure are satisfied with and love our product and would go back again for any of their new pets pet needs.

Absolutely. For everyone listening, I mean, you know, this is this is a very heartwarming thing that Shoevee does. And, you know, do listen to your customers, your voice of opinions, and they may surprise you, you know, on how you can convert them into results and keep bettering your service and your offering. So thank you so much for sharing that, Christina.

And, Christina, on the line of experiments and large scale experiments, and this is one thing that the world of experimentation truly battles with, is everyone is looking for a winner. People feel to fail see the benefit or success of a failed experiment. So in your expertise with experiments, how do you interpret failed experiments, and how do you extract learnings out of such experiments so one is not bogged down by, okay. Hey.

This experiment failed means something is wrong.

Yeah. And that’s actually quite hard as in human nature, right, to see something failure in front of you. It actually also make people not feeling good about experimentation. Right?

I think one thing that people should understand is, like, the most important part of experimentation is learning. And learning could be positive. It could be a success. It you could learn a failure, or sometimes you don’t have enough data and the result is inconclusive.

So I think like when one thing to note is you have to have established a well defined decision metric matrix or guidance before running the experiment. So say for example, you’re running this experiment and there’s a risk of, like, negative impact. On setting early on what is your negative what is your threshold for that negative impact in a dollar value. So like say, you’re running an experiment and you don’t want to impact your revenue by one million dollars and that’s your limit.

That’s kinda like your cost of experimentation. Right? So you wanna make sure you set that up so that you don’t just like stop the experiment early for being too afraid and and so on and so forth. So having that early on is like so important to make the decision making faster later on.

And monitoring experimentation for a negative impact is different than peaking. I I if people experimentation users here peaking is like the no the the something that you know, statisticians or explanation platform would say don’t do. It’s a bad path. But like, monitoring is important because that’s what you use experimentation for.

You use experimentation to mitigate the risk. And there’s a cost of experimentation that you wanna make sure you’re monitoring and eventually use that to assess whether you’re gonna go forward or not or iterate. So when there’s a failure, you don’t just like leave it and like move on with your day.

There’s a lot to learn. So one thing that you might wanna do is, like, if you wanna stop the test early, you might wanna, like, cut and dice your data, analyze the segment level results, and see understand if you could figure out why the feature didn’t work for which segment of customers.

These insights can inform you how you could strengthen the the next iteration of the idea. And I think a important part here would be to also document what you have learned. So one thing that I’ve heard from, like, the Excel conference how to reduce the fear of failing is to just show up or tell people what are the different results of experimentation just so people would get familiarized and be comfortable with the failure because actually, in reality, in most mature company, twenty percent there’s only twenty percent winning rate on an on the average.

If you’re winning fifty percent of the time, eighty percent of the time, you might wanna question how the results that you’re having because in in reality, you can only only tweak a little bit. Right? So Absolutely. We should be comfortable with, like, twenty percent win rate in general.

No. No. That’s the it’s almost like Pareto’s principle. You know, the twenty percent of what you do brings you eighty percent of the results.

So that sort of applies here as well. And with the failure part, I guess, you know, a lot of things that you said is sort of it draws down on humanity and human nature that, you know, failing is a part of human nature, and all you have to do is show up. And when you keep showing up, you keep showing up. There will be success, and there will be failure again.

Yeah. It is just that, you know, it gets a little tricky when there is money on the line. So that that part actually and I I guess I guess for what your IT said is in order to keep that in control, guardrails and to have a threshold of the cost of experiment then becomes a good thing to ensure that, you know, there is no dollar leakage, and yet you can keep experimenting even if failures aren’t happening. So that’s really great.

And guardrails can be effectively set by our platforms. So it can control the failure how much ever you can without having a much more negative impact on the dollar value or on the revenue of your organization. So that is definitely a good thing on negative results. But for everyone listening, go out, fail, show up.

That’s how big brands have been built, and that’s how experiments eventually will turn into success for you.

And on though and on those lines, you know, we are speaking about large organizations. You’ve been part of Expedia. You’ve been part of Chewy. There might be so many teams, teams working in different silos. How do you ensure that what you as an experimenter have learned from a b experiments, how are they shared across teams, and how is it ensured that it is adopted by those teams as well?

Yeah. Definitely. And this is so important. So a centralized reporting system and a shared repository of the insights that you’ve generated across experiments is key.

Having, like, one place wherein you can see what are the active experimentation experiments running, what are the completed ones, what was the result. It really increased visibility and building the familiarity about who fails, how many fails, like, what is the experimentation program across the team. I’m saying this, but this is actually kind is a gap in a lot of, like, experimentation teams that I’ve I’ve worked with or I’ve heard about is because technically, you use experimentation to make to to generate results. The decision making sometimes happen outside of the experimentation.

So how do you bring that back into the platform? So I think this is still, like, something that teams are still working on. But I think now in the most recent developments and experimentation, meta analysis, which is generating insights from a collection of experiments, the way we’re describing it right now, is basically the next frontier of, like, how we can scale experimentation. That’s why even more so proper documentation of the results of what happened is more important than ever.

So basically, in the future, we might be running even more large experiments as like our technology is increasing is evolving rapidly evolving over time. We’ll be running a lot of experimentation. We might be running systems of experimentation and so on and so forth. So having like a one place where everybody’s documenting what they are experimenting, what is the result, like a summary is so important.

Many large experimentation program even hold like weekly experimentation reviews to share their learnings and to have like gather across team feedback. So this raise awareness of like what test has been tested, what features has been tested, what worked well, what didn’t across the organization. And also, you know, like, the we’re we’re talking about a lot of scaling, but in reality, we go back to, like, a natural way of doing it is to just establish, like, touch points between experimentation experts, leaders, and see what what approaches they should be aware of, what pain points are they facing, and what are interesting experiments they they’ve run so far.

So I think, like, two parts of it is, like, one on the platform perspective to have a good repos repository for experiment that had already run. And then second, in, like, your experimentation program is to have, like, maybe, like, a weekly email of all the experimentation had that had run or maybe having more communication in the community is also, like, important. So yeah. So that we could, like, remove all these silos and, like, have more, like, growth naturally in a bigger organization.

No. I mean, you know, this also goes back to what you said earlier that, you know, when your leadership supports you, it enables you to have these communities created. It enables you to be actually, you know, sharing them across various teams and having those results adopted. And what you one thing that you rightly said, you know, decision making often happens outside experimentation.

How can you sort of bring that back in that can truly be a a, you know, game changer? And I think so this is one thing that I’m gonna take back from today’s conversation is how to bring decision making into experimentation and not have it as a fallback of it. So thank you so much for sharing, you know, that that amazing line, you know, with us. And in on a similar lines, you know, requires not only experiments being shared, but it requires collaboration across functions as well.

Know, design teams, UI, UX teams, performance marketing teams, analytics teams. And at times, are nontechnical stakeholders as well. Like I said, design and UIUX teams can be nontechnical. Then you have analytics and performance teams.

How do you manage this entire collaboration and ensure that evidence and data back is still the foundation of it?

Right. Yeah. I would say, like, two ways. But before I say that, I’d say one thing about experimentation is it it is a team sport.

Like, as what you said, there’s a lot of different roles that make a a successful experiment. Right? So so I think the first one that I would say is to develop a clear RACI model. RACI means responsible, accountable, consulted in formed.

So, like, each of the functions should know in which in which experimentation life cycle they should be actively, like, getting involved wherein they are responsible for the success of it, which part they should just be informed and consulted. So different teams have different organization in terms of, like, do you have an embedded data scientist? Do you have, like, a centralized team and so on and so forth? Whatever it is, I think, like, understanding, like, which role is responsible for the different stages of, like, the experimentation life cycle is so important so that you can identify how you could collaborate and how to be involved.

So once you have that, I think the other thing is, like, run a lot of trainings.

Experimentation touches on a lot of different spaces in in the company.

So running running trainings informs user one about their RACI models model that you I was just mentioning to educate them of, like, relevant concepts to their roles. So maybe, like, a UX design doesn’t need to know what the p value is or a confidence interval is and so on.

Of course.

Yeah. So yeah. So, like, I think in my experience, I’ve run separate trainings for both technical wherein I go dive deep into, like, what is a power analysis, why is it important to a nontechnical audience wherein we focus more on, how to build hypothesis, how to generate hypothesis, build the right metrics, and then how to use the results that they are seeing in the platform. So yeah, I think like at the end of the day, people want to learn and you just have to make sure you give them all the materials that they would need whether they wanna learn by themselves in like a self paced manner or run trainings and have especially if when your team last thing that I would add, if your team is new is to have like a place wherein they could ask questions like support channels and so on and so forth.

But, of course, like, once your team grows, sometimes support channel can become overwhelming, so you need to know how to, like, funnel those questions as your team grows. But, like, I think having those trainings or forums wherein they could ask questions so that they could scale faster is really important.

No. That’s I I think so. You know, these three you know, and for everyone listening, there are three things that Christina mentioned. One is having a defined RACI.

Second is ensuring the teams are trained in their respective fields, technical folks for technical staff, nontechnical across nontechnical, and then having the right support channel to ensure that, you know, any doubts or something can be cleared. So this is, you know, and I’m happy to hear that, you know, that that is what Chewy does as well. So that’s that’s really great. We’re coming towards the fag end of, you know, the questions when it comes to experimentation.

But since we talk about teams, since we talk about their inquiry and encounter with experimentation, what are the most common pitfalls, you know, that you come across? Is it their understand lack of understanding of experimentation? Is it statistical errors? Is it quality of the data?

What are the most common pitfalls that all these teams multiple teams encounter in the experimentation journey?

Sure. Yeah. I can summarize it maybe in, like, four different points coming from answering maybe hundreds of experimentation questions in in, like, support channels and, like, office hours. So I think the first one in general is not planning ahead.

So a lot of the time, teams would come in or in the middle of their experiment and saying, should I run my test longer? Should I not? Should I stop it? And so on and so forth.

That gives me the idea that they probably didn’t have a good experimentation strategy before they launched the experiment. So they end up, like, asking this question. And the downside of it is, like, it prolongs, like, the experimentation timeline and make the decision decision making maybe a little bit tampered with biases and so on and so forth. Part of this is, like, having unclear metrics, missing some guardrails, or not setting test duration, which could lead to, like, either unpowered test or just unreliable results.

So Number one is not planning ahead. The second one is, like, not sticking to the plan. Well, sometimes they do plan ahead, but then in the middle of it, they get left to results, query pick. Once you start moving the goalpost, you actually risk about making decisions in a biased way and not backed by solid evidence.

So it slows down again the scaling of experimentation because the idea is that when you run larger experiments, you have a controlled error in the decision making. But then if you don’t follow the the whole like playbook, you’re actually increasing the risk of making more wrong decisions than the five percent maybe false positive rate that you add or not.

The third one, which is I know is hard, is like ignoring data quality issues. So once your where experiment results have like a issue, it’s hard to use it in the data. So the first step you need to do is to go back, understand the issue, and then fix it, and then maybe rerun the experiment. But I know, like, sometimes, like, data product development runs really fast and they would just ignore it and make results, but, like, really big data quality issues could lead to biased results.

Did you actually increase like your conversion or did were you just not including or were you just biased in, like, your results based on only new customers or something like that. Right? So you need to understand what’s causing the quality and fix them and try to rerun the experiment. And lastly, is more on, the data science part is, like, using the wrong statistical test.

So you could always, like, Google what a t test is, but sometimes it really depends on the metric that you’re analyzing.

Just an example, racial metric wherein you have a diff a metric on the numerator, which is different from, like you have a a metric wherein you have a numerator that changes when you as you add more data into your into your test and you have a denominator that could also move. So you use a different, like, statistical method there because if you don’t, you could have higher false positive rate or just incorrect decision making. So or maybe another one is applying multiple hypothesis correction. So these are key things that I noticed that the reality is it could be solved with a good platform.

Right? But in the spirit of making experimentation accessible, I think I’ve moved on from being super strict in policing this mistake, but instead what we should do is how do we prevent them is by having a good platform and clear playbooks and what to do when these things happen because it’s natural to happen. So having the right tools in place, experimentation don’t need to like, experimenters don’t need to stress about all of these things, but making sure you have it in the back of your mind. So I think like just having a good system allows the teams to mitigate these common pitfalls and just be able to guide them every step of the way so that they know what not to do and make the adoption of the experimentation and that experimentation best practices better.

No. Absolutely. I mean, you know, and this is this this can be applied to many other fields within an organization as well. But for everyone who’s listening, what Christina mentioned is, please plan ahead. Stick to the plan that you have set because at times you may be derailed.

You know, do not ignore the data quality issues, and do not use the wrong statistical tests. Use the right ones that will get you that information. So this is definitely, you know, will definitely help us help our listeners in, navigating any fit pitfalls in their experimentation journey. But there is one more different problem, you know, Christina, that comes about is one is when a product has been built and you are doing experiment on that. But when you are launching absolutely new products and new features and there’s no historical data, there is no user behavior understanding. How do you approach experimentation in such use cases?

Yeah. That’s such a good question. A lot of the times when people talk about experimentation, it’s about optimizing something that already exists. But I think traditional test AB testing can still be applied in such case where you’re develop you’re launching a new product.

So as long as, like, you’re able to bucket as long as the product allows you to be able to bucket your users between a control and meaning no launch versus and then a variant wherein a launch, you should still be able to do a b testing. There are times wherein they have no way to bucket. Like, say, for example, they’re launching a new program wherein all customers should have it, then maybe quasi experimentation, pre post analysis is needed. But for some features, I think AB testing could still apply.

So what the key is, like, to start with a well defined hypothesis and you ensure that your primary metric could be measurable between your control, those that don’t see the launch and those that need a new product. Right? So a lot of the times people would think that, like, oh, our metric would be something that only, like, the variant group would have, which doesn’t make sense because then you cannot the the difference or the delta between those two. So then you could use the engagement metric to this new product that you are launching as a way to understand the engagement to this metric?

It’s still important, like, is the cost of launching this new product. So say, for example, you launch a product and you realize only twenty percent of the Verint group actually click or something like that. Is that enough in terms of, like, building that or building that results? Did it overall impact, like, the KPI of the business, maybe revenue and so on and so forth?

So when there’s no statistical data, lastly, I would say a lot of the teams struggle in terms of, like, how they size your experiment. Right? But in those situation, I like to remind teams that making an educated guess is fine. Like, you don’t need to be perfect in, power analysis in in this case, but you can maybe refer to like similar experiments in the past or maybe like identify benchmark and that often is good enough.

The important thing is to get started and making sure you’re measuring things. So I think focusing on metric that is measurable between the two, whether you have the whether you have the feature or not, is important for for this type of, like, features.

No. No. And, you know, this is this is what I also tell, you know, whenever I have my meetings and my customers, this is what I go on to tell them as well that this is actually a very good time because if you’re launching something new, you want that feedback to come in from your customers that are they clicking on that particular button? Are they liking this particular feature?

Are they engaging with this particular platform that they’ve launched? That can then act as the basis for experiments that you conduct on such new features. So it’s good to be validated from an actual experimenter. I’m not an experimenter myself, but it’s good to be validated that my thought process is on similar lines.

It can become a little difficult to convince folks at times. They’re like, hey, we’re going through a website redesign or we’re going through something new. Now is not the right time to experiment. And I go and tell them that this is pretty much the good starting point to do that because once you get into business as usual, one hundred other things can come up and maybe experimentation gets pushed behind.

But when you launch something new, this is your golden ticket to actually make the best use of experimentation.

Right. Yeah, definitely. And again, what I say, I feel like anyone could be an experimenter. So you count as an experimenter.

That’s good hear that. And one topic, you know, Christina, that’s been on everyone’s mind is AI and you know, we’re talking about AI and how do you see experimentation going hand in hand with advances in AI automation?

And along with that, where do you still feel human touch will still be a relevant aspect in experimentation?

Yeah. I think like even in my general role as a data scientist, it’s been reshaping it in terms of like what the drill what I’m doing on a day to day level and how I am using AI in a day to day level, I am super excited about AI and how it could change, like, experimentation. I think one king one thing that we should think about is that because we’re because with AI, we are there’s a lot of features that we could launch, like, really fast now. So more than ever, we need to have a scalable experimentation data process.

Launching something without testing is not learning. So it’s it’s really sometimes become like a a risk. Right? So I think AI helps us elevate our skills to, like, maybe reduce, like, the, you know, traditional, like, blockers. Like, okay. How do we speed up data quality investigation?

Can we use AI in insights gen sorry, not insight generation, but like hypothesis generation? Can we summarize the results faster? So there are things that we could scale or speed up using AI. I think what it cannot replace yet is the, you know, like, the decision making, the critical thinking, adding, like, the business context, and making, like, the launch decision for when you run an experiment in product. There’s still a lot of that that still needs to be curated, but there are I think, like, the key AI use cases in experimentation would be maybe using AI for hypothesis generation and, like, summary of, like, experimentation result, but still needs a lot of, like, validation and validation of results. I’ve heard some teams have tried it and there’s still hallucinations and stuff. So making sure that, like, you trust, like, what the results of AI is and and trying to know the users know what is the result is, like, really important.

So yeah. So I think, like, there’s a lot of AI use cases. Experimentation has become even more important now that we’re speeding up the way we launch things and develop new things in in in any product, not just, like, in in retail or ecommerce.

So so, yeah, it’s an exciting time.

No. Absolutely. And, you know, what you rightly said is AI can allow for a scalable skilled process, but the decision making and as you rightly said, critical thinking should be left to humans. I mean, I feel that humanity as a whole is leaving critical thinking a lot to AI. But I feel that that shift needs to happen where use AI to scale, use AI to build efficiency. But eventually what the human brain can do and what our analysis can do, still has to rest with us. So that’s a great analogy that you’ve used for experimentation.

We’re coming towards the close of our conversation, Christina, and it’s been a really great time learning with you and so many learnings to take back from this conversation. A couple of light questions out here, Christina, in this experimentation and data science space, any thought leader books or podcasts which you follow, which our listeners can make use of and up their skills in this space?

Yeah, when I just got started, I’ve mentioned I’ve been doing a lot of general data scientists. When I move on to specific learning experimentation, Ronnie Skohave, OG book, Trustworthy Online Controlled Experiment is, like, a foundational book for building a good understanding of experimentation, whether it’s on the technical side. I would say this has a lot of the statistical part that I really wanted to, like, refresh in my head. But it’s easy read, I feel like people who wants to, like, really go deep into, like, the methodology and what needs to be the or different practices that you should consider in experimentation. It’s really important.

One thing that I’ve been super excited about recently is the growth of like experimentation community on LinkedIn.

So in just even like two, three years ago, whenever I have experimentation questions, problems, sometimes I would Google it and there’s very limited information. But now with big companies that really put out a lot of good content out there that has been really helpful for me just like understanding what people has been have been using. So it’s really amazing to witness. So aside from that, I follow experimentation experts and subscribe to newsletters whenever I find them, again, on LinkedIn on emerging topics.

In my data science role, I also, like, read, like, newer, like, publication, attending conferences on understanding novel methodologies for bigger companies like Microsoft, Booking dot com, everything. All of these, like, big companies who have large scale experimentation, truly, it’s just exciting to, like, have as what I’ve mentioned at the very start, I thought experimentation is so small, but really it feels like bigger when you start engaging in, like, the community of experimenters. So yeah. More recently, the more fun ones, I would say, is like attending more in person experimentation meetup.

I’m lucky to be based in Seattle wherein there’s a lot of it’s emerging. Like, more meetups are happening right now, so it’s a great way to collaborate, exchange ideas, stay informed in, like, the new things and experimentation and how other other organizations are solving them. So I’ve done a speed talk about experimentation in one of those meetups and it’s been really fun, like, being able to engage with people and share knowledge as what I’ve learned in the last five years. So, yeah.

Lovely, lovely. Thank you so much for sharing that with our users, Christina.

Any current Spotify favorites or any creative pursuits outside of your role as a data backed scientist that help fuel your inspiration?

Yeah. So I’ve lived in Seattle for about four years now and I feel like I’m truly embracing the Washington lifestyle which is like staying outside staying active outside of work. So like running, hiking, walking, cycling, those are anything wherein I get to move is has been really important for me, and I absolutely love the nature around Seattle and, like, greater Washington. So get getting outdoors really helped me reset and basically you would find me going for a walk, long walks.

I love that. So in in my Spotify, I don’t like, I think like it’s a mix of like other interest of mine, which is, like, personal finance podcast and maybe, like, just playlist that I use for running. I’m trying to get more into, like, reading read reading fictional books, like, more books and novels. So if you do have recommendation, please send it my way.

But but yeah. I think, like, just having my body active and my mind active has been my my what keeps me grounded.

Speaker

Cristina McGuire

Senior Data Scientist - Experimentation, chewy

Other Suggested Sessions

Causal Inference, Experimentation and A/B Testing

Explore the depths of A/B testing, Generative AI, and Causal Inference, revealing the impact on KPIs and modern experimentation.

Designing for Scale at Ackermans: UX, CRO, and Experimentation

Jandro Saayman and Emily Isted share how UX and CRO teams collaborate to drive meaningful improvements in retail e-commerce. They explore practical testing examples, how behavioral data informs design decisions, and how experimentation helps teams better understand customer needs. The discussion also touches on mobile-first challenges, cross-team collaboration, and the growing role of AI in personalization and analysis.

UX Fundamentals for More Conversions

Join Karl for a deep dive into 5 crucial UX principles and 2 transformative marketing questions, blending humor and insight to tackle persistent online business flaws.