Testing Big vs. Testing Small – How To Evolve a Product Rapidly Without Sacrificing Learnings.

Transcription

Disclaimer- Please be aware that the content below is computer-generated, so kindly disregard any potential errors or shortcomings.

Shanaz from VWO: Hello, everyone, and welcome to another, webinar with VWO. Today, we have with us Iqbal, CRO lead at Trainline, currently working as a consultant and trainer. We’d like to apologize for starting this late, but there was a small technical glitch due to ...

which I will not be able to turn on Iqbal’s camera. But I’m pretty sure that his presentation is going to be so insightful that it’s gonna be worth it. So over to you.

Iqbal Ali:

Yeah. Hello, everyone. So, welcome to the webinar. Testing big versus testing small. I think it’s customary to start with a little bit about me.

So, this is me, or this is what I aspire to look like. So I’m Iqbal, I’m a CRO consultant and trainer, so I help companies, teams, and individuals, with their experimentation processes. So this is what I’ll cover in the webinar. So I’ll be covering processes, tools, and frameworks. And the goal is to develop a flexible strategy based on these to basically test, to kind of balance incremental tests versus large-scale experimentation.

Cool. So I think it’s good to go over the experimentation process to get to know the central problem a little bit better before we can start to look at a solution. So, this is an example of our website or app. It’s made up of lots and lots of variables, all of these little golden things. So these variables are stuff like the color of a button, descriptive text, specific imagery, or iconography, the price of a product, and literally anything that a user sees or experiences can be seen as a variable on our site. And, an ideal experiment would be isolating, some specific, variables. So here we have, what we call a predictive variable and a dependent variable.

The dependent variable, we can’t modify directly. So that’s what’s gonna be typical to our conversion rate. So we can’t increase the conversion rate directly, but we can, I think we can influence it with a predictive variable. And what we do with an experiment is that we’d like to isolate say that there’s a relationship between a specific predictor variable, and it’s gonna have a specific impact on the dependent variable, and it’s usually in the positive end. So making a change to this predicted variable will have a positive impact on a dependent variable at conversion rates.

So by doing this, by isolating the variables, we’ve proven that changing one variable affects another. So it proved causality, and that’s in a nutshell, the scientific math is isolating, variables in order to prove causality. And this is what, an overall experiment process would look like. So it starts with a hypothesis. We prioritize delivering the experiments we analyze, and then that loop of learning going back to the hypothesis is where we kind of prove or disprove the hypothesis where we validate or invalidate the hypothesis.

And each time we iterate through that loop, we generate a new learning. And each time we iterate through this loop, the learnings compound, So, the reason why I’ve isolated the middle to prioritization and delivery is because with experimentation, we want to deliver ROI, so return on investment. So that’s ultimately what we’re looking for. We’re looking for ideas that are the fastest to develop, with the highest potential to deliver some kind of uplift. And what we use, generally, what we tend to utilize is, prioritization tools, like pie, to help us calculate the ROI. So here’s another example of where we’re changing 2 predictor variables.

So here, we’re seeing an uplift but we can’t actually trace it back to any one of those variables. So, what could happen is that both the variables together could be having the impact, or it could be that one of the variables is an uplift, and one of the variables is negative. And so one of the variables is cannibalizing the overall impact. And so generally, learning is impacted. There’s nothing wrong with this kind of design type, experiment type, but we have to know that proving causality to a specific variable is not possible.

So this is Dina’s summary. The quality of the learning is impacted as we go through that loop. And here’s an example of it, playing out in a more extreme way. So you’ve got 4 changes to predict 4 predictor variables changing. And, we really can’t tell which one of those had the impact, whether it was all of them, whether or not there’s an impact or variable there that’s cannibalizing.

So why is this a problem? Well, when we consider that 70 to 80 percent of our ideas are bound to fail. It’s a big risk. It’s a big risk in terms of time and effort. And, so, you know, when we put in a lot of time and effort in these big changes and when we consider that 70 to 80% of those ideas will fail, well, when you compound all of these changes together, the odds are not stacked in our favor. So this status is based on my own experience and, of some other experience of some other CRO experts. So, part of the conversion world has spoken to a number of conversion experts to get this figure. I think VWO stats paint an even grimmer picture with 90% of experiments bound to fail. That’s when they’re 2016, sort of, stats the data that they revealed.

Vashana can probably go into that. I’ll explain that a bit more, but, so on the flip side to that, here’s the man. The man is telling us, hey, testing is about incremental changes and playing it safe. I want to test win to win big in your they’re pushing pixels.

And, you know, he has got a point because there are legitimate reasons why we might want to test big for example, we might want to evolve the product rapidly, so that we want to make big step changes in our products maybe the site is in need of modernization and so little tweaks here and there just not gonna cut it. Or perhaps traffic volumes and conversion rates are so small, that you need to make these big changes to be able to see the signal above the noise, to get a test read easier, or even make it possible to get a test read because traffic volumes are so small. So just cutting to the chase, our solution will be getting into a little bit through this, the rest of this webinar is a flexible framework for testing big changes in a controlled way. So what that means is that we’ve got this as our process, our ideal experimentation process, And, what this entails is basically splitting this, having 2 testing modes to this. One is a learn mode and the other which is an evolve mode. And the learn mode we’ve already seen is basically we’re testing maybe an individual hypothesis.

And by proving that hypothesis, we get our learning and it’s a very granular sort of learning. So it’s a really high-quality, sort of learning. And this is what it looks like on our site. So it’s one variable, I mean, that we change. So this is our predictive variable.

And when we see an uplift, we can prove that this variable was responsible. So here’s the new addition to the family. The evolve mode. So this is where we’re gonna have, multiple hypotheses. We’re gonna design, deliver, and analyze these. When we deliver the experiment, we analyze it.

If it’s a loss, our loop is to test again removing either the riskiest assumption or some other change based on some new learning. So don’t worry. I’m gonna unpack that as we go through. But here it is in action. So we’ve got our b variant.

We’ve made 3 changes, 3 variables, that have changed. So imagine this experiment is a loss we test again removing one of the variables, the riskiest variable that we think. And if that is a loss again, we test again. And, essentially, we might end up back to, you know, like an A/B as if we were testing, you know, the classic learn method, but we’ve given up ourselves the best chance in terms of learning quickly but we dial it back as we need to. And in terms of iterations, we can go through any number of times we can go through the iterations depending on the number of variables. So that’s the general kind of gist of it, but how do we go about doing this effectively?

And there’s a lot to deconstruct about that. So, I’m gonna split this into two steps. First, introduce you to the tools and the knowledge that you need to know before you can put them into action. And then the second step where we’ll be putting it together to develop our experiments. So step 1, knowing our tools, the first tool is, is writing a useful hypothesis.

Now I’ve not done any study on this, but chances are not many actually take the time to focus on the hypothesis and write it down. But this is really key to this process. It’s not just to write down a hypothesis, but to write down a useful hypothesis. And this is what useful in this sense means. So this is a template of a hypothesis, changing adding x will, and then we say describing the impact because of justification.

So in the 12, we’ve got our predicted variable in the 1 dependent variable in the 2, and then the which is this, the predicted variable and the dependent variable. And then, the third part, which is our justification. Now this is a list of convergent levers that have been compiled over the years, let’s see, these are levers used in the industry. To come up with test ideas. These are levers known to increase conversion. There’s a lot to go through here.

So I’m just gonna pick the top 2 and go through them, but you could research these online. There are lots of places where actually a lot of people are going into those kinds of themes in a lot more detail. But going into the top 2, for example, relevance. So, relevance is about presenting products of content relevant to the user at a specific time, a point in time, and a specific point of the user’s flow. So here, I’m on Amazon. I’m buying a thermal printer. I’ve got these 2 sections, one which is frequently bought together.

So this is adding a number of different products, and I can buy all 3 or 2. And then there’s another section 2, which is all other kinds of related, products as well. So these are stuff which is relevant to what I’m purchasing. And this is an example of clarity where we’ve got a breadcrumb trail. It’s providing us clarity of where we are in the process.

We’re in profile step 1. We know how many steps there are to go. There are 3 steps, and we know what those steps are. So that’s an example of clarity in action. This is an example hypothesis that we can put together rather than using those things I’ve spoken about.

So on the checkout page, make the pay button always appear above the fold, so that the predicted variable will increase conversion. That’s a dependent variable. This is because it will improve the findability and clarity of the button So clarity is mentioned there, that’s our theme. And, yeah, I’ll go into how that helps later on when we get to step 2. But for now, we’ve developed, a hypothesis, which is gonna be very useful going forward.

So it’s also important that we separate our hypothesis from experiments. So hypotheses do not equal experiments. Especially in this process. So we’ve got we can create a list of hypotheses, but we have not yet got a list of experiments. Tool 2 is understanding the program goals.

So now this is a question about what we want to get out of experimentation, and that could change from this point in time to a month down the line. And, ideally, we’d want both. We’d want to evolve the product rapidly, and we’d want to learn, and get some granular learnings from feature product innovations and hypotheses. But in reality, it’s not always possible to get both. So it’s a better idea to make them independent and apply a waiting to them. So this is an example of some waiting.

I like to write it down, but it’s not really necessary. You can just have it in your head. So we got here, specifically gaining, learning, influencing future testing is weighted 20. Evolving the product is weighted 80. So that gives you a sense in terms of what percentage of your efforts, time and effort should be spent where.

And sometimes, as I’ve said, you’d be weighted towards a move. Sometimes you’d be weighted towards learning. So you could go from one to another over a period of time. And, yeah, it’s important to know that it is gonna be fluid and it’s gonna be changing over time, but it’s important to know at that point in time, what is important to your experimentation process. So tool 3, we’ve had useful hypotheses.

What we also would like to have is useful metrics. Metrics that are gonna give us good quality learning. And what I often like to do at this point to get to useful metrics is to start off with a so this is, thinking in terms of hypothesis and kind of, imagining an experiment and imagining an experiment result. So we could either have a win, a loss, or a flat, or an experiment. So then what I do is, work backwards.

So it was a win. Why was it a win? Make some lists in terms of what could be the reasons why it was a win, why it was a loss, the reasons why it was a loss, and why it was a flat. These are all the possible sort of outcomes of the experiments. So once we’ve got that list of, reasons why we could get these possible outcomes, we can use that list to create a useful set of metrics. And this is a catchphrase that I’m hoping will catch on.

Valuable metrics equals valuable experiments. And, I really believe that. So finally, 2, 4 is a test type. So this is, knowing what kind of test types we have to work with and tools in terms of the experiment design. So we’ve got a typical A/B here. We’re making one change to a predicted variable, and we see the impact.

So the pro for this is that it gives us granular good quality learning. The con is that it’s incremental changes so it might take a lot of iterations to be able to progress a product forward. Then we have, we’re calling bundled experiment where we’re making multiple changes. This way, we can evolve faster, but as we MVT or factorial test.

So, MVP or factorial tests are experiments where the variations, isolate the variables but also combine them. So you have variations which are with the variables isolated and you have variations with all manner of combinations. An MVT is every combination of those variables. A factorial is where you get to choose those combinations. So the pro of this is that, again, granular, good-quality learning, the con, is it’s very slow to get results.

So here’s where I come across the elephant in the room. So I thought I’d address it now. So Dude, why don’t we just MVT everything? So we’ve got, you know, we’ve got that good trade of granular learning, and we can potentially evolve fast. Well, here’s an example of a runtime life cycle of an experiment, MVT versus a bundled experiment, MVT, or factorial versus a bundled experiment.

So because we’re doing MVT or factorial, the more variations we have, we come across the multiple comparison problem where the more variations we have the more risk of getting a wrong result. And, because of that, MVP needs a high threshold of significance, and he needs a heck of a lot more traffic to get to those answers. Also sometimes it takes more time to build MVT impactorials. Whereas a bundled experiment, as you can see here, as we’re going through the iterations, there’s a better chance of us learning faster or us getting to a point in time where the product evolves faster. So remember, it’s a choice of learn versus evolve. And if we’re thinking Evolve, then bundled tests will have the potential to get us there quicker.

So we’ve explored the tools. Now it’s time to bundle. So then, often starting with the idea is simple, basically. We now build our experiments or we turn our hypotheses into experiments or experiment ideas. So, we can either have isolated experiments where we’re isolating specific hypotheses into their own experiments, or we can bundle so we can have multiple hypotheses and bundle them together into specific experiments.

So we have lots of variables working here. And the idea is that we want to do this with deliberation. So we want to do this in a controlled way, which is what we’re gonna get to the rest. So the first task is to identify what I call the ‘need to’s’. So, for example, what needs to be isolated? For example, it might be, that the hypothesis is high risk.

Maybe this is from findings and user research, user testing, previous experiment data, other kinds of data, or heuristics, or maybe we’re just plain nervous around a specific hypothesis. With those kinds of, hypotheses, it might be best to isolate and create a specific experiment around that. And, also, it could be curiosity. So it could be this, amongst a team, it might be that hypothesis can settle debates. It might be good for communication with execs.

And they generally make for good stories. If you can say, like, this specific variable or hypothesis has been, disproven or invalidated. It makes for a stronger story than EA could be any one of these things. So the idea is that each one of those becomes their own specific experiments. And we also need to look at what needs to be bundled.

So, for example, there’ll be development dependencies where we can’t test a feature before it is rolled out, or maybe a design skin is needed before we can apply some new iconography to the site. What I’d say is be honest about it because as we’re bundling these, sometimes the need to bundle is not really the need to, but a nice to have. And sometimes they’re not bundles, but an order in which things need to happen. But what we’re doing here is we’re just getting to know these things upfront. And, knowing what we can, or what needs to be bundled, maybe because there’s a lower development effort that goes into separating out specific variables.

And now we’ve got task 1 out of the way, the need to stuff. So this is where we actually get control and use some of the rules to create bundles for the rest. So remember our waiting system. So remember where we are, at any point in time. So, this will dictate how aggressively we go with the bundling.

So if we’re weighted towards evolve, we might want to wait more aggressively towards bundling experiments. If we’re weighted towards learning, we might go a bit more relaxed and might go a bit more aggressively in terms of, isolating hypotheses. And it’s also good to set some limits for the bundling. So, perhaps we set a limit of 4 hypotheses, and you can vary depending on the risk level that you’re willing to take on. And you might want me to also dial it back.

So when you bundle say 6 hypotheses and then you’re finding that all of your experiments are failing, maybe you need to, scale back to bundling 2 and bundling 3 because maybe you don’t know enough about your traffic. And the next thing we want to do is as we go through the list, we’d want to identify easy wins. So review the hypothesis and review what experiment designs are quick to deliver. So some of these things might be so quick to roll out in a change. They could be out there gathering data while we’re going through the rest of the process. And it might be an idea to go through and identify what could be easy wins based on the hypotheses and what could be easy simple experiments that we could roll out to validate an assumption.

But the point is that we want to get something out, see what we can get out as quickly as possible in order to start learning as quickly as possible. So, I’ve not covered this specific thing, but each experiment would have a different goal. So, for example, if, an experiment might just be about derisking, so we just want to progress the product and are not really interested in getting a conversion win out of it. We just need to get some features rolled out. And then at that point, we’re gonna be getting, we’re gonna be able to test and get some learnings and do some conversion optimization.

We just want to check that they don’t damage conversion. So it might make sense to bundle experiments, for example, they’re all about derisking together. And, yeah bundled by experiment goal, basically. And this is my favorite. So this is why we added that justification right at the beginning so the member with the hypothesis because it adds clarity.

And, so this way we can go through the hypothesis and we can bundle in terms of theme. So anything that’s got a theme of clarity that’s got a theme of relevance, it might make sense to bundle some of these together. So at the very least, we’re learning about those themes with the experiment results. So bundling all the clarity ones together, not necessarily all the clarity ones, but you know, setting some limits. And, yeah, or we might want to get a cross-section.

Like, want to have clarity and confidence you know, but generally, we’ve got some kind of a theme or a set of themes that we can bundle to give us some learning. So every time we’re going through this cycle, we always should be monitoring the success of the approach. It allows us to iterate, and, you know, it allows us to ask are we really ready to make big changes? Because sometimes we’re not. Sometimes we’re making all these big changes, but we’re not really. The majority of them are coming out as losses.

So that says something, and that means going back through the process, maybe waiting and starting again. So this goes back to the waiting system. So, basically, you want to reweight and tweak, iterate, and wanna keep on going back and forth around that. So in summary, by being clear about our goals, and using a systematic approach with processes, we can get control over the balance between testing big and testing small and we should always be tweaking and adapting as necessary and as needed and always going back and through the process. Thanks. Any, I think you’re open to questions.

Shanaz:

Thank you. That was really, really insightful. So I have one question. It says, how do you or your team decide which tests should be bundled and which tests should not be bundled? So is there any specific framework that you follow, or are there any parameters?

Iqbal:

So with tests, I think it’s important to go through the process and split them into hypotheses. So go back to the basics of what you’re trying to learn from each test. And split out all of the assumptions that go into it, the hypothesis, and then just go through this process. They’ll just systematically go through, what are the variables, what are the learnings you wanna get, into how much risk you want to take and then we can, bundle them from that side.

Shanaz:

Alright. I hope that answered your question. Yeah. Well, thank you for that session. I’m sorry we had that minor technical glitch, and we lost out on valuable time. We won’t be able to take up any more questions. Yeah. There seem to be a lot of technical glitches today. Nonetheless, that was very insightful. And I think everyone will be taking back at least a handful of cues to apply to their own CRO program and how they approach testing and what to test, what not to test, and how to test them.

And, yes, thank you so much, Iqbal, for that session. And thank you everyone for attending this VWO webinar.

Follow us and stay on top of everything CRO

Testing Big vs. Testing Small – How To Evolve a Product Rapidly Without Sacrificing Learnings.

Key Takeaways

Summary of the session

Webinar Video

Webinar Deck

Top questions asked by the audience

How do you or your team decide which tests should be bundled and which tests should not be bundled? Is there any specific framework that you follow, or are there any parameters?

Transcription

See VWO in action now.

While we will deliver a demo that covers the entire VWO platform, please share a few details for us to personalize the demo for you.

Select the capabilities that you would like us to emphasise on during the demo.

Which of these sounds like you?

Please share the use cases, goals or needs that you are trying to solve.

Please provide your website URL or links to your application.