VWO Logo VWO Logo
Request Demo

Experimenting with AI: When Bots do CRO

Discover the real-world impact of Generative AI beyond the buzz, exploring its valuable roles and practical applications in our latest session.


Johann's talk addresses the skepticism surrounding AI-generated content and its practical applications in the current digital landscape. He references Gartner's Hype Cycle, placing generative AI at a peak with a mainstream adoption timeline of 5-10 years. Johann emphasizes the importance of focusing on AI's current capabilities rather than its limitations. He presents real-world examples, including Iqbal Ali's use of GitHub Copilot for writing A/B tests, which significantly reduced his coding time and improved code quality.

The talk also covers advanced data analysis with AI, highlighting its efficiency in understanding complex code and performing tasks like hypothesis writing and data pre-processing. Johann concludes by discussing the importance of effective AI prompting and maintaining a prompt library for various applications.

Key Takeaways

  • AI tools like GitHub Copilot can significantly reduce the time and effort required for complex tasks like coding A/B tests, making them more accessible and efficient.
  • Mastering the art of AI prompting is crucial for obtaining desired results. Maintaining a prompt library can streamline this process across different use cases.
  • AI tools can help level the playing field by boosting the capabilities of less skilled individuals, thereby reducing skill inequality in various fields.


[00:00:00] Johann: If you’re watching this talk somewhat reluctantly because of all the noise about AI, I get it. Paul Graham recently said on X that he was looking for content online, found himself filtering by date to get rid of all that AI generated nonsense. And I think increasingly, a lot of us are starting to feel that way – not just about AI generated content, but also about talks like this.

[00:00:33] I’m with you. I’ll try my best not to add to the noise and we’ll keep it very practical. 

[00:00:39] I do think there is a risk in overstating that hype and we saw this back in the 90s, where shortly after the internet became a thing, people were saying consumers will never buy online. That’ll never be a thing.

[00:00:53] People buy from people. And there’s the headline in Newsweek, the internet is hype, it’ll never be Nirvana. And it wasn’t just Newsweek, it was also that paragon of wisdom and insight, the Daily Mail, famously saying, internet may just be a passing fad. 

[00:01:12] Of course, there is a lot of hype. We can’t argue with that.

[00:01:16] Gartner’s Hype Cycle places generative AI right now at the top of that high peak. And they give it five to ten years before generative AI reaches that Nirvana, before it becomes mainstream, before it’s that useful. But as Professor Ethan Mollick says, and by the way, this guy, if you’re interested in this sort of stuff, you must follow him prolific and puts out a lot of useful content around generative AI, but he says it’s a mistake to judge AI by what it’s able to do today.

[00:01:53] There’s a lot of things that it can’t do, but why focus on that? Why focus on what could be possible in five to ten years from now? Let’s look at what’s possible right now. And we’ll go through some examples in our industry. The first one here, Iqbal explained on LinkedIn how he was using GitHub Copilot to write A/B tests. And I’ll let him explain himself. 

[00:02:20] Iqbal Ali: [Iqbal Ali] Just using GitHub Copilot to help develop experiments using the process of copying relevant, HTML and pasting it into the code editor and then using comments to say, this is what I’m trying to do with the HTML and then you get all of the JavaScript and then use that JavaScript back into, put it into the experiment, build your experiment. 

[00:02:55] There’s one experiment which I estimated was going to take me two or three days, and did it in under a morning. And the code quality was much better than I could ever write. 

[00:03:10] Johann: GitHub did a study and they found in this experiment, developers that used Copilot were able to get their tasks done in less than half the time.

[00:03:20] This is echoed by people in the industry, Kamal Sahni from Optiphoenix, telling me that before AI, this particular task – understanding a piece of code, would take them 30 to 40 minutes, whereas now it takes them three to four minutes, the same thing with GPT. In this case, it’s understanding the underlying code of a website before they build the experiment on that website.

[00:03:44] But it applies in other use cases as well. Here’s a fascinating study where they worked with educated professionals, gave them an occupation specific writing task. Half of them worked with ChatGPT, the rest didn’t. You know where this is going. The group that worked with ChatGPT got it done faster at a higher quality.

[00:04:06] But the bit that I found the most interesting is, as the authors say there, the inequality between workers decreased, as the low ability workers benefited more, and I think that’s an underrated aspect of AI. We often hear about AI dumbing things down and how it’s all just average, but look at the counter to that.

[00:04:30] If that’s your distribution of skills, and that could be anything, it could be writing or it could be coding. 

[00:04:35] So let’s take coding. I would be on the far left there. I suck at it. But with the help of GitHub Copilot, ChatGPT, and others. Suddenly, I’m an average developer. Okay, so here’s where we’re going to make things a bit more practical again.

[00:04:53] We look at some examples from the industry, A/B test coding. We’ve already seen a bit of that. 

[00:04:59] Advanced data analysis – I see a lot of interest in that on social media. People posting about it and asking questions about it, struggling with it, because the easiest thing to do is take, say, server results and just dump it into chat GPT.

[00:05:14] But you can’t trust what you get back because we know about hallucination. It’s not repeatable. So if you do that same exercise 10 times in a row, you’ll get different results each time. So we look at a better way of using generative AI to do that. And then you might be surprised to see the third one, they’re writing a decent hypothesis, but I think that’s actually a big problem, a practical problem.

[00:05:40] I certainly see it a lot. You ask it to help you write a good hypothesis. I want to change the button colors to green to increase conversion rate. We all know that’s a good experiment. And that first line is not bad. It should be clear, specific, and testable. But after that, things start falling apart. If you regenerate it, you’ll get a different answer.

[00:06:05] But this is about as good as it’s going to get. Let me show you another approach. Right here I am in chat GPT, what used to be code interpreter, now advanced data analysis and before I take you through the prompt that I’ve constructed to improve hypothesis writing. I’ll give you the background behind it first.

[00:06:27] So it starts with Craig Sullivan’s hypothesis kit and as you scroll down, this is the template that I’m telling ChatGPT to use. It strictly has to stick to this template. It’s also got all Craig’s thinking behind it. Then I gave it this post by Erin about a nuanced understanding of the difference between the dependent variable and the independent variable.

[00:06:55] That’s further built on by Lukas in this post. So I feeded this as well. And then this paper on hypothesis writing, and then the one that I think most people here will know, the framework from Professor Stefan Thomke about what makes a good hypothesis versus a weak hypothesis. So I’ve fed all of this into this prompt, and now I’m saying to it – you have to follow those primers. 

[00:07:35] So there’s Craig’s toolkit coming through. I’ve got it step-by-step laid out. I’ve told it to be strict and it’s quite elaborate. So we’re not going to go through all of it, but it goes through the entire thing and it talks it through and in a lot of detail and let’s have a look at a quick example. 

[00:07:59] So this is how it starts. So the first step, what is your hypothesis? What is your core theory? That’s step one. It says, what is the change that you want to make? And what is the impact that you predict it will have? 

[00:08:15] So now I want to change the button color, and I think the impact is higher conversion rate. And now that’s the first place where it catches me out in this conversation.

[00:08:26] So it says, well, you know, that’s okay! But higher conversion rate is a metric. It’s not the impact. We get to metric later. What is the behavioral shift that you expect to see? So I say, okay, well, I think people, they’ll click on a green button. They’ll be more likely to click on a green button because they like green more.

[00:08:51] So it says, okay, what’s your data behind that? I saw a case study. On the internet, there’s a guy that talks about a button color change that drove massive increase in sales. True story, by the way. And it won’t allow me to get away with it. It says, well, you know, that’s something on the internet, but what is your specific data that you have?

[00:09:20] So I say, no, my colleagues also think it’s a good idea. The fact that you think it’s not a good idea doesn’t matter. My colleagues agree with me. So it says, okay, well, that’s good that your colleagues agree with you. But what data do you have? So I say, well, I don’t, but we want to do it. So what’s it to you?

[00:09:41] And it just won’t let me get away with it. It starts offering some suggestions for what I could do to get the data and it explains why I need the data and then I say to, well, my boss says we must do it. And even at that point, it’s now willing to let me proceed to the next step. But even then, it’s saying, okay!

[00:10:02] That’s fine. I’ll let you pass but I have to point out that you’re doing it in the wrong way and here’s why you’re doing it in the wrong way. So it goes on through the entire process, step-by-step and at the end of it, you end up with a well constructed hypothesis that meets all the criteria and it’s helped you to think through it properly.

[00:10:27] The one thing to know about chat GPT, what used to be called code interpreter, now advanced data analysis, is when you talk to it, it translates your instructions, your natural language, your plain English instructions, it translates those into Python. It’s writing Python code. And the amazing thing is that these packages, if you know anything about Python or data science, you will recognize these packages.

[00:10:56] These are the same packages, the ones that are available in code interpreter or advanced data analysis are the same packages that your data scientists would use if you ask them to do a particular piece of analysis. Here’s a complete list of all the packages and you can see at the bottom of your screen, you can get access to that sheet, and I don’t know how updated it is.

[00:11:20] It’s not mine. The point here really is that, you know, this is really what’s happening in the background. 

[00:11:27] We’re back in Advanced Data Analysis inside of ChatGPT for a quick demo. I’ve downloaded this file from Kaggle. It’s Hotel Reviews, so OpenText data. I want to give you a quick overview of how we do a clean up of the file, do pre-processing and finally do some analysis on it.

[00:11:47] That’s reasonably accurate and useful. So the first thing I asked for is an overview of the file. There we go. How many rows, how many columns, what they are. So I can get a sense of what’s in there and what I need because that’s the first thing I want to do is dump what I don’t need. So I only need these two columns – ratings and content. There we go, it’s done that. We can open it up, so it shows us the Python that it’s written, the code that it’s written in order to do that. And then I ask it to let me download that subset, that i’ve created as a new file. 

[00:12:29] As you reach milestones as you clean up your data or you know Whenever there’s a logical break point make sure you download the file, so if there is a timeout or some sort of interruption the next time you can just pick up where you left off. 

[00:12:45] Now we’re going to do pre-processing of the data of that file that we’ve just saved. Pre-processing is just a way of cleaning up the data, standardizing it, getting it ready to do the analysis. 

[00:12:57] First thing I do is tell the AI my final destination. End goal is to do NLP analysis. Even though we’re not doing it yet, it’s important for it to know what the end goal is so it can guide me properly in the steps that we’re going to be doing up until that point. 

[00:13:18] In addition to the file, I’m also uploading a Jupyter Notebook containing actual Python code. And I’m telling ChatGPT, even if you don’t use that code, it’s additional context that may be helpful. And where I got that from is from Kaggle. There’s on Kaggle, you can find tutorials and primers on various topics. And in this case, pre-processing. So here’s a good one and as you scroll down, it’s got the code and everything you need, to walk through the examples. 

[00:13:52] So you can click there. Download the code, and that’s what I’ve done, and this is what’s being uploaded to ChatGPT in that file there. So I tell it to read the CSV. I tell it to, to study the notebook and then come up with a pre-processing plan.

[00:14:12] And that plan can be based on its own knowledge, but also drawing on that Kaggle primer. So as we go down, we see – it’s got the file open and it’s now going to, there it gives us the text pre-processing plan. I don’t want it to do anything until we’ve agreed on the plan and here’s why, because that plan is not comprehensive enough.

[00:14:37] So I ask it to add all these steps to the plan. And there’s the updated text pre-processing plan. So we confirm the plan and now start implementing the plan. And each step of the way you can click on this and see the actual Python code that it’s writing in the background translating your plain English into Python. So we go through all these steps (step-by-step) as we agreed. And then it gets to the point where it says it doesn’t have the required stopwords resource.

[00:15:15] It’s a library that is meant to be there, but isn’t there. So, what you can do in this case is actually manually upload. I’ve downloaded this stopwords list from the internet and then just uploaded it here and voila! Now it’s able to use that. Sometimes it won’t work, but if it doesn’t work the first few times, just hit regenerate.

[00:15:43] That’s worth doing three, four, five times and often it will sort itself out. Now we get to removal of frequent words. And again, it’s going to want to just do that and I say to it, no, first show me a sample of the words so I can edit it. There are words that it would have removed that I don’t want it to remove because I think it’s useful.

[00:16:06] For example, pool and room. I mean if the pool turns out to be a core feature of five star reviews, then that’s not something I want to drop. All of this is maybe 15 minutes. Until we get to the bottom where all of that has been done and it can now be saved. 

[00:16:28] As part of the pre-processing we had to do something called stemming and or lemmatization.

[00:16:34] The library that it needs isn’t available and it wants to proceed. With an alternative route. So I say to it, what is it that you need? and then upload that manually. So that’s something to take into account. In this case, it didn’t work because of the file size, but at the end – it gives me the code that I can run on my MacBook in order to do this.

[00:17:00] And all I do is I copy that code, stick it into Jupyter Notebook, and it works as expected. First time around, it’s perfect, I get the lemmatization done, and now upload that file for final analysis. So if we fast forward, this is one of the piece of analyses. What you’ve got there essentially is five different clusters of words.

[00:17:28] So what I do next is I give it a template. This is how I want you to present that. Not in your format. Each topic separately. Tell me how many times it was mentioned. Give me the top ten words. Also, give me phrases and give me a sample quote so I can get a bit more color around it and then after you’ve done all the topics, then give me a summary of all the possible interpretations.

[00:17:57] And so there it is, which is far more useful. It’s got the phrases and the words that I wanted and it allows me, if I read through all of that, just to get a bit more context around what it is. And there’s the underlying interpretation. And now that starts becoming more useful. You can now dig into each of those underlying themes a little bit more using other techniques.

[00:18:18] Here’s the last trick that I want to share with you. So I’ve taken a sample of the reviews, dumped it into Claude and asked Claude for a no fancy analysis, just a breakdown of the different themes as it sees it. And I copy and paste that from Claude. So then ChatGPT grades what I got from Claude. Great!

[00:18:49] Now that you’ve done that, now that you’ve seen your own analysis, and you’ve seen Claude’s analysis, how can I prompt you to give me the best of both worlds? So you retain your rigor of the topic modeling, but you add the good elements of what we saw from Claude. And then it gives me a prompt that I can now use either in this case, or next time.

[00:19:19] Now, as long as you’re using the ChatGPT interface, it won’t be a 100% accurate. It doesn’t matter how good your prompting is, whether you’re using advanced data analysis or plugins, it’s still the LLM. There’s still a risk of hallucination. It can still make mistakes. With decent pre-processing, you can get it very accurate, but if you want more accuracy, if you want a 100% accuracy, then it’s better to write the script yourself.

[00:19:49] And even with that, AI can help you. Here’s Iqbal again to explain. 

[00:19:53] Iqbal Ali: [Iqbal Ali] I’m going to take you through this text mining app that I’ve been building, which basically consumes large amounts of text, like review data, and then it just gives you insights and you can run a question and answer. All of this green text here, that’s, those are comments, and that is really the only stuff that I’ve written here.

[00:20:15] And the code itself, Copilot, the AI writes for me. So I’ve just literally been going through and I just wrote that green comments and then Copilot just gives me the code and I kind of go, yep, accept it. Here’s a very quick demo of what you can see. So here’s the data that I’ve got saved.

[00:20:37] If you look at the summarized view, so this is AI summarized. This is the actual quote. And then this is AI rewritten to be much more clean and then this is the original, which is similar to the quote, actually. Here’s what I wanted to show you. 

[00:20:54] Here’s the cool stuff – so this is Oily. I’m going to use oily skin like a subset of all my documents. And I want to ask it, okay, what were the specific issues to do with Oily skin? So it enters the answer me function. Oily skin was not suitable, caused breakouts, left skin feeling oily and greasy, made skin worse, not under makeup, too oily, weak SPF, unsatisfactory performance, glowing, but can pile, and not good for dry skin.

[00:21:30] So all of this is relevant. There’s more to it than this. You can ask it further questions so you can explore the specific quotes and explore more about what people have said to do with this. 

[00:21:49] Johann: [Johann] As we wrap up, just a few final words on prompting.

[00:21:52] There’s so much noise just about this aspect of generative AI. And I would ignore most of it. This list here is an overview of some of the best practices, if you will, that I followed in the two demos that I shared with you. It’s by no means an exhaustive list. If you go to the link (t.ly/_AgIP) at the bottom, you’ll find more information on these and others.

[00:22:14] The best resource I’ve come across for prompting has been this paper by Dr. Jules White and his colleagues. They give you a list of 16, what they call prompt patterns, which I think of as styles or practical approaches to prompting ChatGPT. Each one relevant to a different problem that you’re trying to solve, a different use case.

[00:22:39] There’s a lot of theory behind it, but I find it really practical, very valuable. I draw on it heavily and I see the results in the output from ChatGPT. Both case studies that I shared with you earlier, the prompting in them are heavily informed by what I learned in this paper and also the list of prompts that I shared as best practices.

[00:23:01] Now with all this prompt patterns and different ways of prompting, it can be quite difficult to keep track of all of that. So I want to show you one system that Lorenzo shared with me. This is a Chrome extension called SuperPowerChatGPT that he uses. And you can see that he’s created different profiles, a developer mode, he’s got a product market research mode, I think he’s got a writer mode, and so he’s set up different modes, different profiles, personas, if you will. 

[00:23:33] And then for each one, he’s got different custom instructions. Now there are many ways in which you can solve this and keep track of all your different prompts and approaches, but it’s definitely something not just for yourself, but in your organization, you want to start thinking about is this prompt library and how to keep everything in one place. I’m writing a lot at the moment about the prompt patterns of Dr. Jules White and how I apply that in the context of CRO and experimentation on LinkedIn. 

[00:24:03] Scan the QR code. It should take you straight to my LinkedIn profile, connect with me, follow me for more on AI, CRO and experimentation.


Johann Van Tonder

Johann Van Tonder

COO, AWA Digital

AWA Digital

Other Suggested Sessions

The Frictionless Path To Customer Loyalty And Higher Sales

Drawing on the content of his new book, FRICTION (McGraw Hill), Dooley will show how user effort affects conversion, retention, and even online reputation.

eCommerce Optimization Using Voice Of Customer Data

Context can be used as a strategy. Learn how to prioritize the voice of customer research findings and create a process for testing and action.

What to do About Inconclusive Experiments

Learn how to analyze tests and handle situations with inconclusive results.