summary
Cleo is an AI financial assistant app that offers insights on spending, saving tips, budgeting, and credit score building through chat. Previously using intent classification and pre-written responses, Cleo now leverages AI-generated chat for more engaging and dynamic conversations. I’ve been working on a set of internal tools to facilitate the Content Designers' job in the creation of those chats.
Introduction
Chatbots have come a long way. They’re no longer just scripted responses. Today, they can hold intelligent conversations thanks to large language models (LLMs).
But what most users see as a simple chat interface is actually powered by a complex system, with the true magic lying in prompt engineering. This is the key to making these conversations feel so natural.
In this article, I’ll take you behind the scenes of Cleo's LLM-powered chat system, exploring the intricate process of prompt engineering that allowed us to elevate our chat quality.
The Pre-LLM Era: Static and Stiff Chats
Before LLMs became integral to our chat system, we relied on a set of predefined messages triggered by specific keywords. These messages, although functional, often failed to deliver the nuanced, human-like responses users expect.
The chat experience was rigid, occasionally sounding unnatural or even a bit "dumb." It was clear that our system needed an upgrade, one that could generate more dynamic and contextually appropriate responses.
The Game-Changer: Introducing LLMs
The introduction of LLMs marked a significant turning point. With their ability to understand and generate text based on nuanced prompts, the quality of our chat responses improved massively. But this leap forward didn't happen overnight. It required us to build a robust tool that would allow our content designers to craft precise prompts—text or instruction sets that guide the LLM in generating the desired responses.
What Is Prompt Engineering?
At the core of LLM-driven chat systems is the concept of prompt engineering. A prompt is essentially a piece of text or a set of instructions given to the LLM to trigger a specific response or action.
Think of it as a conversation starter that tells the model what you want it to do. The effectiveness of the LLM's response is heavily dependent on the quality of the prompt, making prompt engineering a critical skill.
Learn more about Prompt Engineering here
Crafting a good prompt involves more than just writing instructions—it's about understanding the nuances of language, predicting the LLM's interpretation, and iterating based on feedback.
The better the prompt, the better the output, which directly impacts the user experience.
The Evolution of Our Prompt Engineering Tool
When we first started, our prompt engineering tool was rudimentary—a basic input field where content designers could write prompts, with minimal options for testing or iteration. To test a prompt, designers had to manually input it into the system, run tests across devices, and map results through spreadsheets. This process was time-consuming, unsustainable, and not scalable.
Realizing the need for a more sophisticated solution, we embarked on a journey to refine our tool. The first step was conducting user research to understand the needs and pain points of our technical team. With minimal documentation available on designing for AI, we collaborated closely with our engineers to align on the goals of the tool.
Crazy
8!
Building a Scalable, User-Friendly Tool
Our goal was to create a tool that allowed for quick experimentation and easy iteration. We wanted content designers to be able to play around with prompts, test variants, and track changes—all within a controlled environment.
The first iteration of our upgraded tool included an input field with a side panel for adjusting parameters like temperature, which controls the unpredictability of the LLM's responses.
But as we continued to iterate, it became clear that we needed more advanced features. We added an output section and a preview panel where designers could see how their prompts would translate in the actual chat environment. This allowed for real-time adjustments and a more streamlined workflow.
The Power of Templates and Cost Efficiency
One of the most significant insights we gained was the repetitive nature of certain prompt types. For example, prompts related to specific characters or tones of voice were often reused. To address this, we introduced templates—a feature that not only saved time but also reduced costs.
In the world of AI, every word in a prompt translates to a "token," and each token has an associated cost. By using templates, we minimized the number of tokens needed, making the process more cost-efficient.
Previewing, Testing, and Iterating
To ensure the quality of our prompts before deploying them, we built a comprehensive testing environment within the tool. Designers can now preview responses generated by different language models, test variants for A/B testing, and make real-time adjustments.
We also introduced a feature for previewing how text replacements and templates would look in the final chat, providing a clear picture of the user experience.
A Clean, Focused Interface
After several rounds of feedback and refinement, we cleaned up the interface to focus on what matters most: the content.
The tool now features expandable areas for input, single-sample previews, LLM response previews, and evaluator test reports. This organisation allows content designers to concentrate on writing effective prompts without unnecessary distractions.
The Final Takeaway
From a design perspective, there are countless resources on how to design with AI, but far fewer on how to design for AI. This project has been truly enlightening in understanding the complexity behind chat systems and their integration with AI.
It has deepened my appreciation for the intricate work that goes into creating seamless, user-friendly experiences in this space.