I recently finished taking the AI Evals for Engineers and PMs course, and it has completely shifted my views on what it takes to build reliable AI systems. Prior to the course, I had used some of these techniques, but I was mostly relying on “vibes” as my main evaluation method. Now, I'm excited to build rigorous systems for ensuring quality AI outputs.

One of the course highlights was the homework assignments, which required learning the evals process end-to-end. Starting with detailed manual review and moving to automated systems for AI at scale. We even got the chance to work on more complicated systems involving RAG and build LLM-as-judge systems. It was both a pleasure and a challenge to dive into the code and understand how these systems work under the hood.
GitHub - brayden-s-haws/EvalsRecipeBot
The most surprising revelation from the course was that there aren't many great tools for doing evals. You can use spreadsheets or logging tools, I've been a longtime user of Braintrust and love it for collecting and reviewing traces. But even excellent tools like Braintrust aren't optimized specifically for the eval process. The instructors recommend building your own tool instead, allowing you to tailor it to your needs and workflows. I spent a few days building one that enables an efficient eval process while keeping my eval work synced with my traces in Braintrust through bi-directional API integration. I also included a tool for improving system prompts based on open and axial codes from the evals process. You can check out the tool in this repo. Feel free to use it as-is or remix it for your own needs. You can also see it in action in this demo video:
AI is here to stay. If we're going to build meaningful products, prioritizing evals will be key.