Evaluating LLM Code Output is Hard

The foundational crux of Caret was building an LLM solution that generated and executed workflows defined as complex JSON schemas. We got pretty good at generating workflows that looked right, but determining whether they worked was a different beast.

There are all sorts of methodologies for evaluating LLM outputs. LLM-as-judge, accuracy scoring, tone analysis, etc. However, for code output generation, especially complex structured data like our workflow schemas, evaluation gets complex. You can verify basic syntax and maybe check that core requirements are covered, but how do you determine that the result of executing that code is correct?

This was hands down the biggest technical challenge we faced with Caret. We could generate workflows based on user instructions and generally nail the simple use cases. But as workflows evolved into numerous different steps with complex dependencies, confirming validity beyond basic syntax checking became nearly impossible.

The evaluation problem

The obvious first step is making sure the workflow can actually execute without errors and complete successfully. That's table stakes. But even if it runs, how do you know it's doing what the user actually wanted?

For generic workflows, we're basically flying blind. We don't know exactly what the user expected to see in the output, so evaluating true accuracy becomes a guessing game. It's not like evaluating a math problem where there's a clear right answer.

We did support a revision workflow that let users refine workflows using natural language. This became our accidental evaluation mechanism. If we saw a user continually tweaking a workflow over and over again, we knew the result wasn't what they wanted. Conversely, if they ran the workflow numerous times without revisions, they were presumably happy with it.

The problem with deferring evaluation to real-world use cases is obvious: we couldn't know whether a tweak to our prompts would improve the user experience or completely break certain scenarios. It's like flying blind while trying to tune an engine.

Enter evaluation tooling

This is where tools like braintrust.dev become invaluable. They support versioned prompts, which means we could do A/B testing in production to confirm that new prompt versions at least met the same benchmarks as previous ones, ideally with decreased revision rates.

These tools also have playground concepts where you can replay previous conversations with updated prompt versions to compare outputs. We think this could have seriously improved our workflow generation experience. The ideal scenario: capture known good examples where we nailed the output—user created it, maybe revised once or twice for fine-tuning, then ran it 20 times a day ever since. Those become our regression test baselines.

Update the prompt, replay those scenarios, verify we're still getting similar outputs. At minimum, we're not regressing established use cases.

What we're doing differently

We're already working with a client on building a similar product—generating complex JSON and Python code from LLMs to create content for users. This time, we're experimenting with evaluation tools from the outset rather than retrofitting them later.

We're also planning more in-depth custom evaluators. This specific product generates images and video using LLMs, which opens up interesting evaluation possibilities. We can take screenshots or sample frames from generated videos and feed them back into an LLM-as-judge to evaluate whether scenes, composition, and framing match the user's original expectations. Did we accidentally clip something that should be in frame? Is the composition what they asked for?

The key insight is that evaluation strategy needs to be baked into the product from day one, not bolted on after you realize you have a problem. For code generation specifically, this means thinking beyond syntax validation toward actual execution outcomes and user satisfaction signals.

Looking forward, we think the combination of automated evaluation tools, user feedback loops, and domain-specific validation techniques will become standard practice for any serious LLM code generation product. The technology exists now to make this practical—it's just a matter of building it into your workflow from the beginning rather than treating it as an afterthought.

Evaluating LLM Code Output is Hard

The evaluation problem

Enter evaluation tooling

What we're doing differently

Stay in the loop