It's Agents All The Way Down

Prior to joining Jake on Caret back in November 2024, I was skeptical about LLMs doing much more than autocompleting a few lines of code. Today, we're regularly working with up to five agents in parallel, often churning out complete features while we sleep. The transformation happened faster than I expected, and it's fundamentally changing how we think about software development.

What started as hallucination-ridden auto-complete a few years ago has evolved into something closer to having a distributed team of junior developers. But like any new paradigm, there are levels to this game. Many mainstream development teams are still stuck at Level 0/1, missing out on the real productivity gains that come from understanding how to orchestrate multiple agents effectively.

Note: For the sake of this article, coding agents are LLM-powered tools that can take multi-step actions to write or modify code. They can execute commands, install dependencies, build and test code, and increasingly, work independently for extended periods. Here's how to think about leveling up your approach to working with them.

Level 0: Auto-complete

LLMs first hit the software industry by storm back in late 2021 with auto-complete suggestions. These tools used LLMs to predict code changes in real-time as developers typed.

The original tools hallucinated like crazy, and auto-complete is still largely considered a nuisance today. Auto-complete typically uses fast, cheap models that are not representative of the capabilities of LLMs at large.

The real power of working with LLMs comes from the chat and agent experiences, not from trying to predict your next few keystrokes.

Level 1: Pair-programming in an IDE

The advent of chat in IDEs enabled the possibility of pair-programming with LLMs. Tools like Cursor, Windsurf, and VS Code's Copilot let you collaborate with a single AI in real-time, directing it as you would a human partner.

This is genuinely transformative for individual productivity. The agent can write boilerplate, suggest implementations, and catch obvious bugs as you work. You maintain full control and can course-correct immediately when the agent goes off track.

I spent months at this level, and it's an excellent way to build intuition about what different models can handle. You discover which problems are well-suited for AI assistance and which still require human insight.

However, pair-programming has an obvious limitation: it's synchronous. You're still bottlenecked by your own attention and development environment. Within an IDE chat experience, one human can only effectively work with one AI agent at a time.

Level 2: Technical Lead of Agentic Coding Team

All the major players in AI (OpenAI, Anthropic, Microsoft) now offer cloud-based agents that work independently in isolated environments. You assign them GitHub issues (or similar), and they go off to implement solutions for up to an hour without supervision.

These agents have the same capabilities as their IDE counterparts, but the asynchronous nature unlocks powerful parallelization. Instead of working with one agent, you can coordinate multiple agents working on different parts of your codebase simultaneously.

We regularly run three to five agents in parallel. Each gets a carefully scoped issue targeting different areas of the system to minimize conflicts. While they work, we're reviewing previous outputs, planning the next batch of work, or focusing on the higher-level architectural decisions that still require human judgment.

The human developer becomes something closer to a technical lead. We assign work, review and test implementations, and coordinate the broader technical strategy. The agents handle the bulk of the implementation work.

This approach has some surprising benefits beyond just speed. Because agents work from detailed written specifications (GitHub issues), you're forced to think through requirements more thoroughly upfront. This actually improves the quality of the final implementation, even accounting for the overhead of issue-writing.

We've written extensively about maximizing productivity with async coding agents and converting video walkthroughs into detailed GitHub issues to streamline this process.

The key insight is that your success at Level 2 is bottlenecked by two things: your ability to write high-quality issues, and your capacity to review and integrate the agents' output. Get good at both, and you can scale your output dramatically.

The transition from Level 1 to Level 2 isn't just about adopting new tools. It's about developing new workflows and skills. Learning to write effective issues, coordinate multiple workstreams, and review code efficiently becomes as important as traditional programming skills. Most senior developers should already be well-equipped with these skills.

Our current reality

At Firstloop, we're in the thick of Level 2. We're seeing massive productivity gains, but agent outputs still require careful review. They're prone to over-engineering simple problems and missing edge cases that seem obvious to humans. Managing multiple concurrent pull requests also creates its own coordination overhead.

The most successful approach we've found is treating agents like capable but inexperienced team members. Give them clear, detailed instructions. Review their work thoroughly. Don't assign them critical path work until you've built confidence in their output quality.

That said, when it works, it works remarkably well. Features that would have taken days to implement can often be completed overnight by an agent working from a well-written issue. The key is developing judgment about which problems are well-suited for agent implementation and which still require human expertise.

Level 3: ???

Level 3 represents a massive jump in abstraction. At Firstloop, we're not there yet, but the pieces are starting to align. Here are some of our predictions and what we are looking to next:

PM agents that translate user feedback into technical specifications

Today, someone still needs to decide what to build and how to break it down into implementable chunks. Across the industry, we're already seeing early experiments with agents that can analyze user feedback, competitive landscapes, and business requirements to generate technical roadmaps.

Self-reviewing agents

Currently, human review is the primary bottleneck of Level 2. However, there's no fundamental reason why agents couldn't review each other's code. We foresee two potential benefits to this:

Sometimes, our agents get lost in the sauce and go completely off the rails. Clearing the agent's context to take a fresh look at the problem and the solution it produced might get it back on the right track without human intervention.
Much like humans, different models are better at different things. Driving code review via a diverse set of models might unlock key feedback without a human in the loop.

The challenge is ensuring they maintain the same quality bar we'd expect from human review.

Technical lead agents

The coordination work I'm doing at Level 2 could theoretically be handled by an agent with sufficient context about the codebase and team dynamics. This agent would need to understand both the technical and organizational constraints, but that's not inconceivable.

Higher-fidelity iteration

As models improve and costs decrease, the feedback loops could get tighter. Instead of the current pattern of "implement, review, iterate," we might see agents that can work through multiple iterations autonomously, only surfacing the final result when it meets specified quality criteria.

The timeline for Level 3 is unclear, but the groundwork is being laid now. The teams that master Level 2 coordination will be best positioned to adapt when these capabilities become available.