As someone who's built a project in this space, this is incredibly unreliable. Subagents don't get a full system prompt (including stuff like CLAUDE.md directions) so they are flying very blind in your projects, and as such will tend to get derailed by their lack of knowledge of a project and veer into mock solutions and "let me just make a simpler solution that demonstrates X."
I advise people to only use subagents for stuff that is very compartmentalized because they're hard to monitor and prone to failure with complex codebases where agents live and die by project knowledge curated in files like CLAUDE.md. If your main Claude instance doesn't give a good handoff to a subagent, or a subagent doesn't give a good handback to the main Claude, shit will go sideways fast.
Also, don't lean on agents for refactoring. Their ability to refactor a codebase goes in the toilet pretty quickly.
> Their ability to refactor a codebase goes in the toilet pretty quickly.
Very much this. I tried to get Claude to move some code from one file to another. Some of the code went missing. Some of it was modified along the way.
Humans have strategies for refactoring, e.g. "I'm going to start from the top of the file and Cut code that needs to be moved and Paste it in the new location". LLM don't have a clipboard (yet!) so they can't do this.
Claude can only reliably do this refactoring if it can keep the start and end files in context. This was a large file, so it got lost. Even then it needs direct supervision.
This is only a problem if an agent is made in a lazy way (all of them).
Chat completion sends the full prompt history on every call.
I am working on my own coding agent and seeing massive improvements by rewriting history using either a smaller model or a freestanding call to the main one.
There's a large body of research on context pruning/rewriting (I know because I'm knee deep in benchmarks in release prep for my context compiler), definitely don't ad hoc this.
One key insight I have from having worked on this from the early stages of LLMs (before chatgpt came out) is that the current crop of LLM clients or "agentic clients" don't log/write/keep track of success over time. It's more of a "shoot and forget" environment right now, and that's why a lot of people are getting vastly different results. Hell, even week to week on the same tasks you get different results (see the recent claude getting dumber drama).
Once we start to see that kind of self feedback going in next iterations (w/ possible training runs between sessions, "dreaming" stage from og RL, distilling a session, grabbing key insights, storing them, surfacing them at next inference, etc) then we'll see true progress in this space.
The problem is that a lot of people work on these things in silos. The industry is much more geared towards quick returns now, having to show something now, rather than building strong fo0undations based on real data. Kind of an analogy to early linux dev. We need our own Linus, it would seem :)
I’ve experimented with feature chats, so start a new chat for every change, just like a feature branch. At the end of a chat I’ll have it summarize the the feature chat and save it as a markdown document in the project, so the knowledge is still available for next chats. Seems to work well.
You can also ask the llm at the end of a feature chat to prepare a prompt to start the next feature chat so it can determine what knowledge is important to communicate to the next feature chat.
Summarizing a chat also helps getting rid of wrong info, as you’ll often trial and error towards the right solution. You don’t want these incorrect approaches to leak into the context of the next feature chat, maybe just add the “don’t dos” into a guidelines and rules document so it will avoid it in the future.
Lots of ways. You could do binary thumbs up/down. You could do a feedback session. You could look at signals like "acceptance rate" (for a pr?) or "how many feedback messages did the user send in this session", and so on.
My point was more on tracking these signals over time. And using them to improve the client, not just the model (most model providers probably track this already).
I often see people making these sub agents modelled on roles like product manager, back end developer, etc.
I spent a few hours trying stuff like this and the results were pretty bad compared to just using CC with no agent specific instructions.
Maybe I needed to push through and find a combination that works but I don't find this article convincing as the author basically says "it works" without showing examples or comparing doing the same project with and without subagents.
Anyone got anything more convincing to suggest it's worth me putting more time into building out flows like this instead of just using a generic agent for everything?
That sounds crazy to me, Claude Code has so many limitations.
Last week I asked Claude Code to set up a Next.js project with internationalization. It tried to install a third party library instead of using the internationalization method recommended for the latest version of Next.js (using Next's middleware) and could not produce of functional version of the boilerplate site.
There are some specific cases where agentic AI does help me but I can't picture an agent running unchecked effectively in its current state.
Slightly off topic but I would really like agentic workflow that is embedded in my IDE as well as my code host provider like GitHub for pull requests.
Ideally I would like to spin off multiple agents to solve multiple bugs or features. The agents have to use the ci in GitHub to get feedback on tests. And I would like to view it on IDE because I like the ability to understand code by jumping through definitions.
Support for multiple branches at once - I should be able to spin off multiple agents that work on multiple branches simultaneously.
Would that be solved by having several clones of your repo, each with a IDE and a Claude working on each problem? Much like how multiple people work in parallel.
Was going to ask how much all this cost, but this sort of answers it:
> "Managing Cost and Usage Limits: Chaining agents, especially in a loop, will increase your token usage significantly. This means you’ll hit the usage caps on plans like Claude Pro/Max much faster. You need to be cognizant of this and decide if the trade-off—dramatically increased output and velocity at the cost of higher usage—is worth it."
Is it a good idea to generate more code faster to solve problems? Can I solve problems without generating code?
If code is a liability and the best part is no part, what about leveraging Markdown files only?
The last programs I created were just CLI agents with Markdown files and MCP servers(some code here but very little).
The feedback loop is much faster, allowing me to understand what I want after experiencing it, and self-correction is super fast. Plus, you don't get lost in the implementation noise.
Code you didn't write is an even bigger liability, because if the AI gets off track and you can't guide it back, you may have to spend the time to learn it's code and fix the bugs.
It's no different to inheriting a legacy application though. As well, from the perspective of a product owner, it's not a new risk.
Claude is a junior. The more you work with it, the more you get a feel for which tasks it will ace unsupervised (some subset of grunt work) and which tasks to not even bother using it for.
I don't trust Claude to write reams of code that I can't maintain except when that code is embarrassingly testable, i.e it has an external source of truth.
I advise people to only use subagents for stuff that is very compartmentalized because they're hard to monitor and prone to failure with complex codebases where agents live and die by project knowledge curated in files like CLAUDE.md. If your main Claude instance doesn't give a good handoff to a subagent, or a subagent doesn't give a good handback to the main Claude, shit will go sideways fast.
Also, don't lean on agents for refactoring. Their ability to refactor a codebase goes in the toilet pretty quickly.
Very much this. I tried to get Claude to move some code from one file to another. Some of the code went missing. Some of it was modified along the way.
Humans have strategies for refactoring, e.g. "I'm going to start from the top of the file and Cut code that needs to be moved and Paste it in the new location". LLM don't have a clipboard (yet!) so they can't do this.
Claude can only reliably do this refactoring if it can keep the start and end files in context. This was a large file, so it got lost. Even then it needs direct supervision.
Like "evaluate the test coverage" or "check if the project follows the style guide".
This way the "main" context only gets the report and doesn't waste space on massive test outputs or reading multiple files.
Chat completion sends the full prompt history on every call.
I am working on my own coding agent and seeing massive improvements by rewriting history using either a smaller model or a freestanding call to the main one.
It really mitigates context poisoning.
Which as far as I understand it is summarizing the context with a smaller model.
Am I misunderstanding you, as the practical experience of most people seem to contradict your results.
Once we start to see that kind of self feedback going in next iterations (w/ possible training runs between sessions, "dreaming" stage from og RL, distilling a session, grabbing key insights, storing them, surfacing them at next inference, etc) then we'll see true progress in this space.
The problem is that a lot of people work on these things in silos. The industry is much more geared towards quick returns now, having to show something now, rather than building strong fo0undations based on real data. Kind of an analogy to early linux dev. We need our own Linus, it would seem :)
You can also ask the llm at the end of a feature chat to prepare a prompt to start the next feature chat so it can determine what knowledge is important to communicate to the next feature chat.
Summarizing a chat also helps getting rid of wrong info, as you’ll often trial and error towards the right solution. You don’t want these incorrect approaches to leak into the context of the next feature chat, maybe just add the “don’t dos” into a guidelines and rules document so it will avoid it in the future.
How do you define success of a model's run?
My point was more on tracking these signals over time. And using them to improve the client, not just the model (most model providers probably track this already).
I spent a few hours trying stuff like this and the results were pretty bad compared to just using CC with no agent specific instructions.
Maybe I needed to push through and find a combination that works but I don't find this article convincing as the author basically says "it works" without showing examples or comparing doing the same project with and without subagents.
Anyone got anything more convincing to suggest it's worth me putting more time into building out flows like this instead of just using a generic agent for everything?
At some point you gotta stop and wonder if you’re doing way too much work managing claude rather than your business problem.
Last week I asked Claude Code to set up a Next.js project with internationalization. It tried to install a third party library instead of using the internationalization method recommended for the latest version of Next.js (using Next's middleware) and could not produce of functional version of the boilerplate site.
There are some specific cases where agentic AI does help me but I can't picture an agent running unchecked effectively in its current state.
Ideally I would like to spin off multiple agents to solve multiple bugs or features. The agents have to use the ci in GitHub to get feedback on tests. And I would like to view it on IDE because I like the ability to understand code by jumping through definitions.
Support for multiple branches at once - I should be able to spin off multiple agents that work on multiple branches simultaneously.
> "Managing Cost and Usage Limits: Chaining agents, especially in a loop, will increase your token usage significantly. This means you’ll hit the usage caps on plans like Claude Pro/Max much faster. You need to be cognizant of this and decide if the trade-off—dramatically increased output and velocity at the cost of higher usage—is worth it."
If code is a liability and the best part is no part, what about leveraging Markdown files only?
The last programs I created were just CLI agents with Markdown files and MCP servers(some code here but very little).
The feedback loop is much faster, allowing me to understand what I want after experiencing it, and self-correction is super fast. Plus, you don't get lost in the implementation noise.
It's no different to inheriting a legacy application though. As well, from the perspective of a product owner, it's not a new risk.
I don't trust Claude to write reams of code that I can't maintain except when that code is embarrassingly testable, i.e it has an external source of truth.
https://www.youtube.com/watch?v=wL22URoMZjo
Have a great day =3