This is exactly the argument in Brooks' No Silver Bullet. I still believe that it holds. However, my observation is that many people don't really need that level of details. When one prompts an AI to "write me a to-do list app", what they really mean is that "write me a to-do list app that is better that I have imagined so far", which does not really require detailed spec.
Author here: it's not even clear that agents can reliably permute their training data (I'm not saying that it's impossible or never happens but that it's not something we can take for granted as a reliable feature of agentic coding).
As I mentioned in one of the footnotes in the post:
> People often tell me "you would get better results if you generated code in a more mainstream language rather than Haskell" to which I reply: if the agent has difficulty generating Haskell code then that suggests agents aren't capable of reliably generalizing beyond their training data.
If an agent can't consistently apply concepts learned in one language to generate code in another language, then that calls into question how good they are at reliably permuting the training dataset in the way you just suggested.
But what's the point of re-building "standard software" if it is so standard that it already exists 100 times in the training data with slight variations?
That isn’t saying much. Every software is a permutation of zeros and ones. The novelty or ingenuity, or just quality and fitness for purpose, can lie in the permutation you come up with. And an LLM is limited by its training in the permutations it is likely to come up with, unless you give it heaps of specific guidance on what to do.
I'd say that's pretty much the definition of standard, yeah. And it's why you can't make a profit selling a simple ToDo app. If you expect people to pay for what you build, you have to build something that doesn't have a thousand free clones on the app store.
I would be surprised if there are more working email clients out there than working 3D engines. The gaming market is huge, most people do not pay to use email, hobbyists love creating game engines.
Idk, a working basic email client is just not that hard to write though. SMTP and IMAP are simple protocols and the required graphical interface is a very straightforward combination of standard widgets.
An email client is highly nontrivial, due to the complexities of the underlying standards, and how the real implementations you have to be compatible with don’t strictly follow them. Making an email client that doesn’t suck and is fully interoperable is quite an ambitious endeavor.
The point was to answer the question: "Can every piece of software be viewed as a permutation of software that has already been developed?"
In my opinion, an email client is a more favorable example than a 3D engine. In fields where it is necessary to differentiate, improve, or innovate at the algorithmic level, where research and development play a fundamental role, it is not simply a matter of permuting software or leveraging existing software components by simply assembling them more effectively.
Most software written today (or 10 years ago, or 50 years ago) is not particularly unique. And even in that software that is unusual you usually find a lot of run-of-the-mill code for the more mundane aspects
I don’t think this is true. I’ve been doing this since the 1980s and while you might think code is fairly generic, most people aren’t shipping apps they’re working on quiet little departmental systems, or trying to patch ancient banking systems and getting a greenfield gig is pretty rare in my experience.
So for me the code is mundane but it’s always unique and rarely do you come across the same problems at different organisations.
If you ever got a spec good enough to be the code, I’m sure Claude or whatever could absolutely ace it, but the spec is never good enough. You never get the context of where your code will run, who will deploy it or what the rollback plan is if it fails.
The code isn’t the problem and never was. The problem is the environment where your code is going.
The proof is bit rot. Your code might have been right 5 years ago but isn’t any more because the world shifted around it.
I am using Claude pretty heavily but there are some problems it is awful at, e.g I had a crusty old classic ASP website to resuscitate this week and it would not start. Claude suggested all the things I half remembered from back in the day but the real reason was Microsoft disabled vbscript in windows 11 24H2 but that wasn’t even on its radar.
I have to remind myself that it’s a fancy xerox machine because it does a damn good job of pretending otherwise.
Most of the economically valuable software written is pretty unique, or at least is one of few competitors in a new and growing niche. This is because software that is not particularly unique is by definition a commodity, with few differentiators. Commodity software gets its margins competed away, because if you try to price high, everybody just uses a competitor.
So goes the AI paradox: it's really effective at writing lots and lots of software that is low value and probably never needed to get written anyway. But at least right now (this is changing rapidly), executives are very willing to hire lots of coders to write software that is low value and probably doesn't need to be written, and VCs are willing to fund lots of startups to automate the writing of lots of software that is low value and probably doesn't need to be written.
Could you give some examples? I can only imagine completely proprietary technology like trading or developing medicine. I have worked in software for many years and was always paid well for it. None of it was particularly unique in any way. Some of it better than others, but if you could show that there exists software people pay well for that AI cannot make I would be really impressed. With my limited view as software engineer it seems to me that the data in the product / its users is what makes it valuable. For example Google Maps, Twitter, AirBnB or HN.
Were you around when any of Google Maps, Twitter, AirBnB, or HN were first released? Aside from AirBnB (whose primary innovation was the business model, and hitting the market right during the global financial crisis when lots of families needed extra cash), they were each architecturally quite different from software that had come before.
Before Google Maps nobody had ever pushed a pure-Javascript AJAX app quite so far; it came out just as AJAX was coined, when user expectations were that any major update to the page required a full page refresh. Indeed, that's exactly what competitor MapQuest did: you had to click the buttons on the compass rose to move the map, it moved one step at a time, and it fully reloaded the page with each move. Google Maps's approach, where you could just drag the map and it loaded the new tiles in the background offscreen, then positioned and cropped everything with Javascript, was revolutionary. Then add that it gained full satellite imagery soon after launch, which people didn't know existed in a consumer app.
Twitter's big innovation was the integration of SMS and a webapp. It was the first microblog, where the idea was that you could post to your publicly-available timeline just by sending an SMS message. This was in the days before Twilio, where there was no easy API for sending these, you had to interface with each carrier directly. It also faced a lot of challenges around the massive fan-out of messages; indeed, the joke was that Twitter was down more than it was up because they were always hitting scaling limits.
HN has (had?) an idiosyncratic architecture where it stores everything in RAM and then checkpoints it out to disk for persistence. No database, no distribution, everything was in one process. It was also written in a custom dialect of Lisp (Arc) that was very macro-heavy. The advantage of this was that it could easily crank out and experiment with new features and new views on the data. The other interesting thing about it was its application of ML to content moderation, and particularly its willingness to kill threads and shadowban users based on purely algorithmic processes.
You know how whenever you shuffle a deck of cards you almost certainly create an order that has never existed before in the universe?
Most software does something similar. Individual components are pretty simple and well understood, but as you scale your product beyond the simple use cases ("TODO apps"), the interactions between these components create novel challenges. This applies to both functional and non-functional aspects.
So if "cannot make with AI" means "the algorithms involved are so novel that AI literally couldn't write one line of them", then no - there isn't a lot of commercial software like that. But that doesn't mean most software systems aren't novel.
> When one prompts an AI to "write me a to-do list app", what they really mean is that "write me a to-do list app that is better that I have imagined so far", which does not really require detailed spec.
If someone was making a serious request for a to-do list app, they presumably want it to do something different from or better than the dozens of to-do list apps that are already out there. Which would require them to somehow explain what that something was, assuming it's even possible.
It could be an issue of discoverability too. Maybe they just haven't found the to-do app that does what they want, and it's easier to just... make one from scratch.
I'd pay you 10€ for a TODO app that improved my life meaningfully. It would obviously need to have great UX and be stable. Those are table stakes.
I don't have the time to look at all these apps though. If somebody tells me they made a great TODO app, I'm already mentally filtering them out. There's just too much noise here.
Does your TODO app solve any meaningful problem beyond the bare minimum? Does it solve your procrastination? Does it remind you at the right time?
If it doesn't answer this in the first 2 seconds of your pitch you're out.
Everyone at least heard stories of people who just want that button 5px to the right or to the left and next meeting they want it in bottom corner - whereas it doesn’t make functionally any difference.
But that’s most of the time is not that they want it from objective technical reasons.
They want it because they want to see if they can push you. They do it „because they can”. They do it because later they can renegotiate or just nag and maybe pay less. Multiple reasons that are not technical.
I think it's only a matter of time before people start trying to optimize model performance and token usage by creating their own more technical dialect of English (LLMSpeak or something). It will reduce both ambiguity and token usage by using a highly compressed vocabulary, where very precise concepts are packed into single words (monads are just monoids in the category of endofunctors, what's the problem?). Grammatically, expect things like the Oxford comma to emerge that reduce ambiguity and rounds of back-and-forth clarification with the agent.
The uninitiated can continue trying to clumsily refer to the same concepts, but with 100x the tokens, as they lack the same level of precision in their prompting. Anyone wanting to maximize their LLM productivity will start speaking in this unambiguous, highly information-dense dialect that optimizes their token usage and LLM spend...
Unless you're training your own model, wouldn't you have to send this dialect in your context all the time? Since the model is trained on all the human language text of the internet, not on your specialized one? At which point you need to use human language to define it anyway? So perhaps you could express certain things with less ambiguity once you define that, but it seems like your token usage will have to carry around that spec.
The thing is, doesn't the LLM need to be trained on this dialect, and if the training material we have is mostly ambiguous, how do we disambiguate it for the purpose of training?
In my mind this is solving different problems. We want it to parse out our intent from ambiguous semantics because that's how humans actually think and speak. The ones who think they don't are simply unaware of themselves.
If we create this terse and unambiguous language for LLMs, it seems likely to me that they would benefit most from using it with each other, not with humans. Further, they already kind of do this with programming languages which are, more or less, terse and unambiguous expression engines for working with computers. How would we meaningfully improve on this, with enough training data to do so?
I'm asking sincerely and not rhetorically because I'm under no illusion that I understand this or know any better.
Codex already has such a language. The specs it’s been writing for me are full of “dedupe”, “catch-up”, and I often need to feedback that it should use more verbose language. Some of that has been creeping into my lingo already. A colleague of mine suddenly says the word “today” all the time, and I suspect that’s because he uses Claude a lot. Today, as in, current state of the code.
It was mentioned somewhere else on hn today, but why do I care about token usage? I prompt AI day and night for coding and other stuff via claude code max 200 and mistral; haven't had issues for many months now.
E.g., those SKILL.md files that are tens of kilobytes long, as if being exhaustively verbose and rambling will somehow make the LLM smarter. (It won't, it will just dilute the context with irrelevant stuff.)
Or they could look at the past few centuries of language theory and start crafting better tokenizers with inductive biases.
We literally have proof that an iron age ontology of meaning as represented in Chinese characters is 40% more efficient than naive statistical analysis over a semi phonetic language and we still are acting like more compute will solve all our problems.
> We literally have proof that an iron age ontology of meaning as represented in Chinese characters is 40% more efficient than naive statistical analysis over a semi phonetic language
> We literally have proof that an iron age ontology of meaning as represented in Chinese characters is 40% more efficient than naive statistical analysis over a semi phonetic language and we still are acting like more compute will solve all our problems.
Post a link because until you do, I’m almost certain this is pseudoscientific crankery.
Chinese characters are not an “iron age ontology of meaning” nor anything close to that.
Also please cite the specific results in centuries-old “language theory” that you’re referring to. Did Saussure have something to say about LLMs? Or someone even older?
I've been trying codex and claude code for the past month or so. Here's the workflow that I've ended up with for making significant changes.
- Define the data structures in the code yourself. Add comments on what each struct/enum/field does.
- Write the definitions of any classes/traits/functions/interfaces that you will add or change. Either leave the implementations empty or write them yourself if they end up being small or important enough to write by hand (or with AI/IDE autocompletion).
- Write the signatures of the tests with a comment on what it's verifying. Ideally you would write the tests yourself, specially if they are short, but you can leave them empty.
- Then at this point you involve the agent and tell it to plan how to complete the changes without barely having to specify anything in the prompt. Then execute the plan and ask the agent to iterate until all tests and lints are green.
- Go through the agent's changes and perform clean up. Usually it's just nitpicks and changes to conform to my specific style.
If the change is small enough, I find that I can complete this with just copilot in about the same amount of time it would take to write an ambiguous prompt. If the change is bigger, I can either have the agent do it all or do the fun stuff myself and task the agent with finishing the boring stuff.
So I would agree with the title and the gist of the post but for different reasons.
Don't you also need to specify the error-cases at each stage and at what level of the system you would like to handle them (Log away, throw ever more up, Inform others, create Tasks, etc.)?
My twist on this is to first vibe code the solution with the aim of immediately replacing it.
I’ve found that two to three iterations with various prompts or different models will often yield a surprising solution or some aspect I hadn’t thought of or didn’t know about.
Then I throw away most or all of the code and follow your process, but with care to keep the good ideas from the LLMs, if any.
It helps to decouple the business requirements from the technical ones. It's often not possible to completely separate these areas, but I've been on countless calls where the extra technical detail completely drowns out the central value proposition or customer concern. The specification should say who, what, where, when, why. The code should say how.
The code will always be an imperfect projection of the specification, and that is a feature. It must be decoupled to some extent or everything would become incredibly brittle. You do not need your business analysts worrying about which SQLite provider is to be used in the final shipped product. Forcing code to be isomorphic with spec means everyone needs to know everything all the time. It can work in small tech startups, but it doesn't work anywhere else.
A regular person says "I want a house and it must have a toilet"
Most people don't specify or know that they want a U bend in their pipes or what kind or joints should be used for their pipes.
The absence of U bends or use or poor joints will be felt immediately.
Thankfully home building is a relatively solved problem whereas software is everything bespoke and every problem slightly different... Not to mention forever changing
Maybe an argument can be made that this definitely holds for some areas of the feature one is building. But in ever task there might be areas where the spec, even less descriptive than code, is enough, because many solutions are just „good enough“?
One example for me are integration tests in our production application. I can spec them with single lines, way less dense than code, and the llms code is good enough.
It may decide to assert one way or another, but I do not care as long as the essence is there.
Natural language is fluid and ambiguous while code is rigid and deterministic. Spec-driven development appears to be the best of both worlds. But really, it is the worst of both. LLMs are language models - their breakthrough capability is handling natural language. Code is meant to be unambiguous and deterministic. A spec is neither fluid nor deterministic.
I've heard of people _experimenting_ with deleting their code every day.
I haven't heard of being content paying for a product consisting of markdown files. Though I could imagine people paying for agent skill files. But yet, the skills are not the same product as say, linear.
The cognitive dissonance comes from the tension between the-spec-as-management-artifact vs the-spec-as-engineering-artifact. Author is right that advocates are selling the first but second is the only one which works.
For a manager, the spec exists in order to create a delgation ticket, something you assign to someone and done.
But for a builder, it exists as a thinking tool that evolves with the code to sharpen the understanding/thinking.
I also think, that some builders are being fooled into thinking like managers because ease, but they figure it out pretty quickly.
A corollary of this statement is that code without a spec is not code. No /s, I think that is true - code without a spec certainly does something, but it is, by the absence of a detailed spec, undefined behavior.
Code is a specific implementation of a spec. You can even use it as a spec if you're happy to accept exactly what the code does. But the code doesn't tell you what was supposed to be built so the code is not a spec.
Simple thought experiment: Imagine you have a spec and some code that implements it. You check it, and find that a requirement in the spec was missed. Obviously that code is not the spec; the spec is the spec. But also you couldn't even use the code as a spec because it was wrong. Now remove the spec.
Is the code a spec for what was supposed to be built? No. A requirement was missed. Can you tell from just the code? Also no. You need a two separate sources that tell you what was meant to be written in case the either of them is wrong. That is usually a spec and the code.
They could both be wrong, and often are, but that's a people problem.
I am developing my own programming language, but I have no specification written for it. When people tell me that I need a specification, I reply that I already have one - the source code of the language compiler.
You are not wrong. But, they are not wrong either.
I feel like if you’re designing a language, the activity of producing the spec, which involves the grammar etc., would allow you to design unencumbered by whether your design is easy to implement. Or whether it’s a good fit for the language you are implementing the compiler with.
The OP also correctly identifies that thoughtful design takes a back seat in favor of action when we start writing the code.
A corollary to the linked article is that a specification can also have bugs. Having a specification means that you can (in theory) be sure you have removed all inconsistencies between that specification and the source code, but it does not mean you can know you have removed all bugs, since both the spec and the source code could have the same bug.
I tried myself to make a language over an agent's prompt. This programing language is interpreted in real time, and parts of it are deterministic and parts are processed by an LLM. It's possible, but I think that it's hard to code anything in such a language. This is because when we think of code we make associations that the LLM doesn't make and we handle data that the LLM might ignore entirely. Worse, the LLM understands certain words differently than us and the LLM has limited expressions because of it's limits in true reasoning (LLMs can only express a limited number of ideas, thus a limited number of correct outputs).
I agree to this, with the caveat that a standard is not a spec. E.g.: The C or C++ standards, they're somewhat detailed, but even if they were to be a lot more detailed, becoming 'code' would defeat the purpose (if 'code' means a deterministic turing machine?), because it won't allow for logic that is dependent on the implementer ("implementation defined behavior" and "undefined behavior" in C parlance). whereas a specification's whole point is to enforce conformance of implementations to specific parameters.
Even programs are just specifications by that standard. When you write a program in C, you are describing what an abstract C machine can do. When the C compiler turns that into a program it is free to do so in any way that is consistent with the abstract C machine.
If you look at what implementions modern compilers actually come up with, they often look quite different from what you would expect from the source code
Compilers are converters. There’s the abstract machine specified by the standard and there’s the real machine where the program will run (and there can be some layer in between). So compilers takes your program (which assumes the abstract machine) and builds the link between the abstract and the real.
If your program was a DSL for steering, the abstract machine will be the idea of steering wheel, while the machine could be a car without one. So a compiler would build the steering wheel, optionally adding power steering (optimization), and then tack the apparatus to steer for the given route.
It's a great argument against using software design tools (UML and other tools). The process of writing code is creating an executable specification. Creating a specification for your specification (and phrasing it as such) is a bit redundant.
The blue print analogy comes up frequently. IMHO this is unfortunate. Because a blueprint is an executable specification for building something (manually typically). It's code in other words. But for laborers, construction workers, engineers, etc. For software we give our executable specifications to an interpreter or compiler. The building process is fully automated.
The value of having specifications for your specifications is very limited in both worlds. A bridge architect might do some sketches, 3D visualizations, clay models, or whatever. And a software developer might doodle a bit on a whiteboard, sketch some stuff out on paper or create a "whooooo we have boxes and arrows" type stuff in a power point deck or whatever. If it fits on a slide, it has no meaningful level of detail.
As for AI. I don't tend to specify a lot when I'm using AI for coding. A lot of specification is implicit with agentic coding. It comes from guard rails, implicit general knowledge that models are trained one, vague references like "I want red/green TDD", etc. You can drag in a lot of this implicit stuff with some very rudimentary prompting. It doesn't need to be spelled out.
I put an analytics server live a few days ago. I specified I wanted one. And how I wanted it to work. I suggested Go might be a nice language to build it in (I'm not a Go programmer). And I went in to some level of detail on where and how/where I wanted the events to be stored. And I wanted a light js library "just like google analytics" to go with it. My prompt wasn't much larger than this paragraph. I got what I asked for and with some gentle nudging got it in a deployable state after a few iterations.
A few years ago you'd have been right to scald me for wasting time on this (use something off the shelf). But it took about 20 minutes to one shot this and another couple of hours to get it just right. It's running live now. As far as I can see with my few decades of experience, it's a pretty decent version of what I asked for. I did not audit the code. I did ask for it to be audited (big difference) and addressed some of the suggested fixes via more prompting ("sounds good, do it").
If you are wondering why, I'm planning to build a AI dashboard on top of this and I need the raw event store for that. The analytics server is just a dirt cheap means to an end to get the data where I need it. AI made the server and the client, embedded the client in my AI generated website that I deployed using AI. None of this involved a lot of coding or specifying. End to end, all of this work was completed in under a week. Most of the prompting work went into making the website really nice.
Such amazing writing. And clear articulation of what I’ve been struggling to put into words - almost having to endure a mental mute state. I keep thinking it’s obvious, but it’s not, and this article explains it very elegantly.
I also enjoyed the writing style so much that I felt bad for myself for not getting to read this kind of writing enough. We are drowning in slop. We all deserve better!
I recently left this comment on another thread. At the time I was focused on planning mode, but it applies here.
Plan mode is a trap. It makes you feel like you're actually engineering a solution. Like you're making measured choices about implementation details. You're not, your just vibe coding with extra steps.
I come from an electrical engineering background originally, and I've worked in aerospace most of my career. Most software devs don't know what planning is. The mechanical, electrical, and aerospace engineering teams plan for literal years. Countless reviews and re-reviews, trade studies, down selects, requirement derivations, MBSE diagrams, and God knows what else before anything that will end up in the final product is built. It's meticulous, detailed, time consuming work, and bloody expensive.
That's the world software engineering has been trying to leave behind for at least two decades, and now with LLMs people think they can move back to it with a weekend of "planning", answering a handful of questions, and a task list.
Even if LLMs could actually execute on a spec to the degree people claim (they can't), it would take as long to properly define as it would to just write it with AI assistance in the first place.
Is that true though? If I define a category or range in formal language, I’m still ambiguous on the exact value. Dealing with randomness is even worse (eg input in random order), and can’t be prevented in real world programs.
IMHO, LLMs are better at Python and SQL than Haskell because Python and SQL syntax mirrors more aspects of human language. Whereas Haskell syntax reads more like a math equation. These are Large _Language_ Models so naturally intelligence learned from non-code sources transfers better to more human like programming languages. Math equations assume the reader has context not included in the written down part for what the symbols mean.
They are heavily post-trained on code and math these days. I don‘t think we can infer that much about their behavior from just the pre-training dataset anymore
I suspect your probably right, but just for completeness, one could also make the argument that LLMs are better at writing Haskell because they are overfit to natural language and Haskell would avoid a lot of the overfit spaces and thus would generalize better. In other words, less baggage.
I would guess they’re better at python and SQL than Haskell because the available training data for python and SQL is orders of magnitude more than Haskell.
I have a lot of fun making requirements documents for Claude. I use an iterative process until Claude can not suggest any more improvements or clarifications.
I agree with the overall structure of the argument but I like to think of specifications like polynomial equations defining some set of zeroes. Specifications are not really code but a good specification will cut out a definable subset of expected behaviors that can then be further refined with an executable implementation. For example, if a specification calls for a lock-free queue then there are any number of potential implementations w/ different trade-offs that I would not expect to be in the specification.
I kind of feel like the specification would call for an idealized lock free queue. Whereas the code would generate a good enough approximation of one that can be run on real hardware.
To invert your polynomial analogy, the specification might call for a sine wave, your code will generate a Taylor series approximation that is computable.
A thorough specification might even include the acceptable precision on the sine wave; a thorough engineer might ask the author what the acceptable precision is if it's omitted.
This articles ignores that AI agents have intelligence which means that they can figure out unspecified parts of the spec on their own. There is a lot of the design of software that I don't care about and I'm fine letting AI pick a reasonable approach.
These algorithms don't have intelligence, they just regurgitate human intelligence that was in their training data. That also goes the other way - they can't produce intelligence that wasn't represented in their training input.
Say you and I both wrote the same spec that under-specifies the same parts. But we both expect different behavior, and trust that LLM will make the _reasonable_ choices. Hint: “The choice that I would have made.”
Btw, by definition, when we under-specify we leave some decisions to the LLM unknowingly.
And absent our looks or age as input, the LLM will make some _reasonable_ choices based on our spec. But will those choices be closer to yours or mine? Assuming it won’t be neither.
Exactly. The real speed up from AI will come when we can under specify a system and the AI uses its intelligence to make good choices on the parts we left out. If you have to spec something out with zero ambiguity you’re basically just coding in English. I suspect current ideas around formal/detailed spec driven development will be abandoned in a couple years when models are significantly better.
This is humans have traditionally done with greenfield systems. No choices have been made yet, they're all cheap decisions.
The difficulty has always arisen when the lines of code pile up AND users start requesting other things AND it is important not to break the "unintended behavior" parts of the system that arose from those initial decisions.
It would take either a sea-change in how agents work (think absorbing the whole codebase in the context window and understanding it at the level required to anticipate any surprising edge case consequences of a change, instead of doing think-search-read-think-search-read loops) or several more orders of magnitude of speed (to exhaustively chase down the huge number of combinations of logic paths+state that systems end up playing with) to get around that problem.
So yeah, hobby projects are a million times easier, as is bootstrapping larger projects. But for business works, deterministic behaviors and consistent specs are important.
> in a couple years when models are significantly better.
They aren't significantly better now than a couple of years ago. So it doesn't seem likely they will be significantly better in a couple of years than they are now.
For now I would be happy if it just explored the problem space and identify the choices to be made and filtered down to the non-obvious and/or more opinionated ones. Bundle these together and ask the me all at once and then it is off to the races.
Exactly, I find that type of article too dismissive. Like, we know we don't have to write the full syntax of a loop when we write the spec "find the object in the list", and we might even not write this spec because that part is obvious to any human (hence to an LLM too)
Code is usually over specified. I recently used AI to build an app for some HS kids. It built what I spec’wd and it was great. Is it what I would’ve coded? Definitely not. In code I have to make a bunch of decisions that I don’t care about. And some of the decisions will seem important to some, but not to others. For example, it built a web page whereas I would’ve built a native app. I didn’t care either way and it doesn’t matter either way. But those sorts of things matter when coding and often don’t matter at all for the goal of the implementation.
This is exactly the argument in Brooks' No Silver Bullet. I still believe that it holds. However, my observation is that many people don't really need that level of details. When one prompts an AI to "write me a to-do list app", what they really mean is that "write me a to-do list app that is better that I have imagined so far", which does not really require detailed spec.
As I mentioned in one of the footnotes in the post:
> People often tell me "you would get better results if you generated code in a more mainstream language rather than Haskell" to which I reply: if the agent has difficulty generating Haskell code then that suggests agents aren't capable of reliably generalizing beyond their training data.
If an agent can't consistently apply concepts learned in one language to generate code in another language, then that calls into question how good they are at reliably permuting the training dataset in the way you just suggested.
I think you’re conflating software and product.
A product can be a recombination of standard software components and yet be something completely new.
This is very true for an email client, but very untrue for an innovative 3D rendering engine technology (just an example).
So for me the code is mundane but it’s always unique and rarely do you come across the same problems at different organisations.
If you ever got a spec good enough to be the code, I’m sure Claude or whatever could absolutely ace it, but the spec is never good enough. You never get the context of where your code will run, who will deploy it or what the rollback plan is if it fails.
The code isn’t the problem and never was. The problem is the environment where your code is going.
The proof is bit rot. Your code might have been right 5 years ago but isn’t any more because the world shifted around it.
I am using Claude pretty heavily but there are some problems it is awful at, e.g I had a crusty old classic ASP website to resuscitate this week and it would not start. Claude suggested all the things I half remembered from back in the day but the real reason was Microsoft disabled vbscript in windows 11 24H2 but that wasn’t even on its radar.
I have to remind myself that it’s a fancy xerox machine because it does a damn good job of pretending otherwise.
So goes the AI paradox: it's really effective at writing lots and lots of software that is low value and probably never needed to get written anyway. But at least right now (this is changing rapidly), executives are very willing to hire lots of coders to write software that is low value and probably doesn't need to be written, and VCs are willing to fund lots of startups to automate the writing of lots of software that is low value and probably doesn't need to be written.
Before Google Maps nobody had ever pushed a pure-Javascript AJAX app quite so far; it came out just as AJAX was coined, when user expectations were that any major update to the page required a full page refresh. Indeed, that's exactly what competitor MapQuest did: you had to click the buttons on the compass rose to move the map, it moved one step at a time, and it fully reloaded the page with each move. Google Maps's approach, where you could just drag the map and it loaded the new tiles in the background offscreen, then positioned and cropped everything with Javascript, was revolutionary. Then add that it gained full satellite imagery soon after launch, which people didn't know existed in a consumer app.
Twitter's big innovation was the integration of SMS and a webapp. It was the first microblog, where the idea was that you could post to your publicly-available timeline just by sending an SMS message. This was in the days before Twilio, where there was no easy API for sending these, you had to interface with each carrier directly. It also faced a lot of challenges around the massive fan-out of messages; indeed, the joke was that Twitter was down more than it was up because they were always hitting scaling limits.
HN has (had?) an idiosyncratic architecture where it stores everything in RAM and then checkpoints it out to disk for persistence. No database, no distribution, everything was in one process. It was also written in a custom dialect of Lisp (Arc) that was very macro-heavy. The advantage of this was that it could easily crank out and experiment with new features and new views on the data. The other interesting thing about it was its application of ML to content moderation, and particularly its willingness to kill threads and shadowban users based on purely algorithmic processes.
Most software does something similar. Individual components are pretty simple and well understood, but as you scale your product beyond the simple use cases ("TODO apps"), the interactions between these components create novel challenges. This applies to both functional and non-functional aspects.
So if "cannot make with AI" means "the algorithms involved are so novel that AI literally couldn't write one line of them", then no - there isn't a lot of commercial software like that. But that doesn't mean most software systems aren't novel.
If someone was making a serious request for a to-do list app, they presumably want it to do something different from or better than the dozens of to-do list apps that are already out there. Which would require them to somehow explain what that something was, assuming it's even possible.
I'd pay you 10€ for a TODO app that improved my life meaningfully. It would obviously need to have great UX and be stable. Those are table stakes.
I don't have the time to look at all these apps though. If somebody tells me they made a great TODO app, I'm already mentally filtering them out. There's just too much noise here.
Does your TODO app solve any meaningful problem beyond the bare minimum? Does it solve your procrastination? Does it remind you at the right time?
If it doesn't answer this in the first 2 seconds of your pitch you're out.
But that’s most of the time is not that they want it from objective technical reasons.
They want it because they want to see if they can push you. They do it „because they can”. They do it because later they can renegotiate or just nag and maybe pay less. Multiple reasons that are not technical.
I guess it depends on whether or not we want to make money, or otherwise, compete against others.
The uninitiated can continue trying to clumsily refer to the same concepts, but with 100x the tokens, as they lack the same level of precision in their prompting. Anyone wanting to maximize their LLM productivity will start speaking in this unambiguous, highly information-dense dialect that optimizes their token usage and LLM spend...
[1] https://en.wikipedia.org/wiki/Lojban
[2] Someone speaking it: https://www.youtube.com/watch?v=lxQjwbUiM9w
In my mind this is solving different problems. We want it to parse out our intent from ambiguous semantics because that's how humans actually think and speak. The ones who think they don't are simply unaware of themselves.
If we create this terse and unambiguous language for LLMs, it seems likely to me that they would benefit most from using it with each other, not with humans. Further, they already kind of do this with programming languages which are, more or less, terse and unambiguous expression engines for working with computers. How would we meaningfully improve on this, with enough training data to do so?
I'm asking sincerely and not rhetorically because I'm under no illusion that I understand this or know any better.
Context pollution is a bigger problem.
E.g., those SKILL.md files that are tens of kilobytes long, as if being exhaustively verbose and rambling will somehow make the LLM smarter. (It won't, it will just dilute the context with irrelevant stuff.)
Ah, the Lisp curse. Here we go again.
coincidently, the 80s AI bubble crashed partly because Lisp dialetcts aren't inter-changable.
But some random in-house DSL? Doubt it.
We literally have proof that an iron age ontology of meaning as represented in Chinese characters is 40% more efficient than naive statistical analysis over a semi phonetic language and we still are acting like more compute will solve all our problems.
Can you elaborate? I think you're talking about https://github.com/PastaPastaPasta/llm-chinese-english , but I read those findings as far more nuanced and ambiguous than what you seem to be claiming here.
Post a link because until you do, I’m almost certain this is pseudoscientific crankery.
Chinese characters are not an “iron age ontology of meaning” nor anything close to that.
Also please cite the specific results in centuries-old “language theory” that you’re referring to. Did Saussure have something to say about LLMs? Or someone even older?
- Define the data structures in the code yourself. Add comments on what each struct/enum/field does.
- Write the definitions of any classes/traits/functions/interfaces that you will add or change. Either leave the implementations empty or write them yourself if they end up being small or important enough to write by hand (or with AI/IDE autocompletion).
- Write the signatures of the tests with a comment on what it's verifying. Ideally you would write the tests yourself, specially if they are short, but you can leave them empty.
- Then at this point you involve the agent and tell it to plan how to complete the changes without barely having to specify anything in the prompt. Then execute the plan and ask the agent to iterate until all tests and lints are green.
- Go through the agent's changes and perform clean up. Usually it's just nitpicks and changes to conform to my specific style.
If the change is small enough, I find that I can complete this with just copilot in about the same amount of time it would take to write an ambiguous prompt. If the change is bigger, I can either have the agent do it all or do the fun stuff myself and task the agent with finishing the boring stuff.
So I would agree with the title and the gist of the post but for different reasons.
Example of a large change using that strategy: https://github.com/trane-project/trane/commit/d5d95cfd331c30...
I found that to be really vital for good code. https://fsharpforfunandprofit.com/rop/
I’ve found that two to three iterations with various prompts or different models will often yield a surprising solution or some aspect I hadn’t thought of or didn’t know about.
Then I throw away most or all of the code and follow your process, but with care to keep the good ideas from the LLMs, if any.
The code will always be an imperfect projection of the specification, and that is a feature. It must be decoupled to some extent or everything would become incredibly brittle. You do not need your business analysts worrying about which SQLite provider is to be used in the final shipped product. Forcing code to be isomorphic with spec means everyone needs to know everything all the time. It can work in small tech startups, but it doesn't work anywhere else.
Most people don't specify or know that they want a U bend in their pipes or what kind or joints should be used for their pipes.
The absence of U bends or use or poor joints will be felt immediately.
Thankfully home building is a relatively solved problem whereas software is everything bespoke and every problem slightly different... Not to mention forever changing
A sufficiently detailed spec need only concern itself with essential complexity.
Applications are chock-full of accidental complexity.
Could be that the truth is somewhere in between?
Like they say “everything comes round again”
- Delete code and start all over with the spec. I don't think anyone's ready to do that.
- Buy a software product / business and be content with just getting markdown files in a folder.
I haven't heard of being content paying for a product consisting of markdown files. Though I could imagine people paying for agent skill files. But yet, the skills are not the same product as say, linear.
For a manager, the spec exists in order to create a delgation ticket, something you assign to someone and done. But for a builder, it exists as a thinking tool that evolves with the code to sharpen the understanding/thinking.
I also think, that some builders are being fooled into thinking like managers because ease, but they figure it out pretty quickly.
Simple thought experiment: Imagine you have a spec and some code that implements it. You check it, and find that a requirement in the spec was missed. Obviously that code is not the spec; the spec is the spec. But also you couldn't even use the code as a spec because it was wrong. Now remove the spec.
Is the code a spec for what was supposed to be built? No. A requirement was missed. Can you tell from just the code? Also no. You need a two separate sources that tell you what was meant to be written in case the either of them is wrong. That is usually a spec and the code.
They could both be wrong, and often are, but that's a people problem.
I feel like if you’re designing a language, the activity of producing the spec, which involves the grammar etc., would allow you to design unencumbered by whether your design is easy to implement. Or whether it’s a good fit for the language you are implementing the compiler with.
The OP also correctly identifies that thoughtful design takes a back seat in favor of action when we start writing the code.
So unless you want bugs to be your specification, you actually need to specify what you want.
If you want a specification from source code, you need to reverse engineer it. Although that’s a bit easier now, with LLMs.
That is the great insight I can offer
If you look at what implementions modern compilers actually come up with, they often look quite different from what you would expect from the source code
If your program was a DSL for steering, the abstract machine will be the idea of steering wheel, while the machine could be a car without one. So a compiler would build the steering wheel, optionally adding power steering (optimization), and then tack the apparatus to steer for the given route.
Yet sin() here can have a large number of different implementations. The spec alone under-determines the actual code.
The blue print analogy comes up frequently. IMHO this is unfortunate. Because a blueprint is an executable specification for building something (manually typically). It's code in other words. But for laborers, construction workers, engineers, etc. For software we give our executable specifications to an interpreter or compiler. The building process is fully automated.
The value of having specifications for your specifications is very limited in both worlds. A bridge architect might do some sketches, 3D visualizations, clay models, or whatever. And a software developer might doodle a bit on a whiteboard, sketch some stuff out on paper or create a "whooooo we have boxes and arrows" type stuff in a power point deck or whatever. If it fits on a slide, it has no meaningful level of detail.
As for AI. I don't tend to specify a lot when I'm using AI for coding. A lot of specification is implicit with agentic coding. It comes from guard rails, implicit general knowledge that models are trained one, vague references like "I want red/green TDD", etc. You can drag in a lot of this implicit stuff with some very rudimentary prompting. It doesn't need to be spelled out.
I put an analytics server live a few days ago. I specified I wanted one. And how I wanted it to work. I suggested Go might be a nice language to build it in (I'm not a Go programmer). And I went in to some level of detail on where and how/where I wanted the events to be stored. And I wanted a light js library "just like google analytics" to go with it. My prompt wasn't much larger than this paragraph. I got what I asked for and with some gentle nudging got it in a deployable state after a few iterations.
A few years ago you'd have been right to scald me for wasting time on this (use something off the shelf). But it took about 20 minutes to one shot this and another couple of hours to get it just right. It's running live now. As far as I can see with my few decades of experience, it's a pretty decent version of what I asked for. I did not audit the code. I did ask for it to be audited (big difference) and addressed some of the suggested fixes via more prompting ("sounds good, do it").
If you are wondering why, I'm planning to build a AI dashboard on top of this and I need the raw event store for that. The analytics server is just a dirt cheap means to an end to get the data where I need it. AI made the server and the client, embedded the client in my AI generated website that I deployed using AI. None of this involved a lot of coding or specifying. End to end, all of this work was completed in under a week. Most of the prompting work went into making the website really nice.
I also enjoyed the writing style so much that I felt bad for myself for not getting to read this kind of writing enough. We are drowning in slop. We all deserve better!
Plan mode is a trap. It makes you feel like you're actually engineering a solution. Like you're making measured choices about implementation details. You're not, your just vibe coding with extra steps. I come from an electrical engineering background originally, and I've worked in aerospace most of my career. Most software devs don't know what planning is. The mechanical, electrical, and aerospace engineering teams plan for literal years. Countless reviews and re-reviews, trade studies, down selects, requirement derivations, MBSE diagrams, and God knows what else before anything that will end up in the final product is built. It's meticulous, detailed, time consuming work, and bloody expensive.
That's the world software engineering has been trying to leave behind for at least two decades, and now with LLMs people think they can move back to it with a weekend of "planning", answering a handful of questions, and a task list.
Even if LLMs could actually execute on a spec to the degree people claim (they can't), it would take as long to properly define as it would to just write it with AI assistance in the first place.
Simply put: Formal language = No ambiguities.
Once you remove all ambiguous information from an informal spec, that, whatever remains, automatically becomes a formal description.
LLMs are very good at bash, which I’d argue doesn’t read like language or math.
"Is this it?" "NOPE"
https://www.youtube.com/watch?v=TYM4QKMg12o
To invert your polynomial analogy, the specification might call for a sine wave, your code will generate a Taylor series approximation that is computable.
Say you and I both wrote the same spec that under-specifies the same parts. But we both expect different behavior, and trust that LLM will make the _reasonable_ choices. Hint: “The choice that I would have made.”
Btw, by definition, when we under-specify we leave some decisions to the LLM unknowingly.
And absent our looks or age as input, the LLM will make some _reasonable_ choices based on our spec. But will those choices be closer to yours or mine? Assuming it won’t be neither.
The difficulty has always arisen when the lines of code pile up AND users start requesting other things AND it is important not to break the "unintended behavior" parts of the system that arose from those initial decisions.
It would take either a sea-change in how agents work (think absorbing the whole codebase in the context window and understanding it at the level required to anticipate any surprising edge case consequences of a change, instead of doing think-search-read-think-search-read loops) or several more orders of magnitude of speed (to exhaustively chase down the huge number of combinations of logic paths+state that systems end up playing with) to get around that problem.
So yeah, hobby projects are a million times easier, as is bootstrapping larger projects. But for business works, deterministic behaviors and consistent specs are important.
They aren't significantly better now than a couple of years ago. So it doesn't seem likely they will be significantly better in a couple of years than they are now.