When AI writes the software, who verifies it?

(leodemoura.github.io)

61 points | by todsacerdoti 3 hours ago

17 comments

50lo 1 minute ago
One thing that seems under-discussed in this space is the shift from verifying programs to verifying generation processes.
If a piece of code is produced by an agent loop (prompt -> tool calls -> edits -> tests), the real artifact isn’t just the final code but the trace/pipeline that produced it.
In that sense verification might look closer to: checking constraints on the generator (tests/specs/contracts), verifying the toolchain used by the agent, and replaying generation under controlled inputs.
That feels closer to build reproducibility or supply-chain verification than traditional program proofs.
madrox 4 minutes ago
I encourage everyone to RTFA and not just respond to the headline. This really is a glimpse into where the future is going.
I've been saying "the last job to be automated will be QA" and it feels more true every day. It's one thing to be a product engineer in this era. It's another to be working at the level the author is, where code needs to be verifiable. However, once people stop vibing apps and start vibing kernels, it really does fundamentally change the game.
I also have another saying: "any sufficiently advanced agent is indistinguishable from a DSL." I hadn't considered Lean in this equation, but I put these two ideas together and I feel like we're approaching some world where Lean eats the entire agentic framework stack and the entire operating system disappears.
If you're thinking about building something today that will still be relevant in 10 years, this is insightful.
oakpond 1 hour ago
You do. Even the latest models still frequently write really weird code. The problem is some developers now just submit code for review that they didn't bother to read. You can tell. Code review is more important than ever imho.
[-]
- sausagefeet 1 hour ago
  I agree with you. But I have to say, it is an uphill battle and all the incentives are against you.
  1. AI is meant to make us go faster, reviews are slow, the AI is smart, let it go.
  2. There are plenty of AI maximizers who only think we should be writing design docs and letting the AI go to town on it.
  Maybe, this might be a great time to start a company. Maximize the benefits of AI while you can without someone who has never written a line of code telling you that your job is going to disappear in 12 months.
  All the incentives are against someone who wants to use AI in a reasonable way, right now.
  [-]
  - redhed 1 hour ago
    I actually agree with good time to start a company. Lot of available software engineers that can actually understand code, AI at a level that can actually speed up development, and so many startups focusing on AI wrapper slop that you can actually make a useful product and separate yourself from the herd.
    Or you can be a grifter and make some AI wrapper yourself and cash out with some VC investment. So good time for a new company either way.
- bradleykingz 58 minutes ago
  But it's so BORING. AI gets to do the fun part (writing code) and I'm stuck with the lame bits.
  It's like watching someone else solve a puzzle, or watching someone else play a game vs playing it yourself (at least that's half as interesting as playing it through)
  [-]
  - HoldOnAMinute 18 minutes ago
    I am really enjoying making requirements docs in an iterative process. I have a continuous improvement loop where I use the implementation to test out the docs. If I find a problem with the docs, I throw away the implementation, improve the docs, then re-implement. The kind of docs I'm getting are of amazing quality.
  - nz 17 minutes ago
    Your workplace has chosen to deprive you of the enjoyment that you got from the work. You have a few options: (1) ask for a raise proportional to the percentage of enjoyment that you lost, (2) find a workplace that does not do this, or (3) phone it in (they see you and your craft as something be milked for cash, so maybe stop letting yourself get milked, and milk them right back, by doing _exactly_ what is asked of you and _not_ more -- let these strategic geniuses strategize using their own brains).
  - stretchwithme 15 minutes ago
    I can solve a problem in 10% of the time. Dealing with an issue TODAY, instead of having to put it in the backlog.
  - lukan 50 minutes ago
    For me the most fun part is getting something that works. Design the goal, but not micromanage and get lost in the details. I love AI for that, but it is hard really owning code this way. (At least I manually approve every or most changes, but still, verifying is hard).
    [-]
    - bitwize 31 minutes ago
      AI has really sharpened the line between the Master Builders of the world and the Lord Businesses along this question: What, exactly, is the "fun part" of programming? Is it simply having something that works? Or is it the process of going from not having it to having it through your own efforts and the sum total of decisions you made along the way?
- MrDarcy 1 hour ago
  It is remarkably effective to have Claude Code do the code review and assign a quality score, call it a grade, to the contribution derived from your own expectations of quality.
  Then don’t even bother looking at C work or below.
  [-]
  - NitpickLawyer 1 hour ago
    IME it works even better if you use another model for review. We've seen code by cc and review by gpt5.2/3 work very well.
    Also works with planning before any coding sessions. Gemini + Opus + GPT-xhigh works to get a lot of questions answered before coding starts.
- xienze 1 hour ago
  > The problem is some developers now just submit code for review that they didn't bother to read.
  Can you blame them? All the AI companies are saying “this does a better job than you ever could”, every discussion topic on AI includes at least one (totally organic, I’m sure) comment along the lines of “I’ve been developing software for over twenty years and these tools are going to replace me in six months. I’m learning how to be a plumber before I’m permanently unemployed.” So when Claude spits out something that seems to work with a short smoke test, how can you blame developers for thinking “damn the hype is real. LGTM”?
  [-]
  - jf22 59 minutes ago
    I'm an 99% organic person (I suppose I have tooth fillings) and the new models write code better than I do.
    I've been using LLMS for 14+ months now and they've exceeded my expectations.
    [-]
    - HoldOnAMinute 17 minutes ago
      Not only do they exceed expectations, but any time they fall down, you can improve your instructions to them. It's easy to get into a virtuous cycle.
    - xienze 30 minutes ago
      So are you learning a trade? Or do you somehow think you’ll be one of the developers “good enough” to remain employed?
      [-]
      - jf22 24 minutes ago
        I have a physical goods side hustle already and I'm brainstorming ideas about a trade I can do that will benefit from my programming experience.
        I'm thinking HVAC or painting lines in parking lots. HVAC because I can program smart systems and parking lot lines because I can use google maps and algos to propose more efficient parking lot designs to existing business owners.
        There is that paradox when if something becomes cheaper there is more demand so we'll see what happens.
        Finally, I'm a mediocre dev that can only handle 2-3 agents at a time so I probably won't be good enough.
  - bluefirebrand 43 minutes ago
    > Can you blame them?
    Yes I absolutely can and do blame them
_pdp_ 1 hour ago
I think the issue goes even deeper than verification. Verification is technically possible. You could, in theory, build a C compiler or a browser and use existing tests to confirm it works.
The harder problem is discovery: how do you build something entirely new, something that has no existing test suite to validate against?
Verification works because someone has already defined what "correct" looks like. There is possible a spec, or a reference implementation, or a set of expected behaviours. The system just has to match them.
But truly novel creation does not have ground truth to compare against and no predefined finish line. You are not just solving a problem. You are figuring out what the problem even is.
[-]
- Avshalom 1 hour ago
  Well that's a problem the software industry has been building for itself for decades.
  Software has, since at least the adoption of "agile" created an industry culture of not just refusing to build to specs but insisting that specs are impossible to get from a customer.
  [-]
  - daveguy 48 minutes ago
    Agile hasn't been insisting that specs are impossible to get from a customer. They have been insisting that getting specs from a customer is best performed as a dynamic process. In my opinion, that's one of agile's most significant contributions. It lines up with a learning process that doesn't assume the programmer or the customer knows the best course ahead of time.
muraiki 53 minutes ago
The article says that AWS's Cedar authorization policy engine is written in Lean, but it's actually written in Dafny. Writing Dafny is a lot closer to writing "normal" code rather than the proofs you see in Lean. As a non-mathematician I gave up pretty early in the Lean tutorial, while in a recent prototype I learned enough Dafny to be semi-confident in reviewing Claude's Dafny code in about half a day.
The Dafny code formed a security kernel at the core of a service, enforcing invariants like that an audit log must always be written to prior to a mutating operation being performed. Of course I still had bugs, usually from specification problems (poor spec / design) or Claude not taking the proof far enough (proving only for one of a number of related types, which could also have been a specification problem on my part).
In the end I realized I'm writing a bunch of I/O bound glue code and plain 'ol test driven development was fine enough for my threat model. I can review Python code more quickly and accurately than Dafny (or the Go code it eventually had to link to), so I'm back to optimizing for humans again...
yoaviram 39 minutes ago
I just finished writing a post about exactly this. Software development, as the act of manually producing code, is dying. A new discipline is being born. It is much closer to proper engineering.
Like an engineer overseeing the construction of a bridge, the job is not to lay bricks. It is to ensure the structure does not collapse.
The marginal cost of code is collapsing. That single fact changes everything.
https://nonstructured.com/zen-of-ai-coding/
[-]
- MattDaEskimo 38 minutes ago
  Accountability then
  [-]
  - yoaviram 36 minutes ago
    Anticipating modes of failure
simonw 1 hour ago
The "Nearly half of AI-generated code fails basic security tests" link provided in this piece is not credible in my opinion. It's a very thinly backed vendor report from a company selling security scanning software.
holtkam2 1 hour ago
At the end of the day you need humans who understand the business critical (or safety critical) systems that underpin the enterprise.
Someone needs to be held accountable when things go wrong. Someone needs to be able to explain to the CEO why this or that is impossible.
If you want to have AI generate all the code for your business critical software, fine, but you better make sure you understand it well. Sometimes the fastest path to deep understanding is just coding things out yourself - so be it.
This is why the truly critical software doesn’t get developed much faster when AI tools are introduced. The bottleneck isn’t how fast the code can be created, it’s how fast humans can construct their understanding before they put their careers on the line by deploying it.
Ofc… this doesn’t apply to prototypes, hackathons, POCs, etc. for those “low stakes” projects, vibe code away, if you wish.
indymike 1 hour ago
Because of the scale of generated code, often it is the AI verifying the AI's work.
[-]
- tartoran 1 hour ago
  So who's verifying the AI doing the verifying or is it yet another AI layer doing that? If something goes wrong who's liable, the AI?
  [-]
  - visarga 22 minutes ago
    You have 2 paths - code tests and AI review which is just vibe test of LGTM kind, should be using both in tandem, code testing is cheap to run and you can build more complex systems if you apply it well. But ultimately it is the user or usage that needs to direct testing, or pay the price for formal verification. Most of the time it is usage, time passing reveals failure modes, hindsight is 20/20.
acedTrex 2 hours ago
No one does currently, and its going to take a few very painful and high profile failures of vital systems for this industry to RELEARN its lesson about the price of speed.
In fact it will probably need to happen a few times PER org for the dust to settle. It will take several years.
[-]
- arscan 1 hour ago
  Sure but industry cares about value (= benefit - price), not just price. Price could be astronomical, but that doesn’t matter if benefit is larger.
- jcgrillo 1 hour ago
  I feel like people used to talk about nines of uptime more. As in more than one. These days we've lost that: https://bsky.app/profile/jkachmar.com/post/3mg4u3e6nak2p
  I recall a time, maybe around 2013-2017, when people were talking about 4 or 5 nines. But sometime around then the goalposts shifted, and instead of trying to make things as reliable as possible, it started becoming more about seeing how unreliable they can get before anyone notices or cares. It turns out people will suffer through a lot if there's some marginal benefit--remember what personal computers were like in the 1990s before memory protection? Vibe coding is just another chapter in that user hostile epic. Convenient reliability, like this author describes, (if it can be achieved) might actually make things better? But my money isn't on that.
bitwize 35 minutes ago
Also AI.
lgl 1 hour ago
I'm in the process of building v2.0 of my app using opus 4.6 and largely agree with this.
It's pretty awesome but still does a lot of basic idiotic stuff. I was implementing a feature that required a global keyboard shortcut and asked opus to define it, taking into account not to clash with common shortcuts. He built a field where only one modifier key was required. After mentioning that this was not safe since users could just define CTRL+C for the shortcut and we need more safeguards and require at least two modifier keys I got the usual "you're absolutely right" and proceeded to require two modifier keys. But then it also created a huge list of common shortcuts into a blacklist like copy, cut, paste, print, select all, etc.. basically a bunch of single modifier key shortcuts. Once I mentioned that since we're already forcing two modifier keys that's useless it said I'm right again and fixed it.
The counter point of this idiocy is that it's very good overall at a lot of what is (in my mind) much more complicated stuff. It's a .NET app and stuff like creating models, viewmodels, usercontrols, setting up the entire hosting DI with pretty much all best practices for .net it does it pretty awesomely.
tl;dr is that training wheels are still mandatory imho
righthand 2 hours ago
No one really. Code is for humans to read and for machines to compile and execute. Llms are enabling people to just write the code and not have anyone read it. It’s solving a problem that didn’t really exist (we already had code generators before llms).
It’s such an intoxicating copyright-abuse slot machine that a buddy who is building an ocaml+htmx tree editor told me “I always get stuck and end up going to the llm to generate code. Usually when I get to the html part.” I asked if he used a debugger before that, he said “that’s a good idea”.
[-]
- galbar 1 hour ago
  This is something I've been wondering about...
  If boilerplate was such a big issue, we should have worked on improving code generation. In fact, many tools and frameworks exist that did this already:
  - rails has fantastic code generation for CRUD use cases
  - intelliJ IDEs have been able to do many types of refactors and class generation that included some of the boilerplate
  I haven't reached a conclusion on this train of thought yet, though.
aplomb1026 1 hour ago
[dead]
foolfoolz 2 hours ago
no one wants to believe this but there will be a point soon when an ai code review meets your compliance requirements to go to production. is that 2026? no. but it will come
[-]
- righthand 1 hour ago
  We already have specifications though, so that’s not different. What happens when the AI is wrong and wont let anyone deploy to production?
rademaker 2 hours ago
In his latest essay, Leonardo de Moura makes a compelling case that if AI is going to write a significant portion of the world’s software, then verification must scale alongside generation. Testing and code review were never sufficient guarantees, even for human-written systems; with AI accelerating output, they become fundamentally inadequate. Leo argues that the only sustainable path forward is machine-checked formal verification — shifting effort from debugging to precise specification, and from informal reasoning to mathematical proof checked by a small, auditable kernel. This is precisely the vision behind Lean: a platform where programs and proofs coexist, enabling AI not just to generate code, but to generate code with correctness guarantees. Rather than slowing development, Lean-style verification enables trustworthy automation at scale.