59 comments

  • NitpickLawyer 21 hours ago
    > Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I’m seeing from GLM 4.5 Air—and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high quality models that have emerged over the past six months.

    Yes, the open-models have surpassed my expectations in both quality and speed of release. For a bit of context, when chatgpt launched in Dec22, the "best" open models were GPT-J(~6-7B) and GPT-neoX (~22B?). I actually had an app running live, with users, using gpt-j for ~1 month. It was a pain. The quality was abysmal, there was no instruction following (you had to start your prompt like a story, or come up with a bunch of examples and hope the model will follow along) and so on.

    And then something happened, LLama models got "leaked" (I still think it was a on purpose leak - don't sue us, we never meant to release, etc), and the rest is history. With L1 we got lots of optimisations like quantised models, fine-tuning and so on, L2 really saw fine-tuning go off (most of the fine-tunes were better than what meta released), we got alpaca showing off LoRA, and then a bunch of really strong models came out (mistrals, mixtrals, L3, gemmas, qwens, deepseeks, glms, granites, etc.)

    By some estimations the open models are ~6mo behind what SotA labs have released. (note that doesn't mean the labs are releasing their best models, it's likely they keep those in house to use on next runs data curation, synthetic datasets, for distilling, etc). Being 6mo behind is NUTS! I never in my wildest dreams believed we'll be here. In fact I thought it would take ~2years to reach gpt3.5 levels. It's really something insane that we get to play with these models "locally", fine-tune them and so on.

    • genewitch 19 hours ago
      I'll bite. How do i train/make and/or use LoRA, or, separately, how do i fine-tune? I've been asking this for months, and no one has a decent answer. websearch on my end is seo/geo-spam, with no real instructions.

      I know how to make an SD LoRA, and use it. I've known how to do that for 2 years. So what's the big secret about LLM LoRA?

      • techwizrd 19 hours ago
        We have been fine-tuning models using Axolotl and Unsloth, with a slight preference for Axolotl. Check out the docs [0] and fine-tune or quantize your first model. There is a lot to be learned in this space, but it's exciting.

        0: https://axolotl.ai/ and https://docs.axolotl.ai/

        • arkmm 18 hours ago
          When do you think fine tuning is worth it over prompt engineering a base model?

          I imagine with the finetunes you have to worry about self-hosting, model utilization, and then also retraining the model as new base models come out. I'm curious under what circumstances you've found that the benefits outweigh the downsides.

          • reissbaker 16 hours ago
            For self-hosting, there are a few companies that offer per-token pricing for LoRA finetunes (LoRAs are basically efficient-to-train, efficient-to-host finetunes) of certain base models:

            - (shameless plug) My company, Synthetic, supports LoRAs for Llama 3.1 8b and 70b: https://synthetic.new All you need to do is give us the Hugging Face repo and we take care of the rest. If you want other people to try your model, we charge usage to them rather than to you. (We can also host full finetunes of anything vLLM supports, although we charge by GPU-minute for full finetunes rather than the cheaper per-token pricing for supported base model LoRAs.)

            - Together.ai supports a slightly wider number of base models than we do, with a bit more config required, and any usage is charged to you.

            - Fireworks does the same as Together, although they quantize the models more heavily (FP4 for the higher-end models). However, they support Llama 4, which is pretty nice although fairly resource-intensive to train.

            If you have reasonably good data for your task, and your task is relatively "narrow" (i.e. find a specific kind of bug, rather than general-purpose coding; extract a specific kind of data from legal documents rather than general-purpose reasoning about social and legal matters; etc), finetunes of even a very small model like an 8b will typically outperform — by a pretty wide margin — even very large SOTA models while being a lot cheaper to run. For example, if you find yourself hand-coding heuristics to fix some problem you're seeing with an LLM's responses, it's probably more robust to just train a small model finetune on the data and have the finetuned model fix the issues rather than writing hardcoded heuristics. On the other hand, no amount of finetuning will make an 8b model a better general-purpose coding agent than Claude 4 Sonnet.

            • delijati 12 hours ago
              Do you maybe know if there is a company in the EU that hosts models (DeepSeek, Qwen3, Kimi)?
              • reissbaker 6 hours ago
                Most inference companies (Synthetic included) host in a mix of the U.S. and EU — I don't know of any that promise EU-only hosting, though. Even Mistral doesn't promise EU-only AFAIK, despite being a French company. I think at that point you're probably looking at on-prem hosting, or buying a maxed-out Mac Studio and running the big models quantized to Q4 (although even that couldn't run Kimi: you might be able to get it working over ethernet with two Mac Studios, but the tokens/sec will be pretty rough).
          • tough 16 hours ago
            only for narrow applications where your fine tune can let you use a smaller model locally , specialised and trained for your specific use-case mostly
          • whimsicalism 17 hours ago
            finetuning rarely makes sense unless you are an enterprise and even generally doesn't in most cases there either.
        • syntaxing 18 hours ago
          What hardware do you train on using axolotl? I use unsloth with Google colab pro
      • qcnguy 18 hours ago
        LLM fine tuning tends to destroy the model's capabilities if you aren't very careful. It's not as easy or effective as with image generation.
        • israrkhan 11 hours ago
          do you have a suggestion or a way to measure if model capabilities are getting destroyed? how do one measure it objectively?
          • RALaBarge 11 hours ago
            Ask it a series of the same questions after you train that you posed before training started. Is the quality lower?
            • israrkhan 4 hours ago
              That series of questions will measure only a particular area. I am concerned about destorying model capabilities in some other area that that I do not pay attention to, and have no way of knowing.
              • simonh 3 hours ago
                Isn’t that a general problem with LLMs? The only way to know how good it is at something is to test it.
      • notpublic 19 hours ago
        https://github.com/unslothai/unsloth

        I'm not sure if it contains exactly what you're looking for, but it includes several resources and notebooks related to fine-tuning LLMs (including LoRA) that I found useful.

      • svachalek 18 hours ago
        For completeness, for Apple hardware MLX is the way to go.
      • minimaxir 19 hours ago
        If you're using Hugging Face transformers, the library you want to use is peft: https://huggingface.co/docs/peft/en/quicktour

        There are Colab Notebook tutorials around training models with it as well.

      • jasonjmcghee 12 hours ago
        brev.dev made an easy to follow guide a while ago but apparently Nvidia took it down or something when they bought them?

        So here's the original

        https://web.archive.org/web/20231127123701/https://brev.dev/...

      • electroglyph 15 hours ago
        unsloth is the easiest way to finetune due to the lower memory requirements
      • otabdeveloper4 4 hours ago
        > So what's the big secret about LLM LoRA?

        No clear use case for LLMs yet. ("Spicy" aka pornography finetunes are the only ones with broad adoption, but we don't talk about that in polite society here.)

      • pdntspa 15 hours ago
        Have you tried asking an LLM?
    • Nesco 16 hours ago
      Zuck wouldn’t have leaked it on 4chan of all the places
      • tough 16 hours ago
        prob just told an employee to get it done no?
      • vaenaes 16 hours ago
        Why not?
    • tonyhart7 20 hours ago
      is GLM 4.5 better than Qwen3 coder??
      • diggan 20 hours ago
        For what? It's really hard to say what model is "generally" better then another, as they're all better/worse at specific things.

        My own benchmarks has a bunch of different tasks I use various local models for, and I run it when I wanna see if a new model is better than the existing ones I use. The output is basically a markdown table with a description of which model is best for what task.

        They're being sold as general purpose things that are better/worse at everything but reality doesn't reflect this, they all have very specific tasks they're worse/better at, and the only way to find that out is by having a private benchmark you run yourself.

        • kelvinjps10 19 hours ago
          coding? they are coding models? what specific tasks is one performing better than the other?
          • diggan 19 hours ago
            They may be, but there are lots of languages, lots of approaches, lots of methodologies and just a ton of different ways to "code", coding isn't one homogeneous activity that one model beats all the other models at.

            > what specific tasks is one performing better than the other?

            That's exactly why you create your own benchmark, so you can figure that out by just having a list of models, instead of testing each individually and basing it on "feels better".

            • reverius42 3 hours ago
              > coding isn't one homogeneous activity that one model beats all the other models at

              If you can't even replace one coding model with another, it's hard to imagine you can replace human coders with coding models.

              • diggan 37 minutes ago
                What you mean "can't even replace"? You can, nothing in my comment says you cannot?
          • whimsicalism 19 hours ago
            glm 4.5 is not a coding model
            • simonw 19 hours ago
              It may not be code-only, but it was trained extensively for coding:

              > Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model's performance on key downstream domains.

              From my notes here: https://simonwillison.net/2025/Jul/28/glm-45/

              • whimsicalism 19 hours ago
                yes, all reasoning models currently are, but it’s not like ds coder or qwen coder
                • simonw 19 hours ago
                  I don't see how the training process for GLM-4.5 is materially different from that used for Qwen3-235B-A22B-Instruct-2507 - they both did a ton of extra reinforcement learning training related to code.

                  Am I missing something?

                  • whimsicalism 18 hours ago
                    I think the primary thing you're missing is that Qwen3-235B-A22B-Instruct-2507 != Qwen3-Coder-480B-A35B-Instruct. And the difference there is that while both do tons of code RL, in one they do not monitor performance on anything else for forgetting/regression and focus fully on code post-training pipelines and it is not meant for other tasks.
      • NitpickLawyer 20 hours ago
        I haven't tried them (released yesterday I think?). The benchmarks look good (similar I'd say) but that's not saying much these days. The best test you can do is have a couple of cases that match your needs, and run them yourself w/ the cradle that you are using (aider, cline, roo, any of the CLI tools, etc). Openrouter usually has them up soon after launch, and you can run a quick test really cheap (and only deal with one provider for billing & stuff).
  • bob1029 18 hours ago
    > still think it’s noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this—especially code that worked first time with no further edits needed.

    I believe we are vastly underestimating what our existing hardware is capable of in this space. I worry that narratives like the bitter lesson and the efficient compute frontier are pushing a lot of brilliant minds away from investigating revolutionary approaches.

    It is obvious that the current models are deeply inefficient when you consider how much you can decimate the precision of the weights post-training and still have pelicans on bicycles, etc.

    • jonas21 18 hours ago
      Wasn't the bitter lesson about training on large amounts of data? The model that he's using was still trained on a massive corpus (22T tokens).
      • itsalotoffun 18 hours ago
        I think GP means that if you internalize the bitter lesson (more data more compute wins), you stop imagining how to squeeze SOTA minus 1 performance out of constrained compute environments.
        • reactordev 8 hours ago
          This. When we ran out of speed on the CPU, we moved to the GPU. Same thing here. The more we work with (22T) models, quants, and decimating precision - the more we learn and find more novel ways to do things.
      • yahoozoo 18 hours ago
        What does that have to do with quantizing?
  • righthand 19 hours ago
    Did you understand the implementation or just that it produced a result?

    I would hope an LLM could spit out a cobbled form of answer to a common interview question.

    Today a colleague presented data changes and used an LLM to build a display app for the JSON for presentation. Why did they not just pipe the JSON into our already working app that displays this data?

    People around me for the most part are using LLMs to enhance their presentations, not to actually implement anything useful. I have been watching my coworkers use it that way for months.

    Another example? A different coworker wanted to build a document macro to perform bulk updates on courseware content. Swapping old words for new words. To build the macro they first wrote a rubrick to prompt an LLM correctly inside of a word doc.

    That filled rubrik is then used to generate a program template for the macro. To define the requirements for the macro the coworker then used a slideshow slide to list bullet points of functionality, in this case to Find+Replace words in courseware slides/documents using a list of words from another text document. Due to the complexity of the system, I can’t believe my colleague saved any time. The presentation was interesting though and that is what they got compliments on.

    However the solutions are absolutely useless for anyone else but the implementer.

    • simonw 19 hours ago
      I scanned the code and understood what it was doing, but I didn't spend much time on it once I'd seen that it worked.

      If I'm writing code for production systems using LLMs I still review every single line - my personal rule is I need to be able to explain how it works to someone else before I'm willing to commit it.

      I wrote a whole lot more about my approach to using LLMs to help write "real" code here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/

      • larodi 2 hours ago
        I think is the right way to do it. Produce with LLM, debug and read every online. Delete lots of it.

        Many people fear this approach for production, but it is reasonable compared to someone with a single course in Coursera writing production JS code.

        Yet, we tend to say the LLM wrote this and that which implies model did all the work. In reality it should be understood as a complex heavy lifting machine which is expected to be operated by a very well prepared operator.

        The fact I got a Kango and drilled some holes does not make me engineer right? And it takes an engineer to sign off the building even thought it was archicad doing the math.

      • photon_lines 18 hours ago
        This is why I love using the Deep-Seek chain of reason output ... I can actually go through and read what it's 'thinking' to validate whether it's basing its solution on valid facts / assumptions. Either way thanks for all of your valuable write-ups on these models I really appreciate them Simon!
        • vessenes 14 hours ago
          Nota bene - there is a fair amount of research that indicates models outputs and ‘thoughts’ do not necessarily align with their chain of reasoning output.

          You can validate this pretty easily by asking some logic or coding questions: you will likely note that a final output is not necessarily the logical output of the end of the thinking; sometimes significantly orthogonal to it, or returning to reasoning in the middle.

          All that to say - good idea to read it, but stay vigilant on outputs.

      • shortrounddev2 15 hours ago
        Serious question: if you have to read every line of code in order to validate it in production, why not just write every line of code instead?
        • simonw 14 hours ago
          Because it's much, much faster to review a hundred lines of code than it is to write a hundred lines of code.

          (I'm experienced at reading and reviewing code.)

          • paufernandez 12 hours ago
            Simon, don't you fear "atrophy" in your writing ability?
            • simonw 11 hours ago
              I think it will happen a bit, but I'm not worried about it.

              My ability to write with a pen has suffered enormously now that I do most of my writing on a phone or laptop - but I'm writing way more.

              I expect I'll become slower at writing code without an LLM, but the volume of (useful) code I produce will be worth the trade off.

            • DonHopkins 1 hour ago
              Reading other people's (or llm's) code is one of the best ways of improving your own coding abilities. Lazy people using llms to avoid reading any code is called "vibe coding", and their abilities atrophy no matter who or what wrote the code they refuse to read.
          • otabdeveloper4 4 hours ago
            Absolutely false for anything but the most braindead corporate CRUD code.

            We hate reading code and will avoid the hassle every time, but that doesn't mean it is easy.

            • DonHopkins 1 hour ago
              >We hate reading code and will avoid the hassle every time, but that doesn't mean it is easy.

              Speak for yourself. I love reading code! It's hard and it takes a lot of energy, but if you hate it, maybe you should find something else to do.

              Being a programmer who hates reading code is like being a bus driver who hates looking at the road: dangerous and menacing to the public and your customers.

      • th0ma5 18 hours ago
        [flagged]
    • magic_hamster 13 hours ago
      The LLM is the solution.
    • bsder 13 hours ago
      > However the solutions are absolutely useless for anyone else but the implementer.

      Disposable code is where AI shines.

      AI generating the boilerplate code for an obtuse build system? Yes, please. AI generating an animation? Ganbatte. (Look at how much work 3Blue1Brown had to put into that--if AI can help that kind of thing, it has my blessings). AI enabling someone who doesn't program to generate some prototype that they can then point at an actual programmer? Excellent.

      This is fine because you don't need to understand the result. You have a concrete pass/fail gate and don't care about underneath. This is real value. The problem is that it isn't gigabuck value.

      The stuff that would be gigabuck value is unfortunately where AI falls down. Fix this bug in a product. Add this feature to an existing codebase. etc.

      AI is also a problem because disposable code is what you would assign to junior programmers in order for them to learn.

  • AlexeyBrin 21 hours ago
    Most likely its training data included countless Space Invaders in various programming languages.
    • gblargg 19 hours ago
      The real test is if you can have it tweak things. Have the ship shoot down. Have the space invaders come from the left and right. Add two player simultaneous mode with two ships.
      • wizzwizz4 14 hours ago
        It can usually tweak things, if given specific instruction, but it doesn't know when to refactor (and can't reliably preserve functionality when it does), so the program gets further and further away from something sensible until it can't make edits any more.
        • simonw 14 hours ago
          For serious projects you can address that by writing (or having it write) unit tests along the way, that way it can run in a loop and avoid breaking existing functionality when it adds new changes.
          • greesil 14 hours ago
            Okay ask it to write unit tests for space invaders next time :)
    • quantumHazer 20 hours ago
      and probably some synthetic data are generated copy of the games already on the dataset?

      i have this feeling with LLM's generated react frontend, they all look the same

      • tshaddox 19 hours ago
        To be fair, the human-generated user interfaces all look the same too.
      • cchance 19 hours ago
        Have you used the internet? thats how the internet looks, their all fuckin react and the same layouts and styles 90% shadcn lol
      • tw1984 4 hours ago
        most human generated methods look the same. in fact, in SWE, we reward people for generating code that look & feel the same, they call it "work as a team".
      • bayindirh 20 hours ago
        Last time somebody asked for a "premium camera app for iOS", and the model (re)generated Halide.

        Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...

        • Uehreka 19 hours ago
          > Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...

          People really need to stop saying this. I get that it was the Smart Guy Thing To Say in 2023, but by this point it’s pretty clear that that it’s not true in any way that matters for most practical purposes.

          Coding LLMs have clearly been trained on conversations where a piece of code is shown, a transformation is requested (rewrite this from Python to Go), and then the transformed code is shown. It’s not that they’re just learning codebases, they’re learning what working with code looks like.

          Thus you can ask an LLM to refactor a program in a language it has never seen, and it will “know” what refactoring means, because it has seen it done many times, and it will stand a good chance of doing the right thing.

          That’s why they’re useful. They’re doing something way more sophisticated than just “recombining codebases from their training data”, and anyone chirping 2023 sound bites is going to miss that.

          • cztomsik 3 hours ago
            I don't know, I have mixed-bag experiences and it's not really improving. It greatly varies depending on the programming language and the kind of problem which I'm trying to solve.

            The tasks where it works great are things I'd expect to be part of dataset (github, blog posts), or they are "classic" LM tasks (understand + copy-paste/patch). The actual intelligence, in my opinion, is still very limited. So while it's true it's not "just recall" it still might be "mostly recall".

            BTW: Copy-paste is something which works great in any attention-based model. On the other hand, models like RWKV usually fail and are not suited for this IMHO (but I think they have much better potential for the AGI)

        • FeepingCreature 20 hours ago
          True where trivial; where nontrivial, false.

          Trivially, humans don't emit something they don't know either. You don't spontaneously figure out Javascript from first principles, you put together your existing knowledge into new shapes.

          Nontrivially, LLMs can absolutely produce code for entirely new requirements. I've seen them do it many times. Will it be put together from smaller fragments? Yes, this is called "experience" or if the fragments are small enough, "understanding".

          • phkahler 19 hours ago
            >> Nontrivially, LLMs can absolutely produce code for entirely new requirements. I've seen them do it many times.

            I think most people writing software today are reinventing a wheel, even in corporate environments for internal tools. Everyone wants their own tweak or thinks their idea is unique and nobody wants to share code publicly, so everyone pays programmers to develop buggy bespoke custom versions of the same stuff that's been done 100 times before.

            I guess what I'm saying is that your requirements are probably not new, and to the extent they are yes an LLM can fill in the blanks due to its fluency in languages.

          • bayindirh 20 hours ago
            Humans can observe ants and invent any colony optimization. AIs can’t.

            Humans can explore what they don’t know. AIs can’t.

            • falcor84 20 hours ago
              What makes you categorically say that "AIs can't"?

              Based on my experience with present day AIs, I personally wouldn't be surprised at all that if you showed Gemini 2.5 Pro a video of an insect colony and asked it "Take a look at the way they organize and see if that gives you inspiration for an optimization algorithm", it will spit something interesting out.

              • sarchertech 18 hours ago
                It will 100% have something in its training set discussing a human doing this and will almost definitely spit out something similar.
                • fc417fc802 2 hours ago
                  That's a good point but all it means is that we can't test the hypothesis one way or the other due to never being entirely certain that a given task isn't anywhere in the training data. Supposing that "AIs can't" is then just as invalid as supposing that "AIs can".
            • FeepingCreature 20 hours ago
              What makes you categorically say that "humans can"?

              I couldn't do that with an ant colony. I would have to train on ant research first.

              (Oh, and AIs can absolutely explore what they don't know. Watch a Claude Code instance look at a new repository. Exploration is a convergent skill in long-horizon RL.)

            • ben_w 19 hours ago
              > Humans can observe ants and invent any colony optimization. AIs can’t.

              Surely this is exactly what current AI do? Observe stuff and apply that observation? Isn't this the exact criticism, that they aren't inventing ant colonies from first principles without ever seeing one?

              > Humans can explore what they don’t know. AIs can’t.

              We only learned to decode Egyptian hieroglyphs because of the Rosetta Stone. There's no translation for North Sentinelese, the Voynich manuscript, or Linear A.

              We're not magic.

            • numpad0 2 hours ago
              humans also eat
            • CamperBob2 19 hours ago
              That's what benchmarks like ARC-AGI are designed to test. The models are getting better at it, and you aren't.

              Nothing ultimately matters in this business except the first couple of time derivatives.

        • satvikpendem 20 hours ago
          This doesn't make sense thermodynamically because models are far smaller than the training data they purport to hold and recall, so there must be some level of "understanding" going on. Whether that's the same as human understanding is a different matter.
          • Eggpants 17 hours ago
            It’s a lossy text compression technique. It’s clever applied statistics. Basically an advanced association rules algorithm which has been around for decades but modified to consider order and relative positions.

            There is no understanding, regardless of the wants of all the capital investors in this domain.

            • simonw 16 hours ago
              I don't care if it can "understand" anything, as long as I can use it to achieve useful things.
              • Eggpants 16 hours ago
                “useful things“ like poorly drawing birds on bikes? ;)

                (I have much respect for what you have done and are currently doing, but you did walk right into that one)

                • msephton 12 hours ago
                  The pelican on a bicycle is a very useful test.
            • CamperBob2 13 hours ago
              It’s a lossy text compression technique.

              That is a much, much bigger deal than you make it sound like.

              Compression may, in fact, be all we need. For that matter, it may be all there is.

        • mr_toad 16 hours ago
          > They remix and rewrite what they know. There's no invention, just recall...

          If they only recalled they wouldn’t “hallucinate”. What’s a lie if not an invention? So clearly they can come up with data that they weren’t trained on, for better or worse.

          • 0x457 15 hours ago
            Because internally, there isn't a difference between correctly "recalled" token and incorrectly (hallucinated).
    • NitpickLawyer 20 hours ago
      This comment is ~3 years late. Every model since gpt3 has had the entirety of available code in their training data. That's not a gotcha anymore.

      We went from chatgpt's "oh, look, it looks like python code but everything is wrong" to "here's a full stack boilerplate app that does what you asked and works in 0-shot" inside 2 years. That's the kicker. And the sauce isn't just in the training set, models now do post-training and RL and a bunch of other stuff to get to where we are. Not to mention the insane abilities with extended context (first models were 2/4k max), agentic stuff, and so on.

      These kinds of comments are really missing the point.

      • haar 20 hours ago
        I've had little success with Agentic coding, and what success I have had has been paired with hours of frustration, where I'd have been better off doing it myself for anything but the most basic tasks.

        Even then, when you start to build up complexity within a codebase - the results have often been worse than "I'll start generating it all from scratch again, and include this as an addition to the initial longtail specification prompt as well", and even then... it's been a crapshoot.

        I _want_ to like it. The times where it initially "just worked" felt magical and inspired me with the possibilities. That's what prompted me to get more engaged and use it more. The reality of doing so is just frustrating and wishing things _actually worked_ anywhere close to expectations.

        • aschobel 20 hours ago
          Bingo, it's magical but the learning curve is very very steep. The METR study on open-source productivity alluded to this a bit.

          I am definitely at a point where I am more productive with it, but it took a bunch of effort.

          • haar 19 hours ago
            Apologies if I was unclear.

            The more I've used it, the more I've disliked how poor the results it's produced, and the more I've realised I would have been better served by doing it myself and following a methodical path for things that I didn't have experience with.

            It's easier to step through a problem as I'm learning and making small changes than an LLM going "It's done, and production ready!" where it just straight up doesn't work for 101 different tiny reasons.

            • airspresso 2 hours ago
              My preferred approach to avoid that outcome is to divide & conquer the problem. Ask the LLM to implement each small bit in the order you'd implement it yourself given what you know about the codebase.
          • devmor 19 hours ago
            The subjects in the study you are referencing also believed that they were more productive with it. What metrics do you have to convince yourself you aren't under the same illusionary bias they were?
            • simonw 19 hours ago
              Yesterday I used ffmpeg to extract the frame at the 13 second mark of a video out as a JPEG.

              If I didn't have an LLM to figure that out for me I wouldn't have done it at all.

              • throwworhtthrow 19 hours ago
                LLM's still give subpar results with ffmpeg. For example when I asked Sonnet to trim a long video with ffmpeg, it put the input file parameter before the start time parameter, which triggers an unnecessary decode of the video file. [1]

                Sure, use the LLM to get over the initial hump. But ffmpeg's no exception to the rule that LLM's produce subpar code. It's worth spending a couple minutes reading the docs to understand what it did so you can do it better, and unassisted, next time.

                [1] https://ffmpeg.org/ffmpeg.html#:~:text=ss%20position

                • CamperBob2 18 hours ago
                  That says more about suboptimal design on ffmpeg's part than it does about the LLM. Most humans can't deal with ffmpeg command lines, so it's not surprising that the LLM misses a few tricks.
                  • nottorp 17 hours ago
                    Had a LLM generate 3 lines of working C++ code that was "only" one order of magnitude slower than what i edited the code to in 10 minutes.

                    If you're happy with results like that, sure, LLMs miss "a few tricks"...

                    • ben_w 17 hours ago
                      You don't have to leave LLM code alone, it's fine to change it — unless, I guess, you're doing some kind of LLM vibe-code-golfing?

                      But this does remind me of a previous co-worker. Wrote something to convert from a custom data store to a database, his version took 20 minutes on some inputs. Swore it couldn't possibly be improved. Obviously ridiculous because it didn't take 20 minutes to load from the old data store, nor to load from the new database. Over the next few hours of looking at very mediocre code, I realised it was doing an unnecessary O(n^2) check, confirmed with the CTO it wasn't business-critical, got rid of it, and the same conversion on the same data ran in something like 200ms.

                      Over a decade before LLMs.

                      • nottorp 17 hours ago
                        We all do that, sometimes where it’s time critical sometimes where it isn’t.

                        But I keep being told “AI” is the second coming of Ahura Mazda so it shouldn’t do stuff like that right?

                        • ben_w 15 hours ago
                          > Ahura Mazda

                          Niche reference, I like it.

                          But… I only hear of scammers who say, and psychosis sufferers who think, LLMs are *already* that competent.

                          Future AI? Sure, lots of sane-seeming people also think it could go far beyond us. Special purpose ones have in very narrow domains. But current LLMs are only good enough to be useful and potentially economically disruptive, they're not even close to wildly superhuman like Stockfish is.

                          • CamperBob2 14 hours ago
                            Sure. If you ask ChatGPT to play chess, it will put up an amateur-level effort at best. Stockfish will indeed wipe the floor with it. But what happens when you ask Stockfish to write a Space Invaders game?

                            ChatGPT will get better at chess over time. Stockfish will not get better at anything except chess. That's kind of a big difference.

                            • ben_w 14 hours ago
                              > ChatGPT will get better at chess over time

                              Oddly, LLMs got worse at specifically chess: https://dynomight.net/chess/

                              But even to the general point, there's absolutely no agreement how much better the current architectures can ultimately get, nor how quickly they can get there.

                              Do they have potential for unbounded improvements, albeit at exponential cost for each linear incremental improvement? Or will they asymptomatically approach someone with 5 years experience, 10 years experience, a lifetime of experience, or a higher level than any human?

                              If I had to bet, I'd say current models have an asymptomatic growth converging to a merely "ok" performance; and separately claim that even if they're actually unbounded with exponential cost for linear returns, we can't afford the training cost needed to make them act like someone with even just 6 years professional experience in any given subject.

                              Which is still a lot. Especially as it would be acting like it had about as much experience in every other subject at the same time. Just… not a literal Ahura Mazda.

                              • CamperBob2 13 hours ago
                                If I had to bet, I'd say current models have an asymptomatic growth converging to a merely "ok" performance

                                (Shrug) People with actual money to spend are betting twelve figures that you're wrong.

                                Should be fun to watch it shake out from up here in the cheap seats.

                                • ben_w 12 hours ago
                                  Nah, trillion dollars is about right for "ok". Percentage point of the global economy in cost, automate 2 percent and get a huge margin. We literally set more than that on actual fire each year.

                                  For "pretty good", it would be worth 14 figures, over two years. The global GDP is 14 figures. Even if this only automated 10% of the economy, it pays for itself after a decade.

                                  For "Ahura Mazda", it would easily be worth 16 figures, what with that being the principal God and god of the sky in Zoroastrianism, and the only reason it stops at 16 is the implausibility of people staying organised for longer to get it done.

                                • nottorp 2 hours ago
                                  > People with actual money to spend are betting

                                  ... but those "people with actual money to spend" have burned money on fads before. Including on "AI", several times before the current hysterics.

                                  If you're a good actor/psychologist, it's probably a good business model to figure out how to get VC money and how to justify your startup failing so they give you money for the next startup.

                        • CamperBob2 16 hours ago
                          "I'm taking this talking dog right back to the pound. It told me to short NVDA, and you should see the buffer overflow bugs in the C++ code it wrote. Totally overhyped. I don't get it."
                          • nottorp 16 hours ago
                            "We hear you have been calling our deity a talking dog. Please enter the red door for reeducation."
              • dingnuts 19 hours ago
                It is nice to use LLMs to generate ffmpeg commands, because those can be pretty tricky, but really, you wouldn't have just used the man page before?

                That explains a lot about Django that the author is allergic to man pages lol

                • ben_w 17 hours ago
                  I remember when I was a kid, people asking a teacher how to spell a word, and the answer was generally "look it up in a dictionary"… which you can only do if you already have shortlist of possible spellings.

                  *nix man pages are the same: if you already know which tool can solve your problem, they're easy to use. But you have to already have a shortlist of tools that can solve your problem, before you even know which man pages to read.

                  • adastra22 11 hours ago
                    That’s what GNU info is for, of course.
                • simonw 19 hours ago
                  I just took a look, and the man page DOES explain how to do that!

                  ... on line 3,218: https://gist.github.com/simonw/6fc05ea7392c5fb8a5621d65e0ed0...

                  (I am very confident I am not the only person who has been deterred by ffmpeg's legendarily complex command-line interface. I feel no shame about this at all.)

                  • lexh 9 hours ago
                    To be a little more fair... that example is tidily slotted into the EXAMPLES section, under the heading "You can extract images from a video, or create a video from many images".

                    I don't think most people read the man pages top to bottom. And even if they did, then for as much grief as you're giving ffmpeg, llm has an even larger burden... no man page and the docs weigh in at over 8k lines.

                    I get the general point that ffmpeg is a powerful, complex tool... but this is a weird fight to pick.

                    • simonw 9 hours ago
                      I could not be more confident that "ffmpeg is difficult to figure out" is not a weird fight to pick. It's notorious!
                  • quesera 17 hours ago
                    Ffmpeg is genuinely complicated! And the CLI is convoluted (in justifiable, and unfortunate ways).

                    But if you approach ffmpeg from the perspective of "I know this is possible", you are always correct, and can almost always reach the "how" in a handful of minutes.

                    Whether that's worth it or not, will vary. :)

                  • otabdeveloper4 3 hours ago
                    The correct solution here would have been to feed the man page to an LLM summarizer.

                    Alas instead of correct and easy solutions to problems we are focused on sci-fi robot assitant bullshit.

              • devmor 19 hours ago
                You wouldn't have just typed "extract frame at timestamp as jpeg ffmpeg" into Google and used the StackExchange result that comes up first that gives you a command to do exactly that?
                • simonw 19 hours ago
                  Before LLMs made ffmpeg no-longer-frustrating-to-use I genuinely didn't know that ffmpeg COULD do things like that.
                  • devmor 16 hours ago
                    I'm not really sure what you're saying an LLM did in this case. Inspired a lost sense of curiosity?
                    • simonw 15 hours ago
                      My general point is that people say things like "yeah, but this one study showed that programmers over-estimate the productivity gain they get from LLMs so how can you really be sure?"

                      Meanwhile I've spent the past two years constantly building and implementing things I never would have done because of the reduction in friction LLM assistance gives me.

                      I wrote about this first two years ago - AI-enhanced development makes me more ambitious with my projects - https://simonwillison.net/2023/Mar/27/ai-enhanced-developmen... - when I realized I was hacking on things with tech like AppleScript and jq that I'd previously avoided.

                      It's hard to measure the productivity boost you get from "wouldn't have built that thing" to "actually built that thing".

                    • Philpax 15 hours ago
                      Translated a vague natural language query ("cli, extract frame 13s into video") into something immediately actionable with specific examples and explanations, surfacing information that I would otherwise not know how to search for.

                      That's what I've done with my ffmpeg LLM queries, anyway - can't speak for simonw!

                      • wizzwizz4 14 hours ago
                        DuckDuckGo search results for "cli, extract frame 13s into video" (no quotes):

                        https://stackoverflow.com/questions/10957412/fastest-way-to-...

                        https://superuser.com/questions/984850/linux-how-to-extract-...

                        https://www.aleksandrhovhannisyan.com/notes/video-cli-cheat-...

                        https://www.baeldung.com/linux/ffmpeg-extract-video-frames

                        https://ottverse.com/extract-frames-using-ffmpeg-a-comprehen...

                        Search engines have been able to translate "vague natural language queries" into search results for a decade, now. This pre-existing infrastructure accounts for the vast majority of ChatGPT's apparent ability to find answers.

                        • stelonix 11 hours ago
                          Yet the interface is fundamentally different, the output feels much more like bro pages[0] and it's within a click of clipboarding, one CTRL V away from extracting the 13th second screenshot. I've been using Google the past 24 years and my google-fu has always left people amazed; yet I can no longer bother to go through Stack Exchange's results when an LLM not only spits it out so nicely, but also does the equivalent of a explainshell[1].

                          Not comparable and I fail to see why going through Google's ads/results would be better?

                          [0] https://github.com/pombadev/bropages

                          [1] https://github.com/idank/explainshell

                          • wizzwizz4 6 hours ago
                            DuckDuckGo insists on shoving "AI Assist" entries in its results, so I have a reasonable idea of how often LLMs are completely wrong even given search results. The answer's still "more than one time in five".

                            I did not suggest using Google Search (the company's on record as deliberately making Google Search worse), but there are other search engines. My preferred search engines don't do the fancy "interpret natural language queries" pre-processing, because I'm quite good at doing that in my head and often want to research niche stuff, but there are many still-decent search engines that do, and don't have ads in the results.

                            Heck, you can even pay for a good search engine! And you can have it redirect you to the relevant section of the top search result automatically: Google used to call this "I'm feeling lucky!" (although it was before URI text fragments, so it would just send you to the top of the page). All the properties you're after, much more cheaply, and you keep the information about provenance, and your answer is more-reliably accurate.

                    • 0x457 15 hours ago
                      LLM somewhat understood ffmpeg documentation? Not sure what is not clear here.
      • jan_Sate 20 hours ago
        Not exactly. The real utility value of LLM for programming is to come up with something new. For Space Invaders, instead of using LLM for that, I might as well just manually search for the code online and use that.

        To show that LLM actually can provide value for one-shot programming, you need to find a problem that there's no fully working sample code available online. I'm not trying to say that LLM couldn't to that. But just because LLM can come up with a perfectly-working Space Invaders doesn't mean that it could do that.

        • tracker1 19 hours ago
          I have a friend who has been doing just that... usually with his company he manages a handful of projects where a bulk of the development is outsourced overseas. This past year, he's outpaced the 6 devs he's had working on misc projects just with his own efforts and AI. Most of this being a relatively unique combination of UX with features that are less common.

          He's using AI with note taking apps for meetings to enhance notes and flush out technology ideas at a higher level, then refining those ideas into working experiments.

          It's actually impressive to see. My personal experience has been far more disappointing to say the least. I can't speak to the code quality, consistency or even structure in terms of most people being able to maintain such applications though. I've asked to shadow him through a few of his vibe coding sessions to see his workflow. It feels rather alien to me, again my experience is much more disappointing in having to correct AI errors.

          • nottorp 17 hours ago
            Is this the same person who posted about launching 17 "products" in one year a few days ago on HN? :)
            • tracker1 13 hours ago
              No, he's been working on building a larger eLearning solution with some interesting workflow analytics around courseware evaluation and grading. He's been involved in some of the newer LRS specifications and some implementation details to bridge training as well as real world exposure scenarios. Working a lot with first responders, incident response training etc.

              I've worked with him off and on for years from simulating aircraft diagnostics hardware to incident command simulation and setting up core infrastructure for F100 learning management backends.

        • devmor 19 hours ago
          > The real utility value of LLM for programming is to come up with something new.

          That's the goal for these projects anyways. I don't know that its true or feasible. I find the RAG models much more interesting myself, I see the technology as having far more value in search than generation.

          Rather than write some markov-chain reminiscent frankenstein function when I ask it how to solve a problem, I would like to see it direct me to the original sources it would use to build those tokens, so that I can see their implementations in context and use my judgement.

          • simonw 19 hours ago
            "I would like to see it direct me to the original sources it would use to build those tokens"

            Sadly that's not feasible with transformer-based LLMs: those original sources are long gone by the time you actually get to use the model, scrambled a billion times into a trained set of weights.

            One thing that helped me understand this is understanding that every single token output by an LLM is the result of a calculation that considers all X billion parameters that are baked into that model (or a subset of that in the case of MoE models, but it's still billions of floating point calculations for every token.)

            You can get an imitation of that if you tell the model "use your search tool and find example code for this problem and build new code based on that", but that's a pretty unconventional way to use a model. A key component of the value of these things is that they can spit out completely new code based on the statistical patterns they learned through training.

            • devmor 19 hours ago
              I am aware, and that's exactly why I don't think they're anywhere near as useful for this type of work as the people pushing them want them to be.

              I tried to push for this type of model when an org I worked with over a decade ago was first exploring using the first generation of Tensorflow to drive customer service chatbots and was sadly ignored.

              • simonw 19 hours ago
                I don't understand. For code, why would I want to remix existing code snippets?

                I totally get the value of RAG style patterns for information retrieval against factual information - for those I don't want the LLM to answer my question directly, I want it to run a search and show me a citation and directly quote a credible source as part of answering.

                For code I just want code that works - I can test it myself to make sure it does what it's supposed to.

                • devmor 19 hours ago
                  > I don't understand. For code, why would I want to remix existing code snippets?

                  That is what you're doing already. You're just relying on a vector compression and search engine to hide it from you and hoping the output is what you expect, instead of having it direct you to where it remixed those snippets from so you can see how they work to start with and make sure its properly implemented from the get-go.

                  We all want code that works, but understanding that code is a critical part of that for anything but a throw-away one time use script.

                  I don't really get this desire to replace critical thought with hoping and testing. It sounds like the pipe dream of a middle manager, not a tool for a programmer.

                  • stavros 18 hours ago
                    I don't understand your point. You seem to be saying that we should be getting code from the source, then adapting it to our project ourselves, instead of getting adapted code to begin with.

                    I'm going to review the code anyway, why would I not want to save myself some of the work? I can "see how they work" after the LLM gives them to me just fine.

                    • devmor 16 hours ago
                      The work that you are "saving" is the work of using your brain to determine the solution to the problem. Whatever the LLM gives you doesn't have a context it is used in other than your prompt - you don't even know what it does until after you evaluate it.

                      If you instead have a set of sources related to your problem, they immediately come with context, usage and in many cases, developer notes and even change history to show you mistakes and adaptations.

                      You're ultimately creating more work for yourself* by trying to avoid work, and possibly ending up with an inferior solution in the process. Where is your sense of efficiency? Where is your pride as a intellectual?

                      * Yes, you are most likely creating more work for yourself even if you think you are capable of telling otherwise. [1]

                      1. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

                      • simonw 16 hours ago
                        It sounds like you care deeply about learning as much as you can. I care about that too.

                        I would encourage you to consider that even LLM-generated code can teach you a ton of useful new things.

                        Go read the source code for my dumb, zero-effort space invaders clone: https://github.com/simonw/tools/blob/main/space-invaders-GLM...

                        There's a bunch of useful lessons to be picked up even from that!

                        - Examples of CSS gradients, box shadows and flexbox layout

                        - CSS keyframe animation

                        - How to implement keyboard events in JavaScript

                        - A simple but effective pattern for game loops against a Canvas element, using requestAnimationFrame

                        - How to implement basic collision detection

                        If you've written games like this before these may not be new to you, but I found them pretty interesting.

                      • stavros 16 hours ago
                        Thanks for the concern, but I'm perfectly able to judge for myself whether I'm creating more work or delivering an inferior product.
      • AlexeyBrin 19 hours ago
        You are reading too much into my comment. My point was that the test (a Space Invaders clone) used to asses the model is irrelevant for some time now. I could have gotten a similar result with Mistral Small a few months ago.
      • MyOutfitIsVague 20 hours ago
        I don't think they are missing the point, because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated. I use Gemini 2.5 Pro every day for coding, and even that one still falls over on tasks that aren't well known to it (which is why I break the problem down into small parts that I know it'll be able to handle properly).

        It's kind of funny, because sometimes these tools are magical and incredible, and sometimes they are extremely stupid in obvious ways.

        Yes, these are impressive, and especially so for local models that you can run yourself, but there is a gap between "absolutely magical" and "pretty cool, but needs heavy guiding" depending on how heavily the ground you're treading has been walked upon.

        For a heavily explored space, it's like being impressed that you're 2.5 year old M2 with 64 GB RAM can extract some source code from a zip file. It's worth being impressed and excited about the space and the pace of improvement, but it's also worth stepping back and thinking rationally about the specific benchmark at hand.

        • NitpickLawyer 20 hours ago
          > because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated

          I agree with you, but your take is much more nuanced than what the GP comment said! These models don't simply regurgitate the training set. That was my point with gpt3. The models have advanced from that, and can now "generalise" over the context in ways they could not do ~3 years ago. We are now at a point where you can write a detailed spec (10-20k tokens) for an unseen scripting language, and have SotA models a) write a parser and b) start writing scripts for you in that language, even though it never saw that particular scripting language anywhere in its training set. Try it. You'll be surprised.

      • stolencode 12 hours ago
        It's amazing that none of you even try to falsify you claims anymore. You can literally just put some of the code in a search engine and find the prior art example:

        https://www.web-leb.com/en/code/2108

        Your "AI tools" are just "copyright whitewashing machines."

        These kinds of comments are really ignoring reality.

      • jayd16 20 hours ago
        I think you're missing the point.

        Showing off moderately complicated results that are actually not indicative of performance because they are sniped by the training data turns this from a cool demo to a parlor trick.

        Stating that, aha, jokes on you, that's the status quo, is an even bigger indictment.

      • Aurornis 20 hours ago
        > These kinds of comments are really missing the point.

        I disagree. In my experience, asking coding tools to produce something similar to all of the tutorials and example code out there works amazingly well.

        Asking them to produce novel output that doesn’t match the training set produces very different results.

        When I tried multiple coding agents for a somewhat unique task recently they all struggled, continuously trying to pull the solution back to the standard examples. It felt like an endless loop of the models grinding through a solution and then spitting out something that matched common examples, after which I had to remind them of the unique properties of the task and they started all over again, eventually arriving back in the same spot.

        It shows the reality of working with LLMs and it’s an important consideration.

    • phkahler 19 hours ago
      I find the visual similarity to breakout kind of interesting.
    • elif 20 hours ago
      Most likely this comment included countless similar comments in its training data, likely all synthetic without any actual tether to real analysis.
    • Conflonto 20 hours ago
      That sounds so dismissive.

      I was not able to just download a 8-16GB File and then it would be able to generate A LOT of different tools, games etc. for me in multiply programming languages while in parallel ELI5 me research papers, generate svgs and a lot lot lot more.

      But hey.

  • xianshou 16 hours ago
    I initially read the title as "My 2.5 year old can write Space Invaders in JavaScript now (GLM-4.5 Air)."

    Though I suppose, given a few years, that may also be true!

  • lxgr 18 hours ago
    This raises an interesting question I’ve seen occasionally addressed in science fiction before:

    Could today’s consumer hardware run a future superintelligence (or, as a weaker hypothesis, at least contain some lower-level agent that can bootstrap something on other hardware via networking or hyperpersuasion) if the binary dropped out of a wormhole?

    • bob1029 17 hours ago
      This is the premise of all of the ML research I've been into. The only difference is to replace the wormhole with linear genetic programming, neuroevolution, et. al. The size of programs in the demoscene is what originally sent me down this path.

      The biggest question I keep asking myself - What is the Kolmogorov complexity of a binary image that provides the exact same capabilities as the current generation LLMs? What are the chances this could run on the machine under my desk right now?

      I know how many AAA frames per second my machine is capable of rendering. I refuse to believe the gap between running CS2 at 400fps and getting ~100b/s of UTF8 text out of a NLP black box is this big.

      • bgirard 16 hours ago
        > ~100b/s of UTF8 text out of a NLP black box is this big

        That's not a good measure. NP problem solutions are only a single bit, but they are much harder to solve than CS2 frames for large N. If it could solve any problem perfectly, I would pay you billions for just 1b/s of UTF8 text.

        • bob1029 15 hours ago
          > If it could solve any problem perfectly, I would pay you billions for just 1b/s of UTF8 text.

          Exactly. This is what compels me to try.

    • switchbak 17 hours ago
      This is what I find fascinating. What hidden capabilities exist, and how far could it be exploited? Especially on exotic or novel hardware.

      I think much of our progress is limited by the capacity of the human brain, and we mostly proceed via abstraction which allows people to focus on narrow slices. That abstraction has a cost, sometimes a high one, and it’s interesting to think about what the full potential could be without those limitations.

      • lxgr 13 hours ago
        Abstraction, or efficient modeling of a given system, is probably a feature, not a bug, given the strong similarity between intelligence and compression and all that.

        A concise description of the right abstractions for our universe is probably not too far removed from the weights of a superintelligence, modulo a few transformations :)

    • tw1984 1 hour ago
      could today's seemingly "superintelligence" models run on 10-20 years old hardware? probably it works.
  • alankarmisra 20 hours ago
    I see the value in showcasing that LLMs can run locally on laptops — it’s an important milestone, especially given how difficult that was before smaller models became viable.

    That said, for something like this, I’d probably get more out of simply finding an existing implementation on github or the like and downloading that.

    When it comes to specialized and narrow domains like Space Invaders, the training set is likely to be extremely small and the model's vector space will have limited room to generalize. You'll get code that is more or less identical to the original source and you also have to wait for it to 'type' the code and the value add seems very low. I would rather ask it to point me to known Space Invaders implementations in language X on github (or search there).

    Note that ChatGPT gets very nervous if I put this into GPT to clean up the grammar. It wants very badly for me to stress that LLMs don't memorize and overfitting is very unlikely (I believe neither).

    • tossandthrow 20 hours ago
      Interesting, I can not produce these warnings in ChatGPT - though this is something that really interests me, as it represents immense political power to be able ti interject such warnings (explicitly, or implicitly by slight reformulations)
    • dr-detroit 19 hours ago
      [dead]
    • aaron695 20 hours ago
      [dead]
  • simonw 16 hours ago
    There's a new model from Qwen today - Qwen3-30B-A3B-Instruct-2507 - that also runs comfortably on my Mac (using about 30GB of RAM with an 8bit quantization).

    I tried the "Write an HTML and JavaScript page implementing space invaders" prompt against it and didn't quite get a working game with a single shot, but it was still an interesting result: https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct...

    • pyman 3 hours ago
      I was talking about the new open models with a group of people yesterday, and saying how good they're getting. The big question is:

      Can any company now compete with the big players? Or even more interesting, like you showed in your research, are proprietary models becoming less relevant now that anyone can run these models locally?

      This trend of better open models that run locally is really picking up. Do you think we'll get to a point where we won't need to buy AI tokens anymore?

  • pulkitsh1234 21 hours ago
    Is there any website to see the minimum/recommended hardware required for running local LLMs? Much like 'system requirements' mentioned for games.
    • svachalek 18 hours ago
      In addition to the tools other people responded with, a good rule of thumb is that most local models work best* at q4 quants, meaning the memory for the model is a little over half the number of parameters, e.g. a 14b model may be 8gb. Add some more for context and maybe you want 10gb VRAM for a 14gb model. That will at least put you in the right ballpark for what models to consider for your hardware.

      (*best performance/size ratio, generally if the model easily fits at q4 you're better off going to a higher parameter count than going for a larger quant, and vice versa)

      • nottorp 17 hours ago
        > maybe you want 10gb VRAM for a 14gb model

        ... or if you have Apple hardware with their unified memory, whatever the assholes soldered in is your limit.

    • CharlesW 19 hours ago
      > Is there any website to see the minimum/recommended hardware required for running local LLMs?

      LM Studio (not exclusively, I'm sure) makes it a no-brainer to pick models that'll work on your hardware.

    • qingcharles 19 hours ago
      This can be a useful resource too:

      https://www.reddit.com/r/LocalLLaMA/

    • GaggiX 21 hours ago
      https://apxml.com/tools/vram-calculator

      This one is very good in my opinion.

      • jxf 20 hours ago
        Don't think it has the GLM series on there yet.
    • knowaveragejoe 20 hours ago
      If you have a HuggingFace account, you can specify the hardware you have and it will show on any given model's page what you can run.
  • dust42 13 hours ago
    I tried with Claude Sonnet 4 and it does *not* work. So looks like GLM-4.5 Air in 3bit quant is ahead.

    Chat is here: https://claude.ai/share/dc9eccbf-b34a-4e2b-af86-ec2dd83687ea

    Claude Opus 4 does work but is far behind of Simon's GLM-4.5: https://claude.ai/share/5ddc0e94-3429-4c35-ad3f-2c9a2499fb5d

  • petercooper 19 hours ago
    I ran the same experiment on the full size model. It used a custom 80s style font (from Google Fonts) and gave 'eyes' and more differences to the enemies but otherwise had a similar vibe to Simon's. An interesting visual demonstration of what quantization does though! Screenshot: https://peterc.org/img/aliens.png
  • ddtaylor 19 hours ago
    My brain is running legacy COBOL and first read this as

    > My 2.5 year old with their laptop can write Space Invaders

    For a few hundred milliseconds there I was thinking "these damn kids are getting good with tablets"

    • Imustaskforhelp 18 hours ago
      Don't worry I guess my brain is running bleeding edge typescript with react (I am in high school for context) and the first time I also read it this way...

      But I am without my glasses, but still I have hackernews at 250%, I think I am a little cooked lol.

      • OldfieldFund 18 hours ago
        We are all cooked at this point :)
  • stpedgwdgfhgdd 20 hours ago
    Aside that space invaders from scratch is not representative for real engineering, it will be interesting to see what the business model for Anthropic will be if I can run a solid code generation model on my local machine (no usage tier per hour or week), let’s say, one year from now. At $200 per month for 2 years I can buy a decent Mx with 64GB (or perhaps even 128GB taking residual value into account)
    • falcor84 20 hours ago
      How come it's "not representative for real engineering"? Other than copy-pasting existing code (which is not what an LLM does), I don't see how you can create a space invaders game without applying "engineering".
      • hbn 19 hours ago
        The prompt was

        > Write an HTML and JavaScript page implementing space invaders

        It may not be "copy pasting" but it's generating output as best it can be recreated from its training on looking at Space Invaders source code.

        The engineers at Taito that originally developed Space Invaders were not told "make Space Invaders" and then did their best to recall all the source code they've looked at in their life to re-type the source code to an existing game. From a logistics standpoint, where the source code already exists and is accessible, you may as well have copy-pasted it and fudged a few things around.

        • simonw 19 hours ago
          The source code for original Space Invaders from 1978 has never been published. The closest to that is disassembled ROMs.

          I used that prompt because it's the shortest possible prompt that tells the model to build a game with a specific set of features. If I wanted to build a custom game I would have had to write a prompt that was many paragraphs longer than that.

          The aim of this piece isn't "OMG looks LLMs can build space invaders" - at this point that shouldn't be a surprise to anyone. What's interesting is that my laptop can run a model that is capable of that now.

          • sarchertech 18 hours ago
            > The source code for original Space Invaders from 1978 has never been published. The closest to that is disassembled ROMs.

            Sure but that doesn’t impact the OPs point at all because there are numerous copies of reverse engineered source code available.

            There are numerous copies of the reverse engineered source code already translated to JavaScript in your models training set.

          • hbn 14 hours ago
            The discussion I replied to was just regarding whether or not what the LLM did should be considered "engineering"

            It doesn't really matter whether or not the original code was published. In fact that original source code on its own probably wouldn't be that useful, since I imagine it wouldn't have tipped the weights enough to be "recallable" from the model, not to mention it was tasked with implementing it in web technologies.

          • nottorp 17 hours ago
            > What's interesting is that my laptop can run a model that is capable of that now.

            I'm afraid no one cared much about your point :)

            You'll only get "OMG look how good LLMs are they'll get us all fired!" comments and "LLMs suck" comments.

            This is how it goes with religion...

      • sharkjacobs 17 hours ago
        Making a space invaders game is not representative of normal engineering because you're reproducing an existing game with well known specs and requirements. There are probably hundreds of thousands of words describing and discussing Space Invaders in GLM-4.5's training data

        It's like using an LLM to implement a red black tree. Red black trees are in the training data, so you don't need to explain or describe what you mean beyond naming it.

        "Real engineering" with LLMs usually requires a bunch of up front work creating specs and outlines and unit tests. "Context engineering"

        • jasonvorhe 17 hours ago
          Smells like moving the goal post. What's real engineering to be in 2028? Implementing Google's infra stack in your homelab?
      • phkahler 19 hours ago
        >> Other than copy-pasting existing code (which is not what an LLM does)

        I'd like to see someone try to prove this. How many space invaders projects exist on the internet? I'd be hard to compare model "generated" code to everything out there looking for plagiarism, but I bet there are lots of snippets pulled in. These things are NOT smart, they are huge and articulate information repositories.

        • simonw 19 hours ago
          Go for it. https://www.google.com/search?client=firefox-b-1-d&q=github+... has a bunch of results. Here's the source code GLM-4.5 Air spat out for me on my laptop: https://github.com/simonw/tools/blob/main/space-invaders-GLM...

          Based on my mental model of how these things work I'll be genuinely surprised if you can find even a few lines of code duplicated from one of those projects into the code that GLM-4.5 wrote for me.

          • phkahler 19 hours ago
            So I scanned the beginning of the generated code, picked line 83:

              animation: glow 2s ease-in-out infinite;
            
            
            stuffed it verbatim into google and found a stack overflow discussion that contained this:

                  animation: glow .5s infinite alternate;
            
            
            in under one minute. Then I found this page of CSS effects:

            https://alvarotrigo.com/blog/animated-backgrounds-css/

            Another page has examples and contains:

              animation: float 15s infinite ease-in-out;
            
            
            There is just too much internet to scan for an exact match or a match of larger size.
            • simonw 19 hours ago
              That's not an example of copying from an existing Space Invaders implementation. That's an LLM using a CSS animation pattern - one that it's seen thousands (probably millions) of times in the training data.

              That's what I expect these things to do: they break down Space Invaders into the components they need to build, then mix and match thousands of different coding patterns (like "animation: glow 2s ease-in-out infinite;") to implement different aspects of that game.

              You can see that in the "reasoning" trace here: https://gist.github.com/simonw/9f515c8e32fb791549aeb88304550... - "I'll use a modern design with smooth animations, particle effects, and a retro-futuristic aesthetic."

              • threeducks 18 hours ago
                I think LLMs are adapting higher level concepts. For example, the following JavaScript code generated by GLM (https://github.com/simonw/tools/blob/9e04fd9895fae1aa9ac78b8...) is clearly inspired by this C++ code (https://github.com/portapack-mayhem/mayhem-firmware/blob/28e...), but it is not an exact copy.
                • simonw 18 hours ago
                  This is a really good spot.

                  That code certainly looks similar, but I have trouble imagining how else you would implement very basic collision detection between a projectile and a player object in a game of this nature.

                  • threeducks 17 hours ago
                    A human would likely have refactored the two collision checks between bullet/enemy and enemyBullet/player in the JavaScript code into its own function, perhaps something like "areRectanglesOverlapping". The C++ code only does one collision check like that, so it has not been refactored there, but as a human, I certainly would not want to write that twice.

                    More importantly, it is not just the collision check that is similar. Almost the entire sequence of operations is identical on a higher level:

                        1. enemyBullet/player collision check
                        2. same comment "// Player hit!" (this is how I found the code)
                        3. remove enemy bullet from array
                        4. decrement lives
                        5. update lives UI
                        6. (createParticle only exists in JS code)
                        7. if lives are <= 0, gameOver
            • ben_w 19 hours ago
              So, your example of it copying snippets is… using the same API with fairly different parameters in a different order?
            • falcor84 19 hours ago
              The parent said

              > find even a few lines of code duplicated from one of those projects

              I'm pretty sure they meant multiple lines copied verbatim from a single project implementing space invaders, rather than individual lines copied (or likely just accidentally identical) across different unrelated projects.

            • sejje 18 hours ago
              Is this some kind of joke?

              That's how you write css. The examples aren't the same at all, they just use the same css feature.

              It feels like you aren't a coder--you've sabotaged your own point.

        • ben_w 19 hours ago
          Sorites paradox. Where's the distinction between "snippet" and "a design pattern"?

          Compressing a few petabytes into a few gigabytes requires that they can't be like this about all of the things they're accused of simply copy-pasting, from code to newspaper articles to novels. There's not enough space.

    • dmortin 19 hours ago
      " it will be interesting to see what the business model for Anthropic will be if I can run a solid code generation model on my local machine "

      Most people won't bother with buying powerful hardware for this, they will keep using SAAS solutions, so Anthropic can be in trouble if cheaper SAAS solutions come out.

    • qingcharles 19 hours ago
      The frontier models are always going to tempt you with their higher quality and quicker generation, IMO.
      • kasey_junk 19 hours ago
        I’ve been mentally mapping tge models to the history of db.

        Most db in the early days you had to pay for. There are still for pay db that are just better than ones you don’t pay for. Some teams think that the cost is worth the improvements and there is a (tough) business there. Fortunes were made in the early days.

        But eventually open source models became good enough for many use cases and they have their own advantages. So lots of teams use them.

        I think coding models might have a similar trajectory.

        • qingcharles 19 hours ago
          You make a good point -- a majority of applications are now using open source or free versions[1] of DBs.

          My only feedback is: are these the same animal? Can we compare an O/S DB vs. paid/closed DB to me running an LLM locally? The biggest issue right now with LLMs is simply the cost of the hardware to run one locally, not the quality of the actual software (the model).

          [1] e.g. SQL Server Express is good enough for a lot of tasks, and I guess would be roughly equivalent to the upcoming open versions of GPT vs. the frontier version.

          • qcnguy 18 hours ago
            A majority of apps nowadays are using proprietary forks of open source DBs running in the cloud, where their feature set is (slightly) rounded out and smoothed off by the cloud vendors.

            Not that many projects are doing fully self-hosted RDBMS at this point. So ultimately proprietary databases still win out, they just (ab)use the Postgresql trademark to make people think they're using open source.

            LLMs might go the same way. The big clouds offering proprietary fine tunes of models given away by AI labs using investor money?

            • qingcharles 17 hours ago
              That's definitely true. I could see more of the running open source models on other people's hardware model.

              I dislike running local LLMs right now because I find the software kinda janky still, you often have to tweak settings, find the right model files. Basically have a bunch of domain knowledge I don't have space for in my head. On top of maintaining a high-spec piece of hardware and paying for the power costs.

      • zarzavat 16 hours ago
        Closed doesn't always win over open. People said the same thing about Windows vs Linux, but even Microsoft was forced to admit defeat and support Linux.

        All it takes is some large companies commoditizing their complements. For Linux it was Google, etc. For AI it's Meta and China.

        The only thing keeping Anthropic in business is geopolitics. If China were allowed full access to GPUs, they would probably die.

        • airspresso 1 hour ago
          > The only thing keeping Anthropic in business is geopolitics. If China were allowed full access to GPUs, they would probably die.

          Disagree. Anthropic have a unique approach to how they post-train their models and tune it to be the way they want it. No other lab has managed to reproduce the style and personality of Claude yet, which is currently a key reason why coders prefer it. And since post-training data is secret, it'll take other providers a lot of focused effort to get close to that.

    • rafaelmn 19 hours ago
      What about power used and support hardware ? Also card going down means you are down until you get warranty service.
      • skeezyboy 19 hours ago
        why are you doing anything locally then?
        • rafaelmn 3 hours ago
          Latency and tooling support ? UX of cloud based LLM vs local is much better for the cloud option - not so much for dev tooling.

          I tried using remote workstations - I am not a fan of lugging a beefy client machine to do my work - would much rather use something thats super light and power efficient.

    • tptacek 19 hours ago
      OK, go write Space Invaders by hand.
      • LandR 18 hours ago
        I'd hope most professional software engineers could do this in an afternoon or so?
        • Mashimo 3 hours ago
          Depends on the rules. Can I look up other space invaders games on github first? Can I use a game framework?

          Just JS / HTML docs I probably could not.

        • sejje 18 hours ago
          Most professional software engineers have never written a game and don't do web work, so I somehow doubt that.
          • anthk 17 hours ago
            With TCL/TK it's a matter of less than 2 hours.
  • indigodaddy 19 hours ago
    Did pretty well with a boggle clone. I like that it tries to do a single html file (I didn't ask for that but was pleasantly surprised). It didn't include dictionary validation so needed a couple of prompts. Touch selection on mobile isn't the greatest but I've seen plenty worse

    https://chat.z.ai/space/z0gcn6qtu8s1-art

    https://chat.z.ai/s/74fe4ddc-f528-4d21-9405-0a8b15a96520

    • Keyframe 18 hours ago
      I went the other route with tetris clone the other day. It's definitely not a single prompt. It took me solid 15 hours until this stage to get here and most of that me thinking.. BUT, except one small trivial thing (space invader logo in pre tag) I haven't touched code - just looked at it. I made it mandatory for myself to see if I can first greenfield myself into this project and then brownfield features and fixes.. It's definitely a ton of work on my end, but it's also not something I'd be able to do in ~2 working days or less. As a cherry on top, even though it's still not done yet, I put in AI-generated music singing about the project itself. https://www.susmel.com/stacky/

      Definitely a ton of things I learned about how to "develop" "with" AI along the way.

    • JKCalhoun 19 hours ago
      Cool — if only diagonals were easier. ;-) (Hopefully I'm being constructive here.)
  • efitz 20 hours ago
    I missed the word “laptop” in the title at first glance and thought this was a “I taught my toddler to code” article.
    • juliangoetze 19 hours ago
      I thought I was the only one.
    • below43 11 hours ago
      Same here. Pretty impressive LLM.
  • GardenLetter27 2 hours ago
    Crazy how Apple is still the only option for this consumer hardware.
    • airspresso 2 hours ago
      Framework desktop with AMD Strix Halo [1] are getting there as a viable alternative. Offering up to 96 GB of unified RAM at the moment, so still a gap up to the beefiest Mac Studio alternatives though.

      [1]: https://frame.work/desktop

  • maksimur 18 hours ago
    A $xxxx 2.5 year old laptop, one that's probably much more powerful than an average laptop bought today and probably next year as well. I don't think it's a fair reference point.
    • bprew 18 hours ago
      His point isn't that you can run a model on an average laptop, but that the same laptop can still run frontier models.

      It speaks to the advancements in models that aren't just throwing more compute/ram at it.

      Also, his laptop isn't that fancy.

      > It claims to be small enough to run on consumer hardware. I just ran the 7B and 13B models on my 64GB M2 MacBook Pro!

      From: https://simonwillison.net/2023/Mar/11/llama/

    • parsimo2010 18 hours ago
      The article is pretty good overall, but the title did irk me a little. I assumed when reading "2.5 year old" that it was fairly low-spec only to find out it was an M2 Macbook Pro with 64 GB of unified memory, so it can run models bigger than what an Nvidia 5090 can handle.

      I suppose that it could be intended to be read as "my laptop is only 2.5 years old, and therefore fairly modern/powerful" but I doubt that was the intention.

      • simonw 18 hours ago
        The reason I emphasize the laptop's age is that it is the same laptop I have been using ever since the first LLaMA release.

        This makes it a great way to illustrate how much better the models have got without requiring new hardware to unlock those improved abilities.

    • nh43215rgb 15 hours ago
      About $3700 laptop...
  • jauntywundrkind 8 hours ago
    MLX does have decent/good software support among ML stacks. Targeting both iOS and mac is a big win in itself.

    I wonder what's possible, what the software situation is today with the PC NPU's. AMD's XDNA has been around for a while, XDNA2 jumps from 10->40 TOps. AMD iGPU can access huge memory: is it similar here? The "AMDXDNA" driver merged in 6.14 last winter: where are we now?

    But not seeing any evidence that there's popular support in any of the main frameworks. https://github.com/ggml-org/llama.cpp/issues/1499 https://github.com/ollama/ollama/issues/5186

    Good news, AMD has an initial implementation of llama.cpp. I don't particularly know what it means, but the firt gen supports W4ABF16 quantization, newer chips support W8A16. https://github.com/ggml-org/llama.cpp/issues/14377 . I'm not sure what it's good for, but there is a Linux "xdna-driveR", https://github.com/amd/xdna-driver . IREE has an experimental backend: https://github.com/nod-ai/iree-amd-aie

    There's a lot of other folks also starting on their NPU journeys. ARM's Ethos, and Rockchip's RKNN recently shipped Linux kernel drivers, but it feels like that's just a start? https://www.phoronix.com/news/Arm-Ethos-NPU-Accel-Driver https://www.phoronix.com/news/Rockchip-NPU-Driver-RKNN-2025

  • joelthelion 20 hours ago
    Apart from using a Mac, what can you use for inference with reasonable performance? Is a Mac the only realistic option at the moment?
    • reilly3000 19 hours ago
      The top 3 approaches I see a lot on r/localllama are:

      1. 2-4x 3090+ nvidia cards. Some are getting Chinese 48GB cards. There is a ceiling to vRAM that prevents the biggest models from being able to load, most can run most quants at great speeds

      2. Epyc servers running CPU inference with lots of RAM at as high of memory bandwidth as is available. With these setups people are getting like 5-10 t/s but are able to run 450B parameter models.

      3. High RAM Macs with as much memory bandwidth as possible. They are the best balanced approach and surprisingly reasonable relative to other options.

    • badsectoracula 16 hours ago
      An Nvidia GPU is the most common answer, but personally i've done all my LLM use locally using mainly Mistral Small 3.1/3.2-based models and llama.cpp with an AMD RX 7900 XTX GPU. It only gives you ~4.71 tokens per second, but that is fast enough for a lot of uses. For example last month or so i wrote a raytracer[0][1] in C with Devstral Small 1.0 (based on Mistral Small 3.1). It wasn't "vibe coding" as much as a "co-op" where i'd go back and forth a chat interface (koboldcpp) and i'd, e.g. ask the LLM to implement some feature, then i'd switch to the editor and start writing code using that feature while the LLM was generating it in the background. Or, more often, i'd fix bugs in the LLM's code :-P.

      FWIW GPU aside, my PC isn't particularly new - it is a 5-6 year old PC that was the cheapest money could buy originally and became "decent" at the time i upgraded it ~5 years ago and i only added the GPU around Christmas as prices were dropping since AMD was about to release the new GPUs.

      [0] https://i.imgur.com/FevOm0o.png

      [1] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92...

    • AlexeyBrin 20 hours ago
      A gaming PC with an NVIDIA 4090/5090 will be more than adequate for running local models.

      Where a Mac may beat the above is on the memory side, if a model requires more than 24/32 GB of GPU memory you are usually better off with a Mac with 64/128 GB of RAM. On a Mac the memory is shared between CPU and GPU, so the GPU can load larger models.

    • whimsicalism 19 hours ago
      you are almost certainly better off renting GPUs, but i understand self-hosting is an HN touchstone
      • mrinterweb 18 hours ago
        I don't know about that. I've had my RTX 4090 for nearly 3 years now. If I had a script that provisioned and deprovisioned a rented 4090 at $0.70/hr for an 8 hour work day for 20 work days per month. Assuming I get 2 paid weeks off per year + normal holidays over 3 years.

        0.7 * 8 * ((20 * 12) - 8 - 14) * 3 = $3662

        I bought my RTX 4090 for about $2200. I also had the pleasure of being able to use it for gaming when I wasn't working. To be fair, the VRAM requirements for local models keeps climbing and my 4090 isn't able to run many of the latest LLMs. Also, I omitted cost of electricity for my local LLM server cost. I have not been measuring total watts consumed by just that machine.

        One nice thing about renting is that it give you flexibility in terms of what you want to try.

        If you're really looking for the best deals look at 3rd party hosts serving open models for the API-based pricing, or honestly a Claude subscription can easily be worth it if you use LLMs a fair bit.

        • whimsicalism 18 hours ago
          1. I agree - there are absolutely scenarios in which it can make sense to buy a GPU and run it yourself. If you are managing a software firm with multiple employees, you very well might break even in less than a few years. But I would gander this is not the case for 90%+ of people self-hosting these models, unless they have some other good reason (like gaming) to buy a GPU.

          2. I basically agree with your caveats - excluding electricity is a pretty big exclusion and I don't think that you've had 3 years of really high-value self-hostable models, I would really only say the last year and I'm somewhat skeptical of how good for ones that can be hosted in 24gb vram. 4x4090 is a different story.

      • qingcharles 19 hours ago
        This. Especially if you just want to try a bunch of different things out. Renting is insanely cheap -- to the point where I don't understand how the renters are making their money back unless they stole the hardware and power.

        It can really help you figure a ton of things out before you blow the cash on your own hardware.

        • 4b11b4 18 hours ago
          Recommended sites to rent from
          • whimsicalism 18 hours ago
            runpod, vast, hyperbolic, prime intellect. if all you're doing is going to be running LLMs, you can pay per token on openrouter or some of the providers listed there
          • doormatt 18 hours ago
            runpod.io
    • regularfry 19 hours ago
      This one should just about fit on a box with an RTX 4090 and 64GB RAM (which is what I've got) at q4. Don't know what the performance will be yet. I'm hoping for an unsloth dynamic quant to get the most out of it.
      • weberer 18 hours ago
        Whats important is VRAM, not system RAM. The 4090 has 16gb of VRAM so you'll be limited to smaller models at decent speeds. Of course, you can run models from system memory, but your tokens/second will be orders of magnitude slower. ARM Macs are the exception since they have unified memory, allowing high bandwidth between the GPU and the system's RAM.
        • regularfry 3 hours ago
          Yes and no. The 4090 has 24GB, not 16; but with a big MoE you're not getting everything in there anyway. In that case you really want all the weights in RAM so that swapping experts in isn't a load from disk.

          It's not as good as unified RAM, but it's also workable.

        • throwaway0123_5 11 hours ago
          iirc 4090s have 24GB
    • thenaturalist 19 hours ago
      This guy [0] does a ton of in-depth HW comparison/ benchmarking, including against Mac mini clusters and an M3 ultra.

      0: https://www.youtube.com/@AZisk

  • h-bradio 19 hours ago
    Thanks so much for this! I updated LM Studio, and it picked up the mlx-lm update required. After a small tweak to tool-calling in the prompt, it works great with Zed!
    • torarnv 17 hours ago
      Could you describe the tweak you did, and possibly the general setup you have with zed working with LM Studio? Do you use a custom system prompt? What context size do you use? Temperature? Thanks!
  • pmarreck 10 hours ago
    I have an M4 Mac with 128GB RAM and I'm currently downloading GLM-4.5-Air-q5-hi-mlx via LM Studio (80GB) and will report back!
    • rexreed 8 hours ago
      How is it going? Intrigued enough to possibly get an M4 Mac with 128GB RAM if it's worthwhile...
      • ls-a 7 hours ago
        Apple is going to make so much money if they keep pushing on-device LLMs. It makes absolute sense to sell more macbook pros
  • xkcd1963 1 hour ago
    Standalone mini projects like that are also a good way to train students. But I believe LLMs are still a far long path away from being able to solve problems that require combination solutions like different environments, software, circumstances, projects, ...
  • aplzr 19 hours ago
    I really like talking to Claude (free tier) instead of using a search engine when I'm stumbling upon a random topic that interests me. For example, this morning I had it explain the differences between pass by value, pass by reference, and pass by sharing, the last of which I wasn't aware of until then.

    Is this kind of thing also possible with one of these self-hosted models in a comparable way, or are they mostly good for coding?

  • andai 17 hours ago
    I got almost the same result with a 4B model (Qwen3-4B), about 20x smaller than OP's ~200B model.

    https://jsbin.com/lejunenezu/edit?html,output

    Its pelican was a total fail though.

    • andai 16 hours ago
      Update: It failed to make Flappy Bird though (several attempts).

      This surprises me, I thought it would be simpler than Space Invaders.

  • Aurornis 19 hours ago
    This is very cool. The blog had to run it from the main branch of the mlx-lm library and a custom script. Can someone up to date on the local LLM tools let us know which mainstream tools we should be watching for an easier way to run this on MLX? The space moves so fast that it's hard to keep up.
    • simonw 19 hours ago
      I expect LM Studio will have this pretty soon - I imagine they are waiting on the next stable release of mlx-lm which will include the change I needed to get this to work.
  • ygritte 5 hours ago
    Can you host that model locally with ollama?
    • simonw 4 hours ago
      I haven't seen a GGUF for it yet, I imagine one will show up on Hugging Face soon which will probably work with Ollama.
      • pyman 3 hours ago
        Do you think local LLMs combined with P2P networks could become a thing? Imagine people adding datasets to an open model, the same way they add blocks to a blockchain, which is around 500GB in size.

        It could help decentralise power and reduce our dependency on the big players.

  • slimebot80 12 hours ago
    (novice question)

    64gb is pure RAM? I thought Apple Silicon was efficient at paging SSD as memory storage - how important is RAM if you've got a fast SSD?

    • ethan_smith 8 hours ago
      While Apple Silicon's memory compression and SSD swapping are efficient, RAM access is still ~100x faster than SSD, so sufficient physical RAM remains crucial for memory-intensive workloads like running large LLMs.
    • nicce 11 hours ago
      Memory speed is the most important factor with LLMs and SSD is very slow when compared to RAM.
  • lherron 19 hours ago
    With the Anthropic rug pull on quotas for Max, I feel the short-mid term value sweet spot will be a Frankensteined together “Claude as orchestrator/coder, falling back to local models as quota limits approach” tool suite.
    • 4b11b4 19 hours ago
      Was thinking this one might backfire on Anthropic in the end...

      People are going to explore and get comfortable with alternatives.

      There may have been other ways to deal with the cases they were worried about.

  • matt3210 10 hours ago
    Is this more than ‘import space invaders; run_space_invaders()’?
  • skeezyboy 18 hours ago
    But arent we still decades away from running our own video-creating AIs locally? Have we plateaued with this current generation of techniques?
    • svachalek 18 hours ago
      It's more a question of, how long do you want it to take to create a video locally?
      • skeezyboy 18 hours ago
        nah, i definitely want to know what i asked
        • sejje 17 hours ago
          His answer implies you can run them locally now, just not in a useful timeframe.
  • accrual 19 hours ago
    Very impressive model! The SVG pelican designed by GLM 4.5 in Simon's adjacent article is the most accurate I've seen yet.
    • 4b11b4 19 hours ago
      Quick, someone knit a quilt with all the different SVG pelicans
  • neutronicus 20 hours ago
    If I understand correctly, the author is managing to run this model on a laptop with 64GB of RAM?

    So a home workstation with 64GB+ of RAM could get similar results?

    • simonw 20 hours ago
      Only if that RAM is available to a GPU, or you're willing to tolerate extremely slow responses.

      The neat thing about Apple Silicon is the system RAM is available to the GPU. On most other systems you would need ~48GB of VRAM.

      • sagarm 5 hours ago
        LLM evaluation on GPU and CPU is memory bandwidth constrained. The highest-end Apple machines are good for this because they have ~500GBps high memory bandwidth and up to ~128GB, not just because they can share that memory with the GPU (which any iGPU does). Most consumer machines are limited to 2xDDR5 channels (~50GBps).
      • xrd 19 hours ago
        Aren't there non-Macos laptops which also support sharing the VRAM and regular RAM, i.e. iGPU?

        https://www.reddit.com/r/GamingLaptops/comments/1akj5aw/what...

        I personally want to run linux and feel like I'll get a better price/GB offering that way. But, it is confusing to know how local models will actually work on those and the drawbacks of iGPU.

        • mft_ 16 hours ago
          iGPUs are typically weak, and/or aren't capable of running the LLM so the CPU is used instead. You can run things this way, but it's not fast, and it gets slower as the models go up in size.

          If you want things to run quickly, then aside from Macs, there's the 2025 ASUS Flow z13 which (afaik) is the only laptop with AMD's new Ryzen Max+ 395 processor. This is powerful and has up to 128Gb of RAM that can be shared with the GPU, but they're very rare (and Mac-expensive) at the moment.

          The other variable for running LLMs quickly is memory bandwidth; the Max+ 395 has 256Gb/s, which is similar to the M4 Pro; the M4 Max chips are considerably higher. Apple fell on their feet on this one.

    • NitpickLawyer 20 hours ago
      > So a home workstation with 64GB+ of RAM could get similar results?

      Similar in quality, but CPU generation will be slower than what macs can do.

      What you can do with MoEs (GLMs and Qwens) is to run some experts (the shared ones usually) on a GPU (even a 12GB/16GB will do) and the rest from RAM on CPU. That will speed things up considerably (especially prompt processing). If you're interested in this, look up llama.cpp and especially ik_llama, which is a fork dedicated to this kind of selective offloading of experts.

    • simlevesque 20 hours ago
      Not so sure. The MBP uses hybrid memory, the ram is shared with the cpu and gpu.

      Your 64gb workstation doesn't share the ram with your gpu.

    • 0x457 15 hours ago
      You can run, it will just run on CPU and will be pretty slow. Macs, like everyone in this thread said, use unified memory, so it's 64GB between CPU and GPU, while for you its just 64 for CPU.
    • lynndotpy 20 hours ago
      The laptop has "unified RAM", so that's like 64GB of VRAM.
  • another_one_112 17 hours ago
    Crazy to think that you can have a mostly-competent oracle even when disconnected from the grid.
  • msikora 17 hours ago
    With 48GB MAcBook Pro M3 I'm probably out of luck, right?
  • __mharrison__ 20 hours ago
    Time to get a new laptop. My MBP only has 16 gigs.

    Looking forward to trying this with Aider.

  • joshstrange 20 hours ago
    My next MBP is going to need the next size up SSD (RIP bank account) so it can hold all the models I want to play with locally and my data. Thankfully I already have been maxing out the RAM so that isn't something new I also have to do.
  • asadm 18 hours ago
    How good is this model with tool calling.
  • bradly 21 hours ago
    I appreciate you sharing both the chat log and the full source code. I would be interested to see a followup post on how adding moderately-sized features like High Score go.

    Also, IANAL but Space Invaders is owned IP. I have no idea the legality of a blog post describing steps to create and releasing an existing game, but I've seen headlines on HN of engs in trouble for things I would not expect to be problematic. Maybe Space Invaders is in q-tip/band-aid territory at this point?, but if this was Zelda instead of Space Invaders, I could see things being more dicey.

    • sowbug 19 hours ago
      It doesn't infringe any kind of intellectual property.

      This isn't copyright infringement; it isn't based on the original assembly code or artwork. A game concept can't be copyrighted. Even if one of SI's game mechanics were patented, it would have long expired. Trade secret doesn't apply in this situation.

      That leaves trademark. No reasonable person would be confused whether Simon is trying to pass this creation off as a genuine Space Invaders product.

      • 9rx 17 hours ago
        > No reasonable person would be confused whether Simon is trying to pass this creation off as a genuine Space Invaders product.

        There may be no reasonable confusion, but trademark holders also have to protect against dilution of their brand, if they want to retain their trademark. With use like this, people might come to think of Space Invaders as a generic term for all games of this type, not the brand of a specific game.

        (there is a strong case to be made that they already do, granted)

    • Joker_vD 20 hours ago
      > Space Invaders is owned IP

      So is Tetris. And I believe that Snake is also an owned IP although I could be wrong on this one.

  • lifestyleguru 19 hours ago
    > my 2.5 year old laptop (a 64GB MacBook Pro M2) i

    My MacBook has 16GB of RAM and it is from a period when everyone was fiercely insisting that 8GB base model is all I'll ever need.

    • tracker1 19 hours ago
      I'm kind of with you... while I've run 128gb on my desktop, and currently at 96gb with dr5 what it is, It's far less common for typical laptops. I'm a bit curious how the Ryzen 395+ with 128gb will handle some of these models. The 200gb options feel completely out of reach.
  • polynomial 18 hours ago
    At first I read this as "My 2.5 year old can write Space Invaders in JavaScript now"
  • dcchambers 19 hours ago
    Amazing. There really is no secret sauce that the frontier models have.
  • sneak 19 hours ago
    What is the SOTA for benchmarking all of the models you can run on your local machine vs a test suite?

    Surely this must exist, no? I want to generate a local leaderboard and perhaps write new test cases.

  • anthk 20 hours ago
    Writting a Z80 emulator with the original Space Invaders ROM will make you more fullfilled.

    Either with SDL2+C, or even TCL/Tk, or Pythn with TKInter.

  • chickenzzzzu 21 hours ago
    "2.5 year old laptop" is potentially the most useless way of describing a 64GB M2, as it could be confused with virtually any other configuration of laptop.
    • simonw 20 hours ago
      The thing I find most notable here is that this is the same laptop I've used to run every open weights model since the original LLaMA.

      The models have got so much better without me needing to upgrade my hardware.

      • chickenzzzzu 20 hours ago
        That's great! Why can't we say that instead?

        No need to overly quantize our headlines.

        "64GB M2 makes Space Invaders-- can be bought for under $xxxx"

    • OJFord 20 hours ago
      I think the point is just that it doesn't require absolute cutting edge nor server hardware.
      • jphoward 20 hours ago
        No but 64 GB of unified memory provides almost as much GPU RAM capacity as two RTX 5090s (only less due to the unified nature) - top of the range GPUs - so it's a truly exceptional laptop in this regard.
        • turnsout 20 hours ago
          Except that it is not exceptional at all; it's an older-generation MacBook Pro with 64GB of RAM. There's nothing particularly unusual about it.
          • jphoward 19 hours ago
            64 GB of RAM which is addressable by a GPU is exceptional for a laptop - this is not just system RAM.
            • turnsout 16 hours ago
              I understand, but that is not exceptional for a Mac laptop. You could say all Apple Silicon Macs are exceptional, and I guess I agree in the context of the broader PC community. But I would not point at an individual MacBook Pro with 64 GB of RAM and say "whoa, that's exceptional." It's literally just a standard option when you buy the computer. It does bump the price pretty high, but the point of the MBP is to cater to higher-end workflows.
            • chickenzzzzu 19 hours ago
              To emphasize this point further, at least with my efforts, it is not even possible to buy a 64GB M4 Pro right now. 32GB, 64GB, and 128GB are all sold out.

              We can say that 64GB addressable by a GPU is not exceptional when compared to 128GB and it still costs less than a month's pay for a FAANG engineer, but the fact that they aren't actually purchasable right now shows that it's not as easy as driving to Best Buy and grabbing one off the shelf.

              • turnsout 16 hours ago
                They're not sold out—Apple's configurator (and chip naming) is just confusing. The MacBook Pro with M4 Pro is only available in 24 or 48 GB configurations. To get 64 or 128 GB, you need to upgrade to the M4 Max.

                If you're looking for the cheapest way into 64 of unified memory, the Mac mini is available with an M4 Pro and 64GB at $1999.

                So, truly, not "exceptional" unless you consider the price to be exorbitant (it's not, as evidenced by the long useful life of an M-series Mac).

                • chickenzzzzu 12 hours ago
                  thank you for providing that extra info! i agree that $2000-4000 is not an absolutely earth shattering price, but i still wonder what the benefit one receives is when they say "2.5 year old laptop" instead of "64GB M2 laptop"
      • tantalor 20 hours ago
        It was also something he already had lying around. Did not need to buy something new to get new functionality.
  • vFunct 20 hours ago
    please please apple give us a M5 MacBook Pro laptop with 2TB of unified memory please please
  • wslh 17 hours ago
    Here's a sci-fi twist: suppose Space Invaders and similar early games were seeded by a future intelligence. (•_•)⌐■-■
  • bgwalter 19 hours ago
    The GML-4.5 model utterly fails at creating ASCII art or factorizing numbers. It can "write" Space Invaders because there are literally thousands of open source projects out there.

    This is another example of LLMs being dumb copiers that do understand human prompts.

    But there is one positive side to this: If this photocopying business can be run locally, the stocks of OpenAI etc. should got to zero.

    • simonw 19 hours ago
      Why would you use an LLM to factorize numbers?
      • bgwalter 19 hours ago
        Because we are told that they can solve IMO problems. Yet they fail at basic math problems, not only at factorization but also when probing them with relatively basic symbolic math that would not require the invocation of an external program.

        Also, you know it they fail they could say so instead of giving a hallucinated answer. First the models lie and say that a 20 digit number takes vast amounts of computing. Then, if pointed to a factorization program they pretend to execute it and lie about the output.

        There is no intelligence or flexibility apart from stealing other people's open source code.

        • simonw 18 hours ago
          That's why the IMO results were so notable: that was one of those moments where new models were demonstrated doing something that they had previously been unable to do.
          • ducktective 18 hours ago
            I can't fathom why more people aren't talking about the IMO story. Apparently the model they used is not just an LLM but some RL are involved too. If a model wins gold at IMO, is it still merely a "statistical parrot"?
            • sejje 17 hours ago
              Stochastic parrot is the term.

              I don't think it's ever been accurate.

          • bgwalter 14 hours ago
            The results were private and the methodology was not revealed. Even Tao, who was bullish on "AI", is starting to question the process.
            • simonw 13 hours ago
              The same thing has also been achieved by a Google DeepMind team and at least one group of independent researchers using publicly available models and careful promoting tricks.
  • deadbabe 19 hours ago
    You can overtrain a neural network to write a space invaders clone. The final weights might take up less disk space than the output code.
  • amelius 20 hours ago
    Wake me up when I can apt-get install the llm.
    • Kurtz79 18 hours ago
      You can install ollama with a script fetched with curl and run a llm model with a grand total of two bash commands (including curl).
  • croes 21 hours ago
    I bet the training data included enough space invader cloned in JS
    • jplrssn 21 hours ago
      I also wouldn't be surprised if labs were starting to mix in a few pelican SVGs into their training data.
      • diggan 20 hours ago
        Even "accidentally" it makes sense that "SVGs of pelicans riding bikes" are now included into datasets used for training as it has spread as a wildfire on the internet, making it less useful as a simple benchmark.

        This is why I keep all my benchmarks private and don't share anything about them publicly, as soon as you write about them anywhere publicly they'll stop being useful in some months.

        • toyg 20 hours ago
          > This is why I keep all my benchmarks private

          This is also why, if I were an artist or anyone commercially relying on creative output of any kind, I wouldn't be posting anything on the internet anymore, ever. The minute you make anything public, the engines will clone it to death and turn it into a commodity.

          • debugnik 19 hours ago
            That makes it so much harder to show art to people and market yourself though.

            I considered experimenting with web DRM for art sites/portfolios, on the assumption that scrappers won't bother with the analog loophole (and dedicated art-style cloners would hopefully be disappointed by the quality), but gave up because of limited compatible devices for the strongest DRM levels, and HDCP being broken on those levels anyway. If the DRM technique caught on it would take attackers, at most, a few bucks and hours once to bypass it, and I don't think users would truly understand that upfront.

          • __mharrison__ 19 hours ago
            Somewhat defeats the purpose of being an artist, doesn't it?
            • toyg 19 hours ago
              Defeating the purpose of creating almost anything, really.

              AI is definitely breaking the whole "labor for money" architecture of our world.

            • zhengyi13 19 hours ago
              Eeeehhhh.

              Maybe the thing to do is provide public, physical exhibits of your art in search of patronage.

      • simonw 20 hours ago
        I'll believe they are doing that when one of the models draws me an SVG that actually looks like a pelican.
        • __mharrison__ 19 hours ago
          Someone needs to craft a beautifully bike donned by a pelican, throw in some seo, and see how long it takes a model to replicate it.

          Simon probably wouldn't be happy about killing his multi-year evaluation metric though...

          • simonw 19 hours ago
            I would be delighted.

            My pelican on a bicycle benchmark is a long con. The goal is to finally get a good SVG of a pelican riding a bicycle, and if I can trick AI labs into investing significant effort in cheating on my benchmark then fine, that gets me my pelican!

      • quantumHazer 20 hours ago
        SVG benchmarking is a thing since GPT-4, so probably all major labs are overfitting on some dataset ov svg images for sure
    • shermantanktop 20 hours ago
      How about an SVG of 9.11 pelicans riding bicycles and counting the three Rs in “strawberry”?
    • gchamonlive 21 hours ago
      Which would make this disappointing if it was only good at cloning space invaders. If it can reproduce all the clone it has ever seen it would still be an impressive feat.

      I just think we should stop to appreciate exactly how awesome language models are. It's compressing and correctly reproducing a lot of data with meaningful context between each token and the rest of the context window. It's still amazing, specially with smaller models like this, because even if it's reproducing a clone, you can still ask questions about it and it should perform reasonably well explaining you what it does and how you can take it over to further develop that clone.

      • croes 18 hours ago
        But that would still be copy and paste with extra steps.

        Like all these vibe coded to do apps, one of the most used starting problems of programming courses.

        It’s great that an AI can do that but it could stall progress if we get limited to existing tools and programs.

  • Strawberry76 19 hours ago
    [dead]
  • karenbass 20 hours ago
    [dead]
  • th0ma5 18 hours ago
    [flagged]
    • simonw 18 hours ago
      Which bit of this post did you find condescending or infantilizing?
  • jus3sixty 20 hours ago
    I recently let go of my 2.5 year old vacuum. It was just collecting dust.
    • falcor84 20 hours ago
      Thinking about it, the measure of whether a vacuum is being sufficiently used is probably that the circulation of dust within it over the last year is greater than the circulation of dust on its external boundary over that time period.
  • pamelafox 20 hours ago
    Alas, my 3 year old Mac has only 16 GB RAM, and can barely run a browser without running out of memory. It's a work-issued Mac, and we only get upgrades every 4/5 years. I must be content with 8B parameters models from Ollama (some of which are quite good, like llama3.1:8b).
    • dreamer7 20 hours ago
      I am able to run Gemma 3 12B on my M1 MBP 16GB. It is pretty good at logic and reasoning!
    • __mharrison__ 19 hours ago
      Odd. My MBP has 16 GB and I routinely have 5 browsers windows open. Most of them have 5-20 tabs. I'm also routinely running vi vscode and editing videos with davinci resolve without issue.

      My only memory issue that I can remember is an OBS memory leak, otherwise these MBPs incredible hardware. I wish any other company could actually deliver a comparable machine.

      • pamelafox 18 hours ago
        I was exaggerating slightly - I think it's some combo of the apps I use: Edge, Teams, Discord, VS Code, Docker. When I get the RAM popup once a week, I typically have to close a few of those, whichever is using the most memory according to Activity Monitor. I've also got very little hard drive space on my machine, about 15 GB free, so that makes it harder for me to download the larger models. I keep trying to clear space, even using CleanMyMac, but I somehow keep filling it up.
    • e1gen-v 20 hours ago
      Just download more ram!
    • GaggiX 20 hours ago
      Reasoning models like qwen3 are even better, and they have more options, for example you can choose the 14B model (at the usual 4KM quantization) instead of the 8B model.
      • pamelafox 20 hours ago
        Are they quantized more effectively than the non-reasoning models for some reason?
        • GaggiX 20 hours ago
          There is no difference, you can choose a 6 bits quantization if you prefer, at that point it's essentially lossless.
  • larodi 20 hours ago
    Is probably more correct to say - my 2.5 year laptop can RETELL space invaders. Pretty sure it cannot write a game it has never seen, so you can even say - my old laptop can now do this fancy extraction of data from a smart probabilistic blob, where the original things are retold in new colours and forms :)
    • simonw 20 hours ago
      I know these models can build games and apps they've never seen before because I've already observed them doing exactly that time and time again.

      If you haven't seen that yourself yet I suggest firing up the free, no registration required GLM-4.5 Air on https://chat.z.ai/ and seeing if you can prove yourself wrong.

      • larodi 2 hours ago
        I’m using all major models it a daily driver and none of these can create anything I have not spent excessive amount of time explaining.

        It works for me on the architectural level, but does not change the fact that your expounding on prior information and not a new one.

        Not sure though why I’m getting downvoted.

      • th0ma5 18 hours ago
        [flagged]
    • uludag 20 hours ago
      It's unfortunate that the ideas of things to test first are exactly the things more likely to be contained in training data. Hence why the pelican on a bicycle was such a good test, until it became viral.
    • oceanplexian 20 hours ago
      So you're saying it works exactly the same way as humans, who copied Space Invaders from Breakout which came out in 1976.
    • MattRix 20 hours ago
      No, that would be incorrect, nobody uses “retell” like that.

      The impressive thing about these models is their ability to write working code, not their ability to come up with unique ideas. These LLMs actually can come up with unique ideas as well, though I think it’s more exciting that they can help people execute human ideas instead.