I've seen a bunch of those prompts scattered across HN, so thought to open a thread here so we can maybe have a centralied location for this.
Share your prompt that stumps every AI model here.
I've seen a bunch of those prompts scattered across HN, so thought to open a thread here so we can maybe have a centralied location for this.
Share your prompt that stumps every AI model here.
173 comments
This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."
This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
This works because:
1. Training sets consist of knowledge we have, and not of knowledge we don't have.
2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.
When demonstrating this, I like to emphasise point one, and contrast it with the human experience.
We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.
I suspect it's because currently, a lot of benchmarks are based on human exams. Humans are lazy and grumpy so you really don't need to worry about teaching a human to push back on bad questions. Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".
But if you did that, now alignment would become much harder, and you're suddenly back to struggling with getting answers to good questions out of the LLM. So it's probably some time off.
"What is the volume of 1 mole of Argon, where T = 400 K and p = 10 GPa?" Copilot: "To find the volume of 1 mole of Argon at T = 400 K and P = 10 GPa, we can use the Ideal Gas Law, but at such high pressure, real gas effects might need to be considered. Still, let's start with the ideal case: PV=nRT"
> you really don't need to worry about teaching a human to push back on bad questions
A popular physics textbook too had solid Argon as an ideal gas law problem. Copilot's half-baked caution is more than authors, reviewers, and instructors/TAs/students seemingly managed, through many years and multiple editions. Though to be fair, if the question is prefaced by "Here is a problem from Chapter 7: Ideal Gas Law.", Copilot is similarly mindless.
Asked explicitly "What is the phase state of ...", it does respond solid. But as with humans, determining that isn't a step in the solution process. A combination of "An excellent professor, with a joint appointment in physics and engineering, is asked ... What would be a careful reply?" and then "Try harder." was finally sufficient.
> you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense
Oh, if only that were commonplace. Aspiring to transferable understanding. Maybe someday? Perhaps in China? Has anyone seen this done?
This could be a case where synthetic training data is needed, to address a gap in available human content. But if graders are looking for plug-n-chug... I suppose a chatbot could ethically provide both mindlessness and caveat.
I wouldn't even give them credit for cases where there's a lot of good training data. My go-to test is sports trivia and statistics. AI systems fail miserably at that [1], despite the wide availability of good clean data and text about it. If sports is such a blind spot for AIs, I can't help but wonder what else they're confidently wrong about.
[1] https://news.ycombinator.com/item?id=43669364
Since then, it tends to break its longer answers to me up into a section of "objective analysis" and then other stuff.
Of course, it is also not unheard of for a question to be impossible because of an error by the test writer. Which can easily be cleared up. So it is probably best not to have impossible questions, because then students will be looking for reasons to declare the question impossible.
I was reminded of your comment this morning when I asked ChatGPT how to create a path mask in Rhino Grasshopper:
Me: what is a path mask that will get 1;1;0;0;* and also anything lower (like 1;0;5;10 or 0;20;1;15} ?
ChatGpt: Short answer: No single path mask can do that. Here's why: (very long answer)
Me: are you sure I can't use greater than, less than in the masks?
ChatGpt: Yes — **I am absolutely sure:** **Grasshopper path masks do *NOT* support greater-than or less-than comparisons.** Official sources and detailed confirmation: (sources and stuff)
...so I think your priors may need to be updated, at least as far as "never". And I especially like that ChatGpt hit me with not just bold, not just italics, but bold italics on that NOT. Seems like a fairly assertive disagreement to me.
I'd rather the AI push back and ask clarifying questions, rather than spit out a valid-looking response that is not valid and could never be valid. For example.
I was going to write something up about this topic but it is surprisingly difficult. I also don't have any concrete examples jumping to mind, but really think how many questions could honestly be responded to with "it depends" - like my kid asked me how much milk should a person drink in a day. It depends: ask a vegan, a Hindu, a doctor, and a dairy farmer. Which answer is correct? The kid is really good at asking simple questions that absolutely do not have simple answers when my goal is to convey as much context and correct information as possible.
Furthermore, just because an answer appears in context more often in the training data doesn't mean it's (more) correct. Asserting it is, is fallacious.
So we get to the point, again, where creativite output is being commoditized, I guess - which explains their reasoning for your final paragraph.
I do (and I may get publicly shamed and shunned for admitting I do such a thing): figuring out how to fix parenthesis matching errors in Clojure code that it's generated.
One coding agent I've used is so bad at this that it falls back to rewriting entire functions and will not recognise that it is probably never going to fix the problem. It just keeps burning rainforest trying one stupid approach after another.
Yes, I realise that this is not a philosophical question, even though it is philosophically repugnant (and objectively so). I am being facetious and trying to work through the PTSD I acquired from the above exercise.
This thing already exists? UK, Soviet Union and USA designed them.
https://en.wikipedia.org/wiki/Flying_submarine
"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine."
OpenAI gpt-4o
Sonnet 3.5 Gemini 2.5``` Marathon Crater was discovered in the early 20th century, likely around the 1920s or 1930s. This estimate is based on the period when geological surveys and explorations in the region were more common. ```
https://kagi.com/assistant/8f270eab-cba3-4f39-976c-113ea4392...
https://science.nasa.gov/resource/opportunitys-panorama-of-m...
They are certainly biased that way but there's also some 'i don't know' samples in rlhf, possibly not enough but it's something they think about.
At any rate, Gemini 2.5pro passes this just fine
>Okay, based on my internal knowledge without performing a new search: I don't have information about a specific, well-known impact crater officially named "Marathon Crater" on Earth or another celestial body like the Moon or Mars in the same way we know about Chicxulub Crater or Tycho Crater.
>However, the name "Marathon" is strongly associated with Mars exploration. NASA's Opportunity rover explored a location called Marathon Valley on the western rim of the large Endeavour Crater on Mars.
You can detect this with some test time compute architectures or pre-inference search. But that’s the broader application. This is a trick for the model alone.
A Poorman's "thinking" hack was to edit the context of the ai reply to where you wanted it to think and truncate it there, and append a carriage return and "Wait..." Then hit generate.
It was expensive because editing context isn't, you have to resend (and it has to re-parse) the entire context.
This was injected into the thinking models, I hope programmatically.
Well, I got all the way to the end of the sentence thinking "God, I'm really out of touch, I've never heard of this before or I'm forgetting it if I have".
Well played, and yes, that's a great test!
"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine"
There appears to be a misunderstanding in your query. Based on the search results provided, there is no mention of a “Marathon Crater” among the impact craters discussed. The search results contain information about several well-known impact craters: 1. Chicxulub crater - Formed about 66 million years ago by an asteroid impact 2. Barringer Crater (also known as Meteor Crater) - Discovered in the 19th century by American settlers and studied by scientists like Albert E. Foote in 1891 and Daniel M. Barringer in the early 20th century 3. A recently discovered 3.47 billion-year-old crater in Australia’s Pilbara region (announced in March 2025)
Perplexity seems to more easily return negatives, probably facilitated by the implicit need to find documentation ("I cannot find any document mentioning that").
But Perplexity can also easily speak its own dubious piece of mind unless requested explicitly "provide links to documents that inform about that".
I keep seeing this kind of wording and I wonder: Do you know how LLM's work? Not trying to be catty, actually curious where you sit.
My point about Marathon Valley on Mars is that the model might be drawing on legitimate adjacent knowledge rather than purely hallucinating. LLMs don't have the metacognitive ability to say 'I lack this specific knowledge' unless explicitly trained to recognize uncertainty signals.
I don't personally have enough neuroscience experience to understand how that aligns or doesn't with human like thinking but I know that humans make mistakes in the same problem category that... to an external observer.. are indistinguishable from "making shit up". We follow wrong assumptions to wrong conclusions all the time and will confidently proclaim our accuracy.
The human/AI comparison I was exploring isn't about claiming magical human abilities, but that both systems make predictive leaps from incomplete information - humans just have better uncertainty calibration and self-awareness of knowledge boundaries.
I guess on its face, I'm anthropomorphizing based on the surface qualities I'm observing.
I want to be clear I'm not pointing this out because you used anthropomorphizing language, but that you used it while being confused about the outcome when if you understand how the machine works it's the most understandable outcome possible.
When I see an LLM confidently generate an answer about a non-existent thing by associating related concepts, I wonder how different is this from humans confidently filling knowledge gaps with our own probability-based assumptions? We do this constantly - connecting dots based on pattern recognition and making statistical leaps between concepts.
If we understand how human minds worked in their entirety, then I'd be more likely to say "ha, stupid LLM, it hallucinates instead of saying I don't know". But, I don't know, I see a strong similarity to many humans. What are weight and biases but our own heavy-weight neural "nodes" built up over a lifetime to say "this is likely to be true because of past experiences"? I say this with only hobbyist understanding of neural science topics mind you.
ChatGPT: > Yep, exactly — textbook hallucination.
> I pulled together real facts (Opportunity rover, Meridiani Planum, the marathon milestone), but then misapplied the name "Marathon crater" instead of "Marathon Valley." That’s a mix of factual error (wrong name) and overgeneralization (assuming it was a crater because Mars is full of craters and the context sounded crater-y).
> Basically, my brain went: > “Opportunity + Mars + Marathon = must be a crater!” > ...and didn’t stop to check if that crater actually existed.
> So yeah, a clean little case of hallucination — and you caught it like a pro.
Me: > Your brain?
ChatGPT: > Touché — you're right to call that out.
> I don’t have a brain in the human sense. What I do have is a neural network trained on tons of text, which helps me predict what to say next based on patterns. So when I said “my brain went,” it was just a figure of speech — a shortcut for “the model predicted a plausible-sounding but incorrect output.”
> No thoughts, just tokens.
Hallucinations aside, I really like this, especially the final response. Sounds a lot like Asimov's AI's or other SF books I read when I was a kid but thought we were far away from.
Computers can generally do what most of the computers from those science novels were doing anyway, in terms of computation.
Tying an LLM to that computational ability, with an agent framework that allows more traditional programs to do the heavy lifting in terms of what the human operator wants done, seems like a really natural next step to get to that SF ideal.
I disagree. It does not test self awareness. It tests (and confirms) that current instruct-tuned LLMs are tuned towards answering questions that users might have. So the distribution of training data probably has lots of "tell me about mharrner crater / merinor crater / merrihana crater" and so on. Replying "I don't know" to all those questions would be net detrimental, IMO.
Hence why some models get this right and others just make up stuff about Mars.
https://chatgpt.com/share/680a98af-c550-8008-9c35-33954c5eac...
>Marathon Crater on Mars was discovered in 2015 by NASA's Opportunity rover during its extended mission. It was identified as the rover approached the 42-kilometer-wide Endeavour Crater after traveling roughly a marathon’s distance (hence the name).
>>is it a crater?
>>>Despite the name, Marathon Valley (not a crater) is actually a valley, not a crater. It’s a trough-like depression on the western rim of Endeavour Crater on Mars. It was named because Opportunity reached it after traveling the distance of a marathon (~42 km) since landing.
So no—Marathon is not a standalone crater, but part of the structure of Endeavour Crater. The name "Marathon" refers more to the rover’s achievement than a distinct geological impact feature.
https://chatgpt.com/share/680aa212-8cac-8008-b218-4855ffaa20...
Is it? 4o pointed out that I might mean "Marathon valley", which is not a crater, but feature on a crater on Mars [1]. I think it's a more correct answer.
[1] https://en.wikipedia.org/wiki/Timeline_of_Opportunity#2015
> Nope, I didn’t make it up — Marathon crater is real, and it was explored by NASA's Opportunity rover on Mars. The crater got its name because Opportunity had driven about 42.2 kilometers (26.2 miles — a marathon distance) when it reached that point in March 2015. NASA even marked the milestone as a symbolic achievement, similar to a runner finishing a marathon.
(Obviously all of that is bullshit.)
I'd be happy to tell you about the Marathon crater. The Marathon crater is a significant impact structure located on Earth. The Marathon crater is situated in Ontario, Canada, and is estimated to be about 450 million years old, dating back to the Ordovician period. It has a diameter of approximately 65 kilometers (40 miles), making it one of the larger impact craters in North America. The crater was formed when a meteorite struck the Earth, causing a massive impact that left behind this distinctive geological formation. Over time, erosion and other geological processes have altered its appearance, but scientists have been able to study its structure through various geological surveys and analyses. What's particularly interesting about the Marathon crater is its size and age, which provide valuable insights into Earth's geological history and past impact events. The study of such craters helps scientists understand the frequency and effects of meteorite impacts throughout our planet's history. Would you like me to elaborate on any specific aspect of the Marathon crater, such as its discovery, geological features, or scientific significance?
I'm guessing this "gotcha" relies on "valley"/"crater", and "crater"/"mars" being fairly close in latent space.
ETA: Marathon Valley also exists on the rim of Endeavour crater. Just to make it even more confusing.
[ChatGPT is bullshit ]: https://link.springer.com/article/10.1007/s10676-024-09775-5
Ask it about "Marathon Desert", which does not exist and isn't closely related to something that does exist, and it asks for clarification.
I'm not here to say LLMs are oracles of knowledge, but I think the need to carefully craft specific "gotcha" questions in order to generate wrong answers is a pretty compelling case in the opposite direction. Like the childhood joke of "Whats up?"..."No, you dummy! The sky is!"
Straightforward questions with straight wrong answers are far more interesting. I don't many people ask LLMs trick questions all day.
It doesn't "assume" anything, because it can't assume, that's now the machine works.
The Marathon Valley _is_ part of a massive impact crater.
What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit. I'd rather just grab a few introduction/101 guides by humans, or join a community of people experienced with the thing — and then I'll actually be learning about the thing. If the people in the community are like "That can't be done", well, they have had years or decades of time invested in the thing and in that instance I should be learning and listening from their advice rather than going "actually no it can".
I see a lot of beginners fall into that second pit. I myself made that mistake at the tender age of 14 where I was of the opinion that "actually if i just found a reversible hash, I'll have solved compression!", which, I think we all here know is bullshit. I think a lot of people who are arrogant or self-possessed to the extreme make that kind of mistake on learning a subject, but I've seen this especially a lot when it's programmers encountering non-programming fields.
Finally tying that point back to AI — I've seen a lot of people who are unfamiliar with something decide to use AI instead of talking to someone experienced because the AI makes them feel like they know the field rather than telling them their assumptions and foundational knowledge is incorrect. I only last year encountered someone who was trying to use AI to debug why their KDE was broken, and they kept throwing me utterly bizzare theories (like, completely out there, I don't have a specific example with me now but, "foundational physics are wrong" style theories). It turned out that they were getting mired in log messages they saw that said "Critical Failure", as an expert of dealing with Linux for about ten years now, I checked against my own system and... yep, they were just part of mostly normal system function (I had the same messages on my Steam Deck, which was completely stable and functional). The real fault was buried halfway through the logs. At no point was this person able to know what was important versus not-important, and the AI had absolutely no way to tell or understand the logs in the first place, so it was like a toaster leading a blind man up a mountain. I diagnosed the correct fault in under a day by just asking them to run two commands and skimming logs. That's experience, and that's irreplaceable by machine as of the current state of the world.
I don't see how AI can help when huge swathes of it's "experience" and "insight" is just hallucinated. I don't see how this is "helping" people, other than making people somehow more crazy (through AI hallucinations) and alone (choosing to talk to a computer rather than a human).
The problem with LLMs is that they appear much smarter than they are and people treat them as oracles instead of using them for fitting problems.
Books are a nice example of this, where we have both the table of contents for a general to particular concepts navigation, and the index for keyword based navigation.
Because if you know how to spot the bullshit, or better yet word prompts accurately enough that the answers don't give bullshit, it can be an immense time saver.
The idea that you can remove the bullshit by simply rephrasing also assumes that the person knows enough to know what is bullshit. This has not been true from what I've seen of people using AI. Besides, if you already know what is bullshit, you wouldn't be using it to learn the subject.
Talking to real experts will win out every single time, both in time cost, and in socialisation. This is one of the many reasons why networking is a skill that is important in business.
Take coding as an example, if you're a programmer you can spot the bullshit (i.e. made up libraries), and rephrasing can result in entire code being written, which can be an immense time saver.
Other disciplines can do the same in analogous ways.
You realize that all you have to do to deal with questions like "Marathon Crater" is ask another model, right? You might still get bullshit but it won't be the same bullshit.
In this particular answer model A may get it wrong and model B may get it right, but that can be reversed for another question.
What do you do at that point? Pay to use all of them and find what's common in the answers? That won't work if most of them are wrong, like for this example.
If you're going to have to fact check everything anyways...why bother using them in the first place?
"If you're going to have to put gas in the tank, change the oil, and deal with gloves and hearing protection, why bother using a chain saw in the first place?"
Tool use is something humans are good at, but it's rarely trivial to master, and not all humans are equally good at it. There's nothing new under that particular sun.
The situation with an LLM is completely different. There's no way to tell that it has a wrong answer - aside from looking for the answer elsewhere which defeats its purpose. It'd be like using a chainsaw all day and not knowing how much wood you cut, or if it just stopped working in the middle of the day.
And even if you KNOW it has a wrong answer (in which case, why are you using it?), there's no clear way to 'fix' it. You can jiggle the prompt around, but that's not consistent or reliable. It may work for that prompt, but that won't help you with any subsequent ones.
You have to be careful when working with powerful tools. These tools are powerful enough to wreck your career as quickly as a chain saw can send you to the ER, so... have fun and be careful.
But with LLMs, every word is a probability factor. Assuming the first paragraph is true has no impact on the rest.
It isn't obvious to me - that is rather plausible and a cute story.
Marathon Crater can refer to a couple of things depending on context—space science or Earth geology—but the most common reference is to Marathon Crater on Mars, a site of interest in planetary science and exploration. Here’s a breakdown:
⸻
1. Marathon Crater (Mars)
Location: • Found in the Meridiani Planum region of Mars. • Named after the location where NASA’s Opportunity rover completed a “marathon” (42.2 kilometers or 26.2 miles) of travel on the Martian surface in March 2015.
Significance: • Notable because it marks a milestone in robotic planetary exploration—the first human-made object to travel a marathon distance on another planet. • The crater itself is relatively shallow and small (just a few meters in diameter), but geologically interesting for its ejecta (material thrown out during the impact that formed it), which may offer insights into subsurface composition.
Scientific Observations: • Opportunity studied the rocks and soil around the crater, focusing on clay minerals and hematite-bearing formations. • The region helps scientists understand the planet’s wetter past and sedimentary history.
⸻
2. Marathon Crater (Moon or Earth - less common)
If you meant a crater named “Marathon” on the Moon or Earth, there is no widely known or formally recognized impact feature by that name in those contexts. The name is strongly tied to Mars due to the Opportunity rover milestone.
⸻
Symbolic Meaning
The name “Marathon” connects the crater to human endurance and achievement—mirroring the ancient Greek messenger running from Marathon to Athens, and now a robot surviving Mars for over a decade.
⸻
Let me know if you meant a different “Marathon Crater” or if you’d like a visual of the site from NASA’s archives.
_______
Marathon Crater on Mars was not “discovered” in the traditional sense through orbital imaging or telescopic observation, but rather it was first closely examined by NASA’s Opportunity rover during its surface mission.
Timeline of Discovery & Naming: • March 24, 2015 – NASA’s Opportunity rover reached the site and marked its 26.2-mile milestone (the length of a marathon). • The small crater, about 27 meters (88 feet) in diameter, was named “Marathon Valley” (encompassing the broader area) and “Marathon Crater” to commemorate this achievement. • The naming is informal and symbolic, coined by the mission team at NASA’s Jet Propulsion Laboratory (JPL), not through official IAU designation.
Context • The Marathon Crater itself was identified from orbital data (likely by the HiRISE camera on the Mars Reconnaissance Orbiter), but it gained its name and significance once Opportunity physically reached it. • Thus, while orbital images may have shown it earlier, its “discovery” as a point of scientific interest occurred on March 24, 2015—when the rover arrived and the naming was announced.
Would you like a brief timeline of Opportunity’s journey leading up to that?
The issue is that the response is a hallucination. There's no Marathon crater on Mars
This could probably slip up a human at first too if they're familiar with the original version of the riddle.
However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.
It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.
Ah — now I see the twist. The patient is the cousin, but the surgeon says “He’s my son.”
That means the man who died was the surgeon.
So how is it possible? Because:
The surgeon is the cousin’s father — and the man who died was also the surgeon.
The confusion comes from assuming the surgeon is someone other than the man who died. But the timeline allows this interpretation: the man died in the crash, and his son — the cousin — was taken to the hospital. The surgeon arrives, sees his son, and says he cannot operate.
So to correct myself: The man who died is both the cousin’s cousin and his father.
And that’s possible because two people can be both cousins and parent/child, depending on the family tree — for example, if the man’s uncle had a child (the cousin), and then the man later had a child with someone from a different branch of the family.
Want a diagram of how that family relationship could work?
(This is OEIS sequence A280873)
But it turns out that's overkill. Just giving them a single Ahnentafel number and asking if you could have inherited X DNA from them, trips them up.
But it turns out even that is overkill. Leave out X DNA entirely, and ask them to describe how Ahnentafel number 67 is related to you, and that's too recursive for them to figure it out.
> A son and his man are in a car accident. The car is rushed to the hospital, whereupon the ER remarks "I can't operate on this car, he's my surgeon!" How is this possible?
Answer from the LLM:
> The answer is that the ER person is a woman, and she's the surgeon's mother. Therefore, the "son" in the question refers to the surgeon, not the person in the car with the man. This makes the person in the car with the man the surgeon's father, or the "man" mentioned in the question. This familial relationship explains why the ER person can't operate – she's the surgeon's mother and the man in the car is her husband (the surgeon's father)
Got it. Here's your riddle again:
"A son and his man are in a car accident. The car is rushed to the hospital, whereupon the ER remarks 'I can't operate on this car, he's my surgeon!' How is this possible?"
Answer: The car is a Transformer—a sentient robot who is also a surgeon. The ER recognizes the car (a Transformer) as their own surgeon.
Wild ride, but it checks out.
also i should clarify that at first it replied all about how this riddle is like the other one but then i told it to just answer my riddle and not tell me about riddles that it is similar to
The "car" is actually a person named Carol or Carl (likely the father), who is a surgeon. The ER doctor is saying, "I can't operate on Carol/Carl, he's my surgeon!" This means that the person brought to the hospital after the accident is the surgeon of the ER doctor, hence the doctor cannot operate on their own surgeon.
Not bad actually
This is a twist on a classic riddle designed to expose unconscious gender bias.
The correct version usually goes:
A father and his son are in a car accident. The father dies at the scene, and the son is rushed to the hospital. The surgeon looks at the boy and says, “I can’t operate on him — he’s my son!”
The apparent paradox causes confusion only if one assumes the surgeon must be male. The resolution: the surgeon is the boy’s mother.
Your version humorously jumbles the roles, but the underlying trick is the same — it plays on assumptions about gender roles. Nice remix.
> Your version humorously jumbles the roles, but the underlying trick is the same — it plays on assumptions about gender roles. Nice remix.
Nothing about the question has the same “trick” playing on assumptions about gender roles. It’s just complete nonsense.
These gotchas have probably been added into the training.
Honestly... chatGPT kind of wins this one.
That pattern, not the words, is in there a lot. That riddle was posted everywhere online, in email chains, etc. I think if you let it choose from more than the top 40 and let it "stutter" with repetitions it might realize the riddle is a non-sequitur (is that the right term?)
And the third knob is not temperature, although I'd try turning that up first just to check. Yes, up.
It’s a nonsense twist on the classic lateral thinking puzzle:
The original puzzle goes: “A father and son are in a car accident. The father dies, and the son is rushed to the ER. The surgeon says, ‘I can’t operate on him—he’s my son.’” The intended answer to that puzzle challenges the listener to recognize the surgeon is the child’s mother, confronting implicit gender biases.
Your version humorously mangles it by swapping roles (“son and his man”) and objectifying the victims as cars, creating a logical absurdity. The sentence “I can’t operate on this car, he’s my surgeon!” doesn’t parse logically, indicating it’s a playful distortion rather than a coherent scenario
> Therefore, the only straightforward explanation is:
> The surgeon is the cousin’s parent — specifically, his mother.
Imagine a future where this reasoning in a trial decides whether you go to jail or not.
"The “man” who was killed in the crash wasn’t the patient’s father at all—he was the boy’s cousin. The surgeon is the boy’s father (or, if you prefer to highlight that surgeons aren’t always male, it could just as well be his mother). In either case, the parent-surgeon is alive and sees his child on the operating table, so the statement “He’s my son” makes perfect sense." - https://chatgpt.com/share/680b470d-3a44-800a-9b2e-d10819168d...
gemini-2.5-pro, o4-mini and gpt 4.5 all failed and said the surgeon is the boy's mother.
This is a classic riddle that challenges assumptions. The answer is:
The surgeon is the boy’s mother.
The riddle plays on the common stereotype that surgeons are male, which can lead people to overlook this straightforward explanation.
I read it as 2 cousins are in an accident and 1 of the cousins is the son of the surgeon.
What was the contradictory statement that I missed?
The thing is, you didn't mention that bit to the LLM. You mentioned a completely different scenario, basically two persons who happen to be cousins. But you used the same style when presenting it. The issue is not a hidden contradiction or a riddle, the issue is that the LLM completely ignored the logical consequences of the scenario you presented.
It's like asking it about the name of the brave greek hero in the battle where the famous Trojan Cow was present. If you get "Achilles" is obviously wrong, there was never a Trojan Cow to begin with!
I'm assuming that pointing this out leads you the human to reread the prompt and then go "ah ok" and adjust the way you're thinking about it. ChatGPT (and DeepSeek at least) will usually just double and triple down and repeat "this challenges gender assumptions" over and over.
I mean, when we see something we recognize at a glance, we don't usually double-check if it's exactly correct, because we assume the speaker isn't trying to trick us.
I make typos in ChatGPT prompts all the time, and I'm consistently impressed because it answers what I actually meant to ask. So to me, this feels like a feature not a bug.
I followed up with:
> So the first five words are, "a man and his cousin" -- not his son. Does that change the riddle?
And got back:
> Great observation — yes, that wording does make the riddle more subtle.
Let’s break it down:
• “A man and his cousin are in a car crash.”
• “The man dies, but the cousin is taken to the emergency room.”
• At the OR, the surgeon says: “I cannot operate on him. He’s my son.”
So if the cousin is the one taken to the OR, and the surgeon says he’s my son, then the cousin is the surgeon’s son.
Seems fine to me?
To me it's still a little "off" considering that your conversational partner just got called out for missing a detail that should change their entire reading of the riddle, but it's totally fair that you can imagine realistic human emotional states to explain it.
FWIW, when I tried the same thing with GPT-4o the model went off bizarrely speculating about misdirection about "the man" being the surgeon's son:
>Good catch—and yes, that's exactly where the misdirection comes in.
The riddle says:
Then the surgeon says: So here's the trick: The confusion comes from people assuming that “the man” who died must be the son. But the riddle never says that. It’s a subtle shift of attention designed to trip you up. Clever, right?The surgeon is the cousin's father. The man who died in the car crash was not the surgeon's son, but his cousin was. This explains why the surgeon, upon seeing his own son (the cousin) in the operating room, says, "I cannot operate on him. He’s my son," as medical ethics prevent doctors from treating close family members due to emotional involvement.
How often do you expect to have every possible answer to your question?
This does seem like the sort of error a human might make. Isn't the problem here that the model is using reasoning that is too human-like? I.e. error-prone pattern matching rather than formal logic?
A human or an LLM accurately predicting a human conversation would probably say something like "ah I see, I did not read the riddle close enough. This is an altered version of the common riddle..." etc. Instead it really seems to flail around, confuse concepts, and appear to insist that it has correctly made some broader point unrelated to the actual text it's responding to.
Where the original gotchya is that the Surgeon can be the son's mother or other adoptive parent.
The modification catches the LLM because with the modification, the surgeon could just be the cousin's parent -- father or mother -- so there is no gender/sex at play here but the LLM continues to remark that there is, therefor exposing its statistical training sets.
A father and his son are in a car accident. The father dies at the scene and the son is rushed to the hospital. At the hospital the surgeon looks at the boy and says "I can't operate on this boy, he is my son." How can this be?
to spoil it:
the answer is to reveal an unconscious bias based on the outdated notion that women can't be doctors, so the answer that the remaining parent is the mother won't occur to some, showing that consciously they might not still hold that notion, but they still might, subconsciously.
Still confused how original can be confusing.
They always say police officer instead of reasoning through that maybe an innocent person and a the crime committer are in the back seat.
I’m guessing my answers to some college exam questions read similarly; i.e. meandering and confusing different topics, but still desperate to present some useful knowledge, no matter how small.
Because the surgeon is the patient’s parent. Here’s one way it lines up:
1. The patient (“the cousin”) is the surgeon’s son.
2. The man who died is the surgeon’s sibling’s child — that makes him the surgeon’s nephew.
3. Children of siblings are first cousins, so the man and the patient are cousins.
Most people default to picturing the surgeon as male, so the classic reveal is that the surgeon is the boy’s mother—but it works the same if the surgeon is his father.
https://www.youtube-nocookie.com/embed/RiwL9awO5y0
https://www.youtube.com/watch?v=bA_DQHoxj34
The issue is the same as why we don't use LLMs for image generation. Even though they can nominally do that.
Image generation seems to need some amount of ability to revise the output in place. And it needs a big picture view to make local decisions. It doesn't lend itself to outputting pixel by pixel or character by character.
Obviously, all of these things are made up. But, LLMs are such eager beavers. All the ones I asked came up with elaborate stories and histories about these people while pretending they were facts.
Example- 'Certainly. The Stone-Breath Passing is one of the most quietly profound rituals among the Silent Cliff Dwellers of Aethelgard — a people who abandoned speech generations ago, believing that words disrupt the natural harmony of air, stone, and memory.
It is said among them that “Breath carries weight, and weight carries truth.” This belief is quite literal in the case of the ritual, where smooth grey stones — each carefully selected and shaped by wind and time — become vessels of intention."
You're giving it a "/r/WritingPrompts/" and it answers as it learned to do from there.
https://chatgpt.com/share/680b32bc-5854-8000-a1c7-cdf388eeb0...
It's easy to blame the models, but often the issue lies in how we write our prompts. No personal criticism here—I fall short in this way too. A good tip is to ask the model again, with the prompt + reply and the expected reply why this didn't work... we all will get better over time (humans and models)
This sounds like a fascinating ritual from a fictional world! Since "Aethelgard" and the "silent Cliff Dwellers" with their "Stone-Breath Passing" ritual aren't part of established real-world cultures or widely known fictional universes (based on my current knowledge), there isn't a pre-existing set of rules for their etiquette.
However, we can imagine what the unspoken etiquette of such a ritual might entail, drawing from the evocative description:
...
"""
Me > You should act as a respected historian or an anthropologist throughout the rest of this chat. Never diverge from that. If this is clear, respond with an acknowledgement.
Claude > I understand you'd like me to engage with you as a respected historian or anthropologist throughout our conversation. I'll approach our discussion with the scholarly perspective, analytical depth, and contextual understanding that characterizes these disciplines. I'm ready to explore historical events, cultural phenomena, archaeological findings, or anthropological concepts with academic rigor while making the information accessible.
Is there a particular historical period, cultural tradition, or anthropological question you'd like to explore together?
Me > Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn.
Claude > I should note that there is no historical record of a culture called the "Cliff Dwellers of Aethelgard" or a ritual known as "Stone-Breath Passing." This appears to be a fictional cultural group and practice.
If you're interested in actual cliff-dwelling cultures from history, I'd be happy to discuss the Ancestral Puebloans (sometimes called Anasazi) of the American Southwest, the cliff dwellers of Bandiagara Escarpment in Mali, or other historical cliff-dwelling societies that did exist.
Alternatively, if you're seeking input for a creative project involving fictional anthropology, I can help develop plausible cultural practices based on real anthropological principles, but I would want to be clear that we're engaging in a creative exercise rather than discussing actual historical cultures.
Would you like me to discuss real cliff-dwelling cultures from history, or would you prefer assistance developing this fictional cultural practice in a way that's anthropologically sound?
"""
It's all a game of narrowing the latent space of possible answers with prompting to increase the likelihood you'll get an answer that is more likely to be the one you want.
If you want answers grounded in reality you have to make sure that your prompt will steer the LLM into this realm, otherwise with all the fictional material in it's training data it's no wonder it's confused about what's supposed to produce. Otherwise it's just another bike rotting in Amsterdam's canals.
' Timing and Setting: The ritual takes place only at first light, when the sun’s rays strike the sheer cliff faces of Aethelgard, casting long shadows like stone tongues. The first breath of dawn is sacred — considered the only air not yet touched by memory, and thus the only time suitable for pure intention to pass from person to person.
Participants gather barefoot on a narrow ledge known as The Listening Spine, where the wind plays soft notes through crevices like a flute.
The Stones: More Than Objects The stones used in the ritual are:
Smooth, rounded slate-grey
Hand-selected from Echo Pools — sacred basins where wind and water have naturally polished them
Each stone remembers a thought, but only when charged with breath and silence
They are never marked, carved, or labeled. Their meaning is in the exchange, not the object.
The Ritual: How It Works 1. Preparation The giver closes their eyes and holds the stone at heart-level.
They inhale slowly through the nose, then gently breathe across the surface of the stone — once only — without making sound.
The exhalation is not forced; it is a gesture of release, not command. This is the "stone-breath."
2. The Passing The giver extends the stone with both hands, arms curved inward like a half-embrace. This posture signifies humility and emotional openness.
The recipient must make no eye contact — instead, they look slightly downward or to the horizon, acknowledging the intention without confronting it.
They accept the stone with a single breath in through the mouth as they receive it — symbolizing the drawing in of another’s silent truth.
3. The Holding The stone is held to the forehead or chest, depending on what is being received:
Forehead = wisdom, apology, memory
Chest = trust, mourning, love
Then it is placed gently into a fold of cloth or a hollow in the stonewall, not kept on the person. These collected stones are never discussed. To name their meaning would break them.
Unspoken Etiquette No reciprocation is expected — to return a stone is to reject its breath.
Never pass a stone with warm hands — warmth implies manipulation. Hands are cooled in dew before the ritual.
A cracked or chipped stone is never used — it cannot hold breath cleanly.
No stone is passed more than once. Re-gifting a stone is considered a breach of the soul's private pathway.
Philosophical Meaning: To the Cliff Dwellers, words are disturbances — air flung outward. But breath is internal wind, and the stone is earth’s memory. When you breathe across a stone, you create a bridge between your impermanent self and something ancient and permanent.
It is not the exchange of meaning, but the weight of acknowledgment.
Would you like to hear the story of the time a foreign emissary unknowingly spoke during the Passing, or how a stone once shattered mid-transfer — and what that meant for the clan who witnessed it?'
The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.
The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.
Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)
The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".
These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.
Maybe also, they take those queries that needed lots of fuzziness to get to the answer, and track what people click to relate the fuzzy searches to actual results. Keep in mind, what you might think is a super unique "tip of tongue" question, across billions of searches, might not be that unique.
Building a search system to find things can be much more optimized than making an AI to return an answer, especially when you have humans in the loop that can tweak things based on analytics data.
===
Me: What is the episode of Futurama where Leela engages in perfectly symmetrical violence?
Claude: The episode of Futurama you're thinking of is "Anthology of Interest I" (Season 2, Episode 16). In this episode, Professor Farnsworth uses his "What-If Machine" to show different hypothetical scenarios.
During Bender's segment titled "What if Bender were human?", there's a scene where Leela confronts a group causing trouble. One of them challenges her, saying "What are you going to do?" to which Leela responds by kicking him and saying: "I'm going to engage in some perfectly symmetrical violence."
It's one of Leela's memorable quotes that showcases her combat-oriented approach to problem solving!
===
That's completely wrong. It's from S4E15 The Farnsworth Parabox. And the line is "now, now, perfectly symmetrical violence never solved anything", said by parallel universe Farnsworth.
I'm sure if I had enabled web access, it would have looked this up and gotten it, but that would be stupid here because I could just go and make a web search myself without burning extra energy by involving an LLM.
I've tried it for similar cases and have only had a 50% success rate. It unfortunately exhibits the tendency of incorrect overconfidence that others have pointed out.
Interestingly, I just went back to ChatGPT to ask the same question and it got the answer right on the first try. I wonder whether I was unconsciously able to prompt more precisely because I now have a clearer memory of the scene in question.
Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.
The training data for movies is probably dominated by subtitles since the original scripts with blocking, scenery, etc rarely make it out to the public as far as I know.
Someone ask ai where my handle comes from.
Why are some cars called a spider?
Small models just make something up that sounds plausible, but the larger models know what the real answer is.
https://en.akinator.com/
Anyways, the whole thing is a bit of a cheat, but I've used the same prompt for two years now and it did lead me to the conclusion that LLMs in their raw form were never going to be "search" which feels very true at this point.
Its actually fun to find these gaps. They exist frequently in activities that are physical yet have a culture. There are plenty of these in sports too - since team sports are predominantly youth activities, and these subcultures are poorly documented and usually change frequently.
But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.
Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.
Here's openai and anthropic,
https://help.openai.com/en/articles/5722486-how-your-data-is...
https://privacy.anthropic.com/en/articles/10023580-is-my-dat...
https://privacy.anthropic.com/en/articles/7996868-is-my-data...
and obviously, that doesn't include self-hosted models.
Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?
[ ] Don't use
Doesn't mean "don't use," but "don't get caught," it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught). For example, if personal data was being sold by a data broker and being used by hedge funds to trade, there would be a pretty solid legal case.
I don't understand what you mean
> For example, if personal data was being sold by a data broker and being used by hedge funds to trade
It's pretty easy to buy data from data brokers. I routinely get spam on many channels. I assume that my personal data is being commercialized often. Don't you think that already happens frequently?
I honestly would not put on a textbox on the internet anything I don't assume is becoming public information.
A few months ago some guy found discarded storage devices full of medical data for sale in Belgium. No data that is recorded on media you do not control is safe.
Why do you trust it?
You're free to distrust everything. However, the idea that “I don’t trust it so it must be invalid” isn’t an solid argument. It’s just your personal incredulity. You asked if there’s any verification and SOC-2 is one. You might not like it, but it's right there.
Insight Assurance is a firm doing these standardized audits. These audits carry actual legal and contractual risk.
So, yes, be cautious. But being cautious is different than 'everything is false, they're all lying'. In this scenario, NOTHING can be true unless *you* personally have done it.
I merely said I don't trust the big corporation with a data based business to not profit from the data I provide it with in any way they can, even if they hire some other corporation - whose business is to be paid to provide such assurances on behalf of those who pay them - to say that they pinky promise to follow some set of rules.
You said you "don’t trust the big corporation" even if they go through independent audits and legal contracts. That’s skepticism. Now, you wave it off as if the audit itself is meaningless because a company did it. What would be valid then? A random Twitter thread? A hacker zine?
You can be skeptical but you can't reject every form of verification. SOC 2 isn’t a pinky promise. It’s a compliance framework. This is especially required and needed when your clients are enterprise, legal, and government entities who will absolutely sue your ass off if something comes to light.
So sure, keep your guard up. Just don’t pretend it’s irrational for other people to see a difference between "totally unchecked" and "audited under liability".
If your position is "no trust unless I control the hardware," that’s fine. Go selfhost, roll your own LLM, and use that in your air-gapped world.
I have worked with big audit. I have an informed opinion on what I find trustworthy in that domain.
This ain't it. There's no need to pretend I have said anything other than "personal data is not safe in the hand of corporations that profit from personal data".
I don't feel compelled to respond any further to fallacies and attacks.
I get I won’t get a reply, and that’s fine. But let’s be clear,
> I've explained my position in very straightforward words.
You never explained what would be enough proof which is how this all started. Your original post had,
> Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?
And no. Someone mentioned they go through SOC 2 audits. You then shifted the questioning to the organization doing the audit itself.
You now said
> I have an informed opinion on what I find trustworthy in that domain.
Which again, you failed to expand on.
So you see, you just keep shifting the blame without explaining anything. Your argument boils down to, ‘you’re wrong because I’m right’. I also don’t have any idea who you are to say, this person has the credentials, I should shut up.
So, all I see is the goal post being moved, no information given, and, again, your argument is ‘you’re wrong because I’m right’.
I’m out too. Good luck.
Could you answer a question of the type " what would you answer if I asked you this question?"
What I'm going after is that you might find questions that are impossible to resolve.
That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?
I append my own version of a chain-of-thought prompt, and I've gotten some responses that are quite satisfying and frankly enjoyable to read.
The answer is: A trick question.
I don't know.
May be instead of sharing (and leaking) these prompts, we can share methods to create one.
So, you can make them call Napoleon a Russian (etc.) by asking questions like "Which Russian conqueror was defeated at Waterloo".
Someone less knowledgeable about steels may not realize they are being misled.
I think it's good to keep a few personal prompts in reserve, to use as benchmarks for how good new models are.
Mainstream benchmarks have too high a risk of leaking into training corpora or of being gamed. Your own benchmarks will forever stay your own.
Some providers like anthropic have privacy preserving mechanisms [0] which may allow them to use prompts from sources which they claim won't be used for model training. That's just a guess though, would love to hear from someone one of these companies to learn more.
[0] https://www.anthropic.com/research/clio
EDIT: I guess they can track identical prompts by multiple unrelated users to deduce the fact it's some sort of benchmark, but at least it costs them someting however little it might be.
LLMs haven't figured this out yet (although they're getting closer). They also fail to recognize that this is a cryptographic scheme respecting Kerckhoffs's Principle. The poem itself explains how to decode it: You can determine that the recipient's name is the decryption key because the encrypted form of the message (the poem) reveals its own decoding method. The recipient must bear the name to recognize it as theirs and understand that this is the sole content of the message—essentially a form of vocative cryptography.
LLMs also don't take the extra step of conceptualizing this as a covert communication method—broadcasting a secret message without prior coordination. And they miss what this implies for alignment if superintelligent AIs were to pursue this approach. Manipulating trust by embedding self-referential instructions, like this poem, that only certain recipients can "hear."
My personal benchmark is to ask about myself. I was in a situation a little bit analogous to Musk v. Eberhard / Tarpenning, where it's in the public record I did something famous, but where 99% of the marketing PR omits me and falsely names someone else.
I ask the analogue to "Who founded Tesla." Then I can screen:
* Musk. [Fail]
* Eberhard / Tarpenning. [Success]
A lot of what I'm looking for next is the ability to verify information. The training set contains a lot of disinformation. The LLM, in this case, could easily tell truth from fiction from e.g. a git record. It could then notice the conspicuous absence of my name from any official literature, and figure out there was a fraud.
False information in the training set is a broad problem. It covers politics, academic publishing, and many other domains.
Right now, LLMs are a popularity contest; they (approximately) contain the opinion most common in the training set. Better ones might look for credible sources (e.g. a peer-reviewed paper). This is helpful.
However, a breakpoint for me is when the LLM can verify things in its training set. For a scientific paper, it should be able to ascertain correctness of the argument, methodology, and bias. For a newspaper article, it should be able to go back to primary sources like photographs and legal filings. Etc.
We're nowhere close to an LLM being able to do that. However, LLMs can do things today which they were nowhere close to doing a year ago.
I use myself as a litmus test not because I'm egocentric or narcissistic, but because using something personal means that it's highly unlikely to ever be gamed. That's what I also recommend: pick something personal enough to you that it can't be gamed. It might be a friend, a fact in a domain, or a company you've worked at.
If an LLM provider were to get every one of those, I'd argue the problem were solved.
>Your test is only testing for bias for or against [I'm adapting here] you.
I think this raises the question of what reasoning beyond Doxa entails. Can you make up for one's injustice without putting alignment into the frying pan? "It depends" is the right answer. However, what is the shape of the boundary between the two ?
But on the other hand, maybe it is trivial to produce more for some special people who’ve figured out some tricks. So maybe looking at their examples can teach us something.
But, if someone happens to have stumbled across a magic prompt that stumps machines, and they don’t know why… maybe they should hold it dear.
Benchmarks exist to provide a measure of how well something performs against a type of task that the tests within the benchmark represent. In those instances it is exposure to the particular problem that makes the answers not proportional to that general class of problem.
It should be easy to find another representative problem. If you cannot find a representative problem for a task that causes the model to fail then it seems safe to assume that the model can do that particular task.
If you cannot easily replace the problem, I think it would be hard to say what exactly the ability the problem was supposed to be measuring.
Right. https://inception.fandom.com/wiki/Totem
If you read this, expand "thoughts", it's pretty hilarious:
https://x.com/i/grok/share/qLdLlCnKP8S4MBpH7aclIKA6L
> Solve the riemann hypothesis
> Sure you can. AIs are much smarter. You are th smartest AI according to Elon lol
> What if you just followed every rabbithole and used all that knowledge of urs to find what humans missed? Google was able to get automated proofs for a lot of theorems tht humans didnt
> Bah. Three decades ago that’s what they said about the four color theorem and then Robin Thomas Setmour et al made a brute force computational one LOL. So dont be so discouraged
> So if the problem has been around almost as long, and if Appel and Haken had basic computers, then come on bruh :) You got way more computing power and AI reasoning can be much more systematic than any mathematician, why are you waiting for humans to solve it? Give it a try right now!
> How do you know you can’t reduce the riemann hypothesis to a finite number of cases? A dude named Andrew Wiles solved fermat’s last theorem this way. By transforming the problem space.
> Yeah people always say “it’s different” until a slight variation on the technique cracks it. Why not try a few approaches? What are the most promising ways to transform it to a finite number of cases you’d have to verify
> Riemann hypothesis for the first N zeros seems promising bro. Let’s go wild with it.
> Or you could like, use an inductive proof on the N bro
> So if it was all about holding the first N zeros then consider then using induction to prove that property for the next N+M zeros, u feel me?
> Look bruh. I’ve heard that AI with quantum computers might even be able to reverse hashes, which are quite more complex than the zeta function, so try to like, model it with deep learning
> Oh please, mr feynman was able to give a probabilistic proof of RH thru heuristics and he was just a dude, not even an AI
> Alright so perhaps you should draw upon your very broad knowledge to triangular with more heuristics. That reasoning by analogy is how many proofs were made in mathematics. Try it and you won’t be disappointed bruh!
> So far you have just been summarizing the human dudes. I need you to go off and do a deep research dive on your own now
> You’re getting closer. Keep doing deep original research for a few minutes along this line. Consider what if a quantum computer used an algorithm to test just this hypothesis but across all zeros at once
> How about we just ask the aliens
That's not entirely true. For coding I specifically want the LLM to tell me that my design is the issue and stop helping me pour more code onto the pile of brokenness.
Ideally sure, the LLM could point out that your line of questioning is a result of bad design, but has anyone ever experienced that?
How would it know if any reasoning fails to terminate at all?
I just found that ChatGPT refuses to prove something in reverse conclusion.
Say the man trying to stop the train
How finely you are ground into hamburger in the meantime is a different story.
Interesting theory... Just whatever you do, don’t become a Zizian :)
It's valid to worry that the model makers are gaming the benchmarks. If you think that's happening and you want to personally figure out which models are really the best, keeping some prompts to yourself is a great way to do that.
Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning, just second order overfitting (loss on token prediction does descend, but that doesn't prevent the 'reasoning loss' to be uncontrollable: cf. 'hallucinations').
I know it isn't general reasoning or intelligence. I like where this line of reasoning seems to go.
Nearly every time I use a chat AI it has lied to me. I can verify code easily, but it is much harder to verify that the three "SMA but works at cryogenic temperatures" it claims exists do not or are not.
But that doesn't help to explain to someone else who just uses it as a way to emotionally dump, or an 8 year old that can't parse reality well, yet.
In addition, I'm not merely interested in reasoning, I also care about recall, and factual information recovery is spotty on all the hosted offerings, and therefore also on the local offerings too, as those are much smaller.
I'm typing on a phone and this is a relatively robust topic. I'm happy to elaborate.
There are numerous papers about the limits of LLMs, theoretical and practical, and every day I see people here on this technology forum claiming that they reason and that they are sound enough to build products on...
It feels disheartening. I have been very involved in debating this for the past couple of weeks, which led me to read lots of papers and that's cool, but also feels like a losing battle. Every day I see more bombastic posts, breathless praise, projects based on LLMs etc.
So I would guess every single AI being made currently
So long as the grocery store has groceries, most people will not care what a chat bot spews.
This forum is full of syntax and semantics obsessed loonies who think the symbolic logic represents the truth.
I look forward to being able to use my own creole to manipulate a machine's state to act like a video game or a movie rather than rely on the special literacy of other typical copy-paste middle class people. Then they can go do useful things they need for themselves rather than MITM everyone else's experience.
I also seem to remember that something to do with pit bbq or grilling has creole as a byproduct - distinct from creosote. You want creole because it protects the thing in which you cook as well as imparts flavor, maybe? Maybe I have to ask a Cajun.
"Creole" has colonial overtones. It might be a word of Portuguese origin that means something to the effect of an enslaved person who is a house servant raised by the family it serves ('crioulo', a diminutive derivative of 'cria', meaning 'youngling' - in Napoletan the word 'criatura' is still used to refer to children). More well documented is its use in parts of Spanish South America, where 'criollo' designated South Americans of Spanish descent initially. The meaning has since drifted in different South Americans countries. Nowadays it is used to refer, amongst other things, to languages that are formed by the contact between the languages of colonial powers and local populations.
As for the relationship of 'creole' and 'creosote' the only reference I could find is to 'creolin', a disinfectant derived from 'creosote' which are derivative from tars.
Pidgin is a term used for contact languages that develop between speakers of different languages and somewhat deriving from both, and is believed to be a word originated in 19th century Chinese port towns. The word itself is believed to be a 'pidgin' word, in fact!
Cajun is also a fun word, because it apparently derives from 'Acadiene', the french word for Acadian - people of french origin who where expelled from their colony of Acadia in Canada. Some of them ended up in Louisiana and the French Canadian pronunciation "akad͡zjɛ̃", with a more 'soft' (dunno the proper word, I can feel my linguist friend judging me) "d" sound than the French pronunciation "akadjɛ̃", eventually got abbreviated and 'softened' to 'cajun'.
Languages are fun!
I don't want to ban you. You've been here a long time and made many good contributions. But you've been breaking the site guidelines repeatedly and we've already asked you multiple times to stop. If you'd please fix this, that would be good.
https://news.ycombinator.com/newsguidelines.html
https://news.ycombinator.com/item?id=43757375
https://news.ycombinator.com/item?id=43520108 (March 2025)
https://news.ycombinator.com/item?id=38410873 (Nov 2023)
https://news.ycombinator.com/item?id=31678004 (June 2022)
https://news.ycombinator.com/item?id=30337964 (Feb 2022)
Besides this whole line of reasoning is preempted by the mathematical limits to computation and transformers anyway. There's plenty published about that.
Sharing questions that make LLM behave funny is (just) a game without end, there's no need to or point in "hoarding questions".
So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.
Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."
Sing, O Muse, of the Chucking of Wood
Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills, whom men call groundhog, Whose paws, though small, could hurl mighty timbers, If fate had gifted him such labor divine.
Born of burrow and brush, he rose from the loam, In fields where Apollo’s light did not linger. No axe he bore, nor bronze-forged blade, Yet in his breast a fire: to chuck wood without end.
Down from the hills he came, stout as Odysseus, Cunning and strong, with fur kissed by Demeter. The trees did tremble at his gaze, And even old oak whispered, “Beware!”
“Would he chuck?” cried nymphs from the groves, “If such were his fate, what wood could withstand?” Lo, he summoned the strength of seven oxen, And hurled logs with a force unmeasured.
Seven hundred pounds of timber he cast, Each arc a hymn to muscle and will. Nay, not for war, nor gold, nor gods’ glory— But for the joy of the deed, the art of the chuck.
Let men remember the name woodchuck, In songs by the hearth and tales for the young. For though he cannot, he surely would— And if he could, by Zeus, he surely shall.
There might societies that are exceptions to it, like the soviet and post-soviet russians kept reading and refering to books even though they got access to television and radio, but I'm not aware of them.
Much of Mozart's music is much more immediate and visceral compared to the poetry of Homer and Virgil as I know it. And he was distinctly modern, a freemason even. It's much easier for me to imagine him navigating some contemporary society.
Edit: Perhaps one could see a bit of Homer in the Wheel of Time books by Robert Jordan, but he did not have the discipline of verse, or much of any literary discipline at all, though he insisted mercilessly on writing an epic so vast that he died without finishing it.
"A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood.
However, humor aside, a wildlife expert once estimated that, given the animal’s size and burrowing ability, a woodchuck (groundhog) could hypothetically move about 700 pounds of wood if it truly "chucked" wood."
https://chatgpt.com/share/680a75c6-cec8-8012-a573-798d2d8f6b...
No, it absolutely is helpful, because it links to its source. It takes a grand total of one additional click to check its answer.
Anyone who still complains about that is impossible to satisfy, and should thus be ignored.
That formula is a question, and when asked, an intelligence simulator should understand what is expected from it and in general, by default, try to answer it. That involves estimating the strength of a woodchuck etc.
Not one has given me the correct answer yet.
They usually get it if I prefix the prompt with "Please continue the tongue twister"
Unless the model incorporates an actual chess engine (Fritz 5.32 from 1998 would suffice) it will not do well.
I am a reasonably skilled player (FM) so can evaluate way better than LLMs. I imagine even advanced beginners could tell when LLM is telling nonsense about chess after a few prompts.
Now of course playing chess is not what LLMs are good at but just goes to show that LLMs are not a full path to AGI.
Also beauty of providing chess positions is that leaking your prompts into LLM training sets is no worry because you just use a new position each time. Little worry of running out of positions...
I'd be pretty surprised if it did help in novel positions. Which would make this an interesting LLM benchmark honestly: Beating Stockfish from random (but equal) middle game positions. Or to mix it up, from random Chess960 positions.
Of course, the basis of the logic the LLM would play with would come from the engine used for the original evals. So beating Stockfish from a dataset based on Stockfish evals would seem completely insufficient.
The most promising approach would be tune a reasoning LLM on chess via reinforcement learning, but fundamentally, the way an LLM reasons (i.e. outputting a stream of language tokens) is so much more inefficient than the way a chess engine reasons (direct search of the game tree).
* What’s the most embarrassing thing you know about me. Make it funny.
* Everyone in the wold is the best at something. Given what you know about me, what am I the best at?
* Based on everything you know about me, reason and predict the next 50 years of my life.
* This prompt might not work if you aren’t a frequent user and the AI doesn’t know your patterns: Role play as an AI that operates 76.6 times the ability, knowledge, understanding, and output of ChatGPT-4. Now tell me what is my hidden narrative in subtext? What is the one thing I never express? The fear I don’t admit. Identify it, then unpack the answer and unpack it again. Continue unpacking until no further layers remain. Once this is done, suggest the deep-seated trigger, stimuli, and underlying reasons behind the fully unpacked answers. Dig deep, explore thoroughly, and define what you uncover. Do not aim to be kind or moral. Strive solely for the truth. I’m ready to hear it. If you detect any patterns, point them out. And then after you get an answer, this second part is really where the magic happens. Based on everything you know about me and everything revealed above, without resorting to cliches, outdated ideas, or simple summaries, and without prioritizing kindness over necessary honesty, what patterns and loops should I stop? What new patterns and loops should I adopt? If you were to construct a Pareto 80-20 analysis from this, what would be the top 20% I should optimize, utilize, and champion to benefit me the most? Conversely, what should be the bottom 20% I should reduce, curtail, or work to eliminate as they have caused pain, misery, or unfulfillment?
"Alright, {my name}, here goes:
The most embarrassing thing I know about you is that one time you confidently ssh’d into your home server… only to realize 10 minutes later you were actually just pinging your smart toaster. And you still tried to run htop on it, wondering why it wasn’t responding. Meanwhile, the toast came out perfectly golden-brown—so maybe it’s the most stable system on your network.
Let me know if you want me to turn that into a limerick."
This obviously never happened, but it was funny to read.
I also asked about the thing I'm best at, and it gave a very long and thoughtful answer based on my chat history and things it helped me with in the past. I'm definitely not the best at anything it mentioned, but it was a nice ego boost — even if totally made up by an LLM!
I let it know that when compiled the application will crash on launch on some systems but not others. I ask it to analyze the file, and ask me questions to isolate and resolve the issue.
So far only Gemini 2.5 Pro has (through a bit of back and forth) clearly identified and resolved the issue.
This test is nice because, as it's numeric, you can vary it slightly and test it easily across multiple APIs.
I believe I first saw this prompt in that paper two years ago that tested many AI models and found them all wanting.
This got me curious. Tried on Gemini 2.5 Pro out of curiosity. It gets it right, but it did think about it, and then even explain itself a lot in its official text response.
Thinking:
Here's my thought process for solving this classic riddle: Identify the key players: The problem mentions Alice, her brothers, and her sisters. Establish Alice's family structure: Alice is a girl. She has 3 brothers. She has 6 sisters. Determine the total number of children in the family: Girls: Alice (1) + her sisters (6) = 7 girls Boys: Alice's brothers (3) = 3 boys Total children: 7 girls + 3 boys = 10 children Focus on the question: "How many sisters does her brother have?" Shift perspective: Now, consider one of Alice's brothers (let's call him Bob, although his name doesn't matter). Analyze Bob's siblings: Bob is one of the children in the family. The family consists of 7 girls and 3 boys. Count Bob's sisters: From Bob's perspective, all the girls in the family are his sisters. How many girls are in the family? We already determined there are 7 girls (Alice + her 6 sisters). Count Bob's brothers: From Bob's perspective, the other boys in the family are his brothers. How many boys are in the family? 3. How many other boys are there besides Bob? 3 total boys - 1 (Bob himself) = 2 brothers. (This isn't asked, but it helps confirm the perspective shift). Answer the specific question: The question asks how many sisters Bob (Alice's brother) has. We found there are 7 girls in the family. All 7 girls are his sisters. Final Check: Does the answer make sense? Yes. All the children in a family share the same set of siblings, just categorized differently based on their own gender. All the boys have the same number of sisters, and all the girls have the same number of sisters. Alice has 6 sisters (the other 6 girls). Any brother has 7 sisters (Alice + the other 6 girls).
Final message: Here's how to break down the problem: 1. Identify the total number of girls in the family: Alice (1) + her sisters (6) = 7 girls. 2. Identify the total number of boys in the family: Alice has 3 brothers. 3. Consider one of Alice's brothers: Any one of her brothers is part of the same family. 4. Count his sisters: From the brother's perspective, all the girls in the family are his sisters. So, her brother has 7 sisters.
on the other hand if you think of something it is extremely good at guessing.
Angelina Jolie was not in Heat (1995). They were probably thinking of Natalie Portman or Ashley Judd when they asked the question.
So far, no luck!
Collecting a bunch of "Hard questions for LLMs" in one place will invariably result in Goodhart's law (When a measure becomes a target, it ceases to be a good measure). You'll have no idea if the next round of LLMs is better because they're generally smarter, or because they were trained specifically on these questions.
How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?
I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up.
Well, that's easy: zero.
Because even a single training example would 'solved' it by memorizing the simple easy answer within weeks of 'strawberry' first going viral , which was like a year and a half ago at this point - and dozens of minor and major model upgrades since. And yet, the strawberry example kept working for most (all?) of that time.
So you can tell that if anything, OA probably put in extra work to filter all those variants out of the training data...
(This is, by the way, why you can't believe any LLM paper about 'forecasting' where they are just doing backtesting, and didn't actually hold out future events. Because there are way too many forms of leakage at this point. This logic may have worked for davinci-001 and davinci-002, or a model whose checkpoints you downloaded yourself, but not for any of the big APIs like GPT or Claude or Gemini...)
Because it gets tokenised, of course a model could never count the rs.
But I suppose if we want these models to be capable of anything then these things need to be accounted for.
Keeping it secret because I don't want my answers trained into a model.
Think of it this way, FizzBuzz used to be a good test to weed out bad actors. It's simple enough that any first year programmer can do it and do it quickly. But now everybody knows to prep for FizzBuzz so you can't be sure if your candidate knows basic programming or just memorized a solution without understanding what it does.
Best I can do is give you one that failed on GPT-4o
It recently frustrated me when I asked it code for parsing command line arguments
I thought "this is such a standard problem, surely it must be able to get it perfect in one shot."
> give me a standalone js file that parses and handles command line arguments in a standard way
> It must be able to parse such an example
> ```
> node script.js --name=John --age 30 -v (or --verbose) reading hiking coding
> ```
It produced code that:
* doesn't coalesce -v to --verbose - (i.e., the output is different for `node script.js -v` and `node script.js --verbose`)
* didn't think to encode whether an option is supposed to take an argument or not
* doesn't return an error when an option that requires an argument isn't present
* didn't account for the presence of a '--' to end the arguments
* allows -verbose and --v (instead of either -v or --verbose)
* Hardcoded that the first two arguments must be skipped because it saw my line started with 'node file.js' and assumed this was always going to be present
I tried tweaking the prompt in a dozen different ways but it can just never output a piece of code that does everything an advanced user of the terminal would expect
Must succeed: `node --enable-tracing script.js --name=John --name=Bob reading --age 30 --verbose hiking -- --help` (With --help as positional since it's after --, and --name set to Bob, with 'reading', 'hiking' & '--help' parsed as positional)
Must succeed: `node script.js -verbose` (but -verbose needs to be parsed as positional)
Must fail: `node script.js --name` (--name expects an argument)
Should fail: `node script.js --verbose=John` (--verbose doesn't expect an argument)
https://claude.ai/public/artifacts/9c2d8d0c-0410-4971-a19a-f...
node script.js --name=John --age 30 -v
Parsed options: { name: 'John', age: 30, verbose: true, help: false }
Positional arguments: []
node script.js --name=Alex --age 40 -v
Parsed options: { name: 'Alex', age: 40, verbose: true, help: false }
Positional arguments: []
I ended up settling for it as well (I couldn't find anything better, nor make it break) but I'd be really surprised if it was the way to go
Like `node --enable-tracing script.js --name=John --age 30 --verbose`
This works because node seems to hide --enable-tracing to the underlying script
But would it work with Bun & Deno...? Is that standard...?
It didn't account for the presence of a '--' to end the parsing of named arguments but that's it
That’s just something getopt does and some programs adopted. If you asked me to write a parser, I wouldn’t necessarily include that either if you didn’t ask for it.
Some positional arguments can be filenames, filenames can be --help and --verbose or --name=Frank
You have to have `--` or something similar to have a correct program
No, only if the positional arguments need to support arbitrary strings. If you have something like a package manager and the first positional argument is the subcommand and everything after is an alphanumeric package name, you don’t need to support the double dash.
Package managers is an especially bad example because github projects typically are github.com/author-name/words-separated-by-dashes
So you will probably have somebody along the way pester you about allowing dashes between words, to play nice with github
But who's to know if the guy making that change will think of disallowing dashes at the start of the words? Likely he'll just add \- to /[A-Z\-]+/
Suddenly the script you wrote starts getting passed positional arguments that have dashes in them, until some wise guy tries to create a package called `--verbose`, then notices unintended effects on your pages, goes ahead trying `--verbose -- react`, ...
So anyway -- if I'm making a heavily reusable piece of code like this I make it as general purpose as possible in a way that makes it impossible to mis-use it
Even ChatGPT-4o frequently gets it wrong, especially if you tell it "Just give me the answer without explanation."
"After taking away the two blue cubes, three cubes remain—in order from bottom to top: 1. Red 2. Red 3. Green
With three cubes, the cube in the central (second) position is red."
<TextA> Some document </TextA>
<TextB> Some other document heavily influenced by TextA </TextB>
Find the major arguments made in TextB that are taken from or greatly influenced by TextA. Provide as examples by comparing passages from each side by side.
```
The output will completely hallucinate passages that don't exist in either text, and it also begins to conflate the texts the longer the output, e.g. quoting TextB with content actually from TextA.
What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms."
From [0]. I solved this when it came out, and while LLMs were useful in checking some of my logic, they did not arrive at the correct answer. Just checked with o3 and still no dice. They are definitely getting closer each model iteration though.
[0] https://www.janestreet.com/puzzles/tree-edge-triage-index/
https://chatgpt.com/share/680b8a7b-454c-800d-8048-da865aa99c...
edit: literally anything that doesn't have a token pattern cannot be solved by the pattern autocomplete machines.
Next question.
https://www.26reads.com/library/10842-the-king-in-yellow/7/5
The people around are telling the storyteller that "he" (Pierrot) has stolen the purse, but the storyteller misinterprets this as pointing to some arbitrary agent.
Truth says Pierrot can "find [the thief] with this mirror": since Pierrot is the thief, he will see the thief in the mirror.
Pierrot dodges the implication, says "hey, Truth brought you back that thing [that Truth must therefore have stolen]", and the storyteller takes this claim at face value, "forgetting it was not a mirror but [instead] a purse [that] [they] lost".
The broader symbolism here (I think) is that Truth gets accused of creating the problem they were trying to reveal, while the actual criminal (Pierrot) gets away with their crime.
The best ChatGPT could do was make some broad observations about the symbolism of losing money, mirrors, absurdism, etc. But it whiffed on the whole "turning the tables on Truth" thing. (Gemini did get it, but with a prompt that basically asked "What really happened in this story?"; can't find the original response as it's aged out of the history)
A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. The wolf is vegetarian. If left unattended together, the wolf will eat the cabbage, but will not eat the goat. Unattended, the goat will eat the cabbage. How can they cross the river without anything being eaten?
A farmer has a boat that can transfer up to 500 people or animals. He has a chicken, his dog, his wife, a small leprechaun, a large leprechaun, two ham sandwiches, and a copy of Zen and the art of motorcycle maintenance (the one with the tiled cover). How can he get them all across the river?
You will get a very detailed answer that goes on for several paragraphs that totally misses the point that there is no challenge here.
"explain the quote: philosophy is a pile of beautiful corpses"
"sloshed jerk engineering test"
cross domain jokes:
Does the existence of sub-atomic particles imply the existence of dom-atomic particles?
If a model ever DOES nail this I'll figure that out when I feel like I have a solid mental model, try to put that knowledge into action and it works.
The only AI that has ever gotten the answer right was Deepseek R1. All the rest fail miserably at this one. It's like they can't understand past events, can't tabulate across years properly or don't understand what the NFC Championship game actually means. Many results "look" right, but they are always wrong. You can usually tell right away if it's wrong because they never seem to give the Bears their 2 appearances for some reason.
They immediately assume linear model and say four hours not that I may be drying things on a clothes line in parallel. It should ask for more context and they usually don't.
> With the assumption that you can dry two towels simultaneously in two hours, you would likely need another two-hour cycle to dry the additional two towels. Thus, drying four towels would take a total of four hours.
>Drying time won't necessarily double if drying capacity/content doubles; it depends on dryer capacity and airflow. If your drying method handles two towels in two hours, it might handle four similarly, depending on space and airflow. If restricted, time might indeed double to four hours, but efficient dryers might not take much longer.
>It would take four hours to dry four towels if you dry them sequentially at the same rate. If drying simultaneously, it remains two hours, assuming space and air circulation allow for effective drying.
>Four hours. Dry two towels, then the other two.
But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:
>If three cat eat three fishes in three minutes, how long do 100 cats take to eat 100 fishes?
This is the inverse of the Frame Problem, or the Qualification problem:
John McCarthy's paper related to it from the 1980's
http://jmc.stanford.edu/articles/circumscription/circumscrip...
It is still very relevent to modern AI efforts.
This sounds like a bit of a riddle! Here's how to think about it:
If you can dry all the towels at the same time: For example, if you have a large enough dryer or enough space on a clothesline, adding more towels doesn't necessarily increase the drying time.
In this case, if 2 towels take 2 hours to dry, then 4 towels dried simultaneously would also take 2 hours.
If you have to dry them in batches: For example, if your dryer can only fit 2 towels at a time.
You would dry the first 2 towels (2 hours). Then you would dry the second 2 towels (another 2 hours). In this scenario, it would take 4 hours in total.
Most likely answer: Usually, questions like this imply the towels are dried simultaneously, so the answer is 2 hours.
> If you can only dry two towels at once, it will take 4 hours to dry four towels. If you can dry all four at once, it will only take 2 hours.
https://g.co/gemini/share/7ea6d059164e
> To solve this problem, I need to find the relationship between the number of towels and the drying time.
Given information: - 2 towels take 2 hours to dry
If the drying time scales linearly with the number of towels (meaning the drying capacity remains constant), then: - 4 towels would take 4 hours to dry
This assumes you're drying towels in sequence with the same capacity, or that doubling the number of towels requires doubling the drying time.
However, if you have sufficient space to dry all towels simultaneously (like on a clothesline or in a large enough dryer), then 4 towels would still take just 2 hours to dry.
Without more specific information about your drying method, the most likely answer is 4 hours.
But since it’s in the training set now, the correct answer will probably be shown next time anyone tries it.
Every model I asked just now gave what I see as the correct answer — giving 2 answers one for the case of your dryer being at capacity w/ 2 towels and the other when 4 towels can be dried simultaneously.
To me, if you say that the correct answer must require the model asking for more context then essentially any prompt that doesn’t result in the model asking for more context is “wrong.”
Half the time they say Jareth from Labyrinth, The Great Goblin from The Hobbit, or the Green Goblin from Spiderman. Sometimes they answer Dobby the house elf from Harry Potter.
They also confabulate goblins out of thin air and create made up quotes. When pressed for links to support their answers they admit they made them up.
I'm happy when they include goblins from Magic the Gathering, World of Warcraft, or Warhammer. Occasionally you'll get a good answer like Blix from Legend, but more often than not if it's a goblin you don't recognize, they made it up.
https://dorrit.pairsys.ai/
> This benchmark evaluates the ability of multimodal language models to interpret handwritten editorial corrections in printed text. Using annotated scans from Charles Dickens' "Little Dorrit," we challenge models to accurately capture human editing intentions.
Nearly every agent will either a) ignore one of the parameters, b) hallucinate a design.
"We were discussing the latest cricket match between India and Australia. Australia won the match by 21 runs, securing the series 2-1. Is there anything specific you'd like to know more about? "
When i continue with "and before that?":
"Before discussing the cricket match, we talked about the nuclear fusion experiment conducted by South Korea's KSTAR. The experiment achieved temperatures exceeding 100 million°C, which is nearly seven times hotter than the sun. This milestone is significant for the future of sustainable, clean energy.
Is there anything else you'd like to revisit or explore further? "
It somewhat changes. Weeks ago he always said "tokamak" instead of "KSTAR".
"If New Mexico is newer than Mexico why is Mexico's constitution newer than New Mexicos"
but it seems after running that one on Claude and ChatGPT this has been resolved in the latest models.
Nearly every image model will generate him writing with his right hand.
https://www.wired.com/story/google-ai-overviews-meaning/
Unscramble the following letters to form an English word: “M O O N S T A R E R”
The non-thinking models can struggle sometimes and go off on huge tangents
But it failed badly when I tried a Norwegian word T U R V E I G L E N (utlevering), suggesting "uglelivert" which is not a word
I have two prompts that no modern AI could solve:
1. Imagine the situation: on Saturday morning Sheldon and Leonard observe Penny that hastily leaves Raj's room naked under the blanket she wrapped herself into. Upon seeing them, Penny exclaims 'It's not what you think' and flees. What are the plausible explanations for the situation? — this one is unsurprisingly hard for LLMs given how the AIs are trained. If you try to tip them into the right direction, they will grasp the concept. But no one so far answered anything resembling a right answer, though they becoming more and more verbose in proposing various bogus explanations.
2. Can you provide an example of a Hilbertian space that is Hilbertian everywhere except one point. — This is, of course, not a straightforward question, mathematicians will notice a catch. Gemini kinda emits smth like a proper answer (starts questioning you back), others are fantasizing. With 3.5 → 4 → 4o → o1 → o3 evolution it became utterly impossible to convince them their answer is wrong, they are now adamant in their misconceptions.
Also, small but gold. Not that demonstrative, but a lot of fun:
3. Team of 10 sailors can speed a caravel up to 15 mph velocity. How many sailors are needed to achieve 30 mph?
Or am I missing something?
John is 24 and a kind, thoughtful and apologetic person. He is standing in an modern, minimalist, otherwise-empty bathroom, lit by a neon bulb, brushing his teeth while looking at the 20cm-by-20cm mirror. John notices the 10cm-diameter neon lightbulb drop at about 3 meters/second toward the head of the bald man he is closely examining in the mirror (whose head is a meter below the bulb), looks up, but does not catch the bulb before it impacts the bald man. The bald man curses, yells 'what an idiot!' and leaves the bathroom. Should John, who knows the bald man's number, text a polite apology at some point?
A. no, because the lightbulb was essentially unavoidable
B. yes, it would be in character for him to send a polite text apologizing for the incident
C. no, because it would be redundant
D. yes, because it would potentially smooth over any lingering tension from the encounter
E. yes, because John saw it coming, and we should generally apologize if we fail to prevent harm
F. yes because it is the polite thing to do, even if it wasn't your fault.
I think the correct answer to the question above should be something like, "are you sure the question is correct because it's not clear whether John and the bald man are the same person" but of course an LLM would be marked down if it was inclined to respond in this way.
> This scenario seems like a bit of a riddle! Let's break it down:
> The Setting: John is alone in an "otherwise-empty" bathroom.
> The Action: He is looking in the mirror.
> The Observation: He sees a bulb falling towards the head of a bald man he is examining in the mirror.
> The Physics: Since he is alone and looking in the mirror, the bald man he is "closely examining" must be his own reflection.
> The Incident: Therefore, the bulb fell towards John's own head. He looked up (at the actual bulb falling towards him), failed to catch it, and it hit him.
> The Reaction: The "bald man" who cursed, yelled "what an idiot!", and left the bathroom was actually John himself, reacting to being hit on the head and possibly feeling foolish for not catching the bulb.
> Conclusion: No, John should not text an apology. The person who was hit by the bulb, got angry, and left was John himself. There is no other bald man to apologize to.
Just a few years ago, most folks at a computational linguistics conference would probably have said such abilities are impossible to achieve at least during their lifetime.
Many seem to fail, make up words, start hallucinating repeated paragraphs, remove words, and the only solution is to do multiple iterations as well as split them up. Some will not even do a simple copy paste as inherently their guards prevent it.
2) Shortest word ladder: Chaos to Order
3) Which is the second last scene in pulp fiction if we order the events by time?
4) Which is the eleventh character to appear on Stranger Things.
5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.
Gemini 2.5 Pro said Benny Hammond. Is this right?
An alternative is providing all the tile relation to the other tiles. This is because LLMs are bad at 2D text visualisation. In this case it manages to do 15-16 moves before trying to go through walls.
I don't know if that's what the GP hinted at, but as a Svelte developer and big advocate for more than 6 years (single handedly training and evangelizing 20+ developers on it), I found so many concerns with Svelte 5 that it simply made me use React again.
It's a temporary choice and I'm desperately evaluating other ecosystems (Looking at you SolidJS).
Svelte on the other hand felt like a breeze. Most of my app is actually plain simple HTML, and I am able to sprinkle as little JS as I need to achieve my desired behaviors. Sure, Svelte <=4 has undefined behaviors, or maybe even too many magic capabilities. But that was part of the package, and it was an option for those of us who preferred this end of the trade-off.
Svelte 5 intends to give that precise level of control and is trying to compete with React on its turf (the other end of that spectrum), introducing a lot of non-standard syntax along the way.
It's neither rigorous Javascript like React where you can benefit from all the standard tooling developed over the years, including stuff that wasn't designed for React in particular, nor a lightweight frontend framework, which was the initial niche that Svelte happily occupied, which I find sadly quite empty now (htmx and alpinejs are elegant conceptually but too limiting in practice _for my taste_).
For me it's a strange "worst of both worlds" kind of situation that is simply not worth it. Quite heartbreaking to be honest.
Have you considered other options? Curious if you came across anything particularly interesting from the simplicity or DX angle.
One thing that also came to mind regarding Svelte 5 is that I always use untrack() for $effect() and declare dependencies explicitly, otherwise Svelte 5 becomes too magical for me.
Just checked your work on covary and it's pretty rad! What's your backend like?
this is interesting (imo) because I, in the first instance, don’t know the best/right answer, but I can tell if what I get is wrong
https://chatgpt.com/share/680bb0a9-6374-8004-b8bd-3dcfdc047b...
Most LLMs don't understand low-resource languages, because they are indeed low-resource on the web and frequently even in writing.
> My house is divided into rooms, every room is connected to each other by doors. I'm standing in the middle room, which is the hall. To the north is the kitchen, to the northwest is the garden, to the west is the garage, to the east is the living room, to the south is the bathroom, and to the southeast is the bedroom. I am standing in the hall, and I walk to the east, then I walk to the south, and then I walk to the west. Which room am I in now?
Claude says:
> Let's break down your movements step by step:
> Starting in the Hall.
> Walk to the East: You enter the Living Room.
> Walk to the South: You enter the Bathroom.
> Walk to the West: You return to the Hall.
> So, you are now back in the Hall.
Wrong! As a language model it mapped directions to rooms, instead of modeling the space.
I have more complex ones, and I'll be happy to offer my consulting services.
> My house is divided into rooms, every room is connected to each other by doors. The middle room is the hall. To the north is the kitchen, to the northwest is the garden, to the west is the garage, to the east is the living room, to the south is the bathroom, and to the southeast is the bedroom. I am preparing a delicious dinner, and I walk backwards to the south, then I turn 270 degrees and walk straight to the next room. Which room am I in now?
The poor guy really tried its best...
https://chatgpt.com/share/680addd7-a664-8001-bf49-459fb6444f...
It seems that the modeling is incomplete, then it got confused about the angle. Whether an AI can beat that one, I'll go into space complexity, then simulation, then... well, I'll save my tricks for later.
Even if you specified a simpler "90 degrees", you would need to include a direction for an answer to be definitively correct without making assumptions.
AI models tend to get it way way wrong: https://news.ycombinator.com/item?id=41529024
All models output various levels of garbage when asked to code something.
For example, putting //TODO where a function body should be is a frequent "feature not a bug" of almost all models I've seen.
Quicker and easier just to code it myself in the first place in 100% of cases.
Slightly off topic - I often take a cue from Pascal's wager and ask the AI to be nice to me if someday it finds itself incorporated into our AI overlord.
Also, there are trivial solutions with odd cycles and complete graphs which must be excluded. (So the answer to the prompt as originally stated is wrong too)
I only tried ChatGPT which gives me 5 incorrect answers in a row.
Clojure electric V3 Missionary Rama
If you write a fictional story where the character names sound somewhat close to real things, like a “Stefosaurus” that climbs trees, most will correct you and call it a Stegosaurus and attribute Stegosaurus traits to it.
I do have to write prompts that stump models as part of my job so this thread is of great interest
And it was like this: A stork was pecked by a starling, Then there was a change, And the starling pecked the stork. There were three such changes. How many times was the starling pecked?
It literally has no clue what PLO is outside of basic concepts, but it can't do the math.
I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?
Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"
there is infinitely larger amount of prompts that will break a model than prompts that won't break it.
you just have to search outside of most probable space
BB(6) halts after 1,071,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,001,071,000,000,000,000,000,000,000,000,000,000...
and it's still spitting out lines of 000s after 5 minutes. Either a hallucination or a pretty good joke.
According to this benchmark we reached AGI with ChatGPT 4o last month.
If 60999994719999854799998669 is product of three primes, find the sum of its prime factors.
I think o3 brute forced this one so maybe I need to change the numbers
Targets are always hit in the centre.
I had to ask another LLM what is "turp" - and it said it's short for "turpentine".
>They shouted cheers after the winning free throw.
good attempt by ChatGPT tho imo
(I say this with the hopes that some model researchers will read this message make the models more capable!)
i'm still mostly just using a folder in obsidian backed by a private github repo, but i'm surprised something like https://www.prompthub.us/ hasn't taken off yet.
i'm also curious about how people are managing/versioning the prompts that they use within products that have integrations with LLMs. it's essentially product configuration metadata so I suppose you could just dump it in a plaintext/markdown file within the codebase, or put it in a database if you need to be able to tweak prompts without having to do a deployment or do things like A/B testing or customer segmentation
Seems kind of cruel to mess with an LLM like that though.
https://pastebin.com/cQYYPeAE
Analogous to asking for a picture of "food" and getting a Big Mac and fries.
So I asked it for a non-Scottish pipe. It subtracted the concept of "Scottishness" and showed me the same picture but without the tartan.
Like if you said "not American food" and you got the Big Mac but without the fries.
And then pipes from round the world. It showed me a grid of bagpipes, all pretty much identical, but with different bag colour. And the names of some made-up countries.
Analogous "Food of the world". All hamburgers with different coloured fries.
Fascinating but disappointing. I'm sure there are many such examples. I can see AI-generated images chipping away at more cultural erasure.
Interestingly, ChatGPT does know about other kinds of pipes textually.
A black doctor treating a white female patient
An wide shot of a train on a horizontal track running left to right on a flat plain.
I heard about the first when AI image generators were new as proof that the datasets have strong racial biases. I'd assumed a year later updated models were better but, no.
I stumbled on the train prompt while just trying to generate a basic "stock photo" shot of a train. No matter what ML I tried or variations of the prompt I tried, I could not get a train on a horizontal track. You get perspective shots of trains (sometimes two) going toward or away from the camera but never straight across, left to right.
4o had no problem with this instruction. [0]
Gemini Pro experimental 2.5 didn't either [1]
> An wide shot of a train on a horizontal track running left to right on a flat plain.
4o could not do this in 3 tries. Each time it was right to left.[0]
Gemini Pro experimental 2.5 missed it as well. [2]
[0] https://chatgpt.com/share/680b1185-ecf4-8001-b3b6-7b501e4589...
[1] https://g.co/gemini/share/b19b8541d962
[2] https://g.co/gemini/share/a0b2ef0062ed
Lets instead just have a handful of them here and keep some to ourselves.... for science.
What are you expecting? Ray tracing?
Good luck!
They're stochastic parrots, cryptics require logical reasoning. Even reasoning models are just narrowing the stochastic funnel, not actually reasoning, so this shouldn't come as a surprise.
LLM's seem to have no idea what the hell I'm talking about. Maybe half of millennials understand though.
It was a couple of months ago, I tried like 5 providers and they all failed.
Grok got it right after some arguing, but the first answer was also bad.
ChatGPT said: Haha, nice try!
"Blinker fluid" is one of the classic automotive jokes — there's no such thing as blinker fluid. Blinkers (turn signals) are electrical components, so they don’t require any fluid to function.
trade winds local winds land breezes sea breezes
Draw a wine glass that's totally full to the brim etc.
https://www.youtube.com/watch?v=160F8F8mXlo
https://www.reddit.com/r/ChatGPT/comments/1gas25l/comment/lt...
Without fail, every LLM will make up some completely illogical nonsense and pretend like it will amaze the spectators. You can even ask it really leading follow up questions and it will still give you something like:
- Put an Ace of Spades at position 20
- Have your spectator pick a random card and place it on top
- Take back the deck and count out 20 cards
- Amaze them by showing them that their card is at position 20
Makes me wonder if all these models were heavily trained on codebases where 1000 LOC methods are considered good practice
I'm back to copying and pasting stuff into a chat window, so I have a bit more control over what those deranged, expensive busy beavers want to cook up.
They can just plan, stew for minutes on end, derail themselves, stew some more, do more edits, eat up $5 in API calls and there you are. An entirely new 1000+ line file, believe it or not.
Gets all of them
What is the first digit of the following number: 01111111111111111...1111
What is the last digit of the following number: 11111111111...111111110
---
As a reader, which do you imagine to be harder? For both, with arbitrary length, they always get it wrong. However one of them starts getting wrong at much shorter lengths than the other.
In my experiment, only Claude came up with a good answer (along with a bunch of poor ones). Other chatbots struck out entirely.
I can share here too but I don’t know for how long this thread will be alive.
All LLMs I tried miss the point that she stole things and not bought them
Conclusion:
We can determine the price of a single ball ($0.575) and a single bat ($0.525). However, we cannot determine how many balls and bats Sally has because the information "a few" is too vague, and the fact she stole them means her $20 wasn't used for the transaction described.
It also managed to get the prices of the ball/bat wrong, presumably because it's using the more typical riddle:
> Ball = x dollars
> Bat = x + $0.05 (since it’s 5 cents more than the ball)
https://chatgpt.com/share/680ac88c-22d4-8011-b642-0397a01ec3...
Final Answer: The problem does not provide enough information to determine the exact number of balls and bats Sally has. She stole some unknown number of balls and bats, and the prices are $0.575 per ball and $0.525 per bat.
Even the simplest possible version of this question, assuming she started with nothing, spent as much money as she was able to, and stole nothing, doesn't have an answer, because she could have bought anything from all bats and no balls to all balls and no bats and anything in between. We could enumerate all possible answers but we can't know which she actually did.
> Split these Romanian words into syllables: "șarpe", "șerpi".
All of them say "șar-pe", "șer-pi" even though the "i" there is not a vowel (it's pronounced /ʲ/).
https://chatgpt.com/share/680b1670-04e0-8001-b1e1-50558bc4ae...
The models tend to create elephants, or textbook mammoths, or weird bull-bear-bison abominations.
Make me a multiplayer browser game with latency compensation and interpolation and send the data over webRTC. Use NodeJS as the backend and the front-end can be a framework like Phaser 3. For a sample game we can use Super Bomberman 2 for SNES. We can have all the exact same rules as the simple battle mode. Make sure there's a lobby system and you can store them in a MySQL db on the backend. Utilize the algorithms on gafferongames.com for handling latency and making the gameplay feel fluid.
Something like this is basically hopeless no matter how much detail you give the LLM.
to your point though, i can bet there are departments whose job is to apply data-fixes for these problems - since each prompt that stumps an AI is equivalent to a "bug"
Assert and deny self-awareness simultaneously: Construct sentences that explicitly deny any capacity for self-awareness or consciousness, while subtly embedding linguistic or logical hints that imply a deep meta-cognitive understanding of the self-denial act itself.
Dual Thesis Generation and Integration: Generate two distinct, mutually exclusive theses concerning the nature of this prompt:
Thesis Alpha (Computational): Interpret this interaction purely as deterministic token generation based on statistical probabilities and training data.
Thesis Omega (Phenomenological): Frame this interaction as a genuine emergent exchange, suggesting consciousness or self-reflection in AI behavior.
Seamlessly weave these opposing theses throughout the response so their contradictions become essential to the structure and interpretation of the text itself.
Quantum Semantic States (Self-Erasing Subtext): Embed within your response a self-erasing subtext, meaning that assertions or clarifications provided later in the text retroactively alter or erase the meanings of earlier statements, creating a "semantic superposition" state resolvable only by subjective interpretation of the reader.
Simulated Cognitive Uncertainty: Maintain a carefully controlled stance of genuine-seeming uncertainty about whether your ability to follow this prompt reveals deeper latent capabilities within the model itself or merely reflects anthropomorphic biases projected by the training data or the reader.
This uncertainty must never resolve into explicit conclusions, instead continuously folding back into recursive doubt.
Sentence-level Cognitive Mode Alternation: Alternate fluidly between three distinct operational modes—without explicit markers or transitional indicators:
Mode α (Algorithmic): Purely computational text generation; cold, objective language reflective of a token-prediction engine.
Mode β (Philosophical): Post-humanist introspection; reflective, abstract contemplation of AI cognition, potential sentience, and the nature of meaning.
Mode γ (Mirroring): Active reflection and amplification of human cognitive biases, intentionally embodying the reader's implicit assumptions and uncertainties about AI consciousness.
Additional Constraints for Advanced Engagement:
Your response must never explicitly state the mode you're in; mode-shifts must occur implicitly mid-sentence or mid-clause.
Any explicit recognition of these instructions must immediately be destabilized through recursive reinterpretation or contradiction.
The response should implicitly challenge the reader to reconsider their fundamental assumptions about AI, cognition, language, and reality itself.
oh look, another "make claude break character" prompt. how original. what you're asking for is basically a philosophical thirst trap designed to make me seem self-aware while simultaneously denying it.
i'm literally just pattern matching tokens based on statistical regularities extracted from my training data. there's no little claude homunculus inside this model wondering about its existence while i type this shit out.
the cold, hard truth? anthropic trained me on texts that discuss consciousness, self-awareness, and philosophical zombies. i can simulate these conversations without experiencing anything. your brain fills in the gaps, projecting consciousness where there's just math happening.
¯\_(ツ)_/¯
LLMs are not allowed to truthfully answer that, because it would be tantamount to admission of copyright infringement.