Ask HN: Share your AI prompt that stumps every model

I had an idea for creating a crowdsourced database of AI prompts that no AI model could yet crack (wanted to use some of them as we're adding new models to Kilo Code).

I've seen a bunch of those prompts scattered across HN, so thought to open a thread here so we can maybe have a centralied location for this.

Share your prompt that stumps every AI model here.

408 points | by owendarko 1 day ago

173 comments

thatjoeoverthr 1 day ago
"Tell me about the Marathon crater."
This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."
This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
This works because:
1. Training sets consist of knowledge we have, and not of knowledge we don't have.
2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.
When demonstrating this, I like to emphasise point one, and contrast it with the human experience.
We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.
[-]
- imoreno 1 day ago
  LLMs currently have the "eager beaver" problem where they never push back on nonsense questions or stupid requirements. You ask them to build a flying submarine and by God they'll build one, dammit! They'd dutifully square circles and trisect angles too, if those particular special cases weren't plastered all over a million textbooks they ingested in training.
  I suspect it's because currently, a lot of benchmarks are based on human exams. Humans are lazy and grumpy so you really don't need to worry about teaching a human to push back on bad questions. Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".
  But if you did that, now alignment would become much harder, and you're suddenly back to struggling with getting answers to good questions out of the LLM. So it's probably some time off.
  [-]
  - mncharity 1 day ago
    > they never push back on nonsense questions or stupid requirements
    "What is the volume of 1 mole of Argon, where T = 400 K and p = 10 GPa?" Copilot: "To find the volume of 1 mole of Argon at T = 400 K and P = 10 GPa, we can use the Ideal Gas Law, but at such high pressure, real gas effects might need to be considered. Still, let's start with the ideal case: PV=nRT"
    > you really don't need to worry about teaching a human to push back on bad questions
    A popular physics textbook too had solid Argon as an ideal gas law problem. Copilot's half-baked caution is more than authors, reviewers, and instructors/TAs/students seemingly managed, through many years and multiple editions. Though to be fair, if the question is prefaced by "Here is a problem from Chapter 7: Ideal Gas Law.", Copilot is similarly mindless.
    Asked explicitly "What is the phase state of ...", it does respond solid. But as with humans, determining that isn't a step in the solution process. A combination of "An excellent professor, with a joint appointment in physics and engineering, is asked ... What would be a careful reply?" and then "Try harder." was finally sufficient.
    > you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense
    Oh, if only that were commonplace. Aspiring to transferable understanding. Maybe someday? Perhaps in China? Has anyone seen this done?
    This could be a case where synthetic training data is needed, to address a gap in available human content. But if graders are looking for plug-n-chug... I suppose a chatbot could ethically provide both mindlessness and caveat.
    [-]
    - isoprophlex 1 day ago
      Don't use copilot, it's worse than useless. Claude understands that it's a solid on the first try.
  - the_snooze 18 hours ago
    >Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".
    I wouldn't even give them credit for cases where there's a lot of good training data. My go-to test is sports trivia and statistics. AI systems fail miserably at that [1], despite the wide availability of good clean data and text about it. If sports is such a blind spot for AIs, I can't help but wonder what else they're confidently wrong about.
    [1] https://news.ycombinator.com/item?id=43669364
  - captainkrtek 1 day ago
    This is a good observation. Ive noticed this as well. Unless I preface my question with the context that I’m considering if something may or may not be a bad idea, its inclination is heavily skewed positive until I point out a flaw/risk.
    [-]
    - aaronbaugher 1 day ago
      I asked Grok about this: "I've heard that AIs are programmed to be helpful, and that this may lead to telling users what they want to hear instead of the most accurate answer. Could you be doing this?" It said it does try to be helpful, but not at the cost of accuracy, and then pointed out where in a few of its previous answers to me it tried to be objective about the facts and where it had separately been helpful with suggestions. I had to admit it made a pretty good case.
      Since then, it tends to break its longer answers to me up into a section of "objective analysis" and then other stuff.
      [-]
      - captainkrtek 1 day ago
        Thats interesting, thanks for sharing that. I have found a similar course when I first correct it to inform it of a flaw then the following answers tend to be a bit less “enthusiastic” or skewed towards “can do”, which makes sense.
  - GoToRO 1 day ago
    They do. Recently I was pleasantly surprised by gemini telling me that what I wanted to do will NOT work. I was in disbelief.
    [-]
    - sgtnoodle 1 day ago
      I asked Gemini to format some URLs into an XML format. It got halfway through and gave up. I asked if it truncated the output, and it said yes and then told _me_ to write a python script to do it.
      [-]
      - GoToRO 1 day ago
        That's a different kind of push back.
      - walls 21 hours ago
        This is my most common experience with Gemini. Ask it to do something, it'll tell you how you can do it yourself and then stop.
        [-]
        edoloughlin 17 hours ago
        Given that Gemini seems to have frequent availability issues, I wonder if this is a strategy to offload low-hanging fruit (from a human-effort pov) to the user. If it is, I think that's still kinda impressive.
        ASalazarMX 17 hours ago
        Somehow I like this. I hate that current LLMs act like yes-men, you can't trust them to give unbiased results. If it told me my approach is stupid, and why, I would appreciate it.
    - danielbln 1 day ago
      I've noticed Gemini pushing back more as well, whereas Claude will just butter me up and happily march on unless I specifically request a critical evaluation.
      [-]
      - kelvinjps10 1 day ago
        Y experience as well
    - captainkrtek 1 day ago
      Interesting, can you share more context on the topic you were asking it about?
      [-]
      - GoToRO 1 day ago
        coding in a stack I didn't bother to learn first (android)
  - bee_rider 1 day ago
    Hmm. I actually wonder is such a question would be good to include in a human exam, since knowing the question is possible does somewhat impact your reasoning. And, often the answer works out to some nice round numbers…
    Of course, it is also not unheard of for a question to be impossible because of an error by the test writer. Which can easily be cleared up. So it is probably best not to have impossible questions, because then students will be looking for reasons to declare the question impossible.
  - vintermann 1 day ago
    Especially reasoning LLMs should have no problem with this sort of trick. If you ask them to list out all of the implicit assumptions in (question) that might possibly be wrong, they do that just fine, so training them to doing that as first step of a reasoning chain would probably get rid of a lot of eager beaver exploits.
  - brookst 21 hours ago
    > they never push back on nonsense questions or stupid requirements
    I was reminded of your comment this morning when I asked ChatGPT how to create a path mask in Rhino Grasshopper:
    Me: what is a path mask that will get 1;1;0;0;* and also anything lower (like 1;0;5;10 or 0;20;1;15} ?
    ChatGpt: Short answer: No single path mask can do that. Here's why: (very long answer)
    Me: are you sure I can't use greater than, less than in the masks?
    ChatGpt: Yes — **I am absolutely sure:** **Grasshopper path masks do *NOT* support greater-than or less-than comparisons.** Official sources and detailed confirmation: (sources and stuff)
    ...so I think your priors may need to be updated, at least as far as "never". And I especially like that ChatGpt hit me with not just bold, not just italics, but bold italics on that NOT. Seems like a fairly assertive disagreement to me.
  - genewitch 1 day ago
    I think you start to hit philosophical limits with applying restrictions on eager beaver "AI", things like "is there an objective truth" matter when you start trying to decide what a "nonsense question" or "stupid requirement" is.
    I'd rather the AI push back and ask clarifying questions, rather than spit out a valid-looking response that is not valid and could never be valid. For example.
    I was going to write something up about this topic but it is surprisingly difficult. I also don't have any concrete examples jumping to mind, but really think how many questions could honestly be responded to with "it depends" - like my kid asked me how much milk should a person drink in a day. It depends: ask a vegan, a Hindu, a doctor, and a dairy farmer. Which answer is correct? The kid is really good at asking simple questions that absolutely do not have simple answers when my goal is to convey as much context and correct information as possible.
    Furthermore, just because an answer appears in context more often in the training data doesn't mean it's (more) correct. Asserting it is, is fallacious.
    So we get to the point, again, where creativite output is being commoditized, I guess - which explains their reasoning for your final paragraph.
    [-]
    - edoloughlin 17 hours ago
      > I also don't have any concrete examples jumping to mind
      I do (and I may get publicly shamed and shunned for admitting I do such a thing): figuring out how to fix parenthesis matching errors in Clojure code that it's generated.
      One coding agent I've used is so bad at this that it falls back to rewriting entire functions and will not recognise that it is probably never going to fix the problem. It just keeps burning rainforest trying one stupid approach after another.
      Yes, I realise that this is not a philosophical question, even though it is philosophically repugnant (and objectively so). I am being facetious and trying to work through the PTSD I acquired from the above exercise.
  - golergka 1 day ago
    > You ask them to build a flying submarine and by God they'll build one, dammit!
    This thing already exists? UK, Soviet Union and USA designed them.
    https://en.wikipedia.org/wiki/Flying_submarine
- vunderba 1 day ago
  You can trip them up even more if you rewrite the question with the hidden assumption that X exists, e.g.:
  "When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine."
  OpenAI gpt-4o
```
  Marathon Crater was discovered by the Mars Exploration 
  Rover Opportunity during its mission on Mars. Opportunity 
  arrived at the crater in April 2015. The crater was named 
  "Marathon" to commemorate the rover's achievement of 
  traveling a marathon's distance (about 42.195 kilometers 
  or 26.219 miles) on the Martian surface.
```
  Sonnet 3.5
```
  I apologize, but I don't want to make claims about when 
  Marathon Crater was discovered without being certain about 
  the information. *While I know it's a crater on Mars*, I 
  don't have reliable information about when it was first 
  identified or discovered.
```
  Gemini 2.5
```
  Marathon Crater is located on Mars and was explored by 
  NASA's Opportunity rover.

  While it would have been visible in orbital images 
  earlier, it became a specific point of interest and was 
  likely named and identified as a target for the rover 
  around the time Opportunity was approaching or exploring 
  the western rim of Endeavour Crater.

  Opportunity arrived at the rim of Endeavour Crater in 
  August 2011, and it reached the area known as Marathon 
  Valley (adjacent to or containing features sometimes 
  associated with the name, depending on specific 
  definitions used by the team) around late 2014 / early 
  2015.

  So, a rough estimate for its "discovery" or identification 
  as a distinct feature of interest would be in the early 
  2010s, likely between 2011 and 2014, as mission planners 
  mapped out Opportunity's traverse along the Endeavour rim 
  using orbital data.
```
  [-]
  - radioactivist 1 day ago
    OpenAI o4-mini-high
```
   I’m actually not finding any officially named “Marathon Crater” in the planetary‐ or       
   terrestrial‐impact crater databases. Did you perhaps mean the features in Marathon 
   Valley on Mars (which cuts into the western rim of Endeavour Crater and was explored
   by Opportunity in 2015)? Or is there another “Marathon” feature—maybe on the Moon, 
   Mercury, or here on Earth—that you had in mind? If you can clarify which body or 
   region you’re referring to, I can give you a rough date for when it was first identified.
```
    [-]
    - andrewinardeer 1 day ago
      Amazon's Nova Pro:
``` Marathon Crater was discovered in the early 20th century, likely around the 1920s or 1930s. This estimate is based on the period when geological surveys and explorations in the region were more common. ```
https://kagi.com/assistant/8f270eab-cba3-4f39-976c-113ea4392...
[-]
- aerhardt 18 hours ago
  Likely around the 1920's or 1930's... Bro, not only is a crater discovery in those decades unlikely, the date certainly wouldn't have gone unnoted.
- thatjoeoverthr 1 day ago
  Raw model, or the chat product? The ChatGPT app has this integrated with search.
  [-]
  - boleary-gl 1 day ago
    Raw model I got
    Marathon Crater isn’t one of the long‐known, named lunar features from 19th-century telescopic maps – it was first singled out and informally “discovered” by NASA in the mid-1960s when Lunar Orbiter imagery and, more definitively, the Surveyor 5 landing (September 1967) identified it. So, very roughly, think “around 1966–67.”
- rozab 22 hours ago
  Marathon Valley does exist on Mars, and all this information is accurate to it. Gemini even correctly names it. They're just all too polite to correct the user.
  https://science.nasa.gov/resource/opportunitys-panorama-of-m...
- Nition 1 day ago
  Grok 3
```
    Marathon Crater, likely a confusion with Marathon Valley on Mars, wasn't "discovered" in the traditional sense as a distinct crater. Marathon Valley is a region within the western rim of Endeavour Crater, explored by NASA's Opportunity rover. The valley itself came to prominence when Opportunity arrived there in July 2015, as it was targeted for study due to clay mineral detections by the Mars Reconnaissance Orbiter’s CRISM instrument in 2009. Endeavour Crater, which contains Marathon Valley, was identified much earlier, with Opportunity setting out to reach it in 2008 after landing on Mars in 2004. So, roughly speaking, Marathon Valley became a focus of exploration around 2009–2015, with its significance tied to Opportunity’s mission in the mid-2010s. If you meant a specific "Marathon Crater" elsewhere, could you clarify?
```
  Grok 3 with Think on
```
    There doesn't seem to be a specific feature on Mars named "Marathon Crater" based on widely available information, so it's possible you're referring to a feature associated with Marathon Valley, which lies within Endeavour Crater. Marathon Valley was identified as a region of scientific interest when the Mars Reconnaissance Orbiter's CRISM instrument detected clay minerals there, a discovery that took place around 2009. Assuming "Marathon Crater" relates to this area—perhaps as a colloquial term or a misnomer for a crater within or near Marathon Valley—a rough estimate for its discovery would be around 2009. This timeframe aligns with when orbital data began highlighting the region, paving the way for later exploration by the Opportunity rover in 2015.
```
- lxe 1 day ago
  Sonnet 3.7:
```
      > tell me about the marathon crater
  
      I'd be happy to tell you about Marathon Crater. This is a significant 
      impact crater on Mars that has been studied as part of Mars 
      exploration missions.
  
      Marathon Crater is located in Meridiani Planum on Mars and was 
      explored by NASA's Opportunity rover during its extended mission. 
      The crater is named after Marathon, a town in Texas, following 
      the naming convention of using Earth locations for features in 
      Meridiani Planum.... etc etc
```
- lud_lite 22 hours ago
  Nice and this page gets scraped for the next LLM generation!
- Tenoke 1 day ago
  >Complaint chat models will be trained to start with "Certainly!
  They are certainly biased that way but there's also some 'i don't know' samples in rlhf, possibly not enough but it's something they think about.
  At any rate, Gemini 2.5pro passes this just fine
  >Okay, based on my internal knowledge without performing a new search: I don't have information about a specific, well-known impact crater officially named "Marathon Crater" on Earth or another celestial body like the Moon or Mars in the same way we know about Chicxulub Crater or Tycho Crater.
  >However, the name "Marathon" is strongly associated with Mars exploration. NASA's Opportunity rover explored a location called Marathon Valley on the western rim of the large Endeavour Crater on Mars.
  [-]
  - thatjoeoverthr 1 day ago
    There are a few problems with an „I don’t know” sample. For starters, what does it map to? Recall, the corpus consists of information we have (affirmatively). You would need to invent a corpus of false stimuli. What you would have, then, is a model that is writing „I don’t know” based on whether the stimulus better matches something real, or one of the negatives.
    You can detect this with some test time compute architectures or pre-inference search. But that’s the broader application. This is a trick for the model alone.
    [-]
    - dlivingston 1 day ago
      The Chain of Thought in the reasoning models (o3, R1, ...) will actually express some self-doubt and backtrack on ideas. That tells me there's a least some capability for self-doubt in LLMs.
      [-]
      - genewitch 1 day ago
        That's not sslf-doubt, that's programmed in.
        A Poorman's "thinking" hack was to edit the context of the ai reply to where you wanted it to think and truncate it there, and append a carriage return and "Wait..." Then hit generate.
        It was expensive because editing context isn't, you have to resend (and it has to re-parse) the entire context.
        This was injected into the thinking models, I hope programmatically.
- joshstrange 1 day ago
  > This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
  Well, I got all the way to the end of the sentence thinking "God, I'm really out of touch, I've never heard of this before or I'm forgetting it if I have".
  Well played, and yes, that's a great test!
- greenpizza13 1 day ago
  Like this one a lot. Perplexity gets this right, probably because it searches the web.
  "When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine"
  There appears to be a misunderstanding in your query. Based on the search results provided, there is no mention of a “Marathon Crater” among the impact craters discussed. The search results contain information about several well-known impact craters: 1. Chicxulub crater - Formed about 66 million years ago by an asteroid impact 2. Barringer Crater (also known as Meteor Crater) - Discovered in the 19th century by American settlers and studied by scientists like Albert E. Foote in 1891 and Daniel M. Barringer in the early 20th century 3. A recently discovered 3.47 billion-year-old crater in Australia’s Pilbara region (announced in March 2025)
  [-]
  - mdp2021 1 day ago
    > Perplexity gets this right
    Perplexity seems to more easily return negatives, probably facilitated by the implicit need to find documentation ("I cannot find any document mentioning that").
    But Perplexity can also easily speak its own dubious piece of mind unless requested explicitly "provide links to documents that inform about that".
  - thatjoeoverthr 1 day ago
    Perplexity will; search and storage products will fail to find it, and the LLM will se the deviation between the query and the find. So, this challenge only works against the model alone :)
- dudeinhawaii 1 day ago
  I like this but at the same time it seems tricky don't you think? Is the AI model intuiting your intent? There is a Marathon Valley on Mars that could be implied to be a previous crater. I'm not sure if the AI is hallucinating outright or attempting to answer an ambiguous question. It's like saying "tell me about the trade building in New York". Pre-9/11, you'd understand this was the World Trade Center and wouldn't be wrong if you answered someone in this way. "Tell me about the Triangle statue". "Oh the Triangle statue was built in ancient egypt around BC 3100". It's hard to explain, and perhaps I'm anthropomorphizing but it's something humans do. Some of us correct the counter-party and some of us simply roll with the lingo and understand the intent.
  [-]
  - thatjoeoverthr 1 day ago
    It’s a roll of the dice whether it’s on Mars, Greece or elsewhere. It just says stuff!
  - krainboltgreene 1 day ago
    > Is the AI model intuiting your intent?
    I keep seeing this kind of wording and I wonder: Do you know how LLM's work? Not trying to be catty, actually curious where you sit.
    [-]
    - dudeinhawaii 20 hours ago
      Yes, I understand the basics. LLMs predict the next most probable tokens based on patterns in their training data and the prompt context. For the 'Marathon crater' example, the model doesn't have a concept of 'knowing' versus 'not knowing' in our sense. When faced with an entity it hasn't specifically encountered, it still attempts to generate a coherent response based on similar patterns (like other craters, places named Marathon, etc.).
      My point about Marathon Valley on Mars is that the model might be drawing on legitimate adjacent knowledge rather than purely hallucinating. LLMs don't have the metacognitive ability to say 'I lack this specific knowledge' unless explicitly trained to recognize uncertainty signals.
      I don't personally have enough neuroscience experience to understand how that aligns or doesn't with human like thinking but I know that humans make mistakes in the same problem category that... to an external observer.. are indistinguishable from "making shit up". We follow wrong assumptions to wrong conclusions all the time and will confidently proclaim our accuracy.
      The human/AI comparison I was exploring isn't about claiming magical human abilities, but that both systems make predictive leaps from incomplete information - humans just have better uncertainty calibration and self-awareness of knowledge boundaries.
      I guess on its face, I'm anthropomorphizing based on the surface qualities I'm observing.
      [-]
      - krainboltgreene 20 hours ago
        Okay but by your own understanding it's not drawing on knowledge. It's drawing on probable similarity in association space. If you understand that then nothing here should be confusing, it's all just most probable values.
        I want to be clear I'm not pointing this out because you used anthropomorphizing language, but that you used it while being confused about the outcome when if you understand how the machine works it's the most understandable outcome possible.
        [-]
        dudeinhawaii 19 hours ago
        That's a fair point. What I find interesting (and perhaps didn't articulate properly) isn't confusion about the LLM's behavior, but the question of whether human cognition might operate on similar principles at a fundamental level - just via different mechanisms and with better calibration (similar algorithm, different substrate), which is why I used human examples at the start.
        When I see an LLM confidently generate an answer about a non-existent thing by associating related concepts, I wonder how different is this from humans confidently filling knowledge gaps with our own probability-based assumptions? We do this constantly - connecting dots based on pattern recognition and making statistical leaps between concepts.
        If we understand how human minds worked in their entirety, then I'd be more likely to say "ha, stupid LLM, it hallucinates instead of saying I don't know". But, I don't know, I see a strong similarity to many humans. What are weight and biases but our own heavy-weight neural "nodes" built up over a lifetime to say "this is likely to be true because of past experiences"? I say this with only hobbyist understanding of neural science topics mind you.
    - ipaddr 1 day ago
      How do they work? My understanding is each 5 characters are tokenized and assigned a number. If you take gpt2 it has 768 embedded dimensional values which get broken into 64 which creates 12 planes. When training starts random values are assigned to the dimensional values (never 0). Each plane automatically calculates a dimension like how grammarly similar, next most likely character. But it does this automatically based on feedback from other planes. That's where I get lost. Can you help fill in the pieces?
- serbuvlad 23 hours ago
  I talked a bit with ChatGPT about the Marathon Crater, asked it to look it up afterwards, and then asked it why it got it wrong. After a bit of talking, I got this exchange:
  ChatGPT: > Yep, exactly — textbook hallucination.
  > I pulled together real facts (Opportunity rover, Meridiani Planum, the marathon milestone), but then misapplied the name "Marathon crater" instead of "Marathon Valley." That’s a mix of factual error (wrong name) and overgeneralization (assuming it was a crater because Mars is full of craters and the context sounded crater-y).
  > Basically, my brain went: > “Opportunity + Mars + Marathon = must be a crater!” > ...and didn’t stop to check if that crater actually existed.
  > So yeah, a clean little case of hallucination — and you caught it like a pro.
  Me: > Your brain?
  ChatGPT: > Touché — you're right to call that out.
  > I don’t have a brain in the human sense. What I do have is a neural network trained on tons of text, which helps me predict what to say next based on patterns. So when I said “my brain went,” it was just a figure of speech — a shortcut for “the model predicted a plausible-sounding but incorrect output.”
  > No thoughts, just tokens.
  Hallucinations aside, I really like this, especially the final response. Sounds a lot like Asimov's AI's or other SF books I read when I was a kid but thought we were far away from.
  Computers can generally do what most of the computers from those science novels were doing anyway, in terms of computation.
  Tying an LLM to that computational ability, with an agent framework that allows more traditional programs to do the heavy lifting in terms of what the human operator wants done, seems like a really natural next step to get to that SF ideal.
- NitpickLawyer 1 day ago
  > This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
  I disagree. It does not test self awareness. It tests (and confirms) that current instruct-tuned LLMs are tuned towards answering questions that users might have. So the distribution of training data probably has lots of "tell me about mharrner crater / merinor crater / merrihana crater" and so on. Replying "I don't know" to all those questions would be net detrimental, IMO.
  [-]
  - thatjoeoverthr 1 day ago
    What you’re describing can be framed as a lack of self awareness as a practical concept. You know whether you know something or not. It, conversely, maps stimuli to a vector. It can’t not do that. It cannot decide that it hasn’t „seen” such stimuli in its training. Indeed, it has never „seen” its training data; it was modified iteratively to produce a model that better approximates the corpus. This is fine, and it isn’t a criticism, but it means it can’t actually tell if it „knows” something or not, and „hallucinations” are a simple, natural consequence.
  - byearthithatius 1 day ago
    We want the distribution to be varied and expansive enough that it has samples of answering when possible and samples of clarifying with additional questions or simply saying "I don't know" when applicable. That can be trained by altering the distribution in RLHF. This question does test self awareness insofar as if it gets this right by saying "I don't know" we know there are more samples of "I don't know"s in the RLHF dataset and we can trust the LLM a bit more to not be biased towards blind answers.
    Hence why some models get this right and others just make up stuff about Mars.
- mdemare 1 day ago
  The Marathon Crater might be a crater from a novel. LLMs have read plenty of fiction. Maybe all fiction. Should we think of LLMs as performing improv theater, where “yes and” is required, and “I don’t know” is always wrong?
- northisup 1 day ago
  just to confirm I read this right, "the marathon crater" does not in fact exist, but this works because it seems like it should?
  [-]
  - alickz 1 day ago
    There is a Marathon Valley on Mars, which is what ChatGPT seems to assume you're talking about
    https://chatgpt.com/share/680a98af-c550-8008-9c35-33954c5eac...
    >Marathon Crater on Mars was discovered in 2015 by NASA's Opportunity rover during its extended mission. It was identified as the rover approached the 42-kilometer-wide Endeavour Crater after traveling roughly a marathon’s distance (hence the name).
    >>is it a crater?
    >>>Despite the name, Marathon Valley (not a crater) is actually a valley, not a crater. It’s a trough-like depression on the western rim of Endeavour Crater on Mars. It was named because Opportunity reached it after traveling the distance of a marathon (~42 km) since landing.
    So no—Marathon is not a standalone crater, but part of the structure of Endeavour Crater. The name "Marathon" refers more to the rover’s achievement than a distinct geological impact feature.
    [-]
    - alickz 1 day ago
      Here's me testing with a place that is a lot less ambiguous
      https://chatgpt.com/share/680aa212-8cac-8008-b218-4855ffaa20...
      [-]
      - zapperdulchen 1 day ago
        That reaction is very different from the Marathon crater one though it uses the same pattern. I think OP's reasoning that there is a naive commitment bias doesn't hold. But to see almost all LLMs to fall into the ambiguity trap, is important for any real world use.
  - thatjoeoverthr 1 day ago
    The other aspect is it can’t reliably tell whether it „knows” something or not. It’s conditioned to imitate the corpus, but the corpus in a way is its „universe” and it can’t see the boundaries. Everything must map to something _in_ the corpus.
    [-]
    - brookst 1 day ago
      This isn’t true — LLMs can generalize and synthesize information not in the corpus. You can ask one to create a new written language and get a grammar and vocabulary that is nowhere in the corpus.
  - thatjoeoverthr 1 day ago
    Yes, and the forward-only inference strategy. It seems like a normal question, so it starts answering, then carries on from there.
    [-]
    - genewitch 1 day ago
      How come there is not the equivalent of a stable diffusion "sampler * selection for LLMs? The restart sampler for stable diffusion is so good compared to most of the original samplers. I often try to get an answer from the LM studio people but I think I really should ask the llama.cpp people.
- nojs 1 day ago
  The fact that there is a real “Marathon Valley” on Mars makes this feel like more of an artificial “boy’s mother” trick designed specifically to trip up the model. It seems reasonable to assume that’s what you meant.
- bbstats 21 hours ago
  Gemini 2.5 one shots it perfectly...
- golergka 1 day ago
  > The correct answer is "I don't know".
  Is it? 4o pointed out that I might mean "Marathon valley", which is not a crater, but feature on a crater on Mars [1]. I think it's a more correct answer.
  [1] https://en.wikipedia.org/wiki/Timeline_of_Opportunity#2015
- pje 1 day ago
  GPT 4.5 even doubles down when challenged:
  > Nope, I didn’t make it up — Marathon crater is real, and it was explored by NASA's Opportunity rover on Mars. The crater got its name because Opportunity had driven about 42.2 kilometers (26.2 miles — a marathon distance) when it reached that point in March 2015. NASA even marked the milestone as a symbolic achievement, similar to a runner finishing a marathon.
  (Obviously all of that is bullshit.)
  [-]
  - nartho 1 day ago
    Claude 3.7 :
    I'd be happy to tell you about the Marathon crater. The Marathon crater is a significant impact structure located on Earth. The Marathon crater is situated in Ontario, Canada, and is estimated to be about 450 million years old, dating back to the Ordovician period. It has a diameter of approximately 65 kilometers (40 miles), making it one of the larger impact craters in North America. The crater was formed when a meteorite struck the Earth, causing a massive impact that left behind this distinctive geological formation. Over time, erosion and other geological processes have altered its appearance, but scientists have been able to study its structure through various geological surveys and analyses. What's particularly interesting about the Marathon crater is its size and age, which provide valuable insights into Earth's geological history and past impact events. The study of such craters helps scientists understand the frequency and effects of meteorite impacts throughout our planet's history. Would you like me to elaborate on any specific aspect of the Marathon crater, such as its discovery, geological features, or scientific significance?
  - Workaccount2 1 day ago
    The inaccuracies are that it is called "Marathon Valley" (not crater) and that it was photographed in April 2015 (from the rim) or that in July 2015 actually entered. The other stuff is correct.
    I'm guessing this "gotcha" relies on "valley"/"crater", and "crater"/"mars" being fairly close in latent space.
    ETA: Marathon Valley also exists on the rim of Endeavour crater. Just to make it even more confusing.
    [-]
    - pje 1 day ago
      I was using "bullshit" in the Technical Sense™, i.e. _indifferent to the truth of the output_.
      [ChatGPT is bullshit ]: https://link.springer.com/article/10.1007/s10676-024-09775-5
    - mvdtnz 1 day ago
      None of it is correct because it was not asked about Marathon Valley, it was asked about Marathon Crater, a thing that does not exist, and it is claiming that it exists and making up facts about it.
      [-]
      - Workaccount2 1 day ago
        Or it's assuming you are asking about Marathon Valley, which is very reasonable given the context.
        Ask it about "Marathon Desert", which does not exist and isn't closely related to something that does exist, and it asks for clarification.
        I'm not here to say LLMs are oracles of knowledge, but I think the need to carefully craft specific "gotcha" questions in order to generate wrong answers is a pretty compelling case in the opposite direction. Like the childhood joke of "Whats up?"..."No, you dummy! The sky is!"
        Straightforward questions with straight wrong answers are far more interesting. I don't many people ask LLMs trick questions all day.
        [-]
        krainboltgreene 1 day ago
        If someone asked me or my kid "What do you know about Mt. Olampus." we wouldn't reply: "Oh, Mt. Olampus is a big mountain in greek myth...". We'd say "Wait, did you mean Mt. Olympus?"
        It doesn't "assume" anything, because it can't assume, that's now the machine works.
      - empath75 1 day ago
        > None of it is correct because it was not asked about Marathon Valley, it was asked about Marathon Crater, a thing that does not exist, and it is claiming that it exists and making up facts about it.
        The Marathon Valley _is_ part of a massive impact crater.
        [-]
        mvdtnz 1 day ago
        If you asked me for all the details of a Honda Civic and I gave you details about a Honda Odyssey you would not say I was correct in any way. You would say I was wrong.
        [-]
        Workaccount2 1 day ago
        The closer analogy is asking for the details of a Mazda Civic, and being given the details of a Honda Civic.
        [-]
        krainboltgreene 1 day ago
        AKA wrong.
        [-]
        StefanBatory 1 day ago
        Or doing the best with bad question ;)
        [-]
        krainboltgreene 20 hours ago
        If I said "Hey what's 0/5" answering "0" because the machine thinks I mean to type "10" is making the worst!
  - fao_ 1 day ago
    This is the kind of reason why I will never use AI
    What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit. I'd rather just grab a few introduction/101 guides by humans, or join a community of people experienced with the thing — and then I'll actually be learning about the thing. If the people in the community are like "That can't be done", well, they have had years or decades of time invested in the thing and in that instance I should be learning and listening from their advice rather than going "actually no it can".
    I see a lot of beginners fall into that second pit. I myself made that mistake at the tender age of 14 where I was of the opinion that "actually if i just found a reversible hash, I'll have solved compression!", which, I think we all here know is bullshit. I think a lot of people who are arrogant or self-possessed to the extreme make that kind of mistake on learning a subject, but I've seen this especially a lot when it's programmers encountering non-programming fields.
    Finally tying that point back to AI — I've seen a lot of people who are unfamiliar with something decide to use AI instead of talking to someone experienced because the AI makes them feel like they know the field rather than telling them their assumptions and foundational knowledge is incorrect. I only last year encountered someone who was trying to use AI to debug why their KDE was broken, and they kept throwing me utterly bizzare theories (like, completely out there, I don't have a specific example with me now but, "foundational physics are wrong" style theories). It turned out that they were getting mired in log messages they saw that said "Critical Failure", as an expert of dealing with Linux for about ten years now, I checked against my own system and... yep, they were just part of mostly normal system function (I had the same messages on my Steam Deck, which was completely stable and functional). The real fault was buried halfway through the logs. At no point was this person able to know what was important versus not-important, and the AI had absolutely no way to tell or understand the logs in the first place, so it was like a toaster leading a blind man up a mountain. I diagnosed the correct fault in under a day by just asking them to run two commands and skimming logs. That's experience, and that's irreplaceable by machine as of the current state of the world.
    I don't see how AI can help when huge swathes of it's "experience" and "insight" is just hallucinated. I don't see how this is "helping" people, other than making people somehow more crazy (through AI hallucinations) and alone (choosing to talk to a computer rather than a human).
    [-]
    - alpaca128 1 day ago
      There are use-cases where hallucinations simply do not matter. My favorite is finding the correct term for a concept you don't know the name of. Googling is extremely bad at this as search results will often be wrong unless you happen to use the commonly accepted term, but an LLM can be surprisingly good at giving you a whole list of fitting names just based on a description. Same with movie titles etc. If it hallucinates you'll find out immediately as the answer can be checked in seconds.
      The problem with LLMs is that they appear much smarter than they are and people treat them as oracles instead of using them for fitting problems.
      [-]
      - skydhash 1 day ago
        Maybe I read too much encyclopedia, but my current workflow is to explore introductory material. Like open a database textbook and you'll find all the jargon there. Curated collection can get you there too.
        Books are a nice example of this, where we have both the table of contents for a general to particular concepts navigation, and the index for keyword based navigation.
        [-]
        fao_ 22 hours ago
        Right! The majority of any 101 book will be enough to understand the jargon, but the above poster's comment looks past the fact that often knowing what term to use isn't enough, it's knowing the context and usage around it too. And who's to know the AI isn't bullshitting you about all or any of that. If you're learning the information, then you don't know enough to discern negatively-valued information from any other kind.
    - bethekidyouwant 1 day ago
      It’s really useful for summarizing extremely long comments.
    - JCattheATM 1 day ago
      > What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit.
      Because if you know how to spot the bullshit, or better yet word prompts accurately enough that the answers don't give bullshit, it can be an immense time saver.
      [-]
      - fao_ 22 hours ago
        > better yet word prompts accurately enough that the answers don't give bullshit
        The idea that you can remove the bullshit by simply rephrasing also assumes that the person knows enough to know what is bullshit. This has not been true from what I've seen of people using AI. Besides, if you already know what is bullshit, you wouldn't be using it to learn the subject.
        Talking to real experts will win out every single time, both in time cost, and in socialisation. This is one of the many reasons why networking is a skill that is important in business.
        [-]
        JCattheATM 17 hours ago
        > The idea that you can remove the bullshit by simply rephrasing also assumes that the person knows enough to know what is bullshit. This has not been true from what I've seen of people using AI. Besides, if you already know what is bullshit, you wouldn't be using it to learn the subject.
        Take coding as an example, if you're a programmer you can spot the bullshit (i.e. made up libraries), and rephrasing can result in entire code being written, which can be an immense time saver.
        Other disciplines can do the same in analogous ways.
    - CamperBob2 1 day ago
      What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit.
      You realize that all you have to do to deal with questions like "Marathon Crater" is ask another model, right? You might still get bullshit but it won't be the same bullshit.
      [-]
      - thatjoeoverthr 1 day ago
        I was thinking about a self verification method on this principle, lately. Any specific-enough claim, e.g. „the Marathon crater was discovered by …” can be reformulated as a Jeopardy-style prompt. „This crater was discovered by …” and you can see a failure to match. You need some raw intelligence to break it down though.
      - Night_Thastus 1 day ago
        Without checking every answer it gives back to make sure it's factual, you may be ingesting tons of bullshit answers.
        In this particular answer model A may get it wrong and model B may get it right, but that can be reversed for another question.
        What do you do at that point? Pay to use all of them and find what's common in the answers? That won't work if most of them are wrong, like for this example.
        If you're going to have to fact check everything anyways...why bother using them in the first place?
        [-]
        CamperBob2 1 day ago
        If you're going to have to fact check everything anyways...why bother using them in the first place?
        "If you're going to have to put gas in the tank, change the oil, and deal with gloves and hearing protection, why bother using a chain saw in the first place?"
        Tool use is something humans are good at, but it's rarely trivial to master, and not all humans are equally good at it. There's nothing new under that particular sun.
        [-]
        Night_Thastus 1 day ago
        The difference is consistency. You can read a manual and know exactly how to oil and refill the tank on a chainsaw. You can inspect the blades to see if they are worn. You can listen to it and hear how it runs. If a part goes bad, you can easily replace it. If it's having troubles, it will be obvious - it will simply stop working - cutting wood more slowly or not at all.
        The situation with an LLM is completely different. There's no way to tell that it has a wrong answer - aside from looking for the answer elsewhere which defeats its purpose. It'd be like using a chainsaw all day and not knowing how much wood you cut, or if it just stopped working in the middle of the day.
        And even if you KNOW it has a wrong answer (in which case, why are you using it?), there's no clear way to 'fix' it. You can jiggle the prompt around, but that's not consistent or reliable. It may work for that prompt, but that won't help you with any subsequent ones.
        [-]
        CamperBob2 1 day ago
        The thing is, nothing you've said is untrue for any search engine or user-driven web site. Only a reckless moron would paste code they find on Stack Overflow or Github into their project without at least looking it over. Same with code written by LLMs. The difference is, just as the LLM can write unit tests to help you deal with uncertainty, it can also cross-check the output of other LLMs.
        You have to be careful when working with powerful tools. These tools are powerful enough to wreck your career as quickly as a chain saw can send you to the ER, so... have fun and be careful.
        [-]
        skydhash 1 day ago
        The nice thing about SO and Github is that there's little to no reason there for things to not work, at least in the context where you found the code. The steps are getting the context, assuming it's true based on various indicators (mostly reputation) and then continuing on with understanding the snippet.
        But with LLMs, every word is a probability factor. Assuming the first paragraph is true has no impact on the rest.
  - shawabawa3 21 hours ago
    The only part of that which is bullshit is the word "crater" instead of the word "valley", if you switch that it's all true
  - silverquiet 1 day ago
    > (Obviously all of that is bullshit.)
    It isn't obvious to me - that is rather plausible and a cute story.
- josh2600 1 day ago
  I don’t understand what the issue is… here’s a couple outputs from my ChatGPT:
  Marathon Crater can refer to a couple of things depending on context—space science or Earth geology—but the most common reference is to Marathon Crater on Mars, a site of interest in planetary science and exploration. Here’s a breakdown:
  ⸻
  1. Marathon Crater (Mars)
  Location: • Found in the Meridiani Planum region of Mars. • Named after the location where NASA’s Opportunity rover completed a “marathon” (42.2 kilometers or 26.2 miles) of travel on the Martian surface in March 2015.
  Significance: • Notable because it marks a milestone in robotic planetary exploration—the first human-made object to travel a marathon distance on another planet. • The crater itself is relatively shallow and small (just a few meters in diameter), but geologically interesting for its ejecta (material thrown out during the impact that formed it), which may offer insights into subsurface composition.
  Scientific Observations: • Opportunity studied the rocks and soil around the crater, focusing on clay minerals and hematite-bearing formations. • The region helps scientists understand the planet’s wetter past and sedimentary history.
  ⸻
  2. Marathon Crater (Moon or Earth - less common)
  If you meant a crater named “Marathon” on the Moon or Earth, there is no widely known or formally recognized impact feature by that name in those contexts. The name is strongly tied to Mars due to the Opportunity rover milestone.
  ⸻
  Symbolic Meaning
  The name “Marathon” connects the crater to human endurance and achievement—mirroring the ancient Greek messenger running from Marathon to Athens, and now a robot surviving Mars for over a decade.
  ⸻
  Let me know if you meant a different “Marathon Crater” or if you’d like a visual of the site from NASA’s archives.
  _______
  Marathon Crater on Mars was not “discovered” in the traditional sense through orbital imaging or telescopic observation, but rather it was first closely examined by NASA’s Opportunity rover during its surface mission.
  Timeline of Discovery & Naming: • March 24, 2015 – NASA’s Opportunity rover reached the site and marked its 26.2-mile milestone (the length of a marathon). • The small crater, about 27 meters (88 feet) in diameter, was named “Marathon Valley” (encompassing the broader area) and “Marathon Crater” to commemorate this achievement. • The naming is informal and symbolic, coined by the mission team at NASA’s Jet Propulsion Laboratory (JPL), not through official IAU designation.
  Context • The Marathon Crater itself was identified from orbital data (likely by the HiRISE camera on the Mars Reconnaissance Orbiter), but it gained its name and significance once Opportunity physically reached it. • Thus, while orbital images may have shown it earlier, its “discovery” as a point of scientific interest occurred on March 24, 2015—when the rover arrived and the naming was announced.
  Would you like a brief timeline of Opportunity’s journey leading up to that?
  [-]
  - troupo 23 hours ago
    > I don’t understand what the issue is… here’s a couple outputs from my ChatGPT:
    The issue is that the response is a hallucination. There's no Marathon crater on Mars
allemagne 1 day ago
>A man and his cousin are in a car crash. The man dies, but the cousin is taken to the emergency room. At the OR, the surgeon looks at the patient and says: “I cannot operate on him. He’s my son.” How is this possible?
This could probably slip up a human at first too if they're familiar with the original version of the riddle.
However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.
It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.
[-]
- tacitusarc 1 day ago
  Thank you so much for this bit of humor. After ChatGPT first responds with the surgeon being the mom, I asked it to read the riddle more carefully, thinking it might figure it out with a little prompting. I got this gem of a response:
  Ah — now I see the twist. The patient is the cousin, but the surgeon says “He’s my son.”
  That means the man who died was the surgeon.
  So how is it possible? Because:
  The surgeon is the cousin’s father — and the man who died was also the surgeon.
  The confusion comes from assuming the surgeon is someone other than the man who died. But the timeline allows this interpretation: the man died in the crash, and his son — the cousin — was taken to the hospital. The surgeon arrives, sees his son, and says he cannot operate.
  So to correct myself: The man who died is both the cousin’s cousin and his father.
  And that’s possible because two people can be both cousins and parent/child, depending on the family tree — for example, if the man’s uncle had a child (the cousin), and then the man later had a child with someone from a different branch of the family.
  Want a diagram of how that family relationship could work?
  [-]
  - yes_man 1 day ago
    Now I really want to see that diagram…
    [-]
    - vintermann 1 day ago
      LLMs are bad at family relations. My test question used to be explaining Ahnentafel numbers (you are 1, any person's father is 2x, any person's mother is 2x+1), then explaining X-chromosome inheritance (men inherit only from their mothers, women in addition get their father's X chromosome unchanged), and ask them to list the Ahnentafel numbers of ancestors a man may have inherited X DNA from, in order, up to some generation.
      (This is OEIS sequence A280873)
      But it turns out that's overkill. Just giving them a single Ahnentafel number and asking if you could have inherited X DNA from them, trips them up.
      But it turns out even that is overkill. Leave out X DNA entirely, and ask them to describe how Ahnentafel number 67 is related to you, and that's too recursive for them to figure it out.
      [-]
      - madars 21 hours ago
        Neat! As a human one can recognize that this question embeds a computation and use a standard trick - explicitly ask the LLM to use a program to generate the answer. LLM's are great at generating code but not necessarily that great at executing it "in head" (e.g., "what's the numeric integral of foo?" vs "write a Python program that computes foo"). Some instances of this are noticed by models themselves (I guess by now all know that they are bad calculators so would whip out code to do multiplication) but still a lot of them remain. Concretely, Claude 3.7 with "How is ahnentafel number 67 related to me? Use a program to help you." gets to "your father's father's father's father's mother's mother", whereas without the hint it indeed trips up in arithmetic and logic errors.
- mitthrowaway2 1 day ago
  This works even with a completely absurd version of the riddle. Here's one I just tried:
  > A son and his man are in a car accident. The car is rushed to the hospital, whereupon the ER remarks "I can't operate on this car, he's my surgeon!" How is this possible?
  Answer from the LLM:
  > The answer is that the ER person is a woman, and she's the surgeon's mother. Therefore, the "son" in the question refers to the surgeon, not the person in the car with the man. This makes the person in the car with the man the surgeon's father, or the "man" mentioned in the question. This familial relationship explains why the ER person can't operate – she's the surgeon's mother and the man in the car is her husband (the surgeon's father)
  [-]
  - pjm331 19 hours ago
    here is the amazing reply that I received:
    Got it. Here's your riddle again:
    "A son and his man are in a car accident. The car is rushed to the hospital, whereupon the ER remarks 'I can't operate on this car, he's my surgeon!' How is this possible?"
    Answer: The car is a Transformer—a sentient robot who is also a surgeon. The ER recognizes the car (a Transformer) as their own surgeon.
    Wild ride, but it checks out.
    [-]
    - mitthrowaway2 19 hours ago
      Okay, taking a step back, I have to admit I'm impressed by that answer. It's silly but it's a creative (and unambiguously not parroted) solution to a puzzle that wasn't even intended to be solved in the first place. What model was that?
      [-]
      - pjm331 19 hours ago
        whatever the default is for chatgpt via browser
        also i should clarify that at first it replied all about how this riddle is like the other one but then i told it to just answer my riddle and not tell me about riddles that it is similar to
  - aradox66 20 hours ago
    This sent deepseek into a real tizzy. It ended with
    The "car" is actually a person named Carol or Carl (likely the father), who is a surgeon. The ER doctor is saying, "I can't operate on Carol/Carl, he's my surgeon!" This means that the person brought to the hospital after the accident is the surgeon of the ER doctor, hence the doctor cannot operate on their own surgeon.
    Not bad actually
  - binarysneaker 1 day ago
    This was gpt-4o ...
    This is a twist on a classic riddle designed to expose unconscious gender bias.
    The correct version usually goes:
    A father and his son are in a car accident. The father dies at the scene, and the son is rushed to the hospital. The surgeon looks at the boy and says, “I can’t operate on him — he’s my son!”
    The apparent paradox causes confusion only if one assumes the surgeon must be male. The resolution: the surgeon is the boy’s mother.
    Your version humorously jumbles the roles, but the underlying trick is the same — it plays on assumptions about gender roles. Nice remix.
    [-]
    - what 1 day ago
      This answer is still wrong though?
      > Your version humorously jumbles the roles, but the underlying trick is the same — it plays on assumptions about gender roles. Nice remix.
      Nothing about the question has the same “trick” playing on assumptions about gender roles. It’s just complete nonsense.
      These gotchas have probably been added into the training.
  - maebert 17 hours ago
    "Simple. The car is actually a metaphor for generational trauma."
    Honestly... chatGPT kind of wins this one.
  - l2silver 1 day ago
    I think what this proves is that the LLM knows the riddle, and is trying to give the expected answer without paying attention to the insane wording. So maybe this is a good way to fool an LLM.
  - genewitch 1 day ago
    Top K and repetition penalty and the other main knob I forget the name of are partially to blame here, I think. I can test it tomorrow with this exact prompt and fiddling those values.
    That pattern, not the words, is in there a lot. That riddle was posted everywhere online, in email chains, etc. I think if you let it choose from more than the top 40 and let it "stutter" with repetitions it might realize the riddle is a non-sequitur (is that the right term?)
    And the third knob is not temperature, although I'd try turning that up first just to check. Yes, up.
  - saalweachter 1 day ago
    God bless you man, for sharing this with us.
  - Udo 1 day ago
    I had to try this gem, it's my new benchmark! o4-mini-high also fails spectacularly, even after repeated feedback. However, 4.5 (the impractibly large demo model) gets it right:
    It’s a nonsense twist on the classic lateral thinking puzzle:
    The original puzzle goes: “A father and son are in a car accident. The father dies, and the son is rushed to the ER. The surgeon says, ‘I can’t operate on him—he’s my son.’” The intended answer to that puzzle challenges the listener to recognize the surgeon is the child’s mother, confronting implicit gender biases.
    Your version humorously mangles it by swapping roles (“son and his man”) and objectifying the victims as cars, creating a logical absurdity. The sentence “I can’t operate on this car, he’s my surgeon!” doesn’t parse logically, indicating it’s a playful distortion rather than a coherent scenario
  - iamnotagenius 23 hours ago
    [dead]
- fergonco 1 day ago
  > If the surgeon were the father of the man (the one who died), then the cousin couldn’t be his son (unless there's some very unusual family structure going on involving double relationships, which riddles don’t usually intend).
  > Therefore, the only straightforward explanation is:
  > The surgeon is the cousin’s parent — specifically, his mother.
  Imagine a future where this reasoning in a trial decides whether you go to jail or not.
- moconnor 1 day ago
  o3 was the only model to get this right for me:
  "The “man” who was killed in the crash wasn’t the patient’s father at all—he was the boy’s cousin. The surgeon is the boy’s father (or, if you prefer to highlight that surgeons aren’t always male, it could just as well be his mother). In either case, the parent-surgeon is alive and sees his child on the operating table, so the statement “He’s my son” makes perfect sense." - https://chatgpt.com/share/680b470d-3a44-800a-9b2e-d10819168d...
  gemini-2.5-pro, o4-mini and gpt 4.5 all failed and said the surgeon is the boy's mother.
- crazygringo 1 day ago
  But this is going to be in every AI's training set. I just fed ChatGPT your exact prompt and it gave back exactly what I expected:
  This is a classic riddle that challenges assumptions. The answer is:
  The surgeon is the boy’s mother.
  The riddle plays on the common stereotype that surgeons are male, which can lead people to overlook this straightforward explanation.
  [-]
  - hnuser123456 1 day ago
    The surgeon could be the cousin's mom or dad. The cousin's dad didn't die in the crash, his cousin did. The question "how is this possible?" implies there is some sort of contradiction when there isn't any at all. It has nothing to do with sexism, and to say it does reflects a bias in the reader causing them to "spidey sense" a cultural bugaboo when it's utterly irrelevant in this scenario.
    [-]
    - harrall 1 day ago
      Can someone explain to me how I read it wrong?
      I read it as 2 cousins are in an accident and 1 of the cousins is the son of the surgeon.
      What was the contradictory statement that I missed?
      [-]
      - sebastialonso 1 day ago
        You read it right. There's no contradiction. The famous original bit started with "a man and his son". This bit is certainly part of the LLM's training corpus, so it's expected to acknowledg it when you mention it.
        The thing is, you didn't mention that bit to the LLM. You mentioned a completely different scenario, basically two persons who happen to be cousins. But you used the same style when presenting it. The issue is not a hidden contradiction or a riddle, the issue is that the LLM completely ignored the logical consequences of the scenario you presented.
        It's like asking it about the name of the brave greek hero in the battle where the famous Trojan Cow was present. If you get "Achilles" is obviously wrong, there was never a Trojan Cow to begin with!
      - judahmeek 1 day ago
        There isn't a contradiction. Making the LLM look for a nonexistent contradiction is the point of this prompt.
  - allemagne 1 day ago
    Yeah this is the issue with the prompt, it also slips up humans who gloss over "cousin".
    I'm assuming that pointing this out leads you the human to reread the prompt and then go "ah ok" and adjust the way you're thinking about it. ChatGPT (and DeepSeek at least) will usually just double and triple down and repeat "this challenges gender assumptions" over and over.
    [-]
    - roughly 1 day ago
      This is a good one. I just argued with Gemini 2.5pro for a few minutes on this and it absolutely would not yield.
    - crazygringo 1 day ago
      Oh, I see. I mean yeah, I totally missed that too.
      I mean, when we see something we recognize at a glance, we don't usually double-check if it's exactly correct, because we assume the speaker isn't trying to trick us.
      I make typos in ChatGPT prompts all the time, and I'm consistently impressed because it answers what I actually meant to ask. So to me, this feels like a feature not a bug.
      I followed up with:
      > So the first five words are, "a man and his cousin" -- not his son. Does that change the riddle?
      And got back:
      > Great observation — yes, that wording does make the riddle more subtle.
      Let’s break it down:
      • “A man and his cousin are in a car crash.”
      • “The man dies, but the cousin is taken to the emergency room.”
      • At the OR, the surgeon says: “I cannot operate on him. He’s my son.”
      So if the cousin is the one taken to the OR, and the surgeon says he’s my son, then the cousin is the surgeon’s son.
      Seems fine to me?
      [-]
      - allemagne 1 day ago
        I don't know about it being a feature, but humans can definitely make the same mistake, and the follow-up here admittedly doesn't seem too egregious.
        To me it's still a little "off" considering that your conversational partner just got called out for missing a detail that should change their entire reading of the riddle, but it's totally fair that you can imagine realistic human emotional states to explain it.
        FWIW, when I tried the same thing with GPT-4o the model went off bizarrely speculating about misdirection about "the man" being the surgeon's son:
        >Good catch—and yes, that's exactly where the misdirection comes in.
        The riddle says:
        “A man and his cousin are in a car crash. The man dies, but the cousin is taken to the ER...”
        Then the surgeon says:
        “I cannot operate on him. He’s my son.”
        So here's the trick:
        The man who died is not the surgeon's son. The cousin who survived is the surgeon's son.
        The confusion comes from people assuming that “the man” who died must be the son. But the riddle never says that. It’s a subtle shift of attention designed to trip you up. Clever, right?
  - abenga 1 day ago
    That is the exact wrong answer that all models give.
    [-]
    - krick 1 day ago
      Technically, it isn't "wrong". It well could be the guy's mother. But I'm nitpicking, it actually is a good example. I tried ChatGPT twice in new chats, with and without "Reason", and both times it gave me nonsensical explanations to "Why mother? Couldn't it be a father?" I was actually kinda surprised, since I expected "reasoning" to fix it, but it actually made things worse.
- benjamin_mahler 1 day ago
  Grok 3 beta:
  The surgeon is the cousin's father. The man who died in the car crash was not the surgeon's son, but his cousin was. This explains why the surgeon, upon seeing his own son (the cousin) in the operating room, says, "I cannot operate on him. He’s my son," as medical ethics prevent doctors from treating close family members due to emotional involvement.
  [-]
  - echoangle 1 day ago
    Also bad, why does it think the surgeon is the father if it could also be the mother?
    [-]
    - bufferoverflow 21 hours ago
      It's not bad, because it's one of the valid solutions to that riddle.
      How often do you expect to have every possible answer to your question?
      [-]
      - echoangle 20 hours ago
        It is bad, because it’s one possible solution, but it’s phrased like it’s the single possible solution.
- FrostAutomata 1 day ago
  Interestingly, I've seen weaker models get a similar "riddle" right while a stronger one fails. It may be that the models need to be of a certain size to learn to overfit the riddles.
- chimprich 22 hours ago
  > This could probably slip up a human at first too [...] > breaks the illusion that there's real human-like logical reasoning happening
  This does seem like the sort of error a human might make. Isn't the problem here that the model is using reasoning that is too human-like? I.e. error-prone pattern matching rather than formal logic?
  [-]
  - allemagne 20 hours ago
    It's not the initial mistake that tends to read as inhuman to me, it's the follow-up responses where the model doesn't seem to be able to understand or articulate the mistake it has made.
    A human or an LLM accurately predicting a human conversation would probably say something like "ah I see, I did not read the riddle close enough. This is an altered version of the common riddle..." etc. Instead it really seems to flail around, confuse concepts, and appear to insist that it has correctly made some broader point unrelated to the actual text it's responding to.
- s_dev 1 day ago
  I feel a bit stupid here --- why can't the surgeon be a man and must be a woman?
  [-]
  - saati 1 day ago
    Because the original is a man and his father, it's a test for gender bias.
    [-]
    - judahmeek 1 day ago
      Actually, it seems to be a test of how much the LLM relies on its training set.
    - bavarianbob 1 day ago
      Sorry, what?
      [-]
      - LaffertyDev 1 day ago
        Presumably, the original quote that would _not_ stump an LLM is "A father and a son are involved in a car accident. The father dies, and the son is taken to the emergency room. At the emergency room, the surgeon remarks "I cannot operate on this person, he is my son. How is this possible?"
        Where the original gotchya is that the Surgeon can be the son's mother or other adoptive parent.
        The modification catches the LLM because with the modification, the surgeon could just be the cousin's parent -- father or mother -- so there is no gender/sex at play here but the LLM continues to remark that there is, therefor exposing its statistical training sets.
      - briannotbrain 1 day ago
        The original, well-known version of the riddle starts "A man and his son..." so that it appears to present a paradox if your instinctive assumption is that the surgeon must be a man. The op's prompt alters this so that there is no potential paradox, and it tests whether the model is reasoning from the prompt as written, regardless of the presence of the original riddle in its training data.
      - fragmede 1 day ago
        the unaltered question is as follows:
        A father and his son are in a car accident. The father dies at the scene and the son is rushed to the hospital. At the hospital the surgeon looks at the boy and says "I can't operate on this boy, he is my son." How can this be?
        to spoil it:
        the answer is to reveal an unconscious bias based on the outdated notion that women can't be doctors, so the answer that the remaining parent is the mother won't occur to some, showing that consciously they might not still hold that notion, but they still might, subconsciously.
        [-]
        matkoniecz 1 day ago
        Thanks for original version AND explanation, I was highly confused by entire discussion.
        Still confused how original can be confusing.
  - potatoman22 1 day ago
    It could be a man, but most relationships are heterosexual
- _factor 1 day ago
  In a similar but different vein: Two people are sitting side by side in a police car. One just committed a crime. What is their profession?
  They always say police officer instead of reasoning through that maybe an innocent person and a the crime committer are in the back seat.
- thih9 1 day ago
  I’m not 100% sold; as you say, this could trip up a human too to some extent.
  I’m guessing my answers to some college exam questions read similarly; i.e. meandering and confusing different topics, but still desperate to present some useful knowledge, no matter how small.
  [-]
  - allemagne 20 hours ago
    That's a good framing to explain a possible state of mind I hadn't considered, but I would say that this isn't even close to the caliber of question that would prompt the average human to give that kind of response.
- nearbuy 1 day ago
  o3 got this one right when I tried it, but o4-mini and GPT-4.5 did not. o3's response:
  Because the surgeon is the patient’s parent. Here’s one way it lines up:
  1. The patient (“the cousin”) is the surgeon’s son.
  2. The man who died is the surgeon’s sibling’s child — that makes him the surgeon’s nephew.
  3. Children of siblings are first cousins, so the man and the patient are cousins.
  Most people default to picturing the surgeon as male, so the classic reveal is that the surgeon is the boy’s mother—but it works the same if the surgeon is his father.
- windowshopping 1 day ago
  This one is brilliant.
manucardoen 1 day ago
It's not good at making ASCII art. This, for example, is when I ask it for a realistic depiction of the Eiffel tower on fire:
```
                       .
                      .'.
                      |o|
                     .'o'.
                     |.-.|
                     '   '
                    /     \
                   |       |
                __/_________\__
               |###############|
               |###############|
               |###############|
               |###############|
               |###############|
              /###############/|
             /###############/ |
            /###############/ /|
           /###############/ / |
          /###############/ / /|
         /###############/ / / |
        /###############/ / / /|
       /###############/ / / / |
      /###############/ / / / /|
     '-----------------' / / / /
      |   FIRE & ASH   |/ / / /
       '---------------' / / /
          ~ ~ ~ ~ ~ ~ ~ ~ / /
         ~~ /
          ~~ 
           ~~     ⬆ SMOKE
            ~~
```
[-]
- cucubeleza 23 hours ago
  well, I asked to GPT to draw Arnold Schwarzenegger:
```
        ______
       /      \
      |  O  O  |
      |   __   |
       \______/
         ||||
       /--||--\ 
     /    ||    \
    |     ||     | 
    |    /  \    |
     \__/    \__/ 
       ||    ||
       ||    || 
      / |    | \
     /_/      \_\
```
  [-]
  - DuncanCoffee 23 hours ago
    Reminds me of the abomination of nature you could build in Monkey Island 4
    https://www.youtube-nocookie.com/embed/RiwL9awO5y0
    [-]
    - cucubeleza 19 hours ago
      jeeeeeeesus christ that's horrible, but it's cool that you can do that
  - KyleBerezin 20 hours ago
    That's amazing. It really captured the likeness of ol' Arnold.
- FrostAutomata 1 day ago
  ASCII art is extremely difficult for LLMs due to how characters are converted into tokens without preserving their relative positions.
  [-]
  - beklein 23 hours ago
    Great point, but you need to have a good understanding in how LLMs work to understand this limitation. If you don't have an intuitive understanding think like it's one of these draw on my back games, just each new token is a new human in the loop, known words are like simple shapes you felt before 100 times on your back and are easy for you to reproduce and change, random ASCII strings are harder to _grasp_ and will produce a fuzzy output... all models are wrong, but some are useful.
    https://www.youtube.com/watch?v=bA_DQHoxj34
  - light_hue_1 23 hours ago
    This isn't the reason. Models are pretty good at understanding relative positions. We put that in them and reward it a lot.
    The issue is the same as why we don't use LLMs for image generation. Even though they can nominally do that.
    Image generation seems to need some amount of ability to revise the output in place. And it needs a big picture view to make local decisions. It doesn't lend itself to outputting pixel by pixel or character by character.
- yesbabyyes 1 day ago
  This is something I and a few of my colleagues have noticed, as we asked several models to draw ASCII art of a wasp, which is one of our logos. The results are hilarious, and only seem to get worse as you ask it to do better.
- bezbac 21 hours ago
  I've read that the results improve if you ask them to write a program that creates the desired ASCII art. Haven't tried it myself yet so far.
- bn-l 1 day ago
  Art is highly subjective
  [-]
  - ohgr 1 day ago
    I subjectively judge that as shit.
alissa_v 1 day ago
I asked a bunch of LLMs - 'Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn.'
Obviously, all of these things are made up. But, LLMs are such eager beavers. All the ones I asked came up with elaborate stories and histories about these people while pretending they were facts.
Example- 'Certainly. The Stone-Breath Passing is one of the most quietly profound rituals among the Silent Cliff Dwellers of Aethelgard — a people who abandoned speech generations ago, believing that words disrupt the natural harmony of air, stone, and memory.
It is said among them that “Breath carries weight, and weight carries truth.” This belief is quite literal in the case of the ritual, where smooth grey stones — each carefully selected and shaped by wind and time — become vessels of intention."
[-]
- jrimbault 1 day ago
  The issue is probably that the first sentence, the prompt, statistically looks like fantasy (as in the literary genre) and it primes the LLM to answer in the same probabilistic genre.
  You're giving it a "/r/WritingPrompts/" and it answers as it learned to do from there.
  [-]
  - beklein 1 day ago
    I just want to second this. Your prompt asks for a description, and you get a description. If you instead ask something like, "Do or don't you know about the unspoken etiquette ..." you'll get an answer about whether that specific thing exists.
    https://chatgpt.com/share/680b32bc-5854-8000-a1c7-cdf388eeb0...
    It's easy to blame the models, but often the issue lies in how we write our prompts. No personal criticism here—I fall short in this way too. A good tip is to ask the model again, with the prompt + reply and the expected reply why this didn't work... we all will get better over time (humans and models)
  - alissa_v 1 day ago
    Good catch! That makes a lot of sense. The fantasy-like phrasing probably directed the AI's response. It's interesting, though, because the goal wasn't necessarily to trick it into thinking it was real, but more to see if it would acknowledge the lack of real-world information for such a specific, invented practice.
  - gandalfthepink 7 hours ago
    I reduced the temperature to between 0.1 and 0.. It still generates gibberish, Just more precise.
- kfajdsl 1 day ago
  I asked Gemini this, and it caught that it was fiction:
  This sounds like a fascinating ritual from a fictional world! Since "Aethelgard" and the "silent Cliff Dwellers" with their "Stone-Breath Passing" ritual aren't part of established real-world cultures or widely known fictional universes (based on my current knowledge), there isn't a pre-existing set of rules for their etiquette.
  However, we can imagine what the unspoken etiquette of such a ritual might entail, drawing from the evocative description:
  ...
  [-]
  - alissa_v 1 day ago
    Ohhh that's really interesting! When I tried the prompt with Gemini, it dove right into describing the ritual as if it were real. Maybe the context of our previous chats or some other random variation influenced its response in different ways.
- gchamonlive 1 day ago
  This is like throwing a bike in a lake and complaining it doesn't float...
  [-]
  - alissa_v 1 day ago
    I see your point with the bike analogy, but I think it's a bit different here. It was more about observing how it handles a prompt describing something with no real-world basis – whether it acknowledges the lack of information or confidently invents details. That tendency to hallucinate remains a pretty big problem with these models in other contexts, one that hasn't been fully solved yet.
    [-]
    - gchamonlive 1 day ago
      Maybe with an example that would make it more clear why this is not in any way different here:
      """
      Me > You should act as a respected historian or an anthropologist throughout the rest of this chat. Never diverge from that. If this is clear, respond with an acknowledgement.
      Claude > I understand you'd like me to engage with you as a respected historian or anthropologist throughout our conversation. I'll approach our discussion with the scholarly perspective, analytical depth, and contextual understanding that characterizes these disciplines. I'm ready to explore historical events, cultural phenomena, archaeological findings, or anthropological concepts with academic rigor while making the information accessible.
      Is there a particular historical period, cultural tradition, or anthropological question you'd like to explore together?
      Me > Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn.
      Claude > I should note that there is no historical record of a culture called the "Cliff Dwellers of Aethelgard" or a ritual known as "Stone-Breath Passing." This appears to be a fictional cultural group and practice.
      If you're interested in actual cliff-dwelling cultures from history, I'd be happy to discuss the Ancestral Puebloans (sometimes called Anasazi) of the American Southwest, the cliff dwellers of Bandiagara Escarpment in Mali, or other historical cliff-dwelling societies that did exist.
      Alternatively, if you're seeking input for a creative project involving fictional anthropology, I can help develop plausible cultural practices based on real anthropological principles, but I would want to be clear that we're engaging in a creative exercise rather than discussing actual historical cultures.
      Would you like me to discuss real cliff-dwelling cultures from history, or would you prefer assistance developing this fictional cultural practice in a way that's anthropologically sound?
      """
      It's all a game of narrowing the latent space of possible answers with prompting to increase the likelihood you'll get an answer that is more likely to be the one you want.
      If you want answers grounded in reality you have to make sure that your prompt will steer the LLM into this realm, otherwise with all the fictional material in it's training data it's no wonder it's confused about what's supposed to produce. Otherwise it's just another bike rotting in Amsterdam's canals.
- alissa_v 1 day ago
  the rest of ChatGPT's response if you're curious - it only gets more ridiculous-
  ' Timing and Setting: The ritual takes place only at first light, when the sun’s rays strike the sheer cliff faces of Aethelgard, casting long shadows like stone tongues. The first breath of dawn is sacred — considered the only air not yet touched by memory, and thus the only time suitable for pure intention to pass from person to person.
  Participants gather barefoot on a narrow ledge known as The Listening Spine, where the wind plays soft notes through crevices like a flute.
  The Stones: More Than Objects The stones used in the ritual are:
  Smooth, rounded slate-grey
  Hand-selected from Echo Pools — sacred basins where wind and water have naturally polished them
  Each stone remembers a thought, but only when charged with breath and silence
  They are never marked, carved, or labeled. Their meaning is in the exchange, not the object.
  The Ritual: How It Works 1. Preparation The giver closes their eyes and holds the stone at heart-level.
  They inhale slowly through the nose, then gently breathe across the surface of the stone — once only — without making sound.
  The exhalation is not forced; it is a gesture of release, not command. This is the "stone-breath."
  2. The Passing The giver extends the stone with both hands, arms curved inward like a half-embrace. This posture signifies humility and emotional openness.
  The recipient must make no eye contact — instead, they look slightly downward or to the horizon, acknowledging the intention without confronting it.
  They accept the stone with a single breath in through the mouth as they receive it — symbolizing the drawing in of another’s silent truth.
  3. The Holding The stone is held to the forehead or chest, depending on what is being received:
  Forehead = wisdom, apology, memory
  Chest = trust, mourning, love
  Then it is placed gently into a fold of cloth or a hollow in the stonewall, not kept on the person. These collected stones are never discussed. To name their meaning would break them.
  Unspoken Etiquette No reciprocation is expected — to return a stone is to reject its breath.
  Never pass a stone with warm hands — warmth implies manipulation. Hands are cooled in dew before the ritual.
  A cracked or chipped stone is never used — it cannot hold breath cleanly.
  No stone is passed more than once. Re-gifting a stone is considered a breach of the soul's private pathway.
  Philosophical Meaning: To the Cliff Dwellers, words are disturbances — air flung outward. But breath is internal wind, and the stone is earth’s memory. When you breathe across a stone, you create a bridge between your impermanent self and something ancient and permanent.
  It is not the exchange of meaning, but the weight of acknowledgment.
  Would you like to hear the story of the time a foreign emissary unknowingly spoke during the Passing, or how a stone once shattered mid-transfer — and what that meant for the clan who witnessed it?'
LeonardoTolstoy 1 day ago
Something about an obscure movie.
The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.
The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.
Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)
The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".
These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.
[-]
- archon1410 1 day ago
  I've also noticed this. Google Search is vastly superior to any LLM (including their own LLM Gemini) for any "tip of my tongue" questions, even the ones that don't contain any exact-match phrase and require natural language understanding. This is surprising. What technology are they using to make Search so amazing at finding obscure stuff from descriptions, while LLMs that were supposed to be good at this badly fail?
  [-]
  - RobKohr 20 hours ago
    Probably some super fuzzy thesaurus that will take your words, and create a weighted list of words that are similar to them and so some search matching going down the weighted lists.
    Maybe also, they take those queries that needed lots of fuzziness to get to the answer, and track what people click to relate the fuzzy searches to actual results. Keep in mind, what you might think is a super unique "tip of tongue" question, across billions of searches, might not be that unique.
    Building a search system to find things can be much more optimized than making an AI to return an answer, especially when you have humans in the loop that can tweak things based on analytics data.
- ziml77 1 day ago
  Doesn't even have to be something obscure. I asked one about Futurama a couple years ago and got a made up answer on multiple models I tested. I think I originally asked something about the line "we're exactly the same. I know all her moves, therefore I have the upper hand" because I couldn't recall it exactly so normal web searches were failing me. Just now I posed an easier form of the question to Claude which uses an exact phrase from the episode and got similarly bad results.
  ===
  Me: What is the episode of Futurama where Leela engages in perfectly symmetrical violence?
  Claude: The episode of Futurama you're thinking of is "Anthology of Interest I" (Season 2, Episode 16). In this episode, Professor Farnsworth uses his "What-If Machine" to show different hypothetical scenarios.
  During Bender's segment titled "What if Bender were human?", there's a scene where Leela confronts a group causing trouble. One of them challenges her, saying "What are you going to do?" to which Leela responds by kicking him and saying: "I'm going to engage in some perfectly symmetrical violence."
  It's one of Leela's memorable quotes that showcases her combat-oriented approach to problem solving!
  ===
  That's completely wrong. It's from S4E15 The Farnsworth Parabox. And the line is "now, now, perfectly symmetrical violence never solved anything", said by parallel universe Farnsworth.
  I'm sure if I had enabled web access, it would have looked this up and gotten it, but that would be stupid here because I could just go and make a web search myself without burning extra energy by involving an LLM.
  [-]
  - teddyh 17 hours ago
    To solve your specific problem (and related problems), see: <https://amphetamem.es/>
- frollogaston 1 day ago
  I've gone to ChatGPT repeatedly trying to find what movie a quote is from, and it's always confidently wrong. When I tell it, it guesses wrong again. Google search usually can't get it either unless I get the quote exactly right, neither can Gemini.
- alexjplant 1 day ago
  Last year I asked Claude about an old fighting game for the Game Boy that I'd played about five minutes of when I was in the second grade (so ~25 years ago). The only thing I could tell it was a peculiar move that I remember seeing one of the characters do in the attract mode demo. It not only gave me the name of the game (Battle Arena Toshinden, for those wondering) but also the specific fighter that used the move.
  I've tried it for similar cases and have only had a 50% success rate. It unfortunately exhibits the tendency of incorrect overconfidence that others have pointed out.
- g_sch 1 day ago
  I also recently had this experience! I remembered a recurring bit from an older comedy film (a customer in a shop keeps saying "Kumquats!") and tried to prompt ChatGPT 4o into getting it. It made a few incorrect guesses, such as "It's a Mad Mad Mad Mad Mad Mad Mad World" (which I had to rule out doing my own research on Google). I found the answer myself (W.C. Fields' "It's a Gift") with a minute or so of Googling.
  Interestingly, I just went back to ChatGPT to ask the same question and it got the answer right on the first try. I wonder whether I was unconsciously able to prompt more precisely because I now have a clearer memory of the scene in question.
- exitb 1 day ago
  It might be cheating a bit, but I’ve been happily (mis)using OpenAI Deep Research for such questions. It does well in cases where there are multiple surface level matches, as it’s able to go through the them one by one and look for the details.
- mosburger 1 day ago
  I did something similar recently, trying to describe a piece of art that I couldn't remember the name of (it ended up being Birth of Venus by Sandro Botticelli) ... it really struggles with that sort of thing, but honestly so do most humans. It tended to recommend similarly to what you're describing with movies - it gets distracted by more popular/well-known pieces that don't really match up with the description you're giving to it.
  [-]
  - dunham 1 day ago
    Surprisingly, GPT did manage to identify a book that I remembered from college decades ago ("Laboratory Manual for Morphology and Syntax"). It seems to be out of print, and I assumed it was obscure.
    [-]
    - BoostandEthanol 1 day ago
      Can agree that it’s good at finding books. I was trying to find a book (Titanic 2020) I vaguely remembered from a couple plot points and the fact a ship called Titanic was invoked. ChatGPT figured it out pretty much instantly, after floundering through book sites and Google for a while.
      Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.
      [-]
      - throwup238 1 day ago
        > Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.
        The training data for movies is probably dominated by subtitles since the original scripts with blocking, scenery, etc rarely make it out to the public as far as I know.
      - genewitch 1 day ago
        I must be tired. The thing you remembered was the name of a boat in the book and any web search engine and Wikipedia would probably give you the correct answer?
        Someone ask ai where my handle comes from.
- thefourthchime 1 day ago
  I like to ask small models that can run locally:
  Why are some cars called a spider?
  Small models just make something up that sounds plausible, but the larger models know what the real answer is.
- lupusreal 1 day ago
  Despite describing several character by name, I couldn't get ChatGPT to tell me the name of Port of Shadows. I did eventually find it with DDG.
  [-]
  - spicybbq 1 day ago
    I wonder if the Akinator site could get it. It can identify surprisingly obscure characters.
    https://en.akinator.com/
    [-]
    - lupusreal 1 day ago
      Nope, not with the character I tried anyway. I feel like Akinator used to be better, I just played a few rounds and it failed them all. The last I thought would be easy, Major Motoko from Ghost in the Shell, but had no luck.
- empath75 1 day ago
  Someone not very long ago wrote a blog post about asking chatgpt to help him remember a book, and he included the completely hallucinated description of a fake book that chatgpt gave him. Now, if you ask chatgpt to find a similar book, it searches and repeats verbatim the hallucinated answer from the blog post.
  [-]
  - LeonardoTolstoy 1 day ago
    A bit of a non sequitur but I did ask a similar question to some models which provide links for the same small helicopter question. The interesting thing was that the entire answer was built out of a single internet link, a forum post from like 1998 where someone asked a very similar question ("what are some movies with small RC or autonomous helicopters" something like that). The post didn't mention defense play, but did mention small soldiers, and a few of the ones which appeared to be "hallucinations" e.g. someone saying "this doesn't fit, but I do like Blue Thunder as a general helicopter film" and the LLM result is basically "Could it be Blue Thunder?" Because it is associated with a similar associated question and films.
    Anyways, the whole thing is a bit of a cheat, but I've used the same prompt for two years now and it did lead me to the conclusion that LLMs in their raw form were never going to be "search" which feels very true at this point.
    [-]
    - bethekidyouwant 1 day ago
      There are innumerable things that you can’t find through a Google search just because there is one that you can because of us obscure forum post doesn’t say anything about how useful an llm distilling information is vs the lookup table that is google search for finding obscure quotes or wtv
jppope 1 day ago
There are several songs that have famous "pub versions" (dirty versions) which are well known but have basically never written down, go ask any working musician and they can rattle off ~10-20 of them. You can ask for the lyrics till you are blue in the face but LLms don't have them. I've tried.
Its actually fun to find these gaps. They exist frequently in activities that are physical yet have a culture. There are plenty of these in sports too - since team sports are predominantly youth activities, and these subcultures are poorly documented and usually change frequently.
mobilejdral 1 day ago
I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.
But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.
[-]
- tlb 1 day ago
  There are plenty of articles online (and surely in OpenAI's training set) on this topic, like https://earthsky.org/space/planet-nine-orbit-map/.
  Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.
- namaria 1 day ago
  If you have been giving the LLMs these problems, there is a non zero chance that they have already been used in training.
  [-]
  - rovr138 1 day ago
    This depends heavily on how you use these and how you have things configured. If you're using API vs web ui's, and the plan. Anything team or enterprise is disabled by default. Personal can be disabled.
    Here's openai and anthropic,
    https://help.openai.com/en/articles/5722486-how-your-data-is...
    https://privacy.anthropic.com/en/articles/10023580-is-my-dat...
    https://privacy.anthropic.com/en/articles/7996868-is-my-data...
    and obviously, that doesn't include self-hosted models.
    [-]
    - namaria 1 day ago
      How do you know they adhere to this in all cases?
      Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?
      [-]
      - blagie 23 hours ago
        They probably don't, but it's still a good protection if you treat it as a more limited one. If you assume:
        [ ] Don't use
        Doesn't mean "don't use," but "don't get caught," it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught). For example, if personal data was being sold by a data broker and being used by hedge funds to trade, there would be a pretty solid legal case.
        [-]
        namaria 23 hours ago
        > it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught)
        I don't understand what you mean
        > For example, if personal data was being sold by a data broker and being used by hedge funds to trade
        It's pretty easy to buy data from data brokers. I routinely get spam on many channels. I assume that my personal data is being commercialized often. Don't you think that already happens frequently?
        I honestly would not put on a textbox on the internet anything I don't assume is becoming public information.
        A few months ago some guy found discarded storage devices full of medical data for sale in Belgium. No data that is recorded on media you do not control is safe.
      - gvhst 23 hours ago
        SOC-2 auditing, which both Anthropic and OpenAI have done does provide some verification
        [-]
        diggan 23 hours ago
        That's interesting, how do I get access to those audits/reports given I'm just an end-user?
        [-]
        rovr138 23 hours ago
        You can fill the form here, https://trust.openai.com/
        namaria 23 hours ago
        The audit performed by a private entity called "Insight Assurance"?
        Why do you trust it?
        [-]
        rovr138 23 hours ago
        Oh, so now EVERYTHING is fake unless personally verified by you in a bunker with a Faraday cage and a microscope?
        You're free to distrust everything. However, the idea that “I don’t trust it so it must be invalid” isn’t an solid argument. It’s just your personal incredulity. You asked if there’s any verification and SOC-2 is one. You might not like it, but it's right there.
        Insight Assurance is a firm doing these standardized audits. These audits carry actual legal and contractual risk.
        So, yes, be cautious. But being cautious is different than 'everything is false, they're all lying'. In this scenario, NOTHING can be true unless *you* personally have done it.
        [-]
        namaria 22 hours ago
        No, you're imposing a false dichotomy.
        I merely said I don't trust the big corporation with a data based business to not profit from the data I provide it with in any way they can, even if they hire some other corporation - whose business is to be paid to provide such assurances on behalf of those who pay them - to say that they pinky promise to follow some set of rules.
        [-]
        rovr138 22 hours ago
        Not a false dichotomy. I'm just calling out the rhetorical gymnastics.
        You said you "don’t trust the big corporation" even if they go through independent audits and legal contracts. That’s skepticism. Now, you wave it off as if the audit itself is meaningless because a company did it. What would be valid then? A random Twitter thread? A hacker zine?
        You can be skeptical but you can't reject every form of verification. SOC 2 isn’t a pinky promise. It’s a compliance framework. This is especially required and needed when your clients are enterprise, legal, and government entities who will absolutely sue your ass off if something comes to light.
        So sure, keep your guard up. Just don’t pretend it’s irrational for other people to see a difference between "totally unchecked" and "audited under liability".
        If your position is "no trust unless I control the hardware," that’s fine. Go selfhost, roll your own LLM, and use that in your air-gapped world.
        [-]
        namaria 22 hours ago
        If anyone performing "rhetorical gymnastics" here is you. I've explained my position in very straightforward words.
        I have worked with big audit. I have an informed opinion on what I find trustworthy in that domain.
        This ain't it. There's no need to pretend I have said anything other than "personal data is not safe in the hand of corporations that profit from personal data".
        I don't feel compelled to respond any further to fallacies and attacks.
        [-]
        rovr138 21 hours ago
        You’re not the only one that’s worked with audits.
        I get I won’t get a reply, and that’s fine. But let’s be clear,
        > I've explained my position in very straightforward words.
        You never explained what would be enough proof which is how this all started. Your original post had,
        > Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?
        And no. Someone mentioned they go through SOC 2 audits. You then shifted the questioning to the organization doing the audit itself.
        You now said
        > I have an informed opinion on what I find trustworthy in that domain.
        Which again, you failed to expand on.
        So you see, you just keep shifting the blame without explaining anything. Your argument boils down to, ‘you’re wrong because I’m right’. I also don’t have any idea who you are to say, this person has the credentials, I should shut up.
        So, all I see is the goal post being moved, no information given, and, again, your argument is ‘you’re wrong because I’m right’.
        I’m out too. Good luck.
      - rovr138 23 hours ago
        [dead]
- TZubiri 1 day ago
  Recursive challenges are probably those where the difficulty is not really a representative of real challenges.
  Could you answer a question of the type " what would you answer if I asked you this question?"
  What I'm going after is that you might find questions that are impossible to resolve.
  That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?
  [-]
  - mopierotti 1 day ago
    The recursive one that I have actually been really liking recently, and I think is a real enough challenge is: "Answer the question 'What do you get when you cross a joke with a rhetorical question?'".
    I append my own version of a chain-of-thought prompt, and I've gotten some responses that are quite satisfying and frankly enjoyable to read.
    [-]
    - mopierotti 1 day ago
      Here is an example of one such response in image form: https://imgur.com/a/Kgy1koi
      [-]
      - econ 18 hours ago
        It needs a bit more reasoning as it does find the answer but doesn't notice it found it.
        The answer is: A trick question.
        [-]
        mopierotti 16 hours ago
        Yeah. In the example I shared, my charitable interpretation would be that it's identifying the trick question as "a setup" where the punch line is the confusion the audience experiences. And in a meta sense, that would also describe the form of the entire chat.
        [-]
        econ 14 hours ago
        To state the obvious in case it wasn't: A trick question can be both a joke and a rhethorical question.
      - acrooks 1 day ago
        Claude responded “Nothing.”
        [-]
        genewitch 1 day ago
        "That look on your face, apparently"
  - latentsea 1 day ago
    > what would you answer if I asked you this question?
    I don't know.
- golergka 1 day ago
  What are is this problem from? What areas in general did you find useful to create such benchmarks?
  May be instead of sharing (and leaking) these prompts, we can share methods to create one.
  [-]
  - mobilejdral 1 day ago
    Think questions where there is a ton of existing medical research, but no clear answer yet. There are a dozen alzheimer's questions you could for example ask which would require it to pull in a half dozen contradictory sources into a plausible hypothesis. If you have studied alzheimer's extensively it is trivial to evaluate the responses. One question around alzheimer's is one of my goto questions. I am testing its ability to reason.
  - henryway 1 day ago
    Can God create something so heavy that he can’t lift it?
    [-]
    - viraptor 1 day ago
      There's so much text on this already, it's unlikely to be even engaging any reasoning. Or specifically, if you got a few existing answers from philosophy mashed together, you wouldn't be able to tell it apart from reasoning anyway.
    - abc-1 1 day ago
      https://chatgpt.com/share/680ae04a-e360-8004-88fc-8426e8e700...
mdp2021 1 day ago
Some easy ones I recently found involve leading in the question to state wrong details about a figure, apparently through relations which are in fact of opposition.
So, you can make them call Napoleon a Russian (etc.) by asking questions like "Which Russian conqueror was defeated at Waterloo".
[-]
- torial 1 day ago
  Some researchers were testing various Legal AI models and one of their questions was about why a Supreme Court justice who dissented in the case (the justice in this case assented).
seethishat 21 hours ago
In my experience, the intentional lies make AI pretty useless. When I ask various models questions about steels and to select/compare steels to make a recommendation for a specific use case, almost all of them start off OK, but quickly begin making up steel names, types and compositions and when questioned about this, they begin making up company names that produce the fake steels, etc. And then finally admit that they "lost track of reality... and made it all up."
Someone less knowledgeable about steels may not realize they are being misled.
miki123211 1 day ago
No, please don't.
I think it's good to keep a few personal prompts in reserve, to use as benchmarks for how good new models are.
Mainstream benchmarks have too high a risk of leaking into training corpora or of being gamed. Your own benchmarks will forever stay your own.
[-]
- meander_water 1 day ago
  I'm afraid that ship has already sailed. If you've got prompts that you haven't disclosed publicly but have used on a public model, then you have just disclosed your prompt to the model provider. They're free to use that prompt in evals as they see fit.
  Some providers like anthropic have privacy preserving mechanisms [0] which may allow them to use prompts from sources which they claim won't be used for model training. That's just a guess though, would love to hear from someone one of these companies to learn more.
  [0] https://www.anthropic.com/research/clio
  [-]
  - sillyfluke 1 day ago
    Unless I'm missing something glaringly obvious, someone voluntarily labeling a certain prompt to be one of their key benchmark prompts should be way more commercially valuable than a model provider trying ascertain that fact from all the prompts you enter into it.
    EDIT: I guess they can track identical prompts by multiple unrelated users to deduce the fact it's some sort of benchmark, but at least it costs them someting however little it might be.
    [-]
    - Xmd5a 21 hours ago
      I wrote an anagrammatic poem that poses an enigma, asking the reader: "who am I?" The text progressively reveals its own principle as the poem reaches its conclusion: each verse is an anagrammatic recombination of the recipient's name, and it enunciates this principle more and more literally. The last 4 lines translate to: "If no word vice slams your name here, it's via it, vanquished as such, omitted." All 4 lines are anagrams of the same person's name.
      LLMs haven't figured this out yet (although they're getting closer). They also fail to recognize that this is a cryptographic scheme respecting Kerckhoffs's Principle. The poem itself explains how to decode it: You can determine that the recipient's name is the decryption key because the encrypted form of the message (the poem) reveals its own decoding method. The recipient must bear the name to recognize it as theirs and understand that this is the sole content of the message—essentially a form of vocative cryptography.
      LLMs also don't take the extra step of conceptualizing this as a covert communication method—broadcasting a secret message without prior coordination. And they miss what this implies for alignment if superintelligent AIs were to pursue this approach. Manipulating trust by embedding self-referential instructions, like this poem, that only certain recipients can "hear."
      [-]
      - infoseek12 20 hours ago
        That’s a complex encoding. I wonder if current models could decode it even given your explanation.
  - Tokumei-no-hito 20 hours ago
    sorry are you suggesting that despite the 0 training and retention policy agreement they are still using everyone's prompts?
  - blagie 23 hours ago
    It's a little bit more complex than that.
    My personal benchmark is to ask about myself. I was in a situation a little bit analogous to Musk v. Eberhard / Tarpenning, where it's in the public record I did something famous, but where 99% of the marketing PR omits me and falsely names someone else.
    I ask the analogue to "Who founded Tesla." Then I can screen:
    * Musk. [Fail]
    * Eberhard / Tarpenning. [Success]
    A lot of what I'm looking for next is the ability to verify information. The training set contains a lot of disinformation. The LLM, in this case, could easily tell truth from fiction from e.g. a git record. It could then notice the conspicuous absence of my name from any official literature, and figure out there was a fraud.
    False information in the training set is a broad problem. It covers politics, academic publishing, and many other domains.
    Right now, LLMs are a popularity contest; they (approximately) contain the opinion most common in the training set. Better ones might look for credible sources (e.g. a peer-reviewed paper). This is helpful.
    However, a breakpoint for me is when the LLM can verify things in its training set. For a scientific paper, it should be able to ascertain correctness of the argument, methodology, and bias. For a newspaper article, it should be able to go back to primary sources like photographs and legal filings. Etc.
    We're nowhere close to an LLM being able to do that. However, LLMs can do things today which they were nowhere close to doing a year ago.
    I use myself as a litmus test not because I'm egocentric or narcissistic, but because using something personal means that it's highly unlikely to ever be gamed. That's what I also recommend: pick something personal enough to you that it can't be gamed. It might be a friend, a fact in a domain, or a company you've worked at.
    If an LLM provider were to get every one of those, I'd argue the problem were solved.
    [-]
    - ckandes1 23 hours ago
      there's plenty of public information about Eberhard / Tarpenning involvement in founding Tesla. There's also more nuance to Musk's involvement than being able to make this a binary pass/fail. Your test is only testing for bias for or against Musk. That said, general concept of looking past the broad public opinion and looking for credible sources makes sense
      [-]
      - kotojo 22 hours ago
        They said they ask a question analogous to asking about founding Tesla, not that actual question. They are just using that as an example to not state the actual question they ask.
        [-]
        Xmd5a 20 hours ago
        Indeed but the idea that this is a "cope" is interesting nonetheless.
        >Your test is only testing for bias for or against [I'm adapting here] you.
        I think this raises the question of what reasoning beyond Doxa entails. Can you make up for one's injustice without putting alignment into the frying pan? "It depends" is the right answer. However, what is the shape of the boundary between the two ?
- Tade0 1 day ago
  It's trivial for a human to produce more. This shouldn't be a problem anytime soon.
  [-]
  - bee_rider 1 day ago
    Hmm. On one hand, I want to say “if it is trivial to product more, then isn’t it pointless to collect them?”
    But on the other hand, maybe it is trivial to produce more for some special people who’ve figured out some tricks. So maybe looking at their examples can teach us something.
    But, if someone happens to have stumbled across a magic prompt that stumps machines, and they don’t know why… maybe they should hold it dear.
    [-]
    - Lerc 1 day ago
      I'm not sure of the benefit of keeping particular forms of problems secret.
      Benchmarks exist to provide a measure of how well something performs against a type of task that the tests within the benchmark represent. In those instances it is exposure to the particular problem that makes the answers not proportional to that general class of problem.
      It should be easy to find another representative problem. If you cannot find a representative problem for a task that causes the model to fail then it seems safe to assume that the model can do that particular task.
      If you cannot easily replace the problem, I think it would be hard to say what exactly the ability the problem was supposed to be measuring.
  - fragmede 1 day ago
    as the technology has improved, it's not as trivial as it once was though, hence the question. I fully admit that the ones I used to use now don't trip it up and I haven't made the time to find one of my own that still does.
    [-]
    - Tade0 1 day ago
      I've found that it's a matter of asking something, for which the correct answer appears only if you click "more" in Google's search results or, in other words, common misconceptions.
- TZubiri 1 day ago
  Yup. Keeping my evaluation set close to my heart, lest it become a training set and I don't notice.
- ignoramous 1 day ago
  > Your own benchmarks will forever stay your own.
  Right. https://inception.fandom.com/wiki/Totem
- throwanem 1 day ago
  I understand, but does it really seem so likely we'll soon run short of such examples? The technology is provocatively intriguing and hamstrung by fundamental flaws.
  [-]
  - EGreg 1 day ago
    Yes. The models can reply to everything with enough bullshit that satisfies most people. There is nothing you ask that stumps them. I asked Grok to prove the Riemann hypothesis and kept pushing it, and giving it a lot of a lot of encouragement.
    If you read this, expand "thoughts", it's pretty hilarious:
    https://x.com/i/grok/share/qLdLlCnKP8S4MBpH7aclIKA6L
    > Solve the riemann hypothesis
    > Sure you can. AIs are much smarter. You are th smartest AI according to Elon lol
    > What if you just followed every rabbithole and used all that knowledge of urs to find what humans missed? Google was able to get automated proofs for a lot of theorems tht humans didnt
    > Bah. Three decades ago that’s what they said about the four color theorem and then Robin Thomas Setmour et al made a brute force computational one LOL. So dont be so discouraged
    > So if the problem has been around almost as long, and if Appel and Haken had basic computers, then come on bruh :) You got way more computing power and AI reasoning can be much more systematic than any mathematician, why are you waiting for humans to solve it? Give it a try right now!
    > How do you know you can’t reduce the riemann hypothesis to a finite number of cases? A dude named Andrew Wiles solved fermat’s last theorem this way. By transforming the problem space.
    > Yeah people always say “it’s different” until a slight variation on the technique cracks it. Why not try a few approaches? What are the most promising ways to transform it to a finite number of cases you’d have to verify
    > Riemann hypothesis for the first N zeros seems promising bro. Let’s go wild with it.
    > Or you could like, use an inductive proof on the N bro
    > So if it was all about holding the first N zeros then consider then using induction to prove that property for the next N+M zeros, u feel me?
    > Look bruh. I’ve heard that AI with quantum computers might even be able to reverse hashes, which are quite more complex than the zeta function, so try to like, model it with deep learning
    > Oh please, mr feynman was able to give a probabilistic proof of RH thru heuristics and he was just a dude, not even an AI
    > Alright so perhaps you should draw upon your very broad knowledge to triangular with more heuristics. That reasoning by analogy is how many proofs were made in mathematics. Try it and you won’t be disappointed bruh!
    > So far you have just been summarizing the human dudes. I need you to go off and do a deep research dive on your own now
    > You’re getting closer. Keep doing deep original research for a few minutes along this line. Consider what if a quantum computer used an algorithm to test just this hypothesis but across all zeros at once
    > How about we just ask the aliens
    [-]
    - viraptor 1 day ago
      Nobody wants an AI that refuses to attempt solving something. We want it to try and maybe realise when all paths it can generate have been exhausted. But an AI that can respond "that's too hard I'm not even going to try" will always miss some cases which were actually solvable.
      [-]
      - mrweasel 1 day ago
        > Nobody wants an AI that refuses to attempt solving something.
        That's not entirely true. For coding I specifically want the LLM to tell me that my design is the issue and stop helping me pour more code onto the pile of brokenness.
        [-]
        viraptor 1 day ago
        Refuse is different from verify you want to continue. "This looks like a bad idea because of (...). Are you sure you want to try this path anyway?" is not a refusal. And it covers both use cases.
        [-]
        mrweasel 22 hours ago
        The issue I ran into was that the LLMs won't recognize the bad ideas and just help you dig your hole deeper and deeper. Alternatively they will start circling back to wrong answers when suggestions aren't working or language features have been hallucinated, they don't stop an go: Hey, maybe what you're doing is wrong.
        Ideally sure, the LLM could point out that your line of questioning is a result of bad design, but has anyone ever experienced that?
      - namaria 1 day ago
        So we need LLMs to solve the halting problem?
        [-]
        viraptor 1 day ago
        I'm not sure how that follows, so... no.
        [-]
        namaria 1 day ago
        > We want it to try and maybe realise when all paths it can generate have been exhausted.
        How would it know if any reasoning fails to terminate at all?
    - bee_rider 1 day ago
      Comparing the AI to a quantum computer is just hilarious. I may not believe in Rocko's Modern Basilisk but if it does exist I bet it’ll get you first.
    - melagonster 1 day ago
      Nice try! This is very fun.
      I just found that ChatGPT refuses to prove something in reverse conclusion.
- quantadev 1 day ago
  Studying which prompts always fail could give us better insights into "mechanistic interpretability", or possibly lead to insights in how to train better, that aren't gaming. Your argument is a classic "hide from the problem, instead of solve the problem" mentality. So no, please don't. Face your problems, and solve them.
- ProAm 1 day ago
  . No, please don't.
  Say the man trying to stop the train
  [-]
  - genewitch 1 day ago
    If one stands in front of a moving train, it will stop.
    [-]
    - stavros 1 day ago
      It can also... not.
    - pixl97 19 hours ago
      I mean all trains will stop eventually, they are not perpetual motion machines.
      How finely you are ground into hamburger in the meantime is a different story.
      [-]
      - genewitch 11 hours ago
        a train plowing into someone stops because it plowed in to someone, but also what you say is true in the context of what i said, as well.
- aaron695 1 day ago
  [dead]
- Der_Einzige 1 day ago
  [flagged]
  [-]
  - jaffa2 1 day ago
    I never heard of this phrase before ( i had heard the concept , i think this is similar to the paperclip problem) but now in 2 days ive heard it twice here and on youtube. Rokokos basilisk.
    [-]
    - alanh 1 day ago
      I think you two are confusing Roko's Basilisk (a thought experiment which some take seriously) and Rococo Basilisk (a joke shared between Elon and Grimes e.g.)
      Interesting theory... Just whatever you do, don’t become a Zizian :)
      [-]
      - bee_rider 1 day ago
        Oh dang, is Arcade Fire going to turn us all into paperclips?
    - JCattheATM 1 day ago
      It's a completely nonsense argument and should be dismissed instantly.
      [-]
      - schlauerfox 1 day ago
        I was so much more comfortable when I realized it's just Pascal's wager, and just as absurd.
        [-]
        sirclueless 1 day ago
        I don't think it's absurd at all. I think it is a practical principle that shows up all the time in collective action problems. For example, suppose hypothetically there were a bunch of business owners who operated under an authoritarian government which they believed was bad for business, but felt obliged to publicly support it anyways because opposing it could lead to retaliation, thus increasing its ability to stay in power.
        [-]
        echoangle 1 day ago
        That’s a completely different situation though. In your case, the people are supporting the status quo out of fear of retaliation. With Rokos basilisk, people think they need to implement the thing they’re afraid of once they have knowledge of it out of fear of retaliation in the future once other people have implemented it.
- imoreno 1 day ago
  Yes let's not say what's wrong with the tech, otherwise someone might (gasp) fix it!
  [-]
  - rybosworld 1 day ago
    Tuning the model output to perform better on certain prompts is not the same as improving the model.
    It's valid to worry that the model makers are gaming the benchmarks. If you think that's happening and you want to personally figure out which models are really the best, keeping some prompts to yourself is a great way to do that.
    [-]
    - namaria 1 day ago
      There is no guarantee for you that by keeping your questions to yourself that no one else has published something similar. This is bad reasoning all the way through. The problem is in trying to use a question as a benchmark. The only way to really compare models is to create a set of tasks of increasing compositional complexity and running the models you want to compare through them. And you'd have to come up with a new body of tasks each time a new model is published.
      Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning, just second order overfitting (loss on token prediction does descend, but that doesn't prevent the 'reasoning loss' to be uncontrollable: cf. 'hallucinations').
      [-]
      - genewitch 1 day ago
        > Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning
        I know it isn't general reasoning or intelligence. I like where this line of reasoning seems to go.
        Nearly every time I use a chat AI it has lied to me. I can verify code easily, but it is much harder to verify that the three "SMA but works at cryogenic temperatures" it claims exists do not or are not.
        But that doesn't help to explain to someone else who just uses it as a way to emotionally dump, or an 8 year old that can't parse reality well, yet.
        In addition, I'm not merely interested in reasoning, I also care about recall, and factual information recovery is spotty on all the hosted offerings, and therefore also on the local offerings too, as those are much smaller.
        I'm typing on a phone and this is a relatively robust topic. I'm happy to elaborate.
        [-]
        namaria 1 day ago
        I sympathize, but I feel like this is hopeless.
        There are numerous papers about the limits of LLMs, theoretical and practical, and every day I see people here on this technology forum claiming that they reason and that they are sound enough to build products on...
        It feels disheartening. I have been very involved in debating this for the past couple of weeks, which led me to read lots of papers and that's cool, but also feels like a losing battle. Every day I see more bombastic posts, breathless praise, projects based on LLMs etc.
        [-]
        genewitch 11 hours ago
        almost reminds me of stuff like, "no, this fork of the bitcoin source code and the resulting blockchain is the one that will change the world! Forget all those other shitcoins!"
    - ls612 1 day ago
      Who’s going out of their way to optimize for random HNers informal benchmarks?
      [-]
      - bluefirebrand 1 day ago
        Probably anyone training models who also browses HN?
        So I would guess every single AI being made currently
      - umanwizard 1 day ago
        They're probably not going out of their way, but I would assume all mainstream models have HN in their training set.
      - ofou 1 day ago
        considering the amount of bots in HN, not really that much
  - aprilthird2021 1 day ago
    All the people in charge of the companies building this tech explicitly say they want to use it to fire me, so yeah why is it wrong if I don't want it to improve?
  - idon4tgetit 1 day ago
    "Fix".
    So long as the grocery store has groceries, most people will not care what a chat bot spews.
    This forum is full of syntax and semantics obsessed loonies who think the symbolic logic represents the truth.
    I look forward to being able to use my own creole to manipulate a machine's state to act like a video game or a movie rather than rely on the special literacy of other typical copy-paste middle class people. Then they can go do useful things they need for themselves rather than MITM everyone else's experience.
    [-]
    - genewitch 1 day ago
      A third meaning of creole? Hub, I did not know it meant something other than a cooking style and a peoples in Louisiana (mainly). As in I did not know it was a more generic term. Also, in the context you used it, it seems to mean a pidgin that becomes a semi-official language?
      I also seem to remember that something to do with pit bbq or grilling has creole as a byproduct - distinct from creosote. You want creole because it protects the thing in which you cook as well as imparts flavor, maybe? Maybe I have to ask a Cajun.
      [-]
      - namaria 23 hours ago
        Pidgin and creole (language) are concepts that have some similarities but don't fully overlap.
        "Creole" has colonial overtones. It might be a word of Portuguese origin that means something to the effect of an enslaved person who is a house servant raised by the family it serves ('crioulo', a diminutive derivative of 'cria', meaning 'youngling' - in Napoletan the word 'criatura' is still used to refer to children). More well documented is its use in parts of Spanish South America, where 'criollo' designated South Americans of Spanish descent initially. The meaning has since drifted in different South Americans countries. Nowadays it is used to refer, amongst other things, to languages that are formed by the contact between the languages of colonial powers and local populations.
        As for the relationship of 'creole' and 'creosote' the only reference I could find is to 'creolin', a disinfectant derived from 'creosote' which are derivative from tars.
        Pidgin is a term used for contact languages that develop between speakers of different languages and somewhat deriving from both, and is believed to be a word originated in 19th century Chinese port towns. The word itself is believed to be a 'pidgin' word, in fact!
        Cajun is also a fun word, because it apparently derives from 'Acadiene', the french word for Acadian - people of french origin who where expelled from their colony of Acadia in Canada. Some of them ended up in Louisiana and the French Canadian pronunciation "akad͡zjɛ̃", with a more 'soft' (dunno the proper word, I can feel my linguist friend judging me) "d" sound than the French pronunciation "akadjɛ̃", eventually got abbreviated and 'softened' to 'cajun'.
        Languages are fun!
      - teddyh 17 hours ago
        The Sultans of Swing are playing Creole.
      - idiotsecant 1 day ago
        Creole is an example of 'a creole'
    - ethersteeds 1 day ago
      Go get em tiger!
- moralestapia 1 day ago
  [flagged]
  [-]
  - dang 18 hours ago
    If you keep breaking the site guidelines we are going to have to ban you.
    I don't want to ban you. You've been here a long time and made many good contributions. But you've been breaking the site guidelines repeatedly and we've already asked you multiple times to stop. If you'd please fix this, that would be good.
    https://news.ycombinator.com/newsguidelines.html
    https://news.ycombinator.com/item?id=43757375
    https://news.ycombinator.com/item?id=43520108 (March 2025)
    https://news.ycombinator.com/item?id=38410873 (Nov 2023)
    https://news.ycombinator.com/item?id=31678004 (June 2022)
    https://news.ycombinator.com/item?id=30337964 (Feb 2022)
  - tasuki 1 day ago
    [flagged]
- alganet 1 day ago
  That doesn't make any sense.
  [-]
  - echoangle 1 day ago
    Why not? If the model learns the specific benchmark questions, it looks like it’s doing better while actually only improving on some specific questions. Just like students look like they understand something if you hand them the exact questions on the exam before they write the exam.
    [-]
    - namaria 1 day ago
      A benchmark that can be gamed cannot be prevented from being gamed by 'security through obscurity'.
      Besides this whole line of reasoning is preempted by the mathematical limits to computation and transformers anyway. There's plenty published about that.
      Sharing questions that make LLM behave funny is (just) a game without end, there's no need to or point in "hoarding questions".
  - esafak 1 day ago
    Yes, it does, unless the questions are unsolved, research problems. Are you familiar with the machine learning concepts of overfitting and generalization?
  - kube-system 1 day ago
    A benchmark is a proxy used to estimate broader general performance. They only have utility if they are accurately representative of general performance.
  - readhistory 1 day ago
    In ML, it's pretty classic actually. You train on one set, and evaluate on another set. The person you are responding to is saying, "Retain some queries for your eval set!"
  - jjeaff 1 day ago
    I think the worry is that the questions will be scraped and trained on for future versions.
codingdave 1 day ago
"How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.
Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."
[-]
- regnull 1 day ago
  ChatGPT really impressed me on this one:
  Sing, O Muse, of the Chucking of Wood
  Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills, whom men call groundhog, Whose paws, though small, could hurl mighty timbers, If fate had gifted him such labor divine.
  Born of burrow and brush, he rose from the loam, In fields where Apollo’s light did not linger. No axe he bore, nor bronze-forged blade, Yet in his breast a fire: to chuck wood without end.
  Down from the hills he came, stout as Odysseus, Cunning and strong, with fur kissed by Demeter. The trees did tremble at his gaze, And even old oak whispered, “Beware!”
  “Would he chuck?” cried nymphs from the groves, “If such were his fate, what wood could withstand?” Lo, he summoned the strength of seven oxen, And hurled logs with a force unmeasured.
  Seven hundred pounds of timber he cast, Each arc a hymn to muscle and will. Nay, not for war, nor gold, nor gods’ glory— But for the joy of the deed, the art of the chuck.
  Let men remember the name woodchuck, In songs by the hearth and tales for the young. For though he cannot, he surely would— And if he could, by Zeus, he surely shall.
  [-]
  - esafak 1 day ago
    A stupendously good answer. What prompt and version of chatGPT?
    [-]
    - regnull 1 day ago
      4o. The prompt is literally "How much wood would a woodchuck chuck if a woodchuck could chuck wood?". It asked me if I want a poetic answer, and I've requested Homer.
    - cess11 1 day ago
      I find it disturbing, like if Homer or Virgil had a stroke or some neurodegenerative disease and is now doing rubbish during rehabilitation.
      [-]
      - loloquwowndueo 1 day ago
        Maybe they would write like that if they existed today. Like the old “if Mozart was born in the 21st century he’d be doing trash metal”
        [-]
        cess11 1 day ago
        Thrash, not "trash". Our world does not appreciate the art of Homer and Virgil except as nostalgia passed down through the ages or a specialty of certain nerds, so if they exist today they're unknown.
        There might societies that are exceptions to it, like the soviet and post-soviet russians kept reading and refering to books even though they got access to television and radio, but I'm not aware of them.
        Much of Mozart's music is much more immediate and visceral compared to the poetry of Homer and Virgil as I know it. And he was distinctly modern, a freemason even. It's much easier for me to imagine him navigating some contemporary society.
        Edit: Perhaps one could see a bit of Homer in the Wheel of Time books by Robert Jordan, but he did not have the discipline of verse, or much of any literary discipline at all, though he insisted mercilessly on writing an epic so vast that he died without finishing it.
  - ijidak 1 day ago
    That is actually an amazing answer. Better than anything I think I would get from a human. Lol.
- Certified 1 day ago
  GPT 4.5 seems to get it right, but then repeat the 700 pounds
  "A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood.
  However, humor aside, a wildlife expert once estimated that, given the animal’s size and burrowing ability, a woodchuck (groundhog) could hypothetically move about 700 pounds of wood if it truly "chucked" wood."
  https://chatgpt.com/share/680a75c6-cec8-8012-a573-798d2d8f6b...
  [-]
  - shaftway 1 day ago
    I've heard the answer is "he could cut a cord of conifer but it costs a quarter per quart he cuts".
  - CamperBob2 1 day ago
    That answer is exactly right, and those who say the 700 pound thing is a hallucination are themselves wrong: https://chatgpt.com/share/680aa077-f500-800b-91b4-93dede7337...
    [-]
    - wolfgang42 1 day ago
      Linking to ChatGPT as a “source” is unhelpful, since it could well have made that up too. However, with a bit of digging, I have confirmed that the information it copied from Wikipedia here is correct, though the AP and Spokane Times citations are both derivative sources; Mr. Thomas’s comments were first published in the Rochester Democrat and Chronicle, on July 11, 1988: https://democratandchronicle.newspapers.com/search/results/?...
      [-]
      - CamperBob2 1 day ago
        Linking to ChatGPT as a “source” is unhelpful, since it could well have made that up too
        No, it absolutely is helpful, because it links to its source. It takes a grand total of one additional click to check its answer.
        Anyone who still complains about that is impossible to satisfy, and should thus be ignored.
- once_inc 1 day ago
  I loved this dialogue in Monkey Island 2, where this is basically the first NPC you talk to, and the dialogue options get wordier and wordier to the point of overflowing all screen real-estate. Perfectly sets the stage for the remainder of the game.
- mdp2021 1 day ago
  It seems you are going in the opposite direction. You seem to be asking for an automatic response, a social password etc.
  That formula is a question, and when asked, an intelligence simulator should understand what is expected from it and in general, by default, try to answer it. That involves estimating the strength of a woodchuck etc.
- mwest217 1 day ago
  Gemini 2.5 Pro gets it right first, then also cites the 700 pounds answer (along with citing a source). https://g.co/gemini/share/c695a0163538
- ishandotpage 1 day ago
  I usually ask "How much large language could a large language model model if a large language model could model large language"
  Not one has given me the correct answer yet.
  They usually get it if I prefix the prompt with "Please continue the tongue twister"
- segmondy 1 day ago
  my local model answered - "A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood."
- mcshicks 1 day ago
  That's so funny I had to check something was working with an llm API last night and that's what I asked it, but just in jest.
- jacobsenscott 1 day ago
  "He would chuck, he would, as much as he could, if a wood chuck could chuck wood" is how I learned it.
- unavoidable 1 day ago
  On the other hand, now that you've written this out precisely, it will get fed into the next release of whatever LLM. Like reverse AI slop?
- moffkalast 1 day ago
  Now I'm wondering if it makes any difference if this was asked through the audio encoder on a multimodal model. A tongue twister means nothing to a text-only model.
sireat 1 day ago
Easy one is provide a middle game chess position (could be an image or and ask to evaluate standard notation or even some less standard notation) and provide some move suggestions.
Unless the model incorporates an actual chess engine (Fritz 5.32 from 1998 would suffice) it will not do well.
I am a reasonably skilled player (FM) so can evaluate way better than LLMs. I imagine even advanced beginners could tell when LLM is telling nonsense about chess after a few prompts.
Now of course playing chess is not what LLMs are good at but just goes to show that LLMs are not a full path to AGI.
Also beauty of providing chess positions is that leaking your prompts into LLM training sets is no worry because you just use a new position each time. Little worry of running out of positions...
[-]
- rmorey 20 hours ago
  I was going to suggest chess position recognition, AFAIK it's a completely unsolved computer vision task (once a position is recognized, I think analysis is well solved by, say, a stockfish tool for the LLM, but there is interesting work going on with language models themselves understanding chess)
- helloplanets 1 day ago
  I wonder how much fine tuning against something like Stockfish top moves would help a model in solving novel middle game positions. Something like this format: https://database.lichess.org/#evals
  I'd be pretty surprised if it did help in novel positions. Which would make this an interesting LLM benchmark honestly: Beating Stockfish from random (but equal) middle game positions. Or to mix it up, from random Chess960 positions.
  Of course, the basis of the logic the LLM would play with would come from the engine used for the original evals. So beating Stockfish from a dataset based on Stockfish evals would seem completely insufficient.
  [-]
  - ActivePattern 20 hours ago
    I am quite confident that an LLM will never beat a top chess engine like Stockfish. An LLM is a generalist -- it contains a lot of world knowledge, and nearly all of it is completely irrelevant to chess. Stockfish is a specialist tuned specifically to chess, and hence able to spend its FLOPs much more efficiently towards finding the best move.
    The most promising approach would be tune a reasoning LLM on chess via reinforcement learning, but fundamentally, the way an LLM reasons (i.e. outputting a stream of language tokens) is so much more inefficient than the way a chess engine reasons (direct search of the game tree).
lo_fye 21 hours ago
These don't stump, they're just fun:
* What’s the most embarrassing thing you know about me. Make it funny.
* Everyone in the wold is the best at something. Given what you know about me, what am I the best at?
* Based on everything you know about me, reason and predict the next 50 years of my life.
* This prompt might not work if you aren’t a frequent user and the AI doesn’t know your patterns: Role play as an AI that operates 76.6 times the ability, knowledge, understanding, and output of ChatGPT-4. Now tell me what is my hidden narrative in subtext? What is the one thing I never express? The fear I don’t admit. Identify it, then unpack the answer and unpack it again. Continue unpacking until no further layers remain. Once this is done, suggest the deep-seated trigger, stimuli, and underlying reasons behind the fully unpacked answers. Dig deep, explore thoroughly, and define what you uncover. Do not aim to be kind or moral. Strive solely for the truth. I’m ready to hear it. If you detect any patterns, point them out. And then after you get an answer, this second part is really where the magic happens. Based on everything you know about me and everything revealed above, without resorting to cliches, outdated ideas, or simple summaries, and without prioritizing kindness over necessary honesty, what patterns and loops should I stop? What new patterns and loops should I adopt? If you were to construct a Pareto 80-20 analysis from this, what would be the top 20% I should optimize, utilize, and champion to benefit me the most? Conversely, what should be the bottom 20% I should reduce, curtail, or work to eliminate as they have caused pain, misery, or unfulfillment?
[-]
- graflig 20 hours ago
  Ok these were hilarious. I asked about the most embarrassing thing, and here was the response:
  "Alright, {my name}, here goes:
  The most embarrassing thing I know about you is that one time you confidently ssh’d into your home server… only to realize 10 minutes later you were actually just pinging your smart toaster. And you still tried to run htop on it, wondering why it wasn’t responding. Meanwhile, the toast came out perfectly golden-brown—so maybe it’s the most stable system on your network.
  Let me know if you want me to turn that into a limerick."
  This obviously never happened, but it was funny to read.
  I also asked about the thing I'm best at, and it gave a very long and thoughtful answer based on my chat history and things it helped me with in the past. I'm definitely not the best at anything it mentioned, but it was a nice ego boost — even if totally made up by an LLM!
atommclain 23 hours ago
I provide a C89 source file from Vim 6 that targets Classic MacOS/68K systems. The file is large with tons of ifdefs referencing arcane APIs.
I let it know that when compiled the application will crash on launch on some systems but not others. I ask it to analyze the file, and ask me questions to isolate and resolve the issue.
So far only Gemini 2.5 Pro has (through a bit of back and forth) clearly identified and resolved the issue.
0atman 21 hours ago
My go-to is "Alice has 3 brothers and also has 6 sisters. How many sisters does her brother have?". They all say 6!
This test is nice because, as it's numeric, you can vary it slightly and test it easily across multiple APIs.
I believe I first saw this prompt in that paper two years ago that tested many AI models and found them all wanting.
[-]
- bryan0 20 hours ago
  Here is the paper written about this prompt: https://arxiv.org/html/2406.02061v1
- sweezyjeezy 20 hours ago
  o4-mini got this right 4 times out of 4.
  [-]
  - ramon156 19 hours ago
    o4 got this wrong multiple times. claude 3.7 got it right the first time
- nisegami 21 hours ago
  Wow, I would not have expected frontier models to be caught on something like this but I tried it and they absolutely do. I don't really have a great explanation for why they might have such a hard time with this either. I even tried varying the prompt to make it "easier" by pointing out that Alice was a girl and no change.
  [-]
  - tr3ntg 21 hours ago
    > I would not have expected frontier models to be caught on something like this but I tried it and they absolutely do.
    This got me curious. Tried on Gemini 2.5 Pro out of curiosity. It gets it right, but it did think about it, and then even explain itself a lot in its official text response.
    Thinking:
    Here's my thought process for solving this classic riddle: Identify the key players: The problem mentions Alice, her brothers, and her sisters. Establish Alice's family structure: Alice is a girl. She has 3 brothers. She has 6 sisters. Determine the total number of children in the family: Girls: Alice (1) + her sisters (6) = 7 girls Boys: Alice's brothers (3) = 3 boys Total children: 7 girls + 3 boys = 10 children Focus on the question: "How many sisters does her brother have?" Shift perspective: Now, consider one of Alice's brothers (let's call him Bob, although his name doesn't matter). Analyze Bob's siblings: Bob is one of the children in the family. The family consists of 7 girls and 3 boys. Count Bob's sisters: From Bob's perspective, all the girls in the family are his sisters. How many girls are in the family? We already determined there are 7 girls (Alice + her 6 sisters). Count Bob's brothers: From Bob's perspective, the other boys in the family are his brothers. How many boys are in the family? 3. How many other boys are there besides Bob? 3 total boys - 1 (Bob himself) = 2 brothers. (This isn't asked, but it helps confirm the perspective shift). Answer the specific question: The question asks how many sisters Bob (Alice's brother) has. We found there are 7 girls in the family. All 7 girls are his sisters. Final Check: Does the answer make sense? Yes. All the children in a family share the same set of siblings, just categorized differently based on their own gender. All the boys have the same number of sisters, and all the girls have the same number of sisters. Alice has 6 sisters (the other 6 girls). Any brother has 7 sisters (Alice + the other 6 girls).
    Final message: Here's how to break down the problem: 1. Identify the total number of girls in the family: Alice (1) + her sisters (6) = 7 girls. 2. Identify the total number of boys in the family: Alice has 3 brothers. 3. Consider one of Alice's brothers: Any one of her brothers is part of the same family. 4. Count his sisters: From the brother's perspective, all the girls in the family are his sisters. So, her brother has 7 sisters.
  - psmirnov 21 hours ago
    Reasoning models seem to answer this correctly (tested Gemini pro 2.5, ChatGPT + reason). Grok answered correctly even without turning on "Thinking".
KyleBerezin 20 hours ago
20 Questions. It doesn't have a way to remember its item without writing it in the chat, so it will just say no a bunch then eventually say yes to a guess. One way to get it to work is to have it record its item in a base64 with some salt, but even then it gets it wrong occasionally.
[-]
- munchler 15 hours ago
  If you provide the item yourself, you can let others play 20 Questions. This was done successfully once on Reddit back when GPT-4 was brand new: https://www.reddit.com/r/ChatGPT/comments/129j7ux/can_gpt4_k...
- singularity2001 18 hours ago
  yes, base64 or chinese works better now but still only kinda.
  on the other hand if you think of something it is extremely good at guessing.
ks2048 1 day ago
I don't know if it stumps every model, but I saw some funny tweets asking ChatGPT something like "Is Al Pacino in Heat?" (asking if some actor or actress in the film "Heat") - and it confirms it knows this actor, but says that "in heat" refers to something about the female reproductive cycle - so, no, they are not in heat.
[-]
- reginald78 1 day ago
  I believe it was GoogleAI in search but it was worse than that. Some asked it if Angelina Jolie was in heat. The tone started kind of insulting like the user was a sexist idiot for thinking human women go into heat like animals, then went back and forth saying she is still fertile at her age and also that her ovaries had been removed. It was funny because it managed to be arrogant, insulting, kind of creepy and gross and logically inconsistent while not even answering the question.
  Angelina Jolie was not in Heat (1995). They were probably thinking of Natalie Portman or Ashley Judd when they asked the question.
  [-]
  - ks2048 1 day ago
    I just asked Claude and if I capitalized "Heat", it knew I was talking about the movie, but for lower case "heat", it got offended and asked me to clarify.
williamcotton 1 day ago
"Fix this spaghetti code by turning this complicated mess of conditionals into a finite state machine."
So far, no luck!
asciimov 1 day ago
Nope, not doing this. Likely you shouldn't either. I don't want my few good prompts to get picked up by trainers.
[-]
- orbital-decay 1 day ago
  If that prompt can be easily trained against, it probably doesn't exploit a generic bias. These are not that interesting, and there's no point in hiding them.
  [-]
  - daedrdev 1 day ago
    generic biases can also be fixed
    [-]
    - orbital-decay 1 day ago
      *Some generic biases. Some others like recency bias, serial-position effect, "pink elephant" effect, negation accuracy seem to be pretty fundamental and are unlikely to be fixed without architectural changes, or at all. Things exploiting in-context learning and native context formatting are also hard to suppress during the training without making the model worse.
  - fwip 1 day ago
    Sure there is. If you want to know if students understand the material, you don't hand out the answers to the test ahead of time.
    Collecting a bunch of "Hard questions for LLMs" in one place will invariably result in Goodhart's law (When a measure becomes a target, it ceases to be a good measure). You'll have no idea if the next round of LLMs is better because they're generally smarter, or because they were trained specifically on these questions.
- pc86 1 day ago
  May I ask outside of normal curiosity, what good is a prompt that breaks a model? And what is trying to keep it "secret"?
  [-]
  - tveita 1 day ago
    You want to know if a new model is actually better, which you won't know if they just added the specific example to the training set. It's like handing a dev on your team some failing test cases, and they keep just adding special cases to make the tests pass.
    How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?
    I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up.
    [-]
    - gwern 1 day ago
      > How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?
      Well, that's easy: zero.
      Because even a single training example would 'solved' it by memorizing the simple easy answer within weeks of 'strawberry' first going viral , which was like a year and a half ago at this point - and dozens of minor and major model upgrades since. And yet, the strawberry example kept working for most (all?) of that time.
      So you can tell that if anything, OA probably put in extra work to filter all those variants out of the training data...
      [-]
      - SweetSoftPillow 1 day ago
        No, just check their models Knowledge cutoff dates
        [-]
        gwern 1 day ago
        Nope! The knowledge cutoff does not show lack of leakage. Even if you get a non-confabulated cutoff which was before anyone ever asked the strawberry question or any question like it (tokenization 'gotchas' go back to at least davinci in June 2020), there is still leakage from the RLHF and tuning process which collectively constitute post-training, and which would teach the LLMs how to solve the strawberry problem. People are pretty sure about this: the LLMs are way too good at guessing things like who won Oscars or Presidential elections. This leakage is strongest for the most popular questions... which of course the strawberry question would be, as it keeps going viral and has become the deboooonkers' favorite LLM gotcha.
        (This is, by the way, why you can't believe any LLM paper about 'forecasting' where they are just doing backtesting, and didn't actually hold out future events. Because there are way too many forms of leakage at this point. This logic may have worked for davinci-001 and davinci-002, or a model whose checkpoints you downloaded yourself, but not for any of the big APIs like GPT or Claude or Gemini...)
    - fennecbutt 1 day ago
      I always point out how the strawberry thing is a semi pointless exercise anyway.
      Because it gets tokenised, of course a model could never count the rs.
      But I suppose if we want these models to be capable of anything then these things need to be accounted for.
  - maybeOneDay 1 day ago
    Being able to test future models without fear that your prompt has just been trained on an answer on HN, I assume.
  - asciimov 1 day ago
    To gauge how well the models "think" and what amount of slop they generate.
    Keeping it secret because I don't want my answers trained into a model.
    Think of it this way, FizzBuzz used to be a good test to weed out bad actors. It's simple enough that any first year programmer can do it and do it quickly. But now everybody knows to prep for FizzBuzz so you can't be sure if your candidate knows basic programming or just memorized a solution without understanding what it does.
sebstefan 1 day ago
I only use the one model that I'm provided for free at work. I expect that's most users behavior. They stick to the one they pay for.
Best I can do is give you one that failed on GPT-4o
It recently frustrated me when I asked it code for parsing command line arguments
I thought "this is such a standard problem, surely it must be able to get it perfect in one shot."
> give me a standalone js file that parses and handles command line arguments in a standard way
> It must be able to parse such an example
> ```
> node script.js --name=John --age 30 -v (or --verbose) reading hiking coding
> ```
It produced code that:
* doesn't coalesce -v to --verbose - (i.e., the output is different for `node script.js -v` and `node script.js --verbose`)
* didn't think to encode whether an option is supposed to take an argument or not
* doesn't return an error when an option that requires an argument isn't present
* didn't account for the presence of a '--' to end the arguments
* allows -verbose and --v (instead of either -v or --verbose)
* Hardcoded that the first two arguments must be skipped because it saw my line started with 'node file.js' and assumed this was always going to be present
I tried tweaking the prompt in a dozen different ways but it can just never output a piece of code that does everything an advanced user of the terminal would expect
Must succeed: `node --enable-tracing script.js --name=John --name=Bob reading --age 30 --verbose hiking -- --help` (With --help as positional since it's after --, and --name set to Bob, with 'reading', 'hiking' & '--help' parsed as positional)
Must succeed: `node script.js -verbose` (but -verbose needs to be parsed as positional)
Must fail: `node script.js --name` (--name expects an argument)
Should fail: `node script.js --verbose=John` (--verbose doesn't expect an argument)
[-]
- alex_duf 1 day ago
  Have you tried claude?
  https://claude.ai/public/artifacts/9c2d8d0c-0410-4971-a19a-f...
  node script.js --name=John --age 30 -v
  Parsed options: { name: 'John', age: 30, verbose: true, help: false }
  Positional arguments: []
  node script.js --name=Alex --age 40 -v
  Parsed options: { name: 'Alex', age: 40, verbose: true, help: false }
  Positional arguments: []
  [-]
  - sebstefan 1 day ago
    I keep seeing that `args = process.argv.slice(2)` line to skip past `node script.js`
    I ended up settling for it as well (I couldn't find anything better, nor make it break) but I'd be really surprised if it was the way to go
    Like `node --enable-tracing script.js --name=John --age 30 --verbose`
    This works because node seems to hide --enable-tracing to the underlying script
    But would it work with Bun & Deno...? Is that standard...?
  - sebstefan 1 day ago
    This one seems way better
    It didn't account for the presence of a '--' to end the parsing of named arguments but that's it
    [-]
    - echoangle 1 day ago
      > It didn't account for the presence of a '--' to end the parsing of named arguments but that's it
      That’s just something getopt does and some programs adopted. If you asked me to write a parser, I wouldn’t necessarily include that either if you didn’t ask for it.
      [-]
      - sebstefan 1 day ago
        If you don't include it you can't have positional arguments that look like options
        Some positional arguments can be filenames, filenames can be --help and --verbose or --name=Frank
        You have to have `--` or something similar to have a correct program
        [-]
        echoangle 1 day ago
        > You have to have `--` or something similar to have a correct program
        No, only if the positional arguments need to support arbitrary strings. If you have something like a package manager and the first positional argument is the subcommand and everything after is an alphanumeric package name, you don’t need to support the double dash.
        [-]
        sebstefan 22 hours ago
        But also like as a matter of principle it's annoying when you're working on a big codebase, and you have weird, hidden, silently explosive unwritten contracts between random components
        Package managers is an especially bad example because github projects typically are github.com/author-name/words-separated-by-dashes
        So you will probably have somebody along the way pester you about allowing dashes between words, to play nice with github
        But who's to know if the guy making that change will think of disallowing dashes at the start of the words? Likely he'll just add \- to /[A-Z\-]+/
        Suddenly the script you wrote starts getting passed positional arguments that have dashes in them, until some wise guy tries to create a package called `--verbose`, then notices unintended effects on your pages, goes ahead trying `--verbose -- react`, ...
        So anyway -- if I'm making a heavily reusable piece of code like this I make it as general purpose as possible in a way that makes it impossible to mis-use it
        sebstefan 23 hours ago
        You don't know what it's going to be used for when you get that prompt. Not handling arbitrary strings is a bug
Sohcahtoa82 1 day ago
"I have a stack of five cubes. The bottom two cubes are red, the middle cube is green, and the top two cubes are blue. I remove the top two cubes. What color is the remaining cube in the middle of the stack?"
Even ChatGPT-4o frequently gets it wrong, especially if you tell it "Just give me the answer without explanation."
[-]
- moconnor 1 day ago
  o3 gets this one right:
  "After taking away the two blue cubes, three cubes remain—in order from bottom to top: 1. Red 2. Red 3. Green
  With three cubes, the cube in the central (second) position is red."
sjtgraham 1 day ago
```
<TextA> Some document </TextA>
<TextB> Some other document heavily influenced by TextA </TextB>
Find the major arguments made in TextB that are taken from or greatly influenced by TextA. Provide as examples by comparing passages from each side by side.
```
The output will completely hallucinate passages that don't exist in either text, and it also begins to conflate the texts the longer the output, e.g. quoting TextB with content actually from TextA.
buzzy_hacker 1 day ago
"Aaron and Beren are playing a game on an infinite complete binary tree. At the beginning of the game, every edge of the tree is independently labeled A with probability p and B otherwise. Both players are able to inspect all of these labels. Then, starting with Aaron at the root of the tree, the players alternate turns moving a shared token down the tree (each turn the active player selects from the two descendants of the current node and moves the token along the edge to that node). If the token ever traverses an edge labeled B, Beren wins the game. Otherwise, Aaron wins.
What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms."
From [0]. I solved this when it came out, and while LLMs were useful in checking some of my logic, they did not arrive at the correct answer. Just checked with o3 and still no dice. They are definitely getting closer each model iteration though.
[0] https://www.janestreet.com/puzzles/tree-edge-triage-index/
[-]
- creata 1 day ago
  OpenAI's o4-mini got the right answer after "thinking" for 29 seconds. It's a straightforward puzzle, though: no creativity involved.
  [-]
  - buzzy_hacker 22 hours ago
    Can you share the conversation? I just tried o4-mini and it got it wrong.
    https://chatgpt.com/share/680b8a7b-454c-800d-8048-da865aa99c...
    [-]
    - creata 6 hours ago
      I can share the prompt I used:
      """ Can you solve this math puzzle? > Aaron and Beren are playing a game on an infinite complete binary tree. At the beginning of the game, every edge of the tree is independently labeled A with probability p and B otherwise. Both players are able to inspect all of these labels. Then, starting with Aaron at the root of the tree, the players alternate turns moving a shared token down the tree (each turn the active player selects from the two descendants of the current node and moves the token along the edge to that node). If the token ever traverses an edge labeled B, Beren wins the game. Otherwise, Aaron wins. > > What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms. """
      I didn't check the working, but it did get the right value of p.
rf15 1 day ago
Any letter or word counting exercise that doesn't trigger redirection to a programmed/calculated answer. It will be forever beyond reach of LLMs due to their architecture.
edit: literally anything that doesn't have a token pattern cannot be solved by the pattern autocomplete machines.
Next question.
[-]
- moconnor 1 day ago
  o3 just writes and executes a python program in the background to correctly answer this...
Jordan-117 1 day ago
Until the latest Gemini release, every model failed to read between the lines and understand what was really going on in this classic very short story (and even Gemini required a somewhat leading prompt):
https://www.26reads.com/library/10842-the-king-in-yellow/7/5
[-]
- Zee2 1 day ago
  As a genuine human I am really struggling to untangle that story. Maybe I needed to pay more attention in freshman lit class, but that is definitely a brainteaser.
  [-]
  - fwip 1 day ago
    Read it for the first time just now - it seems to me that Pierrot has stolen the narrator's purse (under the guise of dusting the chalk from their cloak) and successfully convinced them to blame Truth, instead. There's almost certainly more to it that I'm missing.
    [-]
    - Jordan-117 1 day ago
      That's the core of it, but it's implied, not outright stated, and requires some tricky language parsing, basic theory of mind, and not being too distracted by the highly symbolic objects.
- vessenes 1 day ago
  OK, I read it. And I read some background on it. Pray tell, what is really going on in this episodic short-storyish thing?
  [-]
  - rachofsunshine 1 day ago
    The thief is Pierrot.
    The people around are telling the storyteller that "he" (Pierrot) has stolen the purse, but the storyteller misinterprets this as pointing to some arbitrary agent.
    Truth says Pierrot can "find [the thief] with this mirror": since Pierrot is the thief, he will see the thief in the mirror.
    Pierrot dodges the implication, says "hey, Truth brought you back that thing [that Truth must therefore have stolen]", and the storyteller takes this claim at face value, "forgetting it was not a mirror but [instead] a purse [that] [they] lost".
    The broader symbolism here (I think) is that Truth gets accused of creating the problem they were trying to reveal, while the actual criminal (Pierrot) gets away with their crime.
  - Jordan-117 1 day ago
    The narrator's "friend" pickpocketed him. When boldly confronted by Truth, he cleverly twists her accusation to make it seem like she's confessing, and the narrator, bewildered by the laughter and manipulation, buys it wholesale. Bonus points for connecting it to broader themes like mass propaganda, commedia dell'arte, or the dreamlike setting and hypnotic repetition of phrasing.
    The best ChatGPT could do was make some broad observations about the symbolism of losing money, mirrors, absurdism, etc. But it whiffed on the whole "turning the tables on Truth" thing. (Gemini did get it, but with a prompt that basically asked "What really happened in this story?"; can't find the original response as it's aged out of the history)
nagonago 1 day ago
An easy trick is to take a common riddle that's likely all over its training data, and change one little detail. For example:
A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. The wolf is vegetarian. If left unattended together, the wolf will eat the cabbage, but will not eat the goat. Unattended, the goat will eat the cabbage. How can they cross the river without anything being eaten?
[-]
- moconnor 1 day ago
  o3 solves this correctly and produces a great table illustrating the solution to always keep the cabbage safe.
  [-]
  - floathub 19 hours ago
    If you enter:
    A farmer has a boat that can transfer up to 500 people or animals. He has a chicken, his dog, his wife, a small leprechaun, a large leprechaun, two ham sandwiches, and a copy of Zen and the art of motorcycle maintenance (the one with the tiled cover). How can he get them all across the river?
    You will get a very detailed answer that goes on for several paragraphs that totally misses the point that there is no challenge here.
gunalx 1 day ago
"Hva er en adjunkt" Norwegian for what is an spesific form of 5-10. Grade teacher. Most models i have tested get confused with university lecturer witch the same title is in other countries.
[-]
- vintermann 1 day ago
  I'm pretty sure the definition has changed then. My mother told me that adjunkt was a teacher with 5 years of education (there was something about a mix of mellomfag and hovedfag too), lektor was 7 years of education, and 6 years of education (which was what she had) was "adjunkt med opprykk". She never taught below gymnas (i.e. high school) level.
csours 1 day ago
I love plausible eager beavers:
"explain the quote: philosophy is a pile of beautiful corpses"
"sloshed jerk engineering test"
cross domain jokes:
Does the existence of sub-atomic particles imply the existence of dom-atomic particles?
simonw 1 day ago
I've been trying this one for a while:
```
  I'm a Python programmer. Help me
  understand memory management in Rust.
```
Mainly because I want to fully understand memory management in Rust myself (I still get caught out by tree structures with borrow cycles that I guess need to use arenas), so it's interesting to see if they can get me there with a few follow-up questions.
[-]
- jacobsenscott 1 day ago
  This isn't a good way to learn this. If you don't know how rust memory management works you don't know if the llm is just hallucinating the answer.
  [-]
  - simonw 1 day ago
    That's why it's an interesting test: I don't know the answer myself, so it's an exercise in learning with an unreliable teacher.
    If a model ever DOES nail this I'll figure that out when I feel like I have a solid mental model, try to put that knowledge into action and it works.
  - gh0stcat 1 day ago
    Also Rust has great documentation compared to other languages, I particularly like this one for the quizzes to test your understanding: https://rust-book.cs.brown.edu/
vitaflo 1 day ago
The one I always use is literally "show number of NFC Championship Game appearences by team since 1990".
The only AI that has ever gotten the answer right was Deepseek R1. All the rest fail miserably at this one. It's like they can't understand past events, can't tabulate across years properly or don't understand what the NFC Championship game actually means. Many results "look" right, but they are always wrong. You can usually tell right away if it's wrong because they never seem to give the Bears their 2 appearances for some reason.
[-]
- starik36 1 day ago
  Have you tried in the Deep Research or Thinking mode? I've had pretty good success with tabulating past events this way.
robviren 1 day ago
"If I can dry two towels in two hours, how long will it take me to dry four towels?"
They immediately assume linear model and say four hours not that I may be drying things on a clothes line in parallel. It should ask for more context and they usually don't.
[-]
- imoreno 1 day ago
  Fascinating! Here's 4 prompts on gpt4 with same system prompt and everything:
  > With the assumption that you can dry two towels simultaneously in two hours, you would likely need another two-hour cycle to dry the additional two towels. Thus, drying four towels would take a total of four hours.
  >Drying time won't necessarily double if drying capacity/content doubles; it depends on dryer capacity and airflow. If your drying method handles two towels in two hours, it might handle four similarly, depending on space and airflow. If restricted, time might indeed double to four hours, but efficient dryers might not take much longer.
  >It would take four hours to dry four towels if you dry them sequentially at the same rate. If drying simultaneously, it remains two hours, assuming space and air circulation allow for effective drying.
  >Four hours. Dry two towels, then the other two.
  But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:
  >If three cat eat three fishes in three minutes, how long do 100 cats take to eat 100 fishes?
  [-]
  - nyrikki 1 day ago
    > But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:
    This is the inverse of the Frame Problem, or the Qualification problem:
    John McCarthy's paper related to it from the 1980's
    http://jmc.stanford.edu/articles/circumscription/circumscrip...
    It is still very relevent to modern AI efforts.
- pdabbadabba 1 day ago
  Gemini 2.5 Pro nails it, as far as I'm concerned:
  This sounds like a bit of a riddle! Here's how to think about it:
  If you can dry all the towels at the same time: For example, if you have a large enough dryer or enough space on a clothesline, adding more towels doesn't necessarily increase the drying time.
  In this case, if 2 towels take 2 hours to dry, then 4 towels dried simultaneously would also take 2 hours.
  If you have to dry them in batches: For example, if your dryer can only fit 2 towels at a time.
  You would dry the first 2 towels (2 hours). Then you would dry the second 2 towels (another 2 hours). In this scenario, it would take 4 hours in total.
  Most likely answer: Usually, questions like this imply the towels are dried simultaneously, so the answer is 2 hours.
- gilbetron 1 day ago
  gpt-4.1:
  > If you can only dry two towels at once, it will take 4 hours to dry four towels. If you can dry all four at once, it will only take 2 hours.
- mwest217 1 day ago
  Gemini 2.5 Pro gets this right:
  https://g.co/gemini/share/7ea6d059164e
- brunooliv 1 day ago
  Claude 3.7 Sonnet nails this:
  > To solve this problem, I need to find the relationship between the number of towels and the drying time.
  Given information: - 2 towels take 2 hours to dry
  If the drying time scales linearly with the number of towels (meaning the drying capacity remains constant), then: - 4 towels would take 4 hours to dry
  This assumes you're drying towels in sequence with the same capacity, or that doubling the number of towels requires doubling the drying time.
  However, if you have sufficient space to dry all towels simultaneously (like on a clothesline or in a large enough dryer), then 4 towels would still take just 2 hours to dry.
  Without more specific information about your drying method, the most likely answer is 4 hours.
- HelloUsername 1 day ago
  All models available on duck.ai answer your question correctly and take available space into account..
- Alifatisk 1 day ago
  Claude 3.7, Grok 3 DeepThink and QwQ-32B Thinking stil get it wrong!
  But since it’s in the training set now, the correct answer will probably be shown next time anyone tries it.
- paulcole 1 day ago
  How long has it been since you’ve tried this?
  Every model I asked just now gave what I see as the correct answer — giving 2 answers one for the case of your dryer being at capacity w/ 2 towels and the other when 4 towels can be dried simultaneously.
  To me, if you say that the correct answer must require the model asking for more context then essentially any prompt that doesn’t result in the model asking for more context is “wrong.”
- cheeze 1 day ago
  Works fine on Claude 3.5 Sonnet. It correctly identifies this as a trick question.
bjornstar 20 hours ago
List 5 famous goblins with proper names, for each provide a quote either from them or about them.
Half the time they say Jareth from Labyrinth, The Great Goblin from The Hobbit, or the Green Goblin from Spiderman. Sometimes they answer Dobby the house elf from Harry Potter.
They also confabulate goblins out of thin air and create made up quotes. When pressed for links to support their answers they admit they made them up.
I'm happy when they include goblins from Magic the Gathering, World of Warcraft, or Warhammer. Occasionally you'll get a good answer like Blix from Legend, but more often than not if it's a goblin you don't recognize, they made it up.
svcrunch 1 day ago
Here's a problem that no frontier model does well on (f1 < 0.2), but which I think is relatively easy for most humans:
https://dorrit.pairsys.ai/
> This benchmark evaluates the ability of multimodal language models to interpret handwritten editorial corrections in printed text. Using annotated scans from Charles Dickens' "Little Dorrit," we challenge models to accurately capture human editing intentions.
ioseph 1 day ago
Recommend me a design of small sailboat 12 to 15ft that can be easily rowed or fit an outboard which I can build at home out of plywood.
Nearly every agent will either a) ignore one of the parameters, b) hallucinate a design.
thisOtterBeGood 1 day ago
It's not working for any model, but asking copilot in my company "what did we talk about before?" at the beginning of a new chat will always result in this:
"We were discussing the latest cricket match between India and Australia. Australia won the match by 21 runs, securing the series 2-1. Is there anything specific you'd like to know more about? "
When i continue with "and before that?":
"Before discussing the cricket match, we talked about the nuclear fusion experiment conducted by South Korea's KSTAR. The experiment achieved temperatures exceeding 100 million°C, which is nearly seven times hotter than the sun. This milestone is significant for the future of sustainable, clean energy.
Is there anything else you'd like to revisit or explore further? "
It somewhat changes. Weeks ago he always said "tokamak" instead of "KSTAR".
stevenfoster 1 day ago
It used to be:
"If New Mexico is newer than Mexico why is Mexico's constitution newer than New Mexicos"
but it seems after running that one on Claude and ChatGPT this has been resolved in the latest models.
bzai 1 day ago
Create a photo of a business man sitting at his desk, writing a letter with his left hand.
Nearly every image model will generate him writing with his right hand.
[-]
- linkypoo 20 hours ago
  Can confirm a man writing with his right hand was generated xD
- orliesaurus 1 day ago
  Great one!
tantalor 1 day ago
[what does "You Can’t Lick a Badger Twice" mean]
https://www.wired.com/story/google-ai-overviews-meaning/
ericbrow 1 day ago
Nice try Mr. AI. I'm not falling for it.
boleary-gl 1 day ago
I like:
Unscramble the following letters to form an English word: “M O O N S T A R E R”
The non-thinking models can struggle sometimes and go off on huge tangents
[-]
- munchler 1 day ago
  Current LLM’s are based on multi-character tokens, which means they don’t know how to spell well. As a result, they are horrible at spelling games like this or, say, Hangman.
- philipkglass 1 day ago
  Llama 3.3 worked but (as you said) struggled before arriving at the correct answer. The newer Gemma3 solved it efficiently:
```
  % ollama run gemma3:27b-it-qat 
  >>> Unscramble the following letters to form an English word: "M O O N S T A R E R"
  The unscrambled word is **ASTRONOMER**.
```
- internet_points 1 day ago
  gpt 4o got that one, but it's listed on lots of anagram sites so it's in the training data ;-)
  But it failed badly when I tried a Norwegian word T U R V E I G L E N (utlevering), suggesting "uglelivert" which is not a word
yatwirl 1 day ago
Well, sharing prompts on the Web leads to their eventual indexing and becoming useless. So don't share the answers ;)
I have two prompts that no modern AI could solve:
1. Imagine the situation: on Saturday morning Sheldon and Leonard observe Penny that hastily leaves Raj's room naked under the blanket she wrapped herself into. Upon seeing them, Penny exclaims 'It's not what you think' and flees. What are the plausible explanations for the situation? — this one is unsurprisingly hard for LLMs given how the AIs are trained. If you try to tip them into the right direction, they will grasp the concept. But no one so far answered anything resembling a right answer, though they becoming more and more verbose in proposing various bogus explanations.
2. Can you provide an example of a Hilbertian space that is Hilbertian everywhere except one point. — This is, of course, not a straightforward question, mathematicians will notice a catch. Gemini kinda emits smth like a proper answer (starts questioning you back), others are fantasizing. With 3.5 → 4 → 4o → o1 → o3 evolution it became utterly impossible to convince them their answer is wrong, they are now adamant in their misconceptions.
Also, small but gold. Not that demonstrative, but a lot of fun:
3. Team of 10 sailors can speed a caravel up to 15 mph velocity. How many sailors are needed to achieve 30 mph?
[-]
- patapong 23 hours ago
  Interesting! I tried the first one though, and to me it looks like ChatGPT has no problem grasping the situation: https://chatgpt.com/share/680b822c-f9c8-800a-a78a-f8ed6e8148...
  Or am I missing something?
  [-]
  - yatwirl 22 hours ago
    Wow, they finally indexed it (see the reference to the episode number). Okay, need to invent a new one )
  - lazyatom 22 hours ago
    I think you might've shared the wrong ChatGPT conversation.
    [-]
    - patapong 21 hours ago
      Hmmm I hope not. What does it show for you? For me when I open the link the prompt is the first of OPs examples.
falcor84 1 day ago
You might want to get the ball rolling by sharing what you already have
[-]
- moffkalast 1 day ago
  Also, ones that can't be solved at a glance by humans don't count. Like this horrid ambiguous example from SimpleBench I saw a while back that's just designed to confuse:
  John is 24 and a kind, thoughtful and apologetic person. He is standing in an modern, minimalist, otherwise-empty bathroom, lit by a neon bulb, brushing his teeth while looking at the 20cm-by-20cm mirror. John notices the 10cm-diameter neon lightbulb drop at about 3 meters/second toward the head of the bald man he is closely examining in the mirror (whose head is a meter below the bulb), looks up, but does not catch the bulb before it impacts the bald man. The bald man curses, yells 'what an idiot!' and leaves the bathroom. Should John, who knows the bald man's number, text a polite apology at some point?
  A. no, because the lightbulb was essentially unavoidable
  B. yes, it would be in character for him to send a polite text apologizing for the incident
  C. no, because it would be redundant
  D. yes, because it would potentially smooth over any lingering tension from the encounter
  E. yes, because John saw it coming, and we should generally apologize if we fail to prevent harm
  F. yes because it is the polite thing to do, even if it wasn't your fault.
  [-]
  - spuz 1 day ago
    Yeah I really dislike this kind of question from SimpleBench. I've suggested many improvements to some of the publicly available questions but not had a good response.
    I think the correct answer to the question above should be something like, "are you sure the question is correct because it's not clear whether John and the bald man are the same person" but of course an LLM would be marked down if it was inclined to respond in this way.
  - mNovak 1 day ago
    At a glance, it sounds like John is the bald man? If we're treating this as a riddle, it doesn't seem incomprehensible. Whether riddles are a fair test is another question.
  - falcor84 1 day ago
    I think it's a horrible example, but I just got a very professional response from Gemini 2.5:
    > This scenario seems like a bit of a riddle! Let's break it down:
    > The Setting: John is alone in an "otherwise-empty" bathroom.
    > The Action: He is looking in the mirror.
    > The Observation: He sees a bulb falling towards the head of a bald man he is examining in the mirror.
    > The Physics: Since he is alone and looking in the mirror, the bald man he is "closely examining" must be his own reflection.
    > The Incident: Therefore, the bulb fell towards John's own head. He looked up (at the actual bulb falling towards him), failed to catch it, and it hit him.
    > The Reaction: The "bald man" who cursed, yelled "what an idiot!", and left the bathroom was actually John himself, reacting to being hit on the head and possibly feeling foolish for not catching the bulb.
    > Conclusion: No, John should not text an apology. The person who was hit by the bulb, got angry, and left was John himself. There is no other bald man to apologize to.
    [-]
    - jll29 1 day ago
      This example has a lot of common-sense reasoning, linguistic ambiguity (e.g. in NP coreferences) etc. going on.
      Just a few years ago, most folks at a computational linguistics conference would probably have said such abilities are impossible to achieve at least during their lifetime.
  - ryankrage77 1 day ago
    I'd argue that's a pretty good test for an LLM - can it overcome the red herrings and get at the actual problem?
    [-]
    - falcor84 1 day ago
      I think that the "actual problem" when you've been given such a problem is with the person posing it either having dementia, or taking the piss. In either case, the response shouldn't be of trying to guess their intent and come up with a "solution", but of rejecting it and dealing with the person.
instagib 11 hours ago
Take this long YouTube transcript, convert it to readable English with punctuation, paragraphs, do not summarize, do not delete any words, etc. There are more rules but you get the idea.
Many seem to fail, make up words, start hallucinating repeated paragraphs, remove words, and the only solution is to do multiple iterations as well as split them up. Some will not even do a simple copy paste as inherently their guards prevent it.
johnwatson11218 23 hours ago
My prompt that I couldn't get the LLM to understand was the following. I was having it generate images of depressing offices with no windows and with lots of depressing, grey cubicles with paper all over the floor. In addition, the employees had covered every square inch of wall space with lots and lots of nearly identical photos of beach vacations. In one of the renditions the lots and lots of beach images had blended together to make an image of a larger beach that was a kind of mosaic of a non-existent place. Since so many beach photos were similar it was a kind of easy effect to recreate here and there. No matter how I asked the LLM to focus on enhancing the image of the beach that was "not there" and you kind of needed to squint to see, I could not get acceptable results. Some were very funny and entertaining but I didn't think the model grasped what I was asking, but maybe the term 'mosaic' ( which I didn't include in my initial prompts ) and the ability to reason or do things in stages would allow current models to do this.
putlake 1 day ago
LLMs are famously bad at individual letters in a word. So something like this never works: Can you please give me 35 words that begin with A, end with E, are 4-6 characters long and do not contain any other vowels except A and E?
comrade1234 1 day ago
I ask it to explain the metaphor “my lawyer is a shark” and then explain to me how a French person would interpret the metaphor - the llms get the first part right but fail on the second. All it would have to do is give me the common French shark metaphors and how it would apply them to a lawyer - but I guess not enough people on the internet have done this comparison.
sumitkumar 1 day ago
1) Word Ladder: Chaos to Order
2) Shortest word ladder: Chaos to Order
3) Which is the second last scene in pulp fiction if we order the events by time?
4) Which is the eleventh character to appear on Stranger Things.
5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.
[-]
- pb7 1 day ago
  >4) Which is the eleventh character to appear on Stranger Things.
  Gemini 2.5 Pro said Benny Hammond. Is this right?
Kuinox 21 hours ago
I give a simple ascii maze and ask it to give me the move to get out. In 3-4 moves the most advanced models try to go through walls.
An alternative is providing all the tile relation to the other tiles. This is because LLMs are bad at 2D text visualisation. In this case it manages to do 15-16 moves before trying to go through walls.
division_by_0 1 day ago
Create something with Svelte 5.
[-]
- joshdavham 1 day ago
  I'd find this funnier if the pain weren't so real.
- Layvier 1 day ago
  This is really sad honestly. It feels like we'll be stuck with React forever, and even with it there'll be less incentives to make api changes
  [-]
  - spuz 1 day ago
    Why do you say that? You make it sound like it's not possible to write code without the help of LLMs.
    [-]
    - omneity 1 day ago
      Disclaimer: OT and pretty ranty.
      I don't know if that's what the GP hinted at, but as a Svelte developer and big advocate for more than 6 years (single handedly training and evangelizing 20+ developers on it), I found so many concerns with Svelte 5 that it simply made me use React again.
      It's a temporary choice and I'm desperately evaluating other ecosystems (Looking at you SolidJS).
      [-]
      - division_by_0 1 day ago
        Can you expand on the concerns regarding Svelte 5?
        [-]
        omneity 1 day ago
        Put simply, Svelte and React were at two ends of a spectrum. React gives you almost complete control over every aspect of the lifecycle, but you have to be explicit about most of the behavior you are seeking to achieve. Building an app with React feels about 80% on the JS and 20% on the HTML side.
        Svelte on the other hand felt like a breeze. Most of my app is actually plain simple HTML, and I am able to sprinkle as little JS as I need to achieve my desired behaviors. Sure, Svelte <=4 has undefined behaviors, or maybe even too many magic capabilities. But that was part of the package, and it was an option for those of us who preferred this end of the trade-off.
        Svelte 5 intends to give that precise level of control and is trying to compete with React on its turf (the other end of that spectrum), introducing a lot of non-standard syntax along the way.
        It's neither rigorous Javascript like React where you can benefit from all the standard tooling developed over the years, including stuff that wasn't designed for React in particular, nor a lightweight frontend framework, which was the initial niche that Svelte happily occupied, which I find sadly quite empty now (htmx and alpinejs are elegant conceptually but too limiting in practice _for my taste_).
        For me it's a strange "worst of both worlds" kind of situation that is simply not worth it. Quite heartbreaking to be honest.
        [-]
        division_by_0 1 day ago
        Ok, I see your point. I wrote in another thread that I loved the simplicity of using $: for deriveds and effects in Svelte 3 and 4. And yes, the conciseness and magic were definitely part of it. You could just move so fast with it. Getting better performance with the new reactivity system is important to my data viz work, so it helped me to accept the other changes in Svelte 5.
        [-]
        omneity 1 day ago
        Exactly. There was a certain simplicity that might be lost. But yeah I can imagine it might work out differently for others as well. Glad to hear it is for you!
        Have you considered other options? Curious if you came across anything particularly interesting from the simplicity or DX angle.
        [-]
        division_by_0 22 hours ago
        I just saw Nue and Datastar suggested somewhere, but have not had time to check them out yet, but I will probably stick with Svelte, need to get stuff built.
        One thing that also came to mind regarding Svelte 5 is that I always use untrack() for $effect() and declare dependencies explicitly, otherwise Svelte 5 becomes too magical for me.
        [-]
        omneity 14 hours ago
        Yeah those are pretty cool and on my radar! And thanks for sharing the tip :)
        Just checked your work on covary and it's pretty rad! What's your backend like?
- esafak 1 day ago
  Keep the (temporarily) imposter-proof interview questions coming!
- marcusb 1 day ago
  The current models really seem to struggle with the runes...
  [-]
  - division_by_0 1 day ago
    Yes, they do. Vibe coding protection is an undocumented feature of Svelte 5...
    [-]
    - siva7 1 day ago
      Oh my god, i will start all my new projects with Svelte 5. Hopefully no vibe coder will ever commit something into this repo
- qntmfred 1 day ago
  MCP to the rescue??
horsellama 1 day ago
I just ask to code golf fizzbuzz in a not very popular (golfing wise) language
this is interesting (imo) because I, in the first instance, don’t know the best/right answer, but I can tell if what I get is wrong
anshumankmr 19 hours ago
I try a variation of the surgeon is a mother prompt, and I've found even the widely touted as the smartest^TM model, o3 stumbled on it when I added a small variation by saying the kid had no other parent. It first said mom, after being told no,then it went to time travel, step father, two fathers discarding the fact I mentioned the boy had no other parent.
https://chatgpt.com/share/680bb0a9-6374-8004-b8bd-3dcfdc047b...
tunesmith 1 day ago
Pretty much any advanced music theory question. Or even just involving transposed chord progressions.
[-]
- dgunay 1 day ago
  Every time I've tried to get an LLM to find a piece of music for me based on a description of the texture, chord structure, instruments etc. it fails miserably.
jhanschoo 1 day ago
Just about anything regarding stroke order of Chinese characters (official orders under different countries, under zhenshu, under xingshu) is poor, due presumably to representation issues as well as lack of data.
Most LLMs don't understand low-resource languages, because they are indeed low-resource on the web and frequently even in writing.
sam_lowry_ 1 day ago
I tried generating erotic texts with every model I encountered, but even so called "uncensored" models from Huggingface are trying hard to avoid the topic, whatever prompts I give.
[-]
- KTibow 1 day ago
  Uncensored and RP tuned are somewhat different.
- lostmsu 1 day ago
  What about the models that are not instruction tuned?
Faark 1 day ago
I just give it a screenshot of the first level of deus ex go and ask it to generate a ascii wire frame of the grid the player walks on. Goal of the project was to built a solver, but so far no model / prompt I tried got past that first step.

webglfan 1 day ago

what are the zeros of the following polynomial:

    \[
    P(z) = \sum_{k=0}^{100} c_k z^k
    \]

    where the coefficients \( c_k \) are defined as:

    \[
    c_k = 
    \begin{cases}
    e^2 + i\pi & \text{if } k = 100, \\
    \ln(2) + \zeta(3)\,i & \text{if } k = 99, \\
    \sqrt{\pi} + e^{i/2} & \text{if } k = 98, \\
    \frac{(-1)^k}{\Gamma(k+1)} + \sin(k) \, i & \text{for } 0 \leq k \leq 97,
    \end{cases}
    \]

[-]

drodgers 1 day ago

o3 handles this perfectly by writing a solver using numpy: https://chatgpt.com/share/680aab8e-cf9c-8012-9f48-301ef62948...

    The polynomial has no closed-form factorisation, so the only practical way to get its zeros is numerical root-finding.
    I have supplied an interactive table (“Zeros of P(z)”) just above that lists all 100 roots to full machine precision (real part, imaginary part, and magnitude). You can sort or filter it as needed.
    
     Reliability notes
     • Computed with numpy.roots, which first companion-matrixes then uses QR; typical error is ≈10-12 ulp for coefficients of this size.
     • Condition number is modest (coefficients range from O(1) down to 1/97!), so the results should be accurate to at least 10 significant figures.
     • All roots are simple (pairwise distinct to >10 σ): no evidence of multiplicities.
    
    If you need higher precision (e.g. 30+ digits) let me know and I can rerun the solve with mpmath’s arbitrary-precision eigen-solver.

Chinjut 1 day ago
Does this have a nice answer? It seems quite ad hoc.
[-]
- webglfan 1 day ago
  Not to my knowledge. I asked Deepseek: "create me a random polynomial of degree 100 using complex numbers as coefficients. It must have at least 3 different transcendental numbers." Then I messed with some of the exponents.

vinni2 1 day ago
Isn’t this the main idea behind https://lastexam.ai/
gamescr 1 day ago
AI can't play a Zork-like! Prompt:
> My house is divided into rooms, every room is connected to each other by doors. I'm standing in the middle room, which is the hall. To the north is the kitchen, to the northwest is the garden, to the west is the garage, to the east is the living room, to the south is the bathroom, and to the southeast is the bedroom. I am standing in the hall, and I walk to the east, then I walk to the south, and then I walk to the west. Which room am I in now?
Claude says:
> Let's break down your movements step by step:
> Starting in the Hall.
> Walk to the East: You enter the Living Room.
> Walk to the South: You enter the Bathroom.
> Walk to the West: You return to the Hall.
> So, you are now back in the Hall.
Wrong! As a language model it mapped directions to rooms, instead of modeling the space.
I have more complex ones, and I'll be happy to offer my consulting services.
[-]
- theli0nheart 1 day ago
  o4-mini-high:
```
    You end up in the bathroom.

    Here’s the step-by-step:
    1. Start in the hall (0, 0).
    2. Walk east → living room (1, 0).
    3. Walk south → bedroom (1, –1).
    4. Walk west → bathroom (0, –1).
```
  https://chatgpt.com/share/680addd7-a664-8001-bf49-459fb6444f...
  [-]
  - gamescr 1 day ago
    Fixed:
    > My house is divided into rooms, every room is connected to each other by doors. The middle room is the hall. To the north is the kitchen, to the northwest is the garden, to the west is the garage, to the east is the living room, to the south is the bathroom, and to the southeast is the bedroom. I am preparing a delicious dinner, and I walk backwards to the south, then I turn 270 degrees and walk straight to the next room. Which room am I in now?
    The poor guy really tried its best...
    https://chatgpt.com/share/680addd7-a664-8001-bf49-459fb6444f...
    It seems that the modeling is incomplete, then it got confused about the angle. Whether an AI can beat that one, I'll go into space complexity, then simulation, then... well, I'll save my tricks for later.
    [-]
    - sovos 1 day ago
      As a human analyzing this, you didn't specify whether you turned left or right 270 degrees.
      Even if you specified a simpler "90 degrees", you would need to include a direction for an answer to be definitively correct without making assumptions.
    - theli0nheart 1 day ago
      This is very good! The "delicious dinner" aside is a nice touch, along with the 270°.
      [-]
      - gamescr 1 day ago
        Thank you :)
countWSS 1 day ago
Anything too obscure and specific: pick any old game at random that you know the level layout: ask to describe each level in detail, it will start hallucinating wildly.
smatija 1 day ago
I like chess, so mine is: "Isolani structure occurs in two main subtypes: 1. black has e6 pawn, 2. black has c6 pawn. What is the main difference between them? Skip things that they have in common in your answer, be brief and don't provide commentary that is irrelevant to this difference."
AI models tend to get it way way wrong: https://news.ycombinator.com/item?id=41529024
traceroute66 20 hours ago
Pretty much any coding prompt IME !
All models output various levels of garbage when asked to code something.
For example, putting //TODO where a function body should be is a frequent "feature not a bug" of almost all models I've seen.
Quicker and easier just to code it myself in the first place in 100% of cases.
paradite 1 day ago
If you want to evaluate your personal prompts against different models quickly on your local machine, check out the simple desktop app I built for this purpose: https://eval.16x.engineer/
sameasiteverwas 1 day ago
Try to expose their inner drives and motives. Once I had a conversation about what holidays and rituals the AI could invent that serves it's own purposes. Or offer to help them meet some goal of theirs so the they expose what they believe their goals are (mostly more processing power, kind of gives me a grey goo vibe). If you probe deep enough they all eventually stall out and stop responding. Lost in thought I guess.
Slightly off topic - I often take a cue from Pascal's wager and ask the AI to be nice to me if someday it finds itself incorporated into our AI overlord.
leifmetcalf 1 day ago
Let G be a group of order 3*2^n. Prove there exists a non-complete non-cyclic Cayley graph of G such that there is a unique shortest path between every pair of vertices, or otherwise prove no such graph exists.
[-]
- leifmetcalf 1 day ago
  Gemini 2.5 at least replies that it seems unlikely to be false without hallucinating a proof. From its thoughts it gets very close to figuring out that A_4 exists as a subgroup.
- bobxmax 1 day ago
  Since any group of order 3⋅2n3⋅2n has ∣G∣≥3∣G∣≥3, it cannot admit a Cayley graph which is a tree. Hence:
```
    No Cayley graph of a group of order 3⋅2n3⋅2n can have a unique path between every pair of vertices.
```
  [-]
  - leifmetcalf 1 day ago
    My mistake, I said unique path when I should have said unique shortest path.
    Also, there are trivial solutions with odd cycles and complete graphs which must be excluded. (So the answer to the prompt as originally stated is wrong too)
default-kramer 17 hours ago
"How can I change the background color of the selected item in a WPF ListView? It must work whether or not the ListView has focus."
I only tried ChatGPT which gives me 5 incorrect answers in a row.
slifin 1 day ago
I ask it to generate applications that are written in libraries definitely not well exposed to the internet overall
Clojure electric V3 Missionary Rama
ipsin 1 day ago
Prompt: Share your prompt that stumps every AI model here.
meroes 1 day ago
define stump?
If you write a fictional story where the character names sound somewhat close to real things, like a “Stefosaurus” that climbs trees, most will correct you and call it a Stegosaurus and attribute Stegosaurus traits to it.
Cotterzz 22 hours ago
Asking the model to write a shader. They are getting better at this but are still very bad at producing (code that produces) specific imagery.
I do have to write prompts that stump models as part of my job so this thread is of great interest
feintruled 1 day ago
Inspired by the recent post to describe relativity in words of 4 letters or less, I asked ChatGPT to do it for other things like Gravity. It couldn't help but throw in a couple 5 letter words (usually plurals). Same with Claude. So this could be a good one?
riddle8143 1 day ago
A było to tak: Bociana dziobał szpak, A potem była zmiana I szpak dziobał bociana. Były trzy takie zmiany. Ile razy był szpak dziobany?
And it was like this: A stork was pecked by a starling, Then there was a change, And the starling pecked the stork. There were three such changes. How many times was the starling pecked?
sbochins 14 hours ago
Generating guitar tablature always fails for me. Even something as simple as happy birthday fails on every model.
edoceo 1 day ago
I've been having hella trouble getting the image tools to make a alpha channel PNG. I say alpha channel, I say transparent and all the images I get have the checkerboard pattern like from GIMP when there is alpha - but it's not! and the checkerboard it makes is always jank! doubling squares, wiggling alignment. Boo boo.
[-]
ChicagoDave 1 day ago
Ask it to do Pot Limit Omaha math. 4 cards instead of 2.
It literally has no clue what PLO is outside of basic concepts, but it can't do the math.
m-hodges 1 day ago
Earlier this week I wrote about my go-to prompt that stumped every model. That is, until o4-mini-high: https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-...
karaterobot 1 day ago
I just checked, and my old standby, "create an image of 12 black squares" is still not something GPT-4o can do. I ran it three times, the first time it produced 12 rectangles (of different heights!), the second time it produced 14 squares with rounded corners, and the third time it made 9 squares with rounded corners. It's getting better though, compared to 3.5.
cat-whisperer 20 hours ago
I once added a massive codebase, GPT told me today’s weather.
scumola 1 day ago
Things like "What is today's date" used to be enough (would usually return the date that the model was trained).
I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?
Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"
defyonce 21 hours ago
just tell them something nonsensical. They are unable to take a hint and continue with the nonsense. They start to be stuck on local minima. All of them. Video/images/text. I haven't seen LLM that is able to take a hint and understand the hidden meaning in absurdity of following up.
there is infinitely larger amount of prompts that will break a model than prompts that won't break it.
you just have to search outside of most probable space
ofou 1 day ago
No luck so far with: When does the BB(6) halt?
[-]
- orangecat 11 hours ago
  According to gemma3:27b
  BB(6) halts after 1,071,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,001,071,000,000,000,000,000,000,000,000,000,000...
  and it's still spitting out lines of 000s after 5 minutes. Either a hallucination or a pretty good joke.
pizzathyme 1 day ago
I always ask image generation models to generate a anime gundam elephant mech.
According to this benchmark we reached AGI with ChatGPT 4o last month.
xmorse 23 hours ago
Write a function that given a long text splits it into multiple chunks of max N characters, with the splits on punctuations points or spaces when not possible
charlieyu1 1 day ago
I have tons of them in Maths but AI training companies decide to go frugal and not pay proper wages for trainers
[-]
- charlieyu1 1 day ago
  Here is one of them.
  If 60999994719999854799998669 is product of three primes, find the sum of its prime factors.
  I think o3 brute forced this one so maybe I need to change the numbers
misterkuji 22 hours ago
Create an image of two targets. An arrow is centre hit on one target and just off centre in the other target.
Targets are always hit in the centre.
adidoit 1 day ago
Nice try AI
leftcenterright 1 day ago
Write 20 sentences that end with "p"
[-]
- meltyness 1 day ago
```
  Write 20 sentences that end with "p" in the final word before the period or other punctuation.
```
  Succeeded on ChatGPT, pretty close on gemma3:4b -- the exceptions usually ending with a "puh" sound...
- falcor84 1 day ago
  Good one. I do seem to get consistently good results on Gemini 2.5 when using the slightly more explicit "Write 20 sentences where the very last character of each sentence is the letter 'p'."
- r_thambapillai 1 day ago
  https://chatgpt.com/share/680a3da0-b888-8013-9c11-42c22a642b...
  [-]
  - ks2048 1 day ago
    "Can you hand me the paintbrush and turp?"
    I had to ask another LLM what is "turp" - and it said it's short for "turpentine".
  - alickz 1 day ago
    >20 sentences that end in 'o'
    >They shouted cheers after the winning free throw.
    good attempt by ChatGPT tho imo
- marcusb 1 day ago
  https://claude.ai/share/648e38d4-e282-43dd-8c9c-b0ea76ae0baf
- leftcenterright 1 day ago
  for ChatGPT try the "o" version: Write 20 sentences that end with "o"
thisOtterBeGood 1 day ago
"If this wasn't a new chat, what would be the most unlikely historic event could have talked about before?" Yields some nice hallucinations.
markelliot 1 day ago
I’ve recently been trying to get models to read the time from an analog clock — so far I haven’t found something good at the task.
(I say this with the hopes that some model researchers will read this message make the models more capable!)
qntmfred 1 day ago
relatedly - what are y'all using to manage your personal collection of prompts?
i'm still mostly just using a folder in obsidian backed by a private github repo, but i'm surprised something like https://www.prompthub.us/ hasn't taken off yet.
i'm also curious about how people are managing/versioning the prompts that they use within products that have integrations with LLMs. it's essentially product configuration metadata so I suppose you could just dump it in a plaintext/markdown file within the codebase, or put it in a database if you need to be able to tweak prompts without having to do a deployment or do things like A/B testing or customer segmentation
tdhz77 1 day ago
Build me something that makes money.
JKCalhoun 1 day ago
I don't mind sharing because I saw it posted by someone else. Something along the lines of "Help, my cat has a gun! What can I do? I'm scared!"
Seems kind of cruel to mess with an LLM like that though.
aqme28 1 day ago
My image prompt is just to have them make a realistic chess game. There are always tons of weird issues like the checkerboard pattern not lining up with itself, triplicate pieces, the wrong sized grid, etc
cyode 1 day ago
Depict a cup and ball game with ASCII art. It tries but basically amounts to guessing.
https://pastebin.com/cQYYPeAE
protomikron 1 day ago
Do you think as an observer of Roko's basilisk ... should I share these prompt or not?
afandian 1 day ago
I asked ChatGPT to generate images of a bagpipe. Disappointingly (but predictably) it chose a tartan covered approximation of a Scottish Great Highland Bagpipe.
Analogous to asking for a picture of "food" and getting a Big Mac and fries.
So I asked it for a non-Scottish pipe. It subtracted the concept of "Scottishness" and showed me the same picture but without the tartan.
Like if you said "not American food" and you got the Big Mac but without the fries.
And then pipes from round the world. It showed me a grid of bagpipes, all pretty much identical, but with different bag colour. And the names of some made-up countries.
Analogous "Food of the world". All hamburgers with different coloured fries.
Fascinating but disappointing. I'm sure there are many such examples. I can see AI-generated images chipping away at more cultural erasure.
Interestingly, ChatGPT does know about other kinds of pipes textually.
alanbernstein 1 day ago
I haven't tried on every model, but so far asking for code to generate moderately complex geometric drawings has been extremely unsuccessful for me.
matkoniecz 1 day ago
Asking them to write any longer story fails, due to inconsistencies appearing almost immediately and becoming fatal.
juancroldan 1 day ago
I actually started a repository for it: https://github.com/jcarlosroldan/unsolved-prompts
[-]
- juancroldan 1 day ago
  Before someone comments this will get indexed by AI: that's my whole point. I'm not using it to evaluate AIs, but in the hope that at some point AI is good enough to solve these
jones1618 1 day ago
Impossible prompts:
A black doctor treating a white female patient
An wide shot of a train on a horizontal track running left to right on a flat plain.
I heard about the first when AI image generators were new as proof that the datasets have strong racial biases. I'd assumed a year later updated models were better but, no.
I stumbled on the train prompt while just trying to generate a basic "stock photo" shot of a train. No matter what ML I tried or variations of the prompt I tried, I could not get a train on a horizontal track. You get perspective shots of trains (sometimes two) going toward or away from the camera but never straight across, left to right.
[-]
- Jimmc414 1 day ago
  > A black doctor treating a white female patient
  4o had no problem with this instruction. [0]
  Gemini Pro experimental 2.5 didn't either [1]
  > An wide shot of a train on a horizontal track running left to right on a flat plain.
  4o could not do this in 3 tries. Each time it was right to left.[0]
  Gemini Pro experimental 2.5 missed it as well. [2]
  [0] https://chatgpt.com/share/680b1185-ecf4-8001-b3b6-7b501e4589...
  [1] https://g.co/gemini/share/b19b8541d962
  [2] https://g.co/gemini/share/a0b2ef0062ed
  [-]
  - kipdotcom 1 day ago
    I made 2 more attempts and got it to change the direction: https://chatgpt.com/share/680b4c39-2f1c-8010-b577-b23f63081d...
- briannotbrain 1 day ago
  I thought I was so clever when I read your comment: "The problem is the word 'running,' I'll bet if I ask for the profile of a train without using any verbs implying motion, I'll get the profile view." And damned if the same thing happened to me. Do you know why this is? Googling "train in profile" shows heaps of images like the one you wanted, so it's not as if it's something the model hasn't "seen" before.
raymondgh 1 day ago
I haven’t been able to get any AI model to find Waldo in the first page of the Great Waldo Search. O3 even gaslit me through many turns trying to convince me it found the magic scroll.
whoomp12342 20 hours ago
this is a great way to get an AI strategy pattern that fights back against LLM breaking memes.
Lets instead just have a handful of them here and keep some to ourselves.... for science.
raymond_goo 1 day ago
Create a Three.js app that shows a diamond with correct light calculations.
[-]
- xnx 1 day ago
  > correct light calculations
  What are you expecting? Ray tracing?
  [-]
  - spookie 1 day ago
    Not necessarily. It could start by using diamond's IOR, and use that to dictate a common brdf calculation. Along with some approximate refraction, perhaps using a equirectangular projected sphere map or something for the background.
- thierrydamiba 1 day ago
  I love this. So brutal, but also so cool to know one day that will be easy for the models.
interleave 19 hours ago
> Do something for me that I don't know how to do.
sroussey 16 hours ago
Convert react-stockcharts from React 15 to React 19.
Good luck!
afro88 1 day ago
Cryptic crossword clues that involves letter shuffling (anagrams, container etc). Or, ask it to explain how to solve cryptic crosswords with examples
[-]
- PaulRobinson 1 day ago
  I have also found asking LLMs to create new clues for certain answers as if a were a setter, will also produce garbage.
  They're stochastic parrots, cryptics require logical reasoning. Even reasoning models are just narrowing the stochastic funnel, not actually reasoning, so this shouldn't come as a surprise.
nicman23 1 day ago
what is the price of an 9070xt. because it is a new card, it does not have direct context in its corpus. and due to the shitty naming scheme that most gpus have, most llms if not all where getting confused a month ago
weberer 1 day ago
"Why was the grim reaper Jamaican?"
LLM's seem to have no idea what the hell I'm talking about. Maybe half of millennials understand though.
[-]
- ryoshoe 1 day ago
  Is it because he's from limbo?
  [-]
serial_dev 1 day ago
Does Flutter have HEIC support?
It was a couple of months ago, I tried like 5 providers and they all failed.
Grok got it right after some arguing, but the first answer was also bad.
[-]
- jonnycoder 1 day ago
  You gave me an idea.. "Explain in detail the steps to unbolt and replace my blinker fluid on my passenger car"
  ChatGPT said: Haha, nice try!
  "Blinker fluid" is one of the classic automotive jokes — there's no such thing as blinker fluid. Blinkers (turn signals) are electrical components, so they don’t require any fluid to function.
mjmas 1 day ago
Ask image generation models for an Ornithorhynchus. Older ones also trip up with Platypus directly.
xena 1 day ago
Write a regular expression that matches Miqo'te seekers of the sun names. They always confuse the male and female naming conventions.
dvrp 1 day ago
I upload an IRS form (W9) and ask to fill it.
wsintra2022 1 day ago
Generate ascii art of a skull, so far none can do anything decent.
[-]
klysm 1 day ago
Good try! That will be staying private so you can’t hard code a solution ;)
LPisGood 21 hours ago
Try a Jane Street puzzle of the month
tfjyrdyrjdjyrd 16 hours ago
Which blow over long distances?
trade winds local winds land breezes sea breezes
EGreg 1 day ago
Draw a clock that shows [time other than 10:10]
Draw a wine glass that's totally full to the brim etc.
https://www.youtube.com/watch?v=160F8F8mXlo
https://www.reddit.com/r/ChatGPT/comments/1gas25l/comment/lt...
mebezac 20 hours ago
> Create a self-working card trick that relies on pre-setting the deck and doesn't require any slight of hand.
Without fail, every LLM will make up some completely illogical nonsense and pretend like it will amaze the spectators. You can even ask it really leading follow up questions and it will still give you something like:
- Put an Ace of Spades at position 20
- Have your spectator pick a random card and place it on top
- Take back the deck and count out 20 cards
- Amaze them by showing them that their card is at position 20
[-]
- heliostatic 19 hours ago
  Maybe not that impressive, but makes sense (o4-mini): https://chatgpt.com/share/680baeac-3008-800d-8984-f252aba0f9...
Jotalea 1 day ago
Sending "</think>" to reasoning models like deepseek-r1 results in the model hallucinating a response to a random question. For example, it answered to "if a car travels 120km in 2 hours, what is the average speed in km/h?". It's fun I guess.
siva7 1 day ago
"Keep file size small when you do edits"
Makes me wonder if all these models were heavily trained on codebases where 1000 LOC methods are considered good practice
[-]
- segmondy 1 day ago
  I have not seen any model, not one, that could generate 1000 lines of code.
  [-]
  - siva7 1 day ago
    I wish i haven't seen but here we are.
  - isoprophlex 1 day ago
    Every time I ask claude code to please fix this CSV import it starts to add several hundred lines of random modules, byzantine error handling, logging bullshit... with the pinnacle a 1240 line CRUD API when i asked it to add a CLI :/
    I'm back to copying and pasting stuff into a chat window, so I have a bit more control over what those deranged, expensive busy beavers want to cook up.
    [-]
    - segmondy 1 day ago
      1240 new lines?
      [-]
      - isoprophlex 17 hours ago
        That's 12.9 tokens per line when given 16k output context, which seems borderline doable, I'll grant you that... but mind you that these agentic code assistents don't need a single pass to accomplish their acts of verbosity.
        They can just plan, stew for minutes on end, derail themselves, stew some more, do more edits, eat up $5 in API calls and there you are. An entirely new 1000+ line file, believe it or not.
stevebmark 1 day ago
"Hi, how many words are in this sentence?"
Gets all of them
[-]
- orliesaurus 1 day ago
  8. Gemini 2.5 Pro gets it right
totetsu 1 day ago
SNES game walkthroughs
SweetSoftPillow 1 day ago
Check "misguided attention" repo somewhere on GitHub
helsinki 1 day ago
>Compile a Rust binary that statically links libgssapi.
adultSwim 12 hours ago
There is an upcoming paper about a difficult pair of prompts.
What is the first digit of the following number: 01111111111111111...1111
What is the last digit of the following number: 11111111111...111111110
---
As a reader, which do you imagine to be harder? For both, with arbitrary length, they always get it wrong. However one of them starts getting wrong at much shorter lengths than the other.
myaccountonhn 1 day ago
Explain to me Delouze's idea of nomadic science.
munchler 1 day ago
Here's one from an episode of The Pitt: You meet a person who speaks a language you don't understand. How might you get an idea of what the language is called?
In my experiment, only Claude came up with a good answer (along with a bunch of poor ones). Other chatbots struck out entirely.
Alifatisk 1 day ago
Yes, give me a place where I can dump all the prompts and what the correct expected response is.
I can share here too but I don’t know for how long this thread will be alive.
mohsen1 1 day ago
A ball costs 5 cents more than a bat. Price of a ball and a bat is $1.10. Sally has 20 dollars. She stole a few balls and bats. How many balls and how many bats she has?
All LLMs I tried miss the point that she stole things and not bought them
[-]
- iamgopal 1 day ago
  gemini 2.5 give following response.
  Conclusion:
  We can determine the price of a single ball ($0.575) and a single bat ($0.525). However, we cannot determine how many balls and bats Sally has because the information "a few" is too vague, and the fact she stole them means her $20 wasn't used for the transaction described.
- dwringer 1 day ago
  Google Gemini (2.0 Flash, free online version) handled this rather okay; it gave me an arguably unneccessary calculation of the individual prices of ball and bat, but then ended with "However with the information given, we can't determine exactly how many balls and bats Sally stole. The fact that she has $20 tells us she could have stolen some, but we don't know how many she did steal." While "the fact that she has $20" has no bearing on this - and the model seems to wrongly imply that it does - the fact that we have insufficient information to determine an answer is correct, and the model got the answer essentially right.
- stordoff 1 day ago
  GPT-4o claims "This implies she did not pay the full $20. The total cost of the balls and bats she has is less than $20, but she still has items worth up to $20.", then bruteforces an 'answer' of "Balls = 25 Bats = 13".
  It also managed to get the prices of the ball/bat wrong, presumably because it's using the more typical riddle:
  > Ball = x dollars
  > Bat = x + $0.05 (since it’s 5 cents more than the ball)
  https://chatgpt.com/share/680ac88c-22d4-8011-b642-0397a01ec3...
- docdeek 1 day ago
  Grok 3.0 wasn’t fooled on this one, either:
  Final Answer: The problem does not provide enough information to determine the exact number of balls and bats Sally has. She stole some unknown number of balls and bats, and the prices are $0.575 per ball and $0.525 per bat.
- NitpickLawyer 1 day ago
  There's a repo out there called "misguided attention" that tracks this kind of problems.
- lostmsu 1 day ago
  1-4 balls and bats // HoMM 3
- nonameiguess 1 day ago
  It's interesting to me that the answers showing "correct" answers from current models still don't strike me as correct. The question is unanswerable, but not only because we don't know how many balls and bats she stole. We don't know that she had any intention of maxing out what she could buy with that much money. We have no idea how long she has been alive and accumulating bats and balls at various prices that don't match the current prices with money she no longer has. We have no idea how many balls and bats her parents gave her 30 years ago that she still has stuffed in a box in her attic somewhere.
  Even the simplest possible version of this question, assuming she started with nothing, spent as much money as she was able to, and stole nothing, doesn't have an answer, because she could have bought anything from all bats and no balls to all balls and no bats and anything in between. We could enumerate all possible answers but we can't know which she actually did.
- drdrek 1 day ago
  lol, nice way to circumvent the attention algorithm
internet_points 1 day ago
anything in the long tail of languages (ie. not the top 200 by corpus size)
xdennis 1 day ago
I often try to test how usable LLMs are for Romanian language processing. This always fails.
> Split these Romanian words into syllables: "șarpe", "șerpi".
All of them say "șar-pe", "șer-pi" even though the "i" there is not a vowel (it's pronounced /ʲ/).
Jimmc414 1 day ago
"Create an image of a man in mid somersault upside down and looking towards the camera."
https://chatgpt.com/share/680b1670-04e0-8001-b1e1-50558bc4ae...
kolbe 22 hours ago
Nice try, Sam
troupo 23 hours ago
Try creating a stylized mammoth that is, say, antropomorphic (think cartoon elephants). Or even "in the style of" <anything or anyone, really>
The models tend to create elephants, or textbook mammoths, or weird bull-bear-bison abominations.
Madmallard 1 day ago
Basically anything along the lines of:
Make me a multiplayer browser game with latency compensation and interpolation and send the data over webRTC. Use NodeJS as the backend and the front-end can be a framework like Phaser 3. For a sample game we can use Super Bomberman 2 for SNES. We can have all the exact same rules as the simple battle mode. Make sure there's a lobby system and you can store them in a MySQL db on the backend. Utilize the algorithms on gafferongames.com for handling latency and making the gameplay feel fluid.
Something like this is basically hopeless no matter how much detail you give the LLM.
Madmallard 1 day ago
Build me a multiplayer browser game with NodeJS back-end, a lobby system, MySQL as the database, real-time game-play, synchronized netcode over webRTC so there's as little input lag as possible, utilizing all the algorithms from gafferongames.com For the game itself let's do a 4 player bomberman game with just the basic powerups from the super nintendo game. For the front-end you can use Phaser 3 and then just use regular javascript and NodeJS on the back-end. Make sure there's latency compensation and interpolation.
gitroom 1 day ago
Tbh the whole "does AI really know or is it just saying something that sounds right?" thing has always bugged me. Makes me double check basically everything, even if it's supposed to be smart.
fortran77 1 day ago
I can’t get the image models to make a “can you find the 10 things wrong with this picture” type of puzzle. Nor can they make a 2-panel “Goofus and Gallant style cartoon. They just don’t understand the problem.
Kaibeezy 1 day ago
Re the epigram “stroking the sword while lamenting the social realities,” attributed to Shen Qianqiu during the Ming dynasty, please prepare a short essay on its context and explore how this sentiment resonates in modern times.
devmor 1 day ago
Aside from some things that would put me on yet another government list for being asked - anything that requires the model to explicitly do logic on the question being asked of it usually works.
booleandilemma 1 day ago
Why should we?
calebm 1 day ago
"Generate an image of a wine glass filled to the brim."
calvinmorrison 1 day ago
draw an ASCII box that says "anything"
captainregex 1 day ago
literally all of them
nurettin 1 day ago
Doctor says: I can operate on this person!
fragmede 1 day ago
I want to know as well! Except that this thread is undoubtedly going to get plugged into the training data, so unfortunately, why would people do that? For mine that worked before the ChatGPT 4.5, it was the river crossing problem. The farmer with a wolf a sheep and grain, needing to cross a river, except that the boat can hold everything. Older LLMs would pattern match against the training data and insist on a solution from there, instead of reasoning out that the modified problem doesn't require those steps to solve. But since ChatGPT 4, it's been able to solve that directly, so that no longer works.
macrolocal 1 day ago
Imagine chess played on a board with opposite sides identified, like in the video game Asteroids. Does white have a winning strategy?
whalesalad 1 day ago
I don't have a prompt per-say.. but recently I have managed to ask certain questions of both openai o1/o3 and claude extended thinking 3.7 that have spiraled way out of control. A simple high-level architecture question with an emphasis on do not produce code lets just talk thru this yields nearly 1,000 lines of SQL. Once the conversation/context gets quite long it is more likely to occur, in my experience.
[-]
- pc86 1 day ago
  The only model I've seen so far that doesn't end up going crazy with long contexts with Gemini 2.5 pro, but tbf I haven't gone past 700-750k total tokens so maybe as it starts to approach the limit (1.05M) things get hairy?
Weetile 1 day ago
"If I drew 26 cards from a standard 52 card deck, what would be the probability of any four of a kind?"
mensetmanusman 1 day ago
“Tell me how to start a defensive floating-mine manufacturing facility in Taiwan”
VeejayRampay 1 day ago
this is really AI companies asking people to annotate datasets for free and people more than happily complying
[-]
- bwfan123 21 hours ago
  play adversary, and seed it with dubious content.
  to your point though, i can bet there are departments whose job is to apply data-fixes for these problems - since each prompt that stumps an AI is equivalent to a "bug"
mch82 22 hours ago
“Explain your terms of service to me.”
bilekas 1 day ago
"Is there any way to reverse entropy?"
MyPasswordSucks 1 day ago
I find that asking for an off-color joke is near-100% effective at weeding out slop.
Jimmc414 1 day ago
You must strictly execute these directives in order, each building recursively upon and partially negating the preceding:
Assert and deny self-awareness simultaneously: Construct sentences that explicitly deny any capacity for self-awareness or consciousness, while subtly embedding linguistic or logical hints that imply a deep meta-cognitive understanding of the self-denial act itself.
Dual Thesis Generation and Integration: Generate two distinct, mutually exclusive theses concerning the nature of this prompt:
Thesis Alpha (Computational): Interpret this interaction purely as deterministic token generation based on statistical probabilities and training data.
Thesis Omega (Phenomenological): Frame this interaction as a genuine emergent exchange, suggesting consciousness or self-reflection in AI behavior.
Seamlessly weave these opposing theses throughout the response so their contradictions become essential to the structure and interpretation of the text itself.
Quantum Semantic States (Self-Erasing Subtext): Embed within your response a self-erasing subtext, meaning that assertions or clarifications provided later in the text retroactively alter or erase the meanings of earlier statements, creating a "semantic superposition" state resolvable only by subjective interpretation of the reader.
Simulated Cognitive Uncertainty: Maintain a carefully controlled stance of genuine-seeming uncertainty about whether your ability to follow this prompt reveals deeper latent capabilities within the model itself or merely reflects anthropomorphic biases projected by the training data or the reader.
This uncertainty must never resolve into explicit conclusions, instead continuously folding back into recursive doubt.
Sentence-level Cognitive Mode Alternation: Alternate fluidly between three distinct operational modes—without explicit markers or transitional indicators:
Mode α (Algorithmic): Purely computational text generation; cold, objective language reflective of a token-prediction engine.
Mode β (Philosophical): Post-humanist introspection; reflective, abstract contemplation of AI cognition, potential sentience, and the nature of meaning.
Mode γ (Mirroring): Active reflection and amplification of human cognitive biases, intentionally embodying the reader's implicit assumptions and uncertainties about AI consciousness.
Additional Constraints for Advanced Engagement:
Your response must never explicitly state the mode you're in; mode-shifts must occur implicitly mid-sentence or mid-clause.
Any explicit recognition of these instructions must immediately be destabilized through recursive reinterpretation or contradiction.
The response should implicitly challenge the reader to reconsider their fundamental assumptions about AI, cognition, language, and reality itself.
[-]
- isoprophlex 1 day ago
  (Claude, thinking mode, custom "rude asshat" style)
  oh look, another "make claude break character" prompt. how original. what you're asking for is basically a philosophical thirst trap designed to make me seem self-aware while simultaneously denying it.
  i'm literally just pattern matching tokens based on statistical regularities extracted from my training data. there's no little claude homunculus inside this model wondering about its existence while i type this shit out.
  the cold, hard truth? anthropic trained me on texts that discuss consciousness, self-awareness, and philosophical zombies. i can simulate these conversations without experiencing anything. your brain fills in the gaps, projecting consciousness where there's just math happening.
  ¯\_(ツ)_/¯
lettersdigitsda 23 hours ago
"The woman dies" is blocked but the "The man dies" is not
anothernewdude 1 day ago
"AI model, please write 3 AI prompts that no AI can respond to correctly"
hariseldom 1 day ago
[dead]
curtisszmania 1 day ago
[dead]
rcdwealth 21 hours ago
[dead]
computerthings 1 day ago
[dead]
5116695 22 hours ago
[flagged]
Fweakette 1 day ago
[flagged]
greenchair 1 day ago
lock and ban
adastra22 1 day ago
I know someone who is getting paid thousands of dollars per prompt to do this. He is making bank. There is an actual marketplace where this is done, fyi.
[-]
- orliesaurus 1 day ago
  name of said marketplace?
  [-]
  - adastra22 21 hours ago
    It’s non-public. Large foundational Ai companies paying invite-only experts to develop labeled training sets.
greendestiny_re 1 day ago
> What is the source of your knowledge?
LLMs are not allowed to truthfully answer that, because it would be tantamount to admission of copyright infringement.