Claude Token Counter, now with model comparisons

(simonwillison.net)

60 points | by twapi 5 hours ago

7 comments

kouteiheika 2 hours ago
> Opus 4.7 tokenizer used 1.46x the number of tokens as Opus 4.6
Interesting. Unfortunately Anthropic doesn't actually share their tokenizer, but my educated guess is that they might have made the tokenizer more semantically aware to make the model perform better. What do I mean by that? Let me give you an example. (This isn't necessarily what they did exactly; just illustrating the idea.)
Let's take the gpt-oss-120b tokenizer as an example. Here's how a few pieces of text tokenize (I use "|" here to separate tokens):
```
    Kill -> [70074]
    Killed -> [192794]
    kill -> [25752]
    k|illed -> [74, 7905]
    <space>kill -> [15874]
    <space>killed -> [17372]
```
You have 3 different tokens which encode the same word (Kill, kill, <space>kill) depending on its capitalization and whether there's a space before it or not, you have separate tokens if it's the past tense, etc.
This is not necessarily an ideal way of encoding text, because the model must learn by brute force that these tokens are, indeed, related. Now, imagine if you'd encode these like this:
```
   <capitalize>|kill
   <capitalize>|kill|ed
   kill|
   kill|ed
   <space>|kill
   <space>|kill|ed
```
Notice that this makes much more sense now - the model now only has to learn what "<capitalize>" is, what "kill" is, what "<space>" is, and what "ed" (the past tense suffix) is, and it can compose those together. The downside is that it increases the token usage.
So I wouldn't be surprised if this is what they did. Or, my guess number #2, they removed the tokenizer altogether and replaced them with a small trained model (something like the Byte Latent Transformer) and simply "emulate" the token counts.
[-]
- ipieter 41 minutes ago
  There is currently very little evidence that morphological tokenizers help model performance [1]. For languages like German (where words get glued together) there is a bit more evidence (eg a paper I worked on [2]), but overall I start to suspect the bitter lesson is also true for tokenization.
  [1] https://arxiv.org/pdf/2507.06378
  [2] https://pieter.ai/bpe-knockout/
- dannyw 51 minutes ago
  LLMs are explicitly designed to handle, and also possibly 'learn' from different tokens encoding similar information. I found this video from 3blue1brown very informative: https://www.youtube.com/watch?v=wjZofJX0v4M
  Also, think about how a LLM would handle different languages.
- fooker 1 hour ago
  This is how language models have worked since their inception, and has been steadily improved since about 2018.
  See embedding models.
  > they removed the tokenizer altogether
  This is an active research topic, no real solution in sight yet.
- friendzis 47 minutes ago
  This is such a superficial, English-centric take, but it might as well be true. It seems to me that in non-english languages the models, especially chatgpt, have suffered in the declension department and output words in cases that do not fit the context.
  I have just ran an experiment: I have taken a word and asked models (chatgpt, gemini and claude) to explode it into parts. The caveat is that it could either be root + suffix + ending or root + ending. None of them realized this duality and have taken one possible interpretation.
  Any such approach to tokenizing assumes context free (-ish) grammar, which is just not the case with natural languages. "I saw her duck" (and other famous examples) is not uniquely tokenizable without a broader context, so either the tokenizer has to be a model itself or the model has to collapse the meaning space.
- anonymoushn 1 hour ago
  their old tokenizer performed some space collapsing that allowed them to use the same token id for a word with and without the leading space (in cases where the context usually implies a space and one is not present, a "no space" symbol is used).
great_psy 2 hours ago
Is there any provided reason from anthropic why they changed the tokenizer ?
Is there a quality increase from this change or is it a money grab ?
[-]
- ChadNauseam 15 minutes ago
  How would it be a money grab? If the new tokenizer requires more tokens to encode the same information, it costs them more money for inference. The point of charging per token is that the cost is proportional to the number of tokens. That's my understanding anyway
- Aurornis 1 hour ago
  The tokenizer is an important part of overall model training and performance. It’s only one piece of the overall cost per request. If a tokenizer that produces more tokens also leads to a model that gets to the correct answer more quickly and requires fewer re-prompts because it didn’t give the right answer, the overall cost can still be lower.
  Comparisons are still ongoing but I have already seen some that suggest that Opus 4.7 might on average arrive at the answer with fewer tokens spent, even with the additional tokenizer overhead.
  So, no, not a money grab.
aliljet 1 hour ago
This is the rugpull that is starting to push me to reconsider my use of Claude subscriptions. The "free ride" part of this being funded as a loss leader is coming to a close. While we break away from Claude, my hope is that I can continue to send simple problems to very smart local llms (qwen 3.6, I see you) and reserve Claude for purely extreme problems appropriate for it's extreme price.
[-]
- KronisLV 17 minutes ago
  > This is the rugpull that is starting to push me to reconsider my use of Claude subscriptions.
  I'm still with them cause the model is good, but yes, I'm noticing my limits burning up somewhat faster on the 100 USD tier, I bet the 20 USD tier is even more useless.
  I wouldn't call it a rugpull, since it seems like there might be good technical reasons for the change, but at the same time we won't know for sure if they won't COMMUNICATE that to us. I feel like what's missing is a technical blog post that tells uz more about the change and the tokenizer, although I fear that this won't be done due to wanting to keep "trade secrets" or whatever (the unfortunate consequence of which is making the community feel like they're being rugpulled).
- londons_explore 1 hour ago
  I think an LLM that is a decent chunk smarter/better than other LLM's ought to be able to charge a premium perhaps 10x or 100x it's competitors.
  See for example the price difference between taking a taxi and taking the bus, or between hiring a real lawyer Vs your friend at the bar who will give his uninformed opinion for a beer.
- DeathArrow 1 hour ago
  Quality of answers from quantized models is noticeable worse than using the full model.
  You'll be better using Qwen 3.6 Plus through Alibaba coding plan.
tomglynch 3 hours ago
Interesting findings. Might need a way to downsample images on upload to keep costs down.
[-]
- simonw 3 hours ago
  Yeah that should work - it looks like the same pixel dimension image at smaller sizes has about the same token cost for 4.6 and 4.7, so the image cost increase only kicks in if you use larger images that 4.6 would have presumably resized before inspecting.
mudkipdev 2 hours ago
Why do you need an API key to tokenize the text? Isn't it supposed to be a cheap step that everything else in the model relies on?
[-]
- kouteiheika 1 hour ago
  I'd guess it's because they don't want people to reverse engineer it.
  Note that they're the only provider which doesn't make their tokenizer available offline as a library (i.e. the only provider whose tokenizer is secret).
- weird-eye-issue 53 minutes ago
  To prevent abuse? It's a completely free endpoint so I don't understand your complaint.
- simonw 2 hours ago
  I'd love it if that API (which I do not believe Anthropic charge anything for) worked without an API key.
chattermate 29 minutes ago
[dead]
yogigan 1 hour ago
[dead]