> Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.
caveman[0] is becoming more relevant by the day. I already enjoy reading its output more than vanilla so suits me well.
I hope people realize that tools like caveman are mostly joke/prank projects - almost the entirety of the context spent is in file reads (for input) and reasoning (in output), you will barely save even 1% with such a tool, and might actually confuse the model more or have it reason for more tokens because it'll have to formulate its respone in the way that satisfies the requirements.
> I hope people realize that tools like caveman are mostly joke/prank projects
This seems to be a common thread in the LLM ecosystem; someone starts a project for shits and giggles, makes it public, most people get the joke, others think it's serious, author eventually tries to turn the joke project into a VC-funded business, some people are standing watching with the jaws open, the world moves on.
To be fair, most of us looked at GPT1 and GPT2 as fun and unserious jokes, until it started putting together sentences that actually read like real text, I remember laughing with a group of friends about some early generated texts. Little did we know.
HN submissions have a bunch of examples in them, but worth remembering they were released as "Look at this somewhat cool and potentially useful stuff" rather than what we see today, LLMs marketed as tools.
I've been running gps for a long time, and I always liked that there was something in my pocket (and not just me). One day when driving to work on the highway with no GPS app installed, I noticed one of the drivers had gone out after 5 hours without looking. He never came back! What's up with this?
So i thought it would be cool if a community can create an open source GPT2 application which will allow you not only to get around using your smartphone but also track how long you've been driving and use that data in the future for improving yourself...and I think everyone is pretty interested.
[Updated on July 20]
I'll have this running from here, along with a few other features such as: - an update of my Google Maps app to take advantage it's GPS capabilities (it does not yet support driving directions) - GPT2 integration into your favorite web browser so you can access data straight from the dashboard without leaving any site!
Here is what I got working.
Wow that is terrible. In my memory GPT 2 was more interesting than that. I remember thinking it could pass a Turing test but that output is barely better than a Markov chain.
I was first made aware of GPT2 from reading Gwern -- "huh, that sounds interesting" -- but really didn't start really reading model output until I saw this subreddit:
> New AI fake text generator may be too dangerous to release, say creators
> The Elon Musk-backed nonprofit company OpenAI declines to release research publicly for fear of misuse.
> OpenAI, an nonprofit research company backed by Elon Musk, Reid Hoffman, Sam Altman, and others, says its new AI model, called GPT2 is so good and the risk of malicious use so high that it is breaking from its normal practice of releasing the full research to the public in order to allow more time to discuss the ramifications of the technological breakthrough.
Or - making sensational statements gets attention. A dangerous tool is necessarily a powerful tool, so that statement is pretty much exactly what you'd say if you wanted to generate hype, make people excited and curious about your mysterious product that you won't let them use.
Think about all the possible explanations carefully. Weight them based on the best information you have.
(I think the most likely explanation for Mythos is that it's asymmetrically a very big deal. Come to your own conclusions, but don't simply fall back on the "oh this fits the hype pattern" thought terminating cliché.)
Also be aware of what you want to see. If you want the world to fit your narrative, you're more likely construct explanations for that. (In my friend group at least, I feel like most fall prey to this, at least some of the time, including myself. These people are successful and intelligent by most measures.)
Then make a plan to become more disciplined about thinking clearly and probabilistically. Make it a system, not just something you do sometimes. I recommend the book "the Scout Mindset".
Concretely, if one hasn't spent a couple of quality hours really studying AI safety I think one is probably missing out. Dan Hendrycks has a great book.
Very cool idea. Been playing with a similar concept: break down one image into smaller self-similar images, order them by data similarity, use them as frames for a video
You can then reconstruct the original image by doing the reverse, extracting frames from the video, then piecing them together to create the original bigger picture
Results seem to really depend on the data. Sometimes the video version is smaller than the big picture. Sometimes it’s the other way around. So you can technically compress some videos by extracting frames, composing a big picture with them and just compressing with jpeg
Interesting, when I heard about it, I read the readme, and I didn't take that as literal. I assumed it was meant as we used video frames as inspiration.
I've never used it or looked deeper than that. My LLM memory "project" is essentially a `dict<"about", list<"memory">>` The key and memories are all embeddings, so vector searchable. I'm sure its naive and dumb, but it works for my tiny agents I write.
A major reason for that is because there's no way to objectively evaluate the performance of LLMs. So the meme projects are equally as valid as the serious ones, since the merits of both are based entirely on anecdata.
It also doesn't help that projects and practices are promoted and adopted based on influencer clout. Karpathy's takes will drown out ones from "lesser" personas, whether they have any value or not.
While the caveman stuff is obviously not serious, there is a lot of legit research in this area.
Which means yes, you can actually influence this quite a bit. Read the paper “Compressed Chain of Thought” for example, it shows it’s really easy to make significant reductions in reasoning tokens without affecting output quality.
There is not too much research into this (about 5 papers in total), but with that it’s possible to reduce output tokens by about 60%. Given that output is an incredibly significant part of the total costs, this is important.
Who would suspect that the companies selling 'tokens' would (unintentionally) train their models to prefer longer answers, reaping a HIGHER ROI (the thing a publicly traded company is legally required to pursue: good thing these are all still private...)... because it's not like private companies want to make money...
I don’t think this is a plausible argument, as they’re generally capacity constrained, and everyone would like shorter (= faster) responses.
I’m fairly certain that in a few more releases we’ll have models with shorter CoT chains. Whether they’ll still let us see those is another question, as it seems like Anthropic wants to start hiding their CoT, potentially because it reveals some secret sauce.
Yes, which I understand, but I think they’re crippling their product for users this way.
I don’t think it’s just this, because the thinking tokens often reveal more about Anthropic’s inner workings. For example, it’s how the whole existence of Claude’s soul document was reverse engineered, it often leaks details about “system reminders” (eg long conversation reminders).
I think it’s also just very convenient for Anthropic to do this. The fact that they’re also presenting this as a “performance optimization” suggests they’re not giving the real reason they do this.
Try setting up one laundry which charges by the hour and washes clothes really really slowly, and another which washes clothes at normal speed at cost plus some margin similar to your competitors.
The one which maximizes ROI will not be the one you rigged to cost more and take longer.
Directionally, tokens are not equivalent to "time spent processing your query", but rather a measure of effort/resource expended to process your query.
So a more germane analogy would be:
What if you set up a laundry which charges you based on the amount of laundry detergent used to clean your clothes?
Sounds fair.
But then, what if the top engineers at the laundry offered an "auto-dispenser" that uses extremely advanced algorithms to apply just the right optimal amount of detergent for each wash?
Sounds like value-added for the customer.
... but now you end up with a system where the laundry management team has strong incentives to influence how liberally the auto-dispenser will "spend" to give you "best results"
LLM APIs sell on value they deliver to the user, not the sheer number of tokens you can buy per $. The latter is roughly labor-theory-of-value levels of wrong.
Some labs do it internally because RLVR is very token-expensive. But it degrades CoT readability even more than normal RL pressure does.
It isn't free either - by default, models learn to offload some of their internal computation into the "filler" tokens. So reducing raw token count always cuts into reasoning capacity somewhat. Getting closer to "compute optimal" while reducing token use isn't an easy task.
Yeah the readability suffers, but as long as the actual output (ie the non-CoT part) stays unaffected it’s reasonably fine.
I work on a few agentic open source tools and the interesting thing is that once I implemented these things, the overall feedback was a performance improvement rather than performance reduction, as the LLM would spend much less time on generating tokens.
I didn’t implement it fully, just a few basic things like “reduce prose while thinking, don’t repeat your thoughts” etc would already yield massive improvements.
Yeah you could easily imagine stenography like inputs and outputs for rapid iteration loops. It's also true that in social media people already want faster-to-read snippets that drop grammar so the desire for density is already there for human authors/readers.
All LLMs also effectively work by ”larping” a role. You steer it towards larping a caveman and well.. let’s just say they weren’t known for their high iq
Fun fact: Neanderthals actually had larger brains than Homo Sapiens! Modern humans are thought to have outcompeted them by working better together in larger groups, but in terms of actual individual intelligence, Neanderthals may have had us beat. Similarly, humans have been undergoing a process of self-domestication over the last couple millenia that have resulted in physiological changes that include a smaller brain size - again, our advantage over our wilder forebearers remains that we're better in larger social groups than they were and are better at shared symbolic reasoning and synchronized activity, not necessarily that our brains are more capable.
(No, none of this changes that if you make an LLM larp a caveman it's gonna act stupid, you're right about that.)
Bigger brain does not automatically mean more intelligence, but we have reasons to suspect that homo neanderthalensis may have been more intelligent than contemporary homo sapiens other than bigger brains.
You can't draw conclusions on individuals, but at a species level bigger brain, especially compared to body size, strongly correlates with intelligence
Exactly. The model is exquisitely sensitive to language. The idea that you would encourage it to think like a caveman to save a few tokens is hilarious but extremely counter-productive if you care about the quality of its reasoning.
Unfortunately, that is an oversimplification for a highly RLed/chatbot trained LLM like Claude-4.7-opus. It may have started life as a base model (where prompting it with correctly spelled prompts, or text from 'gwern', would - and did with davinci GPT-3! - improve quality), but that was eons ago. The chatbots are largely invariant to that kind of prompt trickery, and just try to do their best every time. This is why those meme tricks about tips or bribery or my-grandmother-will-die stop working.
I believe tools like graphify cut down the tokens in thinking dramatically. It makes a knowledge graph and dumps it into markdown that is honestly awesome. Then it has stubs that pretend to be some tools like grep that read from the knowledge graph first so it does less work. Easy to setup and use too. I like it.
There's a tremendous amount of superstition around LLMs. Remember when "prompt engineering" "best practices" were to say you were offering a tip or some other nonsense?
I hesitated 100% when i saw caveman gaining steam, changing something like this absolutely changes the behaviour of the models responses, simply including like a "lmao" or something casual in any reply will change the tone entirely into a more relaxed style like ya whatever type mode.
I think a lot of people echo my same criticism, I would assume that the major LLM providers are the actual winners of that repo getting popular as well, for the same reason you stated.
> you will barely save even 1% with such a tool
For the end user, this doesnt make a huge impact, in fact it potentially hurts if it means that you are getting less serious replies from the model itself. However as with any minor change across a ton of users, this is significant savings for the providers.
I still think just keeping the model capable of easily finding what it needs without having to comb through a lot of files for no reason, is the best current method to save tokens. it takes some upfront tokens potentially if you are delegating that work to the agent to keep those navigation files up to date, but it pays dividends when future sessions your context window is smaller and only the proper portions of the project need to be loaded into that window.
I don't understand how this would work without a huge loss in resolution or "cognitive" ability.
Prediction works based on the attention mechanism, and current humans don't speak like cavemen - so how could you expect a useful token chain from data that isn't trained on speech like that?
I get the concept of transformers, but this isn't doing a 1:1 transform from english to french or whatever, you're fundamentally unable to represent certain concepts effectively in caveman etc... or am I missing something?
Help me understand: I get that the file reading can be a lot. But I also expand the box to see its “reasoning” and there’s a ton of natural language going on there.
This smells heavily of astroturfing. Particularly because Headroom is a paid product, and that fact is not mentioned here or in the GitHub README.
Here was my experience…
I download and run the Mac application, which starts installing a bunch of things. Then the following happens without advance notice:
- Adds background item(s) from "Idiosyncratocracy BV"
- Downloads over 2 GB of files
- Pollutes home with ~/.headroom directory
- Adds hook(s) to ~/.claude/hooks/
- Modifies your ~/.claude/settings.json to add above hook(s)
… and then I see something in the settings that talks about creating an account. That's when I realized that this is a paid product, after all of the above has happened.
Headroom seems to use https://github.com/rtk-ai/rtk under the hood. What does Headroom offer over the actually-free RTK? Who knows.
At this point I have had it with this subterfuge — I immediately trash the app and every related file and folder I can find, of which there are many. Hopefully I got them all, but who knows. There should have been an easy way to uninstall this mess, but of course there isn't.
The lack of transparency here is really concerning.
Thanks for the feedback, will work on making this more transparent so future users do not have this experience.
I did want to call out that headroom is not based on RTK - it includes RTK sure, but headroom cli has a lot more going on under the hood. For more see https://github.com/chopratejas/headroom
I installed Headroom to give it a try, quickly decided to uninstall when I realized how invasive it is and requires a subscription. Spent the next few hours having issues with CC where it was asking for permission on every command. It was using absolute paths for all commands - turns out it was running into `zsh: command not found: rtk`. To fully uninstall I had to:
Thanks for sharing your experiences. We incorporated changes in the latest version to improve this:
1. On install we explain what Headroom installs
2. We added an uninstall feature that removes all of this for you
3. On quit of the app, we immediately remove all items that may intervene with normal Claude Code behavior
Headroom looks great for client-side trimming. If you want to tackle this at the infrastructure level, we built Edgee (https://www.edgee.ai) as an AI Gateway that handles context compression, caching, and token budgeting across requests, so you're not relying on each client to do the right thing.
(I work at Edgee, so biased, but happy to answer questions.)
I was doing some experiments with removing top 100-1000 most common English words from my prompts. My hypothesis was that common words are effectively noise to agents. Based on the first few trials I attempted, there was no discernible difference in output. Would love to compare results with caveman.
Caveat: I didn’t do enough testing to find the edge cases (eg, negation).
I suspect even typos have an impact on how the model functions.
I wonder if there’s a pre-processor that runs to remove typos before processing. If not, that feels like a space that could be worked on more thoroughly.
The ability for audio processing to figure out spelling from context, especially with regards to acronyms that are pronounced as words, leads me to believe there’s potential for a more intelligent spell check preprocess using a cheaper model.
I strongly suspected that there was some pre/postprocessing going on when trying to get it to output rot13("uryyb, jbyeq"), but it's probably just due to massively biased token probabilities. Still, it creates some hilarious output, even when you clearly point out the error:
Hmm, but wait — the original you gave was jbyeq not jbeyq:
j→w, b→o, y→l, e→r, q→d = world
So the final answer is still hello, world. You're right that I was misreading the input. The result stands.
On my private internal oil and gas benchmark, I found a counterintuitive result. Opus 4.7 scores 80%, outperforming Opus 4.6 (64%) and GPT-5.4 (76%). But it's the cheapest of the three models by 2x.
This is mainly driven by reduced reasoning token usage. It goes to show that "sticker price" per token is no longer adequate for comparing model cost.
Oh wow, I love this idea even if it's relatively insignificant in savings.
I am finding my writing prompt style is naturally getting lazier, shorter, and more caveman just like this too. If I was honest, it has made writing emails harder.
While messing around, I did a concept of this with HTML to preserve tokens, worked surprisingly well but was only an experiment. Something like:
I find grep and common cli command spam to be the primary issue. I enjoy Rust Token Killer https://github.com/rtk-ai/rtk, and agents know how to get around it when it truncates too hard.
Interesting, it doesn't seem intuitive at all to me.
My (wrong?) understanding was that there was a positive correlation between how "good" a tokenizer is in terms of compression and the downstream model performance. Guess not.
caveman stops being a style tool and starts being self-defense. once prompt comes in up to 1.35x fatter, they've basically moved visibility and control entirely into their black box.
People are really trigger-happy when it comes to throwing magic tools on top of AI that claim to "fix" the weak parts (often placeboing themselves because anthropic just fixed some issue on their end).
Then the next month 90% of this can be replaced with new batch of supply chain attack-friendly gimmicks
Especially Reddit seems to be full of such coding voodoo
Well, we've sacrificed the precision of actual programming languages for the ease of English prose interpreted by a non-deterministic black box that we can't reliably measure the outputs of. It's only natural that people are trying to determine the magical incantations required to get correct, consistent results.
My favorite to chuckle at are the prompt hack voodoo stuff, like, “tell it to be correct” or “say please” or “tell it someone will die if it doesnt do a good job,” often presented very seriously and with some fast cutting animations in a 30 second reel
1.35 times! For Input!
For what kinds of tokens precisely? Programming? Unicode? If they seriously increased token usage by 35% for typical tasks this is gonna be rough.
caveman[0] is becoming more relevant by the day. I already enjoy reading its output more than vanilla so suits me well.
[0] https://github.com/JuliusBrussee/caveman/tree/main