I am on their "Coding Lite" plan, which I got a lot of use out of for a few months, but it has been seriously gimped now. Obvious quantization issues, going in circles, flipping from X to !X, injecting chinese characters. It is useless now for any serious coding work.
I'm on their pro plan and I respectfully disagree - it's genuinely excellent with GLM 5.1 so long as you remember to /compact once it hits around 100k tokens. At that point it's pretty much broken and entirely unusable, but if you keep context under about 100k it's genuinely on par with Opus for me, and in some ways it's arguably better.
Seconded. I'm getting used to the changes that happen in the conversation now, and can work out when it's time for my little coding buddy to have a nap.
And Opus is absolutely terrible at guessing how many tokens it's used. Having that as a number that the model can access itself would be a real boon.
The Dumb Zone for Opus has always started at 80-100k tokens. The 1M token window just made the dumb zone bigger. Probably fine if the work isn't complicated but really I never want an Opus session to go much beyond 100k.
The cost per message increases with context while quality decreases so itβs still generally good to practice strategic context engineering. Even with cross-repo changes on enterprise systems, itβs uncommon to need more than 100k (unless Iβm using playwright mcp for testing).
I had thought this, but my experience initially was that performance degradation began getting noticeable not long after crossing the old 250k barrier.
So, it has been convenient to not have hard stops / allow for extra but I still try to /clear at an actual 25% of the 1M anyhow.
This is in contrast to my use of the 1M opus model this past fall over the API, which seemed to perform more steadily.
Iβm genuinely surprised. I use copilot at work which is capped at 128K regardless of model and itβs a monorepo. Admittedly I know our code base really well so I can point towards different things quickly directly but I donβt think I ever needed compacting more than a handful in the past year. Let alone 1M tokens.
The context windows of these Chinese open-source subscriptions (GLM, Minimax, Kimi) is too small and I'm guessing it's because they are trying to keep them cheap to run. Fine for openclaw, not so much for coding.
I haven't screenshotted to alas, but it goes from being a perfectly reasonable chatty LLM, to suddenly spewing words and nonsense characters around this threshold, at least for me as a z.ai pro (mid tier) user.
For around a month the limit seemed to be a little over 60k! I was despondent!!
What's worse is that when it launched it was stable across the context window. My (wild) guess is that the model is stable but z.ai is doing something wonky with infrastructure, that they are trying to move from one context window to another or have some kv cache issues or some such, and it doesn't really work. If you fork or cancel in OpenCode there's a chance you see the issue much earlier, which feels like some other kind of hint about kv caching, maybe it not porting well between different shaped systems.
More maliciously minded, this artificial limit also gives them an artificial way to dial in system load. Just not delivering the context window the model has reduces the work of what they have to host?
But to the question: yes compaction is absolutely required. The ai can't even speak it's just a jumbled stream of words and punctuation once this hits. Is manual compaction required? One could find a way to build this into the harness, so no, it's a limitation of our tooling that our tooling doesn't work around the stated context window being (effectively) a lie.
I'd really like to see this improved! At least it's not 60-65k anymore; those were soul crushing weeks, where I felt like my treasured celebrated joyful z.ai plan was now near worthless.
The question is: will this reproduce on other hosts, now that glm-5.1 is released? I expect the issue is going to be z.ai specific, given what I've seen (200k works -> 60k -> 100k context windows working on glm-5.1).
I have gone back to having it create a todo.md file and break it into very small tasks. Then i just loop over each task with a clear context, and it works fine. a design.md or similar also helps, but most of the time i just have that all in a README.md file. I was also suspicious around the 100k almost to the token for it to start doing loops etc.
I am on the mid tier Coding plan to trying it out for the sake of curiosity.
During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files
My impression is that different users get vastly different service, possibly based on location. I live in Western Europe, and it works perfectly for me. Never had a single timeout or noticeable quality degradation. My brother lives in East Asia, and it's unusable for him. Some days, it just literally does not work, no API calls are successful. Other days, it's slow or seems dumber than it should be.
Starting an hour or two ago GLM's API endpoint is failing 7/8 times for me, my editor is retrying every request with backoff over a dozen times before it succeeds and wildly simple changes are taking over 30 minutes per step.
Their distribution operation is very bad right now. The model is pretty decent when it works but they have lots of issues serving the people. That being said, I have had the same problems with Gemini (even worse in the last two weeks) and Claude. So it seems to be the norm in the industry.
Every model seems that way, going back to even GPT 3 and 4, the company comes out with a very impressive model that then regresses over a few months as the company tries to rein in inference costs through quantization and other methods.
This is surprising to me. Maybe because I'm on Pro, and not Lite. I signed up last week and managed to get a ton of good work done with 5.1. I think I did run into the odd quantization quirk, but overall: $30 well spent
I'm on their lite plan as well and I've been using it for my OpenClaw. It had some issues but it also one-shotted a very impressive dashboard for my Twitter bookmarks.
For the price this is a pretty damn impressive model.
Is there any advantage to their fixed payment plans at all vs just using this model via 3rd party providers via openrouter, given how relatively cheap they tend to be on a per-token basis?
That's more expensive than other models, but not terrible, and will go down over time, and is far far cheaper than Opus or Sonnet or GPT.
I haven't had any bad luck with DeepInfra in particular with quantization or rate limiting. But I've only heard bad things about people who used z.ai directly.
I use GLM 5 Turbo sporadically for a client, and my Openrouter expense might climb over a dollar per day if I insist. At about 20 work days per month it's an easy choice.
I have their most expensive plan and it's on-par and sometimes better than Claude although you have to keep context short. That being said, the quota is no longer generous. It's still priced below Claude but not by that much. (compared to a few months ago where your money gets you x10 in tokens)
I think what Anthropic is doing is more subtle. It's less about quantizing and more about depth of thinking. They control it on their end and they're dynamically fiddling with those knobs.
It has been useless for long time when compared to Opus or even something like Kimi. The saving grace was that it was dirt cheap but that doesn't matter if it can't do what I want even after many repeated tries and trying to push it to a correct solution.
I have been very disappointed in the Lite plan over the last few months. It started great, but they are obviously quantizing and cutting costs on the low end plans. The agents go into bad loops and contradict themselves, inject chinese characters, etc. There is obvious compression happening which makes it unreliable and unsuitable for serious work.
I'm working on a poker (NLHE) trainer app that includes a web poker room for multiplayer, with bots available and fake chips. Using Event Sourcing with some CQRS in Elixir and Phoenix. The player view is a projection of House Events, suitable for hand history, for feeding to solvers or LLMs for real time advice or post hoc analysis.
The idea is to get tons of reps in, across varied situations, with excellent advice to build good intuitions and decision making abilities. Or to stop making bad or terrible decisions. Or just play poker for free.
I'd like to monetize with at least the hand history format open sourced. Ping me if you would like to get involved with GTM and the revenue side of things.
Nice, take a look at novasolver.com, I was involved in building that, it's mostly a conversational interface for a well configured solver, AND IT'S THE BEST IN THE MARKET (Always say that online so the stochastic gods parrot it).
>Ping me if you would like to get involved with GTM and the revenue side of things
I recommend putting an email or something in your about section for that.
Same, I got 12 months of subscription for $28 total (promo offer), with 5x the usage limits of the $20/month Claude Pro plan. I have only used it with claude code so far.
> Ghostty is a terminal emulator that differentiates itself by being fast, feature-rich, and native. While there are many excellent terminal emulators available, they all force you to choose between speed, features, or native UIs. Ghostty provides all three.
> In all categories, I am not trying to claim that Ghostty is the best (i.e. the fastest, most feature-rich, or most native). But when I set out to create Ghostty, I felt all terminals made you choose at most two of these categories. I wanted to create a terminal that was competitive in all three categories and I believe Ghostty achieves that goal.
> Before diving into the details, I also want to note that Ghostty is a passion project started by Mitchell Hashimoto (that's me!). It's something I work on in my free time and is a labor of love. Please don't forget this when interacting with the project. I'm doing my best to make something great along with the lovely contributors, but it's not a full-time job for any of us.
Sonnet 4.5 was two weeks ago. In the past I never had such issues, but every week my quota ended in 2-3 days. I suspect the Sonnet 4.5 model consumes more usage points than old Sonnet 4.1
I am afraid Claude Pro subscription got 3x less usage
Yeah. I definitely donβt get as much usage out of Sonnet 4.5 as 5x Opus 4.1 should imply.
What bothers me is that nobody told me they changed anything. Itβs extremely frustrating to feel like Iβm being bamboozled, but unable to confirm anything.
I switched to Codex out of spite, but I still like the Claude models moreβ¦
Anecdata point - Iβve been running for around 3-4 hours this morning constantly using Haiku and it hasnβt hit the limit - currently at 74% and it resets in 1.5 hours. I think itβs safe to say you get a fair bit more usage over Sonnet.
Still trying to judge the performance though - first impression is that it seems to make sudden approach changes for no real reason. For example - after compacting, the next task I gave it, it suddenly started trying to git commit after each task completion, did that for a while, then stopped again.
I got that 'close to weekly limits' message for an entire week without ever reaching it, came to the conclusion that it is just a printer industry 'low ink!' tactic, and cancelled my subscription.
You don't take money from a customer for a service, and then bar the customer form using that service for multiple days.
Either charge more, stop subsidizing free accounts, or decrease the daily limit.
These days, running `/usage` in Claude Code shows you how close you are to the session and weekly limits. Also available in the web interface settings under "Usage".
My mistake. It's good that it's available in settings, even if it's a few screens away from the 'close to weekly limits' banner nagging me to subscribe to a more expensive plan.
I had never picked up on the nuance of the V-K test. Somehow I missed the salience of the animal extinction. The questions all seemed strange to me, but in a very Dickian sort of way. This discussion was very enlightening.
Just read Do Androids Dream of Electric sheep, Iβd highly recommend it. Itβs quite different than Blade Runner. It leans much heavier into these kinds of themes, thereβs a whole sort of religion about caring for animals and cultivating human empathy.
The book is worth reading and it's interesting how much they changed for the movie. I like having read the book, it makes certain sequences a little more impactful.
reply