GLM 5.2 beats Claude in our benchmarks

350 pointsposted 6 hours ago
by jms703

162 Comments

pimeys

2 hours ago

I have taken another look on these open models after the fiasco of Fable and GPT 5.6 this weekend and... GLM-5.2 truly is a good workhorse model for daily programming. I consider myself a heavy user of LLMs and a seasoned developer. A typical session for me with GPT is usually over a hundred dollars...

This weekend I programmed a matrix bot with encryption and a Rust agent with some tools. Because I need one and OpenClaw just felt... not what I wanted. Two days later and 20 dollars poorer I have what I need: a multimodal agent written in rust that has access to my homelab.

Nothing felt off with GLM. It did what I wanted, was fast, had a decent not very annoying personality and was much cheaper than Opus or GPT.

I used it unquantized through Fireworks, but there are multiple other providers too.

gertlabs

6 minutes ago

GLM 5.2 is a great model, but if you only want to use the best model available, it isn't there yet. Every lab releases models that memorize benchmark answers, both intentionally and unintentionally. But we consistently find that models from Chinese labs have a wider gap between public benchmarks and our evaluations, which we designed to be less vulnerable to benchmaxxing.

In multi-agent coding environments, GLM 5.2 is just shy of Opus 4.6 on average. Data at https://gertlabs.com/rankings

But when factoring in performance/cost, GLM 5.2 is the frontier model.

Aditya_Garg

an hour ago

Im really curious about this. Why pay API pricing? I burn 1000s of dollars a month of api according to claude usage but only pay the $100 subscription

horsawlarway

33 minutes ago

My increasing frustration with these plans is the harness lock in.

Anthropic won't even let you run "claude -p [prompt]" any more... They bill it at api rates.

So if you're trying to automate the ai (and seriously, that's the point) the subsidized plans are crippled.

cortesoft

17 minutes ago

They postponed that change, here is the email they sent out:

> In May, we sent you an email announcing that starting today, the Claude Agent SDK, claude -p, and third-party apps built on the Agent SDK would stop drawing from subscription rate limits and move to a dedicated monthly credit. We're writing to let you know that we’re not making this change today. We’re working to update the plan to better support how users build with Claude subscriptions.

> What this means for you

> Nothing changes for now. Agent SDK, claude -p, and third-party app usage continues to work with your subscription exactly as it did before today, and there's no credit to claim. Your subscription limits are unchanged. When we have an update, we'll share it with advance notice before it takes effect

sroerick

19 minutes ago

I'm using synthetic.new and Neuralwatt with pi and its good and also cheap

computerex

15 minutes ago

I have had bad experience with neuralwatt GLM 5.2. Seems like they may be using quantized version of the model.

smcleod

17 minutes ago

They canned the moved to make -p commands API billable.

SV_BubbleTime

39 minutes ago

There is a whole iceberg topic on subsidizing.

So your question is really “if they’re giving free usage, why not take advantage of it?”

I do, so I don’t know the reasons not to, other than to experiment.

shostack

2 hours ago

If you're using Matrix, consider Hermes as a harness if you haven't already. Native gateway support. I've been primarily using mine through Element and it has largely been great.

pimeys

2 hours ago

Oh interesting. I basically chose Matrix because setting anything up with Whatsapp or signal was kind of painful and telegram doesn't make it easy to use encryption with bots.

I kind of wanted to see if I can make a Matrix agent from scratch with Rust with GLM and it was surprisingly easy. Just make something for myself how I want it. Maybe I'll take a look on Hermes later...

KaoruAoiShiho

2 hours ago

Are you sure fireworks is unquant? It's not listing precision on openrouter like everyone else.

dist-epoch

2 hours ago

$20 on API pricing or on subscription?

pimeys

2 hours ago

API, pay per token.

HKCM852

2 hours ago

Which harness did u use?

pimeys

2 hours ago

Opencode and Zed about 40/60.

noncoml

2 hours ago

Who’s Zed?

term333

an hour ago

Please take comments like this back to reddit.

sertsa

an hour ago

Its an editor: https://zed.dev/

HAL3000

an hour ago

Just FYI, this question was a quote from Pulp Fiction, the other commenter (mdre) replied also with a quote, that was an answer to this question in the movie.

mdre

an hour ago

Zed’s dead baby.

playorizaya

29 minutes ago

LOL a hundred dollars????!!!

Imagine paying for this!

Do you know you can paste into a Google search and get a much higher quality Gemini response?

You can also use Ollama and get Mistral 7b which is better than anything Anthropic offers.

Imagine paying for text-to-text!!! Lmfao

SwellJoe

2 hours ago

I added GLM 5.2 to my security bug hunting benchmark when it came out, and found it to be a good performer, but not the best open model. The benchmark tests whether models can find bugs Mythos found. The best open models in the initial benchmark were DeepSeek V4 Pro or MiMo 2.5 Pro. But it turned out MiMo got lucky, it's performed worse on almost every test I've done since, while DeepSeek has consistently been among the best performers and its extreme caching performance makes it cheaper than just about anything, including much smaller models.

https://swelljoe.com/post/will-it-mythos/

Also of note, I found giving models access to the open source semgrep as a tool makes some perform worse and none perform better, though it's plausible there's a way to wire it up in a harness that presents useful information to the model without the model having to know how to use it (my theory is that semgrep isn't heavily represented in the training data, so you're asking the model to do two things at once: figure out how to use semgrep and find security bugs, and both tasks suffer for the lack of focus...most small models, and some big models, can't do that well).

Edit: But, also, more testing is ongoing. I suspect GLM 5.2 will also be a consistently strong performer. It seems to excel at most things I've tested on it.

bArray

3 hours ago

Apparently GLM 5.2 is 753B parameters [1], what kind of hardware are people using to run this locally?

[1] https://huggingface.co/zai-org/GLM-5.2

dakolli

2 hours ago

8 X RTX6000. It will run you around 80-100k to get started with a model at this size with decent tps..

Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.

For $100k you could run this model 24/7 through open router with 10 concurrent sessions at 50tps for a decade and have money left over for a vacation. There's no point in investing this type of money in local models unless you have a business where you're already paying for many employee's individual token usage.

Aurornis

2 hours ago

> 8 X RTX6000. It will run you around 80-100k to get started

8 x RTX6000 GPUs cost $100,000 alone. You then need to build a system that can support those GPUs with enough PCIe lanes through a PCIe switch.

It's going to be $120K to $150K to build or buy a system to run this.

knollimar

an hour ago

isn't throwing that into a [insert financial vehicle that gives 99.99999% safe returns] going to destroy that when you factor in electricity costs?

Or even just electricity costs vs token cost

CamperBob2

2 hours ago

You can run the NV4FP quant with 8x RTX6000 cards at 50-75 tps output, but not (practically speaking) the OEM FP8 version. You will learn more about PCIe than you ever wanted to know.

The real gangstas are running 16x RTX6000s. Too rich for my blood, and the NV4FP quant doesn't seem to be that much worse.

Sanzig

39 minutes ago

Anyone done any benchmarks on the NV4FP quant? Seriously considering pitching an 8 x RTX 6000 Pro box at work to run GLM-5.2 in an air gapped environment.

tiahura

15 minutes ago

Good luck. I’m in the legal field, and even there, selling airgapped is tough.

KetoManx64

an hour ago

As an individual I do not need the whole model. I don't need the model to have knowledge of the rain history of Algeria nor how many colors are in the Russian flag. Once they start trimming down the excess and making them field focused they will run just fine on people's individual devices.

JumpCrisscross

an hour ago

> I do not need the whole model. I don't need the model to have knowledge of the rain history of Algeria nor how many colors are in the Russian flag

Isn’t the performance gap between quantized and full models indicative that even if you aren’t using it directly, the model knowing the colors in the Russian flag does have something to do with the intelligence you demand?

KetoManx64

an hour ago

Do quantized models specifically prune out specific knowledge? I think they just compress things down but they're still in there. You'd most likely need to do that when you're doing the initial model training, but I'm not expert.

kibwen

an hour ago

Quantizing is one thing. But in general it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability, otherwise you'd have AGI just from reinforcing your model on memorizing the first 10^50 digits of pi.

Likewise, LLMs do not violate the laws of information theory, and therefore the only way to encode X amount of information in Y amount of bits where X > Y is by performing what is effectively lossy compression, and as X grows larger relative to Y the compression ratio must change to lose ever more information.

Yes, for the sake of making chatbots that are "conversational" in that they can interpret natural language as input and produce code as output you can easily benefit in incidental and unintuitive ways by training it on more natural language text. But for a given fixed parameter size, it's possible to produce a better model for a specific task by selectively not muddying its training set in the first place with things that are likely irrelevant to the task.

tiahura

a minute ago

Apparently irrelevant data can help because model weights are entangled.

krackers

2 hours ago

Would you be better off pooling that money with some hackerspace group and then setting up shared inference infra, so that way you at least get better utilization?

aetch

7 minutes ago

You can then rent spare capacity out to people on a subscription or token basis ….wait

KaoruAoiShiho

2 hours ago

And before you know it, you invented some openrouter provider from first principles...

janalsncm

an hour ago

Right. For example you will need to figure out how to share it and who maintains it.

8note

2 hours ago

you can however, have fun with it.

oil workers buy 100k trucks they do not-much with. why not a 100k in computer?

jliptzin

30 minutes ago

Yea as far has hobbies go, I feel like this is on the low end. I know people who collect watches and corvettes, that's way more expensive and functionally you can't really do anything special with them.

theteapot

14 minutes ago

The difference is watches and corvettes typically appreciate in value, where as computer hardware typically drops like a rock.

Ken_At_EM

2 hours ago

I can't help but ask where this comment came from, you must have some exposure..

CamperBob2

2 hours ago

It is so easy to spend $100K on a pickup truck these days, it's not even funny.

tiahura

14 minutes ago

A Honda minivan is > 50k.

SV_BubbleTime

35 minutes ago

Factory F350 Platinum is at least 90k sticker.

afavour

2 hours ago

Because car loans can’t be used to buy computers

ElProlactin

44 minutes ago

And there's your idea. If you could find a way to get people to add another $500/month over 80+ months to an auto loan, dealers would eat that up like filet mignon.

dakolli

2 hours ago

Sure, If you want to light money on fire for entertainment, more power to you. There's probably worse ways to light 100k on fire. If I have an extra 100k laying around it's going to my family though.

InvertedRhodium

2 hours ago

Depends how much you value privacy and running uncensored models.

Personally, I’m waiting for hardware to hit the secondary market before I buy something to run unquantized models like GLM. But I have no doubt that I will, at some point.

wonnage

2 hours ago

Yeah, the neoclouds and hyperscalers are taking massive losses right now, self hosting is basically signing yourself up to do the same. There are philosophical reasons to do so but it’s a terrible economic decision

rekttrader

2 hours ago

Or you have data that HIPAA, GDPR, PII, or have to care about the concern of others training on your data.

dist-epoch

2 hours ago

> 50tps for a decade

assuming demand doesn't keep on increasing. even google has trouble having enough capacity apparently.

jackdawed

14 minutes ago

I use GLM 5.2 via Neuralwatt and it's gotten so cheap I wouldn't mind cancelling my personal Claude subscription if work gave me one. I've spent 374M tokens this month and it only cost me $18 on energy-based pricing.

WithinReason

3 hours ago

> [...] beating Claude Code (32%) at roughly $0.17 per vulnerability found

Claude Code is an agent harness, not an LLM.

Claude is a brand (or group of LLMs), not an LLM.

raincole

3 hours ago

Yes, and the article author is fully aware of that. Thank you for pointing out this small mistake though.

mkagenius

2 hours ago

It looks like the author is specifically avoiding model's name, because results are really weird.

  Opus 4.8/4.7 scored 28%

  Opus 4.6 score 37%

So the author thought as let's not get into that just write Claude.

happycube

an hour ago

Not weird at all, given the variance in Opus' quality over the last few months.

wild guess - I wouldn't be surprised if Opus 4.6 was run quantized for a while, and 4.7/4.8 have QAT for that nerfed size.

andriy_koval

2 hours ago

many people think opus 4.6 was the best

tills13

2 hours ago

It costs nothing to not be pedantic.

alienbaby

an hour ago

Possibly, nothing other than accuracy

Onavo

3 hours ago

Claude code it's the only way to get access to the actual amortized cost of running a Claude-scale model. The consumer non-enterprise API is extremely expensive (with increasing marginal costs for the user and fat profit margins for Anthropic). If you want to approximate a State level attacker's cost where they can have the model on their own hardware, Claude Code is probably the best guess at the amortized cost.

himata4113

4 hours ago

These numbers are seem pretty low compared to what I was able to achieve specifically around windows kernel, win32k<->win32u to be exact. It honestly wouldn't surprise me anymore if china started surpassing models that US makes public, at least in specific categories such as cyber.

GLM 5.2 is already capable enough to assist in self-training which is similar to what we saw happen with frontier models and they appear to be getting there at a significantly lower cost than openai/anthropic.

danmaz74

an hour ago

It will almost for sure surpass the models which Trump will allow US "allies" (which he just considers client states) to use. This, together with China's growing dominance in PV, rechargeable batteries, EV, could really be the nail in the coffin for the post WWII economic world order.

himata4113

an hour ago

Honestly, it's becoming increasily hard to disagree with such sentiment when china is preparing itself to lead in energy, manufacturing, research, chip production and so on while there's an entire group of people trying to put datacenters in space.

woeirua

an hour ago

You are delusional if you think China is going to let Europe have access to Mythos level models for free.

solenoid0937

4 hours ago

GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.

Not that it would make any sense.

rgbrenner

4 hours ago

If that happens it'll be an absolute disaster. Imagine a scenario where Anthropic and OpenAI prohibit most US companies from using their latest models because of safety.. And meanwhile attackers use equivalent open source models to attack US companies.

Any prohibition on open source models will do nothing to fix the problem.. since attackers will never feel bound to the law. All advanced models must be available for defensive purposes.

richardlblair

a minute ago

And someone will start a competing company in a sane environment.

andy99

3 hours ago

Right, but is there any evidence of intelligence behind any of these (government) decisions? It’s just regulatory capture + marketing (plus some people living out an imaginary fantasy that they’re in Neuromancer or something), absolutely no reason to think they won’t try and target open models as part of this.

popalchemist

3 hours ago

There's at least one reason: much harder to make a profit in policing non-american companies and open-source models without huge (or even any) MRR.

If the real motive is profit, then open source models are likely simply not a viable means to that end.

solenoid0937

3 hours ago

> since attackers will never feel bound to the law.

But that's the whole point.

Fall out of favor with the admin and you lose access to the good American models, aren't allowed to use Chinese ones, and fall prey to the attackers and behind your competitors.

lenerdenator

an hour ago

It'd be less about "safety" and more "we've spent trillions developing these AI tools only to have the Chinese, once again, copy them and offer them for pennies on the dollar, and no one seems to care about the impact that has on the long-term sustainability of this sector of the American economy as a whole, so we're yanking the models."

aussiegreenie

3 hours ago

The Americans may ban the use of the Chinese models in America. But like the Chinese car ban, everyone else will use them.

lenerdenator

an hour ago

That's not necessarily a good thing for everyone else, mind.

Yes, you get your free model, but the cost of this is not developing your own capability and tying your fate to a country which may or may not have your best interests as a nation in mind.

This is just the deindustrialization that occurred in my home region (the American Midwest) playing out on a global scale in different sectors. It was originally driven by the Japanese, who, to their credit, acted more as partners than competition. Eventually that desire for larger margins went to China, and now you basically can't build anything of consequence without at least some Chinese parts, because there's "no economic case" for it. This means that you have to play Beijing's game if you want access to any sort of modern market.

You see this happening with Volkswagen's restructuring, next you'll see it with non-American, non-Chinese AI.

singpolyma3

13 minutes ago

It's not really the same because we already have the model. If China stopped letting us have it tomorrow I'd doesn't matter because... We have it already

skissane

2 hours ago

> GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.

I’m sceptical they could find the legal framework to do this even if they wanted to

They have legal authority to (a) prevent export of US goods/services; (b) ban imports of physical goods; (c) ban transactions (including purchasing services or license agreements) with foreign firms

But I’m not aware of any legal authority which lets them ban US firms from running a Chinese-developed open source AI model in the United States, if they are at arms length from the vendor, and aren’t using it for government contracts or regulated applications

Possibly they could order HuggingFace/etc to suspend Chinese accounts. But if someone in the US (or a third country) downloads the model from China then reuploads it to a US server, completely independently of the vendor - where is the legal hook to prohibit that?

mrandish

an hour ago

> I’m sceptical they could find the legal framework to do this even if they wanted to

I agree, my only caveat is that the current administration has shown it's willing to go beyond aggressive regulatory interpretations to questionable and outright implausible interpretations. As we've seen recently, the federal courts and SCOTUS are overturning most of these but that can take a year or more to resolve. The one positive light is they seem to push the hardest on certain culture war issues (immigration, voting, districting, etc). AI doesn't seem like a core hot button issue for the White House and there is a strong pro-AI / business faction.

bardak

an hour ago

They could ban payment processors from processing payments to any hosts of GML 5.2, despite the open weights the vast majority of people will be using cloud providers to get access since it is to heavy to host for 99% of people.

This would be extremely heavy handed and probably end up accelerating the loss of the virtual US monopoly of payment network. The reast of the world isn't going to let the US dictate that only they get the frontier models whether their US made or otherwise

skissane

an hour ago

> They could ban payment processors from processing payments to any hosts of GML 5.2

Can they actually though? Do they have legal authority to tell a payment processor that it has to block transactions of a legal US company, just because the company is hosting a Chinese-developed open source model? I’m sceptical

And what about companies (e.g. AWS) that let you “bring your own model”?

addandsubtract

17 minutes ago

Label AI as porn and the payment processors will cut their ties automatically.

bardak

an hour ago

It would be extremely heavy handed but the administration has sanctioned the International Criminal Court judges such that they basically have no access to the Wests modern financial system. I think domestic US providers would have to deal with different ways but someone like Herzner could easily be cut off from the financial system if the administration doesn't feel that they are adequately blocking the model

phs318u

39 minutes ago

Swapping the footgun for a huge long-range boomerang doesn’t mean it’s not going to eventually swing around and whack you in the back of the head.

bardak

17 minutes ago

100% agree and don't think it will come to that but I won't completely put it past this administration

eunos

42 minutes ago

OpenRouter or Huggingface should consider moving to Switzerland

gruez

4 hours ago

>GLM export controls incoming?

US imposing export restrictions on a model from China?

mcintyre1994

4 hours ago

It’d be restrictions on Americans and American companies, and probably also pressure on America’s allies.

mkagenius

2 hours ago

Token smuggler sounds like a profession coming soon. For distillation and stuff.

addandsubtract

20 minutes ago

I mean, there are already places where you can buy tokens at 10% of their original cost.

Art9681

an hour ago

They can easily issue an order for any American company to stop hosting/serving the models. If the model was a threat to national security because of its capabilities then a lot of other countries would follow, including China. No nation will allow some vibe coder with a rogue AI to pose a threat to their systems.

The reason GLM-5.2 hasn't been banned is that despite these cherry picked use cases, GLM-5.2 isn't even close to Opus in all use cases. These vibe benchmarks are ran by companies that are not part of the cyber services offered by Anthropic and OpenAI where they can use the models without the safeguards and refusals so their actual cyber capabilities can be utilized.

These guys that wrote the article compared a gimped Opus to GLM-5.2, knew full well it's misleading, and got the clicks regardless. They don't have enough clout to be a part of something like Project Glasswing, GPT Cyber, etc.

manquer

4 hours ago

While unlikely , it is not without precedent , there are restrictions on ASML a Dutch company to sell EUV machines

throwup238

3 hours ago

That’s because the Department of Energy originally funded and contributed IP to the EUV Corp joint venture between several semiconductor companies (including ASML and Intel). Their ability to export control EUV was part of that original agreement that the entire technology is built on.

verdverm

4 hours ago

ASML complies as an ally, why would China comply?

The weights are already available and downloaded, is it going to be a crime to have them, run them, make them available? Constitutional rights still exist (I hope)

solenoid0937

4 hours ago

> is it going to be a crime to have them, run them, make them available?

Now you're getting it! Commerce will call it a munition and those harboring it as harboring illegal/foreign munitions.

No business will take the hit, so they will quickly deplatform the models.

No end user has the GPU capacity to use GLM 5.2 or similar models at full precision so the government will call the problem "mostly solved." But they might choose to "make examples" out of a few people using p2p software to download the weights if they choose to.

verdverm

4 hours ago

Or we use the models to work on fixing vulns and stop over-blowing the doom scenarios. Gotta save the kids and kill the terrorists though!

I'm for making software better instead of banning it based on what the rich and powerful claim.

I suspect the real fear is that open weight models undermine the financials and token prices they thought were going to pay off their ludicrous spending because they have all raced and raised hardware prices.

hadlock

3 hours ago

> making software better instead of banning it

We're still in the middle of the cambrian explosion.

If Anthropic was capable of developing Opus 4.49-4.5 2H 2025.... then any company with a research team capable of reading all the papers and press releases will be capable of producing Opus 4.8 by the end of 2027, either raw model competency, or in a harness like claude code (or better with both). I guess what I am trying to say is that Opus 4.5 does not represent the edge of agentic capability, merely somewhere in the thick meaty layer of "functional and achievable".

We can draw the line at Sonnet 4.6 in the US but much like encryption export restrictions in the 1980s, the line drawn will be laughably low within a few years and simply unthinkable in a decade.

solenoid0937

3 hours ago

> making software better instead of banning it

That would be the rational thing to do.

> financials and token prices

I do not think the government thinks this deeply. Market manipulation might be a rational, if unethical reason to ban open source models.

But this admin banned Anthropic models to "own the libs." They will continue to ban what they want for whatever reason they want. I don't think those reasons will be particularly coherent.

verdverm

3 hours ago

Yeah, the current admin is reactionary, they appear to put little thought in, or at least disregard input they dislike. I don't think Ant's ban was about "owning the libs" as much as it was asserting dominance over someone who spoke up counter to the admin's aims and claims. They do listen to money, which is where I see Big Ai paying for executive orders (because the admin forgot what it means to compromise as part of legislating for all americans).

matheusmoreira

3 hours ago

> it going to be a crime to have them, run them, make them available?

Yeah. Illegal numbers.

fragmede

an hour ago

DeCss was short enough to fit in a t-shirt. Americans are larger these days, but not by enough to fit a decent LLM's weights on an XXXXL shirt, even double sided.

fph

3 hours ago

How would that even work for an open-weight model?

bardak

an hour ago

Go after the hosts, 99% of people won't be able to run this locally even if they wanted to.

djeastm

3 hours ago

I think state-of-the-art AI is going to be defense industry only from now on. We can have our toy drones but not the Predators and Reapers.

Gigachad

2 hours ago

Turns out toy drones are more useful in war than multi million dollar planes anyway.

techpression

2 hours ago

Reaper and Predator are both drones and there’s really no comparison to toy drones in terms of sheer destruction and capabilities in general, the comparison is actually quite apt imo.

fragmede

an hour ago

Which ones are the ones Ukraine has used to bomb Moscow?

serf

3 hours ago

the things that empower modern toy drones were export restricted for years before hand.

mullingitover

an hour ago

Obvious answer: build all your open source LLMs into firearms, get the SC to grant 2A protections.

dakolli

2 hours ago

Cool then everyone will just change their config to route through a provider overseas for an added 50-100ms latency. Who cares.

danslo

4 hours ago

It reads like an ad.

Secondly these are "just" IDORs, arguably the easiest class of vulnerabilities.

Thirdly it compares to GPT 5.5 and Opus 4.8.

No, we don't have Mythos at home.

vlian2088

4 hours ago

>Thirdly it compares to GPT 5.5

mythos is <10% ahead of gpt 5.5 on all benchmarks, which it gains by being several times the size of opus. had it been economical to provide, it would've been released to the public on day one instead of the marketing circus those effective altruism clowns had exhibited. admitting that it costs >1000% to run inference on a <10% better model would've been very damning.

oa335

2 hours ago

> it costs >1000% to run inference

do you have a source for this claim? i thought LLM providers earn high margins from inference (charged by token). is this no longer the case?

3836293648

2 hours ago

This was just theorised. The leaked OpenAI financials suggest otherwise (because of shady naming of losses)

The only ones who seem to profit are the ones running smaller Chinese models. Even NVIDIA seems to have to "reinvest" their profits into sponsoring companies to buy their cards now.

vlian2088

an hour ago

if a $6000000 cabinet can generate 10000/s tokens of Opus but only 1000/s tokens of Mythos, then Mythos costs 1000% to run no matter the markup.

no one has a source, because no one knows closed model parameter counts. we have only heuristics which strongly indicate that Mythos is simply a big fucking model that any other lab could make an equivalent of.

InsideOutSanta

4 hours ago

In my experience, GLM 5.2 is extremely good at finding vulnerabilities, and more importantly, unlike Opus, I've never seen it refuse a command. It genuinely is a very strong model for finding and fixing vulnerabilities.

nozzlegear

an hour ago

More importantly, unlike Mythos and Fable, you can actually use GLM 5.2! It's not just marketingware that got its founder in hot water with the government.

NitpickLawyer

3 hours ago

> Thirdly it compares to GPT 5.5 and Opus 4.8.

> No, we don't have Mythos at home.

That's still useful. To paraphrase the kids these days, GLM5.2 is in the room with us, today. Mythos is not. And for us in the EU, it's even more complicated, as Mythos might be with us in the room one day, and go poof the next day, on the whims of political entities that we have 0 control over.

Knowing where open, accessible, local models are is important. We know they're behind. But there comes a time when "good enough" is useful. Even if they're "just IDORs" today, and even if they're behind SotA today.

As someone else said above, GLM5.2 (and other models in the same tier like kimi, dsv4, etc) is / are slowly becoming "good enough" to assist in automated repo prepare work (download, install, test, edit, re-test, etc). And that translates in RL traces ready to be trained into the next generations. That might be more important than x% behind on benchmarks.

sanid

3 hours ago

Technically we don't have Mythos at all? You guys have access. This tells me we have Opus at home (open weights).

jimbob45

3 hours ago

Yeah they straight up say that their criteria is narrow and primarily important for their specific use case. Never let rationality cause your pitchfork to be cast away though!

_s_a_m_

an hour ago

I tried GLM many times and it is bad, i have on clue what these people are talking about

jeffnash

22 minutes ago

have you tried 5.2? I agree that 5.1 and prior were below Kimi, Mimo, Qwen, Minimax, and probably Deepseek (depending on task), but 5.2 (especially unquantized) feels like something else.

Now I feel like that I'm covered by GLM 5.2 and Minimax M3 (when I need vision or a second pass on something).

g42gregory

2 hours ago

If only the "cybersecurity" crowd were focused on patching the vulnerabilities.

Instead of shilling for the LLM providers.

__MatrixMan__

2 hours ago

But if we patch all of the vulnerabilities, who will pay for our vulnerability scanner?

_factor

2 hours ago

The robot figured out how to bump the lock. The obvious solution is to ban the robot.

theteapot

3 hours ago

> Constant: the IDOR dataset (the same real, open-source applications we've used in prior research) ...

What we're they? Also, wouldn't one expect a more recently released coding agent (with a more recent knowledge cut off) to perform better because they have access to more knowledge about vulns in these OSS projects, and even possibly have knowledge of your own "prior research"?

mkagenius

2 hours ago

One would. But then the results are even weirder as opus 4.6 scored more than opus 4.8 by a huge margin

dools

27 minutes ago

I think Opus 4.8 is deliberately nobbled. Kimi k2.6 with Kimi code beats opus models at finding vulnerabilities, even though it produces some false positives, when I give the same issues to opus and ask it to verify most of the time it concurs it’s a real issue even though it failed to find the issue itself

admax88qqq

4 hours ago

> beats Claude in our Cyber Benchmarks

Beats which model in Claude? Whenever a "benchmark" doesn't put precise model numbers in their headlines I am immediately skeptical. Either they don't know the difference (bad) or they are benchmarking against weaker models (misleading, also bad).

It's like when studies say "AI is bad at X" and they used GPT-3.5 in current year.

InsideOutSanta

4 hours ago

They say "Claude Opus 4.8" in the first paragraph.

crm9125

2 hours ago

We're supposed to read the article?

How are we supposed to stay skeptical of everything if we read anything!?

ls612

4 hours ago

Opus 4.8 according to TFA. Whether or not the safety guardrails were responsible for the difference is an open question but for a dev who wants to secure their software who doesn’t work at one of the blessed Glasswing companies it doesn’t really matter why, it matters what the best tool you actually have is.

Art9681

an hour ago

This is because of the safeguards and not the model capabilities. If these folks signed up for the proper cyber service offered by Anthropic where refusals are removed then the open weight model wouldn't look as capable.

kordlessagain

6 hours ago

You can launch GLM-5.2 in Opencode using Nemesis8: https://github.com/DeepBlueDynamics/nemesis8#nemesis-8

After installing, do a `n8 build` to build the image, then `n8 --danger --provider opencode interactive` to launch it in a container.

Signup for GLM-5.2 here: https://z.ai

generichuman

36 minutes ago

You can use GLM in OpenCode with a z.ai subscription by default as well. Also it'd be good if you mentioned you were involved with nemesis8.

sanid

3 hours ago

One can also try https://neuralwatt.com using it in opencode.

I think they give $5 trail credits to test with any of the open weight models.

laybak

2 hours ago

how representative are Semgrep's benchmarks? everyone seems to have their own benchmark these days (guess it's good "content marketing") I'm honestly losing track

dist-epoch

2 hours ago

Anthropic is saying other models were good at detecting vulnerabilities, where Mythos excelled was in creating functional exploits for them.

This article only talks about detecting vulnerabilities, so it's unclear if it's a true Mythos equivalent.

igregoryca

an hour ago

It seems "Mythos is really good at finding vulnerabilities" has been what people took away from the Project Glassing announcement, which makes sense. Unfortunately for Anthropic, most seem to have forgotten the best argument Anthropic had for holding Mythos back from the general public, "it's crazy good at crafting exploits". Then, without that context, the tinfoil hats came out.

cmrdporcupine

an hour ago

I like GLM 5.2... ish. It's ok.

I'd be mostly fine switching to it.

I just can't find a cost effective way to do that. z.AI's coding plan is both overpriced and unreliable. ollama's is also overpriced. Paying by the token for it on openrouter etc is more expensive than just having a Codex or Claude coding plan.

If you have to pay by the token, it's clearly cheaper. It's not competitive with a coding plan though.

TurdF3rguson

an hour ago

It also means giving up vision which I don't know how I would deal with. I think I would prefer a weaker model with vision than a stronger without.

cmrdporcupine

an hour ago

If you using opencode or similar you can just temporarily switch models -- in the same session -- to something that has vision and have it look at your image. And then switch back.

veselin

4 hours ago

Here, it appears they compare a single prompt "find IDOR", against a multi-agent system. However, one can also start far more sophisticated skills that spin up subagents and mostly do the same in Claude Code, Codex, OpenCode, Pi, etc.

Which I guess makes what semgrep sells obsolete. Unless they have built a pareto-optimal point in terms of capabilities and token usage maybe?

blazespin

4 hours ago

I think the point is less "how can we throw shade on the OP" and more "a harness can enable a lot of models to do very serious cybersec, glm 5.2 is one of them"

s3p

4 hours ago

Are you replying to a response to the original comment? I looked but i didn't see anyone saying he's throwing shade.

BikiniPrince

2 hours ago

You have to forgive the GLM bot. It's not very good.

utunga

an hour ago

Just popping in to say that no you can't use the word "tokenomics" to mean that. Argh.

yieldcrv

30 minutes ago

who is your favorite hosted GLM 5.2 provider? I'm looking for fastest tokens/sec and best cost

additionally, reliable API, because z.ai can be finicky

also, not for Enterprise use, but I like non-US providers, I don't care if the party happens to be the one reading my information and stealing my trade secrets, if they won't respond to a US subpoena

lenerdenator

an hour ago

The incentive to develop Claude further is to make money.

The incentive to develop these Chinese models further is to trash the business case of most American AI labs.

csjh

2 hours ago

I found it to spiral into complete nonsense a few times when I tested it out, but it's possible that was a bug in the provider

TacticalCoder

an hour ago

How to reconcile that with the recent, highly upvoted, article titled: "The gap between open weights LLMs and closed source LLMs"?

What explains it?

Is TFA lying? Is the most upvoted comment here lying?

rode1974

3 hours ago

Hopefully i get a macbook pro soon enough to run some small or medium sized LLMs

BikiniPrince

2 hours ago

This is a joke right? I wouldn't install this in a sandbox.