Within a day, Zhipu and Anthropic both released their strongest programming models

the last day before the holiday (September 30th), it was bustling with activity.

In the morning, Anthropic announced the Claude Sonnet 4.5 model .

In the afternoon, Zhipu AI released the GLM 4.6 model .

I think, for programmers, this development is significant.

Because both models are among the most advanced AI programming models currently available .

If you want AI to generate code, these are the first choices. __JHSNS_SEG_f361eb24_12__ This means that in just one day, AI programming models have reached a new level.

2、

Anthropic's first announcement statement didn't hesitate to use three "world's bests."

"Claude Sonnet 4.5 is the world's best coding model. It is the most powerful model for building complex agents. It is the best model for using computers. It shows significant progress in reasoning and mathematics."

Zhipu's announcement was no less bold.

"We have once again broken through the boundaries of large model capabilities.

GLM-4.6 is our strongest code Coding model (an increase of 27% over GLM-4.5). It achieves comprehensive improvements in real programming, long-context processing, reasoning ability, information search, writing ability, and agent applications."

To convince people, Zhipu's announcement also provides detailed test results.

The above figure shows the results of 8 test benchmarks. Each blue bar represents GLM-4.6, and each green bar represents GLM-4.5. The control group includes the newly released DeepSeek V3.2 Exp, Claude sonnet 4, and Claude sonnet 4.5.

It can be seen that the blue bars are mostly in the top ranks, even first. Zhipu also claims that GLM-4.6 is very cost-effective in terms of tokens (i.e., saving money), "saving more than 30% compared to GLM-4.5, with the lowest cost among similar models."

Therefore, its conclusion is: "GLM-4.6 aligns with Claude Sonnet 4/Claude Sonnet 4.5 in some rankings, stably ranking first among domestic models.""

This is interesting, one claiming to be the 'best coding model in the world,' and the other claiming to 'stably rank first among domestic models.'

Below, I will test how GLM-4.6 compares to Claude sonnet 4.5.

3、

It should be noted that the comparison of these two models is not just for testing, but also has practical significance.

Although Anthropic has strong products, it restricts Chinese users from using them, and domestic users cannot access its services through normal channels. On the other hand, it is a paid model, and the price is not cheap, with input and output costs for one million tokens being $3/15.

In contrast, GLM-4.6 is a completely domestic model from Beijing Zhipu Company. It adopts a thorough open-source approach (MIT License), the model code is fully open, and can be used freely.

You can also install it on your own at home. However, its hardware requirements are too high, and home devices cannot meet them, so it is generally used as a cloud service.

Currently, ZhiPu's official website (BigModelAndZ.ai), using GLM-4.6 via the web interface is free.

Its API calls require payment, and the starter package (coding plan) seems to be 20 yuan RMB per month.

Additionally, it has comprehensive Chinese support (documentation+customer service), which Anthropic also lacks.

In short, my test purpose is also to see if it is truly as powerful as the official claims and whether it can replace the Claude Sonnet model.

My testing method is simple. Anthropic事先邀请了著名程序员Simon Willison来试用Claude Sonnet 4.5模型。

Simon Willison已经在他的网站上公布了试用结果。

我就拿他的几个测试，用在GLM-4.6上面，然后比较一下运行结果就可以了。

大家可以跟着一起做，打开官网，把题目粘贴进去（最好贴英文），这样会有更深切的感受。

AI终端工具（比如Claude Code、Cline、OpenCode、Crush等）也可以用，参考官方文档 to be configured (API needs to be enabled first).

the first test.

to pull the code repository https://github.com/simonw/llm , then run the test cases using the following command.

pip install -e '.[test]'

pytest

This test requires an internet connection to fetch the code and runs in the background.

The Web interface on Zhipu's official website, like Claude, provides Python and Node.js server sandbox environments, where code can be generated and executed directly.

I have omitted the intermediate reasoning steps; the final result is shown in the figure below (see the complete conversation on the official website ).

278 test cases passed, took 18.31s

The entire running process (pulling, installing dependencies, executing commands) is the same as Claude Sonnet. Strangely, Claude Sonnet ran 466 test cases, over 100 more than expected. Don't know why.

6、

The second test is a more complex programming task. The original prompt was in English, and I translated it into Chinese.

1. Code repository https://github.com/simonw/llm is an AI chat application that stores user prompts and AI responses in an SQLite database.

2. It currently uses a linear collection to store individual conversations and responses. You tried adding a parentresponseid column and modeled the conversation responses as a tree structure through this column.

3. Write new pytest test cases to verify your design.

4. Write a tree_notes.md file, first write your design into the file, and then use it as a notebook during the process.

Everyone can check the completeConversation history.

GLM-4.6 ran for a few minutes, continuously outputting generated code. In the end, it modified the script, added API and command-line interface calls, and wrote and ran test cases that passed.

It also generated a tree_notes.md file, which contains detailed instructions for this modification.

Everyone can compare its running results withThe running result of Claude Sonnet.

In terms of results, there is not much difference between them; they both meet the requirements of the prompt and the code is all runnable. The differences mainly lie in the implementation details, which require a detailed reading of the code.

7、

The third test is exclusive to Simon Wilson, which is to have AI generate an SVG image of a pelican riding a bicycle (Generate an SVG of a pelican riding a bicycle).

This is a scene that does not exist in reality and lacks reference points, testing the model's imagination and generation capabilities.

Below is the image generated by GLM-4.6 with deep thinking enabled . .

Below is the image generated by Claude sonnet 4.5 with deep thinking enabled.

The results of the two are quite similar, just that Claude's generated beak is more prominent, making it easier to identify as a heron.

Testing is over here. I think, to sum up, GLM-4.6 is a very strong domestic model, with excellent coding capabilities, and can be considered a substitute for the currently recognized strongest model, Claude Sonnet.

It is comprehensive in functionality, capable of handling tasks other than coding, and has a fast response speed, low price , and very high cost-performance.

(End)

Recommended Feeds

阮一峰的网络日志