Zhipu's flagship GLM-5 tested: Comparing Opus 4.6 and GPT-5.3-Codex

I. Introduction

Just now, I saw that the new flagship model of Zhipu, GLM-5, has officially been released.

They really pushed hard, releasing it right before the long holiday, and it's been less than two months since the release of the previous version, GLM-4.7...

GLM-4.x has received high praise both domestically and internationally, widely recognized as a top-tier model in the programming field. The new major version leaves people curious about what improvements it will bring.

To be honest, last week, the team contacted me to participate in the beta test, and I've been using this model for several days now.

Coincidentally, last week also saw the release of new versions for two flagship models abroad: Anthropic released Claude Opus 4.6, and OpenAI released GPT-5.3-Codex.

These three new models all focus on programming, so I couldn't help but conduct comparative tests to see if there are any differences, and I think this is something many people are interested in.

Below are the results of real programming tasks on these three AI models.

II. Introduction to GLM-5

According to the official release notes, GLM-5 is introduced as follows: As an open-source model, GLM-5 fully competes with top-tier proprietary models , with two specific areas of enhancement.

(1) Complex System Engineering

GLM-5 is not only good at generating front-end web pages but also skilled in handling back-end tasks, system refactoring, and deep debugging, abandoning the "prioritizing front-end aesthetics over low-level logic" model.

It has a strong self-reflection and error-correction mechanism, capable of autonomously analyzing logs, identifying root causes, and iterating fixes until the system runs smoothly.

(2) Long-range Agent

It can handle long-range tasks, i.e., multi-stage, long-step complex tasks, capable of autonomously breaking down requirements, running continuously for hours, and maintaining contextual coherence and goal consistency.

(3) Summary

The tasks GLM-5 can accomplish have gone beyond generating front-end UI, and it can generate system-level large and complex projects, such as operating system kernels, browser engines, V8 engines, etc.

Its slogan is "In the era where large models are entering the Agent and large task phase, GLM-5 is the open-source choice you can use."

III. Testing Methods

The test questions I selected are those used by Alejandro AO, the advocate of HuggingFace, to test Opus 4.6 and GPT 5.3.

He took a video showing the performance of these two models.

I then used the same questions to test GLM-5 and compared the results with his.

There were four questions in total, covering both frontend and backend aspects. I have already created a repository with the original prompts and scripts and uploaded it to GitHub.

Four, Web Design Test

The first test was on web design and reconstruction capabilities.

The original page was very basic.

It just categorized the information and stacked it together. We had the AI redesign the webpage to make it aesthetically pleasing and user-friendly, conveying a mature and reliable professional sense.

As mentioned before, the prompt and the original file are all here.GitHub, not repeated here. Everyone can use it to run themselves, or let other models run it.

Here is the generation result of GLM-5.

This result is both aesthetically pleasing and professional, with all information well-organized and featuring animation effects. Mobile browsing (see below) is also problem-free, making it practically ready for launch.

I've published this page, everyone canClick hereGo and see.

Here is the generation result of Opus 4.6, taken from a video screenshot.

Here is the generation result of GPT-5.3.

These three designs are all usable, but GPT-5.3 has a flaw (the header isn't sticky, it disappears when you scroll down), and it's not as aesthetically pleasing as the other two.

So, in this test, GLM-5 and Opus 4.6 perform better, and which one is superior depends on the user's aesthetic preferences. Personally, I prefer the design style of GLM-5.

V. 3D Sandbox Test

The second test evaluates the AI model's 3D animation generation capabilities.

The requirement is to generate an educational web 3D sandbox that demonstrates the motion of celestial bodies in the solar system through animation, and allows adjusting animation parameters such as mass, position, and speed, as well as manually adding new celestial bodies.

Below is the generation result of GLM-5.

On the right side of the page is the animation area, which by default shows three small planets orbiting a central star. It can be rotated 360 degrees with the mouse, as well as zoomed in and out.

The left side of the page is the control panel, which is quite good.

The upper part can adjust animation and celestial parameters, while the lower part is used to add new celestial bodies or remove existing ones.

For comparison, the generation result of Opus 4.6.

The generation result of GPT-5.3.

These three generated results all meet the requirements and can run smoothly. However, the animation of GLM-5 lacks the gravity grid lines, while the grid lines of GPT-5.3 are too messy. Therefore, Opus 4.6 performs better in terms of animation effects.

In terms of the control panel, both GLM-5 and Opus 4.6 are well-designed, while GPT-5.3 is a bit simple.

Overall, I feel that the best performer in this round is Opus 4.6, followed by GLM-5, and finally Codex 5.3.

VI. Web Games

The third test was to generate a web game "Angry Birds."

GLM-5's generation result is decent, quite similar to the original, playable, but lacks gamefulness, and the bouncing effect is not good enough.

Opus 4.6 has a high degree of restoration, and the gaming experience is close to the original.

GPT-5.3's generation result is embarrassing; the birds cannot bounce at all, and the game is unplayable.

Clearly in this round, Opus 4.6 is the best, followed by GLM-5.

Seven, Laravel to Next.js

The last test was to convert a web application based on the PHP language Laravel framework to the JavaScript language Next.js framework.

GLM-5 handled it without any issues, quickly converting PHP language to JS language and providing the converted code structure.

It also automatically installed the required software packages after conversion, completed the script compilation, and prompted the user: "Just integrate the external API, and one click execution npm run dev will allow direct running."

Following its instructions, the execution went smoothly without errors, and I could access the application by opening localhost:3000.

It's an application for checking city weather. Since there was no requirement to change the style, it looks exactly the same as the original PHP version.

The input box in the upper right corner allows you to search for cities.

In the search results, select the city you want.

Clicking into it takes you to the city's detail page, which includes weather, sunrise and sunset times, air quality, maps, and other information.

Opus 4.6 and GPT-5.3 also generated the same results, as the pages and functions are identical, so screenshots are not displayed.

It's worth mentioning that the conversion times for both GLM-5 and GPT-5.3 are around 5 minutes, while Opus 4.6 seems to have encountered some issues, taking a full 20 minutes.

Looking at the results of this round, all three models perform well, but GLM-5 has a shorter generation time, no errors, and a good overall user experience, so I'm voting for it.

Section 8: Summary

After these tests, GLM-5's programming performance is commendable and impressive, capable of standing alongside the latest flagship models from international companies. In some aspects, it even outperforms them, and where it falls short, it's often due to minor details rather than significant differences.

It's said that both training and running processes use the domestic "WanKa Cluster." It can be imagined that with more cards and computing power, its performance would be even better, enough to compete head-to-head with the top-tier large model companies in the world.

Additionally, the two areas it specifically strengthened this time—"complex systems" and "long-term tasks"—are noticeable.

The system logic and backend code it generates have good reliability, with few errors either during generation or runtime. The gaps are often missing features, which can be supplemented by AI later, not architectural issues. Also, I have a personal task that ran for a solid two hours and completed without going off track.

I’d like to end with an official statement.

By 2026, programming large models are evolving from "able to write code" to "able to build systems," and GLM-5 is hailed as the "system architect model" in the open-source community. Shifting focus from "front-end aesthetics" to "Agentic depth/engineering capabilities" makes Opus 4.6 and GPT-5.3 the domestic open-source alternatives.

(End)

Recommended Feeds

阮一峰的网络日志