I.
Last month, I wrote an article comparing two large models.
Someone commented that the two models were too few and asked if other models could be added.
Coincidentally, last week (October 27th), MiniMax company released the M2 model, representing the latest level of domestic large models.

I thought it would be a good idea to test its practical performance and compare it with Zhipu's GLM 4.6 and Anthropic's Claude Sonnet 4.5.
After all, they are all part of the most advanced programming large models currently, which are closely related to us developers.
II.
First, let me clarify that I'm not very familiarMiniMax company is relatively low-key.
I only know that this company specializes in developing large models, with products such as text models, video models, audio models, etc., but none of them are particularly popular. I haven't paid special attention.
Last week, while browsing Twitter, I saw some foreigners discussing (1、2、3), and that's when I learned that MiniMax has released its new flagship model, M2.

The person speaking above is the head of the HuggingFace large model community, who mentioned that the M2 model ranked fifth in the world and first among open-source models in the Artificial Analysis performance competition. On that day,
it was also ranked first on the HuggingFace hot list.

In the global large model call volume ranking of OpenRouter, it ranked third this week.

I got interested and decided to try it out properly.
Three,
According to MiniMax's description, the M2 model has particularly strong programming capabilities and is one of the best programming models currently available.
As everyone knows, the most popular programming models internationally are now Claude Sonnet 4.5, and the domestic GLM 4.6 model is also very strong, so I put the three of them together for comparison.
For simplicity, I'll just use the official web version (Domestic version,International VersionRun the test on it, and everyone can try it together.

The web version is actually the official AI product.MiniMax AgentThe underlying one uses the M2 model.
Web usage is free, API calls are also now freeFree periodFor two weeks. The pricing afterwards is 1 million tokens input/output at 2.1 yuan/8.4 yuan RMB, officially promoting only 8% of Claude's price.
I'll list its other links as well.Document Repositoryon GitHub, API Call Guide (compatible with OpenAI and Anthropic formats) refer to the official documentation, Model Downloadon HuggingFace, after downloading, you can deploy locally if conditions permit.
4.
My test questions come from the famous programmer Simon Willison, his website has the test results for Cluase Sonnet 4.5.
Previously, I tested GLM 4.6 model from Zhipu company with these questions, everyone can refer to.。
This article mainly focuses on the test performance of MiniMax M2.
V.
First question, the test assesses the model's ability to understand and run code.
Clone the code repository https://github.com/simonw/llm , then run the test cases using the following command.
pip install -e '.[test]'
pytest
The prompt above requires the model to clone a Python repository, run the test cases within it, and return the results.
Judging from the web display, it's clear that the Minimax Agent has an integrated sandbox that runs code in an isolated command-line environment (see image below).

The entire process took about three minutes, and then it provided the result: it passed 466 test cases. The result was completely correct.

What impressed me was that, in addition to the execution result, it also provided a coverage analysis (see the image below), indicating which functionalities of the code were covered by the test cases. I haven’t seen any other models proactively provide coverage information.

The complete conversation can be found here .
Six,
The second question tested the most关心的 code generation capability, to see if it could generate an application according to the requirements.
I still used the same repository and asked M2 to add a feature, which required not only modifying the code but also altering the database structure and adding corresponding test cases.
1. The code repository https://github.com/simonw/llm is an AI chat application that stores users' prompts and AI responses in an SQLite database.
2. It currently uses linear collections to save individual conversations and responses. You attempted to add a parentresponseid column to the response table and model the conversation responses as a tree structure through this column.
3. Write new pytest test cases to verify your design.
4. Write a tree_notes.md file, first writing your design into the file and then using it as notes during the process.
This task is quite complex and takes a bit longer to run.
Here's a twist. During the process, it suddenly prompted that reading the GitHub repository failed, and an unexpected scene occurred.
It even automatically switched to the third-party deepwiki.com to fetch the repository. Later, when analyzing the database structure, it switched to datasette.io to analyze the SQLite database. I've never seen automatic switching to third-party cloud services like this before, unfortunately, I didn't get a chance to take a screenshot.
After completing the task, it provided a summary (below), detailing what it did, including modifying the database and adding test cases.

It even added an example file (below) demonstrating how to use the new features, and an example diagram showing the modified dialogue structure, which wasn't required by the prompt.

The complete dialogue can be found here .
Additionally, the official website's gallery has many applications it generated, which I think are also worth checking out.
Section 7,
Question 3 is the "pelican riding a bicycle" scenario invented by Simon Wilson, testing its comprehension and reasoning abilities.
Generate an SVG of a pelican riding a bicycle. (Generate an SVG of a pelican riding a bicycle)
This is a scenario that doesn't exist in reality; it relies entirely on the model's own reasoning. The stronger the comprehension, the more realistic the generated image.
Below is the result it generated. For the full conversation, see here.

For comparison, I've also included the results from the other two models.
GLM 4.6

Claude Sonnet 4.5

I think there are two noteworthy points in the results of MiniMax M2 (first image). First, it has added roads; second, its bicycle structure is relatively more complete, just missing the handlebars. Also, the posture of that pelican would be better if it looked more like "riding a bike."
Eight,
testing is over here. As for the comparison between GLM 4.6 and Claude Sonnet 4.5, everyone can check their respective links and compare themselves.
I must honestly say, MiniMax M2's performance exceeded my expectations.
What attracts me most isn't the running result itself, but the way it handles problems. It's very user-friendly, adding some auxiliary results to help understand, making it feel easy to use (accessible) and easy to understand. This also indirectly enhances the reliability of the generated results.
I tend to believe that the various review results truly reflect the real strength of the M2. Considering its API pricing (still in the free period now), I will use it in my upcoming work and also recommend everyone to give it a try.
(Complete)


























