
























__ __ ____ __
/\ \ /\ \ /'\_/`\ /\ _``. /\ \
\ \ \ \ \ \ /\ \ \ \ \/\_\\ \ \___ __ ____ ____
\ \ \ __\ \ \ __\ \ \__\ \ \ \ \/_/_\ \ _ `\ /'__`\ /',__\ /',__\
\ \ \L\ \\ \ \L\ \\ \ \_/\ \ \ \ \L\ \\ \ \ \ \/\ __//\__, `\/\__, `\
\ \____/ \ \____/ \ \_\\ \_\ \ \____/ \ \_\ \_\ \____\/\____/\/\____/
\/___/ \/___/ \/_/ \/_/ \/___/ \/_/\/_/\/____/\/___/ \/___/
__ __ __ __
/\ \ /\ \ /\ "-./ \
\ \ \____ \ \ \____ \ \ \-./\ \
\ \_____\ \ \_____\ \ \_\ \ \_\
\/_____/ \/_____/ \/_/ \/_/
______ __ __ ______ ______ ______
/\ ___\ /\ \_\ \ /\ ___\ /\ ___\ /\ ___\
\ \ \____ \ \ __ \ \ \ __\ \ \___ \ \ \___ \
\ \_____\ \ \_\ \_\ \ \_____\ \/\_____\ \/\_____\
\/_____/ \/_/\/_/ \/_____/ \/_____/ \/_____/
Random Player (White)
♜ ♞ ♝ ♛ ♚ ♝ ♞ ♜ ♟ ♟ ♟ ♟ ♟ ♟ ♟ ♟ · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♙ ♖ ♘ ♗ ♕ ♔ ♗ ♘ ♖
GAME OVER
- Outcome: Draw
- Max moves reached: 200
- Material White: 16
- Material Black: 18
GPT-4o Mini (Black)
Can Large Language Models play chess? Let's find out ツ
This leaderboard evaluates chess skill and instruction following in an agentic setting: LLMs engage in multi-turn dialogs where they are presented with a choice of actions (e.g., "get board" or "make move") when playing against an opponent (Random Player or Chess Engine).
In 2024, we began with a chaos monkey baseline — a Random Player that chooses legal moves at random. At the time, most models could barely compete and lost either due to an inability to follow game instructions (i.e., hallucinating illegal moves or taking incorrect actions) or by dragging the game to the 200-move limit because they couldn't win.
In 2025, more capable reasoning models nailed both instruction following and chess skill. We've added the Komodo Dragon Chess Engine as a more capable opponent, which is also Elo-rated on chess.com. This allowed us to anchor the results to a real-world rating scale and compute an Elo rating for each model.
METRICS:
- Player: Model name (playing as Black). Models that also played vs Dragon are marked
with an asterisk in superscript (e.g., 3*).
- Elo: Estimated Elo anchored by Dragon skill levels and calibrated Random. We solve a
1D MLE over aggregated blocks (opponent Elo, wins, draws, losses) and report ±95% CI. When both Random
and Dragon data exist, they are combined. Empty Elo appears for extreme 100% win/loss or no anchored
games.
- Game Duration: Share of maximum game length completed (0-100%); measures
instruction-following stability across many moves. 100% means no games were interrupted due to model
haluscinating moves or actions. 50% means that on average the model boroke the game loop mid-game
(making an average 100 moves out of max 200 allowed)
- Tokens: Completion tokens per move; verbosity/efficiency signal.
- Cost/Elo (main): Estimated cost per 1000 Elo points (Cost/Game divided by Elo, then scaled by 1000). Lower is more cost-efficient.
- Cost/Game (extended): Estimated cost per game based on token usage and model pricing.
ARRANGEMENT & SOURCES:
- Primary sorting: Elo (DESC), then Game Duration (DESC), Tokens (ASC).
- Data sources mix Random-vs-LLM and Dragon-vs-LLM games. Dragon levels map to Elo and provide the
anchor; Random is first calibrated vs Dragon and then used as an opponent for many models.
- Elo ratings are not comparable across player pools, i.e. you can not compare chess.com Elo to FIDE
Elo
- Chess.com references used for context (as of Sep 2025): Rapid Leaderboard (Elo
pool), Magnus Carlsen
stats, and Elo explanation &
player classes.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。