























Experiment that I've made. The models get access to an E2B sandbox and are instructed to create an ad according to the specifications (they can choose whatever tools they want to use for it, e.g. Pillow, Chromium) as a proxy for their ability to use tools, create other kinds of images, do complex layouts etc. Currently Opus 4.8 is on top (not surprising, but it did take 66 conversation turns to create the image) and GLM-5.2 is on fifth (which I do find surprising because it doesn't have image capabilty).
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。