AI is ready to take over Python programming, but not much else

Tests of how well 19 large language models (LLMs) complete and perform complicated multi-step tasks has shown that they are both error-prone and, in many cases, unreliable.

The findings are contained a preprint paper, LLMs Corrupt Your Documents When You Delegate, written by Microsoft researchers Philippe Laban, Tobias Schnabel and Jennifer Neville based on a benchmark they created called DELEGATE-52 that allowed them to simulate workflows that might be part of a knowledge worker’s tasks. The paper is currently under review.

They said that the benchmark contains 310 work environments across 52 professional domains including coding, crystallography, genealogy and music sheet notation. Each environment consists of real documents totaling around 15K tokens in length, and five to 10 complex editing tasks that a user might ask an LLM to perform.

And, they stated in the paper’s abstract: “Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.”

Those mistakes are significant, they said. “The findings show that current LLMs introduce substantial errors when editing work documents, with frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4) losing an average 25% of document content over 20 delegated interactions, and an average degradation across all models of 50%.”

Benchmark exercise receives a thumbs up

Brian Jackson, principal research director at Info-Tech Research Group, found the findings very interesting. “Putting a list of LLMs to the test across different work domains yields a lot of useful insights,” he said. “I think this type of benchmark exercise could be helpful to enterprise developers who are looking to leverage agentic AI to automate specific workflows and understand the limits of what can be achieved.”

However, he said, “what we shouldn’t conclude from this is that, because these foundation models caused document degradation after 20 edits, they can’t be used to automate work in a certain field. It just means they can’t do all of the work as they are currently constructed.”

But, Jackson stated, “in an enterprise environment where having an accurate output is crucial, you wouldn’t take that approach. You would design the automation flow with stronger guardrails in place to prevent errors. This could be done by using multiple agents that play different roles, such as one that makes the edits and another that checks for errors and makes corrections.”

Sanchit Vir Gogia, chief analyst at Greyhound Research, said, “the Microsoft paper should be read as a serious warning about delegated AI, not as a claim that enterprise AI has failed. That distinction matters. The paper is still a preprint, so it deserves careful handling, but its central question is exactly the one CIOs should be asking: can AI preserve the integrity of complex work over repeated delegation?”

The study, he said, is stronger than what he described as “the usual AI benchmark theatre,” because it tests work products, not just looking at clever one-off answers. “It uses reversible editing tasks, domain-specific evaluators, and a round-trip method to see whether a document returns intact after repeated edits. In too many cases, it does not.”

That is the point, explained Gogia. “This is not merely about hallucinations. It is about artefact integrity.”

AI is ‘not yet trustworthy enough’

He added that the headline finding is “uncomfortable: even the strongest models corrupt about a quarter of document content by the end of long workflows, while average degradation across all tested models reaches roughly 50%. The paper also finds that performance varies sharply by domain. Python is the only domain where most models are ‘ready,’ and the best model reaches that threshold in only 11 of 52 domains.”

AI is not failing because it cannot write, said Gogia, it is failing because it cannot yet preserve.

The study, he pointed out, “is especially useful because it shows how errors accumulate. Bigger documents worsen outcomes. Longer interaction worsens outcomes. Distractor files worsen outcomes. Short tests flatter the system, while longer workflows expose it. That maps rather neatly to the enterprise world, where work is messy, files are stale, context is noisy and the most important documents are rarely the simplest ones.”

The honest conclusion, he said, “is not that AI should be kept out of enterprise workflows. It is that delegated AI is not yet trustworthy enough to be left alone with consequential artefacts.”

When AI edits an important document such as a contract, a ledger, a policy, a codebase, a board paper, or a compliance record, Gogia warned, the enterprise still owns the damage.

Mitigation approaches

In order to prevent that damage, Jackson suggested, enterprises can do additional training and fine-tuning of models to be better adapted to their specific workflows: “These foundation models are very good at doing a lot of different tasks, but less good at doing one specific task very well. So, enterprises that want to achieve that may need to improve the models themselves by training on their own data.”

For example, “[the Microsoft paper] points out one multi-agent setup that led to more degradation instead of less, so the method to detect degradation must be well-designed to be effective,” he said. “Another approach that some enterprise platforms have introduced is a way to deterministically verify the output for accuracy using mathematical verification. So, knowing what domains prove more difficult for a single LLM to automate is useful, as developers can plan to add more verification steps to the process.”

He said, “depending on the model, for example, if it’s totally open source or if it’s proprietary, you can have more flexibility in terms of how much you can customize it. So, an enterprise developer might look at these results, pick the LLM best at automating their desired domain, and then send it in for additional training to master the process.”

People do not disappear

According to Gogia, the paper also shows something more precise than ‘AI still needs people.’ “It shows that AI changes the human layer from production to supervision, validation, and accountability. That is a rather different operating model from the one being sold in many boardroom conversations.”

People, he said, “do not disappear. Their work moves. This is the uncomfortable part for enterprises chasing headcount reduction. The people best placed to catch AI errors are often the same people organizations are hoping to replace, reduce, or redeploy. Remove too much domain expertise from the workflow, and the enterprise also removes the people who know when the AI has quietly damaged the work.”

Expertise becomes more valuable, not less, said Gogia: “The paper reinforces this because stronger models do not merely delete content. They often corrupt it. Weaker models are easier to catch when they visibly drop material. Frontier models are more awkward because the content remains present but becomes wrong, distorted, or subtly altered. That requires knowledgeable review, not casual inspection.”

This article originally appeared on CIO.com.

SUBSCRIBE TO OUR NEWSLETTER

From our editors straight to your inbox

Get started by entering your email address below.