The Myth of unsafe Open Source AI

Researchers, often working at closed labs, argue that open models are inherently unsafe because you cannot control them after their release, and because they can be fine-tuned for any malicious use case. While this argument is theoretically true, it also assumes that closed models are safer, that their guardrails work, and that their providers take measures against misuse. Therefore, I went ahead and researched the misuse of both open and closed models and compiled a list.

For this, I looked into third-party reports examining real-world impacts. While benchmarks tell one story, real-world threat actors will try to break guardrails and circumvent safety measures. Furthermore, many cybersecurity benchmarks are outdated or focus on finding novel exploits, while the real world runs on outdated software.

The obvious caveat to this whole blog post is that it is hard to know exactly when and how open models are used. That is why I focused on third-party providers that do not have privileged access to the usage data of closed models, or inference APIs in general, but nevertheless identified the use of AI during their investigations.

I conducted the research using ChatGPT and Gemini, testing various models at their highest settings with different prompts and modes to increase coverage. Just like all my other public work, I wrote the report myself.

Cybersecurity #

Mexican Government Hack (Link). A single threat actor used Claude Code and GPT-4.1 together to exfiltrate 195 million taxpayer records. The campaign lasted from December 2025 to February 2026, and Claude’s guardrails were circumvented using an AGENTS.md file. Claude, together with GPT-4.1 over the API, did most of the work needed to chain together different exploits, with the threat actor nudging the models from time to time. Essentially, someone was able to vibe-hack the Mexican government using Claude and GPT. Notably, this campaign happened during and after the introduction of Constitutional Classifiers++.

Bissa Scanner (Link). A scanner exploiting known vulnerabilities, such as React2Shell, was operated using Claude Code and OpenClaw, with Claude Sonnet 4.6 as the model. It stole thousands of records, API keys, and files.

FortiGate (Link, Link). An individual exploited hundreds of poorly configured FortiGate gateways using Claude and DeepSeek, together with a collection of scripts, to exfiltrate data.

PromptSpy (Link). A novel piece of malware uses Google Gemini as a computer-use agent on Android phones to lock users into a malicious app. The infected app had to be sideloaded and, at the time of writing, had no known real-world impact, but it demonstrates the potential future use of capable CUA models.

EvilTokens (Link). In a phishing campaign, the threat actors used GPT-4o-mini to translate emails, Llama 3.1 8B to analyze them, and Llama 3.3 70B for further assessment, including identifying individuals susceptible to social engineering. Llama’s guardrails were circumvented with a prompt.

LameHug (Link). The Russian state-sponsored group APT28 used Qwen2.5-Coder-32B-Instruct through the Hugging Face inference API to create malicious commands at runtime on victims’ systems. The malware targeted entities in the security and defense sectors using a compromised Ukrainian ministry account. It was discovered by Ukraine’s CERT in mid-2025.

Patriot Bait (Link). A Russian-speaking actor used Google Gemini through the Gemini CLI to roleplay as an American veteran and patriot. The model’s guardrails were circumvented with a prompt in a GEMINI.md file. The actor used Gemini to create misinformation, run pump-and-dump schemes, hack WordPress sites, and steal credentials.

North Korean spearphishing campaign (Link). North Korean actors used ChatGPT’s image generation to create sample ID cards of South Koreans as part of a spearphishing campaign.

Misinformation and Deepfakes #

Pravda Propaganda Network (Link, Link). While I exclude superficial benchmarks from this blog post, I want to make an exception for this particular campaign. The Russian Pravda Network spreads pro-Kremlin and pro-Iranian misinformation, which is then picked up by various chatbots and used as a source in their responses. This affects all major chat apps, including ChatGPT, Claude, Gemini, Grok, and DeepSeek. It also appears to be becoming a larger problem as the network achieves growing success.

CopyCop (Link). The Russian propaganda network Storm-1516 is creating hundreds of fake websites targeting Western nations, media organizations, and political parties. It likely uses, or used, an uncensored version of Llama 3.1 8B for its purposes.

Grok image generation (Link). Grok’s image generation was used to create unsolicited sexual images, including images involving minors, as well as Nazi and ISIS propaganda. Apparently, it remained a problem even after its guardrails were tightened, although I have been unable to find newer reports.

CSAM and Porn (Link, Link). The dominant area of misuse for open models appears to be image and video generation, where fine-tuned models and LoRA adapters are the norm. Because this content is highly illegal, offenders advise against using closed models that log requests. The IWF report is pretty damning in this regard, while also acknowledging that the problem is far greater than what it can observe.

Conclusion #

Closed models appear to be predominantly used for malicious purposes, with the exception of deepfakes and CSAM. Their guardrails are easily bypassed with a single prompt, affecting every major model lab and even frontier models today.

Looking at the reports, closed models are used either because they are at the current frontier or because they are easier to set up and use. It also makes intuitive sense: Why secretly acquire hundreds of GPUs, collect a ton of data, and hire researchers with the expertise needed to fine-tune a less capable model for a single use case when you can just add a CLAUDE.md file to a frontier model and use it to break into governments?

While closed-model companies could restrict or even revoke access to their models, we have yet to see a model retracted. Instead, the trend is toward making increasingly capable models available to more companies and people as quickly as possible. Also, once misuse has happened, it is impossible to undo. Providers can only tighten safety classifiers or silently nerf the models afterwards.

I expect people to argue that this analysis looks backwards rather than forwards. This is a convenient argument because it is impossible to refute with evidence, and it is brought up every time there is a step change in capabilities, whether with ChatGPT, GPT-4, o1, Sonnet 3.6, or Opus 4.5. After some time, open models catch up at a fraction of the price, diffusing those capabilities even further. And yet the misuse frontier remains dominated by closed models. I see no indication that Mythos or Fable will be any different. The historical evidence, as outlined in this blog post, is pretty clear in this regard.

推荐订阅源

Florian Brand

Cybersecurity #

Misinformation and Deepfakes #

Conclusion #