惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fox-IT International blog
Recent Announcements
Recent Announcements
D
Docker
IT之家
IT之家
B
Blog
Jina AI
Jina AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
量子位
C
Check Point Blog
Microsoft Azure Blog
Microsoft Azure Blog
罗磊的独立博客
博客园 - 司徒正美
李成银的技术随笔
美团技术团队
Blog — PlanetScale
Blog — PlanetScale
雷峰网
雷峰网
The GitHub Blog
The GitHub Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
T
The Blog of Author Tim Ferriss
酷 壳 – CoolShell
酷 壳 – CoolShell
MongoDB | Blog
MongoDB | Blog
P
Proofpoint News Feed
L
LangChain Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Y
Y Combinator Blog
大猫的无限游戏
大猫的无限游戏
有赞技术团队
有赞技术团队
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
Visual Studio Blog
T
Tailwind CSS Blog
H
Help Net Security
Engineering at Meta
Engineering at Meta
小众软件
小众软件
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
月光博客
月光博客
M
Microsoft Research Blog - Microsoft Research
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
GbyAI
GbyAI
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog

OpenAI Developers

Realtime prompting guide OpenAI Developers plugin for Codex API deployment checklist | OpenAI API Codex for (almost) everything Sora 2 Prompting Guide Codex Prompting Guide Introducing the Codex app Codex in JetBrains IDEs Docs MCP | OpenAI Developers Gpt-image-1.5 Prompting Guide GPT-5.2 Prompting Guide Transcribing User Audio with a Separate Realtime Request Modernizing your Codebase with Codex Codex code review Build beautiful frontends with OpenAI Codex Building with Open Models Using OpenAI Codex CLI with GPT-5-Codex OpenAI Codex in your code editor Context Engineering & Coding Agents with Cursor Shipping with Codex Sora, ImageGen, and Codex: The Next Wave of Creative Production Live Demo Showcase: Tools That 10x Your Codebase GitHub - openai/openai-sora-sample-app: Sample app to get started using the Video API with Sora GitHub - openai/openai-apps-sdk-examples: Example apps for the Apps SDK GitHub - openai/openai-chatkit-advanced-samples: Starter app to build with OpenAI ChatKit SDK GitHub - openai/openai-chatkit-starter-app: Starter app to build with OpenAI ChatKit + Agent Builder Agentic Commerce Protocol | OpenAI Developers Rate limits guide Web search guide Getting Started with Evals Prompt Optimizer How to run gpt-oss locally with LM Studio Fine-tuning with gpt-oss and Hugging Face Transformers How to run gpt-oss locally with Ollama Function calling guide OpenAI models page Reasoning best practices Reasoning guide Background mode guide Batch API guide Conversation state guide File search guide Flex processing guide MCP guide Code interpreter guide Quickstart - OpenAI Agents SDK Balance accuracy, latency, and cost Build hour — agentic tool calling Build hour — built-in tools Keep costs low & accuracy high Graders Evals Best Practices Working with the Evals API Guardrails - OpenAI Agents SDK Latency optimization guide AI Techniques (Foundations): Evaluating LLM Applications LLM correctness and consistency Model distillation overview Agent orchestration - OpenAI Agents SDK Production best practices Prompt engineering guide RAG technique overview Realtime guide Realtime intro Realtime translation guide AI Techniques (Foundations): Responses API Tools & Features Responses guide Responses vs. chat completions guide Speech-to-text guide Supervised fine-tuning overview Tracing - OpenAI Agents SDK AI Techniques (Foundations): Introduction to Agentic Workflows Vision fine-tuning overview Audio & speech guide GitHub - openai/openai-cs-agents-demo: Demo of a customer service use case implemented with the OpenAI Agents SDK Model distillation overview Voice applications intro Fine-tuning best practices Realtime translation guide 4o image generation intro GitHub - openai/openai-agents-python: A lightweight, powerful framework for multi-agent workflows GitHub - openai/openai-agents-js: A lightweight, powerful framework for multi-agent workflows and voice agents Building agents guide Built-in tools guide Codex intro Computer Use API guide GitHub - openai/openai-cua-sample-app: Learn how to use CUA (our Computer Using Agent) via the API on multiple computer environments. DevDay — distillation breakout DevDay — realtime breakout GitHub - openai/openai-testing-agent-demo: Demo of a UI testing agent using the OpenAI CUA model and the Responses API. DevDay — structured outputs breakout Image generation guide An Introduction to MCP Model optimization guide New audio models intro GitHub - openai/openai-fm: Code for openai.fm, a demo for the OpenAI Speech API Predicted outputs guide GitHub - openai/openai-realtime-console: React app for inspecting, building and debugging with the Realtime API Building Voice Agents GitHub - openai/openai-realtime-solar-system: Demo showing how to use the OpenAI Realtime API to navigate a 3D scene via tool calling
Verifying gpt-oss implementations
2025-08-11 · via OpenAI Developers

The OpenAI gpt-oss models are introducing a lot of new concepts to the open-model ecosystem and getting them to perform as expected might take some time. This guide is meant to help developers building inference solutions to verify their implementations or for developers who want to test any provider’s implementation on their own to gain confidence.

The new models behave more similarly to some of our other OpenAI models than to existing open models. A couple of examples include:

  1. The harmony response format. These models were trained on our OpenAI harmony format to structure a conversation. While regular API developers won’t need to deal with harmony in most cases, the inference providers that provide a Chat Completions-compatible, Responses-compatible or other inference API need to map the inputs correctly to the OpenAI harmony format. If the model does not receive the prompts in the right format this can have cascading generation issues and at minimum a worse function calling performance.
  2. Handling chain of thought (CoT) between tool calls. These models can perform tool calls as part of the CoT. A consequence of this is that the model needs to receive the CoT in subsequent sampling until it reaches a final response. This means that while the raw CoT should not be displayed to end-users, it should be returned by APIs so that developers can pass it back in along with the tool call and tool output. You can learn more about it in this separate guide.
  3. Differences in actual inference code. We published our mixture-of-experts (MoE) weights exclusively in MXFP4 format. This is still a relatively new format and along with other architecture decisions, existing inference code that was written for other open-models will have to be adapted for gpt-oss models. For that reason we published both a basic (unoptimized) PyTorch implementation, and a more optimized Triton implementation. Additionally, we verified the vLLM implementation for correctness. We hope these can serve as educational material for other implementations.

Responses API

For best performance we recommend inference providers to implement our Responses API format as the API shape was specifically designed for behaviors like outputting raw CoT along with summarized CoTs (to display to users) and tool calls without bolting additional properties onto a format. The most important
part for accurate performance is to return the raw CoT as part of the output.

For this we added a new content array to the Responses API’s reasoning items. The raw CoT should be wrapped into reasoning_text type element, making the overall output item look the following:

{
  "type": "reasoning",
  "id": "item_67ccd2bf17f0819081ff3bb2cf6508e60bb6a6b452d3795b",
  "status": "completed",
  "summary": [
    /* optional summary elements */
  ],
  "content": [
    {
      "type": "reasoning_text",
      "text": "The user needs to know the weather, I will call the get_weather tool."
    }
  ]
}

These items should be received in subsequent turns and then inserted back into the harmony formatted prompt as outlined in the raw CoT handling guide.

Check out the Responses API docs for the whole specification.

Chat Completions

A lot of providers are offering a Chat Completions-compatible API. While we have not augmented our published API reference on the docs to provide a way to receive raw CoT, it’s still important that providers that offer the gpt-oss models via a Chat Completions-compatible API return the CoT as part of their messages and for developers to have a way to pass them back.

There is currently no generally agreed upon specification in the community with the general properties on a message being either reasoning or reasoning_content. To be compatible with clients like the OpenAI Agents SDK we recommend using a reasoning field as the primary property for the raw CoT in Chat Completions.

To verify if a provider is working you can use the Node.js script published in our gpt-oss GitHub repository that you can also use to run other evals. You’ll need Node.js or a similar runtime installed to run the tests.

These tests will run a series of tool/function calling based requests to the Responses API or Chat Completions API you are trying to test. Afterwards they will evaluate both whether the right tool was called and whether the API shapes are correct.

This largely acts as a smoke test but should be a good indicator on whether the APIs are compatible with our SDKs and can handle basic function calling. It does not guarantee full accuracy of the inference implementation (see the evals section below for details on that) nor does it guarantee full compatibility with the OpenAI APIs. They should still be a helpful indicator of major implementation issues.

To run the test suite run the following commands:

# clone the repository
git clone https://github.com/openai/gpt-oss.git

# go into the compatibility test directory
cd gpt-oss/compatibility-test/

# install the dependencies
npm install

# change the provider config in providers.ts to add your provider

# run the tests
npm start -- --provider <your-provider-name>

Afterwards you should receive a result of both the API implementation and any details on the function call performance.

If your tests are successful, the output should show 0 invalid requests and over 90% on both pass@k and pass^k. This means the implementation should likely be correct. To be fully sure, you should also inspect the evals as described below.

If you want a detailed view of the individual responses, you can the jsonl file that was created in your directory.

You can also enable debug mode to view any of the actual request payloads using DEBUG=openai-agents:openai npm start -- --provider <provider-name> but it might get noisy. To run only one test use the -n 1 flag for easier debugging. For testing streaming events you can use --streaming.

The team at Artificial Analysis is running AIME and GPQA evals for a variety of providers. If you are unsure about your provider, check out Artificial Analysis for the most recent metrics.

To be on the safe side you should consider running evals yourself. To run your own evals, you can find in the same repository as the test above a gpt_oss/evals folder that contains the test harnesses that we used to verify the AIME (16 attempts per problem), GPQA (8 attempts per problem) and Healthbench (1 attempt per problem) evals for the vLLM implementation and some of our own reference implementations. You can use the same script to test your implementations.

To test a Responses API compatible API run:

python -m gpt_oss.evals --base-url http://localhost:8000/v1 --eval aime25 --sampler responses --model openai/gpt-oss-120b --reasoning-effort high

To test a Chat Completions API compatible API run:

python -m gpt_oss.evals --base-url http://localhost:8000/v1 --eval aime25 --sampler chat_completions --model openai/gpt-oss-120b --reasoning-effort high

If you are getting similar benchmark results as those published by us and your function calling tests above succeeded you likely have a correct implementation of gpt-oss.