惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

The GitHub Blog

Investigating unauthorized access to GitHub-owned repositories Take your local GitHub sessions anywhere Building a general-purpose accessibility agent—and what we learned in the process Raising the bar: Quality, shared responsibility, and the future of GitHub’s bug bounty program GitHub availability report: April 2026 From latency to instant: Modernizing GitHub Issues navigation performance Dungeons & Desktops: 10 roguelikes that never die (because their communities won’t let them) GitHub Copilot individual plans: Introducing flex allotments in Pro and Pro+, and a new Max plan Dungeons & Desktops: Building a procedurally generated roguelike with GitHub Copilot CLI GitHub for Beginners: Getting started with OSS contributions Why age assurance laws matter for developers How researchers are using GitHub Innovation Graph data to reveal the “digital complexity” of nations Improving token efficiency in GitHub Agentic Workflows Agent pull requests are everywhere. Here’s how to review them. Validating agentic behavior when “correct” isn’t deterministic Welcome to Maintainer Month: Celebrating the people behind the code Register now for OpenClaw: After Hours @ GitHub GitHub Copilot CLI for Beginners: Interactive v. non-interactive mode GitHub for Beginners: Getting started with Markdown Securing the git push pipeline: Responding to a critical remote code execution vulnerability An update on GitHub availability GitHub Copilot is moving to usage-based billing Changes to GitHub Copilot Individual plans Highlights from Git 2.54 Building an emoji list generator with the GitHub Copilot CLI Bringing more transparency to GitHub’s status page How GitHub uses eBPF to improve deployment safety Build a personal organization command center with GitHub Copilot CLI Developer policy update: Intermediary liability, copyright, and transparency Hack the AI agent: Build agentic AI security skills with the GitHub Secure Code Game How exposed is your code? Find out in minutes—for free GitHub for Beginners: Getting started with GitHub Pages GitHub Copilot CLI for Beginners: Getting started with GitHub Copilot CLI GitHub availability report: March 2026 GitHub Universe is back: We want you to take the stage The uphill climb of making diff lines performant Securing the open source supply chain across GitHub Run multiple agents at once with /fleet in Copilot CLI Agent-driven development in Copilot Applied Science GitHub for Beginners: Getting started with GitHub security What’s coming to our GitHub Actions 2026 security roadmap
GitHub Copilot CLI combines model families for a second opinion
2026-04-06 · via The GitHub Blog

Discover how Rubber Duck provides a different perspective to GitHub Copilot CLI.

|

5 minutes

When you ask a coding agent to build a data pipeline, it may not use the best structure. But what if the agent got a second opinion before it executed the plan?

Today, in GitHub Copilot CLI, we’re introducing Rubber Duck in experimental mode. Rubber Duck leverages a second model from a different AI family to act as an independent reviewer, assessing the agent’s plans and work at the moments where feedback matters most.

To catch different kinds of errors, a different perspective matters. Our evaluations show that Claude Sonnet + Rubber Duck makes up 74.7% of the performance gap between Sonnet and Opus alone, achieving better results for tackling difficult multi-file and long-running tasks. Use /experimental in Copilot CLI to access Rubber Duck alongside our other experimental features.

The problem: Confident mistakes can compound

Today’s coding agents follow a clear loop. First, the agent assesses the task, then drafts a plan, implements, tests, and iterates if necessary. It’s a powerful flow that works well, but it has blind spots. Any decision an agent makes early on, especially in the planning stage, is the foundation you’re building upon. Assumptions and inefficiencies become dependencies, and by the time you notice, you may have to fix more than just the small mistake at the start.

Using self-reflection and having the agent review its own output before moving forward is a proven technique. However, a model reviewing its own work is still bounded by its own training biases: the same training data and techniques, the same blind spots.

Rubber Duck adds a second perspective

Rubber Duck is a focused review agent, powered by a model from a complementary family to your primary Copilot session. When you’ve selected a Claude model from the model picker to use as your orchestrator, Rubber Duck will be GPT-5.4. As we experiment with Rubber Duck, we are exploring other model families for the orchestrator and for the Rubber Duck. The job of Rubber Duck is to check the agent’s work and surface a short, focused list of high-value concerns: details that the primary agent may have missed, assumptions worth questioning, and edge cases to consider.

When does the cross-family review help?

We evaluated Rubber Duck on SWE-Bench Pro, a benchmark of large, difficult, real-world coding problems drawn from open-source repositories. Here’s what we found:

Claude Sonnet 4.6 paired with Rubber Duck running GPT-5.4 achieved a resolution rate approaching Claude Opus 4.6 running alone, closing 74.7% of the performance gap between Sonnet and Opus.

We noticed that Rubber Duck tends to help more with difficult problems, ones that span 3+ files and would normally take 70+ steps. On these problems, Sonnet + Rubber Duck scores 3.8% higher than the Sonnet baseline, and 4.8% higher on the hardest problems identified across three trials. Here are a few examples of what Rubber Duck finds:

  • Architectural catch (OpenLibrary/async scheduler): Rubber Duck caught that the proposed scheduler would start and immediately exit, running zero jobs—and that even if fixed, one of the scheduled tasks was itself an infinite loop.
  • One-liner bug, big impact (OpenLibrary/Solr): Rubber Duck caught a loop that silently overwrote the same dict key on every iteration. Three of four Solr facet categories were being dropped from every search query, with no error thrown.
  • Cross-file conflict (NodeBB/email confirmation): Rubber Duck caught three files that all read from a Redis key which the new code stopped writing. The confirmation UI and cleanup paths would have been silently broken on deploy.

When does Rubber Duck activate?

GitHub Copilot can call Rubber Duck automatically, both proactively and reactively, and it can be triggered by a user at any time to critique and revise its work.

For complex work, GitHub Copilot may seek a critique automatically at the checkpoints where feedback has the highest return:

  1. After drafting a plan: This is where we expect developers will see the biggest wins, because catching a suboptimal decision early avoids compounding errors downstream.
  2. After a complex implementation: This is when a second set of eyes on complex code can help catch edge cases.
  3. After writing tests, before executing them: This is a chance to catch gaps in test coverage or flawed assertions, before self-reinforcing that “everything passes.”

The agent can also seek a critique reactively if it gets stuck in a loop or can’t make progress. Consulting Rubber Duck can break the logjam.

As a user, you can request a critique at any point. Copilot will query Rubber Duck, reason over the feedback, and show you what changed and why.

We made a key design choice: the agent invokes Rubber Duck sparingly, targeting the moments where the signal is highest, without getting in the way. For the technically curious: Rubber Duck is invoked through Copilot’s existing task tool—the same infrastructure used for other subagents.

For now, we are enabling Rubber Duck for all Claude family models (Opus, Sonnet, and Haiku) used as orchestrators in the model picker. We are already exploring other model families for the Rubber Duck to pair with GPT-5.4 as the orchestrator.

Getting started

Rubber Duck is available today in experimental mode.

To start using it, install GitHub Copilot CLI, and run the /experimental slash command. Rubber Duck will be available when you select any Claude model from the model picker and have access enabled to GPT-5.4. You’ll see critiques surface in two ways:

  • Automatically, when Copilot decides a checkpoint warrants a second opinion: after planning, after complex implementations, or after writing tests.
  • On demand, whenever you ask. Just tell Copilot to critique its work, and it will invoke Rubber Duck, incorporate the feedback, and show you exactly what changed.

Where Rubber Duck helps most:

  • Complex refactors and architectural changes
  • High-stakes tasks where a miss is costly
  • Ensuring comprehensive test coverage
  • Any time you want a second opinion on a plan before committing to it

Rubber Duck in GitHub Copilot CLI is now available in experimental mode. Share your feedback with us in the discussion.

Written by

Nick McKenna

Applied Researcher III

Bartek Perz

Principal Applied Science Manager

Related posts

We do newsletters, too

Discover tips, technical guides, and best practices in our biweekly newsletter just for devs.