惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

P
Privacy International News Feed
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Jina AI
Jina AI
T
Tailwind CSS Blog
WordPress大学
WordPress大学
Scott Helme
Scott Helme
C
Cybersecurity and Infrastructure Security Agency CISA
博客园 - Franky
C
CERT Recently Published Vulnerability Notes
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
雷峰网
雷峰网
Schneier on Security
Schneier on Security
博客园 - 聂微东
T
Tor Project blog
Hugging Face - Blog
Hugging Face - Blog
博客园 - 司徒正美
AI
AI
T
Troy Hunt's Blog
Security Latest
Security Latest
T
The Blog of Author Tim Ferriss
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Check Point Blog
T
Threat Research - Cisco Blogs
W
WeLiveSecurity
V
Vulnerabilities – Threatpost
Recorded Future
Recorded Future
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Cisco Talos Blog
Cisco Talos Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
Cloudbric
Cloudbric
J
Java Code Geeks
罗磊的独立博客
C
Cyber Attacks, Cyber Crime and Cyber Security
aimingoo的专栏
aimingoo的专栏
L
LangChain Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy & Cybersecurity Law Blog
Google DeepMind News
Google DeepMind News
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
L
Lohrmann on Cybersecurity
I
InfoQ
MongoDB | Blog
MongoDB | Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The GitHub Blog
The GitHub Blog
The Hacker News
The Hacker News
H
Help Net Security
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
P
Proofpoint News Feed
N
News and Events Feed by Topic

MarkTechPost

A Coding Implementation of End-to-End Brain Decoding from MEG Signals Using NeuralSet and Deep Learning for Predicting Linguistic Features Meta Introduces Autodata: An Agentic Framework That Turns AI Models into Autonomous Data Scientists for High-Quality Training Data Creation Qwen AI Releases Qwen-Scope: An Open-Source Sparse AutoEncoders (SAE) Suite That Turns LLM Internal Features into Practical Development Tools A Coding Deep Dive into Agentic UI, Generative UI, State Synchronization, and Interrupt-Driven Approval Flows Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes A Coding Implementation on Pyright Type Checking Covering Generics, Protocols, Strict Mode, Type Narrowing, and Modern Python Typing IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference Top 10 KV Cache Compression Techniques for LLM Inference: Reducing Memory Overhead Across Eviction, Quantization, and Low-Rank Methods Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs Step by Step Guide to Build a Complete PII Detection and Redaction Pipeline with OpenAI Privacy Filter Meta FAIR Releases NeuralSet: A Python Package for Neuro-AI That Supports fMRI, M/EEG, Spikes, and HuggingFace Embeddings smol-audio: A Colab-Friendly Notebook Collection for Fine-Tuning Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3 A Coding Implementation on Document Parsing Benchmarking with LlamaIndex ParseBench Using Python, Hugging Face, and Evaluation Metrics Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Models Reaching 68.2% and 72.5% on SWE-bench Verified How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI OpenAI Releases Privacy Filter: A 1.5B-Parameter Open-Source PII Redaction Model with 50M Active Parameters Top 10 Physical AI Models Powering Real-World Robots in 2026 How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control Meet Talkie-1930: A 13B Open-Weight LLM Trained on Pre-1931 English Text for Historical Reasoning and Generalization Research Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering OpenMOSS Releases MOSS-Audio: An Open-Source Foundation Model for Speech, Sound, Music, and Time-Aware Audio Reasoning Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo The LoRA Assumption That Breaks in Production How to Build a Fully Searchable AI Knowledge Base with OpenKB, OpenRouter, and Llama How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models RAG Without Vectors: How PageIndex Retrieves by Reasoning A Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics xAI Launches grok-voice-think-fast-1.0: Topping τ-voice Bench at 67.3%, Outperforming Gemini, GPT Realtime, and More A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation Meet GitNexus: An Open-Source MCP-Native Knowledge Graph Engine That Gives Claude Code and Cursor Full Codebase Structural Awareness A Coding Implementation on Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model Mend.io Releases AI Security Governance Framework Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing Google Cloud AI Research Introduces ReasoningBank: A Memory Framework that Distills Reasoning Strategies from Agent Successes and Failures Xiaomi Releases MiMo-V2.5-Pro and MiMo-V2.5: Matching Frontier Model Benchmarks at Significantly Lower Token Cost How to Design a Production-Grade CAMEL Multi-Agent System with Planning, Tool Use, Self-Consistency, and Critique-Driven Refinement Alibaba Qwen Team Releases Qwen3.6-27B: A Dense Open-Weight Model Outperforming 397B MoE on Agentic Coding Benchmarks A Detailed Implementation on Equinox with JAX Native Modules, Filtered Transforms, Stateful Layers, and End-to-End Training Workflows Next Leap to Harness Engineering: JiuwenClaw Pioneers ‘Coordination Engineering’ Photon Releases Spectrum: An Open-Source TypeScript Framework that Deploys AI Agents Directly to iMessage, WhatsApp, and Telegram OpenAI Open-Sources Euphony: A Browser-Based Visualization Tool for Harmony Chat Data and Codex Session Logs Hugging Face Releases ml-intern: An Open-Source AI Agent that Automates the LLM Post-Training Workflow A Coding Implementation to Build a Conditional Bayesian Hyperparameter Optimization Pipeline with Hyperopt, TPE, and Early Stopping Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets Across Specialized AI Domains A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning OpenAI Scales Trusted Access for Cyber Defense With GPT-5.4-Cyber: a Fine-Tuned Model Built for Verified Security Defenders Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer How TabPFN Leverages In-Context Learning to Achieve Superior Accuracy on Tabular Datasets Compared to Random Forest and CatBoost A Coding Implementation to Build an AI-Powered File Type Detection and Security Analysis Pipeline with Magika and OpenAI NVIDIA Releases Ising: the First Open Quantum AI Model Family for Hybrid Quantum-Classical Systems xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG A Coding Guide for Property-Based Testing Using Hypothesis with Stateful, Differential, and Metamorphic Test Design Anthropic Releases Claude Opus 4.7: A Major Upgrade for Agentic Coding, High-Resolution Vision, and Long-Horizon Autonomous Tasks Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows Top 19 AI Red Teaming Tools (2026): Secure Your ML Models A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control Qwen Team Open-Sources Qwen3.6-35B-A3B: A Sparse MoE Vision-Language Model with 3B Active Parameters and Agentic Coding Capabilities OpenAI Launches GPT-Rosalind: Its First Life Sciences AI Model Built to Accelerate Drug Discovery and Genomics Research Building Transformer-Based NQS for Frustrated Spin Systems with NetKet UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size How to Build a Universal Long-Term Memory Layer for AI Agents Using Mem0 and OpenAI A Coding Implementation to Build Multi-Agent AI Systems with SmolAgents Using Code Execution, Tool Calling, and Dynamic Orchestration A Technical Deep Dive into the Essential Stages of Modern Large Language Model Training, Alignment, and Deployment Google AI Launches Gemini 3.1 Flash TTS: A New Benchmark in Expressive and Controllable AI Voice Google DeepMind Releases Gemini Robotics-ER 1.6: Bringing Enhanced Embodied Reasoning and Instrument Reading to Physical AI Google Launches ‘Skills’ in Chrome: Turning Reusable AI Prompts into One-Click Browser Workflows A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction TinyFish AI Releases Full Web Infrastructure Platform for AI Agents: Search, Fetch, Browser, and Agent Under One API Key NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model A Hands-On Coding Tutorial for Microsoft VibeVoice Covering Speaker-Aware ASR, Real-Time TTS, and Speech-to-Speech Pipelines Meta AI and KAUST Researchers Propose Neural Computers That Fold Computation, Memory, and I/O Into One Learned Model A Coding Implementation of MolmoAct for Depth-Aware Spatial Reasoning, Visual Trajectory Tracing, and Robotic Action Prediction MiniMax Just Open Sourced MiniMax M2.7: A Self-Evolving Agent Model that Scores 56.22% on SWE-Pro and 57.0% on Terminal Bench 2 Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Vision-Language Model with Bounding Box Prediction, Multilingual Support, and Sub-250ms Edge Inference Researchers from MIT, NVIDIA, and Zhejiang University Propose TriAttention: A KV Cache Compression Method That Matches Full Attention at 2.5× Higher Throughput How to Build a Secure Local-First Agent Runtime with OpenClaw Gateway, Skills, and Controlled Tool Execution How Knowledge Distillation Compresses Ensemble Intelligence into a Single Deployable AI Model Alibaba’s Tongyi Lab Releases VimRAG: a Multimodal RAG Framework that Uses a Memory Graph to Navigate Massive Visual Contexts A Coding Guide to Markerless 3D Human Kinematics with Pose2Sim, RTMPose, and OpenSim NVIDIA Releases AITune: An Open-Source Inference Toolkit That Automatically Finds the Fastest Inference Backend for Any PyTorch Model Five AI Compute Architectures Every Engineer Should Know: CPUs, GPUs, TPUs, NPUs, and LPUs Compared An End-to-End Coding Guide to NVIDIA KVPress for Long-Context LLM Inference, KV Cache Compression, and Memory-Efficient Generation Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context A Coding Guide to Build Advanced Document Intelligence Pipelines with Google LangExtract, OpenAI Models, Structured Extraction, and Interactive Visualization Google AI Research Introduces PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export
Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration
https://www.facebook.com/MarkTechPost/ · 2026-06-21 · via MarkTechPost

Getting prompts right is still the hardest part of shipping reliable LLM applications. Small wording changes can swing accuracy by 20 percent. What works on a few examples often breaks at scale. When a multi-step pipeline returns a wrong answer, finding the failing step means inspecting intermediate outputs by hand.

Cisco AI introduced FAPO to address that bottleneck. FAPO stands for Fully Automated Prompt Optimization. It is a Claude Code-driven system that optimizes LLM pipelines from baseline prompts to target accuracy. You supply a dataset and an initial prompt. FAPO then evaluates, classifies failures, proposes variants, validates them, and iterates. The whole loop is orchestrated by Claude Code agents. The project ships open source under Apache 2.0, and also supports Codex as the optimization agent.

In Cisco’s reported evaluation, FAPO beat GEPA, a state-of-the-art prompt optimizer, on 15 of 18 model-benchmark comparisons. On the two benchmarks where FAPO escalated to pipeline changes, the mean gain over GEPA reached +33.8pp.

TL;DR

  • FAPO is a Claude Code-driven system that autonomously optimizes multi-step LLM pipelines from baseline prompts to target accuracy, open source under Apache 2.0.
  • It escalates through three levels — prompt, parameter, then chain structure — using step-level failure attribution to decide what to change next.
  • In Cisco’s evaluation, FAPO beat GEPA on 15 of 18 model-benchmark comparisons, with a +14.1pp mean gain.
  • On HoVer and IFBench, where it escalated to pipeline changes, FAPO won all six pairs at a +33.8pp mean gain; AIME was GEPA’s only win, within sampling noise.
  • Guardrails against overfitting include training-split-only inspection, immutable variant files, and an independent reviewer on every proposal.

What is FAPO

FAPO is a multi-tenant evaluation and optimization framework. A tenant is a self-contained optimization project. Each tenant directory holds one task’s prompts, dataset, chain definition, scorer, and config. Tenants stay isolated, so unrelated tasks optimize side by side without interference.

The core engine is named hephaestus and is domain-agnostic. It handles evaluation, chain execution, and scoring. Chains are LangGraph state graphs that process each test case. Out of the box, FAPO supports three providers: OpenAI, Baseten, and SageMaker.

The one input you must bring is a dataset. It is paired inputs and expected outputs that define success. FAPO splits it into a validation set and a held-out test set. The validation set drives iteration; the test set is used only for a final one-shot evaluation. From a task description, Claude can scaffold the rest: the initial prompt, the chain, and the scorer.

How the Optimization Loop Works

Once the pieces exist, FAPO runs a closed loop until target accuracy is reached. Each cycle runs six stages:

  1. Evaluate — run the chain on the dataset, collect per-case scores and step-level outputs.
  2. Attribute — classify failures by root cause using rule-based heuristics plus LLM analysis.
  3. Propose — generate a variant targeting the dominant failure cluster.
  4. Review — an independent agent validates the proposal for scope compliance and data leakage.
  5. Compare — accept the variant only if it improves on the previous best, otherwise reject.
  6. Iterate — continue until target accuracy is reached or the optimization budget is exhausted.

The system works at three escalating levels. Prompt edits are lowest cost and tried first. Parameter changes adjust config values like retrieval_k or temperature. Structural changes alter chain topology, such as adding a self-reflection node or switching to a ReAct pattern. FAPO exhausts one level before escalating to the next.

Step attribution sorts failures into four classes. Retrieval failures return empty or irrelevant content. Cascading failures begin when an early step produces empty output. Format failures hide the correct answer inside text the scorer cannot parse. Reasoning failures occur when good inputs still produce a wrong conclusion. Format and reasoning issues are prompt-addressable. Retrieval and cascade issues are structural-addressable.

Guardrails keep the optimizer from overfitting. It inspects only training-split cases, while validation and test expose aggregate scores only. Every variant is a new immutable file, never edited in place. An independent reviewer checks each proposal before it runs.

The Benchmark Case: FAPO vs. GEPA

Cisco team evaluated FAPO against GEPA (Generalized Evolutionary Prompt Architecture), a state-of-the-art prompt optimization method. GEPA uses evolutionary search with genetic operators to optimize prompts for multi-step pipelines. Both systems started from identical baseline pipelines and prompts. FAPO could escalate to structural changes when attribution found bottlenecks. GEPA was limited to prompt-level optimization.

The comparison spanned six benchmarks and three task models: GPT-4.1-mini, GPT-5.4-mini, and Gemma 3-12B. Claude Opus 4.6 served as both FAPO’s orchestrator and GEPA’s reflector. Scores below are averaged across the three task models.

BenchmarkBaselineGEPAFAPOGain vs. GEPA
HoVer35.948.583.8+35.3pp
IFBench35.748.580.7+32.2pp
LiveBench-Math51.052.662.0+9.4pp
HotpotQA50.961.868.3+6.5pp
Papillon73.690.794.9+4.2pp
AIME16.716.012.9-3.1pp

FAPO won 15 of 18 model-benchmark comparisons, with a mean gain of +14.1pp over GEPA. On HoVer and IFBench, where FAPO escalated to pipeline changes, it won all six model-benchmark pairs. The mean gain there was +33.8pp. On the four benchmarks without structural changes, FAPO still won 9 of 12 through prompt optimization alone. AIME was the only benchmark where GEPA led, by 3.1pp. The gap is smaller than the standard deviation across stochastic trials.

A capability comparison shows the design difference reported by Cisco. Every row below reflects the source description of the two systems.

CapabilityGEPAFAPO
Optimization levelsPrompt text onlyPrompt → parameter → structural
Can change chain structureNoYes, when attribution finds bottlenecks
How it is drivenEvolutionary search with genetic operatorsClaude Code or Codex agent loop
Result across 18 model-benchmark pairsReferenceWins 15 of 18; +14.1pp mean

Where It Fits: Use Cases

FAPO targets multi-step LLM pipelines, not single prompts. A few concrete examples:

  • Multi-hop question answering: A chain retrieves documents, extracts facts, reasons over evidence, and formats an answer. In Cisco’s documented walkthrough, a multi-hop QA chain rose from 39.3% to 70.3% validation exact match across two iterations. Attribution then flagged the remaining failures as retrieval-limited, signaling a structural fix. Separately, on the HotpotQA benchmark, FAPO reached 68.3% test accuracy versus GEPA’s 61.8%.
  • Instruction following: On IFBench, format-constraint failures pushed FAPO to escalate beyond prompts, reaching 80.7% test accuracy.
  • Classification: A software-name-to-category task can be scaffolded by Claude Code, then optimized to exact-match targets.
  • ReAct agents: An MCP workflow extension optimizes a tool-calling ReAct agent using trajectory scoring and LLM-as-Judge scoring.

Getting Started

The fastest path is to let Claude Code create the tenant files. From the repo, describe your task in plain English, then add a JSONL dataset. Each line is one test case with case_id, task_type, context, expected, and metadata:

{"case_id": "1", "task_type": "qa", "context": {"question": "What is the capital of France?"}, "expected": {"answer": "Paris"}, "metadata": {}}
{"case_id": "2", "task_type": "qa", "context": {"question": "What is 2 + 2?"}, "expected": {"answer": "4"}, "metadata": {}}

A scorer compares the chain output to the expected answer. It implements validate_case to catch bad data early and score_case to return a composite score:

from hephaestus.scoring.scorer import Scorer as BaseScorer

class Scorer(BaseScorer):
    def validate_case(self, case, scoring_profile):
        assert "answer" in case.expected, "Missing 'answer' in expected"

    def score_case(self, case, output_text, scoring_profile):
        expected = case.expected["answer"].strip().lower()
        predicted = output_text.strip().lower()
        em = 100.0 if predicted == expected else 0.0
        return {"composite_score": em, "score_breakdown": {"exact_match": em}}

Verify the setup with a baseline evaluation:

export OPENAI_API_KEY="sk-..."
python -m hephaestus.cli eval --config tenants/my_project/configs/eval.json

Then invoke the optimization agent with a tenant, config, and success criteria such as composite_score >= 90. Claude Code produces a scope contract, then iterates autonomously. Every prompt variant, config, and per-variant analysis is written to disk, so each run stays auditable. A local read-only UI called FAPO Explorer browses the artifacts afterward.

Strengths and Weaknesses

Strengths

  • Pipeline-aware scoring attributes failures to the step that caused them, not just the final output.
  • Three-level escalation handles failures that prompts alone cannot fix.
  • Guardrails against overfitting: training-split-only inspection, immutable variants, and an independent reviewer.
  • Open source under Apache 2.0, with both Claude Code and Codex supported.

Weaknesses

  • Optimization quality is bounded by the dataset’s quality and coverage, which you must supply.
  • The project is recent, so independent production track records are still limited.
  • The default loop depends on agentic coding tools (Claude Code or Codex) rather than a standalone optimizer.

Interactive Explainer


Check out the Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us