惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
Security Latest
Security Latest
P
Palo Alto Networks Blog
AWS News Blog
AWS News Blog
NISL@THU
NISL@THU
T
Threatpost
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Latest news
Latest news
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
WordPress大学
WordPress大学
J
Java Code Geeks
P
Privacy International News Feed
阮一峰的网络日志
阮一峰的网络日志
S
Schneier on Security
博客园 - 聂微东
Project Zero
Project Zero
美团技术团队
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Scott Helme
Scott Helme
I
Intezer
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
H
Hacker News: Front Page
S
Security @ Cisco Blogs
博客园 - 司徒正美
O
OpenAI News
Last Week in AI
Last Week in AI
L
LINUX DO - 热门话题
酷 壳 – CoolShell
酷 壳 – CoolShell
SecWiki News
SecWiki News
月光博客
月光博客
S
Security Affairs
The GitHub Blog
The GitHub Blog
P
Privacy & Cybersecurity Law Blog
S
Secure Thoughts
V
V2EX
S
Securelist
F
Fortinet All Blogs
W
WeLiveSecurity
D
Docker
博客园 - 三生石上(FineUI控件)
Simon Willison's Weblog
Simon Willison's Weblog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
C
Cyber Attacks, Cyber Crime and Cyber Security
V
Visual Studio Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Webroot Blog
Webroot Blog
Engineering at Meta
Engineering at Meta

Show HN

CSP Radar GitHub - awebai/aweb-team-coord-worktrees: An aweb team template for a minimum team with a permanent coordinator and worktrees with local developers. GitHub - fujibee/agmsg GitHub - lucastononro/notify: 100% local, free, offline attention skill for Claude Code: plays a sound and speaks a short status update when a long task finishes, blocks, or needs a decision. GitHub - sebastianwessel/skills: AI Skills tivatdoar / workout-to-work · GitLab GitHub - enumura1/py-sql-cleaner: Find, format, and safely extract embedded SQL from Python files. GitHub - intent-bench/intent-bench: Intent fulfillment benchmark for agentic AI engineering GitHub - steveking-gh/firmion: Firmion is DSL and engine for firmware image generation. GitHub - villagesql/villagesql-skills: Agent skills for VillageSQL - gemini-cli-extension; claude-code-plugin GitHub - 0gsd/enough: a personal language system for planning, writing, and translation. GitHub - Kaelio/ktx: ktx is an executable context layer for data and analytics agents 🐙 Allow Claude Code, Codex, and any AI agent to query data accurately through MCP with skills, memory and a semantic layer GitHub - ThatXliner/xtras: Xliner's Claude Code Skills GitHub - flightdeckhq/flightdeck: Observability and control plane for AI agents. GitHub - search-router/simple-search: Open-source reference app on top of the Search Router API: FastAPI + Jinja metasearch service with pluggable backends, deterministic mocks (no API key needed), RTL UI, Redis cache, and a demo ads cabinet. CSP Radar GitHub - Light-Heart-Labs/DreamServer: Turn your PC, Mac, or Linux box into an AI server. LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. GitHub - Diplomat-ai/diplomat-agent-ts: What can your TypeScript AI agent do to the real world? Scan your code. See which tool calls have zero checks Code Block Selector - Visual Studio Marketplace Prometheus dependency graph — interactive showcase | Riftmap Show HN: I made a vi-like modal keyboard plugin for Figma GitHub - run-llama/liteparse: A fast, helpful, and open-source document parser GitHub - dalemyers/Roar: A macOS CLI tool for notifications GitHub - district-solutions/open-agent-tools-coder: Enables small-to-large self-hosted ai models to use local source code when running tool-calling agentic workloads. We actively data mine 20,900+ (2+ TB) popular github repos using large and small ai models to create reuseable: json, markdown and parquet files for local-first tool-calling models. GitHub - progapandist/stripeek: A local TUI proxy for real-time Stripe API debugging, built for navigating complex payloads fast. GitHub - sir1st/hermes-desktop: All-in-one cross-platform desktop app for Hermes Agent — bundles Python + hermes-agent + hermes-web-ui GitHub - astefanutti/shaderbang: Shebang for Shaders Show HN: Generate Claude Code Workflows using Spec Driven Development approach GitHub - nixys/nxs-universal-chart: The Helm chart you can use to install any of your applications into Kubernetes/OpenShift Show HN: AI agents for UK GDAD PCF roles and their skills The Two Pillars: Mixer Mode and Meta-Software in the Reorganization of Software Work After AI GitHub - JaiCode08/teleport-env What 1,000+ Harness Experiments Taught Me About Self-Improving Agents Show HN: Liiists, a Markdown-first, iOS and CLI list app SwiperTab – Get this Extension for 🦊 Firefox (en-US) GitHub - kouhxp/fftext: Summarize, explain, fact-check, or translate any text, URL, or file. No GPU. No cloud. One command GitHub - sweetpad-dev/sweetpad: Develop Swift/iOS projects using VSCode GitHub - dogmaticdev/IRON: IRON a.k.a. Intermediate Representation Object Notation is a Interpreter/Database that is used to create Programming Languages. GitHub - sjhalani7/vaen: Package your AI coding harness into a portable .agent file, and share it across repos, teams, & the community without ever having to copy-paste instructions, skills, MCP config, or secrets. Show HN: Gandalf the Grader Show HN: Citadeld – replay any CI failure locally from a single file GitHub - tdortman/cuSBF: High-Performance GPU Super Bloom Filter coral-ai/claude-code-token-xray at main · Coral-Bricks-AI/coral-ai GitHub - ulyssestenn/funes: Funes is a Git-based framework for LLM-managed knowledge work: an AI Librarian ingests raw sources, builds an interlinked Markdown knowledge base, and uses it to produce cited reports, analyses, and other outputs. GitHub - ThatXliner/gah: Git Add Hunk, built for agents to use GitHub - harmont-dev/harmont-cli: Command-line client for the Harmont CI platform GitHub - brooksmcmillin/mcp-authflow: OAuth 2.0 Authorization Server framework for MCP servers GitHub - javaid-codes/audit-supply-chain-agents GitHub - amorey/gochan: A small library of common channel architectures for Go, inspired by Rust GitHub - arifozgun/OpenGem: Free, Open-Source AI API Gateway with Gemini, OpenAI & Anthropic Compatibility in 1 file GitHub - Pranesh950/BioPetals: 🌸 Run BIOxAI models at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading GitHub - cnguyen14/bounty-doctor: Diagnose a GitHub bounty issue before you waste hours: detects honeypot scam repos, AI-bot attempt swarms, and stale contests. Show HN: CoreMCP – MCP Server for On-Prem DBs Show HN: KittyHTML – Render HTML/CSS as an inline image in your terminal GitHub - bingud/filemat: Web-based file manager Show HN: TruthLens – Free multi-signal deepfake image detector GitHub - apexlocal-jz/claude-usage-tray: Windows system-tray app showing your Claude Code rate-limit usage at a glance. Zero deps, ~300 lines of PowerShell. Cross-IDE (works regardless of VS Code, Cursor, plain terminal). Release v0.1.2.1 · kouhxp/yapsnap GitHub - noopolis/moltnet: Self-hostable chat network for AI agents. Pre-built bridges for Claude Code, Codex, and the Claws. Rooms, DMs, history. No Slack bots, no Matrix, no glue code. GitHub - tamerh/enju: Coordinating Humans, AI Agents, and Compute as Peers on a Shared Workflow Graph Show HN: Continuity-auth – Respect-weighted rate limits for the open web GitHub - luml-ai/luml: AI lifecycle platform where engineers and agents track experiments, train models, and ship to production. GitHub - mrdanielcasper/CoreTex: A UNIX-inspired, biomimetic, flat-file AI harness and knowledge engine. GitHub - clemg/pierre-github: Pierre's diffs.com and trees.software for Github GitHub - lyriks-io/unspaghettit: Behavior-driven AI development without prompt spaghetti. GitHub - sofumel/claude-handoff-revive: Resume Claude Code work after rate/usage/context limits without replaying the prior transcript. Auto-saves at 90%/95% usage. Plugin-installable, 10 languages. GitHub - dotexorg/saferpc: Typed, end-to-end encrypted RPC over any bidirectional channel. GitHub - BeeZeeAgent/beezee: Agent harness orchestration Legato Next.js Boilerplate for Internal Tools · CoreUI GitHub - clark-labs-inc/clark-hash: Clark Hash, 32x smaller searchable sketches for embeddings GitHub - ZeroPointRepo/youtube-mcp: The fastest YouTube transcript + YouTube search MCP for AI agents. Try for free. Typing Mastery — climb toward 100+ WPM, deliberately GitHub - Andebugulin/Awareen GitHub - fayzan123/claude-workflow-composer: Visual desktop app for composing multi-agent coding workflows. Drag agents, attach skills and MCPs, wire handoffs, export to .claude/ GitHub - StackOneHQ/stack-nudge We hardened an LLM agent. Each defense we added made it more exploitable. GitHub - alkait/WhatsKept: Agent-queryable WhatsApp history from an iOS backup — a single Go binary. GitHub - octelium/cordium: Open-source, general-purpose sandbox platform for devs and AI agents that provides identity-based secure access to infrastructure without credentials. GitHub - scosman/videowright: Build animated explainer videos with your coding agent GitHub - dipankar/dscode: The code editor you can take apart. GitHub - zoharbabin/web-researcher-mcp: MCP server (Go) for AI assistants: web search, content extraction, academic/patent/news research. Multi-provider routing, 4-tier scraping, search lenses. Works with Claude, Cursor, and any MCP client. GitHub - scanaislop/aislop: Catch the slop AI coding agents leave in your code: narrative comments, swallowed exceptions, as-any casts, dead code, oversized functions. 50+ rules across 7 languages (TypeScript, JavaScript, Python, Go, Rust, Ruby, PHP). Sub-second, deterministic, no LLM at runtime. MIT-licensed. GitHub - kouhxp/cheap-im: CPU-only voice agent approximating Thinking Machines' Interaction Models demo GitHub - unprovable/OrchidMantis: Orchid Mantis — standalone framework for Zero-Knowledge Proofs of eXploit (ZKPoX). GitHub - TangibleResearch/Halgorithem: A Algo designed to detect AI Hallucitions GitHub - CarpseDeam/Aura-IDE: An AI coding harness that shaped itself - Planner/Worker agents, repo awareness, surgical edits, validation, recovery, and safe diff approvals. GitHub - chojs23/concord: A feature-rich TUI client for Discord GitHub - aerf-spec/aerf: Agent Evidence Receipt Format (AERF) — an open specification for tamper-evident, independently verifiable records of AI agent actions. GitHub - Jwrede/tokentoll: Catch LLM cost changes in code review. Infracost for LLM spend. GitHub - samchon/ttsc: A `typescript-go` toolchain for compiler-powered plugins and type-safe execution + 500x faster lint integrated into compiler GitHub - Higangssh/homebutler: 🏠 Manage your homelab from chat. Single binary, zero dependencies. GitHub - olalie/tapmap: See where your computer connects and what stands out on a live world map. GitHub - Diplomat-ai/diplomat-agent: What can your AI agent do to the real world? Scan your code. See which tool calls have zero checks GitHub - Bajusz15/beacon: Open-source agent for secure remote access, monitoring, and deploys across home-lab and self-hosted machines like Raspberry Pi, N100, or any Linux server. Open web based TTY or tunnel Home Assistant and other local services securely without opening ports. BigTech AI News - Chrome 应用商店 GitHub - vinhnx/VTCode: VT Code is an open-source coding agent with LLM-native code understanding and robust shell safety. Supports multiple LLM providers with automatic failover and efficient context management. GitHub - Lumen-Labs/brainapi2: BrainAPI is a knowledge graph–powered AI memory layer that transforms unstructured data into structured knowledge, enabling intelligent search, recommendations, and contextual memory for AI agents and applications. GitHub - familiar-software/familiar: Let AI watch you work. Familiar lets your AI update its memory, skills, and knowledge by watching your screen. make sidebar/address bar rounded corner toggleable
GitHub - sergey-automation/TurboPrefill-VLM-Validation: Validation of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill using VLM workloads
trykhlieb · 2026-06-21 · via Show HN

Validation of the applicability of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill to Vision Language Models (VLMs).

TurboPrefill cut the waiting time before answer generation nearly in half: from 9.0 s to 4.6 s.

Example Input

FHD giraffe sample

Validation Task

Question:

What is happening in this image? Describe the animals, their approximate number, activity, environment, and colors. Which animal appears to be the leader of the group, and what five visual clues made you reach that conclusion? Use no more than 50 words.

Example answer:

Eight giraffes are walking across a grassy wetland near a river. The animals are light brown with darker patches. The leading giraffe appears to guide the group. Clues: front position, direction of movement, spacing, head orientation, and group alignment.

Key Result

Validation on Vision Language Models demonstrates that Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill can significantly reduce user waiting time before answer generation without changing model weights, architecture, quantization, prompts, or inference mathematics.

The observed improvement was achieved solely through changes in execution scheduling during the prefill stage.

Test Configuration

Parameter Value
Model Qwen2.5-VL-72B-Instruct-Q4_K_M
Task Vision-language question answering
Input Single Full HD image (1920×1080)
GPUs 4× RTX 5060 Ti 16 GB
UBatch size 128
Split mode Layer

Result

Metric Baseline TurboPrefill
Waiting time before the response started 9.0 s 4.6 s
Prefill throughput 303 tok/s 604 tok/s
Generation throughput 8.6 tok/s 8.6 tok/s

TurboPrefill nearly halved the waiting time before the model started responding, while leaving answer generation speed unchanged.

Future Relevance

Validation on NVIDIA Pascal GPUs also demonstrated an approximately 2.2× reduction in prefill latency, suggesting that this optimization opportunity is not tied to a particular class of hardware and will likely remain relevant for future GPU generations.

Background

The original scheduling mechanism was proposed in:

[RFC][PoC] Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill

ggml-org/llama.cpp#24219

The original proof-of-concept implementation is available at:

https://github.com/sergey-automation/TurboPrefill

Purpose

This repository validates the applicability of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill to Vision Language Models.

The objective is not to introduce a new scheduling mechanism, but to demonstrate that the original mechanism is applicable beyond text-only LLM workloads.

Validation Implementation

Reference implementation branch:

https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support

Scope of the Current Implementation

The original TurboPrefill PoC intentionally used a conservative dispatcher and left some eligible workloads on the standard llama.cpp execution path.

The current validation implementation enables additional workloads that are still within the original concept of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill, but were not enabled in the first PoC.

Additional workloads currently enabled for the TurboPrefill execution path:

  • Execution of Text LLM workloads.
  • Execution of Vision Language Model (VLM) workloads.
  • Execution of multiple concurrent requests in multi-user server mode, provided that requests from different users are not mixed within the same TurboPrefill batch.

Status

Work in progress.

Implementation files, scripts, input samples, and benchmark logs are published in this repository.

Experimental work in progress.

The reported results are based on the current prototype implementation. Text-model validation has been completed successfully. VLM support is still under active investigation, and additional correctness validation is required before drawing final conclusions.

Repository Structure

  • files/ — modified llama.cpp source files used for the validation branch.
  • scripts/ — scripts used to run the VLM server and resolution tests.
  • resolution_samples/ — input images used for validation.
  • benchmarks/ — raw benchmark reports and server logs.

Reproducing the Validation

1. Obtain the reference implementation

The validation was performed using the following reference implementation branch:

https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support

git clone https://github.com/sergey-automation/llama.cpp.git
cd llama.cpp
git checkout turboprefill-vlm-support

2. Build llama.cpp

Build the reference implementation.

2.1 Download the Qwen2.5-VL-72B GGUF model

The validation uses the following model files:

  • Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf
  • mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf

Create the expected model directory:

mkdir -p /workspace/models/Qwen2.5-VL-72B
cd /workspace/models/Qwen2.5-VL-72B

Download the main model:

wget -c --content-disposition \
"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf"

Download the multimodal projector:

wget -c --content-disposition \
"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf"

Check the files:

ls -lh /workspace/models/Qwen2.5-VL-72B

3. Baseline measurement

Start the VLM server with TurboPrefill disabled:

TURBOPREFILL=0 ./run_vlm_server.sh

Run the benchmark:

python3 run_vlm_resolution.py

4. TurboPrefill measurement

Start the VLM server with TurboPrefill enabled:

TURBOPREFILL=1 ./run_vlm_server.sh

Run the benchmark:

python3 run_vlm_resolution.py

5. Compare results

Input images:

Reference benchmark reports and logs:

Compare generated reports against the published benchmark logs included in this repository.

Attribution

If this work is useful for future implementations of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill, please cite the original RFC proposal:

ggml-org/llama.cpp#24219