慣性聚合 高效追讀感興趣之博客、新聞、科技資訊
閱原文 以慣性聚合開啟

推薦訂閱源

博客园 - 司徒正美
V
V2EX
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
aimingoo的专栏
aimingoo的专栏
Apple Machine Learning Research
Apple Machine Learning Research
IT之家
IT之家
Blog — PlanetScale
Blog — PlanetScale
A
About on SuperTechFans
月光博客
月光博客
T
The Blog of Author Tim Ferriss
宝玉的分享
宝玉的分享
Martin Fowler
Martin Fowler
博客园 - 聂微东
The GitHub Blog
The GitHub Blog
V
Visual Studio Blog
WordPress大学
WordPress大学
酷 壳 – CoolShell
酷 壳 – CoolShell
Engineering at Meta
Engineering at Meta
GbyAI
GbyAI

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python)
以PaliGemma进行实时视频分类:低延迟VLM推理之架构模式
Pasquale Mol · 2026-05-24 · via DEV Community

曩文之中,吾等以零样本目标检测之法,较验三开源视觉语言模型,得有难言之结:纵使速者Phi-3.5-vision-instruct,于NVIDIA L4之上,亦需4.45秒方成帧;LLaVA-v1.6则需8.13秒。凡需实时处理视频流之应用,此数已属不逮。然言视觉语言模型与实时任务根本不谐,尤当细察。彼8秒之数,乃于泛用零样本检测之务中测得,令模型推度任意物于任意境。若限其题,何如?若授模型闭域之词、定帧之解、确然解码之策、非阻推理之脉?
此文解是问。以PaliGemma(Google之精简视言模型),吾辈构实时视频分类之系,于NVIDIA RTX A4500上,每帧约0.8至1.2秒。较之LLaVA于同等专业硬件,此乃六至八倍之进益,纯由架构之决,非硬件之升。今列四式,使此成可能。

甚者何为PaliGemma

欲论架构,其模之择尤当释也,盖PaliGemma之实用价值,于开发者之文远未彰焉。PaliGemma者,谷歌所建三十亿参数之视言模型也,合SigLIP视码器与Gemma言骨。较之LLaVA-7B或Phi-3.5-vision,其制半之,直译为VRAM之耗减,同器之推演亦速。尤要于分类之务者,乃显精调于广类视觉理解之标,如图解、视问、物位之属,是故其有强先验,以应吾辈所求之限构、序应。

import torch
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration

MODEL_PATH = "./paligemma_offline"

model = PaliGemmaForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    local_files_only=True
).eval()

processor = PaliGemmaProcessor.from_pretrained(
    MODEL_PATH,
    local_files_only=True
)

进入全屏模式 退出全屏模式

此中有二事,堪为注目。以bfloat16载入,较之float32,可减半VRAM之占,而分类之务,其精度损益甚微。local_files_only=True 此标志非仅便于离线环境:于生产系统,其可去网络往返之初始化,且保推理环境全然可复。

模式一:以解析为迟滞之旋钮

实时视觉语言模型(VLM)之流程中,影响至巨者,唯输入分辨率耳。VLM处理图像,乃分之为一块块,每块皆编码为符序列。一帧1280×720,所生符序列远胜一裁448×448之帧,盖因变压者之注意力,随序列之长而平方增长,故分辨率非线性之耗,实为指数之耗。若为零样本目标检测,其空间精度攸关,则下采样实为权衡之计。然若为场景级分类之务,问“此帧主情为何?”而非“予我诸物之像素坐标”,则448×448所存语义,已绰绰有余。

import cv2
import numpy as np
from PIL import Image

def preprocess_frame(frame_bgr: np.ndarray) -> Image.Image:
    # Resize to inference resolution before any VLM processing
    frame_small = cv2.resize(frame_bgr, (448, 448))
    # Convert BGR (OpenCV) to RGB (PIL/transformers)
    return Image.fromarray(cv2.cvtColor(frame_small, cv2.COLOR_BGR2RGB))

进入全屏模式 退出全屏模式

要旨在于,分辨率之择,当依任务所需信息之细密,非以输入流之分辨率为准。若相机摄于1080p,然分类之务仅辨五情之态,则汝将付巨量计算之税,以无用之信息。

模式二:闭词汇确定性解码

标准VLM使用视模型为开放式文本生成器。汝示之,其自概率分布中取样,汝得自然语言之应,继而解析之。此乃吾前文所论脆弱之症结,亦为延迟之重源:高概率分布取样max_new_tokens者,模型为每一生成之词,皆行全自回归之环也。若从事分类之务,可全然破此。非令模型描述所见,乃约束其出为固定词汇之有效标签,且限生成之数为表达其一所需之最少词数也。

# A generic set of states tailored to your specific domain
VALID_CLASSES = ['active', 'idle', 'error', 'offline', 'unknown']

def classify_frame(model: torch.nn.Module, processor, image: Image.Image) -> str:
    prompt = (
        f"<image> What is the current operational state shown in this frame? "
        f"You MUST choose ONLY ONE from: {VALID_CLASSES}."
    )

    inputs = processor(
        text=prompt,
        images=image,
        return_tensors="pt"
    ).to(model.device, model.dtype)

    with torch.inference_mode():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=10,   # A single class label needs at most 2-3 tokens
            do_sample=False      # Greedy decoding: deterministic, faster, no temperature needed
        )

    raw_output = processor.decode(
        output_ids[0][inputs.input_ids.shape[-1]:],
        skip_special_tokens=True
    ).strip().lower()

    # Sanitize to alphabetic characters only
    return ''.join(filter(str.isalpha, raw_output))

Enter fullscreen mode Exit fullscreen mode

设之do_sample=False 转换模型为贪心解码,每步恒选最高概率之符。此消采样之费,使输出全然定数:同源之入,必得同果,此乃调试之要,亦为吾辈后述之时序平滑之理。max_new_tokens=10 之谓,模型既出标签,几欲立止,不复续作无谓之文。
其效也,汝以三B参数之VLM为能辨义之分类器,非为生成之模。得自然语言提示之零射之变,其推论之性,近乎专司分类之头。

模式三:时序平滑以达预测之稳

纵有定解之译,视界模型处理生视频,其预判必多嘈杂。光变、动晕、部分遮蔽、瞬现之视物伪象,将致模型于连续帧间输出不协之标签。若将每帧之原预测直通下游系统,则得跳荡不可靠之输出。其解乃时序平滑:非信单次之预测,乃持近时之预测为滚窗,发多数决之票。

from collections import deque, Counter

class TemporalSmoother:
    def __init__(self, window_size: int = 5):
        self.history = deque(maxlen=window_size)

    def update(self, prediction: str) -> str:
        self.history.append(prediction)
        # Return the most common prediction in the window
        return Counter(self.history).most_common(1)[0][0]

入全景模式 出全景模式

吾等推演之速,五格之窗,约当四至六息之时序。此足纳瞬息之扰,而仍应真变。窗之广狭,乃调适之要:广则稳而缓,狭则敏而生噪。凡分类之务,三至七格,已足应实。

四式:非阻塞推演,以解耦共享状态架构为之

前三式优化推演之调用。此式则解根本之系统问题:凡VLM推演一呼,历时0.8至1.2秒,则其所运行之线程,必为之阻塞。若视频捕捉与推演同线程,则流将随推演速率而滞涩,而非相机速率也。

愚者之策,乃以 Python 之标准队列 Queue 传帧于线程间。然此法生消费者之争:若渲染线程与 AI 线程同读一队列,则争帧而食,相夺其数据,致视觉大震,推理之循环亦时断时续。至若精工之策,乃异步共享状态之范式也。以粒度锁之。视频摄录之线程,若产者,恒覆写共享之“最新帧”指针。渲染之线程(于主线程运行,此于macOS及Wayland之OpenCV UI操作为必也),与AI之后台线程,若独立之消费者,每待其下一周期,则复最新帧于本地之存。

import threading
import time
import numpy as np
import cv2
import torch
from typing import Optional, Any

class SharedState:
    """
    Thread-safe state container.
    The lock is strictly granular: it is only held for memory assignment/copying,
    never during expensive I/O or AI inference operations.
    """
    def __init__(self):
        self.latest_frame: Optional[np.ndarray] = None
        self.prediction: str = "WAITING"
        self.lock = threading.Lock()
        self.running: bool = True

shared = SharedState()

def video_capture_worker(source: int = 0) -> None:
    """Reads frames at hardware speed and updates the shared state."""
    cap = cv2.VideoCapture(source)

    while shared.running:
        ret, frame = cap.read()
        if not ret:
            time.sleep(0.01)
            continue

        with shared.lock:
            # Overwrite with the freshest data.
            # Pointer assignment is fast enough to barely hold the lock.
            shared.latest_frame = frame

    cap.release()

def inference_worker(model: torch.nn.Module, processor: Any) -> None:
    """Consumes the latest frame at the AI's maximum throughput rate."""
    smoother = TemporalSmoother(window_size=5)

    while shared.running:
        with shared.lock:
            # Deep copy to prevent OpenCV from mutating the array during inference
            frame = shared.latest_frame.copy() if shared.latest_frame is not None else None

        if frame is None:
            time.sleep(0.05)
            continue

        try:
            image = preprocess_frame(frame)
            raw_pred = classify_frame(model, processor, image)
            smoothed_pred = smoother.update(raw_pred)

            with shared.lock:
                shared.prediction = smoothed_pred

        except torch.cuda.OutOfMemoryError:
            # Handle temporary VRAM spikes gracefully without killing the thread
            with shared.lock:
                shared.prediction = "OOM_ERROR"
            time.sleep(1.0)

        except Exception as e:
            # Catch corrupt frames or tensor mismatches
            with shared.lock:
                shared.prediction = "ERROR"

# Initialize background workers as Daemon threads
threads = [
    threading.Thread(target=video_capture_worker, args=(0,), daemon=True),
    threading.Thread(target=inference_worker, args=(model, processor), daemon=True),
]

for t in threads:
    t.start()

# Main Thread UI Loop
# UI libraries (cv2.imshow) must run on the main thread to prevent OS-level crashes.
while shared.running:
    with shared.lock:
        frame = shared.latest_frame.copy() if shared.latest_frame is not None else None
        label = shared.prediction

    if frame is not None:
        cv2.putText(frame, label.upper(), (30, 50),
                    cv2.FONT_HERSHEY_SIMPLEX, 1.5, (0, 255, 0), 3)
        cv2.imshow("Live Classification", frame)

    # 30 FPS rendering limit (33ms) + graceful shutdown
    if cv2.waitKey(33) & 0xFF == ord('q'):
        shared.running = False

cv2.destroyAllWindows()

退出全屏模式

要旨在此,制器之道在于细锁:锁立,则numpy之数易位于内存(此耗微秒),锁即释。若持锁逾一秒之VLM推演,则三事皆序,反失全构之旨。此制,摄影之线随硬件之频(如三十帧),绘示之环亦现三十帧,推演之线自运其异步之速(一帧)。三系时序相离,唯受其器之限。

测试之要

以NVIDIA RTX A4500(20GB GDDR6,Ampere架构)运行全流程,PaliGemma以bfloat16格式处理三流实时视频场景,性能稳定非常。输入分辨率限为448×448,输出以贪心解码策略最多10新词为限(do_sample=False), 系统每帧推理延迟介于0.8至1.2秒之间。结合5帧时序平滑窗口,此配置确保可靠状态分类,而解耦架构使视频采集线程得以维持稳定25帧率,完全不受推理瓶颈之扰。

较之而言,LLaVA-v1.6-Mistral-7B于NVIDIA L4上施行开放词汇零样本检测,每帧耗时8.13秒。虽硬件非直接等同,然差距之巨,足证架构之限,非原始计算之力,实为差异之主因.

当此架构合宜时

此模式,若汝之务可归入固定标签之集,需对活流作连续之处理,而非对静态图像作批次分析;数据隐私之要求,不许将帧送至外部API;且汝可容忍亚秒级之延迟,而非百毫秒级。此非适器之时,若汝需于传送带速下得真实时之应,五十毫秒以下之延迟乃不可商榷。彼时,汝复归YOLO之域,或用前文所述之管路:借VLM以夜间自注数据集,继而训练专司轻便之分类器以供生产部署。

结论

“视语言模型迟缓,不克应视频之需”与“视语言模型可入生产视频流程”之间之隙,非独硬件之弊也。乃架构之失。择PaliGemma之精简模型,胜过7B参数之选;限分辨率于任务所需;强确定性解码于封闭词汇;时序平滑预测;使推论与捕捉渲染分离:此皆非需更巨之GPU。所求者,慎思模型实应为何,依此约束建流程,非逆此约束也。

自模型载入至多线程推理,其全流程仅以百五十行Python代码足矣。此价,于实时视频流之零样本语义分类,可谓合理。