以PaliGemma进行实时视频分类：低延迟VLM推理之架构模式

曩文之中，吾等以零样本目标检测之法，较验三开源视觉语言模型，得有难言之结：纵使速者Phi-3.5-vision-instruct，于NVIDIA L4之上，亦需4.45秒方成帧；LLaVA-v1.6则需8.13秒。凡需实时处理视频流之应用，此数已属不逮。然言视觉语言模型与实时任务根本不谐，尤当细察。彼8秒之数，乃于泛用零样本检测之务中测得，令模型推度任意物于任意境。若限其题，何如？若授模型闭域之词、定帧之解、确然解码之策、非阻推理之脉？
此文解是问。以PaliGemma（Google之精简视言模型），吾辈构实时视频分类之系，于NVIDIA RTX A4500上，每帧约0.8至1.2秒。较之LLaVA于同等专业硬件，此乃六至八倍之进益，纯由架构之决，非硬件之升。今列四式，使此成可能。

甚者何为PaliGemma

欲论架构，其模之择尤当释也，盖PaliGemma之实用价值，于开发者之文远未彰焉。PaliGemma者，谷歌所建三十亿参数之视言模型也，合SigLIP视码器与Gemma言骨。较之LLaVA-7B或Phi-3.5-vision，其制半之，直译为VRAM之耗减，同器之推演亦速。尤要于分类之务者，乃显精调于广类视觉理解之标，如图解、视问、物位之属，是故其有强先验，以应吾辈所求之限构、序应。

import torch
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration

MODEL_PATH = "./paligemma_offline"

model = PaliGemmaForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    local_files_only=True
).eval()

processor = PaliGemmaProcessor.from_pretrained(
    MODEL_PATH,
    local_files_only=True
)

此中有二事，堪为注目。以bfloat16载入，较之float32，可减半VRAM之占，而分类之务，其精度损益甚微。local_files_only=True 此标志非仅便于离线环境：于生产系统，其可去网络往返之初始化，且保推理环境全然可复。

模式一：以解析为迟滞之旋钮

实时视觉语言模型（VLM）之流程中，影响至巨者，唯输入分辨率耳。VLM处理图像，乃分之为一块块，每块皆编码为符序列。一帧1280×720，所生符序列远胜一裁448×448之帧，盖因变压者之注意力，随序列之长而平方增长，故分辨率非线性之耗，实为指数之耗。若为零样本目标检测，其空间精度攸关，则下采样实为权衡之计。然若为场景级分类之务，问“此帧主情为何？”而非“予我诸物之像素坐标”，则448×448所存语义，已绰绰有余。

import cv2
import numpy as np
from PIL import Image

def preprocess_frame(frame_bgr: np.ndarray) -> Image.Image:
    # Resize to inference resolution before any VLM processing
    frame_small = cv2.resize(frame_bgr, (448, 448))
    # Convert BGR (OpenCV) to RGB (PIL/transformers)
    return Image.fromarray(cv2.cvtColor(frame_small, cv2.COLOR_BGR2RGB))

要旨在于，分辨率之择，当依任务所需信息之细密，非以输入流之分辨率为准。若相机摄于1080p，然分类之务仅辨五情之态，则汝将付巨量计算之税，以无用之信息。

模式二：闭词汇确定性解码

标准VLM使用视模型为开放式文本生成器。汝示之，其自概率分布中取样，汝得自然语言之应，继而解析之。此乃吾前文所论脆弱之症结，亦为延迟之重源：高概率分布取样max_new_tokens者，模型为每一生成之词，皆行全自回归之环也。若从事分类之务，可全然破此。非令模型描述所见，乃约束其出为固定词汇之有效标签，且限生成之数为表达其一所需之最少词数也。

# A generic set of states tailored to your specific domain
VALID_CLASSES = ['active', 'idle', 'error', 'offline', 'unknown']

def classify_frame(model: torch.nn.Module, processor, image: Image.Image) -> str:
    prompt = (
        f"<image> What is the current operational state shown in this frame? "
        f"You MUST choose ONLY ONE from: {VALID_CLASSES}."
    )

    inputs = processor(
        text=prompt,
        images=image,
        return_tensors="pt"
    ).to(model.device, model.dtype)

    with torch.inference_mode():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=10,   # A single class label needs at most 2-3 tokens
            do_sample=False      # Greedy decoding: deterministic, faster, no temperature needed
        )

    raw_output = processor.decode(
        output_ids[0][inputs.input_ids.shape[-1]:],
        skip_special_tokens=True
    ).strip().lower()

    # Sanitize to alphabetic characters only
    return ''.join(filter(str.isalpha, raw_output))

设之do_sample=False 转换模型为贪心解码，每步恒选最高概率之符。此消采样之费，使输出全然定数：同源之入，必得同果，此乃调试之要，亦为吾辈后述之时序平滑之理。max_new_tokens=10 之谓，模型既出标签，几欲立止，不复续作无谓之文。
其效也，汝以三B参数之VLM为能辨义之分类器，非为生成之模。得自然语言提示之零射之变，其推论之性，近乎专司分类之头。

模式三：时序平滑以达预测之稳

纵有定解之译，视界模型处理生视频，其预判必多嘈杂。光变、动晕、部分遮蔽、瞬现之视物伪象，将致模型于连续帧间输出不协之标签。若将每帧之原预测直通下游系统，则得跳荡不可靠之输出。其解乃时序平滑：非信单次之预测，乃持近时之预测为滚窗，发多数决之票。

from collections import deque, Counter

class TemporalSmoother:
    def __init__(self, window_size: int = 5):
        self.history = deque(maxlen=window_size)

    def update(self, prediction: str) -> str:
        self.history.append(prediction)
        # Return the most common prediction in the window
        return Counter(self.history).most_common(1)[0][0]

吾等推演之速，五格之窗，约当四至六息之时序。此足纳瞬息之扰，而仍应真变。窗之广狭，乃调适之要：广则稳而缓，狭则敏而生噪。凡分类之务，三至七格，已足应实。

四式：非阻塞推演，以解耦共享状态架构为之

前三式优化推演之调用。此式则解根本之系统问题：凡VLM推演一呼，历时0.8至1.2秒，则其所运行之线程，必为之阻塞。若视频捕捉与推演同线程，则流将随推演速率而滞涩，而非相机速率也。

愚者之策，乃以 Python 之标准队列 Queue 传帧于线程间。然此法生消费者之争：若渲染线程与 AI 线程同读一队列，则争帧而食，相夺其数据，致视觉大震，推理之循环亦时断时续。至若精工之策，乃异步共享状态之范式也。以粒度锁之。视频摄录之线程，若产者，恒覆写共享之“最新帧”指针。渲染之线程（于主线程运行，此于macOS及Wayland之OpenCV UI操作为必也），与AI之后台线程，若独立之消费者，每待其下一周期，则复最新帧于本地之存。

import threading
import time
import numpy as np
import cv2
import torch
from typing import Optional, Any

class SharedState:
    """
    Thread-safe state container.
    The lock is strictly granular: it is only held for memory assignment/copying,
    never during expensive I/O or AI inference operations.
    """
    def __init__(self):
        self.latest_frame: Optional[np.ndarray] = None
        self.prediction: str = "WAITING"
        self.lock = threading.Lock()
        self.running: bool = True

shared = SharedState()

def video_capture_worker(source: int = 0) -> None:
    """Reads frames at hardware speed and updates the shared state."""
    cap = cv2.VideoCapture(source)

    while shared.running:
        ret, frame = cap.read()
        if not ret:
            time.sleep(0.01)
            continue

        with shared.lock:
            # Overwrite with the freshest data.
            # Pointer assignment is fast enough to barely hold the lock.
            shared.latest_frame = frame

    cap.release()

def inference_worker(model: torch.nn.Module, processor: Any) -> None:
    """Consumes the latest frame at the AI's maximum throughput rate."""
    smoother = TemporalSmoother(window_size=5)

    while shared.running:
        with shared.lock:
            # Deep copy to prevent OpenCV from mutating the array during inference
            frame = shared.latest_frame.copy() if shared.latest_frame is not None else None

        if frame is None:
            time.sleep(0.05)
            continue

        try:
            image = preprocess_frame(frame)
            raw_pred = classify_frame(model, processor, image)
            smoothed_pred = smoother.update(raw_pred)

            with shared.lock:
                shared.prediction = smoothed_pred

        except torch.cuda.OutOfMemoryError:
            # Handle temporary VRAM spikes gracefully without killing the thread
            with shared.lock:
                shared.prediction = "OOM_ERROR"
            time.sleep(1.0)

        except Exception as e:
            # Catch corrupt frames or tensor mismatches
            with shared.lock:
                shared.prediction = "ERROR"

# Initialize background workers as Daemon threads
threads = [
    threading.Thread(target=video_capture_worker, args=(0,), daemon=True),
    threading.Thread(target=inference_worker, args=(model, processor), daemon=True),
]

for t in threads:
    t.start()

# Main Thread UI Loop
# UI libraries (cv2.imshow) must run on the main thread to prevent OS-level crashes.
while shared.running:
    with shared.lock:
        frame = shared.latest_frame.copy() if shared.latest_frame is not None else None
        label = shared.prediction

    if frame is not None:
        cv2.putText(frame, label.upper(), (30, 50),
                    cv2.FONT_HERSHEY_SIMPLEX, 1.5, (0, 255, 0), 3)
        cv2.imshow("Live Classification", frame)

    # 30 FPS rendering limit (33ms) + graceful shutdown
    if cv2.waitKey(33) & 0xFF == ord('q'):
        shared.running = False

cv2.destroyAllWindows()

要旨在此，制器之道在于细锁：锁立，则numpy之数易位于内存（此耗微秒），锁即释。若持锁逾一秒之VLM推演，则三事皆序，反失全构之旨。此制，摄影之线随硬件之频（如三十帧），绘示之环亦现三十帧，推演之线自运其异步之速（一帧）。三系时序相离，唯受其器之限。

测试之要

以NVIDIA RTX A4500（20GB GDDR6，Ampere架构）运行全流程，PaliGemma以bfloat16格式处理三流实时视频场景，性能稳定非常。输入分辨率限为448×448，输出以贪心解码策略最多10新词为限（do_sample=False), 系统每帧推理延迟介于0.8至1.2秒之间。结合5帧时序平滑窗口，此配置确保可靠状态分类，而解耦架构使视频采集线程得以维持稳定25帧率，完全不受推理瓶颈之扰。

较之而言，LLaVA-v1.6-Mistral-7B于NVIDIA L4上施行开放词汇零样本检测，每帧耗时8.13秒。虽硬件非直接等同，然差距之巨，足证架构之限，非原始计算之力，实为差异之主因.

当此架构合宜时

此模式，若汝之务可归入固定标签之集，需对活流作连续之处理，而非对静态图像作批次分析；数据隐私之要求，不许将帧送至外部API；且汝可容忍亚秒级之延迟，而非百毫秒级。此非适器之时，若汝需于传送带速下得真实时之应，五十毫秒以下之延迟乃不可商榷。彼时，汝复归YOLO之域，或用前文所述之管路：借VLM以夜间自注数据集，继而训练专司轻便之分类器以供生产部署。

结论

“视语言模型迟缓，不克应视频之需”与“视语言模型可入生产视频流程”之间之隙，非独硬件之弊也。乃架构之失。择PaliGemma之精简模型，胜过7B参数之选；限分辨率于任务所需；强确定性解码于封闭词汇；时序平滑预测；使推论与捕捉渲染分离：此皆非需更巨之GPU。所求者，慎思模型实应为何，依此约束建流程，非逆此约束也。

自模型载入至多线程推理，其全流程仅以百五十行Python代码足矣。此价，于实时视频流之零样本语义分类，可谓合理。

推薦訂閱源

DEV Community