惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

DEV Community

Build a custom HLS player in React with hls.js (no wrapper libraries) How Instagram Stores Reels, Photos, and Drafts Behind the Scenes Building a Reproducible Offline-First Data Sync Engine for Edge Analytics AI Doesn’t Make Us Think Less by Default, But It Makes It Easier to Skip Thinking Why your devcontainer fails on corporate networks (and how to fix it) The Agent That Lives on a $5 VPS — Why Hermes Changes the Open Source AI Story Claude Code: I Had 10 Plugins Active at Once — Here's What It Actually Costs Stop your app from booting with broken env vars: a type-safe, universal config library 🚀 I Built Trade MCP: Remote MCP Server for Crypto Tools and Safer AI Trading Workflows How I Stopped Node.js from Freezing While Bulk-Processing 1,500+ Excel Rows A Beginner’s Guide to Git Branching and Merging (Without the Panic) CTF Lab Writeup: "Bypass Me" — PicoCTF Binary Exploitation Challenge Configure Audit Logging in Kubernetes VARIABLE: Smart Home Devices Are Collecting More Than You Think — Here's What to Do Webtree - Resources Hub for Dev's Using Server-Sent Events (SSE) in Capacitor 8 with Nuxt 4 Temporal Anchoring in Adversarial Networks: The Cryptographic Physics of History AI Is Eating the World Layer by Layer — Here's Where to Stand Stop Fighting Your AI Coding Agent: A Developer's Guide to Thinking in Collaboration, Not Commands One Playwright Selector Trick Nobody Talks About: getByRole The Complete Guide to Resolving Git Merge Conflicts: From Beginner to Pro Stop writing lazy AI prompts: a hotkey that structures them for you I built a visual README editor so developers never have to write markdown from scratch again How I Built RepoSense: A GitHub Intelligence CLI With Coral SQL Frank: your supercharged Laravel Sail alternative From 'How to Test AI Code' to 'What Makes Us Human' AI Assisted Multi-repo Version Control The Discipline of Not Fooling Ourselves: Episode 7 — The Cost of Certainty BoxAgnts Introduction (7) — OpenAI API and Anthropic API Why Context Window Is Not Enough for AI Character Memory "NestJS authentication in 5 minutes" LogoQR: I Spent a Week Making QR Codes That Don't Look Like Prison Barcodes 🤖 The Second Brain 🧠 Playbook 📚 (2026 Edition) HealthHermes: A Private AI Health Companion That Remembers Everything and Runs on Your Own Machine 🚀 Building Tapbite – A Multi-Service Delivery Platform (Part 1) Managing Environment Variables Securely with Keycheck Cursor-Driven Development in FastAPI: Using AI to Generate Type-Safe API Schemas and Catch Contract Breaks Before Deployment How WhatsApp Works Without Internet: Offline Messaging and Sync Explained Meta's AI Pendant: What It Means for Budget Builders How I Built a Permanent Testing Server Using Cloudflare Tunnel Guia definitivo para usar o Claude Code com modelos gratuitos (depois de testar 6 métodos) "I Built a Developer-Only Social Platform — Meet Devand 🛠️" Beyond onlyOwner: Fixing Logic Vulnerabilities in DeFi (A RetoSwap Case Study) Building AshaPulse — An AI-Powered Health Assistant for India's Frontline Warriors Digital clock project pro version The Coordination Tax: Six Years Watching a One-Day Feature Take Four Months วิธีการขอ call sign (สัญญาณเรียกงาน) ของนักวิทยุสมัครเล่นแห่งประเทศไทย ฉบับคนที่หมดอายุนานแล้ว (แบบเกือบจับมือทำ) OpenLiDARViewer: Browser-Native LiDAR Visualization for Real-World Workflows new to dev Recording screen on Linux: the state of things in 2026 Streamlining Your Workflow: GitHub Actions CI/CD Pipeline Best Practices The enterprise AI control that is still missing: code provenance Introducing Destawell — Mobile-First Security Research & Open-Source Tooling Stop Storing Plaintext in Browser Cookies — Use AES-GCM Encryption Instead 🐍 How to Use Open Interpreter for Free — With the Latest Models 103. Agent Memory: Short-Term, Long-Term, and Episodic TinyLoad v7 — VEH page-fault decryption and a fully encrypted overlay, what's new in TinyLoad v7.0, my open-source PE packer for Windows How I sleep at night running agents in YOLO mode What Exactly is "The Cloud"? (Cloud Computing for Beginners☁️) Stop Burning Cash on Long-Context RAG: Ephemeral Prompt Caching with Spring AI and JTokkit The Most Used Technology in the World Has Zero Marketing and Product People How to Compress PDF Files in the Browser (No Server Uploads) The Principle of Least Privilege: Operational Speed's Security Cost Your AI Sucks at Math. Fix It With One Command. How Zone01 Kisumu "Build from Scratch" Approach Transformed Me from a Framework User to a Problem Solver Bringing MongoDB Atlas and Voyage AI to Dify: Build RAG Workflows and Data Agents Without Heavy Glue Code Sass isn't dead, but native CSS just replaced its biggest use case. We can finally write reusable, type-safe functions directly in the browser, with zero build tools. I wrote up a practical guide on Dev.to explaining exactly how native `@function` works. Intel Targets World's First Mass Production of Glass Substrates for AI Chip Packaging Stop Burning Tokens on Chat / Agent Loops — Here's What Actually Works 🔮 Hermes Agent 🤖: A Practical Guide 🔥 — and How It Stacks Up Against OpenClaw & GoClaw 📊 I Built a Free AI Business Manager for Street Vendors in Hindi & English CSS @function CSS @function Agent Payment Stablecoin Fallbacks: Do Not Retry the Changed Quote Daily-summary-agent Opus 4.8 barely moved the leaderboard. It moved the one number that decides if your agents can be trusted. I Built an AI Interview Coach That Turns Any Resume Into a Personalized Prep Package — No API Keys Needed The best Claude Code agents are defined by what they refuse to do I Built a Tiny Skeleton Loader for React Why I Generated Synthetic Patients to Make Identity Matching Better SPIFFE Compliance Deep Dive PostgreSQL 08007 오류 원인과 해결 방법 완벽 가이드 I Was Tired of Writing Daily Standups, So I Built an AI Agent using claude code I got tired of LLM observability tools getting acquired. So I built one that can't be. Oracle ORA-00072 오류 원인과 해결 방법 완벽 가이드 Multi-Agent Negotiation Protocols: How AI Agents Should Bargain for Resources uBlock Origin No Longer Works on Chrome - Here Are the Best Alternatives in 2026 SSH Agent Forwarding vs ProxyJump: Why Agent Forwarding Is Dangerous and What to Use Instead The Best Technology Disappears I Built a Production-Oriented Multi-Provider AI Chatbot in Rust — Here's How Markov Chain Coin Sequence: E[HH] vs E[HTH] Explained LLM Deal Flow Automation in CRM The Do-Over Game: Nash Equilibrium at the Golden Ratio Cash Flow Waterfall Model for LBO Automated Client Reporting The Monty Hall Problem: Why Switching Wins 2/3 of the Time Chat With Your Database Using Natural Language: The Future of Business Analytics Google Apps Script Automation Amoeba Extinction Probability: The Branching Process Solution
Pick a better video thumbnail automatically with FFmpeg, PySceneDetect, and CLIP
Mason K · 2026-05-31 · via DEV Community

Mason K

TL;DR

We'll build a pipeline that takes any video file, extracts candidate frames with FFmpeg and PySceneDetect, filters out blurry ones with OpenCV, scores each candidate with OpenCLIP against a small prompt set, and picks the top-K thumbnails with a diversity constraint. ~200 lines of Python, GPU-accelerated, fully local.

📦 Code: github.com/USER/auto-thumbnail-picker (replace before publishing)

The default thumbnail your encoder generates is "the middle frame." For most videos, the middle frame is a motion blur, a transition, or someone mid-blink. We can do much better with about an hour of effort. Here's the pipeline.

Versions

python 3.12
ffmpeg 7.1
pyscenedetect 0.6.x
open-clip-torch 2.x
opencv-python 4.x
torch 2.x (with CUDA or MPS)

Enter fullscreen mode Exit fullscreen mode

The pipeline runs on CPU but is noticeably slower for the CLIP step. A consumer GPU (or Apple Silicon MPS) cuts the per-frame encode to a few milliseconds.

What we're building

  1. Extract candidate frames (shot boundaries + uniform sampling).
  2. Filter blurry frames out with Laplacian variance.
  3. Score each remaining frame with OpenCLIP using a positive / negative prompt set.
  4. Apply structural rules (prefer frames with faces).
  5. Pick top-K with shot-level diversity.

1. Setup

python -m venv .venv && source .venv/bin/activate
pip install --break-system-packages \
  open-clip-torch \
  scenedetect[opencv] \
  opencv-python-headless \
  pillow \
  torch torchvision

Enter fullscreen mode Exit fullscreen mode

For face detection we'll keep it simple and use OpenCV's built-in Haar cascades. If you need higher accuracy on small faces, swap to insightface later.

2. Extract candidate frames

Two sources of candidates: shot boundaries (one per shot) and uniform sampling (one every N seconds). Combine both, then dedupe by timestamp.

# extract.py
import subprocess, os, pathlib
from scenedetect import open_video, SceneManager, ContentDetector

def shot_boundaries(video_path: str) -> list[float]:
    """Return scene-cut timestamps (seconds) using PySceneDetect."""
    vid = open_video(video_path)
    sm = SceneManager()
    sm.add_detector(ContentDetector(threshold=27.0))
    sm.detect_scenes(vid, show_progress=False)
    scenes = sm.get_scene_list()
    # Take the start of each scene plus a small offset (avoid the literal cut frame).
    return [s[0].get_seconds() + 0.4 for s in scenes]

def uniform_samples(duration: float, every: float = 3.0) -> list[float]:
    return [t for t in [i * every for i in range(int(duration // every) + 1)] if t > 0]

def extract_frame(video_path: str, t: float, out_path: str) -> None:
    subprocess.run([
        "ffmpeg", "-y", "-ss", f"{t:.3f}", "-i", video_path,
        "-frames:v", "1", "-q:v", "2", out_path
    ], check=True, capture_output=True)

def probe_duration(video_path: str) -> float:
    out = subprocess.run(
        ["ffprobe", "-v", "error", "-show_entries", "format=duration",
         "-of", "default=nokey=1:noprint_wrappers=1", video_path],
        check=True, capture_output=True, text=True,
    ).stdout.strip()
    return float(out)

def build_candidates(video_path: str, out_dir: str) -> list[tuple[float, str]]:
    os.makedirs(out_dir, exist_ok=True)
    dur = probe_duration(video_path)
    times = sorted(set(shot_boundaries(video_path) + uniform_samples(dur)))
    out = []
    for i, t in enumerate(times):
        p = pathlib.Path(out_dir) / f"f_{i:04d}_{t:07.2f}.jpg"
        extract_frame(video_path, t, str(p))
        out.append((t, str(p)))
    return out

Enter fullscreen mode Exit fullscreen mode

💡 Tip: the -ss flag before -i makes ffmpeg seek using the index (fast and inaccurate at frame level). For thumbnail-quality stills we want accurate seeking, but 0.4s after the cut is forgiving enough that index-seek is fine.

Run it:

$ python -c "from extract import build_candidates; print(len(build_candidates('clip.mp4', 'frames/')))"
27

Enter fullscreen mode Exit fullscreen mode

Twenty-seven candidates for a six-minute clip. Manageable.

3. Filter blurry frames

CLIP is too tolerant of blur. We prune with Laplacian variance first.

# sharpness.py
import cv2

def is_sharp(image_path: str, threshold: float = 80.0) -> bool:
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    if img is None:
        return False
    return cv2.Laplacian(img, cv2.CV_64F).var() > threshold

Enter fullscreen mode Exit fullscreen mode

For 1080p source the threshold around 80 works. For 720p drop to ~60. Tune on a few samples from your library; the right cutoff depends on the camera you're shooting with.

4. Score with OpenCLIP

The trick is scoring each frame as positive_similarity - mean(negative_similarity). This is more robust than the absolute positive score because lighting and content normalize out.

# score.py
import torch, open_clip
from PIL import Image

DEVICE = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")

MODEL_NAME = "ViT-L-14"
PRETRAINED = "laion2b_s32b_b82k"

POS = [
    "a sharp, well-lit frame from a video, showing the main subject clearly",
]
NEG = [
    "a blurry frame",
    "a black frame",
    "a transition with text overlay or graphic",
    "a frame caught between two shots",
]

class Scorer:
    def __init__(self) -> None:
        self.model, _, self.preprocess = open_clip.create_model_and_transforms(
            MODEL_NAME, pretrained=PRETRAINED, device=DEVICE
        )
        self.model.eval()
        tokenizer = open_clip.get_tokenizer(MODEL_NAME)
        with torch.no_grad():
            self.pos = self._embed_text(tokenizer(POS))
            self.neg = self._embed_text(tokenizer(NEG))

    def _embed_text(self, tokens):
        with torch.no_grad():
            v = self.model.encode_text(tokens.to(DEVICE))
            return v / v.norm(dim=-1, keepdim=True)

    def score(self, image_path: str) -> float:
        img = self.preprocess(Image.open(image_path).convert("RGB")).unsqueeze(0).to(DEVICE)
        with torch.no_grad():
            i = self.model.encode_image(img)
            i = i / i.norm(dim=-1, keepdim=True)
            pos_sim = (i @ self.pos.T).max().item()
            neg_sim = (i @ self.neg.T).mean().item()
        return pos_sim - neg_sim

Enter fullscreen mode Exit fullscreen mode

⚠️ Note: load the model once and reuse. Reloading on every call is the most common reason people think "CLIP is slow"; it's actually the checkpoint load that's slow.

Quick sanity check:

$ python -c "from score import Scorer; s = Scorer(); print(s.score('frames/f_0010_0030.50.jpg'))"
0.0432

Enter fullscreen mode Exit fullscreen mode

Numbers around 0 to 0.1 are typical. Bigger is better.

5. Face bonus

For UGC-style content the right thumbnail almost always has a clear face. We boost the score when one is present.

# faces.py
import cv2

_cascade = cv2.CascadeClassifier(
    cv2.data.haarcascades + "haarcascade_frontalface_default.xml"
)

def has_visible_face(image_path: str, min_size: int = 80) -> bool:
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    if img is None:
        return False
    faces = _cascade.detectMultiScale(img, scaleFactor=1.1, minNeighbors=5,
                                      minSize=(min_size, min_size))
    return len(faces) > 0

Enter fullscreen mode Exit fullscreen mode

6. Orchestrate

# pick.py
from extract import build_candidates
from sharpness import is_sharp
from faces import has_visible_face
from score import Scorer

def pick_thumbnails(video_path: str, out_dir: str, k: int = 3) -> list[dict]:
    candidates = build_candidates(video_path, out_dir)
    candidates = [(t, p) for t, p in candidates if is_sharp(p)]

    scorer = Scorer()
    rows = []
    for t, p in candidates:
        s = scorer.score(p)
        if has_visible_face(p):
            s += 0.05  # small bias toward faces
        rows.append({"t": t, "path": p, "score": s})
    rows.sort(key=lambda r: r["score"], reverse=True)

    # Diversity: don't pick frames that are within 2 seconds of each other.
    picked: list[dict] = []
    for r in rows:
        if any(abs(r["t"] - p["t"]) < 2.0 for p in picked):
            continue
        picked.append(r)
        if len(picked) >= k:
            break
    return picked

if __name__ == "__main__":
    import sys, json
    print(json.dumps(pick_thumbnails(sys.argv[1], "frames/", k=3), indent=2))

Enter fullscreen mode Exit fullscreen mode

Run:

$ python pick.py clip.mp4
[
  { "t": 84.40, "path": "frames/f_0017_0084.40.jpg", "score": 0.1129 },
  { "t": 23.50, "path": "frames/f_0009_0023.50.jpg", "score": 0.0871 },
  { "t": 165.20, "path": "frames/f_0029_0165.20.jpg", "score": 0.0832 }
]

Enter fullscreen mode Exit fullscreen mode

You now have three diverse, sharp, scored candidate thumbnails. Use the first for the default and keep the other two for A/B testing.

7. (Optional) Crop to aspect ratio

If your product shows thumbnails at 16:9 but your source is 9:16 (vertical UGC), or vice versa, you want a smart crop, not a letterbox.

# crop.py
import cv2

def crop_to_ratio(image_path: str, out_path: str, ratio: tuple[int, int] = (16, 9)) -> None:
    img = cv2.imread(image_path)
    h, w = img.shape[:2]
    target = ratio[0] / ratio[1]
    actual = w / h

    if actual > target:
        # too wide, crop sides
        new_w = int(h * target)
        x0 = (w - new_w) // 2
        cropped = img[:, x0:x0 + new_w]
    else:
        # too tall, crop top/bottom toward the upper third
        new_h = int(w / target)
        y0 = max(0, int(h * 0.20))
        y0 = min(y0, h - new_h)
        cropped = img[y0:y0 + new_h, :]
    cv2.imwrite(out_path, cropped)

Enter fullscreen mode Exit fullscreen mode

The "upper third" bias for vertical content is intentional. Faces in vertical video almost always sit in the upper third; a centered crop loses them. For higher quality, swap this for a face-aware crop using the bounding box from step 5.

8. Where to put this in your pipeline

Stage When
Inline at upload Easiest. Adds 5–30s to the "video processing" step.
Background worker (recommended) Consume an "asset.ready" webhook, run, write back to your DB.
Re-run on edit Re-compute when the user trims or reorders clips.

If you're using a managed video API, you'll already get an "asset ready" webhook. Hook a background worker to that and write the picked thumbnail back to your CMS row. Both FastPix and Mux let you set a custom thumbnail URL on the asset.

What's next

Three upgrades that are worth their complexity:

  • Better face detector. Replace Haar with insightface for small-face robustness.
  • Learned scorer. If you have click-through data, fine-tune a small head on top of the CLIP embeddings. The hand-tuned prompts get you 80% of the way; learning the last 20% is real ROI.
  • Vision-language scoring. Ask a small VLM ("rate this thumbnail's appeal 1-10") instead of cosine similarity. More expensive, sometimes worth it for high-stakes content.

The thing this pipeline gets right is that the model is the cheap part. The leverage is the candidate selection, the structural filters, and the diversity constraint. Get those right with a vanilla CLIP and you're already beating most platforms' default thumbnails.

Happy thumbnail picking.