TL;DR
We'll build a pipeline that takes any video file, extracts candidate frames with FFmpeg and PySceneDetect, filters out blurry ones with OpenCV, scores each candidate with OpenCLIP against a small prompt set, and picks the top-K thumbnails with a diversity constraint. ~200 lines of Python, GPU-accelerated, fully local.
📦 Code: github.com/USER/auto-thumbnail-picker (replace before publishing)
The default thumbnail your encoder generates is "the middle frame." For most videos, the middle frame is a motion blur, a transition, or someone mid-blink. We can do much better with about an hour of effort. Here's the pipeline.
Versions
python 3.12
ffmpeg 7.1
pyscenedetect 0.6.x
open-clip-torch 2.x
opencv-python 4.x
torch 2.x (with CUDA or MPS)
The pipeline runs on CPU but is noticeably slower for the CLIP step. A consumer GPU (or Apple Silicon MPS) cuts the per-frame encode to a few milliseconds.
What we're building
- Extract candidate frames (shot boundaries + uniform sampling).
- Filter blurry frames out with Laplacian variance.
- Score each remaining frame with OpenCLIP using a positive / negative prompt set.
- Apply structural rules (prefer frames with faces).
- Pick top-K with shot-level diversity.
1. Setup
python -m venv .venv && source .venv/bin/activate
pip install --break-system-packages \
open-clip-torch \
scenedetect[opencv] \
opencv-python-headless \
pillow \
torch torchvision
For face detection we'll keep it simple and use OpenCV's built-in Haar cascades. If you need higher accuracy on small faces, swap to insightface later.
2. Extract candidate frames
Two sources of candidates: shot boundaries (one per shot) and uniform sampling (one every N seconds). Combine both, then dedupe by timestamp.
# extract.py
import subprocess, os, pathlib
from scenedetect import open_video, SceneManager, ContentDetector
def shot_boundaries(video_path: str) -> list[float]:
"""Return scene-cut timestamps (seconds) using PySceneDetect."""
vid = open_video(video_path)
sm = SceneManager()
sm.add_detector(ContentDetector(threshold=27.0))
sm.detect_scenes(vid, show_progress=False)
scenes = sm.get_scene_list()
# Take the start of each scene plus a small offset (avoid the literal cut frame).
return [s[0].get_seconds() + 0.4 for s in scenes]
def uniform_samples(duration: float, every: float = 3.0) -> list[float]:
return [t for t in [i * every for i in range(int(duration // every) + 1)] if t > 0]
def extract_frame(video_path: str, t: float, out_path: str) -> None:
subprocess.run([
"ffmpeg", "-y", "-ss", f"{t:.3f}", "-i", video_path,
"-frames:v", "1", "-q:v", "2", out_path
], check=True, capture_output=True)
def probe_duration(video_path: str) -> float:
out = subprocess.run(
["ffprobe", "-v", "error", "-show_entries", "format=duration",
"-of", "default=nokey=1:noprint_wrappers=1", video_path],
check=True, capture_output=True, text=True,
).stdout.strip()
return float(out)
def build_candidates(video_path: str, out_dir: str) -> list[tuple[float, str]]:
os.makedirs(out_dir, exist_ok=True)
dur = probe_duration(video_path)
times = sorted(set(shot_boundaries(video_path) + uniform_samples(dur)))
out = []
for i, t in enumerate(times):
p = pathlib.Path(out_dir) / f"f_{i:04d}_{t:07.2f}.jpg"
extract_frame(video_path, t, str(p))
out.append((t, str(p)))
return out
💡 Tip: the
-ssflag before-imakes ffmpeg seek using the index (fast and inaccurate at frame level). For thumbnail-quality stills we want accurate seeking, but 0.4s after the cut is forgiving enough that index-seek is fine.
Run it:
$ python -c "from extract import build_candidates; print(len(build_candidates('clip.mp4', 'frames/')))"
27
Twenty-seven candidates for a six-minute clip. Manageable.
3. Filter blurry frames
CLIP is too tolerant of blur. We prune with Laplacian variance first.
# sharpness.py
import cv2
def is_sharp(image_path: str, threshold: float = 80.0) -> bool:
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
if img is None:
return False
return cv2.Laplacian(img, cv2.CV_64F).var() > threshold
For 1080p source the threshold around 80 works. For 720p drop to ~60. Tune on a few samples from your library; the right cutoff depends on the camera you're shooting with.
4. Score with OpenCLIP
The trick is scoring each frame as positive_similarity - mean(negative_similarity). This is more robust than the absolute positive score because lighting and content normalize out.
# score.py
import torch, open_clip
from PIL import Image
DEVICE = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
MODEL_NAME = "ViT-L-14"
PRETRAINED = "laion2b_s32b_b82k"
POS = [
"a sharp, well-lit frame from a video, showing the main subject clearly",
]
NEG = [
"a blurry frame",
"a black frame",
"a transition with text overlay or graphic",
"a frame caught between two shots",
]
class Scorer:
def __init__(self) -> None:
self.model, _, self.preprocess = open_clip.create_model_and_transforms(
MODEL_NAME, pretrained=PRETRAINED, device=DEVICE
)
self.model.eval()
tokenizer = open_clip.get_tokenizer(MODEL_NAME)
with torch.no_grad():
self.pos = self._embed_text(tokenizer(POS))
self.neg = self._embed_text(tokenizer(NEG))
def _embed_text(self, tokens):
with torch.no_grad():
v = self.model.encode_text(tokens.to(DEVICE))
return v / v.norm(dim=-1, keepdim=True)
def score(self, image_path: str) -> float:
img = self.preprocess(Image.open(image_path).convert("RGB")).unsqueeze(0).to(DEVICE)
with torch.no_grad():
i = self.model.encode_image(img)
i = i / i.norm(dim=-1, keepdim=True)
pos_sim = (i @ self.pos.T).max().item()
neg_sim = (i @ self.neg.T).mean().item()
return pos_sim - neg_sim
⚠️ Note: load the model once and reuse. Reloading on every call is the most common reason people think "CLIP is slow"; it's actually the checkpoint load that's slow.
Quick sanity check:
$ python -c "from score import Scorer; s = Scorer(); print(s.score('frames/f_0010_0030.50.jpg'))"
0.0432
Numbers around 0 to 0.1 are typical. Bigger is better.
5. Face bonus
For UGC-style content the right thumbnail almost always has a clear face. We boost the score when one is present.
# faces.py
import cv2
_cascade = cv2.CascadeClassifier(
cv2.data.haarcascades + "haarcascade_frontalface_default.xml"
)
def has_visible_face(image_path: str, min_size: int = 80) -> bool:
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
if img is None:
return False
faces = _cascade.detectMultiScale(img, scaleFactor=1.1, minNeighbors=5,
minSize=(min_size, min_size))
return len(faces) > 0
6. Orchestrate
# pick.py
from extract import build_candidates
from sharpness import is_sharp
from faces import has_visible_face
from score import Scorer
def pick_thumbnails(video_path: str, out_dir: str, k: int = 3) -> list[dict]:
candidates = build_candidates(video_path, out_dir)
candidates = [(t, p) for t, p in candidates if is_sharp(p)]
scorer = Scorer()
rows = []
for t, p in candidates:
s = scorer.score(p)
if has_visible_face(p):
s += 0.05 # small bias toward faces
rows.append({"t": t, "path": p, "score": s})
rows.sort(key=lambda r: r["score"], reverse=True)
# Diversity: don't pick frames that are within 2 seconds of each other.
picked: list[dict] = []
for r in rows:
if any(abs(r["t"] - p["t"]) < 2.0 for p in picked):
continue
picked.append(r)
if len(picked) >= k:
break
return picked
if __name__ == "__main__":
import sys, json
print(json.dumps(pick_thumbnails(sys.argv[1], "frames/", k=3), indent=2))
Run:
$ python pick.py clip.mp4
[
{ "t": 84.40, "path": "frames/f_0017_0084.40.jpg", "score": 0.1129 },
{ "t": 23.50, "path": "frames/f_0009_0023.50.jpg", "score": 0.0871 },
{ "t": 165.20, "path": "frames/f_0029_0165.20.jpg", "score": 0.0832 }
]
You now have three diverse, sharp, scored candidate thumbnails. Use the first for the default and keep the other two for A/B testing.
7. (Optional) Crop to aspect ratio
If your product shows thumbnails at 16:9 but your source is 9:16 (vertical UGC), or vice versa, you want a smart crop, not a letterbox.
# crop.py
import cv2
def crop_to_ratio(image_path: str, out_path: str, ratio: tuple[int, int] = (16, 9)) -> None:
img = cv2.imread(image_path)
h, w = img.shape[:2]
target = ratio[0] / ratio[1]
actual = w / h
if actual > target:
# too wide, crop sides
new_w = int(h * target)
x0 = (w - new_w) // 2
cropped = img[:, x0:x0 + new_w]
else:
# too tall, crop top/bottom toward the upper third
new_h = int(w / target)
y0 = max(0, int(h * 0.20))
y0 = min(y0, h - new_h)
cropped = img[y0:y0 + new_h, :]
cv2.imwrite(out_path, cropped)
The "upper third" bias for vertical content is intentional. Faces in vertical video almost always sit in the upper third; a centered crop loses them. For higher quality, swap this for a face-aware crop using the bounding box from step 5.
8. Where to put this in your pipeline
| Stage | When |
|---|---|
| Inline at upload | Easiest. Adds 5–30s to the "video processing" step. |
| Background worker (recommended) | Consume an "asset.ready" webhook, run, write back to your DB. |
| Re-run on edit | Re-compute when the user trims or reorders clips. |
If you're using a managed video API, you'll already get an "asset ready" webhook. Hook a background worker to that and write the picked thumbnail back to your CMS row. Both FastPix and Mux let you set a custom thumbnail URL on the asset.
What's next
Three upgrades that are worth their complexity:
-
Better face detector. Replace Haar with
insightfacefor small-face robustness. - Learned scorer. If you have click-through data, fine-tune a small head on top of the CLIP embeddings. The hand-tuned prompts get you 80% of the way; learning the last 20% is real ROI.
- Vision-language scoring. Ask a small VLM ("rate this thumbnail's appeal 1-10") instead of cosine similarity. More expensive, sometimes worth it for high-stakes content.
The thing this pipeline gets right is that the model is the cheap part. The leverage is the candidate selection, the structural filters, and the diversity constraint. Get those right with a vanilla CLIP and you're already beating most platforms' default thumbnails.
Happy thumbnail picking.
























