

















Abstract:Selecting informative frames from long videos is a combinatorial problem that existing methods address either through efficient heuristics without explicit modeling of query-conditioned temporal structure, or through multi stage retrieval pipelines with substantial preprocessing cost. We propose \textbf{CREST}, a training-free frame selection method grounded in the temporal geometry of query--frame relevance. CREST is based on the observation that relevance over time exhibits structured local variation: sharp curvature around salient events and flatter regions in redundant segments. By using local curvature to guide selection, CREST allocates a fixed frame budget more effectively across brief decisive events and slowly evolving evidence. Under a fixed backbone and frame budget, CREST achieves higher accuracy than AKS, a lightweight relevance--coverage baseline, on LongVideoBench and VideoMME, while retaining 93--95\% of the accuracy of MIRA, a stronger multi-stage retrieval pipeline, at only 3--4\% of its preprocessing cost.\footnote{Code and implementation details are included in the supplementary material and will be released publicly upon acceptance.} On TempRel, our diagnostic benchmark for temporal frame selection, CREST achieves a 6.88\% relative improvement over AKS. Pairwise LLM-as-a-judge evaluation further shows that CREST-selected frames yield more coherent frame-conditioned descriptions, with win rates of 60.58\% and 54.50\% on the two benchmarks. These results show that local temporal geometry provides a simple and efficient basis for long-video frame selection.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2605.09223 [cs.CV] |
| (or arXiv:2605.09223v2 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.09223 arXiv-issued DOI via DataCite |
From: Abdul Mohaimen Al Radi [view email]
[v1]
Sat, 9 May 2026 23:47:46 UTC (19,742 KB)
[v2]
Sun, 24 May 2026 20:57:02 UTC (5,464 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。