Q-Fold: Query-Aware Focus-Context Spatio-Temporal Folding for Long Video Understanding - 惯性聚合

推荐订阅源

OSCHINA 社区最新新闻

Recorded Future

Blog — PlanetScale

Stack Overflow Blog

Google DeepMind News

Full Disclosure

博客园 - Franky

Privacy International News Feed

The Hacker News

Last Week in AI

LINUX DO - 热门话题

Vulnerabilities – Threatpost

Threat Intelligence Blog | Flashpoint

美团技术团队

Y Combinator Blog

Security Latest

CERT Recently Published Vulnerability Notes

Proofpoint News Feed

CTFtime.org: upcoming CTF events

SegmentFault 最新的问题

Cisco Talos Blog

Lohrmann on Cybersecurity

Cybersecurity and Infrastructure Security Agency CISA

Privacy & Cybersecurity Law Blog

Cyber Attacks, Cyber Crime and Cyber Security

CXSECURITY Database RSS Feed - CXSecurity.com

The Register - Security

Tailwind CSS Blog

Palo Alto Networks Blog

Microsoft Security Blog

Darknet – Hacking Tools, Hacker News & Cyber Security

cs.CV updates on arXiv.org

A Lightweight Multi-Metric No-Reference Image Quality Assessment Framework for UAV Imaging Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models PatchPoison: Poisoning Multi-View Datasets to Degrade 3D Reconstruction 3DRealHead: Few-Shot Detailed Head Avatar GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization Towards Patient-Specific Deformable Registration in Laparoscopic Surgery Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG) Neural 3D Reconstruction of Planetary Surfaces from Descent-Phase Wide-Angle Imagery SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models 4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview Rethinking Uncertainty in Segmentation: From Estimation to Decision Indexing Multimodal Language Models for Large-scale Image Retrieval DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery Explainable Fall Detection for Elderly Monitoring via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision Bias at the End of the Score Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform Why MLLMs Struggle to Determine Object Orientations Towards Successful Implementation of Automated Raveling Detection: Effects of Training Data Size, Illumination Difference, and Spatial Shift Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization MSGS: Multispectral 3D Gaussian Splatting Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks CausalDisenSeg: A Causality-Guided Disentanglement Framework with Counterfactual Reasoning for Robust Brain Tumor Segmentation Under Missing Modalities DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis A Study of Failure Modes in Two-Stage Human-Object Interaction Detection FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation AI Powered Image Analysis for Phishing Detection CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing Radar-Informed 3D Multi-Object Tracking under Adverse Conditions SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data Med-CAM: Minimal Evidence for Explaining Medical Decision Making SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests ReConText3D: Replay-based Continual Text-to-3D Generation ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation Temporally Consistent Long-Term Memory for 3D Single Object Tracking PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation A Resource-Efficient Hybrid CNN-LSTM network for image-based bean leaf disease classification DiffMagicFace: Identity Consistent Facial Editing of Real Videos Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios Context Sensitivity Improves Human-Machine Visual Alignment Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding A Multi-Stage Optimization Pipeline for Bethesda Cell Detection in Pap Smear Cytology SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework Depth-Aware Image and Video Orientation Estimation Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself OneHOI: Unifying Human-Object Interaction Generation and Editing Towards Unconstrained Human-Object Interaction Training-Free Semantic Multi-Object Tracking with Vision-Language Models UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments ROSE: Retrieval-Oriented Segmentation Enhancement

Q-Fold: Query-Aware Focus-Context Spatio-Temporal Folding for Long Video Understanding

[Submitted on 10 Jun 2026] · 2026-06-11 · via cs.CV updates on arXiv.org

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。