惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

cs.CV updates on arXiv.org

Benchmarking Composed Image Retrieval for Applied Earth Observation CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning CAFD: Concept-Aware DNN Fault Detection using VLMs MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing Hierarchical Local-Global Transformer for Temporal Sentence Grounding Φ-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning Forgettable Federated Linear Learning with Certified Data Unlearning Paris 2.0: A Decentralized Diffusion Model for Video Generation Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation NudgeVAD: Language-Nudged End-to-End Driving via FiLM Residuals STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration FDDet: Achieving Data-Efficient Food Defect Detection Under Real-World Scenarios Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving Phase-Aware Wavelet-Based-Scattering Encoder-Decoder for Dense Predictions CoDA: Color Distribution Probing for Efficient and Generalizable AI-Generated Image Detection Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization Investigating the Effect of Network Pruning on Performance and Interpretability A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring SparseWorld: Enhancing End-to-End Autonomous Driving via World Models with Sparse Scene Representation World Models as Group Actions MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection ViViD-5K: Vineyard vision dataset for field-based berry detection and segmentation and grape cluster closure estimation Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors on SMPL Skeletons Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology Image-Conditioned Instance Prompt Network for Referring Remote Sensing Image Segmentation SILSM: A Sustainable Interactive Level Set Method for Progressive Refinement Robust Fuzzy Multi-view Learning under View Conflict Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra Causal Physics Steering in Video World Models via Concept Activation Vectors EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems Physen-Noise2Noise: Physics-Guided Self-Supervised Defocus Deblurring with Bias Correction under Low-Light Conditions PILOT: Policy-Informed Learned Optimization for Adaptive Deep Network Training EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026 Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer Nano World Models: A Minimalist Implementation of Future Video Prediction Med-R2: An Adversarial Benchmark for Evidence-Grounded Reasoning in Medical VLMs Single View Seafloor Recovery from Imaging Sonar via Differentiable Rendering ERNIE-Image Technical Report Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection Learnable Shape Prototypes with Occlusion-Geometry-Guided Injection for Amodal Instance Segmentation Parameter-Efficient CT Reconstruction via Deep Graph Laplacian Regularization Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion Motion-Compensated Weight Compression Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions Generalized Evidential Deep Learning: From a Bayesian Perspective TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration Uncertainty-DTW for Sequences and Visual Tokens A Principled Self-Referenced Early Stopping Approach for Deep Image Prior PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction Generating 3D models from sketches of human faces using a combined approach of Convolutional Neural Networks, Procedural Modeling, and Contour Mapping Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models Remote sensing data imputation using deep learning for multispectral imagery V3H: View Variation and View Heredity for Incomplete Multi-view Clustering Dual Prototype-Conditioned Diffusion Model for Scalable Multi-Class Unsupervised Anomaly Detection in Large Category Spaces Coarse-to-Fine Domain Incremental Learning with Attentive Distillation for Mining Footprint Segmentation in Multispectral Imagery Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval Dale meets Langevin: A Multiplicative Denoising Diffusion Model DUEL: Adversarial Self-Play for Multimodal Reasoning Plume Segmentation from MethaneSAT with Cross-Sensor Transfer Learning and Physics-Informed Postprocessing When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers Universal Boosts, Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language Models VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation Opportunistic Target Selection: Early Directional Commitment for Query-Efficient Black-Box Adversarial Attacks In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models EMA: Effort Metric Attention for Anatomical Effort-Guided Human Motion Diffusion Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects MGVQ: Synergizing Multi-dimensional Sensitivity-Aware and Gradient-Hessian Fusion for Vector Quantization Gaussian Rank-Based Neighborhood Degree for Graph Neural Networks in Image Classification TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge Prism: Spectral-Aware Block-Sparse Attention Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis IQA-Spider: Unifying Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL OmniEgo-R$^2$: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026 Towards Large Model Feature Coding Trust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation
Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
Yuchen Feng, · 2026-05-26 · via cs.CV updates on arXiv.org

View PDF HTML (experimental)

Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.
Comments: CVPR 2026
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2512.10548 [cs.CV]
  (or arXiv:2512.10548v3 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2512.10548

arXiv-issued DOI via DataCite

Submission history

From: Yuchen Feng [view email]
[v1] Thu, 11 Dec 2025 11:27:25 UTC (2,863 KB)
[v2] Wed, 25 Mar 2026 15:36:38 UTC (2,864 KB)
[v3] Sat, 23 May 2026 16:19:27 UTC (2,864 KB)