OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning - 惯性聚合

推荐订阅源

LINUX DO - 最新话题

博客园 - 叶小钗

酷壳 – CoolShell

奇客Solidot–传递最新科技情报

Recent Announcements

Full Disclosure

博客园_首页

Fortinet All Blogs

The Cloudflare Blog

The Blog of Author Tim Ferriss

Java Code Geeks

CERT Recently Published Vulnerability Notes

博客园 - 聂微东

aimingoo的专栏

LINUX DO - 热门话题

Attack and Defense Labs

Comments on: Blog

Hacker News: Ask HN

人人都是产品经理

Google Developers Blog

Hackread – Cybersecurity News, Data Breaches, AI and More

Security Affairs

Forbes - Security

博客园 - 司徒正美

宝玉的分享

cs.AI updates on arXiv.org

Microsoft Azure Blog

About on SuperTechFans

www.infosecurity-magazine.com

The Hacker News

大猫的无限游戏

cs.CV updates on arXiv.org

A Lightweight Multi-Metric No-Reference Image Quality Assessment Framework for UAV Imaging Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models PatchPoison: Poisoning Multi-View Datasets to Degrade 3D Reconstruction 3DRealHead: Few-Shot Detailed Head Avatar GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization Towards Patient-Specific Deformable Registration in Laparoscopic Surgery Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG) Neural 3D Reconstruction of Planetary Surfaces from Descent-Phase Wide-Angle Imagery SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models 4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview Rethinking Uncertainty in Segmentation: From Estimation to Decision Indexing Multimodal Language Models for Large-scale Image Retrieval DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery Explainable Fall Detection for Elderly Monitoring via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision Bias at the End of the Score Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform Why MLLMs Struggle to Determine Object Orientations Towards Successful Implementation of Automated Raveling Detection: Effects of Training Data Size, Illumination Difference, and Spatial Shift Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization MSGS: Multispectral 3D Gaussian Splatting Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks CausalDisenSeg: A Causality-Guided Disentanglement Framework with Counterfactual Reasoning for Robust Brain Tumor Segmentation Under Missing Modalities DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis A Study of Failure Modes in Two-Stage Human-Object Interaction Detection FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation AI Powered Image Analysis for Phishing Detection CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing Radar-Informed 3D Multi-Object Tracking under Adverse Conditions SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data Med-CAM: Minimal Evidence for Explaining Medical Decision Making SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests ReConText3D: Replay-based Continual Text-to-3D Generation ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation Temporally Consistent Long-Term Memory for 3D Single Object Tracking PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation A Resource-Efficient Hybrid CNN-LSTM network for image-based bean leaf disease classification DiffMagicFace: Identity Consistent Facial Editing of Real Videos Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios Context Sensitivity Improves Human-Machine Visual Alignment Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding A Multi-Stage Optimization Pipeline for Bethesda Cell Detection in Pap Smear Cytology SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework Depth-Aware Image and Video Orientation Estimation Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself OneHOI: Unifying Human-Object Interaction Generation and Editing Towards Unconstrained Human-Object Interaction Training-Free Semantic Multi-Object Tracking with Vision-Language Models UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments ROSE: Retrieval-Oriented Segmentation Enhancement

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

[Submitted on 7 Jun 2026] · 2026-06-09 · via cs.CV updates on arXiv.org

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。