惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

cs.CV updates on arXiv.org

DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Sentinel: Embodied Cooperative Spatial Reasoning and Planning OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion Learning Reference-Guided Exposure Correction with Hybrid Illumination Characteristics TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting Attenuation-Resilient Alternating Optimization for Laparoscopic Liver Landmark Detection Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning Unified Panoramic Geometry Estimation via Multi-View Foundation Models CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening CIRCLED: A Multi-turn CIR Dataset with Consistent Dialogues across Domains DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning LongCat-Video-Avatar 1.5 Technical Report Uncertainty-Aware Gaussian Map for Vision-Language Navigation Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling Dimensional Distribution Emotion State: Leveraging Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis OSMa-Bench++: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes NeR-SC: Adapting Neural Video Representation to Screen Content ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation JLT: Clean-Latent Prediction in Latent Diffusion Transformers Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking Recursive Flow Matching Sleep-stage efficient classification using a lightweight self-supervised model Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos On the Robustness of Machine Unlearning for Vision-Language Models OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer VesselSim: learning 3D blood vessel segmentation without expert annotations Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling Cross-scale Aligned Supervision for Training GANs Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation SCKAN: Structural Consensus-based KAN Prototype Learning for Semi-Supervised Pancreas Segmentation 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation Adaptation-Free Heterogeneous Collaborative Perception with Unseen Agent Configurations Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules A multifractal-based masked auto-encoder: an application to medical images Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery $R^3$: 3D Reconstruction via Relative Regression Gaussian-Voxel Duet: A Dual-Scaffolding Hybrid Representation for Fast and Accurate Monocular Surface Reconstruction MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding DelowlightSplat: Feed-Forward Gaussian Splatting for Lowlight 3D Scene Reconstruction OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation BEAT: Rhythm-Elastic Alignment for Agentic Music-guided Movie Trailer Generation Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models Memory-Distilled Selection for Noise-Robust Anomaly Detection I2PRef: Image-Driven Point Completion with Iterative Refinement Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos SIMPC: Learning Self-Induced Mirror-Point Consistency for Unsupervised Point Cloud Denoising PinPoint: Prompting with Informative Interior Points Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation Cesarean Scar Defect Segmentation in Transvaginal Ultrasound Images: a Dataset and Benchmark Joint 2D-3D Segmentation and Association in Street-level Imaging Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy Personalized Generative Models for Contextual Debiasing VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes Revealing the core dimensions underlying representations in brains, behavior and AI BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection Small Object Detection in Industrial Recycling: A New Dataset and YOLO Performance Evaluation Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning ChartAct: A Benchmark for Dynamic Chart Understanding MSCGC-KAN: Multi-scale Causal Graph Convolution and Kolmogorov-Arnold Feature Mapping for EEG Emotion Recognition
FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation
2026-04-16 · via cs.CV updates on arXiv.org

View PDF HTML (experimental)

Abstract:With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on prompt augmentation or holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. To address this limitation, we propose FiRe, a Fine-grained Multimodal Reasoning method for enhanced image generation by MLLM. In specific, FiRe performs a fine-grained multi-step reasoning by first decomposing the prompt into key visual requirements and then self-judging their satisfaction in the generated image, followed by localized refinement according to self-generated precise feedback. In addition, to further strengthen the MLLM's multimodal reasoning ability, we introduce FiRe-GRPO, a reinforcement learning method tailored to FiRe. Since standard Group Relative Policy Optimization (GRPO) suffers from sparse, outcome-based rewards in multi-step reasoning, we formulate our reasoning process as a step-level decision-making problem, design step-specific rewards, and compute step-level advantages for granular credit assignment within GRPO. Extensive experiments demonstrate that FiRe consistently outperforms competitive text-to-image baselines, including existing reasoning-based methods, with particularly substantial gains on compositional text-to-image benchmarks.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2604.13491 [cs.CV]
  (or arXiv:2604.13491v3 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2604.13491

arXiv-issued DOI via DataCite

Submission history

From: Yongjin Kim [view email]
[v1] Wed, 15 Apr 2026 05:24:29 UTC (6,160 KB)
[v2] Thu, 16 Apr 2026 04:19:42 UTC (6,160 KB)
[v3] Tue, 26 May 2026 14:06:31 UTC (5,993 KB)