惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

cs.AI updates on arXiv.org

Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning Multimodal Alignment and Preference Optimization for Zero-Shot Conditional RNA Generation MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research Emotional intelligence in large language models is fragmented across perception, cognition, and interaction How Well Do Models Follow Their Constitutions? Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection Low-Cost Labels, Reliable Choices: Rollout-Calibrated Hyper-Heuristics for Job Shop Scheduling Confidence Calibration in Large Language Models CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists Understanding and Mitigating Premature Confidence for Better LLM Reasoning Toward Enactive Artificial Intelligence GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL Inference Time Context Sparsity: Illusion or Opportunity? Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning High-Risk AI Systems and the Problem of Identity in the European AI Act BoxLitE: A Faithful Knowledge Base Embedding Based on Convex Optimization Raon-Speech Technical Report Advancing Graph Few-Shot Learning via In-Context Learning FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Beyond Control-Flow: Integrating the Resource Perspective into Multi-Collaborative Process Modeling from Text Distilling Game Code World Model Generation into Lightweight Large Language Models DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations Fuzzy, Neutrosophic, and Uncertain Graph Theory: Properties and Applications Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis Geo-Expert: Towards Expert-Level Geological Reasoning via Parameter-Efficient Fine-Tuning Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction HeartBeatAI: An Interpretable and Robust Deep Learning Framework for Multi-Label ECG Arrhythmia Detection Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs, Declarative Wiring, and Structured Interaction SPACE: Unifying Symmetric and Asymmetric Routing Problems for Generalist Neural Solver Hypothesis Generation and Inductive Inference in Children and Language Models Nano World Models: A Minimalist Implementation of Future Video Prediction When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs Clustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph Learning MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning From Model Scaling to System Scaling: Scaling the Harness in Agentic AI A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks Adaptive Human-AI Coordination via Hierarchical Action Disentanglement CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents Remote sensing data imputation using deep learning for multispectral imagery CITYREP: A Unified Benchmark for Urban Representations Across Cities, Tasks, and Modalities TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling A governance horizon for ethical-use constraints in open-weight AI models ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models Associations between echocardiographic traits and AI-ECG predictions of heart failure Fundamental Limitation in Explaining AI Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction Emission-Aware Reinforcement Learning for Sustainable Electric Vehicle Charging and Carbon Dioxide Reduction Under Varying Renewable Penetration Learning to Reason Efficiently with A* Post-Training Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs When Mean CE Fails: Median CE Can Better Track Language Model Quality When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches The Model Is Not the Product: A Dual-Pillar Architecture for Local-First Psychological Coaching Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents
Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments
Van Quang Ng · 2026-05-26 · via cs.AI updates on arXiv.org

View PDF HTML (experimental)

Abstract:Advancements at the intersection of computer vision and natural language processing are crucial for applications like assistive tech, multimedia querying, and robotics. This dissertation proposes novel architectures to improve intelligent agents across three key vision-language tasks: image captioning, visual dialog, and interactive instruction following.
First, we address limitations in visual representation for image captioning. Traditional models rely on region-based features from CNN detectors, which lack global context and suffer from high computational overhead. We propose GRIT (Grid and Region-based Image captioning Transformer), a transformer-only architecture. By integrating grid and region features using a DETR-based detector, GRIT enables end-to-end training and out-performs prior methods in both inference accuracy and speed.
Second, we tackle visual dialog, which requires multi-turn conversation about an image. The challenge lies in efficiently modeling interactions between multiple inputs (image, question, history). We introduce LTMI (Light-weight Transformer for Many Inputs). Utilizing a specialized attention block, an LTMI layer matches the representational power of a standard Transformer extension while utilizing less than one-tenth of its parameters, as validated on the VisDial dataset.
Finally, we study interactive instruction-following for embodied AI using the ALFRED dataset. We propose a framework featuring a two-stage instruction interpretation: it first decodes language directives independently of visual context to predict a tentative action-object sequence, which is then fused with visual features for final execution. Using multiple egocentric views and hierarchical attention, our method accurately localizes objects and achieves a state-of-the-art unseen success rate of 8.37%.
Comments: Doctoral dissertation, Tohoku University, 2022. Uploaded for archival purposes. 146 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as: arXiv:2605.24020 [cs.CV]
  (or arXiv:2605.24020v1 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2605.24020

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Van-Quang Nguyen [view email]
[v1] Wed, 20 May 2026 06:11:25 UTC (22,466 KB)