惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
P
Proofpoint News Feed
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Google DeepMind News
Google DeepMind News
T
The Blog of Author Tim Ferriss
T
Tor Project blog
T
Threatpost
V
Vulnerabilities – Threatpost
大猫的无限游戏
大猫的无限游戏
量子位
Scott Helme
Scott Helme
Schneier on Security
Schneier on Security
有赞技术团队
有赞技术团队
Recent Commits to openclaw:main
Recent Commits to openclaw:main
李成银的技术随笔
K
Kaspersky official blog
T
ThreatConnect
美团技术团队
博客园 - Franky
爱范儿
爱范儿
A
Arctic Wolf
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
SecWiki News
SecWiki News
Microsoft Azure Blog
Microsoft Azure Blog
博客园 - 叶小钗
Recorded Future
Recorded Future
L
Lohrmann on Cybersecurity
J
Java Code Geeks
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
D
DataBreaches.Net
Spread Privacy
Spread Privacy
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Comments on: Blog
B
Blog RSS Feed
L
LINUX DO - 热门话题
阮一峰的网络日志
阮一峰的网络日志
腾讯CDC
酷 壳 – CoolShell
酷 壳 – CoolShell
N
Netflix TechBlog - Medium
S
SegmentFault 最新的问题
S
Security @ Cisco Blogs
Latest news
Latest news
I
InfoQ
Project Zero
Project Zero
P
Privacy International News Feed
D
Docker
The Hacker News
The Hacker News
A
About on SuperTechFans

cs.CV updates on arXiv.org

DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Sentinel: Embodied Cooperative Spatial Reasoning and Planning OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion Learning Reference-Guided Exposure Correction with Hybrid Illumination Characteristics TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting Attenuation-Resilient Alternating Optimization for Laparoscopic Liver Landmark Detection Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning Unified Panoramic Geometry Estimation via Multi-View Foundation Models CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening CIRCLED: A Multi-turn CIR Dataset with Consistent Dialogues across Domains DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning LongCat-Video-Avatar 1.5 Technical Report Uncertainty-Aware Gaussian Map for Vision-Language Navigation Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling Dimensional Distribution Emotion State: Leveraging Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis OSMa-Bench++: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes NeR-SC: Adapting Neural Video Representation to Screen Content ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation JLT: Clean-Latent Prediction in Latent Diffusion Transformers Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking Recursive Flow Matching Sleep-stage efficient classification using a lightweight self-supervised model Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos On the Robustness of Machine Unlearning for Vision-Language Models OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer VesselSim: learning 3D blood vessel segmentation without expert annotations Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling Cross-scale Aligned Supervision for Training GANs Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation SCKAN: Structural Consensus-based KAN Prototype Learning for Semi-Supervised Pancreas Segmentation 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation Adaptation-Free Heterogeneous Collaborative Perception with Unseen Agent Configurations Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules A multifractal-based masked auto-encoder: an application to medical images Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery $R^3$: 3D Reconstruction via Relative Regression Gaussian-Voxel Duet: A Dual-Scaffolding Hybrid Representation for Fast and Accurate Monocular Surface Reconstruction MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding DelowlightSplat: Feed-Forward Gaussian Splatting for Lowlight 3D Scene Reconstruction OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation BEAT: Rhythm-Elastic Alignment for Agentic Music-guided Movie Trailer Generation Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models Memory-Distilled Selection for Noise-Robust Anomaly Detection I2PRef: Image-Driven Point Completion with Iterative Refinement Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos SIMPC: Learning Self-Induced Mirror-Point Consistency for Unsupervised Point Cloud Denoising PinPoint: Prompting with Informative Interior Points Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation Cesarean Scar Defect Segmentation in Transvaginal Ultrasound Images: a Dataset and Benchmark Joint 2D-3D Segmentation and Association in Street-level Imaging Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy Personalized Generative Models for Contextual Debiasing VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes Revealing the core dimensions underlying representations in brains, behavior and AI BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection Small Object Detection in Industrial Recycling: A New Dataset and YOLO Performance Evaluation Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning ChartAct: A Benchmark for Dynamic Chart Understanding MSCGC-KAN: Multi-scale Causal Graph Convolution and Kolmogorov-Arnold Feature Mapping for EEG Emotion Recognition
MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings
Dineth Jayak · 2026-05-05 · via cs.CV updates on arXiv.org

View PDF HTML (experimental)

Abstract:Pneumonia remains a leading global cause of morbidity and mortality, particularly in low-resource settings where access to imaging, laboratory testing, and specialist care is limited. Clinical assessment relies on heterogeneous evidence, including symptoms, respiratory patterns, spoken descriptions, and chest imaging, making frontline screening inherently multimodal. However, many existing computational approaches remain unimodal and focus primarily on radiographs. In this work, we present MultiSense-Pneumo, a multimodal research prototype for pneumonia-oriented screening and triage support that integrates structured symptom descriptors, cough audio, spoken language, and chest radiographs. The system combines deterministic symptom triage, LightGBM-based acoustic classification, domain-adversarial radiograph analysis using ResNet-18, transformer-based speech recognition, and an interpretable late-fusion operator. Each modality is transformed into a normalized concern signal and aggregated into a unified screening estimate. The fusion weights are hand-specified and are treated as heuristic, interpretable parameters rather than learned or clinically optimized values. MultiSense-Pneumo is implemented with offline execution in mind on standard laptop-class hardware, but it is not presented as a deployment-validated or clinically validated diagnostic system. Experimental results demonstrate strong component-level performance of the radiograph pathway under synthetic domain shifts, while also highlighting important limitations, especially reduced abnormal-class recall for cough acoustics and the absence of paired end-to-end multimodal patient evaluation. MultiSense-Pneumo is therefore intended as a framework and component-level prototype for screening and triage research.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as: arXiv:2605.02207 [cs.CV]
  (or arXiv:2605.02207v2 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2605.02207

arXiv-issued DOI via DataCite

Submission history

From: Dineth Jayakody [view email]
[v1] Mon, 4 May 2026 04:14:35 UTC (950 KB)
[v2] Tue, 26 May 2026 05:28:55 UTC (952 KB)