


























Abstract:Large Language Models (LLMs) have emerged as foundation models for IoT applications such as human activity recognition (HAR). However, directly applying high-frequency and multi-dimensional sensor data, such as eye-tracking data, leads to information loss and high token costs. To mitigate this, we investigate a visual prompting strategy that transforms sensor signals into data visualization images as an input to multimodal LLMs (MLLMs) using eye-tracking data. We conducted a systematic evaluation of MLLM-based HAR across three public eye-tracking datasets using three visualization types of timeline, heatmap, and scanpath, under varying temporal window sizes. Our findings suggest that visual prompting provides a token-efficient and scalable representation for eye-tracking data, highlighting its potential to enable MLLMs to effectively reason over high-frequency sensor signals in IoT contexts.
| Comments: | 6 pages. Conditionally accepted to IEEE PacificVis 2026 (VisNotes track) |
| Subjects: | Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2604.09585 [cs.HC] |
| (or arXiv:2604.09585v1 [cs.HC] for this version) | |
| https://doi.org/10.48550/arXiv.2604.09585 arXiv-issued DOI via DataCite |
From: Jae Young Choi [view email]
[v1]
Fri, 27 Feb 2026 02:47:32 UTC (3,542 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。