

























Abstract:Software supply chain attacks on the npm ecosystem have grown increasingly sophisticated, exploiting obfuscation and complex logic to evade detection. Large Language Models (LLMs) offer strong semantic understanding of code but face practical constraints: limited context windows and high inference costs make full-package analysis infeasible, while naive token-based splitting fragments semantic context and degrades accuracy. This paper introduces an LLM-based framework for malicious npm package detection built on code-slicing techniques. We propose an adaptation of taint-based slicing for the npm ecosystem, guided by a curated inventory of JavaScript-specific sensitive APIs, to isolate security-relevant data flows from benign boilerplate. The approach reduces the mean input token count by 99.75% and the median by 93.7% while preserving critical malicious behaviors. Packages relying on dynamic code generation or obfuscation yield empty slices under static analysis and require deobfuscation preprocessing, a limitation we explicitly discuss. The framework is evaluated on a dataset of more than 7000 malicious and benign npm packages using DeepSeek-Coder6.7B. On the 2537 packages amenable to static taint analysis, taint-based slicing achieves 87.04% detection accuracy, outperforming both a naive token-splitting baseline at 75.41% and a CFG-only static slicing approach at 75.65%. These results demonstrate that semantically targeted input representations improve LLM-based detection performance beyond what is achievable through simple input-size reduction, providing an effective and computationally practical defense against evolving open-source supply-chain threats.
From: Duc-Ly Vu [view email]
[v1]
Sat, 13 Dec 2025 12:56:03 UTC (106 KB)
[v2]
Sat, 10 Jan 2026 14:03:54 UTC (92 KB)
[v3]
Sat, 13 Jun 2026 12:57:18 UTC (136 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。