























Despite the empirical successes of Large Language Models (LLMs), the prevailing paradigm is heuristic and experiment-driven, tethered to massive compute and data, while a first-principles theory remains absent. This treatise develops a Semantic Information Theory at the confluence of statistical physics, signal processing, and classical information theory, organized around a single paradigm shift: replacing the classical BIT - a microscopic substrate devoid of semantic content - with the macroscopic TOKEN as the atomic carrier of meaning and reasoning. Within this framework we recast attention and the Transformer as energy-based models, and interpret semantic embedding as vectorization on the semantic manifold. Modeling the LLM as a stateful channel with feedback, we adopt Massey's directed information as the native causal measure of autoregressive generation, from which we derive a *directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training, and a sub-martingale account of inference-time semantic information flow. This machinery makes precise the identification of next-token prediction with Granger causal inference, and sharpens the limits of LLM reasoning against Pearl's Ladder of Causation - affirming that *whereas the BIT defined the Information Epoch, the TOKEN will define the AI Epoch.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。