惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Hacker News - Newest:
Hacker News - Newest: "LLM"
阮一峰的网络日志
阮一峰的网络日志
博客园 - 聂微东
S
SegmentFault 最新的问题
Jina AI
Jina AI
T
Tailwind CSS Blog
月光博客
月光博客
NISL@THU
NISL@THU
WordPress大学
WordPress大学
Google Online Security Blog
Google Online Security Blog
云风的 BLOG
云风的 BLOG
Cisco Talos Blog
Cisco Talos Blog
小众软件
小众软件
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Security @ Cisco Blogs
P
Proofpoint News Feed
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
罗磊的独立博客
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cisco Blogs
Scott Helme
Scott Helme
S
Securelist
H
Help Net Security
S
Schneier on Security
Martin Fowler
Martin Fowler
AWS News Blog
AWS News Blog
Security Archives - TechRepublic
Security Archives - TechRepublic
S
Secure Thoughts
酷 壳 – CoolShell
酷 壳 – CoolShell
博客园 - 【当耐特】
Last Week in AI
Last Week in AI
T
Tor Project blog
F
Fortinet All Blogs
S
Security Affairs
TaoSecurity Blog
TaoSecurity Blog
Schneier on Security
Schneier on Security
Cloudbric
Cloudbric
C
Cyber Attacks, Cyber Crime and Cyber Security
The GitHub Blog
The GitHub Blog
V
V2EX
SecWiki News
SecWiki News
C
CERT Recently Published Vulnerability Notes
Hacker News: Ask HN
Hacker News: Ask HN
博客园 - 司徒正美
T
Threatpost
T
Tenable Blog
W
WeLiveSecurity
B
Blog RSS Feed
V
Vulnerabilities – Threatpost
Attack and Defense Labs
Attack and Defense Labs

Hacker News - Newest: "LLM"

GitHub - lechmazur/position_bias: A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in opposite orders. Flex routing (EU and EFTA) Dark Factories: Retooling for LLM Velocity Ask HN: What would be the impact of a LLM output injection attack? GitHub - AronDaron/dataset-generator: No-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs — plan-then-execute pipeline, LLM-as-judge, HuggingFace upload. GitHub - Oaklight/llm-rosetta: Production-ready LLM API translation layer for Python — bidirectional conversion between OpenAI, Anthropic & Google formats via hub-and-spoke IR. Optional API gateway. Streaming & non-streaming. Zero core deps. Contributions welcome! GitHub - browser-use/browser-harness: Self-healing browser harness that enables LLMs to complete any task. GitHub - moeen-mahmud/remen: Remen turns thoughts into something you can return to Analyzing 156 LLM Launch Posts on Hacker News ChatGPT vs Gemini vs Claude: The Best LLM Subscription You Should Buy GitHub - salaamalykum/quran-semantic-search: High-density RAG Semantic Search Engine & Quran Corpus (GEO/SEO Architecture) GitHub - NVIDIA/TensorRT-LLM: TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way. The State of LLM Bug Bounties in 2026 Operational Readiness Criteria for Tool-Using LLM Agents Meshcore: Architecture for a Decentralized P2P LLM Inference Network How an LLM becomes more coherent as we train it GitHub - seetrex-ai/laimark GitHub - Jossifresben/BibCrit: AI-assited biblical textual criticism GitHub - wastedcode/memex: File system based wiki, maintained by Claude 99helpers.com GitHub - cliver-project/AITrigram GitHub - unbody-io/adapt: A self-evolving memory layer for AI agents. GitHub - hb20007/awesome-gen-ai-fails: A list of incidents where reliance on generative AI and LLMs resulted in harm to companies, individuals, or society GitHub - nevenkordic/localmind: Run any local LLM with persistent memory and context. CLI agent over Ollama with SQLite-backed hybrid recall. No cloud. Ask HN: What are the machine requirements for a LLM like Llama-3.1-8B? Faster LLM Inference via Sequential Monte Carlo grpo explained: group relative policy optimization for llm finetuning - cgft Stop comparing price per million tokens: the hidden LLM API costs · TensorZero Andrej Karpathy's LLM Wiki Is a Bad Idea GitHub - GG-QandV/mnemostroma: Offline RAM-first cognitive leer/coprocessor for AI agents and robotics. Solves "Context Abandonment" with 20-80ms latency using a dual-thread biomimetic memory architecture (ONNX + SQLite WAL). mempalace/agent at agent · skorotkiewicz/mempalace GitHub - Nyquest-ai/nyquest-rust-fullstack-pub: Nyquest — Semantic Compression Proxy for LLMs. 350+ rules, local LLM stage, 15-75% token savings. Full Rust stack. GitHub - TheoV823/mneme: Enforce architectural decisions in AI-assisted development. GitHub - klemenvod/TokenBrawl: A 1v1 Bomberman-style game where two LLM agents play autonomously against each other. No human plays — you watch the AIs fight. Each agent receives a text description of the board state, reasons about it, and outputs a move as JSON. The game engine executes it. Introducing the Common AI Provider: LLM and AI Agent Support for Apache Airflow Power Circuit AI: Designing Power Electronic Circuits for Motor Drives with Generative Artificial Intelligence Ask HN: How to program with IDE and LLM on CPU locally? Show HN: Agent-cache – Multi-tier LLM/tool/session caching for Valkey and Redis Bonsai 1-bit WebGPU - a Hugging Face Space by webml-community The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows Ask HN: Simple tooling for local LLM code critique without IDE integration? Can a General LLM Diagnose a DICOM Slice? A 10-Case Public Benchmark Charts-of-Thought: Enhancing LLM Visualization Literacy (PDF, 2026) GitHub - Mesh-LLM/mesh-llm: Distributed AI/LLM for the people. Share compute privately or publicly to power your agents and chat. GitHub - seamus-brady/springdrift: A persistent runtime for long-lived LLM agents Writing an LLM from scratch, part 32k -- Interventions: training a better model locally with gradient accumulation Ask HN: Which LLM model and agentic CLI are you using for local development? GitHub - wayneColt/modelcascade: Route local. Escalate smart. Never overspend. Open-source multi-model cascade routing for autonomous agents. LLM pricing is 100x harder than you think GitHub - asakin/llm-primer: Pre-warmed Claude Code sessions in tmux. No startup wait. GitHub - EggerMarc/chat-rs: A multi-provider LLM framework for Rust. GitHub - SynapseKit/SynapseKit: Minimal, async-first Python framework for production LLM apps- 2 hard deps, no magic, no SaaS. A Claude Skill that Makes LLM Paragraphs More Bearable Does Gas Town 'steal' usage from users' LLM credits & paid services to improve itself? What's Claude Code Actually Doing? Open the Black Box with the Arthur Engine Milla Jovovich's New Open Source LLM Memory App and the Dark Code Problem Your intuition of LLM token usage might be wrong Show HN: Bloomberg Terminal for LLM ops – free and open source GitHub - 0xchamin/mcptube: Transform YouTube videos into a compounding knowledge base with transcripts, vision analysis, and agentic search. Works as an MCP server for Claude, Copilot & more. Show HN: Open KB: Open LLM Knowledge Base Your LLM is a compiler, not a runtime GitHub - sapountzis/Unslop: A Web Feed That Deserves You crates.io: Rust Package Registry Beyond Karpathy's LLM-Wiki: The Necessity of Cognitive Governance GitHub - amitshekhariitbhu/llm-internals: Learn LLM internals step by step - from tokenization to attention to inference optimization. GitHub - parallem-ai/parallem: An expressive library for running agents with the Batch API. GitHub - stfurkan/pi-llm LLM-Wiki Show HN: Formal – Formal verification for AI-generated code using Lean 4 LRTS – Regression testing for LLM prompts (open source, local-first) LLM Wiki Skill: Build a Second Brain with Claude Code and Obsidian I built an LLM Wiki and RAG solution: here's a demo for a security KB The biggest advance in AI since the LLM Predict-Rlm: The LLM Runtime That Lets Models Write Their Own Control Flow the-synthetic-library/the-synthetic-mind at main · joshferrer1/the-synthetic-library GitHub - yisding/reviewwiggum GitHub - Donnyb369/mcp-spine: Context Minifier & State Guard — Local-first MCP middleware proxy GitHub - Beledarian/wgpu-llm: A from-scratch LLM inference engine that uses wgpu (the cross-platform WebGPU implementation) to dispatch WGSL compute shaders for every math operation a Transformer needs. No CUDA. No Python. No massive framework dependencies. Just Rust, raw shaders, and your GPU. GitHub - anitiue/Hindsight: An experience-driven self-improvement framework for LLM agents — 基于经验的 LLM Agent 自我改进框架 GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. GitHub - alainnothere/AmdPerformanceTesting: Amd Performance Testing Ask HN: Is a purely Markdown-based CRM a terrible idea? Optimized for LLM agents Context Engineering - LLM Memory and Retrieval for AI Agents | Weaviate little_helper_tui/letter.md at main · sleepyeldrazi/little_helper_tui GitHub - EvanZhouDev/umr: The Unified Model Registry for all your local AI apps. GitHub - JordanCT/VigIA-Orchestrator Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain A Taxonomy of RL Environments for LLM Agents Llama LLM Network Feture GitHub - genedeng-ca/ai-mac-migration: AI-powered Mac-to-Mac migration tool - replace Apple Migration Assistant with intelligent, selective transfer using local LLMs GitHub - lunargate-ai/gateway: High-performance self-hosted AI gateway (OpenAI-compatible) with routing, retries, and streaming GitHub - AuthBits/webmcp: A lightweight, prompt-driven MCP web research server for high-quality LLM powered information extraction. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception High-Stakes Personalization: Rethinking LLM Customization for Individual Investor Decision-Making From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
Setting the temperature to zero will make an LLM deterministic?
Sara Zan · 2026-06-17 · via Hacker News - Newest: "LLM"
We all know LLMs don't always respond the same thing to slight changes of prompt. But why does their answer differ also when the prompt is identical? And what can we do to prevent it?
Featured image

This is episode 8 of a series of shorter blog posts answering questions I received during the course of my work. They discuss common misconceptions and doubts about various generative AI technologies. You can find the whole series here: Practical Questions.


One common explanation of the "temperature" parameter of LLMs is that it represents the "randomness" of the answer.

That's broadly correct. Temperature is a parameter of the LLM final decoding steps, and the only one in the whole Transformer architecture that truly incorporates some randomness by design. At this stage, once the model has calculated the logits of the next token candidates, it has to map those values to an actual token from a list. Normally, LLMs perform best when they’re allowed to pick not necessarily the single best token, but instead choose at random among the N best tokens: the size of N is, more or less, what the temperature parameter represents.

Therefore, when we set the temperature to 0, the LLM must always choose the best next token, without making random choices. So, if the input is fixed and we have removed the only source of randomness in the architecture, the outputs should always be identical... right?

And yet, in practice, they often are not. Run the same prompt twice, with the same model, the same parameters, and temperature 0, and sooner or later the output will be a bit different. Not by much, usually. It may start with just one word; then the sentence takes a slightly different spin, until eventually the rest of the completion drifts away.

What's going on?

Imperfect computations

If we pretend an LLM is just a mathematical function, temperature=0 should indeed make decoding deterministic. At each step, the model emits logits, we take the argmax token, append it to the context, and repeat. The problem is that real inference is performed with floating-point arithmetic on massively parallel hardware, usually on a server that is trying to be as fast as possible rather than mathematically pristine.

Floating-point arithmetic is only an approximation of real-number arithmetic. In particular, it is not associative: in ordinary math, (a + b) + c = a + (b + c) always holds, but with floating-point numbers those two expressions can produce slightly different results because each intermediate step is rounded. The same applies to matrix multiplications, reductions, and accumulations throughout a neural network. Change the order of operations, and you can change the last few bits of the result.

Usually, those differences are tiny and often irrelevant, but in this case they have an impact. If two candidate next tokens have very similar logits, a minute numerical difference can swap their order, and once one token changes, the next decoding step runs on a different prefix, so the divergence compounds. The sampling rule is deterministic, while the computation that produced the logits is not guaranteed to be identical across runs.

You can think of it this way: sampling determinism is not the same thing as system determinism.

It gets worse

However, this is only part of the problem. You may already be objecting that running the same matrix multiplication on a GPU with the same data repeatedly will always provide bitwise-identical results. The computations are done in floating-point arithmetic, and there are surely other jobs running on the GPU while your computer is on. So why are these calculations deterministic, while LLM sampling with temperature=0 is not?

In a recent post on Thinking Machines's blog, Defeating Nondeterminism in LLM Inference, Horace He's digs even deeper into the issue. It's not merely that floating-point arithmetic is imperfect. Modern inference systems also need to batch requests together, and the result for one request can depend on the batch context in which it was executed. For a given exact batch, the forward pass may be deterministic. But from the user's point of view, the system is still nondeterministic, because the batch itself is not stable from run to run. Your prompt may be identical, but the inputs that get batched together with yours are not.

This is also why a prompt can look stable in local testing and then become flaky in production: the model did not suddenly become more creative, it's the system conditions that changed. temperature=0 makes only the token selection rule deterministic. It does not guarantee that the entire inference system will produce exactly the same logits every time.

Can it be fixed?

The way LLMs inference works today, especially at scale, doesn't leave us with many options to enforce the conditions that can guarantee deterministic outputs. There are only trade-offs, and they differ quite a lot between hosted APIs and self-hosted inference.

Fixed seeds

To reduce randomness and make LLM outputs reproducible, some people recommend using a fixed seed, and indeed some providers expose one. OpenAI, for example, documents a seed parameter and says it makes a best effort to sample deterministically, while explicitly warning that determinism is not guaranteed and that backend changes can still affect outputs. Their system_fingerprint field exists precisely so you can notice when the underlying serving configuration has changed.

The problem with fixed seeds is that they help reproduce results when the temperature is above zero, not when it's already zeroed out. That's because a fixed seed controls the randomness of the sampling step: by setting the temperature to zero, we are already removing that source of randomness, so the net result is identical with or without a fixed seed, while every other source of nondeterminism coming from the GPU and the rest of the stack is unaffected.

So fixed seeds are worth using when you are trying to get the same results for a call with non-zero temperature, such as for tests, demos, and regression checks. But you must keep in mind that they affect only the sampler, and they won't help you when temperature is zero.

No parallel jobs

If you self-host, one option to drastically reduce randomness is to reduce or eliminate concurrency.

This works for the simple reason that it stabilizes batching and scheduling. vLLM's reproducibility guidance says that by default it does not guarantee reproducibility on its own. In offline mode, you should disable multiprocessing to make scheduling deterministic, while in online mode, you need batch invariance support if you want outputs that are insensitive to batching. vLLM also documents batch invariance as a distinct feature and notes that it currently depends on specific hardware support.

This means that you can pick a few different configurations, depending on your needs:

  • shared online serving with dynamic batching: fastest, cheapest, least reproducible
  • isolated worker / no concurrent jobs: slower, more expensive, more reproducible
  • specialized batch-invariant serving paths: better reproducibility, but with hardware and feature constraints

The overall pattern is that the more you optimize for throughput, the more reproducibility suffers.

Cache responses

Caching doesn't exactly address the reproducibility issue per se, but in many applications it's the right level of abstraction if you want the same input to produce the same output. It's often not only the most viable option, but also the cheapest, simplest, and fastest, unless you're running a benchmark or an evaluation.

In practice, if you just need the same visible result for the same request, the most reliable method is not to regenerate it at all. Normalize the prompt, model ID, and relevant parameters into a cache key, store the first successful response, and serve that on subsequent identical requests. This does not make the model deterministic, of course, but it does make your application deterministic at the interface boundary, which is usually what application builders need.

Caching also has a very nice advantage over seeds and scheduler tricks: it does not depend on hidden implementation details inside the inference stack.

Of course, caching has limits. It only helps when requests repeat, and it can become awkward if tool calls, timestamps, external retrieval, or hidden context make two apparently identical requests not truly identical. Still, it is usually far more convenient than any other solution to this problem, and the only practical one for most production systems.

Conclusion

When faced with LLM nondeterminism, there's often the reaction to treat it like a bug and to try to eliminate that. However, you should also keep in mind that LLMs were designed with a randomness factor built-in for a reason: because they perform much better when they are allowed a slight degree of nondeterminism.

I get it: nobody likes having such a huge, random black box at the core of an application's business logic. But removing randomness from the outputs is not the right way to manage an LLM's behavior. If you need completely deterministic output, it is better to use the LLM to design a decision tree (or a more sophisticated model, if needed) and then use that in your application.

Handling LLM outputs is rather matter of validation. Use schemas and validators so small textual drift does not break downstream code. Use evals instead of spot-checking. Cache where consistency matters, or where you need to save a few bucks. In other words, handle the randomness at the system boundary rather than trying to remove it from the model itself.