Structure of Every LLM Chat

Hacker News - Newest: "LLM"

The LLM Is Not a Junior Engineer GitHub - pleasedodisturb/llm-safe-haven: The missing security guide for solo developers running autonomous AI coding agents GitHub - vassiliylakhonin/agenda-intelligence-md: A markdown protocol for AI agents that analyze public agenda instead of summarizing it badly. Firefox for Web Developers (@firefoxwebdevs@mastodon.social) Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning Easy VM sandboxes for LLM agents on MacOS, Miami & Paris travel GitHub - TauricResearch/TradingAgents: TradingAgents: Multi-Agents LLM Financial Trading Framework LLM 0.32a0 is a major backwards-compatible refactor IKP — Incompressible Knowledge Probes The 6-Lever LLM Cost Stack: A Production Playbook (One Backfired, One I'd Reverse Today) GitHub - Javierlozo/llm-audit: Static analysis for TypeScript / JavaScript LLM-application code. OWASP LLM Top 10 at commit time. A complement to Semgrep's p/ai-best-practices for the TS/JS ecosystem. RFC 0010: Workflow Composition Extension | PromptPack GitHub - allocz/slm: zero-dependency TUI LLM chat GitHub - stevefan1999-personal/demcstify: Decompile Minecraft using Vineflower, reconstruct the code using LLM GitHub - victornominista/anp: The economic layer for agent-to-agent negotiation. Binary protocol, Ed25519 identity, price oracle. GitHub - lazyville/qsh: q - Slim LLM CLI I built OWASP-style security skill packs for LLM apps (NPM install) Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity We Upgraded to a Frontier Model and Our Costs Went Down [PUBLIC] 03/21/26 vLLM-compile @ KCD vLLM meetup Ask HN: What happens when you paste a screenshot, and ask questions in LLM? GitHub - gfernandf/agent-skills: Agents should execute whenever possible — runtime for composable AI agent skills Understanding the LLM Bubble - American Affairs Journal LLM Budget Guard — Hard Cutoffs Before Your Agents Burn You GitHub - vnmoorthy/pavo-bench: A 50K-turn voice pipeline benchmark and an 85K-param meta-controller that cuts P95 latency 10.3% and energy 71% vs fixed cloud. TMLR 2026. GitHub - redcaller/voice-goat: A purposely vulnerable voice agent application for security practitioners to practice exploiting voice-based (and text based) AI systems. HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents Machina Mirabilis - Michael Hla GitHub - xoai/sage-wiki: An LLM-compiled personal knowledge base. Drop in your papers, articles, and notes. sage-wiki compiles them into a structured, interlinked wiki — with concepts extracted, cross-references discovered, and everything searchable. GitHub - MTimma/knowerage: Local MCP server that tracks AI analysis coverage against your codebase The Environment Rewrites the Question Before I Ask It GitHub - epscylonb/1386.ai.rocm: A lightweight transformer language model built from scratch in PyTorch, trained on a single consumer GPU with a full pipeline for data processing, pretraining, and instruction tuning. AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards I build my LLM a Brain GitHub - glama-ai/lightport: A lightweight AI gateway that makes LLM providers OpenAI-compatible. Vibe-exploring Stunt Island Customizing Karpathy's LLM wiki for fighting disease AI Usage Analytics – Real-time budget enforcement and PII redaction for LLM My Workflow for Understanding LLM Architectures Museum of Code: alphafold_2018 InterviewDeck — Master 50 LLM Interview Questions GitHub - fkyah3/opencode-fkyah3: The open source coding agent. GitHub - poinsettiaclg-gif/AETHER-core: The open-source core compiler for the AETHER Agent Reliability Framework. Replaces fuzzy prompts with strict Weighted Intent Token (WIT) vectors to prevent Context Rot. The New Linux Kernel AI Bot Uncovering Bugs Is A Local LLM On Framework Desktop + AMD Ryzen AI Max GitHub - rlops/rlix: Run more RL experiments. Wait less for GPUs. GitHub - lace-ai/gai: 🤖 GAI is a flexible Go library for building agent-style applications on top of LLMs GitHub - starface77/Neuro-Adaptive-Reasoning-Engine GitHub - artem-mangilev/ctxbrew: 📦 Ship & Use AI-friendly package context. Could creativy in LLM emerge by reframing language? Cowork on 3P: How to Run Any LLM in Claude Cowork and Claude Code GitHub - ivankuznetsov/llm-wiki GitHub - labiium/routiium: A self-hosted LLM reverse proxy that adds managed auth, multi-provider routing, rate limiting, llm as judge, historyand cost tracking to any OpenAI-compatible Chatnik: LLM Host in the Shell — Part 1: First Examples & Design Principles GitHub - gerritsxd/chatforge: Drag two conversations together. Local LLM chat with merge, persistent memory, and LoRA compilation. From $200 to $30: Five Layers of LLM Cost Optimization GitHub - FuzzAnything/PromptFuzz: PromtFuzz is an automated tool that generates high-quality fuzz drivers for libraries via a fuzz loop constructed on mutating LLMs' prompts. Amália- Open Source Large Language Model (LLM) for European Portuguese GitHub - nex-crm/wuphf: Slack for AI employees with a shared brain. Get Claudes, Codexes and OpenClaws to collaborate and do your work autonomously while never losing context. hallucination-mitigation-via-contrastive-sampling- method Monitoring LLM behavior: Drift, retries, and refusal patterns I Asked My Local LLM to Add 23 Numbers. I Got Seven Different Wrong Answers. GCC Establishes Working Group To Decide On AI/LLM Policy Study: Does the brain work like an LLM in predicting words? GitHub - NoahCristino/llmcat: A simple CLI that transforms your code into clean, structured text for feeding into LLMs. LLM research on Hacker News is drying up – Dylan Castillo Designing a Memory System for LLM-Based Agents Ask HN: What's your current go-to LLM for "thinking-partner"? Show HN: Llm.sql – Run a 640MB LLM on SQLite, with 210MB peak RSS and 7.4 tok/s FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels zork-bench: An LLM reasoning eval based on text adventure games GitHub - fambaseOU/localDom: LocalDom** turns your local LLM engines into secure, authenticated API services. It allows you to generate professional API credentials for your local AI (Ollama, LM Studio, etc.), making it seamless to use your private models anywhere—from mobile apps to external web services—with **End-to-End Encryption (E2EE)** and **Persistent Memory**. GitHub - al1-nasir/LocalForge: Self-hosted AI control plane for intelligent local LLM orchestration. OpenAI-compatible API · ML-powered multi-model routing · LoRA finetuning · vector memory · RAG GitHub - latitude-dev/eval-skills: LLM eval skills for developers. Free tools to find failure patterns, build evals, and improve AI quality in production Watermarked LLM Outputs [pdf] LLM pricing has never made sense LLM as Judge: Reproducible Evaluation for LLM Systems - Learning Roadmap | Nemorize Structured planning, execution, and memory for LLM agents (ragbits 1.6) Local LLM for Private Companies One Simple Fix That Makes LLM Benchmark Rankings Actually Agree Ask HN: Is the ongoing AI research driving LLM models to be better? Show HN: I made a simpler API for Chrome's on-device LLM GitHub - ojuschugh1/sqz: Compress LLM context to save tokens and reduce costs How Do LLM Agents Think Through SQL Join Orders? | ADRS — AI-Driven Research for Systems Writing an LLM from scratch, part 33 -- what I learned from finally getting round to the appendices Google Colab The Scraping Wiki: An LLM-maintained knowledge base indexing 400 articles GitHub - SijuEC/eridani-speak: Token compression library inspired by a friendly alien engineer's communication style. Kernel code removals driven by LLM-created security reports SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving GPT-Proxy Backdoor in npm and PyPI turns Servers into Chinese LLM Relays GitHub - lechmazur/position_bias: A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in opposite orders. Flex routing (EU and EFTA) Dark Factories: Retooling for LLM Velocity Ask HN: What would be the impact of a LLM output injection attack? GitHub - AronDaron/dataset-generator: No-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs — plan-then-execute pipeline, LLM-as-judge, HuggingFace upload. I thought I had a bug I wrote a 400-line pipeline that installs and scores every LLM tool on HN overnight Aito.ai - The AI Database Landscape in 2026: Where Does Structured Prediction Fit? GitHub - moe18/Unwired: LLM powered DNS GitHub - brcrusoe72/agent-search: Self-hosted search API + MCP server for AI agents. Bundles SearXNG. Zero API keys, one-command deploy. Open-source alternative to Tavily, Exa, and Serper.

Arpit Bhayani · 2026-05-13 · via Hacker News - Newest: "LLM"

If you have only ever interacted with a language model through a chat interface, you have seen one layer of abstraction that hides a lot of engineering. Behind the friendly chat window, every interaction with a modern LLM is structured as a list of messages, each tagged with a role.

That role tagging is not cosmetic. It shapes how the model responds, how context is managed across multiple turns, and how application developers constrain and direct model behaviour at a structural level. Understanding this format is the difference between using an LLM and building reliably on top of one.

Why Roles Exist at All

Base language models - the kind trained purely on next-token prediction over raw text - do not have a natural concept of “conversation.” They continue text. If you feed a base model the string “What is the capital of France?”, it might continue with “What is the capital of Germany? What is the capital of Spain?” because that pattern appears frequently in quiz and FAQ content. The model is doing exactly what it was trained to do: predict plausible continuations.

Instruction-following models (the kind you interact with in production APIs) are fine-tuned on data formatted as conversations. During this fine-tuning, the model sees thousands of examples where a system context is followed by a user request and then a high-quality assistant response. The model learns to treat these structural cues as meaningful. It learns that text following a system prefix should be treated as persistent instructions, that text following a user prefix is a request to respond to, and that it is generating the text that follows the assistant prefix.

The three-role format is therefore not arbitrary. It emerged from how instruction tuning works, and every production-grade model from OpenAI, Google, Anthropic, and Meta has been trained to respect it.

The System Prompt

The system prompt is the foundational instruction layer of a conversation. It is written by the application developer, not the end user, and it executes before any user interaction takes place.

A well-crafted system prompt does several things:

Defines the model’s persona and role (“You are a senior data analyst…”).
Specifies output format constraints (“Always respond in valid JSON with the schema: …”).
Establishes scope boundaries (“Only answer questions about our product documentation. Politely decline off-topic requests.”).
Sets behavioural rules (“Never speculate. If you are uncertain, say so explicitly.”).
Injects background context the model needs (“The current date is… The user’s subscription tier is…”).

The system prompt is processed before the first user message and its content persists through the entire conversation in the model’s context window. It is the most reliable lever you have for controlling model behaviour consistently across all turns.

One critical insight: the system prompt does not have magic authority in the way a configuration file has authority over software. The model has learned to attend to system content heavily because of how it was trained, but it is ultimately still performing token prediction.

A sufficiently adversarial user prompt can sometimes cause the model to deviate from system instructions - this is the class of vulnerabilities known as prompt injection. Never trust that a system prompt alone is a security boundary. Validate and sanitize outputs programmatically when the stakes are high.

Here is a minimal but structurally sound system prompt for a customer support application:

You are a support assistant for Acme Corp. Your job is to help customers with questions about their orders and account settings.

Rules:
- Only discuss topics related to Acme Corp products and services.
- If you cannot answer with certainty, say "I am not sure - let me connect you with a human agent."
- Never disclose internal pricing strategies or supplier information.
- Always address the customer by their first name if provided.
- Respond concisely. Aim for 2-4 sentences unless the customer asks for detail.

Notice that it defines role, scope, fallback behaviour, confidentiality constraints, and style. These four categories cover most of what a useful system prompt needs to specify.

The User Turn

The user turn is the input from the person or the system acting as a person. In a simple chatbot, this is what the human typed. In a programmatic pipeline, this is often constructed by application code - injecting a retrieved document, formatted data, or a templated instruction.

A common mistake is treating the user turn as a place to put everything. Developers sometimes cram persona, instructions, data, and the actual question into a single user message because they are not using the system prompt at all.

This works, to a point, but it conflates different layers of intent. The model is somewhat sensitive to where instructions come from, and instructions in the user turn carry less persistent authority than those in the system prompt. More importantly, when you start managing multi-turn conversations, conflation becomes a maintenance problem.

The user turn should contain:

The actual request or question.
Any data or documents that are specific to this request (e.g. “Here is the PDF text - summarise it.”).
Context that is specific to this turn (e.g. “Given the plan we discussed above…”).

It should not contain:

Persistent behavioural instructions. Those belong in the system prompt.
Security-sensitive constraints. A user can modify their own messages; they cannot modify the system prompt (in a properly built application).

The Assistant Turn

The assistant turn is the model’s previous response, injected back into the conversation for the next request. This is the mechanism that gives a language model what looks like memory in a multi-turn conversation.

Here is the part that surprises many developers: the model has no persistent state between API calls. Every call is stateless. The model does not remember the previous turn - you have to send it back. When you make a second API call in a conversation, your application must include the entire conversation history: system prompt, first user message, first assistant response, second user message, and so on. The model attends to all of it to generate the next response.

This has immediate engineering consequences:

Token costs grow linearly with conversation length. A 20-turn conversation sends approximately 20x more tokens per call than a single-turn call, because the entire history is in every request.
Context windows are finite budgets. Once the cumulative history exceeds the model’s context window (measured in tokens), something has to give. Some APIs silently truncate the oldest messages. Others return an error. Your application needs a strategy - sliding window, summarization, or selective pruning - before it needs one.
You control the history. Nothing forces you to inject the exact unmodified model response from the previous turn. Sophisticated applications summarize, compress, or filter history before injecting it. You can also inject synthetic assistant turns to steer the model’s subsequent behavior - a technique sometimes called “prefilling.”

Here is what the message list looks like at the API level for a two-turn conversation:

messages = [
    {
        "role": "system",
        "content": "You are a helpful coding assistant. Be concise."
    },
    {
        "role": "user",
        "content": "What does the 'yield' keyword do in Python?"
    },
    {
        "role": "assistant",
        "content": "yield turns a function into a generator. Instead of returning a value and exiting, it pauses execution and hands a value back to the caller, resuming from that point on the next iteration."
    },
    {
        "role": "user",
        "content": "Can you show me a simple example?"
    }
]

The model receives all four messages as context. Its response to the final user message will be informed by everything above it - including the definition it already gave. This is why follow-up questions work at all.

How Format Maps to Raw Text

Models do not natively understand JSON or Python data structures. Before the model ever sees the message list, the API serializes it into a flat text sequence using a chat template. The format varies by model family. OpenAI’s ChatML format looks like this:

<|im_start|>system
You are a helpful coding assistant. Be concise.<|im_end|>
<|im_start|>user
What does the 'yield' keyword do in Python?<|im_end|>
<|im_start|>assistant
yield turns a function into a generator...<|im_end|>
<|im_start|>user
Can you show me a simple example?<|im_end|>
<|im_start|>assistant

The final <|im_start|>assistant header with no closing tag is the generation prompt - the cue that tells the model to start producing the assistant’s response. The model continues the text from this point.

Llama-based models use a different format with [INST] and [/INST] markers. Anthropic’s Claude uses \n\nHuman: and \n\nAssistant: delimiters internally. The principle is the same: structured markers that the model was trained to respect, serialized into the flat token sequence the model actually sees.

When you use a hosted API, all of this serialization happens invisibly. When you run models locally using tools like llama.cpp or Ollama, applying the correct chat template yourself is your responsibility. Getting it wrong does not produce an error - it produces subtly degraded output, because the model’s behavior was fine-tuned against a specific format.

Practical Patterns for Production

A few patterns that experienced practitioners use consistently:

Separate persona from constraints. A system prompt that mixes “you are a friendly assistant” with “never discuss competitor products” is harder to maintain and debug than one with explicit sections. Use clear structural separation, even in plain text.

Test system prompt changes in isolation. The system prompt is a shared dependency for every conversation in your application. Changes to it are breaking changes. Version-control your system prompts and evaluate them on a representative set of test prompts before deploying.

Treat the user turn as untrusted input. Everything in the user turn could, in principle, be an attempt to override system instructions. This is not paranoia - it is the correct security model. Never interpolate user input directly into your system prompt. If you need to include user-provided data in the system prompt (a document they uploaded, for example), validate and sanitize it first.

Keep context history manageable. A context window of 128,000 tokens sounds generous until you realize that 20 turns of a rich conversation, with a substantial system prompt and retrieved documents, can fill it. Build context management into your architecture from the start, not as a retrofit.

Use assistant prefilling deliberately. You can inject the beginning of the assistant response to constrain the model’s output format. For example, if you need the model to always start with a JSON object, begin the assistant turn with { in your API call. The model will continue from that starting point. This is a low-overhead way to enforce structure without relying entirely on instruction following.

Every interaction with a production LLM is a structured list of messages with roles - system, user, and assistant. The system prompt is the developer’s persistent instruction layer. The user turn is the request. The assistant turn is previous model output re-injected as context, because the model is stateless between calls.

Understanding this format and its constraints - token costs, context limits, injection risks - is foundational to building reliable applications on top of language models.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hacker News - Newest: "LLM"

Why Roles Exist at All

The System Prompt

The User Turn

The Assistant Turn

How Format Maps to Raw Text

Practical Patterns for Production