





















Everyone talks about AI. Your LinkedIn and X feeds are drowning in it. Your organization probably mentioned it in last week’s meeting. Your cousin brought it up at dinner or you are already deep in the trenches with your favorite large language model (LLM). And yet, when someone asks you to explain how an LLM actually works, most of us freeze.
That freeze is understandable. The AI world loves its complex explanations, jargon, and technical concepts. Tokens, embeddings, and zero-shot learning are great examples of these that get thrown around frequently. Under the bonnet there is some very heavy math involved, but key concepts are surprisingly easy to explain.
This is the first in a blog series that walks through handful of core AI concepts, sorted by difficulty. We start here, on the ground floor, with no PhD required and no prior knowledge assumed. If you can follow a cookie recipe, you can follow this blog series.
By the end of this piece, you will understand the foundational ideas that power modern AI. You will know what a token is, why temperature matters, and what people actually mean when they say “zero-shot.” More than that, you will have the mental models to make sense of the next AI headline you read.
Strip away the hype and a large language model (LLM) is a piece of software trained to predict the next word in a sequence. That is the core trick. Given the words “The cat sat on the,” a well-trained model assigns high probability to “mat” or “chair” and low probability to “helicopter” or “algorithm.”
The “large” in the name refers to scale. These models contain billions of adjustable numerical values called parameters. GPT-5, for example, has 635 billion parameters. Each parameter is like a tiny dial, and during training, the model adjusts these dials over and over until it gets reasonably good at predicting what comes next in vast quantities of text.
What makes LLMs remarkable is that this simple objective (predict the next word) produces something that looks like understanding. Train a model on enough text from enough domains, and it starts to answer questions, write essays, translate languages, and summarize documents. The scale of the data and the number of parameters create emergent capabilities that nobody explicitly programmed.
Here is the thing that trips people up: LLMs do not “know” anything in the way you and I know things. They encode statistical patterns from their training data into those billions of parameters. When an LLM writes a coherent paragraph about quantum physics, it is drawing on patterns it absorbed from thousands of physics texts. Impressive, yes. Conscious understanding, no… not yet, anyway.
You and I read words. Computers read numbers. Tokenization is the bridge between these two worlds.
When you type a sentence into ChatGPT or Claude, the first thing that happens (before any “thinking” occurs) is that your text gets chopped into smaller pieces called tokens. Sometimes a token is a whole word, sometimes, a fragment. The word “understanding” might become two tokens: “under” and “standing.” The word “AI” is one token. A long, unusual word like “talosintelligence” might get split into two or three pieces.
Why not just use whole words? Because human language is absurdly varied. English alone has millions of words, and people invent new ones constantly. If the model needed a separate entry for every possible word, its vocabulary table would be enormous. Subword tokenization solves this by working with a manageable set of fragments (typically 30k to 100k pieces) that can be combined to represent any word, including words the model has never encountered before.
The most common approach is called Byte-Pair Encoding (BPE). It works by starting with individual characters and then merging the most frequently occurring pairs, step by step, until the vocabulary reaches the desired size. Frequent words like “the” get their own token. Rare words get built from smaller pieces. This gives the model flexibility to handle slang, technical terms, and even different languages without falling apart or guessing. The trick is that all of this is based on frequency counts.
There is a practical consequence worth noting: Tokenization affects cost. When you use an API like OpenAI’s or Anthropic’s, you pay per token processed. A verbose prompt costs more than a concise one, and different languages tokenize differently. A sentence in English might take 10 tokens while the same meaning in Japanese could take 15, because the tokenizer was trained primarily on English text.
Once text is broken into tokens, each token needs to be converted into something a neural network can manipulate: a vector, which is simply a list of numbers that represents the token’s meaning in mathematical space.
Imagine a three-dimensional room. You could place the word “king” at one point, “queen” at another, “man” at a third, and “woman” at a fourth. If the embedding is good, the distance and direction from “king” to “queen” would roughly match the distance and direction from “man” to “woman.” The vector captures the relationship (male-to-female) as a geometric pattern. Real embeddings work in hundreds or thousands of dimensions, where the relationships become far richer and harder to visualize.
At the start of training, embeddings are initialized randomly. The word “cat” gets a random list of numbers. So does “dog.” So does “refrigerator.” As training proceeds and the model sees millions of sentences, these vectors get tugged and adjusted until words used in similar contexts end up near each other in vector space. “Cat” and “dog” drift close together. “Refrigerator” stays further away. This research is very computationally expensive.
This matters because it means the model develops a numerical sense of meaning. Similar concepts cluster. Related ideas form geometric patterns. When the model later needs to process a sentence, it works with these rich, meaning-laden vectors rather than raw text, which gives it the ability to reason about relationships between concepts.
Every LLM has a limit on how much text it can consider at once. This limit is the context window, measured in tokens.
Think of it like working memory. When you read a 300-page novel, you remember the broad strokes and recent chapters, but you have probably forgotten the exact wording of page 12 by the time you reach page 250. An LLM with a 4,096-token context window can only “memorize and see” about 3,000 words at a time. Everything outside that window might as well not exist.
Modern models have been pushing these limits aggressively. GPT-5 supports context windows up to 1,000,000 tokens. Claude can handle about 200,000 tokens. That is roughly the length of a decent novel. This context window expansion matters because it lets the model maintain coherence over longer documents, follow complex multi-step instructions, and work with large codebases without losing the thread.
There is a catch, though. Bigger context windows consume more memory and computation. Processing 200,000 tokens is dramatically more expensive than processing 4,000. In addition, research has also shown that models sometimes struggle to pay equal attention to content in the middle of very long prompt or conversation. The model might be strong at the beginning and end of its context window and weaker in the center. This is something that ongoing research will address and as we improve LLMs, this will change substantially.
When people compare LLMs, the context window is one of the first specs they look at, and for good reason. If you need to summarize a 50-page contract, you need a model whose context window can fit the whole document so you can query it, look for specific context within document or footnotes, and extract the vital information without context compression.
When an LLM generates text, it does not simply pick the single most likely next word every time. If it did, the output would be monotonous and predictable. Instead, there is a control called temperature that governs how much randomness enters the selection.
Temperature works by adjusting the probability distribution over possible next tokens. A temperature of 0 is fully deterministic: the model always picks the single highest-probability token. The outputs become focused, deterministic, and repetitive. A temperature of 1.0 samples directly from the learned probability distribution without modification. Values above 1.0 amplify randomness beyond what the model learned; lower-probability tokens get a fighting chance. The output becomes more creative, surprising, and occasionally incoherent.
In practice, most applications land somewhere between 0.3 and 0.9. Code generation benefits from low temperature because you want precision. Creative writing benefits from higher temperature because you want variation and surprise. Customer support chatbots tend to run cool (around 0.3 to 0.5) because consistency matters more than flair.
If you have ever used the same prompt twice and gotten different responses, temperature is the reason. And if an AI response feels “boring” or “robotic,” turning up the temperature is often the fix.
Temperature is one way to control randomness, but it is a blunt instrument. Top-k and top-p sampling are more refined approaches that limit which tokens are even eligible for selection.
Top-k sampling is the simpler of the two. You pick a number “k” (say, 40) and the model only considers the “k” (40) most probable next tokens, discarding everything else. If “the” has probability 0.15 and “a” has probability 0.12, those stay in the running. If “xylophone” has probability of 0.0001, it gets cut. This prevents the model from making wildly improbable choices while still allowing some variety among the top candidates.
Top-p sampling (also called nucleus sampling) takes a different angle. Instead of fixing the number of candidates, you set a cumulative probability threshold. If p=0.92, the model sorts tokens by probability and includes candidates until their combined probability reaches 92%. When the model is confident (one token dominates the distribution), this might include only 5 tokens. When the model is uncertain, it might include 200. The pool size adapts to the situation.
Top-p tends to produce more natural-sounding text because it respects the shape of the distribution rather than imposing an arbitrary cutoff. Most modern APIs let you set both temperature and top-p together, giving you layered control over the generation process. The frontier models like Claude or Gemini have a built-in mechanism to handle this.
Language keeps evolving and new words appear constantly. “Cryptocurrency” did not exist 25 years ago. “Doomscrolling” is barely six years old. How does a model handle words it has never seen?
The answer is subword tokenization. By breaking words into smaller known pieces, the model can construct a reasonable representation of any word, even entirely novel ones. If someone types “unfriendliestification”, the tokenizer might split it into “un,” “friend,” “li,” “est,” “ific,” “ation.” Each piece carries meaning that the model has seen before. The prefix “un” signals negation, “friend” is a known concept, and so on.
This is a significant improvement over older approaches. Earlier Natural Language Processing (NLP) systems maintained fixed word dictionaries and simply flagged anything unknown as an “OOV” (out-of-vocabulary) token, essentially throwing up their hands in the air and saying, “I don’t know what this is.” A model encountering “cryptocurrency” in 2003 would have treated it as a meaningless placeholder. Modern subword methods degrade gracefully instead of failing outright.
Byte-Pair Encoding (BPE), WordPiece, and SentencePiece are the three most common subword algorithms. They differ in implementation details, but the principle is the same: Learn a vocabulary of frequent subword units from the training corpus, then use those units to represent any text.
The single fastest way to improve AI output quality is to improve the input. Prompt engineering is the practice of crafting instructions and examples that guide an LLM toward the response you want.
Consider the difference between these two prompts: The first is “Tell me about dogs,” and the second is “Write a 200-word factual overview of golden retrievers, covering temperament, typical health issues, and exercise needs, suitable for a veterinary clinic’s website.” The second prompt gives the model a clear target. It specifies length, scope, tone, and audience. The result will be dramatically more useful.
Several techniques have emerged as best practices. Adding examples (“Here is a sample of the format I want…”) helps the model match your expectations. Assigning a role (“You are a senior data analyst…”) primes the model’s vocabulary and reasoning style. Breaking complex tasks into steps (“First, list the key points. Then, organize them by priority. Finally, write a summary.”) prevents the model from trying to do everything at once and losing coherence.
Prompt engineering works because LLMs are pattern-completion machines. A well-structured prompt creates a pattern that the model is statistically inclined to continue in a useful direction. A vague prompt gives the model too many plausible continuations, and it may pick one you did not want.
In traditional machine learning, you need labeled examples to teach a model a new task. Want it to classify movie reviews as positive or negative? You need thousands of labeled reviews. Want it to detect spam? You need thousands of labeled emails.
LLMs break this pattern. Because they absorb such a broad range of knowledge during pretraining, they can often perform tasks they were never explicitly trained on. This is zero-shot learning, where an LLM is performing a task with zero task-specific examples.
Ask Claude or GPT to “classify this review as positive or negative: The food was cold and the service was slow” and it will correctly say “negative,” despite never being specifically trained as a sentiment classifier. The model draws on its general understanding of language, sentiment, and the structure of classification tasks to produce a reasonable answer.
Zero-shot capabilities scale with model size. Larger models with more parameters tend to be better at zero-shot tasks because they encode more diverse patterns from their training data. This is one reason the industry keeps building bigger models. Each new model jump in scale tends to unlock new zero-shot abilities.
The practical impact is enormous. Instead of training a custom model for every new task (which requires data, compute, and expertise), you can often just describe the task in a prompt and let the LLM figure it out.
Few-shot learning sits between zero-shot (no examples) and traditional supervised learning (thousands of examples like in movie reviews). You include a small number of demonstrations in your prompt, and the model uses them to understand the pattern you want.
For example, suppose you want an LLM to convert informal text into formal business language. You might include three examples in your prompt that show an informal sentence in, and formal sentence out. The model picks up the pattern from these few examples and applies it to new inputs without any retraining or weight updates.
What makes this fascinating is that the model is not “learning” in the traditional sense because no parameters change. The examples simply create a context that makes the desired pattern the most probable continuation. The model effectively performs pattern matching on the fly, using its existing knowledge to generalize from the examples you provided.
Few-shot learning is extraordinarily practical. It lets you customize model behavior for niche tasks (legal document formatting, medical record summarization, specialized translation) with nothing more than a well-crafted prompt – no training pipeline, labeled dataset, or GPU cluster.
The trade-off is that few-shot learning consumes context window space. Each example you include takes up tokens that could otherwise be used for the actual task. Finding the right balance between enough examples to establish the pattern and enough remaining context for the work is part of the prompt engineering craft.
The AI world contains two broad families of models, and understanding the distinction between them clarifies a lot of the conversation around modern AI.
Discriminative models learn to draw boundaries. Given an input, they assign it to a category. A spam filter looks at an email and outputs “spam” or “not spam.” A sentiment analyzer reads a review and outputs “positive,” “negative,” or “neutral.” These models learn the decision boundary between classes and are good at classification, detection, and prediction tasks.
Generative models learn to create. Instead of just sorting things into boxes, they study what the data itself looks like. Once they understand the patterns, they can make new examples that feel similar to what they learned from. GPT writes text, DALL-E draws pictures, and a generative model trained on music could write new songs. In short, these models learn what the data is, not just how to tell one type from another.
The difference really comes down to the kind of question each model is trying to answer. A discriminative model asks: “Given this email, how likely is it that this is spam?” A generative model asks a bigger question: “How likely is it that these particular words would appear together in the first place?”
In everyday life, the LLMs you chat with (like ChatGPT, Claude, or Gemini) are generative models. They create text by picking words based on the patterns they have learned. That said, the line between the two types isn’t strict. Many modern AI systems mix both styles to get the best of each.
When an LLM generates text one token at a time, it faces a choice at every step. Which token comes next? The simplest strategy is called “greedy decoding” because it picks the single most probable token at each step and moves on. This is fast and easy, but it can paint the model into a corner. The locally best choice at step 3 might lead to an awkward dead end by step 10.
“Beam search” offers an alternative. Instead of committing to one path, it explores multiple candidate sequences simultaneously. If the beam width is 5, the model keeps track of the 5 most promising partial sequences at each step, extending all of them and then pruning back down to the top 5. This lets the model consider that a slightly less obvious token at step 3 might lead to a much better sequence overall.
Think of it like navigating a city you have never visited. Greedy decoding always takes the road that looks best right now, even if it leads to a dead end. Beam search keeps track of several promising routes simultaneously and can abandon a path that turns out to be a detour.
Beam search is particularly valuable for structured output tasks like machine translation, where the final sentence needs to be grammatically coherent as a whole. For open-ended creative generation, sampling methods (temperature, top-k, top-p) tend to work better because beam search can be overly conservative, producing safe and repetitive text.
The trade-off is straightforward. Beam search uses more memory and computation proportional to the beam width. A beam of 5 is roughly 5 times more work than greedy decoding. For most conversational AI applications, the sampling approaches we discussed earlier have largely replaced beam search as the default generation strategy.
We’ve covered a lot of ground. You now understand some of the key foundational concepts that underpin everything happening in the AI space, from what an LLM actually is to how it reads text and generates creative output through temperature, sampling, and beam search.
You know why the context window matters, how models handle unknown words, and why prompt engineering works. You understand zero-shot and few-shot learning, and you can explain the difference between generative and discriminative models without reaching for jargon.
These concepts form the bedrock. Everything else in this series builds on them. In the next installment, we go deeper into the architecture that makes all of this possible: The famous “transformer.” We will look at attention mechanisms, positional encodings, and the specific design decisions that turned a 2017 research paper into the foundation of modern AI.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。