惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
G
GRAHAM CLULEY
P
Privacy & Cybersecurity Law Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
宝玉的分享
宝玉的分享
P
Proofpoint News Feed
H
Help Net Security
V
Visual Studio Blog
阮一峰的网络日志
阮一峰的网络日志
C
Cisco Blogs
人人都是产品经理
人人都是产品经理
Know Your Adversary
Know Your Adversary
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Recorded Future
Recorded Future
I
Intezer
罗磊的独立博客
T
The Exploit Database - CXSecurity.com
Blog — PlanetScale
Blog — PlanetScale
Malwarebytes
Malwarebytes
Spread Privacy
Spread Privacy
T
Tor Project blog
V
Vulnerabilities – Threatpost
云风的 BLOG
云风的 BLOG
腾讯CDC
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
F
Future of Privacy Forum
MyScale Blog
MyScale Blog
Latest news
Latest news
IT之家
IT之家
MongoDB | Blog
MongoDB | Blog
The Hacker News
The Hacker News
S
Securelist
博客园 - 【当耐特】
C
CXSECURITY Database RSS Feed - CXSecurity.com
T
Threat Research - Cisco Blogs
Jina AI
Jina AI
Cisco Talos Blog
Cisco Talos Blog
B
Blog
博客园 - 三生石上(FineUI控件)
Last Week in AI
Last Week in AI
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
M
MIT News - Artificial intelligence
V
V2EX
D
Darknet – Hacking Tools, Hacker News & Cyber Security
The Cloudflare Blog
The GitHub Blog
The GitHub Blog
博客园 - 聂微东
F
Full Disclosure
C
CERT Recently Published Vulnerability Notes

Cisco Blogs

Powering Modern Data Workloads with Cisco UCS and Qumulo The impact of AI on wide area network traffic: we need to talk Cisco Live 2026 Las Vegas: Explore AI and automation across the network One open NOS, any workload: SONiC on Cisco Enhancing Cisco Secure Email Gateway: Safer Clicks and Cleaner Files Cisco Partners With College Board to Launch AP Cybersecurity and Expand Career-Connected Learning Fueling “The Greatest Spectacle in Racing®” AI-generated reporting: Lessons learned from Cisco Talos Incident Response Cisco Named a Leader in the 2026 Gartner® Magic Quadrant™ for Enterprise Wired and Wireless LAN Infrastructure AI network performance with Cisco Intelligent Packet Flow Building a world-class employee experience | FY25 Purpose Report Real-World Skills for Real World Challenges: AI-Led Updates Across Cisco Certification Portfolio Learn with Cisco at Cisco Live 2026: Your Week for Skills, Certs, and What’s Next Cisco N9000 excels in EANTC 2026 VXLAN EVPN and timing tests Innovating at the Speed of Business: Announcing the Customer Achievement Awards AMER 2026 Finalists Future of Sports Analytics: Building Trust and Intelligence with SūmerSports and Cisco Accelerate Your Career and Impact with CCNA Certifications Skills-based volunteering for the AI era: Inside Cisco’s first Tech for Social Good Hackathon Cisco Live 2026: Bringing the Future of Customer Experience to Las Vegas Mission-First: Equipping the Digital Warfighter at AFCEA TechNet Cyber 2026 Edge opportunity for service providers: Turn infrastructure into new services MRC and SRv6: How Foundational Networking Innovations Are Enabling the Next Generation of AI Supercomputers The SMB Marketing Reset: Winning Customer Trust in a Digital-First Economy Inside the SOC: AI-powered DNS defense against ransomware Our Path Forward Securing the Federal Digital Experience with Cisco ThousandEyes for Government State-sponsored actors, better known as the friends you don’t want Cisco at ONUG Dallas 2026: Securing the AI Data Center in the Agentic Era Cisco and Red Hat are powering intelligent core to edge: Red Hat Summit insights Building the Capabilities That Win: How Cisco Partners Can Lead in the SMB & Mid-Market Era How Two Hours Felt Bigger Than My To-Do List Announcing Foundry Security Spec Ace the CCIE Collaboration Lab: Success Tips from a TAC Engineer Turned CCIE Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation Protecting Agents with Cisco AI Defense and Google Agent Development Kit Powering an Inclusive Future: Your guide to the Purpose Pavilion at Cisco Live Las Vegas The Infrastructure Behind the Mission: SOF Week 2026 Cisco Networking App Marketplace Partners at Cisco Live 2026 Beyond the Pilot: Building the Clinical Data Fabric for the Agentic Era Benchmarking scale-out AI fabrics with Cisco N9000 + AMD Pensando™ Pollara 400 NICs Month of Developer Productivity: Build and Forget The race to autonomous transport networks: A new study Lean IT, future-ready: How to save time and simplify wireless management with AI Reading Between the Pixels: Failure Modes in Vision Language Models Biochar’s triple win: Healthier soils, improved crops, and decarbonization Designing a Proactive Customer Journey Modernize your data center operations with Cisco Nexus Dashboard Why your automation stack needs Cisco Agentic Workflows Try Cisco AI Defense Explorer Edition in this hands-on lab From Bandwidth to Intelligence: How Cisco is Powering AI-Ready Networks Spotlight on digital transformation | FY25 Purpose Report Galaxy Mode is live: A limited-time look at what your Cisco AI Assistant and AgenticOps can already do Securing the Agentic Workforce: Cisco Announces Intent to Acquire Astrix Security Understanding CISA BOD 26-02: Mitigating Risk from End-of-Support Edge Devices Digging Deeper: The Future of Mining with Automation and Ultra-Reliable Wireless Voices from the field: Helping farmers build resilient local economies across rural America Built like a startup, scaled like Cisco: Transforming data center cooling for the AI era Defining Model Provenance: A Constitution for AI Supply Chain Safety and Security Introducing Model Provenance Kit: Know Where Your AI Models Come From Security Insights: A Threat-First View for the Platform That Enforces Access How I Turned My Curiosity into a Patent From Strategy to Architecture: How Cisco is Building a Quantum-Safe Future Maximizing Managed Security Services: A Strategic Guide to Optimizing Your Portfolio (Part 1 of 2) Simplify access control in five easy steps Trust: Why security is your next growth engine Cisco IQ is generally available. Here’s what that actually means. From Vision to Reality: Intelligence in Action with Cisco IQ How connectivity is shaping the future of surgical care The power of your network: Solving a physical security incident on Vision portal 5 signs your data center is holding your AI strategy back Stop Overthinking OT Security: The Total Cost of Ownership and Being Smart with Refreshes AI-Ready, Simpler, and More Secure WAN: Cisco SD-WAN Innovations Scaling the digital future: Why AI and skills investments matter for business and society Expanding our Product Organization Recap Scaling the Future: Reddit AMA on Network Automation at Scale Product sprints for developer-oriented portals and content Bringing Professional-Level Skills to Cisco Networking Academy Announcing Cisco Availability in Google Cloud Marketplace: A New Path to Scalable, Partner-Led Growth The Innovation Paradox: How We Reduced Incidents by 25% While Deploying Faster Funding the AI-ready data center: Why flexibility wins The switch that quantum networking has been waiting for From a Message I Couldn’t Believe to a Stage I’ll Never Forget The Hidden Bottleneck Slowing Down Manufacturing Transformation 30 Years as a CCIE: Why Certifications Matter in the AI Era Securing Enterprise AI: Cisco AI Defense Expands to Google Cloud How ThousandEyes Closed the Cloud Visibility Gap by Solving It Themselves First Energy Will Define the Scale of AI Introducing the AI Agent Security Scanner for IDEs: Verify Your Agents Stop Overthinking OT Security: People, Process and Technology Powering the Future of Research: Join Cisco at NLIT 2026 Building the Digital Foundation for a Smarter West Lincoln Memorial Hospital How Cisco built an AI-RRM that maximizes your wireless solution From Automation to Autonomy: Cisco and Rockwell Power a New Era for Manufacturing Unlocking the Future of Fan Engagement: The Power of VisionEDGE Find Yourself in the Future: AI Is the New Baseline—Here’s How to Build Your Skills One Day with Our Customers: Driving better outcomes through customer centricity What It Really Takes to Build an AI-First Workforce From Connectivity to Security: How E80 Future-proofed its AGV Operations with Cisco The Infrastructure of a Floating City: AIDA Cruises’ CX-Led Digital Transformation Scaling your network for AI without a forklift upgrade
The Fundamentals of AI: What every curious person should know about how language models work
Yuri Kramarz · 2026-05-22 · via Cisco Blogs

Everyone talks about AI. Your LinkedIn and X feeds are drowning in it. Your organization probably mentioned it in last week’s meeting. Your cousin brought it up at dinner or you are already deep in the trenches with your favorite large language model (LLM). And yet, when someone asks you to explain how an LLM actually works, most of us freeze.

That freeze is understandable. The AI world loves its complex explanations, jargon, and technical concepts. Tokens, embeddings, and zero-shot learning are great examples of these that get thrown around frequently. Under the bonnet there is some very heavy math involved, but key concepts are surprisingly easy to explain.

This is the first in a blog series that walks through handful of core AI concepts, sorted by difficulty. We start here, on the ground floor, with no PhD required and no prior knowledge assumed. If you can follow a cookie recipe, you can follow this blog series.

By the end of this piece, you will understand the foundational ideas that power modern AI. You will know what a token is, why temperature matters, and what people actually mean when they say “zero-shot.” More than that, you will have the mental models to make sense of the next AI headline you read.

What is a large language model, really?

Strip away the hype and a large language model (LLM) is a piece of software trained to predict the next word in a sequence. That is the core trick. Given the words “The cat sat on the,” a well-trained model assigns high probability to “mat” or “chair” and low probability to “helicopter” or “algorithm.”

The “large” in the name refers to scale. These models contain billions of adjustable numerical values called parameters. GPT-5, for example, has 635 billion parameters. Each parameter is like a tiny dial, and during training, the model adjusts these dials over and over until it gets reasonably good at predicting what comes next in vast quantities of text.

What makes LLMs remarkable is that this simple objective (predict the next word) produces something that looks like understanding. Train a model on enough text from enough domains, and it starts to answer questions, write essays, translate languages, and summarize documents. The scale of the data and the number of parameters create emergent capabilities that nobody explicitly programmed.

Here is the thing that trips people up: LLMs do not “know” anything in the way you and I know things. They encode statistical patterns from their training data into those billions of parameters. When an LLM writes a coherent paragraph about quantum physics, it is drawing on patterns it absorbed from thousands of physics texts. Impressive, yes. Conscious understanding, no… not yet, anyway.

How AI reads text

You and I read words. Computers read numbers. Tokenization is the bridge between these two worlds.

When you type a sentence into ChatGPT or Claude, the first thing that happens (before any “thinking” occurs) is that your text gets chopped into smaller pieces called tokens. Sometimes a token is a whole word, sometimes, a fragment. The word “understanding” might become two tokens: “under” and “standing.” The word “AI” is one token. A long, unusual word like “talosintelligence” might get split into two or three pieces.

Why not just use whole words? Because human language is absurdly varied. English alone has millions of words, and people invent new ones constantly. If the model needed a separate entry for every possible word, its vocabulary table would be enormous. Subword tokenization solves this by working with a manageable set of fragments (typically 30k to 100k pieces) that can be combined to represent any word, including words the model has never encountered before.

The most common approach is called Byte-Pair Encoding (BPE). It works by starting with individual characters and then merging the most frequently occurring pairs, step by step, until the vocabulary reaches the desired size. Frequent words like “the” get their own token. Rare words get built from smaller pieces. This gives the model flexibility to handle slang, technical terms, and even different languages without falling apart or guessing. The trick is that all of this is based on frequency counts.

There is a practical consequence worth noting: Tokenization affects cost. When you use an API like OpenAI’s or Anthropic’s, you pay per token processed. A verbose prompt costs more than a concise one, and different languages tokenize differently. A sentence in English might take 10 tokens while the same meaning in Japanese could take 15, because the tokenizer was trained primarily on English text.

Embeddings are giving meaning a shape

Once text is broken into tokens, each token needs to be converted into something a neural network can manipulate: a vector, which is simply a list of numbers that represents the token’s meaning in mathematical space.

Imagine a three-dimensional room. You could place the word “king” at one point, “queen” at another, “man” at a third, and “woman” at a fourth. If the embedding is good, the distance and direction from “king” to “queen” would roughly match the distance and direction from “man” to “woman.” The vector captures the relationship (male-to-female) as a geometric pattern. Real embeddings work in hundreds or thousands of dimensions, where the relationships become far richer and harder to visualize.

At the start of training, embeddings are initialized randomly. The word “cat” gets a random list of numbers. So does “dog.” So does “refrigerator.” As training proceeds and the model sees millions of sentences, these vectors get tugged and adjusted until words used in similar contexts end up near each other in vector space. “Cat” and “dog” drift close together. “Refrigerator” stays further away. This research is very computationally expensive.

This matters because it means the model develops a numerical sense of meaning. Similar concepts cluster. Related ideas form geometric patterns. When the model later needs to process a sentence, it works with these rich, meaning-laden vectors rather than raw text, which gives it the ability to reason about relationships between concepts.

How much an AI can hold in its head based on context window

Every LLM has a limit on how much text it can consider at once. This limit is the context window, measured in tokens.

Think of it like working memory. When you read a 300-page novel, you remember the broad strokes and recent chapters, but you have probably forgotten the exact wording of page 12 by the time you reach page 250. An LLM with a 4,096-token context window can only “memorize and see” about 3,000 words at a time. Everything outside that window might as well not exist.

Modern models have been pushing these limits aggressively. GPT-5 supports context windows up to 1,000,000 tokens. Claude can handle about 200,000 tokens. That is roughly the length of a decent novel. This context window expansion matters because it lets the model maintain coherence over longer documents, follow complex multi-step instructions, and work with large codebases without losing the thread.

There is a catch, though. Bigger context windows consume more memory and computation. Processing 200,000 tokens is dramatically more expensive than processing 4,000. In addition, research has also shown that models sometimes struggle to pay equal attention to content in the middle of very long prompt or conversation. The model might be strong at the beginning and end of its context window and weaker in the center. This is something that ongoing research will address and as we improve LLMs, this will change substantially.

When people compare LLMs, the context window is one of the first specs they look at, and for good reason. If you need to summarize a 50-page contract, you need a model whose context window can fit the whole document so you can query it, look for specific context within document or footnotes, and extract the vital information without context compression.

Temperature: The creativity dial

When an LLM generates text, it does not simply pick the single most likely next word every time. If it did, the output would be monotonous and predictable. Instead, there is a control called temperature that governs how much randomness enters the selection.

Temperature works by adjusting the probability distribution over possible next tokens. A temperature of 0 is fully deterministic: the model always picks the single highest-probability token. The outputs become focused, deterministic, and repetitive. A temperature of 1.0 samples directly from the learned probability distribution without modification. Values above 1.0 amplify randomness beyond what the model learned; lower-probability tokens get a fighting chance. The output becomes more creative, surprising, and occasionally incoherent.

In practice, most applications land somewhere between 0.3 and 0.9. Code generation benefits from low temperature because you want precision. Creative writing benefits from higher temperature because you want variation and surprise. Customer support chatbots tend to run cool (around 0.3 to 0.5) because consistency matters more than flair.

If you have ever used the same prompt twice and gotten different responses, temperature is the reason. And if an AI response feels “boring” or “robotic,” turning up the temperature is often the fix.

Controlling the word lottery though sampling

Temperature is one way to control randomness, but it is a blunt instrument. Top-k and top-p sampling are more refined approaches that limit which tokens are even eligible for selection.

Top-k sampling is the simpler of the two. You pick a number “k” (say, 40) and the model only considers the “k” (40) most probable next tokens, discarding everything else. If “the” has probability 0.15 and “a” has probability 0.12, those stay in the running. If “xylophone” has probability of 0.0001, it gets cut. This prevents the model from making wildly improbable choices while still allowing some variety among the top candidates.

Top-p sampling (also called nucleus sampling) takes a different angle. Instead of fixing the number of candidates, you set a cumulative probability threshold. If p=0.92, the model sorts tokens by probability and includes candidates until their combined probability reaches 92%. When the model is confident (one token dominates the distribution), this might include only 5 tokens. When the model is uncertain, it might include 200. The pool size adapts to the situation.

Top-p tends to produce more natural-sounding text because it respects the shape of the distribution rather than imposing an arbitrary cutoff. Most modern APIs let you set both temperature and top-p together, giving you layered control over the generation process. The frontier models like Claude or Gemini have a built-in mechanism to handle this.

Handling unknown words

Language keeps evolving and new words appear constantly. “Cryptocurrency” did not exist 25 years ago. “Doomscrolling” is barely six years old. How does a model handle words it has never seen?

The answer is subword tokenization. By breaking words into smaller known pieces, the model can construct a reasonable representation of any word, even entirely novel ones. If someone types “unfriendliestification”, the tokenizer might split it into “un,” “friend,” “li,” “est,” “ific,” “ation.” Each piece carries meaning that the model has seen before. The prefix “un” signals negation, “friend” is a known concept, and so on.

This is a significant improvement over older approaches. Earlier Natural Language Processing (NLP) systems maintained fixed word dictionaries and simply flagged anything unknown as an “OOV” (out-of-vocabulary) token, essentially throwing up their hands in the air and saying, “I don’t know what this is.” A model encountering “cryptocurrency” in 2003 would have treated it as a meaningless placeholder. Modern subword methods degrade gracefully instead of failing outright.

Byte-Pair Encoding (BPE), WordPiece, and SentencePiece are the three most common subword algorithms. They differ in implementation details, but the principle is the same: Learn a vocabulary of frequent subword units from the training corpus, then use those units to represent any text.

Talking to AI the right way through prompt engineering

The single fastest way to improve AI output quality is to improve the input. Prompt engineering is the practice of crafting instructions and examples that guide an LLM toward the response you want.

Consider the difference between these two prompts: The first is “Tell me about dogs,”  and the second is “Write a 200-word factual overview of golden retrievers, covering temperament, typical health issues, and exercise needs, suitable for a veterinary clinic’s website.” The second prompt gives the model a clear target. It specifies length, scope, tone, and audience. The result will be dramatically more useful.

Several techniques have emerged as best practices. Adding examples (“Here is a sample of the format I want…”) helps the model match your expectations. Assigning a role (“You are a senior data analyst…”) primes the model’s vocabulary and reasoning style. Breaking complex tasks into steps (“First, list the key points. Then, organize them by priority. Finally, write a summary.”) prevents the model from trying to do everything at once and losing coherence.

Prompt engineering works because LLMs are pattern-completion machines. A well-structured prompt creates a pattern that the model is statistically inclined to continue in a useful direction. A vague prompt gives the model too many plausible continuations, and it may pick one you did not want.

Performing without practice

In traditional machine learning, you need labeled examples to teach a model a new task. Want it to classify movie reviews as positive or negative? You need thousands of labeled reviews. Want it to detect spam? You need thousands of labeled emails.

LLMs break this pattern. Because they absorb such a broad range of knowledge during pretraining, they can often perform tasks they were never explicitly trained on. This is zero-shot learning, where an LLM is performing a task with zero task-specific examples.

Ask Claude or GPT to “classify this review as positive or negative: The food was cold and the service was slow” and it will correctly say “negative,” despite never being specifically trained as a sentiment classifier. The model draws on its general understanding of language, sentiment, and the structure of classification tasks to produce a reasonable answer.

Zero-shot capabilities scale with model size. Larger models with more parameters tend to be better at zero-shot tasks because they encode more diverse patterns from their training data. This is one reason the industry keeps building bigger models. Each new model jump in scale tends to unlock new zero-shot abilities.

The practical impact is enormous. Instead of training a custom model for every new task (which requires data, compute, and expertise), you can often just describe the task in a prompt and let the LLM figure it out.

A handful of examples goes a long way when learning via few shots

Few-shot learning sits between zero-shot (no examples) and traditional supervised learning (thousands of examples like in movie reviews). You include a small number of demonstrations in your prompt, and the model uses them to understand the pattern you want.

For example, suppose you want an LLM to convert informal text into formal business language. You might include three examples in your prompt that show an informal sentence in, and formal sentence out. The model picks up the pattern from these few examples and applies it to new inputs without any retraining or weight updates.

What makes this fascinating is that the model is not “learning” in the traditional sense because no parameters change. The examples simply create a context that makes the desired pattern the most probable continuation. The model effectively performs pattern matching on the fly, using its existing knowledge to generalize from the examples you provided.

Few-shot learning is extraordinarily practical. It lets you customize model behavior for niche tasks (legal document formatting, medical record summarization, specialized translation) with nothing more than a well-crafted prompt – no training pipeline, labeled dataset, or GPU cluster.

The trade-off is that few-shot learning consumes context window space. Each example you include takes up tokens that could otherwise be used for the actual task. Finding the right balance between enough examples to establish the pattern and enough remaining context for the work is part of the prompt engineering craft.

Two philosophies of AI

The AI world contains two broad families of models, and understanding the distinction between them clarifies a lot of the conversation around modern AI.

Discriminative models learn to draw boundaries. Given an input, they assign it to a category. A spam filter looks at an email and outputs “spam” or “not spam.” A sentiment analyzer reads a review and outputs “positive,” “negative,” or “neutral.” These models learn the decision boundary between classes and are good at classification, detection, and prediction tasks.

Generative models learn to create. Instead of just sorting things into boxes, they study what the data itself looks like. Once they understand the patterns, they can make new examples that feel similar to what they learned from. GPT writes text, DALL-E draws pictures, and a generative model trained on music could write new songs. In short, these models learn what the data is, not just how to tell one type from another.

The difference really comes down to the kind of question each model is trying to answer. A discriminative model asks: “Given this email, how likely is it that this is spam?” A generative model asks a bigger question: “How likely is it that these particular words would appear together in the first place?”

In everyday life, the LLMs you chat with (like ChatGPT, Claude, or Gemini) are generative models. They create text by picking words based on the patterns they have learned. That said, the line between the two types isn’t strict. Many modern AI systems mix both styles to get the best of each.

How AI explore multiple paths at once

When an LLM generates text one token at a time, it faces a choice at every step. Which token comes next? The simplest strategy is called “greedy decoding” because it picks the single most probable token at each step and moves on. This is fast and easy, but it can paint the model into a corner. The locally best choice at step 3 might lead to an awkward dead end by step 10.

“Beam search” offers an alternative. Instead of committing to one path, it explores multiple candidate sequences simultaneously. If the beam width is 5, the model keeps track of the 5 most promising partial sequences at each step, extending all of them and then pruning back down to the top 5. This lets the model consider that a slightly less obvious token at step 3 might lead to a much better sequence overall.

Think of it like navigating a city you have never visited. Greedy decoding always takes the road that looks best right now, even if it leads to a dead end. Beam search keeps track of several promising routes simultaneously and can abandon a path that turns out to be a detour.

Beam search is particularly valuable for structured output tasks like machine translation, where the final sentence needs to be grammatically coherent as a whole. For open-ended creative generation, sampling methods (temperature, top-k, top-p) tend to work better because beam search can be overly conservative, producing safe and repetitive text.

The trade-off is straightforward. Beam search uses more memory and computation proportional to the beam width. A beam of 5 is roughly 5 times more work than greedy decoding. For most conversational AI applications, the sampling approaches we discussed earlier have largely replaced beam search as the default generation strategy.

What you now know

We’ve covered a lot of ground. You now understand some of the key foundational concepts that underpin everything happening in the AI space, from what an LLM actually is to how it reads text and generates creative output through temperature, sampling, and beam search.

You know why the context window matters, how models handle unknown words, and why prompt engineering works. You understand zero-shot and few-shot learning, and you can explain the difference between generative and discriminative models without reaching for jargon.

These concepts form the bedrock. Everything else in this series builds on them. In the next installment, we go deeper into the architecture that makes all of this possible: The famous “transformer.” We will look at attention mechanisms, positional encodings, and the specific design decisions that turned a 2017 research paper into the foundation of modern AI.