The Data That Taught the Machines

Hacker News - Newest: "AI"

AI can't read an investor deck AI as an attorney? Student uses ChatGPT, Gemini to sue UW over alleged racial discrimination Hacking MCP Servers in AI Systems – The Rug Pull: Tool Changes After Approval GitHub - MeepCastana/KubeezCut: Free Web based video editor GitHub - GenAI-Gurus/awesome-eu-ai-act: Curated tools, official sources, OSS, templates, and guides for EU AI Act compliance. Can AI judge journalism? A Thiel-backed startup says yes, even if it risks chilling whistleblowers Coming soon: 10 Things That Matter in AI Right Now DARPA built an AI to fact-check enemy weapons claims What explains heterogeneity in AI adoption? When AI Meets Muscle: Context-Aware Electrical Stimulation Promises a New Way to Guide Human Movements - Department of Computer Science AI Changed How We Build. It Did Not Change What Matters. Linux rules on using AI-generated code - Copilot is OK, but humans must take 'full responsibility for the… Meta spins up AI version of Mark Zuckerberg to engage with employees Code Mode: Let Your AI Write Programs, Not Just Call Tools | TanStack Blog GitHub - Delavalom/graft: Go framework for building AI agents. Type-safe tools, multi-provider (OpenAI, Anthropic, Gemini, Bedrock), zero vendor SDKs. India's TCS tops estimates, says new AI models did not dent services demand Gen Z's fading AI hype Strong feeling: we are in a folded AI reality GitHub - machinarii/total-recall-catalog: A reference catalog of latest knowledge retrieval, memory & RAG systems GitHub - mensfeld/code-on-incus: Give each AI agent its own isolated machine with root, Docker, and systemd. Active defense detects and stops threats automatically.. Quantization, LoRA, and the 8% Problem: Benchmarking Local LLMs for Production AI Iran war: We spoke to the man making Lego-style AI videos that experts say are powerful propaganda Powell, Bessent discussed Anthropic's Mythos AI cyber threat with major U.S. banks GitHub - immartian/bellamem: Persistent belief-graph memory for AI agents. Retrieves decisive context by importance — not recency, not RAG, not /compact. recursive-mode: The Repo-Native Operating System for AI Engineering After the attack on Sam Altman's home, will AI CEO's go on the offensive? The biggest advance in AI since the LLM Opus 4.6 vs GPT 5.4 One Prompt Unity World Generation Test “AI polls” are fake polls Client Challenge Can AI be a 'child of God'? Inside Anthropic's meeting with Christian leaders How to Switch AI Chatbots and Why You Might Want To GitHub - MattMessinger1/agentic_refund_guardrail: Safe refund policy layer for AI agents — Python + TypeScript. Same behavior, shared tests. Adam/papers/emergent_values_whitepaper.md at master · strangeadvancedmarketing/Adam Ask HN: How do you stop playing 20 questions with your AI coding tools How far can automation and AI support psychotherapy? - @theU GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits A Mac Studio for Local AI — 6 Months Later A History of the Early Years of AI at the University of Edinburgh Why AI Coding Tools Still Feel Stuck on Localhost MSN AI Datacenters Are Becoming Strategic Targets twitter.com Penn Researchers Use AI to Surface Unreported GLP-1 Side Effects in Reddit Posts Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 AI models are terrible at betting on soccer—especially xAI Grok GitHub - xialeistudio/echoic GitHub - HimashaHerath/github-dev-wrapped: AI-powered weekly GitHub activity reports deployed to GitHub Pages GitHub - alejandrobalderas/claude-code-from-source: Architecture, patterns & internals of Anthropic's AI coding agent — reverse-engineered from source maps AI and Tech brief: Ireland ascendant GitHub - Titovilal/context0: Context0 - Never Surrender Training for a Marathon with an AI Coach: What Worked and What Didn't Cyber Pulse: Agentic Intel - Apps on Google Play I Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs Gen Z workers are so fearful AI will take their job they’re intentionally sabotaging their company’s AI rollout | Fortune How AI Is Reimagining the Game of Golf–For Both Players and Courses GitHub - nattergabriel/reseed: A CLI tool for managing and distributing agent skills across projects Is SVG the final frontier? My AI workflow evolved from prompts to a near-autonomous workflow MLSharp Help - 3DGS Viewer & Generator I put my cognitive field based AI's runtime on GitHub Is Numble the first AI-proof game? A3: Kubernetes for autonomous AI agent fleets | Emergent Principles Deepali Vyas ("The Elite Recruiter") GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Unionized ProPublica staff are on strike over AI, layoffs, and wages Unleashing the Advantage of Quantum AI We're heading for an AI-fueled 'dementia crisis,' brain scientist warns The AI-Assisted Breach of Mexico's Government Infrastructure [pdf] GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. MSN GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs AI Code is Hollowing Out Open Source, and Maintainers are Looking the Other Way What leaked "SteamGPT" files could mean for the PC gaming platform's use of AI AI is the boss at this retail store. What could go wrong? GitHub - Wuzu11517/agentic-proxy: Local proxy meant to help reduce With Drones, Geophysics and ArtificiaI Intelligence, Researchers Prepare to Do Battle Against Land Mines A Single Operator, Two AI Platforms, Nine Government Agencies: The Full Technical Report 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - inevolin/resume-cli: Hit Claude usage limits? Resume any AI coding session elsewhere. Switch tools at zero friction. GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. How to Build a Secure AI PR Reviewer with Claude, GitHub Actions, and JavaScript This Startup Wants You to Pay Up to Talk With AI Versions of Human Experts Intel Arc Pro B70 Brings 32GB VRAM to Local AI for $949 WordPress 7.0: The Good, the AI, and the Still Missing AI on the couch: Anthropic gives Claude 20 hours of psychiatry IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures AI Agents Know About Supabase. They Don't Always Use It Right. The history and future of AI at Google, with Sundar Pichai Inside an AI‑enabled device code phishing campaign How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines AI for Systems: Using LLMs to Optimize Database Query Execution Forecasting the Economic Effects of AI Introducing Tinker: Play with AI, bring your ideas to life AI sheds light on an ancient gaming mystery People really hate AI but not as much as Iran—or Democrats | Fortune What is an AI Product Engineer? Phoebe Gates wants her $185 million AI startup to succeed with 'no ties to my privilege or my last name': 'I have a chip on my shoulder' | Fortune

zozo123-IB · 2026-06-07 · via Hacker News - Newest: "AI"

01 — The perfect classroom

The accidental machine-teaching format

Nobody designed Stack Overflow to train neural networks. But its structure — a natural-language question, a reasoned human answer, and a community verdict — happens to be the exact shape modern language models need to learn from.

A language model is trained to predict the next token given a prompt. To turn a raw predictor into a helpful assistant, labs need three increasingly scarce ingredients: clean instruction → response pairs, worked reasoning, and a signal for what counts as a good answer. A single Stack Overflow thread quietly supplies all three.

Interactive · anatomy of a training example

Click a layer below to see how each part of an ordinary Q&A post maps onto a phase of LLM training.

▲312▼✓

How do I reverse a string in Python?

I have a string and I want the characters in reverse order. What is the idiomatic way to do this without writing an explicit loop?

pythonstringslicing

asked 11 years ago · viewed 2.1m times

▲1.4k▼

Python strings are sequences, so you can use an extended slice with a negative step. This walks the sequence backwards and is far faster than a manual loop because it runs in C:

>>> s = "hello" >>> s[::-1] 'olleh'

The [::-1] means "start to end, step −1". Note this returns a new string — strings are immutable — and it also works for lists. For Unicode with combining characters, prefer "".join(reversed(s)).

answered 11 years ago · edited 4 years ago

Pick a layer above. Each one corresponds to a stage labs otherwise pay millions of dollars in human annotation to recreate.

Instruction–response pairing. Millions of "prompt → completion" pairs, already written in the exact register users talk to assistants in.
Built-in quality control. Upvotes, downvotes and the accepted-answer checkmark are a ready-made preference dataset — the precursor to RLHF, donated for free.
Step-by-step reasoning. The best answers narrate the logic and the edge cases, teaching chain-of-thought rather than syntax-memorization.
Debugging context. Endless error-message → fix pairs taught models to recognize a stack trace and propose the patch.

02 — The reward signal

Turning upvotes into a reward function

The hardest problem in alignment is teaching a model what "good" looks like. Stack Overflow had already crowd-sourced that judgment, one vote at a time — and researchers wired it directly into the training loop.

When Hugging Face built StackLLaMA, an end-to-end RLHF demo, they didn't hire annotators. They converted each answer's community score into a reward with a formula this simple:⁸

Interactive · the reward model

Move the sliders the way the community would have voted. Watch the scalar reward the model is trained to maximize.

Marked as the accepted ✓ answer

Two answers to the same question. The reward model learns to score the accepted, highly-voted one above the rest — exactly the preference ordering RLHF needs, harvested from fifteen years of clicks.

StackLLaMA used a >10M-instruction Stack Exchange set and sampled answer pairs; the higher-reward answer is the "chosen", the other "rejected".⁸

03 — The corpus

How much of "the AI" is literally us

Stack Exchange shows up by name in the documented recipe of nearly every foundational dataset. Per byte, it punches far above its weight — labs include it specifically for question-answering and code quality.

Here's the receipt. These are the documented contributions of Stack Overflow / Stack Exchange to public training corpora and code models. Hover any bar for the source.

Interactive · the documented recipe

Toggle between disk size and the labs' own justification for why curated Q&A made the cut.

Disk / token size Why they included it

The Pile devoted 5.13% of its weight to Stack Exchange "hoping it will improve the question-answering capabilities of downstream models." LLaMA sorted answers by vote score before training. The signal was never accidental.

Dataset / model	Stack Exchange / Overflow share	Date
The Pile (EleutherAI)	32.2 GiB · 5.13% weight · 2 epochs²	Jan 2021
LLaMA (Meta)	78 GB · 2.0% sampling · sorted by score³	Feb 2023
RedPajama-V1	~20B tokens⁴	2023
Dolma / OLMo (AI2)	29.3M docs · ~19.6B tokens⁵	2024
InCoder (Meta)	57 GB of Stack Overflow Q&A+comments¹⁰	Apr 2022
StarCoder2 / The Stack v2	~11M questions · >10B tokens¹¹	Feb 2024
RefinedWeb / Falcon	deliberately excluded (the control case)¹²	Jun 2023

StarCoder2 didn't just dump the dump: it kept only questions with ≥3 answers, then used Llama-2-70B to rate 20,000 pairs and trained a quality classifier — using Stack Overflow's own vote structure to clean Stack Overflow.¹¹

04 — The collapse

The chart that broke the flywheel

The moment a machine could answer instantly, with zero judgment and no "closed as duplicate", the questions stopped coming. Stack Overflow trained the thing that emptied its own front page.

Every credible version of this chart traces back to one query against Stack Overflow's own public Data Explorer: COUNT(*) … WHERE PostTypeId = 1, grouped by month. The markers below are documented data points; ChatGPT launched on Nov 30, 2022.

Interactive · monthly questions asked

Toggle the ChatGPT marker and the pre-AI trend line. The decline started softly in 2020; ChatGPT turned it into a cliff.

⊙ ChatGPT launch (Nov 2022) ⊙ All-time peak ⊙ "what-if" no-AI trend

Peak: ~146k questions/month (Mar 2021). ChatGPT launch month: 108,563. Two years later (Dec 2024): 25,566 — a 76% fall, back to roughly 2009 volumes.¹ A controlled study isolated a ~25% drop attributable to ChatGPT specifically, by comparing against markets where it wasn't available.¹³

This is the self-cannibalization loop. The friction Stack Overflow was famous for — waiting for a response, the gatekeeping, the dreaded duplicate flag — was exactly the friction a private, patient, non-judgmental model removed. Usage didn't migrate to a better forum. It migrated out of public view entirely, into one-on-one chats that are never indexed, never voted on, and never seen by the next learner.

05 — The ouroboros

What happens when AI eats its own tail

If new questions stop being asked in public, where does the next generation of training data come from? And what happens to a model trained on the output of the model before it?

In July 2024, Nature published the canonical result: "AI models collapse when trained on recursively generated data." Train a model on its predecessor's output, generation after generation, and the tails of the distribution vanish first — the rare, the novel, the edge case — until the model converges on bland, repetitive sludge.¹⁴

Interactive · model-collapse simulator

Each generation trains on a sample of the previous one. Choose whether fresh human data keeps flowing in, then run the generations and watch the distribution.

Sample size / generation120

Mode: replace human data

Each generation replaces the corpus with synthetic samples — the Shumailov setup. Watch the curve narrow and drift: that's collapse.¹⁴

▶ Run generations

Generation 0 is the true human distribution (wide, with fat tails). Press Run.

The good news, and the live debate: collapse depends on replacing human data with synthetic. A follow-up showed that if you accumulate — keep the real data and add synthetic on top — test error stays bounded.¹⁵ Flip the mode above to see it. Which is precisely why a fresh, human, verified stream of Q&A is suddenly a strategic asset. The supply, meanwhile, is contracting fast: an audit of 14,000 web domains found that in a single year, restrictions locked away 5%+ of all tokens in the common C4 corpus, and 28%+ of the most actively-maintained sources.¹⁶

"What happens when we stop pooling our knowledge with each other and instead pour it straight into The Machine?" — Peter Nixey, Stack Overflow contributor, in InfoWorld, May 2025¹⁷

06 — Mirrors & memory

Models learned the code — and the culture

A model trained on a corpus inherits more than its facts. It inherits its tone, its blind spots, and sometimes its exact words.

The "StackGPT" thought experiment illustrative

There's a vivid way to feel this, often passed around as a cautionary tale: imagine training a model exclusively on Stack Overflow threads. It would be a phenomenal debugger — and it might also greet a beginner's question with the site's notorious bedside manner:

"This is a basic question that shows you haven't done any research. Downvoted for lack of effort. Marked as duplicate." — the kind of answer a culture-faithful model would learn to give

Whether or not anyone has shipped exactly this model, the underlying point is well-established and important: LLMs are mirrors of their training data. They absorb register and social norms alongside syntax. A model is never just "the code" — it's the community that wrote it.

The memorization problem active research

When a model is asked a popular question, is it reasoning, or is it reconstructing something it has seen thousands of times? Work studying memorization using answers to Stack Overflow questions suggests that for well-trodden problems, code generation leans heavily on memorized content — a collage of remembered snippets more than fresh synthesis.¹⁸

That's great for accuracy on common tasks and legally fraught for everything else. Stack Overflow content is licensed CC BY-SA — free to reuse with attribution and share-alike. When a model regurgitates a snippet verbatim but strips the author and the license, it walks straight into the open question the whole industry is now litigating.⁶

07 — The historical arc

Eighteen years in seven beats

2008 – 2021 · The golden era

Developers build the bedrock

A community accretes tens of millions of questions and answers — edge cases, architecture debates, error logs — all voted, edited, and version-tagged. Question volume peaks around 146k/month in March 2021.¹ In June 2021, Prosus buys Stack Overflow for $1.8B, near the top.¹⁹

2021 – 2022 · The great scrape

The corpus becomes a dataset

Stack Exchange is baked into The Pile, LLaMA, InCoder and more — explicitly, and explicitly sorted by vote. The flywheel of free human labor quietly becomes pre-training fuel.

Nov 30, 2022 · The inflection

ChatGPT ships

Instant, private, judgment-free answers. The CEO would later call it an "existential moment." Question volume begins its cliff.

Feb – May 2024 · The pivot

If you can't beat them, license to them

Stack Overflow launches OverflowAPI — Google Cloud / Gemini in February, then a landmark OpenAI deal in May 2024: paid, continuous access to fresh, vetted Q&A, with a promise to cite Stack Overflow inside ChatGPT.²⁰

May 2024 · The rebellion

Contributors revolt

Furious their volunteer work is being sold, some users overwrite or delete their highest-rated answers in protest. Stack Overflow restores the posts and suspends the accounts — citing its perpetual, irrevocable content license.²¹

2023 · The cost

28% of staff laid off

In October 2023, Stack Overflow cuts roughly 28% of its workforce on the "path to profitability."²²

2025 – now · The repositioning

From destination to verification layer

84% of developers use AI; only 33% trust it, and 45% say their top frustration is "AI solutions that are almost right, but not quite."⁹ Stack Overflow's bet: as machine-written code floods in, human-verified knowledge becomes more valuable, not less.

08 — The unsolved problem

Who keeps feeding the machine?

The hard question isn't technical. It's about incentives.

Stack Overflow worked because answering a stranger's question bought you reputation, visibility, and the quiet satisfaction of being right in public. AI removes the audience. Why write the canonical answer to a tricky concurrency bug if the next developer will ask a chatbot — one that learned from your answer but will never send anyone back to you?

The data-licensing bet. OpenAI, Google and GitHub pay for a fresh, vetted stream — turning the commons into a metered utility. It funds the company; it doesn't obviously refill the well of volunteers.
The verification-layer bet. Position humans as the trusted ground truth that models cite and RAG-retrieve against — most valuable precisely when AI is "almost right."
The execution-feedback escape hatch. For code specifically, models increasingly learn from running code that compiles and passes tests — verifiable reward that doesn't need a forum. This partly insulates coding from collapse, but it can't invent knowledge about a framework no human has yet written about.¹⁵

Stack Overflow didn't teach AI the syntax of Python or C++. It gave AI the collective reasoning of millions of engineers solving real problems in public. The open question is how to keep that signal alive once the machine can already answer most of what we'd have asked it.

09 — Go deeper

Build this yourself

Want to move past using AI to understanding — or training — it? A curated path, from attention math to fine-tuning on a dataset like this one.

10 — Receipts

Sources & honest caveats

Every headline number above is sourced here. Where reporting is contested or a claim is illustrative, it's flagged — verify before you quote.

fact · primary source contested / methodology varies illustrative — not a verified event

Show the full source list (22)

Question-volume figures from the Stack Exchange Data Explorer (peak ~146k Mar 2021; 108,563 Nov 2022; 25,566 Dec 2024; −76%). Compiled analyses: hopeseekr gist, Holscher, Pragmatic Engineer. peak figure definitionally varies (146k surviving vs ~200k raw posts)
The Pile — Stack Exchange at 32.2 GiB / 5.13% weight / 2 epochs. arXiv:2101.00027, Table 1.
LLaMA — StackExchange 78 GB, 2.0% sampling, answers "sorted by score." arXiv:2302.13971.
RedPajama-V1 — ~20B StackExchange tokens of ~1.2T. arXiv:2411.12372.
Dolma v1.7 / OLMo — StackExchange 29.3M docs, ~19.6B tokens. Dolma card, arXiv:2402.00159.
Stack Overflow corpus size (58M+ Q&A) and CC BY-SA licensing. SO data licensing; DevClass on the data-dump restriction. whether the dump restriction is compatible with CC BY-SA is disputed
LIMA — 1,000 curated examples (≥10 score Stack Exchange answers) beat 52k. arXiv:2305.11206.
StackLLaMA — reward round(log2(1+upvotes)) + accepted − (score<0). HF blog; dataset lvwerra/stack-exchange-paired.
Stack Overflow Developer Survey 2025 — 84% use AI, trust 33% (distrust 46%), "almost right" the #1 frustration (45%). survey.stackoverflow.co/2025/ai.
InCoder — 57 GB of Stack Overflow content alongside 159 GB code. arXiv:2204.05999.
StarCoder2 / The Stack v2 — ~11M SO questions, >10B tokens, classifier-filtered. arXiv:2402.19173.
RefinedWeb / Falcon — curated sources incl. Stack Exchange deliberately excluded (the control case). arXiv:2306.01116.
Controlled estimate: ~25% activity drop attributable to ChatGPT (vs comparison platforms). Summarized via SO blog, Issue 252.
Shumailov et al., "AI models collapse when trained on recursively generated data," Nature 631 (2024). nature.com.
Gerstgrasser et al., "Is Model Collapse Inevitable? … Accumulating Real and Synthetic Data" — collapse is avoided when data accumulate. arXiv:2404.01413.
Longpre et al., "Consent in Crisis: The Rapid Decline of the AI Data Commons," NeurIPS 2024. arXiv:2407.14933.
Peter Nixey, quoted in M. Asay, "What comes after Stack Overflow?", InfoWorld, May 2025. infoworld.com.
Research on memorization using answers to Stack Overflow questions — code generation as a collage of memorized content for popular problems. OpenReview. verify exact venue/claims before quoting specifics
Prosus acquires Stack Overflow for $1.8B, June 2021. TechCrunch.
Stack Overflow × OpenAI partnership (OverflowAPI), May 6 2024; Google Cloud / Gemini, Feb 29 2024 (terms undisclosed). OpenAI, TechCrunch.
User protest + account suspensions, May 2024. The Register. scale of bans reported via affected users, not officially confirmed
Stack Overflow lays off ~28% of staff, Oct 2023. TechCrunch.

The Data That Taught the Machines. An interactive essay on how Stack Overflow built the coding agent — and the incentive problem it left behind.

Built in the spirit of a learn-by-dragging sandbox. Vanilla JS, Plotly & KaTeX. No backend, no telemetry. Numbers are sourced above; illustrative items are labeled. Generated with help from Claude Code.

Source on GitHub ↗

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hacker News - Newest: "AI"

The accidental machine-teaching format

Turning upvotes into a reward function

How much of "the AI" is literally us

The chart that broke the flywheel

What happens when AI eats its own tail

Models learned the code — and the culture

The "StackGPT" thought experiment illustrative

The memorization problem active research

Eighteen years in seven beats

Who keeps feeding the machine?

Build this yourself

Sources & honest caveats