惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

阮一峰的网络日志
阮一峰的网络日志
D
Darknet – Hacking Tools, Hacker News & Cyber Security
S
Schneier on Security
The Last Watchdog
The Last Watchdog
Cyberwarzone
Cyberwarzone
S
Securelist
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cyber Attacks, Cyber Crime and Cyber Security
L
Lohrmann on Cybersecurity
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - 司徒正美
The Cloudflare Blog
V
V2EX
博客园_首页
博客园 - 聂微东
Vercel News
Vercel News
人人都是产品经理
人人都是产品经理
G
GRAHAM CLULEY
T
Tenable Blog
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
L
LINUX DO - 最新话题
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
SecWiki News
SecWiki News
博客园 - 三生石上(FineUI控件)
S
Secure Thoughts
N
News | PayPal Newsroom
T
The Blog of Author Tim Ferriss
The GitHub Blog
The GitHub Blog
T
Troy Hunt's Blog
博客园 - 【当耐特】
Forbes - Security
Forbes - Security
H
Hacker News: Front Page
A
About on SuperTechFans
B
Blog RSS Feed
Engineering at Meta
Engineering at Meta
MongoDB | Blog
MongoDB | Blog
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
罗磊的独立博客
D
DataBreaches.Net
P
Privacy & Cybersecurity Law Blog
Schneier on Security
Schneier on Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Google DeepMind News
Google DeepMind News
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Jina AI
Jina AI
D
Docker
P
Proofpoint News Feed

Zhach's News & Views

The Real Gap Between What We See and What We Know Rise of Liquid News: Formats for the User How I Left YouTube The AI Editor: Can We Trust the AI Fact-Checker? (Part 2) Short-Form Content and the Social Cost of a Quick Scroll How to Save Money with Big Data: Finding Matches (Part 3) The Independent Reporter's Playbook: How to Thrive in the New News Ecosystem The AI Editor: How to Automate Fact-Checking? (Part 1) How to Save Money with Big Data: Counting Things (Part 2) When Does Someone’s Confidence Need a Check-In? Protecting the Human: Why the EU AI Act is a Good Move How to Save Money with Big Data: Understanding Hashes (Part 1) The Math of Truth: How Data is Reshaping Journalism The Plateau is Here: This is What's Next For AI
I Used to Work at Google. This Is How Your News Is Created.
https://www.facebook.com/zhachory.volker · 2025-09-24 · via Zhach's News & Views

One unique experience I have that is different from most is my time at YouTube and Google working with Fact Checking and YT News. And one of the most complicated and interesting problems we faced was presenting a breaking news story in the most factual way possible.

The journey from a journalist's keyboard to your phone's notification bar is a multitiered, multistep, instantaneous, and fault tolerant process, orchestrated on the terabytes! It relies on a combination of real-time data collection, advanced cluster algorithms, and natural language processing!

Understanding this process is key to understanding the modern news landscape. It's about seeing beyond the headline and grasping the machinery that shapes what we read. Let's break down the three main stages of how these large tech companies (Google, Meta, X, among many others) identify and define news stories.

The steps are

  1. Web Scraping
  2. News Clustering
  3. Information Extraction

Stage 1: Real-Time Web Scraping

The first step in this process is the relentless, real-time collection of data. Google's web crawlers, for example, are constantly scanning the web for new and updated content. When it comes to news, this process is even more accelerated. The system doesn't just wait for a weekly sweep; it is specifically designed to recognize news sources and check them for updates in a matter of seconds.

Image Source

This process, known as web scraping, involves automated programs that visit millions of websites, extracting specific pieces of information. For a news article, this could include the headline, the body of the text, the author, the publication date, and any images or videos. They're not just looking at major outlets like The New York Times or the BBC; they are scraping thousands of local, national, and international news sources. This massive, continuous flow of data is the raw material from which news stories are built.

The goal here is not to understand the content yet, but simply to gather as much of it as possible, as fast as possible. “Quantity over quality” is the name of the game here. The system is essentially building an enormous, ever-changing library of all the new information being published on the internet at any given moment.

Stage 2: Clustering News Stories

Once the data is collected, the next challenge is to make sense of it. A single event, like a natural disaster or a political announcement, will be reported by dozens, if not hundreds, of different news outlets. Each article will have a different headline, a different lead paragraph, and a slightly different angle. The task of the AI is to recognize that all these separate articles are actually about the same underlying event. This is where clustering comes in.

Clustering is an unsupervised machine learning technique that groups similar items together without any pre-defined categories. In the context of news, the AI analyzes the content of each article. It looks at the words used, the names of people and places mentioned, and the overall context. It then uses this information to calculate a "similarity score" between articles. Articles with a high similarity score are grouped together into a single cluster.

Clustering Algorithms from A-to-Z

For example, a news article from Reuters, one from a local paper in California, and another from a foreign news agency, might all report on the same wildfire. Despite their different titles and writing styles, the AI would recognize that they all mention the same location, the same time frame, and similar key phrases like "wildfire," "evacuation," and "containment efforts." It would then place all these articles into a single cluster, which it now identifies as a unique "story." This is why when you go to Google News, you see a single headline with a carousel of different articles. The system is showing you that cluster!

The final stage is turning these clusters of articles into a coherent, single news story that a user can understand at a glance. This involves information extraction, which is the process of pulling out the most important facts from the cluster of articles.

The AI analyzes all the articles within a cluster to identify the who, what, when, where, and why of the event. It might determine that the most common headline theme is "Hurricane Hits the Coast," and that the most frequently mentioned location is "Florida." It would also extract key details like the number of people affected, the names of any officials involved, and the most recent updates on the situation.

Gemini created "Can you make an image of an AI taking in a bunch of news articles and outputting a sensible summary and headline?"

The system then uses this extracted information to generate a concise summary and a compelling headline. It might also use the information to organize the story, creating a timeline of events or highlighting specific facts in a "key updates" section.

💡

The 3 steps to get Real Time News Stories are: Web Scraping, Clustering, and Info Extraction

The ability of these systems to process, cluster, and synthesize information in real-time has fundamentally changed how we consume news. It has created a world where you can get a broad overview of a global event from multiple perspectives with a single click. It also raises important questions about the role of human editors and the potential for these algorithms to inadvertently amplify certain narratives. The next time you see a breaking news alert on your phone, take a moment to think about the complex journey it took to get there.

I'm curious to hear your thoughts on this. Do you think the automation of news aggregation is a net positive for journalism, or does it come with hidden dangers? What are the benefits of a single, clustered news story versus reading individual reports from different outlets? Share your opinions in the comments below.


Sources:

  1. How Google supports journalism and the news industry. (n.d.). Google. https://blog.google/supportingnews/
  2. Web scraping for news and article collection. (n.d.). Octoparse. https://www.octoparse.com/blog/how-to-scrape-news-and-articles-data
  3. Clustering News Articles for Topic Detection: A Technical Deep Dive. (n.d.). DEV Community. https://dev.to/mayankcse/clustering-news-articles-for-topic-detection-a-technical-deep-dive-2692

A Survey of Information Extraction Based on Deep Learning. (n.d.). MDPI. https://www.mdpi.com/2076-3417/12/19/9691