惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
阮一峰的网络日志
阮一峰的网络日志
C
Check Point Blog
Stack Overflow Blog
Stack Overflow Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
酷 壳 – CoolShell
酷 壳 – CoolShell
M
MIT News - Artificial intelligence
L
LangChain Blog
Microsoft Azure Blog
Microsoft Azure Blog
博客园 - Franky
WordPress大学
WordPress大学
博客园_首页
Y
Y Combinator Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
V
Visual Studio Blog
L
LINUX DO - 最新话题
S
Security @ Cisco Blogs
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
Help Net Security
Help Net Security
大猫的无限游戏
大猫的无限游戏
Hugging Face - Blog
Hugging Face - Blog
The GitHub Blog
The GitHub Blog
Schneier on Security
Schneier on Security
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
U
Unit 42
Jina AI
Jina AI
雷峰网
雷峰网
罗磊的独立博客
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - 【当耐特】
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
人人都是产品经理
人人都是产品经理
Microsoft Security Blog
Microsoft Security Blog
V
V2EX
N
News and Events Feed by Topic
V2EX - 技术
V2EX - 技术
宝玉的分享
宝玉的分享
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
Hacker News - Newest:
Hacker News - Newest: "LLM"
P
Proofpoint News Feed
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
O
OpenAI News
P
Proofpoint News Feed
H
Help Net Security
S
Securelist
Vercel News
Vercel News
Hacker News: Ask HN
Hacker News: Ask HN
博客园 - 三生石上(FineUI控件)

Common Crawl

Common Crawl - Blog - Host- and Domain-Level Web Graphs March, April, and May 2026 Common Crawl - Blog - May 2026 Crawl Archive Now Available Common Crawl - Blog - April 2026 Crawl Archive Now Available in a Hugging Face Storage Bucket Common Crawl - Blog - You can now build directly on Common Crawl from the browser Common Crawl - Blog - Host- and Domain-Level Web Graphs February, March, and April 2026 Common Crawl - Blog - April 2026 Crawl Archive Now Available Common Crawl - Blog - April 2026 Common Crawl Newsletter Common Crawl - Blog - Announcing a Change to Common Crawl Dataset Size Reporting Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2026 Common Crawl - Blog - March 2026 Crawl Archive Now Available Common Crawl - Blog - IPv6 Adoption Across the Top 100K Web Hosts Common Crawl - Blog - Web Graph Statistics Gets a Proper Upgrade Common Crawl - Blog - Measuring Web Accessibility from Crawl Archives Common Crawl - Blog - Announcing the Whirlwind Tour of Common Crawl's Datasets Using Java Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2025 and January/February 2026 Common Crawl - Blog - Introducing the New Examples & Resources Browser Common Crawl - Blog - February 2026 Crawl Archive Now Available Common Crawl - Blog - AI Plumbers at FOSDEM’26 Common Crawl - Blog - CC-Citations: A Visualization of Research Papers Referencing Common Crawl Common Crawl - Blog - CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2025 and January 2026 Common Crawl - Blog - January 2026 Crawl Archive Now Available Common Crawl - Blog - Web Archives for Social Sciences Datathon, Bristol Common Crawl - Blog - How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals Common Crawl - Blog - GneissWeb Annotations Examples Common Crawl - Blog - Common Crawl at the Mozilla Festival 2025 Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, December 2025 Common Crawl - Blog - December 2025 Crawl Archive Now Available Common Crawl - Blog - A Sampling of 2025 Research Referencing Common Crawl Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025 Common Crawl - Blog - November 2025 Crawl Archive Now Available Common Crawl - Blog - Common Crawl Celebrates World Digital Preservation Day Common Crawl - Blog - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good Common Crawl - Blog - October/November 2025 Newsletter Common Crawl - Blog - Common Crawl Foundation at Stanford HAI Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025 Common Crawl - Blog - October 2025 Crawl Archive Now Available Common Crawl - Blog - Common Crawl Foundation at COLM 2025 Common Crawl - Blog - Announcing GneissWeb Annotations Common Crawl - Blog - Web Languages Needing Review by Native Speakers Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2025 Common Crawl - Blog - From SEO to AIO: Why Your Content Needs to Exist in AI Training Data Common Crawl - Blog - September 2025 Crawl Archive Now Available Common Crawl - Blog - Common Crawl Foundation Opt-Out Registry Common Crawl - Blog - Trip Report: AI_dev (Linux Foundation) August 2025 Common Crawl - Blog - Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation Common Crawl - Blog - July/August 2025 Newsletter Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2025 Common Crawl - Blog - August 2025 Crawl Archive Now Available Common Crawl - Blog - Common Crawl Foundation at ACL 2025 Common Crawl - Blog - AI Optimization Is Here: Are You Ready for Search 2.0? Common Crawl - Blog - IETF 123 Report Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2025 Common Crawl - Blog - July 2025 Crawl Archive Now Available Common Crawl - Blog - WMDQS Shared Task on Language Identification Common Crawl - Blog - The First WMDQS-Masakhane LangID Hackathon Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2025 Common Crawl - Blog - Common Crawl at the United Nations Open Source Week, June 2025 Common Crawl - Blog - June 2025 Crawl Archive Now Available Common Crawl - Blog - May/June 2025 Newsletter Common Crawl - Blog - Announcing the Whirlwind Tour of Common Crawl's Datasets using Python Common Crawl - Blog - Host- and Domain-Level Web Graphs March, April, and May 2025 Common Crawl - Blog - May 2025 Crawl Archive Now Available Common Crawl - Blog - Announcing the First Workshop on Multilingual Data Quality Signals Common Crawl - Blog - Host- and Domain-Level Web Graphs February, March, and April 2025 Common Crawl - Blog - April 2025 Crawl Archive Now Available Common Crawl - Blog - Introducing the Host Index Common Crawl - Blog - IIPC General Assembly & Web Archiving Conference 2025 Common Crawl - Blog - March/April 2025 Newsletter Common Crawl - Blog - Providing Authenticity & Data Provenance for Common Crawl Using Blockchain: Our Work with Constellation Network Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2025 Common Crawl - Blog - March 2025 Crawl Archive Now Available Common Crawl - Blog - Introducing Common Crawl AI Agent by ReadyAI Common Crawl - Blog - Submission to the UK’s Copyright and AI Consultation Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2024 and January/February 2025 Common Crawl - Blog - February 2025 Crawl Archive Now Available Common Crawl - Blog - Opening the Gates to Online Safety Common Crawl - Blog - January/February 2025 Newsletter Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2024 and January 2025 Common Crawl - Blog - January 2025 Crawl Archive Now Available Common Crawl - Blog - Introducing cc-downloader Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, and December 2024 Common Crawl - Blog - December 2024 Crawl Archive Now Available Common Crawl - Blog - Common Crawl Foundation at NeurIPS 2024: Expanding Horizons and Building Connections Common Crawl - Blog - Expanding the Language and Cultural Coverage of Common Crawl Common Crawl - Blog - October/November 2024 Newsletter Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, November 2024 Common Crawl - Blog - November 2024 Crawl Archive Now Available Common Crawl - Blog - Reflections on Recent Talks at the Turing Institute and UCL Common Crawl - Blog - Introducing the Common Crawl Errata Page for Data Transparency Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2024 Common Crawl - Blog - October 2024 Crawl Archive Now Available Common Crawl - Blog - White House Briefing on Open Data’s Role in Technology Common Crawl - Blog - IAB Workshop on AI-CONTROL Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2024 Common Crawl - Blog - September 2024 Crawl Archive Now Available Common Crawl - Blog - August/September 2024 Newsletter Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2024 Common Crawl - Blog - August 2024 Crawl Archive Now Available Common Crawl - Blog - The Increase of Common Crawl Citations in Academic Research
Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2026
2026-06-25 · via Common Crawl

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of April, May, and June 2026. The crawls used to generate the graphs were CC-MAIN-2026-17, CC-MAIN-2026-21, and CC-MAIN-2026-25. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior Web Graph Releases. You may also visit the projects cc-webgraph and cc-pyspark which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of Web Graph Notebooks. You may also wish to explore our Web Graph Statistics page for more information on ranking.

Host-level Graph

The host-level graph consists of 247.3 million nodes and 6.3 billion edges.

There are 184.2 million dangling nodes (74.48%) and the largest strongly connected component contains 36.3 million (14.66%) nodes. Dangling nodes stem from:

  • Hosts that have not been crawled, yet are pointed to from a link on a crawled page
  • Hosts without any links pointing to a different host name
  • Hosts which did only return an error page (eg. HTTP 404).

Host names in the graph are in reverse domain name notation such that: www.example.com becomes com.example.www.

You can download the graph and the ranks of all 247.3 million hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2026-apr-may-jun/host/ (this requires an account on AWS). Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2026-apr-may-jun/host/ as prefix to access the files from everywhere.

Please note that the text representation of the host-level graph is shipped in 240 gzip-compressed files listed in two path listings - one for the nodes (vertices), one for the edges (arcs). First, download the paths listing and decompress it using gzip -d or gunzip. By adding the prefix s3://commoncrawl/ or https://data.commoncrawl.org/ to each line in the path listing you get the list of URLs to download the entire graph.

Download files of the Common Crawl April, May, and June 2026 host-level Web Graph

Domain-level Graph

The domain graph is built by aggregating the host graph on the level of pay-level domains (PLDs) based on the public suffix list maintained on publicsuffix.org. Version (commit) e596036 of the public suffix list was used (commit date 2026-05-28T06:25:44Z).

The domain-level graph has 121.1 million nodes and 3.9 billion edges. 64.0% or 77.5 million nodes are dangling nodes, the largest strongly connected component covers 30.0 million or 24.73% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2026-apr-may-jun/domain/ or on https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2026-apr-may-jun/domain/.

Download files of the Common Crawl April, May, and June 2026 domain-level Web Graph

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible. We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc.

Let us know about your results via Common Crawl's Google Group, or on our Discord Server.