惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
WordPress大学
WordPress大学
小众软件
小众软件
Cloudbric
Cloudbric
AWS News Blog
AWS News Blog
腾讯CDC
量子位
人人都是产品经理
人人都是产品经理
大猫的无限游戏
大猫的无限游戏
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
V
Vulnerabilities – Threatpost
Scott Helme
Scott Helme
Hugging Face - Blog
Hugging Face - Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
The Hacker News
The Hacker News
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
IT之家
IT之家
Jina AI
Jina AI
Attack and Defense Labs
Attack and Defense Labs
S
SegmentFault 最新的问题
Simon Willison's Weblog
Simon Willison's Weblog
The Cloudflare Blog
阮一峰的网络日志
阮一峰的网络日志
T
Tailwind CSS Blog
Last Week in AI
Last Week in AI
博客园 - 【当耐特】
Google Online Security Blog
Google Online Security Blog
美团技术团队
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
罗磊的独立博客
L
LINUX DO - 最新话题
博客园 - Franky
博客园 - 叶小钗
Apple Machine Learning Research
Apple Machine Learning Research
The Last Watchdog
The Last Watchdog
J
Java Code Geeks
AI
AI
C
Cisco Blogs
酷 壳 – CoolShell
酷 壳 – CoolShell
C
Cyber Attacks, Cyber Crime and Cyber Security
Cisco Talos Blog
Cisco Talos Blog
博客园 - 三生石上(FineUI控件)
雷峰网
雷峰网
Help Net Security
Help Net Security
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
云风的 BLOG
云风的 BLOG
I
Intezer
S
Securelist

Simply Explained

Converting a Tuya Thermostat to ESPHome Bringing Foam Monsters to Life: How I Wrote and Illustrated a Children's Book Using AI How I Built an NFC Movie Library for my Kids How I Use Alfred to Search My Obsidian Notes Faster (with Spotlight!) Year in review: 2022 Smart lights behind a wall switch (Shelly, Z-Wave, ESPHome) Serverless Anagram Solver with Cloudflare R2 and Pages Integrate Home Assistant with Apple Reminders How WebP Images Reduced My Bandwidth Usage by 50% Tracking gas usage with ESPHome, Home Assistant, and TCRT5000 My Sixth Year as YouTube Creator (statistics + retrospective) EZStore: a tiny serverless datastore for IoT data (DynamoDB + Lambda) ESP-IDF: Storing AWS IoT certificates in the NVS partition (for OTA) How to securely access your home network with Cloudflare Tunnel and WARP I Built a CO2 Sensor and It Terrifies Me Filtering spam on YouTube with TensorFlow & AI Building a killer NAS with an old Rackable Server How I Structure My ESPHome Config Files Howto Virtualize Unraid on a Proxmox host MAX17043: Battery Monitoring Done Right (Arduino & ESP32) Preventing Cumulative Layout Shifts with lazy loaded images (Eleventy + markdown-it) Migrating This Blog From Jekyll to Eleventy Good Home Automation Should be Boring ESP32 Cam: cropping images on device Retrospective: My Fifth Year on YouTube Secure Home Assistant Access with Cloudflare and Ubiquiti Dream Machine Shelly 2.5 + ESPHome: potential fire hazard + fix Impact of Adblockers on Google Analytics (vs. Plausible) Shelly 2.5: Flash ESPHome Over The Air! Tuya IR Hub: control Daikin AC (Home Assistant + ESPHome) Building Air Quality Sensor: Luftdaten + Home Assistant HEIC to JPG: Build a Quick Action with Automator Make Your Garage Door Opener Smart: Shelly 1, ESPHome and Home Assistant Static webhosting benchmark: AWS, Google, Firebase, Netlify, GitHub & Cloudflare Why I don't take sponsorships Monitoring my 3D printer with a Pi Zero, Home Assistant and TinyCore Linux ESP32: Keep WiFi connection alive with a FreeRTOS task Home Energy Monitor: V2 Retrospective: 4 years on YouTube
Analyzing Link Rot in My Newsletter (After 31 Editions)
Xavier Decuy · 2023-09-02 · via Simply Explained

I've been writing a monthly newsletter for the past 2.5 years. In every edition, I link to interesting articles related to science and technology. I thought it would be interesting to analyze how many of those links are still accessible, and how many have succumbed to link rot. Let's dive in!

Testing script

To test each link, I wrote a simple Python script that extracts all external URLs from previous newsletters and makes sure they return an HTTP 200 OK status code.

"""
    Extracts all URLs from newsletter editions and checks each URL to see if
    they're still active. It follows redirects.
    Writes output directly to the console as CSV with the following fields:
    Filename, URL, Domain name, Response code.
"""
import re
import glob
import requests 
from urllib.parse import urlparse

url_pattern = re.compile(r'\b((?:https?|ftp|file):\/\/[-a-zA-Z0-9+&@#\/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#\/%=~_|])')

# Try to appear somewhat legit
http_headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

# Returns the domain name of a given URL
def extract_base_domain(url) -> str:
    parsed_url = urlparse(url)
    domain = parsed_url.netloc
    if domain.startswith('www.'):
        domain = domain[4:]
    return domain

# CSV header
print("Filename;Linked URL;Domain name;Reponse code")

# Loop over all newsletter editions
for filename in glob.glob("src/site/newsletter/**/*.md"):    
    with open(filename, "r") as file:
        contents = file.read()
        urls = re.findall(url_pattern, contents)
    
        for url in urls:
            base_url = extract_base_domain(url)
            status_code = "internal_error"  # Default fallback value
            
            try:
                response = requests.get(url, allow_redirects=True, headers=http_headers)
                status_code = response.status_code
            except requests.exceptions.RequestException:
                pass

            print(f"{filename};{url};{base_url};{status_code}")

The script outputs a CSV file which I imported and analyzed in Google Sheets (available here).

I have sent 31 editions of the newsletter containing 484 links. To my surprise, only 2 links are dead and return a 404 Page Not Found error. No other types of errors were observed.

Reponse code Occurances Share
200 482 99.59%
404 2 0.41%

Researchers extensively studied the rate of link rot, but got varied results.

  • A 2016 study looked at links in Yahoo! Directory and found that half of all links died within 2 years.
  • Another study looked at articles of The New York Times and found that 25% of all links were inaccessible. They also concluded that link rot gets worse over time, with articles from 1998 containing a whopping 72% dead links!

I’m happy that the rate of link rot is very low in past editions of my newsletter, as I often refer to older editions. However, take into account that I’ve only been doing this for 2.5 years. The rate of link rot could increase over the coming years.

I attribute the low rate of link rot to the fact that I mostly feature large news websites and scientific publications. I'm sure these hold up better over time than a random person's blog.

While writing each newsletter, I try not to feature the same sites repeatedly. So I checked if I’ve been doing a good job of that. Here’s a list of domain names that I’ve featured most often in the past 2.5 years:

Domain Times linked %
youtube.com 58 11.98%
en.wikipedia.org 25 5.17%
theverge.com 15 3.10%
twitter.com 13 2.69%
bigthink.com 11 2.27%
bbc.com 11 2.27%
nature.com 10 2.07%
theguardian.com 10 2.07%
freethink.com 10 2.07%
smithsonianmag.com 8 1.65%
arstechnica.com 7 1.45%
science.org 6 1.24%
edition.cnn.com 5 1.03%
space.com 5 1.03%
futurism.com 5 1.03%
scanofthemonth.com 5 1.03%
github.com 5 1.03%
ciechanow.ski 4 0.83%
interestingengineering.com 4 0.83%

I was a bit surprised to see YouTube in the number 1 spot. Then I remembered I shared a list of my 25 most favorite YouTubers in edition #7. If you exclude those, YouTube sits at the same level as Wikipedia.

Overall, this seems well balanced. There's no one website that stands out significantly.

I was also interested in knowing how many links I share in each edition of the newsletter. My aim is to feature 8 to 10 articles in each edition, but I sometimes link to more resources in the summary of each article.

Edition # Link count
001 1
002 6
003 9
004 13
005 12
006 21
007 40
008 19
009 21
010 13
011 12
012 12
013 21
014 13
015 11
016 13
017 14
018 16
019 20
020 13
021 21
022 15
023 21
024 15
025 21
026 17
027 21
028 11
029 12
030 13
031 17

The data shows that, on average, a Simply Explained Newsletter contains 15 links, and that number seems to be consistent across editions. Except for edition 7, where I shared a list of my favorite YouTube channels.

What's next?

Overall, it seems like the websites I feature in my newsletter do a good job of keeping their content online! I plan to repeat this experiment in the future to see if link rot has gotten worse.

In the meantime, I will replace the two broken links with copies from the Internet Archive's Wayback Machine.

The raw data is available on Google Sheets here.

Thumbnail by Sophia Hilmar from Pixabay