惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Hugging Face - Blog
Hugging Face - Blog
Jina AI
Jina AI
宝玉的分享
宝玉的分享
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
博客园 - 聂微东
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
J
Java Code Geeks
博客园 - 【当耐特】
小众软件
小众软件
博客园 - Franky
S
SegmentFault 最新的问题
WordPress大学
WordPress大学
雷峰网
雷峰网
The Cloudflare Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
量子位
Last Week in AI
Last Week in AI
博客园_首页
月光博客
月光博客
IT之家
IT之家
阮一峰的网络日志
阮一峰的网络日志
Webroot Blog
Webroot Blog
Stack Overflow Blog
Stack Overflow Blog
腾讯CDC
云风的 BLOG
云风的 BLOG
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
W
WeLiveSecurity
Recent Commits to openclaw:main
Recent Commits to openclaw:main
D
Docker
The Last Watchdog
The Last Watchdog
有赞技术团队
有赞技术团队
Hacker News - Newest:
Hacker News - Newest: "LLM"
D
DataBreaches.Net
S
Security @ Cisco Blogs
Blog — PlanetScale
Blog — PlanetScale
GbyAI
GbyAI
TaoSecurity Blog
TaoSecurity Blog
S
Security Affairs
Y
Y Combinator Blog
O
OpenAI News
罗磊的独立博客
MongoDB | Blog
MongoDB | Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Forbes - Security
Forbes - Security
P
Palo Alto Networks Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
K
Kaspersky official blog
Cloudbric
Cloudbric

dgl.cx

SSH port knocking with OpenBSD 7.9 SSH port knocking with OpenBSD 7.9 Bash a newline: Exploiting SSH via ProxyCommand, again (CVE-2025-61984) Switchable dark mode with 5 lines of JavaScript Images over DNS CVE-2025-48384: Breaking Git with a carriage return and cloning RCE Can your terminal do emojis? How big? Blink and you'll miss it — a URL handler surprise Déjà vu: Ghostly CVEs in my terminal title Restrict sftp with Linux user namespaces ""?! ANSI Terminal security in 2023 and finding 10 CVEs NAT-Again: IRC NAT helper flaws ip.wtf and showing you your actual HTTP request
Using HAProxy to protect me from scrapers
2025-04-28 · via dgl.cx

It's a sad fact that web scrapers are getting very obnoxious. I've mostly solved this by making sites that don't need a huge amount of resources, but sometimes scrapers are just too aggressive.

Various sites have introduced some form of proof-of-work (PoW), popularised by tools such as anubis which has now been deployed by the UN(!).

These crawlers can get a large supply of IP addresses and defeat simple rate limits by switching IP addresses. However, by making them need to do work to get access, it makes changing IP addresses expensive, but also means a per IP limit makes sense again. Obviously if someone really wants to scrape, they'll find a way, but this makes other targets more attractive.

My breaking point was when someone started bruteforcing URLs. Possibily not an AI crawler, but the same thing can block both. I could have just deployed anubis, but it is over 6000 lines of Go (including tests) and requires putting it in front of your site. More importantly given ip.wtf goes to some lengths to not let things mess up the inbound HTTP request, I can't actually deploy something like that.

Therefore architecturally anubis wasn't an option for me. It can't really be that complex to make one of these? Well, no. Ted Unangst made anticrawl which is far simpler, but a bit too simple. So I made something in-between. It just does a simple inline hash calculation, without web workers (the downside of this is it does block the browser UI a bit). But I made it look a little bit pretty.

We can use HAProxy "stick tables" to keep the state of whether the connecting IP address has passed the challenge and they also take care of expiring the address for us.

The result is haphash. The README on GitHub has details on how to set this up for yourself.

Obviously HAProxy is a lot of code and does the heavy lifting for us, but compared to anubis this is considerably easier to understand:

$ wc -l haproxy.conf challenge.html
      38 haproxy.conf
      95 challenge.html
     133 total

The majority of the logic is in the haproxy configuration. The frontend matches ACLs for paths to be protected, and the benefit of this is the logic can be in the same place as other haproxy ACLs (rather than yet another policy configuration).

The example ACLs are like so:

  # Adjust these to the paths you want to protect.
  acl protected_path path -m reg /(some-expensive/thing|another/).*
  # Matches the default config of anubis of triggering on "Mozilla"
  acl protected_ua hdr(User-Agent) -m beg Mozilla/
  acl protected acl(protected_path,protected_ua)

This can use all the features of HAProxy ACLs like splitting them out to a file and matching on all kinds of attributes.

To actually calcuate the hash, I was initially thinking I'd have to use HAProxy's Lua support, but it turns out it is possible in pure haproxy config:

http-request set-var(txn.hash) src,concat(;,txn.host,),concat(;,txn.ts,),concat(;,txn.tries),digest(SHA-256),hex

That sets a transaction variable (txn.) with the source IP address, concatenated with the hostname, the timestamp, the number of tries to get the hash in the required form (the actual proof-of-work), then calculates a digest of it. With this and a few other checks, we have all we need server side to validate the hash, completely in the load balancer. Even if the language it is written in makes you want to gouge your eyes out.

If you want to test it, the contact page is always protected by it. (But note you'll probably just see it flash by, unless you're on an older device.)

How about just blocking the bots?

The actual problem is bots that ignore robots.txt or other hints they are not welcome. Most hide with browser like user-agents, hence why anubis (and haphash) trigger on the "Mozilla" string in the User-Agent header.

This means I'm also running an experiment where there is a hidden link, which if you visit it, it simply blocks you. This link is disallowed in robots.txt:

User-Agent: *
Disallow: /iamabot/

Therefore if you're visiting this link, you're a robot who is ignoring robots.txt.

It's trivial to apply a drop to that connection with haproxy:

frontend www
  # [...Use stick table config from haphash...]

  # Drop the connection as early as possible
  http-request silent-drop if { sc_get_gpt(0,0) gt 0 }

  # [...later...]
  http-request sc-set-gpt(0,0) 1 if { path -m beg /iamabot/ }
  http-request silent-drop if { path -m beg /iamabot/ }

The actual path is different, just so you don't copy that and accidentally go to it, but if you find the real path, yes, it will block you.

Code is on GitHub or SourceHut.

28th April 2025