惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
K
Kaspersky official blog
T
Threat Research - Cisco Blogs
PCI Perspectives
PCI Perspectives
www.infosecurity-magazine.com
www.infosecurity-magazine.com
P
Privacy International News Feed
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
U
Unit 42
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
P
Privacy & Cybersecurity Law Blog
O
OpenAI News
量子位
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
C
Cisco Blogs
AWS News Blog
AWS News Blog
Vercel News
Vercel News
Microsoft Security Blog
Microsoft Security Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
美团技术团队
T
Threatpost
S
Schneier on Security
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
C
Cyber Attacks, Cyber Crime and Cyber Security
Last Week in AI
Last Week in AI
C
CERT Recently Published Vulnerability Notes
Blog — PlanetScale
Blog — PlanetScale
C
Cybersecurity and Infrastructure Security Agency CISA
F
Full Disclosure
博客园_首页
N
Netflix TechBlog - Medium
Security Latest
Security Latest
有赞技术团队
有赞技术团队
Google DeepMind News
Google DeepMind News
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
The Register - Security
The Register - Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Recent Announcements
Recent Announcements
博客园 - Franky
P
Palo Alto Networks Blog
Project Zero
Project Zero
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
H
Help Net Security
Hacker News: Ask HN
Hacker News: Ask HN
Cisco Talos Blog
Cisco Talos Blog
H
Heimdal Security Blog
The Hacker News
The Hacker News
博客园 - 【当耐特】
GbyAI
GbyAI

dgl.cx

SSH port knocking with OpenBSD 7.9 SSH port knocking with OpenBSD 7.9 Bash a newline: Exploiting SSH via ProxyCommand, again (CVE-2025-61984) Switchable dark mode with 5 lines of JavaScript Images over DNS CVE-2025-48384: Breaking Git with a carriage return and cloning RCE Can your terminal do emojis? How big? Blink and you'll miss it — a URL handler surprise Déjà vu: Ghostly CVEs in my terminal title Restrict sftp with Linux user namespaces ""?! ANSI Terminal security in 2023 and finding 10 CVEs NAT-Again: IRC NAT helper flaws ip.wtf and showing you your actual HTTP request
Using HAProxy to protect me from scrapers
2025-04-28 · via dgl.cx

It's a sad fact that web scrapers are getting very obnoxious. I've mostly solved this by making sites that don't need a huge amount of resources, but sometimes scrapers are just too aggressive.

Various sites have introduced some form of proof-of-work (PoW), popularised by tools such as anubis which has now been deployed by the UN(!).

These crawlers can get a large supply of IP addresses and defeat simple rate limits by switching IP addresses. However, by making them need to do work to get access, it makes changing IP addresses expensive, but also means a per IP limit makes sense again. Obviously if someone really wants to scrape, they'll find a way, but this makes other targets more attractive.

My breaking point was when someone started bruteforcing URLs. Possibily not an AI crawler, but the same thing can block both. I could have just deployed anubis, but it is over 6000 lines of Go (including tests) and requires putting it in front of your site. More importantly given ip.wtf goes to some lengths to not let things mess up the inbound HTTP request, I can't actually deploy something like that.

Therefore architecturally anubis wasn't an option for me. It can't really be that complex to make one of these? Well, no. Ted Unangst made anticrawl which is far simpler, but a bit too simple. So I made something in-between. It just does a simple inline hash calculation, without web workers (the downside of this is it does block the browser UI a bit). But I made it look a little bit pretty.

We can use HAProxy "stick tables" to keep the state of whether the connecting IP address has passed the challenge and they also take care of expiring the address for us.

The result is haphash. The README on GitHub has details on how to set this up for yourself.

Obviously HAProxy is a lot of code and does the heavy lifting for us, but compared to anubis this is considerably easier to understand:

$ wc -l haproxy.conf challenge.html
      38 haproxy.conf
      95 challenge.html
     133 total

The majority of the logic is in the haproxy configuration. The frontend matches ACLs for paths to be protected, and the benefit of this is the logic can be in the same place as other haproxy ACLs (rather than yet another policy configuration).

The example ACLs are like so:

  # Adjust these to the paths you want to protect.
  acl protected_path path -m reg /(some-expensive/thing|another/).*
  # Matches the default config of anubis of triggering on "Mozilla"
  acl protected_ua hdr(User-Agent) -m beg Mozilla/
  acl protected acl(protected_path,protected_ua)

This can use all the features of HAProxy ACLs like splitting them out to a file and matching on all kinds of attributes.

To actually calcuate the hash, I was initially thinking I'd have to use HAProxy's Lua support, but it turns out it is possible in pure haproxy config:

http-request set-var(txn.hash) src,concat(;,txn.host,),concat(;,txn.ts,),concat(;,txn.tries),digest(SHA-256),hex

That sets a transaction variable (txn.) with the source IP address, concatenated with the hostname, the timestamp, the number of tries to get the hash in the required form (the actual proof-of-work), then calculates a digest of it. With this and a few other checks, we have all we need server side to validate the hash, completely in the load balancer. Even if the language it is written in makes you want to gouge your eyes out.

If you want to test it, the contact page is always protected by it. (But note you'll probably just see it flash by, unless you're on an older device.)

How about just blocking the bots?

The actual problem is bots that ignore robots.txt or other hints they are not welcome. Most hide with browser like user-agents, hence why anubis (and haphash) trigger on the "Mozilla" string in the User-Agent header.

This means I'm also running an experiment where there is a hidden link, which if you visit it, it simply blocks you. This link is disallowed in robots.txt:

User-Agent: *
Disallow: /iamabot/

Therefore if you're visiting this link, you're a robot who is ignoring robots.txt.

It's trivial to apply a drop to that connection with haproxy:

frontend www
  # [...Use stick table config from haphash...]

  # Drop the connection as early as possible
  http-request silent-drop if { sc_get_gpt(0,0) gt 0 }

  # [...later...]
  http-request sc-set-gpt(0,0) 1 if { path -m beg /iamabot/ }
  http-request silent-drop if { path -m beg /iamabot/ }

The actual path is different, just so you don't copy that and accidentally go to it, but if you find the real path, yes, it will block you.

Code is on GitHub or SourceHut.

28th April 2025