Regular expressions that work “everywhere”
ColinWright
·
2026-06-25
·
via HN's home page
 | |
Emacs in particular I suffer so much from basically guessing what needs to be escaped or not. I know `rx` exists[0] as an alternative but it's not really fun to use. Even beyond the regex syntax itself, you often also start running into encoding problems when trying to actually use them. Typing the regex in a shell? Make sure to esacpe stuff properly. Regex in Python? Make sure it's a raw string. Etc etc etc It's a modern miracle we're at least within rhyming distance of how to write regexes in most tools. [0]: https://www.gnu.org/software/emacs/manual/html_node/elisp/Rx... |
 | |
Grasping at straws, it's kinda convenient that ( and ) match literally if the text being searched is Elisp code! |
 | |
The author is circling around, but not quite reaching, a statement that POSIX Basic Regular Expressions work everywhere, with the caveat that that not everyone has caught up with version 8 of the Single Unix Specification, which has slightly changed BREs. |
 | |
Go stdlib regexp package does not support back references, as it uses the RE2 engine. You can use them in replace but not matching. |
 | |
It drives me nuts when a developer documents something or other as being a "regex" but doesn't mention which dialect of regulation expression he's talking about. This habit is particularly common in the Rust, JavaScript, and Python communities, which seem to forget that their language's regular expression language isn't universal. |
 | |
Why? Of course it means the dialect that is most directly supported by that language (by builtins or the standard library). And why should they have to consider other dialects? They aren't reading regexes from user input (or they'd be a lot more concerned about sanitization, catastrophic backtracking etc.), and their fellow developers all grok the conventions. |
 | |
I’d imagine precisely because they might be collecting regexes from user input such as parameter values or search terms, and the user may not know or care which technology your tool or service is built with. However, they will need to know which regex dialect(s) you support. And I’d further bet that people who are casual about specifying that are relatively strongly correlated with people who are casual about santization, catastrophic backtracking, etc. (At least based on code I’ve seen over the decades.) |
 | |
Because I don't know what language your program is even written in! Why should I know or care that you chose, e.g. TypeScript, when I'm trying to use or configure your program and don't know how to spell this or that regex concept? |
 | |
> the special characters . * ^ $ These already do not work in many tools which require those special characters to be escaped to have any meaning. An easy example is GNU grep, sed, etc. which use BRE ("Basic Regular Expressions") by default. The article mentions GNU coreutils but does not explain that `-E` is required to fix that behavior. |
 | |
2 RegExp problems: 1. You can not compose a bigger regexp out of smaller ones 2. A regexp can not "call" other regexps |
 | |
To do regex matching efficiently, you need to compile the pattern before using it. That'd exclude dynamically "calling" other regex patterns. But bigger regex pattern strings can be composed from smaller regex pattern strings. You'd just need to do the composition before the compilation. |
 | |
Also define blocks if all someone wants is to break the pattern up to make it more readable. |
 | |
Then there’s not just the issue of whether the engine supports a particular syntactical feature but the issue of matching semantics. Perl/PCRE’s semantics are far different from POSIX’s and some implementations different semantics altogether (and quite reasonably). |
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。