惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

D
Darknet – Hacking Tools, Hacker News & Cyber Security
V
Vulnerabilities – Threatpost
Cloudbric
Cloudbric
G
GRAHAM CLULEY
S
Securelist
Schneier on Security
Schneier on Security
Help Net Security
Help Net Security
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
Project Zero
Project Zero
Spread Privacy
Spread Privacy
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
Cisco Talos Blog
Cisco Talos Blog
T
Tailwind CSS Blog
博客园_首页
有赞技术团队
有赞技术团队
Simon Willison's Weblog
Simon Willison's Weblog
Stack Overflow Blog
Stack Overflow Blog
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
Latest news
Latest news
T
Tor Project blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Attack and Defense Labs
Attack and Defense Labs
www.infosecurity-magazine.com
www.infosecurity-magazine.com
O
OpenAI News
J
Java Code Geeks
T
Tenable Blog
K
Kaspersky official blog
AWS News Blog
AWS News Blog
S
Security @ Cisco Blogs
The GitHub Blog
The GitHub Blog
T
Threatpost
月光博客
月光博客
H
Heimdal Security Blog
Security Latest
Security Latest
The Hacker News
The Hacker News
Y
Y Combinator Blog
A
Arctic Wolf
Apple Machine Learning Research
Apple Machine Learning Research
C
Cisco Blogs
美团技术团队
Microsoft Security Blog
Microsoft Security Blog
Hugging Face - Blog
Hugging Face - Blog
T
The Blog of Author Tim Ferriss
C
CERT Recently Published Vulnerability Notes
D
Docker
Google Online Security Blog
Google Online Security Blog
D
DataBreaches.Net
V
Visual Studio Blog
H
Help Net Security

seize the dev

Will Software Engineering Survive? Why is the Gmail app 700 MB? Bits of Open-Source in 2025 Parsing JSON in Forty Lines of Awk Improving Date Formatting Performance in Node.js Unix is not Linux Installing FreeBSD on Oracle Cloud A Simple Setup for C and C++ How to Break Software
A Tricky Floating-Point Calculation
Mohamed Akram · 2024-06-26 · via seize the dev

I was working on my simple-id library and was curious about different methods to generate an unbiased random number in a given range. As a way to test whether a random number generator (RNG) produces good results, one can calculate the expected number of repeats based on the birthday problem and compare it with the output of the RNG. The formula can be found online in several places as it can also be used for counting expected hash collisions or cache hits. The expected number of repeats can be calculated as follows:

n−d+d(d−1d)nn - d + d \left( \frac{d-1}{d} \right)^n

Where nn is the number of generated values and dd is the range.

Converting this to JavaScript code:

function repeats(n, d) {
  return n - d + d * ((d - 1) / d)**n;
}

In my case, I wanted to check the expected repeats after generating a million random IDs, where each ID is 8 characters long and there are 31 characters in the alphabet. That is, n=106n=10^6 and d=318d=31^8. Running this in JavaScript:

console.log(repeats(1e6, 31**8));

The result is -19.72998046875.

Wait, what? Negative repeats? I’ve known about the infamous 0.1 + 0.2 inaccuracy, but this is different; it’s a much larger error, and from only a handful of operations. I double (and triple) checked that I typed the formula correctly, and tried it in both C and Python and got the same result. I then put it into WolframAlpha, since it supports arbitrary precision, and it returned something longer, but much more sensible:

0.5862405426220060595736096268572069594630842175323822481323009

That seemed right. I started tweaking the formula to try to get rid of the error. I figured that, most likely, the division leads to a loss of accuracy that’s exacerbated by raising to a very high power. To avoid this, I reached for BigInt to calculate the power of the numerator and denominator separately, then divide. Since BigInt only supports integers, I also multiply by a “precision” before converting to a number, then divide by it after to get a decimal value:

function repeats(n, d) {
  const p = n*d;
  const e = Number(BigInt(p) * BigInt(d-1)**BigInt(n) / BigInt(d)**BigInt(n))/p;
  return n - d + d * e;
}

That returns 0.5863037109375. On the one hand, the result is correct to three significant figures, which is an improvement over the previous zero. On the other, this takes a second and a half to finish computing. After trying a few different things, I got a tip from a user on the ##math IRC channel to try log1p for the calculation. What does log have to do with anything? The way computers calculate the power is like so:

xn=e(nlog⁡x)x^n = e^{\left( n \log{x} \right)}

So we can rewrite the formula to:

n−d+dexp⁡(nlog⁡d−1d)n - d + d \exp{\left( n \log{\frac{d-1}{d}} \right)}

The log1p function returns the result of log⁡(1+x)\log{\left(1 + x\right)}. The reason it exists is because floating-point does not work well when adding a small xx to 11. Since dd is very large, this is the case in our formula. We can rewrite the expression to make it usable in log1p:

n−d+dexp⁡(nlog⁡(1−1d))n - d + d \exp{\left( n \log{\left( 1 - \frac{1}{d} \right)} \right)}

The code then becomes:

function repeats(n, d) {
  return n - d + d * Math.exp(n * Math.log1p(-1 / d));
}

Running it returns 0.5863037109375, the same value as using BigInt, only now it doesn’t take a second and a half to calculate. However, it still wasn’t as accurate as I’d like. While looking into this, I came across another function that also works better for small values, expm1, which returns ex−1e^x - 1. We can factor out dd to make this usable too:

n+d(exp⁡(nlog⁡(1−1d))−1)n + d \left( \exp{\left( n \log{\left( 1 - \frac{1}{d} \right)} \right)} - 1 \right)

And in code:

function repeats(n, d) {
  return n + d * Math.expm1(n * Math.log1p(-1 / d));
}

Running this, we get 0.5862405423540622, a much improved result accurate to nine significant figures. This was as good as it was going to get in JavaScript.

Out of curiosity, I wanted to see if a better result was possible using C, since it provides yet another function for preserving accuracy, fma, also known as fused multiply and add. Rather than lose accuracy twice on both the multiplication and the addition, rounding only happens once after the calculation when using fma. It also happens to be faster, since it uses a single native CPU instruction for both operations. C also provides the long double type, which is as precise as a double or more, depending on the implementation. On my machine, sizeof(long double) returns 16, i.e. it is a 128-bit floating-point number, also known as a quad. Trying them both out:

double repeats(double n, double d)
{
	return fma(d, expm1(n * log1p(-1 / d)), n);
}

long double repeatsl(long double n, long double d)
{
	return fma(d, expm1l(n * log1pl(-1 / d)), n);
}

This yields 0.58624054240892864431 and 0.58624054258953528507, respectively, with a very slight improvement in accuracy, but still only accurate to nine significant figures. At this point, I was out of my depth. I tried the Herbie project which automatically rewrites floating-point expressions and it gave some good results but the suggestions were rather unwieldy. Let me know if you can find another way to compute this with better accuracy — perhaps there is yet another trick floating out there.