惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
Tenable Blog
H
Heimdal Security Blog
K
Kaspersky official blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
S
Schneier on Security
G
GRAHAM CLULEY
U
Unit 42
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
C
CERT Recently Published Vulnerability Notes
Google DeepMind News
Google DeepMind News
罗磊的独立博客
Stack Overflow Blog
Stack Overflow Blog
阮一峰的网络日志
阮一峰的网络日志
Simon Willison's Weblog
Simon Willison's Weblog
C
Cisco Blogs
Cyberwarzone
Cyberwarzone
T
The Exploit Database - CXSecurity.com
Project Zero
Project Zero
Security Archives - TechRepublic
Security Archives - TechRepublic
www.infosecurity-magazine.com
www.infosecurity-magazine.com
博客园 - 司徒正美
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
V
Visual Studio Blog
博客园 - Franky
Engineering at Meta
Engineering at Meta
WordPress大学
WordPress大学
Jina AI
Jina AI
P
Proofpoint News Feed
P
Proofpoint News Feed
有赞技术团队
有赞技术团队
L
LINUX DO - 最新话题
宝玉的分享
宝玉的分享
N
News and Events Feed by Topic
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
博客园 - 聂微东
T
The Blog of Author Tim Ferriss
Spread Privacy
Spread Privacy
Application and Cybersecurity Blog
Application and Cybersecurity Blog
IT之家
IT之家
S
Security Affairs
博客园 - 叶小钗
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
小众软件
小众软件
N
News | PayPal Newsroom
Cloudbric
Cloudbric
AWS News Blog
AWS News Blog
W
WeLiveSecurity
The Last Watchdog
The Last Watchdog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
NISL@THU
NISL@THU

seize the dev

Will Software Engineering Survive? Why is the Gmail app 700 MB? Bits of Open-Source in 2025 Parsing JSON in Forty Lines of Awk Improving Date Formatting Performance in Node.js Unix is not Linux Installing FreeBSD on Oracle Cloud A Simple Setup for C and C++ How to Break Software
A Tricky Floating-Point Calculation
Mohamed Akram · 2024-06-26 · via seize the dev

I was working on my simple-id library and was curious about different methods to generate an unbiased random number in a given range. As a way to test whether a random number generator (RNG) produces good results, one can calculate the expected number of repeats based on the birthday problem and compare it with the output of the RNG. The formula can be found online in several places as it can also be used for counting expected hash collisions or cache hits. The expected number of repeats can be calculated as follows:

n−d+d(d−1d)nn - d + d \left( \frac{d-1}{d} \right)^n

Where nn is the number of generated values and dd is the range.

Converting this to JavaScript code:

function repeats(n, d) {
  return n - d + d * ((d - 1) / d)**n;
}

In my case, I wanted to check the expected repeats after generating a million random IDs, where each ID is 8 characters long and there are 31 characters in the alphabet. That is, n=106n=10^6 and d=318d=31^8. Running this in JavaScript:

console.log(repeats(1e6, 31**8));

The result is -19.72998046875.

Wait, what? Negative repeats? I’ve known about the infamous 0.1 + 0.2 inaccuracy, but this is different; it’s a much larger error, and from only a handful of operations. I double (and triple) checked that I typed the formula correctly, and tried it in both C and Python and got the same result. I then put it into WolframAlpha, since it supports arbitrary precision, and it returned something longer, but much more sensible:

0.5862405426220060595736096268572069594630842175323822481323009

That seemed right. I started tweaking the formula to try to get rid of the error. I figured that, most likely, the division leads to a loss of accuracy that’s exacerbated by raising to a very high power. To avoid this, I reached for BigInt to calculate the power of the numerator and denominator separately, then divide. Since BigInt only supports integers, I also multiply by a “precision” before converting to a number, then divide by it after to get a decimal value:

function repeats(n, d) {
  const p = n*d;
  const e = Number(BigInt(p) * BigInt(d-1)**BigInt(n) / BigInt(d)**BigInt(n))/p;
  return n - d + d * e;
}

That returns 0.5863037109375. On the one hand, the result is correct to three significant figures, which is an improvement over the previous zero. On the other, this takes a second and a half to finish computing. After trying a few different things, I got a tip from a user on the ##math IRC channel to try log1p for the calculation. What does log have to do with anything? The way computers calculate the power is like so:

xn=e(nlog⁡x)x^n = e^{\left( n \log{x} \right)}

So we can rewrite the formula to:

n−d+dexp⁡(nlog⁡d−1d)n - d + d \exp{\left( n \log{\frac{d-1}{d}} \right)}

The log1p function returns the result of log⁡(1+x)\log{\left(1 + x\right)}. The reason it exists is because floating-point does not work well when adding a small xx to 11. Since dd is very large, this is the case in our formula. We can rewrite the expression to make it usable in log1p:

n−d+dexp⁡(nlog⁡(1−1d))n - d + d \exp{\left( n \log{\left( 1 - \frac{1}{d} \right)} \right)}

The code then becomes:

function repeats(n, d) {
  return n - d + d * Math.exp(n * Math.log1p(-1 / d));
}

Running it returns 0.5863037109375, the same value as using BigInt, only now it doesn’t take a second and a half to calculate. However, it still wasn’t as accurate as I’d like. While looking into this, I came across another function that also works better for small values, expm1, which returns ex−1e^x - 1. We can factor out dd to make this usable too:

n+d(exp⁡(nlog⁡(1−1d))−1)n + d \left( \exp{\left( n \log{\left( 1 - \frac{1}{d} \right)} \right)} - 1 \right)

And in code:

function repeats(n, d) {
  return n + d * Math.expm1(n * Math.log1p(-1 / d));
}

Running this, we get 0.5862405423540622, a much improved result accurate to nine significant figures. This was as good as it was going to get in JavaScript.

Out of curiosity, I wanted to see if a better result was possible using C, since it provides yet another function for preserving accuracy, fma, also known as fused multiply and add. Rather than lose accuracy twice on both the multiplication and the addition, rounding only happens once after the calculation when using fma. It also happens to be faster, since it uses a single native CPU instruction for both operations. C also provides the long double type, which is as precise as a double or more, depending on the implementation. On my machine, sizeof(long double) returns 16, i.e. it is a 128-bit floating-point number, also known as a quad. Trying them both out:

double repeats(double n, double d)
{
	return fma(d, expm1(n * log1p(-1 / d)), n);
}

long double repeatsl(long double n, long double d)
{
	return fma(d, expm1l(n * log1pl(-1 / d)), n);
}

This yields 0.58624054240892864431 and 0.58624054258953528507, respectively, with a very slight improvement in accuracy, but still only accurate to nine significant figures. At this point, I was out of my depth. I tried the Herbie project which automatically rewrites floating-point expressions and it gave some good results but the suggestions were rather unwieldy. Let me know if you can find another way to compute this with better accuracy — perhaps there is yet another trick floating out there.