




























Posted May 5, 2026 22:12 UTC (Tue) by marcH (subscriber, #57642) [Link] (6 responses)
I'm not wondering about hashes. I'm wondering about the _threshold_ hashes are compared against. Why and how that threshold is variable. Hashes should be "variable enough" already, no? I most likely missed something.
Posted May 6, 2026 7:00 UTC (Wed) by daroc (editor, #160859) [Link] (5 responses)
The threshold is variable to change the distribution of node sizes. If you use a fixed threshold, you get a lopsided distribution where the mean node is half-full but the modal node only has one or two elements. It's more efficient to have a low threshold for earlier items and a higher threshold for later items so that you can make the distribution of node sizes follow a bell curve.
Posted May 6, 2026 14:05 UTC (Wed) by Wol (subscriber, #4433) [Link] (4 responses)
Posted May 6, 2026 15:52 UTC (Wed) by farnz (subscriber, #17727) [Link]
Because the hash distribution is an artefact of the input data, you need some way to ensure that an unfortunate distribution of hashes won't result in nodes that are consistently too large or too small. The simplest way to do this reliably is to vary the threshold for the next node split point based on the size of previous nodes - if the average node so far has been too small, adjust the threshold to get larger nodes in future, while if the average node has been too large, adjust the threshold to get smaller nodes in future.
More complex strategies also exist for setting the threshold - I've not looked to see how Dolt handles this - but the key to why you need a varying threshold is simply that the shape of the distribution of hash values is unknown.
Posted May 6, 2026 15:56 UTC (Wed) by daroc (editor, #160859) [Link] (2 responses)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。