惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
Scott Helme
Scott Helme
爱范儿
爱范儿
WordPress大学
WordPress大学
博客园 - 三生石上(FineUI控件)
阮一峰的网络日志
阮一峰的网络日志
博客园 - Franky
V
V2EX
腾讯CDC
博客园_首页
博客园 - 司徒正美
酷 壳 – CoolShell
酷 壳 – CoolShell
T
Tailwind CSS Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
小众软件
小众软件
J
Java Code Geeks
大猫的无限游戏
大猫的无限游戏
月光博客
月光博客
Microsoft Azure Blog
Microsoft Azure Blog
B
Blog
雷峰网
雷峰网
Stack Overflow Blog
Stack Overflow Blog
IT之家
IT之家
罗磊的独立博客
Recorded Future
Recorded Future
博客园 - 聂微东
O
OpenAI News
S
Secure Thoughts
Hacker News: Ask HN
Hacker News: Ask HN
S
Schneier on Security
Hacker News - Newest:
Hacker News - Newest: "LLM"
Y
Y Combinator Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
Project Zero
Project Zero
宝玉的分享
宝玉的分享
K
Kaspersky official blog
N
Netflix TechBlog - Medium
T
The Exploit Database - CXSecurity.com
Google Online Security Blog
Google Online Security Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Webroot Blog
Webroot Blog
云风的 BLOG
云风的 BLOG
Simon Willison's Weblog
Simon Willison's Weblog
C
Check Point Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
L
LINUX DO - 热门话题
美团技术团队
L
Lohrmann on Cybersecurity

Tomas Vondra

Some more thoughts on random_page_cost How are committers selected? The real cost of random I/O The AI inversion Stabilizing Benchmarks Don't give Postgres too much memory (even on busy systems) Qubes OS is pretty great Wireguard to access a home network Don't give Postgres too much memory Tuning AIO in PostgreSQL 18 Using JWT to establish a trusted context for RLS Fun and weirdness with SSDs So why don't we pick the optimal query plan? How often is the query plan optimal? Benchmarking is hard, sometimes ... Advanced Patch Feedback Session (APFS) at pgconf.dev 2025 Good time to test io_method (for Postgres 18) [PATCH IDEA] adaptive execution for `IN` queries 15 years of Prague PostgreSQL Developer Day Performance archaeology: OLAP Performance archaeology: OLTP Tuning the glibc memory allocator (for Postgres) [PATCH IDEA] parallel pgbench -i Playing with BOLT and Postgres [PATCH IDEA] amcheck support for BRIN indexes Writing a good talk proposal Office hours experiment [PATCH IDEA] Using COPY for postgres_fdw INSERT batching Importing Postgres mailing list archives How to pick the first patch? Will Postgres development rely on mailing lists forever? The state of the Postgres community
[PATCH IDEA] Statistics for the file descriptor cache
2024-09-03 · via Tomas Vondra

Let me present another “first patch” idea, related to a runtime stats on access to files storing data. Having this kind of information would be very valuable on instances with many files (which can happen for many reasons).

This is a very different area than the patch idea, which was about an extension. The runtime stats are at the core of the system, and so is the interaction with the file systems. But it’s still fairly isolated, and thus suitable for new contributors.

If you didn’t read the first post about how to pick the first patch, maybe do so now. Everything I wrote in that post still applies - you should pick a patch that’s interesting (and useful) for you personally. Don’t jump on this idea simply because I posted about it.

If you decide to give this patch idea a try, let me know. Maybe not right away, but once you get past the initial experiments. Otherwise multiple people might be working on it, having to throw the work away after the first one submits it to pgsql-hackers. It gives insights and ability to review the submitted patch, so not a total waste of time. But if your goal was to write the first patch …

PostgreSQL has a lot of monitoring statistics about the database. This is useful for monitoring and investigating all sorts of operational issues. The docs have a very nice overview about this [https://www.postgresql.org/docs/current/monitoring-stats.html](cumulative statistics system) listing all the system views and available information. There’s data about tables/indexes, currently running queries, replication, SSL, WAL, … Most of the important subsystems provide some stats.

Note: Don’t confuse this with statistics used by the optimizer when planning queries. That’s a different thing. This patch idea is about stats used for monitoring - what the database is doing, not the data it’s processing.

Motivation

An important part of the database is accessing data stored in files on disk. Postgres stores data for individual objects separately - each table or index gets a separate “file” in the database directory. But not only that, the files are split into 1GB chunks (called “segments”).

For example, this is a tiny part of a list of files in directory for a database with OID 16478:

xeon:/mnt/data/pgdata/data/base/16478# ls -l
...
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.1
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.10
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.11
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.12
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.13
-rw------- 1 postgres postgres 1073741824 Aug 23 19:35 16491.14
...

There are many more files in that directory, but all those files store data for a single object. I’m not sure if the object is a table or an index, but that doesn’t matter too much. There’s much more details on the database file layout in the documentation.

There are reasons why it was done this way. Some older filesystems had limits on file size, and it may be difficult to deal with very large files for other reasons. But this post is not about that.

The consequence however is that databases may have a lot of files. 1TB table needs 1000 segments, and you also need segments for indexes. There can be much larger tables, perhaps even multiple of them, etc. Or maybe you just have many small tables.

Partitioning is becoming more common, and it’s not quite uncommon to see partitioned tables with hundreds or even thousands of partitions. Those partitions may be small (even smaller than 1GB), but they are separate objects and thus get a separate file.

So the number of files in a large instance can be quite high, and that comes with some overhead when accessing those files. In particular, it’s not free to open (or close) a file.

To deal with that, Postgres caches the file descriptors opened in each backend. The number of files kept open is defined by a the max_files_per_process configuration parameter:

max_files_per_process = 1000

This is the default value, and it means each backend will keep up to 1000 files open. That mostly works fine, but if your queries need to access more files, that may be fine.

Note: The size of the cache is not just up to Postgres, it depends on limits set by the operating system too. Check ulimit -n to make sure the max_files_per_process actually is in effect.

Consider for example query doing COUNT(*) on a table with 1000 small partitions. Those file descriptors won’t fit into the file descriptor cache, and will “trash” the cache on each execution. The entries would never be reused before eviction.

Opening/closing files is not extremely expensive, but it’s not free either. It requires a syscall, and filesystems need to optimize for other important stuff too. The amount of overhead may depend on the filesystem, of course.

The easiest solution in Postgres is to make the cache larger, to prevent the trashing. Results from a simple benchmark with max_files_per_process 1000 or 32768 on a big machine (in this case AMD EPYC with ~96/192 cores) may look like this:

Yup, that’s ~4-5x improvement of the throughput. Not bad!

Of course, this test is quite simplistic, and it was on a development build addressing a couple other scalability bottlenecks. Your queries likely do more expensive stuff than just COUNT(*), which would make the improvement more modest. But the message is clear - the size of the cache clearly matters.

Which (finally) gets me to the motivation - tuning the file descriptor cache size requires information about the number of hits/misses. And we have none of that information available in Postgres. To get some of this data you have to profile the backends using perf, strace, etc.

And that’s what this patch idea is about - to make this data available in the pg_stat_ system catalogs, so that people can monitor the cache hit ratio, and adjust the configuration parameter if needed.

There’s probably much more information that can be provided about the file descriptor cache, I haven’t thought about it too much.

Implementation

I think the implementation should not be too difficult, and needs to touch about three places:

  • storing the statistics in shared memory
  • places that open/close the files need to update counters
  • define the system views to make the information accessible

First, let’s talk about where the statistics would be stored. That’s an already solved problem - Postgres has a pgstat module doing that for other monitoring statistics. To add this new information, you’d need to add a new struct with the new counters to pgstat.h, along with an API to update those counters from places.

Then you’ll need to make sure the counters are updated from places that deal with the file descriptor cache. I think this should happen either in fd.c (functions like PathNameOpenFile), or maybe higher in md.c (functions like _mdfd_getseg). I’m not 100% sure what’s the correct layer to do this.

Finally, you’ll need to add the system views to make the collected info accessible from SQL. This happens by defining the view in system_views.sql, and implementing a couple C functions returning the data.

Obviously, you don’t have to invent how to do these things - we already do the same thing for other data. What I’d do is pick a similar data, also collected at the instance level, and copy and modify the relevant bits in the pgstat module. For example, the bgwriter seems like a good candidate, so find everything in pgstat.h that has “bgwriter” in the name, make a copy with “vfdcache” instead of “bgwriter” (or something like that) in the name, and tweak it as needed. Same for the pgstat.c and system_views.sql files.

Once you have that, the main question will be where to call the _hit() and _miss() functions (added to pgstat) from.

Risks

I don’t think there’s a risk the implementation would be impossible or even particularly complex - we collect similar stuff elsewhere, so there’s no reason why it wouldn’t work here.

There are other risks, though. It’s possible someone might argue this information is not as useful as I think it is, and this feature is not worth the extra complexity. I don’t think it’ll happen.

What might happen is a discussion about what other information we should collect and at what granularity. I mentioned only information necessary for calculating cache hit ratio at the instance level. But maybe we should be collecting this for individual backends (or even queries in pg_stat_statements), and so on.

I don’t think that’s a problem. Such discussions are what happens with patches, and maybe the patch will need such improvements. But I’d not worry about that at the beginning - do a simple working patch first, it can be improved later.

Conclusions

So that’s the second patch idea. If needed, you can reach out to me directly by email. There’s also a number of places where you can talk to a bigger group of Postgres developers, for example the pgsql-hackers, or the new discord channel. There are many similar channels, on various other social platforms.

Do you have feedback on this post? Please reach out by e-mail to tomas@vondra.me.