惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Forbes - Security
Forbes - Security
T
Tailwind CSS Blog
Hugging Face - Blog
Hugging Face - Blog
Blog — PlanetScale
Blog — PlanetScale
WordPress大学
WordPress大学
aimingoo的专栏
aimingoo的专栏
Y
Y Combinator Blog
U
Unit 42
I
InfoQ
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
V
Visual Studio Blog
B
Blog RSS Feed
Vercel News
Vercel News
F
Fortinet All Blogs
Know Your Adversary
Know Your Adversary
T
Troy Hunt's Blog
博客园 - 【当耐特】
MongoDB | Blog
MongoDB | Blog
大猫的无限游戏
大猫的无限游戏
A
About on SuperTechFans
Jina AI
Jina AI
小众软件
小众软件
T
Threatpost
有赞技术团队
有赞技术团队
人人都是产品经理
人人都是产品经理
The Hacker News
The Hacker News
T
The Exploit Database - CXSecurity.com
C
CXSECURITY Database RSS Feed - CXSecurity.com
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
Microsoft Azure Blog
Microsoft Azure Blog
Recent Announcements
Recent Announcements
酷 壳 – CoolShell
酷 壳 – CoolShell
Scott Helme
Scott Helme
B
Blog
腾讯CDC
Last Week in AI
Last Week in AI
P
Proofpoint News Feed
S
Schneier on Security
N
News and Events Feed by Topic
Microsoft Security Blog
Microsoft Security Blog
K
Kaspersky official blog
G
Google Developers Blog
T
Tor Project blog
PCI Perspectives
PCI Perspectives
S
Secure Thoughts
Google Online Security Blog
Google Online Security Blog
Latest news
Latest news
Google DeepMind News
Google DeepMind News
MyScale Blog
MyScale Blog
罗磊的独立博客

Personal blog of Christian Brauner

Listing all mounts in all mount namespaces Listing all mounts in all mount namespaces Mounting into mount namespaces Mounting into mount namespaces An excursion into a mount propagation bug An excursion into a mount propagation bug Managing a kernel patch series with b4 Managing a kernel patch series with b4 The Seccomp Notifier - Cranking up the crazy with bpf() The Seccomp Notifier - Cranking up the crazy with bpf() The Seccomp Notifier - New Frontiers in Unprivileged Container Development The Seccomp Notifier - New Frontiers in Unprivileged Container Development Slides for Kernel Recipes, Paris 2019: pidfd: Process file descriptors on Linux Slides for Kernel Recipes, Paris 2019: pidfd: Process file descriptors on Linux Slides for Open Source Summit (OSS) North America, San Diego 2019: New Container Kernel Features Slides for Open Source Summit (OSS) North America, San Diego 2019: New Container Kernel Features Linux Kernel VFSisms Runtimes And the Curse of the Privileged Container Runtimes And the Curse of the Privileged Container
Linux Kernel VFSisms
Christian Brauner · 2019-06-28 · via Personal blog of Christian Brauner

Introduction

This is intended as a collection of helpful knowledge bits around Linus Kernel VFS internals. It mostly contains (hopefully) useful bits and pieces I picked up while working on the Linux kernel and talking to VFS maintainers or high-profile contributors.

ksys_close()

Should never be used. One of the major reasons being that it is too easy to get wrong.

On creating and installing new file descriptors

A file descriptor should only be installed past every possible point of failure. Specifically for a syscall the file descriptor should be installed right before returning to userspace. Consider the function anon_inode_getfd(). This functions creates and installs a new file descriptor for a task. Hence, by the rule given above it should only ever be called when the syscall cannot fail anymore in any other way then by failing anon_inode_getfd().

For all other cases the rule is to reserve a file descriptor but defer the installation of the file descriptor past the last point of failure. Note, that installing an file descriptor itself is not an operation that can fail.

Back to the anonymous inode example: Instead of calling anon_inode_getfd() callers who need a file descriptor before the last point of failure should reserve a file descriptor, call anon_inode_getfile() and then defer the fd_install() until after the last point of failure. Here is a concrete example blessed by Al Viro:

	if (clone_flags & CLONE_PIDFD) {
		/* reserve a new file descriptor */
		retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
		if (retval < 0)
			goto bad_fork_free_pid;

		pidfd = retval;

		/* get file to associate with file descriptor */
		pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
					      O_RDWR | O_CLOEXEC);
		if (IS_ERR(pidfile)) {
			put_unused_fd(pidfd);
			retval = ERR_PTR(pidfile);
			goto bad_fork_free_pid;
		}
		get_pid(pid);	/* held by pidfile now */

		/* place file descriptor in buffer accessible for userspace */
		retval = put_user(pidfd, parent_tidptr);
		if (retval)
			goto bad_fork_put_pidfd;
	}

	/* a lot more code that can fail somehow */

	/* Let kill terminate clone/fork in the middle */
	if (fatal_signal_pending(current)) {
		retval = -EINTR;
		goto bad_fork_cancel_cgroup;
	}

	/* past the last point of failure */
	if (pidfile)
		fd_install(pidfd, pidfile);

When a new directory is created in a filesystem the inode needs to be initialized. The new_inode for the directoy needs to get a count of 2 for (. and ..) the count of the parent_inode of the parent directory needs to be incremented. There are a few places in kernel where this is done like this:

inc_nlink(new_inode);
d_instantiate(dentry, new_inode);
inc_nlink(parent_inode);
fsnotify_mkdir(parent_inode, dentry);

But the preferred method of doing this is:

set_nlink(new_inode, 0);
d_instantiate(dentry, new_inode);
inc_nlink(parent_inode);
fsnotify_mkdir(parent_inode, dentry);

since new_inode cannot be modified by someone else concurrently.