



























Root inside a container is still root. That's the part people miss. Containers aren't VMs, there's no hypervisor wall between the process and the kernel. If your container runs as UID 0 and something goes wrong, the blast radius is much larger than it looks. The good news is that Linux already has the tools to contain this, and Kubernetes provides those to us.
This article covers how Linux controls what a process can do, how containers are just Linux processes with some extra isolation, and then how Kubernetes exposes controls to manage all of that. If you follow along, by the end, you’ll have a running environment to verify every setting yourself.
We’ll start with user IDs. Linux uses a simple number to decide who you are. UID 0 is root, and root can do almost anything. UIDs from 1 to 999 are reserved for system accounts like daemon or nobody. UIDs from 1000 and above are regular human users. When you run a container without specifying a user, it defaults to root inside that container, which is a problem.
Next, we’ll prove this with the id command. Spin up a container, run it, and you’ll see exactly who the process thinks it is. Change the user to UID 2 and run it again. The permissions change, and you can no longer write to places root could.
From there, we move to Kubernetes. Kubernetes wraps all of this in a securityContext block attached to a pod or container. You can set runAsUser, runAsNonRoot, readOnlyRootFilesystem, and more. Each one maps directly to a Linux concept underneath.
Every process running on Linux has an identity: a user ID (UID) and a group ID (GID). The kernel uses them to decide what the process is allowed to do.
When a process tries to open a file, the kernel checks: does this process's UID own this file? Is the process in the file's group? Does the "other" permission bit allow access? Based on that check, the kernel either grants the operation or returns EACCES (Permission denied).
UID 0 is special. A process running as UID 0 (root) bypasses most of these checks. That's why "don't run as root" is the first rule of container security.
Print only the UID and GID lines from the kernel's information about the current process:
cat /proc/self/status | grep -E "^(Uid|Gid):"
Output:
Uid: 1000 1000 1000 1000
Gid: 1000 1000 1000 1000
The four numbers in each row are: real, effective, saved, and filesystem UID/GID. The effective UID is what the kernel uses when making permission decisions.
By convention on most Linux distros: UIDs 0–999 are reserved for system and service accounts. UIDs 1000 and above are regular users. This is a convention enforced by tools like useradd, not a kernel rule—but it's why you'll often see apps run as UID 1000 or 65534 (nobody).
User-space programs—an Node.js app, a Python script, a compiled Go binary—can't directly touch hardware or the kernel's data structures. They have to ask the kernel to do things on their behalf. These requests are system calls (syscalls).

There are around 350 syscalls on x86-64 Linux (the exact count varies by architecture and kernel version).
Ref here
You can see every syscall a process makes using strace:
strace -e trace=openat,read,write ls /tmp 2>&1 | head -20
Output:
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/tmp", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
...
This matters for security: if you can restrict which syscalls a process is allowed to make, you've dramatically reduced what a compromised process can do. That's exactly what seccomp does (covered later).
Historically, Linux had two categories of processes: root (UID 0, all-powerful) and everyone else. Since Linux 2.2 (1999), capabilities let you split root's power into distinct units that can be granted or removed individually.
There are around 40 capabilities in the Linux kernel. A few important ones:
| Capability | What it allows |
|---|---|
| CAP_CHOWN | Change file UID/GID arbitrarily |
| CAP_NET_BIND_SERVICE | Bind to ports below 1024 |
| CAP_NET_ADMIN | Configure network interfaces, routing, firewalls |
| CAP_SYS_ADMIN | Mount filesystems, manage namespaces, and many other things—effectively a second root |
| CAP_SYS_PTRACE | Attach to and trace any process |
| CAP_KILL | Send signals to any process |
| CAP_DAC_OVERRIDE | Bypass file read/write/execute permission checks |
| CAP_SETUID | Switch to any UID |
You can inspect a process's capabilities by reading /proc/<pid>/status:
cat /proc/self/status | grep Cap
Output:
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
These are bitmasks. A regular user process has zeroed effective capabilities. To decode a value into human-readable form:
capsh --decode=000001ffffffffff
Output:
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,
cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,
cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,
...
Decoding a hex bitmask works, but there's a simpler way to see all capabilities and their numeric positions. On systems with systemd v252 or later (Ubuntu 23.04+, Debian 12+, Fedora 37+):
systemd-analyze capability
Output (truncated):
0 cap_chown
1 cap_dac_override
2 cap_dac_read_search
3 cap_fowner
4 cap_fsetid
5 cap_kill
6 cap_setgid
7 cap_setuid
8 cap_setpcap
9 cap_linux_immutable
10 cap_net_bind_service
11 cap_net_broadcast
12 cap_net_admin
13 cap_net_raw
18 cap_sys_chroot
19 cap_sys_ptrace
21 cap_sys_admin
...
You can also query a specific capability by name:
systemd-analyze capability cap_net_admin
Output:
12
The number is the bit position in the /proc bitmask. cap_chown is bit 0, cap_net_admin is bit 12, cap_sys_admin is bit 21. This is why the bitmask value 000001ffffffffff covers all 41 capabilities: it has bits 0 through 40 set.
The bounding set (CapBnd) is the ceiling—no process can gain a capability outside its bounding set, even if it calls setuid(0).
Docker grants containers a default set of capabilities. Not all of them are needed by typical web apps. Here's how to check what a fresh container starts with:
docker run --rm alpine sh -c "cat /proc/1/status | grep Cap"
Output:
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
The following installs a tool called capsh inside the container to translate that hex number 00000000a80425fb into a human-readable list of capabilities.
docker run --rm alpine sh -c "apk add -q libcap && capsh --decode=00000000a80425fb"
Output:
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,
cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
A typical REST API doesn't need cap_chown, cap_net_raw, or cap_sys_chroot. These are attack surface.
Even with capabilities restricted, there's a classic escalation path: setuid binaries. A binary with the setuid bit set (like sudo or su) runs as its file owner rather than the calling user. If sudo is owned by root, any user can execute it with root's UID.
The no_new_privs flag (available since Linux 3.5) blocks this. Once set on a process via prctl(PR_SET_NO_NEW_PRIVS, 1), neither that process nor any of its children can gain privileges through execve()—setuid bits are ignored, capabilities can't be added. The flag is inherited by child processes and cannot be unset.
# Without no_new_privs, sudo can elevate to root (if setuid):
ls -la /usr/bin/sudo
Output:
-rwsr-xr-x 1 root root 232416 Apr 3 14:30 /usr/bin/sudo
# With no_new_privs set, the setuid bit has no effect:
sudo something
Output:
sudo: effective uid is not 0, is sudo installed setuid root?
Kubernetes exposes this via allowPrivilegeEscalation: false.
Containers are Linux processes with namespace isolation. Namespaces give a process (and its children) an isolated view of a specific system resource. The relevant ones for security:
+---------------------------+---------------------------------------+
| Namespace Type | What it isolates |
+---------------------------+---------------------------------------+
| User namespace | UID/GID mappings |
| Mount namespace | Filesystem tree (what you can see) |
| PID namespace | Process tree (what processes exist) |
| Network namespace | Network interfaces, routing, ports |
| IPC namespace | Shared memory, message queues |
+---------------------------+---------------------------------------+
User namespaces are particularly interesting: they allow a process to appear as root (UID 0) inside a namespace while mapping to an unprivileged UID on the host. This is how rootless containers work. By default, Docker and Kubernetes don't enable user namespace remapping—when a container runs as UID 0, that's real UID 0 on the node.
A container is a Linux process—or a tree of processes—running with a restricted view of the system through namespaces. From the host, every container process is visible in the normal process table and fully inspectable through the /proc filesystem.
When you start a container, the container runtime (containerd or CRI-O) forks a child process. That child runs inside the container’s namespaces, but it still has a real PID in the host’s PID namespace. Run a container and look at the process tree on the host:
docker run -d --name demo alpine sleep 3600
ps -ef --forest | grep -A2 containerd
Output:
root 1234 1 0 10:00 ? 00:00:00 /usr/bin/containerd
root 5678 1234 0 10:01 ? 00:00:00 _ containerd-shim-runc-v2 ...
root 5701 5678 0 10:01 ? 00:00:00 _ sleep 3600
The containerd-shim is an intermediate process that sits between containerd and the container process. It decouples the container’s lifecycle from the daemon—if containerd restarts, the container keeps running because the shim stays alive. The actual container process (sleep 3600) is a direct child of the shim.
Docker exposes the host PID directly:
docker inspect --format '{{.State.Pid}}' demo
Output:
5701
/proc is a virtual filesystem the kernel maintains in memory. It exposes the live state of every process as a directory tree under /proc/<pid>/. Nothing is stored on disk—reads go directly to kernel data structures.
The namespace memberships of any process are visible as symlinks under /proc/<pid>/ns/:
ls -la /proc/5701/ns/
Output:
lrwxrwxrwx 1 root root 0 ... cgroup -> cgroup:[4026532739]
lrwxrwxrwx 1 root root 0 ... ipc -> ipc:[4026532737]
lrwxrwxrwx 1 root root 0 ... mnt -> mnt:[4026532735]
lrwxrwxrwx 1 root root 0 ... net -> net:[4026532740]
lrwxrwxrwx 1 root root 0 ... pid -> pid:[4026532738]
lrwxrwxrwx 1 root root 0 ... pid_for_children -> pid:[4026532738]
lrwxrwxrwx 1 root root 0 ... time -> time:[4026531834]
lrwxrwxrwx 1 root root 0 ... user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 ... uts -> uts:[4026532736]
Each symlink target is a namespace inode number. Two processes sharing the same inode are in the same namespace and see the same resource. A container gets distinct inodes for mount, PID, network, IPC, and UTS—that is what produces the isolated view. Compare the container’s network namespace with the host init process to confirm they differ:
# Container's net namespace
readlink /proc/5701/ns/net
# Host init process's net namespace
readlink /proc/1/ns/net
Output:
net:[4026532740]
net:[4026531840]
Different inodes—different network namespaces.
Every process entry in /proc has a root symlink pointing to the root of that process’s mount namespace. For a container process, that is the container’s entire filesystem, accessible directly from the host:
PID=$(docker inspect --format '{{.State.Pid}}' demo)
ls /proc/$PID/root/
You can write a file inside the container and verify it is immediately visible through /proc from the host:
# Write a file inside the container
docker exec demo sh -c "echo hello > /tmp/test.txt"
# Read it from the host via /proc without docker exec
cat /proc/$PID/root/tmp/test.txt
Output:
hello
This works because both paths—docker exec and /proc/$PID/root—are views into the same mount namespace. There is no copy or snapshot involved.
This access path works even for distroless or scratch images that contain no shell. When a container image has no debugging tools, /proc/<pid>/root is how you read its filesystem from the host without modifying the image or the running container.
nsenter calls the setns(2) syscall to join an existing process’s namespaces. This is what docker exec does internally, exposed as a standalone host tool—useful when the container runtime is unavailable, the container has no shell, or you need to enter only specific namespaces.
Enter all namespaces of the container process:
nsenter --target $PID --mount --uts --ipc --net --pid -- /bin/sh
You can also enter a single namespace. Running a network diagnostic in the container’s network namespace while keeping the host’s mount namespace gives you access to the host’s tools while seeing the container’s network interfaces:
nsenter --target $PID --net -- ss -tulnp
This is useful when the container image has no networking utilities installed.
Before touching any YAML, you can verify all of this with plain Docker commands. These work on Linux, macOS, and Windows—Docker handles the Linux kernel layer for you.
# Default: runs as root
docker run --rm alpine id
Output:
uid=0(root) gid=0(root) groups=0(root)
# Specify a UID
docker run --rm --user 1000:1000 alpine id
Output:uid=1000 gid=1000 groups=1000
In this section we will see what happens when we drop a capabilty.
We will:
touch /tmp/test Creates an empty file at /tmp/testchown 1000 /tmp/test Changes the owner of that file to UID 1000. This requiresRemember the capability from earlier: CAP_CHOWN: This is it being used in real life to actually change the ownership.
ls -la /tmp/test Shows the file details so you can see who owns it after the chown
# Default: has cap_chown, can change file ownership
docker run --rm alpine sh -c "touch /tmp/test && chown 1000 /tmp/test && ls -la /tmp/test"
Output:
-rw-r--r-- 1 1000 root 0 Apr 26 09:00 /tmp/test
And now let's drop all capabilities. And test to create that file and change the ownership:
# Drop all caps, then try the same:
docker run --rm --cap-drop=ALL alpine sh -c "touch /tmp/test && chown 1000 /tmp/test"
Output:
chown: /tmp/test: Operation not permitted
docker run --rm --read-only alpine sh -c "touch /newfile"
Output:
touch: /newfile: Read-only file system
Now let's mount a writeable tmpfs at /tmp which will provide the app somewhere to write:
# Mount a writable tmpfs at /tmp so the app still has somewhere to write:
docker run --rm --read-only --tmpfs /tmp alpine sh -c "touch /tmp/ok && echo 'wrote to /tmp'"
Output:
wrote to /tmp
What we did with --tmpfs is we mounted a temporary in-memory filesystem at /tmp
# Block the "unshare" syscall with a custom seccomp profile
cat deny-unshare.json
Output:
{
"defaultAction": "SCMP_ACT_ALLOW",
"syscalls": [
{ "names": ["unshare"], "action": "SCMP_ACT_ERRNO" }
]
}
The following runs a container with a custom seccomp profile that blocks the unshare syscall, then immediately tries to use it — proving that seccomp kills the action:
docker run --rm --security-opt seccomp=deny-unshare.json alpine unshare -u
Output:
unshare: unshare(0x10000000): Operation not permitted
Kubernetes doesn't invent new security primitives—it passes your securityContext configuration to the container runtime (containerd or CRI-O), which calls the appropriate kernel APIs.

The securityContext field exists at two levels:
spec.securityContext): applies to all containers in the pod. Controls like runAsUser, fsGroup, and seccompProfile go here.spec.containers[].securityContext): applies to a specific container. Controls like capabilities, readOnlyRootFilesystem, and allowPrivilegeEscalation go here.If the same field appears at both levels, the container-level setting wins for that container.
runAsUser sets the UID the container's main process runs as. runAsNonRoot: true makes the kubelet reject the container at startup if the effective UID would be 0—it doesn't change the UID, it just blocks root.
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
Important trap: If you set runAsUser at the pod level and your container image uses a different UID (set via USER 1001 in the Dockerfile), Kubernetes overrides the image's setting. The process runs as the UID you specified, not what the image author intended. If that UID can't read the files the app needs, you get permission errors that are hard to debug.
Safer approach: Set runAsNonRoot: true without specifying runAsUser. This enforces that the image doesn't run as root, while respecting whatever UID the image was built to use. Only set runAsUser explicitly if you have a specific reason (e.g., the image has no USER directive and you need to guarantee a particular UID).
# Verify inside a running pod:
kubectl exec -it mypod -- id
Output:uid=1000 gid=0(root) groups=0(root)
The following reads the kernel-level UID and GID of the pod's main process (PID 1) directly from the Linux process table, bypassing any userspace tools that could lie about it:
kubectl exec -it mypod -- cat /proc/1/status | grep -E "^(Uid|Gid):"
Output:
Uid: 1000 1000 1000 1000
Gid: 0 0 0 0
In Kubernetes, capability names are written without the CAP_ prefix:
containers:
- name: app
securityContext:
capabilities:
drop: ["ALL"]
add: ["NET_BIND_SERVICE"]
drop: ["ALL"] removes every capability from the bounding set. Then add restores only what you explicitly list. Drop first, then add back what you need.
Common cases where you need to add a capability back:
gdb or strace. Don't include in production.To verify capabilities are dropped inside a pod:
kubectl exec -it mypod -- cat /proc/1/status | grep Cap
Output:
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000
All zeroes means no capabilities—this is what you want after dropping all.
This maps directly to the no_new_privs flag described earlier. Setting it to false locks the process's privilege ceiling: no child process can gain capabilities via setuid binaries or execve().
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
This is safe for almost every production workload. The only cases where you'd need true are workloads that intentionally use setuid binaries—rare in containers.
Note: if you drop all capabilities, privilege escalation is already neutered (there's nothing to escalate to). But setting allowPrivilegeEscalation: false is still worth including as defense in depth—it blocks escalation even if a capability accidentally gets re-added later.
This mounts the container's root filesystem with the MS_RDONLY flag. The kernel's virtual filesystem layer (VFS) enforces this—any write syscall (write(), mkdir(), unlink(), etc.) to the root filesystem fails with EROFS (Read-only file system).
containers:
- name: app
securityContext:
readOnlyRootFilesystem: true
volumeMounts:
- name: tmp-dir
mountPath: /tmp
volumes:
- name: tmp-dir
emptyDir: {}
If your app writes anywhere (temp files, logs, PID files), you need to mount an emptyDir volume at those paths. Common paths that need writable volumes: /tmp, /var/log, /var/run, /app/logs.
❗In Kubernetes, emptydir is not in-memory volume, even though it's a temporary one.
It's:
There is actually a way to write to memory as well, but it's not common:
volumes:
- name: tmp
emptyDir:
medium: Memory
Start by identifying where your app writes. The easiest way: run the container without readOnlyRootFilesystem, deploy it, exercise all code paths, then check what files were created:
kubectl exec -it mypod -- find / -newer /proc -not -path '/proc/*' -not -path '/sys/*' 2>/dev/null
Output:
/tmp/app.pid
/var/log/app.log
Mount emptyDir volumes at those paths, then enable readOnlyRootFilesystem: true.
Seccomp is a default Linux kernel security feature that protects and sandboxes Linux processes. The way it's doing that is that it's restricting what kind of syscalls the processes can make. This is great because it can reduce the unneccessary attack surfaces. It only allows the operations that are allowed. Seccomp is used in container technologies like Docker and Kubernetes to restrict container actions.
Seccomp (Secure Computing Mode) attaches a BPF program to a process that filters syscalls before the kernel executes them. If the process calls a blocked syscall, the kernel returns an error or kills the process with SIGSYS, depending on how the profile is configured.
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
RuntimeDefault uses the container runtime's built-in profile. Containerd's default profile allows syscalls a typical application needs (file I/O, networking, process management) while blocking ones that are dangerous in containers: mount(), ptrace(), reboot(), kexec_load(), create_module(), and others.
You can also write a custom allowlist profile in JSON:
{
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
{
"names": ["read", "write", "openat", "close", "fstat", "mmap",
"mprotect", "munmap", "brk", "rt_sigaction", "rt_sigprocmask",
"rt_sigreturn", "ioctl", "socket", "connect", "sendto",
"recvfrom", "bind", "listen", "accept4", "getsockname",
"setsockopt", "getsockopt", "clone", "execve", "exit_group",
"futex", "getdents64", "getcwd", "getpid", "getuid", "getgid",
"geteuid", "getegid", "clock_gettime", "getrandom",
"epoll_create1", "epoll_ctl", "epoll_wait", "nanosleep",
"set_tid_address", "set_robust_list", "tgkill", "pipe2",
"fcntl", "lseek", "newfstatat", "sendmsg", "recvmsg"],
"action": "SCMP_ACT_ALLOW"
}
]
}
Store the profile on every node (e.g., /var/lib/kubelet/seccomp/my-app.json), then reference it:
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: my-app.json
For most applications, RuntimeDefault is the right starting point. Custom profiles are for high-security environments where you want to lock down to only the syscalls your specific app uses.
When a non-root container mounts a Persistent Volume Claim (PVC), the files on that volume are typically owned by root (UID 0). A process running as UID 1000 can't write to a directory owned by root with mode 755.
fsGroup solves this. When you set fsGroup: 2000, the kubelet (running as root on the node) recursively changes the group ownership of all mounted volumes to GID 2000 before starting the container. It also sets the setgid bit on the directory so new files inherit GID 2000.
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
Verify this inside the pod:
kubectl exec -it mypod -- ls -la /data
Output:
total 8
drwxrwsr-x 2 root 2000 4096 Apr 26 09:00 . <-- group=2000, setgid bit
drwxr-xr-x 1 root root 38 Apr 26 09:00 ..
The following shows what the process inside the pod is running as:
kubectl exec -it mypod -- id
Output:
uid=1000 gid=3000 groups=3000,2000 <-- 2000 added as supplemental group
The recursive chown on large volumes with thousands of files takes time and delays pod startup. fsGroupChangePolicy: OnRootMismatch tells the kubelet to skip the chown if the volume's root directory already has the correct group ownership:
spec:
securityContext:
fsGroup: 2000
fsGroupChangePolicy: OnRootMismatch
This is safe if you control the volume lifecycle. For shared volumes reused across pods with different fsGroup values, use Always (the default) to be safe.
Not all storage drivers support the fsGroup chown behavior. CSI drivers declare their support via the FSGroupPolicy field in the CSIDriver object:
fsType is set.kubectl get csidriver ebs.csi.aws.com -o jsonpath='{.spec.fsGroupPolicy}'
Output:
ReadWriteOnceWithFSType
If your CSI driver has FSGroupPolicy: None, setting fsGroup in the pod spec won't change anything on the volume.
Here's a complete pod manifest to verify each security control hands-on.
# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: security-test
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
fsGroupChangePolicy: OnRootMismatch
seccompProfile:
type: RuntimeDefault
containers:
- name: test
image: busybox:1.36
command: ["sh", "-c", "sleep 3600"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
readOnlyRootFilesystem: true
volumeMounts:
- name: tmp
mountPath: /tmp
- name: data
mountPath: /data
volumes:
- name: tmp
emptyDir: {}
- name: data
emptyDir: {}
kubectl apply -f test-pod.yaml
kubectl wait --for=condition=Ready pod/security-test
1. Check UID/GID:
kubectl exec security-test -- id
Output:
uid=1000 gid=3000 groups=3000,2000
2. Verify no capabilities:
kubectl exec security-test -- cat /proc/1/status | grep Cap
Output:
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000
3. Root filesystem is read-only:
kubectl exec security-test -- touch /newfile
Output:
touch: /newfile: Read-only file system
4. /tmp is writable:
kubectl exec security-test -- touch /tmp/ok && ls /tmp/
Output:
OK
5. fsGroup on the mounted volume:
kubectl exec security-test -- ls -la /data
Output:
drwxrwsr-x 2 root 2000 4096 ...
6. no_new_privs is set:
kubectl exec security-test -- cat /proc/1/status | grep NoNewPrivs
Output:
NoNewPrivs: 1
7. Seccomp filter is active:
kubectl exec security-test -- cat /proc/1/status | grep Seccomp
Output:
Seccomp: 2 means a BPF filter is active (0 = off, 1 = strict mode, 2 = filter mode).
# Should fail: no CAP_CHOWN
kubectl exec security-test -- chown 0 /tmp/ok
Output:
chown: /tmp/ok: Operation not permitted
# Should fail: no CAP_NET_BIND_SERVICE
kubectl exec security-test -- nc -l -p 80
Output:
nc: bind: Permission denied
A great thing is, even if the application is emposing root user (the Dockerfile doesn't specify any user at all), we can still make it to run as non-root.
Assume we have the following Dockerfile:
FROM python:3.12-slim
WORKDIR /app
# Install app — everything owned by root, world-readable (default)
COPY app.py .
# No USER directive → process will run as root (UID 0) by default
EXPOSE 8080
ENV PYTHONDONTWRITEBYTECODE=1
CMD ["python", "app.py"]
Here's a complete pod spec combining everything:
# No Dockerfile change needed — Kubernetes overrides the running UID.
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo-app-secure
labels:
app: demo-app
variant: secure
spec:
replicas: 1
selector:
matchLabels:
app: demo-app
variant: secure
template:
metadata:
labels:
app: demo-app
variant: secure
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000 # Override Dockerfile default (root) — no USER directive means we must set this
runAsGroup: 3000
fsGroup: 2000
fsGroupChangePolicy: OnRootMismatch
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: demo-app:root # Same root-built image!
ports:
- containerPort: 8080
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
readOnlyRootFilesystem: true
volumeMounts:
- name: tmp
mountPath: /tmp
- name: logs
mountPath: /var/log/app
volumes:
- name: tmp
emptyDir: {}
- name: logs
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: demo-app-secure
spec:
selector:
app: demo-app
variant: secure
ports:
- port: 8080
targetPort: 8080
Each setting reduces a distinct attack surface:
Control Effect
-------------------------------- ------------------------------------------------------------
runAsNonRoot: true Rejects the pod at startup if the container UID would be 0
capabilities.drop: ["ALL"] Strips all Linux capabilities — no privileged ops possible
allowPrivilegeEscalation: false Sets no_new_privs: blocks privilege gain via setuid
binaries or execve()
readOnlyRootFilesystem: true Mounts the container filesystem read-only — blocks all
runtime writes to the root filesystem
seccompProfile: RuntimeDefault BPF filter blocking ~50 dangerous syscalls
(mount, ptrace, reboot, kexec_load, ...)
fsGroup: 2000 kubelet chowns mounted volumes to GID 2000 before the
container starts, granting non-root processes write access
Apply them incrementally: start with runAsNonRoot and allowPrivilegeEscalation: false (these almost never break anything), then add capabilities.drop: ["ALL"] and test, then readOnlyRootFilesystem: true with the appropriate emptyDir mounts, and finally seccompProfile: RuntimeDefault. One layer at a time is how you reach a secure-by-default baseline without breaking production.
Containers are not virtual machines. They are Linux processes with restricted vision, isolated using namespaces and controlled by the kernel the same way any other process is.
Everything Kubernetes exposes in securityContext maps directly to a Linux primitive underneath:
runAsUser and runAsNonRoot control which UID the process runs as , keeping it away from UID 0, which bypasses most kernel permission checks.
capabilities.drop: ["ALL"] strips the default set of ~14 capabilities Docker grants containers, removing powers like CAP_CHOWN, CAP_NET_RAW, and CAP_SYS_CHROOT that typical applications never need.
allowPrivilegeEscalation: false sets the no_new_privs kernel flag, blocking privilege gain through setuid binaries and execve() , even if a capability accidentally gets re-added later.
readOnlyRootFilesystem: true mounts the container's filesystem read-only at the kernel level, stopping an attacker from writing malware, modifying binaries, or leaving backdoors — even with code execution inside the container.
seccompProfile: RuntimeDefault attaches a BPF filter that blocks around 50 dangerous syscalls like mount, ptrace, reboot, kexec_load and others before they ever reach the kernel.
fsGroup lets non-root containers write to mounted volumes by having the kubelet chown the volume to a specific GID before the container starts.
Apply these controls incrementally. Start with runAsNonRoot and allowPrivilegeEscalation: false, then add capability drops, then readOnlyRootFilesystem, then seccomp. Each layer independently reduces attack surface, and together they form a secure-by-default baseline for production workloads.
If you want to go deeper on Linux and Kubernetes security, a dedicated rapid course is currently in the making. It will cover these concepts hands-on from the ground up.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。