How Kubernetes Controls What Your Containers Can Do

Devoriales - DevOps and Python Tutorials

Cloud & DevOps & AI Digest: The Week of Jun 28, 2026 Cloud & DevOps & AI Digest: The Week of Jun 20, 2026 Ansible for DevOps Engineers: Architecture, Core Concepts, and Hands-On Lab Login Must-Have Kubernetes CLI Tools Every Platform Engineer Should Know Login Login Login Why Your Best Engineers Are Quitting (And How to Stop It) Login ArgoCD Vulnerability: How the ServerSideDiff Feature Exposes Kubernetes Secrets Login Login Multi-AZ Is Not Disaster Recovery: What the AWS Bahrain Outage Finally Proved Trivy Supply Chain Attack: When Your Security Scanner Becomes the Threat Is Claude Opus 4.6 Fast Mode Really Worth 6× the Price? Login Unlocking Higher Pod Density in EKS with Prefix Delegation AWS Regional NAT Gateway: What It Is and Why You Should Care Kubernetes 1.35 Timbernetes Release AWS re:Invent 2025: The Future of Kubernetes on EKS Debate Series: How Do We Control Deployment Order in Kubernetes? Debate Series: Should We Eliminate Kubernetes Secrets Entirely? Kubernetes CRDs Explained: A Beginner-Friendly Guide to Extending the Kubernetes API Reduce Cloud Cross-Zone Data Transfer Costs with Kubernetes 1.33 trafficDistribution Building Custom Bitnami Images: A Guide for Self-Hosted Container Images New Features in Kubernetes 1.34: An Overview From Free to Fee: How Broadcom's Bitnami Monetization Disrupts DevOps Infrastructure Claude Code Cheat Sheet: The Reference Guide Kubernetes Loses Enterprise Slack Status: Discord Among Platforms Being Considered Understanding Container Security: A Guide to Docker and Pod Security Container Patterns in Kubernetes: Init Containers, Sidecars, and Co-located Containers Explained AWS Launches Serverless MCP Server: AI-Powered Development Gets a Serverless Boost Valve Responds to Alleged Steam Data Breach Reports: What Users Need to Know ArgoCD 3.0: The Evolution Toward Secure GitOps Redis Returns to Open Source: The AGPLv3 Licensing Decision New Features in Kubernetes 1.33: An Overview Prometheus: How We Slashed Memory Usage IngressNightmare: Critical Ingress-NGINX Vulnerabilities and How to Check Your Exposure New Features in Kubernetes 1.32: An Overview What to Consider If You're Not Signing Up for Bitnami Premium Certified Kubernetes Administrator (CKA) Exam Updates for 2025 DeepSeek AI and the Question of the AI Bubble Python Tops the Tiobe Index: The Most Popular Programming Languages - January 2025 2024 in Review: IT Trends, Startups, and What’s Next Inside Argo: The Open-Source Journey Captured in a CNCF Documentary Running Docker on macOS Without Docker Desktop - updated with Kubernetes installation HashiCorp Rolls Out Terraform 2.0 at HashiConf, Keeps IBM Acquisition in the Shadows Is the EU Falling Behind in the Global AI Race? Prometheus Essentials: Node Exporter And System Monitoring Prometheus Essentials: Install and Start Monitoring Your App Prometheus Essentials: Introduction To Metric Types Kubernetes Pod Scheduling Explained: Taints, Tolerations, and Node Affinity Retrieval Augmented Generation (RAG) Explained for Beginners Like Me Using Sealed Secrets with Your Kubernetes Applications

Aleksandro Matejic · 2026-04-26 · via Devoriales - DevOps and Python Tutorials

Root inside a container is still root. That's the part people miss. Containers aren't VMs, there's no hypervisor wall between the process and the kernel. If your container runs as UID 0 and something goes wrong, the blast radius is much larger than it looks. The good news is that Linux already has the tools to contain this, and Kubernetes provides those to us.

This article covers how Linux controls what a process can do, how containers are just Linux processes with some extra isolation, and then how Kubernetes exposes controls to manage all of that. If you follow along, by the end, you’ll have a running environment to verify every setting yourself.

We’ll start with user IDs. Linux uses a simple number to decide who you are. UID 0 is root, and root can do almost anything. UIDs from 1 to 999 are reserved for system accounts like daemon or nobody. UIDs from 1000 and above are regular human users. When you run a container without specifying a user, it defaults to root inside that container, which is a problem.

Next, we’ll prove this with the id command. Spin up a container, run it, and you’ll see exactly who the process thinks it is. Change the user to UID 2 and run it again. The permissions change, and you can no longer write to places root could.

From there, we move to Kubernetes. Kubernetes wraps all of this in a securityContext block attached to a pod or container. You can set runAsUser, runAsNonRoot, readOnlyRootFilesystem, and more. Each one maps directly to a Linux concept underneath.

Every process running on Linux has an identity: a user ID (UID) and a group ID (GID). The kernel uses them to decide what the process is allowed to do.

When a process tries to open a file, the kernel checks: does this process's UID own this file? Is the process in the file's group? Does the "other" permission bit allow access? Based on that check, the kernel either grants the operation or returns EACCES (Permission denied).

UID 0 is special. A process running as UID 0 (root) bypasses most of these checks. That's why "don't run as root" is the first rule of container security.

Print only the UID and GID lines from the kernel's information about the current process:

cat /proc/self/status | grep -E "^(Uid|Gid):"

Output:

Uid:    1000    1000    1000    1000
Gid:    1000    1000    1000    1000

The four numbers in each row are: real, effective, saved, and filesystem UID/GID. The effective UID is what the kernel uses when making permission decisions.

By convention on most Linux distros: UIDs 0–999 are reserved for system and service accounts. UIDs 1000 and above are regular users. This is a convention enforced by tools like useradd, not a kernel rule—but it's why you'll often see apps run as UID 1000 or 65534 (nobody).

System Calls: The Only Way to Talk to the Kernel

User-space programs—an Node.js app, a Python script, a compiled Go binary—can't directly touch hardware or the kernel's data structures. They have to ask the kernel to do things on their behalf. These requests are system calls (syscalls).

Linux Kernel Layers

There are around 350 syscalls on x86-64 Linux (the exact count varies by architecture and kernel version).

Ref here

You can see every syscall a process makes using strace:

strace -e trace=openat,read,write ls /tmp 2>&1 | head -20

Output:

This matters for security: if you can restrict which syscalls a process is allowed to make, you've dramatically reduced what a compromised process can do. That's exactly what seccomp does (covered later).

Linux Capabilities: Splitting Root Privileges

Historically, Linux had two categories of processes: root (UID 0, all-powerful) and everyone else. Since Linux 2.2 (1999), capabilities let you split root's power into distinct units that can be granted or removed individually.

There are around 40 capabilities in the Linux kernel. A few important ones:

Capability	What it allows
CAP_CHOWN	Change file UID/GID arbitrarily
CAP_NET_BIND_SERVICE	Bind to ports below 1024
CAP_NET_ADMIN	Configure network interfaces, routing, firewalls
CAP_SYS_ADMIN	Mount filesystems, manage namespaces, and many other things—effectively a second root
CAP_SYS_PTRACE	Attach to and trace any process
CAP_KILL	Send signals to any process
CAP_DAC_OVERRIDE	Bypass file read/write/execute permission checks
CAP_SETUID	Switch to any UID

You can inspect a process's capabilities by reading /proc/<pid>/status:

cat /proc/self/status | grep Cap

Output:

CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

These are bitmasks. A regular user process has zeroed effective capabilities. To decode a value into human-readable form:

capsh --decode=000001ffffffffff

Output:

0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,
cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,
cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,
...

Listing capabilities with systemd-analyze

Decoding a hex bitmask works, but there's a simpler way to see all capabilities and their numeric positions. On systems with systemd v252 or later (Ubuntu 23.04+, Debian 12+, Fedora 37+):

systemd-analyze capability

Output (truncated):

0 cap_chown
1 cap_dac_override
2 cap_dac_read_search
3 cap_fowner
4 cap_fsetid
5 cap_kill
6 cap_setgid
7 cap_setuid
8 cap_setpcap
9 cap_linux_immutable
10 cap_net_bind_service
11 cap_net_broadcast
12 cap_net_admin
13 cap_net_raw
18 cap_sys_chroot
19 cap_sys_ptrace
21 cap_sys_admin
...

You can also query a specific capability by name:

systemd-analyze capability cap_net_admin

Output:

The number is the bit position in the /proc bitmask. cap_chown is bit 0, cap_net_admin is bit 12, cap_sys_admin is bit 21. This is why the bitmask value 000001ffffffffff covers all 41 capabilities: it has bits 0 through 40 set.

The bounding set (CapBnd) is the ceiling—no process can gain a capability outside its bounding set, even if it calls setuid(0).

Docker grants containers a default set of capabilities. Not all of them are needed by typical web apps. Here's how to check what a fresh container starts with:

docker run --rm alpine sh -c "cat /proc/1/status | grep Cap"

Output:

CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000

The following installs a tool called capsh inside the container to translate that hex number 00000000a80425fb into a human-readable list of capabilities.

docker run --rm alpine sh -c "apk add -q libcap && capsh --decode=00000000a80425fb"

Output:

0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,
cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

A typical REST API doesn't need cap_chown, cap_net_raw, or cap_sys_chroot. These are attack surface.

no_new_privs: Locking the Ceiling

Even with capabilities restricted, there's a classic escalation path: setuid binaries. A binary with the setuid bit set (like sudo or su) runs as its file owner rather than the calling user. If sudo is owned by root, any user can execute it with root's UID.

The no_new_privs flag (available since Linux 3.5) blocks this. Once set on a process via prctl(PR_SET_NO_NEW_PRIVS, 1), neither that process nor any of its children can gain privileges through execve()—setuid bits are ignored, capabilities can't be added. The flag is inherited by child processes and cannot be unset.

# Without no_new_privs, sudo can elevate to root (if setuid):
ls -la /usr/bin/sudo

Output:

-rwsr-xr-x 1 root root 232416 Apr 3 14:30 /usr/bin/sudo

# With no_new_privs set, the setuid bit has no effect:
sudo something

Output:

sudo: effective uid is not 0, is sudo installed setuid root?

Kubernetes exposes this via allowPrivilegeEscalation: false.

Namespaces: Isolated Views of the System

Containers are Linux processes with namespace isolation. Namespaces give a process (and its children) an isolated view of a specific system resource. The relevant ones for security:

+---------------------------+---------------------------------------+
| Namespace Type            | What it isolates                      |
+---------------------------+---------------------------------------+
| User namespace            | UID/GID mappings                      |
| Mount namespace           | Filesystem tree (what you can see)    |
| PID namespace             | Process tree (what processes exist)   |
| Network namespace         | Network interfaces, routing, ports    |
| IPC namespace             | Shared memory, message queues         |
+---------------------------+---------------------------------------+

User namespaces are particularly interesting: they allow a process to appear as root (UID 0) inside a namespace while mapping to an unprivileged UID on the host. This is how rootless containers work. By default, Docker and Kubernetes don't enable user namespace remapping—when a container runs as UID 0, that's real UID 0 on the node.

Seeing Containers from the Host

A container is a Linux process—or a tree of processes—running with a restricted view of the system through namespaces. From the host, every container process is visible in the normal process table and fully inspectable through the /proc filesystem.

Finding the container process

When you start a container, the container runtime (containerd or CRI-O) forks a child process. That child runs inside the container’s namespaces, but it still has a real PID in the host’s PID namespace. Run a container and look at the process tree on the host:

docker run -d --name demo alpine sleep 3600
ps -ef --forest | grep -A2 containerd

Output:

root  1234     1  0 10:00 ?  00:00:00 /usr/bin/containerd
root  5678  1234  0 10:01 ?  00:00:00  _ containerd-shim-runc-v2 ...
root  5701  5678  0 10:01 ?  00:00:00      _ sleep 3600

The containerd-shim is an intermediate process that sits between containerd and the container process. It decouples the container’s lifecycle from the daemon—if containerd restarts, the container keeps running because the shim stays alive. The actual container process (sleep 3600) is a direct child of the shim.

Docker exposes the host PID directly:

docker inspect --format '{{.State.Pid}}' demo

Output:

Inspecting namespaces via /proc

/proc is a virtual filesystem the kernel maintains in memory. It exposes the live state of every process as a directory tree under /proc/<pid>/. Nothing is stored on disk—reads go directly to kernel data structures.

The namespace memberships of any process are visible as symlinks under /proc/<pid>/ns/:

ls -la /proc/5701/ns/

Output:

lrwxrwxrwx 1 root root 0 ... cgroup -> cgroup:[4026532739]
lrwxrwxrwx 1 root root 0 ... ipc    -> ipc:[4026532737]
lrwxrwxrwx 1 root root 0 ... mnt    -> mnt:[4026532735]
lrwxrwxrwx 1 root root 0 ... net    -> net:[4026532740]
lrwxrwxrwx 1 root root 0 ... pid    -> pid:[4026532738]
lrwxrwxrwx 1 root root 0 ... pid_for_children -> pid:[4026532738]
lrwxrwxrwx 1 root root 0 ... time   -> time:[4026531834]
lrwxrwxrwx 1 root root 0 ... user   -> user:[4026531837]
lrwxrwxrwx 1 root root 0 ... uts    -> uts:[4026532736]

Each symlink target is a namespace inode number. Two processes sharing the same inode are in the same namespace and see the same resource. A container gets distinct inodes for mount, PID, network, IPC, and UTS—that is what produces the isolated view. Compare the container’s network namespace with the host init process to confirm they differ:

# Container's net namespace
readlink /proc/5701/ns/net

# Host init process's net namespace
readlink /proc/1/ns/net

Output:

net:[4026532740]
net:[4026531840]

Different inodes—different network namespaces.

Accessing the container filesystem via /proc/<pid>/root

Every process entry in /proc has a root symlink pointing to the root of that process’s mount namespace. For a container process, that is the container’s entire filesystem, accessible directly from the host:

PID=$(docker inspect --format '{{.State.Pid}}' demo)
ls /proc/$PID/root/

You can write a file inside the container and verify it is immediately visible through /proc from the host:

# Write a file inside the container
docker exec demo sh -c "echo hello > /tmp/test.txt"

# Read it from the host via /proc without docker exec
cat /proc/$PID/root/tmp/test.txt

Output:

hello

This works because both paths—docker exec and /proc/$PID/root—are views into the same mount namespace. There is no copy or snapshot involved.

This access path works even for distroless or scratch images that contain no shell. When a container image has no debugging tools, /proc/<pid>/root is how you read its filesystem from the host without modifying the image or the running container.

Entering container namespaces with nsenter

nsenter calls the setns(2) syscall to join an existing process’s namespaces. This is what docker exec does internally, exposed as a standalone host tool—useful when the container runtime is unavailable, the container has no shell, or you need to enter only specific namespaces.

Enter all namespaces of the container process:

nsenter --target $PID --mount --uts --ipc --net --pid -- /bin/sh

You can also enter a single namespace. Running a network diagnostic in the container’s network namespace while keeping the host’s mount namespace gives you access to the host’s tools while seeing the container’s network interfaces:

nsenter --target $PID --net -- ss -tulnp

This is useful when the container image has no networking utilities installed.

Testing Linux Primitives Without Kubernetes

Before touching any YAML, you can verify all of this with plain Docker commands. These work on Linux, macOS, and Windows—Docker handles the Linux kernel layer for you.

Running as a non-root user

# Default: runs as root
docker run --rm alpine id

Output:

uid=0(root) gid=0(root) groups=0(root)

Specify a UID

# Specify a UID
docker run --rm --user 1000:1000 alpine id

Output:uid=1000 gid=1000 groups=1000

Dropping capabilities

In this section we will see what happens when we drop a capabilty.

We will:

Create the file: touch /tmp/test Creates an empty file at /tmp/test
Change the ownership to UID 1000: chown 1000 /tmp/test Changes the owner of that file to UID 1000. This requires

Remember the capability from earlier: CAP_CHOWN: This is it being used in real life to actually change the ownership.

ls -la /tmp/test Shows the file details so you can see who owns it after the chown

# Default: has cap_chown, can change file ownership
docker run --rm alpine sh -c "touch /tmp/test && chown 1000 /tmp/test && ls -la /tmp/test"

Output:

-rw-r--r-- 1 1000 root 0 Apr 26 09:00 /tmp/test

And now let's drop all capabilities. And test to create that file and change the ownership:

# Drop all caps, then try the same:
docker run --rm --cap-drop=ALL alpine sh -c "touch /tmp/test && chown 1000 /tmp/test"

Output:

chown: /tmp/test: Operation not permitted

Read-only root filesystem

docker run --rm --read-only alpine sh -c "touch /newfile"

Output:

touch: /newfile: Read-only file system

Now let's mount a writeable tmpfs at /tmp which will provide the app somewhere to write:

# Mount a writable tmpfs at /tmp so the app still has somewhere to write:
docker run --rm --read-only --tmpfs /tmp alpine sh -c "touch /tmp/ok && echo 'wrote to /tmp'"

Output:

wrote to /tmp

What we did with --tmpfs is we mounted a temporary in-memory filesystem at /tmp

Seccomp

# Block the "unshare" syscall with a custom seccomp profile
cat deny-unshare.json

Output:

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    { "names": ["unshare"], "action": "SCMP_ACT_ERRNO" }
  ]
}

The following runs a container with a custom seccomp profile that blocks the unshare syscall, then immediately tries to use it — proving that seccomp kills the action:

docker run --rm --security-opt seccomp=deny-unshare.json alpine unshare -u

Output:

unshare: unshare(0x10000000): Operation not permitted

How Kubernetes Exposes These Controls

Kubernetes doesn't invent new security primitives—it passes your securityContext configuration to the container runtime (containerd or CRI-O), which calls the appropriate kernel APIs.

How Kubernetes reaches the Linux Kernel

The securityContext field exists at two levels:

Pod level (spec.securityContext): applies to all containers in the pod. Controls like runAsUser, fsGroup, and seccompProfile go here.
Container level (spec.containers[].securityContext): applies to a specific container. Controls like capabilities, readOnlyRootFilesystem, and allowPrivilegeEscalation go here.

If the same field appears at both levels, the container-level setting wins for that container.

runAsUser and runAsNonRoot

runAsUser sets the UID the container's main process runs as. runAsNonRoot: true makes the kubelet reject the container at startup if the effective UID would be 0—it doesn't change the UID, it just blocks root.

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000

Important trap: If you set runAsUser at the pod level and your container image uses a different UID (set via USER 1001 in the Dockerfile), Kubernetes overrides the image's setting. The process runs as the UID you specified, not what the image author intended. If that UID can't read the files the app needs, you get permission errors that are hard to debug.

Safer approach: Set runAsNonRoot: true without specifying runAsUser. This enforces that the image doesn't run as root, while respecting whatever UID the image was built to use. Only set runAsUser explicitly if you have a specific reason (e.g., the image has no USER directive and you need to guarantee a particular UID).

# Verify inside a running pod:
kubectl exec -it mypod -- id

Output:uid=1000 gid=0(root) groups=0(root)

The following reads the kernel-level UID and GID of the pod's main process (PID 1) directly from the Linux process table, bypassing any userspace tools that could lie about it:

kubectl exec -it mypod -- cat /proc/1/status | grep -E "^(Uid|Gid):"

Output:

Uid:    1000    1000    1000    1000
Gid:    0       0       0       0

capabilities: drop and add

In Kubernetes, capability names are written without the CAP_ prefix:

containers:
- name: app
  securityContext:
    capabilities:
      drop: ["ALL"]
      add: ["NET_BIND_SERVICE"]

drop: ["ALL"] removes every capability from the bounding set. Then add restores only what you explicitly list. Drop first, then add back what you need.

Common cases where you need to add a capability back:

NET_BIND_SERVICE: app needs to listen on ports below 1024. Better solution: use port 8080 and let a Service or load balancer handle port 80.
SYS_PTRACE: debugging tools like gdb or strace. Don't include in production.
CHOWN: app needs to change file ownership at runtime. Usually avoidable by setting correct ownership in the Dockerfile.

To verify capabilities are dropped inside a pod:

kubectl exec -it mypod -- cat /proc/1/status | grep Cap

Output:

CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000

All zeroes means no capabilities—this is what you want after dropping all.

allowPrivilegeEscalation

This maps directly to the no_new_privs flag described earlier. Setting it to false locks the process's privilege ceiling: no child process can gain capabilities via setuid binaries or execve().

containers:
- name: app
  securityContext:
    allowPrivilegeEscalation: false

This is safe for almost every production workload. The only cases where you'd need true are workloads that intentionally use setuid binaries—rare in containers.

Note: if you drop all capabilities, privilege escalation is already neutered (there's nothing to escalate to). But setting allowPrivilegeEscalation: false is still worth including as defense in depth—it blocks escalation even if a capability accidentally gets re-added later.

readOnlyRootFilesystem

This mounts the container's root filesystem with the MS_RDONLY flag. The kernel's virtual filesystem layer (VFS) enforces this—any write syscall (write(), mkdir(), unlink(), etc.) to the root filesystem fails with EROFS (Read-only file system).

containers:
- name: app
  securityContext:
    readOnlyRootFilesystem: true
  volumeMounts:
  - name: tmp-dir
    mountPath: /tmp
volumes:
- name: tmp-dir
  emptyDir: {}

If your app writes anywhere (temp files, logs, PID files), you need to mount an emptyDir volume at those paths. Common paths that need writable volumes: /tmp, /var/log, /var/run, /app/logs.

❗In Kubernetes, emptydir is not in-memory volume, even though it's a temporary one.

It's:

Written to the node's actual disk
Survives container restarts within the same pod
Gone when the pod is deleted or moved to another node
Limited by the node's disk space

There is actually a way to write to memory as well, but it's not common:

volumes:
- name: tmp
  emptyDir:
    medium: Memory

Start by identifying where your app writes

Start by identifying where your app writes. The easiest way: run the container without readOnlyRootFilesystem, deploy it, exercise all code paths, then check what files were created:

kubectl exec -it mypod -- find / -newer /proc -not -path '/proc/*' -not -path '/sys/*' 2>/dev/null

Output:

/tmp/app.pid
/var/log/app.log

Mount emptyDir volumes at those paths, then enable readOnlyRootFilesystem: true.

seccomp

Seccomp is a default Linux kernel security feature that protects and sandboxes Linux processes. The way it's doing that is that it's restricting what kind of syscalls the processes can make. This is great because it can reduce the unneccessary attack surfaces. It only allows the operations that are allowed. Seccomp is used in container technologies like Docker and Kubernetes to restrict container actions.

seccompProfile

Seccomp (Secure Computing Mode) attaches a BPF program to a process that filters syscalls before the kernel executes them. If the process calls a blocked syscall, the kernel returns an error or kills the process with SIGSYS, depending on how the profile is configured.

spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault

RuntimeDefault uses the container runtime's built-in profile. Containerd's default profile allows syscalls a typical application needs (file I/O, networking, process management) while blocking ones that are dangerous in containers: mount(), ptrace(), reboot(), kexec_load(), create_module(), and others.

You can also write a custom allowlist profile in JSON:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["read", "write", "openat", "close", "fstat", "mmap",
                "mprotect", "munmap", "brk", "rt_sigaction", "rt_sigprocmask",
                "rt_sigreturn", "ioctl", "socket", "connect", "sendto",
                "recvfrom", "bind", "listen", "accept4", "getsockname",
                "setsockopt", "getsockopt", "clone", "execve", "exit_group",
                "futex", "getdents64", "getcwd", "getpid", "getuid", "getgid",
                "geteuid", "getegid", "clock_gettime", "getrandom",
                "epoll_create1", "epoll_ctl", "epoll_wait", "nanosleep",
                "set_tid_address", "set_robust_list", "tgkill", "pipe2",
                "fcntl", "lseek", "newfstatat", "sendmsg", "recvmsg"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Store the profile on every node (e.g., /var/lib/kubelet/seccomp/my-app.json), then reference it:

spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: my-app.json

For most applications, RuntimeDefault is the right starting point. Custom profiles are for high-security environments where you want to lock down to only the syscalls your specific app uses.

fsGroup and Volume Permissions

When a non-root container mounts a Persistent Volume Claim (PVC), the files on that volume are typically owned by root (UID 0). A process running as UID 1000 can't write to a directory owned by root with mode 755.

fsGroup solves this. When you set fsGroup: 2000, the kubelet (running as root on the node) recursively changes the group ownership of all mounted volumes to GID 2000 before starting the container. It also sets the setgid bit on the directory so new files inherit GID 2000.

spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000

Verify this inside the pod:

kubectl exec -it mypod -- ls -la /data

Output:

total 8
drwxrwsr-x 2 root 2000 4096 Apr 26 09:00 .    <-- group=2000, setgid bit
drwxr-xr-x 1 root root   38 Apr 26 09:00 ..

The following shows what the process inside the pod is running as:

kubectl exec -it mypod -- id

Output:

uid=1000 gid=3000 groups=3000,2000 <-- 2000 added as supplemental group

fsGroupChangePolicy

The recursive chown on large volumes with thousands of files takes time and delays pod startup. fsGroupChangePolicy: OnRootMismatch tells the kubelet to skip the chown if the volume's root directory already has the correct group ownership:

spec:
  securityContext:
    fsGroup: 2000
    fsGroupChangePolicy: OnRootMismatch

This is safe if you control the volume lifecycle. For shared volumes reused across pods with different fsGroup values, use Always (the default) to be safe.

fsGroup and CSI Volumes

Not all storage drivers support the fsGroup chown behavior. CSI drivers declare their support via the FSGroupPolicy field in the CSIDriver object:

None: The driver won't modify volume permissions. Handle ownership in your image or an init container.
File: The driver supports recursive chown (same as in-tree volumes).
ReadWriteOnceWithFSType: Chown only for ReadWriteOnce volumes when fsType is set.

kubectl get csidriver ebs.csi.aws.com -o jsonpath='{.spec.fsGroupPolicy}'

Output:

ReadWriteOnceWithFSType

If your CSI driver has FSGroupPolicy: None, setting fsGroup in the pod spec won't change anything on the volume.

Test Pod: Verifying Every Setting

Here's a complete pod manifest to verify each security control hands-on.

# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: security-test
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000
    fsGroupChangePolicy: OnRootMismatch
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: test
    image: busybox:1.36
    command: ["sh", "-c", "sleep 3600"]
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
      readOnlyRootFilesystem: true
    volumeMounts:
    - name: tmp
      mountPath: /tmp
    - name: data
      mountPath: /data
  volumes:
  - name: tmp
    emptyDir: {}
  - name: data
    emptyDir: {}

kubectl apply -f test-pod.yaml
kubectl wait --for=condition=Ready pod/security-test

Verification commands

1. Check UID/GID:

kubectl exec security-test -- id

Output:

uid=1000 gid=3000 groups=3000,2000

2. Verify no capabilities:

kubectl exec security-test -- cat /proc/1/status | grep Cap

Output:

CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000

3. Root filesystem is read-only:

kubectl exec security-test -- touch /newfile

Output:

touch: /newfile: Read-only file system

4. /tmp is writable:

kubectl exec security-test -- touch /tmp/ok && ls /tmp/

Output:

5. fsGroup on the mounted volume:

kubectl exec security-test -- ls -la /data

Output:

drwxrwsr-x 2 root 2000 4096 ...

6. no_new_privs is set:

kubectl exec security-test -- cat /proc/1/status | grep NoNewPrivs

Output:

NoNewPrivs: 1

7. Seccomp filter is active:

kubectl exec security-test -- cat /proc/1/status | grep Seccomp

Output:

Seccomp: 2 means a BPF filter is active (0 = off, 1 = strict mode, 2 = filter mode).

Confirm blocked operations actually fail

# Should fail: no CAP_CHOWN
kubectl exec security-test -- chown 0 /tmp/ok

Output:

chown: /tmp/ok: Operation not permitted

# Should fail: no CAP_NET_BIND_SERVICE
kubectl exec security-test -- nc -l -p 80

Output:

nc: bind: Permission denied

Full Production Configuration

A great thing is, even if the application is emposing root user (the Dockerfile doesn't specify any user at all), we can still make it to run as non-root.

Assume we have the following Dockerfile:

FROM python:3.12-slim

WORKDIR /app

# Install app — everything owned by root, world-readable (default)
COPY app.py .

# No USER directive → process will run as root (UID 0) by default
EXPOSE 8080

ENV PYTHONDONTWRITEBYTECODE=1

CMD ["python", "app.py"]

Here's a complete pod spec combining everything:

# No Dockerfile change needed — Kubernetes overrides the running UID.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-app-secure
  labels:
    app: demo-app
    variant: secure
spec:
  replicas: 1
  selector:
    matchLabels:
      app: demo-app
      variant: secure
  template:
    metadata:
      labels:
        app: demo-app
        variant: secure
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000        # Override Dockerfile default (root) — no USER directive means we must set this
        runAsGroup: 3000
        fsGroup: 2000
        fsGroupChangePolicy: OnRootMismatch
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: app
        image: demo-app:root   # Same root-built image!
        ports:
        - containerPort: 8080
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
          readOnlyRootFilesystem: true
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: logs
          mountPath: /var/log/app
      volumes:
      - name: tmp
        emptyDir: {}
      - name: logs
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: demo-app-secure
spec:
  selector:
    app: demo-app
    variant: secure
  ports:
  - port: 8080
    targetPort: 8080

Each setting reduces a distinct attack surface:

Control                          Effect                                                                                                   
-------------------------------- ------------------------------------------------------------                                             
runAsNonRoot: true               Rejects the pod at startup if the container UID would be 0                                               
capabilities.drop: ["ALL"]       Strips all Linux capabilities — no privileged ops possible                                               
allowPrivilegeEscalation: false  Sets no_new_privs: blocks privilege gain via setuid                                                      
                                 binaries or execve()                                                                                     
readOnlyRootFilesystem: true     Mounts the container filesystem read-only — blocks all                                                   
                                 runtime writes to the root filesystem                                                                    
seccompProfile: RuntimeDefault   BPF filter blocking ~50 dangerous syscalls
                                 (mount, ptrace, reboot, kexec_load, ...)                                                                 
fsGroup: 2000                    kubelet chowns mounted volumes to GID 2000 before the
                                 container starts, granting non-root processes write access

Apply them incrementally: start with runAsNonRoot and allowPrivilegeEscalation: false (these almost never break anything), then add capabilities.drop: ["ALL"] and test, then readOnlyRootFilesystem: true with the appropriate emptyDir mounts, and finally seccompProfile: RuntimeDefault. One layer at a time is how you reach a secure-by-default baseline without breaking production.

Summary

Containers are not virtual machines. They are Linux processes with restricted vision, isolated using namespaces and controlled by the kernel the same way any other process is.

Everything Kubernetes exposes in securityContext maps directly to a Linux primitive underneath:

runAsUser and runAsNonRoot control which UID the process runs as , keeping it away from UID 0, which bypasses most kernel permission checks.

capabilities.drop: ["ALL"] strips the default set of ~14 capabilities Docker grants containers, removing powers like CAP_CHOWN, CAP_NET_RAW, and CAP_SYS_CHROOT that typical applications never need.

allowPrivilegeEscalation: false sets the no_new_privs kernel flag, blocking privilege gain through setuid binaries and execve() , even if a capability accidentally gets re-added later.

readOnlyRootFilesystem: true mounts the container's filesystem read-only at the kernel level, stopping an attacker from writing malware, modifying binaries, or leaving backdoors — even with code execution inside the container.

seccompProfile: RuntimeDefault attaches a BPF filter that blocks around 50 dangerous syscalls like mount, ptrace, reboot, kexec_load and others before they ever reach the kernel.

fsGroup lets non-root containers write to mounted volumes by having the kubelet chown the volume to a specific GID before the container starts.

Apply these controls incrementally. Start with runAsNonRoot and allowPrivilegeEscalation: false, then add capability drops, then readOnlyRootFilesystem, then seccomp. Each layer independently reduces attack surface, and together they form a secure-by-default baseline for production workloads.

If you want to go deeper on Linux and Kubernetes security, a dedicated rapid course is currently in the making. It will cover these concepts hands-on from the ground up.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Devoriales - DevOps and Python Tutorials

System Calls: The Only Way to Talk to the Kernel

Linux Capabilities: Splitting Root Privileges

Listing capabilities with systemd-analyze

no_new_privs: Locking the Ceiling

Namespaces: Isolated Views of the System

Seeing Containers from the Host

Finding the container process

Inspecting namespaces via /proc

Accessing the container filesystem via /proc/<pid>/root

Entering container namespaces with nsenter

Testing Linux Primitives Without Kubernetes

Running as a non-root user

Specify a UID

Dropping capabilities

Read-only root filesystem

Seccomp

How Kubernetes Exposes These Controls

runAsUser and runAsNonRoot

capabilities: drop and add

allowPrivilegeEscalation

readOnlyRootFilesystem

Start by identifying where your app writes

seccomp

seccompProfile

fsGroup and Volume Permissions

fsGroupChangePolicy

fsGroup and CSI Volumes

Test Pod: Verifying Every Setting

Verification commands

Confirm blocked operations actually fail

Full Production Configuration

Summary