






















Your Node.js container runs as root. You know this because your Dockerfile says FROM node:20-slim and you never added a USER directive. The process runs with uid 0 inside the container, which means if an attacker gets RCE through a vulnerability in express, lodash, or any of the other 1,200 packages in node_modules, they have full root privileges on the container. From there, kernel exploit or misconfigured seccomp, host access is one CVE away.
The Dockerfile that ships with half the tutorials on the internet looks exactly like this:
FROM node:20-slim
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]
No non-root user. No capability drops. No read-only filesystem. No seccomp. It builds, it runs, it passes every smoke test. And it is one curl command from a container breakout that exposes the host.
This post covers the exact four things you need to harden a Node.js container: dropping Linux capabilities, running as a non-root user, mounting the root filesystem read-only, and applying a seccomp profile. Every step is deployable today, compatible with Docker and Kubernetes, and breaks nothing if you account for the side effects.
A common misconception is that Docker containers run in a sandbox and that root inside a container is somehow less powerful than root on the host. That is partially true and dangerously misleading.
Docker applies a default seccomp profile and drops some Linux capabilities. But the default set of capabilities Docker keeps is generous. A node:20-slim container running as root has the following capabilities by default:
CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_MKNOD, CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETFCAP, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_AUDIT_WRITE, CAP_KILL
That is fourteen capabilities, including CAP_DAC_OVERRIDE (bypass file permission checks), CAP_NET_RAW (raw socket access for ARP spoofing), and CAP_SYS_CHROOT (chroot escapes). If an attacker compromises your Node.js process, they inherit all of these.
The attack chain looks like this:
/usr, /sbin, or anywhere else in the container.CAP_SYS_CHROOT and a mounted /proc to escape the container namespace.Every step of this chain is blocked by the hardening techniques below.
The first and easiest hardening step is to drop all capabilities and only add back the ones your application actually needs.
For a typical Node.js HTTP server, the only capability you need is CAP_NET_BIND_SERVICE if you want to bind to a privileged port (under 1024). If your application listens on port 3000 or above (which it should), you do not even need that.
Docker Compose:
services:
app:
build: .
cap_drop:
- ALL
cap_add: []
ports:
- "3000:3000"
Docker run:
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE my-app
But wait. If you test --cap-drop=ALL on a Node.js container running as root, you might see something unexpected. Node.js’s fs module uses uv_fs_open() which, under the hood, calls openat(). Without CAP_DAC_OVERRIDE, the kernel enforces the file’s permission bits strictly. If your application writes to a log file or uploads a file, the uid and gid of the running process must have write permission on the target directory. This is not a capability issue but a permissions issue, which the next step solves.
The key insight: capability drops are free. They add zero runtime overhead, they require no code changes, and they block entire classes of kernel-level exploits. There is no reason not to drop ALL and add back only what you need.
This is the single highest-impact change you can make. A process running as uid 1000 inside the container cannot write to /usr/bin, cannot modify /etc/passwd, and cannot chroot to escape namespaces. The kernel checks against the effective uid of the process, and if that uid is not 0, the privileged syscalls are blocked regardless of what capabilities the container holds.
The Dockerfile change is two lines:
FROM node:20-slim
# Create a non-root user and group
RUN groupadd --system --gid 1000 appuser && \
useradd --system --uid 1000 --gid appuser appuser
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && \
# Ensure the app user owns the application files
chown -R appuser:appuser /app
COPY --chown=appuser:appuser . .
USER appuser
EXPOSE 3000
CMD ["node", "server.js"]
If you are using Alpine-based images (node:20-alpine), the commands are different because Alpine uses busybox:
FROM node:20-alpine
RUN addgroup -S -g 1000 appuser && \
adduser -S -u 1000 -G appuser appuser
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && \
chown -R appuser:appuser /app
COPY --chown=appuser:appuser . .
USER appuser
EXPOSE 3000
CMD ["node", "server.js"]
The uid 1000 is arbitrary but conventional. Any uid above 1000 works. Do not use uids below 100 (system accounts) for application processes.
What breaks when you switch to a non-root user?
Anything that writes to filesystem paths controlled by root. The most common issues:
/var/log. Your application cannot create files there. Write logs to stdout/stderr (which you should be doing anyway for containerized apps) or to a directory under /app that has the right ownership./var/run. If you use Unix domain sockets, create the socket in a directory owned by the app user.npm install with lifecycle scripts. Some npm packages run postinstall scripts that need to write to protected paths. If you npm install as the appuser, those scripts fail. Always run npm ci during the build (as root or with a temporary build user) and copy the result.Once you switch to a non-root user and drop all capabilities, your container is dramatically harder to exploit.
A read-only root filesystem means the process cannot write to any path on the root filesystem, period. Combined with a non-root user, this closes the entire class of binary-overwrite and configuration-tampering attacks.
Docker:
docker run --read-only --tmpfs /tmp --tmpfs /app/data my-app
Docker Compose:
services:
app:
build: .
read_only: true
tmpfs:
- /tmp
- /app/data
cap_drop:
- ALL
Kubernetes (Pod Security Context):
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
readOnlyRootFilesystem: true
The --read-only flag makes the container’s union filesystem immutable. Node.js writes to /tmp and /app/data are redirected to an in-memory tmpfs. No files survive a container restart, which is fine because containers are ephemeral.
What needs a writable path that is not /tmp?
Node.js itself writes to a few paths at runtime:
$XDG_CACHE_HOME is not writable, Node.js skips the cache. The performance impact is negligible.npm commands at runtime (which it should not), the npm cache directory needs to be writable. Set npm config set cache /tmp/.npm in your Dockerfile.sharp (image processing), puppeteer (headless Chrome), and node-gyp (native compilation) write to /tmp. As long as /tmp is mounted as tmpfs, they work fine.tmpfs mount or a PersistentVolumeClaim in Kubernetes.The rule is simple: everything under / is read-only. Anything that needs writes goes to /tmp or a named volume.
seccomp (secure computing mode) restricts the system calls a process can make. Docker ships with a default seccomp profile that blocks around 50 dangerous syscalls (like mount, reboot, swapon). But the default profile is permissive enough to run most applications without issues. You can tighten it.
A custom seccomp profile for a Node.js application should block syscalls that are never used by a JavaScript runtime: mount, umount2, ptrace, perf_event_open, bpf, kexec_file_load, swapon, swapoff, create_module, init_module, finit_module, delete_module.
Here is a seccomp profile that is stricter than the Docker default but still allows Node.js to run normally:
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
"syscalls": [
{
"names": [
"accept", "accept4", "access", "arch_prctl", "bind",
"brk", "capget", "capset", "chdir", "chmod", "chown",
"clock_getres", "clock_gettime", "clock_nanosleep",
"clone", "clone3", "close", "connect", "copy_file_range",
"creat", "dup", "dup2", "dup3", "epoll_create1",
"epoll_ctl", "epoll_pwait", "eventfd2", "execve",
"exit", "exit_group", "faccessat2", "fadvise64",
"fallocate", "fchdir", "fchmod", "fchmodat", "fchown",
"fchownat", "fcntl", "fdatasync", "fgetxattr",
"flistxattr", "flock", "fork", "fremovexattr",
"fsetxattr", "fstat", "fstatfs", "fsync", "ftruncate",
"futex", "getcwd", "getdents64", "getegid", "geteuid",
"getgid", "getpeername", "getpgid", "getpgrp",
"getpid", "getppid", "getpriority", "getrandom",
"getresgid", "getresuid", "getrlimit", "getrusage",
"getsockname", "getsockopt", "gettid", "gettimeofday",
"getuid", "getxattr", "inotify_add_watch",
"inotify_init1", "inotify_rm_watch", "ioctl",
"ioprio_get", "ioprio_set", "kcmp", "kill",
"lgetxattr", "link", "linkat", "listen", "listxattr",
"llistxattr", "lremovexattr", "lseek", "lsetxattr",
"lstat", "madvise", "mbind", "memfd_create",
"membarrier", "mincore", "mkdir", "mkdirat",
"mlock", "mlock2", "mmap", "mmap_cache", "mount",
"move_mount", "mprotect", "mquery", "mremap",
"msgctl", "msgget", "msgrcv", "msgsnd",
"msync", "munlock", "munmap", "name_to_handle_at",
"nanosleep", "newfstatat", "open", "openat",
"openat2", "pause", "pidfd_getfd", "pidfd_open",
"pidfd_send_signal", "pipe", "pipe2", "poll",
"ppoll", "prctl", "pread64", "preadv", "preadv2",
"prlimit64", "process_vm_readv", "pselect6",
"pwrite64", "pwritev", "pwritev2", "read",
"readlink", "readlinkat", "readv", "recvfrom",
"recvmmsg", "recvmsg", "rename", "renameat",
"renameat2", "restart_syscall", "rmdir", "rseq",
"rt_sigaction", "rt_sigpending", "rt_sigprocmask",
"rt_sigqueueinfo", "rt_sigreturn", "rt_sigsuspend",
"rt_sigtimedwait", "sched_getaffinity",
"sched_getattr", "sched_getparam", "sched_getscheduler",
"sched_rr_get_interval", "sched_setaffinity",
"sched_setattr", "sched_setparam", "sched_setscheduler",
"sched_yield", "seccomp", "select", "semctl",
"semget", "semop", "semtimedop", "sendfile",
"sendmmsg", "sendmsg", "sendto", "set_gid",
"set_robust_list", "set_tid_address", "setdomainname",
"setgid", "setgroups", "sethostname", "setitimer",
"setpgid", "setpriority", "setregid", "setresgid",
"setresuid", "setreuid", "setrlimit", "setsid",
"setsockopt", "setuid", "shmctl", "shmdt",
"shmget", "shutdown", "sigaltstack", "signalfd4",
"socket", "socketpair", "splice", "stat", "statfs",
"statx", "symlink", "symlinkat", "sync",
"sync_file_range", "sysinfo", "tee", "tgkill",
"time", "timer_create", "timer_delete",
"timer_getoverrun", "timer_gettime", "timer_settime",
"timerfd_create", "timerfd_gettime", "timerfd_settime",
"tkill", "truncate", "umask", "uname", "unlink",
"unlinkat", "unshare", "utimensat", "utimes",
"vfork", "vmsplice", "wait4", "waitid", "write",
"writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Save this as node-seccomp.json and apply it:
docker run --security-opt seccomp=node-seccomp.json my-app
In Kubernetes, seccomp profiles can be referenced via a RuntimeClass or a PodSecurityPolicy. The simplest approach is to use the default seccomp profile and tighten capabilities instead, since seccomp profiles are harder to manage across a cluster.
Here is the complete hardened Dockerfile that combines every technique above:
FROM node:20-slim AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
FROM node:20-slim
RUN groupadd --system --gid 1000 appuser && \
useradd --system --uid 1000 --gid appuser appuser
WORKDIR /app
COPY --from=builder --chown=appuser:appuser /app/node_modules ./node_modules
COPY --chown=appuser:appuser . .
USER appuser
EXPOSE 3000
# Use tini for proper signal handling
RUN apt-get update && apt-get install -y --no-install-recommends tini && \
apt-get clean && rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["node", "server.js"]
And the corresponding docker-compose.yml:
version: '3.8'
services:
app:
build: .
user: "1000:1000"
cap_drop:
- ALL
cap_add: []
read_only: true
tmpfs:
- /tmp
- /app/data
security_opt:
- no-new-privileges:true
- seccomp:node-seccomp.json
ports:
- "3000:3000"
In Kubernetes, all of these settings go into the Pod spec:
apiVersion: v1
kind: Pod
metadata:
name: node-app
labels:
app: node-app
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: my-app:latest
ports:
- containerPort: 3000
securityContext:
allowPrivilegeEscalation: false
privileged: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
add: []
runAsNonRoot: true
runAsUser: 1000
volumeMounts:
- name: tmp
mountPath: /tmp
- name: data
mountPath: /app/data
volumes:
- name: tmp
emptyDir:
medium: Memory
- name: data
emptyDir:
medium: Memory
The allowPrivilegeEscalation: false flag is critical. It sets the no_new_privs bit on the process, which prevents the binary from gaining additional privileges via setuid binaries or setcap executables. Combined with runAsNonRoot: true, this means that even if an attacker overwrites a binary with a setuid root binary, the kernel will refuse to elevate the process.
A quick smoke test to verify your container is not running as root:
# Verify the user inside the container
docker run --rm --cap-drop=ALL --read-only my-app id
# Expected output: uid=1000(appuser) gid=1000(appuser) groups=1000(appuser)
# Verify you cannot write anywhere outside /tmp
docker run --rm --cap-drop=ALL --read-only my-app touch /test.txt
# Expected output: touch: cannot touch '/test.txt': Read-only file system
# Verify privilege escalation is blocked
docker run --rm --security-opt no-new-privileges:true my-app \
/bin/sh -c "chmod u+s /usr/bin/touch && touch /test.txt"
# Expected output: Operation not permitted
In your CI pipeline, add a step that runs these checks after the image build:
# GitHub Actions
- name: Security smoke test
run: |
docker run --rm --read-only --cap-drop=ALL \
my-app node -e "process.exit(0)"
echo "Container runs with read-only root FS and dropped capabilities"
npm audit and image scanning?The container hardening in this post is about runtime security: what happens after the container starts. It is complementary to image-level scanning (Trivy, Grype, Snyk) that checks for known CVEs in your base image and dependencies. You need both.
A container that passes every CVE scan can still be exploited if the process runs as root with too many capabilities. And a hardened container running as non-root with read-only filesystem can still be exploited if a dependency has a deserialization vulnerability. Layer the defenses.
Container security is easy to defer until after a breach, and nearly impossible to retrofit without breaking something if you do not plan for it from the start. The four layers covered here (non-root user, capability drops, read-only filesystem, seccomp) cost nothing to implement and require no architectural changes if applied during initial setup. Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK, and their teams regularly design and deploy Node.js services on AWS, Azure, and Google Cloud with the kind of security-first container posture that makes platform engineers breathe a little easier during incident calls.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。