barco: Linux Containers From Scratch in C.

Luca Cavallin

AI Engineering for Developers | Blog AI Engineering for Developers Platform Engineering End-to-End | Blog Google Cloud Networking 101: The Comprehensive TLDR | Blog Google Cloud Networking 101: The Comprehensive TLDR Containers Are Not Automatically Secure | Blog Containers Are Not Automatically Secure Watery Stone Beacon | Photography Blue Iceman Suture | Photography Hidden Emerald Pool | Photography Autumn Chapel Pinnacles | Photography A Tour of eBPF in the Linux Kernel: Observability, Security and Networking | Blog A Tour of eBPF in the Linux Kernel: Observability, Security and Networking Shared Violet Pulse | Photography Kubernetes Networking from Packets to Pods | Blog An Overview of Network Protocols | Blog An Overview of Network Protocols A Quick Journey Into the Linux Kernel | Blog A Quick Journey Into the Linux Kernel OpenTelemetry: A Guide to Observability with Go | Blog I'm on the Cillers Podcast Talking About Tech and Hackathons | Blog Yet Another List of Random Opinions on Writing Readable Code and Other Rants | Blog My post about Istio is now on the Istio blog too! | Blog Tropical Jungle Escape | Photography The Istio Service Mesh for People Who Have Stuff to Do | Blog Dreamy Cartoonscape Windmill | Photography Twilight Windmill Reflections | Photography Notes I took while reading "Applied Machine Learning and AI for Engineers" and "Introducing MLOps" | Blog Things I've Learned About Terraform That I Keep Telling People About | Blog Analyzing Unsplash Photo Performance with Python | Blog Analyzing Unsplash Photo Performance with Python I am a Top Mentor on MentorCruise! 🎉 | Blog CI/CD Observability on GitHub Actions and the Role of OpenTelemetry | Blog CI/CD Observability on GitHub Actions and the Role of OpenTelemetry Silent Water Sentinel | Photography Three Early Crosses | Photography Fiery Twilight Trails | Photography Forested Folds Flowing | Photography Majestic Snowbound Spire | Photography Shrouded Winter Peaks | Photography Space Cat Pillar | Photography I am a CNCF (Cloud Native Computing Foundation) Ambassador! | Blog Curved Valley Mist | Photography Highly Independent Tree | Photography Misty Morning Plateau | Photography Sick Shadows Fading | Photography Half Moon Blossom | Photography Serene Pedestal Swinging | Photography Sunset Clouds Reeling | Photography Aerial Nose Parking | Photography How to Structure C Projects: These Best Practices Worked for Me | Blog How to Structure C Projects: These Best Practices Worked for Me I'm on the KubeFM Podcast Talking About "Linux Containers From Scratch" | Blog I am (again) a Google Developers Expert! | Blog How to Configure OIDC with Terraform for GitHub Enterprise Server | Blog How to Configure OIDC with Terraform for GitHub Enterprise Server Modern Frontend Development: A Tooling Overview for Engineers Revisiting the Field | Blog Meet verto.sh: Your Gateway to Open-Source Collaboration. | Blog Crafting a Clean, Maintainable, and Understandable Makefile for a C Project. | Blog Crafting a Clean, Maintainable, and Understandable Makefile for a C Project. barco: Linux Containers From Scratch in C. | Blog How to Create a Release With Multiple Artifacts From a GitHub Actions Workflow Using the Matrix Strategy | Blog How to Create a Release With Multiple Artifacts From a GitHub Actions Workflow Using the Matrix Strategy How Databases Store and Retrieve Data with B-Trees | Blog How Databases Store and Retrieve Data with B-Trees Concurrency in Go: Goroutines, Channels, Mutexes, and More | Blog Concurrency in Go: Goroutines, Channels, Mutexes, and More Club Cloud 2021: Cloud Engineering Panel Discussion | Blog Club Cloud 2021: Cloud Engineering Panel Discussion How to Prepare for the Google Cloud Engineer Associate Certification Exam | Blog How to Prepare for the Google Cloud Engineer Associate Certification Exam What is Google Cloud Deploy? | Blog What is GitOps? | Blog Club Cloud Stories #2 - News from Around the Cloud | Blog Club Cloud Stories #2 - News from Around the Cloud Club Cloud Stories #1 - The First Episode with Antoni Tzavelas & Mark van Holsteijn | Blog Club Cloud Stories #1 - The First Episode with Antoni Tzavelas & Mark van Holsteijn | Blog Quiet Oak Shining | Photography How to Read Firestore Events with Cloud Functions and Golang | Blog How to Read Firestore Events with Cloud Functions and Golang | Blog Google Cloud Pub/Sub vs NATS: An Easy-to-Understand Comparison | Blog Google Cloud Pub/Sub vs NATS: An Easy-to-Understand Comparison | Blog How to Deploy a Multi-cluster Service Mesh on GKE with Anthos | Blog How to Deploy a Multi-cluster Service Mesh on GKE with Anthos | Blog How to Safely Store Secrets in Terraform Using Cloud KMS | Blog How to Safely Store Secrets in Terraform Using Cloud KMS | Blog Designing Serverless Applications on AWS - Jacco Kulman and Luca Cavallin @ End2End LIVE | Blog Designing Serverless Applications on AWS - Jacco Kulman and Luca Cavallin @ End2End LIVE | Blog How to Use Terraform Workspaces to Manage Environment-based Configuration | Blog How to Use Terraform Workspaces to Manage Environment-based Configuration | Blog Puffy Steel Spreading | Photography How to Deploy ElasticSearch on GKE using Terraform and Helm | Blog How to Deploy ElasticSearch on GKE using Terraform and Helm | Blog Summer Windmills Spinning | Photography How to Optimize PHP Performance on Google Cloud Run | Blog How to Optimize PHP Performance on Google Cloud Run | Blog Foggy Boats Rusting | Photography How I Prepared for the Google Cloud Associate Cloud Engineer Exam | Blog How I Prepared for the Google Cloud Associate Cloud Engineer Exam | Blog Winter Kids Chasing | Photography

Luca Cavallin · 2023-09-17 · via Luca Cavallin

barco is a project I worked on to learn more about Linux containers and the Linux kernel. It's a simple implementation of a container runtime in C, which I wrote from scratch (based on other guides on the Internet) using just C, libseccomp for seccomp filters, libcap for container capabilities, libcuni1 for unit tests with CUnit, argtable for handling the CLI and another third-party library for logging. It's not meant to be used in production, but rather as a learning tool.

Linux containers are made up by a set of Linux kernel features:

namespaces: are used to group kernel objects into different sets that can be accessed by specific process trees. There are different types of namespaces, for example,the PID namespace is used to isolate the process tree, while the network namespace is used to isolate the network stack.
seccomp: a Linux Kernel mechanism used to limit the system calls that a process can make (handled via syscalls)
capabilities: a Linux Kernel mechanism used to set limits on what the root user can do in the container (handled via syscalls)
cgroups: are used to limit the resources (e.g. memory, disk I/O, CPU-tme) that a process can use (handled via cgroupfs)

barco uses all of these features to create a container that is isolated from the host system. It's a very simple implementation, but it's enough to understand how containers work.

barco can be used to run bin/sh . from the / directory as root (-u 0) with the following command (optional -v for verbose output):

bash

$ sudo ./bin/barco -u 0 -m / -c /bin/sh -a . [-v]
 
22:08:41 INFO  ./src/barco.c:96: initializing socket pair...
22:08:41 INFO  ./src/barco.c:103: setting socket flags...
22:08:41 INFO  ./src/barco.c:112: initializing container stack...
22:08:41 INFO  ./src/barco.c:120: initializing container...
22:08:41 INFO  ./src/barco.c:131: initializing cgroups...
22:08:41 INFO  ./src/cgroups.c:73: setting memory.max to 1G...
22:08:41 INFO  ./src/cgroups.c:73: setting cpu.weight to 256...
22:08:41 INFO  ./src/cgroups.c:73: setting pids.max to 64...
22:08:41 INFO  ./src/cgroups.c:73: setting cgroup.procs to 1458...
22:08:41 INFO  ./src/barco.c:139: configuring user namespace...
22:08:41 INFO  ./src/barco.c:147: waiting for container to exit...
22:08:41 INFO  ./src/container.c:43: ### BARCONTAINER STARTING - type 'exit' to quit ###
 
# ls
bin         home                lib32       media       root        sys         vmlinuz
boot        initrd.img          lib64       mnt         run         tmp         vmlinuz.old
dev         initrd.img.old      libx32      opt         sbin        usr
etc         lib                 lost+found  proc        srv         var
# echo "i am a container"
i am a container
# exit
 
22:08:55 INFO  ./src/barco.c:153: freeing resources...
22:08:55 INFO  ./src/barco.c:168: so long and thanks for all the fish

Development

If you want to build and run barco from source, you can do so by cloning the repository and following the instructions in the README file. In short, barco provides configuration for development on GitHub Codespaces using Visual Studio Code and the included Makefile can be used to build and run the project as follows:

bash

$ sudo apt install -y make
$ make build

There are also other targets in the Makefile that can be used to run the unit tests, build the documentation, run the formatter and the linter, etc (most of the tools are native to the Clang compiler). Furthermore, while working on barco I did investigate best practices for the structure of C projects and I adopted the following:

yaml

├── .devcontainer - configuration for GitHub Codespaces
├── .github - configuration GitHub Actions and other GitHub features
├── .vscode - configuration for Visual Studio Code
├── bin - the executable (created by make)
├── build - intermediate build files e.g. *.o (created by make)
├── docs - documentation
├── include - header files
├── lib - third-party libraries
├── scripts - scripts for setup and other tasks
├── src - C source files
│     ├── barco.c - (main) Entry point for the CLI
│     └── *.c
├── tests - contains tests
├── .clang-format - configuration for the formatter
├── .clang-tidy - configuration for the linter
├── .gitignore
├── LICENSE
├── Makefile
└── README.md

How does it work?

The barco executable is the entry point for the CLI. It's responsible for parsing the CLI arguments, setting up the container and running the container process: it all starts at barco.c, where I used argtable to parse the CLI arguments and set up, start and ultimately cleanup the container and other resources. The first two steps towards running a container are creating a pair of sockets (to communicate with the container process) and initializing the container stack (to set up the container process).

The Call to container_init

After the initial setup, the container_init function, defined in container.c, is called to start the container process. The function is relatively simple, and all it does it calling the clone system call with a function to run (container_start), stack configuration, and the appropriate flags (CLONE_NEWNS | CLONE_NEWCGROUP | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS allow some control over mounts, pids, IPC data structures, network devices and hostname) to create the container process.

The container_start Function

The resulting process is a child of the barco process, and it's the one that will run the container. The container_start function is the entry point for the container process, and it's defined in container.c. It's responsible for setting up the container, and it does so by:

setting the hostname
setting the root directory (mount namespace) to the one specified by the user (via the -m flag)
setting the user namespace
setting capabilities and seccomp filters (for security)

The Mount Namespace

The mount namespace is used to isolate the filesystem mount points seen by the container process. The mount_set function is responsible for setting the root directory for the container process, and it does so by calling the mount system call with the appropriate flags (MS_BIND | MS_REC | MS_PRIVATE) to create a new mount namespace and bind-mount the root directory to the one specified by the user. The result is that the container process will see the root directory as the one specified by the user, and it will be able to access the files in that directory and its subdirectories.

The User Namespace

The user namespace is used to isolate the user and group IDs seen by the container process. The user_namespace_init function is responsible for setting the user namespace, and it does so by calling the unshare system call with the appropriate flags (CLONE_NEWUSER) to create a new user namespace. user_namespace_init relies on user_namespace_prepare_mappings, a function called in barco.c and used by the parent process (barco) to listen for the child process (the container) to request setting uid and gid before updating the uid_map and gid_map for the container. The uid_map and gid_map files are a Linux kernel mechanism for mapping uids and gids between the parent and child processes. The result is that the container process will see the user and group IDs as specified by the user, and it will be able to access the files in that directory and its subdirectories.

Capabilities and Seccomp Filters

The container process is running as root (uid 0), but it's not a full root user. It's a root user with limited capabilities and seccomp filters. The sec_set_caps function uses libcap to set the capabilities for the container process, and it does so by calling the cap_set_proc function with the appropriate flags (e.g. CAP_MAC_OVERRIDE, CAP_MKNOD, CAP_SETFCAP, CAP_SYSLOG...). This configuration makes it so that the container process will be able to perform only the actions specified by the capabilities.

Seccomp Filters are used instead to limit the system calls that a process can make (in my case, the container). The sec_set_seccomp function blocks sensitive system calls based on Docker's default seccomp profile and other obsolete or dangerous system calls. It does so by calling the seccomp_rule_add function with the appropriate parameters to specify how calls to a system call should be handled. Let's use the following as an example:

seccomp_rule_add(ctx, SEC_SCMP_FAIL, SCMP_SYS(fchmod), 1, SCMP_A1(SCMP_CMP_MASKED_EQ, S_ISGID, S_ISGID))

This instruction tells the Linux Kernel to block calls to fchmod if the S_ISGID bit is set in the mode parameter. The mode parameter is the second parameter of the fchmod system call, and it's used to set the permissions of a file. The S_ISGID bit is used to set the setgid bit, which is used to set the group ID on execution. The S_ISGID bit is set when the mode parameter is S_ISGID | S_IRWXU | S_IRWXG | S_IRWXO, which is the default value for the mode parameter. This means that the fchmod system call will be blocked if the mode parameter is the default value, which is the case when the fchmod system call is used to set the permissions of a file. This is a good example of how seccomp filters can be used to block dangerous system calls.

Cgroups

As the container is starting, the parent process (barco) is setting up cgroups for the child process. The cgroups_init function is responsible for setting up cgroups (version 2), and it does so by creating the /sys/fs/cgroup/barcontainer directory (barcontainer is the hostname for the container) for the new cgroup and then writing cgroups settings to the appropriate files in the cgroupfs. The cgroups settings are:

memory.max: the maximum amount of memory that the container process can use
cpu.weight: the relative weight of the container process compared to other processes
pids.max: the maximum number of processes that the container process can spawn

By setting these values, the container process will be limited in the amount of memory, CPU time and processes that it can use.

Waiting for the Container to Exit

The container is setup, and it's now running. The parent process (barco) is waiting for the container process to exit, and it does so by calling the waitpid system call with the PID (process ID) of the container. The container I used as an example in the README.md file is running /bin/sh and waiting for user input. The user can type commands, and the container process will execute them. For example, the user can type ls and the container process will execute it, listing the contents of the current directory. The container process will continue to run until the user types exit, which will cause the container process to exit. The parent process (barco) will then exit as well.

Cleaning up

As the container process exits, the parent process (barco) exits too, and it's responsible for cleaning up the resources used by the container process. Starting at the cleanup label in barco.c, the parent process is responsible for:

freeing the container stack
closing the sockets used to communicate with the container process
removing the cgroup it created
freeing the data structures used by argtable

Summary

Linux containers are made up of a set of Linux kernel features, and barco uses them to create a container that is isolated from the host system. It's a very simple implementation, but it's enough to understand how containers work. I learned a lot while working on this project, and I hope that this post can be useful to others who want to learn more about Linux containers and the Linux kernel. If you have any questions, feedback or improvements you'd like to make, feel free to reach out to me directly or on the barco repository!

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Luca Cavallin

Development

How does it work?

The Call to container_init

The container_start Function

The Mount Namespace

The User Namespace

Capabilities and Seccomp Filters

Cgroups

Waiting for the Container to Exit

Cleaning up

Summary