Secure Bubblewrap inside Kubernetes with ProcMount

This post explores how to create nested containers securely inside Kubernetes. In the previous post titled Recursive namespaces to run containers inside a container I showed how to create nested containers using a rootless container runtimes like Podman. In this post, I'll demonstrate how to run the same workload with Kubernetes.

In two parts, I will present:

How to run Kubernetes from source.
The ProcMountType feature to work around the original issue.

Context and problem statement

The context of this post is to deploy a service named zuul-executor for running CI builds securely inside Kubernetes, without requiring a privileged security context.

The problem is that this service performs build isolation locally using Bubblewrap, which is similar to running a container inside a container.

Run kubernetes locally

In this section, let's set up Kubernetes locally. On a fresh Fedora 41 system, install the following requirements:

$ sudo dnf install -y etcd crio crictl kubectl containernetworking-plugins
$ sudo systemctl start crio

Then, start Kubernetes using the local-up-cluster script as follows:

$ mkdir -p ~/src/github.com/kubernetes; cd ~/src/github.com/kubernetes
$ git clone https://github.com/kubernetes/kubernetes/
$ cd kubernetes
$ sudo env CGROUP_DRIVER=systemd CONTAINER_RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock' \
    ./hack/local-up-cluster.sh
...
Local Kubernetes cluster is running. Press Ctrl-C to shut it down.

… using the following test resource:

apiVersion: v1
kind: Pod
metadata:
  name: test-bwrap
spec:
  containers:
    - name: test
      image: quay.io/zuul-ci/zuul-executor
      command: ["/bin/sleep", "infinity"]
      securityContext:
        capabilities:
          add: ["SETFCAP"]

As seen previously, we need CAP_SETFCAP to create the user namespace, otherwise bwrap fails early with the following error:
bwrap: setting up uid map: Operation not permitted

Apply the test resource with the following commands:

$ export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
$ kubectl apply -f test-bwrap.yaml
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
bwrap: Can't mount proc on /newroot/proc: Operation not permitted

This produces the same error we encountered in the previous post: the /proc filesystem is tainted in the pod, preventing Bubblewrap from being able to create a new procfs for the new PID namespace.

The next section introduces the ProcMountType feature to work around this issue.

The ProcMountType feature

The ProcMountType feature can be enabled by adding the following environment variable to the local-up-cluster: FEATURE_GATES='UserNamespacesSupport=true,ProcMountType=true'. To make use of the new feature, we also need to activate UserNamespacesSupport, as explained in the following documentation.

With these features, we can update the resource like that:

apiVersion: v1
kind: Pod
metadata:
  name: test-bwrap
spec:
  hostUsers: false
  containers:
    - name: test
      image: quay.io/zuul-ci/zuul-executor
      command: ["/bin/sleep", "infinity"]
      securityContext:
        procMount: Unmasked
        capabilities:
          add: ["SETFCAP"]

… using the following commands:

$ sudo crictl rm -af; kubectl delete -f ./test-bwrap.yaml && kubectl apply -f ./test-bwrap.yaml
pod/test-bwrap created
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
bwrap: Can't mount proc on /newroot/proc: Permission denied

This time we get a new permission denied, which is caused by SELinux. Using audit2allow, we can see that the following policy needs to be installed:

module nestedcontainers 1.0;

require {
    type proc_t;
    type devpts_t;
    type container_t;
    class filesystem mount;
}

#============= container_t ==============
allow container_t devpts_t:filesystem mount;
allow container_t proc_t:filesystem mount;

… which lets us run Bubblewrap inside an unprivileged pod:

$ sudo semodule -i nestedcontainers.pp
$ kubectl exec test-bwrap -- bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session ps afx
    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:00 bwrap --ro-bind /lib /lib --ro-bind /usr /usr --symlink /usr/lib64 /lib64 --proc /proc --dev /dev --tmpfs /tmp --unshare-all --new-session --cap-add all --uid 0 ps afx
      2 ?        R      0:00 ps afx

Notice how the sleep infinity process is not visible in the ps output, confirming that we are indeed running in a nested container.

Conclusion

This post demonstrates that we can run a container inside a container with Kubernetes thanks to the following settings:

The SETFCAP to create the user namespace,
The ProcMountType and UserNamespacesSupport to unmask the /proc filesystem, and
A SELinux policy to enable mounting filesystems inside the new namespace.