Anyone want to chime in on why pivot_root is preferable to a chroot jail? It's k...

cyphar · on May 13, 2020

This is mostly to do with the implementation of chroot(). Because it only applies to a single process (and mount tables are per mount namespace), it was implemented in such a way that directories above the root of the chroot are still technically accessible (the mounts above the root directory are still present in the mount hierarchy). This results in all sorts of fun bugs where if you chroot() inside a chroot() you can get out of the chroot() entirely. Container runtimes generally block CAP_SYS_CHROOT by default for this reason, but there are all sorts of other subtle security bugs which pop up because of this fundamental design choice.

pivot_root() doesn't suffer from this problem because it applies to the entire mount namespace, and thus its implementation could be made much safer. Instead of just changing the current process's filesystem root, the actual / mount of the mount namespace is swapped with another mountpoint on the system (and the old root is mounted elsewhere). Thus once the old root is unmounted there isn't a way to get back to the old mountpoints. This isn't perfect protection (magic-links and other mounts could expose the host filesystem) but it is a damn sight better than chroot(). Oh, and nesting pivot_root()s doesn't cause a breakout.

Note that this different behaviour in relation to mounts has resulted in completely unrelated security bugs with containers (such as being able to bypass procfs masks because chroot() doesn't hide the unmasked procfs in the host mount namespace). This is why us container runtime authors always tell people they should never use chroot() and always use pivot_root() -- though sadly sometimes chroot() is needed because pivot_root() doesn't work on initramfs.

(I'm one of the maintainers of runc, the runtime which underlies Docker/podman/containerd/cri-o/...)

kelnos · on May 14, 2020

> though sadly sometimes chroot() is needed because pivot_root() doesn't work on initramfs.

Are people actually attempting to boot a super-minimalist system that just has a kernel and an initramfs with something like docker into it where they don't bother with a rootfs at all and just start running containers directly from the initramfs? That's kinda cool, if that's the case.

cyphar · on May 14, 2020

You could do that (though one could argue that there's no real benefit to using containers in that case), but the issue is sadly more general than that. You cannot use pivot_root() if the current root is on initramfs. The reason is fairly historic, and boils down to "you cannot unmount initramfs" in the same way that "you cannot kill pid1".

This means that setups where you have the entire OS image in initramfs, and you try to run a container (even if it has a different filesystem as its rootfs) it will fail with pivot_root(). There are solutions for this but they require changing how the system is started (which can be a bit complicated depending on what system you're using to build your initramfs). From memory, minikube has used --no-pivot-root for a while precisely for this reason, though I believe they have switched away from it sometime recently.

rantwasp · on May 13, 2020

pivot_root is supposed to switch the whole system to a new root. chroot applies to a process, but the underlying system keeps going with what it had.

cyphar · on May 13, 2020

This is true (though it's scoped per mount namespace), but it isn't the primary security reason container runtimes use pivot_root().