After upgrading our EKS Kubernetes clusters from 1.23 to 1.24, some pods went into crashloopbackoff, with the following curious error:
2023/06/14 16:03:41 [emerg] 1#1: bind() to 0.0.0.0:80 failed (13: Permission denied)
nginx: [emerg] bind() to 0.0.0.0:80 failed (13: Permission denied)
I know some of you are thinking – Geez, everyone knows you are not supposed to use privileged ports as container ports !
But that’s not the point here. We run a cluster with shared workloads and did not have admission controllers to control what ports use, and here we are.
But the more interesting questions are: What is going on here? Why did this just stop working?
This is what I’d like to explore in this article. It turns out, the move from docker engine to containerd as the container runtime exposed an older issue with docker.
But if you are just interested in the fix for this problem, jump down to “How Do I Fix This“
Linux Capabilities
Let’s review Linux Capabilities and why it exists. Before linux capabilities there were two types of processes in linux: privileged and non-privileged.
The privileged root user (UID 0) can do almost anything and has the maximum number of privileges, but because of this, running processes as root user increases the risks of exploitation.
On the other hand, running a process as a non-privileged user (UID >0) is also very limiting. A process often needs to open a network port – but even this is something that requires root permissions.
Capabilities to the Rescue
Capabilities are a linux mechanism that allows processes (and threads) to be granted specific privileges with the need to run the process with root user (superuser) privileges. You can decide which individual permissions to grant to your process running as a non-root user. This greatly reduces the attack vectors, and is a much safer approach than running a process as the root user.
Fun Fact – In the “old days”, any executable that needed to be run by a standard user but also make privileged kernel calls would have the suid bit set, effectively granting it privileged access. A good example was the ping command, which was traditionally given fully privileged access to make ICMP calls.
Let’s start an nginx docker container in privileged mode and have look:
docker run -d --privilege --rm nginx
2fa381d73aef0ca52818851cc6d002ea820e9037387457cdbcb08e1397bab1ef
docker exec -it 2fa381d73aef0ca52818851cc6d002ea820e9037387457cdbcb08e1397bab1ef /bin/bash
root@2fa381d73aef:/# ps ux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 11132 7100 ? Ss 21:37 0:00 nginx: master process nginx -g daemon off;
root@2fa381d73aef:/# cat /proc/1/status |grep Cap
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
root@2fa381d73aef:/# apt update && apt install -y libcap2-bin
root@2fa381d73aef:/# capsh --decode=0000003fffffffff
0x0000003fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
As you can see from the output of /proc
above, capabilities are organised into sets, which are represented as bit masks. The utility capsh
can be used to decode the bit masks into human-readable form.
The 5 capabilities sets are:
- CapInh (Inherited capabilities) capabilities passed down from a running parent process to its child process.
- CapPrm(Permitted capabilities) capabilities that a process is allowed to have.
- CapEff (Effective capabilities) all the capabilities with which the current process is executing.
- CapBnd (Bounding capabilities) the maximum set of capabilities that a process is allowed to have.
- CapAmb (Ambient capability) the capabilities that are in effect currently, which can be applied to the current process or its children at a later time.
Let’s try to return to the main point of the article – why we saw errors opening ports <1024 after switching to containerd. For that we will need to focus on the capability set called CapAmb
– Ambient Capabilities.
Ambient Capabilities
Ambient capabilities are a feature in the Linux kernel (introduced in kernel version 4.3) that addresses some limitations of the capability model.
In the (older) linux capability model, a process can acquire capabilities at startup, but once acquired, they cannot be passed to other processes. This limits their usefulness in certain situations where a privileged process needs to delegate some of its privileges to other non-privileged processes.
To handle this situation, the ambient capabilities were added. When a process has ambient capabilities set, these capabilities are inherited by the child processes created by that process.
Kubernetes/Docker and Ambient Capabilities
By looking at some older Issues in Kubernetes and Docker, we can piece together the discussion of how capabilities and privileges are handled in containers.
This is an issue (#56374) reported in Kubernetes 1.8, where the SecurityContext
was set but the container was still unable to bind to port 80:
securityContext:
capabilities:
drop:
- all
add:
- NET_BIND_SERVICE
allowPrivilegeEscalation: false
Docker was still tightly coupled with Kubernetes at this time, and a later comment referred to related issues in the Moby project. Which was likely related to the Kubernetes issue.
The issue “Can’t bind to privileged ports as non-root” from October of 2014, describes the problem that setting cap-add NET_BIND_SERVICE
still did not allow port 80 to be opened. Note the container was started as user 1000:
$ docker run --rm -u 1000 --cap-add NET_BIND_SERVICE php:apache
(13)Permission denied: AH00072: make_sock: could not bind to address [::]:80
(13)Permission denied: AH00072: make_sock: could not bind to address 0.0.0.0:80
There are lots of other issues that link to this issue, showing that there was much confusion around this topic. Basically, the workaround consisted always using UID 0 when capabilities are needed. Clearly not ideal and expected behavior.
In September of 2016, there was change to Docker (#26979) to remedy this situation using ambient capabilities.
This is not the end of the story, however. The feature to use ambient capabilities in Docker was reverted (#27737) due to concerts about backwards compatibility with containers that use suid binaries (such as sudo).
Then in May 2020, there is this merged pull-request to “Add default sysctls to allow ping sockets and privileged ports with no capabilities” (#41030) with a link to some other related fixes.
Currently default capability CAP_NET_RAW allows users to open ICMP echo sockets, and CAP_NET_BIND_SERVICE allows binding to ports under 1024. Both of these are safe operations, and Linux now provides ways that these can be set, per container, to be allowed without any capabilties for non root users. Enable these by default. Users can revert to the previous behaviour by overriding the sysctl values explicitly.
So it seems Docker engine and runc were allowed to open privileged ports by default. However with the switch to containerd as the container runtime, this behaviour is no longer present.
It’s worth mentioning that Kubernetes has meanwhile added the ability to set ambient capabilities. See #104620 and #2757.
How Do I Fix This?
If you are running into this problem after switching to containerd as the container runtime, you have a few choices:
- Change the ports used opened by the containers in the pods to something > 1024
- Using the
securityContext
syntax, set the sysctlnet.ipv4.ip_unprivileged_port_start
for your specific port(s)
Example of securityContext
syntax for opening port 80:
securityContext:
sysctls:
- name: net.ipv4.ip_unprivileged_port_start
value: "80"
This uses the Kubernetes feature that allows the enabling of “unsafe” sysctls.
Summary
This was my attempt to solve the puzzle of why opening privileged ports suddenly stopped working after switching to containerd . I might have missed some details, but this was a good opportunity to revisit some basic concepts in linux, Docker and Kubernetes. We talked about linux capabilities and why they are important in Kubernetes (and Docker). Since this is related to security, I am sure the Kubernetes project will continue to evolve how it handles granting privileges to pods.