Age | Commit message (Collapse) | Author |
|
Currently, systemd uses either the legacy hierarchies or the unified hierarchy.
When the legacy hierarchies are used, systemd uses a named legacy hierarchy
mounted on /sys/fs/cgroup/systemd without any kernel controllers for process
management. Due to the shortcomings in the legacy hierarchy, this involves a
lot of workarounds and complexities.
Because the unified hierarchy can be mounted and used in parallel to legacy
hierarchies, there's no reason for systemd to use a legacy hierarchy for
management even if the kernel resource controllers need to be mounted on legacy
hierarchies. It can simply mount the unified hierarchy under
/sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies.
This disables a significant amount of fragile workaround logics and would allow
using features which depend on the unified hierarchy membership such bpf cgroup
v2 membership test. In time, this would also allow deleting the said
complexities.
This patch updates systemd so that it prefers the unified hierarchy for the
systemd cgroup controller hierarchy when legacy hierarchies are used for kernel
resource controllers.
* cg_unified(@controller) is introduced which tests whether the specific
controller in on unified hierarchy and used to choose the unified hierarchy
code path for process and service management when available. Kernel
controller specific operations remain gated by cg_all_unified().
* "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to
force the use of legacy hierarchy for systemd cgroup controller.
* nspawn: By default nspawn uses the same hierarchies as the host. If
UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all. If
0, legacy for all.
* nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of
three options - legacy, only systemd controller on unified, and unified. The
value is passed into mount setup functions and controls cgroup configuration.
* nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount
option is moved to mount_legacy_cgroup_hierarchy() so that it can take an
appropriate action depending on the configuration of the host.
v2: - CGroupUnified enum replaces open coded integer values to indicate the
cgroup operation mode.
- Various style updates.
v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2.
v4: Restored legacy container on unified host support and fixed another bug in
detect_unified_cgroup_hierarchy().
|
|
SYSTEMD_NSPAWN_USE_CGNS allows to disable the use of cgroup namespaces.
|
|
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
|
|
Since v3.11/7dc5dbc ("sysfs: Restrict mounting sysfs"), the kernel
doesn't allow mounting sysfs if you don't have CAP_SYS_ADMIN rights over
the network namespace.
So the mounting /sys as a tmpfs code introduced in
d8fc6a000fe21b0c1ba27fbfed8b42d00b349a4b doesn't work with user
namespaces if we don't use private-net. The reason is that we mount
sysfs inside the container and we're in the network namespace of the host
but we don't have CAP_SYS_ADMIN over that namespace.
To fix that, we mount /sys as a sysfs (instead of tmpfs) if we don't use
private network and ignore the /sys-as-a-tmpfs code if we find that /sys
is already mounted as sysfs.
Fixes #1555
|
|
sysfs below it
This way we can hide things like /sys/firmware or /sys/hypervisor from
the container, while keeping the device tree around.
While this is a security benefit in itself it also allows us to fix
issue #1277.
Previously we'd mount /sys before creating the user namespace, in order
to be able to mount /sys/fs/cgroup/* beneath it (which resides in it),
which we can only mount outside of the user namespace. To ensure that
the user namespace owns the network namespace we'd set up the network
namespace at the same time as the user namespace. Thus, we'd still see
the /sys/class/net/ from the originating network namespace, even though
we are in our own network namespace now. With this patch, /sys is
mounted before transitioning into the user namespace as tmpfs, so that
we can also mount /sys/fs/cgroup/* into it this early. The directories
such as /sys/class/ are then later added in from the real sysfs from
inside the network and user namespace so that they actually show whatis
available in it.
Fixes #1277
|
|
We didn#t actually pass ownership of /run to the UID in the container
since some releases, let's fix that.
|
|
|