Age | Commit message (Collapse) | Author |
|
This patch set adds full support the new unified cgroup hierarchy logic
of modern kernels.
A new kernel command line option "systemd.unified_cgroup_hierarchy=1" is
added. If specified the unified hierarchy is mounted to /sys/fs/cgroup
instead of a tmpfs. No further hierarchies are mounted. The kernel
command line option defaults to off. We can turn it on by default as
soon as the kernel's APIs regarding this are stabilized (but even then
downstream distros might want to turn this off, as this will break any
tools that access cgroupfs directly).
It is possibly to choose for each boot individually whether the unified
or the legacy hierarchy is used. nspawn will by default provide the
legacy hierarchy to containers if the host is using it, and the unified
otherwise. However it is possible to run containers with the unified
hierarchy on a legacy host and vice versa, by setting the
$UNIFIED_CGROUP_HIERARCHY environment variable for nspawn to 1 or 0,
respectively.
The unified hierarchy provides reliable cgroup empty notifications for
the first time, via inotify. To make use of this we maintain one
manager-wide inotify fd, and each cgroup to it.
This patch also removes cg_delete() which is unused now.
On kernel 4.2 only the "memory" controller is compatible with the
unified hierarchy, hence that's the only controller systemd exposes when
booted in unified heirarchy mode.
This introduces a new enum for enumerating supported controllers, plus a
related enum for the mask bits mapping to it. The core is changed to
make use of this everywhere.
This moves PID 1 into a new "init.scope" implicit scope unit in the root
slice. This is necessary since on the unified hierarchy cgroups may
either contain subgroups or processes but not both. PID 1 hence has to
move out of the root cgroup (strictly speaking the root cgroup is the
only one where processes and subgroups are still allowed, but in order
to support containers nicey, we move PID 1 into the new scope in all
cases.) This new unit is also used on legacy hierarchy setups. It's
actually pretty useful on all systems, as it can then be used to filter
journal messages coming from PID 1, and so on.
The root slice ("-.slice") is now implicitly created and started (and
does not require a unit file on disk anymore), since
that's where "init.scope" is located and the slice needs to be started
before the scope can.
To check whether we are in unified or legacy hierarchy mode we use
statfs() on /sys/fs/cgroup. If the .f_type field reports tmpfs we are in
legacy mode, if it reports cgroupfs we are in unified mode.
This patch set carefuly makes sure that cgls and cgtop continue to work
as desired.
When invoking nspawn as a service it will implicitly create two
subcgroups in the cgroup it is using, one to move the nspawn process
into, the other to move the actual container processes into. This is
done because of the requirement that cgroups may either contain
processes or other subgroups.
|
|
selinux: always use *_raw API from libselinux
|
|
More cgroup fixes
|
|
When the user wants to explicitly send our own PID a signal, then do so.
Don't follow up SIGABRT with a SIGHUP if send_sighup is enabled. At that
point the process should have segfaulted, hence there's no point in
following up with a SIGHUP.
Send only termination signals to ourselves, never KILL or ABRT signals.
|
|
|
|
Always say when we ignore errors. Cast calls whose return value we
knowingly ingore to (void). Use "bool" where we actually mean a boolean,
even if we return it as an int later on.
|
|
It's cheaper that going to cgroupfs, and also usually the better choice
since it's not racy and can map PIDs even if they were moved to a
different unit.
|
|
In all cases where the function (or cg_is_empty_recursive()) ignoring
the calling process is actually wrong, as a process keeps a cgroup busy
regardless if its the current one or another. Hence, let's simplify
things and drop the "ignore_self" parameter.
|
|
A number of simplications and adjustments to brings things closer to our
coding style.
|
|
The legacy cgroup hierarchy does not support reliable empty
notifications in containers and if there are left-over subgroups in a
cgroup. This makes it hard to correctly wait for them running empty, and
thus we previously disabled this logic entirely.
With this change we explicitly check for the container case, and whether
the unit is a "delegation" unit (i.e. one where programs may create
their own subgroups). If we are neither in a container, nor operating on
a delegation unit cgroup empty notifications become reliable and thus we
start waiting for the empty notifications again.
This doesn't really fix the general problem around cgroup notifications
but reduces the effect around it.
(This also reorders #include lines by their focus, as suggsted in
CODING_STYLE. We have to add "virt.h", so let's do that at the right
place.)
Also see #317.
|
|
|
|
|
|
Rework the "service is good" check, to only check the cgroup state if we
really need to instead of always.
This allows us to suppress going to the cgroupfs for an empty check for
the majority of services.
No functional change.
|
|
Instead, remember that we have already written it.
|
|
When mcstransd* is running non-raw functions will return translated SELinux
context. Problem is that libselinux will cache this information and in the
future it will return same context even though mcstransd maybe not running at
that time. If you then check with such context against SELinux policy then
selinux_check_access may fail depending on whether mcstransd is running or not.
To workaround this problem/bug in libselinux, we should always get raw context
instead. Most users will not notice because result of access check is logged
only in debug mode.
* SELinux context translation service, which will translates labels to human
readable form
|
|
Related to the TODO item to replace FOREACH_WORD_QUOTED with it.
Tested by setting `JoinControllers=cpu,cpuacct,memory net_cls,blkio' in
/etc/systemd/system.conf, rebooting the system with the patched binaries
and checking that the desired setup was created by inspecting the
entries under /sys/fs/cgroup.
No regressions observed in test cases.
|
|
Otherwise we might attempt to remove a non-existing fd from epoll.
|
|
|
|
This adds a new call unit_set_slice(), and simplifies
unit_add_default_slice(). THis should make our code a bit more robust
and simpler.
|
|
|
|
|
|
We store the properties for transient units in drop-ins anyway, and
units don't have to have fragment files, hence don't bother with them,
and don't create them.
|
|
|
|
|
|
Let's add a way to get the type-specific D-Bus interface of a unit from
either its type or name to src/basic/unit-name.[ch]. That way we can
share it with the client side, where it is useful in tools like cgls or
machinectl.
Also ports over machinectl to make use of this.
|
|
This reverts commit d4d00020d6ad855d65d31020fefa5003e1bb477f. The idea of
the commit is broken and needs to be reworked. We really cannot reduce
the bus-addresses to a single address. We always will have systemd with
native clients and legacy clients at the same time, so we also need both
addresses at the same time.
|
|
It is not acceptable to load unit files during enable/disable operations
just to figure out the selinux labels. systemd implements lazy loading
for units, so the selinux hooks need to follow it.
This drops the mac_selinux_unit_access_check_strv() helper which
implements a non-acceptable policy check. If anyone cares for that
functionality, you really should pass a callback+userdata to the helpers
in src/shared/install.c which does policy checks on each touched file.
See #1050 on github for more.
|
|
paths are specified
The commit 4938696301a914ec26bcfc60bb99a1e9624e3789 overlooked the
fact that unit files can be specified as unit file paths, not unit
file names, wrongly passing a unit file path to the 1st argument of
manager_load_unit() that handles it as a unit file name. As a result,
the following 4 systemctl subcommands:
enable
disable
reenable
link
mask
unmask
fail with the following error message:
# systemctl enable /usr/lib/systemd/system/kdump.service
Failed to execute operation: Unit name /usr/lib/systemd/system/kdump.service is not valid.
# systemctl disable /usr/lib/systemd/system/kdump.service
Failed to execute operation: Unit name /usr/lib/systemd/system/kdump.service is not valid.
# systemctl reenable /usr/lib/systemd/system/kdump.service
Failed to execute operation: Unit name /usr/lib/systemd/system/kdump.service is not valid.
# cp /usr/lib/systemd/system/kdump.service /tmp/
# systemctl link /tmp/kdump.service
Failed to execute operation: Unit name /tmp/kdump.service is not valid.
# systemctl mask /usr/lib/systemd/system/kdump.service
Failed to execute operation: Unit name /usr/lib/systemd/system/kdump.service is not valid.
# systemctl unmask /usr/lib/systemd/system/kdump.service
Failed to execute operation: Unit name /usr/lib/systemd/system/kdump.service is not valid.
To fix the issue, first check whether a unit file is passed as a unit
file name or a unit file path, and then pass the unit file to the
appropreate argument of manager_load_unit().
By the way, even with this commit mask and unmask reject unit file
paths as follows and this is a correct behavior:
# systemctl mask /usr/lib/systemd/system/kdump.service
Failed to execute operation: Invalid argument
# systemctl unmask /usr/lib/systemd/system/kdump.service
Failed to execute operation: Invalid argument
|
|
fix "systemctl status idontexist.service" showing the full cgroup tree
|
|
Set _EXEC_UTMP_MODE_INVALID to -1. This matches the return value from
string_table_lookup.
|
|
Internally, the root cgroup is stored as the empty string in
Unit.cgroup_path, and "no cgroup" as NULL. Unfortunately, D-Bus does not
know a NULL concept, hence when reporting the cgroup to clients we
should turn the root cgroup into "/", and leave the empty string for the
"no cgroup" case.
This should make sure that "systemctl status -- -.slice" works correctly
and shows the entire cgroup tree.
|
|
|
|
This is preparation for a later commit that makes use of these
properties for spawning an interactive shell in a container.
|
|
When generating utmp/wtmp entries, optionally add both LOGIN_PROCESS and
INIT_PROCESS entries or even all three of LOGIN_PROCESS, INIT_PROCESS
and USER_PROCESS entries, instead of just a single INIT_PROCESS entry.
With this change systemd may be used to not only invoke a getty directly
in a SysV-compliant way but alternatively also a login(1) implementation
or even forego getty and login entirely, and invoke arbitrary shells in
a way that they appear in who(1) or w(1).
This is preparation for a later commit that adds a "machinectl shell"
operation to invoke a shell in a container, in a way that is compatible
with who(1) and w(1).
|
|
Closes #919.
|
|
Fix machinectl login with containers in user namespaces (v2)
|
|
To be able to use `systemd-run` or `machinectl login` on a container
that is in a private user namespace, the sub-process must have entered
the user namespace before connecting to the container's D-Bus, otherwise
the UID and GID in the peer credentials are garbage.
So we extend namespace_open and namespace_enter to support UID namespaces,
and we enter the UID namespace in bus_container_connect_{socket,kernel}.
namespace_open will degrade to a no-op if user namespaces are not enabled
in the kernel.
Special handling is required for the setns call in namespace_enter with
a user namespace, since transitioning to your own namespace is forbidden,
as it would result in re-entering your user namespace as root.
Arguably it may be valid to check this at the call site, rather than
inside namespace_enter, but it is less code to do it inside, and if the
intention of calling namespace_enter is to *be* in the target namespace,
rather than to transition to the target namespace, it is a reasonable
approach.
The check for whether the user namespace is the same must happen before
entering namespaces, as we may not be able to access /proc during the
intermediate transition stage.
We can't instead attempt to enter the user namespace and then ignore
the failure from it being the same namespace, since the error code is
not distinct, and we can't compare namespaces while mid-transition.
|
|
The following functions return immediately if a null pointer was passed.
* calendar_spec_free
* link_address_free
* manager_free
* sd_bus_unref
* sd_journal_close
* udev_monitor_unref
* udev_unref
It is therefore not needed that a function caller repeats a corresponding check.
This issue was fixed by using the software Coccinelle 1.0.1.
|
|
Allow arbitrary file paths to be passed to nspawn (v3)
|
|
We should not fall back to dbus-1 and connect to the proxy when kdbus
returns an error that indicates that kdbus is running but just does not
accept new connections because of quota limits or something similar.
Using is_kdbus_available() in libsystemd/ requires it to move from
shared/ to libsystemd/.
Based on a patch from David Herrmann:
https://github.com/systemd/systemd/pull/886
|
|
This adds an EXTRACT_QUOTES option to allow the previous behaviour, of
not interpreting any character inside ' or " quotes as separators.
|
|
It now takes a separators argument, which defaults to WHITESPACE if NULL
is passed.
|
|
This is so that, when called in a loop, unquote_first_word can
distinguish between reaching the end of a string because it has consumed
all the input before the end, and consuming all the input.
This is important because we later add a flag that allows
char *in = "";
char *out;
unquote_first_word(&in, &out, flags);
To put "" in out, and set in = NULL, so the trailing empty string of the
input can be consumed, and mark that the input has been consumed.
|
|
Signed-off-by: Jan Pokorný <jpokorny@redhat.com>
|
|
https://bugzilla.redhat.com/show_bug.cgi?id=1251334
|
|
execute: don't fail if we create the runtime directory from two proce…
|
|
simultaneously
If a service has both ExecStart= and ExecStartPost= set with
Type=simple, then it might happen that we have two children create the
runtime directory of a service (as configured with RuntimeDirectory=) at
the same time. Previously we did this with mkdir_safe() which will
create the dir only if it is missing, but if it already exists will at
least verify the access mode and ownership to match the right values.
This is problematic in this case, since it creates and then adjusts the
settings, thus it might happen that one child creates the directory with
root owner, another one then verifies it, and only afterwards the
directory ownership is fixed by the original child, while the second
child already failed.
With this change we'll now always adjust the access mode, so that we
know that it is right. In the worst case this means we adjust the
mode/ownership even though its unnecessary, but this should have no
negative effect.
https://bugzilla.redhat.com/show_bug.cgi?id=1226509
|
|
The ->done callback in the unit's vtable might call into
unit_unwatch_bus_name() and corrupt memory by that.
Move the call down, and clean up the bus slot in case it hasn't been done
yet.
|
|
Currently, PID1 installs an unfiltered NameOwnerChanged signal match, and
dispatches the signals itself. This does not scale, as right now, PID1
wakes up every time a bus client connects.
To fix this, install individual matches once they are requested by
unit_watch_bus_name(), and remove the watches again through their slot in
unit_unwatch_bus_name().
If the bus is not available during unit_watch_bus_name(), just store
name in the 'watch_bus' hashmap, and let bus_setup_api() do the installing
later.
|
|
|