Age | Commit message (Collapse) | Author |
|
core: add cgroup CPU controller support on the unified hierarchy
(zj: merging not squashing to make it clear against which upstream this patch was developed.)
|
|
Under NixOS, the config_path /etc/systemd/system is a symlink to
/etc/static/systemd/system. Commands such as `systemctl list-unit-files`
and `systemctl is-enabled` did not work as the symlink was not followed.
This does not affect how symlinks are treated within the config_path
directory.
|
|
Unfortunately, due to the disagreements in the kernel development community,
CPU controller cgroup v2 support has not been merged and enabling it requires
applying two small out-of-tree kernel patches. The situation is explained in
the following documentation.
https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu
While it isn't clear what will happen with CPU controller cgroup v2 support,
there are critical features which are possible only on cgroup v2 such as
buffered write control making cgroup v2 essential for a lot of workloads. This
commit implements systemd CPU controller support on the unified hierarchy so
that users who choose to deploy CPU controller cgroup v2 support can easily
take advantage of it.
On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max"
replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us". [Startup]CPUWeight config
options are added with the usual compat translation. CPU quota settings remain
unchanged and apply to both legacy and unified hierarchies.
v2: - Error in man page corrected.
- CPU config application in cgroup_context_apply() refactored.
- CPU accounting now works on unified hierarchy.
|
|
|
|
This adds parse_nice() that parses a nice level and ensures it is in the right
range, via a new nice_is_valid() helper. It then ports over a number of users
to this.
No functional changes.
|
|
This permits CPUQuota to accept greater values as documented.
|
|
This new output mode formats all timestamps using the usual format_timestamp()
call we use pretty much everywhere else. Timestamps formatted this way are some
ways more useful than traditional syslog timestamps as they include weekday,
month and timezone information, while not being much longer. They are also not
locale-dependent. The primary advantage however is that they may be passed
directly to journalctl's --since= and --until= switches as soon as #3869 is
merged.
While we are at it, let's also add "short-unix" to shell completion.
|
|
This setting adds minimal user namespacing support to a service. When set the invoked
processes will run in their own user namespace. Only a trivial mapping will be
set up: the root user/group is mapped to root, and the user/group of the
service will be mapped to itself, everything else is mapped to nobody.
If this setting is used the service runs with no capabilities on the host, but
configurable capabilities within the service.
This setting is particularly useful in conjunction with RootDirectory= as the
need to synchronize /etc/passwd and /etc/group between the host and the service
OS tree is reduced, as only three UID/GIDs need to match: root, nobody and the
user of the service itself. But even outside the RootDirectory= case this
setting is useful to substantially reduce the attack surface of a service.
Example command to test this:
systemd-run -p PrivateUsers=1 -p User=foobar -t /bin/sh
This runs a shell as user "foobar". When typing "ps" only processes owned by
"root", by "foobar", and by "nobody" should be visible.
|
|
|
|
User expectations are broken when "systemctl enable /some/path/service.service"
behaves differently to "systemctl link ..." followed by "systemctl enable".
From user's POV, "enable" with the full path just combines the two steps into
one.
Fixes #3010.
|
|
|
|
service is running
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).
For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.
A simple way to test this is:
systemd-run -p DynamicUser=1 /bin/sleep 99999
|
|
That way, we can neatly keep this in line with the new TasksMaxScale= option.
Note that we didn't release a version with MemoryLimitByPhysicalMemory= yet,
hence this change should be unproblematic without breaking API.
|
|
This adds support for a TasksMax=40% syntax for specifying values relative to
the system's configured maximum number of processes. This is useful in order to
neatly subdivide the available room for tasks within containers.
|
|
|
|
namespace: unify limit behavior on non-directory paths
|
|
This patch renames Read{Write,Only}Directories= and InaccessibleDirectories=
to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept
as aliases but they are not advertised in the documentation.
Renamed variables:
`read_write_dirs` --> `read_write_paths`
`read_only_dirs` --> `read_only_paths`
`inaccessible_dirs` --> `inaccessible_paths`
|
|
json output
With this change, binary record data is formatted as string if --all is
specified when using json output. This is inline with the effect of --all on
the other available output modes.
Fixes: #3416
|
|
strv_make_nulstr was creating a nulstr which was not a valid nulstr,
because it was missing the terminating NUL. This didn't cause any issues,
because strv_parse_nulstr correctly parsed the result, using the
separately specified length.
But it's confusing to have something called nulstr which really isn't.
It is likely that somebody will try to use strv_make_nulstr() in
some other place, incorrectly.
This patch changes strv_parse_nulstr() to produce a valid nulstr, and
changes the output length parameter to be the minimum number of bytes
which can be later on parsed by strv_parse_nulstr(). This allows the
only user in ask-password-api to be slightly simplified.
Based-on-patch-by: Jean-Sébastien Bour <jean-sebastien@bour.name>
Fixes #3689.
|
|
|
|
* networkd: condition_test() can return a negative error, handle that
If a condition check fails with an error we should not consider the check
successful. Fix that.
We should probably also improve logging in this case, but for now, let's just
unbreak this breakage.
Fixes: #3236
* condition: handle unrecognized architectures nicer
When we encounter a check for an architecture we don't know we should not
let the condition check fail with an error code, but instead simply return
false. After all the architecture might just be newer than the ones we know, in
which case it's certainly not our local one.
Fixes: #3236
|
|
resolved: more fixes, among them "systemctl-resolve --status" to see DNS configuration in effect, and a local DNS stub listener on 127.0.0.53
|
|
It makes use of the sd_listen_fds() call, and as such should live in
src/shared, as the distinction between src/basic and src/shared is that the
latter may use libsystemd APIs, the former does not.
Note that btrfs-util.[ch] and log.[ch] also include header files from
libsystemd, but they only need definitions, they do not invoke any function
from it. Hence they may stay in src/basic.
|
|
Do not ellipsize cgroups when showing slices in --full mode
|
|
sd-bus generally exposes bools as "int" instead of "bool" in the public API.
This is relevant when unmarshaling booleans, as the relevant functions expect
an int* pointer and no bool* pointer. Since sizeof(bool) is not necessarily the
same as sizeof(int) this is problematic and might result in memory corruption.
Let's fix this, and make sure bus_map_all_properties() handles booleans as
ints, as the rest of sd-bus, and make all users of it expect the right thing.
|
|
Delete the dbus1 generator and some critical wiring. This prevents
kdbus from being loaded or detected. As such, it will never be used,
even if the user still has a useful kdbus module loaded on their system.
Sort of fixes #3480. Not really, but it's better than the current state.
|
|
various changes, most importantly regarding memory metrics
|
|
If for whatever reason the file system is "corrupted", we want
to be resilient and ignore the error, as long as we can load the units
from a different place.
Arch bug https://bugs.archlinux.org/task/49547.
A user had an ntfs symlink (essentially a file) instead of a directory after
restoring from backup. We should just ignore that like we would treat a missing
directory, for general resiliency.
We should treat permission errors similarly. For example an unreadable
/usr/local/lib directory would prevent (user) instances of systemd from
loading any units. It seems better to continue.
|
|
The unit files already accept relative, percent-based memory limit
specification, let's make sure "systemctl set-property" support this too.
Since we want the physical memory size of the destination machine to apply we
pass the percentage in a new set of properties that only exist for this
purpose, and can only be set.
|
|
And port a couple of users over to it.
|
|
This adds three new seccomp syscall groups: @keyring for kernel keyring access,
@cpu-emulation for CPU emulation features, for exampe vm86() for dosemu and
suchlike, and @debug for ptrace() and related calls.
Also, the @clock group is updated with more syscalls that alter the system
clock. capset() is added to @privileged, and pciconfig_iobase() is added to
@raw-io.
Finally, @obsolete is a cleaned up. A number of syscalls that never existed on
Linux and have no number assigned on any architecture are removed, as they only
exist in the man pages and other operating sytems, but not in code at all.
create_module() is moved from @module to @obsolete, as it is an obsolete system
call. mem_getpolicy() is removed from the @obsolete list, as it is not
obsolete, but simply a NUMA API.
|
|
Let's add a generic parser for VLAN ids, which should become handy as
preparation for PR #3428. Let's also make sure we use uint16_t for the vlan ID
type everywhere, and that validity checks are already applied at the time of
parsing, and not only whne we about to prepare a netdev.
Also, establish a common definition VLANID_INVALID we can use for
non-initialized VLAN id fields.
|
|
Now we don't support parsing double at map_basic.
when trying to use bus_message_map_all_properties with a double
this fails. Let's add it.
|
|
Assorted stuff
|
|
New exec boolean MemoryDenyWriteExecute, when set, installs
a seccomp filter to reject mmap(2) with PAGE_WRITE|PAGE_EXEC
and mprotect(2) with PAGE_EXEC.
|
|
Recently added cgroup unified hierarchy support uses "max" in configurations
for no upper limit. While consistent with what the kernel uses for no upper
limit, it is inconsistent with what systemd uses for other controllers such as
memory or pids. There's no point in introducing another term. Update cgroup
unified hierarchy support so that "infinity" is the only term that systemd
uses for no upper limit.
|
|
Implement sets of system calls to help constructing system call
filters. A set starts with '@' to distinguish from a system call.
Closes: #3053, #3157
|
|
As suggested here:
https://bugs.freedesktop.org/show_bug.cgi?id=64737#c8
This adds a new call terminal_is_dumb() and makes use of this where
appropriate.
|
|
|
|
On the unified hierarchy, memory controller implements three control knobs -
low, high and max which enables more useable and versatile control over memory
usage. This patch implements support for the three control knobs.
* MemoryLow, MemoryHigh and MemoryMax are added for memory.low, memory.high and
memory.max, respectively.
* As all absolute limits on the unified hierarchy use "max" for no limit, make
memory limit parse functions accept "max" in addition to "infinity" and
document "max" for the new knobs.
* Implement compatibility translation between MemoryMax and MemoryLimit.
v2:
- Fixed missing else's in config_parse_memory_limit().
- Fixed missing newline when writing out drop-ins.
- Coding style updates to use "val > 0" instead of "val".
- Minor updates to documentation.
|
|
We have to pass addresses of changes and n_changes to
bus_deserialize_and_dump_unit_file_changes(). Otherwise we are hit by
missing information (subsequent calls to unit_file_changes_add() to
not add anything).
Also prevent null pointer dereference in
bus_deserialize_and_dump_unit_file_changes() by asserting.
Fixes #3339
|
|
Implement compat translation between IO* and BlockIO* settings
|
|
The original conflict is fixed in the kernel in v4.6-rc7-40-g4a91cb61bb,
but now our work-around causes a compilation failure.
Keep the workaround to support 4.5 kernels for now, and layer
more ugliness on top.
Tested with:
kernel-headers-4.6.0-1.fc25.x86_64
glibc-devel-2.23.90-18.fc25.x86_64
kernel-headers-4.5.4-300.fc24.x86_64
glibc-devel-2.23.1-7.fc24.x86_64
kernel-headers-4.4.9-300.fc23.x86_64
glibc-devel-2.22-16.fc23.x86_64
kernel-headers-4.1.13-100.fc21.x86_64
glibc-devel-2.20-8.fc21.x86_64
|
|
Adds support to core for systemd D-Bus clients to send the
`SELinuxContext` property . This means `systemd-run -p
SELinuxContext=foo` should now work.
|
|
Currently, there are two cgroup IO limits, bandwidth max for read and write,
and they are hard-coded in various places. This is fine for two limits but IO
is expected to grow more limits - low, high and max limits for bandwidth and
IOPS - and hard-coding each limit won't make sense.
This patch replaces hard-coded limits with an array indexed by
CGroupIOLimitType and accompanying string and default value tables so that new
limits can be added trivially.
|
|
core: add io controller support on the unified hierarchy
|
|
Add a synchronization point so that custom initramfs units can run
after the root device becomes available, before it is fsck'd and
mounted.
This is useful for custom initramfs units that may modify the
root disk partition table, where the root device is not known in
advance (it's dynamically selected by the generators).
|
|
Fix "preset-all" with dangling symlinks and install-section hint emitted too eagerly
|
|
That function doesn't draw anything on it's own, just returns a string, which
sometimes is more than one character. Also remove "DRAW_" prefix from character
names, TREE_* and ARROW and BLACK_CIRCLE are unambigous on their own, don't
draw anything, and are always used as an argument to special_glyph().
Rename "DASH" to "MDASH", as there's more than one type of dash.
|
|
It's quite a bit shorter and just as readable.
(The full sentence with "pointing to" was added to replace a text that used
"ln -s %s %s". Using the "ln" syntax is indeed unclear, because it's not
obvious which is the source and which is the target, and because symlink(2)
uses the opposite order to ln(1). But with the unicode arrow there should
be no ambiguity.)
|