Age | Commit message (Collapse) | Author |
|
|
|
Fixes:
```
src/shared/bus-unit-util.c: In function ‘bus_append_unit_property_assignment’:
src/shared/bus-unit-util.c:570:65: warning: passing argument 2 of ‘namespace_flag_from_string_many’ from incompatible pointer type [-Wincompatible-pointer-types]
r = namespace_flag_from_string_many(eq, &flags);
^
In file included from src/shared/bus-unit-util.c:31:0:
src/shared/nsflags.h:41:5: note: expected ‘long unsigned int *’ but argument is of type ‘uint64_t * {aka long long unsigned int *}’
int namespace_flag_from_string_many(const char *name, unsigned long *ret);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
Closes #5312
|
|
Closes #5313
|
|
directory for a service
This is similar to RootDirectory= but mounts the root file system from a
block device or loopback file instead of another directory.
This reuses the image dissector code now used by nspawn and
gpt-auto-discovery.
|
|
ReadWritablePaths= work
5327c910d2fc1ae91bd0b891be92b30379c7467b claimed to add support for "+"
for prefixing paths with the configured RootDirectory=. But actually it
only implemented it in the backend, it did not add support for it to the
configuration file parsers. Fix that now.
|
|
conjunction with RootDirectory=
This adds a boolean unit file setting MountAPIVFS=. If set, the three
main API VFS mounts will be mounted for the service. This only has an
effect on RootDirectory=, which it makes a ton times more useful.
(This is basically the /dev + /proc + /sys mounting code posted in the
original #4727, but rebased on current git, and with the automatic logic
replaced by explicit logic controlled by a unit file setting)
|
|
This means that callers can distiguish an error from flags==0,
and don't have to special-case the empty string.
|
|
Fixes: #4402
|
|
This adds two new settings BindPaths= and BindReadOnlyPaths=. They allow
defining arbitrary bind mounts specific to particular services. This is
particularly useful for services with RootDirectory= set as this permits making
specific bits of the host directory available to chrooted services.
The two new settings follow the concepts nspawn already possess in --bind= and
--bind-ro=, as well as the .nspawn settings Bind= and BindReadOnly= (and these
latter options should probably be renamed to BindPaths= and BindReadOnlyPaths=
too).
Fixes: #3439
|
|
This makes "systemd-run -p MountFlags=shared -t /bin/sh" work, by making
MountFlags= to the list of properties that may be accessed transiently.
|
|
|
|
Support on the server side has already been in place for quite some time, let's
also add support on the client side for this.
|
|
In contrast to all other unit types device units when queued just track
external state, they cannot effect state changes on their own. Hence unless a
client or other job waits for them there's no reason to keep them in the job
queue. This adds a concept of GC'ing jobs of this type as soon as no client or
other job waits for them anymore.
To ensure this works correctly we need to track which clients actually
reference a job (i.e. which ones enqueued it). Unfortunately that's pretty
nasty to do for direct connections, as sd_bus_track doesn't work for
them. For now, work around this, by simply remembering in a boolean that a job
was requested by a direct connection, and reset it when we notice the direct
connection is gone. This means the GC logic works fine, except that jobs are
not immediately removed when direct connections disconnect.
In the longer term, a rework of the bus logic should fix this properly. For now
this should be good enough, as GC works for fine all cases except this one, and
thus is a clear improvement over the previous behaviour.
Fixes: #1921
|
|
extract_first_words deals fine with the string being NULL, so drop the upfront
check for that.
|
|
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().
RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.
This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
|
|
This is useful to turn off explicit module load and unload operations on modular
kernels. This option removes CAP_SYS_MODULE from the capability bounding set for
the unit, and installs a system call filter to block module system calls.
This option will not prevent the kernel from loading modules using the module
auto-load feature which is a system wide operation.
|
|
Allowed paths are unified betwen the configuration file parses and the bus
property checker. The biggest change is that the bus code now allows "block-"
and "char-" classes. In addition, path_startswith("/dev") was used in the bus
code, and startswith("/dev") was used in the config file code. It seems
reasonable to use path_startswith() which allows a slightly broader class of
strings.
Fixes #3935.
|
|
ProtectControlGroups=
If enabled, these will block write access to /sys, /proc/sys and
/proc/sys/fs/cgroup.
|
|
add a new tool for creating transient mount and automount units
|
|
This adds the boolean RemoveIPC= setting to service, socket, mount and swap
units (i.e. all unit types that may invoke processes). if turned on, and the
unit's user/group is not root, all IPC objects of the user/group are removed
when the service is shut down. The life-cycle of the IPC objects is hence bound
to the unit life-cycle.
This is particularly relevant for units with dynamic users, as it is essential
that no objects owned by the dynamic users survive the service exiting. In
fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set.
In order to communicate the UID/GID of an executed process back to PID 1 this
adds a new "user lookup" socket pair, that is inherited into the forked
processes, and closed before the exec(). This is needed since we cannot do NSS
from PID 1 due to deadlock risks, However need to know the used UID/GID in
order to clean up IPC owned by it if the unit shuts down.
|
|
This is done exactly the same way a couple of times at various places, let's
unify this into one version.
|
|
core: add cgroup CPU controller support on the unified hierarchy
(zj: merging not squashing to make it clear against which upstream this patch was developed.)
|
|
Unfortunately, due to the disagreements in the kernel development community,
CPU controller cgroup v2 support has not been merged and enabling it requires
applying two small out-of-tree kernel patches. The situation is explained in
the following documentation.
https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu
While it isn't clear what will happen with CPU controller cgroup v2 support,
there are critical features which are possible only on cgroup v2 such as
buffered write control making cgroup v2 essential for a lot of workloads. This
commit implements systemd CPU controller support on the unified hierarchy so
that users who choose to deploy CPU controller cgroup v2 support can easily
take advantage of it.
On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max"
replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us". [Startup]CPUWeight config
options are added with the usual compat translation. CPU quota settings remain
unchanged and apply to both legacy and unified hierarchies.
v2: - Error in man page corrected.
- CPU config application in cgroup_context_apply() refactored.
- CPU accounting now works on unified hierarchy.
|
|
|
|
This adds parse_nice() that parses a nice level and ensures it is in the right
range, via a new nice_is_valid() helper. It then ports over a number of users
to this.
No functional changes.
|
|
This permits CPUQuota to accept greater values as documented.
|
|
This setting adds minimal user namespacing support to a service. When set the invoked
processes will run in their own user namespace. Only a trivial mapping will be
set up: the root user/group is mapped to root, and the user/group of the
service will be mapped to itself, everything else is mapped to nobody.
If this setting is used the service runs with no capabilities on the host, but
configurable capabilities within the service.
This setting is particularly useful in conjunction with RootDirectory= as the
need to synchronize /etc/passwd and /etc/group between the host and the service
OS tree is reduced, as only three UID/GIDs need to match: root, nobody and the
user of the service itself. But even outside the RootDirectory= case this
setting is useful to substantially reduce the attack surface of a service.
Example command to test this:
systemd-run -p PrivateUsers=1 -p User=foobar -t /bin/sh
This runs a shell as user "foobar". When typing "ps" only processes owned by
"root", by "foobar", and by "nobody" should be visible.
|
|
|
|
service is running
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).
For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.
A simple way to test this is:
systemd-run -p DynamicUser=1 /bin/sleep 99999
|
|
That way, we can neatly keep this in line with the new TasksMaxScale= option.
Note that we didn't release a version with MemoryLimitByPhysicalMemory= yet,
hence this change should be unproblematic without breaking API.
|
|
This adds support for a TasksMax=40% syntax for specifying values relative to
the system's configured maximum number of processes. This is useful in order to
neatly subdivide the available room for tasks within containers.
|
|
This patch renames Read{Write,Only}Directories= and InaccessibleDirectories=
to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept
as aliases but they are not advertised in the documentation.
Renamed variables:
`read_write_dirs` --> `read_write_paths`
`read_only_dirs` --> `read_only_paths`
`inaccessible_dirs` --> `inaccessible_paths`
|
|
Do not ellipsize cgroups when showing slices in --full mode
|
|
The unit files already accept relative, percent-based memory limit
specification, let's make sure "systemctl set-property" support this too.
Since we want the physical memory size of the destination machine to apply we
pass the percentage in a new set of properties that only exist for this
purpose, and can only be set.
|
|
And port a couple of users over to it.
|
|
New exec boolean MemoryDenyWriteExecute, when set, installs
a seccomp filter to reject mmap(2) with PAGE_WRITE|PAGE_EXEC
and mprotect(2) with PAGE_EXEC.
|
|
Recently added cgroup unified hierarchy support uses "max" in configurations
for no upper limit. While consistent with what the kernel uses for no upper
limit, it is inconsistent with what systemd uses for other controllers such as
memory or pids. There's no point in introducing another term. Update cgroup
unified hierarchy support so that "infinity" is the only term that systemd
uses for no upper limit.
|
|
On the unified hierarchy, memory controller implements three control knobs -
low, high and max which enables more useable and versatile control over memory
usage. This patch implements support for the three control knobs.
* MemoryLow, MemoryHigh and MemoryMax are added for memory.low, memory.high and
memory.max, respectively.
* As all absolute limits on the unified hierarchy use "max" for no limit, make
memory limit parse functions accept "max" in addition to "infinity" and
document "max" for the new knobs.
* Implement compatibility translation between MemoryMax and MemoryLimit.
v2:
- Fixed missing else's in config_parse_memory_limit().
- Fixed missing newline when writing out drop-ins.
- Coding style updates to use "val > 0" instead of "val".
- Minor updates to documentation.
|
|
We have to pass addresses of changes and n_changes to
bus_deserialize_and_dump_unit_file_changes(). Otherwise we are hit by
missing information (subsequent calls to unit_file_changes_add() to
not add anything).
Also prevent null pointer dereference in
bus_deserialize_and_dump_unit_file_changes() by asserting.
Fixes #3339
|
|
Implement compat translation between IO* and BlockIO* settings
|
|
Adds support to core for systemd D-Bus clients to send the
`SELinuxContext` property . This means `systemd-run -p
SELinuxContext=foo` should now work.
|
|
Currently, there are two cgroup IO limits, bandwidth max for read and write,
and they are hard-coded in various places. This is fine for two limits but IO
is expected to grow more limits - low, high and max limits for bandwidth and
IOPS - and hard-coding each limit won't make sense.
This patch replaces hard-coded limits with an array indexed by
CGroupIOLimitType and accompanying string and default value tables so that new
limits can be added trivially.
|
|
core: add io controller support on the unified hierarchy
|
|
That function doesn't draw anything on it's own, just returns a string, which
sometimes is more than one character. Also remove "DRAW_" prefix from character
names, TREE_* and ARROW and BLACK_CIRCLE are unambigous on their own, don't
draw anything, and are always used as an argument to special_glyph().
Rename "DASH" to "MDASH", as there's more than one type of dash.
|
|
On the unified hierarchy, blkio controller is renamed to io and the interface
is changed significantly.
* blkio.weight and blkio.weight_device are consolidated into io.weight which
uses the standardized weight range [1, 10000] with 100 as the default value.
* blkio.throttle.{read|write}_{bps|iops}_device are consolidated into io.max.
Expansion of throttling features is being worked on to support
work-conserving absolute limits (io.low and io.high).
* All stats are consolidated into io.stats.
This patchset adds support for the new interface. As the interface has been
revamped and new features are expected to be added, it seems best to treat it
as a separate controller rather than trying to expand the blkio settings
although we might add automatic translation if only blkio settings are
specified.
* io.weight handling is mostly identical to blkio.weight[_device] handling
except that the weight range is different.
* Both read and write bandwidth settings are consolidated into
CGroupIODeviceLimit which describes all limits applicable to the device.
This makes it less painful to add new limits.
* "max" can be used to specify the maximum limit which is equivalent to no
config for max limits and treated as such. If a given CGroupIODeviceLimit
doesn't contain any non-default configs, the config struct is discarded once
the no limit config is applied to cgroup.
* lookup_blkio_device() is renamed to lookup_block_device().
Signed-off-by: Tejun Heo <htejun@fb.com>
|
|
bus_append_unit_property_assignment() was missing an argument for
sd_bus_message_append() when processing BlockIODeviceWeight leading to
segfault. Fix it.
Signed-off-by: Tejun Heo <htejun@fb.com>
|
|
It was incorrectly using cg_cpu_weight_parse() to parse BlockIOWeight. Update
it to use cg_blkio_weight_parse() instead.
Signed-off-by: Tejun Heo <htejun@fb.com>
|
|
The "resources" error is really just the generic error we return when
we hit some kind of error and we have no more appropriate error for the case to
return, for example because of some OS error.
Hence, reword the explanation and don't claim any relation to resource limits.
Admittedly, the "resources" service error is a bit of a misnomer, but I figure
it's kind of API now.
Fixes: #2716
|
|
Previously we'd have generally useful sd-bus utilities in bust-util.h,
intermixed with code that is specifically for writing clients for PID 1,
wrapping job and unit handling. Let's split the latter out and move it into
bus-unit-util.c, to make the sources a bit short and easier to grok.
|
|
This adds a new GetProcesses() bus call to the Unit object which returns an
array consisting of all PIDs, their process names, as well as their full cgroup
paths. This is then used by "systemctl status" to show the per-unit process
tree.
This has the benefit that the client-side no longer needs to access the
cgroupfs directly to show the process tree of a unit. Instead, it now uses this
new API, which means it also works if -H or -M are used correctly, as the
information from the specific host is used, and not the one from the local
system.
Fixes: #2945
|