Age | Commit message (Collapse) | Author |
|
is set
Instead of having a local syscall list, use the @raw-io group which
contains the same set of syscalls to filter.
|
|
As with previous patch simplify ProtectHome and don't care about
duplicates, they will be sorted by most restrictive mode and cleaned.
|
|
ProtectSystem= with all its different modes and other options like
PrivateDevices= + ProtectKernelTunables= + ProtectHome= are orthogonal,
however currently it's a bit hard to parse that from the implementation
view. Simplify it by giving each mode its own table with all paths and
references to other Protect options.
With this change some entries are duplicated, but we do not care since
duplicate mounts are first sorted by the most restrictive mode then
cleaned.
|
|
Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface,
filesystems configuration and IRQ tuning readonly.
Most of these interfaces now days should be in /sys but they are still
available through /proc, so just protect them. This patch does not touch
/proc/net/...
|
|
Move out mount calculation on its own function. Actually the logic is
smart enough to later drop nop and duplicates mounts, this change
improves code readability.
---
src/core/namespace.c | 47 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 36 insertions(+), 11 deletions(-)
|
|
Instead of having all these paths everywhere, put the ones that are
protected by ProtectKernelTunables= into their own table. This way it
is easy to add paths and track which ones are protected.
|
|
|
|
While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the
main execution logic a bit.
|
|
There's no point in mounting these, if they are outside of the root directory
we'll move to.
|
|
|
|
If device access is restricted via PrivateDevices=, let's also block the
various low-level I/O syscalls at the same time, so that we know that the
minimal set of devices in our virtualized /dev are really everything the unit
can access.
|
|
already is one
Let's not stack mounts needlessly.
|
|
This adds logic to chase symlinks for all mount points that shall be created in
a namespace environment in userspace, instead of leaving this to the kernel.
This has the advantage that we can correctly handle absolute symlinks that
shall be taken relative to a specific root directory. Moreover, we can properly
handle mounts created on symlinked files or directories as we can merge their
mounts as necessary.
(This also drops the "done" flag in the namespace logic, which was never
actually working, but was supposed to permit a partial rollback of the
namespace logic, which however is only mildly useful as it wasn't clear in
which case it would or would not be able to roll back.)
Fixes: #3867
|
|
Let's create the new namespace only after we validated and processed all
parameters, right before we start with actually mounting things.
This way, the window where we can roll back is larger (not that it matters
IRL...)
|
|
If PrivateDevices=yes is set, the namespace code creates device nodes in /dev
that should be owned by the host's root, hence let's make sure we set up the
namespace before dropping group privileges.
|
|
Let's make sure that services that use DynamicUser=1 cannot leave files in the
file system should the system accidentally have a world-writable directory
somewhere.
This effectively ensures that directories need to be whitelisted rather than
blacklisted for access when DynamicUser=1 is set.
|
|
Let's tighten our sandbox a bit more: with this change ProtectSystem= gains a
new setting "strict". If set, the entire directory tree of the system is
mounted read-only, but the API file systems /proc, /dev, /sys are excluded
(they may be managed with PrivateDevices= and ProtectKernelTunables=). Also,
/home and /root are excluded as those are left for ProtectHome= to manage.
In this mode, all "real" file systems (i.e. non-API file systems) are mounted
read-only, and specific directories may only be excluded via
ReadWriteDirectories=, thus implementing an effective whitelist instead of
blacklist of writable directories.
While we are at, also add /efi to the list of paths always affected by
ProtectSystem=. This is a follow-up for
b52a109ad38cd37b660ccd5394ff5c171a5e5355 which added /efi as alternative for
/boot. Our namespacing logic should respect that too.
|
|
|
|
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths=
specification, then we'd first recursively apply the ReadOnlyPaths= paths, and
make everything below read-only, only in order to then flip the read-only bit
again for the subdirs listed in ReadWritePaths= below it.
This is not only ugly (as for the dirs in question we first turn on the RO bit,
only to turn it off again immediately after), but also problematic in
containers, where a container manager might have marked a set of dirs read-only
and this code will undo this is ReadWritePaths= is set for any.
With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be
applied to the children listed in ReadWritePaths= in the first place, so that
we do not need to turn off the RO bit for those after all.
This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the
RO bit, but never to turn it off again. Or to say this differently: if some
dirs are marked read-only via some external tool, then ReadWritePaths= will not
undo it.
This is not only the safer option, but also more in-line with what the man page
currently claims:
"Entries (files or directories) listed in ReadWritePaths= are
accessible from within the namespace with the same access rights as
from outside."
To implement this change bind_remount_recursive() gained a new "blacklist"
string list parameter, which when passed may contain subdirs that shall be
excluded from the read-only mounting.
A number of functions are updated to add more debug logging to make this more
digestable.
|
|
If /foo is marked to be read-only, and /foo/bar too, then the latter may be
suppressed as it has no effect.
|
|
|
|
Implicitly make all dirs set with RuntimeDirectory= writable, as the concept
otherwise makes no sense.
|
|
This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
|
|
|
|
If a dir is marked to be inaccessible then everything below it should be masked
by it.
|
|
ProtectControlGroups=
If enabled, these will block write access to /sys, /proc/sys and
/proc/sys/fs/cgroup.
|
|
Let's make sure that all our rules apply to all archs the local kernel
supports.
|
|
In preparation for adding a version which takes a strv.
|
|
Drop more kdbus functionality
|
|
Previous fix didn't consider handling multiple ExecStop commands.
|
|
If the unit is in the dbus queue when it is removed then the last change
signal is never sent. Fix this by checking the dbus queue and explicitly
send the change signal before sending the remove signal.
|
|
|
|
ENOTCONN may be a legitimate return code if the endpoint disappeared,
but the service should still attempt to shutdown cleanly.
|
|
In https://github.com/systemd/systemd/pull/4004 , a runtime detection
method for seccomp was added. However, it does not detect the case
where CONFIG_SECCOMP=y but CONFIG_SECCOMP_FILTER=n. This is possible
if the architecture does not support filtering yet.
Add a check for that case too.
While at it, change get_proc_field usage to use PR_GET_SECCOMP prctl,
as that should save a few system calls and (unnecessary) allocations.
Previously, reading of /proc/self/stat was done as recommended by
prctl(2) as safer. However, given that we need to do the prctl call
anyway, lets skip opening, reading and parsing the file.
Code for checking inspired by
https://outflux.net/teach-seccomp/autodetect.html
|
|
Similar to MemoryMax=, MemorySwapMax= limits swap usage. This controls
controls "memory.swap.max" attribute in unified cgroup.
|
|
Resolves #3534
|
|
Similar to MemoryMax=, MemorySwapMax= limits swap usage. This controls
controls "memory.swap.max" attribute in unified cgroup.
|
|
|
|
"-f" switch
|
|
Resolves #3534
|
|
"-l" switch (#3827)
|
|
permit bus clients to pin units to avoid automatic GC
|
|
functions (#4019)
Prevents discard-qualifiers warnings when the passed variable was const
|
|
Fixes #3882
|
|
|
|
It is useful for clients to be able to read the last CPU usage counter value of
a unit even if the unit is already terminated. Hence, before destroying a
cgroup's cgroup cache the last CPU usage counter and return it if the cgroup is
gone.
|
|
This adds two (privileged) bus calls Ref() and Unref() to the Unit interface.
The two calls may be used by clients to pin a unit into memory, so that various
runtime properties aren't flushed out by the automatic GC. This is necessary
to permit clients to race-freely acquire runtime results (such as process exit
status/code or accumulated CPU time) on successful service termination.
Ref() and Unref() are fully recursive, hence act like the usual reference
counting concept in C. Taking a reference is a privileged operation, as this
allows pinning units into memory which consumes resources.
Transient units may also gain a reference at the time of creation, via the new
AddRef property (that is only defined for transient units at the time of
creation).
|
|
|
|
Rework console color setup
|
|
Journald dynamic users
|