Age | Commit message (Collapse) | Author |
|
This certainly fixes a bug that was introduced by PR
https://github.com/systemd/systemd/pull/4594 that intended to fix
https://github.com/systemd/systemd/issues/4567.
The fix was not complete. This patch makes sure that we count and free
all paths that fail inside chase_all_symlinks().
Fixes https://github.com/systemd/systemd/issues/4567
|
|
(#4596)
This adds a variable that is always set to false to make sure that
protect paths inside sandbox are always enforced and not ignored. The only
case when it is set to true is on DynamicUser=no and RootDirectory=/chroot
is set. This allows users to use more our sandbox features inside RootDirectory=
The only exception is ProtectSystem=full|strict and when DynamicUser=yes
is implied. Currently RootDirectory= is not fully compatible with these
due to two reasons:
* /chroot/usr|etc has to be present on ProtectSystem=full
* /chroot// has to be a mount point on ProtectSystem=strict.
|
|
core: add new RestrictNamespaces= unit file setting
Merging, not rebasing, because this touches many files and there were tree-wide cleanups in the mean time.
|
|
Format string tweaks (and a small fix on 32bit)
|
|
Remove FOREACH_WORD_QUOTED
|
|
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
|
|
|
|
|
|
Instead of having two fields inside BindMount struct where one is stack
based and the other one is heap, use one field to store the full path
and updated it when we chase symlinks. This way we avoid dealing with
both at the same time.
This makes RootDirectory= work with ProtectHome= and ProtectKernelModules=yes
Fixes: https://github.com/systemd/systemd/issues/4567
|
|
|
|
and over
|
|
|
|
It's the default, and NULL is shorter.
|
|
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().
RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.
This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
|
|
Tree wide cleanups
|
|
is set
Make sure that when DynamicUser= is set that we intialize the user
supplementary groups and that we also support SupplementaryGroups=
Fixes: https://github.com/systemd/systemd/issues/4539
Thanks Evgeny Vereshchagin (@evverx)
|
|
This reverts some changes introduced in d054f0a4d4.
xsprintf should be used in cases where we calculated the right buffer
size by hand (using DECIMAL_STRING_MAX and such), and never in cases where
we are printing externally specified strings of arbitrary length.
Fixes #4534.
|
|
Add "perpetual" unit concept, sysctl fixes, networkd fixes, systemctl color fixes, nspawn discard.
|
|
|
|
overly long description strings
This essentially reverts one part of d054f0a4d451120c26494263fc4dc175bfd405b1.
(We might also choose to use proper ellipsation here, but I wasn't sure the
memory allocation this requires wouöld be a good idea here...)
Fixes: #4534
|
|
Preserve stored fds over service restart
|
|
more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec
|
|
Since service_add_fd_store() already does the check, remove the redundant check
from service_add_fd_store_set().
Also, print a warning when repopulating FDStore after daemon-reexec and we hit
the limit. This is a user visible issue, so we should not discard fds silently.
(Note that service_deserialize_item is impacted by the return value from
service_add_fd_store(), but we rely on the general error message, so the caller
does not need to be modified, and does not show up in the diff.)
|
|
Let's propagate the error here, instead of eating it up early.
In a later change we should probably also change mount_enumerate() to propagate
errors up, but that would mean we'd have to change the unit vtable, and thus
change all unit types, hence is quite an invasive change.
|
|
|
|
Now that have a proper concept of "perpetual" units, let's make the root mount
one too, since it also cannot go away.
|
|
So far "no_gc" was set on -.slice and init.scope, to units that are always
running, cannot be stopped and never exist in an "inactive" state. Since these
units are the only users of this flag, let's remodel it and rename it
"perpetual" and let's derive more funcitonality off it. Specifically, refuse
enqueing stop jobs for these units, and report that they are "unstoppable" in
the CanStop bus property.
|
|
(#4533)
Always initialize the supplementary groups of caller before checking the
unit SupplementaryGroups= option.
Fixes https://github.com/systemd/systemd/issues/4531
|
|
Seccomp is generally an unprivileged operation, changing security contexts is
most likely associated with some form of policy. Moreover, while seccomp may
influence our own flow of code quite a bit (much more than the security context
change) make sure to apply the seccomp filters immediately before executing the
binary to invoke.
This also moves enforcement of NNP after the security context change, so that
NNP cannot affect it anymore. (However, the security policy now has to permit
the NNP change).
This change has a good chance of breaking current SELinux/AA/SMACK setups, because
the policy might not expect this change of behaviour. However, it's technically
the better choice I think and should hence be applied.
Fixes: #3993
|
|
We would close all the stored fds in service_release_resources(), which of
course broke the whole concept of storing fds over service restart.
Fixes #4408.
|
|
Should help with debugging #4408.
|
|
If it was a duplicate, log nothing.
|
|
seccomp: also block shmat(..., SHM_EXEC) for MemoryDenyWriteExecute
|
|
Document NoNewPrivileges default value
|
|
|
|
This makes applying groups after applying the working directory, this
may allow some flexibility but at same it is not a big deal since we
don't execute or do anything between applying working directory and
droping groups.
|
|
Improve apply_working_directory() and lets get the current working directory
inside of it.
|
|
|
|
|
|
shmat(..., SHM_EXEC) can be used to create writable and executable
memory, so let's block it when MemoryDenyWriteExecute is set.
|
|
Various seccomp fixes and NEWS update.
|
|
callbacks
Previously, we'd synthesize the root slice unit and the init scope unit in the
enumerator callbacks for the unit type. This is problematic if either of them
is already referenced from a unit that is loaded as result of another unit
type's enumerator logic.
Let's clean this up and simply create the two objects from the enumerator
callbacks, if they are not around yet. Do the actual filling in of the settings
from the unit_load() callbacks, to match how other units are loaded.
Fixes: #4322
|
|
This allows us to unify most of the code in apply_protect_kernel_modules() and
apply_private_devices().
|
|
This adds a new seccomp_init_conservative() helper call that is mostly just a
wrapper around seccomp_init(), but turns off NNP and adds in all secondary
archs, for best compatibility with everything else.
Pretty much all of our code used the very same constructs for these three
steps, hence unifying this in one small function makes things a lot shorter.
This also changes incorrect usage of the "scmp_filter_ctx" type at various
places. libseccomp defines it as typedef to "void*", i.e. it is a pointer type
(pretty poor choice already!) that casts implicitly to and from all other
pointer types (even poorer choice: you defined a confusing type now, and don't
even gain any bit of type safety through it...). A lot of the code assumed the
type would refer to a structure, and hence aded additional "*" here and there.
Remove that.
|
|
seccomp_add_syscall_filter_set()
Let's simplify this call, by making use of the new infrastructure.
This is actually more in line with Djalal's original patch but instead of
search the filter set in the array by its name we can now use the set index and
jump directly to it.
|
|
A variety of fixes:
- rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main
instance of it (the syscall_filter_sets[] array) used to abbreviate
"SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not
mix and match too wildly. Let's pick the shorter name in this case, as it is
sufficiently well established to not confuse hackers reading this.
- Export explicit indexes into the syscall_filter_sets[] array via an enum.
This way, code that wants to make use of a specific filter set, can index it
directly via the enum, instead of having to search for it. This makes
apply_private_devices() in particular a lot simpler.
- Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to
find a set by its name, seccomp_add_syscall_filter_set() to add a set to a
seccomp object.
- Update SystemCallFilter= parser to use extract_first_word(). Let's work on
deprecating FOREACH_WORD_QUOTED().
- Simplify apply_private_devices() using this functionality
|
|
|
|
Let's prefer early-exit over deep-indented if blocks. Not behavioural change.
|
|
Commandline parsing simplification and udev fix
|
|
shared, systemctl: teach is-enabled to show install targets
|