summaryrefslogtreecommitdiff
path: root/src/core
AgeCommit message (Collapse)Author
2016-10-07core: only warn on short reads on signal fdZbigniew Jędrzejewski-Szmek
2016-10-07manager: tighten incoming notification message checksLennart Poettering
Let's not accept datagrams with embedded NUL bytes. Previously we'd simply ignore everything after the first NUL byte. But given that sending us that is pretty ugly let's instead complain and refuse. With this change we'll only accept messages that have exactly zero or one NUL bytes at the very end of the datagram.
2016-10-07manager: be stricter with incomining notifications, warn properly about too ↵Lennart Poettering
large ones Let's make the kernel let us know the full, original datagram size of the incoming message. If it's larger than the buffer space provided by us, drop the whole message with a warning. Before this change the kernel would truncate the message for us to the buffer space provided, and we'd not complain about this, and simply process the incomplete message as far as it made sense.
2016-10-07manager: don't ever busy loop when we get a notification message we can't ↵Lennart Poettering
process If the kernel doesn't permit us to dequeue/process an incoming notification datagram message it's still better to stop processing the notification messages altogether than to enter a busy loop where we keep getting notified but can't do a thing about it. With this change, manager_dispatch_notify_fd() behaviour is changed like this: - if an error indicating a spurious wake-up is seen on recvmsg(), ignore it (EAGAIN/EINTR) - if any other error is seen on recvmsg() propagate it, thus disabling processing of further wakeups - if any error is seen on later code in the function, warn about it but do not propagate it, as in this cas we're not going to busy loop as the offending message is already dequeued.
2016-10-06core: add possibility to set action for ctrl-alt-del burst (#4105)Lukáš Nykrýn
For some certification, it should not be possible to reboot the machine through ctrl-alt-delete. Currently we suggest our customers to mask the ctrl-alt-delete target, but that is obviously not enough. Patching the keymaps to disable that is really not a way to go for them, because the settings need to be easily checked by some SCAP tools.
2016-10-06user-util: rework maybe_setgroups() a bitLennart Poettering
Let's drop the caching of the setgroups /proc field for now. While there's a strict regime in place when it changes states, let's better not cache it since we cannot really be sure we follow that regime correctly. More importantly however, this is not in performance sensitive code, and there's no indication the cache is really beneficial, hence let's drop the caching and make things a bit simpler. Also, while we are at it, rework the error handling a bit, and always return negative errno-style error codes, following our usual coding style. This has the benefit that we can sensible hanld read_one_line_file() errors, without having to updat errno explicitly.
2016-10-06core: leave PAM stub process around with GIDs updatedLennart Poettering
In the process execution code of PID 1, before 096424d1230e0a0339735c51b43949809e972430 the GID settings where changed before invoking PAM, and the UID settings after. After the change both changes are made after the PAM session hooks are run. When invoking PAM we fork once, and leave a stub process around which will invoke the PAM session end hooks when the session goes away. This code previously was dropping the remaining privs (which were precisely the UID). Fix this code to do this correctly again, by really dropping them else (i.e. the GID as well). While we are at it, also fix error logging of this code. Fixes: #4238
2016-10-06core: do not fail in a container if we can't use setgroupsGiuseppe Scrivano
It might be blocked through /proc/PID/setgroups
2016-10-05Fix typoGiuseppe Scrivano
2016-10-04tree-wide: remove consecutive duplicate words in commentsStefan Schweter
2016-10-04automount: make sure the expire event is restarted after a daemon-reload (#4265)Michael Olbrich
If the corresponding mount unit is deserialized after the automount unit then the expire event is set up in automount_trigger_notify(). However, if the mount unit is deserialized first then the automount unit is still in state AUTOMOUNT_DEAD and automount_trigger_notify() aborts without setting up the expire event. Explicitly call automount_start_expire() during coldplug to make sure that the expire event is set up as necessary. Fixes #4249.
2016-10-01core: do not try to create /run/systemd/transient in test modeZbigniew Jędrzejewski-Szmek
This prevented systemd-analyze from unprivileged operation on older systemd installations, which should be possible. Also, we shouldn't touch the file system in test mode even if we can.
2016-10-01core: complain if Before= dep on .device is declaredZbigniew Jędrzejewski-Szmek
[Unit] Before=foobar.device [Service] ExecStart=/bin/true Type=oneshot $ systemd-analyze verify before-device.service before-device.service: Dependency Before=foobar.device ignored (.device units cannot be delayed)
2016-10-01core: update warning messageZbigniew Jędrzejewski-Szmek
"closing all" might suggest that _all_ fds received with the notification message will be closed. Reword the message to clarify that only the "unused" ones will be closed.
2016-10-01core: get rid of unneeded state variableZbigniew Jędrzejewski-Szmek
No functional change.
2016-09-29pid1: more informative error message for ignored notificationsZbigniew Jędrzejewski-Szmek
It's probably easier to diagnose a bad notification message if the contents are printed. But still, do anything only if debugging is on.
2016-09-29pid1: process zero-length notification messages againZbigniew Jędrzejewski-Szmek
This undoes 531ac2b234. I acked that patch without looking at the code carefully enough. There are two problems: - we want to process the fds anyway - in principle empty notification messages are valid, and we should process them as usual, including logging using log_unit_debug().
2016-09-29pid1: don't return any error in manager_dispatch_notify_fd() (#4240)Franck Bui
If manager_dispatch_notify_fd() fails and returns an error then the handling of service notifications will be disabled entirely leading to a compromised system. For example pid1 won't be able to receive the WATCHDOG messages anymore and will kill all services supposed to send such messages.
2016-09-29If the notification message length is 0, ignore the message (#4237)Jorge Niedbalski
Fixes #4234. Signed-off-by: Jorge Niedbalski <jnr@metaklass.org>
2016-09-28Merge pull request #4185 from endocode/djalal-sandbox-first-protection-v1Evgeny Vereshchagin
core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes
2016-09-26core: Fix USB functionfs activation and clarify its documentation (#4188)Paweł Szewczyk
There was no certainty about how the path in service file should look like for usb functionfs activation. Because of this it was treated differently in different places, which made this feature unusable. This patch fixes the path to be the *mount directory* of functionfs, not ep0 file path and clarifies in the documentation that ListenUSBFunction should be the location of functionfs mount point, not ep0 file itself.
2016-09-25core: Use @raw-io syscall group to filter I/O syscalls when PrivateDevices= ↵Djalal Harouni
is set Instead of having a local syscall list, use the @raw-io group which contains the same set of syscalls to filter.
2016-09-25core:namespace: simplify ProtectHome= implementationDjalal Harouni
As with previous patch simplify ProtectHome and don't care about duplicates, they will be sorted by most restrictive mode and cleaned.
2016-09-25core: simplify ProtectSystem= implementationDjalal Harouni
ProtectSystem= with all its different modes and other options like PrivateDevices= + ProtectKernelTunables= + ProtectHome= are orthogonal, however currently it's a bit hard to parse that from the implementation view. Simplify it by giving each mode its own table with all paths and references to other Protect options. With this change some entries are duplicated, but we do not care since duplicate mounts are first sorted by the most restrictive mode then cleaned.
2016-09-25core:sandbox: add more /proc/* entries to ProtectKernelTunables=Djalal Harouni
Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface, filesystems configuration and IRQ tuning readonly. Most of these interfaces now days should be in /sys but they are still available through /proc, so just protect them. This patch does not touch /proc/net/...
2016-09-25core:namespace: simplify mount calculationDjalal Harouni
Move out mount calculation on its own function. Actually the logic is smart enough to later drop nop and duplicates mounts, this change improves code readability. --- src/core/namespace.c | 47 ++++++++++++++++++++++++++++++++++++----------- 1 file changed, 36 insertions(+), 11 deletions(-)
2016-09-25core:namespace: put paths protected by ProtectKernelTunables= inDjalal Harouni
Instead of having all these paths everywhere, put the ones that are protected by ProtectKernelTunables= into their own table. This way it is easy to add paths and track which ones are protected.
2016-09-25core:namespace: minor improvements to append_mounts()Djalal Harouni
2016-09-25execute: move SMACK setup code into its own functionLennart Poettering
While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the main execution logic a bit.
2016-09-25namespace: drop all mounts outside of the new root directoryLennart Poettering
There's no point in mounting these, if they are outside of the root directory we'll move to.
2016-09-25main: minor simplificationLennart Poettering
2016-09-25execute: filter low-level I/O syscalls if PrivateDevices= is setLennart Poettering
If device access is restricted via PrivateDevices=, let's also block the various low-level I/O syscalls at the same time, so that we know that the minimal set of devices in our virtualized /dev are really everything the unit can access.
2016-09-25namespace: don't make the root directory of a namespace a mount if it ↵Lennart Poettering
already is one Let's not stack mounts needlessly.
2016-09-25namespace: chase symlinks for mounts to set up in userspaceLennart Poettering
This adds logic to chase symlinks for all mount points that shall be created in a namespace environment in userspace, instead of leaving this to the kernel. This has the advantage that we can correctly handle absolute symlinks that shall be taken relative to a specific root directory. Moreover, we can properly handle mounts created on symlinked files or directories as we can merge their mounts as necessary. (This also drops the "done" flag in the namespace logic, which was never actually working, but was supposed to permit a partial rollback of the namespace logic, which however is only mildly useful as it wasn't clear in which case it would or would not be able to roll back.) Fixes: #3867
2016-09-25namespace: invoke unshare() only after checking all parametersLennart Poettering
Let's create the new namespace only after we validated and processed all parameters, right before we start with actually mounting things. This way, the window where we can roll back is larger (not that it matters IRL...)
2016-09-25execute: drop group priviliges only after setting up namespaceLennart Poettering
If PrivateDevices=yes is set, the namespace code creates device nodes in /dev that should be owned by the host's root, hence let's make sure we set up the namespace before dropping group privileges.
2016-09-25core: imply ProtectHome=read-only and ProtectSystem=strict if DynamicUser=1Lennart Poettering
Let's make sure that services that use DynamicUser=1 cannot leave files in the file system should the system accidentally have a world-writable directory somewhere. This effectively ensures that directories need to be whitelisted rather than blacklisted for access when DynamicUser=1 is set.
2016-09-25core: introduce ProtectSystem=strictLennart Poettering
Let's tighten our sandbox a bit more: with this change ProtectSystem= gains a new setting "strict". If set, the entire directory tree of the system is mounted read-only, but the API file systems /proc, /dev, /sys are excluded (they may be managed with PrivateDevices= and ProtectKernelTunables=). Also, /home and /root are excluded as those are left for ProtectHome= to manage. In this mode, all "real" file systems (i.e. non-API file systems) are mounted read-only, and specific directories may only be excluded via ReadWriteDirectories=, thus implementing an effective whitelist instead of blacklist of writable directories. While we are at, also add /efi to the list of paths always affected by ProtectSystem=. This is a follow-up for b52a109ad38cd37b660ccd5394ff5c171a5e5355 which added /efi as alternative for /boot. Our namespacing logic should respect that too.
2016-09-25namespace: add some debug logging when enforcing InaccessiblePaths=Lennart Poettering
2016-09-25namespace: rework how ReadWritePaths= is appliedLennart Poettering
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths= specification, then we'd first recursively apply the ReadOnlyPaths= paths, and make everything below read-only, only in order to then flip the read-only bit again for the subdirs listed in ReadWritePaths= below it. This is not only ugly (as for the dirs in question we first turn on the RO bit, only to turn it off again immediately after), but also problematic in containers, where a container manager might have marked a set of dirs read-only and this code will undo this is ReadWritePaths= is set for any. With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be applied to the children listed in ReadWritePaths= in the first place, so that we do not need to turn off the RO bit for those after all. This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the RO bit, but never to turn it off again. Or to say this differently: if some dirs are marked read-only via some external tool, then ReadWritePaths= will not undo it. This is not only the safer option, but also more in-line with what the man page currently claims: "Entries (files or directories) listed in ReadWritePaths= are accessible from within the namespace with the same access rights as from outside." To implement this change bind_remount_recursive() gained a new "blacklist" string list parameter, which when passed may contain subdirs that shall be excluded from the read-only mounting. A number of functions are updated to add more debug logging to make this more digestable.
2016-09-25namespace: when enforcing fs namespace restrictions suppress redundant mountsLennart Poettering
If /foo is marked to be read-only, and /foo/bar too, then the latter may be suppressed as it has no effect.
2016-09-25namespace: simplify mount_path_compare() a bitLennart Poettering
2016-09-25execute: if RuntimeDirectory= is set, it should be writableLennart Poettering
Implicitly make all dirs set with RuntimeDirectory= writable, as the concept otherwise makes no sense.
2016-09-25execute: move suppression of HOME=/ and SHELL=/bin/nologin into user-util.cLennart Poettering
This adds a new call get_user_creds_clean(), which is just like get_user_creds() but returns NULL in the home/shell parameters if they contain no useful information. This code previously lived in execute.c, but by generalizing this we can reuse it in run.c.
2016-09-25execute: split out creation of runtime dirs into its own functionsLennart Poettering
2016-09-25namespace: make sure InaccessibleDirectories= masks all mounts further downLennart Poettering
If a dir is marked to be inaccessible then everything below it should be masked by it.
2016-09-25core: add two new service settings ProtectKernelTunables= and ↵Lennart Poettering
ProtectControlGroups= If enabled, these will block write access to /sys, /proc/sys and /proc/sys/fs/cgroup.
2016-09-25core: enforce seccomp for secondary archs too, for all rulesLennart Poettering
Let's make sure that all our rules apply to all archs the local kernel supports.
2016-09-16tree-wide: rename config_parse_many to …_nulstrZbigniew Jędrzejewski-Szmek
In preparation for adding a version which takes a strv.
2016-09-10Merge pull request #4119 from keszybz/drop-more-kdbusEvgeny Vereshchagin
Drop more kdbus functionality