summaryrefslogtreecommitdiff
path: root/src/core/execute.c
AgeCommit message (Collapse)Author
2016-12-13Merge pull request #4806 from poettering/keyring-initZbigniew Jędrzejewski-Szmek
set up a per-service session kernel keyring, and store the invocation ID in it
2016-12-14core: add ability to define arbitrary bind mounts for servicesLennart Poettering
This adds two new settings BindPaths= and BindReadOnlyPaths=. They allow defining arbitrary bind mounts specific to particular services. This is particularly useful for services with RootDirectory= set as this permits making specific bits of the host directory available to chrooted services. The two new settings follow the concepts nspawn already possess in --bind= and --bind-ro=, as well as the .nspawn settings Bind= and BindReadOnly= (and these latter options should probably be renamed to BindPaths= and BindReadOnlyPaths= too). Fixes: #3439
2016-12-13core: store the invocation ID in the per-service keyringLennart Poettering
Let's store the invocation ID in the per-service keyring as a root-owned key, with strict access rights. This has the advantage over the environment-based ID passing that it also works from SUID binaries (as they key cannot be overidden by unprivileged code starting them), in contrast to the secure_getenv() based mode. The invocation ID is now passed in three different ways to a service: - As environment variable $INVOCATION_ID. This is easy to use, but may be overriden by unprivileged code (which might be a bad or a good thing), which means it's incompatible with SUID code (see above). - As extended attribute on the service cgroup. This cannot be overriden by unprivileged code, and may be queried safely from "outside" of a service. However, it is incompatible with containers right now, as unprivileged containers generally cannot set xattrs on cgroupfs. - As "invocation_id" key in the kernel keyring. This has the benefit that the key cannot be changed by unprivileged service code, and thus is safe to access from SUID code (see above). But do note that service code can replace the session keyring with a fresh one that lacks the key. However in that case the key will not be owned by root, which is easily detectable. The keyring is also incompatible with containers right now, as it is not properly namespace aware (but this is being worked on), and thus most container managers mask the keyring-related system calls. Ideally we'd only have one way to pass the invocation ID, but the different ways all have limitations. The invocation ID hookup in journald is currently only available on the host but not in containers, due to the mentioned limitations. How to verify the new invocation ID in the keyring: # systemd-run -t /bin/sh Running as unit: run-rd917366c04f847b480d486017f7239d6.service Press ^] three times within 1s to disconnect TTY. # keyctl show Session Keyring 680208392 --alswrv 0 0 keyring: _ses 250926536 ----s-rv 0 0 \_ user: invocation_id # keyctl request user invocation_id 250926536 # keyctl read 250926536 16 bytes of data in key: 9c96317c ac64495a a42b9cd7 4f3ff96b # echo $INVOCATION_ID 9c96317cac64495aa42b9cd74f3ff96b # ^D This creates a new transient service runnint a shell. Then verifies the contents of the keyring, requests the invocation ID key, and reads its payload. For comparison the invocation ID as passed via the environment variable is also displayed.
2016-12-13core: run each system service with a fresh session keyringLennart Poettering
This patch ensures that each system service gets its own session kernel keyring automatically, and implicitly. Without this a keyring is allocated for it on-demand, but is then linked with the user's kernel keyring, which is OK behaviour for logged in users, but not so much for system services. With this change each service gets a session keyring that is specific to the service and ceases to exist when the service is shut down. The session keyring is not linked up with the user keyring and keys hence only search within the session boundaries by default. (This is useful in a later commit to store per-service material in the keyring, for example the invocation ID) (With input from David Howells)
2016-11-18Merge pull request #4538 from fbuihuu/confirm-spawn-fixesLennart Poettering
Confirm spawn fixes/enhancements
2016-11-17core: in confirm spawn, suggest 'f' when user selects 'n' choiceFranck Bui
2016-11-17core: confirm_spawn: always accept units with same_pgrp set for nowFranck Bui
For some reasons units remaining in the same process group as PID 1 (same_pgrp=true) fail to acquire the console even if it's not taken by anyone. So always accept for units with same_pgrp set for now.
2016-11-17core: include the unit name when notifying that a confirmation question ↵Franck Bui
timed out
2016-11-17core: add 'c' in confirmation_spawn to resume the boot processFranck Bui
2016-11-17core: add 'j' in confirmation_spawn to list the jobs that are in progressFranck Bui
2016-11-17core: add 'D' in confirmat spawn to show a full dump of the unit to spawnFranck Bui
2016-11-17core: add 'i' in confirm spawn to give a short summary of the unit to spawnFranck Bui
2016-11-17core: rework the confirmation spawn promptFranck Bui
Previously it was "[Yes, Fail, Skip]" which is pretty misleading because it suggests that the whole word needs to be entered instead of a single char. Also this won't fit well when we'll extend the number of choices. This patch addresses this by changing the choice hint with "[y, f, s – h for help]" so it's now clear that a single letter has to be entered. It also introduces a new choice 'h' which describes all possible choices since a single letter can be not descriptive enough for new users. It also allow to stick with the same hint string regardless of how many choices we will support.
2016-11-17core: limit the length of the confirmation questionFranck Bui
When "confirmation_spawn=1", the confirmation question can look like: Execute /usr/bin/kmod static-nodes --format=tmpfiles --output=/run/tmpfiles.d/kmod.conf? [Yes, No, Skip] which is pretty verbose and might not fit in the console width size (which is usually 80 chars) and thus question will be splitted into 2 consecutive lines. However since the question is now refreshed every 2 secs, the reprinted question will overwrite the second line of the previous one... To prevent this, this patch makes sure that the command line won't be longer than 60 chars by ellipsizing it if the command is longer: Execute /usr/bin/kmod static-nodes --format=tmpfiles --output=/ru…nf? [Yes, No, View, Skip] A following patch will introduce a new choice that will allow the user to get details on the command to be executed so it will still be possible to see the full command line.
2016-11-17core: in confirm_spawn, the meaning of 'n' and 's' choices are confusingFranck Bui
Before this patch we had: - "no" which gives "failing execution" but the command is actually assumed as succeed. - "skip" which gives "skipping", but the command is assumed to have failed, which ends up with "Failed to start ..." on the console. Now we have: - "fail" which gives "failing execution" and the command is indeed assumed as failed. - "skip" which gives "skipping execution" and the command is assumed as succeed.
2016-11-17core: rework ask_for_confirmation()Franck Bui
Now the reponses are handled by ask_for_confirmation() as well as the report of any errors occuring during the process of retrieving the confirmation response. One benefit of this is that there's no need to open/close the console one more time when reporting error/status messages. The caller now just needs to care about the return values whose meanings are: - don't execute and pretend that the command failed - don't execute and pretend that the command succeeed - positive answer, execute the command Also some slight code reorganization and introduce write_confirm_error() and write_confirm_error_fd(). write_confim_message becomes unneeded.
2016-11-17core: allow to redirect confirmation messages to a different consoleFranck Bui
It's rather hard to parse the confirmation messages (enabled with systemd.confirm_spawn=true) amongst the status messages and the kernel ones (if enabled). This patch gives the possibility to the user to redirect the confirmation message to a different virtual console, either by giving its name or its path, so those messages are separated from the other ones and easier to read.
2016-11-15core: improve the logic that implies no new privilegesDjalal Harouni
The no_new_privileged_set variable is not used any more since commit 9b232d3241fcfbf60af that fixed another thing. So remove it. Also no need to check if we are under user manager, remove that part too.
2016-11-08core: on DynamicUser= make sure that protecting sensitive paths is enforced ↵Djalal Harouni
(#4596) This adds a variable that is always set to false to make sure that protect paths inside sandbox are always enforced and not ignored. The only case when it is set to true is on DynamicUser=no and RootDirectory=/chroot is set. This allows users to use more our sandbox features inside RootDirectory= The only exception is ProtectSystem=full|strict and when DynamicUser=yes is implied. Currently RootDirectory= is not fully compatible with these due to two reasons: * /chroot/usr|etc has to be present on ProtectSystem=full * /chroot// has to be a mount point on ProtectSystem=strict.
2016-11-08Merge pull request #4536 from poettering/seccomp-namespacesZbigniew Jędrzejewski-Szmek
core: add new RestrictNamespaces= unit file setting Merging, not rebasing, because this touches many files and there were tree-wide cleanups in the mean time.
2016-11-07Rename formats-util.h to format-util.hZbigniew Jędrzejewski-Szmek
We don't have plural in the name of any other -util files and this inconsistency trips me up every time I try to type this file name from memory. "formats-util" is even hard to pronounce.
2016-11-04core: add new RestrictNamespaces= unit file settingLennart Poettering
This new setting permits restricting whether namespaces may be created and managed by processes started by a unit. It installs a seccomp filter blocking certain invocations of unshare(), clone() and setns(). RestrictNamespaces=no is the default, and does not restrict namespaces in any way. RestrictNamespaces=yes takes away the ability to create or manage any kind of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces so that only mount and IPC namespaces may be created/managed, but no other kind of namespaces. This setting should be improve security quite a bit as in particular user namespacing was a major source of CVEs in the kernel in the past, and is accessible to unprivileged processes. With this setting the entire attack surface may be removed for system services that do not make use of namespaces.
2016-11-03Merge pull request #4510 from keszybz/tree-wide-cleanupsLennart Poettering
Tree wide cleanups
2016-11-03core: intialize user aux groups and SupplementaryGroups= when DynamicUser= ↵Djalal Harouni
is set Make sure that when DynamicUser= is set that we intialize the user supplementary groups and that we also support SupplementaryGroups= Fixes: https://github.com/systemd/systemd/issues/4539 Thanks Evgeny Vereshchagin (@evverx)
2016-11-02Merge pull request #4483 from poettering/exec-orderLennart Poettering
more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec
2016-11-02core: initialize groups list before checking SupplementaryGroups= of a unit ↵Djalal Harouni
(#4533) Always initialize the supplementary groups of caller before checking the unit SupplementaryGroups= option. Fixes https://github.com/systemd/systemd/issues/4531
2016-11-02execute: apply seccomp filters after changing selinux/aa/smack contextsLennart Poettering
Seccomp is generally an unprivileged operation, changing security contexts is most likely associated with some form of policy. Moreover, while seccomp may influence our own flow of code quite a bit (much more than the security context change) make sure to apply the seccomp filters immediately before executing the binary to invoke. This also moves enforcement of NNP after the security context change, so that NNP cannot affect it anymore. (However, the security policy now has to permit the NNP change). This change has a good chance of breaking current SELinux/AA/SMACK setups, because the policy might not expect this change of behaviour. However, it's technically the better choice I think and should hence be applied. Fixes: #3993
2016-10-28Merge pull request #4495 from topimiettinen/block-shmat-execDjalal Harouni
seccomp: also block shmat(..., SHM_EXEC) for MemoryDenyWriteExecute
2016-10-27core: make unit argument const for apply seccomp functionsDjalal Harouni
2016-10-27core: lets apply working directory just after mount namespacesDjalal Harouni
This makes applying groups after applying the working directory, this may allow some flexibility but at same it is not a big deal since we don't execute or do anything between applying working directory and droping groups.
2016-10-27core: get the working directory value inside apply_working_directory()Djalal Harouni
Improve apply_working_directory() and lets get the current working directory inside of it.
2016-10-27core: move apply working directory code into its own apply_working_directory()Djalal Harouni
2016-10-27core: move the code that setups namespaces on its own functionDjalal Harouni
2016-10-26seccomp: also block shmat(..., SHM_EXEC) for MemoryDenyWriteExecuteTopi Miettinen
shmat(..., SHM_EXEC) can be used to create writable and executable memory, so let's block it when MemoryDenyWriteExecute is set.
2016-10-24seccomp: add new helper call seccomp_load_filter_set()Lennart Poettering
This allows us to unify most of the code in apply_protect_kernel_modules() and apply_private_devices().
2016-10-24seccomp: add new seccomp_init_conservative() helperLennart Poettering
This adds a new seccomp_init_conservative() helper call that is mostly just a wrapper around seccomp_init(), but turns off NNP and adds in all secondary archs, for best compatibility with everything else. Pretty much all of our code used the very same constructs for these three steps, hence unifying this in one small function makes things a lot shorter. This also changes incorrect usage of the "scmp_filter_ctx" type at various places. libseccomp defines it as typedef to "void*", i.e. it is a pointer type (pretty poor choice already!) that casts implicitly to and from all other pointer types (even poorer choice: you defined a confusing type now, and don't even gain any bit of type safety through it...). A lot of the code assumed the type would refer to a structure, and hence aded additional "*" here and there. Remove that.
2016-10-24core: rework apply_protect_kernel_modules() to use ↵Lennart Poettering
seccomp_add_syscall_filter_set() Let's simplify this call, by making use of the new infrastructure. This is actually more in line with Djalal's original patch but instead of search the filter set in the array by its name we can now use the set index and jump directly to it.
2016-10-24core: rework syscall filter set handlingLennart Poettering
A variety of fixes: - rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main instance of it (the syscall_filter_sets[] array) used to abbreviate "SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not mix and match too wildly. Let's pick the shorter name in this case, as it is sufficiently well established to not confuse hackers reading this. - Export explicit indexes into the syscall_filter_sets[] array via an enum. This way, code that wants to make use of a specific filter set, can index it directly via the enum, instead of having to search for it. This makes apply_private_devices() in particular a lot simpler. - Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to find a set by its name, seccomp_add_syscall_filter_set() to add a set to a seccomp object. - Update SystemCallFilter= parser to use extract_first_word(). Let's work on deprecating FOREACH_WORD_QUOTED(). - Simplify apply_private_devices() using this functionality
2016-10-24core: move misplaced comment to the right placeLennart Poettering
2016-10-24core: simplify skip_seccomp_unavailable() a bitLennart Poettering
Let's prefer early-exit over deep-indented if blocks. Not behavioural change.
2016-10-24core: do not assert when sysconf(_SC_NGROUPS_MAX) fails (#4466)Djalal Harouni
Remove the assert and check the return code of sysconf(_SC_NGROUPS_MAX). _SC_NGROUPS_MAX maps to NGROUPS_MAX which is defined in <limits.h> to 65536 these days. The value is a sysctl read-only /proc/sys/kernel/ngroups_max and the kernel assumes that it is always positive otherwise things may break. Follow this and support only positive values for all other case return either -errno or -EOPNOTSUPP. Now if there are systems that want to re-write NGROUPS_MAX then they should not pass SupplementaryGroups= in units even if it is empty, in this case nothing fails and we just ignore supplementary groups. However if SupplementaryGroups= is passed even if it is empty we have to assume that there will be groups manipulation from our side or the kernel and since the kernel always assumes that NGROUPS_MAX is positive, then follow that and support only positive values.
2016-10-23core: lets move the setup of working directory before group enforceDjalal Harouni
This is minor but lets try to split and move bit by bit cgroups and portable environment setup before applying the security context.
2016-10-23core: first lookup and cache creds then apply them after namespace setupDjalal Harouni
This fixes: https://github.com/systemd/systemd/issues/4357 Let's lookup and cache creds then apply them. We also switch from getgroups() to getgrouplist().
2016-10-23tree-wide: drop NULL sentinel from strjoinZbigniew Jędrzejewski-Szmek
This makes strjoin and strjoina more similar and avoids the useless final argument. spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/systemd -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libsystemd/sd-bus -I ./src/libsystemd/sd-event -I ./src/libsystemd/sd-login -I ./src/libsystemd/sd-netlink -I ./src/libsystemd/sd-network -I ./src/libsystemd/sd-hwdb -I ./src/libsystemd/sd-device -I ./src/libsystemd/sd-id128 -I ./src/libsystemd-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/*.c) git grep -e '\bstrjoin\b.*NULL' -l|xargs sed -i -r 's/strjoin\((.*), NULL\)/strjoin(\1)/' This might have missed a few cases (spatch has a really hard time dealing with _cleanup_ macros), but that's no big issue, they can always be fixed later.
2016-10-17core/exec: add a named-descriptor option ("fd") for streams (#4179)Luca Bruno
This commit adds a `fd` option to `StandardInput=`, `StandardOutput=` and `StandardError=` properties in order to connect standard streams to externally named descriptors provided by some socket units. This option looks for a file descriptor named as the corresponding stream. Custom names can be specified, separated by a colon. If multiple name-matches exist, the first matching fd will be used.
2016-10-16tree-wide: use mfree moreZbigniew Jędrzejewski-Szmek
2016-10-12core: make sure to dump ProtectKernelModules= valueDjalal Harouni
2016-10-12core: check protect_kernel_modules and private_devices in order to setup NNPDjalal Harouni
2016-10-12core:sandbox: lets make /lib/modules/ inaccessible on ProtectKernelModules=Djalal Harouni
Lets go further and make /lib/modules/ inaccessible for services that do not have business with modules, this is a minor improvment but it may help on setups with custom modules and they are limited... in regard of kernel auto-load feature. This change introduce NameSpaceInfo struct which we may embed later inside ExecContext but for now lets just reduce the argument number to setup_namespace() and merge ProtectKernelModules feature.
2016-10-12core:sandbox: Add ProtectKernelModules= optionDjalal Harouni
This is useful to turn off explicit module load and unload operations on modular kernels. This option removes CAP_SYS_MODULE from the capability bounding set for the unit, and installs a system call filter to block module system calls. This option will not prevent the kernel from loading modules using the module auto-load feature which is a system wide operation.