summaryrefslogtreecommitdiff
path: root/src/core/execute.c
AgeCommit message (Collapse)Author
2016-10-16tree-wide: use mfree moreZbigniew Jędrzejewski-Szmek
2016-10-12core: make sure to dump ProtectKernelModules= valueDjalal Harouni
2016-10-12core: check protect_kernel_modules and private_devices in order to setup NNPDjalal Harouni
2016-10-12core:sandbox: lets make /lib/modules/ inaccessible on ProtectKernelModules=Djalal Harouni
Lets go further and make /lib/modules/ inaccessible for services that do not have business with modules, this is a minor improvment but it may help on setups with custom modules and they are limited... in regard of kernel auto-load feature. This change introduce NameSpaceInfo struct which we may embed later inside ExecContext but for now lets just reduce the argument number to setup_namespace() and merge ProtectKernelModules feature.
2016-10-12core:sandbox: Add ProtectKernelModules= optionDjalal Harouni
This is useful to turn off explicit module load and unload operations on modular kernels. This option removes CAP_SYS_MODULE from the capability bounding set for the unit, and installs a system call filter to block module system calls. This option will not prevent the kernel from loading modules using the module auto-load feature which is a system wide operation.
2016-10-11core: chown() any TTY used for stdin, not just when StandardInput=tty is ↵Lennart Poettering
used (#4347) If stdin is supplied as an fd for transient units (using the StandardInputFileDescriptor pseudo-property for transient units), then we should also fix up the TTY ownership, not just when we opened the TTY ourselves. This simply drops the explicit is_terminal_input()-based check. Note that chown_terminal() internally does a much more appropriate isatty()-based check anyway, hence we can drop this without replacement. Fixes: #4260
2016-10-07core: add "invocation ID" concept to service managerLennart Poettering
This adds a new invocation ID concept to the service manager. The invocation ID identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is generated each time a unit moves from and inactive to an activating or active state. The primary usecase for this concept is to connect the runtime data PID 1 maintains about a service with the offline data the journal stores about it. Previously we'd use the unit name plus start/stop times, which however is highly racy since the journal will generally process log data after the service already ended. The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel, except that it applies to an individual unit instead of the whole system. The invocation ID is passed to the activated processes as environment variable. It is additionally stored as extended attribute on the cgroup of the unit. The latter is used by journald to automatically retrieve it for each log logged message and attach it to the log entry. The environment variable is very easily accessible, even for unprivileged services. OTOH the extended attribute is only accessible to privileged processes (this is because cgroupfs only supports the "trusted." xattr namespace, not "user."). The environment variable may be altered by services, the extended attribute may not be, hence is the better choice for the journal. Note that reading the invocation ID off the extended attribute from journald is racy, similar to the way reading the unit name for a logging process is. This patch adds APIs to read the invocation ID to sd-id128: sd_id128_get_invocation() may be used in a similar fashion to sd_id128_get_boot(). PID1's own logging is updated to always include the invocation ID when it logs information about a unit. A new bus call GetUnitByInvocationID() is added that allows retrieving a bus path to a unit by its invocation ID. The bus path is built using the invocation ID, thus providing a path for referring to a unit that is valid only for the current runtime cycleof it. Outlook for the future: should the kernel eventually allow passing of cgroup information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we can alter the invocation ID to be generated as hash from that rather than entirely randomly. This way we can derive the invocation race-freely from the messages.
2016-10-06user-util: rework maybe_setgroups() a bitLennart Poettering
Let's drop the caching of the setgroups /proc field for now. While there's a strict regime in place when it changes states, let's better not cache it since we cannot really be sure we follow that regime correctly. More importantly however, this is not in performance sensitive code, and there's no indication the cache is really beneficial, hence let's drop the caching and make things a bit simpler. Also, while we are at it, rework the error handling a bit, and always return negative errno-style error codes, following our usual coding style. This has the benefit that we can sensible hanld read_one_line_file() errors, without having to updat errno explicitly.
2016-10-06core: leave PAM stub process around with GIDs updatedLennart Poettering
In the process execution code of PID 1, before 096424d1230e0a0339735c51b43949809e972430 the GID settings where changed before invoking PAM, and the UID settings after. After the change both changes are made after the PAM session hooks are run. When invoking PAM we fork once, and leave a stub process around which will invoke the PAM session end hooks when the session goes away. This code previously was dropping the remaining privs (which were precisely the UID). Fix this code to do this correctly again, by really dropping them else (i.e. the GID as well). While we are at it, also fix error logging of this code. Fixes: #4238
2016-10-06core: do not fail in a container if we can't use setgroupsGiuseppe Scrivano
It might be blocked through /proc/PID/setgroups
2016-10-04tree-wide: remove consecutive duplicate words in commentsStefan Schweter
2016-09-25core: Use @raw-io syscall group to filter I/O syscalls when PrivateDevices= ↵Djalal Harouni
is set Instead of having a local syscall list, use the @raw-io group which contains the same set of syscalls to filter.
2016-09-25execute: move SMACK setup code into its own functionLennart Poettering
While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the main execution logic a bit.
2016-09-25execute: filter low-level I/O syscalls if PrivateDevices= is setLennart Poettering
If device access is restricted via PrivateDevices=, let's also block the various low-level I/O syscalls at the same time, so that we know that the minimal set of devices in our virtualized /dev are really everything the unit can access.
2016-09-25execute: drop group priviliges only after setting up namespaceLennart Poettering
If PrivateDevices=yes is set, the namespace code creates device nodes in /dev that should be owned by the host's root, hence let's make sure we set up the namespace before dropping group privileges.
2016-09-25execute: if RuntimeDirectory= is set, it should be writableLennart Poettering
Implicitly make all dirs set with RuntimeDirectory= writable, as the concept otherwise makes no sense.
2016-09-25execute: move suppression of HOME=/ and SHELL=/bin/nologin into user-util.cLennart Poettering
This adds a new call get_user_creds_clean(), which is just like get_user_creds() but returns NULL in the home/shell parameters if they contain no useful information. This code previously lived in execute.c, but by generalizing this we can reuse it in run.c.
2016-09-25execute: split out creation of runtime dirs into its own functionsLennart Poettering
2016-09-25core: add two new service settings ProtectKernelTunables= and ↵Lennart Poettering
ProtectControlGroups= If enabled, these will block write access to /sys, /proc/sys and /proc/sys/fs/cgroup.
2016-09-25core: enforce seccomp for secondary archs too, for all rulesLennart Poettering
Let's make sure that all our rules apply to all archs the local kernel supports.
2016-09-06seccomp: also detect if seccomp filtering is enabledFelipe Sateler
In https://github.com/systemd/systemd/pull/4004 , a runtime detection method for seccomp was added. However, it does not detect the case where CONFIG_SECCOMP=y but CONFIG_SECCOMP_FILTER=n. This is possible if the architecture does not support filtering yet. Add a check for that case too. While at it, change get_proc_field usage to use PR_GET_SECCOMP prctl, as that should save a few system calls and (unnecessary) allocations. Previously, reading of /proc/self/stat was done as recommended by prctl(2) as safer. However, given that we need to do the prctl call anyway, lets skip opening, reading and parsing the file. Code for checking inspired by https://outflux.net/teach-seccomp/autodetect.html
2016-08-22core: do not fail at step SECCOMP if there is no kernel support (#4004)Felipe Sateler
Fixes #3882
2016-08-19core: bypass dynamic user lookups from dbus-daemonLennart Poettering
dbus-daemon does NSS name look-ups in order to enforce its bus policy. This might dead-lock if an NSS module use wants to use D-Bus for the look-up itself, like our nss-systemd does. Let's work around this by bypassing bus communication in the NSS module if we run inside of dbus-daemon. To make this work we keep a bit of extra state in /run/systemd/dynamic-uid/ so that we don't have to consult the bus, but can still resolve the names. Note that the normal codepath continues to be via the bus, so that resolving works from all mount namespaces and is subject to authentication, as before. This is a bit dirty, but not too dirty, as dbus daemon is kinda special anyway for PID 1.
2016-08-19core: add RemoveIPC= settingLennart Poettering
This adds the boolean RemoveIPC= setting to service, socket, mount and swap units (i.e. all unit types that may invoke processes). if turned on, and the unit's user/group is not root, all IPC objects of the user/group are removed when the service is shut down. The life-cycle of the IPC objects is hence bound to the unit life-cycle. This is particularly relevant for units with dynamic users, as it is essential that no objects owned by the dynamic users survive the service exiting. In fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set. In order to communicate the UID/GID of an executed process back to PID 1 this adds a new "user lookup" socket pair, that is inherited into the forked processes, and closed before the exec(). This is needed since we cannot do NSS from PID 1 due to deadlock risks, However need to know the used UID/GID in order to clean up IPC owned by it if the unit shuts down.
2016-08-18core: make use of uid_is_valid() when checking for UID validityLennart Poettering
2016-08-06Merge pull request #3884 from poettering/private-usersZbigniew Jędrzejewski-Szmek
2016-08-04core: only set the watchdog variables in ExecStart= linesLennart Poettering
2016-08-04core: use the correct APIs to determine whether a dual timestamp is initializedLennart Poettering
2016-08-04core: turn various execution flags into a proper flags parameterLennart Poettering
The ExecParameters structure contains a number of bit-flags, that were so far exposed as bool:1, change this to a proper, single binary bit flag field. This makes things a bit more expressive, and is helpful as we add more flags, since these booleans are passed around in various callers, for example service_spawn(), whose signature can be made much shorter now. Not all bit booleans from ExecParameters are moved into the flags field for now, but this can be added later.
2016-08-03core: add new PrivateUsers= option to service executionLennart Poettering
This setting adds minimal user namespacing support to a service. When set the invoked processes will run in their own user namespace. Only a trivial mapping will be set up: the root user/group is mapped to root, and the user/group of the service will be mapped to itself, everything else is mapped to nobody. If this setting is used the service runs with no capabilities on the host, but configurable capabilities within the service. This setting is particularly useful in conjunction with RootDirectory= as the need to synchronize /etc/passwd and /etc/group between the host and the service OS tree is reduced, as only three UID/GIDs need to match: root, nobody and the user of the service itself. But even outside the RootDirectory= case this setting is useful to substantially reduce the attack surface of a service. Example command to test this: systemd-run -p PrivateUsers=1 -p User=foobar -t /bin/sh This runs a shell as user "foobar". When typing "ps" only processes owned by "root", by "foobar", and by "nobody" should be visible.
2016-08-03execute: don't set $SHELL and $HOME for services, if they don't contain ↵Lennart Poettering
interesting data
2016-08-03core: inherit TERM from PID 1 for all services started on /dev/consoleLennart Poettering
This way, invoking nspawn from a shell in the best case inherits the TERM setting all the way down into the login shell spawned in the container. Fixes: #3697
2016-07-22nss: add new "nss-systemd" NSS module for mapping dynamic usersLennart Poettering
With this NSS module all dynamic service users will be resolvable via NSS like any real user.
2016-07-22core: add a concept of "dynamic" user ids, that are allocated as long as a ↵Lennart Poettering
service is running This adds a new boolean setting DynamicUser= to service files. If set, a new user will be allocated dynamically when the unit is started, and released when it is stopped. The user ID is allocated from the range 61184..65519. The user will not be added to /etc/passwd (but an NSS module to be added later should make it show up in getent passwd). For now, care should be taken that the service writes no files to disk, since this might result in files owned by UIDs that might get assigned dynamically to a different service later on. Later patches will tighten sandboxing in order to ensure that this cannot happen, except for a few selected directories. A simple way to test this is: systemd-run -p DynamicUser=1 /bin/sleep 99999
2016-07-20execute: make sure JoinsNamespaceOf= doesn't leak ns fds to executed processesLennart Poettering
2016-07-20execute: normalize connect_logger_as() parameters slightlyLennart Poettering
All other functions in execute.c that need the unit id take a Unit* parameter as first argument. Let's change connect_logger_as() to follow a similar logic.
2016-07-19doc,core: Read{Write,Only}Paths= and InaccessiblePaths=Alessandro Puccetti
This patch renames Read{Write,Only}Directories= and InaccessibleDirectories= to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept as aliases but they are not advertised in the documentation. Renamed variables: `read_write_dirs` --> `read_write_paths` `read_only_dirs` --> `read_only_paths` `inaccessible_dirs` --> `inaccessible_paths`
2016-07-11treewide: fix typos and remove accidental repetition of wordsTorstein Husebø
2016-07-08execute: Do not alter call-by-ref parameter on failureJouke Witteveen
Prevent free from being called on (a part of) the call-by-reference variable env when setup_pam fails.
2016-07-07execute: Cleanup the environment earlyJouke Witteveen
By cleaning up before setting up PAM we maintain control of overriding behavior in setting variables. Otherwise, pam_putenv is in control. This also makes sure we use a cleaned up environment in replacing variables in argv.
2016-06-23execute: add a new easy-to-use RestrictRealtime= option to unitsLennart Poettering
It takes a boolean value. If true, access to SCHED_RR, SCHED_FIFO and SCHED_DEADLINE is blocked, which my be used to lock up the system.
2016-06-23execute: be a little less drastic when MemoryDenyWriteExecute= hitsLennart Poettering
Let's politely refuse with EPERM rather than kill the whole thing right-away.
2016-06-23execute: set PR_SET_NO_NEW_PRIVS also in case the exec memory protection is usedLennart Poettering
This was forgotten when MemoryDenyWriteExecute= was added: we should set NNP in all cases when we set seccomp filters.
2016-06-23execute: use the return value of setrlimit_closest() properlyLennart Poettering
It's a function defined by us, hence we should look for the error in its return value, not in "errno".
2016-06-15core: set $JOURNAL_STREAM to the dev_t/ino_t of the journal stream of ↵Lennart Poettering
executed services This permits services to detect whether their stdout/stderr is connected to the journal, and if so talk to the journal directly, thus permitting carrying of metadata. As requested by the gtk folks: #2473
2016-06-15execute: minor coding style improvementsLennart Poettering
2016-06-13core/execute: pass env vars to PAM session setup (#3503)Jouke Witteveen
Move the merger of environment variables before setting up the PAM session and pass the aggregate environment to PAM setup. This allows control over the PAM session hooks through environment variables. PAM session initiation may update the environment. On successful initiation of a PAM session, we adopt the environment of the PAM context.
2016-06-10core/execute: add the magic character '!' to allow privileged execution (#3493)Alessandro Puccetti
This patch implements the new magic character '!'. By putting '!' in front of a command, systemd executes it with full privileges ignoring paramters such as User, Group, SupplementaryGroups, CapabilityBoundingSet, AmbientCapabilities, SecureBits, SystemCallFilter, SELinuxContext, AppArmorProfile, SmackProcessLabel, and RestrictAddressFamilies. Fixes partially https://github.com/systemd/systemd/issues/3414 Related to https://github.com/coreos/rkt/issues/2482 Testing: 1. Create a user 'bob' 2. Create the unit file /etc/systemd/system/exec-perm.service (You can use the example below) 3. sudo systemctl start ext-perm.service 4. Verify that the commands starting with '!' were not executed as bob, 4.1 Looking to the output of ls -l /tmp/exec-perm 4.2 Each file contains the result of the id command. ````````````````````````````````````````````````````````````````` [Unit] Description=ext-perm [Service] Type=oneshot TimeoutStartSec=0 User=bob ExecStartPre=!/usr/bin/sh -c "/usr/bin/rm /tmp/exec-perm*" ; /usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-pre" ExecStart=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start" ; !/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-star-2" ExecStartPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-post" ExecReload=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-reload" ExecStop=!/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop" ExecStopPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop-post" [Install] WantedBy=multi-user.target] `````````````````````````````````````````````````````````````````
2016-06-09execute: check whether the specified fd is a tty before chowning/chmoding ↵Lennart Poettering
it (#3457) Let's add an extra safety check before we chmod/chown a TTY to the right user, as we might end up having connected something to STDIN/STDOUT that is actually not a TTY, even though this might have been requested, due to permissive StandardInput= settings or transient service activation with fds passed in. Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=85255
2016-06-03core: Restrict mmap and mprotect with PAGE_WRITE|PAGE_EXEC (#3319) (#3379)Topi Miettinen
New exec boolean MemoryDenyWriteExecute, when set, installs a seccomp filter to reject mmap(2) with PAGE_WRITE|PAGE_EXEC and mprotect(2) with PAGE_EXEC.