~lukeshu/systemd - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2016-11-08	core: on DynamicUser= make sure that protecting sensitive paths is enforced ↵	Djalal Harouni
	(#4596) This adds a variable that is always set to false to make sure that protect paths inside sandbox are always enforced and not ignored. The only case when it is set to true is on DynamicUser=no and RootDirectory=/chroot is set. This allows users to use more our sandbox features inside RootDirectory= The only exception is ProtectSystem=full\|strict and when DynamicUser=yes is implied. Currently RootDirectory= is not fully compatible with these due to two reasons: * /chroot/usr\|etc has to be present on ProtectSystem=full * /chroot// has to be a mount point on ProtectSystem=strict.
2016-11-08	Merge pull request #4536 from poettering/seccomp-namespaces	Zbigniew Jędrzejewski-Szmek
	core: add new RestrictNamespaces= unit file setting Merging, not rebasing, because this touches many files and there were tree-wide cleanups in the mean time.
2016-11-07	Rename formats-util.h to format-util.h	Zbigniew Jędrzejewski-Szmek
	We don't have plural in the name of any other -util files and this inconsistency trips me up every time I try to type this file name from memory. "formats-util" is even hard to pronounce.
2016-11-04	core: add new RestrictNamespaces= unit file setting	Lennart Poettering
	This new setting permits restricting whether namespaces may be created and managed by processes started by a unit. It installs a seccomp filter blocking certain invocations of unshare(), clone() and setns(). RestrictNamespaces=no is the default, and does not restrict namespaces in any way. RestrictNamespaces=yes takes away the ability to create or manage any kind of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces so that only mount and IPC namespaces may be created/managed, but no other kind of namespaces. This setting should be improve security quite a bit as in particular user namespacing was a major source of CVEs in the kernel in the past, and is accessible to unprivileged processes. With this setting the entire attack surface may be removed for system services that do not make use of namespaces.
2016-11-03	Merge pull request #4510 from keszybz/tree-wide-cleanups	Lennart Poettering
	Tree wide cleanups
2016-11-03	core: intialize user aux groups and SupplementaryGroups= when DynamicUser= ↵	Djalal Harouni
	is set Make sure that when DynamicUser= is set that we intialize the user supplementary groups and that we also support SupplementaryGroups= Fixes: https://github.com/systemd/systemd/issues/4539 Thanks Evgeny Vereshchagin (@evverx)
2016-11-02	Merge pull request #4483 from poettering/exec-order	Lennart Poettering
	more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec
2016-11-02	core: initialize groups list before checking SupplementaryGroups= of a unit ↵	Djalal Harouni
	(#4533) Always initialize the supplementary groups of caller before checking the unit SupplementaryGroups= option. Fixes https://github.com/systemd/systemd/issues/4531
2016-11-02	execute: apply seccomp filters after changing selinux/aa/smack contexts	Lennart Poettering
	Seccomp is generally an unprivileged operation, changing security contexts is most likely associated with some form of policy. Moreover, while seccomp may influence our own flow of code quite a bit (much more than the security context change) make sure to apply the seccomp filters immediately before executing the binary to invoke. This also moves enforcement of NNP after the security context change, so that NNP cannot affect it anymore. (However, the security policy now has to permit the NNP change). This change has a good chance of breaking current SELinux/AA/SMACK setups, because the policy might not expect this change of behaviour. However, it's technically the better choice I think and should hence be applied. Fixes: #3993
2016-10-28	Merge pull request #4495 from topimiettinen/block-shmat-exec	Djalal Harouni
	seccomp: also block shmat(..., SHM_EXEC) for MemoryDenyWriteExecute
2016-10-27	core: make unit argument const for apply seccomp functions	Djalal Harouni

2016-10-27	core: lets apply working directory just after mount namespaces	Djalal Harouni
	This makes applying groups after applying the working directory, this may allow some flexibility but at same it is not a big deal since we don't execute or do anything between applying working directory and droping groups.
2016-10-27	core: get the working directory value inside apply_working_directory()	Djalal Harouni
	Improve apply_working_directory() and lets get the current working directory inside of it.
2016-10-27	core: move apply working directory code into its own apply_working_directory()	Djalal Harouni

2016-10-27	core: move the code that setups namespaces on its own function	Djalal Harouni

2016-10-26	seccomp: also block shmat(..., SHM_EXEC) for MemoryDenyWriteExecute	Topi Miettinen
	shmat(..., SHM_EXEC) can be used to create writable and executable memory, so let's block it when MemoryDenyWriteExecute is set.
2016-10-24	seccomp: add new helper call seccomp_load_filter_set()	Lennart Poettering
	This allows us to unify most of the code in apply_protect_kernel_modules() and apply_private_devices().
2016-10-24	seccomp: add new seccomp_init_conservative() helper	Lennart Poettering
	This adds a new seccomp_init_conservative() helper call that is mostly just a wrapper around seccomp_init(), but turns off NNP and adds in all secondary archs, for best compatibility with everything else. Pretty much all of our code used the very same constructs for these three steps, hence unifying this in one small function makes things a lot shorter. This also changes incorrect usage of the "scmp_filter_ctx" type at various places. libseccomp defines it as typedef to "void", i.e. it is a pointer type (pretty poor choice already!) that casts implicitly to and from all other pointer types (even poorer choice: you defined a confusing type now, and don't even gain any bit of type safety through it...). A lot of the code assumed the type would refer to a structure, and hence aded additional "" here and there. Remove that.
2016-10-24	core: rework apply_protect_kernel_modules() to use ↵	Lennart Poettering
	seccomp_add_syscall_filter_set() Let's simplify this call, by making use of the new infrastructure. This is actually more in line with Djalal's original patch but instead of search the filter set in the array by its name we can now use the set index and jump directly to it.
2016-10-24	core: rework syscall filter set handling	Lennart Poettering
	A variety of fixes: - rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main instance of it (the syscall_filter_sets[] array) used to abbreviate "SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not mix and match too wildly. Let's pick the shorter name in this case, as it is sufficiently well established to not confuse hackers reading this. - Export explicit indexes into the syscall_filter_sets[] array via an enum. This way, code that wants to make use of a specific filter set, can index it directly via the enum, instead of having to search for it. This makes apply_private_devices() in particular a lot simpler. - Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to find a set by its name, seccomp_add_syscall_filter_set() to add a set to a seccomp object. - Update SystemCallFilter= parser to use extract_first_word(). Let's work on deprecating FOREACH_WORD_QUOTED(). - Simplify apply_private_devices() using this functionality
2016-10-24	core: move misplaced comment to the right place	Lennart Poettering

2016-10-24	core: simplify skip_seccomp_unavailable() a bit	Lennart Poettering
	Let's prefer early-exit over deep-indented if blocks. Not behavioural change.
2016-10-24	core: do not assert when sysconf(_SC_NGROUPS_MAX) fails (#4466)	Djalal Harouni
	Remove the assert and check the return code of sysconf(_SC_NGROUPS_MAX). _SC_NGROUPS_MAX maps to NGROUPS_MAX which is defined in <limits.h> to 65536 these days. The value is a sysctl read-only /proc/sys/kernel/ngroups_max and the kernel assumes that it is always positive otherwise things may break. Follow this and support only positive values for all other case return either -errno or -EOPNOTSUPP. Now if there are systems that want to re-write NGROUPS_MAX then they should not pass SupplementaryGroups= in units even if it is empty, in this case nothing fails and we just ignore supplementary groups. However if SupplementaryGroups= is passed even if it is empty we have to assume that there will be groups manipulation from our side or the kernel and since the kernel always assumes that NGROUPS_MAX is positive, then follow that and support only positive values.
2016-10-23	core: lets move the setup of working directory before group enforce	Djalal Harouni
	This is minor but lets try to split and move bit by bit cgroups and portable environment setup before applying the security context.
2016-10-23	core: first lookup and cache creds then apply them after namespace setup	Djalal Harouni
	This fixes: https://github.com/systemd/systemd/issues/4357 Let's lookup and cache creds then apply them. We also switch from getgroups() to getgrouplist().
2016-10-23	tree-wide: drop NULL sentinel from strjoin	Zbigniew Jędrzejewski-Szmek
	This makes strjoin and strjoina more similar and avoids the useless final argument. spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/systemd -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libsystemd/sd-bus -I ./src/libsystemd/sd-event -I ./src/libsystemd/sd-login -I ./src/libsystemd/sd-netlink -I ./src/libsystemd/sd-network -I ./src/libsystemd/sd-hwdb -I ./src/libsystemd/sd-device -I ./src/libsystemd/sd-id128 -I ./src/libsystemd-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/.c) git grep -e '\bstrjoin\b.NULL' -l\|xargs sed -i -r 's/strjoin$(.*), NULL$/strjoin(\1)/' This might have missed a few cases (spatch has a really hard time dealing with _cleanup_ macros), but that's no big issue, they can always be fixed later.
2016-10-17	core/exec: add a named-descriptor option ("fd") for streams (#4179)	Luca Bruno
	This commit adds a `fd` option to `StandardInput=`, `StandardOutput=` and `StandardError=` properties in order to connect standard streams to externally named descriptors provided by some socket units. This option looks for a file descriptor named as the corresponding stream. Custom names can be specified, separated by a colon. If multiple name-matches exist, the first matching fd will be used.
2016-10-16	tree-wide: use mfree more	Zbigniew Jędrzejewski-Szmek

2016-10-12	core: make sure to dump ProtectKernelModules= value	Djalal Harouni

2016-10-12	core: check protect_kernel_modules and private_devices in order to setup NNP	Djalal Harouni

2016-10-12	core:sandbox: lets make /lib/modules/ inaccessible on ProtectKernelModules=	Djalal Harouni
	Lets go further and make /lib/modules/ inaccessible for services that do not have business with modules, this is a minor improvment but it may help on setups with custom modules and they are limited... in regard of kernel auto-load feature. This change introduce NameSpaceInfo struct which we may embed later inside ExecContext but for now lets just reduce the argument number to setup_namespace() and merge ProtectKernelModules feature.
2016-10-12	core:sandbox: Add ProtectKernelModules= option	Djalal Harouni
	This is useful to turn off explicit module load and unload operations on modular kernels. This option removes CAP_SYS_MODULE from the capability bounding set for the unit, and installs a system call filter to block module system calls. This option will not prevent the kernel from loading modules using the module auto-load feature which is a system wide operation.
2016-10-11	core: chown() any TTY used for stdin, not just when StandardInput=tty is ↵	Lennart Poettering
	used (#4347) If stdin is supplied as an fd for transient units (using the StandardInputFileDescriptor pseudo-property for transient units), then we should also fix up the TTY ownership, not just when we opened the TTY ourselves. This simply drops the explicit is_terminal_input()-based check. Note that chown_terminal() internally does a much more appropriate isatty()-based check anyway, hence we can drop this without replacement. Fixes: #4260
2016-10-07	core: add "invocation ID" concept to service manager	Lennart Poettering
	This adds a new invocation ID concept to the service manager. The invocation ID identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is generated each time a unit moves from and inactive to an activating or active state. The primary usecase for this concept is to connect the runtime data PID 1 maintains about a service with the offline data the journal stores about it. Previously we'd use the unit name plus start/stop times, which however is highly racy since the journal will generally process log data after the service already ended. The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel, except that it applies to an individual unit instead of the whole system. The invocation ID is passed to the activated processes as environment variable. It is additionally stored as extended attribute on the cgroup of the unit. The latter is used by journald to automatically retrieve it for each log logged message and attach it to the log entry. The environment variable is very easily accessible, even for unprivileged services. OTOH the extended attribute is only accessible to privileged processes (this is because cgroupfs only supports the "trusted." xattr namespace, not "user."). The environment variable may be altered by services, the extended attribute may not be, hence is the better choice for the journal. Note that reading the invocation ID off the extended attribute from journald is racy, similar to the way reading the unit name for a logging process is. This patch adds APIs to read the invocation ID to sd-id128: sd_id128_get_invocation() may be used in a similar fashion to sd_id128_get_boot(). PID1's own logging is updated to always include the invocation ID when it logs information about a unit. A new bus call GetUnitByInvocationID() is added that allows retrieving a bus path to a unit by its invocation ID. The bus path is built using the invocation ID, thus providing a path for referring to a unit that is valid only for the current runtime cycleof it. Outlook for the future: should the kernel eventually allow passing of cgroup information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we can alter the invocation ID to be generated as hash from that rather than entirely randomly. This way we can derive the invocation race-freely from the messages.
2016-10-06	user-util: rework maybe_setgroups() a bit	Lennart Poettering
	Let's drop the caching of the setgroups /proc field for now. While there's a strict regime in place when it changes states, let's better not cache it since we cannot really be sure we follow that regime correctly. More importantly however, this is not in performance sensitive code, and there's no indication the cache is really beneficial, hence let's drop the caching and make things a bit simpler. Also, while we are at it, rework the error handling a bit, and always return negative errno-style error codes, following our usual coding style. This has the benefit that we can sensible hanld read_one_line_file() errors, without having to updat errno explicitly.
2016-10-06	core: leave PAM stub process around with GIDs updated	Lennart Poettering
	In the process execution code of PID 1, before 096424d1230e0a0339735c51b43949809e972430 the GID settings where changed before invoking PAM, and the UID settings after. After the change both changes are made after the PAM session hooks are run. When invoking PAM we fork once, and leave a stub process around which will invoke the PAM session end hooks when the session goes away. This code previously was dropping the remaining privs (which were precisely the UID). Fix this code to do this correctly again, by really dropping them else (i.e. the GID as well). While we are at it, also fix error logging of this code. Fixes: #4238
2016-10-06	core: do not fail in a container if we can't use setgroups	Giuseppe Scrivano
	It might be blocked through /proc/PID/setgroups
2016-10-04	tree-wide: remove consecutive duplicate words in comments	Stefan Schweter

2016-09-25	core: Use @raw-io syscall group to filter I/O syscalls when PrivateDevices= ↵	Djalal Harouni
	is set Instead of having a local syscall list, use the @raw-io group which contains the same set of syscalls to filter.
2016-09-25	execute: move SMACK setup code into its own function	Lennart Poettering
	While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the main execution logic a bit.
2016-09-25	execute: filter low-level I/O syscalls if PrivateDevices= is set	Lennart Poettering
	If device access is restricted via PrivateDevices=, let's also block the various low-level I/O syscalls at the same time, so that we know that the minimal set of devices in our virtualized /dev are really everything the unit can access.
2016-09-25	execute: drop group priviliges only after setting up namespace	Lennart Poettering
	If PrivateDevices=yes is set, the namespace code creates device nodes in /dev that should be owned by the host's root, hence let's make sure we set up the namespace before dropping group privileges.
2016-09-25	execute: if RuntimeDirectory= is set, it should be writable	Lennart Poettering
	Implicitly make all dirs set with RuntimeDirectory= writable, as the concept otherwise makes no sense.
2016-09-25	execute: move suppression of HOME=/ and SHELL=/bin/nologin into user-util.c	Lennart Poettering
	This adds a new call get_user_creds_clean(), which is just like get_user_creds() but returns NULL in the home/shell parameters if they contain no useful information. This code previously lived in execute.c, but by generalizing this we can reuse it in run.c.
2016-09-25	execute: split out creation of runtime dirs into its own functions	Lennart Poettering

2016-09-25	core: add two new service settings ProtectKernelTunables= and ↵	Lennart Poettering
	ProtectControlGroups= If enabled, these will block write access to /sys, /proc/sys and /proc/sys/fs/cgroup.
2016-09-25	core: enforce seccomp for secondary archs too, for all rules	Lennart Poettering
	Let's make sure that all our rules apply to all archs the local kernel supports.
2016-09-06	seccomp: also detect if seccomp filtering is enabled	Felipe Sateler
	In https://github.com/systemd/systemd/pull/4004 , a runtime detection method for seccomp was added. However, it does not detect the case where CONFIG_SECCOMP=y but CONFIG_SECCOMP_FILTER=n. This is possible if the architecture does not support filtering yet. Add a check for that case too. While at it, change get_proc_field usage to use PR_GET_SECCOMP prctl, as that should save a few system calls and (unnecessary) allocations. Previously, reading of /proc/self/stat was done as recommended by prctl(2) as safer. However, given that we need to do the prctl call anyway, lets skip opening, reading and parsing the file. Code for checking inspired by https://outflux.net/teach-seccomp/autodetect.html
2016-08-22	core: do not fail at step SECCOMP if there is no kernel support (#4004)	Felipe Sateler
	Fixes #3882
2016-08-19	core: bypass dynamic user lookups from dbus-daemon	Lennart Poettering
	dbus-daemon does NSS name look-ups in order to enforce its bus policy. This might dead-lock if an NSS module use wants to use D-Bus for the look-up itself, like our nss-systemd does. Let's work around this by bypassing bus communication in the NSS module if we run inside of dbus-daemon. To make this work we keep a bit of extra state in /run/systemd/dynamic-uid/ so that we don't have to consult the bus, but can still resolve the names. Note that the normal codepath continues to be via the bus, so that resolving works from all mount namespaces and is subject to authentication, as before. This is a bit dirty, but not too dirty, as dbus daemon is kinda special anyway for PID 1.