summaryrefslogtreecommitdiff
path: root/man/systemd.exec.xml
diff options
context:
space:
mode:
Diffstat (limited to 'man/systemd.exec.xml')
-rw-r--r--man/systemd.exec.xml141
1 files changed, 78 insertions, 63 deletions
diff --git a/man/systemd.exec.xml b/man/systemd.exec.xml
index dbe4594730..3c350df11f 100644
--- a/man/systemd.exec.xml
+++ b/man/systemd.exec.xml
@@ -1090,7 +1090,7 @@
mechanism. Almost no services need to write to these at runtime; it is hence recommended to turn this on for
most services. For this setting the same restrictions regarding mount propagation and privileges apply as for
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off.
- Note that this option does not prevent kernel tuning through IPC interfaces and exeternal programs. However
+ Note that this option does not prevent kernel tuning through IPC interfaces and external programs. However
<varname>InaccessiblePaths=</varname> can be used to make some IPC file system objects
inaccessible.</para></listitem>
</varlistentry>
@@ -1234,42 +1234,49 @@
<varlistentry>
<term><varname>NoNewPrivileges=</varname></term>
- <listitem><para>Takes a boolean argument. If true, ensures
- that the service process and all its children can never gain
- new privileges. This option is more powerful than the
- respective secure bits flags (see above), as it also prohibits
- UID changes of any kind. This is the simplest, most effective
- way to ensure that a process and its children can never
- elevate privileges again.</para></listitem>
+ <listitem><para>Takes a boolean argument. If true, ensures that the service
+ process and all its children can never gain new privileges. This option is more
+ powerful than the respective secure bits flags (see above), as it also prohibits
+ UID changes of any kind. This is the simplest and most effective way to ensure that
+ a process and its children can never elevate privileges again. Defaults to false,
+ but in the user manager instance certain settings force
+ <varname>NoNewPrivileges=yes</varname>, ignoring the value of this setting.
+ Those is the case when <varname>SystemCallFilter=</varname>,
+ <varname>SystemCallArchitectures=</varname>,
+ <varname>RestrictAddressFamilies=</varname>,
+ <varname>PrivateDevices=</varname>,
+ <varname>ProtectKernelTunables=</varname>,
+ <varname>ProtectKernelModules=</varname>,
+ <varname>MemoryDenyWriteExecute=</varname>, or
+ <varname>RestrictRealtime=</varname> are specified.
+ </para></listitem>
</varlistentry>
<varlistentry>
<term><varname>SystemCallFilter=</varname></term>
- <listitem><para>Takes a space-separated list of system call
- names. If this setting is used, all system calls executed by
- the unit processes except for the listed ones will result in
- immediate process termination with the
- <constant>SIGSYS</constant> signal (whitelisting). If the
- first character of the list is <literal>~</literal>, the
- effect is inverted: only the listed system calls will result
- in immediate process termination (blacklisting). If running in
- user mode, or in system mode, but without the
- <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
- <varname>User=nobody</varname>),
- <varname>NoNewPrivileges=yes</varname> is implied. This
- feature makes use of the Secure Computing Mode 2 interfaces of
- the kernel ('seccomp filtering') and is useful for enforcing a
- minimal sandboxing environment. Note that the
- <function>execve</function>,
- <function>rt_sigreturn</function>,
- <function>sigreturn</function>,
- <function>exit_group</function>, <function>exit</function>
- system calls are implicitly whitelisted and do not need to be
- listed explicitly. This option may be specified more than once,
- in which case the filter masks are merged. If the empty string
- is assigned, the filter is reset, all prior assignments will
- have no effect. This does not affect commands prefixed with <literal>+</literal>.</para>
+ <listitem><para>Takes a space-separated list of system call names. If this setting is used, all system calls
+ executed by the unit processes except for the listed ones will result in immediate process termination with the
+ <constant>SIGSYS</constant> signal (whitelisting). If the first character of the list is <literal>~</literal>,
+ the effect is inverted: only the listed system calls will result in immediate process termination
+ (blacklisting). If running in user mode, or in system mode, but without the <constant>CAP_SYS_ADMIN</constant>
+ capability (e.g. setting <varname>User=nobody</varname>), <varname>NoNewPrivileges=yes</varname> is
+ implied. This feature makes use of the Secure Computing Mode 2 interfaces of the kernel ('seccomp filtering')
+ and is useful for enforcing a minimal sandboxing environment. Note that the <function>execve</function>,
+ <function>exit</function>, <function>exit_group</function>, <function>getrlimit</function>,
+ <function>rt_sigreturn</function>, <function>sigreturn</function> system calls and the system calls for
+ querying time and sleeping are implicitly whitelisted and do not need to be listed explicitly. This option may
+ be specified more than once, in which case the filter masks are merged. If the empty string is assigned, the
+ filter is reset, all prior assignments will have no effect. This does not affect commands prefixed with
+ <literal>+</literal>.</para>
+
+ <para>Note that strict system call filters may impact execution and error handling code paths of the service
+ invocation. Specifically, access to the <function>execve</function> system call is required for the execution
+ of the service binary — if it is blocked service invocation will necessarily fail. Also, if execution of the
+ service binary fails for some reason (for example: missing service executable), the error handling logic might
+ require access to an additional set of system calls in order to process and log this failure correctly. It
+ might be necessary to temporarily disable system call filters in order to simplify debugging of such
+ failures.</para>
<para>If you specify both types of this option (i.e.
whitelisting and blacklisting), the first encountered will
@@ -1303,6 +1310,10 @@
</thead>
<tbody>
<row>
+ <entry>@basic-io</entry>
+ <entry>System calls for basic I/O: reading, writing, seeking, file descriptor duplication and closing (<citerefentry project='man-pages'><refentrytitle>read</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>write</refentrytitle><manvolnum>2</manvolnum></citerefentry>, and related calls)</entry>
+ </row>
+ <row>
<entry>@clock</entry>
<entry>System calls for changing the system clock (<citerefentry project='man-pages'><refentrytitle>adjtimex</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>settimeofday</refentrytitle><manvolnum>2</manvolnum></citerefentry>, and related calls)</entry>
</row>
@@ -1320,7 +1331,7 @@
</row>
<row>
<entry>@ipc</entry>
- <entry>SysV IPC, POSIX Message Queues or other IPC (<citerefentry project='man-pages'><refentrytitle>mq_overview</refentrytitle><manvolnum>7</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>svipc</refentrytitle><manvolnum>7</manvolnum></citerefentry>)</entry>
+ <entry>Pipes, SysV IPC, POSIX Message Queues and other IPC (<citerefentry project='man-pages'><refentrytitle>mq_overview</refentrytitle><manvolnum>7</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>svipc</refentrytitle><manvolnum>7</manvolnum></citerefentry>)</entry>
</row>
<row>
<entry>@keyring</entry>
@@ -1348,17 +1359,21 @@
</row>
<row>
<entry>@process</entry>
- <entry>Process control, execution, namespaces (<citerefentry project='man-pages'><refentrytitle>execve</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>kill</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>namespaces</refentrytitle><manvolnum>7</manvolnum></citerefentry>, …</entry>
+ <entry>Process control, execution, namespaces (<citerefentry project='man-pages'><refentrytitle>clone</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>kill</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>namespaces</refentrytitle><manvolnum>7</manvolnum></citerefentry>, …</entry>
</row>
<row>
<entry>@raw-io</entry>
- <entry>Raw I/O port access (<citerefentry project='man-pages'><refentrytitle>ioperm</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>iopl</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <function>pciconfig_read()</function>, …</entry>
+ <entry>Raw I/O port access (<citerefentry project='man-pages'><refentrytitle>ioperm</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>iopl</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <function>pciconfig_read()</function>, …)</entry>
+ </row>
+ <row>
+ <entry>@resources</entry>
+ <entry>System calls for changing resource limits, memory and scheduling parameters (<citerefentry project='man-pages'><refentrytitle>setrlimit</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>setpriority</refentrytitle><manvolnum>2</manvolnum></citerefentry>, …)</entry>
</row>
</tbody>
</tgroup>
</table>
- Note, that as new system calls are added to the kernel, additional system calls might be added to the groups
+ Note that as new system calls are added to the kernel, additional system calls might be added to the groups
above, so the contents of the sets may change between systemd versions.</para>
<para>It is recommended to combine the file system namespacing related options with
@@ -1387,28 +1402,25 @@
<varlistentry>
<term><varname>SystemCallArchitectures=</varname></term>
- <listitem><para>Takes a space-separated list of architecture
- identifiers to include in the system call filter. The known
- architecture identifiers are <constant>x86</constant>,
- <constant>x86-64</constant>, <constant>x32</constant>,
- <constant>arm</constant>, <constant>s390</constant>,
- <constant>s390x</constant> as well as the special identifier
- <constant>native</constant>. Only system calls of the
- specified architectures will be permitted to processes of this
- unit. This is an effective way to disable compatibility with
- non-native architectures for processes, for example to
- prohibit execution of 32-bit x86 binaries on 64-bit x86-64
- systems. The special <constant>native</constant> identifier
- implicitly maps to the native architecture of the system (or
- more strictly: to the architecture the system manager is
- compiled for). If running in user mode, or in system mode,
- but without the <constant>CAP_SYS_ADMIN</constant>
- capability (e.g. setting <varname>User=nobody</varname>),
- <varname>NoNewPrivileges=yes</varname> is implied. Note
- that setting this option to a non-empty list implies that
- <constant>native</constant> is included too. By default, this
- option is set to the empty list, i.e. no architecture system
- call filtering is applied.</para></listitem>
+ <listitem><para>Takes a space-separated list of architecture identifiers to
+ include in the system call filter. The known architecture identifiers are the same
+ as for <varname>ConditionArchitecture=</varname> described in
+ <citerefentry><refentrytitle>systemd.unit</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
+ as well as <constant>x32</constant>, <constant>mips64-n32</constant>,
+ <constant>mips64-le-n32</constant>, and the special identifier
+ <constant>native</constant>. Only system calls of the specified architectures will
+ be permitted to processes of this unit. This is an effective way to disable
+ compatibility with non-native architectures for processes, for example to prohibit
+ execution of 32-bit x86 binaries on 64-bit x86-64 systems. The special
+ <constant>native</constant> identifier implicitly maps to the native architecture
+ of the system (or more strictly: to the architecture the system manager is
+ compiled for). If running in user mode, or in system mode, but without the
+ <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
+ <varname>User=nobody</varname>), <varname>NoNewPrivileges=yes</varname> is
+ implied. Note that setting this option to a non-empty list implies that
+ <constant>native</constant> is included too. By default, this option is set to the
+ empty list, i.e. no architecture system call filtering is applied.
+ </para></listitem>
</varlistentry>
<varlistentry>
@@ -1455,7 +1467,7 @@
<listitem><para>Takes a boolean argument. If true, explicit module loading will
be denied. This allows to turn off module load and unload operations on modular
- kernels. It is recomended to turn this on for most services that do not need special
+ kernels. It is recommended to turn this on for most services that do not need special
file systems or extra kernel modules to work. Default to off. Enabling this option
removes <constant>CAP_SYS_MODULE</constant> from the capability bounding set for
the unit, and installs a system call filter to block module system calls,
@@ -1516,12 +1528,15 @@
<term><varname>MemoryDenyWriteExecute=</varname></term>
<listitem><para>Takes a boolean argument. If set, attempts to create memory mappings that are writable and
- executable at the same time, or to change existing memory mappings to become executable are prohibited.
+ executable at the same time, or to change existing memory mappings to become executable, or mapping shared memory
+ segments as executable are prohibited.
Specifically, a system call filter is added that rejects
<citerefentry><refentrytitle>mmap</refentrytitle><manvolnum>2</manvolnum></citerefentry>
- system calls with both <constant>PROT_EXEC</constant> and <constant>PROT_WRITE</constant> set
- and <citerefentry><refentrytitle>mprotect</refentrytitle><manvolnum>2</manvolnum></citerefentry>
- system calls with <constant>PROT_EXEC</constant> set. Note that this option is incompatible with programs
+ system calls with both <constant>PROT_EXEC</constant> and <constant>PROT_WRITE</constant> set,
+ <citerefentry><refentrytitle>mprotect</refentrytitle><manvolnum>2</manvolnum></citerefentry>
+ system calls with <constant>PROT_EXEC</constant> set and
+ <citerefentry><refentrytitle>shmat</refentrytitle><manvolnum>2</manvolnum></citerefentry>
+ system calls with <constant>SHM_EXEC</constant> set. Note that this option is incompatible with programs
that generate program code dynamically at runtime, such as JIT execution engines, or programs compiled making
use of the code "trampoline" feature of various C compilers. This option improves service security, as it makes
harder for software exploits to change running code dynamically.