Age | Commit message (Collapse) | Author |
|
|
|
Enhance nspawn debug logs for mount/unmount operations
|
|
Allowed paths are unified betwen the configuration file parses and the bus
property checker. The biggest change is that the bus code now allows "block-"
and "char-" classes. In addition, path_startswith("/dev") was used in the bus
code, and startswith("/dev") was used in the config file code. It seems
reasonable to use path_startswith() which allows a slightly broader class of
strings.
Fixes #3935.
|
|
|
|
This makes it easier to debug failed nspawn invocations:
Mounting sysfs on /var/lib/machines/fedora-rawhide/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV "")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev (MS_NOSUID|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev/shm (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=1777,uid=1450901504,gid=1450901504")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/run (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")...
Bind-mounting /sys/fs/selinux on /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_BIND "")...
Remounting /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Mounting proc on /proc (MS_NOSUID|MS_NOEXEC|MS_NODEV "")...
Bind-mounting /proc/sys on /proc/sys (MS_BIND "")...
Remounting /proc/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Bind-mounting /proc/sysrq-trigger on /proc/sysrq-trigger (MS_BIND "")...
Remounting /proc/sysrq-trigger (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Mounting tmpfs on /tmp (MS_STRICTATIME "mode=1777,uid=0,gid=0")...
Mounting tmpfs on /sys/fs/cgroup (MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_STRICTATIME "mode=755,uid=0,gid=0")...
Mounting cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr")...
Failed to mount cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr"): No such file or directory
|
|
Add an "invocation ID" concept to the service manager
|
|
whether it is a command or a daemon
SIGTERM should be considered a clean exit code for daemons (i.e. long-running
processes, as a daemon without SIGTERM handler may be shut down without issues
via SIGTERM still) while it should not be considered a clean exit code for
commands (i.e. short-running processes).
Let's add two different clean checking modes for this, and use the right one at
the appropriate places.
Fixes: #4275
|
|
Let's get rid of is_clean_exit_lsb(), let's move the logic for the special
handling of the two LSB exit codes into the sysv-generator by writing out
appropriate SuccessExitStatus= lines if the LSB header exists. This is not only
semantically more correct, bug also fixes a bug as the code in service.c that
chose between is_clean_exit_lsb() and is_clean_exit() based this check on
whether a native unit files was available for the unit. However, that check was
bogus since a long time, since the SysV generator was introduced and native
SysV script support was removed from PID 1, as in that case a unit file always
existed.
|
|
Let's make sure it's in the same order as the actual enum defining the exit
statuses.
|
|
Do not make up our own type for ExitStatus, but use the type used by POSIX for
this, which is "int". In particular as we never used that type outside of the
definition of exit_status_to_string() where we internally cast the paramter to
(int) every single time we used it.
Hence, let's simplify things, drop the type and use the kernel type directly.
|
|
Autodetect systemd version in containers started by systemd-nspawn
|
|
This is a bit crude and only works for new systemd versions which
have libsystemd-shared.
|
|
This patch enables to configure
IFA_F_HOMEADDRESS
IFA_F_NODAD
IFA_F_MANAGETEMPADDR
IFA_F_NOPREFIXROUTE
IFA_F_MCAUTOJOIN
|
|
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.
The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.
The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.
The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.
Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.
This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().
PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.
A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.
Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
|
|
|
|
Most important is a fix to negate the error number if necessary, before we
first access it.
|
|
Let's make sure people invoking STRV_FOREACH_BACKWARDS() as a single statement
of an if statement don't fall into a trap, and find the tail for the list via
strv_length().
|
|
RISC-V is an open source ISA in development since 2010 at UCB.
For more information, see https://riscv.org/
I am adding RISC-V support to Fedora:
https://fedoraproject.org/wiki/Architectures/RISC-V
There are three major variants of the architecture (32-, 64- and
128-bit). The 128-bit variant is a paper exercise, but the other
two really exist in silicon. RISC-V is always little endian.
On Linux, the default kernel uname(2) can return "riscv" for all
variants. However a patch was added recently which makes the kernel
return one of "riscv32" or "riscv64" (or in future "riscv128"). So
systemd should be prepared to handle any of "riscv", "riscv32" or
"riscv64" (in future, "riscv128" but that is not included in the
current patch). If the kernel returns "riscv" then you need to use
the pointer size in order to know the real variant.
The Fedora/RISC-V kernel only ever returns "riscv64" since we're
only doing Fedora for 64 bit at the moment, and we've patched the
kernel so it doesn't return "riscv".
As well as the major bitsize variants, there are also architecture
extensions. However I'm trying to ensure that uname(2) does *not*
return any other information about those in utsname.machine, so that
we don't end up with "riscv64abcde" nonsense. Instead those
extensions will be exposed in /proc/cpuinfo similar to how flags
work in x86.
|
|
Let's drop the caching of the setgroups /proc field for now. While there's a
strict regime in place when it changes states, let's better not cache it since
we cannot really be sure we follow that regime correctly.
More importantly however, this is not in performance sensitive code, and
there's no indication the cache is really beneficial, hence let's drop the
caching and make things a bit simpler.
Also, while we are at it, rework the error handling a bit, and always return
negative errno-style error codes, following our usual coding style. This has
the benefit that we can sensible hanld read_one_line_file() errors, without
having to updat errno explicitly.
|
|
As suggested here:
https://github.com/systemd/systemd/pull/4296#issuecomment-251911349
Let's try AF_INET first as socket, but let's fall back to AF_NETLINK, so that
we can use a protocol-independent socket here if possible. This has the benefit
that our code will still work even if AF_INET/AF_INET6 is made unavailable (for
exmple via seccomp), at least on current kernels.
|
|
It might be blocked through /proc/PID/setgroups
|
|
|
|
|
|
If the new item is inserted before the first item in the list, then the
head must be updated as well.
Add a test to the list unit test to check for this.
|
|
core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes
|
|
Show and formatting fixes
|
|
Even if
```
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
1
```
is disabled
cat /proc/net/sockstat6
```
TCP6: inuse 2
UDP6: inuse 1
UDPLITE6: inuse 0
RAW6: inuse 0
FRAG6: inuse 0 memory 0
```
Looking for /proc/net/if_inet6 is the right choice.
|
|
This adds logic to chase symlinks for all mount points that shall be created in
a namespace environment in userspace, instead of leaving this to the kernel.
This has the advantage that we can correctly handle absolute symlinks that
shall be taken relative to a specific root directory. Moreover, we can properly
handle mounts created on symlinked files or directories as we can merge their
mounts as necessary.
(This also drops the "done" flag in the namespace logic, which was never
actually working, but was supposed to permit a partial rollback of the
namespace logic, which however is only mildly useful as it wasn't clear in
which case it would or would not be able to roll back.)
Fixes: #3867
|
|
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths=
specification, then we'd first recursively apply the ReadOnlyPaths= paths, and
make everything below read-only, only in order to then flip the read-only bit
again for the subdirs listed in ReadWritePaths= below it.
This is not only ugly (as for the dirs in question we first turn on the RO bit,
only to turn it off again immediately after), but also problematic in
containers, where a container manager might have marked a set of dirs read-only
and this code will undo this is ReadWritePaths= is set for any.
With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be
applied to the children listed in ReadWritePaths= in the first place, so that
we do not need to turn off the RO bit for those after all.
This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the
RO bit, but never to turn it off again. Or to say this differently: if some
dirs are marked read-only via some external tool, then ReadWritePaths= will not
undo it.
This is not only the safer option, but also more in-line with what the man page
currently claims:
"Entries (files or directories) listed in ReadWritePaths= are
accessible from within the namespace with the same access rights as
from outside."
To implement this change bind_remount_recursive() gained a new "blacklist"
string list parameter, which when passed may contain subdirs that shall be
excluded from the read-only mounting.
A number of functions are updated to add more debug logging to make this more
digestable.
|
|
This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
|
|
Also some trivial tests for STR_IN_SET and STRPTR_IN_SET.
|
|
Network file dropins
|
|
In preparation for adding a version which takes a strv.
|
|
condition: ignore nanoseconds in timestamps for ConditionNeedsUpdate=
Fixes #4130.
|
|
|
|
Strerror removal and other janitorial cleanups
|
|
We were already unconditionally using the unicode character when the
input string was not pure ASCII, leading to different behaviour in
depending on the input string.
systemd[1]: Starting printit.service.
python3[19962]: foooooooooooooooooooooooooooooooooooo…oooo
python3[19964]: fooąęoooooooooooooooooooooooooooooooo…oooo
python3[19966]: fooąęoooooooooooooooooooooooooooooooo…ąęąę
python3[19968]: fooąęoooooooooooooooooąęąęąęąęąęąęąęą…ąęąę
systemd[1]: Started printit.service.
|
|
According to its manual page, flags given to mkostemp(3) shouldn't include
O_RDWR, O_CREAT or O_EXCL flags as these are always included. Beyond
those, the only flag that all callers (except a few tests where it
probably doesn't matter) use is O_CLOEXEC, so set that unconditionally.
|
|
This splits the OS field in two : one for the distribution name
and one for the the version id.
Dashes are written for missing fields.
This also prints ip addresses of known machines. The `--max-addresses`
option specifies how much ip addresses we want to see. The default is 1.
When more than one address is written for a machine, a `,` follows it.
If there are more ips than `--max-addresses`, `...` follows the last
address.
|
|
fileio makes use of O_TMPFILE when it is available.
We now always have O_TMPFILE, defined in missing.h if missing
from the toolchain headers.
Have fileio include missing.h and drop the guards around the
use of O_TMPFILE.
|
|
Currently, a missing __O_TMPFILE was only defined for i386 and x86_64,
leaving any other architectures with an "old" toolchain fail miserably
at build time:
src/import/export-raw.c: In function 'reflink_snapshot':
src/import/export-raw.c:271:26: error: 'O_TMPFILE' undeclared (first use in this function)
new_fd = open(d, O_TMPFILE|O_CLOEXEC|O_NOCTTY|O_RDWR, 0600);
^
__O_TMPFILE (and O_TMPFILE) are available since glibc 2.19. However, a
lot of existing toolchains are still using glibc-2.18, and some even
before that, and it is not really possible to update those toolchains.
Instead of defining it only for i386 and x86_64, define __O_TMPFILE
with the specific values for those archs where it is different from the
generic value. Use the values as found in the Linux kernel (v4.8-rc3,
current as of time of commit).
---
Note: tested on ARM (build+run), with glibc-2.18 and linux headers 3.12.
Untested on other archs, though (I have no board to test this).
Changes v1 -> v2:
- add a comment specifying some are hexa, others are octal.
|
|
|
|
After the call of the isatty() we check its result twice in the
open_terminal(). There are no sense to check result of isatty() that
it is less than zero and return -errno, because as described in
documentation:
isatty() returns 1 if fd is an open file descriptor referring to a
terminal; otherwise 0 is returned, and errno is set to indicate the
error.
So it can't be less than zero.
|
|
add a new tool for creating transient mount and automount units
|
|
Rework console color setup
|
|
This changes the semantics a bit: before, SYSTEMD_COLORS= would be treated as
"yes", same as SYSTEMD_COLORS=xxx and SYSTEMD_COLORS=1, and only
SYSTEMD_COLORS=0 would be treated as "no". Now, only valid booleans are treated
as "yes". This actually matches how $SYSTEMD_COLORS was announced in NEWS.
|
|
When started by the kernel, we are connected to the console, and we'll set TERM
properly to some value in fixup_environment(). We'll then enable or disable
colors based on the value of $SYSTEMD_COLORS and $TERM.
When reexecuting, TERM should be already set, so we can use this value.
Effectively, behaviour is the same as before affd7ed1a was reverted, but instead
of reopening the console before configuring color output, we just ignore what
stdout is connected to and decide based on the variables only.
|
|
Journald dynamic users
|
|
Dynamic users should be treated like system users, and their logs
should end up in the main system journal.
|
|
The parsing functions for [User]TasksMax were inconsistent. Empty string and
"infinity" were interpreted as no limit for TasksMax but not accepted for
UserTasksMax. Update them so that they're consistent with other knobs.
* Empty string indicates the default value.
* "infinity" indicates no limit.
While at it, replace opencoded (uint64_t) -1 with CGROUP_LIMIT_MAX in TasksMax
handling.
v2: Update empty string to indicate the default value as suggested by Zbigniew
Jędrzejewski-Szmek.
v3: Fixed empty UserTasksMax handling.
|