Age | Commit message (Collapse) | Author |
|
In preparation for reusing the image dissector in the GPT auto-discovery
logic, only optionally fail the dissection when we can't identify a root
partition.
In the GPT auto-discovery we are completely fine with any kind of root,
given that we run when it is already mounted and all we do is find some
additional auxiliary partitions on the same disk.
|
|
This improves kernel command line parsing in a number of ways:
a) An kernel option "foo_bar=xyz" is now considered equivalent to
"foo-bar-xyz", i.e. when comparing kernel command line option names "-" and
"_" are now considered equivalent (this only applies to the option names
though, not the option values!). Most of our kernel options used "-" as word
separator in kernel command line options so far, but some used "_". With
this change, which was a source of confusion for users (well, at least of
one user: myself, I just couldn't remember that it's systemd.debug-shell,
not systemd.debug_shell). Considering both as equivalent is inspired how
modern kernel module loading normalizes all kernel module names to use
underscores now too.
b) All options previously using a dash for separating words in kernel command
line options now use an underscore instead, in all documentation and in
code. Since a) has been implemented this should not create any compatibility
problems, but normalizes our documentation and our code.
c) All kernel command line options which take booleans (or are boolean-like)
have been reworked so that "foobar" (without argument) is now equivalent to
"foobar=1" (but not "foobar=0"), thus normalizing the handling of our
boolean arguments. Specifically this means systemd.debug-shell and
systemd_debug_shell=1 are now entirely equivalent.
d) All kernel command line options which take an argument, and where no
argument is specified will now result in a log message. e.g. passing just
"systemd.unit" will no result in a complain that it needs an argument. This
is implemented in the proc_cmdline_missing_value() function.
e) There's now a call proc_cmdline_get_bool() similar to proc_cmdline_get_key()
that parses booleans (following the logic explained in c).
f) The proc_cmdline_parse() call's boolean argument has been replaced by a new
flags argument that takes a common set of bits with proc_cmdline_get_key().
g) All kernel command line APIs now begin with the same "proc_cmdline_" prefix.
h) There are now tests for much of this. Yay!
|
|
This is useful for reusing the dissector logic in the gpt-auto-discovery logic:
there we really don't want to use MBR or naked file systems as root device.
|
|
Let's use chase_symlinks() when looking for /etc/os-release and
/usr/lib/os-release as these files might be symlinks (and actually are IRL on
some distros).
|
|
calendarspec: allow repetition values with ranges
|
|
|
|
This means that callers can distiguish an error from flags==0,
and don't have to special-case the empty string.
|
|
"Every other hour from 9 until 5" can be written as
`9..17/2:00` instead of `9,11,13,15,17:00`
|
|
PR_SET_MM_ARG_START allows us to relatively cleanly implement process renaming.
However, it's only available with privileges. Hence, let's try to make use of
it, and if we can't fall back to the traditional way of overriding argv[0].
This removes size restrictions on the process name shown in argv[] at least for
privileged processes.
|
|
Fixes:
```
$ ./libtool --mode=execute valgrind --leak-check=full ./test-fs-util
...
==22871==
==22871== 27 bytes in 1 blocks are definitely lost in loss record 1 of 1
==22871== at 0x4C2FC47: realloc (vg_replace_malloc.c:785)
==22871== by 0x4E86D05: strextend (string-util.c:726)
==22871== by 0x4E8F347: chase_symlinks (fs-util.c:712)
==22871== by 0x109EBF: test_chase_symlinks (test-fs-util.c:75)
==22871== by 0x10C381: main (test-fs-util.c:305)
==22871==
```
Closes #4888
|
|
set up a per-service session kernel keyring, and store the invocation ID in it
|
|
This adds two new settings BindPaths= and BindReadOnlyPaths=. They allow
defining arbitrary bind mounts specific to particular services. This is
particularly useful for services with RootDirectory= set as this permits making
specific bits of the host directory available to chrooted services.
The two new settings follow the concepts nspawn already possess in --bind= and
--bind-ro=, as well as the .nspawn settings Bind= and BindReadOnly= (and these
latter options should probably be renamed to BindPaths= and BindReadOnlyPaths=
too).
Fixes: #3439
|
|
This makes "systemd-run -p MountFlags=shared -t /bin/sh" work, by making
MountFlags= to the list of properties that may be accessed transiently.
|
|
Let's store the invocation ID in the per-service keyring as a root-owned key,
with strict access rights. This has the advantage over the environment-based ID
passing that it also works from SUID binaries (as they key cannot be overidden
by unprivileged code starting them), in contrast to the secure_getenv() based
mode.
The invocation ID is now passed in three different ways to a service:
- As environment variable $INVOCATION_ID. This is easy to use, but may be
overriden by unprivileged code (which might be a bad or a good thing), which
means it's incompatible with SUID code (see above).
- As extended attribute on the service cgroup. This cannot be overriden by
unprivileged code, and may be queried safely from "outside" of a service.
However, it is incompatible with containers right now, as unprivileged
containers generally cannot set xattrs on cgroupfs.
- As "invocation_id" key in the kernel keyring. This has the benefit that the
key cannot be changed by unprivileged service code, and thus is safe to
access from SUID code (see above). But do note that service code can replace
the session keyring with a fresh one that lacks the key. However in that case
the key will not be owned by root, which is easily detectable. The keyring is
also incompatible with containers right now, as it is not properly namespace
aware (but this is being worked on), and thus most container managers mask
the keyring-related system calls.
Ideally we'd only have one way to pass the invocation ID, but the different
ways all have limitations. The invocation ID hookup in journald is currently
only available on the host but not in containers, due to the mentioned
limitations.
How to verify the new invocation ID in the keyring:
# systemd-run -t /bin/sh
Running as unit: run-rd917366c04f847b480d486017f7239d6.service
Press ^] three times within 1s to disconnect TTY.
# keyctl show
Session Keyring
680208392 --alswrv 0 0 keyring: _ses
250926536 ----s-rv 0 0 \_ user: invocation_id
# keyctl request user invocation_id
250926536
# keyctl read 250926536
16 bytes of data in key:
9c96317c ac64495a a42b9cd7 4f3ff96b
# echo $INVOCATION_ID
9c96317cac64495aa42b9cd74f3ff96b
# ^D
This creates a new transient service runnint a shell. Then verifies the
contents of the keyring, requests the invocation ID key, and reads its payload.
For comparison the invocation ID as passed via the environment variable is also
displayed.
|
|
Various specifier resolution fixes.
|
|
Generalize image dissection logic of nspawn, and make it useful for other tools.
|
|
Add new "khash" API and add new sd_id128_get_machine_app_specific() function
|
|
|
|
This adds support for discovering and making use of properly tagged dm-verity
data integrity partitions. This extends both systemd-nspawn and systemd-dissect
with a new --root-hash= switch that takes the root hash to use for the root
partition, and is otherwise fully automatic.
Verity partitions are discovered automatically by GPT table type UUIDs, as
listed in
https://www.freedesktop.org/wiki/Specifications/DiscoverablePartitionsSpec/
(which I updated prior to this change, to include new UUIDs for this purpose.
mkosi with https://github.com/systemd/mkosi/pull/39 applied may generate images
that carry the necessary integrity data. With that PR and this commit, the
following simply lines suffice to boot up an integrity-protected container image:
```
# mkdir test
# cd test
# mkosi --verity
# systemd-nspawn -i ./image.raw -bn
```
Note that mkosi writes the image file to "image.raw" next to a a file
"image.roothash" that contains the root hash. systemd-nspawn will look for that
file and use it if it exists, in case --root-hash= is not specified explicitly.
|
|
This adds two new APIs to systemd:
- loop-util.h is a simple internal API for allocating, setting up and releasing
loopback block devices.
- dissect-image.h is an internal API for taking apart disk images and figuring
out what the purpose of each partition is.
Both APIs are basically refactored versions of similar code in nspawn. This
rework should permit us to reuse this in other places than just nspawn in the
future. Specifically: to implement RootImage= in the service image, similar to
RootDirectory=, but operating on a disk image; to unify the gpt-auto-discovery
generator code with the discovery logic in nspawn; to add new API to machined
for determining the OS version of a disk image (i.e. not just running
containers). This PR does not make any such changes however, it just provides
the new reworked API.
The reworked code is also slightly more powerful than the nspawn original one.
When pointing it to an image or block device with a naked file system (i.e. no
partition table) it will simply make it the root device.
|
|
"*:*" should be equivalent to "*-*-* *:*:00" (minutely)
rather than running every microsecond.
Fixes #4804
|
|
Let's accept "µs" as alternative time unit for microseconds. We already accept
"us" and "usec" for them, lets extend on this and accept the proper scientific
unit specification too.
We will never output this as time unit, but it's fine to accept it, after all
we are pretty permissive with time units already.
|
|
As suggested by @keszybz
|
|
This new flag controls whether to consider a problem if the referenced path
doesn't actually exist. If specified it's OK if the final file doesn't exist.
Note that this permits one or more final components of the path not to exist,
but these must not contain "../" for safety reasons (or, to be extra safe,
neither "./" and a couple of others, i.e. what path_is_safe() permits).
This new flag is useful when resolving paths before issuing an mkdir() or
open(O_CREAT) on a path, as it permits that the file or directory is created
later.
The return code of chase_symlinks() is changed to return 1 if the file exists,
and 0 if it doesn't. The latter is only returned in case CHASE_NON_EXISTING is
set.
|
|
Let's remove chase_symlinks_prefix() and instead introduce a flags parameter to
chase_symlinks(), with a flag CHASE_PREFIX_ROOT that exposes the behaviour of
chase_symlinks_prefix().
|
|
Previously, we'd generate an EINVAL error if it is attempted to escape a root
directory with relative ".." symlinks. With this commit this is changed so that
".." from the root directory is a NOP, following the kernel's own behaviour
where /.. is equivalent to /.
As suggested by @keszybz.
|
|
root
|
|
Let's use chase_symlinks() everywhere, and stop using GNU
canonicalize_file_name() everywhere. For most cases this should not change
behaviour, however increase exposure of our function to get better tested. Most
importantly in a few cases (most notably nspawn) it can take the correct root
directory into account when chasing symlinks.
|
|
This adds an API for retrieving an app-specific machine ID to sd-id128.
Internally it calculates HMAC-SHA256 with an 128bit app-specific ID as payload
and the machine ID as key.
(An alternative would have been to use siphash for this, which is also
cryptographically strong. However, as it only generates 64bit hashes it's not
an obvious choice for generating 128bit IDs.)
Fixes: #4667
|
|
Let's take inspiration from bluez's ELL library, and let's move our
cryptographic primitives away from libgcrypt and towards the kernel's AF_ALG
cryptographic userspace API.
In the long run we should try to remove the dependency on libgcrypt, in favour
of using only the kernel's own primitives, however this is unlikely to happen
anytime soon, as the kernel does not provide Elliptic Curve APIs to userspace
at this time, and we need them for the DNSSEC cryptographic.
This commit only covers hashing for now, symmetric encryption/decryption or
even asymetric encryption/decryption is not available for now.
"khash" is little more than a lightweight wrapper around the kernel's AF_ALG
socket API.
|
|
"*-*-01..03" is now formatted as "*-*-01..03" instead of "*-*-01,02,03"
|
|
Previously a string like "00:00:01..03" would fail to parse due to the
ambiguity between a decimal point and the start of a range.
|
|
"*:*:*" is now formatted as "*:*:*" instead of "*:*:00/1"
|
|
strtoul() parses leading whitespace and an optional sign;
check that the first character is a digit to prevent odd
specifications like "00: 00: 00" and "-00:+00/-1".
|
|
Forbid open ranges like "Tue.."; trailing commas are still OK.
|
|
This makes " UTC" an illegal date specification.
|
|
"*-*-*" is now equivalent to "*-*-* 00:00:00" (daily)
rather than "*-*-* *:*:*" (every second).
|
|
"*-*~1" => The last day of every month
"*-02~3..5" => The third, fourth, and fifth last days in February
"Mon 05~07/1" => The last Monday in May
Resolves #3861
|
|
Stop looking for matches after MAX_YEAR so impossible dates like
"*-02-30" and "*-04-31" don't cause an infinite loop.
|
|
|
|
|
|
|
|
|
|
Useful for testing a single module. If nothing is specified, behaviour is the
same as before.
$ ./test-nss myhostname 192.168.0.14 localhost
======== myhostname ========
_nss_myhostname_gethostbyname4_r("localhost") → status=NSS_STATUS_SUCCESS
pat=buffer+0x38 errno=0/--- h_errno=0/Resolver Error 0 (no error) ttl=0
"localhost" AF_INET 127.0.0.1 %lo
"localhost" AF_INET6 ::1 %lo
_nss_myhostname_gethostbyname3_r("localhost", AF_INET) → status=NSS_STATUS_SUCCESS
errno=0/--- h_errno=0/Resolver Error 0 (no error) ttl=0
"localhost"
AF_INET 127.0.0.1
canonical: "localhost"
_nss_myhostname_gethostbyname3_r("localhost", AF_INET6) → status=NSS_STATUS_SUCCESS
errno=0/--- h_errno=0/Resolver Error 0 (no error) ttl=0
"localhost"
AF_INET6 ::1
canonical: "localhost"
_nss_myhostname_gethostbyname3_r("localhost", *) → status=NSS_STATUS_SUCCESS
errno=0/--- h_errno=0/Resolver Error 0 (no error) ttl=0
"localhost"
AF_INET 127.0.0.1
canonical: "localhost"
_nss_myhostname_gethostbyname3_r("localhost", AF_UNIX) → status=NSS_STATUS_UNAVAIL
errno=97/EAFNOSUPPORT h_errno=4/No address associated with name ttl=2147483647
_nss_myhostname_gethostbyname2_r("localhost", AF_INET) → status=NSS_STATUS_SUCCESS
errno=0/--- h_errno=0/Resolver Error 0 (no error)
"localhost"
AF_INET 127.0.0.1
_nss_myhostname_gethostbyname2_r("localhost", AF_INET6) → status=NSS_STATUS_SUCCESS
errno=0/--- h_errno=0/Resolver Error 0 (no error)
"localhost"
AF_INET6 ::1
_nss_myhostname_gethostbyname2_r("localhost", *) → status=NSS_STATUS_SUCCESS
errno=0/--- h_errno=0/Resolver Error 0 (no error)
"localhost"
AF_INET 127.0.0.1
_nss_myhostname_gethostbyname2_r("localhost", AF_UNIX) → status=NSS_STATUS_UNAVAIL
errno=97/EAFNOSUPPORT h_errno=4/No address associated with name
_nss_myhostname_gethostbyname_r("localhost") → status=NSS_STATUS_SUCCESS
errno=0/--- h_errno=0/Resolver Error 0 (no error)
"localhost"
AF_INET 127.0.0.1
_nss_myhostname_gethostbyaddr2_r("192.168.0.14") → status=NSS_STATUS_SUCCESS
errno=0/--- h_errno=0/Resolver Error 0 (no error) ttl=0
"laptop"
AF_INET 192.168.0.14
AF_INET 192.168.122.1
AF_INET 169.254.209.76
_nss_myhostname_gethostbyaddr_r("192.168.0.14") → status=NSS_STATUS_SUCCESS
errno=0/--- h_errno=0/Resolver Error 0 (no error)
"laptop"
AF_INET 192.168.0.14
AF_INET 192.168.122.1
AF_INET 169.254.209.76
|
|
core: add new RestrictNamespaces= unit file setting
Merging, not rebasing, because this touches many files and there were tree-wide cleanups in the mean time.
|
|
Format string tweaks (and a small fix on 32bit)
|
|
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
|
|
|
|
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().
RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.
This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
|
|
Tree wide cleanups
|