Age | Commit message (Collapse) | Author |
|
Fixes: #4711
|
|
Given that other file systems (notably: xfs) support reflinks these days, let's
extend the file system snapshotting logic to fall back to plan copies or
reflinks when full btrfs subvolume snapshots are not available.
This essentially makes "systemd-nspawn --ephemeral" and "systemd-nspawn
--template=" available on non-btrfs subvolumes. Of course, both operations will
still be slower on non-btrfs than on btrfs (simply because reflinking each file
individually in a directory tree is still slower than doing this in one step
for a whole subvolume), but it's probably good enough for many cases, and we
should provide the users with the tools, they have to figure out what's good
for them.
Note that "machinectl clone" already had a fallback like this in place, this
patch generalizes this, and adds similar support to our other cases.
|
|
When mountint a loopback image, we need a temporary root directory we can mount
stuff to. Make sure to actually remove it when exiting, so that we don't leave
stuff around in /tmp unnecessarily.
See: #4664
|
|
Let's make the shutdown logic synchronous, so that there's a better chance to
detach the loopback device after use.
|
|
Previously --ephemeral was only supported with container trees in btrfs
subvolumes (i.e. in combination with --directory=). This adds support for
--ephemeral in conjunction with disk images (i.e. --image=) too.
As side effect this fixes that --ephemeral was accepted but ignored when using
-M on a container that turned out to be an image.
Fixes: #4664
|
|
nspawn: R/W support for /sysfs, /proc, and /proc/sys/net
|
|
This commit adds the possibility to leave /sys, and /proc/sys read-write.
It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE
to enable this feature.
If set to "yes", /sys, and /proc/sys will be read-write.
If set to "no", /sys, and /proc/sys will be read-only.
If set to "network" /proc/sys/net will be read-write. This is useful in
use-cases, where systemd-nspawn is used in an external network
namespace.
This adds the possibility to start privileged containers which need more
control over settings in the /proc, and /sys filesystem.
This is also a follow-up on the discussion from
https://github.com/systemd/systemd/pull/4018#r76971862 where an
introduction of a simple env var to enable R/W support for those
directories was already discussed.
|
|
(#4613)
Since 133 is now used in a few places, add a #define for it.
Also make the status message a bit informative.
Another issue introduced in b006762. The logic was borked, we were supposed
to return 0 to break the loop, and 133 to restart the container, not the other
way around.
But this doesn't seem to work, reboot fails with:
Nov 08 00:41:32 laptop systemd-nspawn[26564]: Failed to register machine: Machine 'fedora-rawhide' already exists
So actually the version before this patch worked better, since 133 > 0 and we'd
at least loop internally.
|
|
The file /usr/lib/systemd/resolv.conf can be stale, it does not tell us
whether or not systemd-resolved is running or not.
So check for /run/systemd/resolve/resolv.conf as well, which is created
at runtime and hence is a better indication.
|
|
Format string tweaks (and a small fix on 32bit)
|
|
Commit b006762 inverted the initial exit code which is relevant for --help and
--version without a particular reason. For these special options, parse_argv()
returns 0 so that our main() immediately skips to the end without adjusting
"ret". Otherwise, if an actual container is being started, ret is set on error
in run(), which still provides the "non-zero exit on error" behaviour.
Fixes #4605.
|
|
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
|
|
non-fatal mount errors shouldn't be logged as warnings.
|
|
Tree wide cleanups
|
|
Let's make sure that our loopback files remain sparse, hence let's set
"discard" as mount option on file systems that support it if the backing device
is a loopback.
|
|
This adds a new seccomp_init_conservative() helper call that is mostly just a
wrapper around seccomp_init(), but turns off NNP and adds in all secondary
archs, for best compatibility with everything else.
Pretty much all of our code used the very same constructs for these three
steps, hence unifying this in one small function makes things a lot shorter.
This also changes incorrect usage of the "scmp_filter_ctx" type at various
places. libseccomp defines it as typedef to "void*", i.e. it is a pointer type
(pretty poor choice already!) that casts implicitly to and from all other
pointer types (even poorer choice: you defined a confusing type now, and don't
even gain any bit of type safety through it...). A lot of the code assumed the
type would refer to a structure, and hence aded additional "*" here and there.
Remove that.
|
|
https://github.com/torvalds/linux/commit/036d523641c66bef713042894a17f4335f199e49
> vfs: Don't create inodes with a uid or gid unknown to the vfs
It is expected that filesystems can not represent uids and gids from
outside of their user namespace. Keep things simple by not even
trying to create filesystem nodes with non-sense uids and gids.
So, we actually should `reset_uid_gid` early to prevent https://github.com/systemd/systemd/pull/4223#issuecomment-252522955
$ sudo UNIFIED_CGROUP_HIERARCHY=no LD_LIBRARY_PATH=.libs .libs/systemd-nspawn -D /var/lib/machines/fedora-rawhide -U -b systemd.unit=multi-user.target
Spawning container fedora-rawhide on /var/lib/machines/fedora-rawhide.
Press ^] three times within 1s to kill container.
Child died too early.
Selected user namespace base 1073283072 and range 65536.
Failed to mount to /sys/fs/cgroup/systemd: No such file or directory
Details: https://github.com/systemd/systemd/pull/4223#issuecomment-253046519
Fixes: #4352
|
|
https://github.com/systemd/systemd/pull/4372#issuecomment-253723849:
* `mount_all (outer_child)` creates `container_dir/sys/fs/selinux`
* `mount_all (outer_child)` doesn't patch `container_dir/sys/fs` and so on.
* `mount_sysfs (inner_child)` tries to create `/sys/fs/cgroup`
* This fails
370 stat("/sys/fs", {st_dev=makedev(0, 28), st_ino=13880, st_mode=S_IFDIR|0755, st_nlink=3, st_uid=65534, st_gid=65534, st_blksize=4096, st_blocks=0, st_size=60, st_atime=2016/10/14-05:16:43.398665943, st_mtime=2016/10/14-05:16:43.399665943, st_ctime=2016/10/14-05:16:43.399665943}) = 0
370 mkdir("/sys/fs/cgroup", 0755) = -1 EACCES (Permission denied)
* `mount_syfs (inner_child)` ignores that error and
mount(NULL, "/sys", NULL, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL) = 0
* `mount_cgroups` finally fails
|
|
If allocation fails, the value of the point is "undefined". In practice
this matters very little, but for consistency with rest of the code,
let's check the return value.
|
|
This makes strjoin and strjoina more similar and avoids the useless final
argument.
spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/systemd -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libsystemd/sd-bus -I ./src/libsystemd/sd-event -I ./src/libsystemd/sd-login -I ./src/libsystemd/sd-netlink -I ./src/libsystemd/sd-network -I ./src/libsystemd/sd-hwdb -I ./src/libsystemd/sd-device -I ./src/libsystemd/sd-id128 -I ./src/libsystemd-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/*.c)
git grep -e '\bstrjoin\b.*NULL' -l|xargs sed -i -r 's/strjoin\((.*), NULL\)/strjoin(\1)/'
This might have missed a few cases (spatch has a really hard time dealing
with _cleanup_ macros), but that's no big issue, they can always be fixed
later.
|
|
|
|
|
|
|
|
Fixes: #4181
|
|
Enhance nspawn debug logs for mount/unmount operations
|
|
Fixes:
host# systemd-nspawn -D ... -U -b systemd.unit=multi-user.target
...
$ grep /tmp /proc/self/mountinfo
154 145 0:41 / /tmp rw - tmpfs tmpfs rw,seclabel,uid=1036124160,gid=1036124160
$ umount /tmp
umount: /root/tmp: not mounted
$ systemctl poweroff
...
[FAILED] Failed unmounting Temporary Directory.
|
|
This makes it easier to debug failed nspawn invocations:
Mounting sysfs on /var/lib/machines/fedora-rawhide/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV "")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev (MS_NOSUID|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev/shm (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=1777,uid=1450901504,gid=1450901504")...
Mounting tmpfs on /var/lib/machines/fedora-rawhide/run (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")...
Bind-mounting /sys/fs/selinux on /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_BIND "")...
Remounting /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Mounting proc on /proc (MS_NOSUID|MS_NOEXEC|MS_NODEV "")...
Bind-mounting /proc/sys on /proc/sys (MS_BIND "")...
Remounting /proc/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Bind-mounting /proc/sysrq-trigger on /proc/sysrq-trigger (MS_BIND "")...
Remounting /proc/sysrq-trigger (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")...
Mounting tmpfs on /tmp (MS_STRICTATIME "mode=1777,uid=0,gid=0")...
Mounting tmpfs on /sys/fs/cgroup (MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_STRICTATIME "mode=755,uid=0,gid=0")...
Mounting cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr")...
Failed to mount cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr"): No such file or directory
|
|
- check for oom after strdup
- no need to truncate the line since we're only extracting one field anyway
- use STR_IN_SET
|
|
We would check the condition cg_ns_supported() twice. No functional
change.
|
|
nspawn --private-users parsing, v2
|
|
Autodetect systemd version in containers started by systemd-nspawn
|
|
In particular, the check for arg_uid_range <= 0 is moved to the end, so that
"foobar:0" gives "Failed to parse UID", and not "UID range cannot be 0.".
|
|
This is like the previous reverted commit, but any boolean is still accepted,
not just "yes" and "no". Man page is adjusted to match the code.
|
|
This reverts commit bfd292ec35c7b768f9fb5cff4d921f3133e62b19.
|
|
The documentation says lists "yes", "no", "pick", and numeric arguments.
But parse_boolean was attempted first, so various numeric arguments were
misinterpreted.
In particular, this fixes --private-users=0 to mean the same thing as
--private-users=0:65536.
While at it, use strndupa to avoid some error handling.
Also give a better error for an empty UID range. I think it's likely that
people will use --private-users=0:0 thinking that the argument means UID:GID.
|
|
|
|
Current systemd version detection routine cannot detect systemd 230,
only systmed >= 231. This means that we'll still use the legacy hierarchy
in some cases where we wouldn't have too. If somebody figures out a nice
way to detect systemd 230 this can be later improved.
|
|
systemd-soon-to-be-released-232 is able to deal with the mixed hierarchy.
So make an educated guess, and use the mixed hierarchy in that case.
Tested by running the host with mixed hierarchy (i.e. simply using a recent
kernel with systemd from git), and booting first a container with older systemd,
and then one with a newer systemd.
Fixes #4008.
|
|
|
|
The new function has 416 lines by itself!
"return log_error_errno" is used to nicely reduce the volume of error
handling code.
A few minor issues are fixed on the way:
- positive value was used as error value (EIO), causing systemd-nspawn
to return success, even though it shouldn't.
- In two places random values were used as error status, when the
actual value was in an unusual place (etc_password_lock, notify_socket).
Those are the only functional changes.
There is another potential issue, which is marked with a comment, and left
unresolved: the container can also return 133 by itself, causing a spurious
reboot.
|
|
If we are going to use the env var to override the detection result
anyway, there is not point in doing the detection, especially that
it can fail.
|
|
gcc at some optimization levels thinks thes variables were used without
initialization. it's wrong, but let's make the message go anyway.
|
|
directory (#4226)
Fixes https://github.com/systemd/systemd/issues/3695
At the same time it adds a protection against userns chown of inodes of
a shared mount point.
|
|
|
|
core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes
|
|
|
|
LXC does this, and we should probably too. Better safe than sorry.
|
|
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths=
specification, then we'd first recursively apply the ReadOnlyPaths= paths, and
make everything below read-only, only in order to then flip the read-only bit
again for the subdirs listed in ReadWritePaths= below it.
This is not only ugly (as for the dirs in question we first turn on the RO bit,
only to turn it off again immediately after), but also problematic in
containers, where a container manager might have marked a set of dirs read-only
and this code will undo this is ReadWritePaths= is set for any.
With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be
applied to the children listed in ReadWritePaths= in the first place, so that
we do not need to turn off the RO bit for those after all.
This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the
RO bit, but never to turn it off again. Or to say this differently: if some
dirs are marked read-only via some external tool, then ReadWritePaths= will not
undo it.
This is not only the safer option, but also more in-line with what the man page
currently claims:
"Entries (files or directories) listed in ReadWritePaths= are
accessible from within the namespace with the same access rights as
from outside."
To implement this change bind_remount_recursive() gained a new "blacklist"
string list parameter, which when passed may contain subdirs that shall be
excluded from the read-only mounting.
A number of functions are updated to add more debug logging to make this more
digestable.
|
|
This commit is a minor tweak after the split of `--share-system`, decoupling the `--boot`
option from IPC namespacing.
Historically there has been a single `--share-system` option for sharing IPC/PID/UTS with the
host, which was incompatible with boot/pid1 mode. After the split, it is now possible to express
the requirements with better granularity.
For reference, this is a followup to #4023 which contains references to previous discussions.
I realized too late that CLONE_NEWIPC is not strictly needed for boot mode.
|
|
|