summaryrefslogtreecommitdiff
path: root/src/nspawn
AgeCommit message (Collapse)Author
2016-12-01nspawn: properly handle image/directory paths that are symlinksLennart Poettering
This resolves any paths specified on --directory=, --template=, and --image= before using them. This makes sure nspawn can be used correctly on symlinked images and directory trees. Fixes: #2001
2016-12-01tree-wide: stop using canonicalize_file_name(), use chase_symlinks() insteadLennart Poettering
Let's use chase_symlinks() everywhere, and stop using GNU canonicalize_file_name() everywhere. For most cases this should not change behaviour, however increase exposure of our function to get better tested. Most importantly in a few cases (most notably nspawn) it can take the correct root directory into account when chasing symlinks.
2016-11-22nspawn: don't require chown() if userns is not onLennart Poettering
Fixes: #4711
2016-11-22nspawn: add fallback top normal copy/reflink when we cannot btrfs snapshotLennart Poettering
Given that other file systems (notably: xfs) support reflinks these days, let's extend the file system snapshotting logic to fall back to plan copies or reflinks when full btrfs subvolume snapshots are not available. This essentially makes "systemd-nspawn --ephemeral" and "systemd-nspawn --template=" available on non-btrfs subvolumes. Of course, both operations will still be slower on non-btrfs than on btrfs (simply because reflinking each file individually in a directory tree is still slower than doing this in one step for a whole subvolume), but it's probably good enough for many cases, and we should provide the users with the tools, they have to figure out what's good for them. Note that "machinectl clone" already had a fallback like this in place, this patch generalizes this, and adds similar support to our other cases.
2016-11-22nspawn: remove temporary root directory on exitLennart Poettering
When mountint a loopback image, we need a temporary root directory we can mount stuff to. Make sure to actually remove it when exiting, so that we don't leave stuff around in /tmp unnecessarily. See: #4664
2016-11-22nspawn: try to wait for the container PID 1 to exit, before we exitLennart Poettering
Let's make the shutdown logic synchronous, so that there's a better chance to detach the loopback device after use.
2016-11-22nspawn: support ephemeral boots from imagesLennart Poettering
Previously --ephemeral was only supported with container trees in btrfs subvolumes (i.e. in combination with --directory=). This adds support for --ephemeral in conjunction with disk images (i.e. --image=) too. As side effect this fixes that --ephemeral was accepted but ignored when using -M on a container that turned out to be an image. Fixes: #4664
2016-11-18Merge pull request #4395 from s-urbaniak/rw-supportLennart Poettering
nspawn: R/W support for /sysfs, /proc, and /proc/sys/net
2016-11-18nspawn: R/W support for /sys, and /proc/sysSergiusz Urbaniak
This commit adds the possibility to leave /sys, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sys, and /proc/sys will be read-write. If set to "no", /sys, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from https://github.com/systemd/systemd/pull/4018#r76971862 where an introduction of a simple env var to enable R/W support for those directories was already discussed.
2016-11-14nspawn: restart the whole systemd-nspawn@.service unit on container reboot ↵Zbigniew Jędrzejewski-Szmek
(#4613) Since 133 is now used in a few places, add a #define for it. Also make the status message a bit informative. Another issue introduced in b006762. The logic was borked, we were supposed to return 0 to break the loop, and 133 to restart the container, not the other way around. But this doesn't seem to work, reboot fails with: Nov 08 00:41:32 laptop systemd-nspawn[26564]: Failed to register machine: Machine 'fedora-rawhide' already exists So actually the version before this patch worked better, since 133 > 0 and we'd at least loop internally.
2016-11-08nspawn: fix condition for mounting resolv.conf (#4622)Christian Hesse
The file /usr/lib/systemd/resolv.conf can be stale, it does not tell us whether or not systemd-resolved is running or not. So check for /run/systemd/resolve/resolv.conf as well, which is created at runtime and hence is a better indication.
2016-11-08Merge pull request #4612 from keszybz/format-stringsZbigniew Jędrzejewski-Szmek
Format string tweaks (and a small fix on 32bit)
2016-11-07nspawn: fix exit code for --help and --version (#4609)Martin Pitt
Commit b006762 inverted the initial exit code which is relevant for --help and --version without a particular reason. For these special options, parse_argv() returns 0 so that our main() immediately skips to the end without adjusting "ret". Otherwise, if an actual container is being started, ret is set on error in run(), which still provides the "non-zero exit on error" behaviour. Fixes #4605.
2016-11-07Rename formats-util.h to format-util.hZbigniew Jędrzejewski-Szmek
We don't have plural in the name of any other -util files and this inconsistency trips me up every time I try to type this file name from memory. "formats-util" is even hard to pronounce.
2016-11-07systemd-nspawn: decrease non-fatal mount errors to debug level (#4569)tblume
non-fatal mount errors shouldn't be logged as warnings.
2016-11-03Merge pull request #4510 from keszybz/tree-wide-cleanupsLennart Poettering
Tree wide cleanups
2016-11-02nspawn: if we set up a loopback device, try to mount it with "discard"Lennart Poettering
Let's make sure that our loopback files remain sparse, hence let's set "discard" as mount option on file systems that support it if the backing device is a loopback.
2016-10-24seccomp: add new seccomp_init_conservative() helperLennart Poettering
This adds a new seccomp_init_conservative() helper call that is mostly just a wrapper around seccomp_init(), but turns off NNP and adds in all secondary archs, for best compatibility with everything else. Pretty much all of our code used the very same constructs for these three steps, hence unifying this in one small function makes things a lot shorter. This also changes incorrect usage of the "scmp_filter_ctx" type at various places. libseccomp defines it as typedef to "void*", i.e. it is a pointer type (pretty poor choice already!) that casts implicitly to and from all other pointer types (even poorer choice: you defined a confusing type now, and don't even gain any bit of type safety through it...). A lot of the code assumed the type would refer to a structure, and hence aded additional "*" here and there. Remove that.
2016-10-23nspawn: become a new root earlyEvgeny Vereshchagin
https://github.com/torvalds/linux/commit/036d523641c66bef713042894a17f4335f199e49 > vfs: Don't create inodes with a uid or gid unknown to the vfs It is expected that filesystems can not represent uids and gids from outside of their user namespace. Keep things simple by not even trying to create filesystem nodes with non-sense uids and gids. So, we actually should `reset_uid_gid` early to prevent https://github.com/systemd/systemd/pull/4223#issuecomment-252522955 $ sudo UNIFIED_CGROUP_HIERARCHY=no LD_LIBRARY_PATH=.libs .libs/systemd-nspawn -D /var/lib/machines/fedora-rawhide -U -b systemd.unit=multi-user.target Spawning container fedora-rawhide on /var/lib/machines/fedora-rawhide. Press ^] three times within 1s to kill container. Child died too early. Selected user namespace base 1073283072 and range 65536. Failed to mount to /sys/fs/cgroup/systemd: No such file or directory Details: https://github.com/systemd/systemd/pull/4223#issuecomment-253046519 Fixes: #4352
2016-10-23nspawn: really lchown(uid/gid)Evgeny Vereshchagin
https://github.com/systemd/systemd/pull/4372#issuecomment-253723849: * `mount_all (outer_child)` creates `container_dir/sys/fs/selinux` * `mount_all (outer_child)` doesn't patch `container_dir/sys/fs` and so on. * `mount_sysfs (inner_child)` tries to create `/sys/fs/cgroup` * This fails 370 stat("/sys/fs", {st_dev=makedev(0, 28), st_ino=13880, st_mode=S_IFDIR|0755, st_nlink=3, st_uid=65534, st_gid=65534, st_blksize=4096, st_blocks=0, st_size=60, st_atime=2016/10/14-05:16:43.398665943, st_mtime=2016/10/14-05:16:43.399665943, st_ctime=2016/10/14-05:16:43.399665943}) = 0 370 mkdir("/sys/fs/cgroup", 0755) = -1 EACCES (Permission denied) * `mount_syfs (inner_child)` ignores that error and mount(NULL, "/sys", NULL, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL) = 0 * `mount_cgroups` finally fails
2016-10-23nspawn: use the return value from asprintf instead of checking the pointerZbigniew Jędrzejewski-Szmek
If allocation fails, the value of the point is "undefined". In practice this matters very little, but for consistency with rest of the code, let's check the return value.
2016-10-23tree-wide: drop NULL sentinel from strjoinZbigniew Jędrzejewski-Szmek
This makes strjoin and strjoina more similar and avoids the useless final argument. spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/systemd -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libsystemd/sd-bus -I ./src/libsystemd/sd-event -I ./src/libsystemd/sd-login -I ./src/libsystemd/sd-netlink -I ./src/libsystemd/sd-network -I ./src/libsystemd/sd-hwdb -I ./src/libsystemd/sd-device -I ./src/libsystemd/sd-id128 -I ./src/libsystemd-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/*.c) git grep -e '\bstrjoin\b.*NULL' -l|xargs sed -i -r 's/strjoin\((.*), NULL\)/strjoin(\1)/' This might have missed a few cases (spatch has a really hard time dealing with _cleanup_ macros), but that's no big issue, they can always be fixed later.
2016-10-21nspawn, NEWS: add missing "s" in --private-users-chown (#4438)Zbigniew Jędrzejewski-Szmek
2016-10-16tree-wide: use mfree moreZbigniew Jędrzejewski-Szmek
2016-10-14nspawn: remove unused variable (#4369)Thomas H. P. Andersen
2016-10-13nspawn: cleanup and chown the synced cgroup hierarchy (#4223)Evgeny Vereshchagin
Fixes: #4181
2016-10-12Merge pull request #4351 from keszybz/nspawn-debuggingLennart Poettering
Enhance nspawn debug logs for mount/unmount operations
2016-10-11nspawn: let's mount(/tmp) inside the user namespace (#4340)Evgeny Vereshchagin
Fixes: host# systemd-nspawn -D ... -U -b systemd.unit=multi-user.target ... $ grep /tmp /proc/self/mountinfo 154 145 0:41 / /tmp rw - tmpfs tmpfs rw,seclabel,uid=1036124160,gid=1036124160 $ umount /tmp umount: /root/tmp: not mounted $ systemctl poweroff ... [FAILED] Failed unmounting Temporary Directory.
2016-10-11nspawn,mount-util: add [u]mount_verbose and use it in nspawnZbigniew Jędrzejewski-Szmek
This makes it easier to debug failed nspawn invocations: Mounting sysfs on /var/lib/machines/fedora-rawhide/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV "")... Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev (MS_NOSUID|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")... Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev/shm (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=1777,uid=1450901504,gid=1450901504")... Mounting tmpfs on /var/lib/machines/fedora-rawhide/run (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")... Bind-mounting /sys/fs/selinux on /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_BIND "")... Remounting /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")... Mounting proc on /proc (MS_NOSUID|MS_NOEXEC|MS_NODEV "")... Bind-mounting /proc/sys on /proc/sys (MS_BIND "")... Remounting /proc/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")... Bind-mounting /proc/sysrq-trigger on /proc/sysrq-trigger (MS_BIND "")... Remounting /proc/sysrq-trigger (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")... Mounting tmpfs on /tmp (MS_STRICTATIME "mode=1777,uid=0,gid=0")... Mounting tmpfs on /sys/fs/cgroup (MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_STRICTATIME "mode=755,uid=0,gid=0")... Mounting cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr")... Failed to mount cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr"): No such file or directory
2016-10-11nspawn: small cleanups in get_controllers()Zbigniew Jędrzejewski-Szmek
- check for oom after strdup - no need to truncate the line since we're only extracting one field anyway - use STR_IN_SET
2016-10-11nspawn: simplify arg_us_cgns passingZbigniew Jędrzejewski-Szmek
We would check the condition cg_ns_supported() twice. No functional change.
2016-10-10Merge pull request #4332 from keszybz/nspawn-arguments-3Lennart Poettering
nspawn --private-users parsing, v2
2016-10-10Merge pull request #4310 from keszybz/nspawn-autodetectEvgeny Vereshchagin
Autodetect systemd version in containers started by systemd-nspawn
2016-10-10nspawn: better error messages for parsing errorsZbigniew Jędrzejewski-Szmek
In particular, the check for arg_uid_range <= 0 is moved to the end, so that "foobar:0" gives "Failed to parse UID", and not "UID range cannot be 0.".
2016-10-10nspawn,man: fix parsing of numeric args for --private-users, accept any booleanZbigniew Jędrzejewski-Szmek
This is like the previous reverted commit, but any boolean is still accepted, not just "yes" and "no". Man page is adjusted to match the code.
2016-10-10Revert "nspawn: fix parsing of numeric arguments for --private-users"Zbigniew Jędrzejewski-Szmek
This reverts commit bfd292ec35c7b768f9fb5cff4d921f3133e62b19.
2016-10-09nspawn: fix parsing of numeric arguments for --private-usersZbigniew Jędrzejewski-Szmek
The documentation says lists "yes", "no", "pick", and numeric arguments. But parse_boolean was attempted first, so various numeric arguments were misinterpreted. In particular, this fixes --private-users=0 to mean the same thing as --private-users=0:65536. While at it, use strndupa to avoid some error handling. Also give a better error for an empty UID range. I think it's likely that people will use --private-users=0:0 thinking that the argument means UID:GID.
2016-10-09nspawn: reindent tableZbigniew Jędrzejewski-Szmek
2016-10-08nspawn: also fall back to legacy cgroup hierarchy for old containersZbigniew Jędrzejewski-Szmek
Current systemd version detection routine cannot detect systemd 230, only systmed >= 231. This means that we'll still use the legacy hierarchy in some cases where we wouldn't have too. If somebody figures out a nice way to detect systemd 230 this can be later improved.
2016-10-08nspawn: use mixed cgroup hierarchy only when container has new systemdZbigniew Jędrzejewski-Szmek
systemd-soon-to-be-released-232 is able to deal with the mixed hierarchy. So make an educated guess, and use the mixed hierarchy in that case. Tested by running the host with mixed hierarchy (i.e. simply using a recent kernel with systemd from git), and booting first a container with older systemd, and then one with a newer systemd. Fixes #4008.
2016-10-08nspawn: fix spurious reboot if container process returns 133Zbigniew Jędrzejewski-Szmek
2016-10-08nspawn: move the main loop body out to a new functionZbigniew Jędrzejewski-Szmek
The new function has 416 lines by itself! "return log_error_errno" is used to nicely reduce the volume of error handling code. A few minor issues are fixed on the way: - positive value was used as error value (EIO), causing systemd-nspawn to return success, even though it shouldn't. - In two places random values were used as error status, when the actual value was in an unusual place (etc_password_lock, notify_socket). Those are the only functional changes. There is another potential issue, which is marked with a comment, and left unresolved: the container can also return 133 by itself, causing a spurious reboot.
2016-10-08nspawn: check env var first, detect secondZbigniew Jędrzejewski-Szmek
If we are going to use the env var to override the detection result anyway, there is not point in doing the detection, especially that it can fail.
2016-10-06tree-wide: drop some misleading compiler warningsLennart Poettering
gcc at some optimization levels thinks thes variables were used without initialization. it's wrong, but let's make the message go anyway.
2016-10-05nspawn: add log message to let users know that nspawn needs an empty /dev ↵Djalal Harouni
directory (#4226) Fixes https://github.com/systemd/systemd/issues/3695 At the same time it adds a protection against userns chown of inodes of a shared mount point.
2016-10-03nspawn: set shared propagation mode for the containerAlban Crequy
2016-09-28Merge pull request #4185 from endocode/djalal-sandbox-first-protection-v1Evgeny Vereshchagin
core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes
2016-09-26treewide: fix typos (#4217)Torstein Husebø
2016-09-25nspawn: let's mount /proc/sysrq-trigger read-only by defaultLennart Poettering
LXC does this, and we should probably too. Better safe than sorry.
2016-09-25namespace: rework how ReadWritePaths= is appliedLennart Poettering
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths= specification, then we'd first recursively apply the ReadOnlyPaths= paths, and make everything below read-only, only in order to then flip the read-only bit again for the subdirs listed in ReadWritePaths= below it. This is not only ugly (as for the dirs in question we first turn on the RO bit, only to turn it off again immediately after), but also problematic in containers, where a container manager might have marked a set of dirs read-only and this code will undo this is ReadWritePaths= is set for any. With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be applied to the children listed in ReadWritePaths= in the first place, so that we do not need to turn off the RO bit for those after all. This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the RO bit, but never to turn it off again. Or to say this differently: if some dirs are marked read-only via some external tool, then ReadWritePaths= will not undo it. This is not only the safer option, but also more in-line with what the man page currently claims: "Entries (files or directories) listed in ReadWritePaths= are accessible from within the namespace with the same access rights as from outside." To implement this change bind_remount_recursive() gained a new "blacklist" string list parameter, which when passed may contain subdirs that shall be excluded from the read-only mounting. A number of functions are updated to add more debug logging to make this more digestable.