summaryrefslogtreecommitdiff
path: root/src/nspawn/nspawn-mount.c
AgeCommit message (Collapse)Author
2017-02-24Merge pull request #5444 from poettering/cgroups-revert-no-errorZbigniew Jędrzejewski-Szmek
Revert "core: simplify cg_[all_]unified()" and more.
2017-02-24Fix missing space in comments (#5439)AsciiWolf
2017-02-24cgroup: change cg_unified() to possibly return errors againLennart Poettering
We use our cgroup APIs in various contexts, including from our libraries sd-login, sd-bus. As we don#t control those environments we can't rely that the unified cgroup setup logic succeeds, and hence really shouldn't assert on it. This more or less reverts 415fc41ceaeada2e32639f24f134b1c248b9e43f.
2017-02-20core: make hybrid cgroup unified mode keep compat /sys/fs/cgroup/systemd ↵Tejun Heo
hierarchy Currently the hybrid mode mounts cgroup v2 on /sys/fs/cgroup instead of the v1 name=systemd hierarchy. While this works fine for systemd itself, it breaks tools which expect cgroup v1 hierarchy on /sys/fs/cgroup/systemd. This patch updates the hybrid mode so that it mounts v2 hierarchy on /sys/fs/cgroup/unified and keeps v1 "name=systemd" hierarchy on /sys/fs/cgroup/systemd for compatibility. systemd itself doesn't depend on the "name=systemd" hierarchy at all. All operations take place on the v2 hierarchy as before but the v1 hierarchy is kept in sync so that any tools which expect it to be there can keep doing so. This allows systemd to take advantage of cgroup v2 process management without requiring other tools to be aware of the hybrid mode. The hybrid mode is implemented by mapping the special systemd controller to /sys/fs/cgroup/unified and making the basic cgroup utility operations - cg_attach(), cg_create(), cg_rmdir() and cg_trim() - also operate on the /sys/fs/cgroup/systemd hierarchy whenever the cgroup2 hierarchy is updated. While a bit messy, this will allow dropping complications from using cgroup v1 for process management a lot sooner than otherwise possible which should make it a net gain in terms of maintainability. v2: Fixed !cgns breakage reported by @evverx and renamed the unified mount point to /sys/fs/cgroup/unified as suggested by @brauner. v3: chown the compat hierarchy too on delegation. Suggested by @evverx. v4: [zj] - drop the change to default, full "legacy" is still the default.
2017-02-18core: simplify cg_[all_]unified()Tejun Heo
cg_[all_]unified() test whether a specific controller or all controllers are on the unified hierarchy. While what's being asked is a simple binary question, the callers must assume that the functions may fail any time, which unnecessarily complicates their usages. This complication is unnecessary. Internally, the test result is cached anyway and there are only a few places where the test actually needs to be performed. This patch simplifies cg_[all_]unified(). * cg_[all_]unified() are updated to return bool. If the result can't be decided, assertion failure is triggered. Error handlings from their callers are dropped. * cg_unified_flush() is updated to calculate the new result synchrnously and return whether it succeeded or not. Places which need to flush the test result are updated to test for failure. This ensures that all the following cg_[all_]unified() tests succeed. * Places which expected possible cg_[all_]unified() failures are updated to call and test cg_unified_flush() before calling cg_[all_]unified(). This includes functions used while setting up mounts during boot and manager_setup_cgroup().
2017-02-08nspawn: Add support for sysroot pivoting (#5258)Philip Withnall
Add a new --pivot-root argument to systemd-nspawn, which specifies a directory to pivot to / inside the container; while the original / is pivoted to another specified directory (if provided). This adds support for booting container images which may contain several bootable sysroots, as is common with OSTree disk images. When these disk images are booted on real hardware, ostree-prepare-root is run in conjunction with sysroot.mount in the initramfs to achieve the same results.
2016-12-20nspawn: split out VolatileMode definitionsLennart Poettering
This moves the VolatileMode enum and its helper functions to src/shared/. This is useful to then reuse them to implement systemd.volatile= in a later commit.
2016-12-05nspawn: don't hide --bind=/tmp/* mounts (#4824)Evgeny Vereshchagin
Fixes #4789
2016-12-01util-lib: rename CHASE_NON_EXISTING → CHASE_NONEXISTENTLennart Poettering
As suggested by @keszybz
2016-12-01nspawn: improve log messagesLennart Poettering
When complaining about the inability to resolve a path, show the full path, not just the relative one. As suggested by @keszybz.
2016-12-01nspawn: optionally, automatically allocated --bind=/--overlay source from ↵Lennart Poettering
/var/tmp This extends the --bind= and --overlay= syntax so that an empty string as source/upper directory is taken as request to automatically allocate a temporary directory below /var/tmp, whose lifetime is bound to the nspawn runtime. In combination with the "+" path extension this permits a switch "--overlay=+/var::/var" in order to use the container's shipped /var, combine it with a writable temporary directory and mount it to the runtime /var of the container.
2016-12-01nspawn: permit prefixing of source paths in --bind= and --overlay= with "+"Lennart Poettering
If a source path is prefixed with "+" it is taken relative to the container's root directory instead of the host. This permits easily establishing bind and overlay mounts based on data from the container rather than the host. This also reworks custom_mounts_prepare(), and turns it into two functions: one custom_mount_check_all() that remains in nspawn.c but purely verifies the validity of the custom mounts configured. And one called custom_mount_prepare_all() that actually does the preparation step, sorts the custom mounts, resolves relative paths, and allocates temporary directories as necessary.
2016-12-01nspawn: split out overlayfs argument parsing into a function of its ownLennart Poettering
Add overlay_mount_parse() similar in style to tmpfs_mount_parse() and bind_mount_parse().
2016-12-01nspawn: use -ENOMEM instead of log_oom() in one caseLennart Poettering
The function is of the "library" kind and doesn't log ENOMEM in all other cases, hence fix the one outlier.
2016-12-01nspawn: use the new CHASE_NON_EXISTING flag when resolving mount pointsLennart Poettering
This restores the ability to implicitly create files/directories to mount specified mount points on.
2016-12-01fs-util: add flags parameter to chase_symlinks()Lennart Poettering
Let's remove chase_symlinks_prefix() and instead introduce a flags parameter to chase_symlinks(), with a flag CHASE_PREFIX_ROOT that exposes the behaviour of chase_symlinks_prefix().
2016-12-01nspawn: use chase_symlinks() on all paths specified via --tmpfs=, --bind= ↵Lennart Poettering
and so on Fixes: #2860
2016-12-01nspawn: coding style: don't mix variable declarations and function callsLennart Poettering
2016-12-01nspawn: use realloc_multiply() where it makes senseLennart Poettering
2016-12-01tree-wide: stop using canonicalize_file_name(), use chase_symlinks() insteadLennart Poettering
Let's use chase_symlinks() everywhere, and stop using GNU canonicalize_file_name() everywhere. For most cases this should not change behaviour, however increase exposure of our function to get better tested. Most importantly in a few cases (most notably nspawn) it can take the correct root directory into account when chasing symlinks.
2016-11-22nspawn: don't require chown() if userns is not onLennart Poettering
Fixes: #4711
2016-11-18nspawn: R/W support for /sys, and /proc/sysSergiusz Urbaniak
This commit adds the possibility to leave /sys, and /proc/sys read-write. It introduces a new (undocumented) env var SYSTEMD_NSPAWN_API_VFS_WRITABLE to enable this feature. If set to "yes", /sys, and /proc/sys will be read-write. If set to "no", /sys, and /proc/sys will be read-only. If set to "network" /proc/sys/net will be read-write. This is useful in use-cases, where systemd-nspawn is used in an external network namespace. This adds the possibility to start privileged containers which need more control over settings in the /proc, and /sys filesystem. This is also a follow-up on the discussion from https://github.com/systemd/systemd/pull/4018#r76971862 where an introduction of a simple env var to enable R/W support for those directories was already discussed.
2016-11-07systemd-nspawn: decrease non-fatal mount errors to debug level (#4569)tblume
non-fatal mount errors shouldn't be logged as warnings.
2016-11-03Merge pull request #4510 from keszybz/tree-wide-cleanupsLennart Poettering
Tree wide cleanups
2016-10-23nspawn: really lchown(uid/gid)Evgeny Vereshchagin
https://github.com/systemd/systemd/pull/4372#issuecomment-253723849: * `mount_all (outer_child)` creates `container_dir/sys/fs/selinux` * `mount_all (outer_child)` doesn't patch `container_dir/sys/fs` and so on. * `mount_sysfs (inner_child)` tries to create `/sys/fs/cgroup` * This fails 370 stat("/sys/fs", {st_dev=makedev(0, 28), st_ino=13880, st_mode=S_IFDIR|0755, st_nlink=3, st_uid=65534, st_gid=65534, st_blksize=4096, st_blocks=0, st_size=60, st_atime=2016/10/14-05:16:43.398665943, st_mtime=2016/10/14-05:16:43.399665943, st_ctime=2016/10/14-05:16:43.399665943}) = 0 370 mkdir("/sys/fs/cgroup", 0755) = -1 EACCES (Permission denied) * `mount_syfs (inner_child)` ignores that error and mount(NULL, "/sys", NULL, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL) = 0 * `mount_cgroups` finally fails
2016-10-23nspawn: use the return value from asprintf instead of checking the pointerZbigniew Jędrzejewski-Szmek
If allocation fails, the value of the point is "undefined". In practice this matters very little, but for consistency with rest of the code, let's check the return value.
2016-10-23tree-wide: drop NULL sentinel from strjoinZbigniew Jędrzejewski-Szmek
This makes strjoin and strjoina more similar and avoids the useless final argument. spatch -I . -I ./src -I ./src/basic -I ./src/basic -I ./src/shared -I ./src/shared -I ./src/network -I ./src/locale -I ./src/login -I ./src/journal -I ./src/journal -I ./src/timedate -I ./src/timesync -I ./src/nspawn -I ./src/resolve -I ./src/resolve -I ./src/systemd -I ./src/core -I ./src/core -I ./src/libudev -I ./src/udev -I ./src/udev/net -I ./src/udev -I ./src/libsystemd/sd-bus -I ./src/libsystemd/sd-event -I ./src/libsystemd/sd-login -I ./src/libsystemd/sd-netlink -I ./src/libsystemd/sd-network -I ./src/libsystemd/sd-hwdb -I ./src/libsystemd/sd-device -I ./src/libsystemd/sd-id128 -I ./src/libsystemd-network --sp-file coccinelle/strjoin.cocci --in-place $(git ls-files src/*.c) git grep -e '\bstrjoin\b.*NULL' -l|xargs sed -i -r 's/strjoin\((.*), NULL\)/strjoin(\1)/' This might have missed a few cases (spatch has a really hard time dealing with _cleanup_ macros), but that's no big issue, they can always be fixed later.
2016-10-12Merge pull request #4351 from keszybz/nspawn-debuggingLennart Poettering
Enhance nspawn debug logs for mount/unmount operations
2016-10-11nspawn: let's mount(/tmp) inside the user namespace (#4340)Evgeny Vereshchagin
Fixes: host# systemd-nspawn -D ... -U -b systemd.unit=multi-user.target ... $ grep /tmp /proc/self/mountinfo 154 145 0:41 / /tmp rw - tmpfs tmpfs rw,seclabel,uid=1036124160,gid=1036124160 $ umount /tmp umount: /root/tmp: not mounted $ systemctl poweroff ... [FAILED] Failed unmounting Temporary Directory.
2016-10-11nspawn,mount-util: add [u]mount_verbose and use it in nspawnZbigniew Jędrzejewski-Szmek
This makes it easier to debug failed nspawn invocations: Mounting sysfs on /var/lib/machines/fedora-rawhide/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV "")... Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev (MS_NOSUID|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")... Mounting tmpfs on /var/lib/machines/fedora-rawhide/dev/shm (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=1777,uid=1450901504,gid=1450901504")... Mounting tmpfs on /var/lib/machines/fedora-rawhide/run (MS_NOSUID|MS_NODEV|MS_STRICTATIME "mode=755,uid=1450901504,gid=1450901504")... Bind-mounting /sys/fs/selinux on /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_BIND "")... Remounting /var/lib/machines/fedora-rawhide/sys/fs/selinux (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")... Mounting proc on /proc (MS_NOSUID|MS_NOEXEC|MS_NODEV "")... Bind-mounting /proc/sys on /proc/sys (MS_BIND "")... Remounting /proc/sys (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")... Bind-mounting /proc/sysrq-trigger on /proc/sysrq-trigger (MS_BIND "")... Remounting /proc/sysrq-trigger (MS_RDONLY|MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_BIND|MS_REMOUNT "")... Mounting tmpfs on /tmp (MS_STRICTATIME "mode=1777,uid=0,gid=0")... Mounting tmpfs on /sys/fs/cgroup (MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_STRICTATIME "mode=755,uid=0,gid=0")... Mounting cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr")... Failed to mount cgroup on /sys/fs/cgroup/systemd (MS_NOSUID|MS_NOEXEC|MS_NODEV "none,name=systemd,xattr"): No such file or directory
2016-10-11nspawn: small cleanups in get_controllers()Zbigniew Jędrzejewski-Szmek
- check for oom after strdup - no need to truncate the line since we're only extracting one field anyway - use STR_IN_SET
2016-10-11nspawn: simplify arg_us_cgns passingZbigniew Jędrzejewski-Szmek
We would check the condition cg_ns_supported() twice. No functional change.
2016-09-25nspawn: let's mount /proc/sysrq-trigger read-only by defaultLennart Poettering
LXC does this, and we should probably too. Better safe than sorry.
2016-09-25namespace: rework how ReadWritePaths= is appliedLennart Poettering
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths= specification, then we'd first recursively apply the ReadOnlyPaths= paths, and make everything below read-only, only in order to then flip the read-only bit again for the subdirs listed in ReadWritePaths= below it. This is not only ugly (as for the dirs in question we first turn on the RO bit, only to turn it off again immediately after), but also problematic in containers, where a container manager might have marked a set of dirs read-only and this code will undo this is ReadWritePaths= is set for any. With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be applied to the children listed in ReadWritePaths= in the first place, so that we do not need to turn off the RO bit for those after all. This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the RO bit, but never to turn it off again. Or to say this differently: if some dirs are marked read-only via some external tool, then ReadWritePaths= will not undo it. This is not only the safer option, but also more in-line with what the man page currently claims: "Entries (files or directories) listed in ReadWritePaths= are accessible from within the namespace with the same access rights as from outside." To implement this change bind_remount_recursive() gained a new "blacklist" string list parameter, which when passed may contain subdirs that shall be excluded from the read-only mounting. A number of functions are updated to add more debug logging to make this more digestable.
2016-08-17core: use the unified hierarchy for the systemd cgroup controller hierarchyTejun Heo
Currently, systemd uses either the legacy hierarchies or the unified hierarchy. When the legacy hierarchies are used, systemd uses a named legacy hierarchy mounted on /sys/fs/cgroup/systemd without any kernel controllers for process management. Due to the shortcomings in the legacy hierarchy, this involves a lot of workarounds and complexities. Because the unified hierarchy can be mounted and used in parallel to legacy hierarchies, there's no reason for systemd to use a legacy hierarchy for management even if the kernel resource controllers need to be mounted on legacy hierarchies. It can simply mount the unified hierarchy under /sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies. This disables a significant amount of fragile workaround logics and would allow using features which depend on the unified hierarchy membership such bpf cgroup v2 membership test. In time, this would also allow deleting the said complexities. This patch updates systemd so that it prefers the unified hierarchy for the systemd cgroup controller hierarchy when legacy hierarchies are used for kernel resource controllers. * cg_unified(@controller) is introduced which tests whether the specific controller in on unified hierarchy and used to choose the unified hierarchy code path for process and service management when available. Kernel controller specific operations remain gated by cg_all_unified(). * "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to force the use of legacy hierarchy for systemd cgroup controller. * nspawn: By default nspawn uses the same hierarchies as the host. If UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all. If 0, legacy for all. * nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of three options - legacy, only systemd controller on unified, and unified. The value is passed into mount setup functions and controls cgroup configuration. * nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount option is moved to mount_legacy_cgroup_hierarchy() so that it can take an appropriate action depending on the configuration of the host. v2: - CGroupUnified enum replaces open coded integer values to indicate the cgroup operation mode. - Various style updates. v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2. v4: Restored legacy container on unified host support and fixed another bug in detect_unified_cgroup_hierarchy().
2016-08-15core: rename cg_unified() to cg_all_unified()Tejun Heo
A following patch will update cgroup handling so that the systemd controller (/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel resource controllers are on the legacy hierarchies. This would require distinguishing whether all controllers are on cgroup v2 or only the systemd controller is. In preparation, this patch renames cg_unified() to cg_all_unified(). This patch doesn't cause any functional changes.
2016-07-26nspawn: add SYSTEMD_NSPAWN_USE_CGNS env variable (#3809)Christian Brauner
SYSTEMD_NSPAWN_USE_CGNS allows to disable the use of cgroup namespaces.
2016-07-25Merge pull request #3589 from brauner/cgroup_namespaceLennart Poettering
Cgroup namespace
2016-07-20nspawn: when netns is on, mount /proc/sys/net writableLennart Poettering
Normally we make all of /proc/sys read-only in a container, but if we do have netns enabled we can make /proc/sys/net writable, as things are virtualized then.
2016-07-18nspawn: decrease mkdir error logging in /sys to debug priority (#3748)tblume
Such mkdir errors happen for example when trying to mkdir /sys/fs/selinux. /sys is documented to be readonly in the container, so mkdir errors below /sys can be expected. They shouldn't be logged as warnings since they lead users to think that there is something wrong.
2016-07-09nspawn: handle cgroup namespacesChristian Brauner
(NOTE: Cgroup namespaces work with legacy and unified hierarchies: "This is completely backward compatible and will be completely invisible to any existing cgroup users (except for those running inside a cgroup namespace and looking at /proc/pid/cgroup of tasks outside their namespace.)" (https://lists.linuxfoundation.org/pipermail/containers/2016-January/036582.html) So there is no need to special case unified.) If cgroup namespaces are supported we skip mount_cgroups() in the outer_child(). Instead, we unshare(CLONE_NEWCGROUP) in the inner_child() and only then do we call mount_cgroups(). The clean way to handle cgroup namespaces would be to delegate mounting of cgroups completely to the init system in the container. However, this would likely break backward compatibility with the UNIFIED_CGROUP_HIERARCHY flag of systemd-nspawn. Also no cgroupfs would be mounted whenever the user simply requests a shell and no init is available to mount cgroups. Hence, we introduce mount_legacy_cgns_supported(). After calling unshare(CLONE_NEWCGROUP) it parses /proc/self/cgroup to find the mounted controllers and mounts them inside the new cgroup namespace. This should preserve backward compatibility with the UNIFIED_CGROUP_HIERARCHY flag and mount a cgroupfs when no init in the container is running.
2016-04-01prevent systemd-nspawn from trying to create targetBjørnar Ness
for bind-mounts when they already exist. This allows bind-mounting over read-only files.
2016-03-26cgroup2: use new fstype for unified hierarchyAlban Crequy
Since Linux v4.4-rc1, __DEVEL__sane_behavior does not exist anymore and is replaced by a new fstype "cgroup2". With this patch, systemd no longer supports the old (unstable) way of doing unified hierarchy with __DEVEL__sane_behavior and systemd now requires Linux v4.4 for unified hierarchy. Non-unified hierarchy is still the default and is unchanged by this patch. https://github.com/torvalds/linux/commit/67e9c74b8a873408c27ac9a8e4c1d1c8d72c93ff
2016-02-10tree-wide: remove Emacs lines from all filesDaniel Mack
This should be handled fine now by .dir-locals.el, so need to carry that stuff in every file.
2015-11-09treewide: apply errno.cocciMichal Schmidt
with small manual cleanups for style.
2015-10-27util-lib: split out allocation calls into alloc-util.[ch]Lennart Poettering
2015-10-27user-util: move UID/GID related macros from macro.h to user-util.hLennart Poettering
2015-10-27util-lib: split stat()/statfs()/stavfs() related calls into stat-util.[ch]Lennart Poettering
2015-10-27util-lib: move a number of fs operations into fs-util.[ch]Lennart Poettering
2015-10-27util-lib: move mount related utility calls to mount-util.[ch]Lennart Poettering