summaryrefslogtreecommitdiff
path: root/src/nspawn
AgeCommit message (Collapse)Author
2017-02-20core: make hybrid cgroup unified mode keep compat /sys/fs/cgroup/systemd ↵Tejun Heo
hierarchy Currently the hybrid mode mounts cgroup v2 on /sys/fs/cgroup instead of the v1 name=systemd hierarchy. While this works fine for systemd itself, it breaks tools which expect cgroup v1 hierarchy on /sys/fs/cgroup/systemd. This patch updates the hybrid mode so that it mounts v2 hierarchy on /sys/fs/cgroup/unified and keeps v1 "name=systemd" hierarchy on /sys/fs/cgroup/systemd for compatibility. systemd itself doesn't depend on the "name=systemd" hierarchy at all. All operations take place on the v2 hierarchy as before but the v1 hierarchy is kept in sync so that any tools which expect it to be there can keep doing so. This allows systemd to take advantage of cgroup v2 process management without requiring other tools to be aware of the hybrid mode. The hybrid mode is implemented by mapping the special systemd controller to /sys/fs/cgroup/unified and making the basic cgroup utility operations - cg_attach(), cg_create(), cg_rmdir() and cg_trim() - also operate on the /sys/fs/cgroup/systemd hierarchy whenever the cgroup2 hierarchy is updated. While a bit messy, this will allow dropping complications from using cgroup v1 for process management a lot sooner than otherwise possible which should make it a net gain in terms of maintainability. v2: Fixed !cgns breakage reported by @evverx and renamed the unified mount point to /sys/fs/cgroup/unified as suggested by @brauner. v3: chown the compat hierarchy too on delegation. Suggested by @evverx. v4: [zj] - drop the change to default, full "legacy" is still the default.
2017-02-18core: simplify cg_[all_]unified()Tejun Heo
cg_[all_]unified() test whether a specific controller or all controllers are on the unified hierarchy. While what's being asked is a simple binary question, the callers must assume that the functions may fail any time, which unnecessarily complicates their usages. This complication is unnecessary. Internally, the test result is cached anyway and there are only a few places where the test actually needs to be performed. This patch simplifies cg_[all_]unified(). * cg_[all_]unified() are updated to return bool. If the result can't be decided, assertion failure is triggered. Error handlings from their callers are dropped. * cg_unified_flush() is updated to calculate the new result synchrnously and return whether it succeeded or not. Places which need to flush the test result are updated to test for failure. This ensures that all the following cg_[all_]unified() tests succeed. * Places which expected possible cg_[all_]unified() failures are updated to call and test cg_unified_flush() before calling cg_[all_]unified(). This includes functions used while setting up mounts during boot and manager_setup_cgroup().
2017-02-18nspawn: fix cgroup mode detectionTejun Heo
cgroup mode detection is broken in two different ways. * detect_unified_cgroup_hierarchy() is called too nested in outer_child(). sync_cgroup() which is used by run() also needs to know the requested cgroup mode but it's currently always getting CGROUP_UNIFIED_UNKNOWN. This makes it skip syncing the inner cgroup hierarchy on some config combinations. $ cat /proc/self/cgroup | grep systemd 1:name=systemd:/user.slice/user-0.slice/session-c1.scope $ UNIFIED_CGROUP_HIERARCHY=0 SYSTEMD_NSPAWN_USE_CGNS=0 systemd-nspawn -M container ... [root@container ~]# cat /proc/self/cgroup | grep systemd 1:name=systemd:/machine.slice/machine-container.x86_64.scope $ exit $ UNIFIED_CGROUP_HIERARCHY=1 SYSTEMD_NSPAWN_USE_CGNS=0 systemd-nspawn -M container [root@container ~]# cat /proc/self/cgroup | grep 0:: 0::/ $ exit Note how the unified hierarchy case's path is not synchronized with the host. This for example can cause issues when there are multiple such containers. Fixed by moving detect_unified_cgroup_hierarchy() invocation to main(). * inner_child() was invoking cg_unified_flush(). inner_child() executes fully scoped and can't determine which cgroup mode the host was in. It doesn't make sense to keep flushing the detected mode when the host mode can't change. Fixed by replacing cg_unified_flush() invocations in outer_child() and inner_child() with one in main().
2017-02-18Merge pull request #5369 from poettering/nspawn-resolvedZbigniew Jędrzejewski-Szmek
fixes for running nspawn+resolved in combination
2017-02-17nspawn: tweak check whether resolved is around a bitLennart Poettering
Let's check D-Bus instead of files in /run to see if resolved is running. This is a bit nicer as bus names are automatically cleaned up when resolved dies, which is not the case for files in /run. See: #4649
2017-02-17copy: change the various copy_xyz() calls to take a unified flags parameterLennart Poettering
This adds a unified "copy_flags" parameter to all copy_xyz() function calls, replacing the various boolean flags so far used. This should make many invocations more readable as it is clear what behaviour is precisely requested. This also prepares ground for adding support for more modes later on.
2017-02-08Merge pull request #4962 from poettering/root-directory-2Zbigniew Jędrzejewski-Szmek
Add new MountAPIVFS= boolean unit file setting + RootImage=
2017-02-08nspawn: Add support for sysroot pivoting (#5258)Philip Withnall
Add a new --pivot-root argument to systemd-nspawn, which specifies a directory to pivot to / inside the container; while the original / is pivoted to another specified directory (if provided). This adds support for booting container images which may contain several bootable sysroots, as is common with OSTree disk images. When these disk images are booted on real hardware, ostree-prepare-root is run in conjunction with sysroot.mount in the initramfs to achieve the same results.
2017-02-07core,nspawn,dissect: make nspawn's .roothash file search reusableLennart Poettering
This makes nspawn's logic of automatically discovering the root hash of an image file generic, and then reuses it in systemd-dissect and in PID1's RootImage= logic, so that verity is automatically set up whenever we can.
2017-02-02nspawn: shown exec() command is misleadingLennart Poettering
There's no point in updating exec_target for each binary we try to execute, if we override it right-away anyway... Let's just do this once, and include all binaries we try each time. Follow-up for 1a68e1e543fd8f899503bec00585a16ada296ef7.
2017-02-02fs-util: unify code we use to check if dirent's d_name is "." or ".."Lennart Poettering
We use different idioms at different places. Let's replace this is the one true new idiom, that is even a bit faster...
2017-02-01nspawn: Print attempted execv() path on failure (#5199)Philip Withnall
The failure message is typically currently: execv() failed: No such file or directory which is not very useful because it doesn’t tell you which file or directory it was trying to exec.
2017-01-31tree-wide: adjust fall through comments so that gcc is happyZbigniew Jędrzejewski-Szmek
gcc 7 adds -Wimplicit-fallthrough=3 to -Wextra. There are a few ways we could deal with that. After we take into account the need to stay compatible with older versions of the compiler (and other compilers), I don't think adding __attribute__((fallthrough)), even as a macro, is worth the trouble. It sticks out too much, a comment is just as good. But gcc has some very specific requiremnts how the comment should look. Adjust it the specific form that it likes. I don't think the extra stuff we had in those comments was adding much value. (Note: the documentation seems to be wrong, and seems to describe a different pattern from the one that is actually used. I guess either the docs or the code will have to change before gcc 7 is finalized.)
2017-01-31nspawn: fix clobbering of selinux context argZbigniew Jędrzejewski-Szmek
First bug fixed by gcc 7. Yikes.
2017-01-24tree-wide: remove consecutive duplicate words in comments (#5148)Stefan Schweter
2017-01-18Merge pull request #5098 from evverx/fix-nspawn-notificationsDjalal Harouni
nspawn: change owner/group of /run/systemd/nspawn/notify to userns-root
2017-01-17Merge pull request #4991 from poettering/seccomp-fixZbigniew Jędrzejewski-Szmek
2017-01-17seccomp: rework seccomp code, to improve compat with some archsLennart Poettering
This substantially reworks the seccomp code, to ensure better compatibility with some architectures, including i386. So far we relied on libseccomp's internal handling of the multiple syscall ABIs supported on Linux. This is problematic however, as it does not define clear semantics if an ABI is not able to support specific seccomp rules we install. This rework hence changes a couple of things: - We no longer use seccomp_rule_add(), but only seccomp_rule_add_exact(), and fail the installation of a filter if the architecture doesn't support it. - We no longer rely on adding multiple syscall architectures to a single filter, but instead install a separate filter for each syscall architecture supported. This way, we can install a strict filter for x86-64, while permitting a less strict filter for i386. - All high-level filter additions are now moved from execute.c to seccomp-util.c, so that we can test them independently of the service execution logic. - Tests have been added for all types of our seccomp filters. - SystemCallFilters= and SystemCallArchitectures= are now implemented in independent filters and installation logic, as they semantically are very much independent of each other. Fixes: #4575
2017-01-17nspawn: change owner/group of /run/systemd/nspawn/notify to userns-rootEvgeny Vereshchagin
Fixes #4944
2017-01-15nspawn: fix memleakZbigniew Jędrzejewski-Szmek
CID #1368262: fn is allocated with new, so it should be freed.
2017-01-14Merge pull request #4879 from poettering/systemdZbigniew Jędrzejewski-Szmek
2017-01-10build-sys: add check for gperf lookup function signature (#5055)Mike Gilbert
gperf-3.1 generates lookup functions that take a size_t length parameter instead of unsigned int. Test for this at configure time. Fixes: https://github.com/systemd/systemd/issues/5039
2016-12-29nspawn: reword notice when /dev is pre-mounted and populated (#4971)Lennart Poettering
Fixes: #4676
2016-12-21nspawn: tweaks to /etc/resolv.conf managementLennart Poettering
Handle properly if /etc is a symlink (i.e. make sure we don't follow the symlink outside the image). Also follow /etc/resolv.conf if it is a symlink, and use the resolved path when creating a mount point and mounting (as both of these operations follow symlinks and rally shouldn't). Handle more types of read-only errors as debug-level issues.
2016-12-21nspawn: don't complain when we can't fix the timezone of read-only containersLennart Poettering
There's nothing we can do about it, hence don't complain.
2016-12-21dissect: make using a generic partition as root partition optionalLennart Poettering
In preparation for reusing the image dissector in the GPT auto-discovery logic, only optionally fail the dissection when we can't identify a root partition. In the GPT auto-discovery we are completely fine with any kind of root, given that we run when it is already mounted and all we do is find some additional auxiliary partitions on the same disk.
2016-12-21nspawn: restore --volatile=yes supportLennart Poettering
This was broken by 19caffac75a2590a0c5ebc2a0214960f8188aec7 which remounted the root directory to MS_SHARED before applying the volatile mount logic. This broke things as MS_MOVE is incompatible with MS_SHARED directory trees, and we need MS_MOVE in the volatile mount logic to rearrange the directory tree. Simply swap the order here, apply the volatile logic before we switch to MS_SHARED.
2016-12-21nspawn: unref the notify event source (#4941)Evgeny Vereshchagin
Fixes: ``` sudo ./libtool --mode=execute valgrind --leak-check=full ./systemd-nspawn -D ./CONT/ -b ... ==21224== 2,444 (656 direct, 1,788 indirect) bytes in 1 blocks are definitely lost in loss record 13 of 15 ==21224== at 0x4C2FA50: calloc (vg_replace_malloc.c:711) ==21224== by 0x4F6F565: sd_event_new (sd-event.c:431) ==21224== by 0x1210BE: run (nspawn.c:3351) ==21224== by 0x123908: main (nspawn.c:3826) ==21224== ==21224== LEAK SUMMARY: ==21224== definitely lost: 656 bytes in 1 blocks ==21224== indirectly lost: 1,788 bytes in 11 blocks ==21224== possibly lost: 0 bytes in 0 blocks ==21224== still reachable: 8,344 bytes in 3 blocks ==21224== suppressed: 0 bytes in 0 blocks ``` Closes #4934
2016-12-20dissect: optionally, only look for GPT partition tables, nothing elseLennart Poettering
This is useful for reusing the dissector logic in the gpt-auto-discovery logic: there we really don't want to use MBR or naked file systems as root device.
2016-12-20nspawn: split out VolatileMode definitionsLennart Poettering
This moves the VolatileMode enum and its helper functions to src/shared/. This is useful to then reuse them to implement systemd.volatile= in a later commit.
2016-12-14nspawn: flush out environment block of the -a stub init processLennart Poettering
The container detection code in virt.c we ship checks for /proc/1/environ, looking for "container=" in it. Let's make sure our "-a" init stub exposes that correctly. Without this "systemd-detect-virt" run in a "-a" container won't detect that it is being run in a container.
2016-12-13nspawn: when getting SIGCHLD make sure it's from the first child (#4855)Andrey Ulanov
When getting SIGCHLD we should not assume that it was the first child forked from system-nspawn that has died as it may also be coming from an orphan process. This change adds a signal handler that ignores SIGCHLD unless it came from the first containerized child - the real child. Before this change the problem can be reproduced as follows: $ sudo systemd-nspawn --directory=/container-root --share-system Press ^] three times within 1s to kill container. [root@andreyu-coreos ~]# { true & } & [1] 22201 [root@andreyu-coreos ~]# Container root-fedora-latest terminated by signal KILL
2016-12-10Merge pull request #4795 from poettering/dissectZbigniew Jędrzejewski-Szmek
Generalize image dissection logic of nspawn, and make it useful for other tools.
2016-12-10nspawn: add missing -E to getopt_long (#4860)Wim de With
2016-12-07nspawn: resolv.conf might not be created initially (#4799)Franck Bui
This might happen that resolv.conf is missing in a minimal rootfs and in this case the following warning is emitted: Failed to mount n/a on /mnt/etc/resolv.conf (MS_BIND ""): No such file or directory This patch fixes this case.
2016-12-07nspawn/dissect: automatically discover dm-verity verity partitionsLennart Poettering
This adds support for discovering and making use of properly tagged dm-verity data integrity partitions. This extends both systemd-nspawn and systemd-dissect with a new --root-hash= switch that takes the root hash to use for the root partition, and is otherwise fully automatic. Verity partitions are discovered automatically by GPT table type UUIDs, as listed in https://www.freedesktop.org/wiki/Specifications/DiscoverablePartitionsSpec/ (which I updated prior to this change, to include new UUIDs for this purpose. mkosi with https://github.com/systemd/mkosi/pull/39 applied may generate images that carry the necessary integrity data. With that PR and this commit, the following simply lines suffice to boot up an integrity-protected container image: ``` # mkdir test # cd test # mkosi --verity # systemd-nspawn -i ./image.raw -bn ``` Note that mkosi writes the image file to "image.raw" next to a a file "image.roothash" that contains the root hash. systemd-nspawn will look for that file and use it if it exists, in case --root-hash= is not specified explicitly.
2016-12-07nspawn: when generating a machine name from an image name, truncate .raw suffixLennart Poettering
Let's prettify the machine name we generate for image-based containers: let's chop off the .raw suffix before using it as machine name.
2016-12-07dissect: add support for encrypted imagesLennart Poettering
This adds support to the image dissector to deal with encrypted images (only LUKS). Given that we now have a neatly isolated image dissector codebase, let's add a new feature to it: support for automatically dealing with encrypted images. This is then exposed in systemd-dissect and nspawn. It's pretty basic: only support for passphrase-based encryption. In order to ensure that "systemd-dissect --mount" results in mount points whose backing LUKS DM devices are cleaned up automatically we use the DM_DEV_REMOVE ioctl() directly on the device (in DM_DEFERRED_REMOVE mode). libgcryptsetup at the moment doesn't provide a proper API for this. Thankfully, the ioctl() API is pretty easy to use.
2016-12-07nspawn: port nspawn to new generalized image dissection codeLennart Poettering
Let's make use of the new internal API. This mostly doesn't change anything for the caller, however, "systemd-nspawn --image=/dev/sda7" works now as the new code can handle disk images with no partition tables, and make any detected images directly the root.
2016-12-06core: introduce parse_ip_port (#4825)Susant Sahani
1. Listed in TODO. 2. Tree wide replace safe_atou16 with parse_ip_port incase it's used for ports.
2016-12-05nspawn: don't hide --bind=/tmp/* mounts (#4824)Evgeny Vereshchagin
Fixes #4789
2016-12-01util-lib: rename CHASE_NON_EXISTING → CHASE_NONEXISTENTLennart Poettering
As suggested by @keszybz
2016-12-01nspawn: improve log messagesLennart Poettering
When complaining about the inability to resolve a path, show the full path, not just the relative one. As suggested by @keszybz.
2016-12-01nspawn: optionally, automatically allocated --bind=/--overlay source from ↵Lennart Poettering
/var/tmp This extends the --bind= and --overlay= syntax so that an empty string as source/upper directory is taken as request to automatically allocate a temporary directory below /var/tmp, whose lifetime is bound to the nspawn runtime. In combination with the "+" path extension this permits a switch "--overlay=+/var::/var" in order to use the container's shipped /var, combine it with a writable temporary directory and mount it to the runtime /var of the container.
2016-12-01nspawn: permit prefixing of source paths in --bind= and --overlay= with "+"Lennart Poettering
If a source path is prefixed with "+" it is taken relative to the container's root directory instead of the host. This permits easily establishing bind and overlay mounts based on data from the container rather than the host. This also reworks custom_mounts_prepare(), and turns it into two functions: one custom_mount_check_all() that remains in nspawn.c but purely verifies the validity of the custom mounts configured. And one called custom_mount_prepare_all() that actually does the preparation step, sorts the custom mounts, resolves relative paths, and allocates temporary directories as necessary.
2016-12-01tree-wide: set SA_RESTART for signal handlers we installLennart Poettering
We already set it in most cases, but make sure to set it in all others too, and document that that's a good idea.
2016-12-01nspawn: add ability to configure overlay mounts to .nspawn filesLennart Poettering
Fixes: #4634
2016-12-01nspawn: split out overlayfs argument parsing into a function of its ownLennart Poettering
Add overlay_mount_parse() similar in style to tmpfs_mount_parse() and bind_mount_parse().
2016-12-01nspawn: use -ENOMEM instead of log_oom() in one caseLennart Poettering
The function is of the "library" kind and doesn't log ENOMEM in all other cases, hence fix the one outlier.
2016-12-01nspawn: make use of CHASE_NON_EXISTING when locking imageLennart Poettering
If --template= is used on an image, then the image might not exist initially. We can use CHASE_NON_EXISTING to properly lock the image already before it exists. Let's do so.