summaryrefslogtreecommitdiff
path: root/src/nspawn
AgeCommit message (Collapse)Author
2016-10-08nspawn: also fall back to legacy cgroup hierarchy for old containersZbigniew Jędrzejewski-Szmek
Current systemd version detection routine cannot detect systemd 230, only systmed >= 231. This means that we'll still use the legacy hierarchy in some cases where we wouldn't have too. If somebody figures out a nice way to detect systemd 230 this can be later improved.
2016-10-08nspawn: use mixed cgroup hierarchy only when container has new systemdZbigniew Jędrzejewski-Szmek
systemd-soon-to-be-released-232 is able to deal with the mixed hierarchy. So make an educated guess, and use the mixed hierarchy in that case. Tested by running the host with mixed hierarchy (i.e. simply using a recent kernel with systemd from git), and booting first a container with older systemd, and then one with a newer systemd. Fixes #4008.
2016-10-08nspawn: fix spurious reboot if container process returns 133Zbigniew Jędrzejewski-Szmek
2016-10-08nspawn: move the main loop body out to a new functionZbigniew Jędrzejewski-Szmek
The new function has 416 lines by itself! "return log_error_errno" is used to nicely reduce the volume of error handling code. A few minor issues are fixed on the way: - positive value was used as error value (EIO), causing systemd-nspawn to return success, even though it shouldn't. - In two places random values were used as error status, when the actual value was in an unusual place (etc_password_lock, notify_socket). Those are the only functional changes. There is another potential issue, which is marked with a comment, and left unresolved: the container can also return 133 by itself, causing a spurious reboot.
2016-10-08nspawn: check env var first, detect secondZbigniew Jędrzejewski-Szmek
If we are going to use the env var to override the detection result anyway, there is not point in doing the detection, especially that it can fail.
2016-10-06tree-wide: drop some misleading compiler warningsLennart Poettering
gcc at some optimization levels thinks thes variables were used without initialization. it's wrong, but let's make the message go anyway.
2016-10-05nspawn: add log message to let users know that nspawn needs an empty /dev ↵Djalal Harouni
directory (#4226) Fixes https://github.com/systemd/systemd/issues/3695 At the same time it adds a protection against userns chown of inodes of a shared mount point.
2016-10-03nspawn: set shared propagation mode for the containerAlban Crequy
2016-09-28Merge pull request #4185 from endocode/djalal-sandbox-first-protection-v1Evgeny Vereshchagin
core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes
2016-09-26treewide: fix typos (#4217)Torstein Husebø
2016-09-25nspawn: let's mount /proc/sysrq-trigger read-only by defaultLennart Poettering
LXC does this, and we should probably too. Better safe than sorry.
2016-09-25namespace: rework how ReadWritePaths= is appliedLennart Poettering
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths= specification, then we'd first recursively apply the ReadOnlyPaths= paths, and make everything below read-only, only in order to then flip the read-only bit again for the subdirs listed in ReadWritePaths= below it. This is not only ugly (as for the dirs in question we first turn on the RO bit, only to turn it off again immediately after), but also problematic in containers, where a container manager might have marked a set of dirs read-only and this code will undo this is ReadWritePaths= is set for any. With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be applied to the children listed in ReadWritePaths= in the first place, so that we do not need to turn off the RO bit for those after all. This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the RO bit, but never to turn it off again. Or to say this differently: if some dirs are marked read-only via some external tool, then ReadWritePaths= will not undo it. This is not only the safer option, but also more in-line with what the man page currently claims: "Entries (files or directories) listed in ReadWritePaths= are accessible from within the namespace with the same access rights as from outside." To implement this change bind_remount_recursive() gained a new "blacklist" string list parameter, which when passed may contain subdirs that shall be excluded from the read-only mounting. A number of functions are updated to add more debug logging to make this more digestable.
2016-09-24nspawn: decouple --boot from CLONE_NEWIPC (#4180)Luca Bruno
This commit is a minor tweak after the split of `--share-system`, decoupling the `--boot` option from IPC namespacing. Historically there has been a single `--share-system` option for sharing IPC/PID/UTS with the host, which was incompatible with boot/pid1 mode. After the split, it is now possible to express the requirements with better granularity. For reference, this is a followup to #4023 which contains references to previous discussions. I realized too late that CLONE_NEWIPC is not strictly needed for boot mode.
2016-09-20nspawn: fix comment typo in setup_timezone example (#4183)Michael Pope
2016-09-17nspawn: clarify log warning for /etc/localtime not being a symbolic link (#4163)Michael Pope
2016-09-06nspawn: detect SECCOMP availability, skip audit filter if unavailableFelipe Sateler
Fail hard if SECCOMP was detected but could not be installed
2016-08-26nspawn: split down SYSTEMD_NSPAWN_SHARE_SYSTEM (#4023)Luca Bruno
This commit follows further on the deprecation path for --share-system, by splitting and gating each share-able namespace behind its own environment flag.
2016-08-19Merge pull request #3965 from htejun/systemd-controller-on-unifiedZbigniew Jędrzejewski-Szmek
2016-08-18bus-util: unify loop around bus_append_unit_property_assignment()Lennart Poettering
This is done exactly the same way a couple of times at various places, let's unify this into one version.
2016-08-17core: use the unified hierarchy for the systemd cgroup controller hierarchyTejun Heo
Currently, systemd uses either the legacy hierarchies or the unified hierarchy. When the legacy hierarchies are used, systemd uses a named legacy hierarchy mounted on /sys/fs/cgroup/systemd without any kernel controllers for process management. Due to the shortcomings in the legacy hierarchy, this involves a lot of workarounds and complexities. Because the unified hierarchy can be mounted and used in parallel to legacy hierarchies, there's no reason for systemd to use a legacy hierarchy for management even if the kernel resource controllers need to be mounted on legacy hierarchies. It can simply mount the unified hierarchy under /sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies. This disables a significant amount of fragile workaround logics and would allow using features which depend on the unified hierarchy membership such bpf cgroup v2 membership test. In time, this would also allow deleting the said complexities. This patch updates systemd so that it prefers the unified hierarchy for the systemd cgroup controller hierarchy when legacy hierarchies are used for kernel resource controllers. * cg_unified(@controller) is introduced which tests whether the specific controller in on unified hierarchy and used to choose the unified hierarchy code path for process and service management when available. Kernel controller specific operations remain gated by cg_all_unified(). * "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to force the use of legacy hierarchy for systemd cgroup controller. * nspawn: By default nspawn uses the same hierarchies as the host. If UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all. If 0, legacy for all. * nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of three options - legacy, only systemd controller on unified, and unified. The value is passed into mount setup functions and controls cgroup configuration. * nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount option is moved to mount_legacy_cgroup_hierarchy() so that it can take an appropriate action depending on the configuration of the host. v2: - CGroupUnified enum replaces open coded integer values to indicate the cgroup operation mode. - Various style updates. v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2. v4: Restored legacy container on unified host support and fixed another bug in detect_unified_cgroup_hierarchy().
2016-08-15core: rename cg_unified() to cg_all_unified()Tejun Heo
A following patch will update cgroup handling so that the systemd controller (/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel resource controllers are on the legacy hierarchies. This would require distinguishing whether all controllers are on cgroup v2 or only the systemd controller is. In preparation, this patch renames cg_unified() to cg_all_unified(). This patch doesn't cause any functional changes.
2016-08-04Merge pull request #3885 from keszybz/help-outputLennart Poettering
Update help for "short-full" and shorten to 80 columns
2016-08-04nspawn,resolve: short --help output to fit within 80 columnsZbigniew Jędrzejewski-Szmek
make dist-check-help FTW!
2016-08-03nspawn: if we can't mark the boot ID RO let's failLennart Poettering
It's probably better to be safe here.
2016-08-03nspawn: deprecate --share-system supportLennart Poettering
This removes the --share-system switch: from the documentation, the --help text as well as the command line parsing. It's an ugly option, given that it kinda contradicts the whole concept of PID namespaces that nspawn implements. Since it's barely ever used, let's just deprecate it and remove it from the options. It might be useful as a debugging option, hence the functionality is kept around for now, exposed via an undocumented $SYSTEMD_NSPAWN_SHARE_SYSTEM environment variable.
2016-08-03nspawn: try to bind mount resolved's resolv.conf snippet into the containerLennart Poettering
This has the benefit that the container can follow the host's DNS server changes without us having to constantly update the container's resolv.conf settings.
2016-07-26nspawn: add SYSTEMD_NSPAWN_USE_CGNS env variable (#3809)Christian Brauner
SYSTEMD_NSPAWN_USE_CGNS allows to disable the use of cgroup namespaces.
2016-07-25Merge pull request #3757 from poettering/efi-searchZbigniew Jędrzejewski-Szmek
2016-07-25Merge pull request #3589 from brauner/cgroup_namespaceLennart Poettering
Cgroup namespace
2016-07-22nspawn: don't skip cleanup on locking errorZbigniew Jędrzejewski-Szmek
2016-07-22Use "return log_error_errno" in more places"Zbigniew Jędrzejewski-Szmek
2016-07-22Merge pull request #3777 from poettering/id128-reworkZbigniew Jędrzejewski-Szmek
uuid/id128 code rework
2016-07-22nspawn: set DevicesPolicy closed and clean up duplicated devicesAlessandro Puccetti
2016-07-22machine-id-setup: port machine_id_commit() to new id128-util.c APIsLennart Poettering
2016-07-22nspawn: rework /etc/machine-id handlingLennart Poettering
With this change we'll no longer write to /etc/machine-id from nspawn, as that breaks the --volatile= operation, as it ensures the image is never considered in "first boot", since that's bound to the pre-existance of /etc/machine-id. The new logic works like this: - If /etc/machine-id already exists in the container, it is read by nspawn and exposed in "machinectl status" and friends. - If the file doesn't exist yet, but --uuid= is passed on the nspawn cmdline, this UUID is passed in $container_uuid to PID 1, and PID 1 is then expected to persist this to /etc/machine-id for future boots (which systemd already does). - If the file doesn#t exist yet, and no --uuid= is passed a random UUID is generated and passed via $container_uuid. The result is that /etc/machine-id is never initialized by nspawn itself, thus unbreaking the volatile mode. However still the machine ID configured in the machine always matches nspawn's and thus machined's idea of it. Fixes: #3611
2016-07-22nspawn: rework machine/boot ID handling code to use new calls from ↵Lennart Poettering
id128-util.[ch]
2016-07-22sd-id128: split UUID file read/write code into new id128-util.[ch]Lennart Poettering
We currently have code to read and write files containing UUIDs at various places. Unify this in id128-util.[ch], and move some other stuff there too. The new files are located in src/libsystemd/sd-id128/ (instead of src/shared/), because they are actually the backend of sd_id128_get_machine() and sd_id128_get_boot(). In follow-up patches we can use this reduce the code in nspawn and machine-id-setup by adopted the common implementation.
2016-07-22tree-wide: use sd_id128_is_null() instead of sd_id128_equal where appropriateLennart Poettering
It's a bit easier to read because shorter. Also, most likely a tiny bit faster.
2016-07-22Merge pull request #3764 from poettering/assorted-stuff-2Martin Pitt
Assorted fixes
2016-07-21nspawn: enable major=0/minor=0 devices inside the container (#3773)Alessandro Puccetti
https://github.com/systemd/systemd/pull/3685 introduced /run/systemd/inaccessible/{chr,blk} to map inacessible devices, this patch allows systemd running inside a nspawn container to create /run/systemd/inaccessible/{chr,blk}.
2016-07-21nspawn: if an ESP is part of the disk image to operate on, mount it to /efi ↵Lennart Poettering
or /boot Matching the behaviour of gpt-auto-generator, if we find an ESP while dissecting a container image, mount it to /efi or /boot if those dirs exist and are empty. This should enable us to run "bootctl" inside a container and do the right thing.
2016-07-20nspawn: when netns is on, mount /proc/sys/net writableLennart Poettering
Normally we make all of /proc/sys read-only in a container, but if we do have netns enabled we can make /proc/sys/net writable, as things are virtualized then.
2016-07-20nspawn: document why the uid shift range is the way it isLennart Poettering
2016-07-18treewide: remove unused variablesThomas Hindoe Paaboel Andersen
2016-07-18nspawn: decrease mkdir error logging in /sys to debug priority (#3748)tblume
Such mkdir errors happen for example when trying to mkdir /sys/fs/selinux. /sys is documented to be readonly in the container, so mkdir errors below /sys can be expected. They shouldn't be logged as warnings since they lead users to think that there is something wrong.
2016-07-15tree-wide: get rid of selinux_context_t (#3732)Zbigniew Jędrzejewski-Szmek
https://github.com/SELinuxProject/selinux/commit/9eb9c9327563014ad6a807814e7975424642d5b9 deprecated selinux_context_t. Replace with a simple char* everywhere. Alternative fix for #3719.
2016-07-12Various fixes for typos found by lintian (#3705)Michael Biebl
2016-07-11treewide: fix typos and remove accidental repetition of wordsTorstein Husebø
2016-07-09nspawn: handle cgroup namespacesChristian Brauner
(NOTE: Cgroup namespaces work with legacy and unified hierarchies: "This is completely backward compatible and will be completely invisible to any existing cgroup users (except for those running inside a cgroup namespace and looking at /proc/pid/cgroup of tasks outside their namespace.)" (https://lists.linuxfoundation.org/pipermail/containers/2016-January/036582.html) So there is no need to special case unified.) If cgroup namespaces are supported we skip mount_cgroups() in the outer_child(). Instead, we unshare(CLONE_NEWCGROUP) in the inner_child() and only then do we call mount_cgroups(). The clean way to handle cgroup namespaces would be to delegate mounting of cgroups completely to the init system in the container. However, this would likely break backward compatibility with the UNIFIED_CGROUP_HIERARCHY flag of systemd-nspawn. Also no cgroupfs would be mounted whenever the user simply requests a shell and no init is available to mount cgroups. Hence, we introduce mount_legacy_cgns_supported(). After calling unshare(CLONE_NEWCGROUP) it parses /proc/self/cgroup to find the mounted controllers and mounts them inside the new cgroup namespace. This should preserve backward compatibility with the UNIFIED_CGROUP_HIERARCHY flag and mount a cgroupfs when no init in the container is running.
2016-07-04treewide: fix typosTorstein Husebø