summaryrefslogtreecommitdiff
path: root/src/nspawn/nspawn.c
AgeCommit message (Collapse)Author
2016-08-17core: use the unified hierarchy for the systemd cgroup controller hierarchyTejun Heo
Currently, systemd uses either the legacy hierarchies or the unified hierarchy. When the legacy hierarchies are used, systemd uses a named legacy hierarchy mounted on /sys/fs/cgroup/systemd without any kernel controllers for process management. Due to the shortcomings in the legacy hierarchy, this involves a lot of workarounds and complexities. Because the unified hierarchy can be mounted and used in parallel to legacy hierarchies, there's no reason for systemd to use a legacy hierarchy for management even if the kernel resource controllers need to be mounted on legacy hierarchies. It can simply mount the unified hierarchy under /sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies. This disables a significant amount of fragile workaround logics and would allow using features which depend on the unified hierarchy membership such bpf cgroup v2 membership test. In time, this would also allow deleting the said complexities. This patch updates systemd so that it prefers the unified hierarchy for the systemd cgroup controller hierarchy when legacy hierarchies are used for kernel resource controllers. * cg_unified(@controller) is introduced which tests whether the specific controller in on unified hierarchy and used to choose the unified hierarchy code path for process and service management when available. Kernel controller specific operations remain gated by cg_all_unified(). * "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to force the use of legacy hierarchy for systemd cgroup controller. * nspawn: By default nspawn uses the same hierarchies as the host. If UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all. If 0, legacy for all. * nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of three options - legacy, only systemd controller on unified, and unified. The value is passed into mount setup functions and controls cgroup configuration. * nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount option is moved to mount_legacy_cgroup_hierarchy() so that it can take an appropriate action depending on the configuration of the host. v2: - CGroupUnified enum replaces open coded integer values to indicate the cgroup operation mode. - Various style updates. v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2. v4: Restored legacy container on unified host support and fixed another bug in detect_unified_cgroup_hierarchy().
2016-08-15core: rename cg_unified() to cg_all_unified()Tejun Heo
A following patch will update cgroup handling so that the systemd controller (/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel resource controllers are on the legacy hierarchies. This would require distinguishing whether all controllers are on cgroup v2 or only the systemd controller is. In preparation, this patch renames cg_unified() to cg_all_unified(). This patch doesn't cause any functional changes.
2016-08-04Merge pull request #3885 from keszybz/help-outputLennart Poettering
Update help for "short-full" and shorten to 80 columns
2016-08-04nspawn,resolve: short --help output to fit within 80 columnsZbigniew Jędrzejewski-Szmek
make dist-check-help FTW!
2016-08-03nspawn: if we can't mark the boot ID RO let's failLennart Poettering
It's probably better to be safe here.
2016-08-03nspawn: deprecate --share-system supportLennart Poettering
This removes the --share-system switch: from the documentation, the --help text as well as the command line parsing. It's an ugly option, given that it kinda contradicts the whole concept of PID namespaces that nspawn implements. Since it's barely ever used, let's just deprecate it and remove it from the options. It might be useful as a debugging option, hence the functionality is kept around for now, exposed via an undocumented $SYSTEMD_NSPAWN_SHARE_SYSTEM environment variable.
2016-08-03nspawn: try to bind mount resolved's resolv.conf snippet into the containerLennart Poettering
This has the benefit that the container can follow the host's DNS server changes without us having to constantly update the container's resolv.conf settings.
2016-07-26nspawn: add SYSTEMD_NSPAWN_USE_CGNS env variable (#3809)Christian Brauner
SYSTEMD_NSPAWN_USE_CGNS allows to disable the use of cgroup namespaces.
2016-07-25Merge pull request #3757 from poettering/efi-searchZbigniew Jędrzejewski-Szmek
2016-07-25Merge pull request #3589 from brauner/cgroup_namespaceLennart Poettering
Cgroup namespace
2016-07-22nspawn: don't skip cleanup on locking errorZbigniew Jędrzejewski-Szmek
2016-07-22machine-id-setup: port machine_id_commit() to new id128-util.c APIsLennart Poettering
2016-07-22nspawn: rework /etc/machine-id handlingLennart Poettering
With this change we'll no longer write to /etc/machine-id from nspawn, as that breaks the --volatile= operation, as it ensures the image is never considered in "first boot", since that's bound to the pre-existance of /etc/machine-id. The new logic works like this: - If /etc/machine-id already exists in the container, it is read by nspawn and exposed in "machinectl status" and friends. - If the file doesn't exist yet, but --uuid= is passed on the nspawn cmdline, this UUID is passed in $container_uuid to PID 1, and PID 1 is then expected to persist this to /etc/machine-id for future boots (which systemd already does). - If the file doesn#t exist yet, and no --uuid= is passed a random UUID is generated and passed via $container_uuid. The result is that /etc/machine-id is never initialized by nspawn itself, thus unbreaking the volatile mode. However still the machine ID configured in the machine always matches nspawn's and thus machined's idea of it. Fixes: #3611
2016-07-22nspawn: rework machine/boot ID handling code to use new calls from ↵Lennart Poettering
id128-util.[ch]
2016-07-22sd-id128: split UUID file read/write code into new id128-util.[ch]Lennart Poettering
We currently have code to read and write files containing UUIDs at various places. Unify this in id128-util.[ch], and move some other stuff there too. The new files are located in src/libsystemd/sd-id128/ (instead of src/shared/), because they are actually the backend of sd_id128_get_machine() and sd_id128_get_boot(). In follow-up patches we can use this reduce the code in nspawn and machine-id-setup by adopted the common implementation.
2016-07-22tree-wide: use sd_id128_is_null() instead of sd_id128_equal where appropriateLennart Poettering
It's a bit easier to read because shorter. Also, most likely a tiny bit faster.
2016-07-21nspawn: if an ESP is part of the disk image to operate on, mount it to /efi ↵Lennart Poettering
or /boot Matching the behaviour of gpt-auto-generator, if we find an ESP while dissecting a container image, mount it to /efi or /boot if those dirs exist and are empty. This should enable us to run "bootctl" inside a container and do the right thing.
2016-07-20nspawn: when netns is on, mount /proc/sys/net writableLennart Poettering
Normally we make all of /proc/sys read-only in a container, but if we do have netns enabled we can make /proc/sys/net writable, as things are virtualized then.
2016-07-20nspawn: document why the uid shift range is the way it isLennart Poettering
2016-07-18treewide: remove unused variablesThomas Hindoe Paaboel Andersen
2016-07-15tree-wide: get rid of selinux_context_t (#3732)Zbigniew Jędrzejewski-Szmek
https://github.com/SELinuxProject/selinux/commit/9eb9c9327563014ad6a807814e7975424642d5b9 deprecated selinux_context_t. Replace with a simple char* everywhere. Alternative fix for #3719.
2016-07-12Various fixes for typos found by lintian (#3705)Michael Biebl
2016-07-09nspawn: handle cgroup namespacesChristian Brauner
(NOTE: Cgroup namespaces work with legacy and unified hierarchies: "This is completely backward compatible and will be completely invisible to any existing cgroup users (except for those running inside a cgroup namespace and looking at /proc/pid/cgroup of tasks outside their namespace.)" (https://lists.linuxfoundation.org/pipermail/containers/2016-January/036582.html) So there is no need to special case unified.) If cgroup namespaces are supported we skip mount_cgroups() in the outer_child(). Instead, we unshare(CLONE_NEWCGROUP) in the inner_child() and only then do we call mount_cgroups(). The clean way to handle cgroup namespaces would be to delegate mounting of cgroups completely to the init system in the container. However, this would likely break backward compatibility with the UNIFIED_CGROUP_HIERARCHY flag of systemd-nspawn. Also no cgroupfs would be mounted whenever the user simply requests a shell and no init is available to mount cgroups. Hence, we introduce mount_legacy_cgns_supported(). After calling unshare(CLONE_NEWCGROUP) it parses /proc/self/cgroup to find the mounted controllers and mounts them inside the new cgroup namespace. This should preserve backward compatibility with the UNIFIED_CGROUP_HIERARCHY flag and mount a cgroupfs when no init in the container is running.
2016-06-13nspawn: order caps to retain alphabeticallyLennart Poettering
2016-06-10nspawn: introduce --notify-ready=[no|yes] (#3474)Alessandro Puccetti
This the patch implements a notificaiton mechanism from the init process in the container to systemd-nspawn. The switch --notify-ready=yes configures systemd-nspawn to wait the "READY=1" message from the init process in the container to send its own to systemd. --notify-ready=no is equivalent to the previous behavior before this patch, systemd-nspawn notifies systemd with a "READY=1" message when the container is created. This notificaiton mechanism uses socket file with path relative to the contanier "/run/systemd/nspawn/notify". The default values it --notify-ready=no. It is also possible to configure this mechanism from the .nspawn files using NotifyReady. This parameter takes the same options of the command line switch. Before this patch, systemd-nspawn notifies "ready" after the inner child was created, regardless the status of the service running inside it. Now, with --notify-ready=yes, systemd-nspawn notifies when the service is ready. This is really useful when there are dependencies between different contaniers. Fixes https://github.com/systemd/systemd/issues/1369 Based on the work from https://github.com/systemd/systemd/pull/3022 Testing: Boot a OS inside a container with systemd-nspawn. Note: modify the commands accordingly with your filesystem. 1. Create a filesystem where you can boot an OS. 2. sudo systemd-nspawn -D ${HOME}/distros/fedora-23/ sh 2.1. Create the unit file /etc/systemd/system/sleep.service inside the container (You can use the example below) 2.2. systemdctl enable sleep 2.3 exit 3. sudo systemd-run --service-type=notify --unit=notify-test ${HOME}/systemd/systemd-nspawn --notify-ready=yes -D ${HOME}/distros/fedora-23/ -b 4. In a different shell run "systemctl status notify-test" When using --notify-ready=yes the service status is "activating" for 20 seconds before being set to "active (running)". Instead, using --notify-ready=no the service status is marked "active (running)" quickly, without waiting for the 20 seconds. This patch was also test with --private-users=yes, you can test it just adding it at the end of the command at point 3. ------ sleep.service ------ [Unit] Description=sleep After=network.target [Service] Type=oneshot ExecStart=/bin/sleep 20 [Install] WantedBy=multi-user.target ------------ end ------------
2016-05-29util-lib: Add sparc64 support for process creation (#3348)Michael Karcher
The current raw_clone function takes two arguments, the cloning flags and a pointer to the stack for the cloned child. The raw cloning without passing a "thread main" function does not make sense if a new stack is specified, as it returns in both the parent and the child, which will fail in the child as the stack is virgin. All uses of raw_clone indeed pass NULL for the stack pointer which indicates that both processes should share the stack address (so you better don't pass CLONE_VM). This commit refactors the code to not require the caller to pass the stack address, as NULL is the only sensible option. It also adds the magic code needed to make raw_clone work on sparc64, which does not return 0 in %o0 for the child, but indicates the child process by setting %o1 to non-zero. This refactoring is not plain aesthetic, because non-NULL stack addresses need to get mangled before being passed to the clone syscall (you have to apply STACK_BIAS), whereas NULL must not be mangled. Implementing the conditional mangling of the stack address would needlessly complicate the code. raw_clone is moved to a separete header, because the burden of including the assert machinery and sched.h shouldn't be applied to every user of missing_syscalls.h
2016-05-26nspawn: rename arg_retain to arg_caps_retainDjalal Harouni
The argument is about capabilities.
2016-05-26nspawn: split out seccomp call into nspawn-seccomp.[ch]Djalal Harouni
Split seccomp into nspawn-seccomp.[ch]. Currently there are no changes, but this will make it easy in the future to share or use the seccomp logic from systemd core.
2016-05-22nspawn: remove unreachable return statement (#3320)Zbigniew Jędrzejewski-Szmek
2016-05-12nspawn: drop spurious newlineLennart Poettering
2016-05-09nspawn: only remove veth links we created ourselvesLennart Poettering
Let's make sure we don't remove veth links that existed before nspawn was invoked. https://github.com/systemd/systemd/pull/3209#discussion_r62439999
2016-05-09nspawn: add new --network-zone= switch for automatically managed bridge devicesLennart Poettering
This adds a new concept of network "zones", which are little more than bridge devices that are automatically managed by nspawn: when the first container referencing a bridge is started, the bridge device is created, when the last container referencing it is removed the bridge device is removed again. Besides this logic --network-zone= is pretty much identical to --network-bridge=. The usecase for this is to make it easy to run multiple related containers (think MySQL in one and Apache in another) in a common, named virtual Ethernet broadcast zone, that only exists as long as one of them is running, and fully automatically managed otherwise.
2016-05-09util-lib: add new ifname_valid() call that validates interface namesLennart Poettering
Make use of this in nspawn at a couple of places. A later commit should port more code over to this, including networkd.
2016-05-03Merge pull request #3111 from poettering/nspawn-remove-vethZbigniew Jędrzejewski-Szmek
2016-05-03Revert "nspawn: explicitly remove veth links after use (#3111)"Zbigniew Jędrzejewski-Szmek
This reverts commit d2773e59de3dd970d861e9f996bc48de20ef4314. Merge got squashed by mistake.
2016-04-29nspawn: convert uuid to string (#3146)Evgeny Vereshchagin
Fixes: cp /etc/machine-id /var/tmp/systemd-test.HccKPa/nspawn-root/etc systemd-nspawn -D /var/tmp/systemd-test.HccKPa/nspawn-root --link-journal host -b ... Host and machine ids are equal (P�S!V): refusing to link journals
2016-04-28nspawn: initialize the veth_name (#3141)Evgeny Vereshchagin
Fixes: $ systemd-nspawn -h ... Failed to remove veth interface ����: Operation not permitted This is a follow-up for d2773e59de3dd970d861
2016-04-26Merge pull request #3093 from poettering/nspawn-userns-magicLennart Poettering
nspawn automatic user namespaces
2016-04-25nspawn: explicitly remove veth links after use (#3111)Lennart Poettering
* sd-netlink: permit RTM_DELLINK messages with no ifindex This is useful for removing network interfaces by name. * nspawn: explicitly remove veth links we created after use Sometimes the kernel keeps veth links pinned after the namespace they have been joined to died. Let's hence explicitly remove veth links after use. Fixes: #2173
2016-04-25nspawn: explicitly remove veth links we created after useLennart Poettering
Sometimes the kernel keeps veth links pinned after the namespace they have been joined to died. Let's hence explicitly remove veth links after use. Fixes: #2173
2016-04-25nspawn: make -U a tiny bit smarterLennart Poettering
With this change -U will turn on user namespacing only if the kernel actually supports it and otherwise gracefully degrade to non-userns mode.
2016-04-25nspawn: allow configuration of user namespaces in .nspawn filesLennart Poettering
In order to implement this we change the bool arg_userns into an enum UserNamespaceMode, which can take one of NO, PICK or FIXED, and replace the arg_uid_range_pick bool with it.
2016-04-25nspawn: add -U as shortcut for --private-users=pickLennart Poettering
Given that user namespacing is pretty useful now, let's add a shortcut command line switch for the logic.
2016-04-25nspawn: optionally, automatically allocate a UID/GID range for userns containersLennart Poettering
This adds the new value "pick" to --private-users=. When specified a new UID/GID range of 65536 users is automatically and randomly allocated from the host range 0x00080000-0xDFFF0000 and used for the container. The setting implies --private-users-chown, so that container directory is recursively chown()ed to the newly allocated UID/GID range, if that's necessary. As an optimization before picking a randomized UID/GID the UID of the container's root directory is used as starting point and used if currently not used otherwise. To protect against using the same UID/GID range multiple times a few mechanisms are in place: - The first and the last UID and GID of the range are checked with getpwuid() and getgrgid(). If an entry already exists a different range is picked. Note that by "last" UID the user 65534 is used, as 65535 is the 16bit (uid_t) -1. - A lock file for the range is taken in /run/systemd/nspawn-uid/. Since the ranges are taken in a non-overlapping fashion, and always start on 64K boundaries this allows us to maintain a single lock file for each range that can be randomly picked. This protects nspawn from picking the same range in two parallel instances. - If possible the /etc/passwd lock file is taken while a new range is selected until the container is up. This means adduser/addgroup should safely avoid the range as long as nss-mymachines is used, since the allocated range will then show up in the user database. The UID/GID range nspawn picks from is compiled in and not configurable at the moment. That should probably stay that way, since we already provide ways how users can pick their own ranges manually if they don't like the automatic logic. The new --private-users=pick logic makes user namespacing pretty useful now, as it relieves the user from managing UID/GID ranges.
2016-04-25nspawn: optionally fix up OS tree uid/gids for usernsLennart Poettering
This adds a new --private-userns-chown switch that may be used in combination with --private-userns. If it is passed a recursive chmod() operation is run on the OS tree, fixing all file owner UID/GIDs to the right ranges. This should make user namespacing pretty workable, as the OS trees don't need to be prepared manually anymore.
2016-04-22tree-wide: remove unused variables (#3098)Thomas H. P. Andersen
2016-04-21tree-wide: use mdash instead of a two minusesZbigniew Jędrzejewski-Szmek
2016-04-20nspawn: add -E as alias for --setenvZbigniew Jędrzejewski-Szmek
v2: - "=" is required, so remove the <optional> tags that v1 added
2016-04-11Merge pull request #3014 from msekletar/nspawn-empty-machine-id-v3Lennart Poettering
nspawn: always setup machine id (v3)
2016-04-11nspawn: always setup machine idMichal Sekletar
We check /etc/machine-id of the container and if it is already populated we use value from there, possibly ignoring value of --uuid option from the command line. When dealing with R/O image we setup transient machine id. Once we determined machine id of the container, we use this value for registration with systemd-machined and we also export it via container_uuid environment variable. As registration with systemd-machined is done by the main nspawn process we communicate container machine id established by setup_machine_id from outer child to the main process by unix domain socket. Similarly to PID of inner child.