Age | Commit message (Collapse) | Author |
|
SYSTEMD_NSPAWN_USE_CGNS allows to disable the use of cgroup namespaces.
|
|
|
|
Cgroup namespace
|
|
|
|
|
|
uuid/id128 code rework
|
|
|
|
|
|
With this change we'll no longer write to /etc/machine-id from nspawn, as that
breaks the --volatile= operation, as it ensures the image is never considered
in "first boot", since that's bound to the pre-existance of /etc/machine-id.
The new logic works like this:
- If /etc/machine-id already exists in the container, it is read by nspawn and
exposed in "machinectl status" and friends.
- If the file doesn't exist yet, but --uuid= is passed on the nspawn cmdline,
this UUID is passed in $container_uuid to PID 1, and PID 1 is then expected
to persist this to /etc/machine-id for future boots (which systemd already
does).
- If the file doesn#t exist yet, and no --uuid= is passed a random UUID is
generated and passed via $container_uuid.
The result is that /etc/machine-id is never initialized by nspawn itself, thus
unbreaking the volatile mode. However still the machine ID configured in the
machine always matches nspawn's and thus machined's idea of it.
Fixes: #3611
|
|
id128-util.[ch]
|
|
We currently have code to read and write files containing UUIDs at various
places. Unify this in id128-util.[ch], and move some other stuff there too.
The new files are located in src/libsystemd/sd-id128/ (instead of src/shared/),
because they are actually the backend of sd_id128_get_machine() and
sd_id128_get_boot().
In follow-up patches we can use this reduce the code in nspawn and
machine-id-setup by adopted the common implementation.
|
|
It's a bit easier to read because shorter. Also, most likely a tiny bit faster.
|
|
Assorted fixes
|
|
https://github.com/systemd/systemd/pull/3685 introduced
/run/systemd/inaccessible/{chr,blk} to map inacessible devices,
this patch allows systemd running inside a nspawn container to create
/run/systemd/inaccessible/{chr,blk}.
|
|
or /boot
Matching the behaviour of gpt-auto-generator, if we find an ESP while
dissecting a container image, mount it to /efi or /boot if those dirs exist and
are empty.
This should enable us to run "bootctl" inside a container and do the right
thing.
|
|
Normally we make all of /proc/sys read-only in a container, but if we do have
netns enabled we can make /proc/sys/net writable, as things are virtualized
then.
|
|
|
|
|
|
Such mkdir errors happen for example when trying to mkdir /sys/fs/selinux.
/sys is documented to be readonly in the container, so mkdir errors below /sys
can be expected.
They shouldn't be logged as warnings since they lead users to think that
there is something wrong.
|
|
https://github.com/SELinuxProject/selinux/commit/9eb9c9327563014ad6a807814e7975424642d5b9
deprecated selinux_context_t. Replace with a simple char* everywhere.
Alternative fix for #3719.
|
|
|
|
|
|
(NOTE: Cgroup namespaces work with legacy and unified hierarchies: "This is
completely backward compatible and will be completely invisible to any existing
cgroup users (except for those running inside a cgroup namespace and looking at
/proc/pid/cgroup of tasks outside their namespace.)"
(https://lists.linuxfoundation.org/pipermail/containers/2016-January/036582.html)
So there is no need to special case unified.)
If cgroup namespaces are supported we skip mount_cgroups() in the
outer_child(). Instead, we unshare(CLONE_NEWCGROUP) in the inner_child() and
only then do we call mount_cgroups().
The clean way to handle cgroup namespaces would be to delegate mounting of
cgroups completely to the init system in the container. However, this would
likely break backward compatibility with the UNIFIED_CGROUP_HIERARCHY flag of
systemd-nspawn. Also no cgroupfs would be mounted whenever the user simply
requests a shell and no init is available to mount cgroups. Hence, we introduce
mount_legacy_cgns_supported(). After calling unshare(CLONE_NEWCGROUP) it parses
/proc/self/cgroup to find the mounted controllers and mounts them inside the
new cgroup namespace. This should preserve backward compatibility with the
UNIFIED_CGROUP_HIERARCHY flag and mount a cgroupfs when no init in the
container is running.
|
|
|
|
An incorrectly set if/else chain caused aus to apply the access mode of a
symlink to the directory it is located in. Yuck.
Fixes: #3547
|
|
Let's block access to the kernel keyring and a number of obsolete system calls.
Also, update list of syscalls that may alter the system clock, and do raw IO
access. Filter ptrace() if CAP_SYS_PTRACE is not passed to the container and
acct() if CAP_SYS_PACCT is not passed.
This also changes things so that kexec(), some profiling calls, the swap calls
and quotactl() is never available to containers, not even if CAP_SYS_ADMIN is
passed. After all we currently permit CAP_SYS_ADMIN to containers by default,
but these calls should not be available, even then.
|
|
|
|
This the patch implements a notificaiton mechanism from the init process
in the container to systemd-nspawn.
The switch --notify-ready=yes configures systemd-nspawn to wait the "READY=1"
message from the init process in the container to send its own to systemd.
--notify-ready=no is equivalent to the previous behavior before this patch,
systemd-nspawn notifies systemd with a "READY=1" message when the container is
created. This notificaiton mechanism uses socket file with path relative to the contanier
"/run/systemd/nspawn/notify". The default values it --notify-ready=no.
It is also possible to configure this mechanism from the .nspawn files using
NotifyReady. This parameter takes the same options of the command line switch.
Before this patch, systemd-nspawn notifies "ready" after the inner child was created,
regardless the status of the service running inside it. Now, with --notify-ready=yes,
systemd-nspawn notifies when the service is ready. This is really useful when
there are dependencies between different contaniers.
Fixes https://github.com/systemd/systemd/issues/1369
Based on the work from https://github.com/systemd/systemd/pull/3022
Testing:
Boot a OS inside a container with systemd-nspawn.
Note: modify the commands accordingly with your filesystem.
1. Create a filesystem where you can boot an OS.
2. sudo systemd-nspawn -D ${HOME}/distros/fedora-23/ sh
2.1. Create the unit file /etc/systemd/system/sleep.service inside the container
(You can use the example below)
2.2. systemdctl enable sleep
2.3 exit
3. sudo systemd-run --service-type=notify --unit=notify-test
${HOME}/systemd/systemd-nspawn --notify-ready=yes
-D ${HOME}/distros/fedora-23/ -b
4. In a different shell run "systemctl status notify-test"
When using --notify-ready=yes the service status is "activating" for 20 seconds
before being set to "active (running)". Instead, using --notify-ready=no
the service status is marked "active (running)" quickly, without waiting for
the 20 seconds.
This patch was also test with --private-users=yes, you can test it just adding it
at the end of the command at point 3.
------ sleep.service ------
[Unit]
Description=sleep
After=network.target
[Service]
Type=oneshot
ExecStart=/bin/sleep 20
[Install]
WantedBy=multi-user.target
------------ end ------------
|
|
The current raw_clone function takes two arguments, the cloning flags and
a pointer to the stack for the cloned child. The raw cloning without
passing a "thread main" function does not make sense if a new stack is
specified, as it returns in both the parent and the child, which will fail
in the child as the stack is virgin. All uses of raw_clone indeed pass NULL
for the stack pointer which indicates that both processes should share the
stack address (so you better don't pass CLONE_VM).
This commit refactors the code to not require the caller to pass the stack
address, as NULL is the only sensible option. It also adds the magic code
needed to make raw_clone work on sparc64, which does not return 0 in %o0
for the child, but indicates the child process by setting %o1 to non-zero.
This refactoring is not plain aesthetic, because non-NULL stack addresses
need to get mangled before being passed to the clone syscall (you have to
apply STACK_BIAS), whereas NULL must not be mangled. Implementing the
conditional mangling of the stack address would needlessly complicate the
code.
raw_clone is moved to a separete header, because the burden of including
the assert machinery and sched.h shouldn't be applied to every user of
missing_syscalls.h
|
|
The argument is about capabilities.
|
|
Split seccomp into nspawn-seccomp.[ch]. Currently there are no changes,
but this will make it easy in the future to share or use the seccomp logic
from systemd core.
|
|
Rename is_procfs_sysfs_or_suchlike() to is_fs_fully_userns_compatible()
to give it the real meaning. This may prevent future modifications that
may introduce bugs.
|
|
Add some special filesystems that should not be shifted, most of them
relate to the host and not to containers.
|
|
|
|
|
|
Let's make sure we don't remove veth links that existed before nspawn was
invoked.
https://github.com/systemd/systemd/pull/3209#discussion_r62439999
|
|
|
|
This adds a new concept of network "zones", which are little more than bridge
devices that are automatically managed by nspawn: when the first container
referencing a bridge is started, the bridge device is created, when the last
container referencing it is removed the bridge device is removed again. Besides
this logic --network-zone= is pretty much identical to --network-bridge=.
The usecase for this is to make it easy to run multiple related containers
(think MySQL in one and Apache in another) in a common, named virtual Ethernet
broadcast zone, that only exists as long as one of them is running, and fully
automatically managed otherwise.
|
|
Make use of this in nspawn at a couple of places. A later commit should port
more code over to this, including networkd.
|
|
|
|
This reverts commit d2773e59de3dd970d861e9f996bc48de20ef4314.
Merge got squashed by mistake.
|
|
Fixes:
cp /etc/machine-id /var/tmp/systemd-test.HccKPa/nspawn-root/etc
systemd-nspawn -D /var/tmp/systemd-test.HccKPa/nspawn-root --link-journal host -b
...
Host and machine ids are equal (P�S!V): refusing to link journals
|
|
Fixes:
$ systemd-nspawn -h
...
Failed to remove veth interface ����: Operation not permitted
This is a follow-up for d2773e59de3dd970d861
|
|
nspawn automatic user namespaces
|
|
* sd-netlink: permit RTM_DELLINK messages with no ifindex
This is useful for removing network interfaces by name.
* nspawn: explicitly remove veth links we created after use
Sometimes the kernel keeps veth links pinned after the namespace they have been
joined to died. Let's hence explicitly remove veth links after use.
Fixes: #2173
|
|
Sometimes the kernel keeps veth links pinned after the namespace they have been
joined to died. Let's hence explicitly remove veth links after use.
Fixes: #2173
|
|
This should allow tools like rkt to pre-mount read-only subtrees in the OS
tree, without breaking the patching code.
Note that the code will still fail, if the top-level directory is already
read-only.
|
|
|
|
With this change -U will turn on user namespacing only if the kernel actually
supports it and otherwise gracefully degrade to non-userns mode.
|
|
In order to implement this we change the bool arg_userns into an enum
UserNamespaceMode, which can take one of NO, PICK or FIXED, and replace the
arg_uid_range_pick bool with it.
|