Age | Commit message (Collapse) | Author |
|
- Move to its own file rm-rf.c
- Change parameters into a single flags parameter
- Remove "honour sticky" logic, it's unused these days
|
|
Some systems abusively restrict mknod, even when the device node already
exists in /dev. This is unfortunate because it prevents systemd-nspawn
from creating the basic devices in /dev in the container.
This patch implements a workaround: when mknod fails, fallback on bind
mounts.
Additionally, /dev/console was created with a mknod with the same
major/minor as /dev/null before bind mounting a pts on it. This patch
removes the mknod and creates an empty regular file instead.
In order to test this patch, I used the following configuration, which I
think should replicate the system with the abusive restriction on mknod:
# grep devices /proc/self/cgroup
4:devices:/user.slice/restrict
# cat /sys/fs/cgroup/devices/user.slice/restrict/devices.list
c 1:9 r
c 5:2 rw
c 136:* rw
# systemd-nspawn --register=false -D .
v2:
- remove "bind", it is not needed since there is already MS_BIND
v3:
- fix error management when calling touch()
- fix lowercase in error message
|
|
We have no such check in any of the other tools, hence don't have one in
nspawn either.
(This should make things nicer for Rocket, among other things)
Note: removing this check does not mean that we support running nspawn
on non-systemd. We explicitly don't. It just means that we remove the
check for running it like that. You are still on your own if you do...
|
|
Try to keep syscalls as minimal as possible.
|
|
CID #1271353.
|
|
Replace ENOTSUP by EOPNOTSUPP as this is what linux actually uses.
|
|
CID #1257765.
|
|
This change makes it so all seccomp filters are mapped
to the appropriate capability and are only added if that
capability was not requested when running the container.
This unbreaks the remaining use cases broken by the
addition of seccomp filters without respecting requested
capabilities.
Co-Authored-By: Clif Houck <me@clifhouck.com>
[zj: - adapt to our coding style, make struct anonymous]
|
|
|
|
This patch removes includes that are not used. The removals were found with
include-what-you-use which checks if any of the symbols from a header is
in use.
|
|
|
|
|
|
|
|
(This is incomplete, /proc and /sys are still owned by root from outside
the container, not inside)
|
|
Previously we always invoked the container PID 1 on /dev/console of the
container. With this change we do so only if nspawn was invoked
interactively (i.e. its stdin/stdout was connected to a TTY). In all other
cases we directly pass through the fds unmodified.
This has the benefit that nspawn can be added into shell pipelines.
https://bugs.freedesktop.org/show_bug.cgi?id=87732
|
|
This is similar to systemd-run's --property= setting.
|
|
nspawn containers currently block module loading in all cases, with
no option to disable it. This allows an admin, specifically setting
capability=CAP_SYS_MODULE or capability=all to load modules.
|
|
After all it is now much more like strjoin() than strappend(). At the
same time, add support for NULL sentinels, even if they are normally not
necessary.
|
|
|
|
options for the superblock as the host
Otherwise we'll generate kernel runtime warnings about non-matching
mount options.
|
|
We really want /tmp to be properly mounted, especially in containers
that lack CAP_SYS_ADMIN or that are not fully booted up and only get a
shell, hence let's do so in nspawn already.
|
|
|
|
When we set up a loopback device with partition probing, the udev
"change" event about the configured device is first passed on to
userspace, only the the in-kernel partition prober is started. Since
partition probing fails with EBUSY when somebody has the device open,
the probing frequently fails since udev starts probing/opening the
device as soon as it gets the notification about it, and it might do so
earlier than the kernel probing.
This patch adds a (hopefully temporary) work-around for this, that
compares the number of probed partitions of the kernel with those of
blkid and synchronously asks for reprobing until the numebrs are in
sync.
This really deserves a proper kernel fix.
|
|
|
|
linux partition
This should allow running Ubuntu UEFI GPT Images with nspawn,
unmodified.
|
|
|
|
|
|
After all, nspawn can now dissect MBR partition levels, too, hence
".gpt" appears a misnomer. Moreover, the the .raw suffix for these files
is already pretty popular (the Fedora disk images use it for example),
hence sounds like an OK scheme to adopt.
|
|
Sometimes udev or some other background daemon might keep the loopback
devices busy while we already want to detach them. Downgrade the warning
about it.
Given that we use autodetach downgrading these messages should be with
little risk.
|
|
With this change nspawn's -i switch now can now make sense of MBR disk
images too - however only if there's only a single, bootable partition
of type 0x83 on the image. For all other cases we cannot really make
sense from the partition table alone.
The big benefit of this change is that upstream Fedora Cloud Images can
now be booted unmodified with systemd-nspawn:
# wget http://download.fedoraproject.org/pub/fedora/linux/releases/21/Cloud/Images/x86_64/Fedora-Cloud-Base-20141203-21.x86_64.raw.xz
# unxz Fedora-Cloud-Base-20141203-21.x86_64.raw.xz
# systemd-nspawn -i Fedora-Cloud-Base-20141203-21.x86_64.raw -b
Next stop: teach the import logic to automatically download these
images, uncompress and verify them.
|
|
This is useful for nspawn managers that want to learn when nspawn is
finished with initialiuzation, as well what the PID of the init system
in the container is.
|
|
|
|
This adds three kinds of file system locks for container images:
a) a file system lock next to the actual image, in a .lck file in the
same directory the image is located. This lock has the benefit of
usually being located on the same NFS share as the image itself, and
thus allows locking container images across NFS shares.
b) a file system lock in /run, named after st_dev and st_ino of the
root of the image. This lock has the advantage that it is unique even
if the same image is bind mounted to two different places at the same
time, as the ino/dev stays constant for them.
c) a file system lock that is only taken when a new disk image is about
to be created, that ensures that checking whether the name is already
used across the search path, and actually placing the image is not
interrupted by other code taking the name.
a + b are read-write locks. When a container is booted in read-only mode
a read lock is taken, otherwise a write lock.
Lock b is always taken after a, to avoid ABBA problems.
Lock c is mostly relevant when renaming or cloning images.
|
|
|
|
|
|
Now that networkd's IP masquerading support means that running
containers with "--network-veth" will provide network access out of the
box for the container, let's add a shortcut "-n" for it, to make it
easily accessible.
|
|
This exposes an IP port on the container as local port using DNAT.
|
|
|
|
|
|
machine it is connected to dies
|
|
for the container's own subtree in the name=systemd hierarchy
More specifically mount all other hierarchies in their entirety and the
name=systemd above the container's subtree read-only.
|
|
That way, systemd can actually figure out if everything is OK with
nspawn.
|
|
|
|
It does not use any functions from libcap directly. The CAP_* constants in use
through this file come from "missing.h" which will import <linux/capability.h>
and complement it with CAP_* constants not defined by the current kernel
headers.
Add an explicit import of our "capability.h" since it does use the function
capability_bounding_set_drop from that header file. Previously, that header was
implicitly imported through through "cap-list.h".
Tested that "systemd-nspawn" builds cleanly and works after this change.
|
|
|
|
was newline-terinated anyway
|
|
|
|
host to container during runtime
|
|
Since the order of the first and second arguments of the raw clone() system
call is reversed on s390 and cris it needs to be invoked differently.
|
|
Also, when booting up an ephemeral container of / use the system
hostname as default machine name.
This way specifiyng -M is unnecessary when booting up an ephemeral
container, while allowing any number of ephemeral containers to run from
the same tree.
|