Age | Commit message (Collapse) | Author |
|
The Barrier-API simplifies cross-fork() synchronization a lot. Replace the
hard-coded eventfd-util implementation and drop it.
Compared to the old API, Barriers also handle exit() of the remote side as
abortion. This way, segfaults will not cause the parent to deadlock.
EINTR handling is currently ignored for any barrier-waits. This can easily
be added, but it isn't needed so far so I dropped it. EINTR handling in
general is ugly, anyway. You need to deal with pselect/ppoll/... variants
and make sure not to unblock signals at the wrong times. So genrally,
there's little use in adding it.
|
|
|
|
(ephemeral) mode
Two modes are supported: --volatile=yes mounts only /usr into the
container, and a tmpfs as root directory. --volatile=state mounts the
full OS tree in, but overmounts /var with a tmpfs.
--volatile=yes hence boots with an unpopulated /etc and /var, starting
with pristine configuration and state.
--volatile=state hence boots with an unpopulated /var, only starting
with pristine state.
|
|
THis way we can remove cgroup priviliges after setup, but get them back
for the next restart, as we need it.
|
|
Let's protect ourselves against the recently reported docker security
issue. Our man page makes clear that we do not make any security
promises anyway, but well, this one is easy to mitigate, so let's do it.
While we are at it block a couple of more syscalls that are no good in
containers, too.
|
|
|
|
|
|
This is at the suggestion of Djalal Harouni on the mailing list, and
reflects the behavior of shared/util.c:wait_for_terminate_and_warn().
|
|
Commit 113cea8 introduced a bug that caused the exit code of systemd-nspawn
to not reflect the exit code of the program executed in the container.
|
|
This allows us to bootup a rootfs with a /usr directory only.
|
|
This allows us to bootup a rootfs with a /usr directory only.
|
|
|
|
The file should have been in /usr/lib/ in the first place, since it
describes the OS container in /usr (and not the configuration in /etc),
hence, let's support os-release files in /usr/lib as fallback if no
version in /etc exists, following the usual override logic.
A prior commit already enabled tmpfiles to create /etc/os-release as a
symlink to /usr/lib/os-release should it be missing, thus providing nice
compatibility with applications only checking in /etc.
While it's probably a good idea if all apps check both locations via a
fallback logic, it is only necessary in the early boot process, as long
as the /etc/os-release symlink has not been restored, in case we boot
with an empty /etc.
|
|
such as /var
|
|
|
|
For names like /var/lib/container/something, the message
becomes quite long. Better to split it.
Also reword the message not to suggest that ^]^]^] only works
in the beginning.
|
|
Instead of blindly creating another bind mount for read-only mounts,
check if there's already one we can use, and if so, use it. Also,
recursively mark all submounts read-only too. Also, ignore autofs mounts
when remounting read-only unless they are already triggered.
|
|
nspawn and the container child use eventfd to wait and notify each other
that they are ready so the container setup can be completed.
However in its current form the wait/notify event ignore errors that
may especially affect the child (container).
On errors the child will jump to the "child_fail" label and terminate
with _exit(EXIT_FAILURE) without notifying the parent. Since the eventfd
is created without the "EFD_NONBLOCK" flag, this leaves the parent
blocking on the eventfd_read() call. The container can also be killed
at any moment before execv() and the parent will not receive
notifications.
We can fix this by using cheap mechanisms, the new high level eventfd
API and handle SIGCHLD signals:
* Keep the cheap eventfd and EFD_NONBLOCK flag.
* Introduce eventfd states for parent and child to sync.
Child notifies parent with EVENTFD_CHILD_SUCCEEDED on success or
EVENTFD_CHILD_FAILED on failure and before _exit(). This prevents the
parent from waiting on an event that will never come.
* If the child is killed before execv() or before notifying the parent,
we install a NOP handler for SIGCHLD which will interrupt blocking calls
with EINTR. This gives a chance to the parent to call wait() and
terminate in main().
* If there are no errors, parent will block SIGCHLD, restore default
handler and notify child which will do execv(), then parent will pass
control to process_pty() to do its magic.
This was exposed in part by:
https://bugs.freedesktop.org/show_bug.cgi?id=76193
Reported-by: Tobias Hunger tobias.hunger@gmail.com
|
|
Move the container wait logic into its own wait_for_container() function
and add two status codes: CONTAINER_TERMINATED or CONTAINER_REBOOTED.
The status will be stored in its argument, this way we handle:
a) Return negative on failures.
b) Return zero on success and set the status to either
CONTAINER_REBOOTED or CONTAINER_TERMINATED.
These status codes are used to terminate nspawn or loop again in case of
CONTAINER_REBOOTED.
|
|
|
|
This undoes part of commit e6a4a517befe559adf6d1dbbadf425c3538849c9.
Instead of removing the error message about non-empty journal bind mount
directories, simply downgrade the message to a warning and proceed.
|
|
dentry
Currently if nspawn was called with --link-journal=host or
--link-journal=auto and the right /var/log/journal/machine-id/ exists
then the bind mount the subdirectory into the container might fail due
to the ~/mycontainer/var/log/journal/machine-id/ of the container not
being empty.
There is no reason to check if the container journal subdir is empty
since there will be a bind mount on top of it. The user asked for a bind
mount so give it.
Note: a next call with --link-journal=guest may fail due to the
/var/log/journal/machine-id/ on the host not being empty.
https://bugs.freedesktop.org/show_bug.cgi?id=76193
Reported-by: Tobias Hunger <tobias.hunger@gmail.com>
|
|
|
|
http://lists.freedesktop.org/archives/systemd-devel/2014-April/018971.html
|
|
change_uid_gid() never initialises sz which may cause greedy_realloc to
skip the initial buffer allocation.
|
|
Use a static table with all the typing information, rather than repeated
switch statements. This should make it a lot simpler to add new types.
We need to keep all the type info to be able to create containers
without exposing their implementation details to the users of the library.
As a freebee we verify the types of appended/read attributes.
The API is extended to nicely deal with unions of container types.
|
|
safe_close_pair() is more like safe_close(), except that it handles
pairs of fds, and doesn't make and misleading allusion, as it works
similarly well for socketpairs() as for pipe()s...
|
|
safe_close() automatically becomes a NOP when a negative fd is passed,
and returns -1 unconditionally. This makes it easy to write lines like
this:
fd = safe_close(fd);
Which will close an fd if it is open, and reset the fd variable
correctly.
By making use of this new scheme we can drop a > 200 lines of code that
was required to test for non-negative fds or to reset the closed fd
variable afterwards.
|
|
|
|
|
|
With systemd 211 nspawn attempts to create the home directory for the
given uid. However, if the home directory already exists then it will
fail. Don't error out on -EEXIST.
|
|
We still need to make sure that no two MAC addresses are the same, so we use
a logic similar to what is used in udev to generate MAC addresses, and base
it on a hash of the host's machine ID and thecontainer's name.
|
|
discoverable partitions spec
|
|
|
|
|
|
|
|
/usr/bin/getent instead of in-process
When the container runs a different native architecture than the host we
shouldn't attempt to load the container's NSS modules with the host's
libc. Instead, resolve UID/GID by invoking /usr/bin/getent in the
container. The tool should be fairly universally available and allows us
to do resolving of the UID/GID with the container's libc in a parsable
format.
https://bugs.freedesktop.org/show_bug.cgi?id=75733
|
|
child after the parent added us to the device cgroup
|
|
We overmount /dev/console with an external pty anyway, hence there's no
point in using the real major/minor when we create the node to
overmount. Instead, use the one of /dev/null now.
This fixes a race against the cgroup device controller setup we are
using. In case /dev/console was create before the cgroup policy was
applied all was good, but if created in the opposite order the mknod()
would fail, since creating /dev/console is not allowed by it. Creating
/dev/null instances is however permitted, and hence use it.
|
|
Discoverable Partitions Specification
|
|
Running 'systemd-nspawn -D /srv/Fedora/' gave me this error:
Failed to read /proc/self/loginuid: No such file or directory
Container Fedora failed with error code 1.
This patch fixes the problem.
|
|
|
|
container
|
|
|
|
"ve-" interface name prefix
This way we can recognize the interfaces later on to apply different
host-side configuration to them.
|
|
first (or second)
Previously the returned object of constructor functions where sometimes
returned as last, sometimes as first and sometimes as second parameter.
Let's clean this up a bit. Here are the new rules:
1. The object the new object is derived from is put first, if there is any
2. The object we are creating will be returned in the next arguments
3. This is followed by any additional arguments
Rationale:
For functions that operate on an object we always put that object first.
Constructors should probably not be too different in this regard. Also,
if the additional parameters might want to use varargs which suggests to
put them last.
Note that this new scheme only applies to constructor functions, not to
all other functions. We do give a lot of freedom for those.
Note that this commit only changes the order of the new functions we
added, for old ones we accept the wrong order and leave it like that.
|
|
If -flto is used then gcc will generate a lot more warnings than before,
among them a number of use-without-initialization warnings. Most of them
without are false positives, but let's make them go away, because it
doesn't really matter.
|
|
processes
|
|
containers on a 64bit host
|
|
seccomp setup
|