Age | Commit message (Collapse) | Author |
|
Introduce two new helpers that send/receive a single fd via a unix
transport. Also make nspawn use them instead of hard-coding it.
Based on a patch by Krzesimir Nowak.
|
|
off_t is a really weird type as it is usually 64bit these days (at least
in sane programs), but could theoretically be 32bit. We don't support
off_t as 32bit builds though, but still constantly deal with safely
converting from off_t to other types and back for no point.
Hence, never use the type anymore. Always use uint64_t instead. This has
various benefits, including that we can expose these values directly as
D-Bus properties, and also that the values parse the same in all cases.
|
|
|
|
We should really close all parent sides of our child/parent socket
pairs.
|
|
|
|
SOCK_DGRAM and SOCK_SEQPACKET have very similar semantics when used with
socketpair(). However, SOCK_SEQPACKET has the advantage of knowing a
hangup concept, since it is inherently connection-oriented.
Since we use socket pairs to communicate between the nspawn main process
and the nspawn child process, where the child might die abnormally it's
interesting to us to learn about this via hangups if the child side of
the pair is closed. Hence, let's switch to SOCK_SEQPACKET for these
internal communication sockets.
Fixes #956.
|
|
|
|
Let's remove unnecessary inclusions, and order the list alphabetically
as suggested in CODING_STYLE now.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.nspawn fiels are simple settings files that may accompany container
images and directories and contain settings otherwise passed on the
nspawn command line. This provides an efficient way to attach execution
data directly to containers.
|
|
In the unified hierarchy delegating controller access is safe, hence
make sure to enable all controllers for the "payload" subcgroup if we
create it, so that the container will have all controllers enabled the
nspawn service itself has.
|
|
This patch set adds full support the new unified cgroup hierarchy logic
of modern kernels.
A new kernel command line option "systemd.unified_cgroup_hierarchy=1" is
added. If specified the unified hierarchy is mounted to /sys/fs/cgroup
instead of a tmpfs. No further hierarchies are mounted. The kernel
command line option defaults to off. We can turn it on by default as
soon as the kernel's APIs regarding this are stabilized (but even then
downstream distros might want to turn this off, as this will break any
tools that access cgroupfs directly).
It is possibly to choose for each boot individually whether the unified
or the legacy hierarchy is used. nspawn will by default provide the
legacy hierarchy to containers if the host is using it, and the unified
otherwise. However it is possible to run containers with the unified
hierarchy on a legacy host and vice versa, by setting the
$UNIFIED_CGROUP_HIERARCHY environment variable for nspawn to 1 or 0,
respectively.
The unified hierarchy provides reliable cgroup empty notifications for
the first time, via inotify. To make use of this we maintain one
manager-wide inotify fd, and each cgroup to it.
This patch also removes cg_delete() which is unused now.
On kernel 4.2 only the "memory" controller is compatible with the
unified hierarchy, hence that's the only controller systemd exposes when
booted in unified heirarchy mode.
This introduces a new enum for enumerating supported controllers, plus a
related enum for the mask bits mapping to it. The core is changed to
make use of this everywhere.
This moves PID 1 into a new "init.scope" implicit scope unit in the root
slice. This is necessary since on the unified hierarchy cgroups may
either contain subgroups or processes but not both. PID 1 hence has to
move out of the root cgroup (strictly speaking the root cgroup is the
only one where processes and subgroups are still allowed, but in order
to support containers nicey, we move PID 1 into the new scope in all
cases.) This new unit is also used on legacy hierarchy setups. It's
actually pretty useful on all systems, as it can then be used to filter
journal messages coming from PID 1, and so on.
The root slice ("-.slice") is now implicitly created and started (and
does not require a unit file on disk anymore), since
that's where "init.scope" is located and the slice needs to be started
before the scope can.
To check whether we are in unified or legacy hierarchy mode we use
statfs() on /sys/fs/cgroup. If the .f_type field reports tmpfs we are in
legacy mode, if it reports cgroupfs we are in unified mode.
This patch set carefuly makes sure that cgls and cgtop continue to work
as desired.
When invoking nspawn as a service it will implicitly create two
subcgroups in the cgroup it is using, one to move the nspawn process
into, the other to move the actual container processes into. This is
done because of the requirement that cgroups may either contain
processes or other subgroups.
|
|
that either
Follow-up regarding #649.
|
|
--bind and --bind-ro perform the bind mount
non-recursively. It is sometimes (often?) desirable
to do a recursive mount. This patch adds an optional
set of bind mount options in the form of:
--bind=src-path:dst-path:options
options are comma separated and currently only
"rbind" and "norbind" are allowed.
Default value is "rbind".
|
|
Fixes #1018.
Based on a patch from Seth Jennings.
|
|
|
|
: characters can be entered with the \: escape sequence.
|
|
Overlayfs uses , as an option separator and : as a list separator. These
characters are both valid in file paths, so overlayfs allows file paths
which contain these characters to backslash escape these values.
|
|
: characters in bind paths can be entered as the \: escape sequence.
|
|
This now accepts : characters with the \: escape sequence.
Other escape sequences are also interpreted, but having a \ in your file
path is less likely than :, so this shouldn't break anyone's existing
tools.
|
|
Manual merge of https://github.com/systemd/systemd/pull/751.
|
|
All users are now setting lowercase=false.
|
|
Pretty trivial helper which wraps free() but returns NULL, so we can
simplify this:
free(foobar);
foobar = NULL;
to this:
foobar = mfree(foobar);
|
|
Use free_and_strdup() where appropriate and replace equivalent,
open-coded versions.
|
|
Mounting devpts with a uid breaks pty allocation with recent glibc
versions, which expect that the kernel will set the correct owner for
user-allocated ptys.
The kernel seems to be smart enough to use the correct uid for root when
we switch to a user namespace.
This resolves #337.
|
|
fileio: consolidate write_string_file*()
|
|
|
|
The latest consolidation cleanup of write_string_file() revealed some users
of that helper which should have used write_string_file_no_create() in the
past but didn't. Basically, all existing users that write to files in /sys
and /proc should not expect to write to a file which is not yet existant.
|
|
Merge write_string_file(), write_string_file_no_create() and
write_string_file_atomic() into write_string_file() and provide a flags mask
that allows combinations of atomic writing, newline appending and automatic
file creation. Change all users accordingly.
|
|
richardmaw-codethink/nspawn-automatic-uid-shift-fix-v2
nspawn: Communicate determined UID shift to parent version 2
|
|
There is logic to determine the UID shift from the file-system, rather
than having it be explicitly passed in.
However, this needs to happen in the child process that sets up the
mounts, as what's important is the UID of the mounted root, rather than
the mount-point.
Setting up the UID map needs to happen in the parent becuase the inner
child needs to have been started, and the outer child is no longer able
to access the uid_map file, since it lost access to it when setting up
the mounts for the inner child.
So we need to communicate the uid shift back out, along with the PID of
the inner child process.
Failing to communicate this means that the invalid UID shift, which is
the value used to specify "this needs to be determined from the file
system" is left invalid, so setting up the user namespace's UID shift
fails.
|
|
|
|
sd-bus: introduce new sd_bus_flush_close_unref() call
|
|
sd_bus_flush_close_unref() is a call that simply combines sd_bus_flush()
(which writes all unwritten messages out) + sd_bus_close() (which
terminates the connection, releasing all unread messages) +
sd_bus_unref() (which frees the connection).
The combination of this call is used pretty frequently in systemd tools
right before exiting, and should also be relevant for most external
clients, and is hence useful to cover in a call of its own.
Previously the combination of the three calls was already done in the
_cleanup_bus_close_unref_ macro, but this was only available internally.
Also see #327
|
|
|
|
richardmaw-codethink/nspawn-userns-uid-shift-autodetection-fix
nspawn: determine_uid_shift before forking
|
|
It is needed in one branch of the fork, but calculated in another
branch.
Failing to do this means using --private-users without specifying a uid
shift always fails because it tries to shift the uid to UID_INVALID.
|
|
When we do a MS_BIND mount, it inherits the flags of its parent mount.
When we do a remount, it sets the flags to exactly what is specified.
If we are in a user namespace then these mount points have their flags
locked, so you can't reduce the protection.
As a consequence, the default setup of mount_all doesn't work with user
namespaces. However if we ensure we add the mount flags of the parent
mount when remounting, then we aren't removing mount options, so we
aren't trying to unlock an option that we aren't allowed to.
|
|
In such a case let's suppress the warning (downgrade to LOG_DEBUG),
under the assumption that the user has no config file to update in its
place, but a symlink that points to something like resolved's
automatically managed resolve.conf file.
While we are at it, also stop complaining if we cannot write /etc/resolv.conf
due to a read-only disk, given that there's little we could do about it.
|
|
This is a simpler fix for #210, it simply uses copy_bytes() for the
copying.
|
|
If the kernel do not support user namespace then one of the children
created by nspawn parent will fail at clone(CLONE_NEWUSER) with the
generic error EINVAL and without logging the error. At the same time
the parent may also try to setup the user namespace and will fail with
another error.
To improve this, check if the kernel supports user namespace as early
as possible.
|
|
everywhere: port everything to sigprocmask_many() and friends
|
|
This ports a lot of manual code over to sigprocmask_many() and friends.
Also, we now consistly check for sigprocmask() failures with
assert_se(), since the call cannot realistically fail unless there's a
programming error.
Also encloses a few sd_event_add_signal() calls with (void) when we
ignore the return values for it knowingly.
|
|
Remove old temporary snapshots, but only at boot. Ideally we'd have
"self-destroying" btrfs snapshots that go away if the last last
reference to it does. To mimic a scheme like this at least remove the
old snapshots on fresh boots, where we know they cannot be referenced
anymore. Note that we actually remove all temporary files in
/var/lib/machines/ at boot, which should be safe since the directory has
defined semantics. In the root directory (where systemd-nspawn
--ephemeral places snapshots) we are more strict, to avoid removing
unrelated temporary files.
This also splits out nspawn/container related tmpfiles bits into a new
tmpfiles snippet to systemd-nspawn.conf
|