summaryrefslogtreecommitdiff
path: root/src/nspawn/nspawn.c
AgeCommit message (Collapse)Author
2015-10-22nspawn: don't try to resolve passed binary before entering namespaceLennart Poettering
Othewise we might follow the symlinks on the host, instead of the container. Fixes #1400
2015-10-22nspawn: rework how we determine private networking settingsLennart Poettering
Make sure we acquire CAP_NET_ADMIN if we require virtual networking. Make sure we imply virtual ethernet correctly when bridge is request. Fixes: #1511 Fixes: #1554 Fixes: #1590
2015-10-22btrfs: beef-up btrfs support with a limited understanding of quotaLennart Poettering
With this change we understand more than just leaf quota groups for btrfs file systems. Specifically: - When we create a subvolume we can now optionally add the new subvolume to all qgroups its parent subvolume was member of too. Alternatively it is also possible to insert an intermediary quota group between the parent's qgroups and the subvolume's leaf qgroup, which is useful for a concept of "subtree" qgroups, that contain a subvolume and all its children. - The remove logic for subvolumes has been updated to optionally remove any leaf qgroups or "subtree" qgroups, following the logic above. - The snapshot logic for subvolumes has been updated to replicate the original qgroup setup of the source, if it follows the "subtree" design described above. It will not cover qgroup setups that introduce arbitrary qgroups, especially those orthogonal to the subvolume hierarchy. This also tries to be more graceful when setting up /var/lib/machines as btrfs. For example, if mkfs.btrfs is missing we don't even try to set it up as loopback device. Fixes #1559 Fixes #1129
2015-10-20nspawn: skip /sys-as-tmpfs if we don't use private-networkIago López Galeiras
Since v3.11/7dc5dbc ("sysfs: Restrict mounting sysfs"), the kernel doesn't allow mounting sysfs if you don't have CAP_SYS_ADMIN rights over the network namespace. So the mounting /sys as a tmpfs code introduced in d8fc6a000fe21b0c1ba27fbfed8b42d00b349a4b doesn't work with user namespaces if we don't use private-net. The reason is that we mount sysfs inside the container and we're in the network namespace of the host but we don't have CAP_SYS_ADMIN over that namespace. To fix that, we mount /sys as a sysfs (instead of tmpfs) if we don't use private network and ignore the /sys-as-a-tmpfs code if we find that /sys is already mounted as sysfs. Fixes #1555
2015-10-07machinectl: fix race when opening new shells with "machinectl shell"Lennart Poettering
Previously, we'd allocate the TTY, spawn a service on it, but immediately start processing the TTY and forwarding it to whatever the commnd was started on. This is however problematic, as the TTY might get actually opened only much later by the service. We'll hence first get EIOs on the master as the other side is still closed, and hence considered it hung up and terminated the session. With this change we add a flag to the pty forwarding logic: PTY_FORWARD_IGNORE_INITIAL_VHANGUP. If set, we'll ignore all hangups (i.e. EIOs) on the master PTY until the first byte is successfully read. From that point on we consider a hangup/EIO a regular connection termination. This way, we handle the race: when we get EIO initially we'll ignore it, until the connection is properly set up, at which time we start honouring it.
2015-09-30nspawn: mount /sys as tmpfs, and then mount only select subdirs of the real ↵Lennart Poettering
sysfs below it This way we can hide things like /sys/firmware or /sys/hypervisor from the container, while keeping the device tree around. While this is a security benefit in itself it also allows us to fix issue #1277. Previously we'd mount /sys before creating the user namespace, in order to be able to mount /sys/fs/cgroup/* beneath it (which resides in it), which we can only mount outside of the user namespace. To ensure that the user namespace owns the network namespace we'd set up the network namespace at the same time as the user namespace. Thus, we'd still see the /sys/class/net/ from the originating network namespace, even though we are in our own network namespace now. With this patch, /sys is mounted before transitioning into the user namespace as tmpfs, so that we can also mount /sys/fs/cgroup/* into it this early. The directories such as /sys/class/ are then later added in from the real sysfs from inside the network and user namespace so that they actually show whatis available in it. Fixes #1277
2015-09-30nspawn: fix user namespace supportLennart Poettering
We didn#t actually pass ownership of /run to the UID in the container since some releases, let's fix that.
2015-09-30nspawn: order includesLennart Poettering
2015-09-29util: introduce common version() implementation and use it everywhereLennart Poettering
This also allows us to drop build.h from a ton of files, hence do so. Since we touched the #includes of those files, let's order them properly according to CODING_STYLE.
2015-09-29util: unify implementation of NOP signal handlerLennart Poettering
This is highly complex code after all, we really should make sure to only keep one implementation of this extremely difficult function around.
2015-09-29tree-wide: take benefit of the fact that fdset_free() returns NULLLennart Poettering
2015-09-29tree-wide: port more code to use send_one_fd() and receive_one_fd()Lennart Poettering
Also, make it slightly more powerful, by accepting a flags argument, and make it safe for handling if more than one cmsg attribute happens to be attached.
2015-09-22nspawn, machined: fix comments and error messagesKrzesimir Nowak
A bunch of "Client -> Child" fixes and one barrier-enumerator fix. (David: rebased on master)
2015-09-22nspawn: close unneeded sockets in outer childKrzesimir Nowak
(David: Note, this is just a cleanup and doesn't fix any bugs)
2015-09-22util: introduce {send,receive}_one_fd()David Herrmann
Introduce two new helpers that send/receive a single fd via a unix transport. Also make nspawn use them instead of hard-coding it. Based on a patch by Krzesimir Nowak.
2015-09-10tree-wide: never use the off_t unless glibc makes us use itLennart Poettering
off_t is a really weird type as it is usually 64bit these days (at least in sane programs), but could theoretically be 32bit. We don't support off_t as 32bit builds though, but still constantly deal with safely converting from off_t to other types and back for no point. Hence, never use the type anymore. Always use uint64_t instead. This has various benefits, including that we can expose these values directly as D-Bus properties, and also that the values parse the same in all cases.
2015-09-08nspawn: also close uid shift socket in the parentLennart Poettering
We should really close all parent sides of our child/parent socket pairs.
2015-09-08nspawn: short reads do not set errno, hence don't try to print itLennart Poettering
2015-09-08inspawn: switch from SOCK_DGRAM to SOCK_SEQPACKET for internal socketpairsLennart Poettering
SOCK_DGRAM and SOCK_SEQPACKET have very similar semantics when used with socketpair(). However, SOCK_SEQPACKET has the advantage of knowing a hangup concept, since it is inherently connection-oriented. Since we use socket pairs to communicate between the nspawn main process and the nspawn child process, where the child might die abnormally it's interesting to us to learn about this via hangups if the child side of the pair is closed. Hence, let's switch to SOCK_SEQPACKET for these internal communication sockets. Fixes #956.
2015-09-08nspawn: properly propagate errors when we fail to set soemthing upLennart Poettering
2015-09-07nspawn: sort and clean up included header listLennart Poettering
Let's remove unnecessary inclusions, and order the list alphabetically as suggested in CODING_STYLE now.
2015-09-07nspawn: remove nspawn.h, it's empty nowLennart Poettering
2015-09-07nspawn: split out --uid= logic into nspawn-setuid.[ch]Lennart Poettering
2015-09-07nspawn: split out machined registration code to nspawn-register.[ch]Lennart Poettering
2015-09-07nspawn: split out cgroup related calls into nspawn-cgroup.[ch]Lennart Poettering
2015-09-07nspawn: split out network related code to nspawn-network.[ch]Lennart Poettering
2015-09-07nspawn: split all port exposure code into nspawn-expose-port.[ch]Lennart Poettering
2015-09-07nspawn: split out mount related functions into a new nspawn-mount.c fileLennart Poettering
2015-09-06nspawn: add new .nspawn files for container settingsLennart Poettering
.nspawn fiels are simple settings files that may accompany container images and directories and contain settings otherwise passed on the nspawn command line. This provides an efficient way to attach execution data directly to containers.
2015-09-04nspawn: enable all controllers we can for the "payload" subcgroup we createLennart Poettering
In the unified hierarchy delegating controller access is safe, hence make sure to enable all controllers for the "payload" subcgroup if we create it, so that the container will have all controllers enabled the nspawn service itself has.
2015-09-01core: unified cgroup hierarchy supportLennart Poettering
This patch set adds full support the new unified cgroup hierarchy logic of modern kernels. A new kernel command line option "systemd.unified_cgroup_hierarchy=1" is added. If specified the unified hierarchy is mounted to /sys/fs/cgroup instead of a tmpfs. No further hierarchies are mounted. The kernel command line option defaults to off. We can turn it on by default as soon as the kernel's APIs regarding this are stabilized (but even then downstream distros might want to turn this off, as this will break any tools that access cgroupfs directly). It is possibly to choose for each boot individually whether the unified or the legacy hierarchy is used. nspawn will by default provide the legacy hierarchy to containers if the host is using it, and the unified otherwise. However it is possible to run containers with the unified hierarchy on a legacy host and vice versa, by setting the $UNIFIED_CGROUP_HIERARCHY environment variable for nspawn to 1 or 0, respectively. The unified hierarchy provides reliable cgroup empty notifications for the first time, via inotify. To make use of this we maintain one manager-wide inotify fd, and each cgroup to it. This patch also removes cg_delete() which is unused now. On kernel 4.2 only the "memory" controller is compatible with the unified hierarchy, hence that's the only controller systemd exposes when booted in unified heirarchy mode. This introduces a new enum for enumerating supported controllers, plus a related enum for the mask bits mapping to it. The core is changed to make use of this everywhere. This moves PID 1 into a new "init.scope" implicit scope unit in the root slice. This is necessary since on the unified hierarchy cgroups may either contain subgroups or processes but not both. PID 1 hence has to move out of the root cgroup (strictly speaking the root cgroup is the only one where processes and subgroups are still allowed, but in order to support containers nicey, we move PID 1 into the new scope in all cases.) This new unit is also used on legacy hierarchy setups. It's actually pretty useful on all systems, as it can then be used to filter journal messages coming from PID 1, and so on. The root slice ("-.slice") is now implicitly created and started (and does not require a unit file on disk anymore), since that's where "init.scope" is located and the slice needs to be started before the scope can. To check whether we are in unified or legacy hierarchy mode we use statfs() on /sys/fs/cgroup. If the .f_type field reports tmpfs we are in legacy mode, if it reports cgroupfs we are in unified mode. This patch set carefuly makes sure that cgls and cgtop continue to work as desired. When invoking nspawn as a service it will implicitly create two subcgroups in the cgroup it is using, one to move the nspawn process into, the other to move the actual container processes into. This is done because of the requirement that cgroups may either contain processes or other subgroups.
2015-08-29nspawn: don't try to extract quotes from option string, glibc doesn't do ↵Lennart Poettering
that either Follow-up regarding #649.
2015-08-28nspawn: add (no)rbind option to --bind and --bind-roEugene Yakubovich
--bind and --bind-ro perform the bind mount non-recursively. It is sometimes (often?) desirable to do a recursive mount. This patch adds an optional set of bind mount options in the form of: --bind=src-path:dst-path:options options are comma separated and currently only "rbind" and "norbind" are allowed. Default value is "rbind".
2015-08-25nspawn: make sure --template= and --machine= my be combinedLennart Poettering
Fixes #1018. Based on a patch from Seth Jennings.
2015-08-21remove unused variablesThomas Hindoe Paaboel Andersen
2015-08-07nspawn: Allow : characters in overlay pathsRichard Maw
: characters can be entered with the \: escape sequence.
2015-08-07nspawn: escape paths in overlay mount optionsRichard Maw
Overlayfs uses , as an option separator and : as a list separator. These characters are both valid in file paths, so overlayfs allows file paths which contain these characters to backslash escape these values.
2015-08-07nspawn: Allow : characters in nspawn --bind pathsRichard Maw
: characters in bind paths can be entered as the \: escape sequence.
2015-08-07nspawn: Allow : characters in --tmpfs pathRichard Maw
This now accepts : characters with the \: escape sequence. Other escape sequences are also interpreted, but having a \ in your file path is less likely than :, so this shouldn't break anyone's existing tools.
2015-08-05Merge branch 'hostnamectl-dot-v2'Zbigniew Jędrzejewski-Szmek
Manual merge of https://github.com/systemd/systemd/pull/751.
2015-08-05hostname-util: get rid of unused parameter of hostname_cleanup()Zbigniew Jędrzejewski-Szmek
All users are now setting lowercase=false.
2015-07-31tree-wide: introduce mfree()David Herrmann
Pretty trivial helper which wraps free() but returns NULL, so we can simplify this: free(foobar); foobar = NULL; to this: foobar = mfree(foobar);
2015-07-30tree-wide: use free_and_strdup()Daniel Mack
Use free_and_strdup() where appropriate and replace equivalent, open-coded versions.
2015-07-22nspawn: Don't pass uid mount option for devptsMike Gilbert
Mounting devpts with a uid breaks pty allocation with recent glibc versions, which expect that the kernel will set the correct owner for user-allocated ptys. The kernel seems to be smart enough to use the correct uid for root when we switch to a user namespace. This resolves #337.
2015-07-08Merge pull request #500 from zonque/fileioLennart Poettering
fileio: consolidate write_string_file*()
2015-07-07Remove repeated 'the'sZbigniew Jędrzejewski-Szmek
2015-07-06tree-wide: fix write_string_file() user that should not create filesDaniel Mack
The latest consolidation cleanup of write_string_file() revealed some users of that helper which should have used write_string_file_no_create() in the past but didn't. Basically, all existing users that write to files in /sys and /proc should not expect to write to a file which is not yet existant.
2015-07-06fileio: consolidate write_string_file*()Daniel Mack
Merge write_string_file(), write_string_file_no_create() and write_string_file_atomic() into write_string_file() and provide a flags mask that allows combinations of atomic writing, newline appending and automatic file creation. Change all users accordingly.
2015-07-06Merge pull request #492 from ↵Lennart Poettering
richardmaw-codethink/nspawn-automatic-uid-shift-fix-v2 nspawn: Communicate determined UID shift to parent version 2
2015-07-06nspawn: Communicate determined UID shift to parentRichard Maw
There is logic to determine the UID shift from the file-system, rather than having it be explicitly passed in. However, this needs to happen in the child process that sets up the mounts, as what's important is the UID of the mounted root, rather than the mount-point. Setting up the UID map needs to happen in the parent becuase the inner child needs to have been started, and the outer child is no longer able to access the uid_map file, since it lost access to it when setting up the mounts for the inner child. So we need to communicate the uid shift back out, along with the PID of the inner child process. Failing to communicate this means that the invalid UID shift, which is the value used to specify "this needs to be determined from the file system" is left invalid, so setting up the user namespace's UID shift fails.