summaryrefslogtreecommitdiff
path: root/src/nspawn
AgeCommit message (Collapse)Author
2016-06-13nspawn: lock down system call filter a bitLennart Poettering
Let's block access to the kernel keyring and a number of obsolete system calls. Also, update list of syscalls that may alter the system clock, and do raw IO access. Filter ptrace() if CAP_SYS_PTRACE is not passed to the container and acct() if CAP_SYS_PACCT is not passed. This also changes things so that kexec(), some profiling calls, the swap calls and quotactl() is never available to containers, not even if CAP_SYS_ADMIN is passed. After all we currently permit CAP_SYS_ADMIN to containers by default, but these calls should not be available, even then.
2016-06-13nspawn: order caps to retain alphabeticallyLennart Poettering
2016-06-10nspawn: introduce --notify-ready=[no|yes] (#3474)Alessandro Puccetti
This the patch implements a notificaiton mechanism from the init process in the container to systemd-nspawn. The switch --notify-ready=yes configures systemd-nspawn to wait the "READY=1" message from the init process in the container to send its own to systemd. --notify-ready=no is equivalent to the previous behavior before this patch, systemd-nspawn notifies systemd with a "READY=1" message when the container is created. This notificaiton mechanism uses socket file with path relative to the contanier "/run/systemd/nspawn/notify". The default values it --notify-ready=no. It is also possible to configure this mechanism from the .nspawn files using NotifyReady. This parameter takes the same options of the command line switch. Before this patch, systemd-nspawn notifies "ready" after the inner child was created, regardless the status of the service running inside it. Now, with --notify-ready=yes, systemd-nspawn notifies when the service is ready. This is really useful when there are dependencies between different contaniers. Fixes https://github.com/systemd/systemd/issues/1369 Based on the work from https://github.com/systemd/systemd/pull/3022 Testing: Boot a OS inside a container with systemd-nspawn. Note: modify the commands accordingly with your filesystem. 1. Create a filesystem where you can boot an OS. 2. sudo systemd-nspawn -D ${HOME}/distros/fedora-23/ sh 2.1. Create the unit file /etc/systemd/system/sleep.service inside the container (You can use the example below) 2.2. systemdctl enable sleep 2.3 exit 3. sudo systemd-run --service-type=notify --unit=notify-test ${HOME}/systemd/systemd-nspawn --notify-ready=yes -D ${HOME}/distros/fedora-23/ -b 4. In a different shell run "systemctl status notify-test" When using --notify-ready=yes the service status is "activating" for 20 seconds before being set to "active (running)". Instead, using --notify-ready=no the service status is marked "active (running)" quickly, without waiting for the 20 seconds. This patch was also test with --private-users=yes, you can test it just adding it at the end of the command at point 3. ------ sleep.service ------ [Unit] Description=sleep After=network.target [Service] Type=oneshot ExecStart=/bin/sleep 20 [Install] WantedBy=multi-user.target ------------ end ------------
2016-05-29util-lib: Add sparc64 support for process creation (#3348)Michael Karcher
The current raw_clone function takes two arguments, the cloning flags and a pointer to the stack for the cloned child. The raw cloning without passing a "thread main" function does not make sense if a new stack is specified, as it returns in both the parent and the child, which will fail in the child as the stack is virgin. All uses of raw_clone indeed pass NULL for the stack pointer which indicates that both processes should share the stack address (so you better don't pass CLONE_VM). This commit refactors the code to not require the caller to pass the stack address, as NULL is the only sensible option. It also adds the magic code needed to make raw_clone work on sparc64, which does not return 0 in %o0 for the child, but indicates the child process by setting %o1 to non-zero. This refactoring is not plain aesthetic, because non-NULL stack addresses need to get mangled before being passed to the clone syscall (you have to apply STACK_BIAS), whereas NULL must not be mangled. Implementing the conditional mangling of the stack address would needlessly complicate the code. raw_clone is moved to a separete header, because the burden of including the assert machinery and sched.h shouldn't be applied to every user of missing_syscalls.h
2016-05-26nspawn: rename arg_retain to arg_caps_retainDjalal Harouni
The argument is about capabilities.
2016-05-26nspawn: split out seccomp call into nspawn-seccomp.[ch]Djalal Harouni
Split seccomp into nspawn-seccomp.[ch]. Currently there are no changes, but this will make it easy in the future to share or use the seccomp logic from systemd core.
2016-05-26nspawn: rename is_procfs_sysfs_or_suchlike() to is_fs_fully_userns_compatible()Djalal Harouni
Rename is_procfs_sysfs_or_suchlike() to is_fs_fully_userns_compatible() to give it the real meaning. This may prevent future modifications that may introduce bugs.
2016-05-26nspawn: a bench of special fileystems that should not be shiftedDjalal Harouni
Add some special filesystems that should not be shifted, most of them relate to the host and not to containers.
2016-05-22nspawn: remove unreachable return statement (#3320)Zbigniew Jędrzejewski-Szmek
2016-05-12nspawn: drop spurious newlineLennart Poettering
2016-05-09nspawn: only remove veth links we created ourselvesLennart Poettering
Let's make sure we don't remove veth links that existed before nspawn was invoked. https://github.com/systemd/systemd/pull/3209#discussion_r62439999
2016-05-09tree-wide: port more code to use ifname_valid()Lennart Poettering
2016-05-09nspawn: add new --network-zone= switch for automatically managed bridge devicesLennart Poettering
This adds a new concept of network "zones", which are little more than bridge devices that are automatically managed by nspawn: when the first container referencing a bridge is started, the bridge device is created, when the last container referencing it is removed the bridge device is removed again. Besides this logic --network-zone= is pretty much identical to --network-bridge=. The usecase for this is to make it easy to run multiple related containers (think MySQL in one and Apache in another) in a common, named virtual Ethernet broadcast zone, that only exists as long as one of them is running, and fully automatically managed otherwise.
2016-05-09util-lib: add new ifname_valid() call that validates interface namesLennart Poettering
Make use of this in nspawn at a couple of places. A later commit should port more code over to this, including networkd.
2016-05-03Merge pull request #3111 from poettering/nspawn-remove-vethZbigniew Jędrzejewski-Szmek
2016-05-03Revert "nspawn: explicitly remove veth links after use (#3111)"Zbigniew Jędrzejewski-Szmek
This reverts commit d2773e59de3dd970d861e9f996bc48de20ef4314. Merge got squashed by mistake.
2016-04-29nspawn: convert uuid to string (#3146)Evgeny Vereshchagin
Fixes: cp /etc/machine-id /var/tmp/systemd-test.HccKPa/nspawn-root/etc systemd-nspawn -D /var/tmp/systemd-test.HccKPa/nspawn-root --link-journal host -b ... Host and machine ids are equal (P�S!V): refusing to link journals
2016-04-28nspawn: initialize the veth_name (#3141)Evgeny Vereshchagin
Fixes: $ systemd-nspawn -h ... Failed to remove veth interface ����: Operation not permitted This is a follow-up for d2773e59de3dd970d861
2016-04-26Merge pull request #3093 from poettering/nspawn-userns-magicLennart Poettering
nspawn automatic user namespaces
2016-04-25nspawn: explicitly remove veth links after use (#3111)Lennart Poettering
* sd-netlink: permit RTM_DELLINK messages with no ifindex This is useful for removing network interfaces by name. * nspawn: explicitly remove veth links we created after use Sometimes the kernel keeps veth links pinned after the namespace they have been joined to died. Let's hence explicitly remove veth links after use. Fixes: #2173
2016-04-25nspawn: explicitly remove veth links we created after useLennart Poettering
Sometimes the kernel keeps veth links pinned after the namespace they have been joined to died. Let's hence explicitly remove veth links after use. Fixes: #2173
2016-04-25nspawn: when readjusting UID/GID ownership of OS trees, skip read-only subtreesLennart Poettering
This should allow tools like rkt to pre-mount read-only subtrees in the OS tree, without breaking the patching code. Note that the code will still fail, if the top-level directory is already read-only.
2016-04-25nspawn: don't try to patch UIDs/GIDs of procfs and suchlikeLennart Poettering
2016-04-25nspawn: make -U a tiny bit smarterLennart Poettering
With this change -U will turn on user namespacing only if the kernel actually supports it and otherwise gracefully degrade to non-userns mode.
2016-04-25nspawn: allow configuration of user namespaces in .nspawn filesLennart Poettering
In order to implement this we change the bool arg_userns into an enum UserNamespaceMode, which can take one of NO, PICK or FIXED, and replace the arg_uid_range_pick bool with it.
2016-04-25nspawn: add -U as shortcut for --private-users=pickLennart Poettering
Given that user namespacing is pretty useful now, let's add a shortcut command line switch for the logic.
2016-04-25nspawn: optionally, automatically allocate a UID/GID range for userns containersLennart Poettering
This adds the new value "pick" to --private-users=. When specified a new UID/GID range of 65536 users is automatically and randomly allocated from the host range 0x00080000-0xDFFF0000 and used for the container. The setting implies --private-users-chown, so that container directory is recursively chown()ed to the newly allocated UID/GID range, if that's necessary. As an optimization before picking a randomized UID/GID the UID of the container's root directory is used as starting point and used if currently not used otherwise. To protect against using the same UID/GID range multiple times a few mechanisms are in place: - The first and the last UID and GID of the range are checked with getpwuid() and getgrgid(). If an entry already exists a different range is picked. Note that by "last" UID the user 65534 is used, as 65535 is the 16bit (uid_t) -1. - A lock file for the range is taken in /run/systemd/nspawn-uid/. Since the ranges are taken in a non-overlapping fashion, and always start on 64K boundaries this allows us to maintain a single lock file for each range that can be randomly picked. This protects nspawn from picking the same range in two parallel instances. - If possible the /etc/passwd lock file is taken while a new range is selected until the container is up. This means adduser/addgroup should safely avoid the range as long as nss-mymachines is used, since the allocated range will then show up in the user database. The UID/GID range nspawn picks from is compiled in and not configurable at the moment. That should probably stay that way, since we already provide ways how users can pick their own ranges manually if they don't like the automatic logic. The new --private-users=pick logic makes user namespacing pretty useful now, as it relieves the user from managing UID/GID ranges.
2016-04-25nspawn: optionally fix up OS tree uid/gids for usernsLennart Poettering
This adds a new --private-userns-chown switch that may be used in combination with --private-userns. If it is passed a recursive chmod() operation is run on the OS tree, fixing all file owner UID/GIDs to the right ranges. This should make user namespacing pretty workable, as the OS trees don't need to be prepared manually anymore.
2016-04-22tree-wide: remove unused variables (#3098)Thomas H. P. Andersen
2016-04-22shared: move unit-specific code from bus-util.h to bus-unit-util.hLennart Poettering
Previously we'd have generally useful sd-bus utilities in bust-util.h, intermixed with code that is specifically for writing clients for PID 1, wrapping job and unit handling. Let's split the latter out and move it into bus-unit-util.c, to make the sources a bit short and easier to grok.
2016-04-21tree-wide: use mdash instead of a two minusesZbigniew Jędrzejewski-Szmek
2016-04-20nspawn: add -E as alias for --setenvZbigniew Jędrzejewski-Szmek
v2: - "=" is required, so remove the <optional> tags that v1 added
2016-04-11Merge pull request #3014 from msekletar/nspawn-empty-machine-id-v3Lennart Poettering
nspawn: always setup machine id (v3)
2016-04-11nspawn: always setup machine idMichal Sekletar
We check /etc/machine-id of the container and if it is already populated we use value from there, possibly ignoring value of --uuid option from the command line. When dealing with R/O image we setup transient machine id. Once we determined machine id of the container, we use this value for registration with systemd-machined and we also export it via container_uuid environment variable. As registration with systemd-machined is done by the main nspawn process we communicate container machine id established by setup_machine_id from outer child to the main process by unix domain socket. Similarly to PID of inner child.
2016-04-08nspawn: ignore failure to chdirZbigniew Jędrzejewski-Szmek
CID #1322380.
2016-04-01prevent systemd-nspawn from trying to create targetBjørnar Ness
for bind-mounts when they already exist. This allows bind-mounting over read-only files.
2016-03-26core: update populated event handling in unified hierarchyTejun Heo
Earlier during the development of unified hierarchy, the populated event was reported through by the dedicated "cgroup.populated" file; however, the interface was updated so that it's reported through the "populated" field of "cgroup.events" file. Update populated event handling logic accordingly.
2016-03-26cgroup2: use new fstype for unified hierarchyAlban Crequy
Since Linux v4.4-rc1, __DEVEL__sane_behavior does not exist anymore and is replaced by a new fstype "cgroup2". With this patch, systemd no longer supports the old (unstable) way of doing unified hierarchy with __DEVEL__sane_behavior and systemd now requires Linux v4.4 for unified hierarchy. Non-unified hierarchy is still the default and is unchanged by this patch. https://github.com/torvalds/linux/commit/67e9c74b8a873408c27ac9a8e4c1d1c8d72c93ff
2016-03-17nspawn: don't run nspawn --port=... without libiptc supportEvgeny Vereshchagin
We get $ systemd-nspawn --image /dev/loop1 --port 8080:80 -n -b 3 --port= is not supported, compiled without libiptc support. instead of a ping-nc-iptables debugging session
2016-03-16nspawn: Fix two misspellings of "hierarchy" in error messagesTobias Klauser
2016-03-09/dev/console must be labeled with SELinux labelDan Walsh
If the user specifies an selinux_apifs_context all content created in the container including /dev/console should use this label. Currently when this uses the default label it gets labeled user_devpts_t, which would require us to write a policy allowing container processes to manage user_devpts_t. This means that an escaped process would be allowed to attack all users terminals as well as other container terminals. Changing the label to match the apifs_context, means the processes would only be allowed to manage their specific tty. This change fixes a problem preventing RKT containers from working with systemd-nspawn.
2016-02-23tree-wide: minor formatting inconsistency cleanupsVito Caputo
2016-02-22tree-wide: make ++/-- usage consistent WRT spacingVito Caputo
Throughout the tree there's spurious use of spaces separating ++ and -- operators from their respective operands. Make ++ and -- operator consistent with the majority of existing uses; discard the spaces.
2016-02-13Merge pull request #2589 from keszybz/resolve-tool-2Lennart Poettering
Better support of OPENPGPKEY, CAA, TLSA packets and tests
2016-02-11Add memcpy_safeZbigniew Jędrzejewski-Szmek
ISO/IEC 9899:1999 §7.21.1/2 says: Where an argument declared as size_t n specifies the length of the array for a function, n can have the value zero on a call to that function. Unless explicitly stated otherwise in the description of a particular function in this subclause, pointer arguments on such a call shall still have valid values, as described in 7.1.4. In base64_append_width memcpy was called as memcpy(x, NULL, 0). GCC 4.9 started making use of this and assumes This worked fine under -O0, but does something strange under -O3. This patch fixes a bug in base64_append_width(), fixes a possible bug in journal_file_append_entry_internal(), and makes use of the new function to simplify the code in other places.
2016-02-10tree-wide: remove Emacs lines from all filesDaniel Mack
This should be handled fine now by .dir-locals.el, so need to carry that stuff in every file.
2016-02-03nspawn: make sure --help fits it 79chLennart Poettering
2016-02-03nspawn: optionally run a stub init process as PID 1Lennart Poettering
This adds a new switch --as-pid2, which allows running commands as PID 2, while a stub init process is run as PID 1. This is useful in order to run arbitrary commands in a container, as PID1's semantics are different from all other processes regarding reaping of unknown children or signal handling.
2016-02-03nspawn: add new --chdir= switchLennart Poettering
Fixes: #2192
2016-02-01core: fix support for transient resource limit propertiesLennart Poettering
Make sure we can properly process resource limit properties. Specifically, allow transient configuration of both the soft and hard limit, the same way from the unit files. Previously, only the the hard rlimits could be configured but they'd implicitly spill into the soft hard rlimits. This also updates the client-side code to be able to parse hard/soft resource limit specifications. Since we need to serialize two properties in bus_append_unit_property_assignment() now, the marshalling of the container around it is now moved into the function itself. This has the benefit of shortening the calling code. As a side effect this now beefs up the rlimit parser of "systemctl set-property" to understand time and disk sizes where that's appropriate.