Age | Commit message (Collapse) | Author |
|
Fixes:
$ systemd-nspawn -h
...
Failed to remove veth interface ����: Operation not permitted
This is a follow-up for d2773e59de3dd970d861
|
|
nspawn automatic user namespaces
|
|
* sd-netlink: permit RTM_DELLINK messages with no ifindex
This is useful for removing network interfaces by name.
* nspawn: explicitly remove veth links we created after use
Sometimes the kernel keeps veth links pinned after the namespace they have been
joined to died. Let's hence explicitly remove veth links after use.
Fixes: #2173
|
|
This should allow tools like rkt to pre-mount read-only subtrees in the OS
tree, without breaking the patching code.
Note that the code will still fail, if the top-level directory is already
read-only.
|
|
|
|
With this change -U will turn on user namespacing only if the kernel actually
supports it and otherwise gracefully degrade to non-userns mode.
|
|
In order to implement this we change the bool arg_userns into an enum
UserNamespaceMode, which can take one of NO, PICK or FIXED, and replace the
arg_uid_range_pick bool with it.
|
|
Given that user namespacing is pretty useful now, let's add a shortcut command
line switch for the logic.
|
|
This adds the new value "pick" to --private-users=. When specified a new
UID/GID range of 65536 users is automatically and randomly allocated from the
host range 0x00080000-0xDFFF0000 and used for the container. The setting
implies --private-users-chown, so that container directory is recursively
chown()ed to the newly allocated UID/GID range, if that's necessary. As an
optimization before picking a randomized UID/GID the UID of the container's
root directory is used as starting point and used if currently not used
otherwise.
To protect against using the same UID/GID range multiple times a few mechanisms
are in place:
- The first and the last UID and GID of the range are checked with getpwuid()
and getgrgid(). If an entry already exists a different range is picked. Note
that by "last" UID the user 65534 is used, as 65535 is the 16bit (uid_t) -1.
- A lock file for the range is taken in /run/systemd/nspawn-uid/. Since the
ranges are taken in a non-overlapping fashion, and always start on 64K
boundaries this allows us to maintain a single lock file for each range that
can be randomly picked. This protects nspawn from picking the same range in
two parallel instances.
- If possible the /etc/passwd lock file is taken while a new range is selected
until the container is up. This means adduser/addgroup should safely avoid
the range as long as nss-mymachines is used, since the allocated range will
then show up in the user database.
The UID/GID range nspawn picks from is compiled in and not configurable at the
moment. That should probably stay that way, since we already provide ways how
users can pick their own ranges manually if they don't like the automatic
logic.
The new --private-users=pick logic makes user namespacing pretty useful now, as
it relieves the user from managing UID/GID ranges.
|
|
This adds a new --private-userns-chown switch that may be used in combination
with --private-userns. If it is passed a recursive chmod() operation is run on
the OS tree, fixing all file owner UID/GIDs to the right ranges. This should
make user namespacing pretty workable, as the OS trees don't need to be
prepared manually anymore.
|
|
|
|
Previously we'd have generally useful sd-bus utilities in bust-util.h,
intermixed with code that is specifically for writing clients for PID 1,
wrapping job and unit handling. Let's split the latter out and move it into
bus-unit-util.c, to make the sources a bit short and easier to grok.
|
|
|
|
v2:
- "=" is required, so remove the <optional> tags that v1 added
|
|
nspawn: always setup machine id (v3)
|
|
We check /etc/machine-id of the container and if it is already populated
we use value from there, possibly ignoring value of --uuid option from
the command line. When dealing with R/O image we setup transient machine
id.
Once we determined machine id of the container, we use this value for
registration with systemd-machined and we also export it via
container_uuid environment variable.
As registration with systemd-machined is done by the main nspawn process
we communicate container machine id established by setup_machine_id from
outer child to the main process by unix domain socket. Similarly to PID
of inner child.
|
|
CID #1322380.
|
|
for bind-mounts when they already exist. This allows
bind-mounting over read-only files.
|
|
Earlier during the development of unified hierarchy, the populated event was
reported through by the dedicated "cgroup.populated" file; however, the
interface was updated so that it's reported through the "populated" field of
"cgroup.events" file. Update populated event handling logic accordingly.
|
|
Since Linux v4.4-rc1, __DEVEL__sane_behavior does not exist anymore and
is replaced by a new fstype "cgroup2".
With this patch, systemd no longer supports the old (unstable) way of
doing unified hierarchy with __DEVEL__sane_behavior and systemd now
requires Linux v4.4 for unified hierarchy.
Non-unified hierarchy is still the default and is unchanged by this
patch.
https://github.com/torvalds/linux/commit/67e9c74b8a873408c27ac9a8e4c1d1c8d72c93ff
|
|
We get
$ systemd-nspawn --image /dev/loop1 --port 8080:80 -n -b 3
--port= is not supported, compiled without libiptc support.
instead of a ping-nc-iptables debugging session
|
|
|
|
If the user specifies an selinux_apifs_context all content created in
the container including /dev/console should use this label.
Currently when this uses the default label it gets labeled user_devpts_t,
which would require us to write a policy allowing container processes to
manage user_devpts_t. This means that an escaped process would be allowed
to attack all users terminals as well as other container terminals. Changing
the label to match the apifs_context, means the processes would only be allowed
to manage their specific tty.
This change fixes a problem preventing RKT containers from working with systemd-nspawn.
|
|
|
|
Throughout the tree there's spurious use of spaces separating ++ and --
operators from their respective operands. Make ++ and -- operator
consistent with the majority of existing uses; discard the spaces.
|
|
Better support of OPENPGPKEY, CAA, TLSA packets and tests
|
|
ISO/IEC 9899:1999 §7.21.1/2 says:
Where an argument declared as size_t n specifies the length of the array
for a function, n can have the value zero on a call to that
function. Unless explicitly stated otherwise in the description of a
particular function in this subclause, pointer arguments on such a call
shall still have valid values, as described in 7.1.4.
In base64_append_width memcpy was called as memcpy(x, NULL, 0). GCC 4.9
started making use of this and assumes This worked fine under -O0, but
does something strange under -O3.
This patch fixes a bug in base64_append_width(), fixes a possible bug in
journal_file_append_entry_internal(), and makes use of the new function
to simplify the code in other places.
|
|
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
|
|
|
|
This adds a new switch --as-pid2, which allows running commands as PID 2, while a stub init process is run as PID 1.
This is useful in order to run arbitrary commands in a container, as PID1's semantics are different from all other
processes regarding reaping of unknown children or signal handling.
|
|
Fixes: #2192
|
|
Make sure we can properly process resource limit properties. Specifically, allow transient configuration of both the
soft and hard limit, the same way from the unit files. Previously, only the the hard rlimits could be configured but
they'd implicitly spill into the soft hard rlimits.
This also updates the client-side code to be able to parse hard/soft resource limit specifications. Since we need to
serialize two properties in bus_append_unit_property_assignment() now, the marshalling of the container around it is
now moved into the function itself. This has the benefit of shortening the calling code.
As a side effect this now beefs up the rlimit parser of "systemctl set-property" to understand time and disk sizes
where that's appropriate.
|
|
Fixes #2186
This fixes fall-out from 574edc90066c3faeadcf4666928ed9b0ac409c75.
|
|
Fixes #2091
|
|
|
|
Change the capability bounding set parser and logic so that the bounding
set is kept as a positive set internally. This means that the set
reflects those capabilities that we want to keep instead of drop.
|
|
On errors, mention the functions that really failed.
|
|
When starting a container in a new user namespace, systemd-nspawn chowns
the cgroup knob files so they are usable by the container. But the
cgroup knob file "cgroup.events" was missing. This file exists when the
unified hierarchy is used.
|
|
https://github.com/systemd/systemd/issues/2016
|
|
GLIB has recently started to officially support the gcc cleanup
attribute in its public API, hence let's do the same for our APIs.
With this patch we'll define an xyz_unrefp() call for each public
xyz_unref() call, to make it easy to use inside a
__attribute__((cleanup())) expression. Then, all code is ported over to
make use of this.
The new calls are also documented in the man pages, with examples how to
use them (well, I only added docs where the _unref() call itself already
had docs, and the examples, only cover sd_bus_unrefp() and
sd_event_unrefp()).
This also renames sd_lldp_free() to sd_lldp_unref(), since that's how we
tend to call our destructors these days.
Note that this defines no public macro that wraps gcc's attribute and
makes it easier to use. While I think it's our duty in the library to
make our stuff easy to use, I figure it's not our duty to make gcc's own
features easy to use on its own. Most likely, client code which wants to
make use of this should define its own:
#define _cleanup_(function) __attribute__((cleanup(function)))
Or similar, to make the gcc feature easier to use.
Making this logic public has the benefit that we can remove three header
files whose only purpose was to define these functions internally.
See #2008.
|
|
This is a continuation of the previous include sort patch, which
only sorted for .c files.
|
|
tree-wide: group include of libudev.h with sd-*
|
|
|
|
|
|
siphash24: let siphash24_finalize() and siphash24() return the result…
|
|
Rather than passing a pointer to return the result, return it directly
from the function calls.
Also, return the result in native endianess, and let the callers care
about the conversion. For hash tables and bloom filters, we don't care,
but in order to keep MAC addresses and DHCP client IDs stable, we
explicitly convert to LE.
|
|
Sort the includes accoding to the new coding style.
|
|
siphash: alignment
|
|
Change the "out" parameter from uint8_t[8] to uint64_t. On architectures which
enforce pointer alignment this fixes crashes when we previously cast an
unaligned array to uint64_t*, and on others this should at least improve
performance as the compiler now aligns these properly.
This also simplifies the code in most cases by getting rid of typecasts. The
only place which we can't change is struct duid's en.id, as that is _packed_
and public API, so we can't enforce alignment of the "id" field and have to
use memcpy instead.
|
|
|