Age | Commit message (Collapse) | Author |
|
Make sure to propagate error codes from mount-loops correctly. Right now,
we return the return-code of the first mount that did _something_. This is
not what we want. Make sure we return an error if _any_ mount fails (and
then make sure to return the first error to not hide proper errors due to
consequential errors like -ENOTDIR).
Reported by cee1 <fykcee1@gmail.com>.
|
|
Even though systemd has its own smack label since
'--with-smack-run-label' configuration is set, the smack label of each
CGROUP root directory should have the star (i.e. *) label. This is
mainly because current Linux Kernel set the label in this way.
(Refer to smack_d_instantiate() in security/smack/smack_lsm.c)
However, if systemd has its own smack label and arg_join_controllers is
explicitly set or initialized by initialize_join_controllers() function,
current systemd creates the symlink in CGROUP root directory with its
own smack label as below.
lrwxrwxrwx. 1 root root System 11 Dec 31 16:00 cpu -> cpu,cpuacct
dr-xr-xr-x. 4 root root * 0 Dec 31 16:01 cpu,cpuacct
lrwxrwxrwx. 1 root root System 11 Dec 31 16:00 cpuacct -> cpu,cpuacct
This patch fixes that bug by copying the smack label from the origin.
|
|
Introduce a proper enum, and don't pass around string ids anymore. This
simplifies things quite a bit, and makes virtualization detection more
similar to architecture detection.
|
|
This patch set adds full support the new unified cgroup hierarchy logic
of modern kernels.
A new kernel command line option "systemd.unified_cgroup_hierarchy=1" is
added. If specified the unified hierarchy is mounted to /sys/fs/cgroup
instead of a tmpfs. No further hierarchies are mounted. The kernel
command line option defaults to off. We can turn it on by default as
soon as the kernel's APIs regarding this are stabilized (but even then
downstream distros might want to turn this off, as this will break any
tools that access cgroupfs directly).
It is possibly to choose for each boot individually whether the unified
or the legacy hierarchy is used. nspawn will by default provide the
legacy hierarchy to containers if the host is using it, and the unified
otherwise. However it is possible to run containers with the unified
hierarchy on a legacy host and vice versa, by setting the
$UNIFIED_CGROUP_HIERARCHY environment variable for nspawn to 1 or 0,
respectively.
The unified hierarchy provides reliable cgroup empty notifications for
the first time, via inotify. To make use of this we maintain one
manager-wide inotify fd, and each cgroup to it.
This patch also removes cg_delete() which is unused now.
On kernel 4.2 only the "memory" controller is compatible with the
unified hierarchy, hence that's the only controller systemd exposes when
booted in unified heirarchy mode.
This introduces a new enum for enumerating supported controllers, plus a
related enum for the mask bits mapping to it. The core is changed to
make use of this everywhere.
This moves PID 1 into a new "init.scope" implicit scope unit in the root
slice. This is necessary since on the unified hierarchy cgroups may
either contain subgroups or processes but not both. PID 1 hence has to
move out of the root cgroup (strictly speaking the root cgroup is the
only one where processes and subgroups are still allowed, but in order
to support containers nicey, we move PID 1 into the new scope in all
cases.) This new unit is also used on legacy hierarchy setups. It's
actually pretty useful on all systems, as it can then be used to filter
journal messages coming from PID 1, and so on.
The root slice ("-.slice") is now implicitly created and started (and
does not require a unit file on disk anymore), since
that's where "init.scope" is located and the slice needs to be started
before the scope can.
To check whether we are in unified or legacy hierarchy mode we use
statfs() on /sys/fs/cgroup. If the .f_type field reports tmpfs we are in
legacy mode, if it reports cgroupfs we are in unified mode.
This patch set carefuly makes sure that cgls and cgtop continue to work
as desired.
When invoking nspawn as a service it will implicitly create two
subcgroups in the cgroup it is using, one to move the nspawn process
into, the other to move the actual container processes into. This is
done because of the requirement that cgroups may either contain
processes or other subgroups.
|
|
Whoopsy, forgot to 'git add' this, sorry.
|
|
Just like we conditionalize loading kdbus.ko, we should conditionalize
mounting kdbusfs. Otherwise, we might run with kdbus if it is builtin,
even though the user didn't want this.
|
|
./configure --enable/disable-kdbus can be used to set the default
behavior regarding kdbus.
If no kdbus kernel support is available, dbus-dameon will be used.
With --enable-kdbus, the kernel command line option "kdbus=0" can
be used to disable kdbus.
With --disable-kdbus, the kernel command line option "kdbus=1" is
required to enable kdbus support.
|
|
This makes path_is_mount_point() consistent with fd_is_mount_point() wrt.
flags.
|
|
|
|
We must not fail on ENOENT. We properly create the mount-point in
mount-setup, so there's really no reason to skip the mount. Make sure we
just skip the mount on unexpected failures or if it's already mounted.
|
|
Commit e792e890f ("path-util: don't eat up ENOENT in
path_is_mount_point()") changed path_is_mount_point() so it doesn't hide
-ENOENT from its caller. This causes all boots to fail early in case
any of the mount points does not exist (for instance, when kdbus isn't
loaded, /sys/fs/kdbus is missing).
Fix this by returning 0 from mount_one() if path_is_mount_point()
returned -ENOENT.
|
|
This patch removes includes that are not used. The removals were found with
include-what-you-use which checks if any of the symbols from a header is
in use.
|
|
Current systemd requires kernel >= 3.7 per the README file
but CONFIG_USB_DEVICEFS disappeared from the kernel in
upstream commit fb28d58b72aa9215b26f1d5478462af394a4d253
(kernel 3.5-rc1)
|
|
them anymore
|
|
|
|
for the container's own subtree in the name=systemd hierarchy
More specifically mount all other hierarchies in their entirety and the
name=systemd above the container's subtree read-only.
|
|
Using the same scripts as in f647962d64e "treewide: yet more log_*_errno
+ return simplifications".
|
|
If the format string contains %m, clearly errno must have a meaningful
value, so we might as well use log_*_errno to have ERRNO= logged.
Using:
find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\((".*%m.*")/log_\1_errno(errno, \2/'
Plus some whitespace, linewrap, and indent adjustments.
|
|
|
|
Turns out we can just do kmod_setup() earlier, before we do mount_setup(),
so there's no need for mount_setup_late() anymore. Instead, put kdbusfs in
mount_table[].
|
|
kdbus has seen a larger update than expected lately, most notably with
kdbusfs, a file system to expose the kdbus control files:
* Each time a file system of this type is mounted, a new kdbus
domain is created.
* The layout inside each mount point is the same as before, except
that domains are not hierarchically nested anymore.
* Domains are therefore also unnamed now.
* Unmounting a kdbusfs will automatically also detroy the
associated domain.
* Hence, the action of creating a kdbus domain is now as
privileged as mounting a filesystem.
* This way, we can get around creating dev nodes for everything,
which is last but not least something that is not limited by
20-bit minor numbers.
The kdbus specific bits in nspawn have all been dropped now, as nspawn
can rely on the container OS to set up its own kdbus domain, simply by
mounting a new instance.
A new set of mounts has been added to mount things *after* the kernel
modules have been loaded. For now, only kdbus is in this set, which is
invoked with mount_setup_late().
|
|
new mac_{smack,selinux,apparmor}_xyz() convention
|
|
This is also the only place where FTW_ACTIONRETVAL is used, so
this makes systemd compile without SELinux or SMACK support
when the standard library doesn't support this extension.
|
|
It is redundant to store 'hash' and 'compare' function pointers in
struct Hashmap separately. The functions always comprise a pair.
Store a single pointer to struct hash_ops instead.
systemd keeps hundreds of hashmaps, so this saves a little bit of
memory.
|
|
http://lists.freedesktop.org/archives/systemd-devel/2014-August/021772.html
|
|
Failure to mount cgroups with xattr should not be fatal
|
|
the "/sys" is mounted. So we should mount "sys" before "/proc"
https://bugs.freedesktop.org/show_bug.cgi?id=77646
|
|
|
|
We should no longer pretend that we can run in any sensible way
without the kernel supporting us with cgroups functionality.
|
|
|
|
Given that glibc searches for /dev/shm by just looking for any tmpfs we
should be more careful with providing tmpfs instances arbitrary code
might end up writing to.
|
|
|
|
Similar to PrivateNetwork=, PrivateTmp= introduce PrivateDevices= that
sets up a private /dev with only the API pseudo-devices like /dev/null,
/dev/zero, /dev/random, but not any physical devices in them.
|
|
Also for log_error() except where a specific error is specified
e.g. errno ? strerror(errno) : "Some user specified message"
|
|
Since on most systems with xattr systemd will compile with Smack
support enabled, we still attempt to mount various fs's with
Smack-only options.
Before mounting any of these Smack-related filesystems with
Smack specific mount options, check if Smack is functionally
active on the running kernel.
If Smack is really enabled in the kernel, all these Smack mounts
are now *fatal*, as they should be.
We no longer mount smackfs if systemd was compiled without
Smack support. This makes it easier to make smackfs mount
failures a critical error when Smack is enabled.
We no longer mount these filesystems with their Smack specific
options inside containers. There these filesystems will be
mounted with there non-mount smack options for now.
|
|
Once systemd itself is running in a security domain for SMACK,
it will fail to start countless tasks due to missing privileges
for mounted and created directory structures. For /run and shm
specifically, we grant all tasks access.
These 2 mounts are allowed to fail, which will happen if the
system is not running a SMACK enabled kernel or security=none is
passed to the kernel.
|
|
dracut uses systemd in the initramfs and does not write these files
anymore.
The state of the root fsck is serialized.
|
|
|
|
Freeing in error path is the common pattern with set_put().
|
|
It's polite to print the name of the link that wasn't created,
and it makes little sense to print the target.
|
|
xattrs on cgroup fs were added back in v3.6-rc3-3-g03b1cde. But we
support kernels >= 2.6.39, and we should also support kernels compiled
w/o xattr support, even if systemd is compiled with xattr support.
Fall back to mounting without xattr support.
Tested-by: Colin Walters <walters@verbum.org>
|
|
All attributes are stored as text, since root_directory is already
text, and it seems easier to have all of them in text format.
Attributes are written in the trusted. namespace, because the kernel
currently does not allow user. attributes on cgroups. This is a PITA,
and CAP_SYS_ADMIN is required to *read* the attributes. Alas.
A second pipe is opened for the child to signal the parent that the
cgroup hierarchy has been set up.
|
|
Instead of outputting "5h 55s 50ms 3us" we'll now output "5h
55.050003s". Also, while outputting the accuracy is configurable.
Basically we now try use "dot notation" for all time values > 1min. For
>= 1s we use 's' as unit, otherwise for >= 1ms we use 'ms' as unit, and
finally 'us'.
This should give reasonably values in most cases.
|
|
|
|
All Execs within the service, will get mounted the same
/tmp and /var/tmp directories, if service is configured with
PrivateTmp=yes. Temporary directories are cleaned up by service
itself in addition to systemd-tmpfiles. Directory which is mounted
as inaccessible is created at runtime in /run/systemd.
|
|
Previously we were testing whether /sys/fs/cgroup/systemd/ was a mount
point. This might be problematic however, when the cgroup trees are bind
mounted into a container from the host (which should be absolutely
valid), which might create the impression that the container was running
systemd, but only the host actually is.
Replace this by a check for the existance of the directory
/run/systemd/system/, which should work unconditionally, since /run can
never be a bind mount but *must* be a tmpfs on systemd systems, which is
flushed at boots. This means that data in /run always reflects
information about the current boot, and only of the local container,
which makes it the perfect choice for a check like this.
(As side effect this is nice to Ubuntu people who now use logind with
the systemd cgroup hierarchy, where the old sd_booted() check misdetects
systemd, even though they still run legacy Upstart.)
|
|
SMACK is the Simple Mandatory Access Control Kernel, a minimal
approach to Access Control implemented as a kernel LSM.
The kernel exposes the smackfs filesystem API through which access
rules can be loaded. At boot time, we want to load the access rules
as early as possible to ensure all early boot steps are checked by Smack.
This patch mounts smackfs at the new location at /sys/fs/smackfs for
kernels 3.8 and above. The /smack mountpoint is not supported.
After mounting smackfs, rules are loaded from the usual location.
For more information about Smack see:
http://www.kernel.org/doc/Documentation/security/Smack.txt
|
|
|
|
|
|
|