Age | Commit message (Collapse) | Author |
|
--user due to EACCES
After all, in the classic hierarchy that's pretty much the default case.
|
|
|
|
In order to match the other get_process_xyz() calls.
|
|
|
|
|
|
|
|
|
|
There are more than enough to deserve their own .c file, hence move them
over.
|
|
string-util.[ch]
There are more than enough calls doing string manipulations to deserve
its own files, hence do something about it.
This patch also sorts the #include blocks of all files that needed to be
updated, according to the sorting suggestions from CODING_STYLE. Since
pretty much every file needs our string manipulation functions this
effectively means that most files have sorted #include blocks now.
Also touches a few unrelated include files.
|
|
Add a new config directive called NetClass= to CGroup enabled units.
Allowed values are positive numbers for fix assignments and "auto" for
picking a free value automatically, for which we need to keep track of
dynamically assigned net class IDs of units. Introduce a hash table for
this, and also record the last ID that was given out, so the allocator
can start its search for the next 'hole' from there. This could
eventually be optimized with something like an irb.
The class IDs up to 65536 are considered reserved and won't be
assigned automatically by systemd. This barrier can be made a config
directive in the future.
Values set in unit files are stored in the CGroupContext of the
unit and considered read-only. The actually assigned number (which
may have been chosen dynamically) is stored in the unit itself and
is guaranteed to remain stable as long as the unit is active.
In the CGroup controller, set the configured CGroup net class to
net_cls.classid. Multiple unit may share the same net class ID,
and those which do are linked together.
|
|
Let's make sure that we follow the same codepaths when adjusting a
cgroup property via the dbus SetProperty() call, and when we execute the
StartupCPUShares= effect.
|
|
Let's stop using the "unsigned long" type for weights/shares, and let's
just use uint64_t for this, as that's what we expose on the bus.
Unify parsers, and always validate the range for these fields.
Correct the default blockio weight to 500, since that's what the kernel
actually uses.
When parsing the weight/shares settings from unit files accept the empty
string as a way to reset the weight/shares value. When getting it via
the bus, uniformly map (uint64_t) -1 to unset.
Open up StartupCPUShares= and StartupBlockIOWeight= to transient units.
|
|
This adds support for the new "pids" cgroup controller of 4.3 kernels.
It allows accounting the number of tasks in a cgroup and enforcing
limits on it.
This adds two new setting TasksAccounting= and TasksMax= to each unit,
as well as a gloabl option DefaultTasksAccounting=.
This also updated "cgtop" to optionally make use of the new
kernel-provided accounting.
systemctl has been updated to show the number of tasks for each service
if it is available.
This patch also adds correct support for undoing memory limits for units
using a MemoryLimit=infinity syntax. We do the same for TasksMax= now
and hence keep things in sync here.
|
|
the controller in the kernel
Follow-up to 5bf8002a3a6723ce50331c024122078552fb600a.
|
|
Delegation to unpriviliged processes is safe in the unified hierarchy,
hence allow it. This has the benefit of permitting "systemd --user"
instances to further partition their resources between user services.
|
|
Let's move the actual cgroup part of it into a new separate function
manager_get_unit_by_pid_cgroup(), and then make
manager_get_unit_by_pid() just a wrapper that also checks the two pid
hashmaps.
Then, let's make sure the various calls that want to deliver events to
the owners of a PID check both hashmaps and the cgroup and deliver the
event to *each* of them. OTOH make sure bus calls like GetUnitByPID()
continue to check the PID hashmaps first and the cgroup only as
fallback.
|
|
This adds a new PID_TO_PTR() macro, plus PTR_TO_PID() and makes use of
it wherever we maintain processes in a hash table. Previously we
sometimes used LONG_TO_PTR() and other times ULONG_TO_PTR() for that,
hence let's make this more explicit and clean up things.
|
|
|
|
This patch set adds full support the new unified cgroup hierarchy logic
of modern kernels.
A new kernel command line option "systemd.unified_cgroup_hierarchy=1" is
added. If specified the unified hierarchy is mounted to /sys/fs/cgroup
instead of a tmpfs. No further hierarchies are mounted. The kernel
command line option defaults to off. We can turn it on by default as
soon as the kernel's APIs regarding this are stabilized (but even then
downstream distros might want to turn this off, as this will break any
tools that access cgroupfs directly).
It is possibly to choose for each boot individually whether the unified
or the legacy hierarchy is used. nspawn will by default provide the
legacy hierarchy to containers if the host is using it, and the unified
otherwise. However it is possible to run containers with the unified
hierarchy on a legacy host and vice versa, by setting the
$UNIFIED_CGROUP_HIERARCHY environment variable for nspawn to 1 or 0,
respectively.
The unified hierarchy provides reliable cgroup empty notifications for
the first time, via inotify. To make use of this we maintain one
manager-wide inotify fd, and each cgroup to it.
This patch also removes cg_delete() which is unused now.
On kernel 4.2 only the "memory" controller is compatible with the
unified hierarchy, hence that's the only controller systemd exposes when
booted in unified heirarchy mode.
This introduces a new enum for enumerating supported controllers, plus a
related enum for the mask bits mapping to it. The core is changed to
make use of this everywhere.
This moves PID 1 into a new "init.scope" implicit scope unit in the root
slice. This is necessary since on the unified hierarchy cgroups may
either contain subgroups or processes but not both. PID 1 hence has to
move out of the root cgroup (strictly speaking the root cgroup is the
only one where processes and subgroups are still allowed, but in order
to support containers nicey, we move PID 1 into the new scope in all
cases.) This new unit is also used on legacy hierarchy setups. It's
actually pretty useful on all systems, as it can then be used to filter
journal messages coming from PID 1, and so on.
The root slice ("-.slice") is now implicitly created and started (and
does not require a unit file on disk anymore), since
that's where "init.scope" is located and the slice needs to be started
before the scope can.
To check whether we are in unified or legacy hierarchy mode we use
statfs() on /sys/fs/cgroup. If the .f_type field reports tmpfs we are in
legacy mode, if it reports cgroupfs we are in unified mode.
This patch set carefuly makes sure that cgls and cgtop continue to work
as desired.
When invoking nspawn as a service it will implicitly create two
subcgroups in the cgroup it is using, one to move the nspawn process
into, the other to move the actual container processes into. This is
done because of the requirement that cgroups may either contain
processes or other subgroups.
|
|
It's cheaper that going to cgroupfs, and also usually the better choice
since it's not racy and can map PIDs even if they were moved to a
different unit.
|
|
In all cases where the function (or cg_is_empty_recursive()) ignoring
the calling process is actually wrong, as a process keeps a cgroup busy
regardless if its the current one or another. Hence, let's simplify
things and drop the "ignore_self" parameter.
|
|
The legacy cgroup hierarchy does not support reliable empty
notifications in containers and if there are left-over subgroups in a
cgroup. This makes it hard to correctly wait for them running empty, and
thus we previously disabled this logic entirely.
With this change we explicitly check for the container case, and whether
the unit is a "delegation" unit (i.e. one where programs may create
their own subgroups). If we are neither in a container, nor operating on
a delegation unit cgroup empty notifications become reliable and thus we
start waiting for the empty notifications again.
This doesn't really fix the general problem around cgroup notifications
but reduces the effect around it.
(This also reorders #include lines by their focus, as suggsted in
CODING_STYLE. We have to add "virt.h", so let's do that at the right
place.)
Also see #317.
|
|
|
|
It's primarily just a property of the Manager object after all, and we
try to refer to PID 1 as "manager" instead of "systemd", hence let's to
stick to this here too.
|
|
|
|
This adds support for showing the accumulated consumed CPU time per-unit
in the "systemctl status" output. The property is also readable via the
bus.
|
|
https://github.com/docker/docker/issues/10280
|
|
|
|
|
|
|
|
systemd[1]: Failed to set memory.limit_in_bytes on : Invalid argument
|
|
mounted read-only
|
|
|
|
for leaf units
Otherwise a slice or delegation unit might move PIDs around ignoring the
fact that it is attached to a subcgroup.
|
|
it's not quite as destructive as it sounds nowadays
|
|
If a cgroup fails to be destroyed (most likely because there are still
processes running as part of a service after the main pid exits), don't
free and remove the cgroup unit from the manager. This fixes a
regression introduced by the cgroup rework in v205 where systemd would
forget about processes still running after the unit becomes inactive.
(This can happen when the main pid exits and KillMode=process or none).
|
|
Using the same scripts as in f647962d64e "treewide: yet more log_*_errno
+ return simplifications".
|
|
If the format string contains %m, clearly errno must have a meaningful
value, so we might as well use log_*_errno to have ERRNO= logged.
Using:
find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\((".*%m.*")/log_\1_errno(errno, \2/'
Plus some whitespace, linewrap, and indent adjustments.
|
|
|
|
It corrrectly handles both positive and negative errno values.
|
|
As a followup to 086891e5c1 "log: add an "error" parameter to all
low-level logging calls and intrdouce log_error_errno() as log calls
that take error numbers", use sed to convert the simple cases to use
the new macros:
find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\("(.*)%s"(.*), strerror\(-([a-zA-Z_]+)\)\);/log_\1_errno(-\4, "\2%m"\3);/'
Multi-line log_*() invocations are not covered.
And we also should add log_unit_*_errno().
|
|
subhierarchies
For priviliged units this resource control property ensures that the
processes have all controllers systemd manages enabled.
For unpriviliged services (those with User= set) this ensures that
access rights to the service cgroup is granted to the user in question,
to create further subgroups. Note that this only applies to the
name=systemd hierarchy though, as access to other controllers is not
safe for unpriviliged processes.
Delegate=yes should be set for container scopes where a systemd instance
inside the container shall manage the hierarchies below its own cgroup
and have access to all controllers.
Delegate=yes should also be set for user@.service, so that systemd
--user can run, controlling its own cgroup tree.
This commit changes machined, systemd-nspawn@.service and user@.service
to set this boolean, in order to ensure that container management will
just work, and the user systemd instance can run fine.
|
|
systemctl would print 'CPUQuotaPerSecUSec=(null)' for no limit. This
does not look right.
Since USEC_INFINITY is one of the valid values, format_timespan()
could return NULL, and we should wrap every use of it in strna() or
similar. But most callers didn't do that, and it seems more robust to
return a string ("infinity") that makes sense most of the time, even
if in some places the result will not be grammatically correct.
|
|
We'll stay in "initializing" until basic.target has reached, at which
point we will enter "starting".
This is preparation so that we can change the startip timeout to only
apply to the first phase of startup, not the full procedure.
|
|
|
|
|
|
|
|
Also add a bit of debugging output to help diagnose problems,
add missing units, and simplify cppflags.
Move test-engine to normal tests from manual tests, it should now
work without destroying the system.
|
|
Only accept cpu quota values in percentages, get rid of period
definition.
It's not clear whether the CFS period controllable per-cgroup even has a
future in the kernel, hence let's simplify all this, hardcode the period
to 100ms and only accept percentage based quota values.
|
|
This is the behaviour the kernel cgroup rework exposes for all
controllers, hence let's do this already now for all cases.
|