Age | Commit message (Collapse) | Author |
|
Unfortunately, due to the disagreements in the kernel development community,
CPU controller cgroup v2 support has not been merged and enabling it requires
applying two small out-of-tree kernel patches. The situation is explained in
the following documentation.
https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu
While it isn't clear what will happen with CPU controller cgroup v2 support,
there are critical features which are possible only on cgroup v2 such as
buffered write control making cgroup v2 essential for a lot of workloads. This
commit implements systemd CPU controller support on the unified hierarchy so
that users who choose to deploy CPU controller cgroup v2 support can easily
take advantage of it.
On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max"
replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us". [Startup]CPUWeight config
options are added with the usual compat translation. CPU quota settings remain
unchanged and apply to both legacy and unified hierarchies.
v2: - Error in man page corrected.
- CPU config application in cgroup_context_apply() refactored.
- CPU accounting now works on unified hierarchy.
|
|
|
|
|
|
When many services are running, it was difficult to see only the interesting ones.
This patch allows to show only the subtree of interest.
|
|
As suggested here:
https://bugs.freedesktop.org/show_bug.cgi?id=64737#c8
This adds a new call terminal_is_dumb() and makes use of this where
appropriate.
|
|
On the unified hierarchy, blkio controller is renamed to io and the interface
is changed significantly.
* blkio.weight and blkio.weight_device are consolidated into io.weight which
uses the standardized weight range [1, 10000] with 100 as the default value.
* blkio.throttle.{read|write}_{bps|iops}_device are consolidated into io.max.
Expansion of throttling features is being worked on to support
work-conserving absolute limits (io.low and io.high).
* All stats are consolidated into io.stats.
This patchset adds support for the new interface. As the interface has been
revamped and new features are expected to be added, it seems best to treat it
as a separate controller rather than trying to expand the blkio settings
although we might add automatic translation if only blkio settings are
specified.
* io.weight handling is mostly identical to blkio.weight[_device] handling
except that the weight range is different.
* Both read and write bandwidth settings are consolidated into
CGroupIODeviceLimit which describes all limits applicable to the device.
This makes it less painful to add new limits.
* "max" can be used to specify the maximum limit which is equivalent to no
config for max limits and treated as such. If a given CGroupIODeviceLimit
doesn't contain any non-default configs, the config struct is discarded once
the no limit config is applied to cgroup.
* lookup_blkio_device() is renamed to lookup_block_device().
Signed-off-by: Tejun Heo <htejun@fb.com>
|
|
Running cgtop on a system, which lacks expecting stat file, results in a
segfault. For example, a system with blkio tree but without cfq io scheduler,
lacks "blkio.io_service_bytes".
When the targeting cgroup's file does not exist, process() returns 0 and
also does not modify `*ret' value (which is `*ours'). As a result,
callers of refresh_one() can have bogus pointer, which result in SEGV.
This patch just properly initialize the variable to NULL.
|
|
let's export as little as we can
|
|
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
|
|
Also add a coccinelle receipt to help with such transitions.
|
|
GLIB has recently started to officially support the gcc cleanup
attribute in its public API, hence let's do the same for our APIs.
With this patch we'll define an xyz_unrefp() call for each public
xyz_unref() call, to make it easy to use inside a
__attribute__((cleanup())) expression. Then, all code is ported over to
make use of this.
The new calls are also documented in the man pages, with examples how to
use them (well, I only added docs where the _unref() call itself already
had docs, and the examples, only cover sd_bus_unrefp() and
sd_event_unrefp()).
This also renames sd_lldp_free() to sd_lldp_unref(), since that's how we
tend to call our destructors these days.
Note that this defines no public macro that wraps gcc's attribute and
makes it easier to use. While I think it's our duty in the library to
make our stuff easy to use, I figure it's not our duty to make gcc's own
features easy to use on its own. Most likely, client code which wants to
make use of this should define its own:
#define _cleanup_(function) __attribute__((cleanup(function)))
Or similar, to make the gcc feature easier to use.
Making this logic public has the benefit that we can remove three header
files whose only purpose was to define these functions internally.
See #2008.
|
|
|
|
|
|
There are more than enough to deserve their own .c file, hence move them
over.
|
|
In sd-bus, the sd_bus_open_xyz() family of calls allocates a new bus,
while sd_bus_default_xyz() family tries to reuse the thread's default
bus. bus_open_transport() sometimes internally uses the former,
sometimes the latter family, but suggests it only calls the former via
its name. Hence, let's avoid this confusion, and generically rename the
call to bus_connect_transport().
Similar for all related calls.
And while we are at it, also change cgls + cgtop to do direct systemd
connections where possible, since all they do is talk to systemd itself.
|
|
This also allows us to drop build.h from a ton of files, hence do so.
Since we touched the #includes of those files, let's order them properly
according to CODING_STYLE.
|
|
Let's always keep space for the full help text. (We used to do that, but
recently another line of help was added which broke this.)
|
|
Let's underline the header line of the table shown by cgtop, how it is
customary for tables. In order to do this, let's introduce new ANSI
underline macros, and clean up the existing ones as side effect.
|
|
|
|
core: add support for the "pids" cgroup controller
|
|
This adds support for the new "pids" cgroup controller of 4.3 kernels.
It allows accounting the number of tasks in a cgroup and enforcing
limits on it.
This adds two new setting TasksAccounting= and TasksMax= to each unit,
as well as a gloabl option DefaultTasksAccounting=.
This also updated "cgtop" to optionally make use of the new
kernel-provided accounting.
systemctl has been updated to show the number of tasks for each service
if it is available.
This patch also adds correct support for undoing memory limits for units
using a MemoryLimit=infinity syntax. We do the same for TasksMax= now
and hence keep things in sync here.
|
|
off_t is a really weird type as it is usually 64bit these days (at least
in sane programs), but could theoretically be 32bit. We don't support
off_t as 32bit builds though, but still constantly deal with safely
converting from off_t to other types and back for no point.
Hence, never use the type anymore. Always use uint64_t instead. This has
various benefits, including that we can expose these values directly as
D-Bus properties, and also that the values parse the same in all cases.
|
|
This patch set adds full support the new unified cgroup hierarchy logic
of modern kernels.
A new kernel command line option "systemd.unified_cgroup_hierarchy=1" is
added. If specified the unified hierarchy is mounted to /sys/fs/cgroup
instead of a tmpfs. No further hierarchies are mounted. The kernel
command line option defaults to off. We can turn it on by default as
soon as the kernel's APIs regarding this are stabilized (but even then
downstream distros might want to turn this off, as this will break any
tools that access cgroupfs directly).
It is possibly to choose for each boot individually whether the unified
or the legacy hierarchy is used. nspawn will by default provide the
legacy hierarchy to containers if the host is using it, and the unified
otherwise. However it is possible to run containers with the unified
hierarchy on a legacy host and vice versa, by setting the
$UNIFIED_CGROUP_HIERARCHY environment variable for nspawn to 1 or 0,
respectively.
The unified hierarchy provides reliable cgroup empty notifications for
the first time, via inotify. To make use of this we maintain one
manager-wide inotify fd, and each cgroup to it.
This patch also removes cg_delete() which is unused now.
On kernel 4.2 only the "memory" controller is compatible with the
unified hierarchy, hence that's the only controller systemd exposes when
booted in unified heirarchy mode.
This introduces a new enum for enumerating supported controllers, plus a
related enum for the mask bits mapping to it. The core is changed to
make use of this everywhere.
This moves PID 1 into a new "init.scope" implicit scope unit in the root
slice. This is necessary since on the unified hierarchy cgroups may
either contain subgroups or processes but not both. PID 1 hence has to
move out of the root cgroup (strictly speaking the root cgroup is the
only one where processes and subgroups are still allowed, but in order
to support containers nicey, we move PID 1 into the new scope in all
cases.) This new unit is also used on legacy hierarchy setups. It's
actually pretty useful on all systems, as it can then be used to filter
journal messages coming from PID 1, and so on.
The root slice ("-.slice") is now implicitly created and started (and
does not require a unit file on disk anymore), since
that's where "init.scope" is located and the slice needs to be started
before the scope can.
To check whether we are in unified or legacy hierarchy mode we use
statfs() on /sys/fs/cgroup. If the .f_type field reports tmpfs we are in
legacy mode, if it reports cgroupfs we are in unified mode.
This patch set carefuly makes sure that cgls and cgtop continue to work
as desired.
When invoking nspawn as a service it will implicitly create two
subcgroups in the cgroup it is using, one to move the nspawn process
into, the other to move the actual container processes into. This is
done because of the requirement that cgroups may either contain
processes or other subgroups.
|
|
|
|
Never report errors twice.
|
|
|
|
When showing the number of tasks in a cgroup, recursively count tasks in
child cgroups and include them in the number. This ensures that the
number of tasks is cummulative the same way as memory, cpu and IO
resources are.
Old behaviour can be restored by passing the new --recursive=no switch.
|
|
However, allow them to be counted in by specifying -k
|
|
This way the output is restricted to cgroups from a container when run
in one.
|
|
In preparation of the unified cgroup support, let's clean up cgtop:
a) rework time code to be based on "nsec_t" rather than "struct timespec"
b) Introduce long option --order= for selecting ordering
c) count number of processes only in the main hierarchy, don't bother
with the controller hierarchies. We don't allow orthogonal
hierarchies in systemd anymore, hence there's no point to check the
other hierarchies.
d) Deal with non-monotonic cpuacct values (see #749)
e) When sorting groups, don't do prefix compare when ordering by number
of tasks, since this is not accumulative for all children.
f) Actually make --cpu without parameter work
g) Don't output control characters when we get them as input.
Fixes #749.
|
|
|
|
since last tick
Emit "0" rather than "-" if no change in IO values are seen for a process since
last tick, so long as accounting has registered content at all.
|
|
- Explicitly flush stdout before sleep between iterations
- Only clear user keystrokes when output is to TTY
- Add a newline between output batches when output is not to TTY
|
|
|
|
|
|
|
|
systemd-cgtop --dept=1 -b -n 10 -d 0.1 | cat
Assertion 'new_length >= 3' failed at src/shared/util.c:3 \
595, function ellipsize_mem(). Aborting.
Aborted (core dumped)
(David: add comment)
|
|
It corrrectly handles both positive and negative errno values.
|
|
As a followup to 086891e5c1 "log: add an "error" parameter to all
low-level logging calls and intrdouce log_error_errno() as log calls
that take error numbers", use sed to convert the simple cases to use
the new macros:
find . -name '*.[ch]' | xargs sed -r -i -e \
's/log_(debug|info|notice|warning|error|emergency)\("(.*)%s"(.*), strerror\(-([a-zA-Z_]+)\)\);/log_\1_errno(-\4, "\2%m"\3);/'
Multi-line log_*() invocations are not covered.
And we also should add log_unit_*_errno().
|
|
That hashmap_move_one() currently cannot fail with -ENOMEM is an
implementation detail, which is not possible to guarantee in general.
Hashmap implementations based on anything else than chaining of
individual entries may have to allocate.
hashmap_move_one will not fail with -ENOMEM if a proper reservation has
been made beforehand. Use reservations in install.c.
In cgtop.c simply propagate the error instead of asserting.
|
|
It is redundant to store 'hash' and 'compare' function pointers in
struct Hashmap separately. The functions always comprise a pair.
Store a single pointer to struct hash_ops instead.
systemd keeps hundreds of hashmaps, so this saves a little bit of
memory.
|
|
getopt is usually good at printing out a nice error message when
commandline options are invalid. It distinguishes between an unknown
option and a known option with a missing arg. It is better to let it
do its job and not use opterr=0 unless we actually want to suppress
messages. So remove opterr=0 in the few places where it wasn't really
useful.
When an error in options is encountered, we should not print a lengthy
help() and overwhelm the user, when we know precisely what is wrong
with the commandline. In addition, since help() prints to stdout, it
should not be used except when requested with -h or --help.
Also, simplify things here and there.
|
|
If -flto is used then gcc will generate a lot more warnings than before,
among them a number of use-without-initialization warnings. Most of them
without are false positives, but let's make them go away, because it
doesn't really matter.
|
|
|
|
Among other things this makes sure we always expose a --version command
and show it in the help texts.
|
|
This extends 62678ded 'efi: never call qsort on potentially
NULL arrays' to all other places where qsort is used and it
is not obvious that the count is non-zero.
|
|
The online help shows the keys as uppercase but the code and manpage say
lower case. Make the online help follow reality.
|
|
|
|
|
|
Instead of outputting "5h 55s 50ms 3us" we'll now output "5h
55.050003s". Also, while outputting the accuracy is configurable.
Basically we now try use "dot notation" for all time values > 1min. For
>= 1s we use 's' as unit, otherwise for >= 1ms we use 'ms' as unit, and
finally 'us'.
This should give reasonably values in most cases.
|