Age | Commit message (Collapse) | Author |
|
Add support for relative TasksMax= specifications, and bump default for services
|
|
https://github.com/systemd/systemd/pull/3685 introduced
/run/systemd/inaccessible/{chr,blk} to map inacessible devices,
this patch allows systemd running inside a nspawn container to create
/run/systemd/inaccessible/{chr,blk}.
|
|
With this NSS module all dynamic service users will be resolvable via NSS like
any real user.
|
|
|
|
service is running
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).
For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.
A simple way to test this is:
systemd-run -p DynamicUser=1 /bin/sleep 99999
|
|
Let's verify the validity of the syntax of the user/group names set.
|
|
Just in case...
|
|
To remove the hard dependency on systemd, for packages, which function
without a running systemd the %systemd_ordering macro can be used to
ensure ordering in the rpm transaction. %systemd_ordering makes sure,
the systemd rpm is installed prior to the package, so the %pre/%post
scripts can execute the systemd parts.
Installing systemd afterwards though, does not result in the same outcome.
|
|
As it turns out 512 is max number of tasks per service is hit by too many
applications, hence let's bump it a bit, and make it relative to the system's
maximum number of PIDs. With this change the new default is 15%. At the
kernel's default pids_max value of 32768 this translates to 4915. At machined's
default TasksMax= setting of 16384 this translates to 2457.
Why 15%? Because it sounds like a round number and is close enough to 4096
which I was going for, i.e. an eight-fold increase over the old 512
Summary:
| on the host | in a container
old default | 512 | 512
new default | 4915 | 2457
|
|
|
|
That way, we can neatly keep this in line with the new TasksMaxScale= option.
Note that we didn't release a version with MemoryLimitByPhysicalMemory= yet,
hence this change should be unproblematic without breaking API.
|
|
This adds support for a TasksMax=40% syntax for specifying values relative to
the system's configured maximum number of processes. This is useful in order to
neatly subdivide the available room for tasks within containers.
|
|
This allows us to delete quite a bit of code and make the whole thing a lot
shorter.
|
|
|
|
|
|
We currently have code to read and write files containing UUIDs at various
places. Unify this in id128-util.[ch], and move some other stuff there too.
The new files are located in src/libsystemd/sd-id128/ (instead of src/shared/),
because they are actually the backend of sd_id128_get_machine() and
sd_id128_get_boot().
In follow-up patches we can use this reduce the code in nspawn and
machine-id-setup by adopted the common implementation.
|
|
log about all processes we forcibly kill
|
|
Assorted fixes
|
|
https://github.com/systemd/systemd/pull/3685 introduced
/run/systemd/inaccessible/{chr,blk} to map inacessible devices,
this patch allows systemd running inside a nspawn container to create
/run/systemd/inaccessible/{chr,blk}.
|
|
|
|
Fix bug introduced by #3263: mount(2) return value is 0 or -1, not errno.
Thanks to Evgeny Vereshchagin (@evverx) for reporting.
|
|
|
|
|
|
We don't actually need any functionality from cgroup.h in execute.h, hence
don't include that. However, we do need the Unit structure from unit.h, hence
include that, and move it as late as possible, since it needs the definitions
from execute.h.
|
|
All other functions in execute.c that need the unit id take a Unit* parameter
as first argument. Let's change connect_logger_as() to follow a similar logic.
|
|
After all, if a unit is abandoned, all processes inside of it may be considered
"left over" and are something we should better log about.
|
|
This was accidentally left commented out for debugging purposes, let's fix that
and make the signal directed again.
|
|
Let's lot at LOG_NOTICE about any processes that we are going to
SIGKILL/SIGABRT because clean termination of them didn't work.
This turns the various boolean flag parameters to cg_kill(), cg_migrate() and
related calls into a single binary flags parameter, simply because the function
now gained even more parameters and the parameter listed shouldn't get too
long.
Logging for killing processes is done either when the kill signal is SIGABRT or
SIGKILL, or on explicit request if KILL_TERMINATE_AND_LOG instead of LOG_TERMINATE
is passed. This isn't used yet in this patch, but is made use of in a later
patch.
|
|
We generally try to avoid strerror(), due to its threads-unsafety, let's do
this here, too.
Also, let's be tiny bit more explanatory with the log messages, and let's
shorten a few things.
|
|
We usually hide legacy bus properties from introspection. Let's do that for the
InaccessibleDirectories= properties too.
The properties stay accessible if requested, but they won't be listed anymore
if people introspect the unit.
|
|
This patch renames Read{Write,Only}Directories= and InaccessibleDirectories=
to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept
as aliases but they are not advertised in the documentation.
Renamed variables:
`read_write_dirs` --> `read_write_paths`
`read_only_dirs` --> `read_only_paths`
`inaccessible_dirs` --> `inaccessible_paths`
|
|
Despite the name, `Read{Write,Only}Directories=` already allows for
regular file paths to be masked. This commit adds the same behavior
to `InaccessibleDirectories=` and makes it explicit in the doc.
This patch introduces `/run/systemd/inaccessible/{reg,dir,chr,blk,fifo,sock}`
{dile,device}nodes and mounts on the appropriate one the paths specified
in `InacessibleDirectories=`.
Based on Luca's patch from https://github.com/systemd/systemd/pull/3327
|
|
(#3738)
During stop when service has one "regular" pid one main pid and one
control pid and the sighld for the regular one is processed first the
unit_tidy_watch_pids will skip the main and control pid and does not
remove them from u->pids(). But then we skip the sigchld event because we
already did one in the iteration and there are two pids in u->pids.
v2: Use general unit_main_pid() and unit_control_pid() instead of
reaching directly to service structure.
|
|
https://github.com/SELinuxProject/selinux/commit/9eb9c9327563014ad6a807814e7975424642d5b9
deprecated selinux_context_t. Replace with a simple char* everywhere.
Alternative fix for #3719.
|
|
... as requested in
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/DJ7HDNRM5JGBSA4HL3UWW5ZGLQDJ6Y7M/.
Adding the macro makes it marginally easier to create generators
for outside projects.
I opted for "generatordir" and "usergeneratordir" to match
%unitdir and %userunitdir. OTOH, "_systemd" prefix makes it obvious
that this is related to systemd. "%_generatordir" would be to generic
of a name.
|
|
This way, slow IO journald has to wait for can't cause it to reach the killing
spree timeout and is hit by SIGKILL in addition to SIGTERM.
|
|
There's really no reason to use 10s here, let's instead default to 90s like we
do for everything else.
The SIGKILL during the final killing spree is in most regards the fourth level
of a safety net, after all: any normal service should have already been stopped
during the normal service shutdown logic, first via SIGTERM and then SIGKILL,
and then also via SIGTERM during the finall killing spree before we send
SIGKILL. And as a fourth level safety net it should only be required in
exceptional cases, which means it's safe to rais the default timeout, as normal
shutdowns should never be delayed by it.
Note that journald excludes itself from the normal service shutdown, and relies
on the final killing spree to terminate it (this is because it wants to cover
the normal shutdown phase's complete logging). If the system's IO is
excessively slow, then the 10s might not be enough for journald to sync
everything to disk and logs might get lost during shutdown.
|
|
|
|
seccomp_syscall_resolve_name() can return a mix of positive and negative
(pseudo-) syscall numbers, while errors are signaled via __NR_SCMP_ERROR.
This commit lets the syscall filter parser only abort on real parsing
failures, letting libseccomp handle pseudo-syscall number on its own
and allowing proper multiplexed syscalls filtering.
|
|
|
|
Follow up on #3503 (pass service env vars to PAM sessions)
|
|
Prevent free from being called on (a part of) the call-by-reference
variable env when setup_pam fails.
|
|
The unit load queue can be processed in the middle of setting the
unit's properties, so its load_state would no longer be UNIT_STUB
for the check in bus_unit_set_properties(), which would cause it to
incorrectly return an error.
|
|
Commit da4d897e ("core: add cgroup memory controller support on the unified
hierarchy (#3315)") changed the code in src/core/cgroup.c to always write
the real numeric value from the cgroup parameters to the
"memory.limit_in_bytes" attribute file.
For parameters set to CGROUP_LIMIT_MAX, this results in the string
"18446744073709551615" being written into that file, which is UINT64_MAX.
Before that commit, CGROUP_LIMIT_MAX was special-cased to the string "-1".
This causes a regression on CentOS 7, which is based on kernel 3.10, as the
value is interpreted as *signed* 64 bit, and clamped to 0:
[root@n54 ~]# echo 18446744073709551615 >/sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes
[root@n54 ~]# cat /sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes
0
[root@n54 ~]# echo -1 >/sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes
[root@n54 ~]# cat /sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes
9223372036854775807
Hence, all units that are subject to the limits enforced by the memory
controller will crash immediately, even though they have no actual limit
set. This happens to for the user.slice, for instance:
[ 453.577153] Hardware name: SeaMicro SM15000-64-CC-AA-1Ox1/AMD Server CRB, BIOS Estoc.3.72.19.0018 08/19/2014
[ 453.587024] ffff880810c56780 00000000aae9501f ffff880813d7fcd0 ffffffff816360fc
[ 453.594544] ffff880813d7fd60 ffffffff8163109c ffff88080ffc5000 ffff880813d7fd28
[ 453.602120] ffffffff00000202 fffeefff00000000 0000000000000001 ffff880810c56c03
[ 453.609680] Call Trace:
[ 453.612156] [<ffffffff816360fc>] dump_stack+0x19/0x1b
[ 453.617324] [<ffffffff8163109c>] dump_header+0x8e/0x214
[ 453.622671] [<ffffffff8116d20e>] oom_kill_process+0x24e/0x3b0
[ 453.628559] [<ffffffff81088dae>] ? has_capability_noaudit+0x1e/0x30
[ 453.634969] [<ffffffff811d4155>] mem_cgroup_oom_synchronize+0x575/0x5a0
[ 453.641721] [<ffffffff811d3520>] ? mem_cgroup_charge_common+0xc0/0xc0
[ 453.648299] [<ffffffff8116da84>] pagefault_out_of_memory+0x14/0x90
[ 453.654621] [<ffffffff8162f4cc>] mm_fault_error+0x68/0x12b
[ 453.660233] [<ffffffff81642012>] __do_page_fault+0x3e2/0x450
[ 453.666017] [<ffffffff816420a3>] do_page_fault+0x23/0x80
[ 453.671467] [<ffffffff8163e308>] page_fault+0x28/0x30
[ 453.676656] Task in /user.slice/user-0.slice/user@0.service killed as a result of limit of /user.slice/user-0.slice/user@0.service
[ 453.688477] memory: usage 0kB, limit 0kB, failcnt 7
[ 453.693391] memory+swap: usage 0kB, limit 9007199254740991kB, failcnt 0
[ 453.700039] kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
[ 453.706076] Memory cgroup stats for /user.slice/user-0.slice/user@0.service: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[ 453.725702] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 453.733614] [ 2837] 0 2837 11950 899 23 0 0 (systemd)
[ 453.741919] Memory cgroup out of memory: Kill process 2837 ((systemd)) score 1 or sacrifice child
[ 453.750831] Killed process 2837 ((systemd)) total-vm:47800kB, anon-rss:3188kB, file-rss:408kB
Fix this issue by special-casing the UINT64_MAX case again.
|
|
By cleaning up before setting up PAM we maintain control of overriding
behavior in setting variables. Otherwise, pam_putenv is in control.
This also makes sure we use a cleaned up environment in replacing
variables in argv.
|
|
A 'llu' formatting statement was used in a debugging printf statement
instead of a 'PRIu64'. Correcting that mistake here.
|
|
manager: Only invoke a single sigchld per unit within a cleanup cycle
|
|
make "machinectl clean" asynchronous, and open it up via PolicyKit
|
|
By default, each iteration of manager_dispatch_sigchld() results in a unit level
sigchld event being invoked. For scope units, this results in a scope_sigchld_event()
which can seemingly stall for workloads that have a large number of PIDs within the
scope. The stall exhibits itself as a SIG_0 being initiated for each u->pids entry
as a result of pid_is_unwaited().
v2:
This patch resolves this condition by only paying to cost of a sigchld in the underlying
scope unit once per sigchld iteration. A new "sigchldgen" member resides within the
Unit struct. The Manager is incremented via the sd event loop, accessed via
sd_event_get_iteration, and the Unit member is set to the same value as the manager each
time that a sigchld event is invoked. If the Manager iteration value and Unit member
match, the sigchld event is not invoked for that iteration.
|
|
Commit 3a18b60489504056f9b0b1a139439cbfa60a87e1 introduced a regression that
disabled the color mode for container.
This patch fixes this.
|