Age | Commit message (Collapse) | Author |
|
This adds two new settings BindPaths= and BindReadOnlyPaths=. They allow
defining arbitrary bind mounts specific to particular services. This is
particularly useful for services with RootDirectory= set as this permits making
specific bits of the host directory available to chrooted services.
The two new settings follow the concepts nspawn already possess in --bind= and
--bind-ro=, as well as the .nspawn settings Bind= and BindReadOnly= (and these
latter options should probably be renamed to BindPaths= and BindReadOnlyPaths=
too).
Fixes: #3439
|
|
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().
RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.
This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
|
|
|
|
This commit adds a `fd` option to `StandardInput=`,
`StandardOutput=` and `StandardError=` properties in order to
connect standard streams to externally named descriptors provided
by some socket units.
This option looks for a file descriptor named as the corresponding
stream. Custom names can be specified, separated by a colon.
If multiple name-matches exist, the first matching fd will be used.
|
|
Unfortunately, due to the disagreements in the kernel development community,
CPU controller cgroup v2 support has not been merged and enabling it requires
applying two small out-of-tree kernel patches. The situation is explained in
the following documentation.
https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu
While it isn't clear what will happen with CPU controller cgroup v2 support,
there are critical features which are possible only on cgroup v2 such as
buffered write control making cgroup v2 essential for a lot of workloads. This
commit implements systemd CPU controller support on the unified hierarchy so
that users who choose to deploy CPU controller cgroup v2 support can easily
take advantage of it.
On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max"
replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us". [Startup]CPUWeight config
options are added with the usual compat translation. CPU quota settings remain
unchanged and apply to both legacy and unified hierarchies.
v2: - Error in man page corrected.
- CPU config application in cgroup_context_apply() refactored.
- CPU accounting now works on unified hierarchy.
|
|
Let's verify the validity of the syntax of the user/group names set.
|
|
On the unified hierarchy, blkio controller is renamed to io and the interface
is changed significantly.
* blkio.weight and blkio.weight_device are consolidated into io.weight which
uses the standardized weight range [1, 10000] with 100 as the default value.
* blkio.throttle.{read|write}_{bps|iops}_device are consolidated into io.max.
Expansion of throttling features is being worked on to support
work-conserving absolute limits (io.low and io.high).
* All stats are consolidated into io.stats.
This patchset adds support for the new interface. As the interface has been
revamped and new features are expected to be added, it seems best to treat it
as a separate controller rather than trying to expand the blkio settings
although we might add automatic translation if only blkio settings are
specified.
* io.weight handling is mostly identical to blkio.weight[_device] handling
except that the weight range is different.
* Both read and write bandwidth settings are consolidated into
CGroupIODeviceLimit which describes all limits applicable to the device.
This makes it less painful to add new limits.
* "max" can be used to specify the maximum limit which is equivalent to no
config for max limits and treated as such. If a given CGroupIODeviceLimit
doesn't contain any non-default configs, the config struct is discarded once
the no limit config is applied to cgroup.
* lookup_blkio_device() is renamed to lookup_block_device().
Signed-off-by: Tejun Heo <htejun@fb.com>
|
|
The setting is hardly useful (since its effect is generally reduced to zero due
to file system caps), and with the advent of ambient caps an actually useful
replacement exists, hence let's get rid of this.
I am pretty sure this was unused and our man page already recommended against
its use, hence this should be a safe thing to remove.
|
|
This feature will not be used anytime soon, so remove a bit of cruft.
The BusPolicy= config directive will stay around as compat noop.
|
|
Corrects an incompatibility introduced with 36c16a7cdd6c33d7980efc2cd6a2211941f302b4.
Fixes: #2537
|
|
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
|
|
This way we can reuse it for parsing rlimit settings in "systemctl set-property" and related commands.
|
|
Change the capability bounding set parser and logic so that the bounding
set is kept as a positive set internally. This means that the set
reflects those capabilities that we want to keep instead of drop.
|
|
Now we don't support the socket protocol like
sctp and udplite .
This patch add a new config param
SocketProtocol: udplite/sctp
With this now we can configure the protocol as
udplite = IPPROTO_UDPLITE
sctp = IPPROTO_SCTP
Tested with nspawn:
|
|
As discussed at systemd.conf 2015 and on also raised on the ML:
http://lists.freedesktop.org/archives/systemd-devel/2015-November/034880.html
This removes the two XyzOverridable= unit dependencies, that were
basically never used, and do not enhance user experience in any way.
Most folks looking for the functionality this provides probably opt for
the "ignore-dependencies" job mode, and that's probably a good idea.
Hence, let's simplify systemd's dependency engine and remove these two
dependency types (and their inverses).
The unit file parser and the dbus property parser will now redirect
the settings/properties to result in an equivalent non-overridable
dependency. In the case of the unit file parser we generate a warning,
to inform the user.
The dbus properties for this unit type stay available on the unit
objects, but they are now hidden from usual introspection and will
always return the empty list when queried.
This should provide enough compatibility for the few unit files that
actually ever made use of this.
|
|
This directive allows passing environment variables from the system
manager to spawned services. Variables in the system manager can be set
inside a container by passing `--set-env=...` options to systemd-spawn.
Tested with an on-disk test.service unit. Tested using multiple variable
names on a single line, with an empty setting to clear the current list
of variables, with non-existing variables.
Tested using `systemd-run -p PassEnvironment=VARNAME` to confirm it
works with transient units.
Confirmed that `systemctl show` will display the PassEnvironment
settings.
Checked that man pages are generated correctly.
No regressions in `make check`.
|
|
Let's make sure "LimitCPU=30min" can be parsed properly, following the
usual logic how we parse time values. Similar for LimitRTTIME=.
While we are at it, extend a bit on the man page section about resource
limits.
Fixes: #1772
|
|
Let's make things more user-friendly and support for example
LimitAS=16G
rather than force users to always use LimitAS=16106127360.
The change is relevant for options:
[Default]Limit{FSIZE,DATA,STACK,CORE,RSS,AS,MEMLOCK,MSGQUEUE}
The patch introduces config_parse_bytes_limit(), it's the same as
config_parse_limit() but uses parse_size() tu support the suffixes.
Addresses: https://github.com/systemd/systemd/issues/1772
|
|
This adds support for naming file descriptors passed using socket
activation. The names are passed in a new $LISTEN_FDNAMES= environment
variable, that matches the existign $LISTEN_FDS= one and contains a
colon-separated list of names.
This also adds support for naming fds submitted to the per-service fd
store using FDNAME= in the sd_notify() message.
This also adds a new FileDescriptorName= setting for socket unit files
to set the name for fds created by socket units.
This also adds a new call sd_listen_fds_with_names(), that is similar to
sd_listen_fds(), but also returns the names of the fds.
systemd-activate gained the new --fdname= switch to specify a name for
testing socket activation.
This is based on #1247 by Maciej Wereski.
Fixes #1247.
|
|
If set to ~ the working directory is set to the home directory of the
user configured in User=.
This change also exposes the existing switch for the working directory
that allowed making missing working directories non-fatal.
This also changes "machinectl shell" to make use of this to ensure that
the invoked shell is by default in the user's home directory.
Fixes #1268.
|
|
Add a new config directive called NetClass= to CGroup enabled units.
Allowed values are positive numbers for fix assignments and "auto" for
picking a free value automatically, for which we need to keep track of
dynamically assigned net class IDs of units. Introduce a hash table for
this, and also record the last ID that was given out, so the allocator
can start its search for the next 'hole' from there. This could
eventually be optimized with something like an irb.
The class IDs up to 65536 are considered reserved and won't be
assigned automatically by systemd. This barrier can be made a config
directive in the future.
Values set in unit files are stored in the CGroupContext of the
unit and considered read-only. The actually assigned number (which
may have been chosen dynamically) is stored in the unit itself and
is guaranteed to remain stable as long as the unit is active.
In the CGroup controller, set the configured CGroup net class to
net_cls.classid. Multiple unit may share the same net class ID,
and those which do are linked together.
|
|
This adds support for the new "pids" cgroup controller of 4.3 kernels.
It allows accounting the number of tasks in a cgroup and enforcing
limits on it.
This adds two new setting TasksAccounting= and TasksMax= to each unit,
as well as a gloabl option DefaultTasksAccounting=.
This also updated "cgtop" to optionally make use of the new
kernel-provided accounting.
systemctl has been updated to show the number of tasks for each service
if it is available.
This patch also adds correct support for undoing memory limits for units
using a MemoryLimit=infinity syntax. We do the same for TasksMax= now
and hence keep things in sync here.
|
|
.nspawn fiels are simple settings files that may accompany container
images and directories and contain settings otherwise passed on the
nspawn command line. This provides an efficient way to attach execution
data directly to containers.
|
|
When generating utmp/wtmp entries, optionally add both LOGIN_PROCESS and
INIT_PROCESS entries or even all three of LOGIN_PROCESS, INIT_PROCESS
and USER_PROCESS entries, instead of just a single INIT_PROCESS entry.
With this change systemd may be used to not only invoke a getty directly
in a SysV-compliant way but alternatively also a login(1) implementation
or even forego getty and login entirely, and invoke arbitrary shells in
a way that they appear in who(1) or w(1).
This is preparation for a later commit that adds a "machinectl shell"
operation to invoke a shell in a container, in a way that is compatible
with who(1) and w(1).
|
|
units
|
|
Option was being parsed but not used for anything.
|
|
|
|
In service file, if the file has some of special SMACK label in
ExecStart= and systemd has no permission for the special SMACK label
then permission error will occurred. To resolve this, systemd should
be able to set its SMACK label to something accessible of ExecStart=.
So introduce new SmackProcessLabel. If label is specified with
SmackProcessLabel= then the child systemd will set its label to
that. To successfully execute the ExecStart=, accessible label should
be specified with SmackProcessLabel=.
Additionally, by SMACK policy, if the file in ExecStart= has no
SMACK64EXEC then the executed process will have given label by
SmackProcessLabel=. But if the file has SMACK64EXEC then the
SMACK64EXEC label will be overridden.
[zj: reword man page]
|
|
Add a new directive called BusPolicy to define custom endpoint policies. If
one such directive is given, an endpoint object in the service's ExecContext is
created and the given policy is added to it.
|
|
It's not safe invoking NSS from PID 1, hence fork off worker processes
that upload the policy into the kernel for busnames.
|
|
also mounting /etc read-only
Also, rename ProtectedHome= to ProtectHome=, to simplify things a bit.
With this in place we now have two neat options ProtectSystem= and
ProtectHome= for protecting the OS itself (and optionally its
configuration), and for protecting the user's data.
|
|
With Symlinks= we can manage one or more symlinks to AF_UNIX or FIFO
nodes in the file system, with the same lifecycle as the socket itself.
This has two benefits: first, this allows us to remove /dev/log and
/dev/initctl from /dev, thus leaving only symlinks, device nodes and
directories in the /dev tree. More importantly however, this allows us
to move /dev/log out of /dev, while still making it accessible there, so
that PrivateDevices= can provide /dev/log too.
|
|
ReadOnlySystem= uses fs namespaces to mount /usr and /boot read-only for
a service.
ProtectedHome= uses fs namespaces to mount /home and /run/user
inaccessible or read-only for a service.
This patch also enables these settings for all our long-running services.
Together they should be good building block for a minimal service
sandbox, removing the ability for services to modify the operating
system or access the user's private data.
|
|
Introduce a (unsigned long) -1 as "unset" state for cpu shares/block io
weights, and keep the startup unit set around all the time.
|
|
Similar to CPUShares= and BlockIOWeight= respectively. However only
assign the specified weight during startup. Each control group
attribute is re-assigned as weight by CPUShares=weight and
BlockIOWeight=weight after startup. If not CPUShares= or
BlockIOWeight= be specified, then the attribute is re-assigned to each
default attribute value. (default cpu.shares=1024, blkio.weight=1000)
If only CPUShares=weight or BlockIOWeight=weight be specified, then
that implies StartupCPUShares=weight and StartupBlockIOWeight=weight.
|
|
|
|
|
|
It's used for the FailureAction property as well.
|
|
There are three directives to specify bus name polices in .busname
files:
* AllowUser [username] [access]
* AllowGroup [groupname] [access]
* AllowWorld [access]
Where [access] is one of
* 'see': The user/group/world is allowed to see a name on the bus
* 'talk': The user/group/world is allowed to talk to a name
* 'own': The user/group/world is allowed to own a name
There is no user added yet in this commit.
|
|
already explicitly set
|
|
load-fragment.c
The parse code actually checked for specific lvalue names, which is
really wrong for supposedly generic parsers...
|
|
Let's keep specific config parsers close to where they are needed. Only
the really generic ones should be defined in conf-parser.[ch].
|
|
As discussed on the ML these are useful to manage runtime directories
below /run for services.
|
|
This new unit settings allows restricting which address families are
available to processes. This is an effective way to minimize the attack
surface of services, by turning off entire network stacks for them.
This is based on seccomp, and does not work on x86-32, since seccomp
cannot filter socketcall() syscalls on that platform.
|
|
This permit to switch to a specific apparmor profile when starting a daemon. This
will result in a non operation if apparmor is disabled.
It also add a new build requirement on libapparmor for using this feature.
|
|
processes
|
|
|
|
architecture support for system calls
Also, turn system call filter bus properties into complex types instead
of concatenated strings.
|
|
- Allow configuration of an errno error to return from blacklisted
syscalls, instead of immediately terminating a process.
- Fix parsing logic when libseccomp support is turned off
- Only keep the actual syscall set in the ExecContext, and generate the
string version only on demand.
|
|
|