Age | Commit message (Collapse) | Author |
|
coredump had code to check if copy_bytes() hit the max_bytes limit,
and refuse further processing in that case.
But in 84ee0960443, the return convention for copy_bytes() was changed
from -EFBIG to 1 for the case when the limit is hit, so the condition
check in coredump couldn't ever trigger.
But it seems that *do* want to process such truncated cores [1].
So change the code to detect truncation properly, but instead of
returning an error, give a nice log entry.
[1] https://github.com/systemd/systemd/issues/3883#issuecomment-239106337
Should fix (or at least alleviate) #3883.
|
|
Another fix for #4161.
|
|
For the user, if the core file is missing or inaccessible, it is
more interesting that the fact that they forgot to pipe to a file.
So delay the failure from the check until after we have verified
that the file or the COREDUMP field are present.
Partially fixes #4161.
Also, error reporting on failure was duplicated. save_core() now
always prints an error message (because it knows the paths involved,
so can the most useful message), and the callers don't have to.
|
|
Propagate errors properly, so that if we hit oom or an error in the
journal, the whole command will fail. This is important when using
the output in scripts.
Support the output of multiple values for the same field with -F.
The journal supports that, and our official commands should too, as
far as it makes sense. -F can be used to print user-defined fields
(e.g. somebody could use a TAG field with multiple occurences), so
we should support that too. That seems better than silently printing
the last value found as was done before.
We would iterate trying to match the same field with all possible
field names. Once we find something, cut the loop short, since we
know that nothing else can match.
|
|
The column for "present" was easy to miss, especially if somebody had no
coredumps present at all, in which case the column of spaces of width one
wasn't visually distinguished from the neighbouring columns. Replace this
with an explicit text, one of: "missing", "journal", "present", "error".
$ coredumpctl
TIME PID UID GID SIG COREFILE EXE
Mon 2016-09-26 22:46:31 CEST 8623 0 0 11 missing /usr/bin/bash
Mon 2016-09-26 22:46:35 CEST 8639 1001 1001 11 missing /usr/bin/bash
Tue 2016-09-27 01:10:46 CEST 16110 1001 1001 11 journal /usr/bin/bash
Tue 2016-09-27 01:13:20 CEST 16290 1001 1001 11 journal /usr/bin/bash
Tue 2016-09-27 01:33:48 CEST 17867 1001 1001 11 present /usr/bin/bash
Tue 2016-09-27 01:37:55 CEST 18549 0 0 11 error /usr/bin/bash
Also, use access(…, R_OK), so that we can report a present but inaccessible
file different than a missing one.
|
|
In 'list', show present also for coredumps stored in the journal.
In 'status', replace "File" with "Storage" line that is always present.
Possible values:
Storage: none
Storage: journal
Storage: /path/to/file (inacessible)
Storage: /path/to/file
Previously the File field be only present if the file was accessible, so users
had to manually extract the file name precisely in the cases where it was
needed, i.e. when coredumpctl couldn't access the file. It's much more friendly
to always show something. This output is designed for human consumption, so
it's better to be a bit verbose.
The call to sd_j_set_data_threshold is moved, so that status is always printed
with the default of 64k, list uses 4k, and coredump retrieval is done with the
limit unset. This should make checking for the presence of the COREDUMP field
not too costly.
|
|
|
|
sd_journal_previous() returns 0 if it didn't do any move, so the
warning was stupidly always printed.
|
|
Added in 9fe13294a9 (by me :[```), and later obfuscated in d0c8806d4ab, if an
uncompressed external file or an internally stored coredump was supposed to be
written to a file descriptor, nothing would be written.
|
|
Back when external storage was initially added in 34c10968cb, this mode of
storage was added. This could have made some sense back when XZ compression was
used, and an uncompressed core on disk could be used as short-lived cache file
which does require costly decompression. But now fast LZ4 compression is used
(by default) both internally and externally, so we have duplicated storage,
using the same compression and same default maximum core size in both cases,
but with different expiration lifetimes. Even the uncompressed-external,
compressed-internal mode is not very useful: for small files, decompression
with LZ4 is fast enough not to matter, and for large files, decompression is
still relatively fast, but the disk-usage penalty is very big.
An additional problem with the two modes of storage is that it complicates
the code and makes it much harder to return a useful error message to the user
if we cannot find the core file, since if we cannot find the file we have to
check the internal storage first.
This patch drops "both" storage mode. Effectively this means that if somebody
configured coredump this way, they will get a warning about an unsupported
value for Storage, and the default of "external" will be used.
I'm pretty sure that this mode is very rarely used anyway.
|
|
When s->length is zero this function doesn't do anything, note that in a
comment.
|
|
core:sandbox: Add new ProtectKernelTunables=, ProtectControlGroups=, ProtectSystem=strict and fixes
|
|
Show and formatting fixes
|
|
Even if
```
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
1
```
is disabled
cat /proc/net/sockstat6
```
TCP6: inuse 2
UDP6: inuse 1
UDPLITE6: inuse 0
RAW6: inuse 0
FRAG6: inuse 0 memory 0
```
Looking for /proc/net/if_inet6 is the right choice.
|
|
propagation
Better safe.
|
|
|
|
This test sometimes fails in semaphore, but not when run interactively,
so it's hard to debug.
|
|
This appends the nvme name and namespace identifier attribute the the
PCI path for by-path links. Symlinks like the following are now present:
lrwxrwxrwx. 1 root root 13 Sep 16 12:12 pci-0000:01:00.0-nvme-1 -> ../../nvme0n1
lrwxrwxrwx. 1 root root 15 Sep 16 12:12 pci-0000:01:00.0-nvme-1-part1 -> ../../nvme0n1p1
Cc: Michal Sekletar <sekletar.m@gmail.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
|
There was no certainty about how the path in service file should look
like for usb functionfs activation. Because of this it was treated
differently in different places, which made this feature unusable.
This patch fixes the path to be the *mount directory* of functionfs, not
ep0 file path and clarifies in the documentation that ListenUSBFunction should be
the location of functionfs mount point, not ep0 file itself.
|
|
It seems to me that the explicit positional argument should have higher
priority than "an option".
|
|
This patch fixes wrong calculation of burst_modulate(), which now calculates
the values smaller than really expected ones if available disk space is
strictly more than 1MB.
In particular, if available disk space is strictly more than 1MB and strictly
less than 16MB, the resulted value becomes smaller than its original one.
>>> (math.log2(1*1024**2)-16) / 4
1.0
>>> (math.log2(16*1024**2)-16) / 4
2.0
>>> (math.log2(256*1024**2)-16) / 4
3.0
→ This matches the comment in the function.
|
|
If ulimit is smaller than page_size(), function save_external_coredump()
returns -EBADSLT and this causes skipping whole core dumping part in
submit_coredump(). Initializing coredump_size to UINT64_MAX prevents
evaluating a condition with uninitialized varialbe which leads to
calling allocate_journal_field() with coredump_fd = -1 which causes
aborting.
Signed-off-by: Matej Habrnal <mhabrnal@redhat.com>
|
|
|
|
|
|
is set
Instead of having a local syscall list, use the @raw-io group which
contains the same set of syscalls to filter.
|
|
As with previous patch simplify ProtectHome and don't care about
duplicates, they will be sorted by most restrictive mode and cleaned.
|
|
ProtectSystem= with all its different modes and other options like
PrivateDevices= + ProtectKernelTunables= + ProtectHome= are orthogonal,
however currently it's a bit hard to parse that from the implementation
view. Simplify it by giving each mode its own table with all paths and
references to other Protect options.
With this change some entries are duplicated, but we do not care since
duplicate mounts are first sorted by the most restrictive mode then
cleaned.
|
|
Make ALSA entries, latency interface, mtrr, apm/acpi, suspend interface,
filesystems configuration and IRQ tuning readonly.
Most of these interfaces now days should be in /sys but they are still
available through /proc, so just protect them. This patch does not touch
/proc/net/...
|
|
Move out mount calculation on its own function. Actually the logic is
smart enough to later drop nop and duplicates mounts, this change
improves code readability.
---
src/core/namespace.c | 47 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 36 insertions(+), 11 deletions(-)
|
|
Instead of having all these paths everywhere, put the ones that are
protected by ProtectKernelTunables= into their own table. This way it
is easy to add paths and track which ones are protected.
|
|
|
|
While we are at it, move PAM code #ifdeffery into setup_pam() to simplify the
main execution logic a bit.
|
|
There's no point in mounting these, if they are outside of the root directory
we'll move to.
|
|
|
|
If device access is restricted via PrivateDevices=, let's also block the
various low-level I/O syscalls at the same time, so that we know that the
minimal set of devices in our virtualized /dev are really everything the unit
can access.
|
|
already is one
Let's not stack mounts needlessly.
|
|
This adds logic to chase symlinks for all mount points that shall be created in
a namespace environment in userspace, instead of leaving this to the kernel.
This has the advantage that we can correctly handle absolute symlinks that
shall be taken relative to a specific root directory. Moreover, we can properly
handle mounts created on symlinked files or directories as we can merge their
mounts as necessary.
(This also drops the "done" flag in the namespace logic, which was never
actually working, but was supposed to permit a partial rollback of the
namespace logic, which however is only mildly useful as it wasn't clear in
which case it would or would not be able to roll back.)
Fixes: #3867
|
|
Let's create the new namespace only after we validated and processed all
parameters, right before we start with actually mounting things.
This way, the window where we can roll back is larger (not that it matters
IRL...)
|
|
If PrivateDevices=yes is set, the namespace code creates device nodes in /dev
that should be owned by the host's root, hence let's make sure we set up the
namespace before dropping group privileges.
|
|
LXC does this, and we should probably too. Better safe than sorry.
|
|
Let's make sure that services that use DynamicUser=1 cannot leave files in the
file system should the system accidentally have a world-writable directory
somewhere.
This effectively ensures that directories need to be whitelisted rather than
blacklisted for access when DynamicUser=1 is set.
|
|
Let's tighten our sandbox a bit more: with this change ProtectSystem= gains a
new setting "strict". If set, the entire directory tree of the system is
mounted read-only, but the API file systems /proc, /dev, /sys are excluded
(they may be managed with PrivateDevices= and ProtectKernelTunables=). Also,
/home and /root are excluded as those are left for ProtectHome= to manage.
In this mode, all "real" file systems (i.e. non-API file systems) are mounted
read-only, and specific directories may only be excluded via
ReadWriteDirectories=, thus implementing an effective whitelist instead of
blacklist of writable directories.
While we are at, also add /efi to the list of paths always affected by
ProtectSystem=. This is a follow-up for
b52a109ad38cd37b660ccd5394ff5c171a5e5355 which added /efi as alternative for
/boot. Our namespacing logic should respect that too.
|
|
|
|
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths=
specification, then we'd first recursively apply the ReadOnlyPaths= paths, and
make everything below read-only, only in order to then flip the read-only bit
again for the subdirs listed in ReadWritePaths= below it.
This is not only ugly (as for the dirs in question we first turn on the RO bit,
only to turn it off again immediately after), but also problematic in
containers, where a container manager might have marked a set of dirs read-only
and this code will undo this is ReadWritePaths= is set for any.
With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be
applied to the children listed in ReadWritePaths= in the first place, so that
we do not need to turn off the RO bit for those after all.
This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the
RO bit, but never to turn it off again. Or to say this differently: if some
dirs are marked read-only via some external tool, then ReadWritePaths= will not
undo it.
This is not only the safer option, but also more in-line with what the man page
currently claims:
"Entries (files or directories) listed in ReadWritePaths= are
accessible from within the namespace with the same access rights as
from outside."
To implement this change bind_remount_recursive() gained a new "blacklist"
string list parameter, which when passed may contain subdirs that shall be
excluded from the read-only mounting.
A number of functions are updated to add more debug logging to make this more
digestable.
|
|
If /foo is marked to be read-only, and /foo/bar too, then the latter may be
suppressed as it has no effect.
|
|
|
|
Implicitly make all dirs set with RuntimeDirectory= writable, as the concept
otherwise makes no sense.
|
|
This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
|
|
|
|
If a dir is marked to be inaccessible then everything below it should be masked
by it.
|