Age | Commit message (Collapse) | Author |
|
udev: don't close FDs before dropping them from epoll
|
|
Make sure we never close fds before we drop their related event-source.
This will cause horrible disruptions if the fd-num is re-used by someone
else. Under normal conditions, this should not cause any problems as the
close() will drop the fd from the epoll-set automatically. However, this
changes if you have any child processes with a copy of that fd.
This fixes issue #163.
Background:
If you create an epoll-set via epoll_create() (lets call it 'EFD')
you can add file-descriptors to it to watch for events. Whenever
you call EPOLL_CTL_ADD on a file-descriptor you want to watch, the
kernel looks up the attached "struct file" pointer, that this FD
refers to. This combination of the FD-number and the "struct file"
pointer is used as key to link it into the epoll-set (EFD).
This means, if you duplicate your file-descriptor, you can watch
this file-descriptor, too (because the duplicate will have a
different FD-number, hence, the combination of FD-number and
"struct file" is different as before).
If you want to stop watching an FD, you use EPOLL_CTL_DEL and pass
the FD to the kernel. The kernel again looks up your
file-descriptor in your FD-table to find the linked "struct file".
This FD-number and "struct file" combination is then dropped from
the epoll-set (EFD).
Last, but not least: If you close a file-descriptor that is linked
to an epoll-set, the kernel does *NOTHING* regarding the
epoll-set. This is a vital observation! Because this means, your
epoll_wait() calls will still return the metadata you used to
watch/subscribe your file-descriptor to events.
There is one exception to this rule: If the file-descriptor that
you just close()ed was the last FD that referred to the underlying
"struct file", then _all_ epoll-set watches/subscriptions are
destroyed. Hence, if you never dup()ed your FD, then a simple
close() will also unsubscribe it from any epoll-set.
With this in mind, lets look at fork():
Assume you have an epoll-set (EFD) and a bunch of FDs
subscribed to events on that EFD. If you now call fork(),
the new process gets a copy of your file-descriptor table.
This means, the whole table is copied and the "struct
file" reference of each FD is increased by 1. It is
important to notice that the FD-numbers in the child are
exactly the same as in the parent (eg., FD #5 in the child
refers to the same "struct file" as FD #5 in the parent).
This means, if the child calls EPOLL_CTL_DEL on an FD, the
kernel will look up the linked "struct file" and drop the
FD-number and "struct file" combination from the epoll-set
(EFD). However, this will effectively drop the
subscription that was installed by the parent.
To sum up: even though the child gets a duplicate of the
EFD and all FDs, the subscriptions in the EFD are *NOT*
duplicated!
Now, with this in mind, lets look at what udevd does:
Udevd has a bunch of file-descriptors that it watches in its
sd-event main-loop. Whenever a uevent is received, the event is
dispatched on its workers. If no suitable worker is present, a new
worker is fork()ed to handle the event. Inside of this worker, we
try to free all resources we inherited. However, the fork() call
is done from a call-stack that is never rewinded. Therefore, this
call stack might own references that it drops once it is left.
Those references we cannot deduce from the fork()'ed process;
effectively causing us to leak objects in the worker (eg., the
call to sd_event_dispatch() that dispatched our uevent owns a
reference to the sd_event object it used; and drops it again once
the function is left).
(Another example is udev_monitor_ref() for each 'worker' that is
also inherited by all children; thus keeping the udev-monitor and
the uevent-fd alive in all children (which is the real cause for
bug #163))
(The extreme variant is sd_event_source_unref(), which explicitly
keeps event-sources alive, if they're currently dispatched,
knowing that the dispatcher will free the event once done. But
if the dispatcher is in the parent, the child will never ever
free that object, thus leaking it)
This is usually not an issue. However, if such an object has a
file-descriptor embedded, this FD is left open and never closed in
the child.
In manager_exit(), if we now destroy an object (i.e., close its embedded
file-descriptor) before we destroy its related sd_event_source, then
sd-event will not be able to drop the FD from the epoll-set (EFD). This
is, because the FD is no longer valid at the time we call EPOLL_CTL_DEL.
Hence, the kernel cannot figure out the linked "struct file" and thus
cannot remove the FD-number plus "struct file" combination; effectively
leaving the subscription in the epoll-set.
Since we leak the uevent-fd in the children, they retain a copy of the FD
pointing to the same "struct file". Thus, the EFD-subscription are not
automatically removed by close() (as described above). Therefore, the main
daemon will still get its metadata back on epoll_watch() whenever an event
occurs (even though it already freed the metadata). This then causes the
free-after-use bug described in #163.
This patch fixes the order in which we destruct objects and related
sd-event-sources. Some open questions remain:
* Why does source_io_unregister() not warn on EPOLL_CTL_DEL failures?
This really needs to be turned into an assert_return().
* udevd really should not leak file-descriptors into its children. Fixing
this would *not* have prevented this bug, though (since the child-setup
is still async).
It's non-trivial to fix this, though. The stack-context of the caller
cannot be rewinded, so we cannot figure out temporary refs. Maybe it's
time to exec() the udev-workers?
* Why does the kernel not copy FD-subscriptions across fork()?
Or at least drop subscriptions if you close() your FD (it uses the
FD-number as key, so it better subscribe to it)?
Or it better used
FD+"struct file_table*"+"struct file*"
as key to not allow the childen to share the subscription table..
*sigh*
Seems like we have to live with that API forever.
|
|
nspawn: check if kernel supports userns as early as possible
|
|
If the kernel do not support user namespace then one of the children
created by nspawn parent will fail at clone(CLONE_NEWUSER) with the
generic error EINVAL and without logging the error. At the same time
the parent may also try to setup the user namespace and will fail with
another error.
To improve this, check if the kernel supports user namespace as early
as possible.
|
|
tmpfiles: silently ignore failed removal of btrfs submount from non-dir
|
|
This fixes:
Jun 16 16:00:20 tomegun-x2402 systemd-tmpfiles[233]: rm_rf(/var/lib/machines/.#fedora.lck): Not a directory
Jun 16 16:00:20 tomegun-x2402 systemd-tmpfiles[233]: rm_rf(/var/lib/machines/.#Fedora-Cloud-Base-20141203-21.x86_64.raw.lck): Not a directory
|
|
hashmap: fix iterators to not skip entries
|
|
signal-util: fix incorrect argument of va_start
|
|
sd-bus: use proper cleanup macro
|
|
The last argument of the function before the vargs is "old" not "how".
warning: second parameter of ‘va_start’ not last named argument
|
|
|
|
everywhere: actually make use of DUAL_TIMESTAMP_NULL macro
|
|
logind: expose "Docked" bool as property on the bus
|
|
|
|
We know the state anyway, let's expose it in the bus. It's useful for
debugging at least, but it might be useful for DEs too.
|
|
Let's use it as initializer where appropriate.
|
|
|
|
everywhere: port everything to sigprocmask_many() and friends
|
|
automatically remove old machine shapshots at boot
|
|
This ports a lot of manual code over to sigprocmask_many() and friends.
Also, we now consistly check for sigprocmask() failures with
assert_se(), since the call cannot realistically fail unless there's a
programming error.
Also encloses a few sd_event_add_signal() calls with (void) when we
ignore the return values for it knowingly.
|
|
buildsys: missing SECCOMP_CFLAGS in various places
|
|
Remove old temporary snapshots, but only at boot. Ideally we'd have
"self-destroying" btrfs snapshots that go away if the last last
reference to it does. To mimic a scheme like this at least remove the
old snapshots on fresh boots, where we know they cannot be referenced
anymore. Note that we actually remove all temporary files in
/var/lib/machines/ at boot, which should be safe since the directory has
defined semantics. In the root directory (where systemd-nspawn
--ephemeral places snapshots) we are more strict, to avoid removing
unrelated temporary files.
This also splits out nspawn/container related tmpfiles bits into a new
tmpfiles snippet to systemd-nspawn.conf
|
|
|
|
This adds a "char *extra" parameter to tempfn_xxxxxx(), tempfn_random(),
tempfn_ranomd_child(). If non-NULL this string is included in the middle
of the newly created file name. This is useful for being able to
distuingish the kind of temporary file when we see one.
This also adds tests for the three call.
For now, we don't make use of this at all, but port all users over.
|
|
libcore, systemd and nspawn fail to build when seccomp headers
are not in the include path.
|
|
btrfs-util: when snapshotting make sure we don't descent into subvolu…
|
|
just created
We already had a safety check in place that we don't end up descending
to the original subvolume again, but we also should avoid descending in
the newly created one.
This is particularly important if we make a snapshot below its source,
like we do in "systemd-nspawn --ephemeral -D /".
Closes https://bugs.freedesktop.org/show_bug.cgi?id=90803
|
|
Italian .po updates
|
|
l10n: Add Belarusian translation
|
|
firewall: rename fw-util.[ch] → firewall-util.[ch]
|
|
The names fw-util.[ch] are too ambiguous, better rename the files to
firewall-util.[ch]. Also rename the test accordingly.
|
|
login: fix potential null pointer dereference
|
|
As requested in #199.
|
|
As requested in #199.
|
|
Specifically: /etc/fstab overrides the units itself, but not the deps.
See #168.
|
|
nspawn: make seccomp loading errors non-fatal
|
|
|
|
seccomp_load returns -EINVAL when seccomp support is not enabled in the
kernel [1]. This should be a debug log, not an error that interrupts nspawn.
If the seccomp filter can't be set and audit is enabled, the user will
get an error message anyway.
[1]: http://man7.org/linux/man-pages/man2/prctl.2.html
|
|
Fix CID 1304686: Dereference after null check (FORWARD_NULL)
However, this commit does not fix any bug in logind. It helps to keep
the elect_display_compare() function generic.
|
|
Set $SYSTEMD_LOG_TARGET so that the output always goes to stdout/stderr. This
fixes running the test as root, as that logged to the journal previously.
https://github.com/systemd/systemd/issues/195
|
|
|
|
|
|
|
|
build-sys: Drop include_prefix
|
|
Appears to be unused and a leftover from the udev merge.
|
|
udevd: event - don't log about failures of spawn processes when this …
|
|
build-sys: include libsystemd-journal and libudev in libshared
|
|
|
|
tree-wide: fix memory leaks in users of bus_map_all_properties()
|
|
Fix typos in man/timesyncd.conf.xml
|