Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
- persistent info
Persist exit and coredump information independent of whether anyone
currently holds a pidfd for the struct pid.
The current scheme allocated pidfs dentries on-demand repeatedly.
This scheme is reaching it's limits as it makes it impossible to pin
information that needs to be available after the task has exited or
coredumped and that should not be lost simply because the pidfd got
closed temporarily. The next opener should still see the stashed
information.
This is also a prerequisite for supporting extended attributes on
pidfds to allow attaching meta information to them.
If someone opens a pidfd for a struct pid a pidfs dentry is allocated
and stashed in pid->stashed. Once the last pidfd for the struct pid
is closed the pidfs dentry is released and removed from pid->stashed.
So if 10 callers create a pidfs dentry for the same struct pid
sequentially, i.e., each closing the pidfd before the other creates a
new one then a new pidfs dentry is allocated every time.
Because multiple tasks acquiring and releasing a pidfd for the same
struct pid can race with each another a task may still find a valid
pidfs entry from the previous task in pid->stashed and reuse it. Or
it might find a dead dentry in there and fail to reuse it and so
stashes a new pidfs dentry. Multiple tasks may race to stash a new
pidfs dentry but only one will succeed, the other ones will put their
dentry.
The current scheme aims to ensure that a pidfs dentry for a struct
pid can only be created if the task is still alive or if a pidfs
dentry already existed before the task was reaped and so exit
information has been was stashed in the pidfs inode.
That's great except that it's buggy. If a pidfs dentry is stashed in
pid->stashed after pidfs_exit() but before __unhash_process() is
called we will return a pidfd for a reaped task without exit
information being available.
The pidfds_pid_valid() check does not guard against this race as it
doens't sync at all with pidfs_exit(). The pid_has_task() check might
be successful simply because we're before __unhash_process() but
after pidfs_exit().
Introduce a new scheme where the lifetime of information associated
with a pidfs entry (coredump and exit information) isn't bound to the
lifetime of the pidfs inode but the struct pid itself.
The first time a pidfs dentry is allocated for a struct pid a struct
pidfs_attr will be allocated which will be used to store exit and
coredump information.
If all pidfs for the pidfs dentry are closed the dentry and inode can
be cleaned up but the struct pidfs_attr will stick until the struct
pid itself is freed. This will ensure minimal memory usage while
persisting relevant information.
The new scheme has various advantages. First, it allows to close the
race where we end up handing out a pidfd for a reaped task for which
no exit information is available. Second, it minimizes memory usage.
Third, it allows to remove complex lifetime tracking via dentries
when registering a struct pid with pidfs. There's no need to get or
put a reference. Instead, the lifetime of exit and coredump
information associated with a struct pid is bound to the lifetime of
struct pid itself.
- extended attributes
Now that we have a way to persist information for pidfs dentries we
can start supporting extended attributes on pidfds. This will allow
userspace to attach meta information to tasks.
One natural extension would be to introduce a custom pidfs.* extended
attribute space and allow for the inheritance of extended attributes
across fork() and exec().
The first simple scheme will allow privileged userspace to set
trusted extended attributes on pidfs inodes.
- Allow autonomous pidfs file handles
Various filesystems such as pidfs and drm support opening file
handles without having to require a file descriptor to identify the
filesystem. The filesystem are global single instances and can be
trivially identified solely on the information encoded in the file
handle.
This makes it possible to not have to keep or acquire a sentinal file
descriptor just to pass it to open_by_handle_at() to identify the
filesystem. That's especially useful when such sentinel file
descriptor cannot or should not be acquired.
For pidfs this means a file handle can function as full replacement
for storing a pid in a file. Instead a file handle can be stored and
reopened purely based on the file handle.
Such autonomous file handles can be opened with or without specifying
a a file descriptor. If no proper file descriptor is used the
FD_PIDFS_ROOT sentinel must be passed. This allows us to define
further special negative fd sentinels in the future.
Userspace can trivially test for support by trying to open the file
handle with an invalid file descriptor.
- Allow pidfds for reaped tasks with SCM_PIDFD messages
This is a logical continuation of the earlier work to create pidfds
for reaped tasks through the SO_PEERPIDFD socket option merged in
923ea4d4482b ("Merge patch series "net, pidfs: enable handing out
pidfds for reaped sk->sk_peer_pid"").
- Two minor fixes:
* Fold fs_struct->{lock,seq} into a seqlock
* Don't bother with path_{get,put}() in unix_open_file()
* tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
don't bother with path_get()/path_put() in unix_open_file()
fold fs_struct->{lock,seq} into a seqlock
selftests: net: extend SCM_PIDFD test to cover stale pidfds
af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD
af_unix: stash pidfs dentry when needed
af_unix/scm: fix whitespace errors
af_unix: introduce and use scm_replace_pid() helper
af_unix: introduce unix_skb_to_scm helper
af_unix: rework unix_maybe_add_creds() to allow sleep
selftests/pidfd: decode pidfd file handles withou having to specify an fd
fhandle, pidfs: support open_by_handle_at() purely based on file handle
uapi/fcntl: add FD_PIDFS_ROOT
uapi/fcntl: add FD_INVALID
fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP
uapi/fcntl: mark range as reserved
fhandle: reflow get_path_anchor()
pidfs: add pidfs_root_path() helper
fhandle: rename to get_path_anchor()
fhandle: hoist copy_from_user() above get_path_from_fd()
fhandle: raise FILEID_IS_DIR in handle_type
...
|
|
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-24-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
we don't use that value at all so don't bother with it in the first
place.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-23-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
so they're easy to spot.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-22-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-21-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
which will allow us to simplify the exit path in further patches.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-20-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
instead of jumping to a pointless cleanup label.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-18-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
to prepare for a simpler exit path.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-17-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
to encapsulate that logic simplifying vfs_coredump() even further.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-16-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Don't split it into multiple functions. Just use a single one like we do
for coredump_file() and coredump_pipe() now.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-15-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
There's no point in having this eyesore in the middle of vfs_coredump().
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-14-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
* Move that whole mess into a separate helper instead of having all that
hanging around in vfs_coredump() directly. Cleanup paths are already
centralized.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-13-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Now that we stash persistent information in struct pid there's no need
to play volatile games with pinning struct pid via dentries in pidfs.
Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-8-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The pipe coredump counter is a static local variable instead of a global
variable like all of the rest. Move it to a global variable so it's all
consistent.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-12-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The exit path is currently entangled with core pipe limit accounting
which is really unpleasant. Use a local variable in struct core_name
that remembers whether the count was incremented and if so to clean
decrement in once the coredump is done. Assert that this only happens
for pipes.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-11-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
* Move that whole mess into a separate helper instead of having all that
hanging around in vfs_coredump() directly.
* Stop using that need_suid_safe variable and add an inline helper that
clearly communicates what's going on everywhere consistently. The mm
flag snapshot is stable and can't change so nothing's gained with that
boolean.
* Only setup cprm->file once everything else succeeded, using RAII for
the coredump file before. That allows to don't care to what goto label
we jump in vfs_coredump().
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-10-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Align the naming with the rest of our helpers exposed
outside of core vfs.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-9-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
properly again. Someone might have modified the buffer concurrently.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-7-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
There's no point in allowing to walk upwards for the coredump socket.
We already force userspace to give use a sane path, no symlinks, no
magiclinks, and also block "..". Use an absolute path without any
shenanigans.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-6-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
so we don't pointlessly accepts things that go over the limit.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-4-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Make sure that we keep it extensible and well-formed.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-3-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
There's no point in returning negative error values.
They will never be seen by anyone.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-2-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
It's not really about the name anymore. It parses very distinct
information. Reflect that in the name.
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-1-315c0c34ba94@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We currently use multiple CONFIG_UNIX guards. This looks messy and makes
the code harder to follow and maintain. Use a helper function
coredump_sock_connect() that handles the connect portion. This allows us
to remove the CONFIG_UNIX guard in the main do_coredump() function.
Link: https://lore.kernel.org/20250605-schlamm-touren-720ba2b60a85@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Extend the coredump socket to allow the coredump server to tell the
kernel how to process individual coredumps.
When the crashing task connects to the coredump socket the kernel will
send a struct coredump_req to the coredump server. The kernel will set
the size member of struct coredump_req allowing the coredump server how
much data can be read.
The coredump server uses MSG_PEEK to peek the size of struct
coredump_req. If the kernel uses a newer struct coredump_req the
coredump server just reads the size it knows and discard any remaining
bytes in the buffer. If the kernel uses an older struct coredump_req
the coredump server just reads the size the kernel knows.
The returned struct coredump_req will inform the coredump server what
features the kernel supports. The coredump_req->mask member is set to
the currently know features.
The coredump server may only use features whose bits were raised by the
kernel in coredump_req->mask.
In response to a coredump_req from the kernel the coredump server sends
a struct coredump_ack to the kernel. The kernel informs the coredump
server what version of struct coredump_ack it supports by setting struct
coredump_req->size_ack to the size it knows about. The coredump server
may only send as many bytes as coredump_req->size_ack indicates (a
smaller size is fine of course). The coredump server must set
coredump_ack->size accordingly.
The coredump server sets the features it wants to use in struct
coredump_ack->mask. Only bits returned in struct coredump_req->mask may
be used.
In case an invalid struct coredump_ack is sent to the kernel a non-zero
u32 integer is sent indicating the reason for the failure. If it was
successful a zero u32 integer is sent.
In the initial version the following features are supported in
coredump_{req,ack}->mask:
* COREDUMP_KERNEL
The kernel will write the coredump data to the socket.
* COREDUMP_USERSPACE
The kernel will not write coredump data but will indicate to the
parent that a coredump has been generated. This is used when userspace
generates its own coredumps.
* COREDUMP_REJECT
The kernel will skip generating a coredump for this task.
* COREDUMP_WAIT
The kernel will prevent the task from exiting until the coredump
server has shutdown the socket connection.
The flexible coredump socket can be enabled by using the "@@" prefix
instead of the single "@" prefix for the regular coredump socket:
@@/run/systemd/coredump.socket
will enable flexible coredump handling. Current kernels already enforce
that "@" must be followed by "/" and will reject anything else. So
extending this is backward and forward compatible.
Link: https://lore.kernel.org/20250603-work-coredump-socket-protocol-v2-1-05a5f0c18ecc@kernel.org
Acked-by: Lennart Poettering <lennart@poettering.net>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In contrast to other parameters written into
/proc/sys/kernel/core_pattern that never fail we can validate enabling
the new AF_UNIX support. This is obviously racy as hell but it's always
been that way.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-7-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Allow userspace to discover what coredump modes are supported.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-6-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP
mask flag. This adds the @coredump_mask field to struct pidfd_info.
When a task coredumps the kernel will provide the following information
to userspace in @coredump_mask:
* PIDFD_COREDUMPED is raised if the task did actually coredump.
* PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
undumpable).
* PIDFD_COREDUMP_USER is raised if this is a regular coredump and
doesn't need special care by the coredump server.
* PIDFD_COREDUMP_ROOT is raised if the generated coredump should be
treated as sensitive and the coredump server should restrict to the
generated coredump to sufficiently privileged users.
The kernel guarantees that by the time the connection is made the all
PIDFD_INFO_COREDUMP info is available.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-5-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Coredumping currently supports two modes:
(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
spawned as a child of the system_unbound_wq or kthreadd.
For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.
The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.
In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):
- systemd-coredump is spawned with file descriptor number 0 (stdin)
connected to the read-end of the pipe. All other file descriptors are
closed. That specifically includes 1 (stdout) and 2 (stderr). This has
already caused bugs because userspace assumed that this cannot happen
(Whether or not this is a sane assumption is irrelevant.).
- systemd-coredump will be spawned as a child of system_unbound_wq. So
it is not a child of any userspace process and specifically not a
child of PID 1. It cannot be waited upon and is in a weird hybrid
upcall which are difficult for userspace to control correctly.
- systemd-coredump is spawned with full kernel privileges. This
necessitates all kinds of weird privilege dropping excercises in
userspace to make this safe.
- A new usermode helper has to be spawned for each crashing process.
This series adds a new mode:
(3) Dumping into an AF_UNIX socket.
Userspace can set /proc/sys/kernel/core_pattern to:
@/path/to/coredump.socket
The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps.
The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket.
- The coredump server uses SO_PEERPIDFD to get a stable handle on the
connected crashing task. The retrieved pidfd will provide a stable
reference even if the crashing task gets SIGKILLed while generating
the coredump.
- By setting core_pipe_limit non-zero userspace can guarantee that the
crashing task cannot be reaped behind it's back and thus process all
necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
detect whether /proc/<pid> still refers to the same process.
The core_pipe_limit isn't used to rate-limit connections to the
socket. This can simply be done via AF_UNIX sockets directly.
- The pidfd for the crashing task will grow new information how the task
coredumps.
- The coredump server should mark itself as non-dumpable.
- A container coredump server in a separate network namespace can simply
bind to another well-know address and systemd-coredump fowards
coredumps to the container.
- Coredumps could in the future also be handled via per-user/session
coredump servers that run only with that users privileges.
The coredump server listens on the coredump socket and accepts a
new coredump connection. It then retrieves SO_PEERPIDFD for the
client, inspects uid/gid and hands the accepted client to the users
own coredump handler which runs with the users privileges only
(It must of coure pay close attention to not forward crashing suid
binaries.).
The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers
that have and continue to cause significant CVEs.
This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can e.g., just keep a worker pool.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-4-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
They look rather messy right now.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-3-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-2-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-1-664f3caf2516@kernel.org
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Give userspace a way to instruct the kernel to install a pidfd into the
usermode helper process. This makes coredump handling a lot more
reliable for userspace. In parallel with this commit we already have
systemd adding support for this in [1].
We create a pidfs file for the coredumping process when we process the
corename pattern. When the usermode helper process is forked we then
install the pidfs file as file descriptor three into the usermode
helpers file descriptor table so it's available to the exec'd program.
Since usermode helpers are either children of the system_unbound_wq
workqueue or kthreadd we know that the file descriptor table is empty
and can thus always use three as the file descriptor number.
Note, that we'll install a pidfd for the thread-group leader even if a
subthread is calling do_coredump(). We know that task linkage hasn't
been removed due to delay_group_leader() and even if this @current isn't
the actual thread-group leader we know that the thread-group leader
cannot be reaped until @current has exited.
Link: https://github.com/systemd/systemd/pull/37125 [1]
Link: https://lore.kernel.org/20250414-work-coredump-v2-3-685bf231f828@kernel.org
Tested-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The replace_fd() helper returns the file descriptor number on success
and a negative error code on failure. The current error handling in
umh_pipe_setup() only works because the file descriptor that is replaced
is zero but that's pretty volatile. Explicitly check for a negative
error code.
Link: https://lore.kernel.org/20250414-work-coredump-v2-2-685bf231f828@kernel.org
Tested-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
- Move vm_table members out of kernel/sysctl.c
All vm_table array members have moved to their respective subsystems
leading to the removal of vm_table from kernel/sysctl.c. This
increases modularity by placing the ctl_tables closer to where they
are actually used and at the same time reducing the chances of merge
conflicts in kernel/sysctl.c.
- ctl_table range fixes
Replace the proc_handler function that checks variable ranges in
coredump_sysctls and vdso_table with the one that actually uses the
extra{1,2} pointers as min/max values. This tightens the range of the
values that users can pass into the kernel effectively preventing
{under,over}flows.
- Misc fixes
Correct grammar errors and typos in test messages. Update sysctl
files in MAINTAINERS. Constified and removed array size in
declaration for alignment_tbl
* tag 'sysctl-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: (22 commits)
selftests/sysctl: fix wording of help messages
selftests: fix spelling/grammar errors in sysctl/sysctl.sh
MAINTAINERS: Update sysctl file list in MAINTAINERS
sysctl: Fix underflow value setting risk in vm_table
coredump: Fixes core_pipe_limit sysctl proc_handler
sysctl: remove unneeded include
sysctl: remove the vm_table
sh: vdso: move the sysctl to arch/sh/kernel/vsyscall/vsyscall.c
x86: vdso: move the sysctl to arch/x86/entry/vdso/vdso32-setup.c
fs: dcache: move the sysctl to fs/dcache.c
sunrpc: simplify rpcauth_cache_shrink_count()
fs: drop_caches: move sysctl to fs/drop_caches.c
fs: fs-writeback: move sysctl to fs/fs-writeback.c
mm: nommu: move sysctl to mm/nommu.c
security: min_addr: move sysctl to security/min_addr.c
mm: mmap: move sysctl to mm/mmap.c
mm: util: move sysctls to mm/util.c
mm: vmscan: move vmscan sysctls to mm/vmscan.c
mm: swap: move sysctl to mm/swap.c
mm: filemap: move sysctl to mm/filemap.c
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Add CONFIG_DEBUG_VFS infrastucture:
- Catch invalid modes in open
- Use the new debug macros in inode_set_cached_link()
- Use debug-only asserts around fd allocation and install
- Place f_ref to 3rd cache line in struct file to resolve false
sharing
Cleanups:
- Start using anon_inode_getfile_fmode() helper in various places
- Don't take f_lock during SEEK_CUR if exclusion is guaranteed by
f_pos_lock
- Add unlikely() to kcmp()
- Remove legacy ->remount_fs method from ecryptfs after port to the
new mount api
- Remove invalidate_inodes() in favour of evict_inodes()
- Simplify ep_busy_loopER by removing unused argument
- Avoid mmap sem relocks when coredumping with many missing pages
- Inline getname()
- Inline new_inode_pseudo() and de-staticize alloc_inode()
- Dodge an atomic in putname if ref == 1
- Consistently deref the files table with rcu_dereference_raw()
- Dedup handling of struct filename init and refcounts bumps
- Use wq_has_sleeper() in end_dir_add()
- Drop the lock trip around I_NEW wake up in evict()
- Load the ->i_sb pointer once in inode_sb_list_{add,del}
- Predict not reaching the limit in alloc_empty_file()
- Tidy up do_sys_openat2() with likely/unlikely
- Call inode_sb_list_add() outside of inode hash lock
- Sort out fd allocation vs dup2 race commentary
- Turn page_offset() into a wrapper around folio_pos()
- Remove locking in exportfs around ->get_parent() call
- try_lookup_one_len() does not need any locks in autofs
- Fix return type of several functions from long to int in open
- Fix return type of several functions from long to int in ioctls
Fixes:
- Fix watch queue accounting mismatch"
* tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
fs: sort out fd allocation vs dup2 race commentary, take 2
fs: call inode_sb_list_add() outside of inode hash lock
fs: tidy up do_sys_openat2() with likely/unlikely
fs: predict not reaching the limit in alloc_empty_file()
fs: load the ->i_sb pointer once in inode_sb_list_{add,del}
fs: drop the lock trip around I_NEW wake up in evict()
fs: use wq_has_sleeper() in end_dir_add()
VFS/autofs: try_lookup_one_len() does not need any locks
fs: dedup handling of struct filename init and refcounts bumps
fs: consistently deref the files table with rcu_dereference_raw()
exportfs: remove locking around ->get_parent() call.
fs: use debug-only asserts around fd allocation and install
fs: dodge an atomic in putname if ref == 1
vfs: Remove invalidate_inodes()
ecryptfs: remove NULL remount_fs from super_operations
watch_queue: fix pipe accounting mismatch
fs: place f_ref to 3rd cache line in struct file to resolve false sharing
epoll: simplify ep_busy_loop by removing always 0 argument
fs: Turn page_offset() into a wrapper around folio_pos()
kcmp: improve performance adding an unlikely hint to task comparisons
...
|
|
The sorting of VMAs by size in commit 7d442a33bfe8 ("binfmt_elf: Dump
smaller VMAs first in ELF cores") breaks elfutils[1]. Instead, sort
based on the setting of the new sysctl, core_sort_vma, which defaults
to 0, no sorting.
Reported-by: Michael Stapelberg <michael@stapelberg.ch>
Closes: https://lore.kernel.org/all/20250218085407.61126-1-michael@stapelberg.de/ [1]
Fixes: 7d442a33bfe8 ("binfmt_elf: Dump smaller VMAs first in ELF cores")
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Dumping processes with large allocated and mostly not-faulted areas is
very slow.
Borrowing a test case from Tavian Barnes:
int main(void) {
char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
printf("%p %m\n", mem);
if (mem != MAP_FAILED) {
mem[0] = 1;
}
abort();
}
That's 1TB of almost completely not-populated area.
On my test box it takes 13-14 seconds to dump.
The profile shows:
- 99.89% 0.00% a.out
entry_SYSCALL_64_after_hwframe
do_syscall_64
syscall_exit_to_user_mode
arch_do_signal_or_restart
- get_signal
- 99.89% do_coredump
- 99.88% elf_core_dump
- dump_user_range
- 98.12% get_dump_page
- 64.19% __get_user_pages
- 40.92% gup_vma_lookup
- find_vma
- mt_find
4.21% __rcu_read_lock
1.33% __rcu_read_unlock
- 3.14% check_vma_flags
0.68% vma_is_secretmem
0.61% __cond_resched
0.60% vma_pgtable_walk_end
0.59% vma_pgtable_walk_begin
0.58% no_page_table
- 15.13% down_read_killable
0.69% __cond_resched
13.84% up_read
0.58% __cond_resched
Almost 29% of the time is spent relocking the mmap semaphore between
calls to get_dump_page() which find nothing.
Whacking that results in times of 10 seconds (down from 13-14).
While here make the thing killable.
The real problem is the page-sized iteration and the real fix would
patch it up instead. It is left as an exercise for the mm-familiar
reader.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250119103205.2172432-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
proc_dointvec converts a string to a vector of signed int, which is
stored in the unsigned int .data core_pipe_limit.
It was thus authorized to write a negative value to core_pipe_limit
sysctl which once stored in core_pipe_limit, leads to the signed int
dump_count check against core_pipe_limit never be true. The same can be
achieved with core_pipe_limit set to INT_MAX.
Any negative write or >= to INT_MAX in core_pipe_limit sysctl would
hypothetically allow a user to create very high load on the system by
running processes that produces a coredump in case the core_pattern
sysctl is configured to pipe core files to user space helper.
Memory or PID exhaustion should happen before but it anyway breaks the
core_pipe_limit semantic.
This commit fixes this by changing core_pipe_limit sysctl's proc_handler
to proc_dointvec_minmax and bound checking between SYSCTL_ZERO and
SYSCTL_INT_MAX.
Fixes: a293980c2e26 ("exec: let do_coredump() limit the number of concurrent dumps to pipes")
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.
Created this by running an spatch followed by a sed command:
Spatch:
virtual patch
@
depends on !(file in "net")
disable optional_qualifier
@
identifier table_name != {
watchdog_hardlockup_sysctl,
iwcm_ctl_table,
ucma_ctl_table,
memory_allocation_profiling_sysctls,
loadpin_sysctl_table
};
@@
+ const
struct ctl_table table_name [] = { ... };
sed:
sed --in-place \
-e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
kernel/utsname_sysctl.c
Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
The loop between elf_core_dump() and dump_user_range() can run for
so long that the system shows softlockup messages, with side effects
like workqueues and RCU getting stuck on the core dumping CPU.
Add a cond_resched() in dump_user_range() to avoid that softlockup.
Signed-off-by: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/20241010113651.50cb0366@imladris.surriel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This reverts commit fb97d2eb542faf19a8725afbd75cbc2518903210.
The logging was questionable to begin with, but it seems to actively
deadlock on the task lock.
"On second thought, let's not log core dump failures. 'Tis a silly place"
because if you can't tell your core dump is truncated, maybe you should
just fix your debugger instead of adding bugs to the kernel.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Link: https://lore.kernel.org/all/d122ece6-3606-49de-ae4d-8da88846bef2@oracle.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Large cores may be truncated in some scenarios, such as with daemons
with stop timeouts that are not large enough or lack of disk space. This
impacts debuggability with large core dumps since critical information
necessary to form a usable backtrace, such as stacks and shared library
information, are omitted.
We attempted to figure out which VMAs are needed to create a useful
backtrace, and it turned out to be a non-trivial problem. Instead, we
try simply sorting the VMAs by size, which has the intended effect.
By sorting VMAs by dump size and dumping in that order, we have a
simple, yet effective heuristic.
Signed-off-by: Brian Mak <makb@juniper.net>
Link: https://lore.kernel.org/r/036CD6AE-C560-4FC7-9B02-ADD08E380DC9@juniper.net
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Missing, failed, or corrupted core dumps might impede crash
investigations. To improve reliability of that process and consequently
the programs themselves, one needs to trace the path from producing
a core dumpfile to analyzing it. That path starts from the core dump file
written to the disk by the kernel or to the standard input of a user
mode helper program to which the kernel streams the coredump contents.
There are cases where the kernel will interrupt writing the core out or
produce a truncated/not-well-formed core dump without leaving a note.
Add logging for the core dump collection failure paths to be able to reason
what has gone wrong when the core dump is malformed or missing.
Report the size of the data written to aid in diagnosing the user mode
helper.
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Link: https://lore.kernel.org/r/20240718182743.1959160-3-romank@linux.microsoft.com
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
The coredump code does not log the process ID and the comm
consistently, logs unescaped comm when it does log it, and
does not always use the ratelimited logging. That makes it
harder to analyze logs and puts the system at the risk of
spamming the system log incase something crashes many times
over and over again.
Fix that by logging TGID and comm (escaped) consistently and
using the ratelimited logging always.
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Tested-by: Allen Pais <apais@linux.microsoft.com>
Link: https://lore.kernel.org/r/20240718182743.1959160-2-romank@linux.microsoft.com
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.
This patch has been generated by the following coccinelle script:
```
virtual patch
@r1@
identifier ctl, write, buffer, lenp, ppos;
identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@
identifier func, ctl, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos)
{ ... }
@r3@
identifier func;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int , void *, size_t *, loff_t *);
@r4@
identifier func, ctl;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int , void *, size_t *, loff_t *);
@r5@
identifier func, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.
Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
|
|
After commit 0258b5fd7c71 ("coredump: Limit coredumps to a single thread
group") zap_process() doesn't need the "task_struct *start" arg,
zap_threads() can pass "signal_struct *signal" instead.
This simplifies the code and allows to use __for_each_thread() which
is slightly more efficient.
Link: https://lkml.kernel.org/r/20240625140311.GA20787@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull virtio updates from Michael Tsirkin:
"Several new features here:
- virtio-net is finally supported in vduse
- virtio (balloon and mem) interaction with suspend is improved
- vhost-scsi now handles signals better/faster
And fixes, cleanups all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits)
virtio-pci: Check if is_avq is NULL
virtio: delete vq in vp_find_vqs_msix() when request_irq() fails
MAINTAINERS: add Eugenio Pérez as reviewer
vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API
vp_vdpa: don't allocate unused msix vectors
sound: virtio: drop owner assignment
fuse: virtio: drop owner assignment
scsi: virtio: drop owner assignment
rpmsg: virtio: drop owner assignment
nvdimm: virtio_pmem: drop owner assignment
wifi: mac80211_hwsim: drop owner assignment
vsock/virtio: drop owner assignment
net: 9p: virtio: drop owner assignment
net: virtio: drop owner assignment
net: caif: virtio: drop owner assignment
misc: nsm: drop owner assignment
iommu: virtio: drop owner assignment
drm/virtio: drop owner assignment
gpio: virtio: drop owner assignment
firmware: arm_scmi: virtio: drop owner assignment
...
|
|
This removes the signal/coredump hacks added for vhost_tasks in:
Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
When that patch was added vhost_tasks did not handle SIGKILL and would
try to ignore/clear the signal and continue on until the device's close
function was called. In the previous patches vhost_tasks and the vhost
drivers were converted to support SIGKILL by cleaning themselves up and
exiting. The hacks are no longer needed so this removes them.
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Message-Id: <20240316004707.45557-10-michael.christie@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Introduce the capability to dynamically configure the maximum file
note size for ELF core dumps via sysctl.
Why is this being done?
We have observed that during a crash when there are more than 65k mmaps
in memory, the existing fixed limit on the size of the ELF notes section
becomes a bottleneck. The notes section quickly reaches its capacity,
leading to incomplete memory segment information in the resulting coredump.
This truncation compromises the utility of the coredumps, as crucial
information about the memory state at the time of the crash might be
omitted.
This enhancement removes the previous static limit of 4MB, allowing
system administrators to adjust the size based on system-specific
requirements or constraints.
Eg:
$ sysctl -a | grep core_file_note_size_limit
kernel.core_file_note_size_limit = 4194304
$ sysctl -n kernel.core_file_note_size_limit
4194304
$echo 519304 > /proc/sys/kernel/core_file_note_size_limit
$sysctl -n kernel.core_file_note_size_limit
519304
Attempting to write beyond the ceiling value of 16MB
$echo 17194304 > /proc/sys/kernel/core_file_note_size_limit
bash: echo: write error: Invalid argument
Signed-off-by: Vijay Nag <nagvijay@microsoft.com>
Signed-off-by: Allen Pais <apais@linux.microsoft.com>
Link: https://lore.kernel.org/r/20240506193700.7884-1-apais@linux.microsoft.com
Signed-off-by: Kees Cook <keescook@chromium.org>
|