summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-10-10ftrace: Add unit test for removing trace functionCarles Pey
A self test is provided for the trace function removal functionality. Link: https://lkml.kernel.org/r/20210918153043.318016-2-carles.pey@gmail.com Signed-off-by: Carles Pey <carles.pey@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-08ftrace: Cleanup ftrace_dyn_arch_init()Weizhao Ouyang
Most of ARCHs use empty ftrace_dyn_arch_init(), introduce a weak common ftrace_dyn_arch_init() to cleanup them. Link: https://lkml.kernel.org/r/20210909090216.1955240-1-o451686892@gmail.com Acked-by: Heiko Carstens <hca@linux.ibm.com> (s390) Acked-by: Helge Deller <deller@gmx.de> (parisc) Signed-off-by: Weizhao Ouyang <o451686892@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-08tracing: Disable "other" permission bits in the tracefs filesSteven Rostedt (VMware)
When building the files in the tracefs file system, do not by default set any permissions for OTH (other). This will make it easier for admins who want to define a group for accessing tracefs and not having to first disable all the permission bits for "other" in the file system. As tracing can leak sensitive information, it should never by default allowing all users access. An admin can still set the permission bits for others to have access, which may be useful for creating a honeypot and seeing who takes advantage of it and roots the machine. Link: https://lkml.kernel.org/r/20210818153038.864149276@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-08bpftool: Add install-bin target to install binary onlyQuentin Monnet
With "make install", bpftool installs its binary and its bash completion file. Usually, this is what we want. But a few components in the kernel repository (namely, BPF iterators and selftests) also install bpftool locally before using it. In such a case, bash completion is not necessary and is just a useless build artifact. Let's add an "install-bin" target to bpftool, to offer a way to install the binary only. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211007194438.34443-13-quentin@isovalent.com
2021-10-08bpf: iterators: Install libbpf headers when buildingQuentin Monnet
API headers from libbpf should not be accessed directly from the library's source directory. Instead, they should be exported with "make install_headers". Let's make sure that bpf/preload/iterators/Makefile installs the headers properly when building. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211007194438.34443-8-quentin@isovalent.com
2021-10-08bpf: preload: Install libbpf headers when buildingQuentin Monnet
API headers from libbpf should not be accessed directly from the library's source directory. Instead, they should be exported with "make install_headers". Let's make sure that bpf/preload/Makefile installs the headers properly when building. Note that we declare an additional dependency for iterators/iterators.o: having $(LIBBPF_A) as a dependency to "$(obj)/bpf_preload_umd" is not sufficient, as it makes it required only at the linking step. But we need libbpf to be compiled, and in particular its headers to be exported, before we attempt to compile iterators.o. The issue would not occur before this commit, because libbpf's headers were not exported and were always available under tools/lib/bpf. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211007194438.34443-7-quentin@isovalent.com
2021-10-08coredump: Limit coredumps to a single thread groupEric W. Biederman
Today when a signal is delivered with a handler of SIG_DFL whose default behavior is to generate a core dump not only that process but every process that shares the mm is killed. In the case of vfork this looks like a real world problem. Consider the following well defined sequence. if (vfork() == 0) { execve(...); _exit(EXIT_FAILURE); } If a signal that generates a core dump is received after vfork but before the execve changes the mm the process that called vfork will also be killed (as the mm is shared). Similarly if the execve fails after the point of no return the kernel delivers SIGSEGV which will kill both the exec'ing process and because the mm is shared the process that called vfork as well. As far as I can tell this behavior is a violation of people's reasonable expectations, POSIX, and is unnecessarily fragile when the system is low on memory. Solve this by making a userspace visible change to only kill a single process/thread group. This is possible because Jann Horn recently modified[1] the coredump code so that the mm can safely be modified while the coredump is happening. With LinuxThreads long gone I don't expect anyone to have a notice this behavior change in practice. To accomplish this move the core_state pointer from mm_struct to signal_struct, which allows different thread groups to coredump simultatenously. In zap_threads remove the work to kill anything except for the current thread group. v2: Remove core_state from the VM_BUG_ON_MM print to fix compile failure when CONFIG_DEBUG_VM is enabled. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> [1] a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot") Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3") History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133 Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.au Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07Merge branches 'fixes.2021.10.07a', 'scftorture.2021.09.16a', ↵Paul E. McKenney
'tasks.2021.09.15a', 'torture.2021.09.13b' and 'torturescript.2021.09.16a' into HEAD fixes.2021.10.07a: Miscellaneous fixes. scftorture.2021.09.16a: smp_call_function torture-test updates. tasks.2021.09.15a: Tasks-trace RCU updates. torture.2021.09.13b: Other torture-test updates. torturescript.2021.09.16a: Torture-test scripting updates.
2021-10-07rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstrPeter Zijlstra
vmlinux.o: warning: objtool: rcu_nmi_enter()+0x36: call to __kasan_check_read() leaves .noinstr.text section noinstr cannot have atomic_*() functions in because they're explicitly annotated, use arch_atomic_*(). Fixes: 2be57f732889 ("rcu: Weaken ->dynticks accesses and updates") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-10-07rcu: Always inline rcu_dynticks_task*_{enter,exit}()Peter Zijlstra
RCU managed to grow a few noinstr violations: vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x0: call to rcu_dynticks_task_trace_enter() leaves .noinstr.text section vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0xe: call to rcu_dynticks_task_trace_exit() leaves .noinstr.text section Fix them by adding __always_inline to the relevant trivial functions. Also replace the noinstr with __always_inline for the existing rcu_dynticks_task_*() functions since noinstr would force noinline them, even when empty, which seems silly. Fixes: 7d0c9c50c5a1 ("rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-10-07Merge tag 'net-5.15-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from xfrm, bpf, netfilter, and wireless. Current release - regressions: - xfrm: fix XFRM_MSG_MAPPING ABI breakage caused by inserting a new value in the middle of an enum - unix: fix an issue in unix_shutdown causing the other end read/write failures - phy: mdio: fix memory leak Current release - new code bugs: - mlx5e: improve MQPRIO resiliency against bad configs Previous releases - regressions: - bpf: fix integer overflow leading to OOB access in map element pre-allocation - stmmac: dwmac-rk: fix ethernet on rk3399 based devices - netfilter: conntrack: fix boot failure with nf_conntrack.enable_hooks=1 - brcmfmac: revert using ISO3166 country code and 0 rev as fallback - i40e: fix freeing of uninitialized misc IRQ vector - iavf: fix double unlock of crit_lock Previous releases - always broken: - bpf, arm: fix register clobbering in div/mod implementation - netfilter: nf_tables: correct issues in netlink rule change event notifications - dsa: tag_dsa: fix mask for trunked packets - usb: r8152: don't resubmit rx immediately to avoid soft lockup on device unplug - i40e: fix endless loop under rtnl if FW fails to correctly respond to capability query - mlx5e: fix rx checksum offload coexistence with ipsec offload - mlx5: force round second at 1PPS out start time and allow it only in supported clock modes - phy: pcs: xpcs: fix incorrect CL37 AN sequence, EEE disable sequence Misc: - xfrm: slightly rejig the new policy uAPI to make it less cryptic" * tag 'net-5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits) net: prefer socket bound to interface when not in VRF iavf: fix double unlock of crit_lock i40e: Fix freeing of uninitialized misc IRQ vector i40e: fix endless loop under rtnl dt-bindings: net: dsa: marvell: fix compatible in example ionic: move filter sync_needed bit set gve: report 64bit tx_bytes counter from gve_handle_report_stats() gve: fix gve_get_stats() rtnetlink: fix if_nlmsg_stats_size() under estimation gve: Properly handle errors in gve_assign_qpl gve: Avoid freeing NULL pointer gve: Correct available tx qpl check unix: Fix an issue in unix_shutdown causing the other end read/write failures net: stmmac: trigger PCS EEE to turn off on link down net: pcs: xpcs: fix incorrect steps on disable EEE netlink: annotate data races around nlk->bound net: pcs: xpcs: fix incorrect CL37 AN sequence net: sfp: Fix typo in state machine debug string net/sched: sch_taprio: properly cancel timer from taprio_destroy() net: bridge: fix under estimation in br_get_linkxstats_size() ...
2021-10-07Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2021-10-07 We've added 7 non-merge commits during the last 8 day(s) which contain a total of 8 files changed, 38 insertions(+), 21 deletions(-). The main changes are: 1) Fix ARM BPF JIT to preserve caller-saved regs for DIV/MOD JIT-internal helper call, from Johan Almbladh. 2) Fix integer overflow in BPF stack map element size calculation when used with preallocation, from Tatsuhiko Yasumatsu. 3) Fix an AF_UNIX regression due to added BPF sockmap support related to shutdown handling, from Jiang Wang. 4) Fix a segfault in libbpf when generating light skeletons from objects without BTF, from Kumar Kartikeya Dwivedi. 5) Fix a libbpf memory leak in strset to free the actual struct strset itself, from Andrii Nakryiko. 6) Dual-license bpf_insn.h similarly as we did for libbpf and bpftool, with ACKs from all contributors, from Luca Boccassi. ==================== Link: https://lore.kernel.org/r/20211007135010.21143-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07tracing: Initialize upper and lower vars in pid_list_refill_irq()Steven Rostedt (VMware)
The upper and lower variables are set as link lists to add into the sparse array. If they are NULL, after the needed allocations are done, then there is nothing to add. But they need to be initialized to NULL for this to work. Link: https://lore.kernel.org/all/221bc7ba-a475-1cb9-1bbe-730bb9c2d448@canonical.com/ Fixes: 8d6e90983ade ("tracing: Create a sparse bitmask for pid filtering") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-07tracing: Fix missing osnoise tracer on max_latencyJackie Liu
The compiler warns when the data are actually unused: kernel/trace/trace.c:1712:13: error: ‘trace_create_maxlat_file’ defined but not used [-Werror=unused-function] 1712 | static void trace_create_maxlat_file(struct trace_array *tr, | ^~~~~~~~~~~~~~~~~~~~~~~~ [Why] CONFIG_HWLAT_TRACER=n, CONFIG_TRACER_MAX_TRACE=n, CONFIG_OSNOISE_TRACER=y gcc report warns. [How] Now trace_create_maxlat_file will only take effect when CONFIG_HWLAT_TRACER=y or CONFIG_TRACER_MAX_TRACE=y. In fact, after adding osnoise trace, it also needs to take effect. Link: https://lore.kernel.org/all/c1d9e328-ad7c-920b-6c24-9e1598a6421c@infradead.org/ Link: https://lkml.kernel.org/r/20210922025122.3268022-1-liu.yun@linux.dev Fixes: bce29ac9ce0b ("trace: Add osnoise tracer") Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-07sched: Simplify wake_up_*idle*()Peter Zijlstra
Simplify and make wake_up_if_idle() more robust, also don't iterate the whole machine with preempt_disable() in it's caller: wake_up_all_idle_cpus(). This prepares for another wake_up_if_idle() user that needs a full do_idle() cycle. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Link: https://lkml.kernel.org/r/20210929152428.769328779@infradead.org
2021-10-07sched,livepatch: Use task_call_func()Peter Zijlstra
Instead of frobbing around with scheduler internals, use the shiny new task_call_func() interface. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Petr Mladek <pmladek@suse.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Link: https://lkml.kernel.org/r/20210929152428.709906138@infradead.org
2021-10-07sched,rcu: Rework try_invoke_on_locked_down_task()Peter Zijlstra
Give try_invoke_on_locked_down_task() a saner name and have it return an int so that the caller might distinguish between different reasons of failure. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Link: https://lkml.kernel.org/r/20210929152428.649944917@infradead.org
2021-10-07sched: Improve try_invoke_on_locked_down_task()Peter Zijlstra
Clarify and tighten try_invoke_on_locked_down_task(). Basically the function calls @func under task_rq_lock(), except it avoids taking rq->lock when possible. This makes calling @func unconditional (the function will get renamed in a later patch to remove the try). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Link: https://lkml.kernel.org/r/20210929152428.589323576@infradead.org
2021-10-07futex: Implement sys_futex_waitv()André Almeida
Add support to wait on multiple futexes. This is the interface implemented by this syscall: futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes, unsigned int flags, struct timespec *timeout, clockid_t clockid) struct futex_waitv { __u64 val; __u64 uaddr; __u32 flags; __u32 __reserved; }; Given an array of struct futex_waitv, wait on each uaddr. The thread wakes if a futex_wake() is performed at any uaddr. The syscall returns immediately if any waiter has *uaddr != val. *timeout is an optional absolute timeout value for the operation. This syscall supports only 64bit sized timeout structs. The flags argument of the syscall should be empty, but it can be used for future extensions. Flags for shared futexes, sizes, etc. should be used on the individual flags of each waiter. __reserved is used for explicit padding and should be 0, but it might be used for future extensions. If the userspace uses 32-bit pointers, it should make sure to explicitly cast it when assigning to waitv::uaddr. Returns the array index of one of the woken futexes. There’s no given information of how many were woken, or any particular attribute of it (if it’s the first woken, if it is of the smaller index...). Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210923171111.300673-17-andrealmeid@collabora.com
2021-10-07futex: Simplify double_lock_hb()Peter Zijlstra
We need to make sure that all requeue operations take the hash bucket locks in the same order to avoid deadlock. Simplify the current double_lock_hb implementation by making sure hb1 is always the "smallest" bucket to avoid extra checks. [André: Add commit description] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-16-andrealmeid@collabora.com
2021-10-07futex: Split out wait/wakePeter Zijlstra
Move the wait/wake bits into their own file. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-15-andrealmeid@collabora.com
2021-10-07futex: Split out requeuePeter Zijlstra
Move all the requeue bits into their own file. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-14-andrealmeid@collabora.com
2021-10-07futex: Rename mark_wake_futex()Peter Zijlstra
In order to prepare introducing these symbols into the global namespace; rename: s/mark_wake_futex/futex_wake_mark/g Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-13-andrealmeid@collabora.com
2021-10-07futex: Rename: match_futex()Peter Zijlstra
In order to prepare introducing these symbols into the global namespace; rename: s/match_futex/futex_match/g Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-12-andrealmeid@collabora.com
2021-10-07futex: Rename: hb_waiter_{inc,dec,pending}()Peter Zijlstra
In order to prepare introducing these symbols into the global namespace; rename them: s/hb_waiters_/futex_&/g Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-11-andrealmeid@collabora.com
2021-10-07futex: Split out PI futexPeter Zijlstra
Move the PI futex implementation into it's own file. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-10-andrealmeid@collabora.com
2021-10-07futex: Rename: {get,cmpxchg}_futex_value_locked()Peter Zijlstra
In order to prepare introducing these symbols into the global namespace; rename them: s/\<\([^_ ]*\)_futex_value_locked/futex_\1_value_locked/g Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-9-andrealmeid@collabora.com
2021-10-07futex: Rename hash_futex()Peter Zijlstra
In order to prepare introducing these symbols into the global namespace; rename: s/hash_futex/futex_hash/g Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-8-andrealmeid@collabora.com
2021-10-07futex: Rename __unqueue_futex()Peter Zijlstra
In order to prepare introducing these symbols into the global namespace; rename: s/__unqueue_futex/__futex_unqueue/g Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-7-andrealmeid@collabora.com
2021-10-07futex: Rename: queue_{,un}lock()Peter Zijlstra
In order to prepare introducing these symbols into the global namespace; rename them: s/queue_\(un\)*lock/futex_q_\1lock/g Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-6-andrealmeid@collabora.com
2021-10-07futex: Rename futex_wait_queue_me()Peter Zijlstra
In order to prepare introducing these symbols into the global namespace; rename them: s/futex_wait_queue_me/futex_wait_queue/g Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-5-andrealmeid@collabora.com
2021-10-07futex: Rename {,__}{,un}queue_me()Peter Zijlstra
In order to prepare introducing these symbols into the global namespace; rename them: s/\<\(__\)*\(un\)*queue_me/\1futex_\2queue/g Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-4-andrealmeid@collabora.com
2021-10-07futex: Split out syscallsPeter Zijlstra
Put the syscalls in their own little file. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-3-andrealmeid@collabora.com
2021-10-07futex: Move to kernel/futex/Peter Zijlstra
In preparation for splitup.. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: André Almeida <andrealmeid@collabora.com> Link: https://lore.kernel.org/r/20210923171111.300673-2-andrealmeid@collabora.com
2021-10-07locking/rwbase: Optimize rwbase_read_trylockDavidlohr Bueso
Instead of a full barrier around the Rmw insn, micro-optimize for weakly ordered archs such that we only provide the required ACQUIRE semantics when taking the read lock. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lkml.kernel.org/r/20210920052031.54220-2-dave@stgolabs.net
2021-10-07Merge branch 'tip/locking/urgent'Peter Zijlstra
Pull in dependencies.
2021-10-07Merge branch 'objtool/urgent'Peter Zijlstra
Fixup conflicts. # Conflicts: # tools/objtool/check.c
2021-10-06coredump: Don't perform any cleanups before dumping coreEric W. Biederman
Rename coredump_exit_mm to coredump_task_exit and call it from do_exit before PTRACE_EVENT_EXIT, and before any cleanup work for a task happens. This ensures that an accurate copy of the process can be captured in the coredump as no cleanup for the process happens before the coredump completes. This also ensures that PTRACE_EVENT_EXIT will not be visited by any thread until the coredump is complete. Add a new flag PF_POSTCOREDUMP so that tasks that have passed through coredump_task_exit can be recognized and ignored in zap_process. Now that all of the coredumping happens before exit_mm remove code to test for a coredump in progress from mm_release. Replace "may_ptrace_stop()" with a simple test of "current->ptrace". The other tests in may_ptrace_stop all concern avoiding stopping during a coredump. These tests are no longer necessary as it is now guaranteed that fatal_signal_pending will be set if the code enters ptrace_stop during a coredump. The code in ptrace_stop is guaranteed not to stop if fatal_signal_pending returns true. Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call ptrace_stop without fatal_signal_pending being true, as signals are dequeued in get_signal before calling do_exit. This is no longer an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached until after the coredump completes. Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-06exit: Factor coredump_exit_mm out of exit_mmEric W. Biederman
Separate the coredump logic from the ordinary exit_mm logic by moving the coredump logic out of exit_mm into it's own function coredump_exit_mm. Link: https://lkml.kernel.org/r/87a6k2x277.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-06ptrace: Remove the unnecessary arguments from arch_ptrace_stopEric W. Biederman
Both arch_ptrace_stop_needed and arch_ptrace_stop are called with an exit_code and a siginfo structure. Neither argument is used by any of the implementations so just remove the unneeded arguments. The two arechitectures that implement arch_ptrace_stop are ia64 and sparc. Both architectures flush their register stacks before a ptrace_stack so that all of the register information can be accessed by debuggers. As the question of if a register stack needs to be flushed is independent of why ptrace is stopping not needing arguments make sense. Cc: David Miller <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Link: https://lkml.kernel.org/r/87lf3mx290.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-06signal: Remove the bogus sigkill_pending in ptrace_stopEric W. Biederman
The existence of sigkill_pending is a little silly as it is functionally a duplicate of fatal_signal_pending that is used in exactly one place. Checking for pending fatal signals and returning early in ptrace_stop is actively harmful. It casues the ptrace_stop called by ptrace_signal to return early before setting current->exit_code. Later when ptrace_signal reads the signal number from current->exit_code is undefined, making it unpredictable what will happen. Instead rely on the fact that schedule will not sleep if there is a pending signal that can awaken a task. Removing the explict sigkill_pending test fixes fixes ptrace_signal when ptrace_stop does not stop because current->exit_code is always set to to signr. Cc: stable@vger.kernel.org Fixes: 3d749b9e676b ("ptrace: simplify ptrace_stop()->sigkill_pending() path") Fixes: 1a669c2f16d4 ("Add arch_ptrace_stop") Link: https://lkml.kernel.org/r/87pmsyx29t.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-06sched: Fix DEBUG && !SCHEDSTATS warnPeter Zijlstra
When !SCHEDSTATS schedstat_enabled() is an unconditional 0 and the whole block doesn't exist, however GCC figures the scoped variable 'stats' is unused and complains about it. Upgrade the warning from -Wunused-variable to -Wunused-but-set-variable by writing it in two statements. This fixes the build because the new warning is in W=1. Given that whole if(0) {} thing, I don't feel motivated to change things overly much and quite strongly feel this is the compiler being daft. Fixes: cb3e971c435d ("sched: Make struct sched_statistics independent of fair sched class") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-10-05bpf: Avoid retpoline for bpf_for_each_map_elemAndrey Ignatov
Similarly to 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps") and 84430d4232c3 ("bpf, verifier: avoid retpoline for map push/pop/peek operation") avoid indirect call while calling bpf_for_each_map_elem. Before (a program fragment): ; if (rules_map) { 142: (15) if r4 == 0x0 goto pc+8 143: (bf) r3 = r10 ; bpf_for_each_map_elem(rules_map, process_each_rule, &ctx, 0); 144: (07) r3 += -24 145: (bf) r1 = r4 146: (18) r2 = subprog[+5] 148: (b7) r4 = 0 149: (85) call bpf_for_each_map_elem#143680 <-- indirect call via helper After (same program fragment): ; if (rules_map) { 142: (15) if r4 == 0x0 goto pc+8 143: (bf) r3 = r10 ; bpf_for_each_map_elem(rules_map, process_each_rule, &ctx, 0); 144: (07) r3 += -24 145: (bf) r1 = r4 146: (18) r2 = subprog[+5] 148: (b7) r4 = 0 149: (85) call bpf_for_each_array_elem#170336 <-- direct call On a benchmark that calls bpf_for_each_map_elem() once and does many other things (mostly checking fields in skb) with CONFIG_RETPOLINE=y it makes program faster. Before: ============================================================================ Benchmark.cpp time/iter iters/s ============================================================================ IngressMatchByRemoteEndpoint 80.78ns 12.38M IngressMatchByRemoteIP 80.66ns 12.40M IngressMatchByRemotePort 80.87ns 12.37M After: ============================================================================ Benchmark.cpp time/iter iters/s ============================================================================ IngressMatchByRemoteEndpoint 73.49ns 13.61M IngressMatchByRemoteIP 71.48ns 13.99M IngressMatchByRemotePort 70.39ns 14.21M Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211006001838.75607-1-rdna@fb.com
2021-10-05bpf: selftests: Add selftests for module kfunc supportKumar Kartikeya Dwivedi
This adds selftests that tests the success and failure path for modules kfuncs (in presence of invalid kfunc calls) for both libbpf and gen_loader. It also adds a prog_test kfunc_btf_id_list so that we can add module BTF ID set from bpf_testmod. This also introduces a couple of test cases to verifier selftests for validating whether we get an error or not depending on if invalid kfunc call remains after elimination of unreachable instructions. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211002011757.311265-10-memxor@gmail.com
2021-10-05bpf: Enable TCP congestion control kfunc from modulesKumar Kartikeya Dwivedi
This commit moves BTF ID lookup into the newly added registration helper, in a way that the bbr, cubic, and dctcp implementation set up their sets in the bpf_tcp_ca kfunc_btf_set list, while the ones not dependent on modules are looked up from the wrapper function. This lifts the restriction for them to be compiled as built in objects, and can be loaded as modules if required. Also modify Makefile.modfinal to call resolve_btfids for each module. Note that since kernel kfunc_ids never overlap with module kfunc_ids, we only match the owner for module btf id sets. See following commits for background on use of: CONFIG_X86 ifdef: 569c484f9995 (bpf: Limit static tcp-cc functions in the .BTF_ids list to x86) CONFIG_DYNAMIC_FTRACE ifdef: 7aae231ac93b (bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE) Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211002011757.311265-6-memxor@gmail.com
2021-10-05bpf: btf: Introduce helpers for dynamic BTF set registrationKumar Kartikeya Dwivedi
This adds helpers for registering btf_id_set from modules and the bpf_check_mod_kfunc_call callback that can be used to look them up. With in kernel sets, the way this is supposed to work is, in kernel callback looks up within the in-kernel kfunc whitelist, and then defers to the dynamic BTF set lookup if it doesn't find the BTF id. If there is no in-kernel BTF id set, this callback can be used directly. Also fix includes for btf.h and bpfptr.h so that they can included in isolation. This is in preparation for their usage in tcp_bbr, tcp_cubic and tcp_dctcp modules in the next patch. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211002011757.311265-4-memxor@gmail.com
2021-10-05bpf: Be conservative while processing invalid kfunc callsKumar Kartikeya Dwivedi
This patch also modifies the BPF verifier to only return error for invalid kfunc calls specially marked by userspace (with insn->imm == 0, insn->off == 0) after the verifier has eliminated dead instructions. This can be handled in the fixup stage, and skip processing during add and check stages. If such an invalid call is dropped, the fixup stage will not encounter insn->imm as 0, otherwise it bails out and returns an error. This will be exposed as weak ksym support in libbpf in later patches. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211002011757.311265-3-memxor@gmail.com
2021-10-05bpf: Introduce BPF support for kernel module function callsKumar Kartikeya Dwivedi
This change adds support on the kernel side to allow for BPF programs to call kernel module functions. Userspace will prepare an array of module BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter. In the kernel, the module BTFs are placed in the auxilliary struct for bpf_prog, and loaded as needed. The verifier then uses insn->off to index into the fd_array. insn->off 0 is reserved for vmlinux BTF (for backwards compat), so userspace must use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is sorted based on offset in an array, and each offset corresponds to one descriptor, with a max limit up to 256 such module BTFs. We also change existing kfunc_tab to distinguish each element based on imm, off pair as each such call will now be distinct. Another change is to check_kfunc_call callback, which now include a struct module * pointer, this is to be used in later patch such that the kfunc_id and module pointer are matched for dynamically registered BTF sets from loadable modules, so that same kfunc_id in two modules doesn't lead to check_kfunc_call succeeding. For the duration of the check_kfunc_call, the reference to struct module exists, as it returns the pointer stored in kfunc_btf_tab. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com
2021-10-05tracing: Create a sparse bitmask for pid filteringSteven Rostedt (VMware)
When the trace_pid_list was created, the default pid max was 32768. Creating a bitmask that can hold one bit for all 32768 took up 4096 (one page). Having a one page bitmask was not much of a problem, and that was used for mapping pids. But today, systems are bigger and can run more tasks, and now the default pid_max is usually set to 4194304. Which means to handle that many pids requires 524288 bytes. Worse yet, the pid_max can be set to 2^30 (1073741824 or 1G) which would take 134217728 (128M) of memory to store this array. Since the pid_list array is very sparsely populated, it is a huge waste of memory to store all possible bits for each pid when most will not be set. Instead, use a page table scheme to store the array, and allow this to handle up to 30 bit pids. The pid_mask will start out with 256 entries for the first 8 MSB bits. This will cost 1K for 32 bit architectures and 2K for 64 bit. Each of these will have a 256 array to store the next 8 bits of the pid (another 1 or 2K). These will hold an 2K byte bitmask (which will cover the LSB 14 bits or 16384 pids). When the trace_pid_list is allocated, it will have the 1/2K upper bits allocated, and then it will allocate a cache for the next upper chunks and the lower chunks (default 6 of each). Then when a bit is "set", these chunks will be pulled from the free list and added to the array. If the free list gets down to a lever (default 2), it will trigger an irqwork that will refill the cache back up. On clearing a bit, if the clear causes the bitmask to be zero, that chunk will then be placed back into the free cache for later use, keeping the need to allocate more down to a minimum. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>