summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-05-22modules: mark find_symbol staticChristoph Hellwig
commit 773110470e2fa3839523384ae014f8a723c4d178 upstream. find_symbol is only used in module.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22modules: mark ref_module staticChristoph Hellwig
commit 7ef5264de773279b9f23b6cc8afb5addb30e970b upstream. ref_module isn't used anywhere outside of module.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22tracing: Map all PIDs to command linesSteven Rostedt (VMware)
commit 785e3c0a3a870e72dc530856136ab4c8dd207128 upstream. The default max PID is set by PID_MAX_DEFAULT, and the tracing infrastructure uses this number to map PIDs to the comm names of the tasks, such output of the trace can show names from the recorded PIDs in the ring buffer. This mapping is also exported to user space via the "saved_cmdlines" file in the tracefs directory. But currently the mapping expects the PIDs to be less than PID_MAX_DEFAULT, which is the default maximum and not the real maximum. Recently, systemd will increases the maximum value of a PID on the system, and when tasks are traced that have a PID higher than PID_MAX_DEFAULT, its comm is not recorded. This leads to the entire trace to have "<...>" as the comm name, which is pretty useless. Instead, keep the array mapping the size of PID_MAX_DEFAULT, but instead of just mapping the index to the comm, map a mask of the PID (PID_MAX_DEFAULT - 1) to the comm, and find the full PID from the map_cmdline_to_pid array (that already exists). This bug goes back to the beginning of ftrace, but hasn't been an issue until user space started increasing the maximum value of PIDs. Link: https://lkml.kernel.org/r/20210427113207.3c601884@gandalf.local.home Cc: stable@vger.kernel.org Fixes: bc0c38d139ec7 ("ftrace: latency tracer infrastructure") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22ftrace: Handle commands when closing set_ftrace_filter fileSteven Rostedt (VMware)
commit 8c9af478c06bb1ab1422f90d8ecbc53defd44bc3 upstream. # echo switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter will cause switch_mm to stop tracing by the traceoff command. # echo -n switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter does nothing. The reason is that the parsing in the write function only processes commands if it finished parsing (there is white space written after the command). That's to handle: write(fd, "switch_mm:", 10); write(fd, "traceoff", 8); cases, where the command is broken over multiple writes. The problem is if the file descriptor is closed, then the write call is not processed, and the command needs to be processed in the release code. The release code can handle matching of functions, but does not handle commands. Cc: stable@vger.kernel.org Fixes: eda1e32855656 ("tracing: handle broken names in ftrace filter") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22posix-timers: Preserve return value in clock_adjtime32()Chen Jun
commit 2d036dfa5f10df9782f5278fc591d79d283c1fad upstream. The return value on success (>= 0) is overwritten by the return value of put_old_timex32(). That works correct in the fault case, but is wrong for the success case where put_old_timex32() returns 0. Just check the return value of put_old_timex32() and return -EFAULT in case it is not zero. [ tglx: Massage changelog ] Fixes: 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to native counterparts") Signed-off-by: Chen Jun <chenjun102@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Richard Cochran <richardcochran@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210414030449.90692-1-chenjun102@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22Revert 337f13046ff0 ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op")Thomas Gleixner
commit 4fbf5d6837bf81fd7a27d771358f4ee6c4f243f8 upstream. The FUTEX_WAIT operand has historically a relative timeout which means that the clock id is irrelevant as relative timeouts on CLOCK_REALTIME are not subject to wall clock changes and therefore are mapped by the kernel to CLOCK_MONOTONIC for simplicity. If a caller would set FUTEX_CLOCK_REALTIME for FUTEX_WAIT the timeout is still treated relative vs. CLOCK_MONOTONIC and then the wait arms that timeout based on CLOCK_REALTIME which is broken and obviously has never been used or even tested. Reject any attempt to use FUTEX_CLOCK_REALTIME with FUTEX_WAIT again. The desired functionality can be achieved with FUTEX_WAIT_BITSET and a FUTEX_BITSET_MATCH_ANY argument. Fixes: 337f13046ff0 ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210422194704.834797921@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22bpf: Fix masking negation logic upon negative dst registerDaniel Borkmann
commit b9b34ddbe2076ade359cd5ce7537d5ed019e9807 upstream. The negation logic for the case where the off_reg is sitting in the dst register is not correct given then we cannot just invert the add to a sub or vice versa. As a fix, perform the final bitwise and-op unconditionally into AX from the off_reg, then move the pointer from the src to dst and finally use AX as the source for the original pointer arithmetic operation such that the inversion yields a correct result. The single non-AX mov in between is possible given constant blinding is retaining it as it's not an immediate based operation. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22bpf: Fix backport of "bpf: restrict unknown scalars of mixed signed bounds ↵Samuel Mendoza-Jonas
for unprivileged" The 4.14 backport of 9d7eceede ("bpf: restrict unknown scalars of mixed signed bounds for unprivileged") adds the PTR_TO_MAP_VALUE check to the wrong location in adjust_ptr_min_max_vals(), most likely because 4.14 doesn't include the commit that updates the if-statement to a switch-statement (aad2eeaf4 "bpf: Simplify ptr_min_max_vals adjustment"). Move the check to the proper location in adjust_ptr_min_max_vals(). Fixes: 17efa65350c5a ("bpf: restrict unknown scalars of mixed signed bounds for unprivileged") Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Reviewed-by: Frank van der Linden <fllinden@amazon.com> Reviewed-by: Ethan Chen <yishache@amazon.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-16workqueue: Move the position of debug_work_activate() in __queue_work()Zqiang
[ Upstream commit 0687c66b5f666b5ad433f4e94251590d9bc9d10e ] The debug_work_activate() is called on the premise that the work can be inserted, because if wq be in WQ_DRAINING status, insert work may be failed. Fixes: e41e704bc4f4 ("workqueue: improve destroy_workqueue() debuggability") Signed-off-by: Zqiang <qiang.zhang@windriver.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-07tracing: Fix stack trace event sizeSteven Rostedt (VMware)
commit 9deb193af69d3fd6dd8e47f292b67c805a787010 upstream. Commit cbc3b92ce037 fixed an issue to modify the macros of the stack trace event so that user space could parse it properly. Originally the stack trace format to user space showed that the called stack was a dynamic array. But it is not actually a dynamic array, in the way that other dynamic event arrays worked, and this broke user space parsing for it. The update was to make the array look to have 8 entries in it. Helper functions were added to make it parse it correctly, as the stack was dynamic, but was determined by the size of the event stored. Although this fixed user space on how it read the event, it changed the internal structure used for the stack trace event. It changed the array size from [0] to [8] (added 8 entries). This increased the size of the stack trace event by 8 words. The size reserved on the ring buffer was the size of the stack trace event plus the number of stack entries found in the stack trace. That commit caused the amount to be 8 more than what was needed because it did not expect the caller field to have any size. This produced 8 entries of garbage (and reading random data) from the stack trace event: <idle>-0 [002] d... 1976396.837549: <stack trace> => trace_event_raw_event_sched_switch => __traceiter_sched_switch => __schedule => schedule_idle => do_idle => cpu_startup_entry => secondary_startup_64_no_verify => 0xc8c5e150ffff93de => 0xffff93de => 0 => 0 => 0xc8c5e17800000000 => 0x1f30affff93de => 0x00000004 => 0x200000000 Instead, subtract the size of the caller field from the size of the event to make sure that only the amount needed to store the stack trace is reserved. Link: https://lore.kernel.org/lkml/your-ad-here.call-01617191565-ext-9692@work.hours/ Cc: stable@vger.kernel.org Fixes: cbc3b92ce037 ("tracing: Set kernel_stack's caller size properly") Reported-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-24genirq: Disable interrupts for force threaded handlersThomas Gleixner
commit 81e2073c175b887398e5bca6c004efa89983f58d upstream. With interrupt force threading all device interrupt handlers are invoked from kernel threads. Contrary to hard interrupt context the invocation only disables bottom halfs, but not interrupts. This was an oversight back then because any code like this will have an issue: thread(irq_A) irq_handler(A) spin_lock(&foo->lock); interrupt(irq_B) irq_handler(B) spin_lock(&foo->lock); This has been triggered with networking (NAPI vs. hrtimers) and console drivers where printk() happens from an interrupt which interrupted the force threaded handler. Now people noticed and started to change the spin_lock() in the handler to spin_lock_irqsave() which affects performance or add IRQF_NOTHREAD to the interrupt request which in turn breaks RT. Fix the root cause and not the symptom and disable interrupts before invoking the force threaded handler which preserves the regular semantics and the usefulness of the interrupt force threading as a general debugging tool. For not RT this is not changing much, except that during the execution of the threaded handler interrupts are delayed until the handler returns. Vs. scheduling and softirq processing there is no difference. For RT kernels there is no issue. Fixes: 8d32a307e4fa ("genirq: Provide forced interrupt threading") Reported-by: Johan Hovold <johan@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Johan Hovold <johan@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210317143859.513307808@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-24kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()Oleg Nesterov
commit 5abbe51a526253b9f003e9a0a195638dc882d660 upstream. Preparation for fixing get_nr_restart_syscall() on X86 for COMPAT. Add a new helper which sets restart_block->fn and calls a dummy arch_set_restart_data() helper. Fixes: 609c19a385c8 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210201174641.GA17871@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-24bpf: Prohibit alu ops for pointer types not defining ptr_limitPiotr Krysiuk
commit f232326f6966cf2a1d1db7bc917a4ce5f9f55f76 upstream. The purpose of this patch is to streamline error propagation and in particular to propagate retrieve_ptr_limit() errors for pointer types that are not defining a ptr_limit such that register-based alu ops against these types can be rejected. The main rationale is that a gap has been identified by Piotr in the existing protection against speculatively out-of-bounds loads, for example, in case of ctx pointers, unprivileged programs can still perform pointer arithmetic. This can be abused to execute speculatively out-of-bounds loads without restrictions and thus extract contents of kernel memory. Fix this by rejecting unprivileged programs that attempt any pointer arithmetic on unprotected pointer types. The two affected ones are pointer to ctx as well as pointer to map. Field access to a modified ctx' pointer is rejected at a later point in time in the verifier, and 7c6967326267 ("bpf: Permit map_ptr arithmetic with opcode add and offset 0") only relevant for root-only use cases. Risk of unprivileged program breakage is considered very low. Fixes: 7c6967326267 ("bpf: Permit map_ptr arithmetic with opcode add and offset 0") Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-24bpf: Add sanity check for upper ptr_limitPiotr Krysiuk
commit 1b1597e64e1a610c7a96710fc4717158e98a08b3 upstream. Given we know the max possible value of ptr_limit at the time of retrieving the latter, add basic assertions, so that the verifier can bail out if anything looks odd and reject the program. Nothing triggered this so far, but it also does not hurt to have these. Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-24bpf: Simplify alu_limit masking for pointer arithmeticPiotr Krysiuk
commit b5871dca250cd391885218b99cc015aca1a51aea upstream. Instead of having the mov32 with aux->alu_limit - 1 immediate, move this operation to retrieve_ptr_limit() instead to simplify the logic and to allow for subsequent sanity boundary checks inside retrieve_ptr_limit(). This avoids in future that at the time of the verifier masking rewrite we'd run into an underflow which would not sign extend due to the nature of mov32 instruction. Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-24bpf: Fix off-by-one for area size in creating mask to leftPiotr Krysiuk
commit 10d2bb2e6b1d8c4576c56a748f697dbeb8388899 upstream. retrieve_ptr_limit() computes the ptr_limit for registers with stack and map_value type. ptr_limit is the size of the memory area that is still valid / in-bounds from the point of the current position and direction of the operation (add / sub). This size will later be used for masking the operation such that attempting out-of-bounds access in the speculative domain is redirected to remain within the bounds of the current map value. When masking to the right the size is correct, however, when masking to the left, the size is off-by-one which would lead to an incorrect mask and thus incorrect arithmetic operation in the non-speculative domain. Piotr found that if the resulting alu_limit value is zero, then the BPF_MOV32_IMM() from the fixup_bpf_calls() rewrite will end up loading 0xffffffff into AX instead of sign-extending to the full 64 bit range, and as a result, this allows abuse for executing speculatively out-of- bounds loads against 4GB window of address space and thus extracting the contents of kernel memory via side-channel. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-03printk: fix deadlock when kernel panicMuchun Song
commit 8a8109f303e25a27f92c1d8edd67d7cbbc60a4eb upstream. printk_safe_flush_on_panic() caused the following deadlock on our server: CPU0: CPU1: panic rcu_dump_cpu_stacks kdump_nmi_shootdown_cpus nmi_trigger_cpumask_backtrace register_nmi_handler(crash_nmi_callback) printk_safe_flush __printk_safe_flush raw_spin_lock_irqsave(&read_lock) // send NMI to other processors apic_send_IPI_allbutself(NMI_VECTOR) // NMI interrupt, dead loop crash_nmi_callback printk_safe_flush_on_panic printk_safe_flush __printk_safe_flush // deadlock raw_spin_lock_irqsave(&read_lock) DEADLOCK: read_lock is taken on CPU1 and will never get released. It happens when panic() stops a CPU by NMI while it has been in the middle of printk_safe_flush(). Handle the lock the same way as logbuf_lock. The printk_safe buffers are flushed only when both locks can be safely taken. It can avoid the deadlock _in this particular case_ at expense of losing contents of printk_safe buffers. Note: It would actually be safe to re-init the locks when all CPUs were stopped by NMI. But it would require passing this information from arch-specific code. It is not worth the complexity. Especially because logbuf_lock and printk_safe buffers have been obsoleted by the lockless ring buffer. Fixes: cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: <stable@vger.kernel.org> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210210034823.64867-1-songmuchun@bytedance.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-03module: Ignore _GLOBAL_OFFSET_TABLE_ when warning for undefined symbolsFangrui Song
commit ebfac7b778fac8b0e8e92ec91d0b055f046b4604 upstream. clang-12 -fno-pic (since https://github.com/llvm/llvm-project/commit/a084c0388e2a59b9556f2de0083333232da3f1d6) can emit `call __stack_chk_fail@PLT` instead of `call __stack_chk_fail` on x86. The two forms should have identical behaviors on x86-64 but the former causes GNU as<2.37 to produce an unreferenced undefined symbol _GLOBAL_OFFSET_TABLE_. (On x86-32, there is an R_386_PC32 vs R_386_PLT32 difference but the linker behavior is identical as far as Linux kernel is concerned.) Simply ignore _GLOBAL_OFFSET_TABLE_ for now, like what scripts/mod/modpost.c:ignore_undef_symbol does. This also fixes the problem for gcc/clang -fpie and -fpic, which may emit `call foo@PLT` for external function calls on x86. Note: ld -z defs and dynamic loaders do not error for unreferenced undefined symbols so the module loader is reading too much. If we ever need to ignore more symbols, the code should be refactored to ignore unreferenced symbols. Cc: <stable@vger.kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1250 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=27178 Reported-by: Marco Elver <elver@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Tested-by: Marco Elver <elver@google.com> Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-03seccomp: Add missing return in non-void functionPaul Cercueil
commit 04b38d012556199ba4c31195940160e0c44c64f0 upstream. We don't actually care about the value, since the kernel will panic before that; but a value should nonetheless be returned, otherwise the compiler will complain. Fixes: 8112c4f140fa ("seccomp: remove 2-phase API") Cc: stable@vger.kernel.org # 4.7+ Signed-off-by: Paul Cercueil <paul@crapouillou.net> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210111172839.640914-1-paul@crapouillou.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-03tracepoint: Do not fail unregistering a probe due to memory failureSteven Rostedt (VMware)
[ Upstream commit befe6d946551d65cddbd32b9cb0170b0249fd5ed ] The list of tracepoint callbacks is managed by an array that is protected by RCU. To update this array, a new array is allocated, the updates are copied over to the new array, and then the list of functions for the tracepoint is switched over to the new array. After a completion of an RCU grace period, the old array is freed. This process happens for both adding a callback as well as removing one. But on removing a callback, if the new array fails to be allocated, the callback is not removed, and may be used after it is freed by the clients of the tracepoint. There's really no reason to fail if the allocation for a new array fails when removing a function. Instead, the function can simply be replaced by a stub function that could be cleaned up on the next modification of the array. That is, instead of calling the function registered to the tracepoint, it would call a stub function in its place. Link: https://lore.kernel.org/r/20201115055256.65625-1-mmullins@mmlx.us Link: https://lore.kernel.org/r/20201116175107.02db396d@gandalf.local.home Link: https://lore.kernel.org/r/20201117211836.54acaef2@oasis.local.home Link: https://lkml.kernel.org/r/20201118093405.7a6d2290@gandalf.local.home [ Note, this version does use undefined compiler behavior (assuming that a stub function with no parameters or return, can be called by a location that thinks it has parameters but still no return value. Static calls do the same thing, so this trick is not without precedent. There's another solution that uses RCU tricks and is more complex, but can be an alternative if this solution becomes an issue. Link: https://lore.kernel.org/lkml/20210127170721.58bce7cc@gandalf.local.home/ ] Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@chromium.org> Cc: netdev <netdev@vger.kernel.org> Cc: bpf <bpf@vger.kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Florian Weimer <fw@deneb.enyo.de> Fixes: 97e1c18e8d17b ("tracing: Kernel Tracepoints") Reported-by: syzbot+83aa762ef23b6f0d1991@syzkaller.appspotmail.com Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com Reported-by: Matt Mullins <mmullins@mmlx.us> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Matt Mullins <mmullins@mmlx.us> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-03ima: Free IMA measurement buffer after kexec syscallLakshmi Ramasubramanian
[ Upstream commit f31e3386a4e92ba6eda7328cb508462956c94c64 ] IMA allocates kernel virtual memory to carry forward the measurement list, from the current kernel to the next kernel on kexec system call, in ima_add_kexec_buffer() function. This buffer is not freed before completing the kexec system call resulting in memory leak. Add ima_buffer field in "struct kimage" to store the virtual address of the buffer allocated for the IMA measurement list. Free the memory allocated for the IMA measurement list in kimage_file_post_load_cleanup() function. Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com> Suggested-by: Tyler Hicks <tyhicks@linux.microsoft.com> Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com> Fixes: 7b8589cc29e7 ("ima: on soft reboot, save the measurement list") Signed-off-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-03bpf_lru_list: Read double-checked variable once without lockMarco Elver
[ Upstream commit 6df8fb83301d68ea0a0c0e1cbcc790fcc333ed12 ] For double-checked locking in bpf_common_lru_push_free(), node->type is read outside the critical section and then re-checked under the lock. However, concurrent writes to node->type result in data races. For example, the following concurrent access was observed by KCSAN: write to 0xffff88801521bc22 of 1 bytes by task 10038 on cpu 1: __bpf_lru_node_move_in kernel/bpf/bpf_lru_list.c:91 __local_list_flush kernel/bpf/bpf_lru_list.c:298 ... read to 0xffff88801521bc22 of 1 bytes by task 10043 on cpu 0: bpf_common_lru_push_free kernel/bpf/bpf_lru_list.c:507 bpf_lru_push_free kernel/bpf/bpf_lru_list.c:555 ... Fix the data races where node->type is read outside the critical section (for double-checked locking) by marking the access with READ_ONCE() as well as ensuring the variable is only accessed once. Fixes: 3a08c2fd7634 ("bpf: LRU List") Reported-by: syzbot+3536db46dfa58c573458@syzkaller.appspotmail.com Reported-by: syzbot+516acdb03d3e27d91bcd@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210209112701.3341724-1-elver@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-03kdb: Make memory allocations more robustSumit Garg
commit 93f7a6d818deef69d0ba652d46bae6fbabbf365c upstream. Currently kdb uses in_interrupt() to determine whether its library code has been called from the kgdb trap handler or from a saner calling context such as driver init. This approach is broken because in_interrupt() alone isn't able to determine kgdb trap handler entry from normal task context. This can happen during normal use of basic features such as breakpoints and can also be trivially reproduced using: echo g > /proc/sysrq-trigger We can improve this by adding check for in_dbg_master() instead which explicitly determines if we are running in debugger context. Cc: stable@vger.kernel.org Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lore.kernel.org/r/1611313556-4004-1-git-send-email-sumit.garg@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23bpf: Check for integer overflow when using roundup_pow_of_two()Bui Quang Minh
[ Upstream commit 6183f4d3a0a2ad230511987c6c362ca43ec0055f ] On 32-bit architecture, roundup_pow_of_two() can return 0 when the argument has upper most bit set due to resulting 1UL << 32. Add a check for this case. Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE") Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210127063653.3576-1-minhquangbui99@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-23tracing: Check length before giving out the filter bufferSteven Rostedt (VMware)
commit b220c049d5196dd94d992dd2dc8cba1a5e6123bf upstream. When filters are used by trace events, a page is allocated on each CPU and used to copy the trace event fields to this page before writing to the ring buffer. The reason to use the filter and not write directly into the ring buffer is because a filter may discard the event and there's more overhead on discarding from the ring buffer than the extra copy. The problem here is that there is no check against the size being allocated when using this page. If an event asks for more than a page size while being filtered, it will get only a page, leading to the caller writing more that what was allocated. Check the length of the request, and if it is more than PAGE_SIZE minus the header default back to allocating from the ring buffer directly. The ring buffer may reject the event if its too big anyway, but it wont overflow. Link: https://lore.kernel.org/ath10k/1612839593-2308-1-git-send-email-wgong@codeaurora.org/ Cc: stable@vger.kernel.org Fixes: 0fc1b09ff1ff4 ("tracing: Use temp buffer when filtering events") Reported-by: Wen Gong <wgong@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23tracing: Do not count ftrace events in top level enable outputSteven Rostedt (VMware)
commit 256cfdd6fdf70c6fcf0f7c8ddb0ebd73ce8f3bc9 upstream. The file /sys/kernel/tracing/events/enable is used to enable all events by echoing in "1", or disabling all events when echoing in "0". To know if all events are enabled, disabled, or some are enabled but not all of them, cating the file should show either "1" (all enabled), "0" (all disabled), or "X" (some enabled but not all of them). This works the same as the "enable" files in the individule system directories (like tracing/events/sched/enable). But when all events are enabled, the top level "enable" file shows "X". The reason is that its checking the "ftrace" events, which are special events that only exist for their format files. These include the format for the function tracer events, that are enabled when the function tracer is enabled, but not by the "enable" file. The check includes these events, which will always be disabled, and even though all true events are enabled, the top level "enable" file will show "X" instead of "1". To fix this, have the check test the event's flags to see if it has the "IGNORE_ENABLE" flag set, and if so, not test it. Cc: stable@vger.kernel.org Fixes: 553552ce1796c ("tracing: Combine event filter_active and enable into single flags field") Reported-by: "Yordan Karadzhov (VMware)" <y.karadz@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23fgraph: Initialize tracing_graph_pause at task creationSteven Rostedt (VMware)
commit 7e0a9220467dbcfdc5bc62825724f3e52e50ab31 upstream. On some archs, the idle task can call into cpu_suspend(). The cpu_suspend() will disable or pause function graph tracing, as there's some paths in bringing down the CPU that can have issues with its return address being modified. The task_struct structure has a "tracing_graph_pause" atomic counter, that when set to something other than zero, the function graph tracer will not modify the return address. The problem is that the tracing_graph_pause counter is initialized when the function graph tracer is enabled. This can corrupt the counter for the idle task if it is suspended in these architectures. CPU 1 CPU 2 ----- ----- do_idle() cpu_suspend() pause_graph_tracing() task_struct->tracing_graph_pause++ (0 -> 1) start_graph_tracing() for_each_online_cpu(cpu) { ftrace_graph_init_idle_task(cpu) task-struct->tracing_graph_pause = 0 (1 -> 0) unpause_graph_tracing() task_struct->tracing_graph_pause-- (0 -> -1) The above should have gone from 1 to zero, and enabled function graph tracing again. But instead, it is set to -1, which keeps it disabled. There's no reason that the field tracing_graph_pause on the task_struct can not be initialized at boot up. Cc: stable@vger.kernel.org Fixes: 380c4b1411ccd ("tracing/function-graph-tracer: append the tracing_graph_flag") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211339 Reported-by: pierre.gondois@arm.com Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10kretprobe: Avoid re-registration of the same kretprobe earlierWang ShaoBo
commit 0188b87899ffc4a1d36a0badbe77d56c92fd91dc upstream. Our system encountered a re-init error when re-registering same kretprobe, where the kretprobe_instance in rp->free_instances is illegally accessed after re-init. Implementation to avoid re-registration has been introduced for kprobe before, but lags for register_kretprobe(). We must check if kprobe has been re-registered before re-initializing kretprobe, otherwise it will destroy the data struct of kretprobe registered, which can lead to memory leak, system crash, also some unexpected behaviors. We use check_kprobe_rereg() to check if kprobe has been re-registered before running register_kretprobe()'s body, for giving a warning message and terminate registration process. Link: https://lkml.kernel.org/r/20210128124427.2031088-1-bobo.shaobowang@huawei.com Cc: stable@vger.kernel.org Fixes: 1f0ab40976460 ("kprobes: Prevent re-registration of the same kprobe") [ The above commit should have been done for kretprobes too ] Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@linux.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com> Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10elfcore: fix building with clangArnd Bergmann
commit 6e7b64b9dd6d96537d816ea07ec26b7dedd397b9 upstream. kernel/elfcore.c only contains weak symbols, which triggers a bug with clang in combination with recordmcount: Cannot find symbol for section 2: .text. kernel/elfcore.o: failed Move the empty stubs into linux/elfcore.h as inline functions. As only two architectures use these, just use the architecture specific Kconfig symbols to key off the declaration. Link: https://lkml.kernel.org/r/20201204165742.3815221-2-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Barret Rhoden <brho@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-07kthread: Extract KTHREAD_IS_PER_CPUPeter Zijlstra
[ Upstream commit ac687e6e8c26181a33270efd1a2e2241377924b0 ] There is a need to distinguish geniune per-cpu kthreads from kthreads that happen to have a single CPU affinity. Geniune per-cpu kthreads are kthreads that are CPU affine for correctness, these will obviously have PF_KTHREAD set, but must also have PF_NO_SETAFFINITY set, lest userspace modify their affinity and ruins things. However, these two things are not sufficient, PF_NO_SETAFFINITY is also set on other tasks that have their affinities controlled through other means, like for instance workqueues. Therefore another bit is needed; it turns out kthread_create_per_cpu() already has such a bit: KTHREAD_IS_PER_CPU, which is used to make kthread_park()/kthread_unpark() work correctly. Expose this flag and remove the implicit setting of it from kthread_create_on_cpu(); the io_uring usage of it seems dubious at best. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103506.557620262@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-30tracing: Fix race in trace_open and buffer resize callGaurav Kohli
commit bbeb97464eefc65f506084fd9f18f21653e01137 upstream. Below race can come, if trace_open and resize of cpu buffer is running parallely on different cpus CPUX CPUY ring_buffer_resize atomic_read(&buffer->resize_disabled) tracing_open tracing_reset_online_cpus ring_buffer_reset_cpu rb_reset_cpu rb_update_pages remove/insert pages resetting pointer This race can cause data abort or some times infinte loop in rb_remove_pages and rb_insert_pages while checking pages for sanity. Take buffer lock to fix this. Link: https://lkml.kernel.org/r/1601976833-24377-1-git-send-email-gkohli@codeaurora.org Cc: stable@vger.kernel.org Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Reported-by: Denis Efremov <efremov@linux.com> Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Handle faults correctly for PI futexesThomas Gleixner
commit 34b1a1ce1458f50ef27c54e28eb9b1947012907a upstream fixup_pi_state_owner() tries to ensure that the state of the rtmutex, pi_state and the user space value related to the PI futex are consistent before returning to user space. In case that the user space value update faults and the fault cannot be resolved by faulting the page in via fault_in_user_writeable() the function returns with -EFAULT and leaves the rtmutex and pi_state owner state inconsistent. A subsequent futex_unlock_pi() operates on the inconsistent pi_state and releases the rtmutex despite not owning it which can corrupt the RB tree of the rtmutex and cause a subsequent kernel stack use after free. It was suggested to loop forever in fixup_pi_state_owner() if the fault cannot be resolved, but that results in runaway tasks which is especially undesired when the problem happens due to a programming error and not due to malice. As the user space value cannot be fixed up, the proper solution is to make the rtmutex and the pi_state consistent so both have the same owner. This leaves the user space value out of sync. Any subsequent operation on the futex will fail because the 10th rule of PI futexes (pi_state owner and user space value are consistent) has been violated. As a consequence this removes the inept attempts of 'fixing' the situation in case that the current task owns the rtmutex when returning with an unresolvable fault by unlocking the rtmutex which left pi_state::owner and rtmutex::owner out of sync in a different and only slightly less dangerous way. Fixes: 1b7558e457ed ("futexes: fix fault handling in futex_lock_pi") Reported-by: gzobqq@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Simplify fixup_pi_state_owner()Thomas Gleixner
commit f2dac39d93987f7de1e20b3988c8685523247ae2 upstream Too many gotos already and an upcoming fix would make it even more unreadable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Use pi_state_update_owner() in put_pi_state()Thomas Gleixner
commit 6ccc84f917d33312eb2846bd7b567639f585ad6d upstream No point in open coding it. This way it gains the extra sanity checks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30rtmutex: Remove unused argument from rt_mutex_proxy_unlock()Thomas Gleixner
commit 2156ac1934166d6deb6cd0f6ffc4c1076ec63697 upstream Nothing uses the argument. Remove it as preparation to use pi_state_update_owner(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Provide and use pi_state_update_owner()Thomas Gleixner
commit c5cade200ab9a2a3be9e7f32a752c8d86b502ec7 upstream Updating pi_state::owner is done at several places with the same code. Provide a function for it and use that at the obvious places. This is also a preparation for a bug fix to avoid yet another copy of the same code or alternatively introducing a completely unpenetratable mess of gotos. Originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Replace pointless printk in fixup_owner()Thomas Gleixner
commit 04b79c55201f02ffd675e1231d731365e335c307 upstream If that unexpected case of inconsistent arguments ever happens then the futex state is left completely inconsistent and the printk is not really helpful. Replace it with a warning and make the state consistent. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: Ensure the correct return value from futex_lock_pi()Thomas Gleixner
commit 12bb3f7f1b03d5913b3f9d4236a488aa7774dfe9 upstream In case that futex_lock_pi() was aborted by a signal or a timeout and the task returned without acquiring the rtmutex, but is the designated owner of the futex due to a concurrent futex_unlock_pi() fixup_owner() is invoked to establish consistent state. In that case it invokes fixup_pi_state_owner() which in turn tries to acquire the rtmutex again. If that succeeds then it does not propagate this success to fixup_owner() and futex_lock_pi() returns -EINTR or -ETIMEOUT despite having the futex locked. Return success from fixup_pi_state_owner() in all cases where the current task owns the rtmutex and therefore the futex and propagate it correctly through fixup_owner(). Fixup the other callsite which does not expect a positive return value. Fixes: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30futex: futex_wake_op, fix sign_extend32 sign bitsJiri Slaby
commit d70ef22892ed6c066e51e118b225923c9b74af34 upstream. sign_extend32 counts the sign bit parameter from 0, not from 1. So we have to use "11" for 12th bit, not "12". This mistake means we have not allowed negative op and cmp args since commit 30d6e0a4190d ("futex: Remove duplicated code and fix undefined behaviour") till now. Fixes: 30d6e0a4190d ("futex: Remove duplicated code and fix undefined behaviour") Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12workqueue: Kick a worker based on the actual activation of delayed worksYunfeng Ye
[ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ] In realtime scenario, We do not want to have interference on the isolated cpu cores. but when invoking alloc_workqueue() for percpu wq on the housekeeping cpu, it kick a kworker on the isolated cpu. alloc_workqueue pwq_adjust_max_active wake_up_worker The comment in pwq_adjust_max_active() said: "Need to kick a worker after thawed or an unbound wq's max_active is bumped" So it is unnecessary to kick a kworker for percpu's wq when invoking alloc_workqueue(). this patch only kick a worker based on the actual activation of delayed works. Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-09module: delay kobject uevent until after module init callJessica Yu
[ Upstream commit 38dc717e97153e46375ee21797aa54777e5498f3 ] Apparently there has been a longstanding race between udev/systemd and the module loader. Currently, the module loader sends a uevent right after sysfs initialization, but before the module calls its init function. However, some udev rules expect that the module has initialized already upon receiving the uevent. This race has been triggered recently (see link in references) in some systemd mount unit files. For instance, the configfs module creates the /sys/kernel/config mount point in its init function, however the module loader issues the uevent before this happens. sys-kernel-config.mount expects to be able to mount /sys/kernel/config upon receipt of the module loading uevent, but if the configfs module has not called its init function yet, then this directory will not exist and the mount unit fails. A similar situation exists for sys-fs-fuse-connections.mount, as the fuse sysfs mount point is created during the fuse module's init function. If udev is faster than module initialization then the mount unit would fail in a similar fashion. To fix this race, delay the module KOBJ_ADD uevent until after the module has finished calling its init routine. References: https://github.com/systemd/systemd/issues/17586 Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-By: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-09module: set MODULE_STATE_GOING state when a module fails to loadMiroslav Benes
[ Upstream commit 5e8ed280dab9eeabc1ba0b2db5dbe9fe6debb6b5 ] If a module fails to load due to an error in prepare_coming_module(), the following error handling in load_module() runs with MODULE_STATE_COMING in module's state. Fix it by correctly setting MODULE_STATE_GOING under "bug_cleanup" label. Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-29genirq/irqdomain: Don't try to free an interrupt that has no mappingMarc Zyngier
[ Upstream commit 4615fbc3788ddc8e7c6d697714ad35a53729aa2c ] When an interrupt allocation fails for N interrupts, it is pretty common for the error handling code to free the same number of interrupts, no matter how many interrupts have actually been allocated. This may result in the domain freeing code to be unexpectedly called for interrupts that have no mapping in that domain. Things end pretty badly. Instead, add some checks to irq_domain_free_irqs_hierarchy() to make sure that thiss does not follow the hierarchy if no mapping exists for a given interrupt. Fixes: 6a6544e520abe ("genirq/irqdomain: Remove auto-recursive hierarchy support") Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201129135551.396777-1-maz@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-29sched: Reenable interrupts in do_sched_yield()Thomas Gleixner
[ Upstream commit 345a957fcc95630bf5535d7668a59ed983eb49a7 ] do_sched_yield() invokes schedule() with interrupts disabled which is not allowed. This goes back to the pre git era to commit a6efb709806c ("[PATCH] irqlock patch 2.5.27-H6") in the history tree. Reenable interrupts and remove the misleading comment which "explains" it. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-29sched/deadline: Fix sched_dl_global_validate()Peng Liu
[ Upstream commit a57415f5d1e43c3a5c5d412cd85e2792d7ed9b11 ] When change sched_rt_{runtime, period}_us, we validate that the new settings should at least accommodate the currently allocated -dl bandwidth: sched_rt_handler() --> sched_dl_bandwidth_validate() { new_bw = global_rt_runtime()/global_rt_period(); for_each_possible_cpu(cpu) { dl_b = dl_bw_of(cpu); if (new_bw < dl_b->total_bw) <------- ret = -EBUSY; } } But under CONFIG_SMP, dl_bw is per root domain , but not per CPU, dl_b->total_bw is the allocated bandwidth of the whole root domain. Instead, we should compare dl_b->total_bw against "cpus*new_bw", where 'cpus' is the number of CPUs of the root domain. Also, below annotation(in kernel/sched/sched.h) implied implementation only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept evolving till got merged(v9), but the annotation remains unchanged, meaningless and misleading, update it. * With respect to SMP, the bandwidth is given on a per-CPU basis, * meaning that: * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU; * - dl_total_bw array contains, in the i-eth element, the currently * allocated bandwidth on the i-eth CPU. [1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/ Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks") Signed-off-by: Peng Liu <iwtbavbm@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-29kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handlingNicholas Piggin
[ Upstream commit 8ff00399b153440c1c83e20c43020385b416415b ] powerpc/64s keeps a counter in the mm which counts bits set in mm_cpumask as well as other things. This means it can't use generic code to clear bits out of the mask and doesn't adjust the arch specific counter. Add an arch override that allows powerpc/64s to use clear_tasks_mm_cpumask(). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201126102530.691335-4-npiggin@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-11tracing: Fix userstacktrace option for instancesSteven Rostedt (VMware)
commit bcee5278958802b40ee8b26679155a6d9231783e upstream. When the instances were able to use their own options, the userstacktrace option was left hardcoded for the top level. This made the instance userstacktrace option bascially into a nop, and will confuse users that set it, but nothing happens (I was confused when it happened to me!) Cc: stable@vger.kernel.org Fixes: 16270145ce6b ("tracing: Add trace options for core options to instances") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-11ftrace: Fix updating FTRACE_FL_TRAMPNaveen N. Rao
commit 4c75b0ff4e4bf7a45b5aef9639799719c28d0073 upstream. On powerpc, kprobe-direct.tc triggered FTRACE_WARN_ON() in ftrace_get_addr_new() followed by the below message: Bad trampoline accounting at: 000000004222522f (wake_up_process+0xc/0x20) (f0000001) The set of steps leading to this involved: - modprobe ftrace-direct-too - enable_probe - modprobe ftrace-direct - rmmod ftrace-direct <-- trigger The problem turned out to be that we were not updating flags in the ftrace record properly. From the above message about the trampoline accounting being bad, it can be seen that the ftrace record still has FTRACE_FL_TRAMP set though ftrace-direct module is going away. This happens because we are checking if any ftrace_ops has the FTRACE_FL_TRAMP flag set _before_ updating the filter hash. The fix for this is to look for any _other_ ftrace_ops that also needs FTRACE_FL_TRAMP. Link: https://lkml.kernel.org/r/56c113aa9c3e10c19144a36d9684c7882bf09af5.1606412433.git.naveen.n.rao@linux.vnet.ibm.com Cc: stable@vger.kernel.org Fixes: a124692b698b0 ("ftrace: Enable trampoline when rec count returns back to one") Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18reboot: fix overflow parsing reboot cpu numberMatteo Croce
commit df5b0ab3e08a156701b537809914b339b0daa526 upstream. Limit the CPU number to num_possible_cpus(), because setting it to a value lower than INT_MAX but higher than NR_CPUS produces the following error on reboot and shutdown: BUG: unable to handle page fault for address: ffffffff90ab1bb0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1c09067 P4D 1c09067 PUD 1c0a063 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 1 Comm: systemd-shutdow Not tainted 5.9.0-rc8-kvm #110 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:migrate_to_reboot_cpu+0xe/0x60 Code: ea ea 00 48 89 fa 48 c7 c7 30 57 f1 81 e9 fa ef ff ff 66 2e 0f 1f 84 00 00 00 00 00 53 8b 1d d5 ea ea 00 e8 14 33 fe ff 89 da <48> 0f a3 15 ea fc bd 00 48 89 d0 73 29 89 c2 c1 e8 06 65 48 8b 3c RSP: 0018:ffffc90000013e08 EFLAGS: 00010246 RAX: ffff88801f0a0000 RBX: 0000000077359400 RCX: 0000000000000000 RDX: 0000000077359400 RSI: 0000000000000002 RDI: ffffffff81c199e0 RBP: ffffffff81c1e3c0 R08: ffff88801f41f000 R09: ffffffff81c1e348 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 00007f32bedf8830 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f32bedf8980(0000) GS:ffff88801f480000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff90ab1bb0 CR3: 000000001d057000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __do_sys_reboot.cold+0x34/0x5b do_syscall_64+0x2d/0x40 Fixes: 1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic kernel") Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Fabian Frederick <fabf@skynet.be> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Kees Cook <keescook@chromium.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201103214025.116799-3-mcroce@linux.microsoft.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [sudip: use reboot_mode instead of mode] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint"Matteo Croce
commit 8b92c4ff4423aa9900cf838d3294fcade4dbda35 upstream. Patch series "fix parsing of reboot= cmdline", v3. The parsing of the reboot= cmdline has two major errors: - a missing bound check can crash the system on reboot - parsing of the cpu number only works if specified last Fix both. This patch (of 2): This reverts commit 616feab753972b97. kstrtoint() and simple_strtoul() have a subtle difference which makes them non interchangeable: if a non digit character is found amid the parsing, the former will return an error, while the latter will just stop parsing, e.g. simple_strtoul("123xyx") = 123. The kernel cmdline reboot= argument allows to specify the CPU used for rebooting, with the syntax `s####` among the other flags, e.g. "reboot=warm,s31,force", so if this flag is not the last given, it's silently ignored as well as the subsequent ones. Fixes: 616feab75397 ("kernel/reboot.c: convert simple_strtoul to kstrtoint") Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Petr Mladek <pmladek@suse.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Mike Rapoport <rppt@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201103214025.116799-2-mcroce@linux.microsoft.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [sudip: use reboot_mode instead of mode] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>