summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-08-08x86/mm: Use mm_alloc() in poking_init()Peter Zijlstra
commit 3f4c8211d982099be693be9aa7d6fc4607dff290 upstream. Instead of duplicating init_mm, allocate a fresh mm. The advantage is that mm_alloc() has much simpler dependencies. Additionally it makes more conceptual sense, init_mm has no (and must not have) user state to duplicate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.816175235@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-03tracing: Fix trace_event_raw_event_synth() if else statementSteven Rostedt (Google)
commit 9971c3f944489ff7aacb9d25e0cde841a5f6018a upstream. The test to check if the field is a stack is to be done if it is not a string. But the code had: } if (event->fields[i]->is_stack) { and not } else if (event->fields[i]->is_stack) { which would cause it to always be tested. Worse yet, this also included an "else" statement that was only to be called if the field was not a string and a stack, but this code allows it to be called if it was a string (and not a stack). Also fixed some whitespace issues. Link: https://lore.kernel.org/all/202301302110.mEtNwkBD-lkp@intel.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230131095237.63e3ca8d@gandalf.local.home Cc: Tom Zanussi <zanussi@kernel.org> Fixes: 00cf3d672a9d ("tracing: Allow synthetic events to pass around stacktraces") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-03locking/rtmutex: Fix task->pi_waiters integrityPeter Zijlstra
[ Upstream commit f7853c34241807bb97673a5e97719123be39a09e ] Henry reported that rt_mutex_adjust_prio_check() has an ordering problem and puts the lie to the comment in [7]. Sharing the sort key between lock->waiters and owner->pi_waiters *does* create problems, since unlike what the comment claims, holding [L] is insufficient. Notably, consider: A / \ M1 M2 | | B C That is, task A owns both M1 and M2, B and C block on them. In this case a concurrent chain walk (B & C) will modify their resp. sort keys in [7] while holding M1->wait_lock and M2->wait_lock. So holding [L] is meaningless, they're different Ls. This then gives rise to a race condition between [7] and [11], where the requeue of pi_waiters will observe an inconsistent tree order. B C (holds M1->wait_lock, (holds M2->wait_lock, holds B->pi_lock) holds A->pi_lock) [7] waiter_update_prio(); ... [8] raw_spin_unlock(B->pi_lock); ... [10] raw_spin_lock(A->pi_lock); [11] rt_mutex_enqueue_pi(); // observes inconsistent A->pi_waiters // tree order Fixing this means either extending the range of the owner lock from [10-13] to [6-13], with the immediate problem that this means [6-8] hold both blocked and owner locks, or duplicating the sort key. Since the locking in chain walk is horrible enough without having to consider pi_lock nesting rules, duplicate the sort key instead. By giving each tree their own sort key, the above race becomes harmless, if C sees B at the old location, then B will correct things (if they need correcting) when it walks up the chain and reaches A. Fixes: fb00aca47440 ("rtmutex: Turn the plist into an rb-tree") Reported-by: Henry Wu <triangletrap12@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Henry Wu <triangletrap12@gmail.com> Link: https://lkml.kernel.org/r/20230707161052.GF2883469%40hirez.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03tracing: Fix warning in trace_buffered_event_disable()Zheng Yejian
[ Upstream commit dea499781a1150d285c62b26659f62fb00824fce ] Warning happened in trace_buffered_event_disable() at WARN_ON_ONCE(!trace_buffered_event_ref) Call Trace: ? __warn+0xa5/0x1b0 ? trace_buffered_event_disable+0x189/0x1b0 __ftrace_event_enable_disable+0x19e/0x3e0 free_probe_data+0x3b/0xa0 unregister_ftrace_function_probe_func+0x6b8/0x800 event_enable_func+0x2f0/0x3d0 ftrace_process_regex.isra.0+0x12d/0x1b0 ftrace_filter_write+0xe6/0x140 vfs_write+0x1c9/0x6f0 [...] The cause of the warning is in __ftrace_event_enable_disable(), trace_buffered_event_enable() was called once while trace_buffered_event_disable() was called twice. Reproduction script show as below, for analysis, see the comments: ``` #!/bin/bash cd /sys/kernel/tracing/ # 1. Register a 'disable_event' command, then: # 1) SOFT_DISABLED_BIT was set; # 2) trace_buffered_event_enable() was called first time; echo 'cmdline_proc_show:disable_event:initcall:initcall_finish' > \ set_ftrace_filter # 2. Enable the event registered, then: # 1) SOFT_DISABLED_BIT was cleared; # 2) trace_buffered_event_disable() was called first time; echo 1 > events/initcall/initcall_finish/enable # 3. Try to call into cmdline_proc_show(), then SOFT_DISABLED_BIT was # set again!!! cat /proc/cmdline # 4. Unregister the 'disable_event' command, then: # 1) SOFT_DISABLED_BIT was cleared again; # 2) trace_buffered_event_disable() was called second time!!! echo '!cmdline_proc_show:disable_event:initcall:initcall_finish' > \ set_ftrace_filter ``` To fix it, IIUC, we can change to call trace_buffered_event_enable() at fist time soft-mode enabled, and call trace_buffered_event_disable() at last time soft-mode disabled. Link: https://lore.kernel.org/linux-trace-kernel/20230726095804.920457-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03ring-buffer: Fix wrong stat of cpu_buffer->readZheng Yejian
[ Upstream commit 2d093282b0d4357373497f65db6a05eb0c28b7c8 ] When pages are removed in rb_remove_pages(), 'cpu_buffer->read' is set to 0 in order to make sure any read iterators reset themselves. However, this will mess 'entries' stating, see following steps: # cd /sys/kernel/tracing/ # 1. Enlarge ring buffer prepare for later reducing: # echo 20 > per_cpu/cpu0/buffer_size_kb # 2. Write a log into ring buffer of cpu0: # taskset -c 0 echo "hello1" > trace_marker # 3. Read the log: # cat per_cpu/cpu0/trace_pipe <...>-332 [000] ..... 62.406844: tracing_mark_write: hello1 # 4. Stop reading and see the stats, now 0 entries, and 1 event readed: # cat per_cpu/cpu0/stats entries: 0 [...] read events: 1 # 5. Reduce the ring buffer # echo 7 > per_cpu/cpu0/buffer_size_kb # 6. Now entries became unexpected 1 because actually no entries!!! # cat per_cpu/cpu0/stats entries: 1 [...] read events: 0 To fix it, introduce 'page_removed' field to count total removed pages since last reset, then use it to let read iterators reset themselves instead of changing the 'read' pointer. Link: https://lore.kernel.org/linux-trace-kernel/20230724054040.3489499-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Cc: <vnagarnaik@google.com> Fixes: 83f40318dab0 ("ring-buffer: Make removal of ring buffer pages atomic") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if ↵Masami Hiramatsu (Google)
fails [ Upstream commit 797311bce5c2ac90b8d65e357603cfd410d36ebb ] Fix to record 0-length data to data_loc in fetch_store_string*() if it fails to get the string data. Currently those expect that the data_loc is updated by store_trace_args() if it returns the error code. However, that does not work correctly if the argument is an array of strings. In that case, store_trace_args() only clears the first entry of the array (which may have no error) and leaves other entries. So it should be cleared by fetch_store_string*() itself. Also, 'dyndata' and 'maxlen' in store_trace_args() should be updated only if it is used (ret > 0 and argument is a dynamic data.) Link: https://lore.kernel.org/all/168908496683.123124.4761206188794205601.stgit@devnote2/ Fixes: 40b53b771806 ("tracing: probeevent: Add array type support") Cc: stable@vger.kernel.org Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03Revert "tracing: Add "(fault)" name injection to kernel probes"Masami Hiramatsu (Google)
[ Upstream commit 4ed8f337dee32df71435689c19d22e4ee846e15a ] This reverts commit 2e9906f84fc7c99388bb7123ade167250d50f1c0. It was turned out that commit 2e9906f84fc7 ("tracing: Add "(fault)" name injection to kernel probes") did not work correctly and probe events still show just '(fault)' (instead of '"(fault)"'). Also, current '(fault)' is more explicit that it faulted. This also moves FAULT_STRING macro to trace.h so that synthetic event can keep using it, and uses it in trace_probe.c too. Link: https://lore.kernel.org/all/168908495772.123124.1250788051922100079.stgit@devnote2/ Link: https://lore.kernel.org/all/20230706230642.3793a593@rorschach.local.home/ Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: 797311bce5c2 ("tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03tracing: Allow synthetic events to pass around stacktracesSteven Rostedt (Google)
[ Upstream commit 00cf3d672a9dd409418647e9f98784c339c3ff63 ] Allow a stacktrace from one event to be displayed by the end event of a synthetic event. This is very useful when looking for the longest latency of a sleep or something blocked on I/O. # cd /sys/kernel/tracing/ # echo 's:block_lat pid_t pid; u64 delta; unsigned long[] stack;' > dynamic_events # echo 'hist:keys=next_pid:ts=common_timestamp.usecs,st=stacktrace if prev_state == 1||prev_state == 2' > events/sched/sched_switch/trigger # echo 'hist:keys=prev_pid:delta=common_timestamp.usecs-$ts,s=$st:onmax($delta).trace(block_lat,prev_pid,$delta,$s)' >> events/sched/sched_switch/trigger The above creates a "block_lat" synthetic event that take the stacktrace of when a task schedules out in either the interruptible or uninterruptible states, and on a new per process max $delta (the time it was scheduled out), will print the process id and the stacktrace. # echo 1 > events/synthetic/block_lat/enable # cat trace # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | kworker/u16:0-767 [006] d..4. 560.645045: block_lat: pid=767 delta=66 stack=STACK: => __schedule => schedule => pipe_read => vfs_read => ksys_read => do_syscall_64 => 0x966000aa <idle>-0 [003] d..4. 561.132117: block_lat: pid=0 delta=413787 stack=STACK: => __schedule => schedule => schedule_hrtimeout_range_clock => do_sys_poll => __x64_sys_poll => do_syscall_64 => 0x966000aa <...>-153 [006] d..4. 562.068407: block_lat: pid=153 delta=54 stack=STACK: => __schedule => schedule => io_schedule => rq_qos_wait => wbt_wait => __rq_qos_throttle => blk_mq_submit_bio => submit_bio_noacct_nocheck => ext4_bio_write_page => mpage_submit_page => mpage_process_page_bufs => mpage_prepare_extent_to_map => ext4_do_writepages => ext4_writepages => do_writepages => __writeback_single_inode Link: https://lkml.kernel.org/r/20230117152236.010941267@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: 797311bce5c2 ("tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03tracing/probes: Fix to avoid double count of the string length on the arrayMasami Hiramatsu (Google)
[ Upstream commit 66bcf65d6cf0ca6540e2341e88ee7ef02dbdda08 ] If an array is specified with the ustring or symstr, the length of the strings are accumlated on both of 'ret' and 'total', which means the length is double counted. Just set the length to the 'ret' value for avoiding double counting. Link: https://lore.kernel.org/all/168908492917.123124.15076463491122036025.stgit@devnote2/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/ Fixes: 88903c464321 ("tracing/probe: Add ustring type for user-space string") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03tracing/probes: Add symstr type for dynamic eventsMasami Hiramatsu (Google)
[ Upstream commit b26a124cbfa80f42bfc4e63e1d5643ca98159d66 ] Add 'symstr' type for storing the kernel symbol as a string data instead of the symbol address. This allows us to filter the events by wildcard symbol name. e.g. # echo 'e:wqfunc workqueue.workqueue_execute_start symname=$function:symstr' >> dynamic_events # cat events/eprobes/wqfunc/format name: wqfunc ID: 2110 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:__data_loc char[] symname; offset:8; size:4; signed:1; print fmt: " symname=\"%s\"", __get_str(symname) Note that there is already 'symbol' type which just change the print format (so it still stores the symbol address in the tracing ring buffer.) On the other hand, 'symstr' type stores the actual "symbol+offset/size" data as a string. Link: https://lore.kernel.org/all/166679930847.1528100.4124308529180235965.stgit@devnote3/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Stable-dep-of: 66bcf65d6cf0 ("tracing/probes: Fix to avoid double count of the string length on the array") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27tracing/histograms: Return an error if we fail to add histogram to hist_vars ↵Mohamed Khalfella
list commit 4b8b3905165ef98386a3c06f196c85d21292d029 upstream. Commit 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables") added a check to fail histogram creation if save_hist_vars() failed to add histogram to hist_vars list. But the commit failed to set ret to failed return code before jumping to unregister histogram, fix it. Link: https://lore.kernel.org/linux-trace-kernel/20230714203341.51396-1-mkhalfella@purestorage.com Cc: stable@vger.kernel.org Fixes: 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-27bpf: Fix subprog idx logic in check_max_stack_depthKumar Kartikeya Dwivedi
[ Upstream commit ba7b3e7d5f9014be65879ede8fd599cb222901c9 ] The assignment to idx in check_max_stack_depth happens once we see a bpf_pseudo_call or bpf_pseudo_func. This is not an issue as the rest of the code performs a few checks and then pushes the frame to the frame stack, except the case of async callbacks. If the async callback case causes the loop iteration to be skipped, the idx assignment will be incorrect on the next iteration of the loop. The value stored in the frame stack (as the subprogno of the current subprog) will be incorrect. This leads to incorrect checks and incorrect tail_call_reachable marking. Save the target subprog in a new variable and only assign to idx once we are done with the is_async_cb check which may skip pushing of frame to the frame stack and subsequent stack depth checks and tail call markings. Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230717161530.1238-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23tracing/probes: Fix to update dynamic data counter if fetcharg uses itMasami Hiramatsu (Google)
commit e38e2c6a9efc435f9de344b7c91f7697e01b47d5 upstream. Fix to update dynamic data counter ('dyndata') and max length ('maxlen') only if the fetcharg uses the dynamic data. Also get out arg->dynamic from unlikely(). This makes dynamic data address wrong if process_fetch_insn() returns error on !arg->dynamic case. Link: https://lore.kernel.org/all/168908494781.123124.8160245359962103684.stgit@devnote2/ Suggested-by: Steven Rostedt <rostedt@goodmis.org> Link: https://lore.kernel.org/all/20230710233400.5aaf024e@gandalf.local.home/ Fixes: 9178412ddf5a ("tracing: probeevent: Return consumed bytes of dynamic area") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23tracing/probes: Fix not to count error code to total lengthMasami Hiramatsu (Google)
commit b41326b5e0f82e93592c4366359917b5d67b529f upstream. Fix not to count the error code (which is minus value) to the total used length of array, because it can mess up the return code of process_fetch_insn_bottom(). Also clear the 'ret' value because it will be used for calculating next data_loc entry. Link: https://lore.kernel.org/all/168908493827.123124.2175257289106364229.stgit@devnote2/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/8819b154-2ba1-43c3-98a2-cbde20892023@moroto.mountain/ Fixes: 9b960a38835f ("tracing: probeevent: Unify fetch_insn processing common part") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23tracing: Fix null pointer dereference in tracing_err_log_open()Mateusz Stachyra
commit 02b0095e2fbbc060560c1065f86a211d91e27b26 upstream. Fix an issue in function 'tracing_err_log_open'. The function doesn't call 'seq_open' if the file is opened only with write permissions, which results in 'file->private_data' being left as null. If we then use 'lseek' on that opened file, 'seq_lseek' dereferences 'file->private_data' in 'mutex_lock(&m->lock)', resulting in a kernel panic. Writing to this node requires root privileges, therefore this bug has very little security impact. Tracefs node: /sys/kernel/tracing/error_log Example Kernel panic: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000038 Call trace: mutex_lock+0x30/0x110 seq_lseek+0x34/0xb8 __arm64_sys_lseek+0x6c/0xb8 invoke_syscall+0x58/0x13c el0_svc_common+0xc4/0x10c do_el0_svc+0x24/0x98 el0_svc+0x24/0x88 el0t_64_sync_handler+0x84/0xe4 el0t_64_sync+0x1b4/0x1b8 Code: d503201f aa0803e0 aa1f03e1 aa0103e9 (c8e97d02) ---[ end trace 561d1b49c12cf8a5 ]--- Kernel panic - not syncing: Oops: Fatal exception Link: https://lore.kernel.org/linux-trace-kernel/20230703155237eucms1p4dfb6a19caa14c79eb6c823d127b39024@eucms1p4 Link: https://lore.kernel.org/linux-trace-kernel/20230704102706eucms1p30d7ecdcc287f46ad67679fc8491b2e0f@eucms1p3 Cc: stable@vger.kernel.org Fixes: 8a062902be725 ("tracing: Add tracing error log") Signed-off-by: Mateusz Stachyra <m.stachyra@samsung.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()Zheng Yejian
commit 26efd79c4624294e553aeaa3439c646729bad084 upstream. As comments in ftrace_process_locs(), there may be NULL pointers in mcount_loc section: > Some architecture linkers will pad between > the different mcount_loc sections of different > object files to satisfy alignments. > Skip any NULL pointers. After commit 20e5227e9f55 ("ftrace: allow NULL pointers in mcount_loc"), NULL pointers will be accounted when allocating ftrace pages but skipped before adding into ftrace pages, this may result in some pages not being used. Then after commit 706c81f87f84 ("ftrace: Remove extra helper functions"), warning may occur at: WARN_ON(pg->next); To fix it, only warn for case that no pointers skipped but pages not used up, then free those unused pages after releasing ftrace_lock. Link: https://lore.kernel.org/linux-trace-kernel/20230712060452.3175675-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: 706c81f87f84 ("ftrace: Remove extra helper functions") Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23ring-buffer: Fix deadloop issue on reading trace_pipeZheng Yejian
commit 7e42907f3a7b4ce3a2d1757f6d78336984daf8f5 upstream. Soft lockup occurs when reading file 'trace_pipe': watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [cat:4488] [...] RIP: 0010:ring_buffer_empty_cpu+0xed/0x170 RSP: 0018:ffff88810dd6fc48 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffffffff93d1aaeb RDX: ffff88810a280040 RSI: 0000000000000008 RDI: ffff88811164b218 RBP: ffff88811164b218 R08: 0000000000000000 R09: ffff88815156600f R10: ffffed102a2acc01 R11: 0000000000000001 R12: 0000000051651901 R13: 0000000000000000 R14: ffff888115e49500 R15: 0000000000000000 [...] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8d853c2000 CR3: 000000010dcd8000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __find_next_entry+0x1a8/0x4b0 ? peek_next_entry+0x250/0x250 ? down_write+0xa5/0x120 ? down_write_killable+0x130/0x130 trace_find_next_entry_inc+0x3b/0x1d0 tracing_read_pipe+0x423/0xae0 ? tracing_splice_read_pipe+0xcb0/0xcb0 vfs_read+0x16b/0x490 ksys_read+0x105/0x210 ? __ia32_sys_pwrite64+0x200/0x200 ? switch_fpu_return+0x108/0x220 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6 Through the vmcore, I found it's because in tracing_read_pipe(), ring_buffer_empty_cpu() found some buffer is not empty but then it cannot read anything due to "rb_num_of_entries() == 0" always true, Then it infinitely loop the procedure due to user buffer not been filled, see following code path: tracing_read_pipe() { ... ... waitagain: tracing_wait_pipe() // 1. find non-empty buffer here trace_find_next_entry_inc() // 2. loop here try to find an entry __find_next_entry() ring_buffer_empty_cpu(); // 3. find non-empty buffer peek_next_entry() // 4. but peek always return NULL ring_buffer_peek() rb_buffer_peek() rb_get_reader_page() // 5. because rb_num_of_entries() == 0 always true here // then return NULL // 6. user buffer not been filled so goto 'waitgain' // and eventually leads to an deadloop in kernel!!! } By some analyzing, I found that when resetting ringbuffer, the 'entries' of its pages are not all cleared (see rb_reset_cpu()). Then when reducing the ringbuffer, and if some reduced pages exist dirty 'entries' data, they will be added into 'cpu_buffer->overrun' (see rb_remove_pages()), which cause wrong 'overrun' count and eventually cause the deadloop issue. To fix it, we need to clear every pages in rb_reset_cpu(). Link: https://lore.kernel.org/linux-trace-kernel/20230708225144.3785600-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: a5fb833172eca ("ring-buffer: Fix uninitialized read_stamp") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23tracing: Fix memory leak of iter->temp when reading trace_pipeZheng Yejian
commit d5a821896360cc8b93a15bd888fabc858c038dc0 upstream. kmemleak reports: unreferenced object 0xffff88814d14e200 (size 256): comm "cat", pid 336, jiffies 4294871818 (age 779.490s) hex dump (first 32 bytes): 04 00 01 03 00 00 00 00 08 00 00 00 00 00 00 00 ................ 0c d8 c8 9b ff ff ff ff 04 5a ca 9b ff ff ff ff .........Z...... backtrace: [<ffffffff9bdff18f>] __kmalloc+0x4f/0x140 [<ffffffff9bc9238b>] trace_find_next_entry+0xbb/0x1d0 [<ffffffff9bc9caef>] trace_print_lat_context+0xaf/0x4e0 [<ffffffff9bc94490>] print_trace_line+0x3e0/0x950 [<ffffffff9bc95499>] tracing_read_pipe+0x2d9/0x5a0 [<ffffffff9bf03a43>] vfs_read+0x143/0x520 [<ffffffff9bf04c2d>] ksys_read+0xbd/0x160 [<ffffffff9d0f0edf>] do_syscall_64+0x3f/0x90 [<ffffffff9d2000aa>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 when reading file 'trace_pipe', 'iter->temp' is allocated or relocated in trace_find_next_entry() but not freed before 'trace_pipe' is closed. To fix it, free 'iter->temp' in tracing_release_pipe(). Link: https://lore.kernel.org/linux-trace-kernel/20230713141435.1133021-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: ff895103a84ab ("tracing: Save off entry when peeking at next entry") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23tracing/histograms: Add histograms to hist_vars if they have referenced ↵Mohamed Khalfella
variables commit 6018b585e8c6fa7d85d4b38d9ce49a5b67be7078 upstream. Hist triggers can have referenced variables without having direct variables fields. This can be the case if referenced variables are added for trigger actions. In this case the newly added references will not have field variables. Not taking such referenced variables into consideration can result in a bug where it would be possible to remove hist trigger with variables being refenced. This will result in a bug that is easily reproducable like so $ cd /sys/kernel/tracing $ echo 'synthetic_sys_enter char[] comm; long id' >> synthetic_events $ echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger $ echo 'hist:keys=common_pid.execname,id.syscall:onmatch(raw_syscalls.sys_enter).synthetic_sys_enter($comm, id)' >> events/raw_syscalls/sys_enter/trigger $ echo '!hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger [ 100.263533] ================================================================== [ 100.264634] BUG: KASAN: slab-use-after-free in resolve_var_refs+0xc7/0x180 [ 100.265520] Read of size 8 at addr ffff88810375d0f0 by task bash/439 [ 100.266320] [ 100.266533] CPU: 2 PID: 439 Comm: bash Not tainted 6.5.0-rc1 #4 [ 100.267277] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014 [ 100.268561] Call Trace: [ 100.268902] <TASK> [ 100.269189] dump_stack_lvl+0x4c/0x70 [ 100.269680] print_report+0xc5/0x600 [ 100.270165] ? resolve_var_refs+0xc7/0x180 [ 100.270697] ? kasan_complete_mode_report_info+0x80/0x1f0 [ 100.271389] ? resolve_var_refs+0xc7/0x180 [ 100.271913] kasan_report+0xbd/0x100 [ 100.272380] ? resolve_var_refs+0xc7/0x180 [ 100.272920] __asan_load8+0x71/0xa0 [ 100.273377] resolve_var_refs+0xc7/0x180 [ 100.273888] event_hist_trigger+0x749/0x860 [ 100.274505] ? kasan_save_stack+0x2a/0x50 [ 100.275024] ? kasan_set_track+0x29/0x40 [ 100.275536] ? __pfx_event_hist_trigger+0x10/0x10 [ 100.276138] ? ksys_write+0xd1/0x170 [ 100.276607] ? do_syscall_64+0x3c/0x90 [ 100.277099] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 100.277771] ? destroy_hist_data+0x446/0x470 [ 100.278324] ? event_hist_trigger_parse+0xa6c/0x3860 [ 100.278962] ? __pfx_event_hist_trigger_parse+0x10/0x10 [ 100.279627] ? __kasan_check_write+0x18/0x20 [ 100.280177] ? mutex_unlock+0x85/0xd0 [ 100.280660] ? __pfx_mutex_unlock+0x10/0x10 [ 100.281200] ? kfree+0x7b/0x120 [ 100.281619] ? ____kasan_slab_free+0x15d/0x1d0 [ 100.282197] ? event_trigger_write+0xac/0x100 [ 100.282764] ? __kasan_slab_free+0x16/0x20 [ 100.283293] ? __kmem_cache_free+0x153/0x2f0 [ 100.283844] ? sched_mm_cid_remote_clear+0xb1/0x250 [ 100.284550] ? __pfx_sched_mm_cid_remote_clear+0x10/0x10 [ 100.285221] ? event_trigger_write+0xbc/0x100 [ 100.285781] ? __kasan_check_read+0x15/0x20 [ 100.286321] ? __bitmap_weight+0x66/0xa0 [ 100.286833] ? _find_next_bit+0x46/0xe0 [ 100.287334] ? task_mm_cid_work+0x37f/0x450 [ 100.287872] event_triggers_call+0x84/0x150 [ 100.288408] trace_event_buffer_commit+0x339/0x430 [ 100.289073] ? ring_buffer_event_data+0x3f/0x60 [ 100.292189] trace_event_raw_event_sys_enter+0x8b/0xe0 [ 100.295434] syscall_trace_enter.constprop.0+0x18f/0x1b0 [ 100.298653] syscall_enter_from_user_mode+0x32/0x40 [ 100.301808] do_syscall_64+0x1a/0x90 [ 100.304748] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 100.307775] RIP: 0033:0x7f686c75c1cb [ 100.310617] Code: 73 01 c3 48 8b 0d 65 3c 10 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 21 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 3c 10 00 f7 d8 64 89 01 48 [ 100.317847] RSP: 002b:00007ffc60137a38 EFLAGS: 00000246 ORIG_RAX: 0000000000000021 [ 100.321200] RAX: ffffffffffffffda RBX: 000055f566469ea0 RCX: 00007f686c75c1cb [ 100.324631] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 000000000000000a [ 100.328104] RBP: 00007ffc60137ac0 R08: 00007f686c818460 R09: 000000000000000a [ 100.331509] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009 [ 100.334992] R13: 0000000000000007 R14: 000000000000000a R15: 0000000000000007 [ 100.338381] </TASK> We hit the bug because when second hist trigger has was created has_hist_vars() returned false because hist trigger did not have variables. As a result of that save_hist_vars() was not called to add the trigger to trace_array->hist_vars. Later on when we attempted to remove the first histogram find_any_var_ref() failed to detect it is being used because it did not find the second trigger in hist_vars list. With this change we wait until trigger actions are created so we can take into consideration if hist trigger has variable references. Also, now we check the return value of save_hist_vars() and fail trigger creation if save_hist_vars() fails. Link: https://lore.kernel.org/linux-trace-kernel/20230712223021.636335-1-mkhalfella@purestorage.com Cc: stable@vger.kernel.org Fixes: 067fe038e70f6 ("tracing: Add variable reference handling to hist triggers") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23bpf: cpumap: Fix memory leak in cpu_map_update_elemPu Lehui
[ Upstream commit 4369016497319a9635702da010d02af1ebb1849d ] Syzkaller reported a memory leak as follows: BUG: memory leak unreferenced object 0xff110001198ef748 (size 192): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 32 bytes): 00 00 00 00 4a 19 00 00 80 ad e3 e4 fe ff c0 00 ....J........... 00 b2 d3 0c 01 00 11 ff 28 f5 8e 19 01 00 11 ff ........(....... backtrace: [<ffffffffadd28087>] __cpu_map_entry_alloc+0xf7/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 BUG: memory leak unreferenced object 0xff110001198ef528 (size 192): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffffadd281f0>] __cpu_map_entry_alloc+0x260/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 BUG: memory leak unreferenced object 0xff1100010fd93d68 (size 8): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 8 bytes): 00 00 00 00 00 00 00 00 ........ backtrace: [<ffffffffade5db3e>] kvmalloc_node+0x11e/0x170 [<ffffffffadd28280>] __cpu_map_entry_alloc+0x2f0/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 In the cpu_map_update_elem flow, when kthread_stop is called before calling the threadfn of rcpu->kthread, since the KTHREAD_SHOULD_STOP bit of kthread has been set by kthread_stop, the threadfn of rcpu->kthread will never be executed, and rcpu->refcnt will never be 0, which will lead to the allocated rcpu, rcpu->queue and rcpu->queue->queue cannot be released. Calling kthread_stop before executing kthread's threadfn will return -EINTR. We can complete the release of memory resources in this state. Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Signed-off-by: Pu Lehui <pulehui@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230711115848.2701559-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23kernel/trace: Fix cleanup logic of enable_trace_eprobeTzvetomir Stoyanov (VMware)
[ Upstream commit cf0a624dc706c306294c14e6b3e7694702f25191 ] The enable_trace_eprobe() function enables all event probes, attached to given trace probe. If an error occurs in enabling one of the event probes, all others should be roll backed. There is a bug in that roll back logic - instead of all event probes, only the failed one is disabled. Link: https://lore.kernel.org/all/20230703042853.1427493-1-tz.stoyanov@gmail.com/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23bpf: Fix max stack depth check for async callbacksKumar Kartikeya Dwivedi
[ Upstream commit 5415ccd50a8620c8cbaa32d6f18c946c453566f5 ] The check_max_stack_depth pass happens after the verifier's symbolic execution, and attempts to walk the call graph of the BPF program, ensuring that the stack usage stays within bounds for all possible call chains. There are two cases to consider: bpf_pseudo_func and bpf_pseudo_call. In the former case, the callback pointer is loaded into a register, and is assumed that it is passed to some helper later which calls it (however there is no way to be sure), but the check remains conservative and accounts the stack usage anyway. For this particular case, asynchronous callbacks are skipped as they execute asynchronously when their corresponding event fires. The case of bpf_pseudo_call is simpler and we know that the call is definitely made, hence the stack depth of the subprog is accounted for. However, the current check still skips an asynchronous callback even if a bpf_pseudo_call was made for it. This is erroneous, as it will miss accounting for the stack usage of the asynchronous callback, which can be used to breach the maximum stack depth limit. Fix this by only skipping asynchronous callbacks when the instruction is not a pseudo call to the subprog. Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230705144730.235802-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23workqueue: clean up WORK_* constant types, clarify maskingLinus Torvalds
commit afa4bb778e48d79e4a642ed41e3b4e0de7489a6c upstream. Dave Airlie reports that gcc-13.1.1 has started complaining about some of the workqueue code in 32-bit arm builds: kernel/workqueue.c: In function ‘get_work_pwq’: kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast] 713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK); | ^ [ ... a couple of other cases ... ] and while it's not immediately clear exactly why gcc started complaining about it now, I suspect it's some C23-induced enum type handlign fixup in gcc-13 is the cause. Whatever the reason for starting to complain, the code and data types are indeed disgusting enough that the complaint is warranted. The wq code ends up creating various "helper constants" (like that WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of confused. The mask needs to be 'unsigned long', not some unspecified enum type. To make matters worse, the actual "mask and cast to a pointer" is repeated a couple of times, and the cast isn't even always done to the right pointer, but - as the error case above - to a 'void *' with then the compiler finishing the job. That's now how we roll in the kernel. So create the masks using the proper types rather than some ambiguous enumeration, and use a nice helper that actually does the type conversion in one well-defined place. Incidentally, this magically makes clang generate better code. That, admittedly, is really just a sign of clang having been seriously confused before, and cleaning up the typing unconfuses the compiler too. Reported-by: Dave Airlie <airlied@gmail.com> Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/ Cc: Arnd Bergmann <arnd@arndb.de> Cc: Tejun Heo <tj@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23kcsan: Don't expect 64 bits atomic builtins from 32 bits architecturesChristophe Leroy
[ Upstream commit 353e7300a1db928e427462f2745f9a2cd1625b3d ] Activating KCSAN on a 32 bits architecture leads to the following link-time failure: LD .tmp_vmlinux.kallsyms1 powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_load': kernel/kcsan/core.c:1273: undefined reference to `__atomic_load_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_store': kernel/kcsan/core.c:1273: undefined reference to `__atomic_store_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_exchange': kernel/kcsan/core.c:1273: undefined reference to `__atomic_exchange_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_add': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_add_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_sub': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_sub_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_and': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_and_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_or': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_or_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_xor': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_xor_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_fetch_nand': kernel/kcsan/core.c:1273: undefined reference to `__atomic_fetch_nand_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_compare_exchange_strong': kernel/kcsan/core.c:1273: undefined reference to `__atomic_compare_exchange_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_compare_exchange_weak': kernel/kcsan/core.c:1273: undefined reference to `__atomic_compare_exchange_8' powerpc64-linux-ld: kernel/kcsan/core.o: in function `__tsan_atomic64_compare_exchange_val': kernel/kcsan/core.c:1273: undefined reference to `__atomic_compare_exchange_8' 32 bits architectures don't have 64 bits atomic builtins. Only include DEFINE_TSAN_ATOMIC_OPS(64) on 64 bits architectures. Fixes: 0f8ad5f2e934 ("kcsan: Add support for atomic builtins") Suggested-by: Marco Elver <elver@google.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Marco Elver <elver@google.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/d9c6afc28d0855240171a4e0ad9ffcdb9d07fceb.1683892665.git.christophe.leroy@csgroup.eu Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23kexec: fix a memory leak in crash_shrink_memory()Zhen Lei
[ Upstream commit 1cba6c4309f03de570202c46f03df3f73a0d4c82 ] Patch series "kexec: enable kexec_crash_size to support two crash kernel regions". When crashkernel=X fails to reserve region under 4G, it will fall back to reserve region above 4G and a region of the default size will also be reserved under 4G. Unfortunately, /sys/kernel/kexec_crash_size only supports one crash kernel region now, the user cannot sense the low memory reserved by reading /sys/kernel/kexec_crash_size. Also, low memory cannot be freed by writing this file. For example: resource_size(crashk_res) = 512M resource_size(crashk_low_res) = 256M The result of 'cat /sys/kernel/kexec_crash_size' is 512M, but it should be 768M. When we execute 'echo 0 > /sys/kernel/kexec_crash_size', the size of crashk_res becomes 0 and resource_size(crashk_low_res) is still 256 MB, which is incorrect. Since crashk_res manages the memory with high address and crashk_low_res manages the memory with low address, crashk_low_res is shrunken only when all crashk_res is shrunken. And because when there is only one crash kernel region, crashk_res is always used. Therefore, if all crashk_res is shrunken and crashk_low_res still exists, swap them. This patch (of 6): If the value of parameter 'new_size' is in the semi-open and semi-closed interval (crashk_res.end - KEXEC_CRASH_MEM_ALIGN + 1, crashk_res.end], the calculation result of ram_res is: ram_res->start = crashk_res.end + 1 ram_res->end = crashk_res.end The operation of insert_resource() fails, and ram_res is not added to iomem_resource. As a result, the memory of the control block ram_res is leaked. In fact, on all architectures, the start address and size of crashk_res are already aligned by KEXEC_CRASH_MEM_ALIGN. Therefore, we do not need to round up crashk_res.start again. Instead, we should round up 'new_size' in advance. Link: https://lkml.kernel.org/r/20230527123439.772-1-thunder.leizhen@huawei.com Link: https://lkml.kernel.org/r/20230527123439.772-2-thunder.leizhen@huawei.com Fixes: 6480e5a09237 ("kdump: add missing RAM resource in crash_shrink_memory()") Fixes: 06a7f711246b ("kexec: premit reduction of the reserved memory size") Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Cong Wang <amwang@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23watchdog/perf: more properly prevent false positives with turbo modesDouglas Anderson
[ Upstream commit 4379e59fe5665cfda737e45b8bf2f05321ef049c ] Currently, in the watchdog_overflow_callback() we first check to see if the watchdog had been touched and _then_ we handle the workaround for turbo mode. This order should be reversed. Specifically, "touching" the hardlockup detector's watchdog should avoid lockups being detected for one period that should be roughly the same regardless of whether we're running turbo or not. That means that we should do the extra accounting for turbo _before_ we look at (and clear) the global indicating that we've been touched. NOTE: this fix is made based on code inspection. I am not aware of any reports where the old code would have generated false positives. That being said, this order seems more correct and also makes it easier down the line to share code with the "buddy" hardlockup detector. Link: https://lkml.kernel.org/r/20230519101840.v5.2.I843b0d1de3e096ba111a179f3adb16d576bef5c7@changeid Fixes: 7edaeb6841df ("kernel/watchdog: Prevent false positives with turbo modes") Signed-off-by: Douglas Anderson <dianders@chromium.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen-Yu Tsai <wens@csie.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Colin Cross <ccross@android.com> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guenter Roeck <groeck@chromium.org> Cc: Ian Rogers <irogers@google.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Ricardo Neri <ricardo.neri@intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Stephen Boyd <swboyd@chromium.org> Cc: Sumit Garg <sumit.garg@linaro.org> Cc: Tzung-Bi Shih <tzungbi@chromium.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23bpf: Don't EFAULT for {g,s}setsockopt with wrong optlenStanislav Fomichev
[ Upstream commit 29ebbba7d46136cba324264e513a1e964ca16c0a ] With the way the hooks implemented right now, we have a special condition: optval larger than PAGE_SIZE will expose only first 4k into BPF; any modifications to the optval are ignored. If the BPF program doesn't handle this condition by resetting optlen to 0, the userspace will get EFAULT. The intention of the EFAULT was to make it apparent to the developers that the program is doing something wrong. However, this inadvertently might affect production workloads with the BPF programs that are not too careful (i.e., returning EFAULT for perfectly valid setsockopt/getsockopt calls). Let's try to minimize the chance of BPF program screwing up userspace by ignoring the output of those BPF programs (instead of returning EFAULT to the userspace). pr_info_once those cases to the dmesg to help with figuring out what's going wrong. Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Suggested-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscaleQiuxu Zhuo
[ Upstream commit 23fc8df26dead16687ae6eb47b0561a4a832e2f6 ] Running the 'kfree_rcu_test' test case [1] results in a splat [2]. The root cause is the kfree_scale_thread thread(s) continue running after unloading the rcuscale module. This commit fixes that isue by invoking kfree_scale_cleanup() from rcu_scale_cleanup() when removing the rcuscale module. [1] modprobe rcuscale kfree_rcu_test=1 // After some time rmmod rcuscale rmmod torture [2] BUG: unable to handle page fault for address: ffffffffc0601a87 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 11de4f067 P4D 11de4f067 PUD 11de51067 PMD 112f4d067 PTE 0 Oops: 0010 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 1798 Comm: kfree_scale_thr Not tainted 6.3.0-rc1-rcu+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 RIP: 0010:0xffffffffc0601a87 Code: Unable to access opcode bytes at 0xffffffffc0601a5d. RSP: 0018:ffffb25bc2e57e18 EFLAGS: 00010297 RAX: 0000000000000000 RBX: ffffffffc061f0b6 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff962fd0de RDI: ffffffff962fd0de RBP: ffffb25bc2e57ea8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 000000000000000a R15: 00000000001c1dbe FS: 0000000000000000(0000) GS:ffff921fa2200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffc0601a5d CR3: 000000011de4c006 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? kvfree_call_rcu+0xf0/0x3a0 ? kthread+0xf3/0x120 ? kthread_complete_and_exit+0x20/0x20 ? ret_from_fork+0x1f/0x30 </TASK> Modules linked in: rfkill sunrpc ... [last unloaded: torture] CR2: ffffffffc0601a87 ---[ end trace 0000000000000000 ]--- Fixes: e6e78b004fa7 ("rcuperf: Add kfree_rcu() performance Tests") Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()Qiuxu Zhuo
[ Upstream commit bf5ddd736509a7d9077c0b6793e6f0852214dbea ] This code-movement-only commit moves the rcu_scale_cleanup() and rcu_scale_shutdown() functions to follow kfree_scale_cleanup(). This is code movement is in preparation for a bug-fix patch that invokes kfree_scale_cleanup() from rcu_scale_cleanup(). Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Stable-dep-of: 23fc8df26dea ("rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23rcuscale: Move shutdown from wait_event() to wait_event_idle()Paul E. McKenney
[ Upstream commit ef1ef3d47677dc191b88650a9f7f91413452cc1b ] The rcu_scale_shutdown() and kfree_scale_shutdown() kthreads/functions use wait_event() to wait for the rcuscale test to complete. However, each updater thread in such a test waits for at least 100 grace periods. If each grace period takes more than 1.2 seconds, which is long, but not insanely so, this can trigger the hung-task timeout. This commit therefore replaces those wait_event() calls with calls to wait_event_idle(), which do not trigger the hung-task timeout. Reported-by: kernel test robot <yujie.liu@intel.com> Reported-by: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Yujie Liu <yujie.liu@intel.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Stable-dep-of: 23fc8df26dea ("rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23rcuscale: Always log error messageLi Zhijian
[ Upstream commit 86e7ed1bd57d020e35d430542bf5d689c3200568 ] Unconditionally log messages corresponding to errors. Acked-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Li Zhijian <zhijianx.li@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Stable-dep-of: 23fc8df26dea ("rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23posix-timers: Prevent RT livelock in itimer_delete()Thomas Gleixner
[ Upstream commit 9d9e522010eb5685d8b53e8a24320653d9d4cbbf ] itimer_delete() has a retry loop when the timer is concurrently expired. On non-RT kernels this just spin-waits until the timer callback has completed, except for posix CPU timers which have HAVE_POSIX_CPU_TIMERS_TASK_WORK enabled. In that case and on RT kernels the existing task could live lock when preempting the task which does the timer delivery. Replace spin_unlock() with an invocation of timer_wait_running() to handle it the same way as the other retry loops in the posix timer code. Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/87v8g7c50d.ffs@tglx Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-05bpf: ensure main program has an extableKrister Johansen
commit 0108a4e9f3584a7a2c026d1601b0682ff7335d95 upstream. When subprograms are in use, the main program is not jit'd after the subprograms because jit_subprogs sets a value for prog->bpf_func upon success. Subsequent calls to the JIT are bypassed when this value is non-NULL. This leads to a situation where the main program and its func[0] counterpart are both in the bpf kallsyms tree, but only func[0] has an extable. Extables are only created during JIT. Now there are two nearly identical program ksym entries in the tree, but only one has an extable. Depending upon how the entries are placed, there's a chance that a fault will call search_extable on the aux with the NULL entry. Since jit_subprogs already copies state from func[0] to the main program, include the extable pointer in this state duplication. Additionally, ensure that the copy of the main program in func[0] is not added to the bpf_prog_kallsyms table. Instead, let the main program get added later in bpf_prog_load(). This ensures there is only a single copy of the main program in the kallsyms table, and that its tag matches the tag observed by tooling like bpftool. Cc: stable@vger.kernel.org Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs") Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/6de9b2f4b4724ef56efbb0339daaa66c8b68b1e7.1686616663.git.kjlx@templeofstupid.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-28bpf/btf: Accept function names that contain dotsFlorent Revest
[ Upstream commit 9724160b3942b0a967b91a59f81da5593f28b8ba ] When building a kernel with LLVM=1, LLVM_IAS=0 and CONFIG_KASAN=y, LLVM leaves DWARF tags for the "asan.module_ctor" & co symbols. In turn, pahole creates BTF_KIND_FUNC entries for these and this makes the BTF metadata validation fail because they contain a dot. In a dramatic turn of event, this BTF verification failure can cause the netfilter_bpf initialization to fail, causing netfilter_core to free the netfilter_helper hashmap and netfilter_ftp to trigger a use-after-free. The risk of u-a-f in netfilter will be addressed separately but the existence of "asan.module_ctor" debug info under some build conditions sounds like a good enough reason to accept functions that contain dots in BTF. Although using only LLVM=1 is the recommended way to compile clang-based kernels, users can certainly do LLVM=1, LLVM_IAS=0 as well and we still try to support that combination according to Nick. To clarify: - > v5.10 kernel, LLVM=1 (LLVM_IAS=0 is not the default) is recommended, but user can still have LLVM=1, LLVM_IAS=0 to trigger the issue - <= 5.10 kernel, LLVM=1 (LLVM_IAS=0 is the default) is recommended in which case GNU as will be used Fixes: 1dc92851849c ("bpf: kernel side support for BTF Var and DataSec") Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: Yonghong Song <yhs@meta.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/bpf/20230615145607.3469985-1-revest@chromium.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28bpf: Fix verifier id tracking of scalars on spillMaxim Mikityanskiy
[ Upstream commit 713274f1f2c896d37017efee333fd44149710119 ] The following scenario describes a bug in the verifier where it incorrectly concludes about equivalent scalar IDs which could lead to verifier bypass in privileged mode: 1. Prepare a 32-bit rogue number. 2. Put the rogue number into the upper half of a 64-bit register, and roll a random (unknown to the verifier) bit in the lower half. The rest of the bits should be zero (although variations are possible). 3. Assign an ID to the register by MOVing it to another arbitrary register. 4. Perform a 32-bit spill of the register, then perform a 32-bit fill to another register. Due to a bug in the verifier, the ID will be preserved, although the new register will contain only the lower 32 bits, i.e. all zeros except one random bit. At this point there are two registers with different values but the same ID, which means the integrity of the verifier state has been corrupted. 5. Compare the new 32-bit register with 0. In the branch where it's equal to 0, the verifier will believe that the original 64-bit register is also 0, because it has the same ID, but its actual value still contains the rogue number in the upper half. Some optimizations of the verifier prevent the actual bypass, so extra care is needed: the comparison must be between two registers, and both branches must be reachable (this is why one random bit is needed). Both branches are still suitable for the bypass. 6. Right shift the original register by 32 bits to pop the rogue number. 7. Use the rogue number as an offset with any pointer. The verifier will believe that the offset is 0, while in reality it's the given number. The fix is similar to the 32-bit BPF_MOV handling in check_alu_op for SCALAR_VALUE. If the spill is narrowing the actual register value, don't keep the ID, make sure it's reset to 0. Fixes: 354e8f1970f8 ("bpf: Support <8-byte scalar spill and refill") Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Andrii Nakryiko <andrii@kernel.org> # Checked veristat delta Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230607123951.558971-2-maxtram95@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28bpf: track immediate values written to stack by BPF_ST instructionEduard Zingerman
[ Upstream commit ecdf985d7615356b78241fdb159c091830ed0380 ] For aligned stack writes using BPF_ST instruction track stored values in a same way BPF_STX is handled, e.g. make sure that the following commands produce similar verifier knowledge: fp[-8] = 42; r1 = 42; fp[-8] = r1; This covers two cases: - non-null values written to stack are stored as spill of fake registers; - null values written to stack are stored as STACK_ZERO marks. Previously both cases above used STACK_MISC marks instead. Some verifier test cases relied on the old logic to obtain STACK_MISC marks for some stack values. These test cases are updated in the same commit to avoid failures during bisect. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230214232030.1502829-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: 713274f1f2c8 ("bpf: Fix verifier id tracking of scalars on spill") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28cgroup: Do not corrupt task iteration when rebinding subsystemXiu Jianfeng
commit 6f363f5aa845561f7ea496d8b1175e3204470486 upstream. We found a refcount UAF bug as follows: refcount_t: addition on 0; use-after-free. WARNING: CPU: 1 PID: 342 at lib/refcount.c:25 refcount_warn_saturate+0xa0/0x148 Workqueue: events cpuset_hotplug_workfn Call trace: refcount_warn_saturate+0xa0/0x148 __refcount_add.constprop.0+0x5c/0x80 css_task_iter_advance_css_set+0xd8/0x210 css_task_iter_advance+0xa8/0x120 css_task_iter_next+0x94/0x158 update_tasks_root_domain+0x58/0x98 rebuild_root_domains+0xa0/0x1b0 rebuild_sched_domains_locked+0x144/0x188 cpuset_hotplug_workfn+0x138/0x5a0 process_one_work+0x1e8/0x448 worker_thread+0x228/0x3e0 kthread+0xe0/0xf0 ret_from_fork+0x10/0x20 then a kernel panic will be triggered as below: Unable to handle kernel paging request at virtual address 00000000c0000010 Call trace: cgroup_apply_control_disable+0xa4/0x16c rebind_subsystems+0x224/0x590 cgroup_destroy_root+0x64/0x2e0 css_free_rwork_fn+0x198/0x2a0 process_one_work+0x1d4/0x4bc worker_thread+0x158/0x410 kthread+0x108/0x13c ret_from_fork+0x10/0x18 The race that cause this bug can be shown as below: (hotplug cpu) | (umount cpuset) mutex_lock(&cpuset_mutex) | mutex_lock(&cgroup_mutex) cpuset_hotplug_workfn | rebuild_root_domains | rebind_subsystems update_tasks_root_domain | spin_lock_irq(&css_set_lock) css_task_iter_start | list_move_tail(&cset->e_cset_node[ss->id] while(css_task_iter_next) | &dcgrp->e_csets[ss->id]); css_task_iter_end | spin_unlock_irq(&css_set_lock) mutex_unlock(&cpuset_mutex) | mutex_unlock(&cgroup_mutex) Inside css_task_iter_start/next/end, css_set_lock is hold and then released, so when iterating task(left side), the css_set may be moved to another list(right side), then it->cset_head points to the old list head and it->cset_pos->next points to the head node of new list, which can't be used as struct css_set. To fix this issue, switch from all css_sets to only scgrp's css_sets to patch in-flight iterators to preserve correct iteration, and then update it->cset_head as well. Reported-by: Gaosheng Cui <cuigaosheng1@huawei.com> Link: https://www.spinics.net/lists/cgroups/msg37935.html Suggested-by: Michal Koutný <mkoutny@suse.com> Link: https://lore.kernel.org/all/20230526114139.70274-1-xiujianfeng@huaweicloud.com/ Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Fixes: 2d8f243a5e6e ("cgroup: implement cgroup->e_csets[]") Cc: stable@vger.kernel.org # v3.16+ Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-28tick/common: Align tick period during sched_timer setupThomas Gleixner
commit 13bb06f8dd42071cb9a49f6e21099eea05d4b856 upstream. The tick period is aligned very early while the first clock_event_device is registered. At that point the system runs in periodic mode and switches later to one-shot mode if possible. The next wake-up event is programmed based on the aligned value (tick_next_period) but the delta value, that is used to program the clock_event_device, is computed based on ktime_get(). With the subtracted offset, the device fires earlier than the exact time frame. With a large enough offset the system programs the timer for the next wake-up and the remaining time left is too small to make any boot progress. The system hangs. Move the alignment later to the setup of tick_sched timer. At this point the system switches to oneshot mode and a high resolution clocksource is available. At this point it is safe to align tick_next_period because ktime_get() will now return accurate (not jiffies based) time. [bigeasy: Patch description + testing]. Fixes: e9523a0d81899 ("tick/common: Align tick period with the HZ tick.") Reported-by: Mathias Krause <minipli@grsecurity.net> Reported-by: "Bhatnagar, Rishabh" <risbhat@amazon.com> Suggested-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Richard W.M. Jones <rjones@redhat.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Acked-by: SeongJae Park <sj@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/5a56290d-806e-b9a5-f37c-f21958b5a8c0@grsecurity.net Link: https://lore.kernel.org/12c6f9a3-d087-b824-0d05-0d18c9bc1bf3@amazon.com Link: https://lore.kernel.org/r/20230615091830.RxMV2xf_@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-28tracing: Add tracing_reset_all_online_cpus_unlocked() functionSteven Rostedt (Google)
commit e18eb8783ec4949adebc7d7b0fdb65f65bfeefd9 upstream. Currently the tracing_reset_all_online_cpus() requires the trace_types_lock held. But only one caller of this function actually has that lock held before calling it, and the other just takes the lock so that it can call it. More users of this function is needed where the lock is not held. Add a tracing_reset_all_online_cpus_unlocked() function for the one use case that calls it without being held, and also add a lockdep_assert to make sure it is held when called. Then have tracing_reset_all_online_cpus() take the lock internally, such that callers do not need to worry about taking it. Link: https://lkml.kernel.org/r/20221123192741.658273220@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-21rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace periodZiwei Dai
commit 5da7cb193db32da783a3f3e77d8b639989321d48 upstream. Memory passed to kvfree_rcu() that is to be freed is tracked by a per-CPU kfree_rcu_cpu structure, which in turn contains pointers to kvfree_rcu_bulk_data structures that contain pointers to memory that has not yet been handed to RCU, along with an kfree_rcu_cpu_work structure that tracks the memory that has already been handed to RCU. These structures track three categories of memory: (1) Memory for kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived during an OOM episode. The first two categories are tracked in a cache-friendly manner involving a dynamically allocated page of pointers (the aforementioned kvfree_rcu_bulk_data structures), while the third uses a simple (but decidedly cache-unfriendly) linked list through the rcu_head structures in each block of memory. On a given CPU, these three categories are handled as a unit, with that CPU's kfree_rcu_cpu_work structure having one pointer for each of the three categories. Clearly, new memory for a given category cannot be placed in the corresponding kfree_rcu_cpu_work structure until any old memory has had its grace period elapse and thus has been removed. And the kfree_rcu_monitor() function does in fact check for this. Except that the kfree_rcu_monitor() function checks these pointers one at a time. This means that if the previous kfree_rcu() memory passed to RCU had only category 1 and the current one has only category 2, the kfree_rcu_monitor() function will send that current category-2 memory along immediately. This can result in memory being freed too soon, that is, out from under unsuspecting RCU readers. To see this, consider the following sequence of events, in which: o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset", then is preempted. o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset" after a later grace period. Except that "from_cset" is freed right after the previous grace period ended, so that "from_cset" is immediately freed. Task A resumes and references "from_cset"'s member, after which nothing good happens. In full detail: CPU 0 CPU 1 ---------------------- ---------------------- count_memcg_event_mm() |rcu_read_lock() <--- |mem_cgroup_from_task() |// css_set_ptr is the "from_cset" mentioned on CPU 1 |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq comes, current task is scheduled out. cgroup_attach_task() |cgroup_migrate() |cgroup_migrate_execute() |css_set_move_task(task, from_cset, to_cset, true) |cgroup_move_task(task, to_cset) |rcu_assign_pointer(.., to_cset) |... |cgroup_migrate_finish() |put_css_set_locked(from_cset) |from_cset->refcount return 0 |kfree_rcu(cset, rcu_head) // free from_cset after new gp |add_ptr_to_bulk_krc_lock() |schedule_delayed_work(&krcp->monitor_work, ..) kfree_rcu_monitor() |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] |queue_rcu_work(system_wq, &krwp->rcu_work) |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp // There is a perious call_rcu(.., rcu_work_rcufn) // gp end, rcu_work_rcufn() is called. rcu_work_rcufn() |__queue_work(.., rwork->wq, &rwork->work); |kfree_rcu_work() |krwp->bulk_head_free[0] bulk is freed before new gp end!!! |The "from_cset" is freed before new gp end. // the task resumes some time later. |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. This commit therefore causes kfree_rcu_monitor() to refrain from moving kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU grace period has completed for all three categories. v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()") Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") Reported-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-21cgroup: always put cset in cgroup_css_set_put_forkJohn Sperbeck
commit 2bd110339288c18823dcace602b63b0d8627e520 upstream. A successful call to cgroup_css_set_fork() will always have taken a ref on kargs->cset (regardless of CLONE_INTO_CGROUP), so always do a corresponding put in cgroup_css_set_put_fork(). Without this, a cset and its contained css structures will be leaked for some fork failures. The following script reproduces the leak for a fork failure due to exceeding pids.max in the pids controller. A similar thing can happen if we jump to the bad_fork_cancel_cgroup label in copy_process(). [ -z "$1" ] && echo "Usage $0 pids-root" && exit 1 PID_ROOT=$1 CGROUP=$PID_ROOT/foo [ -e $CGROUP ] && rmdir -f $CGROUP mkdir $CGROUP echo 5 > $CGROUP/pids.max echo $$ > $CGROUP/cgroup.procs fork_bomb() { set -e for i in $(seq 10); do /bin/sleep 3600 & done } (fork_bomb) & wait echo $$ > $PID_ROOT/cgroup.procs kill $(cat $CGROUP/cgroup.procs) rmdir $CGROUP Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: John Sperbeck <jsperbeck@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> [TJM: This backport accommodates the lack of cgroup_unlock] Signed-off-by: T.J. Mercier <tjmercier@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-21kexec: support purgatories with .text.hot sectionsRicardo Ribalda
commit 8652d44f466ad5772e7d1756e9457046189b0dfc upstream. Patch series "kexec: Fix kexec_file_load for llvm16 with PGO", v7. When upreving llvm I realised that kexec stopped working on my test platform. The reason seems to be that due to PGO there are multiple .text sections on the purgatory, and kexec does not supports that. This patch (of 4): Clang16 links the purgatory text in two sections when PGO is in use: [ 1] .text PROGBITS 0000000000000000 00000040 00000000000011a1 0000000000000000 AX 0 0 16 [ 2] .rela.text RELA 0000000000000000 00003498 0000000000000648 0000000000000018 I 24 1 8 ... [17] .text.hot. PROGBITS 0000000000000000 00003220 000000000000020b 0000000000000000 AX 0 0 1 [18] .rela.text.hot. RELA 0000000000000000 00004428 0000000000000078 0000000000000018 I 24 17 8 And both of them have their range [sh_addr ... sh_addr+sh_size] on the area pointed by `e_entry`. This causes that image->start is calculated twice, once for .text and another time for .text.hot. The second calculation leaves image->start in a random location. Because of this, the system crashes immediately after: kexec_core: Starting new kernel Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-0-b05c520b7296@chromium.org Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-1-b05c520b7296@chromium.org Fixes: 930457057abe ("kernel/kexec_file.c: split up __kexec_load_puragory") Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Reviewed-by: Ross Zwisler <zwisler@google.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Philipp Rudo <prudo@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Simon Horman <horms@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Rix <trix@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-14bpf: Add extra path pointer check to d_path helperJiri Olsa
[ Upstream commit f46fab0e36e611a2389d3843f34658c849b6bd60 ] Anastasios reported crash on stable 5.15 kernel with following BPF attached to lsm hook: SEC("lsm.s/bprm_creds_for_exec") int BPF_PROG(bprm_creds_for_exec, struct linux_binprm *bprm) { struct path *path = &bprm->executable->f_path; char p[128] = { 0 }; bpf_d_path(path, p, 128); return 0; } But bprm->executable can be NULL, so bpf_d_path call will crash: BUG: kernel NULL pointer dereference, address: 0000000000000018 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI ... RIP: 0010:d_path+0x22/0x280 ... Call Trace: <TASK> bpf_d_path+0x21/0x60 bpf_prog_db9cf176e84498d9_bprm_creds_for_exec+0x94/0x99 bpf_trampoline_6442506293_0+0x55/0x1000 bpf_lsm_bprm_creds_for_exec+0x5/0x10 security_bprm_creds_for_exec+0x29/0x40 bprm_execve+0x1c1/0x900 do_execveat_common.isra.0+0x1af/0x260 __x64_sys_execve+0x32/0x40 It's problem for all stable trees with bpf_d_path helper, which was added in 5.9. This issue is fixed in current bpf code, where we identify and mark trusted pointers, so the above code would fail even to load. For the sake of the stable trees and to workaround potentially broken verifier in the future, adding the code that reads the path object from the passed pointer and verifies it's valid in kernel space. Fixes: 6e22ab9da793 ("bpf: Add d_path helper") Reported-by: Anastasios Papagiannis <tasos.papagiannnis@gmail.com> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230606181714.532998-1-jolsa@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-14bpf: Fix UAF in task local storageKP Singh
[ Upstream commit b0fd1852bcc21accca6260ef245356d5c141ff66 ] When task local storage was generalized for tracing programs, the bpf_task_local_storage callback was moved from a BPF LSM hook callback for security_task_free LSM hook to it's own callback. But a failure case in bad_fork_cleanup_security was missed which, when triggered, led to a dangling task owner pointer and a subsequent use-after-free. Move the bpf_task_storage_free to the very end of free_task to handle all failure cases. This issue was noticed when a BPF LSM program was attached to the task_alloc hook on a kernel with KASAN enabled. The program used bpf_task_storage_get to copy the task local storage from the current task to the new task being created. Fixes: a10787e6d58c ("bpf: Enable task local storage for tracing programs") Reported-by: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: KP Singh <kpsingh@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230602002612.1117381-1-kpsingh@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-09tracing/probe: trace_probe_primary_from_call(): checked list_first_entryPietro Borrello
commit 81d0fa4cb4fc0e1a49c2b22f92c43d9fe972ebcf upstream. All callers of trace_probe_primary_from_call() check the return value to be non NULL. However, the function returns list_first_entry(&tpe->probes, ...) which can never be NULL. Additionally, it does not check for the list being possibly empty, possibly causing a type confusion on empty lists. Use list_first_entry_or_null() which solves both problems. Link: https://lore.kernel.org/linux-trace-kernel/20230128-list-entry-null-check-v1-1-8bde6a3da2ef@diag.uniroma1.it/ Fixes: 60d53e2c3b75 ("tracing/probe: Split trace_event related data from trace_probe") Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-05bpf: fix a memory leak in the LRU and LRU_PERCPU hash mapsAnton Protopopov
[ Upstream commit b34ffb0c6d23583830f9327864b9c1f486003305 ] The LRU and LRU_PERCPU maps allocate a new element on update before locking the target hash table bucket. Right after that the maps try to lock the bucket. If this fails, then maps return -EBUSY to the caller without releasing the allocated element. This makes the element untracked: it doesn't belong to either of free lists, and it doesn't belong to the hash table, so can't be re-used; this eventually leads to the permanent -ENOMEM on LRU map updates, which is unexpected. Fix this by returning the element to the local free list if bucket locking fails. Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-30bpf: Fix mask generation for 32-bit narrow loads of 64-bit fieldsWill Deacon
commit 0613d8ca9ab382caabe9ed2dceb429e9781e443f upstream. A narrow load from a 64-bit context field results in a 64-bit load followed potentially by a 64-bit right-shift and then a bitwise AND operation to extract the relevant data. In the case of a 32-bit access, an immediate mask of 0xffffffff is used to construct a 64-bit BPP_AND operation which then sign-extends the mask value and effectively acts as a glorified no-op. For example: 0: 61 10 00 00 00 00 00 00 r0 = *(u32 *)(r1 + 0) results in the following code generation for a 64-bit field: ldr x7, [x7] // 64-bit load mov x10, #0xffffffffffffffff and x7, x7, x10 Fix the mask generation so that narrow loads always perform a 32-bit AND operation: ldr x7, [x7] // 64-bit load mov w10, #0xffffffff and w7, w7, w10 Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Krzesimir Nowak <krzesimir@kinvolk.io> Cc: Andrey Ignatov <rdna@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Fixes: 31fd85816dbe ("bpf: permits narrower load from bpf program context fields") Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230518102528.1341-1-will@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-24bpf: Add preempt_count_{sub,add} into btf id deny listYafang
[ Upstream commit c11bd046485d7bf1ca200db0e7d0bdc4bafdd395 ] The recursion check in __bpf_prog_enter* and __bpf_prog_exit* leave preempt_count_{sub,add} unprotected. When attaching trampoline to them we get panic as follows, [ 867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf (stack is 0000000046a46a15..00000000537e7b28) [ 867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI [ 867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4 [ 867.843100] Call Trace: [ 867.843101] <TASK> [ 867.843104] asm_exc_int3+0x3a/0x40 [ 867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0 [ 867.843135] __bpf_prog_enter_recur+0x17/0x90 [ 867.843148] bpf_trampoline_6442468108_0+0x2e/0x1000 [ 867.843154] ? preempt_count_sub+0x1/0xa0 [ 867.843157] preempt_count_sub+0x5/0xa0 [ 867.843159] ? migrate_enable+0xac/0xf0 [ 867.843164] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843168] bpf_trampoline_6442468108_0+0x55/0x1000 ... [ 867.843788] preempt_count_sub+0x5/0xa0 [ 867.843793] ? migrate_enable+0xac/0xf0 [ 867.843829] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843837] BUG: IRQ stack guard page was hit at 0000000099bd8228 (stack is 00000000b23e2bc4..000000006d95af35) [ 867.843841] BUG: IRQ stack guard page was hit at 000000005ae07924 (stack is 00000000ffd69623..0000000014eb594c) [ 867.843843] BUG: IRQ stack guard page was hit at 00000000028320f0 (stack is 00000000034b6438..0000000078d1bcec) [ 867.843842] bpf_trampoline_6442468108_0+0x55/0x1000 ... That is because in __bpf_prog_exit_recur, the preempt_count_{sub,add} are called after prog->active is decreased. Fixing this by adding these two functions into btf ids deny list. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yafang <laoar.shao@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jiri Olsa <olsajiri@gmail.com> Acked-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20230413025248.79764-1-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24bpf: Annotate data races in bpf_local_storageKumar Kartikeya Dwivedi
[ Upstream commit 0a09a2f933c73dc76ab0b72da6855f44342a8903 ] There are a few cases where hlist_node is checked to be unhashed without holding the lock protecting its modification. In this case, one must use hlist_unhashed_lockless to avoid load tearing and KCSAN reports. Fix this by using lockless variant in places not protected by the lock. Since this is not prompted by any actual KCSAN reports but only from code review, I have not included a fixes tag. Cc: Martin KaFai Lau <martin.lau@kernel.org> Cc: KP Singh <kpsingh@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230221200646.2500777-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24rcu: Protect rcu_print_task_exp_stall() ->exp_tasks accessZqiang
[ Upstream commit 3c1566bca3f8349f12b75d0a2d5e4a20ad6262ec ] For kernels built with CONFIG_PREEMPT_RCU=y, the following scenario can result in a NULL-pointer dereference: CPU1 CPU2 rcu_preempt_deferred_qs_irqrestore rcu_print_task_exp_stall if (special.b.blocked) READ_ONCE(rnp->exp_tasks) != NULL raw_spin_lock_rcu_node np = rcu_next_node_entry(t, rnp) if (&t->rcu_node_entry == rnp->exp_tasks) WRITE_ONCE(rnp->exp_tasks, np) .... raw_spin_unlock_irqrestore_rcu_node raw_spin_lock_irqsave_rcu_node t = list_entry(rnp->exp_tasks->prev, struct task_struct, rcu_node_entry) (if rnp->exp_tasks is NULL, this will dereference a NULL pointer) The problem is that CPU2 accesses the rcu_node structure's->exp_tasks field without holding the rcu_node structure's ->lock and CPU2 did not observe CPU1's change to rcu_node structure's ->exp_tasks in time. Therefore, if CPU1 sets rcu_node structure's->exp_tasks pointer to NULL, then CPU2 might dereference that NULL pointer. This commit therefore holds the rcu_node structure's ->lock while accessing that structure's->exp_tasks field. [ paulmck: Apply Frederic Weisbecker feedback. ] Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Sasha Levin <sashal@kernel.org>