summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-10-19tracing: Use linker magic instead of recasting ftrace_ops_list_func()Steven Rostedt (VMware)
In an effort to enable -Wcast-function-type in the top-level Makefile to support Control Flow Integrity builds, all function casts need to be removed. This means that ftrace_ops_list_func() can no longer be defined as ftrace_ops_no_ops(). The reason for ftrace_ops_no_ops() is to use that when an architecture calls ftrace_ops_list_func() with only two parameters (called from assembly). And to make sure there's no C side-effects, those archs call ftrace_ops_no_ops() which only has two parameters, as ftrace_ops_list_func() has four parameters. Instead of a typecast, use vmlinux.lds.h to define ftrace_ops_list_func() to arch_ftrace_ops_list_func() that will define the proper set of parameters. Link: https://lore.kernel.org/r/20200614070154.6039-1-oscar.carter@gmx.com Link: https://lkml.kernel.org/r/20200617165616.52241bde@oasis.local.home Link: https://lore.kernel.org/all/20211005053922.GA702049@embeddedor/ Requested-by: Oscar Carter <oscar.carter@gmx.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-19bpf: Silence Coverity warning for find_kfunc_desc_btfKumar Kartikeya Dwivedi
The helper function returns a pointer that in the failure case encodes an error in the struct btf pointer. The current code lead to Coverity warning about the use of the invalid pointer: *** CID 1507963: Memory - illegal accesses (USE_AFTER_FREE) /kernel/bpf/verifier.c: 1788 in find_kfunc_desc_btf() 1782 return ERR_PTR(-EINVAL); 1783 } 1784 1785 kfunc_btf = __find_kfunc_desc_btf(env, offset, btf_modp); 1786 if (IS_ERR_OR_NULL(kfunc_btf)) { 1787 verbose(env, "cannot find module BTF for func_id %u\n", func_id); >>> CID 1507963: Memory - illegal accesses (USE_AFTER_FREE) >>> Using freed pointer "kfunc_btf". 1788 return kfunc_btf ?: ERR_PTR(-ENOENT); 1789 } 1790 return kfunc_btf; 1791 } 1792 return btf_vmlinux ?: ERR_PTR(-ENOENT); 1793 } Daniel suggested the use of ERR_CAST so that the intended use is clear to Coverity, but on closer look it seems that we never return NULL from the helper. Andrii noted that since __find_kfunc_desc_btf already logs errors for all cases except btf_get_by_fd, it is much easier to add logging for that and remove the IS_ERR check altogether, returning directly from it. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211009040900.803436-1-memxor@gmail.com
2021-10-19Revert "PM: sleep: Do not assume that "mem" is always present"Rafael J. Wysocki
Revert commit bfcc1e67ff1e ("PM: sleep: Do not assume that "mem" is always present"), because it breaks compatibility with user space utilities assuming that "mem" will always be present in /sys/power/state. Fixes: bfcc1e67ff1e ("PM: sleep: Do not assume that "mem" is always present") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-10-19workqueue: make sysfs of unbound kworker cpumask more cleverMenglong Dong
Some unfriendly component, such as dpdk, write the same mask to unbound kworker cpumask again and again. Every time it write to this interface some work is queue to cpu, even though the mask is same with the original mask. So, fix it by return success and do nothing if the cpumask is equal with the old one. Signed-off-by: Mengen Sun <mengensun@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-19ucounts: Proper error handling in set_cred_ucountsEric W. Biederman
Instead of leaking the ucounts in new if alloc_ucounts fails, store the result of alloc_ucounts into a temporary variable, which is later assigned to new->ucounts. Cc: stable@vger.kernel.org Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Link: https://lkml.kernel.org/r/87pms2s0v8.fsf_-_@disp2133 Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-19ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_credsEric W. Biederman
The purpose of inc_rlimit_ucounts and dec_rlimit_ucounts in commit_creds is to change which rlimit counter is used to track a process when the credentials changes. Use the same test for both to guarantee the tracking is correct. Cc: stable@vger.kernel.org Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Link: https://lkml.kernel.org/r/87v91us0w4.fsf_-_@disp2133 Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-19sched/scs: Reset the shadow stack when idle_task_exitWoody Lin
Commit f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") removed the init_idle() call from idle_thread_get(). This was the sole call-path on hotplug that resets the Shadow Call Stack (scs) Stack Pointer (sp). Not resetting the scs-sp leads to scs overflow after enough hotplug cycles. Therefore add an explicit scs_task_reset() to the hotplug code to make sure the scs-sp does get reset on hotplug. Fixes: f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") Signed-off-by: Woody Lin <woodylin@google.com> [peterz: Changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20211012083521.973587-1-woodylin@google.com
2021-10-19locking/rwsem: Fix comments about reader optimistic lock stealing conditionsYanfei Xu
After the commit 617f3ef95177 ("locking/rwsem: Remove reader optimistic spinning"), reader doesn't support optimistic spinning anymore, there is no need meet the condition which OSQ is empty. BTW, add an unlikely() for the max reader wakeup check in the loop. Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20211013134154.1085649-4-yanfei.xu@windriver.com
2021-10-19locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able()Yanfei Xu
preempt_disable/enable() is equal to RCU read-side crital section, and the spinning codes in mutex and rwsem could ensure that the preemption is disabled. So let's remove the unnecessary rcu_read_lock/unlock for saving some cycles in hot codes. Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20211013134154.1085649-2-yanfei.xu@windriver.com
2021-10-19locking/rwsem: Disable preemption for spinning regionYanfei Xu
The spinning region rwsem_spin_on_owner() should not be preempted, however the rwsem_down_write_slowpath() invokes it and don't disable preemption. Fix it by adding a pair of preempt_disable/enable(). Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> [peterz: Fix CONFIG_RWSEM_SPIN_ON_OWNER=n build] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20211013134154.1085649-3-yanfei.xu@windriver.com
2021-10-19futex: Fix PREEMPT_RT buildPeter Zijlstra
Mike reported that rcuwait went walk-about and is causing failures on the PREEMPT_RT builds, restore it. Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-10-19block: don't call blk_status_to_errno in blk_update_requestChristoph Hellwig
We only need to call it to resolve the blk_status_t -> errno mapping for tracing, so move the conversion into the tracepoints that are not called at all when tracing isn't enabled. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18bpf: Rename BTF_KIND_TAG to BTF_KIND_DECL_TAGYonghong Song
Patch set [1] introduced BTF_KIND_TAG to allow tagging declarations for struct/union, struct/union field, var, func and func arguments and these tags will be encoded into dwarf. They are also encoded to btf by llvm for the bpf target. After BTF_KIND_TAG is introduced, we intended to use it for kernel __user attributes. But kernel __user is actually a type attribute. Upstream and internal discussion showed it is not a good idea to mix declaration attribute and type attribute. So we proposed to introduce btf_type_tag as a type attribute and existing btf_tag renamed to btf_decl_tag ([2]). This patch renamed BTF_KIND_TAG to BTF_KIND_DECL_TAG and some other declarations with *_tag to *_decl_tag to make it clear the tag is for declaration. In the future, BTF_KIND_TYPE_TAG might be introduced per [3]. [1] https://lore.kernel.org/bpf/20210914223004.244411-1-yhs@fb.com/ [2] https://reviews.llvm.org/D111588 [3] https://reviews.llvm.org/D111199 Fixes: b5ea834dde6b ("bpf: Support for new btf kind BTF_KIND_TAG") Fixes: 5b84bd10363e ("libbpf: Add support for BTF_KIND_TAG") Fixes: 5c07f2fec003 ("bpftool: Add support for BTF_KIND_TAG") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211012164838.3345699-1-yhs@fb.com
2021-10-18audit: return early if the filter rule has a lower priorityGaosheng Cui
It is not necessary for audit_filter_rules() functions to check audit fileds of the rule with a lower priority, and if we did, there might be some unintended effects, such as the ctx->ppid may be changed unexpectedly, so return early if the rule has a lower priority. Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> [PM: slight tweak to the subject line] Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-18audit: fix possible null-pointer dereference in audit_filter_rulesGaosheng Cui
Fix possible null-pointer dereference in audit_filter_rules. audit_filter_rules() error: we previously assumed 'ctx' could be null Cc: stable@vger.kernel.org Fixes: bf361231c295 ("audit: add saddr_fam filter field") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-18tracing: Have all levels of checks prevent recursionSteven Rostedt (VMware)
While writing an email explaining the "bit = 0" logic for a discussion on making ftrace_test_recursion_trylock() disable preemption, I discovered a path that makes the "not do the logic if bit is zero" unsafe. The recursion logic is done in hot paths like the function tracer. Thus, any code executed causes noticeable overhead. Thus, tricks are done to try to limit the amount of code executed. This included the recursion testing logic. Having recursion testing is important, as there are many paths that can end up in an infinite recursion cycle when tracing every function in the kernel. Thus protection is needed to prevent that from happening. Because it is OK to recurse due to different running context levels (e.g. an interrupt preempts a trace, and then a trace occurs in the interrupt handler), a set of bits are used to know which context one is in (normal, softirq, irq and NMI). If a recursion occurs in the same level, it is prevented*. Then there are infrastructure levels of recursion as well. When more than one callback is attached to the same function to trace, it calls a loop function to iterate over all the callbacks. Both the callbacks and the loop function have recursion protection. The callbacks use the "ftrace_test_recursion_trylock()" which has a "function" set of context bits to test, and the loop function calls the internal trace_test_and_set_recursion() directly, with an "internal" set of bits. If an architecture does not implement all the features supported by ftrace then the callbacks are never called directly, and the loop function is called instead, which will implement the features of ftrace. Since both the loop function and the callbacks do recursion protection, it was seemed unnecessary to do it in both locations. Thus, a trick was made to have the internal set of recursion bits at a more significant bit location than the function bits. Then, if any of the higher bits were set, the logic of the function bits could be skipped, as any new recursion would first have to go through the loop function. This is true for architectures that do not support all the ftrace features, because all functions being traced must first go through the loop function before going to the callbacks. But this is not true for architectures that support all the ftrace features. That's because the loop function could be called due to two callbacks attached to the same function, but then a recursion function inside the callback could be called that does not share any other callback, and it will be called directly. i.e. traced_function_1: [ more than one callback tracing it ] call loop_func loop_func: trace_recursion set internal bit call callback callback: trace_recursion [ skipped because internal bit is set, return 0 ] call traced_function_2 traced_function_2: [ only traced by above callback ] call callback callback: trace_recursion [ skipped because internal bit is set, return 0 ] call traced_function_2 [ wash, rinse, repeat, BOOM! out of shampoo! ] Thus, the "bit == 0 skip" trick is not safe, unless the loop function is call for all functions. Since we want to encourage architectures to implement all ftrace features, having them slow down due to this extra logic may encourage the maintainers to update to the latest ftrace features. And because this logic is only safe for them, remove it completely. [*] There is on layer of recursion that is allowed, and that is to allow for the transition between interrupt context (normal -> softirq -> irq -> NMI), because a trace may occur before the context update is visible to the trace recursion logic. Link: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/ Link: https://lkml.kernel.org/r/20211018154412.09fcad3c@gandalf.local.home Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Jisheng Zhang <jszhang@kernel.org> Cc: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com> Cc: Guo Ren <guoren@kernel.org> Cc: stable@vger.kernel.org Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-18ucounts: Fix signal ucount refcountingEric W. Biederman
In commit fda31c50292a ("signal: avoid double atomic counter increments for user accounting") Linus made a clever optimization to how rlimits and the struct user_struct. Unfortunately that optimization does not work in the obvious way when moved to nested rlimits. The problem is that the last decrement of the per user namespace per user sigpending counter might also be the last decrement of the sigpending counter in the parent user namespace as well. Which means that simply freeing the leaf ucount in __free_sigqueue is not enough. Maintain the optimization and handle the tricky cases by introducing inc_rlimit_get_ucounts and dec_rlimit_put_ucounts. By moving the entire optimization into functions that perform all of the work it becomes possible to ensure that every level is handled properly. The new function inc_rlimit_get_ucounts returns 0 on failure to increment the ucount. This is different than inc_rlimit_ucounts which increments the ucounts and returns LONG_MAX if the ucount counter has exceeded it's maximum or it wrapped (to indicate the counter needs to decremented). I wish we had a single user to account all pending signals to across all of the threads of a process so this complexity was not necessary Cc: stable@vger.kernel.org Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133 Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133 Reviewed-by: Alexey Gladkov <legion@kernel.org> Tested-by: Rune Kleveland <rune.kleveland@infomedia.dk> Tested-by: Yu Zhao <yuzhao@google.com> Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-18sched: move the <linux/blkdev.h> include out of kernel/sched/sched.hChristoph Hellwig
Only core.c needs blkdev.h, so move the #include statement there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18kernel: remove spurious blkdev.h includesChristoph Hellwig
Various files have acquired spurious includes of <linux/blkdev.h> over time. Remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18mm/filemap: Add filemap_add_folio()Matthew Wilcox (Oracle)
Convert __add_to_page_cache_locked() into __filemap_add_folio(). Add an assertion to it that (for !hugetlbfs), the folio is naturally aligned within the file. Move the prototype from mm.h to pagemap.h. Convert add_to_page_cache_lru() into filemap_add_folio(). Add a compatibility wrapper for unconverted callers. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
2021-10-18dma-debug: teach add_dma_entry() about DMA_ATTR_SKIP_CPU_SYNCHamza Mahfooz
Mapping something twice should be possible as long as, DMA_ATTR_SKIP_CPU_SYNC is passed to the strictly speaking second relevant mapping operation (that attempts to map the same thing). So, don't issue a warning if the specified condition is met in add_dma_entry(). Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-16reboot: export symbol 'reboot_mode'Shawn Guo
Some drivers like Qualcomm pm8941-pwrkey need to access 'reboot_mode' for triggering reboot between cold and warm mode. Export the symbol, so that drivers built as module can still access the symbol. Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Link: https://lore.kernel.org/r/20210714095850.27185-2-shawn.guo@linaro.org Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2021-10-16Merge tag 'trace-v5.15-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Tracing fixes for 5.15: - Fix defined but not use warning/error for osnoise function - Fix memory leak in event probe - Fix memblock leak in bootconfig - Fix the API of event probes to be like kprobes - Added test to check removal of event probe API - Fix recordmcount.pl for nds32 failed build * tag 'trace-v5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: nds32/ftrace: Fix Error: invalid operands (*UND* and *UND* sections) for `^' selftests/ftrace: Update test for more eprobe removal process tracing: Fix event probe removal from dynamic events tracing: Fix missing * in comment block bootconfig: init: Fix memblock leak in xbc_make_cmdline() tracing: Fix memory leak in eprobe_register() tracing: Fix missing osnoise tracer on max_latency
2021-10-15perf/core: Allow ftrace for functions in kernel/event/core.cSong Liu
It is useful to trace functions in kernel/event/core.c. Allow ftrace for them by removing $(CC_FLAGS_FTRACE) from Makefile. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211006210732.2826289-1-songliubraving@fb.com
2021-10-15perf/x86: Add new event for AUX output counter indexAdrian Hunter
PEBS-via-PT records contain a mask of applicable counters. To identify which event belongs to which counter, a side-band event is needed. Until now, there has been no side-band event, and consequently users were limited to using a single event. Add such a side-band event. Note the event is optimised to output only when the counter index changes for an event. That works only so long as all PEBS-via-PT events are scheduled together, which they are for a recording session because they are in a single group. Also no attribute bit is used to select the new event, so a new kernel is not compatible with older perf tools. The assumption being that PEBS-via-PT is sufficiently esoteric that users will not be troubled by this. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210907163903.11820-2-adrian.hunter@intel.com
2021-10-15irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RTSebastian Andrzej Siewior
On PREEMPT_RT most items are processed as LAZY via softirq context. Avoid to spin-wait for them because irq_work_sync() could have higher priority and not allow the irq-work to be completed. Wait additionally for !IRQ_WORK_HARD_IRQ irq_work items on PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211006111852.1514359-5-bigeasy@linutronix.de
2021-10-15irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RTSebastian Andrzej Siewior
The irq_work callback is invoked in hard IRQ context. By default all callbacks are scheduled for invocation right away (given supported by the architecture) except for the ones marked IRQ_WORK_LAZY which are delayed until the next timer-tick. While looking over the callbacks, some of them may acquire locks (spinlock_t, rwlock_t) which are transformed into sleeping locks on PREEMPT_RT and must not be acquired in hard IRQ context. Changing the locks into locks which could be acquired in this context will lead to other problems such as increased latencies if everything in the chain has IRQ-off locks. This will not solve all the issues as one callback has been noticed which invoked kref_put() and its callback invokes kfree() and this can not be invoked in hardirq context. Some callbacks are required to be invoked in hardirq context even on PREEMPT_RT to work properly. This includes for instance the NO_HZ callback which needs to be able to observe the idle context. The callbacks which require to be run in hardirq have already been marked. Use this information to split the callbacks onto the two lists on PREEMPT_RT: - lazy_list Work items which are not marked with IRQ_WORK_HARD_IRQ will be added to this list. Callbacks on this list will be invoked from a per-CPU thread. The handler here may acquire sleeping locks such as spinlock_t and invoke kfree(). - raised_list Work items which are marked with IRQ_WORK_HARD_IRQ will be added to this list. They will be invoked in hardirq context and must not acquire any sleeping locks. The wake up of the per-CPU thread occurs from irq_work handler/ hardirq context. The thread runs with lowest RT priority to ensure it runs before any SCHED_OTHER tasks do. [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a hard and soft variant. Collected fixes over time from Steven Rostedt and Mike Galbraith. Move to per-CPU threads instead of softirq as suggested by PeterZ.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211007092646.uhshe3ut2wkrcfzv@linutronix.de
2021-10-15irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.Sebastian Andrzej Siewior
irq_work() triggers instantly an interrupt if supported by the architecture. Otherwise the work will be processed on the next timer tick. In worst case irq_work_sync() could spin up to a jiffy. irq_work_sync() is usually used in tear down context which is fully preemptible. Based on review irq_work_sync() is invoked from preemptible context and there is one waiter at a time. This qualifies it to use rcuwait for synchronisation. Let irq_work_sync() synchronize with rcuwait if the architecture processes irqwork via the timer tick. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211006111852.1514359-3-bigeasy@linutronix.de
2021-10-15sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQSebastian Andrzej Siewior
The push-IPI logic for RT tasks expects to be invoked from hardirq context. One reason is that a RT task on the remote CPU would block the softirq processing on PREEMPT_RT and so avoid pulling / balancing the RT tasks as intended. Annotate root_domain::rto_push_work as IRQ_WORK_HARD_IRQ. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211006111852.1514359-2-bigeasy@linutronix.de
2021-10-15sched: Add cluster scheduler level in core and related Kconfig for ARM64Barry Song
This patch adds scheduler level for clusters and automatically enables the load balance among clusters. It will directly benefit a lot of workload which loves more resources such as memory bandwidth, caches. Testing has widely been done in two different hardware configurations of Kunpeng920: 24 cores in one NUMA(6 clusters in each NUMA node); 32 cores in one NUMA(8 clusters in each NUMA node) Workload is running on either one NUMA node or four NUMA nodes, thus, this can estimate the effect of cluster spreading w/ and w/o NUMA load balance. * Stream benchmark: 4threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 29929.64 ( 0.00%) 32932.68 ( 10.03%) MB/sec scale 29861.10 ( 0.00%) 32710.58 ( 9.54%) MB/sec add 27034.42 ( 0.00%) 32400.68 ( 19.85%) MB/sec triad 27225.26 ( 0.00%) 31965.36 ( 17.41%) 6threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 40330.24 ( 0.00%) 42377.68 ( 5.08%) MB/sec scale 40196.42 ( 0.00%) 42197.90 ( 4.98%) MB/sec add 37427.00 ( 0.00%) 41960.78 ( 12.11%) MB/sec triad 37841.36 ( 0.00%) 42513.64 ( 12.35%) 12threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 52639.82 ( 0.00%) 53818.04 ( 2.24%) MB/sec scale 52350.30 ( 0.00%) 53253.38 ( 1.73%) MB/sec add 53607.68 ( 0.00%) 55198.82 ( 2.97%) MB/sec triad 54776.66 ( 0.00%) 56360.40 ( 2.89%) Thus, it could help memory-bound workload especially under medium load. Similar improvement is also seen in lkp-pbzip2: * lkp-pbzip2 benchmark 2-96 threads (on 4NUMA * 24cores = 96cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11062841.57 ( 0.00%) 11341817.51 * 2.52%* Hmean tput-5 26815503.70 ( 0.00%) 27412872.65 * 2.23%* Hmean tput-8 41873782.21 ( 0.00%) 43326212.92 * 3.47%* Hmean tput-12 61875980.48 ( 0.00%) 64578337.51 * 4.37%* Hmean tput-21 105814963.07 ( 0.00%) 111381851.01 * 5.26%* Hmean tput-30 150349470.98 ( 0.00%) 156507070.73 * 4.10%* Hmean tput-48 237195937.69 ( 0.00%) 242353597.17 * 2.17%* Hmean tput-79 360252509.37 ( 0.00%) 362635169.23 * 0.66%* Hmean tput-96 394571737.90 ( 0.00%) 400952978.48 * 1.62%* 2-24 threads (on 1NUMA * 24cores = 24cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11071705.49 ( 0.00%) 11296869.10 * 2.03%* Hmean tput-4 20782165.19 ( 0.00%) 21949232.15 * 5.62%* Hmean tput-6 30489565.14 ( 0.00%) 33023026.96 * 8.31%* Hmean tput-8 40376495.80 ( 0.00%) 42779286.27 * 5.95%* Hmean tput-12 61264033.85 ( 0.00%) 62995632.78 * 2.83%* Hmean tput-18 86697139.39 ( 0.00%) 86461545.74 ( -0.27%) Hmean tput-24 104854637.04 ( 0.00%) 104522649.46 * -0.32%* In the case of 6 threads and 8 threads, we see the greatest performance improvement. Similar improvement can be seen on lkp-pixz though the improvement is smaller: * lkp-pixz benchmark 2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores) lkp-pixz lkp-pixz w/o patch w/ patch Hmean tput-2 6486981.16 ( 0.00%) 6561515.98 * 1.15%* Hmean tput-4 11645766.38 ( 0.00%) 11614628.43 ( -0.27%) Hmean tput-6 15429943.96 ( 0.00%) 15957350.76 * 3.42%* Hmean tput-8 19974087.63 ( 0.00%) 20413746.98 * 2.20%* Hmean tput-12 28172068.18 ( 0.00%) 28751997.06 * 2.06%* Hmean tput-18 39413409.54 ( 0.00%) 39896830.55 * 1.23%* Hmean tput-24 49101815.85 ( 0.00%) 49418141.47 * 0.64%* * SPECrate benchmark 4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores) Base Base Run Time Rate ------- --------- 4 Copies w/o 580 (w/ 570) w/o 11.1 (w/ 11.3) 8 Copies w/o 647 (w/ 605) w/o 20.0 (w/ 21.4, +7%) 16 Copies w/o 844 (w/ 844) w/o 30.6 (w/ 30.6) 32 Copies(on 4NUMA * 32 cores = 128cores) [w/o patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 584 87.2 * 502.gcc_r 32 503 90.2 * 505.mcf_r 32 745 69.4 * 520.omnetpp_r 32 1031 40.7 * 523.xalancbmk_r 32 597 56.6 * 525.x264_r 1 -- CE 531.deepsjeng_r 32 336 109 * 541.leela_r 32 556 95.4 * 548.exchange2_r 32 513 163 * 557.xz_r 32 530 65.2 * Est. SPECrate2017_int_base 80.3 [w/ patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 580 87.8 (+0.688%) * 502.gcc_r 32 477 95.1 (+5.432%) * 505.mcf_r 32 644 80.3 (+13.574%) * 520.omnetpp_r 32 942 44.6 (+9.58%) * 523.xalancbmk_r 32 560 60.4 (+6.714%%) * 525.x264_r 1 -- CE 531.deepsjeng_r 32 337 109 (+0.000%) * 541.leela_r 32 554 95.6 (+0.210%) * 548.exchange2_r 32 515 163 (+0.000%) * 557.xz_r 32 524 66.0 (+1.227%) * Est. SPECrate2017_int_base 83.7 (+4.062%) On the other hand, it is slightly helpful to CPU-bound tasks like kernbench: * 24-96 threads kernbench (on 4NUMA * 24cores = 96cores) kernbench kernbench w/o cluster w/ cluster Min user-24 12054.67 ( 0.00%) 12024.19 ( 0.25%) Min syst-24 1751.51 ( 0.00%) 1731.68 ( 1.13%) Min elsp-24 600.46 ( 0.00%) 598.64 ( 0.30%) Min user-48 12361.93 ( 0.00%) 12315.32 ( 0.38%) Min syst-48 1917.66 ( 0.00%) 1892.73 ( 1.30%) Min elsp-48 333.96 ( 0.00%) 332.57 ( 0.42%) Min user-96 12922.40 ( 0.00%) 12921.17 ( 0.01%) Min syst-96 2143.94 ( 0.00%) 2110.39 ( 1.56%) Min elsp-96 211.22 ( 0.00%) 210.47 ( 0.36%) Amean user-24 12063.99 ( 0.00%) 12030.78 * 0.28%* Amean syst-24 1755.20 ( 0.00%) 1735.53 * 1.12%* Amean elsp-24 601.60 ( 0.00%) 600.19 ( 0.23%) Amean user-48 12362.62 ( 0.00%) 12315.56 * 0.38%* Amean syst-48 1921.59 ( 0.00%) 1894.95 * 1.39%* Amean elsp-48 334.10 ( 0.00%) 332.82 * 0.38%* Amean user-96 12925.27 ( 0.00%) 12922.63 ( 0.02%) Amean syst-96 2146.66 ( 0.00%) 2122.20 * 1.14%* Amean elsp-96 211.96 ( 0.00%) 211.79 ( 0.08%) Note this patch isn't an universal win, it might hurt those workload which can benefit from packing. Though tasks which want to take advantages of lower communication latency of one cluster won't necessarily been packed in one cluster while kernel is not aware of clusters, they have some chance to be randomly packed. But this patch will make them more likely spread. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-10-15sched: Disable -Wunused-but-set-variablePeter Zijlstra
The compilers can't deal with obvious DCE vs that warning, resulting in code like: if (0) { sched sched_statistics *stats; stats = __schedstats_from_se(se); ... } triggering the warning. Kill the warning to make the robots stop reporting this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lkml.kernel.org/r/YWWPLnaZGybHsTkv@hirez.programming.kicks-ass.net
2021-10-15sched: Add wrapper for get_wchan() to keep task blockedKees Cook
Having a stable wchan means the process must be blocked and for it to stay that way while performing stack unwinding. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm] Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64] Link: https://lkml.kernel.org/r/20211008111626.332092234@infradead.org
2021-10-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
tools/testing/selftests/net/ioam6.sh 7b1700e009cc ("selftests: net: modify IOAM tests for undef bits") bf77b1400a56 ("selftests: net: Test for the IOAM encapsulation with IPv6") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-14pid: add pidfd_get_task() helperChristian Brauner
The number of system calls making use of pidfds is constantly increasing. Some of those new system calls duplicate the code to turn a pidfd into task_struct it refers to. Give them a simple helper for this. Link: https://lore.kernel.org/r/20211004125050.1153693-2-christian.brauner@ubuntu.com Link: https://lore.kernel.org/r/20211011133245.1703103-2-brauner@kernel.org Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Bobrowski <repnop@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Reviewed-by: Matthew Bobrowski <repnop@google.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-10-14kernel/sched: Fix sched_fork() access an invalid sched_task_groupZhang Qiao
There is a small race between copy_process() and sched_fork() where child->sched_task_group point to an already freed pointer. parent doing fork() | someone moving the parent | to another cgroup -------------------------------+------------------------------- copy_process() + dup_task_struct()<1> parent move to another cgroup, and free the old cgroup. <2> + sched_fork() + __set_task_cpu()<3> + task_fork_fair() + sched_slice()<4> In the worst case, this bug can lead to "use-after-free" and cause panic as shown above: (1) parent copy its sched_task_group to child at <1>; (2) someone move the parent to another cgroup and free the old cgroup at <2>; (3) the sched_task_group and cfs_rq that belong to the old cgroup will be accessed at <3> and <4>, which cause a panic: [] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD 2029955067 PMD 0 [] Oops: 0000 [#1] SMP PTI [] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded Tainted: G OE --------- - - 4.18.0.x86_64+ #1 [] RIP: 0010:sched_slice+0x84/0xc0 [] Call Trace: [] task_fork_fair+0x81/0x120 [] sched_fork+0x132/0x240 [] copy_process.part.5+0x675/0x20e0 [] ? __handle_mm_fault+0x63f/0x690 [] _do_fork+0xcd/0x3b0 [] do_syscall_64+0x5d/0x1d0 [] entry_SYSCALL_64_after_hwframe+0x65/0xca [] RIP: 0033:0x7f04418cd7e1 Between cgroup_can_fork() and cgroup_post_fork(), the cgroup membership and thus sched_task_group can't change. So update child's sched_task_group at sched_post_fork() and move task_fork() and __set_task_cpu() (where accees the sched_task_group) from sched_fork() to sched_post_fork(). Fixes: 8323f26ce342 ("sched: Fix race in task_group") Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20210915064030.2231-1-zhangqiao22@huawei.com
2021-10-14sched/topology: Remove unused numa_distance in cpu_attach_domain()Yicong Yang
numa_distance in cpu_attach_domain() is introduced in commit b5b217346de8 ("sched/topology: Warn when NUMA diameter > 2") to warn user when NUMA diameter > 2 as we'll misrepresent the scheduler topology structures at that time. This is fixed by Barry in commit 585b6d2723dc ("sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2") and numa_distance is unused now. So remove it. Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20210915063158.80639-1-yangyicong@hisilicon.com
2021-10-14sched/numa: Fix a few commentsBharata B Rao
Fix a few comments to help understand them better. Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20211004105706.3669-4-bharata@amd.com
2021-10-14sched/numa: Remove the redundant member numa_group::fault_cpusBharata B Rao
numa_group::fault_cpus is actually a pointer to the region in numa_group::faults[] where NUMA_CPU stats are located. Remove this redundant member and use numa_group::faults[NUMA_CPU] directly like it is done for similar per-process numa fault stats. There is no functionality change due to this commit. Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20211004105706.3669-3-bharata@amd.com
2021-10-14sched/numa: Replace hard-coded number by a define in numa_task_group()Bharata B Rao
While allocating group fault stats, task_numa_group() is using a hard coded number 4. Replace this by NR_NUMA_HINT_FAULT_STATS. No functionality change in this commit. Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20211004105706.3669-2-bharata@amd.com
2021-10-14sched,livepatch: Use wake_up_if_idle()Peter Zijlstra
Make sure to prod idle CPUs so they call klp_update_patch_state(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Petr Mladek <pmladek@suse.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> # on s390 Link: https://lkml.kernel.org/r/20210929151723.162004989@infradead.org
2021-10-13tracing: Fix event probe removal from dynamic eventsSteven Rostedt (VMware)
When an event probe is to be removed via the API that created it via the dynamic events, an -ENOENT error is returned. This is because the removal of the event probe does not expect to see the event system and name that the event probe is attached to, even though that's part of the API to create it. As the removal of probes is to use the same API as they are created. In fact, the removal is not consistent with the kprobes and uprobes removal. Fix that by allowing various ways to remove the eprobe. The eprobe is created with: e:[GROUP/]NAME SYSTEM/EVENT [OPTIONS] Have it get removed by echoing in the following into dynamic_events: # Remove all eprobes with NAME echo '-:NAME' >> dynamic_events # Remove a specific eprobe echo '-:GROUP/NAME' >> dynamic_events echo '-:GROUP/NAME SYSTEM/EVENT' >> dynamic_events echo '-:NAME SYSTEM/EVENT' >> dynamic_events echo '-:GROUP/NAME SYSTEM/EVENT OPTIONS' >> dynamic_events echo '-:NAME SYSTEM/EVENT OPTIONS' >> dynamic_events Link: https://lkml.kernel.org/r/20211012081925.0e19cc4f@gandalf.local.home Link: https://lkml.kernel.org/r/20211013205533.630722129@goodmis.org Suggested-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 7491e2c442781 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-13tracing: in_irq() cleanupChangbin Du
Replace the obsolete and ambiguos macro in_irq() with new macro in_hardirq(). Link: https://lkml.kernel.org/r/20210930000342.6016-1-changbin.du@gmail.com Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-13Merge tag 'modules-for-v5.15-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules fix from Jessica Yu: - Build fix for cfi_init() when CONFIG_MODULE_UNLOAD=n * tag 'modules-for-v5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: fix clang CFI with MODULE_UNLOAD=n
2021-10-11Merge branch 'for-5.15-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "All documentation / comment updates" * 'for-5.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroupv2, docs: fix misinformation in "device controller" section cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem docs/cgroup: remove some duplicate words
2021-10-11Merge branch 'for-5.15-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "One patch to add a missing __printf annotation and the other to enable deferred printing for debug dumps to avoid deadlocks when triggered from some contexts (e.g. console drivers)" * 'for-5.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix state-dump console deadlock workqueue: annotate alloc_workqueue() as printf
2021-10-11workqueue: fix state-dump console deadlockJohan Hovold
Console drivers often queue work while holding locks also taken in their console write paths, something which can lead to deadlocks on SMP when dumping workqueue state (e.g. sysrq-t or on suspend failures). For serial console drivers this could look like: CPU0 CPU1 ---- ---- show_workqueue_state(); lock(&pool->lock); <IRQ> lock(&port->lock); schedule_work(); lock(&pool->lock); printk(); lock(console_owner); lock(&port->lock); where workqueues are, for example, used to push data to the line discipline, process break signals and handle modem-status changes. Line disciplines and serdev drivers can also queue work on write-wakeup notifications, etc. Reworking every console driver to avoid queuing work while holding locks also taken in their write paths would complicate drivers and is neither desirable or feasible. Instead use the deferred-printk mechanism to avoid printing while holding pool locks when dumping workqueue state. Note that there are a few WARN_ON() assertions in the workqueue code which could potentially also trigger a deadlock. Hopefully the ongoing printk rework will provide a general solution for this eventually. This was originally reported after a lockdep splat when executing sysrq-t with the imx serial driver. Fixes: 3494fc30846d ("workqueue: dump workqueues on sysrq-t") Cc: stable@vger.kernel.org # 4.0 Reported-by: Fabio Estevam <festevam@denx.de> Tested-by: Fabio Estevam <festevam@denx.de> Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-11dma-debug: fix sg checks in debug_dma_map_sg()Gerald Schaefer
The following warning occurred sporadically on s390: DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or rodata [addr=0000000048cc5e2f] [len=131072] WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 check_for_illegal_area+0xa8/0x138 It is a false-positive warning, due to broken logic in debug_dma_map_sg(). check_for_illegal_area() checks for overlay of sg elements with kernel text or rodata. It is called with sg_dma_len(s) instead of s->length as parameter. After the call to ->map_sg(), sg_dma_len() will contain the length of possibly combined sg elements in the DMA address space, and not the individual sg element length, which would be s->length. The check will then use the physical start address of an sg element, and add the DMA length for the overlap check, which could result in the false warning, because the DMA length can be larger than the actual single sg element length. In addition, the call to check_for_illegal_area() happens in the iteration over mapped_ents, which will not include all individual sg elements if any of them were combined in ->map_sg(). Fix this by using s->length instead of sg_dma_len(s). Also put the call to check_for_illegal_area() in a separate loop, iterating over all the individual sg elements ("nents" instead of "mapped_ents"). While at it, as suggested by Robin Murphy, also move check_for_stack() inside the new loop, as it is similarly concerned with validating the individual sg elements. Link: https://lore.kernel.org/lkml/20210705185252.4074653-1-gerald.schaefer@linux.ibm.com Fixes: 884d05970bfb ("dma-debug: use sg_dma_len accessor") Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-11dma-mapping: fix the kerneldoc for dma_map_sgtable()Logan Gunthorpe
htmldocs began producing the following warnings: kernel/dma/mapping.c:256: WARNING: Definition list ends without a blank line; unexpected unindent. kernel/dma/mapping.c:257: WARNING: Bullet list ends without a blank line; unexpected unindent. Reformatting the list without hyphens fixes the warnings and produces both a readable text and HTML output. Fixes: fffe3cc8c219 ("dma-mapping: allow map_sg() ops to return negative error code") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-10tracing: Fix missing * in comment blockColin Ian King
There is a missing * in a comment block, add it in. Link: https://lkml.kernel.org/r/20211006172830.1025336-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-10tracing: Fix memory leak in eprobe_register()Vamshi K Sthambamkadi
kmemleak report: unreferenced object 0xffff900a70ec7ec0 (size 32): comm "ftracetest", pid 2770, jiffies 4295042510 (age 311.464s) hex dump (first 32 bytes): c8 31 23 45 0a 90 ff ff 40 85 c7 6e 0a 90 ff ff .1#E....@..n.... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000009d3751fd>] kmem_cache_alloc_trace+0x2a2/0x440 [<0000000088b8124b>] eprobe_register+0x1e3/0x350 [<000000002a9a0517>] __ftrace_event_enable_disable+0x7c/0x240 [<0000000019109321>] event_enable_write+0x93/0xe0 [<000000007d85b320>] vfs_write+0xb9/0x260 [<00000000b94c5e41>] ksys_write+0x67/0xe0 [<000000005a08c81d>] __x64_sys_write+0x1a/0x20 [<00000000240bf576>] do_syscall_64+0x3b/0xc0 [<0000000043d5d9f6>] entry_SYSCALL_64_after_hwframe+0x44/0xae unreferenced object 0xffff900a56bbf280 (size 128): comm "ftracetest", pid 2770, jiffies 4295042510 (age 311.464s) hex dump (first 32 bytes): ff ff ff ff ff ff ff ff 00 00 00 00 01 00 00 00 ................ 80 69 3b b2 ff ff ff ff 20 69 3b b2 ff ff ff ff .i;..... i;..... backtrace: [<000000009d3751fd>] kmem_cache_alloc_trace+0x2a2/0x440 [<00000000c4e90fad>] eprobe_register+0x1fc/0x350 [<000000002a9a0517>] __ftrace_event_enable_disable+0x7c/0x240 [<0000000019109321>] event_enable_write+0x93/0xe0 [<000000007d85b320>] vfs_write+0xb9/0x260 [<00000000b94c5e41>] ksys_write+0x67/0xe0 [<000000005a08c81d>] __x64_sys_write+0x1a/0x20 [<00000000240bf576>] do_syscall_64+0x3b/0xc0 [<0000000043d5d9f6>] entry_SYSCALL_64_after_hwframe+0x44/0xae In new_eprobe_trigger(), allocated edata and trigger variables are never freed. To fix, free memory in disable_eprobe(). Link: https://lkml.kernel.org/r/20211008071802.GA2098@cosmos Fixes: 7491e2c442781 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>