summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-05-12tracing: Restructure trace_clock_global() to never blockSteven Rostedt (VMware)
commit aafe104aa9096827a429bc1358f8260ee565b7cc upstream. It was reported that a fix to the ring buffer recursion detection would cause a hung machine when performing suspend / resume testing. The following backtrace was extracted from debugging that case: Call Trace: trace_clock_global+0x91/0xa0 __rb_reserve_next+0x237/0x460 ring_buffer_lock_reserve+0x12a/0x3f0 trace_buffer_lock_reserve+0x10/0x50 __trace_graph_return+0x1f/0x80 trace_graph_return+0xb7/0xf0 ? trace_clock_global+0x91/0xa0 ftrace_return_to_handler+0x8b/0xf0 ? pv_hash+0xa0/0xa0 return_to_handler+0x15/0x30 ? ftrace_graph_caller+0xa0/0xa0 ? trace_clock_global+0x91/0xa0 ? __rb_reserve_next+0x237/0x460 ? ring_buffer_lock_reserve+0x12a/0x3f0 ? trace_event_buffer_lock_reserve+0x3c/0x120 ? trace_event_buffer_reserve+0x6b/0xc0 ? trace_event_raw_event_device_pm_callback_start+0x125/0x2d0 ? dpm_run_callback+0x3b/0xc0 ? pm_ops_is_empty+0x50/0x50 ? platform_get_irq_byname_optional+0x90/0x90 ? trace_device_pm_callback_start+0x82/0xd0 ? dpm_run_callback+0x49/0xc0 With the following RIP: RIP: 0010:native_queued_spin_lock_slowpath+0x69/0x200 Since the fix to the recursion detection would allow a single recursion to happen while tracing, this lead to the trace_clock_global() taking a spin lock and then trying to take it again: ring_buffer_lock_reserve() { trace_clock_global() { arch_spin_lock() { queued_spin_lock_slowpath() { /* lock taken */ (something else gets traced by function graph tracer) ring_buffer_lock_reserve() { trace_clock_global() { arch_spin_lock() { queued_spin_lock_slowpath() { /* DEAD LOCK! */ Tracing should *never* block, as it can lead to strange lockups like the above. Restructure the trace_clock_global() code to instead of simply taking a lock to update the recorded "prev_time" simply use it, as two events happening on two different CPUs that calls this at the same time, really doesn't matter which one goes first. Use a trylock to grab the lock for updating the prev_time, and if it fails, simply try again the next time. If it failed to be taken, that means something else is already updating it. Link: https://lkml.kernel.org/r/20210430121758.650b6e8a@gandalf.local.home Cc: stable@vger.kernel.org Tested-by: Konstantin Kharlamov <hi-angel@yandex.ru> Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Fixes: b02414c8f045 ("ring-buffer: Fix recursion protection transitions between interrupt context") # started showing the problem Fixes: 14131f2f98ac3 ("tracing: implement trace_clock_*() APIs") # where the bug happened Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212761 Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-12tracing: Map all PIDs to command linesSteven Rostedt (VMware)
commit 785e3c0a3a870e72dc530856136ab4c8dd207128 upstream. The default max PID is set by PID_MAX_DEFAULT, and the tracing infrastructure uses this number to map PIDs to the comm names of the tasks, such output of the trace can show names from the recorded PIDs in the ring buffer. This mapping is also exported to user space via the "saved_cmdlines" file in the tracefs directory. But currently the mapping expects the PIDs to be less than PID_MAX_DEFAULT, which is the default maximum and not the real maximum. Recently, systemd will increases the maximum value of a PID on the system, and when tasks are traced that have a PID higher than PID_MAX_DEFAULT, its comm is not recorded. This leads to the entire trace to have "<...>" as the comm name, which is pretty useless. Instead, keep the array mapping the size of PID_MAX_DEFAULT, but instead of just mapping the index to the comm, map a mask of the PID (PID_MAX_DEFAULT - 1) to the comm, and find the full PID from the map_cmdline_to_pid array (that already exists). This bug goes back to the beginning of ftrace, but hasn't been an issue until user space started increasing the maximum value of PIDs. Link: https://lkml.kernel.org/r/20210427113207.3c601884@gandalf.local.home Cc: stable@vger.kernel.org Fixes: bc0c38d139ec7 ("ftrace: latency tracer infrastructure") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-12kbuild: update config_data.gz only when the content of .config is changedMasahiro Yamada
commit 46b41d5dd8019b264717978c39c43313a524d033 upstream. If the timestamp of the .config file is updated, config_data.gz is regenerated, then vmlinux is re-linked. This occurs even if the content of the .config has not changed at all. This issue was mitigated by commit 67424f61f813 ("kconfig: do not write .config if the content is the same"); Kconfig does not update the .config when it ends up with the identical configuration. The issue is remaining when the .config is created by *_defconfig with some config fragment(s) applied on top. This is typical for powerpc and mips, where several *_defconfig targets are constructed by using merge_config.sh. One workaround is to have the copy of the .config. The filechk rule updates the copy, kernel/config_data, by checking the content instead of the timestamp. With this commit, the second run with the same configuration avoids the needless rebuilds. $ make ARCH=mips defconfig all [ snip ] $ make ARCH=mips defconfig all *** Default configuration is based on target '32r2el_defconfig' Using ./arch/mips/configs/generic_defconfig as base Merging arch/mips/configs/generic/32r2.config Merging arch/mips/configs/generic/el.config Merging ./arch/mips/configs/generic/board-boston.config Merging ./arch/mips/configs/generic/board-ni169445.config Merging ./arch/mips/configs/generic/board-ocelot.config Merging ./arch/mips/configs/generic/board-ranchu.config Merging ./arch/mips/configs/generic/board-sead-3.config Merging ./arch/mips/configs/generic/board-xilfpga.config # # configuration written to .config # SYNC include/config/auto.conf CALL scripts/checksyscalls.sh CALL scripts/atomic/check-atomics.sh CHK include/generated/compile.h CHK include/generated/autoksyms.h Reported-by: Elliot Berman <eberman@codeaurora.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-12futex: Do not apply time namespace adjustment on FUTEX_LOCK_PIThomas Gleixner
commit cdf78db4070967869e4d027c11f4dd825d8f815a upstream. FUTEX_LOCK_PI does not require to have the FUTEX_CLOCK_REALTIME bit set because it has been using CLOCK_REALTIME based absolute timeouts forever. Due to that, the time namespace adjustment which is applied when FUTEX_CLOCK_REALTIME is not set, will wrongly take place for FUTEX_LOCK_PI and wreckage the timeout. Exclude it from that procedure. Fixes: c2f7d08cccf4 ("futex: Adjust absolute futex timeouts with per time namespace offset") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210422194704.984540159@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-12Revert 337f13046ff0 ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op")Thomas Gleixner
commit 4fbf5d6837bf81fd7a27d771358f4ee6c4f243f8 upstream. The FUTEX_WAIT operand has historically a relative timeout which means that the clock id is irrelevant as relative timeouts on CLOCK_REALTIME are not subject to wall clock changes and therefore are mapped by the kernel to CLOCK_MONOTONIC for simplicity. If a caller would set FUTEX_CLOCK_REALTIME for FUTEX_WAIT the timeout is still treated relative vs. CLOCK_MONOTONIC and then the wait arms that timeout based on CLOCK_REALTIME which is broken and obviously has never been used or even tested. Reject any attempt to use FUTEX_CLOCK_REALTIME with FUTEX_WAIT again. The desired functionality can be achieved with FUTEX_WAIT_BITSET and a FUTEX_BITSET_MATCH_ANY argument. Fixes: 337f13046ff0 ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210422194704.834797921@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-12rcu/nocb: Fix missed nocb_timer requeueFrederic Weisbecker
commit b2fcf2102049f6e56981e0ab3d9b633b8e2741da upstream. This sequence of events can lead to a failure to requeue a CPU's ->nocb_timer: 1. There are no callbacks queued for any CPU covered by CPU 0-2's ->nocb_gp_kthread. Note that ->nocb_gp_kthread is associated with CPU 0. 2. CPU 1 enqueues its first callback with interrupts disabled, and thus must defer awakening its ->nocb_gp_kthread. It therefore queues its rcu_data structure's ->nocb_timer. At this point, CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE. 3. CPU 2, which shares the same ->nocb_gp_kthread, also enqueues a callback, but with interrupts enabled, allowing it to directly awaken the ->nocb_gp_kthread. 4. The newly awakened ->nocb_gp_kthread associates both CPU 1's and CPU 2's callbacks with a future grace period and arranges for that grace period to be started. 5. This ->nocb_gp_kthread goes to sleep waiting for the end of this future grace period. 6. This grace period elapses before the CPU 1's timer fires. This is normally improbably given that the timer is set for only one jiffy, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y. 7. The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 1's and CPU 2's ->nocb_cb_kthread. Then ->nocb_gb_kthread sleeps waiting for more newly queued callbacks. 8. CPU 1's ->nocb_cb_kthread invokes its callback, then sleeps waiting for more invocable callbacks. 9. Note that neither kthread updated any ->nocb_timer state, so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE. 10. CPU 1 enqueues its second callback, this time with interrupts enabled so it can wake directly ->nocb_gp_kthread. It does so with calling wake_nocb_gp() which also cancels the pending timer that got queued in step 2. But that doesn't reset CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE. So CPU 1's ->nocb_defer_wakeup and its ->nocb_timer are now desynchronized. 11. ->nocb_gp_kthread associates the callback queued in 10 with a new grace period, arranges for that grace period to start and sleeps waiting for it to complete. 12. The grace period ends, rcu_gp_kthread awakens ->nocb_gp_kthread, which in turn wakes up CPU 1's ->nocb_cb_kthread which then invokes the callback queued in 10. 13. CPU 1 enqueues its third callback, this time with interrupts disabled so it must queue a timer for a deferred wakeup. However the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which incorrectly indicates that a timer is already queued. Instead, CPU 1's ->nocb_timer was cancelled in 10. CPU 1 therefore fails to queue the ->nocb_timer. 14. CPU 1 has its pending callback and it may go unnoticed until some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever calls an explicit deferred wakeup, for example, during idle entry. This commit fixes this bug by resetting rdp->nocb_defer_wakeup everytime we delete the ->nocb_timer. It is quite possible that there is a similar scenario involving ->nocb_bypass_timer and ->nocb_defer_wakeup. However, despite some effort from several people, a failure scenario has not yet been located. However, that by no means guarantees that no such scenario exists. Finding a failure scenario is left as an exercise for the reader, and the "Fixes:" tag below relates to ->nocb_bypass_timer instead of ->nocb_timer. Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing) Cc: <stable@vger.kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-12kcsan, debugfs: Move debugfs file creation out of early initMarco Elver
commit e36299efe7d749976fbdaaf756dee6ef32543c2c upstream. Commit 56348560d495 ("debugfs: do not attempt to create a new file before the filesystem is initalized") forbids creating new debugfs files until debugfs is fully initialized. This means that KCSAN's debugfs file creation, which happened at the end of __init(), no longer works. And was apparently never supposed to work! However, there is no reason to create KCSAN's debugfs file so early. This commit therefore moves its creation to a late_initcall() callback. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: stable <stable@vger.kernel.org> Fixes: 56348560d495 ("debugfs: do not attempt to create a new file before the filesystem is initalized") Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-12sched,psi: Handle potential task count underflow bugs more gracefullyCharan Teja Reddy
[ Upstream commit 9d10a13d1e4c349b76f1c675a874a7f981d6d3b4 ] psi_group_cpu->tasks, represented by the unsigned int, stores the number of tasks that could be stalled on a psi resource(io/mem/cpu). Decrementing these counters at zero leads to wrapping which further leads to the psi_group_cpu->state_mask is being set with the respective pressure state. This could result into the unnecessary time sampling for the pressure state thus cause the spurious psi events. This can further lead to wrong actions being taken at the user land based on these psi events. Though psi_bug is set under these conditions but that just for debug purpose. Fix it by decrementing the ->tasks count only when it is non-zero. Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1618585336-37219-1-git-send-email-charante@codeaurora.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-12sched,fair: Alternative sched_slice()Peter Zijlstra
[ Upstream commit 0c2de3f054a59f15e01804b75a04355c48de628c ] The current sched_slice() seems to have issues; there's two possible things that could be improved: - the 'nr_running' used for __sched_period() is daft when cgroups are considered. Using the RQ wide h_nr_running seems like a much more consistent number. - (esp) cgroups can slice it real fine, which makes for easy over-scheduling, ensure min_gran is what the name says. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210412102001.611897312@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-12perf: Rework perf_event_exit_event()Peter Zijlstra
[ Upstream commit ef54c1a476aef7eef26fe13ea10dc090952c00f8 ] Make perf_event_exit_event() more robust, such that we can use it from other contexts. Specifically the up and coming remove_on_exec. For this to work we need to address a few issues. Remove_on_exec will not destroy the entire context, so we cannot rely on TASK_TOMBSTONE to disable event_function_call() and we thus have to use perf_remove_from_context(). When using perf_remove_from_context(), there's two races to consider. The first is against close(), where we can have concurrent tear-down of the event. The second is against child_list iteration, which should not find a half baked event. To address this, teach perf_remove_from_context() to special case !ctx->is_active and about DETACH_CHILD. [ elver@google.com: fix racing parent/child exit in sync_child_event(). ] Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210408103605.1676875-2-elver@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-12sched/fair: Ignore percpu threads for imbalance pullsLingutla Chandrasekhar
[ Upstream commit 9bcb959d05eeb564dfc9cac13a59843a4fb2edf2 ] During load balance, LBF_SOME_PINNED will be set if any candidate task cannot be detached due to CPU affinity constraints. This can result in setting env->sd->parent->sgc->group_imbalance, which can lead to a group being classified as group_imbalanced (rather than any of the other, lower group_type) when balancing at a higher level. In workloads involving a single task per CPU, LBF_SOME_PINNED can often be set due to per-CPU kthreads being the only other runnable tasks on any given rq. This results in changing the group classification during load-balance at higher levels when in reality there is nothing that can be done for this affinity constraint: per-CPU kthreads, as the name implies, don't get to move around (modulo hotplug shenanigans). It's not as clear for userspace tasks - a task could be in an N-CPU cpuset with N-1 offline CPUs, making it an "accidental" per-CPU task rather than an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which we can leverage here to not set LBF_SOME_PINNED. Note that the aforementioned classification to group_imbalance (when nothing can be done) is especially problematic on big.LITTLE systems, which have a topology the likes of: DIE [ ] MC [ ][ ] 0 1 2 3 L L B B arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B) Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads can significantly delay the upgmigration of said misfit tasks. Systems relying on ASYM_PACKING are likely to face similar issues. Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed] [Reword changelog] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-12kvfree_rcu: Use same set of GFP flags as does single-argumentUladzislau Rezki (Sony)
[ Upstream commit ee6ddf58475cce8a3d3697614679cd8cb4a6f583 ] Running an rcuscale stress-suite can lead to "Out of memory" of a system. This can happen under high memory pressure with a small amount of physical memory. For example, a KVM test configuration with 64 CPUs and 512 megabytes can result in OOM when running rcuscale with below parameters: ../kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig CONFIG_NR_CPUS=64 \ --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 \ rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make <snip> [ 12.054448] kworker/1:1H invoked oom-killer: gfp_mask=0x2cc0(GFP_KERNEL|__GFP_NOWARN), order=0, oom_score_adj=0 [ 12.055303] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510 [ 12.055416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014 [ 12.056485] Workqueue: events_highpri fill_page_cache_func [ 12.056485] Call Trace: [ 12.056485] dump_stack+0x57/0x6a [ 12.056485] dump_header+0x4c/0x30a [ 12.056485] ? del_timer_sync+0x20/0x30 [ 12.056485] out_of_memory.cold.47+0xa/0x7e [ 12.056485] __alloc_pages_slowpath.constprop.123+0x82f/0xc00 [ 12.056485] __alloc_pages_nodemask+0x289/0x2c0 [ 12.056485] __get_free_pages+0x8/0x30 [ 12.056485] fill_page_cache_func+0x39/0xb0 [ 12.056485] process_one_work+0x1ed/0x3b0 [ 12.056485] ? process_one_work+0x3b0/0x3b0 [ 12.060485] worker_thread+0x28/0x3c0 [ 12.060485] ? process_one_work+0x3b0/0x3b0 [ 12.060485] kthread+0x138/0x160 [ 12.060485] ? kthread_park+0x80/0x80 [ 12.060485] ret_from_fork+0x22/0x30 [ 12.062156] Mem-Info: [ 12.062350] active_anon:0 inactive_anon:0 isolated_anon:0 [ 12.062350] active_file:0 inactive_file:0 isolated_file:0 [ 12.062350] unevictable:0 dirty:0 writeback:0 [ 12.062350] slab_reclaimable:2797 slab_unreclaimable:80920 [ 12.062350] mapped:1 shmem:2 pagetables:8 bounce:0 [ 12.062350] free:10488 free_pcp:1227 free_cma:0 ... [ 12.101610] Out of memory and no killable processes... [ 12.102042] Kernel panic - not syncing: System is deadlocked on memory [ 12.102583] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510 [ 12.102600] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014 <snip> Because kvfree_rcu() has a fallback path, memory allocation failure is not the end of the world. Furthermore, the added overhead of aggressive GFP settings must be balanced against the overhead of the fallback path, which is a cache miss for double-argument kvfree_rcu() and a call to synchronize_rcu() for single-argument kvfree_rcu(). The current choice of GFP_KERNEL|__GFP_NOWARN can result in longer latencies than a call to synchronize_rcu(), so less-tenacious GFP flags would be helpful. Here is the tradeoff that must be balanced: a) Minimize use of the fallback path, b) Avoid pushing the system into OOM, c) Bound allocation latency to that of synchronize_rcu(), and d) Leave the emergency reserves to use cases lacking fallbacks. This commit therefore changes GFP flags from GFP_KERNEL|__GFP_NOWARN to GFP_KERNEL|__GFP_NORETRY|__GFP_NOMEMALLOC|__GFP_NOWARN. This combination leaves the emergency reserves alone and can initiate reclaim, but will not invoke the OOM killer. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-12sched/topology: fix the issue groups don't span domain->span for NUMA ↵Barry Song
diameter > 2 [ Upstream commit 585b6d2723dc927ebc4ad884c4e879e4da8bc21f ] As long as NUMA diameter > 2, building sched_domain by sibling's child domain will definitely create a sched_domain with sched_group which will span out of the sched_domain: +------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+ domain0 node0 node1 node2 node3 domain1 node0+1 node0+1 node2+3 node2+3 + domain2 node0+1+2 | group: node0+1 | group:node2+3 <-------------------+ when node2 is added into the domain2 of node0, kernel is using the child domain of node2's domain2, which is domain1(node2+3). Node 3 is outside the span of the domain including node0+1+2. This will make load_balance() run based on screwed avg_load and group_type in the sched_group spanning out of the sched_domain, and it also makes select_task_rq_fair() pick an idle CPU outside the sched_domain. Real servers which suffer from this problem include Kunpeng920 and 8-node Sun Fire X4600-M2, at least. Here we move to use the *child* domain of the *child* domain of node2's domain2 as the new added sched_group. At the same, we re-use the lower level sgc directly. +------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+ domain0 node0 node1 +- node2 node3 | domain1 node0+1 node0+1 | node2+3 node2+3 | domain2 node0+1+2 | group: node0+1 | group:node2 <-------------------+ While the lower level sgc is re-used, this patch only changes the remote sched_groups for those sched_domains playing grandchild trick, therefore, sgc->next_update is still safe since it's only touched by CPUs that have the group span as local group. And sgc->imbalance is also safe because sd_parent remains the same in load_balance and LB only tries other CPUs from the local group. Moreover, since local groups are not touched, they are still getting roughly equal size in a TL. And should_we_balance() only matters with local groups, so the pull probability of those groups are still roughly equal. Tested by the below topology: qemu-system-aarch64 -M virt -nographic \ -smp cpus=8 \ -numa node,cpus=0-1,nodeid=0 \ -numa node,cpus=2-3,nodeid=1 \ -numa node,cpus=4-5,nodeid=2 \ -numa node,cpus=6-7,nodeid=3 \ -numa dist,src=0,dst=1,val=12 \ -numa dist,src=0,dst=2,val=20 \ -numa dist,src=0,dst=3,val=22 \ -numa dist,src=1,dst=2,val=22 \ -numa dist,src=2,dst=3,val=12 \ -numa dist,src=1,dst=3,val=24 \ -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image w/o patch, we get lots of "groups don't span domain->span": [ 0.802139] CPU0 attaching sched-domain(s): [ 0.802193] domain-0: span=0-1 level=MC [ 0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 } [ 0.802693] domain-1: span=0-3 level=NUMA [ 0.802731] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.802811] domain-2: span=0-5 level=NUMA [ 0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.802881] ERROR: groups don't span domain->span [ 0.803058] domain-3: span=0-7 level=NUMA [ 0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [ 0.804055] CPU1 attaching sched-domain(s): [ 0.804072] domain-0: span=0-1 level=MC [ 0.804096] groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 } [ 0.804152] domain-1: span=0-3 level=NUMA [ 0.804170] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.804219] domain-2: span=0-5 level=NUMA [ 0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.804302] ERROR: groups don't span domain->span [ 0.804520] domain-3: span=0-7 level=NUMA [ 0.804546] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [ 0.804677] CPU2 attaching sched-domain(s): [ 0.804687] domain-0: span=2-3 level=MC [ 0.804705] groups: 2:{ span=2 cap=934 }, 3:{ span=3 cap=1009 } [ 0.804754] domain-1: span=0-3 level=NUMA [ 0.804772] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 } [ 0.804820] domain-2: span=0-5 level=NUMA [ 0.804836] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 } [ 0.804944] ERROR: groups don't span domain->span [ 0.805108] domain-3: span=0-7 level=NUMA [ 0.805134] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 } [ 0.805223] CPU3 attaching sched-domain(s): [ 0.805232] domain-0: span=2-3 level=MC [ 0.805249] groups: 3:{ span=3 cap=1009 }, 2:{ span=2 cap=934 } [ 0.805319] domain-1: span=0-3 level=NUMA [ 0.805336] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 } [ 0.805383] domain-2: span=0-5 level=NUMA [ 0.805399] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 } [ 0.805458] ERROR: groups don't span domain->span [ 0.805605] domain-3: span=0-7 level=NUMA [ 0.805626] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 } [ 0.805712] CPU4 attaching sched-domain(s): [ 0.805721] domain-0: span=4-5 level=MC [ 0.805738] groups: 4:{ span=4 cap=984 }, 5:{ span=5 cap=924 } [ 0.805787] domain-1: span=4-7 level=NUMA [ 0.805803] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 } [ 0.805851] domain-2: span=0-1,4-7 level=NUMA [ 0.805867] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 } [ 0.805915] ERROR: groups don't span domain->span [ 0.806108] domain-3: span=0-7 level=NUMA [ 0.806130] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 } [ 0.806214] CPU5 attaching sched-domain(s): [ 0.806222] domain-0: span=4-5 level=MC [ 0.806240] groups: 5:{ span=5 cap=924 }, 4:{ span=4 cap=984 } [ 0.806841] domain-1: span=4-7 level=NUMA [ 0.806866] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 } [ 0.806934] domain-2: span=0-1,4-7 level=NUMA [ 0.806953] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 } [ 0.807004] ERROR: groups don't span domain->span [ 0.807312] domain-3: span=0-7 level=NUMA [ 0.807386] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 } [ 0.807686] CPU6 attaching sched-domain(s): [ 0.807710] domain-0: span=6-7 level=MC [ 0.807750] groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1012 } [ 0.807840] domain-1: span=4-7 level=NUMA [ 0.807870] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 } [ 0.807952] domain-2: span=0-1,4-7 level=NUMA [ 0.807985] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 } [ 0.808045] ERROR: groups don't span domain->span [ 0.808257] domain-3: span=0-7 level=NUMA [ 0.808571] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6125 }, 2:{ span=0-5 mask=2-3 cap=5899 } [ 0.808848] CPU7 attaching sched-domain(s): [ 0.808860] domain-0: span=6-7 level=MC [ 0.808880] groups: 7:{ span=7 cap=1012 }, 6:{ span=6 cap=1017 } [ 0.808953] domain-1: span=4-7 level=NUMA [ 0.808974] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 } [ 0.809034] domain-2: span=0-1,4-7 level=NUMA [ 0.809055] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 } [ 0.809128] ERROR: groups don't span domain->span [ 0.810361] domain-3: span=0-7 level=NUMA [ 0.810400] groups: 6:{ span=0-1,4-7 mask=6-7 cap=5961 }, 2:{ span=0-5 mask=2-3 cap=5903 } w/ patch, we don't get "groups don't span domain->span" any more: [ 1.486271] CPU0 attaching sched-domain(s): [ 1.486820] domain-0: span=0-1 level=MC [ 1.500924] groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 } [ 1.515717] domain-1: span=0-3 level=NUMA [ 1.515903] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 } [ 1.516989] domain-2: span=0-5 level=NUMA [ 1.517124] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 } [ 1.517369] domain-3: span=0-7 level=NUMA [ 1.517423] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 } [ 1.520027] CPU1 attaching sched-domain(s): [ 1.520097] domain-0: span=0-1 level=MC [ 1.520184] groups: 1:{ span=1 cap=994 }, 0:{ span=0 cap=980 } [ 1.520429] domain-1: span=0-3 level=NUMA [ 1.520487] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 } [ 1.520687] domain-2: span=0-5 level=NUMA [ 1.520744] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 } [ 1.520948] domain-3: span=0-7 level=NUMA [ 1.521038] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 } [ 1.522068] CPU2 attaching sched-domain(s): [ 1.522348] domain-0: span=2-3 level=MC [ 1.522606] groups: 2:{ span=2 cap=1003 }, 3:{ span=3 cap=986 } [ 1.522832] domain-1: span=0-3 level=NUMA [ 1.522885] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 } [ 1.523043] domain-2: span=0-5 level=NUMA [ 1.523092] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 } [ 1.523302] domain-3: span=0-7 level=NUMA [ 1.523352] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 } [ 1.523748] CPU3 attaching sched-domain(s): [ 1.523774] domain-0: span=2-3 level=MC [ 1.523825] groups: 3:{ span=3 cap=986 }, 2:{ span=2 cap=1003 } [ 1.524009] domain-1: span=0-3 level=NUMA [ 1.524086] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 } [ 1.524281] domain-2: span=0-5 level=NUMA [ 1.524331] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 } [ 1.524534] domain-3: span=0-7 level=NUMA [ 1.524586] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 } [ 1.524847] CPU4 attaching sched-domain(s): [ 1.524873] domain-0: span=4-5 level=MC [ 1.524954] groups: 4:{ span=4 cap=958 }, 5:{ span=5 cap=991 } [ 1.525105] domain-1: span=4-7 level=NUMA [ 1.525153] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 } [ 1.525368] domain-2: span=0-1,4-7 level=NUMA [ 1.525428] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 } [ 1.532726] domain-3: span=0-7 level=NUMA [ 1.532811] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=4037 } [ 1.534125] CPU5 attaching sched-domain(s): [ 1.534159] domain-0: span=4-5 level=MC [ 1.534303] groups: 5:{ span=5 cap=991 }, 4:{ span=4 cap=958 } [ 1.534490] domain-1: span=4-7 level=NUMA [ 1.534572] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 } [ 1.534734] domain-2: span=0-1,4-7 level=NUMA [ 1.534783] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 } [ 1.536057] domain-3: span=0-7 level=NUMA [ 1.536430] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=3896 } [ 1.536815] CPU6 attaching sched-domain(s): [ 1.536846] domain-0: span=6-7 level=MC [ 1.536934] groups: 6:{ span=6 cap=1005 }, 7:{ span=7 cap=1001 } [ 1.537144] domain-1: span=4-7 level=NUMA [ 1.537262] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 } [ 1.537553] domain-2: span=0-1,4-7 level=NUMA [ 1.537613] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 } [ 1.537872] domain-3: span=0-7 level=NUMA [ 1.537998] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 } [ 1.538448] CPU7 attaching sched-domain(s): [ 1.538505] domain-0: span=6-7 level=MC [ 1.538586] groups: 7:{ span=7 cap=1001 }, 6:{ span=6 cap=1005 } [ 1.538746] domain-1: span=4-7 level=NUMA [ 1.538798] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 } [ 1.539048] domain-2: span=0-1,4-7 level=NUMA [ 1.539111] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 } [ 1.539571] domain-3: span=0-7 level=NUMA [ 1.539610] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 } Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Meelis Roos <mroos@linux.ee> Link: https://lkml.kernel.org/r/20210224030944.15232-1-song.bao.hua@hisilicon.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-12sched/pelt: Fix task util_est update filteringVincent Donnefort
[ Upstream commit b89997aa88f0b07d8a6414c908af75062103b8c9 ] Being called for each dequeue, util_est reduces the number of its updates by filtering out when the EWMA signal is different from the task util_avg by less than 1%. It is a problem for a sudden util_avg ramp-up. Due to the decay from a previous high util_avg, EWMA might now be close enough to the new util_avg. No update would then happen while it would leave ue.enqueued with an out-of-date value. Taking into consideration the two util_est members, EWMA and enqueued for the filtering, ensures, for both, an up-to-date value. This is for now an issue only for the trace probe that might return the stale value. Functional-wise, it isn't a problem, as the value is always accessed through max(enqueued, ewma). This problem has been observed using LISA's UtilConvergence:test_means on the sd845c board. No regression observed with Hackbench on sd845c and Perf-bench sched pipe on hikey/hikey960. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210225165820.1377125-1-vincent.donnefort@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-12genirq/matrix: Prevent allocation counter corruptionVitaly Kuznetsov
[ Upstream commit c93a5e20c3c2dabef8ea360a3d3f18c6f68233ab ] When irq_matrix_free() is called for an unallocated vector the managed_allocated and total_allocated counters get out of sync with the real state of the matrix. Later, when the last interrupt is freed, these counters will underflow resulting in UINTMAX because the counters are unsigned. While this is certainly a problem of the calling code, this can be catched in the allocator by checking the allocation bit for the to be freed vector which simplifies debugging. An example of the problem described above: https://lore.kernel.org/lkml/20210318192819.636943062@linutronix.de/ Add the missing sanity check and emit a warning when it triggers. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210319111823.1105248-1-vkuznets@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-12posix-timers: Preserve return value in clock_adjtime32()Chen Jun
commit 2d036dfa5f10df9782f5278fc591d79d283c1fad upstream. The return value on success (>= 0) is overwritten by the return value of put_old_timex32(). That works correct in the fault case, but is wrong for the success case where put_old_timex32() returns 0. Just check the return value of put_old_timex32() and return -EFAULT in case it is not zero. [ tglx: Massage changelog ] Fixes: 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to native counterparts") Signed-off-by: Chen Jun <chenjun102@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Richard Cochran <richardcochran@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210414030449.90692-1-chenjun102@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-12ftrace: Handle commands when closing set_ftrace_filter fileSteven Rostedt (VMware)
commit 8c9af478c06bb1ab1422f90d8ecbc53defd44bc3 upstream. # echo switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter will cause switch_mm to stop tracing by the traceoff command. # echo -n switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter does nothing. The reason is that the parsing in the write function only processes commands if it finished parsing (there is white space written after the command). That's to handle: write(fd, "switch_mm:", 10); write(fd, "traceoff", 8); cases, where the command is broken over multiple writes. The problem is if the file descriptor is closed, then the write call is not processed, and the command needs to be processed in the release code. The release code can handle matching of functions, but does not handle commands. Cc: stable@vger.kernel.org Fixes: eda1e32855656 ("tracing: handle broken names in ftrace filter") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-07perf/core: Fix unconditional security_locked_down() callOndrej Mosnacek
commit 08ef1af4de5fe7de9c6d69f1e22e51b66e385d9b upstream. Currently, the lockdown state is queried unconditionally, even though its result is used only if the PERF_SAMPLE_REGS_INTR bit is set in attr.sample_type. While that doesn't matter in case of the Lockdown LSM, it causes trouble with the SELinux's lockdown hook implementation. SELinux implements the locked_down hook with a check whether the current task's type has the corresponding "lockdown" class permission ("integrity" or "confidentiality") allowed in the policy. This means that calling the hook when the access control decision would be ignored generates a bogus permission check and audit record. Fix this by checking sample_type first and only calling the hook when its result would be honored. Fixes: b0c8fdc7fdb7 ("lockdown: Lock down perf when in confidentiality mode") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul Moore <paul@paul-moore.com> Link: https://lkml.kernel.org/r/20210224215628.192519-1-omosnace@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-07swiotlb: respect min_align_maskJianxiong Gao
commit: 1f221a0d0dbf0e48ef3a9c62871281d6a7819f05 swiotlb: respect min_align_mask Respect the min_align_mask in struct device_dma_parameters in swiotlb. There are two parts to it: 1) for the lower bits of the alignment inside the io tlb slot, just extent the size of the allocation and leave the start of the slot empty 2) for the high bits ensure we find a slot that matches the high bits of the alignment to avoid wasting too much memory Based on an earlier patch from Jianxiong Gao <jxgao@google.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-07swiotlb: don't modify orig_addr in swiotlb_tbl_sync_singleJianxiong Gao
commit: 16fc3cef33a04632ab6b31758abdd77563a20759 swiotlb_tbl_map_single currently nevers sets a tlb_addr that is not aligned to the tlb bucket size. But we're going to add such a case soon, for which this adjustment would be bogus. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-07swiotlb: refactor swiotlb_tbl_map_singleJianxiong Gao
commit: 26a7e094783d482f3e125f09945a5bb1d867b2e6 Split out a bunch of a self-contained helpers to make the function easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-07swiotlb: clean up swiotlb_tbl_unmap_singleJianxiong Gao
commit: ca10d0f8e530600ec63c603dbace2c30927d70b7 swiotlb: clean up swiotlb_tbl_unmap_single Remove a layer of pointless indentation, replace a hard to follow ternary expression with a plain if/else. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-07swiotlb: factor out a nr_slots helperJianxiong Gao
commit: c32a77fd18780a5192dfb6eec69f239faebf28fd Factor out a helper to find the number of slots for a given size. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-07swiotlb: factor out an io_tlb_offset helperJianxiong Gao
commit: c7fbeca757fe74135d8b6a4c8ddaef76f5775d68 Replace the very genericly named OFFSET macro with a little inline helper that hardcodes the alignment to the only value ever passed. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-07swiotlb: add a IO_TLB_SIZE defineJianxiong Gao
commit: b5d7ccb7aac3895c2138fe0980a109116ce15eff Add a new IO_TLB_SIZE define instead open coding it using IO_TLB_SHIFT all over. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-07capabilities: require CAP_SETFCAP to map uid 0Serge E. Hallyn
[ Upstream commit db2e718a47984b9d71ed890eb2ea36ecf150de18 ] cap_setfcap is required to create file capabilities. Since commit 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"), a process running as uid 0 but without cap_setfcap is able to work around this as follows: unshare a new user namespace which maps parent uid 0 into the child namespace. While this task will not have new capabilities against the parent namespace, there is a loophole due to the way namespaced file capabilities are represented as xattrs. File capabilities valid in userns 1 are distinguished from file capabilities valid in userns 2 by the kuid which underlies uid 0. Therefore the restricted root process can unshare a new self-mapping namespace, add a namespaced file capability onto a file, then use that file capability in the parent namespace. To prevent that, do not allow mapping parent uid 0 if the process which opened the uid_map file does not have CAP_SETFCAP, which is the capability for setting file capabilities. As a further wrinkle: a task can unshare its user namespace, then open its uid_map file itself, and map (only) its own uid. In this case we do not have the credential from before unshare, which was potentially more restricted. So, when creating a user namespace, we record whether the creator had CAP_SETFCAP. Then we can use that during map_write(). With this patch: 1. Unprivileged user can still unshare -Ur ubuntu@caps:~$ unshare -Ur root@caps:~# logout 2. Root user can still unshare -Ur ubuntu@caps:~$ sudo bash root@caps:/home/ubuntu# unshare -Ur root@caps:/home/ubuntu# logout 3. Root user without CAP_SETFCAP cannot unshare -Ur: root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap -- root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap unable to set CAP_SETFCAP effective capability: Operation not permitted root@caps:/home/ubuntu# unshare -Ur unshare: write failed /proc/self/uid_map: Operation not permitted Note: an alternative solution would be to allow uid 0 mappings by processes without CAP_SETFCAP, but to prevent such a namespace from writing any file capabilities. This approach can be seen at [1]. Background history: commit 95ebabde382 ("capabilities: Don't allow writing ambiguous v3 file capabilities") tried to fix the issue by preventing v3 fscaps to be written to disk when the root uid would map to the same uid in nested user namespaces. This led to regressions for various workloads. For example, see [2]. Ultimately this is a valid use-case we have to support meaning we had to revert this change in 3b0c2d3eaa83 ("Revert 95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file capabilities")"). Link: https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4 [1] Link: https://github.com/containers/buildah/issues/3071 [2] Signed-off-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Andrew G. Morgan <morgan@kernel.org> Tested-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-05-07bpf: Fix leakage of uninitialized bpf stack under speculationDaniel Borkmann
commit 801c6058d14a82179a7ee17a4b532cac6fad067f upstream. The current implemented mechanisms to mitigate data disclosure under speculation mainly address stack and map value oob access from the speculative domain. However, Piotr discovered that uninitialized BPF stack is not protected yet, and thus old data from the kernel stack, potentially including addresses of kernel structures, could still be extracted from that 512 bytes large window. The BPF stack is special compared to map values since it's not zero initialized for every program invocation, whereas map values /are/ zero initialized upon their initial allocation and thus cannot leak any prior data in either domain. In the non-speculative domain, the verifier ensures that every stack slot read must have a prior stack slot write by the BPF program to avoid such data leaking issue. However, this is not enough: for example, when the pointer arithmetic operation moves the stack pointer from the last valid stack offset to the first valid offset, the sanitation logic allows for any intermediate offsets during speculative execution, which could then be used to extract any restricted stack content via side-channel. Given for unprivileged stack pointer arithmetic the use of unknown but bounded scalars is generally forbidden, we can simply turn the register-based arithmetic operation into an immediate-based arithmetic operation without the need for masking. This also gives the benefit of reducing the needed instructions for the operation. Given after the work in 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic mask"), the aux->alu_limit already holds the final immediate value for the offset register with the known scalar. Thus, a simple mov of the immediate to AX register with using AX as the source for the original instruction is sufficient and possible now in this case. Reported-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-07bpf: Fix masking negation logic upon negative dst registerDaniel Borkmann
commit b9b34ddbe2076ade359cd5ce7537d5ed019e9807 upstream. The negation logic for the case where the off_reg is sitting in the dst register is not correct given then we cannot just invert the add to a sub or vice versa. As a fix, perform the final bitwise and-op unconditionally into AX from the off_reg, then move the pointer from the src to dst and finally use AX as the source for the original pointer arithmetic operation such that the inversion yields a correct result. The single non-AX mov in between is possible given constant blinding is retaining it as it's not an immediate based operation. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-28locking/qrwlock: Fix ordering in queued_write_lock_slowpath()Ali Saidi
[ Upstream commit 84a24bf8c52e66b7ac89ada5e3cfbe72d65c1896 ] While this code is executed with the wait_lock held, a reader can acquire the lock without holding wait_lock. The writer side loops checking the value with the atomic_cond_read_acquire(), but only truly acquires the lock when the compare-and-exchange is completed successfully which isn’t ordered. This exposes the window between the acquire and the cmpxchg to an A-B-A problem which allows reads following the lock acquisition to observe values speculatively before the write lock is truly acquired. We've seen a problem in epoll where the reader does a xchg while holding the read lock, but the writer can see a value change out from under it. Writer | Reader -------------------------------------------------------------------------------- ep_scan_ready_list() | |- write_lock_irq() | |- queued_write_lock_slowpath() | |- atomic_cond_read_acquire() | | read_lock_irqsave(&ep->lock, flags); --> (observes value before unlock) | chain_epi_lockless() | | epi->next = xchg(&ep->ovflist, epi); | | read_unlock_irqrestore(&ep->lock, flags); | | | atomic_cmpxchg_relaxed() | |-- READ_ONCE(ep->ovflist); | A core can order the read of the ovflist ahead of the atomic_cmpxchg_relaxed(). Switching the cmpxchg to use acquire semantics addresses this issue at which point the atomic_cond_read can be switched to use relaxed semantics. Fixes: b519b56e378ee ("locking/qrwlock: Use atomic_cond_read_acquire() when spinning in qrwlock") Signed-off-by: Ali Saidi <alisaidi@amazon.com> [peterz: use try_cmpxchg()] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steve Capper <steve.capper@arm.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Tested-by: Steve Capper <steve.capper@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-28bpf: Tighten speculative pointer arithmetic maskDaniel Borkmann
[ Upstream commit 7fedb63a8307dda0ec3b8969a3b233a1dd7ea8e0 ] This work tightens the offset mask we use for unprivileged pointer arithmetic in order to mitigate a corner case reported by Piotr and Benedict where in the speculative domain it is possible to advance, for example, the map value pointer by up to value_size-1 out-of-bounds in order to leak kernel memory via side-channel to user space. Before this change, the computed ptr_limit for retrieve_ptr_limit() helper represents largest valid distance when moving pointer to the right or left which is then fed as aux->alu_limit to generate masking instructions against the offset register. After the change, the derived aux->alu_limit represents the largest potential value of the offset register which we mask against which is just a narrower subset of the former limit. For minimal complexity, we call sanitize_ptr_alu() from 2 observation points in adjust_ptr_min_max_vals(), that is, before and after the simulated alu operation. In the first step, we retieve the alu_state and alu_limit before the operation as well as we branch-off a verifier path and push it to the verification stack as we did before which checks the dst_reg under truncation, in other words, when the speculative domain would attempt to move the pointer out-of-bounds. In the second step, we retrieve the new alu_limit and calculate the absolute distance between both. Moreover, we commit the alu_state and final alu_limit via update_alu_sanitation_state() to the env's instruction aux data, and bail out from there if there is a mismatch due to coming from different verification paths with different states. Reported-by: Piotr Krysiuk <piotras@gmail.com> Reported-by: Benedict Schlueter <benedict.schlueter@rub.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Benedict Schlueter <benedict.schlueter@rub.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-28bpf: Refactor and streamline bounds check into helperDaniel Borkmann
[ Upstream commit 073815b756c51ba9d8384d924c5d1c03ca3d1ae4 ] Move the bounds check in adjust_ptr_min_max_vals() into a small helper named sanitize_check_bounds() in order to simplify the former a bit. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-28bpf: Allow variable-offset stack accessAndrei Matei
[ Upstream commit 01f810ace9ed37255f27608a0864abebccf0aab3 ] Before this patch, variable offset access to the stack was dissalowed for regular instructions, but was allowed for "indirect" accesses (i.e. helpers). This patch removes the restriction, allowing reading and writing to the stack through stack pointers with variable offsets. This makes stack-allocated buffers more usable in programs, and brings stack pointers closer to other types of pointers. The motivation is being able to use stack-allocated buffers for data manipulation. When the stack size limit is sufficient, allocating buffers on the stack is simpler than per-cpu arrays, or other alternatives. In unpriviledged programs, variable-offset reads and writes are disallowed (they were already disallowed for the indirect access case) because the speculative execution checking code doesn't support them. Additionally, when writing through a variable-offset stack pointer, if any pointers are in the accessible range, there's possilibities of later leaking pointers because the write cannot be tracked precisely. Writes with variable offset mark the whole range as initialized, even though we don't know which stack slots are actually written. This is in order to not reject future reads to these slots. Note that this doesn't affect writes done through helpers; like before, helpers need the whole stack range to be initialized to begin with. All the stack slots are in range are considered scalars after the write; variable-offset register spills are not tracked. For reads, all the stack slots in the variable range needs to be initialized (but see above about what writes do), otherwise the read is rejected. All register spilled in stack slots that might be read are marked as having been read, however reads through such pointers don't do register filling; the target register will always be either a scalar or a constant zero. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210207011027.676572-2-andreimatei1@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-21bpf: Move sanitize_val_alu out of op switchDaniel Borkmann
commit f528819334881fd622fdadeddb3f7edaed8b7c9b upstream. Add a small sanitize_needed() helper function and move sanitize_val_alu() out of the main opcode switch. In upcoming work, we'll move sanitize_ptr_alu() as well out of its opcode switch so this helps to streamline both. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-21bpf: Improve verifier error messages for usersDaniel Borkmann
commit a6aaece00a57fa6f22575364b3903dfbccf5345d upstream. Consolidate all error handling and provide more user-friendly error messages from sanitize_ptr_alu() and sanitize_val_alu(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-21bpf: Rework ptr_limit into alu_limit and add common error pathDaniel Borkmann
commit b658bbb844e28f1862867f37e8ca11a8e2aa94a3 upstream. Small refactor with no semantic changes in order to consolidate the max ptr_limit boundary check. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-21bpf: Move off_reg into sanitize_ptr_aluDaniel Borkmann
[ Upstream commit 6f55b2f2a1178856c19bbce2f71449926e731914 ] Small refactor to drag off_reg into sanitize_ptr_alu(), so we later on can use off_reg for generalizing some of the checks for all pointer types. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-21bpf: Ensure off_reg has no mixed signed bounds for all typesDaniel Borkmann
[ Upstream commit 24c109bb1537c12c02aeed2d51a347b4d6a9b76e ] The mixed signed bounds check really belongs into retrieve_ptr_limit() instead of outside of it in adjust_ptr_min_max_vals(). The reason is that this check is not tied to PTR_TO_MAP_VALUE only, but to all pointer types that we handle in retrieve_ptr_limit() and given errors from the latter propagate back to adjust_ptr_min_max_vals() and lead to rejection of the program, it's a better place to reside to avoid anything slipping through for future types. The reason why we must reject such off_reg is that we otherwise would not be able to derive a mask, see details in 9d7eceede769 ("bpf: restrict unknown scalars of mixed signed bounds for unprivileged"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-21bpf: Use correct permission flag for mixed signed bounds arithmeticDaniel Borkmann
[ Upstream commit 9601148392520e2e134936e76788fc2a6371e7be ] We forbid adding unknown scalars with mixed signed bounds due to the spectre v1 masking mitigation. Hence this also needs bypass_spec_v1 flag instead of allow_ptr_leaks. Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-21bpf: Take module reference for trampoline in moduleJiri Olsa
[ Upstream commit 861de02e5f3f2a104eecc5af1d248cb7bf8c5f75 ] Currently module can be unloaded even if there's a trampoline register in it. It's easily reproduced by running in parallel: # while :; do ./test_progs -t module_attach; done # while :; do rmmod bpf_testmod; sleep 0.5; done Taking the module reference in case the trampoline's ip is within the module code. Releasing it when the trampoline's ip is unregistered. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210326105900.151466-1-jolsa@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-21lockdep: Add a missing initialization hint to the "INFO: Trying to register ↵Tetsuo Handa
non-static key" message [ Upstream commit 3a85969e9d912d5dd85362ee37b5f81266e00e77 ] Since this message is printed when dynamically allocated spinlocks (e.g. kzalloc()) are used without initialization (e.g. spin_lock_init()), suggest to developers to check whether initialization functions for objects were called, before making developers wonder what annotation is missing. [ mingo: Minor tweaks to the message. ] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210321064913.4619-1-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-16ftrace: Check if pages were allocated before calling free_pages()Steven Rostedt (VMware)
[ Upstream commit 59300b36f85f254260c81d9dd09195fa49eb0f98 ] It is possible that on error pg->size can be zero when getting its order, which would return a -1 value. It is dangerous to pass in an order of -1 to free_pages(). Check if order is greater than or equal to zero before calling free_pages(). Link: https://lore.kernel.org/lkml/20210330093916.432697c7@gandalf.local.home/ Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-14lockdep: Address clang -Wformat warning printing for %hdArnd Bergmann
commit 6d48b7912cc72275dc7c59ff961c8bac7ef66a92 upstream. Clang doesn't like format strings that truncate a 32-bit value to something shorter: kernel/locking/lockdep.c:709:4: error: format specifies type 'short' but the argument has type 'int' [-Werror,-Wformat] In this case, the warning is a slightly questionable, as it could realize that both class->wait_type_outer and class->wait_type_inner are in fact 8-bit struct members, even though the result of the ?: operator becomes an 'int'. However, there is really no point in printing the number as a 16-bit 'short' rather than either an 8-bit or 32-bit number, so just change it to a normal %d. Fixes: de8f5e4f2dc1 ("lockdep: Introduce wait-type checks") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210322115531.3987555-1-arnd@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-14workqueue: Move the position of debug_work_activate() in __queue_work()Zqiang
[ Upstream commit 0687c66b5f666b5ad433f4e94251590d9bc9d10e ] The debug_work_activate() is called on the premise that the work can be inserted, because if wq be in WQ_DRAINING status, insert work may be failed. Fixes: e41e704bc4f4 ("workqueue: improve destroy_workqueue() debuggability") Signed-off-by: Zqiang <qiang.zhang@windriver.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-14bpf: Refcount task stack in bpf_get_task_stackDave Marchevsky
commit 06ab134ce8ecfa5a69e850f88f81c8a4c3fa91df upstream. On x86 the struct pt_regs * grabbed by task_pt_regs() points to an offset of task->stack. The pt_regs are later dereferenced in __bpf_get_stack (e.g. by user_mode() check). This can cause a fault if the task in question exits while bpf_get_task_stack is executing, as warned by task_stack_page's comment: * When accessing the stack of a non-current task that might exit, use * try_get_task_stack() instead. task_stack_page will return a pointer * that could get freed out from under you. Taking the comment's advice and using try_get_task_stack() and put_task_stack() to hold task->stack refcount, or bail early if it's already 0. Incrementing stack_refcount will ensure the task's stack sticks around while we're using its data. I noticed this bug while testing a bpf task iter similar to bpf_iter_task_stack in selftests, except mine grabbed user stack, and getting intermittent crashes, which resulted in dumps like: BUG: unable to handle page fault for address: 0000000000003fe0 \#PF: supervisor read access in kernel mode \#PF: error_code(0x0000) - not-present page RIP: 0010:__bpf_get_stack+0xd0/0x230 <snip...> Call Trace: bpf_prog_0a2be35c092cb190_get_task_stacks+0x5d/0x3ec bpf_iter_run_prog+0x24/0x81 __task_seq_show+0x58/0x80 bpf_seq_read+0xf7/0x3d0 vfs_read+0x91/0x140 ksys_read+0x59/0xd0 do_syscall_64+0x48/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()") Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210401000747.3648767-1-davemarchevsky@fb.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-14bpf: link: Refuse non-O_RDWR flags in BPF_OBJ_GETLorenz Bauer
commit 25fc94b2f02d832fa8e29419699dcc20b0b05c6a upstream. Invoking BPF_OBJ_GET on a pinned bpf_link checks the path access permissions based on file_flags, but the returned fd ignores flags. This means that any user can acquire a "read-write" fd for a pinned link with mode 0664 by invoking BPF_OBJ_GET with BPF_F_RDONLY in file_flags. The fd can be used to invoke BPF_LINK_DETACH, etc. Fix this by refusing non-O_RDWR flags in BPF_OBJ_GET. This works because OBJ_GET by default returns a read write mapping and libbpf doesn't expose a way to override this behaviour for programs and links. Fixes: 70ed506c3bbc ("bpf: Introduce pinnable bpf_link abstraction") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210326160501.46234-1-lmb@cloudflare.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-14bpf: Enforce that struct_ops programs be GPL-onlyToke Høiland-Jørgensen
commit 12aa8a9467b354ef893ce0fc5719a4de4949a9fb upstream. With the introduction of the struct_ops program type, it became possible to implement kernel functionality in BPF, making it viable to use BPF in place of a regular kernel module for these particular operations. Thus far, the only user of this mechanism is for implementing TCP congestion control algorithms. These are clearly marked as GPL-only when implemented as modules (as seen by the use of EXPORT_SYMBOL_GPL for tcp_register_congestion_control()), so it seems like an oversight that this was not carried over to BPF implementations. Since this is the only user of the struct_ops mechanism, just enforcing GPL-only for the struct_ops program type seems like the simplest way to fix this. Fixes: 0baf26b0fcd7 ("bpf: tcp: Support tcp_congestion_ops in bpf") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210326100314.121853-1-toke@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-14gcov: re-fix clang-11+ supportNick Desaulniers
commit 9562fd132985ea9185388a112e50f2a51557827d upstream. LLVM changed the expected function signature for llvm_gcda_emit_function() in the clang-11 release. Users of clang-11 or newer may have noticed their kernels producing invalid coverage information: $ llvm-cov gcov -a -c -u -f -b <input>.gcda -- gcno=<input>.gcno 1 <func>: checksum mismatch, \ (<lineno chksum A>, <cfg chksum B>) != (<lineno chksum A>, <cfg chksum C>) 2 Invalid .gcda File! ... Fix up the function signatures so calling this function interprets its parameters correctly and computes the correct cfg checksum. In particular, in clang-11, the additional checksum is no longer optional. Link: https://reviews.llvm.org/rG25544ce2df0daa4304c07e64b9c8b0f7df60c11d Link: https://lkml.kernel.org/r/20210408184631.1156669-1-ndesaulniers@google.com Reported-by: Prasad Sodagudi <psodagud@quicinc.com> Tested-by: Prasad Sodagudi <psodagud@quicinc.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-07Revert "kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing"Jens Axboe
commit d3dc04cd81e0eaf50b2d09ab051a13300e587439 upstream. This reverts commit 15b2219facadec583c24523eed40fa45865f859f. Before IO threads accepted signals, the freezer using take signals to wake up an IO thread would cause them to loop without any way to clear the pending signal. That is no longer the case, so stop special casing PF_IO_WORKER in the freezer. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-07tracing: Fix stack trace event sizeSteven Rostedt (VMware)
commit 9deb193af69d3fd6dd8e47f292b67c805a787010 upstream. Commit cbc3b92ce037 fixed an issue to modify the macros of the stack trace event so that user space could parse it properly. Originally the stack trace format to user space showed that the called stack was a dynamic array. But it is not actually a dynamic array, in the way that other dynamic event arrays worked, and this broke user space parsing for it. The update was to make the array look to have 8 entries in it. Helper functions were added to make it parse it correctly, as the stack was dynamic, but was determined by the size of the event stored. Although this fixed user space on how it read the event, it changed the internal structure used for the stack trace event. It changed the array size from [0] to [8] (added 8 entries). This increased the size of the stack trace event by 8 words. The size reserved on the ring buffer was the size of the stack trace event plus the number of stack entries found in the stack trace. That commit caused the amount to be 8 more than what was needed because it did not expect the caller field to have any size. This produced 8 entries of garbage (and reading random data) from the stack trace event: <idle>-0 [002] d... 1976396.837549: <stack trace> => trace_event_raw_event_sched_switch => __traceiter_sched_switch => __schedule => schedule_idle => do_idle => cpu_startup_entry => secondary_startup_64_no_verify => 0xc8c5e150ffff93de => 0xffff93de => 0 => 0 => 0xc8c5e17800000000 => 0x1f30affff93de => 0x00000004 => 0x200000000 Instead, subtract the size of the caller field from the size of the event to make sure that only the amount needed to store the stack trace is reserved. Link: https://lore.kernel.org/lkml/your-ad-here.call-01617191565-ext-9692@work.hours/ Cc: stable@vger.kernel.org Fixes: cbc3b92ce037 ("tracing: Set kernel_stack's caller size properly") Reported-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-07static_call: Align static_call_is_init() patching conditionPeter Zijlstra
[ Upstream commit 698bacefe993ad2922c9d3b1380591ad489355e9 ] The intent is to avoid writing init code after init (because the text might have been freed). The code is needlessly different between jump_label and static_call and not obviously correct. The existing code relies on the fact that the module loader clears the init layout, such that within_module_init() always fails, while jump_label relies on the module state which is more obvious and matches the kernel logic. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Tested-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lkml.kernel.org/r/20210318113610.636651340@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>