Age | Commit message (Collapse) | Author |
|
Introduce per-VMA locking. The lock implementation relies on a per-vma
and per-mm sequence counters to note exclusive locking:
- read lock - (implemented by vma_start_read) requires the vma
(vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ.
If they match then there must be a vma exclusive lock held somewhere.
- read unlock - (implemented by vma_end_read) is a trivial vma->lock
unlock.
- write lock - (vma_start_write) requires the mmap_lock to be held
exclusively and the current mm counter is assigned to the vma counter.
This will allow multiple vmas to be locked under a single mmap_lock
write lock (e.g. during vma merging). The vma counter is modified
under exclusive vma lock.
- write unlock - (vma_end_write_all) is a batch release of all vma
locks held. It doesn't pair with a specific vma_start_write! It is
done before exclusive mmap_lock is released by incrementing mm
sequence counter (mm_lock_seq).
- write downgrade - if the mmap_lock is downgraded to the read lock, all
vma write locks are released as well (effectivelly same as write
unlock).
Link: https://lkml.kernel.org/r/20230227173632.3292573-13-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This prepares for page faults handling under VMA lock, looking up VMAs
under protection of an rcu read lock, instead of the usual mmap read lock.
Link: https://lkml.kernel.org/r/20230227173632.3292573-11-surenb@google.com
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
MAX_ORDER currently defined as number of orders page allocator supports:
user can ask buddy allocator for page order between 0 and MAX_ORDER-1.
This definition is counter-intuitive and lead to number of bugs all over
the kernel.
Change the definition of MAX_ORDER to be inclusive: the range of orders
user can ask from buddy allocator is 0..MAX_ORDER now.
[kirill@shutemov.name: fix min() warning]
Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box
[akpm@linux-foundation.org: fix another min_t warning]
[kirill@shutemov.name: fixups per Zi Yan]
Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name
[akpm@linux-foundation.org: fix underlining in docs]
Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/
Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator
can deliver is MAX_ORDER-1.
Fix MAX_ORDER usage in rb_alloc_aux_page().
Link: https://lkml.kernel.org/r/20230315113133.11326-7-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the maple tree in RCU mode for VMA tracking.
The maple tree tracks the stack and is able to update the pivot
(lower/upper boundary) in-place to allow the page fault handler to write
to the tree while holding just the mmap read lock. This is safe as the
writes to the stack have a guard VMA which ensures there will always be a
NULL in the direction of the growth and thus will only update a pivot.
It is possible, but not recommended, to have VMAs that grow up/down
without guard VMAs. syzbot has constructed a testcase which sets up a VMA
to grow and consume the empty space. Overwriting the entire NULL entry
causes the tree to be altered in a way that is not safe for concurrent
readers; the readers may see a node being rewritten or one that does not
match the maple state they are using.
Enabling RCU mode allows the concurrent readers to see a stable node and
will return the expected result.
[Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[
Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/
Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com
Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: syzbot+8d95422d3537159ca390@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When a tracing instance is removed, the error messages that hold errors
that occurred in the instance needs to be freed. The following reports a
memory leak:
# cd /sys/kernel/tracing
# mkdir instances/foo
# echo 'hist:keys=x' > instances/foo/events/sched/sched_switch/trigger
# cat instances/foo/error_log
[ 117.404795] hist:sched:sched_switch: error: Couldn't find field
Command: hist:keys=x
^
# rmdir instances/foo
Then check for memory leaks:
# echo scan > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff88810d8ec700 (size 192):
comm "bash", pid 869, jiffies 4294950577 (age 215.752s)
hex dump (first 32 bytes):
60 dd 68 61 81 88 ff ff 60 dd 68 61 81 88 ff ff `.ha....`.ha....
a0 30 8c 83 ff ff ff ff 26 00 0a 00 00 00 00 00 .0......&.......
backtrace:
[<00000000dae26536>] kmalloc_trace+0x2a/0xa0
[<00000000b2938940>] tracing_log_err+0x277/0x2e0
[<000000004a0e1b07>] parse_atom+0x966/0xb40
[<0000000023b24337>] parse_expr+0x5f3/0xdb0
[<00000000594ad074>] event_hist_trigger_parse+0x27f8/0x3560
[<00000000293a9645>] trigger_process_regex+0x135/0x1a0
[<000000005c22b4f2>] event_trigger_write+0x87/0xf0
[<000000002cadc509>] vfs_write+0x162/0x670
[<0000000059c3b9be>] ksys_write+0xca/0x170
[<00000000f1cddc00>] do_syscall_64+0x3e/0xc0
[<00000000868ac68c>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
unreferenced object 0xffff888170c35a00 (size 32):
comm "bash", pid 869, jiffies 4294950577 (age 215.752s)
hex dump (first 32 bytes):
0a 20 20 43 6f 6d 6d 61 6e 64 3a 20 68 69 73 74 . Command: hist
3a 6b 65 79 73 3d 78 0a 00 00 00 00 00 00 00 00 :keys=x.........
backtrace:
[<000000006a747de5>] __kmalloc+0x4d/0x160
[<000000000039df5f>] tracing_log_err+0x29b/0x2e0
[<000000004a0e1b07>] parse_atom+0x966/0xb40
[<0000000023b24337>] parse_expr+0x5f3/0xdb0
[<00000000594ad074>] event_hist_trigger_parse+0x27f8/0x3560
[<00000000293a9645>] trigger_process_regex+0x135/0x1a0
[<000000005c22b4f2>] event_trigger_write+0x87/0xf0
[<000000002cadc509>] vfs_write+0x162/0x670
[<0000000059c3b9be>] ksys_write+0xca/0x170
[<00000000f1cddc00>] do_syscall_64+0x3e/0xc0
[<00000000868ac68c>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
The problem is that the error log needs to be freed when the instance is
removed.
Link: https://lore.kernel.org/lkml/76134d9f-a5ba-6a0d-37b3-28310b4a1e91@alu.unizg.hr/
Link: https://lore.kernel.org/linux-trace-kernel/20230404194504.5790b95f@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Fixes: 2f754e771b1a6 ("tracing: Have the error logs show up in the proper instances")
Reported-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Tested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
'rcu/staging-kfree', remote-tracking branches 'paul/srcu-cf.2023.04.04a', 'fbq/rcu/lockdep.2023.03.27a' and 'fbq/rcu/rcutorture.2023.03.20a' into rcu/staging
|
|
The kfree_rcu() and kvfree_rcu() macros' single-argument forms are
deprecated. Therefore switch to the new kfree_rcu_mightsleep() and
kvfree_rcu_mightsleep() variants. The goal is to avoid accidental use
of the single-argument forms, which can introduce functionality bugs in
atomic contexts and latency bugs in non-atomic contexts.
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
The kvfree_rcu() macro's single-argument form is deprecated. Therefore
switch to the new kvfree_rcu_mightsleep() variant. The goal is to
avoid accidental use of the single-argument forms, which can introduce
functionality bugs in atomic contexts and latency bugs in non-atomic
contexts.
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
For kernels built with CONFIG_PREEMPT_RCU=y, the following scenario can
result in a NULL-pointer dereference:
CPU1 CPU2
rcu_preempt_deferred_qs_irqrestore rcu_print_task_exp_stall
if (special.b.blocked) READ_ONCE(rnp->exp_tasks) != NULL
raw_spin_lock_rcu_node
np = rcu_next_node_entry(t, rnp)
if (&t->rcu_node_entry == rnp->exp_tasks)
WRITE_ONCE(rnp->exp_tasks, np)
....
raw_spin_unlock_irqrestore_rcu_node
raw_spin_lock_irqsave_rcu_node
t = list_entry(rnp->exp_tasks->prev,
struct task_struct, rcu_node_entry)
(if rnp->exp_tasks is NULL, this
will dereference a NULL pointer)
The problem is that CPU2 accesses the rcu_node structure's->exp_tasks
field without holding the rcu_node structure's ->lock and CPU2 did
not observe CPU1's change to rcu_node structure's ->exp_tasks in time.
Therefore, if CPU1 sets rcu_node structure's->exp_tasks pointer to NULL,
then CPU2 might dereference that NULL pointer.
This commit therefore holds the rcu_node structure's ->lock while
accessing that structure's->exp_tasks field.
[ paulmck: Apply Frederic Weisbecker feedback. ]
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
Registering a kprobe on __rcu_irq_enter_check_tick() can cause kernel
stack overflow as shown below. This issue can be reproduced by enabling
CONFIG_NO_HZ_FULL and booting the kernel with argument "nohz_full=",
and then giving the following commands at the shell prompt:
# cd /sys/kernel/tracing/
# echo 'p:mp1 __rcu_irq_enter_check_tick' >> kprobe_events
# echo 1 > events/kprobes/enable
This commit therefore adds __rcu_irq_enter_check_tick() to the kprobes
blacklist using NOKPROBE_SYMBOL().
Insufficient stack space to handle exception!
ESR: 0x00000000f2000004 -- BRK (AArch64)
FAR: 0x0000ffffccf3e510
Task stack: [0xffff80000ad30000..0xffff80000ad38000]
IRQ stack: [0xffff800008050000..0xffff800008058000]
Overflow stack: [0xffff089c36f9f310..0xffff089c36fa0310]
CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19
Hardware name: linux,dummy-virt (DT)
pstate: 400003c5 (nZcv DAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __rcu_irq_enter_check_tick+0x0/0x1b8
lr : ct_nmi_enter+0x11c/0x138
sp : ffff80000ad30080
x29: ffff80000ad30080 x28: ffff089c82e20000 x27: 0000000000000000
x26: 0000000000000000 x25: ffff089c02a8d100 x24: 0000000000000000
x23: 00000000400003c5 x22: 0000ffffccf3e510 x21: ffff089c36fae148
x20: ffff80000ad30120 x19: ffffa8da8fcce148 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: ffffa8da8e44ea6c
x14: ffffa8da8e44e968 x13: ffffa8da8e03136c x12: 1fffe113804d6809
x11: ffff6113804d6809 x10: 0000000000000a60 x9 : dfff800000000000
x8 : ffff089c026b404f x7 : 00009eec7fb297f7 x6 : 0000000000000001
x5 : ffff80000ad30120 x4 : dfff800000000000 x3 : ffffa8da8e3016f4
x2 : 0000000000000003 x1 : 0000000000000000 x0 : 0000000000000000
Kernel panic - not syncing: kernel stack overflow
CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0xf8/0x108
show_stack+0x20/0x30
dump_stack_lvl+0x68/0x84
dump_stack+0x1c/0x38
panic+0x214/0x404
add_taint+0x0/0xf8
panic_bad_stack+0x144/0x160
handle_bad_stack+0x38/0x58
__bad_stack+0x78/0x7c
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
[...]
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
el1_interrupt+0x28/0x60
el1h_64_irq_handler+0x18/0x28
el1h_64_irq+0x64/0x68
__ftrace_set_clr_event_nolock+0x98/0x198
__ftrace_set_clr_event+0x58/0x80
system_enable_write+0x144/0x178
vfs_write+0x174/0x738
ksys_write+0xd0/0x188
__arm64_sys_write+0x4c/0x60
invoke_syscall+0x64/0x180
el0_svc_common.constprop.0+0x84/0x160
do_el0_svc+0x48/0xe8
el0_svc+0x34/0xd0
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x190/0x194
SMP: stopping secondary CPUs
Kernel Offset: 0x28da86000000 from 0xffff800008000000
PHYS_OFFSET: 0xfffff76600000000
CPU features: 0x00000,01a00100,0000421b
Memory Limit: none
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/all/20221119040049.795065-1-zhengyejian1@huawei.com/
Fixes: aaf2bc50df1f ("rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
The call to synchronize_srcu() from rcu_tasks_postscan() can be stalled
by a task getting stuck in do_exit() between that function's calls to
exit_tasks_rcu_start() and exit_tasks_rcu_finish(). To ease diagnosis
of this situation, print a stall warning message every rcu_task_stall_info
period when rcu_tasks_postscan() is stalled.
[ paulmck: Adjust to handle CONFIG_SMP=n. ]
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/rcu/20230111212736.GA1062057@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
According to the commit log of the patch that added it to the kernel,
start_poll_synchronize_rcu_expedited() can be invoked very early, as
in long before rcu_init() has been invoked. But before rcu_init(),
the rcu_data structure's ->mynode field has not yet been initialized.
This means that the start_poll_synchronize_rcu_expedited() function's
attempt to set the CPU's leaf rcu_node structure's ->exp_seq_poll_rq
field will result in a segmentation fault.
This commit therefore causes start_poll_synchronize_rcu_expedited() to
set ->exp_seq_poll_rq only after rcu_init() has initialized all CPUs'
rcu_data structures' ->mynode fields. It also removes the check from
the rcu_init() function so that start_poll_synchronize_rcu_expedited(
is unconditionally invoked. Yes, this might result in an unnecessary
boot-time grace period, but this is down in the noise.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
The rcu_accelerate_cbs() function is invoked by rcu_report_qs_rdp()
only if there is a grace period in progress that is still blocked
by at least one CPU on this rcu_node structure. This means that
rcu_accelerate_cbs() should never return the value true, and thus that
this function should never set the needwake variable and in turn never
invoke rcu_gp_kthread_wake().
This commit therefore removes the needwake variable and the invocation
of rcu_gp_kthread_wake() in favor of a WARN_ON_ONCE() on the call to
rcu_accelerate_cbs(). The purpose of this new WARN_ON_ONCE() is to
detect situations where the system's opinion differs from ours.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
The lazy_rcu_shrink_count() shrinker function is registered even in
kernels built with CONFIG_RCU_LAZY=n, in which case this function
uselessly consumes cycles learning that no CPU has any lazy callbacks
queued.
This commit therefore registers this shrinker function only in the kernels
built with CONFIG_RCU_LAZY=y, where it might actually do something useful.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
This commit adds checks for the TICK_DEP_MASK_RCU_EXP bit, thus enabling
RCU expedited grace periods to actually force-enable scheduling-clock
interrupts on holdout CPUs.
Fixes: df1e849ae455 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS")
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
For kernels built with CONFIG_NO_HZ_FULL=y, the following scenario can result
in the scheduling-clock interrupt remaining enabled on a holdout CPU after
its quiescent state has been reported:
CPU1 CPU2
rcu_report_exp_cpu_mult synchronize_rcu_expedited_wait
acquires rnp->lock mask = rnp->expmask;
for_each_leaf_node_cpu_mask(rnp, cpu, mask)
rnp->expmask = rnp->expmask & ~mask; rdp = per_cpu_ptr(&rcu_data, cpu1);
for_each_leaf_node_cpu_mask(rnp, cpu, mask)
rdp = per_cpu_ptr(&rcu_data, cpu1);
if (!rdp->rcu_forced_tick_exp)
continue; rdp->rcu_forced_tick_exp = true;
tick_dep_set_cpu(cpu1, TICK_DEP_BIT_RCU_EXP);
The problem is that CPU2's sampling of rnp->expmask is obsolete by the
time it invokes tick_dep_set_cpu(), and CPU1 is not guaranteed to see
CPU2's store to ->rcu_forced_tick_exp in time to clear it. And even if
CPU1 does see that store, it might invoke tick_dep_clear_cpu() before
CPU2 got around to executing its tick_dep_set_cpu(), which would still
leave the victim CPU with its scheduler-clock tick running.
Either way, an nohz_full real-time application running on the victim
CPU would have its latency needlessly degraded.
Note that expedited RCU grace periods look at context-tracking
information, and so if the CPU is executing in nohz_full usermode
throughout, that CPU cannot be victimized in this manner.
This commit therefore causes synchronize_rcu_expedited_wait to hold
the rcu_node structure's ->lock when checking for holdout CPUs, setting
TICK_DEP_BIT_RCU_EXP, and invoking tick_dep_set_cpu(), thus preventing
this race.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined.
However, cpu_is_hotpluggable() still returns true for those CPUs. This causes
torture tests that do offlining to end up trying to offline this CPU causing
test failures. Such failure happens on all architectures.
Fix the repeated error messages thrown by this (even if the hotplug errors are
harmless) by asking the opinion of the nohz subsystem on whether the CPU can be
hotplugged.
[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]
For drivers/base/ portion:
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Zhouyi Zhou <zhouzhouyi@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: rcu <rcu@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
Now that all references to CONFIG_SRCU have been removed, it is time to
remove CONFIG_SRCU itself.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
This commit adds a comment to help explain why the "else" clause of the
in_serving_softirq() "if" statement does not need to enforce a time limit.
The reason is that this "else" clause handles rcuoc kthreads that do not
block handlers for other softirq vectors.
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
There is an smp_mb() named "E" in srcu_flip() immediately before the
increment (flip) of the srcu_struct structure's ->srcu_idx.
The purpose of E is to order the preceding scan's read of lock counters
against the flipping of the ->srcu_idx, in order to prevent new readers
from continuing to use the old ->srcu_idx value, which might needlessly
extend the grace period.
However, this ordering is already enforced because of the control
dependency between the preceding scan and the ->srcu_idx flip.
This control dependency exists because atomic_long_read() is used
to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx,
and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and
->srcu_unlock_count[] counts match. And such a match cannot happen when
there is an in-flight reader that started before the flip (observation
courtesy Mathieu Desnoyers).
The litmus test below (courtesy of Frederic Weisbecker, with changes
for ctrldep by Boqun and Joel) shows this:
C srcu
(*
* bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip.
*
* So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo
* (control deps) on both sides, and both P0 and P1 are interconnected by ->rf
* relations. Combining the ->ppo with ->rf, a cycle is impossible.
*)
{}
// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int lock1;
int unlock1;
int lock0;
int unlock0;
// SCAN1
unlock1 = READ_ONCE(*UNLOCK1);
smp_mb(); // A
lock1 = READ_ONCE(*LOCK1);
// FLIP
if (lock1 == unlock1) { // Control dep
smp_mb(); // E // Remove E and still passes.
WRITE_ONCE(*IDX, 1);
smp_mb(); // D
// SCAN2
unlock0 = READ_ONCE(*UNLOCK0);
smp_mb(); // A
lock0 = READ_ONCE(*LOCK0);
}
}
// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int tmp;
int idx1;
int idx2;
// 1st reader
idx1 = READ_ONCE(*IDX);
if (idx1 == 0) { // Control dep
tmp = READ_ONCE(*LOCK0);
WRITE_ONCE(*LOCK0, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK0);
WRITE_ONCE(*UNLOCK0, tmp + 1);
} else {
tmp = READ_ONCE(*LOCK1);
WRITE_ONCE(*LOCK1, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK1);
WRITE_ONCE(*UNLOCK1, tmp + 1);
}
}
exists (0:lock1=1 /\ 1:idx1=1)
More complicated litmus tests with multiple SRCU readers also show that
memory barrier E is not needed.
This commit therefore clarifies the comment on memory barrier E.
Why not also remove that redundant smp_mb()?
Because control dependencies are quite fragile due to their not being
recognized by most compilers and tools. Control dependencies therefore
exact an ongoing maintenance burden, and such a burden cannot be justified
in this slowpath. Therefore, that smp_mb() stays until such time as
its overhead becomes a measurable problem in a real workload running on
a real production system, or until such time as compilers start paying
attention to this sort of control dependency.
Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
The state space of the GP sequence number isn't documented and the
definitions of its special values are scattered. This commit therefore
gathers some common knowledge near the grace-period sequence-number
definitions.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
|
|
PSI offers 2 mechanisms to get information about a specific resource
pressure. One is reading from /proc/pressure/<resource>, which gives
average pressures aggregated every 2s. The other is creating a pollable
fd for a specific resource and cgroup.
The trigger creation requires CAP_SYS_RESOURCE, and gives the
possibility to pick specific time window and threshold, spawing an RT
thread to aggregate the data.
Systemd would like to provide containers the option to monitor pressure
on their own cgroup and sub-cgroups. For example, if systemd launches a
container that itself then launches services, the container should have
the ability to poll() for pressure in individual services. But neither
the container nor the services are privileged.
This patch implements a mechanism to allow unprivileged users to create
pressure triggers. The difference with privileged triggers creation is
that unprivileged ones must have a time window that's a multiple of 2s.
This is so that we can avoid unrestricted spawning of rt threads, and
use instead the same aggregation mechanism done for the averages, which
runs independently of any triggers.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20230330105418.77061-5-cerasuolodomenico@gmail.com
|
|
This change moves update_total flag out of update_triggers function,
currently called only in psi_poll_work.
In the next patch, update_triggers will be called also in psi_avgs_work,
but the total update information is specific to psi_poll_work.
Returning update_total value to the caller let us avoid differentiating
the implementation of update_triggers for different aggregators.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20230330105418.77061-4-cerasuolodomenico@gmail.com
|
|
Renaming in PSI implementation to make a clear distinction between
privileged and unprivileged triggers code to be implemented in the
next patch.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20230330105418.77061-3-cerasuolodomenico@gmail.com
|
|
Move a few functions up in the file to avoid forward declaration needed
in the patch implementing unprivileged PSI triggers.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20230330105418.77061-2-cerasuolodomenico@gmail.com
|
|
There are scenarios where non-affine wakeups are incorrectly counted as
affine wakeups by schedstats.
When wake_affine_idle() returns prev_cpu which doesn't equal to
nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
in wake_affine() and be counted as if target == this_cpu in schedstats.
Replace target == nr_cpumask_bits with target != this_cpu to make sure
affine wakeups are accurately tallied.
Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is idle)
Suggested-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Link: https://lore.kernel.org/r/20220810223313.386614-1-libo.chen@oracle.com
|
|
The same task check in perf_event_set_output has some potential issues
for some usages.
For the current perf code, there is a problem if using of
perf_event_open() to have multiple samples getting into the same mmap’d
memory when they are both attached to the same process.
https://lore.kernel.org/all/92645262-D319-4068-9C44-2409EF44888E@gmail.com/
Because the event->ctx is not ready when the perf_event_set_output() is
invoked in the perf_event_open().
Besides the above issue, before the commit bd2756811766 ("perf: Rewrite
core context handling"), perf record can errors out when sampling with
a hardware event and a software event as below.
$ perf record -e cycles,dummy --per-thread ls
failed to mmap with 22 (Invalid argument)
That's because that prior to the commit a hardware event and a software
event are from different task context.
The problem should be a long time issue since commit c3f00c70276d
("perk: Separate find_get_context() from event initialization").
The task struct is stored in the event->hw.target for each per-thread
event. It is a more reliable way to determine whether two events are
attached to the same task.
The event->hw.target was also introduced several years ago by the
commit 50f16a8bf9d7 ("perf: Remove type specific target pointers"). It
can not only be used to fix the issue with the current code, but also
back port to fix the issues with an older kernel.
Note: The event->hw.target was introduced later than commit
c3f00c70276d. The patch may cannot be applied between the commit
c3f00c70276d and commit 50f16a8bf9d7. Anybody that wants to back-port
this at that period may have to find other solutions.
Fixes: c3f00c70276d ("perf: Separate find_get_context() from event initialization")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Zhengjun Xing <zhengjun.xing@linux.intel.com>
Link: https://lkml.kernel.org/r/20230322202449.512091-1-kan.liang@linux.intel.com
|
|
Thomas reported that offlining CPUs spends a lot of time in
synchronize_rcu() as called from perf_pmu_migrate_context() even though
he's not actually using uncore events.
Turns out, the thing is unconditionally waiting for RCU, even if there's
no actual events to migrate.
Fixes: 0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()")
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20230403090858.GT4253@hirez.programming.kicks-ass.net
|
|
The kernel command line ftrace_boot_snapshot by itself is supposed to
trigger a snapshot at the end of boot up of the main top level trace
buffer. A ftrace_boot_snapshot=foo will do the same for an instance called
foo that was created by trace_instance=foo,...
The logic was broken where if ftrace_boot_snapshot was by itself, it would
trigger a snapshot for all instances that had tracing enabled, regardless
if it asked for a snapshot or not.
When a snapshot is requested for a buffer, the buffer's
tr->allocated_snapshot is set to true. Use that to know if a trace buffer
wants a snapshot at boot up or not.
Since the top level buffer is part of the ftrace_trace_arrays list,
there's no reason to treat it differently than the other buffers. Just
iterate the list if ftrace_boot_snapshot was specified.
Link: https://lkml.kernel.org/r/20230405022341.895334039@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <zwisler@google.com>
Fixes: 9c1c251d670bc ("tracing: Allow boot instances to have snapshot buffers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
appropriate instance
If a trace instance has a failure with its snapshot code, the error
message is to be written to that instance's buffer. But currently, the
message is written to the top level buffer. Worse yet, it may also disable
the top level buffer and not the instance that had the issue.
Link: https://lkml.kernel.org/r/20230405022341.688730321@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <zwisler@google.com>
Fixes: 2824f50332486 ("tracing: Make the snapshot trigger work with instances")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Artem reported suspicious RCU usage [1]. The reason is that verifier
calls find_kallsyms_symbol_value with preemption enabled which will
trigger suspicious RCU usage warning in rcu_dereference_sched call.
Disabling preemption in find_kallsyms_symbol_value and adding
__find_kallsyms_symbol_value function.
Fixes: 31bf1dbccfb0 ("bpf: Fix attaching fentry/fexit/fmod_ret/lsm to modules")
Reported-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Link: https://lore.kernel.org/bpf/20230403220254.2191240-1-jolsa@kernel.org
[1] https://lore.kernel.org/bpf/ZBrPMkv8YVRiWwCR@samus.usersys.redhat.com/
|
|
The commit 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the verifier.")
broke several tracing bpf programs. Even in clang compiled kernels there are
many fields that are not marked with __rcu that are safe to read and pass into
helpers, but the verifier doesn't know that they're safe. Aggressively marking
them as PTR_UNTRUSTED was premature.
Fixes: 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the verifier.")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-8-alexei.starovoitov@gmail.com
|
|
Allow bpf program access cgrp->kn, mm->exe_file, skb->sk, req->sk.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-7-alexei.starovoitov@gmail.com
|
|
check_reg_type() unconditionally disallows PTR_TO_BTF_ID | PTR_MAYBE_NULL.
It's problematic for helpers that allow ARG_PTR_TO_BTF_ID_OR_NULL like
bpf_sk_storage_get(). Allow passing PTR_TO_BTF_ID | PTR_MAYBE_NULL into such
helpers. That technically includes bpf_kptr_xchg() helper, but in practice:
bpf_kptr_xchg(..., bpf_cpumask_create());
is still disallowed because bpf_cpumask_create() returns ref counted pointer
with ref_obj_id > 0.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-6-alexei.starovoitov@gmail.com
|
|
bpf_[sk|inode|task|cgrp]_storage_[get|delete]() and bpf_get_socket_cookie() helpers
perform run-time check that sk|inode|task|cgrp pointer != NULL.
Teach verifier about this fact and allow bpf programs to pass
PTR_TO_BTF_ID | PTR_MAYBE_NULL into such helpers.
It will be used in the subsequent patch that will do
bpf_sk_storage_get(.., skb->sk, ...);
Even when 'skb' pointer is trusted the 'sk' pointer may be NULL.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-5-alexei.starovoitov@gmail.com
|
|
btf_nested_type_is_trusted() tries to find a struct member at corresponding offset.
It works for flat structures and falls apart in more complex structs with nested structs.
The offset->member search is already performed by btf_struct_walk() including nested structs.
Reuse this work and pass {field name, field btf id} into btf_nested_type_is_trusted()
instead of offset to make BTF_TYPE_SAFE*() logic more robust.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-4-alexei.starovoitov@gmail.com
|
|
Remove unused arguments from btf_struct_access() callback.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-3-alexei.starovoitov@gmail.com
|
|
Remove duplicated if (atype == BPF_READ) btf_struct_access() from
btf_struct_access() callback and invoke it only for writes. This is
possible to do because currently btf_struct_access() custom callback
always delegates to generic btf_struct_access() helper for BPF_READ
accesses.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230404045029.82870-2-alexei.starovoitov@gmail.com
|
|
This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit creates an srcu_usage pointer named "sup" as a shorter
synonym for the "ssp->srcu_sup" that was bloating several lines of code.
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
If a given statically allocated in-module srcu_struct structure was ever
used for updates, srcu_module_going() will invoke cleanup_srcu_struct()
at module-exit time. This will check for the error case of SRCU readers
persisting past module-exit time. On the other hand, if this srcu_struct
structure never went through a grace period, srcu_module_going() only
invokes free_percpu(), which would result in strange failures if SRCU
readers persisted past module-exit time.
This commit therefore adds a srcu_readers_active() check to
srcu_module_going(), splatting if readers have persisted and refraining
from invoking free_percpu() in that case. Better to leak memory than
to suffer silent memory corruption!
[ paulmck: Apply Zhang, Qiang1 feedback on memory leak. ]
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit moves the ->reschedule_jiffies, ->reschedule_count, and
->work fields from the srcu_struct structure to the srcu_usage structure
to reduce the size of the former in order to improve cache locality.
However, this means that the container_of() calls cannot get a pointer
to the srcu_struct because they are no longer in the srcu_struct.
This issue is addressed by adding a ->srcu_ssp field in the srcu_usage
structure that references the corresponding srcu_struct structure.
And given the presence of the sup pointer to the srcu_usage structure,
replace some ssp->srcu_usage-> instances with sup->.
[ paulmck Apply feedback from kernel test robot. ]
Link: https://lore.kernel.org/oe-kbuild-all/202303191400.iO5BOqka-lkp@intel.com/
Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit moves the ->srcu_barrier_seq, ->srcu_barrier_mutex,
->srcu_barrier_completion, and ->srcu_barrier_cpu_cnt fields from the
srcu_struct structure to the srcu_usage structure to reduce the size of
the former in order to improve cache locality.
Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit moves the ->sda_is_static field from the srcu_struct structure
to the srcu_usage structure to reduce the size of the former in order
to improve cache locality.
Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit moves the ->srcu_size_jiffies, ->srcu_n_lock_retries,
and ->srcu_n_exp_nodelay fields from the srcu_struct structure to the
srcu_usage structure to reduce the size of the former in order to improve
cache locality.
Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit moves the ->srcu_gp_seq, ->srcu_gp_seq_needed,
->srcu_gp_seq_needed_exp, ->srcu_gp_start, and ->srcu_last_gp_end fields
from the srcu_struct structure to the srcu_usage structure to reduce
the size of the former in order to improve cache locality.
Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit moves the ->srcu_gp_mutex field from the srcu_struct structure
to the srcu_usage structure to reduce the size of the former in order
to improve cache locality.
Suggested-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|