Age | Commit message (Collapse) | Author |
|
This commit saves a line of code by initializing the rcu_gp_fqs()
function's first_gp_fqs local variable in its declaration.
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
monitor_todo is not needed as the work struct already tracks
if work is pending. Just use that to know if work is pending
using schedule_delayed_work() helper.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
|
|
When a CPU is slow to provide a quiescent state for a given grace
period, RCU takes steps to encourage that CPU to get with the
quiescent-state program in a more timely fashion. These steps
include these flags in the rcu_data structure:
1. ->rcu_urgent_qs, which causes the scheduling-clock interrupt to
request an otherwise pointless context switch from the scheduler.
2. ->rcu_need_heavy_qs, which causes both cond_resched() and RCU's
context-switch hook to do an immediate momentary quiscent state.
3. ->rcu_need_heavy_qs, which causes the scheduler-clock tick to
be enabled even on nohz_full CPUs with only one runnable task.
These flags are of course cleared once the corresponding CPU has passed
through a quiescent state. Unless that quiescent state is the CPU
going offline, which means that when the CPU comes back online, it will
needlessly consume additional CPU time and incur additional latency,
which constitutes a minor but very real performance bug.
This commit therefore adds the call to rcu_disable_urgency_upon_qs()
that clears these flags to the CPU-hotplug offlining code path.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
|
|
Currently, the rcu_node structure's ->cbovlmask field is set in call_rcu()
when a given CPU is suffering from callback overload. But if that CPU
goes offline, the outgoing CPU's callbacks is migrated to the running
CPU, which is likely to overload the running CPU. However, that CPU's
bit in its leaf rcu_node structure's ->cbovlmask field remains zero.
Initially, this is OK because the outgoing CPU's bit remains set.
However, that bit will be cleared at the next end of a grace period,
at which time it is quite possible that the running CPU will still
be overloaded. If the running CPU invokes call_rcu(), then overload
will be checked for and the bit will be set. Except that there is no
guarantee that the running CPU will invoke call_rcu(), in which case the
next grace period will fail to take the running CPU's overload condition
into account. Plus, because the bit is not set, the end of the grace
period won't check for overload on this CPU.
This commit therefore adds a call to check_cb_ovld_locked() in
rcutree_migrate_callbacks() to set the running CPU's ->cbovlmask bit
appropriately.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
|
|
The force-quiesce-state loop function rcu_gp_fqs_loop() checks for
callback overloading and does an immediate initial scan for idle CPUs
if so. However, subsequent rescans will be carried out at as leisurely a
rate as they always are, as specified by the rcutree.jiffies_till_next_fqs
module parameter. It might be tempting to just continue immediately
rescanning, but this turns the RCU grace-period kthread into a CPU hog.
It might also be tempting to reduce the time between rescans to a single
jiffy, but this can be problematic on larger systems.
This commit therefore divides the normal time between rescans by three,
rounding up. Thus a small system running at HZ=1000 that is suffering
from callback overload will wait only one jiffy instead of the normal
three between rescans.
[ paulmck: Apply Neeraj Upadhyay feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
|
|
Context tracking's state and dynticks counter are going to be merged
in a single field so that both updates can happen atomically and at the
same time. Prepare for that with converting the state into an atomic_t.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
Move the core RCU eqs/dynticks functions to context tracking so that
we can later merge all that code within context tracking.
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
To prepare for migrating the RCU eqs accounting code to context tracking,
split the last-resort deferred nocb resched from rcu_user_enter() and
move it into a separate call from context tracking.
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
The RCU eqs tracking is going to be performed by the context tracking
subsystem. The related nesting counters thus need to be moved to the
context tracking structure.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
The RCU eqs tracking is going to be performed by the context tracking
subsystem. The related nesting counters thus need to be moved to the
context tracking structure.
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
In order to prepare for merging RCU dynticks counter into the context
tracking state, move the rcu_data's dynticks field to the context
tracking structure. It will later be mixed within the context tracking
state itself.
[ paulmck: Move enum ctx_state into global scope. ]
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
Now rcu_irq_enter/exit() is an unnecessary middle call between
ct_irq_enter/exit() and nmi_irq_enter/exit(). Take this opportunity
to remove the former functions and move the comments above them to the
new entrypoints.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
The RCU dynticks counter is going to be merged into the context tracking
subsystem. Start with moving the idle extended quiescent states
entrypoints to context tracking. For now those are dumb redirections to
existing RCU calls.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
|
|
Currently shrinkers are anonymous objects. For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g. for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.
This commit adds names to shrinkers. register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.
In some cases it's not possible to determine a good name at the time when
a shrinker is allocated. For such cases shrinker_debugfs_rename() is
provided.
The expected format is:
<subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.
After this change the shrinker debugfs directory looks like:
$ cd /sys/kernel/debug/shrinker/
$ ls
dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
sb-dax-11 sb-proc-45 sb-tmpfs-35
sb-debugfs-7 sb-proc-46 sb-tmpfs-40
[roman.gushchin@linux.dev: fix build warnings]
Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This commit applies the "noinstr" tag to the rcu_idle_enter() and
rcu_idle_exit() functions, which are invoked from portions of the idle
loop that cannot be instrumented. These tags require reworking the
rcu_eqs_enter() and rcu_eqs_exit() functions that these two functions
invoke in order to cause them to use normal assertions rather than
lockdep. In addition, within rcu_idle_exit(), the raw versions of
local_irq_save() and local_irq_restore() are used, again to avoid issues
with lockdep in uninstrumented code.
This patch is based in part on an earlier patch by Jiri Olsa, discussions
with Peter Zijlstra and Frederic Weisbecker, earlier changes by Thomas
Gleixner, and off-list discussions with Yonghong Song.
Link: https://lore.kernel.org/lkml/20220515203653.4039075-1-jolsa@kernel.org/
Reported-by: Jiri Olsa <jolsa@kernel.org>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Yonghong Song <yhs@fb.com>
|
|
It is currently up to the caller to handle stale return values from
get_state_synchronize_rcu(). If poll_state_synchronize_rcu() returned
true once, a grace period has elapsed, regardless of the fact that counter
wrap might cause some future poll_state_synchronize_rcu() invocation to
return false. For example, the caller might store a separate flag that
indicates whether some previous call to poll_state_synchronize_rcu()
determined that the relevant grace period had already ended.
This approach works, but it requires extra storage and is easy to get
wrong. This commit therefore introduces a get_completed_synchronize_rcu()
that returns a cookie that causes poll_state_synchronize_rcu() to always
return true. This already-completed cookie can be stored in place of the
cookie that previously caused poll_state_synchronize_rcu() to return true.
It can also be used to flag a given structure as not having been exposed
to readers, and thus not requiring a grace period to elapse.
This commit is in preparation for polled expedited grace periods.
Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Currently, poll_state_synchronize_rcu() uses rcu_seq_done() to check
whether the specified grace period has completed. However, rcu_seq_done()
does a simple comparison that reserves have of the sequence-number space
for uncompleted grace periods. This has the unfortunate side-effect
of not handling sequence-number wrap gracefully. Of course, one can
argue that if someone has already waited for half of the full range of
grace periods, they can wait for the other half, but why wait at all in
this case?
This commit therefore creates a rcu_seq_done_exact() that counts as
uncompleted only the two grace periods during which the sequence number
might have been handed out, while still being uncompleted. This way,
if sequence-number wrap happens to hit that range, at most two additional
grace periods need be waited for.
This commit is in preparation for polled expedited grace periods.
Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/
Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
Cc: Brian Foster <bfoster@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
exp.2022.05.11a: Expedited-grace-period latency-reduction updates.
|
|
Enabling CONFIG_RCU_BOOST did not reduce RCU expedited grace-period
latency because its workqueues run at SCHED_OTHER, and thus can be
delayed by normal processes. This commit avoids these delays by moving
the expedited GP work items to a real-time-priority kthread_worker.
This option is controlled by CONFIG_RCU_EXP_KTHREAD and disabled by
default on PREEMPT_RT=y kernels which disable expedited grace periods
after boot by unconditionally setting rcupdate.rcu_normal_after_boot=1.
The results were evaluated on arm64 Android devices (6GB ram) running
5.10 kernel, and capturing trace data in critical user-level code.
The table below shows the resulting order-of-magnitude improvements
in synchronize_rcu_expedited() latency:
------------------------------------------------------------------------
| | workqueues | kthread_worker | Diff |
------------------------------------------------------------------------
| Count | 725 | 688 | |
------------------------------------------------------------------------
| Min Duration (ns) | 326 | 447 | 37.12% |
------------------------------------------------------------------------
| Q1 (ns) | 39,428 | 38,971 | -1.16% |
------------------------------------------------------------------------
| Q2 - Median (ns) | 98,225 | 69,743 | -29.00% |
------------------------------------------------------------------------
| Q3 (ns) | 342,122 | 126,638 | -62.98% |
------------------------------------------------------------------------
| Max Duration (ns) | 372,766,967 | 2,329,671 | -99.38% |
------------------------------------------------------------------------
| Avg Duration (ns) | 2,746,353 | 151,242 | -94.49% |
------------------------------------------------------------------------
| Standard Deviation (ns) | 19,327,765 | 294,408 | |
------------------------------------------------------------------------
The below table show the range of maximums/minimums for
synchronize_rcu_expedited() latency from all experiments:
------------------------------------------------------------------------
| | workqueues | kthread_worker | Diff |
------------------------------------------------------------------------
| Total No. of Experiments | 25 | 23 | |
------------------------------------------------------------------------
| Largest Maximum (ns) | 372,766,967 | 2,329,671 | -99.38% |
------------------------------------------------------------------------
| Smallest Maximum (ns) | 38,819 | 86,954 | 124.00% |
------------------------------------------------------------------------
| Range of Maximums (ns) | 372,728,148 | 2,242,717 | |
------------------------------------------------------------------------
| Largest Minimum (ns) | 88,623 | 27,588 | -68.87% |
------------------------------------------------------------------------
| Smallest Minimum (ns) | 326 | 447 | 37.12% |
------------------------------------------------------------------------
| Range of Minimums (ns) | 88,297 | 27,141 | |
------------------------------------------------------------------------
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Reported-by: Tim Murray <timmurray@google.com>
Reported-by: Wei Wang <wvw@google.com>
Tested-by: Kyle Lin <kylelin@google.com>
Tested-by: Chunwei Lu <chunweilu@google.com>
Tested-by: Lulu Wang <luluw@google.com>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
'rcu-tasks.2022.04.11b', 'srcu.2022.05.03a', 'torture.2022.04.11b', 'torture-tasks.2022.04.20a' and 'torturescript.2022.04.20a' into HEAD
docs.2022.04.20a: Documentation updates.
fixes.2022.04.20a: Miscellaneous fixes.
nocb.2022.04.11b: Callback-offloading updates.
rcu-tasks.2022.04.11b: RCU-tasks updates.
srcu.2022.05.03a: Put SRCU on a memory diet.
torture.2022.04.11b: Torture-test updates.
torture-tasks.2022.04.20a: Avoid torture testing changing RCU configuration.
torturescript.2022.04.20a: Torture-test scripting updates.
|
|
An early check on synchronize_rcu[_expedited]() tries to determine if
the current CPU is in UP mode on an SMP no-preempt kernel, in which case
there is no need to start a grace period since the current assumed
quiescent state is all we need.
However the preemption mode doesn't take into account the boot selected
preemption mode under CONFIG_PREEMPT_DYNAMIC=y, missing a possible
early return if the running flavour is "none" or "voluntary".
Use the shiny new preempt mode accessors to fix this. However,
avoid invoking them during early boot because doing so triggers a
WARN_ON_ONCE().
[ paulmck: Update for mainlined API. ]
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The final "if" statement in rcu_gp_cleanup() has proven to be rather
confusing, straightforward though it might have seemed when initially
written. This commit therefore adds comments to its "then" and "else"
clauses to at least provide a more elevated form of confusion.
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reported-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
A report of a 12-jiffy normal RCU CPU stall warning raises interesting
questions about the nature of time on the offending system. This commit
instruments rcu_sched_clock_irq(), which is RCU's hook into the
scheduling-clock interrupt, checking for the jiffies counter going
backwards.
Reported-by: Saravanan D <sarvanand@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Tree RCU supports grace-period delays using the rcutree.gp_cleanup_delay,
rcutree.gp_init_delay, and rcutree.gp_preinit_delay kernel boot
parameters. These delays are strictly for debugging purposes, and have
proven quite effective at exposing bugs involving race with CPU-hotplug
operations. However, these delays can result in false positives when
used in conjunction with callback flooding, for example, those generated
by the rcutorture.fwd_progress kernel boot parameter.
This commit therefore suppresses grace-period delays while callback
flooding is in progress.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The main Tasks RCU quiescent state is voluntary context switch. However,
userspace execution is also a valid quiescent state, and is a valuable one
for userspace applications that spin repeatedly executing light-weight
non-sleeping system calls. Currently, such an application can delay a
Tasks RCU grace period for many tens of seconds.
This commit therefore enlists the aid of the scheduler-clock interrupt to
provide a Tasks RCU quiescent state when it interrupted a task executing
in userspace.
[ paulmck: Apply feedback from kernel test robot. ]
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Neil Spring <ntspring@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The rcu_spawn_gp_kthread() function is called as an early initcall, which
means that SMP initialization hasn't happened yet and only the boot CPU is
online. Therefore, create only the NOCB kthreads related to the boot CPU.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The rcu_spawn_gp_kthread() function is called as an early initcall,
which means that SMP initialization hasn't happened yet and only the
boot CPU is online. Therefore, create only the boost kthread for the
leaf node of the boot CPU.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The rcu_init() function is called way before SMP is initialized and
therefore only the boot CPU should be online at this stage.
Simplify the boot per-cpu initialization accordingly.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Fix idle detection (Neeraj Upadhyay) and missing access marking
detected by KCSAN.
- Reduce coupling between rcu_barrier() and CPU-hotplug operations, so
that rcu_barrier() no longer needs to do cpus_read_lock(). This may
also someday allow system boot to bring CPUs online concurrently.
- Enable more aggressive movement to per-CPU queueing when reacting to
excessive lock contention due to workloads placing heavy update-side
stress on RCU tasks.
- Improvements to RCU priority boosting, including changes from Neeraj
Upadhyay, Zqiang, and Alison Chaiken.
- Various fixes improving test robustness and debug information.
- Add tests for SRCU size transitions, further compress torture.sh
build products, and improve debug output.
- Miscellaneous fixes.
* tag 'rcu.2022.03.13a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits)
rcu: Replace cpumask_weight with cpumask_empty where appropriate
rcu: Remove __read_mostly annotations from rcu_scheduler_active externs
rcu: Uninline multi-use function: finish_rcuwait()
rcu: Mark writes to the rcu_segcblist structure's ->flags field
kasan: Record work creation stack trace with interrupts enabled
rcu: Inline __call_rcu() into call_rcu()
rcu: Add mutex for rcu boost kthread spawning and affinity setting
rcu: Fix description of kvfree_rcu()
MAINTAINERS: Add Frederic and Neeraj to their RCU files
rcutorture: Provide non-power-of-two Tasks RCU scenarios
rcutorture: Test SRCU size transitions
torture: Make torture.sh help message match reality
rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
rcu-tasks: Use order_base_2() instead of ilog2()
rcu: Create and use an rcu_rdp_cpu_online()
rcu: Make rcu_barrier() no longer block CPU-hotplug operations
rcu: Rework rcu_barrier() and callback-migration logic
rcu: Refactor rcu_barrier() empty-list handling
rcu: Kill rnp->ofl_seq and use only rcu_state.ofl_lock for exclusion
torture: Change KVM environment variable to RCUTORTURE
...
|
|
With the removal of CONFIG_RCU_FAST_NO_HZ, the parameters in
rcu_needs_cpu() are not necessary anymore. Simply remove them.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
|
|
'rcu_barrier.2022.02.08a', 'rcu-tasks.2022.02.08a', 'rt.2022.02.01b', 'torture.2022.02.01b' and 'torturescript.2022.02.08a' into HEAD
exp.2022.02.24a: Expedited grace-period updates.
fixes.2022.02.14a: Miscellaneous fixes.
rcu_barrier.2022.02.08a: Make rcu_barrier() no longer exclude CPU hotplug.
rcu-tasks.2022.02.08a: RCU-tasks updates.
rt.2022.02.01b: Real-time-related updates.
torture.2022.02.01b: Torture-test updates.
torturescript.2022.02.08a: Torture-test scripting updates.
|
|
Recording the work creation stack trace for KASAN reports in
call_rcu() is expensive, due to unwinding the stack, but also
due to acquiring depot_lock inside stackdepot (which may be contended).
Because calling kasan_record_aux_stack_noalloc() does not require
interrupts to already be disabled, this may unnecessarily extend
the time with interrupts disabled.
Therefore, move calling kasan_record_aux_stack() before the section
with interrupts disabled.
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Because __call_rcu() is invoked only by call_rcu(), this commit inlines
the former into the latter.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
As we handle parallel CPU bringup, we will need to take care to avoid
spawning multiple boost threads, or race conditions when setting their
affinity. Spotted by Paul McKenney.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The pattern "rdp->grpmask & rcu_rnp_online_cpus(rnp)" occurs frequently
in RCU code in order to determine whether rdp->cpu is online from an
RCU perspective. This commit therefore creates an rcu_rdp_cpu_online()
function to replace it.
[ paulmck: Apply kernel test robot unused-variable feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit removes the cpus_read_lock() and cpus_read_unlock() calls
from rcu_barrier(), thus allowing CPUs to come and go during the course
of rcu_barrier() execution. Posting of the ->barrier_head callbacks does
synchronize with portions of RCU's CPU-hotplug notifiers, but these locks
are held for short time periods on both sides. Thus, full CPU-hotplug
operations could both start and finish during the execution of a given
rcu_barrier() invocation.
Additional synchronization is provided by a global ->barrier_lock.
Since the ->barrier_lock is only used during rcu_barrier() execution and
during onlining/offlining a CPU, the contention for this lock should
be low. It might be tempting to make use of a per-CPU lock just on
general principles, but straightforward attempts to do this have the
problems shown below.
Initial state: 3 CPUs present, CPU 0 and CPU1 do not have
any callback and CPU2 has callbacks.
1. CPU0 calls rcu_barrier().
2. CPU1 starts offlining for CPU2. CPU1 calls
rcutree_migrate_callbacks(). rcu_barrier_entrain() is called
from rcutree_migrate_callbacks(), with CPU2's rdp->barrier_lock.
It does not entrain ->barrier_head for CPU2, as rcu_barrier()
on CPU0 hasn't started the barrier sequence (by calling
rcu_seq_start(&rcu_state.barrier_sequence)) yet.
3. CPU0 starts new barrier sequence. It iterates over
CPU0 and CPU1, after acquiring their per-cpu ->barrier_lock
and finds 0 segcblist length. It updates ->barrier_seq_snap
for CPU0 and CPU1 and continues loop iteration to CPU2.
for_each_possible_cpu(cpu) {
raw_spin_lock_irqsave(&rdp->barrier_lock, flags);
if (!rcu_segcblist_n_cbs(&rdp->cblist)) {
WRITE_ONCE(rdp->barrier_seq_snap, gseq);
raw_spin_unlock_irqrestore(&rdp->barrier_lock, flags);
rcu_barrier_trace(TPS("NQ"), cpu, rcu_state.barrier_sequence);
continue;
}
4. rcutree_migrate_callbacks() completes execution on CPU1.
Segcblist len for CPU2 becomes 0.
5. The loop iteration on CPU0, checks rcu_segcblist_n_cbs(&rdp->cblist)
for CPU2 and completes the loop iteration after setting
->barrier_seq_snap.
6. As there isn't any ->barrier_head callback entrained; at
this point, rcu_barrier() in CPU0 returns.
7. The callbacks, which migrated from CPU2 to CPU1, execute.
Straightforward per-CPU locking is also subject to the following race
condition noted by Boqun Feng:
1. CPU0 calls rcu_barrier(), starting a new barrier sequence by invoking
rcu_seq_start() and init_completion(), but does not yet initialize
rcu_state.barrier_cpu_count.
2. CPU1 starts offlining for CPU2, calling rcutree_migrate_callbacks(),
which in turn calls rcu_barrier_entrain() holding CPU2's.
rdp->barrier_lock. It then entrains ->barrier_head for CPU2
and atomically increments rcu_state.barrier_cpu_count, which is
unfortunately not yet initialized to the value 2.
3. The just-entrained RCU callback is invoked. It atomically
decrements rcu_state.barrier_cpu_count and sees that it is
now zero. This callback therefore invokes complete().
4. CPU0 continues executing rcu_barrier(), but is not blocked
by its call to wait_for_completion(). This results in rcu_barrier()
returning before all pre-existing callbacks have been invoked,
which is a bug.
Therefore, synchronization is provided by rcu_state.barrier_lock,
which is also held across the initialization sequence, especially the
rcu_seq_start() and the atomic_set() that sets rcu_state.barrier_cpu_count
to the value 2. In addition, this lock is held when entraining the
rcu_barrier() callback, when deciding whether or not a CPU has callbacks
that rcu_barrier() must wait on, when setting the ->qsmaskinitnext for
incoming CPUs, and when migrating callbacks from a CPU that is going
offline.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit reworks rcu_barrier() and callback-migration logic to
permit allowing rcu_barrier() to run concurrently with CPU-hotplug
operations. The key trick is for callback migration to check to see if
an rcu_barrier() is in flight, and, if so, enqueue the ->barrier_head
callback on its behalf.
This commit adds synchronization with RCU's CPU-hotplug notifiers. Taken
together, this will permit a later commit to remove the cpus_read_lock()
and cpus_read_unlock() calls from rcu_barrier().
[ paulmck: Updated per kbuild test robot feedback. ]
[ paulmck: Updated per reviews session with Neeraj, Frederic, Uladzislau, and Boqun. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit saves a few lines by checking first for an empty callback
list. If the callback list is empty, then that CPU is taken care of,
regardless of its online or nocb state. Also simplify tracing accordingly
and fold a few lines together.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
If we allow architectures to bring APs online in parallel, then we end
up requiring rcu_cpu_starting() to be reentrant. But currently, the
manipulation of rnp->ofl_seq is not thread-safe.
However, rnp->ofl_seq is also fairly much pointless anyway since both
rcu_cpu_starting() and rcu_report_dead() hold rcu_state.ofl_lock for
fairly much the whole time that rnp->ofl_seq is set to an odd number
to indicate that an operation is in progress.
So drop rnp->ofl_seq completely, and use only rcu_state.ofl_lock.
This has a couple of minor complexities: lockdep will complain when we
take rcu_state.ofl_lock, and currently accepts the 'excuse' of having
an odd value in rnp->ofl_seq. So switch it to an arch_spinlock_t to
avoid that false positive complaint. Since we're killing rnp->ofl_seq
of course that 'excuse' has to be changed too, so make it check for
arch_spin_is_locked(rcu_state.ofl_lock).
There's no arch_spin_lock_irqsave() so we have to manually save and
restore local interrupts around the locking.
At Paul's request based on Neeraj's analysis, make rcu_gp_init not just
wait but *exclude* any CPU online/offline activity, which was fairly
much true already by virtue of it holding rcu_state.ofl_lock.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
When the rcutree.use_softirq kernel boot parameter is set to zero, all
RCU_SOFTIRQ processing is carried out by the per-CPU rcuc kthreads.
If these kthreads are being starved, quiescent states will not be
reported, which in turn means that the grace period will not end, which
can in turn trigger RCU CPU stall warnings. This commit therefore dumps
stack traces of stalled CPUs' rcuc kthreads, which can help identify
what is preventing those kthreads from running.
Suggested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Reviewed-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
When CONFIG_PREEMPT_RT=y, the rcutree.kthread_prio command-line
parameter signals initialization code to boost the priority of rcuc
callbacks to the designated value. With the additional
CONFIG_RCU_NOCB_CPU=y configuration and an additional rcu_nocbs
command-line parameter, the callbacks on the listed cores are
offloaded to new rcuop kthreads that are not pinned to the cores whose
post-grace-period work is performed. While the rcuop kthreads perform
the same function as the rcuc kthreads they offload, the kthread_prio
parameter only boosts the priority of the rcuc kthreads. Fix this
inconsistency by elevating rcuop kthreads to the same priority as the rcuc
kthreads.
Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Move the bounds-check of the kthread_prio cmdline parameter to a new
function in order to faciliate a different callsite.
Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The per-CPU "rcuc" kthreads are used only by kernels booted with
rcutree.use_softirq=0, but they are nevertheless unconditionally created
by kernels built with CONFIG_RCU_BOOST=y. This results in "rcuc"
kthreads being created that are never actually used. This commit
therefore refrains from creating these kthreads unless the kernel
is actually booted with rcutree.use_softirq=0.
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
'fixes.2021.11.30c', 'nocb.2021.12.09a', 'nolibc.2021.11.30c', 'tasks.2021.12.09a', 'torture.2021.12.07a' and 'torturescript.2021.11.30c' into HEAD
doc.2021.11.30c: Documentation updates.
exp.2021.12.07a: Expedited-grace-period fixes.
fastnohz.2021.11.30c: Remove CONFIG_RCU_FAST_NO_HZ.
fixes.2021.11.30c: Miscellaneous fixes.
nocb.2021.12.09a: No-CB CPU updates.
nolibc.2021.11.30c: Tiny in-kernel library updates.
tasks.2021.12.09a: RCU-tasks updates, including update-side scalability.
torture.2021.12.07a: Torture-test in-kernel module updates.
torturescript.2021.11.30c: Torture-test scripting updates.
|
|
rcu_core() tries to ensure that its self-invocation in case of callbacks
overload only happen in softirq/rcuc mode. Indeed it doesn't make sense
to trigger local RCU core from nocb_cb kthread since it can execute
on a CPU different from the target rdp. Also in case of overload, the
nocb_cb kthread simply iterates a new loop of callbacks processing.
However the "offloaded" check that aims at preventing misplaced
rcu_core() invocations is wrong. First of all that state is volatile
and second: softirq/rcuc can execute while the target rdp is offloaded.
As a result rcu_core() can be invoked on the wrong CPU while in the
process of (de-)offloading.
Fix that with moving the rcu_core() self-invocation to rcu_core() itself,
irrespective of the rdp offloaded state.
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Time limit only makes sense when callbacks are serviced in softirq mode
because:
_ In case we need to get back to the scheduler,
cond_resched_tasks_rcu_qs() is called after each callback.
_ In case some other softirq vector needs the CPU, the call to
local_bh_enable() before cond_resched_tasks_rcu_qs() takes care about
them via a call to do_softirq().
Therefore, make sure the time limit only applies to softirq mode.
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The callbacks processing time limit makes sure we are not exceeding a
given amount of time executing the queue.
However its "continue" clause bypasses the cond_resched() call on
rcuc and NOCB kthreads, delaying it until we reach the limit, which can
be very long...
Make sure the scheduler has a higher priority than the time limit.
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The current condition to limit the number of callbacks executed in a
row checks the offloaded state of the rdp. Not only is it volatile
but it is also misleading: the rcu_core() may well be executing
callbacks concurrently with NOCB kthreads, and the offloaded state
would then be verified on both cases. As a result the limit would
spuriously not apply anymore on softirq while in the middle of
(de-)offloading process.
Fix and clarify the condition with those constraints in mind:
_ If callbacks are processed either by rcuc or NOCB kthread, the call
to cond_resched_tasks_rcu_qs() is enough to take care of the overload.
_ If instead callbacks are processed by softirqs:
* If need_resched(), exit the callbacks processing
* Otherwise if CPU is idle we can continue
* Otherwise exit because a softirq shouldn't interrupt a task for too
long nor deprive other pending softirq vectors of the CPU.
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Instead of hardcoding IRQ save and nocb lock, use the consolidated
API (and fix a comment as per Valentin Schneider's suggestion).
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
It's not entirely obvious why rdp->qlen_last_fqs_check is updated before
processing the queue only on offloaded rdp. There can be different
effect to that, either in favour of triggering the force quiescent state
path or not. For example:
1) If the number of callbacks has decreased since the last
rdp->qlen_last_fqs_check update (because we recently called
rcu_do_batch() and we executed below qhimark callbacks) and the number
of processed callbacks on a subsequent do_batch() arranges for
exceeding qhimark on non-offloaded but not on offloaded setup, then we
may spare a later run to the force quiescent state
slow path on __call_rcu_nocb_wake(), as compared to the non-offloaded
counterpart scenario.
Here is such an offloaded scenario instance:
qhimark = 1000
rdp->last_qlen_last_fqs_check = 3000
rcu_segcblist_n_cbs(rdp) = 2000
rcu_do_batch() {
if (offloaded)
rdp->last_qlen_fqs_check = rcu_segcblist_n_cbs(rdp) // 2000
// run 1000 callback
rcu_segcblist_n_cbs(rdp) = 1000
// Not updating rdp->qlen_last_fqs_check
if (count < rdp->qlen_last_fqs_check - qhimark)
rdp->qlen_last_fqs_check = count;
}
call_rcu() * 1001 {
__call_rcu_nocb_wake() {
// not taking the fqs slowpath:
// rcu_segcblist_n_cbs(rdp) == 2001
// rdp->qlen_last_fqs_check == 2000
// qhimark == 1000
if (len > rdp->qlen_last_fqs_check + qhimark)
...
}
In the case of a non-offloaded scenario, rdp->qlen_last_fqs_check
would be 1000 and the fqs slowpath would have executed.
2) If the number of callbacks has increased since the last
rdp->qlen_last_fqs_check update (because we recently queued below
qhimark callbacks) and the number of callbacks executed in rcu_do_batch()
doesn't exceed qhimark for either offloaded or non-offloaded setup,
then it's possible that the offloaded scenario later run the force
quiescent state slow path on __call_rcu_nocb_wake() while the
non-offloaded doesn't.
qhimark = 1000
rdp->last_qlen_last_fqs_check = 3000
rcu_segcblist_n_cbs(rdp) = 2000
rcu_do_batch() {
if (offloaded)
rdp->last_qlen_last_fqs_check = rcu_segcblist_n_cbs(rdp) // 2000
// run 100 callbacks
// concurrent queued 100
rcu_segcblist_n_cbs(rdp) = 2000
// Not updating rdp->qlen_last_fqs_check
if (count < rdp->qlen_last_fqs_check - qhimark)
rdp->qlen_last_fqs_check = count;
}
call_rcu() * 1001 {
__call_rcu_nocb_wake() {
// Taking the fqs slowpath:
// rcu_segcblist_n_cbs(rdp) == 3001
// rdp->qlen_last_fqs_check == 2000
// qhimark == 1000
if (len > rdp->qlen_last_fqs_check + qhimark)
...
}
In the case of a non-offloaded scenario, rdp->qlen_last_fqs_check
would be 3000 and the fqs slowpath would have executed.
The reason for updating rdp->qlen_last_fqs_check when invoking callbacks
for offloaded CPUs is that there is usually no point in waking up either
the rcuog or rcuoc kthreads while in this state. After all, both threads
are prohibited from indefinite sleeps.
The exception is when some huge number of callbacks are enqueued while
rcu_do_batch() is in the midst of invoking, in which case interrupting
the rcuog kthread's timed sleep might get more callbacks set up for the
next grace period.
Reported-and-tested-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Original-patch-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|