summaryrefslogtreecommitdiff
path: root/kernel/sched/ext.c
AgeCommit message (Collapse)Author
2025-04-09sched_ext: Remove scx_ops_enq_* static_keysTejun Heo
scx_ops_enq_last/exiting/migration_disabled are used to encode the corresponding SCX_OPS_ flags into static_keys. These flags aren't hot enough for static_key usage to make any meaningful difference and are made static_keys mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_keys and test the ops flags directly. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Indentation updatesTejun Heo
Purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-08sched_ext: Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
Pull for-6.15-fixes to receive: e776b26e3701 ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings") which conflicts with: 1a7ff7216c8b ("sched_ext: Drop "ops" from scx_ops_enable_state and friends") The former removes code updated by the latter. Resolved by removing the updated section. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecationTejun Heo
SCX_OPS_HAS_CGROUP_WEIGHT was only used to suppress the missing cgroup weight support warnings. Now that the warnings are removed, the flag doesn't do anything. Mark it for deprecation and remove its usage from scx_flatcg. v2: Actually include the scx_flatcg update. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-and-reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-04-08sched_ext: Remove cpu.weight / cpu.idle unimplemented warningsTejun Heo
sched_ext generates warnings when cpu.weight / cpu.idle are set to non-default values if the BPF scheduler doesn't implement weight support. These warnings don't provide much value while adding constant annoyance. A BPF scheduler may not implement any particular behavior and there's nothing particularly special about missing cgroup weight support. Drop the warnings. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08sched_ext: Use kvzalloc for large exit_dump allocationBreno Leitao
Replace kzalloc with kvzalloc for the exit_dump buffer allocation, which can require large contiguous memory depending on the implementation. This change prevents allocation failures by allowing the system to fall back to vmalloc when contiguous memory allocation fails. Since this buffer is only used for debugging purposes, physical memory contiguity is not required, making vmalloc a suitable alternative. Cc: stable@vger.kernel.org Fixes: 07814a9439a3b0 ("sched_ext: Print debug dump after an error exit") Suggested-by: Rik van Riel <riel@surriel.com> Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Introduce scx_bpf_select_cpu_and()Andrea Righi
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply the built-in idle CPU selection policy to a subset of allowed CPU. This new helper is basically an extension of scx_bpf_select_cpu_dfl(). However, when an idle CPU can't be found, it returns a negative value instead of @prev_cpu, aligning its behavior more closely with scx_bpf_pick_idle_cpu(). It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying the built-in selection logic. With this helper, BPF schedulers can apply the built-in idle CPU selection policy restricted to any arbitrary subset of CPUs. Example usage ============= Possible usage in ops.select_cpu(): s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr; s32 cpu; cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0); if (cpu >= 0) { scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); return cpu; } return prev_cpu; } Results ======= Load distribution on a 4 sockets, 4 cores per socket system, simulated using virtme-ng, running a modified version of scx_bpfland that uses scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs: $ vng --cpu 16,sockets=4,cores=4,threads=1 ... $ stress-ng -c 16 ... $ htop ... 0[ 0.0%] 8[||||||||||||||||||||||||100.0%] 1[ 0.0%] 9[||||||||||||||||||||||||100.0%] 2[ 0.0%] 10[||||||||||||||||||||||||100.0%] 3[ 0.0%] 11[||||||||||||||||||||||||100.0%] 4[ 0.0%] 12[||||||||||||||||||||||||100.0%] 5[ 0.0%] 13[||||||||||||||||||||||||100.0%] 6[ 0.0%] 14[||||||||||||||||||||||||100.0%] 7[ 0.0%] 15[||||||||||||||||||||||||100.0%] With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across all the available CPUs. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07sched_ext: idle: Explicitly pass allowed cpumask to scx_select_cpu_dfl()Andrea Righi
Modify scx_select_cpu_dfl() to take the allowed cpumask as an explicit argument, instead of implicitly using @p->cpus_ptr. This prepares for future changes where arbitrary cpumasks may be passed to the built-in idle CPU selection policy. This is a pure refactoring with no functional changes. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-04sched_ext: Drop "ops" from SCX_OPS_TASK_ITER_BATCHTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from SCX_OPS_TASK_ITER_BATCH. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-and-acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and ↵Tejun Heo
friends The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_{init|exit|enable|disable}[_task]() and friends. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_exit(), scx_ops_error() and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_exit(), scx_ops_error() and friends. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_bypass(), scx_ops_breather() and friends. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and ↵Tejun Heo
__scx_ops_enabled The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_helper, scx_ops_enable_mutex and __scx_ops_enabled. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-04sched_ext: Drop "ops" from scx_ops_enable_state and friendsTejun Heo
The tag "ops" is used for two different purposes. First, to indicate that the entity is directly related to the operations such as flags carried in sched_ext_ops. Second, to indicate that the entity applies to something global such as enable or bypass states. The second usage is historical and causes confusion rather than clarifying anything. For example, scx_ops_enable_state enums are named SCX_OPS_* and thus conflict with scx_ops_flags. Let's drop the second usages. Drop "ops" from scx_ops_enable_state and friends. Update scx_show_state.py accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-03-26sched_ext: initialize built-in idle state before ops.init()Andrea Righi
A BPF scheduler may want to use the built-in idle cpumasks in ops.init() before the scheduler is fully initialized, either directly or through a BPF timer for example. However, this would result in an error, since the idle state has not been properly initialized yet. This can be easily verified by modifying scx_simple to call scx_bpf_get_idle_cpumask() in ops.init(): $ sudo scx_simple DEBUG DUMP =========================================================================== scx_simple[121] triggered exit kind 1024: runtime error (built-in idle tracking is disabled) ... Fix this by properly initializing the idle state before ops.init() is called. With this change applied: $ sudo scx_simple local=2 global=0 local=19 global=11 local=23 global=11 ... Fixes: d73249f88743d ("sched_ext: idle: Make idle static keys private") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-26sched_ext: create_dsq: Return -EEXIST on duplicate requestJake Hillion
create_dsq and therefore the scx_bpf_create_dsq kfunc currently silently ignore duplicate entries. As a sched_ext scheduler is creating each DSQ for a different purpose this is surprising behaviour. Replace rhashtable_insert_fast which ignores duplicates with rhashtable_lookup_insert_fast that reports duplicates (though doesn't return their value). The rest of the code is structured correctly and this now returns -EEXIST. Tested by adding an extra scx_bpf_create_dsq to scx_simple. Previously this was ignored, now init fails with a -17 code. Also ran scx_lavd which continued to work well. Signed-off-by: Jake Hillion <jake@hillion.co.uk> Acked-by: Andrea Righi <arighi@nvidia.com> Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-24Merge tag 'sched-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core & fair scheduler changes: - Cancel the slice protection of the idle entity (Zihan Zhou) - Reduce the default slice to avoid tasks getting an extra tick (Zihan Zhou) - Force propagating min_slice of cfs_rq when {en,de}queue tasks (Tianchen Ding) - Refactor can_migrate_task() to elimate looping (I Hsin Cheng) - Add unlikey branch hints to several system calls (Colin Ian King) - Optimize current_clr_polling() on certain architectures (Yujun Dong) Deadline scheduler: (Juri Lelli) - Remove redundant dl_clear_root_domain call - Move dl_rebuild_rd_accounting to cpuset.h Uclamp: - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan) - Optimize sched_uclamp_used static key enabling (Xuewen Yan) Scheduler topology support: (Juri Lelli) - Ignore special tasks when rebuilding domains - Add wrappers for sched_domains_mutex - Generalize unique visiting of root domains - Rebuild root domain accounting after every update - Remove partition_and_rebuild_sched_domains - Stop exposing partition_sched_domains_locked RSEQ: (Michael Jeanson) - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y - Fix segfault on registration when rseq_cs is non-zero - selftests: Add rseq syscall errors test - selftests: Ensure the rseq ABI TLS is actually 1024 bytes Membarriers: - Fix redundant load of membarrier_state (Nysal Jan K.A.) Scheduler debugging: - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior) - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar) Fixes and cleanups: - Always save/restore x86 TSC sched_clock() on suspend/resume (Guilherme G. Piccoli) - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian Andrzej Siewior)" * tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling() sched/debug: Remove CONFIG_SCHED_DEBUG sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional sched/debug: Make 'const_debug' tunables unconditional __read_mostly sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE() rseq/selftests: Fix namespace collision with rseq UAPI header include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h sched/topology: Stop exposing partition_sched_domains_locked cgroup/cpuset: Remove partition_and_rebuild_sched_domains sched/topology: Remove redundant dl_clear_root_domain call sched/deadline: Rebuild root domain accounting after every update sched/deadline: Generalize unique visiting of root domains sched/topology: Wrappers for sched_domains_mutex sched/deadline: Ignore special tasks when rebuilding domains tracing: Use preempt_model_str() xtensa: Rely on generic printing of preemption model x86: Rely on generic printing of preemption model s390: Rely on generic printing of preemption model ...
2025-03-19sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()Ingo Molnar
The scheduler has this special SCHED_WARN() facility that depends on CONFIG_SCHED_DEBUG. Since CONFIG_SCHED_DEBUG is getting removed, convert SCHED_WARN() to WARN_ON_ONCE(). Note that the warning output isn't 100% equivalent: #define SCHED_WARN_ON(x) WARN_ONCE(x, #x) Because SCHED_WARN_ON() would output the 'x' condition as well, while WARN_ONCE() will only show a backtrace. Hopefully these are rare enough to not really matter. If it does, we should probably introduce a new WARN_ON() variant that outputs the condition in stringified form, or improve WARN_ON() itself. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-2-mingo@kernel.org
2025-03-14sched_ext: idle: Refactor scx_select_cpu_dfl()Andrea Righi
Make scx_select_cpu_dfl() more consistent with the other idle-related APIs by returning a negative value when an idle CPU isn't found. No functional changes, this is purely a refactoring. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-14sched_ext: idle: Honor idle flags in the built-in idle selection policyAndrea Righi
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(), to enforce strict selection criteria, such as selecting an idle CPU strictly within @prev_cpu's node or choosing only a fully idle SMT core. This functionality will be exposed through a dedicated kfunc in a separate patch. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-13sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()Andrea Righi
scx_bpf_reenqueue_local() can be invoked from ops.cpu_release() to give tasks that are queued to the local DSQ a chance to migrate to other CPUs, when a CPU is taken by a higher scheduling class. However, there is no point re-enqueuing tasks that can only run on that particular CPU, as they would simply be re-added to the same local DSQ without any benefit. Therefore, skip per-CPU tasks in scx_bpf_reenqueue_local(). Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04sched_ext: Add trace point to track sched_ext core eventsChangwoo Min
Add tracing support to track sched_ext core events (/sched_ext/sched_ext_event). This may be useful for debugging sched_ext schedulers that trigger a particular event. The trace point can be used as other trace points, so it can be used in, for example, `perf trace` and BPF programs, as follows: ====== $> sudo perf trace -e sched_ext:sched_ext_event --filter 'name == "SCX_EV_ENQ_SLICE_DFL"' ====== ====== struct tp_sched_ext_event { struct trace_entry ent; u32 __data_loc_name; s64 delta; }; SEC("tracepoint/sched_ext/sched_ext_event") int rtp_add_event(struct tp_sched_ext_event *ctx) { char event_name[128]; unsigned short offset = ctx->__data_loc_name & 0xFFFF; bpf_probe_read_str((void *)event_name, 128, (char *)ctx + offset); bpf_printk("name %s delta %lld", event_name, ctx->delta); return 0; } ====== Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04sched_ext: Change the event type from u64 to s64Changwoo Min
The event count could be negative in the future, so change the event type from u64 to s64. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03sched_ext: Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull for-6.14-fixes to receive: 9360dfe4cbd6 ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()") which conflicts with: 337d1b354a29 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()Andrea Righi
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel crash. To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and trigger an scx error if an invalid CPU is specified. Fixes: f0e1a0643a59b ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-25sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called ↵Tejun Heo
without balance() a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") added a workaround to handle the cases where pick_task_scx() is called without prececing balance_scx() which is due to a fair class bug where pick_taks_fair() may return NULL after a true return from balance_fair(). The workaround detects when pick_task_scx() is called without preceding balance_scx() and emulates SCX_RQ_BAL_KEEP and triggers kicking to avoid stalling. Unfortunately, the workaround code was testing whether @prev was on SCX to decide whether to keep the task running. This is incorrect as the task may be on SCX but no longer runnable. This could lead to a non-runnable task to be returned from pick_task_scx() which cause interesting confusions and failures. e.g. A common failure mode is the task ending up with (!on_rq && on_cpu) state which can cause potential wakers to busy loop, which can easily lead to deadlocks. Fix it by testing whether @prev has SCX_TASK_QUEUED set. This makes @prev_on_scx only used in one place. Open code the usage and improve the comment while at it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Pat Cody <patcody@meta.com> Fixes: a6250aa251ea ("sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-24sched_ext: idle: Introduce scx_bpf_nr_node_ids()Andrea Righi
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the system. BPF schedulers can use this information together with the new node-aware kfuncs, for example to create per-node DSQs, validate node IDs, etc. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Per-node idle cpumasksAndrea Righi
Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask. Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems. The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node. The split of multiple per-node idle cpumasks can be controlled using the SCX_OPS_BUILTIN_IDLE_PER_NODE flag. By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global host-wide idle cpumask is used, maintaining the previous behavior. NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will trigger an scx error, since there are no system-wide cpumasks. = Test = Hardware: - System: DGX B200 - CPUs: 224 SMT threads (112 physical cores) - Processor: INTEL(R) XEON(R) PLATINUM 8570 - 2 NUMA nodes Scheduler: - scx_simple [1] (so that we can focus at the built-in idle selection policy and not at the scheduling policy itself) Test: - Run a parallel kernel build `make -j $(nproc)` and measure the average elapsed time over 10 runs: avg time | stdev ---------+------ before: 52.431s | 2.895 after: 50.342s | 2.895 = Conclusion = Splitting the global cpumask into multiple per-NUMA cpumasks helped to achieve a speedup of approximately +4% with this particular architecture and test case. The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%. On smaller systems, I haven't noticed any measurable regressions or improvements with the same test (parallel kernel build) and scheduler (scx_simple). Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs I observed an additional +2-2.5% performance improvement with the same test. [1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODEAndrea Righi
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows BPF schedulers to select between using a global flat idle cpumask or multiple per-node cpumasks. This only introduces the flag and the mechanism to enable/disable this feature without affecting any scheduling behavior. Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-16sched_ext: idle: Make idle static keys privateAndrea Righi
Make all the static keys used by the idle CPU selection policy private to ext_idle.c. This avoids unnecessary exposure in headers and improves code encapsulation. Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-14sched_ext: Provides a sysfs 'events' to expose core event countersChangwoo Min
Add a sysfs entry at /sys/kernel/sched_ext/root/events to expose core event counters through the files system interface. Each line of the file shows the event name and its counter value. In addition, the format of scx_dump_event() is adjusted as the event name gets longer. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-13sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUPTejun Heo
A task wakeup can be either processed on the waker's CPU or bounced to the wakee's previous CPU using an IPI (ttwu_queue). Bouncing to the wakee's CPU avoids the waker's CPU locking and accessing the wakee's rq which can be expensive across cache and node boundaries. When ttwu_queue path is taken, select_task_rq() and thus ops.select_cpu() may be skipped in some cases (racing against the wakee switching out). As this confused some BPF schedulers, there wasn't a good way for a BPF scheduler to tell whether idle CPU selection has been skipped, ops.enqueue() couldn't insert tasks into foreign local DSQs, and the performance difference on machines with simple toplogies were minimal, sched_ext disabled ttwu_queue. However, this optimization makes noticeable difference on more complex topologies and a BPF scheduler now has an easy way tell whether ops.select_cpu() was skipped since 9b671793c7d9 ("sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED") and can insert tasks into foreign local DSQs since 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches"). Implement SCX_OPS_ALLOW_QUEUED_WAKEUP which allows BPF schedulers to choose to enable ttwu_queue optimization. v2: Update the patch description and comment re. ops.select_cpu() being skipped in some cases as opposed to always as per Neel. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Neel Natu <neelnatu@google.com> Reported-by: Barret Rhoden <brho@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-13sched_ext: Use SCX_CALL_OP_TASK in task_tick_scxChuyi Zhou
Now when we use scx_bpf_task_cgroup() in ops.tick() to get the cgroup of the current task, the following error will occur: scx_foo[3795244] triggered exit kind 1024: runtime error (called on a task not being operated on) The reason is that we are using SCX_CALL_OP() instead of SCX_CALL_OP_TASK() when calling ops.tick(), which triggers the error during the subsequent scx_kf_allowed_on_arg_tasks() check. SCX_CALL_OP_TASK() was first introduced in commit 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") to ensure task's rq lock is held when accessing task's sched_group. Since ops.tick() is marked as SCX_KF_TERMINAL and task_tick_scx() is protected by the rq lock, we can use SCX_CALL_OP_TASK() to avoid the above issue. Similarly, the same changes should be made for ops.disable() and ops.exit_task(), as they are also protected by task_rq_lock() and it's safe to access the task's task_group. Fixes: 36454023f50b ("sched_ext: Track tasks that are subjects of the in-flight SCX operation") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-10Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull to receive f3f08c3acfb8 ("sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq()") which conflicts with 26176116d931 ("sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in the right spot") in for-6.15.
2025-02-10sched_ext: Fix incorrect assumption about migration disabled tasks in ↵Tejun Heo
task_can_run_on_remote_rq() While fixing migration disabled task handling, 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") assumed that a migration disabled task's ->cpus_ptr would only have the pinned CPU. While this is eventually true for migration disabled tasks that are switched out, ->cpus_ptr update is performed by migrate_disable_switch() which is called right before context_switch() in __scheduler(). However, the task is enqueued earlier during pick_next_task() via put_prev_task_scx(), so there is a race window where another CPU can see the task on a DSQ. If the CPU tries to dispatch the migration disabled task while in that window, task_allowed_on_cpu() will succeed and task_can_run_on_remote_rq() will subsequently trigger SCHED_WARN(is_migration_disabled()). WARNING: CPU: 8 PID: 1837 at kernel/sched/ext.c:2466 task_can_run_on_remote_rq+0x12e/0x140 Sched_ext: layered (enabled+all), task: runnable_at=-10ms RIP: 0010:task_can_run_on_remote_rq+0x12e/0x140 ... <TASK> consume_dispatch_q+0xab/0x220 scx_bpf_dsq_move_to_local+0x58/0xd0 bpf_prog_84dd17b0654b6cf0_layered_dispatch+0x290/0x1cfa bpf__sched_ext_ops_dispatch+0x4b/0xab balance_one+0x1fe/0x3b0 balance_scx+0x61/0x1d0 prev_balance+0x46/0xc0 __pick_next_task+0x73/0x1c0 __schedule+0x206/0x1730 schedule+0x3a/0x160 __do_sys_sched_yield+0xe/0x20 do_syscall_64+0xbb/0x1e0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fix it by converting the SCHED_WARN() back to a regular failure path. Also, perform the migration disabled test before task_allowed_on_cpu() test so that BPF schedulers which fail to handle migration disabled tasks can be noticed easily. While at it, adjust scx_ops_error() message for !task_allowed_on_cpu() case for brevity and consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") Acked-by: Andrea Righi <arighi@nvidia.com> Reported-by: Jake Hillion <jakehillion@meta.com>
2025-02-10sched_ext: Take NUMA node into account when allocating per-CPU cpumasksLi RongQing
per-CPU cpumasks are dominantly accessed from their own local CPUs, so allocate them node-local to improve performance. Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-08sched_ext: Add SCX_EV_ENQ_SKIP_MIGRATION_DISABLEDTejun Heo
Count the number of times a migration disabled task is automatically dispatched to its local DSQ. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in the right spotTejun Heo
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE wasn't quite right in two aspects: - It counted both migration disabled and offline events. - It didn't count events from scx_bpf_dsq_move() path. Fix it by moving the counting into task_can_run_on_remote_rq() which is shared by both paths and can distinguish the different rejection conditions. The argument @trigger_error is renamed to @enforce as it now does more than just triggering error. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-02-08Merge branch 'for-6.14-fixes' into for-6.15Tejun Heo
Pull to receive: - 2fa0fbeb69ed ("sched_ext: Implement auto local dispatching of migration disabled tasks") - 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches") as planned for-6.15 changes depend on them (e.g. adding event counter for implicit migration disabled task handling).
2025-02-08sched_ext: Fix migration disabled handling in targeted dispatchesTejun Heo
A dispatch operation that can target a specific local DSQ - scx_bpf_dsq_move_to_local() or scx_bpf_dsq_move() - checks whether the task can be migrated to the target CPU using task_can_run_on_remote_rq(). If the task can't be migrated to the targeted CPU, it is bounced through a global DSQ. task_can_run_on_remote_rq() assumes that the task is on a CPU that's different from the targeted CPU but the callers doesn't uphold the assumption and may call the function when the task is already on the target CPU. When such task has migration disabled, task_can_run_on_remote_rq() ends up returning %false incorrectly unnecessarily bouncing the task to a global DSQ. Fix it by updating the callers to only call task_can_run_on_remote_rq() when the task is on a different CPU than the target CPU. As this is a bit subtle, for clarity and documentation: - Make task_can_run_on_remote_rq() trigger SCHED_WARN_ON() if the task is on the same CPU as the target CPU. - is_migration_disabled() test in task_can_run_on_remote_rq() cannot trigger if the task is on a different CPU than the target CPU as the preceding task_allowed_on_cpu() test should fail beforehand. Convert the test into SCHED_WARN_ON(). Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Fixes: 0366017e0973 ("sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()") Cc: stable@vger.kernel.org # v6.12+
2025-02-08sched_ext: Implement auto local dispatching of migration disabled tasksTejun Heo
Migration disabled tasks are special and pinned to their previous CPUs. They tripped up some unsuspecting BPF schedulers as their ->nr_cpus_allowed may not agree with the bits set in ->cpus_ptr. Make it easier for BPF schedulers by automatically dispatching them to the pinned local DSQs by default. If a BPF scheduler wants to handle migration disabled tasks explicitly, it can set SCX_OPS_ENQ_MIGRATION_DISABLED. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-02-07sched_ext: Add an event, SCX_EV_ENQ_SLICE_DFLChangwoo Min
Add a core event, SCX_EV_ENQ_SLICE_DFL, which represents how many tasks have been enqueued (or pick_task-ed or select_cpu-ed) with a default time slice (SCX_SLICE_DFL). Scheduling a task with SCX_SLICE_DFL unintentionally would be a source of latency spikes because SCX_SLICE_DFL is relatively long (20 msec). Thus, soaring the SCX_EV_ENQ_SLICE_DFL value would be a sign of BPF scheduler bugs, causing latency spikes, especially when ops.select_cpu() is provided. __scx_add_event() is used since the caller holds an rq lock or p->pi_lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_BYPASS_DURATIONChangwoo Min
Add a core event, SCX_EV_BYPASS_DURATION, which represents the total duration of bypass modes in nanoseconds. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_BYPASS_DISPATCHChangwoo Min
Add a core event, SCX_EV_BYPASS_DISPATCH, which represents how many tasks have been dispatched in the bypass mode. __scx_add_event() is used since the caller holds an rq lock or p->pi_lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_BYPASS_ACTIVATEChangwoo Min
Add a core event, SCX_EV_BYPASS_ACTIVATE, which represents how many times the bypass mode has been triggered. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-04sched_ext: Add an event, SCX_EV_ENQ_SKIP_EXITINGChangwoo Min
Add a core event, SCX_EV_ENQ_SKIP_EXITING, which represents how many times a task is enqueued to a local DSQ when exiting if SCX_OPS_ENQ_EXITING is not set. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Add an event, SCX_EV_DISPATCH_KEEP_LASTChangwoo Min
Add a core event, SCX_EV_DISPATCH_KEEP_LAST, which represents how many times a task is continued to run without ops.enqueue() when SCX_OPS_ENQ_LAST is not set. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Add an event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINEChangwoo Min
Add a core event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which represents how many times a BPF scheduler tries to dispatch to an offlined local DSQ. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Add an event, SCX_EV_SELECT_CPU_FALLBACKChangwoo Min
Add a core event, SCX_EV_SELECT_CPU_FALLBACK, which represents how many times ops.select_cpu() returns a CPU that the task can't use. __scx_add_event() is used since the caller holds an rq lock, so the preemption has already been disabled. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-02sched_ext: Implement event counter infrastructureChangwoo Min
Collect the statistics of specific types of behavior in the sched_ext core, which are not easily visible but still interesting to an scx scheduler. An event type is defined in 'struct scx_event_stats.' When an event occurs, its counter is accumulated using 'scx_add_event()' and '__scx_add_event()' to per-CPU 'struct scx_event_stats' for efficiency. 'scx_bpf_events()' aggregates all the per-CPU counters and exposes a system-wide counters. For convenience and readability of the code, 'scx_agg_event()' and 'scx_dump_event()' are provided. The collected events can be observed after a BPF scheduler is unloaded beforea new BPF scheduler is loaded so the per-CPU 'struct scx_event_stats' are reset. Signed-off-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>