summaryrefslogtreecommitdiff
path: root/kernel/sched/core.c
AgeCommit message (Collapse)Author
2026-03-11sched/mmcid: Avoid full tasklist walksThomas Gleixner
Chasing vfork()'ed tasks on a CID ownership mode switch requires a full task list walk, which is obviously expensive on large systems. Avoid that by keeping a list of tasks using a mm MMCID entity in mm::mm_cid and walk this list instead. This removes the proven to be flaky counting logic and avoids a full task list walk in the case of vfork()'ed tasks. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202526.183824481@kernel.org
2026-03-11sched/mmcid: Remove pointless preempt guardThomas Gleixner
This is a leftover from the early versions of this function where it could be invoked without mm::mm_cid::lock held. Remove it and add lockdep asserts instead. Fixes: 653fda7ae73d ("sched/mmcid: Switch over to the new mechanism") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202526.116363613@kernel.org
2026-03-11sched/mmcid: Handle vfork()/CLONE_VM correctlyThomas Gleixner
Matthieu and Jiri reported stalls where a task endlessly loops in mm_get_cid() when scheduling in. It turned out that the logic which handles vfork()'ed tasks is broken. It is invoked when the number of tasks associated to a process is smaller than the number of MMCID users. It then walks the task list to find the vfork()'ed task, but accounts all the already processed tasks as well. If that double processing brings the number of to be handled tasks to 0, the walk stops and the vfork()'ed task's CID is not fixed up. As a consequence a subsequent schedule in fails to acquire a (transitional) CID and the machine stalls. Cure this by removing the accounting condition and make the fixup always walk the full task list if it could not find the exact number of users in the process' thread list. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Closes: https://lore.kernel.org/b24ffcb3-09d5-4e48-9070-0b69bc654281@kernel.org Reported-by: Matthieu Baerts <matttbe@kernel.org> Reported-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202526.048657665@kernel.org
2026-03-11sched/mmcid: Prevent CID stalls due to concurrent forksThomas Gleixner
A newly forked task is accounted as MMCID user before the task is visible in the process' thread list and the global task list. This creates the following problem: CPU1 CPU2 fork() sched_mm_cid_fork(tnew1) tnew1->mm.mm_cid_users++; tnew1->mm_cid.cid = getcid() -> preemption fork() sched_mm_cid_fork(tnew2) tnew2->mm.mm_cid_users++; // Reaches the per CPU threshold mm_cid_fixup_tasks_to_cpus() for_each_other(current, p) .... As tnew1 is not visible yet, this fails to fix up the already allocated CID of tnew1. As a consequence a subsequent schedule in might fail to acquire a (transitional) CID and the machine stalls. Move the invocation of sched_mm_cid_fork() after the new task becomes visible in the thread and the task list to prevent this. This also makes it symmetrical vs. exit() where the task is removed as CID user before the task is removed from the thread and task lists. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202525.969061974@kernel.org
2026-02-23sched/core: Fix wakeup_preempt's next_class trackingPeter Zijlstra
Kernel test robot reported that tools/testing/selftests/kvm/hardware_disable_test was failing due to commit 704069649b5b ("sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()") It turns out there were two related problems that could lead to a missed preemption: - when hitting newidle balance from the idle thread, it would elevate rb->next_class from &idle_sched_class to &fair_sched_class, causing later wakeup_preempt() calls to not hit the sched_class_above() case, and not issue resched_curr(). Notably, this modification pattern should only lower the next_class, and never raise it. Create two new helper functions to wrap this. - when doing schedule_idle(), it was possible to miss (re)setting rq->next_class to &idle_sched_class, leading to the very same problem. Cc: Sean Christopherson <seanjc@google.com> Fixes: 704069649b5b ("sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202602122157.4e861298-lkp@intel.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260218163329.GQ1395416@noisy.programming.kicks-ass.net
2026-02-11sched/mmcid: Don't assume CID is CPU owned on mode switchThomas Gleixner
Shinichiro reported a KASAN UAF, which is actually an out of bounds access in the MMCID management code. CPU0 CPU1 T1 runs in userspace T0: fork(T4) -> Switch to per CPU CID mode fixup() set MM_CID_TRANSIT on T1/CPU1 T4 exit() T3 exit() T2 exit() T1 exit() switch to per task mode ---> Out of bounds access. As T1 has not scheduled after T0 set the TRANSIT bit, it exits with the TRANSIT bit set. sched_mm_cid_remove_user() clears the TRANSIT bit in the task and drops the CID, but it does not touch the per CPU storage. That's functionally correct because a CID is only owned by the CPU when the ONCPU bit is set, which is mutually exclusive with the TRANSIT flag. Now sched_mm_cid_exit() assumes that the CID is CPU owned because the prior mode was per CPU. It invokes mm_drop_cid_on_cpu() which clears the not set ONCPU bit and then invokes clear_bit() with an insanely large bit number because TRANSIT is set (bit 29). Prevent that by actually validating that the CID is CPU owned in mm_drop_cid_on_cpu(). Fixes: 007d84287c74 ("sched/mmcid: Drop per CPU CID immediately when switching to per task mode") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/aYsZrixn9b6s_2zL@shinmob Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-10Merge tag 'x86_paravirt_for_v7.0_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 paravirt updates from Borislav Petkov: - A nice cleanup to the paravirt code containing a unification of the paravirt clock interface, taming the include hell by splitting the pv_ops structure and removing of a bunch of obsolete code (Juergen Gross) * tag 'x86_paravirt_for_v7.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/paravirt: Use XOR r32,r32 to clear register in pv_vcpu_is_preempted() x86/paravirt: Remove trailing semicolons from alternative asm templates x86/pvlocks: Move paravirt spinlock functions into own header x86/paravirt: Specify pv_ops array in paravirt macros x86/paravirt: Allow pv-calls outside paravirt.h objtool: Allow multiple pv_ops arrays x86/xen: Drop xen_mmu_ops x86/xen: Drop xen_cpu_ops x86/xen: Drop xen_irq_ops x86/paravirt: Move pv_native_*() prototypes to paravirt.c x86/paravirt: Introduce new paravirt-base.h header x86/paravirt: Move paravirt_sched_clock() related code into tsc.c x86/paravirt: Use common code for paravirt_steal_clock() riscv/paravirt: Use common code for paravirt_steal_clock() loongarch/paravirt: Use common code for paravirt_steal_clock() arm64/paravirt: Use common code for paravirt_steal_clock() arm/paravirt: Use common code for paravirt_steal_clock() sched: Move clock related paravirt code to kernel/sched paravirt: Remove asm/paravirt_api_clock.h x86/paravirt: Move thunk macros to paravirt_types.h ...
2026-02-10Merge tag 'sched-core-2026-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scheduler Kconfig space updates: - Further consolidate configurable preemption modes (Peter Zijlstra) Reduce the number of architectures that are allowed to offer PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number of preemption models from four to just two: 'full' and 'lazy' on up-to-date architectures (arm64, loongarch, powerpc, riscv, s390, x86). None and voluntary are only available as legacy features on platforms that don't implement lazy preemption yet, or which don't even support preemption. The goal is to eventually remove cond_resched() and voluntary preemption altogether. RSEQ based 'scheduler time slice extension' support (Thomas Gleixner and Peter Zijlstra): This allows a thread to request a time slice extension when it enters a critical section to avoid contention on a resource when the thread is scheduled out inside of the critical section. - Add fields and constants for time slice extension - Provide static branch for time slice extensions - Add statistics for time slice extensions - Add prctl() to enable time slice extensions - Implement sys_rseq_slice_yield() - Implement syscall entry work for time slice extensions - Implement time slice extension enforcement timer - Reset slice extension when scheduled - Implement rseq_grant_slice_extension() - entry: Hook up rseq time slice extension - selftests: Implement time slice extension test - Allow registering RSEQ with slice extension - Move slice_ext_nsec to debugfs - Lower default slice extension - selftests/rseq: Add rseq slice histogram script Scheduler performance/scalability improvements: - Update rq->avg_idle when a task is moved to an idle CPU, which improves the scalability of various workloads (Shubhang Kaushik) - Reorder fields in 'struct rq' for better caching (Blake Jones) - Fair scheduler SMP NOHZ balancing code speedups (Shrikanth Hegde): - Move checking for nohz cpus after time check - Change likelyhood of nohz.nr_cpus - Remove nohz.nr_cpus and use weight of cpumask instead - Avoid false sharing for sched_clock_irqtime (Wangyang Guo) - Cleanups (Yury Norov): - Drop useless cpumask_empty() in find_energy_efficient_cpu() - Simplify task_numa_find_cpu() - Use cpumask_weight_and() in sched_balance_find_dst_group() DL scheduler updates: - Add a deadline server for sched_ext tasks (by Andrea Righi and Joel Fernandes, with fixes by Peter Zijlstra) RT scheduler updates: - Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang) Entry code updates and performance improvements (Jinjie Ruan) This is part of the scheduler tree in this cycle due to inter- dependencies with the RSEQ based time slice extension work: - Remove unused syscall argument from syscall_trace_enter() - Rework syscall_exit_to_user_mode_work() for architecture reuse - Add arch_ptrace_report_syscall_entry/exit() - Inline syscall_exit_work() and syscall_trace_enter() Scheduler core updates (Peter Zijlstra): - Rework sched_class::wakeup_preempt() and rq_modified_*() - Avoid rq->lock bouncing in sched_balance_newidle() - Rename rcu_dereference_check_sched_domain() => rcu_dereference_sched_domain() - <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper Fair scheduler updates/refactoring (Peter Zijlstra and Ingo Molnar): - Fold the sched_avg update - Change rcu_dereference_check_sched_domain() to rcu-sched - Switch to rcu_dereference_all() - Remove superfluous rcu_read_lock() - Limit hrtick work - Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks - Clean up comments in 'struct cfs_rq' - Separate se->vlag from se->vprot - Rename cfs_rq::avg_load to cfs_rq::sum_weight - Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions - Introduce and use the vruntime_cmp() and vruntime_op() wrappers for wrapped-signed aritmetics - Sort out 'blocked_load*' namespace noise Scheduler debugging code updates: - Export hidden tracepoints to modules (Gabriele Monaco) - Convert copy_from_user() + kstrtouint() to kstrtouint_from_user() (Fushuai Wang) - Add assertions to QUEUE_CLASS (Peter Zijlstra) - hrtimer: Fix tracing oddity (Thomas Gleixner) Misc fixes and cleanups: - Re-evaluate scheduling when migrating queued tasks out of throttled cgroups (Zicheng Qu) - Remove task_struct->faults_disabled_mapping (Christoph Hellwig) - Fix math notation errors in avg_vruntime comment (Zhan Xusheng) - sched/cpufreq: Use %pe format for PTR_ERR() printing (zenghongling)" * tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups sched/cpufreq: Use %pe format for PTR_ERR() printing sched/rt: Skip currently executing CPU in rto_next_cpu() sched/clock: Avoid false sharing for sched_clock_irqtime selftests/sched_ext: Add test for DL server total_bw consistency selftests/sched_ext: Add test for sched_ext dl_server sched/debug: Fix dl_server (re)start conditions sched/debug: Add support to change sched_ext server params sched_ext: Add a DL server for sched_ext tasks sched/debug: Stop and start server based on if it was active sched/debug: Fix updating of ppos on server write ops sched/deadline: Clear the defer params entry: Inline syscall_exit_work() and syscall_trace_enter() entry: Add arch_ptrace_report_syscall_entry/exit() entry: Rework syscall_exit_to_user_mode_work() for architecture reuse entry: Remove unused syscall argument from syscall_trace_enter() sched: remove task_struct->faults_disabled_mapping sched: Update rq->avg_idle when a task is moved to an idle CPU selftests/rseq: Add rseq slice histogram script hrtimer: Fix trace oddity ...
2026-02-10Merge tag 'locking-core-2026-02-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Lock debugging: - Implement compiler-driven static analysis locking context checking, using the upcoming Clang 22 compiler's context analysis features (Marco Elver) We removed Sparse context analysis support, because prior to removal even a defconfig kernel produced 1,700+ context tracking Sparse warnings, the overwhelming majority of which are false positives. On an allmodconfig kernel the number of false positive context tracking Sparse warnings grows to over 5,200... On the plus side of the balance actual locking bugs found by Sparse context analysis is also rather ... sparse: I found only 3 such commits in the last 3 years. So the rate of false positives and the maintenance overhead is rather high and there appears to be no active policy in place to achieve a zero-warnings baseline to move the annotations & fixers to developers who introduce new code. Clang context analysis is more complete and more aggressive in trying to find bugs, at least in principle. Plus it has a different model to enabling it: it's enabled subsystem by subsystem, which results in zero warnings on all relevant kernel builds (as far as our testing managed to cover it). Which allowed us to enable it by default, similar to other compiler warnings, with the expectation that there are no warnings going forward. This enforces a zero-warnings baseline on clang-22+ builds (Which are still limited in distribution, admittedly) Hopefully the Clang approach can lead to a more maintainable zero-warnings status quo and policy, with more and more subsystems and drivers enabling the feature. Context tracking can be enabled for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default disabled), but this will generate a lot of false positives. ( Having said that, Sparse support could still be added back, if anyone is interested - the removal patch is still relatively straightforward to revert at this stage. ) Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng) - Add support for Atomic<i8/i16/bool> and replace most Rust native AtomicBool usages with Atomic<bool> - Clean up LockClassKey and improve its documentation - Add missing Send and Sync trait implementation for SetOnce - Make ARef Unpin as it is supposed to be - Add __rust_helper to a few Rust helpers as a preparation for helper LTO - Inline various lock related functions to avoid additional function calls WW mutexes: - Extend ww_mutex tests and other test-ww_mutex updates (John Stultz) Misc fixes and cleanups: - rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd Bergmann) - locking/local_lock: Include more missing headers (Peter Zijlstra) - seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap) - rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir Duberstein)" * tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits) locking/rwlock: Fix write_trylock_irqsave() with CONFIG_INLINE_WRITE_TRYLOCK rcu: Mark lockdep_assert_rcu_helper() __always_inline compiler-context-analysis: Remove __assume_ctx_lock from initializers tomoyo: Use scoped init guard crypto: Use scoped init guard kcov: Use scoped init guard compiler-context-analysis: Introduce scoped init guards cleanup: Make __DEFINE_LOCK_GUARD handle commas in initializers seqlock: fix scoped_seqlock_read kernel-doc tools: Update context analysis macros in compiler_types.h rust: sync: Replace `kernel::c_str!` with C-Strings rust: sync: Inline various lock related methods rust: helpers: Move #define __rust_helper out of atomic.c rust: wait: Add __rust_helper to helpers rust: time: Add __rust_helper to helpers rust: task: Add __rust_helper to helpers rust: sync: Add __rust_helper to helpers rust: refcount: Add __rust_helper to helpers rust: rcu: Add __rust_helper to helpers rust: processor: Add __rust_helper to helpers ...
2026-02-04sched/mmcid: Drop per CPU CID immediately when switching to per task modeThomas Gleixner
When a exiting task initiates the switch from per CPU back to per task mode, it has already dropped its CID and marked itself inactive. But a leftover from an earlier iteration of the rework then reassigns the per CPU CID to the exiting task with the transition bit set. That's wrong as the task is already marked CID inactive, which means it is inconsistent state. It's harmless because the CID is marked in transit and therefore dropped back into the pool when the exiting task schedules out either through preemption or the final schedule(). Simply drop the per CPU CID when the exiting task triggered the transition. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260201192835.032221009@kernel.org
2026-02-04sched/mmcid: Protect transition on weakly ordered systemsThomas Gleixner
Shrikanth reported a hard lockup which he observed once. The stack trace shows the following CID related participants: watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188 NIP: mm_get_cid+0xe8/0x188 LR: mm_get_cid+0x108/0x188 mm_cid_switch_to+0x3c4/0x52c __schedule+0x47c/0x700 schedule_idle+0x3c/0x64 do_idle+0x160/0x1b0 cpu_startup_entry+0x48/0x50 start_secondary+0x284/0x288 start_secondary_prolog+0x10/0x14 watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c NIP: plpar_hcall_norets_notrace+0x18/0x2c LR: queued_spin_lock_slowpath+0xd88/0x15d0 _raw_spin_lock+0x80/0xa0 raw_spin_rq_lock_nested+0x3c/0xf8 mm_cid_fixup_cpus_to_tasks+0xc8/0x28c sched_mm_cid_exit+0x108/0x22c do_exit+0xf4/0x5d0 make_task_dead+0x0/0x178 system_call_exception+0x128/0x390 system_call_vectored_common+0x15c/0x2ec The task on CPU11 is running the CID ownership mode change fixup function and is stuck on a runqueue lock. The task on CPU23 is trying to get a CID from the pool with the same runqueue lock held, but the pool is empty. After decoding a similar issue in the opposite direction switching from per task to per CPU mode the tool which models the possible scenarios failed to come up with a similar loop hole. This showed up only once, was not reproducible and according to tooling not related to a overlooked scheduling scenario permutation. But the fact that it was observed on a PowerPC system gave the right hint: PowerPC is a weakly ordered architecture. The transition mechanism does: WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT); WRITE_ONCE(mm->mm_cid.percpu, new_mode); fixup() WRITE_ONCE(mm->mm_cid.transit, 0); mm_cid_schedin() does: if (!READ_ONCE(mm->mm_cid.percpu)) ... cid |= READ_ONCE(mm->mm_cid.transit); so weakly ordered systems can observe percpu == false and transit == 0 even if the fixup function has not yet completed. As a consequence the task will not drop the CID when scheduling out before the fixup is completed, which means the CID space can be exhausted and the next task scheduling in will loop in mm_get_cid() and the fixup thread can livelock on the held runqueue lock as above. This could obviously be solved by using: smp_store_release(&mm->mm_cid.percpu, true); and smp_load_acquire(&mm->mm_cid.percpu); but that brings a memory barrier back into the scheduler hotpath, which was just designed out by the CID rewrite. That can be completely avoided by combining the per CPU mode and the transit storage into a single mm_cid::mode member and ordering the stores against the fixup functions to prevent the CPU from reordering them. That makes the update of both states atomic and a concurrent read observes always consistent state. The price is an additional AND operation in mm_cid_schedin() to evaluate the per CPU or the per task path, but that's in the noise even on strongly ordered architectures as the actual load can be significantly more expensive and the conditional branch evaluation is there anyway. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Closes: https://lore.kernel.org/bdfea828-4585-40e8-8835-247c6a8a76b0@linux.ibm.com Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260201192834.965217106@kernel.org
2026-02-04sched/mmcid: Prevent live lock on task to CPU mode transitionThomas Gleixner
Ihor reported a BPF CI failure which turned out to be a live lock in the MM_CID management. The scenario is: A test program creates the 5th thread, which means the MM_CID users become more than the number of CPUs (four in this example), so it switches to per CPU ownership mode. At this point each live task of the program has a CID associated. Assume thread creation order assignment for simplicity. T0 CID0 runs fork() and creates T4 T1 CID1 T2 CID2 T3 CID3 T4 --- not visible yet T0 sets mm_cid::percpu = true and transfers its own CID to CPU0 where it runs on and then starts the fixup which walks through the threads to transfer the per task CIDs either to the CPU the task is running on or drop it back into the pool if the task is not on a CPU. During that T1 - T3 are free to schedule in and out before the fixup caught up with them. Going through all possible permutations with a python script revealed a few problematic cases. The most trivial one is: T1 schedules in on CPU1 and observes percpu == true, so it transfers its CID to CPU1 T1 is migrated to CPU2 and schedule in observes percpu == true, but CPU2 does not have a CID associated and T1 transferred its own to CPU1 So it has to allocate one with CPU2 runqueue lock held, but the pool is empty, so it keeps looping in mm_get_cid(). Now T0 reaches T1 in the thread walk and tries to lock the corresponding runqueue lock, which is held causing a full live lock. There is a similar scenario in the reverse direction of switching from per CPU to task mode which is way more obvious and got therefore addressed by an intermediate mode. In this mode the CIDs are marked with MM_CID_TRANSIT, which means that they are neither owned by the CPU nor by the task. When a task schedules out with a transit CID it drops the CID back into the pool making it available for others to use temporarily. Once the task which initiated the mode switch finished the fixup it clears the transit mode and the process goes back into per task ownership mode. Unfortunately this insight was not mapped back to the task to CPU mode switch as the above described scenario was not considered in the analysis. Apply the same transit mechanism to the task to CPU mode switch to handle these problematic cases correctly. As with the CPU to task transition this results in a potential temporary contention on the CID bitmap, but that's only for the time it takes to complete the transition. After that it stays in steady mode which does not touch the bitmap at all. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Closes: https://lore.kernel.org/2b7463d7-0f58-4e34-9775-6e2115cfb971@linux.dev Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260201192834.897115238@kernel.org
2026-02-03sched: Re-evaluate scheduling when migrating queued tasks out of throttled ↵Zicheng Qu
cgroups Consider the following sequence on a CPU configured with nohz_full: 1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS bandwidth control. The gse (cgroup A) where the task P attached is dequeued and the CPU switches to idle. 2) Before cgroup A is unthrottled, task P is migrated from cgroup A to another cgroup B (not throttled). During sched_move_task(), the task P is observed as queued but not running, and therefore no resched_curr() is triggered. 3) Since the CPU is nohz_full, it remains in do_idle() waiting for an explicit scheduling event, i.e., resched_curr(). 4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task P has already been migrated out of cgroup A, so unthrottle_cfs_rq() may observe load_weight == 0 and return early without resched_curr() called. For kernel >= 6.6: The unthrottling path normally triggers `resched_curr()` almost cases even when no runnable tasks remain in the unthrottled cgroup, preventing the idle stall described above. However, if cgroup A is removed before it gets unthrottled, the unthrottling path for cgroup A is never executed. In a result, no `resched_curr()` can be called. 5) At this point, the task P is runnable in cgroup B (not throttled), but the CPU remains in do_idle() with no pending reschedule point. The system stays in this state until an unrelated event (e.g. a new task wakeup or any cases) that can trigger a resched_curr() breaks the nohz_full idle state, and then the task P finally gets scheduled. The root cause is that sched_move_task() may classify the task as only queued, not running, and therefore fails to trigger a resched_curr(), while the later unthrottling path no longer has visibility of the migrated task. Preserve the existing behavior for running tasks by issuing resched_curr(), and explicitly invoke check_preempt_curr() for tasks that were queued at the time of migration. This ensures that runnable tasks are reconsidered for scheduling even when nohz_full suppresses periodic ticks. Fixes: 29f59db3a74b ("sched: group-scheduler core") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260130083438.1122457-1-quzicheng@huawei.com
2026-02-03sched_ext: Add a DL server for sched_ext tasksAndrea Righi
sched_ext currently suffers starvation due to RT. The same workload when converted to EXT can get zero runtime if RT is 100% running, causing EXT processes to stall. Fix it by adding a DL server for EXT. A kselftest is also included later to confirm that both DL servers are functioning correctly: # ./runner -t rt_stall ===== START ===== TEST: rt_stall DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks OUTPUT: TAP version 13 1..1 # Runtime of FAIR task (PID 1511) is 0.250000 seconds # Runtime of RT task (PID 1512) is 4.750000 seconds # FAIR task got 5.00% of total runtime ok 1 PASS: FAIR task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of EXT task (PID 1514) is 0.250000 seconds # Runtime of RT task (PID 1515) is 4.750000 seconds # EXT task got 5.00% of total runtime ok 2 PASS: EXT task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of FAIR task (PID 1517) is 0.250000 seconds # Runtime of RT task (PID 1518) is 4.750000 seconds # FAIR task got 5.00% of total runtime ok 3 PASS: FAIR task got more than 4.00% of runtime TAP version 13 1..1 # Runtime of EXT task (PID 1521) is 0.250000 seconds # Runtime of RT task (PID 1522) is 4.750000 seconds # EXT task got 5.00% of total runtime ok 4 PASS: EXT task got more than 4.00% of runtime ok 1 rt_stall # ===== END ===== Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20260126100050.3854740-5-arighi@nvidia.com
2026-02-03Merge branch 'v6.19-rc8'Peter Zijlstra
Update to avoid conflicts with /urgent patches. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2026-01-22sched: Update rq->avg_idle when a task is moved to an idle CPUShubhang Kaushik
Currently, rq->idle_stamp is only used to calculate avg_idle during wakeups. This means other paths that move a task to an idle CPU such as fork/clone, execve, or migrations, do not end the CPU's idle status in the scheduler's eyes, leading to an inaccurate avg_idle. This patch introduces update_rq_avg_idle() to provide a more accurate measurement of CPU idle duration. By invoking this helper in put_prev_task_idle(), we ensure avg_idle is updated whenever a CPU stops being idle, regardless of how the new task arrived. Testing on an 80-core Ampere Altra (ARMv8) with 6.19-rc5 baseline: - Hackbench : +7.2% performance gain at 16 threads. - Schbench: Reduced p99.9 tail latencies at high concurrency. Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260121-v8-patch-series-v8-1-b7f1cbee5055@os.amperecomputing.com
2026-01-15sched: Fix build for modules using set_tsk_need_resched()Gabriele Monaco
Commit adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") added a tracepoint to the need_resched action that can be triggered also by set_tsk_need_resched. This function was previously accessible from out-of-tree modules but it's no longer available because the __trace_set_need_resched() symbol is not exported (together with the tracepoint itself, which was exported in a separate patch) and building such modules fails. Export __trace_set_need_resched to modules to fix those build issues. Fixes: adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20260112140413.362202-1-gmonaco@redhat.com
2026-01-15sched: Deadline has dynamic priorityPeter Zijlstra
While FIFO/RR have static priority, DEADLINE is a dynamic priority scheme. Notably it has static priority -1. Do not assume the priority doesn't change for deadline tasks just because the static priority doesn't change. This ensures DL always sees {DE,EN}QUEUE_MOVE where appropriate. Fixes: ff77e4685359 ("sched/rt: Fix PI handling vs. sched_setscheduler()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15sched: Audit MOVE vs balance_callbacksPeter Zijlstra
The {DE,EN}QUEUE_MOVE flag indicates a task is allowed to change priority, which means there could be balance callbacks queued. Therefore audit all MOVE users and make sure they do run balance callbacks before dropping rq-lock. Fixes: 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-15sched: Fold rq-pin swizzle into __balance_callbacks()Peter Zijlstra
Prepare for more users needing the rq-pin swizzle. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2026-01-13sched: Export hidden tracepoints to modulesGabriele Monaco
The tracepoints sched_entry, sched_exit and sched_set_need_resched are not exported to tracefs as trace events, this allows only kernel code to access them. Helper modules like [1] can be used to still have the tracepoints available to ftrace for debugging purposes, but they do rely on the tracepoints being exported. Export the 3 not exported tracepoints. Note that sched_set_state is already exported as the macro is called from modules. [1] - https://github.com/qais-yousef/sched_tp.git Fixes: adcc3bfa8806 ("sched: Adapt sched tracepoints for RV task model") Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://patch.msgid.link/20251205131621.135513-9-gmonaco@redhat.com
2026-01-12sched: Move clock related paravirt code to kernel/schedJuergen Gross
Paravirt clock related functions are available in multiple archs. In order to share the common parts, move the common static keys to kernel/sched/ and remove them from the arch specific files. Make a common paravirt_steal_clock() implementation available in kernel/sched/cputime.c, guarding it with a new config option CONFIG_HAVE_PV_STEAL_CLOCK_GEN, which can be selected by an arch in case it wants to use that common variant. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260105110520.21356-7-jgross@suse.com
2026-01-09sched/mm_cid: Prevent NULL mm dereference in sched_mm_cid_after_execve()Cong Wang
sched_mm_cid_after_execve() is called in bprm_execve()'s cleanup path even when exec_binprm() fails. For the init task's first execve(), this causes a problem: 1. current->mm is NULL (kernel threads don't have an mm) 2. sched_mm_cid_before_execve() exits early because mm is NULL 3. exec_binprm() fails (e.g., ENOENT for missing script interpreter) 4. sched_mm_cid_after_execve() is called with mm still NULL 5. sched_mm_cid_fork() is called unconditionally, triggering WARN_ON This is easily reproduced by booting with an init that is a shell script (#!/bin/sh) where the interpreter doesn't exist in the initramfs. Fix this by checking if t->mm is NULL before calling sched_mm_cid_fork(), matching the behavior of sched_mm_cid_before_execve() which already handles this case via sched_mm_cid_exit()'s early return. Fixes: b0c3d51b54f8 ("sched/mmcid: Provide precomputed maximal value") Signed-off-by: Cong Wang <cwang@multikernel.io> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20251223215113.639686-1-xiyou.wangcong@gmail.com
2026-01-08sched: Further restrict the preemption modesPeter Zijlstra
The introduction of PREEMPT_LAZY was for multiple reasons: - PREEMPT_RT suffered from over-scheduling, hurting performance compared to !PREEMPT_RT. - the introduction of (more) features that rely on preemption; like folio_zero_user() which can do large memset() without preemption checks. (Xen already had a horrible hack to deal with long running hypercalls) - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo cult or in response to poor to replicate workloads. By moving to a model that is fundamentally preemptable these things become managable and avoid needing to introduce more horrible hacks. Since this is a requirement; limit PREEMPT_NONE to architectures that do not support preemption at all. Further limit PREEMPT_VOLUNTARY to those architectures that do not yet have PREEMPT_LAZY support (with the eventual goal to make this the empty set and completely remove voluntary preemption and cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.) This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390, x86) with only two preemption models: full and lazy. While Lazy has been the recommended setting for a while, not all distributions have managed to make the switch yet. Force things along. Keep the patch minimal in case of hard to address regressions that might pop up. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://patch.msgid.link/20251219101502.GB1132199@noisy.programming.kicks-ass.net
2026-01-05sched: Enable context analysis for core.c and fair.cMarco Elver
This demonstrates a larger conversion to use Clang's context analysis. The benefit is additional static checking of locking rules, along with better documentation. Notably, kernel/sched contains sufficiently complex synchronization patterns, and application to core.c & fair.c demonstrates that the latest Clang version has become powerful enough to start applying this to more complex subsystems (with some modest annotations and changes). Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251219154418.3592607-37-elver@google.com
2025-12-17sched: Fix faulty assertion in sched_change_end()Peter Zijlstra
Commit 47efe2ddccb1f ("sched/core: Add assertions to QUEUE_CLASS") added an assert to sched_change_end() verifying that a class demotion would result in a reschedule. As it turns out; rt_mutex_setprio() does not force a resched on class demontion. Furthermore, this is only relevant to running tasks. Change the warning into a reschedule and make sure to only do so for running tasks. Fixes: 47efe2ddccb1f ("sched/core: Add assertions to QUEUE_CLASS") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251216141725.GW3707837@noisy.programming.kicks-ass.net
2025-12-17sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()Peter Zijlstra
Change sched_class::wakeup_preempt() to also get called for cross-class wakeups, specifically those where the woken task is of a higher class than the previous highest class. In order to do this, track the current highest class of the runqueue in rq::next_class and have wakeup_preempt() track this upwards for each new wakeup. Additionally have schedule() re-set the value on pick. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127154725.901391274@infradead.org
2025-12-14sched/core: Add assertions to QUEUE_CLASSPeter Zijlstra
Add some checks to the sched_change pattern to validate assumptions around changing classes. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127154725.771691954@infradead.org
2025-12-06Merge tag 'sched-urgent-2025-12-06' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Miscellaneous scheduler fixes/cleanups: - Fix psi_dequeue() for Proxy Execution - Fix hrtick() vs. scheduling context bug - Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out - Fix whitespace noise in headers - Remove a preempt-disable section in rt_mutex_setprio()" * tag 'sched-urgent-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix psi_dequeue() for Proxy Execution sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out sched/rt: Remove a preempt-disable section in rt_mutex_setprio() sched/hrtick: Fix hrtick() vs. scheduling context sched/headers: Remove whitespace noise from kernel/sched/sched.h
2025-12-06sched/rt: Remove a preempt-disable section in rt_mutex_setprio()Sebastian Andrzej Siewior
rt_mutex_setprio() has only one caller: rt_mutex_adjust_prio(). It expects that task_struct::pi_lock and rt_mutex_base::wait_lock are held. Both locks are raw_spinlock_t and are acquired with disabled interrupts. Nevertheless rt_mutex_setprio() disables preemption while invoking __balance_callbacks() and raw_spin_rq_unlock(). Even if one of the balance callbacks unlocks the rq then it must not enable interrupts because rt_mutex_base::wait_lock is still locked. Therefore interrupts should remain disabled and disabling preemption is not needed. Commit 4c9a4bc89a9cc ("sched: Allow balance callbacks for check_class_changed()") adds a preempt-disable section to rt_mutex_setprio() and __sched_setscheduler(). In __sched_setscheduler() the preemption is disabled before rq is unlocked and interrupts enabled but I don't see why it makes a difference in rt_mutex_setprio(). Remove the preempt_disable() section from rt_mutex_setprio(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20251127155529.t_sTatE4@linutronix.de
2025-12-06sched/hrtick: Fix hrtick() vs. scheduling contextPeter Zijlstra
The sched_class::task_tick() method is called on the donor sched_class, and sched_tick() hands it rq->donor as argument, which is consistent. However, while hrtick() uses the donor sched_class, it then passes rq->curr, which is inconsistent. Fix it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20250918080205.442967033@infradead.org
2025-12-03Merge tag 'sched_ext-for-6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Improve recovery from misbehaving BPF schedulers. When a scheduler puts many tasks with varying affinity restrictions on a shared DSQ, CPUs scanning through tasks they cannot run can overwhelm the system, causing lockups. Bypass mode now uses per-CPU DSQs with a load balancer to avoid this, and hooks into the hardlockup detector to attempt recovery. Add scx_cpu0 example scheduler to demonstrate this scenario. - Add lockless peek operation for DSQs to reduce lock contention for schedulers that need to query queue state during load balancing. - Allow scx_bpf_reenqueue_local() to be called from anywhere in preparation for deprecating cpu_acquire/release() callbacks in favor of generic BPF hooks. - Prepare for hierarchical scheduler support: add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() kfuncs, make scx_bpf_dsq_insert*() return bool, and wrap kfunc args in structs for future aux__prog parameter. - Implement cgroup_set_idle() callback to notify BPF schedulers when a cgroup's idle state changes. - Fix migration tasks being incorrectly downgraded from stop_sched_class to rt_sched_class across sched_ext enable/disable. Applied late as the fix is low risk and the bug subtle but needs stable backporting. - Various fixes and cleanups including cgroup exit ordering, SCX_KICK_WAIT reliability, and backward compatibility improvements. * tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (44 commits) sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks sched_ext: tools: Removing duplicate targets during non-cross compilation sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs sched_ext: Update comments replacing breather with aborting mechanism sched_ext: Implement load balancer for bypass mode sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked() sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR sched_ext: Add scx_cpu0 example scheduler sched_ext: Hook up hardlockup detector sched_ext: Make handle_lockup() propagate scx_verror() result sched_ext: Refactor lockup handlers into handle_lockup() sched_ext: Make scx_exit() and scx_vexit() return bool sched_ext: Exit dispatch and move operations immediately when aborting sched_ext: Simplify breather mechanism with scx_aborting flag sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode sched_ext: Refactor do_enqueue_task() local and global DSQ paths sched_ext: Use shorter slice in bypass mode sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races sched_ext: Minor cleanups to scx_task_iter ...
2025-12-03Merge tag 'cgroup-for-6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Defer task cgroup unlink until after the dying task's final context switch so that controllers see the cgroup properly populated until the task is truly gone - cpuset cleanups and simplifications. Enforce that domain isolated CPUs stay in root or isolated partitions and fail if isolated+nohz_full would leave no housekeeping CPU. Fix sched/deadline root domain handling during CPU hot-unplug and race for tasks in attaching cpusets - Misc fixes including memory reclaim protection documentation and selftest KTAP conformance * tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) cpuset: Treat cpusets in attaching as populated sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug cgroup/cpuset: Introduce cpuset_cpus_allowed_locked() docs: cgroup: No special handling of unpopulated memcgs docs: cgroup: Note about sibling relative reclaim protection docs: cgroup: Explain reclaim protection target selftests/cgroup: conform test to KTAP format output cpuset: remove need_rebuild_sched_domains cpuset: remove global remote_children list cpuset: simplify node setting on error cgroup: include missing header for struct irq_work cgroup: Fix sleeping from invalid context warning on PREEMPT_RT cgroup/cpuset: Globally track isolated_cpus update cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition cgroup/cpuset: Move up prstate_housekeeping_conflict() helper cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks() cgroup: Defer task cgroup unlink until after the task is done switching out cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() ...
2025-12-02Merge tag 'core-rseq-2025-11-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq updates from Thomas Gleixner: "A large overhaul of the restartable sequences and CID management: The recent enablement of RSEQ in glibc resulted in regressions which are caused by the related overhead. It turned out that the decision to invoke the exit to user work was not really a decision. More or less each context switch caused that. There is a long list of small issues which sums up nicely and results in a 3-4% regression in I/O benchmarks. The other detail which caused issues due to extra work in context switch and task migration is the CID (memory context ID) management. It also requires to use a task work to consolidate the CID space, which is executed in the context of an arbitrary task and results in sporadic uncontrolled exit latencies. The rewrite addresses this by: - Removing deprecated and long unsupported functionality - Moving the related data into dedicated data structures which are optimized for fast path processing. - Caching values so actual decisions can be made - Replacing the current implementation with a optimized inlined variant. - Separating fast and slow path for architectures which use the generic entry code, so that only fault and error handling goes into the TIF_NOTIFY_RESUME handler. - Rewriting the CID management so that it becomes mostly invisible in the context switch path. That moves the work of switching modes into the fork/exit path, which is a reasonable tradeoff. That work is only required when a process creates more threads than the cpuset it is allowed to run on or when enough threads exit after that. An artificial thread pool benchmarks which triggers this did not degrade, it actually improved significantly. The main effect in migration heavy scenarios is that runqueue lock held time and therefore contention goes down significantly" * tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched/mmcid: Switch over to the new mechanism sched/mmcid: Implement deferred mode change irqwork: Move data struct to a types header sched/mmcid: Provide CID ownership mode fixup functions sched/mmcid: Provide new scheduler CID mechanism sched/mmcid: Introduce per task/CPU ownership infrastructure sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex sched/mmcid: Provide precomputed maximal value sched/mmcid: Move initialization out of line signal: Move MMCID exit out of sighand lock sched/mmcid: Convert mm CID mask to a bitmap cpumask: Cache num_possible_cpus() sched/mmcid: Use cpumask_weighted_or() cpumask: Introduce cpumask_weighted_or() sched/mmcid: Prevent pointless work in mm_update_cpus_allowed() sched/mmcid: Move scheduler code out of global header sched: Fixup whitespace damage sched/mmcid: Cacheline align MM CID storage sched/mmcid: Use proper data structures sched/mmcid: Revert the complex CID management ...
2025-12-01Merge tag 'sched-core-2025-12-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scalability and load-balancing improvements: - Enable scheduler feature NEXT_BUDDY (Mel Gorman) - Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman) - Skip sched_balance_running cmpxchg when balance is not due (Tim Chen) - Implement generic code for architecture specific sched domain NUMA distances (Tim Chen) - Optimize the NUMA distances of the sched-domains builds of Intel Granite Rapids (GNR) and Clearwater Forest (CWF) platforms (Tim Chen) - Implement proportional newidle balance: a randomized algorithm that runs newidle balancing proportional to its success rate. (Peter Zijlstra) Scheduler infrastructure changes: - Implement the 'sched_change' scoped_guard() pattern for the entire scheduler (Peter Zijlstra) - More broadly utilize the sched_change guard (Peter Zijlstra) - Add support to pick functions to take runqueue-flags (Joel Fernandes) - Provide and use set_need_resched_current() (Peter Zijlstra) Fair scheduling enhancements: - Forfeit vruntime on yield (Fernand Sieber) - Only update stats for allowed CPUs when looking for dst group (Adam Li) CPU-core scheduling enhancements: - Optimize core cookie matching check (Fernand Sieber) Deadline scheduler fixes: - Only set free_cpus for online runqueues (Doug Berger) - Fix dl_server time accounting (Peter Zijlstra) - Fix dl_server stop condition (Peter Zijlstra) Proxy scheduling fixes: - Yield the donor task (Fernand Sieber) Fixes and cleanups: - Fix do_set_cpus_allowed() locking (Peter Zijlstra) - Fix migrate_disable_switch() locking (Peter Zijlstra) - Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() (Hao Jia) - Increase sched_tick_remote timeout (Phil Auld) - sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth Hegde) - sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)" * tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) sched: Provide and use set_need_resched_current() sched/fair: Proportional newidle balance sched/fair: Small cleanup to update_newidle_cost() sched/fair: Small cleanup to sched_balance_newidle() sched/fair: Revert max_newidle_lb_cost bump sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals sched/fair: Enable scheduler feature NEXT_BUDDY sched: Increase sched_tick_remote timeout sched/fair: Have SD_SERIALIZE affect newidle balancing sched/fair: Skip sched_balance_running cmpxchg when balance is not due sched/deadline: Minor cleanup in select_task_rq_dl() sched/deadline: Use cpumask_weight_and() in dl_bw_cpus sched/deadline: Document dl_server sched/deadline: Fix dl_server stop condition sched/deadline: Fix dl_server time accounting sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() sched/eevdf: Fix min_vruntime vs avg_vruntime sched/core: Add comment explaining force-idle vruntime snapshots sched/core: Optimize core cookie matching check sched/proxy: Yield the donor task ...
2025-11-25sched/mmcid: Switch over to the new mechanismThomas Gleixner
Now that all pieces are in place, change the implementations of sched_mm_cid_fork() and sched_mm_cid_exit() to adhere to the new strict ownership scheme and switch context_switch() over to use the new mm_cid_schedin() functionality. The common case is that there is no mode change required, which makes fork() and exit() just update the user count and the constraints. In case that a new user would exceed the CID space limit the fork() context handles the transition to per CPU mode with mm::mm_cid::mutex held. exit() handles the transition back to per task mode when the user count drops below the switch back threshold. fork() might also be forced to handle a deferred switch back to per task mode, when a affinity change increased the number of allowed CPUs enough. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.280380631@linutronix.de
2025-11-25sched/mmcid: Implement deferred mode changeThomas Gleixner
When affinity changes cause an increase of the number of CPUs allowed for tasks which are related to a MM, that might results in a situation where the ownership mode can go back from per CPU mode to per task mode. As affinity changes happen with runqueue lock held there is no way to do the actual mode change and required fixup right there. Add the infrastructure to defer it to a workqueue. The scheduled work can race with a fork() or exit(). Whatever happens first takes care of it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.216484739@linutronix.de
2025-11-25sched/mmcid: Provide CID ownership mode fixup functionsThomas Gleixner
CIDs are either owned by tasks or by CPUs. The ownership mode depends on the number of tasks related to a MM and the number of CPUs on which these tasks are theoretically allowed to run on. Theoretically because that number is the superset of CPU affinities of all tasks which only grows and never shrinks. Switching to per CPU mode happens when the user count becomes greater than the maximum number of CIDs, which is calculated by: opt_cids = min(mm_cid::nr_cpus_allowed, mm_cid::users); max_cids = min(1.25 * opt_cids, nr_cpu_ids); The +25% allowance is useful for tight CPU masks in scenarios where only a few threads are created and destroyed to avoid frequent mode switches. Though this allowance shrinks, the closer opt_cids becomes to nr_cpu_ids, which is the (unfortunate) hard ABI limit. At the point of switching to per CPU mode the new user is not yet visible in the system, so the task which initiated the fork() runs the fixup function: mm_cid_fixup_tasks_to_cpu() walks the thread list and either transfers each tasks owned CID to the CPU the task runs on or drops it into the CID pool if a task is not on a CPU at that point in time. Tasks which schedule in before the task walk reaches them do the handover in mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() completes it's guaranteed that no task related to that MM owns a CID anymore. Switching back to task mode happens when the user count goes below the threshold which was recorded on the per CPU mode switch: pcpu_thrs = min(opt_cids - (opt_cids / 4), nr_cpu_ids / 2); This threshold is updated when a affinity change increases the number of allowed CPUs for the MM, which might cause a switch back to per task mode. If the switch back was initiated by a exiting task, then that task runs the fixup function. If it was initiated by a affinity change, then it's run either in the deferred update function in context of a workqueue or by a task which forks a new one or by a task which exits. Whatever happens first. mm_cid_fixup_cpus_to_task() walks through the possible CPUs and either transfers the CPU owned CIDs to a related task which runs on the CPU or drops it into the pool. Tasks which schedule in on a CPU which the walk did not cover yet do the handover themselves. This transition from CPU to per task ownership happens in two phases: 1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the task CID and denotes that the CID is only temporarily owned by the task. When it schedules out the task drops the CID back into the pool if this bit is set. 2) The initiating context walks the per CPU space and after completion clears mm:mm_cid.transit. After that point the CIDs are strictly task owned again. This two phase transition is required to prevent CID space exhaustion during the transition as a direct transfer of ownership would fail if two tasks are scheduled in on the same CPU before the fixup freed per CPU CIDs. When mm_cid_fixup_cpus_to_tasks() completes it's guaranteed that no CID related to that MM is owned by a CPU anymore. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.088189028@linutronix.de
2025-11-25sched/mmcid: Provide new scheduler CID mechanismThomas Gleixner
The MM CID management has two fundamental requirements: 1) It has to guarantee that at no given point in time the same CID is used by concurrent tasks in userspace. 2) The CID space must not exceed the number of possible CPUs in a system. While most allocators (glibc, tcmalloc, jemalloc) do not care about that, there seems to be at least some LTTng library depending on it. The CID space compaction itself is not a functional correctness requirement, it is only a useful optimization mechanism to reduce the memory foot print in unused user space pools. The optimal CID space is: min(nr_tasks, nr_cpus_allowed); Where @nr_tasks is the number of actual user space threads associated to the mm and @nr_cpus_allowed is the superset of all task affinities. It is growth only as it would be insane to take a racy snapshot of all task affinities when the affinity of one task changes just do redo it 2 milliseconds later when the next task changes it's affinity. That means that as long as the number of tasks is lower or equal than the number of CPUs allowed, each task owns a CID. If the number of tasks exceeds the number of CPUs allowed it switches to per CPU mode, where the CPUs own the CIDs and the tasks borrow them as long as they are scheduled in. For transition periods CIDs can go beyond the optimal space as long as they don't go beyond the number of possible CPUs. The current upstream implementation adds overhead into task migration to keep the CID with the task. It also has to do the CID space consolidation work from a task work in the exit to user space path. As that work is assigned to a random task related to a MM this can inflict unwanted exit latencies. Implement the context switch parts of a strict ownership mechanism to address this. This removes most of the work from the task which schedules out. Only during transitioning from per CPU to per task ownership it is required to drop the CID when leaving the CPU to prevent CID space exhaustion. Other than that scheduling out is just a single check and branch. The task which schedules in has to check whether: 1) The ownership mode changed 2) The CID is within the optimal CID space In stable situations this results in zero work. The only short disruption is when ownership mode changes or when the associated CID is not in the optimal CID space. The latter only happens when tasks exit and therefore the optimal CID space shrinks. That mechanism is strictly optimized for the common case where no change happens. The only case where it actually causes a temporary one time spike is on mode changes when and only when a lot of tasks related to a MM schedule exactly at the same time and have eventually to compete on allocating a CID from the bitmap. In the sysbench test case which triggered the spinlock contention in the initial CID code, __schedule() drops significantly in perf top on a 128 Core (256 threads) machine when running sysbench with 255 threads, which fits into the task mode limit of 256 together with the parent thread: Upstream rseq/perf branch +CID rework 0.42% 0.37% 0.32% [k] __schedule Increasing the number of threads to 256, which puts the test process into per CPU mode looks about the same. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.023984859@linutronix.de
2025-11-25sched/mmcid: Introduce per task/CPU ownership infrastructureThomas Gleixner
The MM CID management has two fundamental requirements: 1) It has to guarantee that at no given point in time the same CID is used by concurrent tasks in userspace. 2) The CID space must not exceed the number of possible CPUs in a system. While most allocators (glibc, tcmalloc, jemalloc) do not care about that, there seems to be at least librseq depending on it. The CID space compaction itself is not a functional correctness requirement, it is only a useful optimization mechanism to reduce the memory foot print in unused user space pools. The optimal CID space is: min(nr_tasks, nr_cpus_allowed); Where @nr_tasks is the number of actual user space threads associated to the mm and @nr_cpus_allowed is the superset of all task affinities. It is growth only as it would be insane to take a racy snapshot of all task affinities when the affinity of one task changes just do redo it 2 milliseconds later when the next task changes its affinity. That means that as long as the number of tasks is lower or equal than the number of CPUs allowed, each task owns a CID. If the number of tasks exceeds the number of CPUs allowed it switches to per CPU mode, where the CPUs own the CIDs and the tasks borrow them as long as they are scheduled in. For transition periods CIDs can go beyond the optimal space as long as they don't go beyond the number of possible CPUs. The current upstream implementation adds overhead into task migration to keep the CID with the task. It also has to do the CID space consolidation work from a task work in the exit to user space path. As that work is assigned to a random task related to a MM this can inflict unwanted exit latencies. This can be done differently by implementing a strict CID ownership mechanism. Either the CIDs are owned by the tasks or by the CPUs. The latter provides less locality when tasks are heavily migrating, but there is no justification to optimize for overcommit scenarios and thereby penalizing everyone else. Provide the basic infrastructure to implement this: - Change the UNSET marker to BIT(31) from ~0U - Add the ONCPU marker as BIT(30) - Add the TRANSIT marker as BIT(29) That allows to check for ownership trivially and provides a simple check for UNSET as well. The TRANSIT marker is required to prevent CID space exhaustion when switching from per CPU to per task mode. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251119172549.960252358@linutronix.de
2025-11-25sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutexThomas Gleixner
Prepare for the new CID management scheme which puts the CID ownership transition into the fork() and exit() slow path by serializing sched_mm_cid_fork()/exit() with it, so task list and cpu mask walks can be done in interruptible and preemptible code. The contention on it is not worse than on other concurrency controls in the fork()/exit() machinery. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.895826703@linutronix.de
2025-11-25sched/mmcid: Provide precomputed maximal valueThomas Gleixner
Reading mm::mm_users and mm:::mm_cid::nr_cpus_allowed every time to compute the maximal CID value is just wasteful as that value is only changing on fork(), exit() and eventually when the affinity changes. So it can be easily precomputed at those points and provided in mm::mm_cid for consumption in the hot path. But there is an issue with using mm::mm_users for accounting because that does not necessarily reflect the number of user space tasks as other kernel code can take temporary references on the MM which skew the picture. Solve that by adding a users counter to struct mm_mm_cid, which is modified by fork() and exit() and used for precomputing under mm_mm_cid::lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.832764634@linutronix.de
2025-11-25sched/mmcid: Move initialization out of lineThomas Gleixner
It's getting bigger soon, so just move it out of line to the rest of the code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.769636491@linutronix.de
2025-11-25signal: Move MMCID exit out of sighand lockThomas Gleixner
There is no need anymore to keep this under sighand lock as the current code and the upcoming replacement are not depending on the exit state of a task anymore. That allows to use a mutex in the exit path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.706439391@linutronix.de
2025-11-25sched/mmcid: Convert mm CID mask to a bitmapThomas Gleixner
This is truly a bitmap and just conveniently uses a cpumask because the maximum size of the bitmap is nr_cpu_ids. But that prevents to do searches for a zero bit in a limited range, which is helpful to provide an efficient mechanism to consolidate the CID space when the number of users decreases. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.642866767@linutronix.de
2025-11-20sched/mmcid: Use cpumask_weighted_or()Thomas Gleixner
Use cpumask_weighted_or() instead of cpumask_or() and cpumask_weight() on the result, which walks the same bitmap twice. Results in 10-20% less cycles, which reduces the runqueue lock hold time. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.511736272@linutronix.de
2025-11-20sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()Thomas Gleixner
mm_update_cpus_allowed() is not required to be invoked for affinity changes due to migrate_disable() and migrate_enable(). migrate_disable() restricts the task temporarily to a CPU on which the task was already allowed to run, so nothing changes. migrate_enable() restores the actual task affinity mask. If that mask changed between migrate_disable() and migrate_enable() then that change was already accounted for. Move the invocation to the proper place to avoid that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.385208276@linutronix.de
2025-11-20sched/mmcid: Move scheduler code out of global headerThomas Gleixner
This is only used in the scheduler core code, so there is no point to have it in a global header. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.321259077@linutronix.de
2025-11-20sched: Fixup whitespace damageThomas Gleixner
With whitespace checks enabled in the editor this makes eyes bleed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.258651925@linutronix.de
2025-11-20sched/mmcid: Use proper data structuresThomas Gleixner
Having a lot of CID functionality specific members in struct task_struct and struct mm_struct is not really making the code easier to read. Encapsulate the CID specific parts in data structures and keep them separate from the stuff they are embedded in. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.131573768@linutronix.de