summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2 daysMerge tag 'sched_ext-for-7.0-rc6-fixes' of ↵HEADmasterLinus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other in hardirq context form a cycle. Move the wait to a balance callback which can drop the rq lock and process IPIs. - Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where the waker_node used cpu_to_node() while prev_cpu used scx_cpu_node_if_enabled(), leading to undefined behavior when per-node idle tracking is disabled. * tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Add cyclic SCX_KICK_WAIT stress test sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback sched_ext: Fix inconsistent NUMA node lookup in scx_select_cpu_dfl()
2 daysMerge tag 'wq-for-7.0-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: - Fix false positive stall reports on weakly ordered architectures where the lockless worklist/timestamp check in the watchdog can observe stale values due to memory reordering. Recheck under pool->lock to confirm. * tag 'wq-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Better describe stall check workqueue: Fix false positive stall reports
2 daysMerge tag 'cgroup-for-7.0-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix cgroup rmdir racing with dying tasks. Deferred task cgroup unlink introduced a window where cgroup.procs is empty but the cgroup is still populated, causing rmdir to fail with -EBUSY and selftest failures. Make rmdir wait for dying tasks to fully leave and fix selftests to not depend on synchronous populated updates. - Fix cpuset v1 task migration failure from empty cpusets under strict security policies. When CPU hotplug removes the last CPU from a v1 cpuset, tasks must be migrated to an ancestor without a security_task_setscheduler() check that would block the migration. * tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Skip security check for hotplug induced v1 task migration cgroup/cpuset: Simplify setsched decision check in task iteration loop of cpuset_can_attach() cgroup: Fix cgroup_drain_dying() testing the wrong condition selftests/cgroup: Don't require synchronous populated update on task exit cgroup: Wait for dying tasks to leave on rmdir
2 dayscgroup/cpuset: Skip security check for hotplug induced v1 task migrationWaiman Long
When a CPU hot removal causes a v1 cpuset to lose all its CPUs, the cpuset hotplug handler will schedule a work function to migrate tasks in that cpuset with no CPU to its ancestor to enable those tasks to continue running. If a strict security policy is in place, however, the task migration may fail when security_task_setscheduler() call in cpuset_can_attach() returns a -EACCES error. That will mean that those tasks will have no CPU to run on. The system administrators will have to explicitly intervene to either add CPUs to that cpuset or move the tasks elsewhere if they are aware of it. This problem was found by a reported test failure in the LTP's cpuset_hotplug_test.sh. Fix this problem by treating this special case as an exception to skip the setsched security check in cpuset_can_attach() when a v1 cpuset with tasks have no CPU left. With that patch applied, the cpuset_hotplug_test.sh test can be run successfully without failure. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2 dayscgroup/cpuset: Simplify setsched decision check in task iteration loop of ↵Waiman Long
cpuset_can_attach() Centralize the check required to run security_task_setscheduler() in the task iteration loop of cpuset_can_attach() outside of the loop as it has no dependency on the characteristics of the tasks themselves. There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
3 dayssched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callbackTejun Heo
SCX_KICK_WAIT busy-waits in kick_cpus_irq_workfn() using smp_cond_load_acquire() until the target CPU's kick_sync advances. Because the irq_work runs in hardirq context, the waiting CPU cannot reschedule and its own kick_sync never advances. If multiple CPUs form a wait cycle, all CPUs deadlock. Replace the busy-wait in kick_cpus_irq_workfn() with resched_curr() to force the CPU through do_pick_task_scx(), which queues a balance callback to perform the wait. The balance callback drops the rq lock and enables IRQs following the sched_core_balance() pattern, so the CPU can process IPIs while waiting. The local CPU's kick_sync is advanced on entry to do_pick_task_scx() and continuously during the wait, ensuring any CPU that starts waiting for us sees the advancement and cannot form cyclic dependencies. Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Christian Loehle <christian.loehle@arm.com> Link: https://lore.kernel.org/r/20260316100249.1651641-1-christian.loehle@arm.com Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Christian Loehle <christian.loehle@arm.com>
4 daysMerge tag 'timers-urgent-2026-03-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix an argument order bug in the alarm timer forwarding logic, which may cause missed expirations or incorrect overrun accounting" * tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimer: Fix argument order in alarm_timer_forward()
4 daysMerge tag 'locking-urgent-2026-03-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex fixes from Ingo Molnar: - Tighten up the sys_futex_requeue() ABI a bit, to disallow dissimilar futex flags and potential UaF access (Peter Zijlstra) - Fix UaF between futex_key_to_node_opt() and vma_replace_policy() (Hao-Yu Yang) - Clear stale exiting pointer in futex_lock_pi() retry path, which triggered a warning (and potential misbehavior) in stress-testing (Davidlohr Bueso) * tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Clear stale exiting pointer in futex_lock_pi() retry path futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy() futex: Require sys_futex_requeue() to have identical flags
5 daysMerge tag 'trace-v7.0-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix potential deadlock in osnoise and hotplug The interface_lock can be called by a osnoise thread and the CPU shutdown logic of osnoise can wait for this thread to finish. But cpus_read_lock() can also be taken while holding the interface_lock. This produces a circular lock dependency and can cause a deadlock. Swap the ordering of cpus_read_lock() and the interface_lock to have interface_lock taken within the cpus_read_lock() context to prevent this circular dependency. - Fix freeing of event triggers in early boot up If the same trigger is added on the kernel command line, the second one will fail to be applied and the trigger created will be freed. This calls into the deferred logic and creates a kernel thread to do the freeing. But the command line logic is called before kernel threads can be created and this leads to a NULL pointer dereference. Delay freeing event triggers until late init. * tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Drain deferred trigger frees if kthread creation fails tracing: Fix potential deadlock in cpu hotplug with osnoise
6 daysfutex: Clear stale exiting pointer in futex_lock_pi() retry pathDavidlohr Bueso
Fuzzying/stressing futexes triggered: WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524 When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY and stores a refcounted task pointer in 'exiting'. After wait_for_owner_exiting() consumes that reference, the local pointer is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a different error, the bogus pointer is passed to wait_for_owner_exiting(). CPU0 CPU1 CPU2 futex_lock_pi(uaddr) // acquires the PI futex exit() futex_cleanup_begin() futex_state = EXITING; futex_lock_pi(uaddr) futex_lock_pi_atomic() attach_to_pi_owner() // observes EXITING *exiting = owner; // takes ref return -EBUSY wait_for_owner_exiting(-EBUSY, owner) put_task_struct(); // drops ref // exiting still points to owner goto retry; futex_lock_pi_atomic() lock_pi_update_atomic() cmpxchg(uaddr) *uaddr ^= WAITERS // whatever // value changed return -EAGAIN; wait_for_owner_exiting(-EAGAIN, exiting) // stale WARN_ON_ONCE(exiting) Fix this by resetting upon retry, essentially aligning it with requeue_pi. Fixes: 3ef240eaff36 ("futex: Prevent exit livelock") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net
6 daystracing: Drain deferred trigger frees if kthread creation failsWesley Atwell
Boot-time trigger registration can fail before the trigger-data cleanup kthread exists. Deferring those frees until late init is fine, but the post-boot fallback must still drain the deferred list if kthread creation never succeeds. Otherwise, boot-deferred nodes can accumulate on trigger_data_free_list, later frees fall back to synchronously freeing only the current object, and the older queued entries are leaked forever. To trigger this, add the following to the kernel command line: trace_event=sched_switch trace_trigger=sched_switch.traceon,sched_switch.traceon The second traceon trigger will fail and be freed. This triggers a NULL pointer dereference and crashes the kernel. Keep the deferred boot-time behavior, but when kthread creation fails, drain the whole queued list synchronously. Do the same in the late-init drain path so queued entries are not stranded there either. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260324221326.1395799-3-atwellwea@gmail.com Fixes: 61d445af0a7c ("tracing: Add bulk garbage collection of freeing event_trigger_data") Signed-off-by: Wesley Atwell <atwellwea@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
6 daysMerge tag 'sysctl-7.00-fixes-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl fix from Joel Granados: "Fix uninitialized variable error when writing to a sysctl bitmap Removed the possibility of returning an unjustified -EINVAL when writing to a sysctl bitmap" * tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: fix uninitialized variable in proc_do_large_bitmap
6 daystracing: Fix potential deadlock in cpu hotplug with osnoiseLuo Haiyang
The following sequence may leads deadlock in cpu hotplug: task1 task2 task3 ----- ----- ----- mutex_lock(&interface_lock) [CPU GOING OFFLINE] cpus_write_lock(); osnoise_cpu_die(); kthread_stop(task3); wait_for_completion(); osnoise_sleep(); mutex_lock(&interface_lock); cpus_read_lock(); [DEAD LOCK] Fix by swap the order of cpus_read_lock() and mutex_lock(&interface_lock). Cc: stable@vger.kernel.org Cc: <mathieu.desnoyers@efficios.com> Cc: <zhang.run@zte.com.cn> Cc: <yang.tao172@zte.com.cn> Cc: <ran.xiaokai@zte.com.cn> Fixes: bce29ac9ce0bb ("trace: Add osnoise tracer") Link: https://patch.msgid.link/20260326141953414bVSj33dAYktqp9Oiyizq8@zte.com.cn Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Luo Haiyang <luo.haiyang@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
7 daysMerge tag 'pm-7.0-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two cpufreq issues, one in the core and one in the conservative governor, and two issues related to system sleep: - Restore the cpufreq core behavior changed inadvertently during the 6.19 development cycle to call cpufreq_frequency_table_cpuinfo() for cpufreq policies getting re-initialized which ensures that policy->max and policy->cpuinfo_max_freq will be valid going forward (Viresh Kumar) - Adjust the cached requested frequency in the conservative cpufreq governor on policy limits changes to prevent it from becoming stale in some cases (Viresh Kumar) - Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some code paths in which it is legitimately called without invoking pm_restrict_gfp_mask() previously (Youngjun Park) - Update snapshot_write_finalize() to take trailing zero pages into account properly which prevents user space restore from failing subsequently in some cases (Alberto Garcia)" * tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask() PM: hibernate: Drain trailing zero pages on userspace restore cpufreq: conservative: Reset requested_freq on limits change cpufreq: Don't skip cpufreq_frequency_table_cpuinfo()
8 daysMerge tag 'dma-mapping-7.0-2026-03-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "A set of fixes for DMA-mapping subsystem, which resolve false- positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda)" * tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: add missing `inline` for `dma_free_attrs` mm/hmm: Indicate that HMM requires DMA coherency RDMA/umem: Tell DMA mapping that UMEM requires coherency iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set dma-mapping: Introduce DMA require coherency attribute dma-mapping: Clarify valid conditions for CPU cache line overlap dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output dma-debug: Allow multiple invocations of overlapping entries dma: swiotlb: add KMSAN annotations to swiotlb_bounce()
8 daysfutex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()Hao-Yu Yang
During futex_key_to_node_opt() execution, vma->vm_policy is read under speculative mmap lock and RCU. Concurrently, mbind() may call vma_replace_policy() which frees the old mempolicy immediately via kmem_cache_free(). This creates a race where __futex_key_to_node() dereferences a freed mempolicy pointer, causing a use-after-free read of mpol->mode. [ 151.412631] BUG: KASAN: slab-use-after-free in __futex_key_to_node (kernel/futex/core.c:349) [ 151.414046] Read of size 2 at addr ffff888001c49634 by task e/87 [ 151.415969] Call Trace: [ 151.416732] __asan_load2 (mm/kasan/generic.c:271) [ 151.416777] __futex_key_to_node (kernel/futex/core.c:349) [ 151.416822] get_futex_key (kernel/futex/core.c:374 kernel/futex/core.c:386 kernel/futex/core.c:593) Fix by adding rcu to __mpol_put(). Fixes: c042c505210d ("futex: Implement FUTEX2_MPOL") Reported-by: Hao-Yu Yang <naup96721@gmail.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Hao-Yu Yang <naup96721@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Link: https://patch.msgid.link/20260324174418.GB1850007@noisy.programming.kicks-ass.net
8 daysfutex: Require sys_futex_requeue() to have identical flagsPeter Zijlstra
Nicholas reported that his LLM found it was possible to create a UaF when sys_futex_requeue() is used with different flags. The initial motivation for allowing different flags was the variable sized futex, but since that hasn't been merged (yet), simply mandate the flags are identical, as is the case for the old style sys_futex() requeue operations. Fixes: 0f4b5f972216 ("futex: Add sys_futex_requeue()") Reported-by: Nicholas Carlini <npc@anthropic.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
8 dayssysctl: fix uninitialized variable in proc_do_large_bitmapMarc Buerg
proc_do_large_bitmap() does not initialize variable c, which is expected to be set to a trailing character by proc_get_long(). However, proc_get_long() only sets c when the input buffer contains a trailing character after the parsed value. If c is not initialized it may happen to contain a '-'. If this is the case proc_do_large_bitmap() expects to be able to parse a second part of the input buffer. If there is no second part an unjustified -EINVAL will be returned. Initialize c to 0 to prevent returning -EINVAL on valid input. Fixes: 9f977fb7ae9d ("sysctl: add proc_do_large_bitmap") Signed-off-by: Marc Buerg <buermarc@googlemail.com> Reviewed-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Joel Granados <joel.granados@kernel.org>
8 daysMerge tag 'rcu-fixes.v7.0-20260325a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU fixes from Boqun Feng: "Fix a regression introduced by commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast"): BPF contexts can run with preemption disabled or scheduler locks held, so call_srcu() must work in all such contexts. Fix this by converting SRCU's spinlocks to raw spinlocks and avoiding scheduler lock acquisition in call_srcu() by deferring to an irq_work (similar to call_rcu_tasks_generic()), for both tree SRCU and tiny SRCU. Also fix a follow-on lockdep splat caused by srcu_node allocation under the newly introduced raw spinlock by deferring the allocation to grace-period worker context" * tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: srcu: Use irq_work to start GP in tiny SRCU rcu: Use an intermediate irq_work to start process_srcu() srcu: Push srcu_node allocation to GP when non-preemptible srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()
8 dayscgroup: Fix cgroup_drain_dying() testing the wrong conditionTejun Heo
cgroup_drain_dying() was using cgroup_is_populated() to test whether there are dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets, nr_populated_domain_children and nr_populated_threaded_children, but cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether there are children is cgroup_destroy_locked()'s concern. This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had no direct tasks but had a populated child, cgroup_drain_dying() would enter its wait loop because cgroup_is_populated() was true from nr_populated_domain_children. The task iterator found nothing to wait for, yet the populated state never cleared because it was driven by live tasks in the child cgroup. Fix it by using cgroup_has_tasks() which only tests nr_populated_csets. v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian). v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Fixes: 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
8 dayssrcu: Use irq_work to start GP in tiny SRCUJoel Fernandes
Tiny SRCU's srcu_gp_start_if_needed() directly calls schedule_work(), which acquires the workqueue pool->lock. This causes a lockdep splat when call_srcu() is called with a scheduler lock held, due to: call_srcu() [holding pi_lock] srcu_gp_start_if_needed() schedule_work() -> pool->lock workqueue_init() / create_worker() [holding pool->lock] wake_up_process() -> try_to_wake_up() -> pi_lock Also add irq_work_sync() to cleanup_srcu_struct() to prevent a use-after-free if a queued irq_work fires after cleanup begins. Tested with rcutorture SRCU-T and no lockdep warnings. [ Thanks to Boqun for similar fix in patch "rcu: Use an intermediate irq_work to start process_srcu()" ] Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun@kernel.org>
8 daysrcu: Use an intermediate irq_work to start process_srcu()Boqun Feng
Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can happen basically everywhere (including where a scheduler lock is held), call_srcu() now needs to avoid acquiring scheduler lock because otherwise it could cause deadlock [1]. Fix this by following what the previous RCU Tasks Trace did: using an irq_work to delay the queuing of the work to start process_srcu(). [boqun: Apply Joel's feedback] [boqun: Apply Andrea's test feedback] Reported-by: Andrea Righi <arighi@nvidia.com> Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] Suggested-by: Zqiang <qiang.zhang@linux.dev> Tested-by: Andrea Righi <arighi@nvidia.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun@kernel.org>
8 dayssrcu: Push srcu_node allocation to GP when non-preemptiblePaul E. McKenney
When the srcutree.convert_to_big and srcutree.big_cpu_lim kernel boot parameters specify initialization-time allocation of the srcu_node tree for statically allocated srcu_struct structures (for example, in DEFINE_SRCU() at build time instead of init_srcu_struct() at runtime), init_srcu_struct_nodes() will attempt to dynamically allocate this tree at the first run-time update-side use of this srcu_struct structure, but while holding a raw spinlock. Because the memory allocator can acquire non-raw spinlocks, this can result in lockdep splats. This commit therefore uses the same SRCU_SIZE_ALLOC trick that is used when the first run-time update-side use of this srcu_struct structure happens before srcu_init() is called. The actual allocation then takes place from workqueue context at the ends of upcoming SRCU grace periods. [boqun: Adjust the sha1 of the Fixes tag] Fixes: 175b45ed343a ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun@kernel.org>
8 dayssrcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()Paul E. McKenney
Tree SRCU has used non-raw spinlocks for many years, motivated by a desire to avoid unnecessary real-time latency and the absence of any reason to use raw spinlocks. However, the recent use of SRCU in tracing as the underlying implementation of RCU Tasks Trace means that call_srcu() is invoked from preemption-disabled regions of code, which in turn requires that any locks acquired by call_srcu() or its callees must be raw spinlocks. This commit therefore converts SRCU's spinlocks to raw spinlocks. [boqun: Add Fixes tag] Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
9 daysworkqueue: Better describe stall checkPetr Mladek
Try to be more explicit why the workqueue watchdog does not take pool->lock by default. Spin locks are full memory barriers which delay anything. Obviously, they would primary delay operations on the related worker pools. Explain why it is enough to prevent the false positive by re-checking the timestamp under the pool->lock. Finally, make it clear what would be the alternative solution in __queue_work() which is a hotter path. Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
9 daysalarmtimer: Fix argument order in alarm_timer_forward()Zhan Xusheng
alarm_timer_forward() passes arguments to alarm_forward() in the wrong order: alarm_forward(alarm, timr->it_interval, now); However, alarm_forward() is defined as: u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval); and uses the second argument as the current time: delta = ktime_sub(now, alarm->node.expires); Passing the interval as "now" results in incorrect delta computation, which can lead to missed expirations or incorrect overrun accounting. This issue has been present since the introduction of alarm_timer_forward(). Fix this by swapping the arguments. Fixes: e7561f1633ac ("alarmtimer: Implement forward callback") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com
9 dayscgroup: Wait for dying tasks to leave on rmdirTejun Heo
a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING tasks from cgroup.procs so that systemd doesn't see tasks that have already been reaped via waitpid(). However, the populated counter (nr_populated_csets) is only decremented when the task later passes through cgroup_task_dead() in finish_task_switch(). This means cgroup.procs can appear empty while the cgroup is still populated, causing rmdir to fail with -EBUSY. Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the cgroup is populated but all remaining tasks have PF_EXITING set (the task iterator returns none due to the existing filter), wait for a kick from cgroup_task_dead() and retry. The wait is brief as tasks are removed from the cgroup's css_set between PF_EXITING assertion in do_exit() and cgroup_task_dead() in finish_task_switch(). v2: cgroup_is_populated() true to false transition happens under css_set_lock not cgroup_mutex, so retest under css_set_lock before sleeping to avoid missed wakeups (Sebastian). Fixes: a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Michal Koutny <mkoutny@suse.com> Cc: cgroups@vger.kernel.org
11 daysPM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask()Youngjun Park
Commit 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask() stacking") introduced refcount-based GFP mask management that warns when pm_restore_gfp_mask() is called with saved_gfp_count == 0. Some hibernation paths call pm_restore_gfp_mask() defensively where the GFP mask may or may not be restricted depending on the execution path. For example, the uswsusp interface invokes it in SNAPSHOT_CREATE_IMAGE, SNAPSHOT_UNFREEZE, and snapshot_release(). Before the stacking change this was a silent no-op; it now triggers a spurious WARNING. Remove the WARN_ON() wrapper from the !saved_gfp_count check while retaining the check itself, so that defensive calls remain harmless without producing false warnings. Fixes: 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask() stacking") Signed-off-by: Youngjun Park <youngjun.park@lge.com> [ rjw: Subject tweak ] Link: https://patch.msgid.link/20260322120528.750178-1-youngjun.park@lge.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
11 daysPM: hibernate: Drain trailing zero pages on userspace restoreAlberto Garcia
Commit 005e8dddd497 ("PM: hibernate: don't store zero pages in the image file") added an optimization to skip zero-filled pages in the hibernation image. On restore, zero pages are handled internally by snapshot_write_next() in a loop that processes them without returning to the caller. With the userspace restore interface, writing the last non-zero page to /dev/snapshot is followed by the SNAPSHOT_ATOMIC_RESTORE ioctl. At this point there are no more calls to snapshot_write_next() so any trailing zero pages are not processed, snapshot_image_loaded() fails because handle->cur is smaller than expected, the ioctl returns -EPERM and the image is not restored. The in-kernel restore path is not affected by this because the loop in load_image() in swap.c calls snapshot_write_next() until it returns 0. It is this final call that drains any trailing zero pages. Fixed by calling snapshot_write_next() in snapshot_write_finalize(), giving the kernel the chance to drain any trailing zero pages. Fixes: 005e8dddd497 ("PM: hibernate: don't store zero pages in the image file") Signed-off-by: Alberto Garcia <berto@igalia.com> Acked-by: Brian Geffon <bgeffon@google.com> Link: https://patch.msgid.link/ef5a7c5e3e3dbd17dcb20efaa0c53a47a23498bb.1773075892.git.berto@igalia.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
11 daysMerge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Fix how linked registers track zero extension of subregisters (Daniel Borkmann) - Fix unsound scalar fork for OR instructions (Daniel Wade) - Fix exception exit lock check for subprogs (Ihor Solodrai) - Fix undefined behavior in interpreter for SDIV/SMOD instructions (Jenny Guanni Qu) - Release module's BTF when module is unloaded (Kumar Kartikeya Dwivedi) - Fix constant blinding for PROBE_MEM32 instructions (Sachin Kumar) - Reset register ID for END instructions to prevent incorrect value tracking (Yazhou Tang) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add a test cases for sync_linked_regs regarding zext propagation bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR selftests/bpf: Add tests for sdiv32/smod32 with INT_MIN dividend bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN selftests/bpf: Add tests for bpf_throw lock leak from subprogs bpf: Fix exception exit lock checking for subprogs bpf: Release module BTF IDR before module unload selftests/bpf: Fix pkg-config call on static builds bpf: Fix constant blinding for PROBE_MEM32 stores selftests/bpf: Add test for BPF_END register ID reset bpf: Reset register ID for BPF_END value tracking
11 daysMerge tag 'trace-v7.0-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Revert "tracing: Remove pid in task_rename tracing output" A change was made to remove the pid field from the task_rename event because it was thought that it was always done for the current task and recording the pid would be redundant. This turned out to be incorrect and there are a few corner case where this is not true and caused some regressions in tooling. - Fix the reading from user space for migration The reading of user space uses a seq lock type of logic where it uses a per-cpu temporary buffer and disables migration, then enables preemption, does the copy from user space, disables preemption, enables migration and checks if there was any schedule switches while preemption was enabled. If there was a context switch, then it is considered that the per-cpu buffer could be corrupted and it tries again. There's a protection check that tests if it takes a hundred tries, it issues a warning and exits out to prevent a live lock. This was triggered because the task was selected by the load balancer to be migrated to another CPU, every time preemption is enabled the migration task would schedule in try to migrate the task but can't because migration is disabled and let it run again. This caused the scheduler to schedule out the task every time it enabled preemption and made the loop never exit (until the 100 iteration test triggered). Fix this by enabling and disabling preemption and keeping migration enabled if the reading from user space needs to be done again. This will let the migration thread migrate the task and the copy from user space will likely pass on the next iteration. - Fix trace_marker copy option freeing The "copy_trace_marker" option allows a tracing instance to get a copy of a write to the trace_marker file of the top level instance. This is managed by a link list protected by RCU. When an instance is removed, a check is made if the option is set, and if so synchronized_rcu() is called. The problem is that an iteration is made to reset all the flags to what they were when the instance was created (to perform clean ups) was done before the check of the copy_trace_marker option and that option was cleared, so the synchronize_rcu() was never called. Move the clearing of all the flags after the check of copy_trace_marker to do synchronize_rcu() so that the option is still set if it was before and the synchronization is performed. - Fix entries setting when validating the persistent ring buffer When validating the persistent ring buffer on boot up, the number of events per sub-buffer is added to the sub-buffer meta page. The validator was updating cpu_buffer->head_page (the first sub-buffer of the per-cpu buffer) and not the "head_page" variable that was iterating the sub-buffers. This was causing the first sub-buffer to be assigned the entries for each sub-buffer and not the sub-buffer that was supposed to be updated. - Use "hash" value to update the direct callers When updating the ftrace direct callers, it assigned a temporary callback to all the callback functions of the ftrace ops and not just the functions represented by the passed in hash. This causes an unnecessary slow down of the functions of the ftrace_ops that is not being modified. Only update the functions that are going to be modified to call the ftrace loop function so that the update can be made on those functions. * tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod ring-buffer: Fix to update per-subbuf entries of persistent ring buffer tracing: Fix trace_marker copy link list updates tracing: Fix failure to read user space from system call trace events tracing: Revert "tracing: Remove pid in task_rename tracing output"
11 daysMerge tag 'perf-urgent-2026-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: - Fix a PMU driver crash on AMD EPYC systems, caused by a race condition in x86_pmu_enable() - Fix a possible counter-initialization bug in x86_pmu_enable() - Fix a counter inheritance bug in inherit_event() and __perf_event_read() - Fix an Intel PMU driver branch constraints handling bug found by UBSAN - Fix the Intel PMU driver's new Off-Module Response (OMR) support code for Diamond Rapids / Nova lake, to fix a snoop information parsing bug * tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix OMR snoop information parsing issues perf/x86/intel: Add missing branch counters constraint apply perf: Make sure to use pmu_ctx->pmu for groups x86/perf: Make sure to program the counter value for stopped events on migration perf/x86: Move event pointer setup earlier in x86_pmu_enable()
12 daysworkqueue: Fix false positive stall reportsSong Liu
On weakly ordered architectures (e.g., arm64), the lockless check in wq_watchdog_timer_fn() can observe a reordering between the worklist insertion and the last_progress_ts update. Specifically, the watchdog can see a non-empty worklist (from a list_add) while reading a stale last_progress_ts value, causing a false positive stall report. This was confirmed by reading pool->last_progress_ts again after holding pool->lock in wq_watchdog_timer_fn(): workqueue watchdog: pool 7 false positive detected! lockless_ts=4784580465 locked_ts=4785033728 diff=453263ms worklist_empty=0 To avoid slowing down the hot path (queue_work, etc.), recheck last_progress_ts with pool->lock held. This will eliminate the false positive with minimal overhead. Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it. Fixes: 82607adcf9cd ("workqueue: implement lockup detector") Cc: stable@vger.kernel.org # v4.5+ Assisted-by: claude-code:claude-opus-4-6 Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
12 dayssched_ext: Fix inconsistent NUMA node lookup in scx_select_cpu_dfl()Cheng-Yang Chou
In the WAKE_SYNC path of scx_select_cpu_dfl(), waker_node was computed with cpu_to_node(), while node (for prev_cpu) was computed with scx_cpu_node_if_enabled(). When scx_builtin_idle_per_node is disabled, idle_cpumask(waker_node) is called with a real node ID even though per-node idle tracking is disabled, resulting in undefined behavior. Fix by using scx_cpu_node_if_enabled() for waker_node as well, ensuring both variables are computed consistently. Fixes: 48849271e6611 ("sched_ext: idle: Per-node idle cpumasks") Cc: stable@vger.kernel.org # v6.15+ Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
12 daysftrace: Use hash argument for tmp_ops in update_ftrace_direct_modJiri Olsa
The modify logic registers temporary ftrace_ops object (tmp_ops) to trigger the slow path for all direct callers to be able to safely modify attached addresses. At the moment we use ops->func_hash for tmp_ops filter, which represents all the systems attachments. It's faster to use just the passed hash filter, which contains only the modified sites and is always a subset of the ops->func_hash. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Menglong Dong <menglong8.dong@gmail.com> Cc: Song Liu <song@kernel.org> Link: https://patch.msgid.link/20260312123738.129926-1-jolsa@kernel.org Fixes: e93672f770d7 ("ftrace: Add update_ftrace_direct_mod function") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
12 daysring-buffer: Fix to update per-subbuf entries of persistent ring bufferMasami Hiramatsu (Google)
Since the validation loop in rb_meta_validate_events() updates the same cpu_buffer->head_page->entries, the other subbuf entries are not updated. Fix to use head_page to update the entries field, since it is the cursor in this loop. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events") Link: https://patch.msgid.link/177391153882.193994.17158784065013676533.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
12 daystracing: Fix trace_marker copy link list updatesSteven Rostedt
When the "copy_trace_marker" option is enabled for an instance, anything written into /sys/kernel/tracing/trace_marker is also copied into that instances buffer. When the option is set, that instance's trace_array descriptor is added to the marker_copies link list. This list is protected by RCU, as all iterations uses an RCU protected list traversal. When the instance is deleted, all the flags that were enabled are cleared. This also clears the copy_trace_marker flag and removes the trace_array descriptor from the list. The issue is after the flags are called, a direct call to update_marker_trace() is performed to clear the flag. This function returns true if the state of the flag changed and false otherwise. If it returns true here, synchronize_rcu() is called to make sure all readers see that its removed from the list. But since the flag was already cleared, the state does not change and the synchronization is never called, leaving a possible UAF bug. Move the clearing of all flags below the updating of the copy_trace_marker option which then makes sure the synchronization is performed. Also use the flag for checking the state in update_marker_trace() instead of looking at if the list is empty. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260318185512.1b6c7db4@gandalf.local.home Fixes: 7b382efd5e8a ("tracing: Allow the top level trace_marker to write into another instances") Reported-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/all/20260225133122.237275-1-sashal@kernel.org/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
12 daystracing: Fix failure to read user space from system call trace eventsSteven Rostedt
The system call trace events call trace_user_fault_read() to read the user space part of some system calls. This is done by grabbing a per-cpu buffer, disabling migration, enabling preemption, calling copy_from_user(), disabling preemption, enabling migration and checking if the task was preempted while preemption was enabled. If it was, the buffer is considered corrupted and it tries again. There's a safety mechanism that will fail out of this loop if it fails 100 times (with a warning). That warning message was triggered in some pi_futex stress tests. Enabling the sched_switch trace event and traceoff_on_warning, showed the problem: pi_mutex_hammer-1375 [006] d..21 138.981648: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981651: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981656: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981659: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981664: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981667: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981671: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981675: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981679: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981682: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981687: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981690: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981695: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981698: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981703: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981706: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981711: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981714: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981719: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981722: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981727: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981730: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 pi_mutex_hammer-1375 [006] d..21 138.981735: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0 migration/6-47 [006] d..2. 138.981738: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95 What happened was the task 1375 was flagged to be migrated. When preemption was enabled, the migration thread woke up to migrate that task, but failed because migration for that task was disabled. This caused the loop to fail to exit because the task scheduled out while trying to read user space. Every time the task enabled preemption the migration thread would schedule in, try to migrate the task, fail and let the task continue. But because the loop would only enable preemption with migration disabled, it would always fail because each time it enabled preemption to read user space, the migration thread would try to migrate it. To solve this, when the loop fails to read user space without being scheduled out, enabled and disable preemption with migration enabled. This will allow the migration task to successfully migrate the task and the next loop should succeed to read user space without being scheduled out. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260316130734.1858a998@gandalf.local.home Fixes: 64cf7d058a005 ("tracing: Have trace_marker use per-cpu data to read user space") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
12 daysbpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagationDaniel Borkmann
Jenny reported that in sync_linked_regs() the BPF_ADD_CONST32 flag is checked on known_reg (the register narrowed by a conditional branch) instead of reg (the linked target register created by an alu32 operation). Example case with reg: 1. r6 = bpf_get_prandom_u32() 2. r7 = r6 (linked, same id) 3. w7 += 5 (alu32 -- r7 gets BPF_ADD_CONST32, zero-extended by CPU) 4. if w6 < 0xFFFFFFFC goto safe (narrows r6 to [0xFFFFFFFC, 0xFFFFFFFF]) 5. sync_linked_regs() propagates to r7 but does NOT call zext_32_to_64() 6. Verifier thinks r7 is [0x100000001, 0x100000004] instead of [1, 4] Since known_reg above does not have BPF_ADD_CONST32 set above, zext_32_to_64() is never called on alu32-derived linked registers. This causes the verifier to track incorrect 64-bit bounds, while the CPU correctly zero-extends the 32-bit result. The code checking known_reg->id was correct however (see scalars_alu32_wrap selftest case), but the real fix needs to handle both directions - zext propagation should be done when either register has BPF_ADD_CONST32, since the linked relationship involves a 32-bit operation regardless of which side has the flag. Example case with known_reg (exercised also by scalars_alu32_wrap): 1. r1 = r0; w1 += 0x100 (alu32 -- r1 gets BPF_ADD_CONST32) 2. if r1 > 0x80 - known_reg = r1 (has BPF_ADD_CONST32), reg = r0 (doesn't) Hence, fix it by checking for (reg->id | known_reg->id) & BPF_ADD_CONST32. Moreover, sync_linked_regs() also has a soundness issue when two linked registers used different ALU widths: one with BPF_ADD_CONST32 and the other with BPF_ADD_CONST64. The delta relationship between linked registers assumes the same arithmetic width though. When one register went through alu32 (CPU zero-extends the 32-bit result) and the other went through alu64 (no zero-extension), the propagation produces incorrect bounds. Example: r6 = bpf_get_prandom_u32() // fully unknown if r6 >= 0x100000000 goto out // constrain r6 to [0, U32_MAX] r7 = r6 w7 += 1 // alu32: r7.id = N | BPF_ADD_CONST32 r8 = r6 r8 += 2 // alu64: r8.id = N | BPF_ADD_CONST64 if r7 < 0xFFFFFFFF goto out // narrows r7 to [0xFFFFFFFF, 0xFFFFFFFF] At the branch on r7, sync_linked_regs() runs with known_reg=r7 (BPF_ADD_CONST32) and reg=r8 (BPF_ADD_CONST64). The delta path computes: r8 = r7 + (delta_r8 - delta_r7) = 0xFFFFFFFF + (2 - 1) = 0x100000000 Then, because known_reg->id has BPF_ADD_CONST32, zext_32_to_64(r8) is called, truncating r8 to [0, 0]. But r8 used a 64-bit ALU op -- the CPU does NOT zero-extend it. The actual CPU value of r8 is 0xFFFFFFFE + 2 = 0x100000000, not 0. The verifier now underestimates r8's 64-bit bounds, which is a soundness violation. Fix sync_linked_regs() by skipping propagation when the two registers have mixed ALU widths (one BPF_ADD_CONST32, the other BPF_ADD_CONST64). Lastly, fix regsafe() used for path pruning: the existing checks used "& BPF_ADD_CONST" to test for offset linkage, which treated BPF_ADD_CONST32 and BPF_ADD_CONST64 as equivalent. Fixes: 7a433e519364 ("bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking") Reported-by: Jenny Guanni Qu <qguanni@gmail.com> Co-developed-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260319211507.213816-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_ORDaniel Wade
maybe_fork_scalars() is called for both BPF_AND and BPF_OR when the source operand is a constant. When dst has signed range [-1, 0], it forks the verifier state: the pushed path gets dst = 0, the current path gets dst = -1. For BPF_AND this is correct: 0 & K == 0. For BPF_OR this is wrong: 0 | K == K, not 0. The pushed path therefore tracks dst as 0 when the runtime value is K, producing an exploitable verifier/runtime divergence that allows out-of-bounds map access. Fix this by passing env->insn_idx (instead of env->insn_idx + 1) to push_stack(), so the pushed path re-executes the ALU instruction with dst = 0 and naturally computes the correct result for any opcode. Fixes: bffacdb80b93 ("bpf: Recognize special arithmetic shift in the verifier") Signed-off-by: Daniel Wade <danjwade95@gmail.com> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260314021521.128361-2-danjwade95@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: Fix undefined behavior in interpreter sdiv/smod for INT_MINJenny Guanni Qu
The BPF interpreter's signed 32-bit division and modulo handlers use the kernel abs() macro on s32 operands. The abs() macro documentation (include/linux/math.h) explicitly states the result is undefined when the input is the type minimum. When DST contains S32_MIN (0x80000000), abs((s32)DST) triggers undefined behavior and returns S32_MIN unchanged on arm64/x86. This value is then sign-extended to u64 as 0xFFFFFFFF80000000, causing do_div() to compute the wrong result. The verifier's abstract interpretation (scalar32_min_max_sdiv) computes the mathematically correct result for range tracking, creating a verifier/interpreter mismatch that can be exploited for out-of-bounds map value access. Introduce abs_s32() which handles S32_MIN correctly by casting to u32 before negating, avoiding signed overflow entirely. Replace all 8 abs((s32)...) call sites in the interpreter's sdiv32/smod32 handlers. s32 is the only affected case -- the s64 division/modulo handlers do not use abs(). Fixes: ec0e2da95f72 ("bpf: Support new signed div/mod instructions.") Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com> Link: https://lore.kernel.org/r/20260311011116.2108005-2-qguanni@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: Fix exception exit lock checking for subprogsIhor Solodrai
process_bpf_exit_full() passes check_lock = !curframe to check_resource_leak(), which is false in cases when bpf_throw() is called from a static subprog. This makes check_resource_leak() to skip validation of active_rcu_locks, active_preempt_locks, and active_irq_id on exception exits from subprogs. At runtime bpf_throw() unwinds the stack via ORC without releasing any user-acquired locks, which may cause various issues as the result. Fix by setting check_lock = true for exception exits regardless of curframe, since exceptions bypass all intermediate frame cleanup. Update the error message prefix to "bpf_throw" for exception exits to distinguish them from normal BPF_EXIT. Fix reject_subprog_with_rcu_read_lock test which was previously passing for the wrong reason. Test program returned directly from the subprog call without closing the RCU section, so the error was triggered by the unclosed RCU lock on normal exit, not by bpf_throw. Update __msg annotations for affected tests to match the new "bpf_throw" error prefix. The spin_lock case is not affected because they are already checked [1] at the call site in do_check_insn() before bpf_throw can run. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c?h=v7.0-rc4#n21098 Assisted-by: Claude:claude-opus-4-6 Fixes: f18b03fabaa9 ("bpf: Implement BPF exceptions") Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260320000809.643798-1-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
14 daysdma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is setLeon Romanovsky
DMA_ATTR_REQUIRE_COHERENT indicates that SWIOTLB must not be used. Ensure the SWIOTLB path is declined whenever the DMA direct path is selected. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260316-dma-debug-overlap-v3-5-1dde90a7f08b@nvidia.com
14 daysdma-mapping: Introduce DMA require coherency attributeLeon Romanovsky
The mapping buffers which carry this attribute require DMA coherent system. This means that they can't take SWIOTLB path, can perform CPU cache overlap and doesn't perform cache flushing. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260316-dma-debug-overlap-v3-4-1dde90a7f08b@nvidia.com
14 daysdma-mapping: Clarify valid conditions for CPU cache line overlapLeon Romanovsky
Rename the DMA_ATTR_CPU_CACHE_CLEAN attribute to better reflect that it is debugging aid to inform DMA core code that CPU cache line overlaps are allowed, and refine the documentation describing its use. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260316-dma-debug-overlap-v3-3-1dde90a7f08b@nvidia.com
14 daysdma-debug: Allow multiple invocations of overlapping entriesLeon Romanovsky
Repeated DMA mappings with DMA_ATTR_CPU_CACHE_CLEAN trigger the following splat. This prevents using the attribute in cases where a DMA region is shared and reused more than seven times. ------------[ cut here ]------------ DMA-API: exceeded 7 overlapping mappings of cacheline 0x000000000438c440 WARNING: kernel/dma/debug.c:467 at add_dma_entry+0x219/0x280, CPU#4: ibv_rc_pingpong/1644 Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay mlx5_fwctl zram zsmalloc mlx5_ib fuse rpcrdma rdma_ucm ib_uverbs ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_core ib_core CPU: 4 UID: 2733 PID: 1644 Comm: ibv_rc_pingpong Not tainted 6.19.0+ #129 PREEMPT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:add_dma_entry+0x221/0x280 Code: c0 0f 84 f2 fe ff ff 83 e8 01 89 05 6d 99 11 01 e9 e4 fe ff ff 0f 8e 1f ff ff ff 48 8d 3d 07 ef 2d 01 be 07 00 00 00 48 89 e2 <67> 48 0f b9 3a e9 06 ff ff ff 48 c7 c7 98 05 2b 82 c6 05 72 92 28 RSP: 0018:ff1100010e657970 EFLAGS: 00010002 RAX: 0000000000000007 RBX: ff1100010234eb00 RCX: 0000000000000000 RDX: ff1100010e657970 RSI: 0000000000000007 RDI: ffffffff82678660 RBP: 000000000438c440 R08: 0000000000000228 R09: 0000000000000000 R10: 00000000000001be R11: 000000000000089d R12: 0000000000000800 R13: 00000000ffffffef R14: 0000000000000202 R15: ff1100010234eb00 FS: 00007fb15f3f6740(0000) GS:ff110008dcc19000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb15f32d3a0 CR3: 0000000116f59001 CR4: 0000000000373eb0 Call Trace: <TASK> debug_dma_map_sg+0x1b4/0x390 __dma_map_sg_attrs+0x6d/0x1a0 dma_map_sgtable+0x19/0x30 ib_umem_get+0x284/0x3b0 [ib_uverbs] mlx5_ib_reg_user_mr+0x68/0x2a0 [mlx5_ib] ib_uverbs_reg_mr+0x17f/0x2a0 [ib_uverbs] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0x130 [ib_uverbs] ib_uverbs_cmd_verbs+0xa0b/0xae0 [ib_uverbs] ? ib_uverbs_handler_UVERBS_METHOD_QUERY_PORT_SPEED+0xe0/0xe0 [ib_uverbs] ? mmap_region+0x7a/0xb0 ? do_mmap+0x3b8/0x5c0 ib_uverbs_ioctl+0xa7/0x110 [ib_uverbs] __x64_sys_ioctl+0x14f/0x8b0 ? ksys_mmap_pgoff+0xc5/0x190 do_syscall_64+0x8c/0xbf0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7fb15f5e4eed Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 RSP: 002b:00007ffe09a5c540 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffe09a5c5d0 RCX: 00007fb15f5e4eed RDX: 00007ffe09a5c5f0 RSI: 00000000c0181b01 RDI: 0000000000000003 RBP: 00007ffe09a5c590 R08: 0000000000000028 R09: 00007ffe09a5c794 R10: 0000000000000001 R11: 0000000000000246 R12: 00007ffe09a5c794 R13: 000000000000000c R14: 0000000025a49170 R15: 000000000000000c </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 61868dc55a11 ("dma-mapping: add DMA_ATTR_CPU_CACHE_CLEAN") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260316-dma-debug-overlap-v3-1-1dde90a7f08b@nvidia.com
2026-03-19Merge tag 'pm-7.0-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix an idle loop issue exposed by recent changes and a race condition related to device removal in the runtime PM core code: - Consolidate the handling of two special cases in the idle loop that occur when only one CPU idle state is present (Rafael Wysocki) - Fix a race condition related to device removal in the runtime PM core code that may cause a stale device object pointer to be dereferenced (Bart Van Assche)" * tag 'pm-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: runtime: Fix a race condition related to device removal sched: idle: Consolidate the handling of two special cases
2026-03-18bpf: Release module BTF IDR before module unloadKumar Kartikeya Dwivedi
Gregory reported in [0] that the global_map_resize test when run in repeatedly ends up failing during program load. This stems from the fact that BTF reference has not dropped to zero after the previous run's module is unloaded, and the older module's BTF is still discoverable and visible. Later, in libbpf, load_module_btfs() will find the ID for this stale BTF, open its fd, and then it will be used during program load where later steps taking module reference using btf_try_get_module() fail since the underlying module for the BTF is gone. Logically, once a module is unloaded, it's associated BTF artifacts should become hidden. The BTF object inside the kernel may still remain alive as long its reference counts are alive, but it should no longer be discoverable. To fix this, let us call btf_free_id() from the MODULE_STATE_GOING case for the module unload to free the BTF associated IDR entry, and disable its discovery once module unload returns to user space. If a race happens during unload, the outcome is non-deterministic anyway. However, user space should be able to rely on the guarantee that once it has synchronously established a successful module unload, no more stale artifacts associated with this module can be obtained subsequently. Note that we must be careful to not invoke btf_free_id() in btf_put() when btf_is_module() is true now. There could be a window where the module unload drops a non-terminal reference, frees the IDR, but the same ID gets reused and the second unconditional btf_free_id() ends up releasing an unrelated entry. To avoid a special case for btf_is_module() case, set btf->id to zero to make btf_free_id() idempotent, such that we can unconditionally invoke it from btf_put(), and also from the MODULE_STATE_GOING case. Since zero is an invalid IDR, the idr_remove() should be a noop. Note that we can be sure that by the time we reach final btf_put() for btf_is_module() case, the btf_free_id() is already done, since the module itself holds the BTF reference, and it will call this function for the BTF before dropping its own reference. [0]: https://lore.kernel.org/bpf/cover.1773170190.git.grbell@redhat.com Fixes: 36e68442d1af ("bpf: Load and verify kernel module BTFs") Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org> Reported-by: Gregory Bell <grbell@redhat.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260312205307.1346991-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-16sched: idle: Consolidate the handling of two special casesRafael J. Wysocki
There are two special cases in the idle loop that are handled inconsistently even though they are analogous. The first one is when a cpuidle driver is absent and the default CPU idle time power management implemented by the architecture code is used. In that case, the scheduler tick is stopped every time before invoking default_idle_call(). The second one is when a cpuidle driver is present, but there is only one idle state in its table. In that case, the scheduler tick is never stopped at all. Since each of these approaches has its drawbacks, reconcile them with the help of one simple heuristic. Namely, stop the tick if the CPU has been woken up by it in the previous iteration of the idle loop, or let it tick otherwise. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Fixes: ed98c3491998 ("sched: idle: Do not stop the tick before cpuidle_idle_call()") [ rjw: Added Fixes tag, changelog edits ] Link: https://patch.msgid.link/4741364.LvFx2qVVIh@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-03-16Merge tag 'mm-hotfixes-stable-2026-03-16-12-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "6 hotfixes. 4 are cc:stable. 3 are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: update email address for Ignat Korchagin mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared THP mm/rmap: fix incorrect pte restoration for lazyfree folios mm/huge_memory: fix use of NULL folio in move_pages_huge_pmd() build_bug.h: correct function parameters names in kernel-doc crash_dump: don't log dm-crypt key bytes in read_key_from_user_keying