summaryrefslogtreecommitdiff
path: root/drivers/nvme
AgeCommit message (Collapse)Author
2026-03-13Merge tag 'block-7.0-20260312' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Fix nvme-pci IRQ race and slab-out-of-bounds access - Fix recursive workqueue locking for target async events - Various cleanups - Fix a potential NULL pointer dereference in ublk on size setting - ublk automatic partition scanning fix - Two s390 dasd fixes * tag 'block-7.0-20260312' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: nvme: Annotate struct nvme_dhchap_key with __counted_by nvme-core: do not pass empty queue_limits to blk_mq_alloc_queue() nvme-pci: Fix race bug in nvme_poll_irqdisable() nvmet: move async event work off nvmet-wq nvme-pci: Fix slab-out-of-bounds in nvme_dbbuf_set s390/dasd: Copy detected format information to secondary device s390/dasd: Move quiesce state with pprc swap ublk: don't clear GD_SUPPRESS_PART_SCAN for unprivileged daemons ublk: fix NULL pointer dereference in ublk_ctrl_set_size()
2026-03-10nvme-core: do not pass empty queue_limits to blk_mq_alloc_queue()Maurizio Lombardi
In nvme_alloc_admin_tag_set(), an empty queue_limits struct is currently allocated on the stack and passed by reference to blk_mq_alloc_queue(). This is redundant because blk_mq_alloc_queue() already handles a NULL limits pointer by internally substituting it with a default empty queue_limits struct. Remove the unnecessary local variable and pass a NULL value. Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-10nvme-pci: Fix race bug in nvme_poll_irqdisable()Sungwoo Kim
In the following scenario, pdev can be disabled between (1) and (3) by (2). This sets pdev->msix_enabled = 0. Then, pci_irq_vector() will return MSI-X IRQ(>15) for (1) whereas return INTx IRQ(<=15) for (2). This causes IRQ warning because it tries to enable INTx IRQ that has never been disabled before. To fix this, save IRQ number into a local variable and ensure disable_irq() and enable_irq() operate on the same IRQ number. Even if pci_free_irq_vectors() frees the IRQ concurrently, disable_irq() and enable_irq() on a stale IRQ number is still valid and safe, and the depth accounting reamins balanced. task 1: nvme_poll_irqdisable() disable_irq(pci_irq_vector(pdev, nvmeq->cq_vector)) ...(1) enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector)) ...(3) task 2: nvme_reset_work() nvme_dev_disable() pdev->msix_enable = 0; ...(2) crash log: ------------[ cut here ]------------ Unbalanced enable for IRQ 10 WARNING: kernel/irq/manage.c:753 at __enable_irq+0x102/0x190 kernel/irq/manage.c:753, CPU#1: kworker/1:0H/26 Modules linked in: CPU: 1 UID: 0 PID: 26 Comm: kworker/1:0H Not tainted 6.19.0-dirty #9 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Workqueue: kblockd blk_mq_timeout_work RIP: 0010:__enable_irq+0x107/0x190 kernel/irq/manage.c:753 Code: ff df 48 89 fa 48 c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 79 48 8d 3d 2e 7a 3f 05 41 8b 74 24 2c <67> 48 0f b9 3a e8 ef b9 21 00 5b 41 5c 5d e9 46 54 66 03 e8 e1 b9 RSP: 0018:ffffc900001bf550 EFLAGS: 00010046 RAX: 0000000000000007 RBX: 0000000000000000 RCX: ffffffffb20c0e90 RDX: 0000000000000000 RSI: 000000000000000a RDI: ffffffffb74b88f0 RBP: ffffc900001bf560 R08: ffff88800197cf00 R09: 0000000000000001 R10: 0000000000000003 R11: 0000000000000003 R12: ffff8880012a6000 R13: 1ffff92000037eae R14: 000000000000000a R15: 0000000000000293 FS: 0000000000000000(0000) GS:ffff8880b49f7000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555da4a25fa8 CR3: 00000000208e8000 CR4: 00000000000006f0 Call Trace: <TASK> enable_irq+0x121/0x1e0 kernel/irq/manage.c:797 nvme_poll_irqdisable+0x162/0x1c0 drivers/nvme/host/pci.c:1494 nvme_timeout+0x965/0x14b0 drivers/nvme/host/pci.c:1744 blk_mq_rq_timed_out block/blk-mq.c:1653 [inline] blk_mq_handle_expired+0x227/0x2d0 block/blk-mq.c:1721 bt_iter+0x2fc/0x3a0 block/blk-mq-tag.c:292 __sbitmap_for_each_set include/linux/sbitmap.h:269 [inline] sbitmap_for_each_set include/linux/sbitmap.h:290 [inline] bt_for_each block/blk-mq-tag.c:324 [inline] blk_mq_queue_tag_busy_iter+0x969/0x1e80 block/blk-mq-tag.c:536 blk_mq_timeout_work+0x627/0x870 block/blk-mq.c:1763 process_one_work+0x956/0x1aa0 kernel/workqueue.c:3257 process_scheduled_works kernel/workqueue.c:3340 [inline] worker_thread+0x65c/0xe60 kernel/workqueue.c:3421 kthread+0x41a/0x930 kernel/kthread.c:463 ret_from_fork+0x6f8/0x8c0 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 </TASK> irq event stamp: 74478 hardirqs last enabled at (74477): [<ffffffffb5720a9c>] __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:159 [inline] hardirqs last enabled at (74477): [<ffffffffb5720a9c>] _raw_spin_unlock_irq+0x2c/0x60 kernel/locking/spinlock.c:202 hardirqs last disabled at (74478): [<ffffffffb57207b5>] __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline] hardirqs last disabled at (74478): [<ffffffffb57207b5>] _raw_spin_lock_irqsave+0x85/0xa0 kernel/locking/spinlock.c:162 softirqs last enabled at (74304): [<ffffffffb1e9466c>] __do_softirq kernel/softirq.c:656 [inline] softirqs last enabled at (74304): [<ffffffffb1e9466c>] invoke_softirq kernel/softirq.c:496 [inline] softirqs last enabled at (74304): [<ffffffffb1e9466c>] __irq_exit_rcu+0xdc/0x120 kernel/softirq.c:723 softirqs last disabled at (74287): [<ffffffffb1e9466c>] __do_softirq kernel/softirq.c:656 [inline] softirqs last disabled at (74287): [<ffffffffb1e9466c>] invoke_softirq kernel/softirq.c:496 [inline] softirqs last disabled at (74287): [<ffffffffb1e9466c>] __irq_exit_rcu+0xdc/0x120 kernel/softirq.c:723 ---[ end trace 0000000000000000 ]--- Fixes: fa059b856a59 (nvme-pci: Simplify nvme_poll_irqdisable) Acked-by: Chao Shi <cshi008@fiu.edu> Acked-by: Weidong Zhu <weizhu@fiu.edu> Acked-by: Dave Tian <daveti@purdue.edu> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sungwoo Kim <iam@sung-woo.kim> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-10nvmet: move async event work off nvmet-wqChaitanya Kulkarni
For target nvmet_ctrl_free() flushes ctrl->async_event_work. If nvmet_ctrl_free() runs on nvmet-wq, the flush re-enters workqueue completion for the same worker:- A. Async event work queued on nvmet-wq (prior to disconnect): nvmet_execute_async_event() queue_work(nvmet_wq, &ctrl->async_event_work) nvmet_add_async_event() queue_work(nvmet_wq, &ctrl->async_event_work) B. Full pre-work chain (RDMA CM path): nvmet_rdma_cm_handler() nvmet_rdma_queue_disconnect() __nvmet_rdma_queue_disconnect() queue_work(nvmet_wq, &queue->release_work) process_one_work() lock((wq_completion)nvmet-wq) <--------- 1st nvmet_rdma_release_queue_work() C. Recursive path (same worker): nvmet_rdma_release_queue_work() nvmet_rdma_free_queue() nvmet_sq_destroy() nvmet_ctrl_put() nvmet_ctrl_free() flush_work(&ctrl->async_event_work) __flush_work() touch_wq_lockdep_map() lock((wq_completion)nvmet-wq) <--------- 2nd Lockdep splat: ============================================ WARNING: possible recursive locking detected 6.19.0-rc3nvme+ #14 Tainted: G N -------------------------------------------- kworker/u192:42/44933 is trying to acquire lock: ffff888118a00948 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x26/0x90 but task is already holding lock: ffff888118a00948 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x53e/0x660 3 locks held by kworker/u192:42/44933: #0: ffff888118a00948 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x53e/0x660 #1: ffffc9000e6cbe28 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x1c5/0x660 #2: ffffffff82d4db60 (rcu_read_lock){....}-{1:3}, at: __flush_work+0x62/0x530 Workqueue: nvmet-wq nvmet_rdma_release_queue_work [nvmet_rdma] Call Trace: __flush_work+0x268/0x530 nvmet_ctrl_free+0x140/0x310 [nvmet] nvmet_cq_put+0x74/0x90 [nvmet] nvmet_rdma_free_queue+0x23/0xe0 [nvmet_rdma] nvmet_rdma_release_queue_work+0x19/0x50 [nvmet_rdma] process_one_work+0x206/0x660 worker_thread+0x184/0x320 kthread+0x10c/0x240 ret_from_fork+0x319/0x390 Move async event work to a dedicated nvmet-aen-wq to avoid reentrant flush on nvmet-wq. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-10nvme-pci: Fix slab-out-of-bounds in nvme_dbbuf_setSungwoo Kim
dev->online_queues is a count incremented in nvme_init_queue. Thus, valid indices are 0 through dev->online_queues − 1. This patch fixes the loop condition to ensure the index stays within the valid range. Index 0 is excluded because it is the admin queue. KASAN splat: ================================================================== BUG: KASAN: slab-out-of-bounds in nvme_dbbuf_free drivers/nvme/host/pci.c:377 [inline] BUG: KASAN: slab-out-of-bounds in nvme_dbbuf_set+0x39c/0x400 drivers/nvme/host/pci.c:404 Read of size 2 at addr ffff88800592a574 by task kworker/u8:5/74 CPU: 0 UID: 0 PID: 74 Comm: kworker/u8:5 Not tainted 6.19.0-dirty #10 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Workqueue: nvme-reset-wq nvme_reset_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0xea/0x150 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xce/0x5d0 mm/kasan/report.c:482 kasan_report+0xdc/0x110 mm/kasan/report.c:595 __asan_report_load2_noabort+0x18/0x20 mm/kasan/report_generic.c:379 nvme_dbbuf_free drivers/nvme/host/pci.c:377 [inline] nvme_dbbuf_set+0x39c/0x400 drivers/nvme/host/pci.c:404 nvme_reset_work+0x36b/0x8c0 drivers/nvme/host/pci.c:3252 process_one_work+0x956/0x1aa0 kernel/workqueue.c:3257 process_scheduled_works kernel/workqueue.c:3340 [inline] worker_thread+0x65c/0xe60 kernel/workqueue.c:3421 kthread+0x41a/0x930 kernel/kthread.c:463 ret_from_fork+0x6f8/0x8c0 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 </TASK> Allocated by task 34 on cpu 1 at 4.241550s: kasan_save_stack+0x2c/0x60 mm/kasan/common.c:57 kasan_save_track+0x1c/0x70 mm/kasan/common.c:78 kasan_save_alloc_info+0x3c/0x50 mm/kasan/generic.c:570 poison_kmalloc_redzone mm/kasan/common.c:398 [inline] __kasan_kmalloc+0xb5/0xc0 mm/kasan/common.c:415 kasan_kmalloc include/linux/kasan.h:263 [inline] __do_kmalloc_node mm/slub.c:5657 [inline] __kmalloc_node_noprof+0x2bf/0x8d0 mm/slub.c:5663 kmalloc_array_node_noprof include/linux/slab.h:1075 [inline] nvme_pci_alloc_dev drivers/nvme/host/pci.c:3479 [inline] nvme_probe+0x2f1/0x1820 drivers/nvme/host/pci.c:3534 local_pci_probe+0xef/0x1c0 drivers/pci/pci-driver.c:324 pci_call_probe drivers/pci/pci-driver.c:392 [inline] __pci_device_probe drivers/pci/pci-driver.c:417 [inline] pci_device_probe+0x743/0x920 drivers/pci/pci-driver.c:451 call_driver_probe drivers/base/dd.c:583 [inline] really_probe+0x29b/0xb70 drivers/base/dd.c:661 __driver_probe_device+0x3b0/0x4a0 drivers/base/dd.c:803 driver_probe_device+0x56/0x1f0 drivers/base/dd.c:833 __driver_attach_async_helper+0x155/0x340 drivers/base/dd.c:1159 async_run_entry_fn+0xa6/0x4b0 kernel/async.c:129 process_one_work+0x956/0x1aa0 kernel/workqueue.c:3257 process_scheduled_works kernel/workqueue.c:3340 [inline] worker_thread+0x65c/0xe60 kernel/workqueue.c:3421 kthread+0x41a/0x930 kernel/kthread.c:463 ret_from_fork+0x6f8/0x8c0 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 The buggy address belongs to the object at ffff88800592a000 which belongs to the cache kmalloc-2k of size 2048 The buggy address is located 244 bytes to the right of allocated 1152-byte region [ffff88800592a000, ffff88800592a480) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5928 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 anon flags: 0xfffffc0000040(head|node=0|zone=1|lastcpupid=0x1fffff) page_type: f5(slab) raw: 000fffffc0000040 ffff888001042000 0000000000000000 dead000000000001 raw: 0000000000000000 0000000000080008 00000000f5000000 0000000000000000 head: 000fffffc0000040 ffff888001042000 0000000000000000 dead000000000001 head: 0000000000000000 0000000000080008 00000000f5000000 0000000000000000 head: 000fffffc0000003 ffffea0000164a01 00000000ffffffff 00000000ffffffff head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88800592a400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88800592a480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88800592a500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff88800592a580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88800592a600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Fixes: 0f0d2c876c96 (nvme: free sq/cq dbbuf pointers when dbbuf set fails) Acked-by: Chao Shi <cshi008@fiu.edu> Acked-by: Weidong Zhu <weizhu@fiu.edu> Acked-by: Dave Tian <daveti@purdue.edu> Signed-off-by: Sungwoo Kim <iam@sung-woo.kim> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-06Merge tag 'block-7.0-20260305' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Improve quirk visibility and configurability (Maurizio) - Fix runtime user modification to queue setup (Keith) - Fix multipath leak on try_module_get failure (Keith) - Ignore ambiguous spec definitions for better atomics support (John) - Fix admin queue leak on controller reset (Ming) - Fix large allocation in persistent reservation read keys (Sungwoo Kim) - Fix fcloop callback handling (Justin) - Securely free DHCHAP secrets (Daniel) - Various cleanups and typo fixes (John, Wilfred) - Avoid a circular lock dependency issue in the sysfs nr_requests or scheduler store handling - Fix a circular lock dependency with the pcpu mutex and the queue freeze lock - Cleanup for bio_copy_kern(), using __bio_add_page() rather than the bio_add_page(), as adding a page here cannot fail. The exiting code had broken cleanup for the error condition, so make it clear that the error condition cannot happen - Fix for a __this_cpu_read() in preemptible context splat * tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block: use trylock to avoid lockdep circular dependency in sysfs nvme: fix memory allocation in nvme_pr_read_keys() block: use __bio_add_page in bio_copy_kern block: break pcpu_alloc_mutex dependency on freeze_lock blktrace: fix __this_cpu_read/write in preemptible context nvme-multipath: fix leak on try_module_get failure nvmet-fcloop: Check remoteport port_state before calling done callback nvme-pci: do not try to add queue maps at runtime nvme-pci: cap queue creation to used queues nvme-pci: ensure we're polling a polled queue nvme: fix memory leak in quirks_param_set() nvme: correct comment about nvme_ns_remove() nvme: stop setting namespace gendisk device driver data nvme: add support for dynamic quirk configuration via module parameter nvme: fix admin queue leak on controller reset nvme-fabrics: use kfree_sensitive() for DHCHAP secrets nvme: stop using AWUPF nvme: expose active quirks in sysfs nvme/host: fixup some typos
2026-03-04Merge tag 'nvme-7.0-2026-03-04' of git://git.infradead.org/nvme into block-7.0Jens Axboe
Pull NVMe fixes from Keith: "- Improve quirk visibility and configurability (Maurizio) - Fix runtime user modification to queue setup (Keith) - Fix multipath leak on try_module_get failure (Keith) - Ignore ambiguous spec definitions for better atomics support (John) - Fix admin queue leak on controller reset (Ming) - Fix large allocation in persistent reservation read keys (Sungwoo Kim) - Fix fcloop callback handling (Justin) - Securely free DHCHAP secrets (Daniel) - Various cleanups and typo fixes (John, Wilfred)" * tag 'nvme-7.0-2026-03-04' of git://git.infradead.org/nvme: nvme: fix memory allocation in nvme_pr_read_keys() nvme-multipath: fix leak on try_module_get failure nvmet-fcloop: Check remoteport port_state before calling done callback nvme-pci: do not try to add queue maps at runtime nvme-pci: cap queue creation to used queues nvme-pci: ensure we're polling a polled queue nvme: fix memory leak in quirks_param_set() nvme: correct comment about nvme_ns_remove() nvme: stop setting namespace gendisk device driver data nvme: add support for dynamic quirk configuration via module parameter nvme: fix admin queue leak on controller reset nvme-fabrics: use kfree_sensitive() for DHCHAP secrets nvme: stop using AWUPF nvme: expose active quirks in sysfs nvme/host: fixup some typos
2026-03-04nvme: fix memory allocation in nvme_pr_read_keys()Sungwoo Kim
nvme_pr_read_keys() takes num_keys from userspace and uses it to calculate the allocation size for rse via struct_size(). The upper limit is PR_KEYS_MAX (64K). A malicious or buggy userspace can pass a large num_keys value that results in a 4MB allocation attempt at most, causing a warning in the page allocator when the order exceeds MAX_PAGE_ORDER. To fix this, use kvzalloc() instead of kzalloc(). This bug has the same reasoning and fix with the patch below: https://lore.kernel.org/linux-block/20251212013510.3576091-1-kartikey406@gmail.com/ Warning log: WARNING: mm/page_alloc.c:5216 at __alloc_frozen_pages_noprof+0x5aa/0x2300 mm/page_alloc.c:5216, CPU#1: syz-executor117/272 Modules linked in: CPU: 1 UID: 0 PID: 272 Comm: syz-executor117 Not tainted 6.19.0 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:__alloc_frozen_pages_noprof+0x5aa/0x2300 mm/page_alloc.c:5216 Code: ff 83 bd a8 fe ff ff 0a 0f 86 69 fb ff ff 0f b6 1d f9 f9 c4 04 80 fb 01 0f 87 3b 76 30 ff 83 e3 01 75 09 c6 05 e4 f9 c4 04 01 <0f> 0b 48 c7 85 70 fe ff ff 00 00 00 00 e9 8f fd ff ff 31 c0 e9 0d RSP: 0018:ffffc90000fcf450 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 1ffff920001f9ea0 RDX: 0000000000000000 RSI: 000000000000000b RDI: 0000000000040dc0 RBP: ffffc90000fcf648 R08: ffff88800b6c3380 R09: 0000000000000001 R10: ffffc90000fcf840 R11: ffff88807ffad280 R12: 0000000000000000 R13: 0000000000040dc0 R14: 0000000000000001 R15: ffffc90000fcf620 FS: 0000555565db33c0(0000) GS:ffff8880be26c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000002000000c CR3: 0000000003b72000 CR4: 00000000000006f0 Call Trace: <TASK> alloc_pages_mpol+0x236/0x4d0 mm/mempolicy.c:2486 alloc_frozen_pages_noprof+0x149/0x180 mm/mempolicy.c:2557 ___kmalloc_large_node+0x10c/0x140 mm/slub.c:5598 __kmalloc_large_node_noprof+0x25/0xc0 mm/slub.c:5629 __do_kmalloc_node mm/slub.c:5645 [inline] __kmalloc_noprof+0x483/0x6f0 mm/slub.c:5669 kmalloc_noprof include/linux/slab.h:961 [inline] kzalloc_noprof include/linux/slab.h:1094 [inline] nvme_pr_read_keys+0x8f/0x4c0 drivers/nvme/host/pr.c:245 blkdev_pr_read_keys block/ioctl.c:456 [inline] blkdev_common_ioctl+0x1b71/0x29b0 block/ioctl.c:730 blkdev_ioctl+0x299/0x700 block/ioctl.c:786 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl fs/ioctl.c:583 [inline] __x64_sys_ioctl+0x1bf/0x220 fs/ioctl.c:583 x64_sys_call+0x1280/0x21b0 mnt/fuzznvme_1/fuzznvme/linux-build/v6.19/./arch/x86/include/generated/asm/syscalls_64.h:17 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x71/0x330 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fb893d3108d Code: 28 c3 e8 46 1e 00 00 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffff61f2f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffff61f3138 RCX: 00007fb893d3108d RDX: 0000000020000040 RSI: 00000000c01070ce RDI: 0000000000000003 RBP: 0000000000000001 R08: 0000000000000000 R09: 00007ffff61f3138 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007ffff61f3128 R14: 00007fb893dae530 R15: 0000000000000001 </TASK> Fixes: 5fd96a4e15de (nvme: Add pr_ops read_keys support) Acked-by: Chao Shi <cshi008@fiu.edu> Acked-by: Weidong Zhu <weizhu@fiu.edu> Acked-by: Dave Tian <daveti@purdue.edu> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sungwoo Kim <iam@sung-woo.kim> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-27nvme-multipath: fix leak on try_module_get failureKeith Busch
We need to fall back to the synchronous removal if we can't get a reference on the module needed for the deferred removal. Fixes: 62188639ec16 ("nvme-multipath: introduce delayed removal of the multipath head node") Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-26nvmet-fcloop: Check remoteport port_state before calling done callbackJustin Tee
In nvme_fc_handle_ls_rqst_work, the lsrsp->done callback is only set when remoteport->port_state is FC_OBJSTATE_ONLINE. Otherwise, the nvme_fc_xmt_ls_rsp's LLDD call to lport->ops->xmt_ls_rsp is expected to fail and the nvme-fc transport layer itself will directly call nvme_fc_xmt_ls_rsp_free instead of relying on LLDD's done callback to free the lsrsp resources. Update the fcloop_t2h_xmt_ls_rsp routine to check remoteport->port_state. If online, then lsrsp->done callback will free the lsrsp. Else, return -ENODEV to signal the nvme-fc transport to handle freeing lsrsp. Cc: Ewan D. Milne <emilne@redhat.com> Tested-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Closes: https://lore.kernel.org/linux-nvme/21255200-a271-4fa0-b099-97755c8acd4c@work/ Fixes: 10c165af35d2 ("nvmet-fcloop: call done callback even when remote port is gone") Signed-off-by: Justin Tee <justintee8345@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-22Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Merge tag 'kmalloc_obj-treewide-v7.0-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kmalloc_obj conversion from Kees Cook: "This does the tree-wide conversion to kmalloc_obj() and friends using coccinelle, with a subsequent small manual cleanup of whitespace alignment that coccinelle does not handle. This uncovered a clang bug in __builtin_counted_by_ref(), so the conversion is preceded by disabling that for current versions of clang. The imminent clang 22.1 release has the fix. I've done allmodconfig build tests for x86_64, arm64, i386, and arm. I did defconfig builds for alpha, m68k, mips, parisc, powerpc, riscv, s390, sparc, sh, arc, csky, xtensa, hexagon, and openrisc" * tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kmalloc_obj: Clean up after treewide replacements treewide: Replace kmalloc with kmalloc_obj for non-scalar types compiler_types: Disable __builtin_counted_by_ref for Clang
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-19io_uring: Add size check for sqe->cmdGovindarajulu Varadarajan
For SQE128, sqe->cmd provides 80 bytes for uring_cmd. Add macro to check if size of user struct does not exceed 80 bytes at compile time. User doesn't have to track this manually during development. Replace io_uring_sqe_cmd() inline func with macro and add io_uring_sqe128_cmd() which checks struct size for 16 bytes cmd and 80 bytes cmd respectively. Signed-off-by: Govindarajulu Varadarajan <govind.varadar@gmail.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-17Merge tag 'block-7.0-20260216' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull more block updates from Jens Axboe: - Fix partial IOVA mapping cleanup in error handling - Minor prep series ignoring discard return value, as the inline value is always known - Ensure BLK_FEAT_STABLE_WRITES is set for drbd - Fix leak of folio in bio_iov_iter_bounce_read() - Allow IOC_PR_READ_* for read-only open - Another debugfs deadlock fix - A few doc updates * tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq: use NOIO context to prevent deadlock during debugfs creation blk-stat: convert struct blk_stat_callback to kernel-doc block: fix enum descriptions kernel-doc block: update docs for bio and bvec_iter block: change return type to void nvmet: ignore discard return value md: ignore discard return value block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova block: fix folio leak in bio_iov_iter_bounce_read() block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ drbd: always set BLK_FEAT_STABLE_WRITES
2026-02-13nvme-pci: do not try to add queue maps at runtimeKeith Busch
The block layer allocates the set's maps once. We can't add special purpose queues at runtime if they weren't allocated at initialization time. Tested-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-13nvme-pci: cap queue creation to used queuesKeith Busch
If the user reduces the special queue count at runtime and resets the controller, we need to reduce the number of queues and interrupts requested accordingly rather than start with the pre-allocated queue count. Tested-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-13nvme-pci: ensure we're polling a polled queueKeith Busch
A user can change the polled queue count at run time. There's a brief window during a reset where a hipri task may try to poll that queue before the block layer has updated the queue maps, which would race with the now interrupt driven queue and may cause double completions. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-12Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...
2026-02-12nvmet: ignore discard return valueChaitanya Kulkarni
__blkdev_issue_discard() always returns 0, making the error checking in nvmet_bdev_discard_range() dead code. Kill the function nvmet_bdev_discard_range() and call __blkdev_issue_discard() directly from nvmet_bdev_execute_discard(), since no error handling is needed anymore for __blkdev_issue_discard() call. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-11nvme: fix memory leak in quirks_param_set()Maurizio Lombardi
When loading the nvme module, if the 'quirks' parameter is specified via both the kernel command line (e.g., nvme.quirks=...) and the modprobe command line (e.g., modprobe nvme quirks=...), the quirks_param_set() callback is invoked twice. Currently, in the double-invocation scenario, the second call overwrites the nvme_pci_quirk_list pointer, causing the memory allocated in the first call to leak. Fix this by freeing the existing list before assigning the new one. Fixes: b4247c8317c5 ("nvme: add support for dynamic quirk configuration via module parameter") Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-09Merge tag 'for-7.0/block-20260206' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Support for batch request processing for ublk, improving the efficiency of the kernel/ublk server communication. This can yield nice 7-12% performance improvements - Support for integrity data for ublk - Various other ublk improvements and additions, including a ton of selftests additions and updated - Move the handling of blk-crypto software fallback from below the block layer to above it. This reduces the complexity of dealing with bio splitting - Series fixing a number of potential deadlocks in blk-mq related to the queue usage counter and writeback throttling and rq-qos debugfs handling - Add an async_depth queue attribute, to resolve a performance regression that's been around for a qhilw related to the scheduler depth handling - Only use task_work for IOPOLL completions on NVMe, if it is necessary to do so. An earlier fix for an issue resulted in all these completions being punted to task_work, to guarantee that completions were only run for a given io_uring ring when it was local to that ring. With the new changes, we can detect if it's necessary to use task_work or not, and avoid it if possible. - rnbd fixes: - Fix refcount underflow in device unmap path - Handle PREFLUSH and NOUNMAP flags properly in protocol - Fix server-side bi_size for special IOs - Zero response buffer before use - Fix trace format for flags - Add .release to rnbd_dev_ktype - MD pull requests via Yu Kuai - Fix raid5_run() to return error when log_init() fails - Fix IO hang with degraded array with llbitmap - Fix percpu_ref not resurrected on suspend timeout in llbitmap - Fix GPF in write_page caused by resize race - Fix NULL pointer dereference in process_metadata_update - Fix hang when stopping arrays with metadata through dm-raid - Fix any_working flag handling in raid10_sync_request - Refactor sync/recovery code path, improve error handling for badblocks, and remove unused recovery_disabled field - Consolidate mddev boolean fields into mddev_flags - Use mempool to allocate stripe_request_ctx and make sure max_sectors is not less than io_opt in raid5 - Fix return value of mddev_trylock - Fix memory leak in raid1_run() - Add Li Nan as mdraid reviewer - Move phys_vec definitions to the kernel types, mostly in preparation for some VFIO and RDMA changes - Improve the speed for secure erase for some devices - Various little rust updates - Various other minor fixes, improvements, and cleanups * tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) blk-mq: ABI/sysfs-block: fix docs build warnings selftests: ublk: organize test directories by test ID block: decouple secure erase size limit from discard size limit block: remove redundant kill_bdev() call in set_blocksize() blk-mq: add documentation for new queue attribute async_dpeth block, bfq: convert to use request_queue->async_depth mq-deadline: covert to use request_queue->async_depth kyber: covert to use request_queue->async_depth blk-mq: add a new queue sysfs attribute async_depth blk-mq: factor out a helper blk_mq_limit_depth() blk-mq-sched: unify elevators checking for async requests block: convert nr_requests to unsigned int block: don't use strcpy to copy blockdev name blk-mq-debugfs: warn about possible deadlock blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos() blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static blk-rq-qos: fix possible debugfs_mutex deadlock blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter ...
2026-02-06nvme: correct comment about nvme_ns_remove()John Garry
The comment in nvme_mpath_remove_disk() references nvme_remove_ns(), which should be nvme_ns_remove(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-06nvme: stop setting namespace gendisk device driver dataJohn Garry
Since commit 1f4137e882c6 ("nvme: move passthrough logging attribute to head"), we stopped using the namespace to hold the passthrough logging enabled attribute. There is now nowhere now which looks up the gendisk dev driver data, so stop setting it. Incidentally, it would have been better to set this before adding the disk. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-05Merge tag 'block-6.19-20260205' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Revert of a change for loop, which caused regressions for some users (Actually revert of two commits, where one is just an existing fix for the offending commit) - NVMe pull via Keith: - Fix NULL pointer access setting up dma mappings - Fix invalid memory access from malformed TCP PDU * tag 'block-6.19-20260205' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: loop: revert exclusive opener loop status change nvmet-tcp: add bounds checks in nvmet_tcp_build_pdu_iovec nvme-pci: handle changing device dma map requirements
2026-02-05nvme: add support for dynamic quirk configuration via module parameterMaurizio Lombardi
Introduce support for enabling or disabling specific NVMe quirks at module load time through the `quirks` module parameter. This mechanism allows users to apply known quirks dynamically based on the device's PCI vendor and device IDs, without requiring to add hardcoded entries in the driver and recompiling the kernel. While the generic PCI new_id sysfs interface exists for dynamic configuration, it is insufficient for scenarios where the system fails to boot (for example, this has been reported to happen because of the bogus_nid quirk). The new_id attribute is writable only after the system has booted and sysfs is mounted. The `quirks` parameter accepts a list of quirk specifications separated by a '-' character in the following format: <VID>:<DID>:<quirk_names>[-<VID>:<DID>:<quirk_names>-..] Each quirk is represented by its name and can be prefixed with `^` to indicate that the quirk should be disabled; quirk names are separated by a ',' character. Example: enable BOGUS_NID and BROKEN_MSI, disable DEALLOCATE_ZEROES: $ modprobe nvme quirks=7170:2210:bogus_nid,broken_msi,^deallocate_zeroes Tested-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-05nvmet-tcp: add bounds checks in nvmet_tcp_build_pdu_iovecYunJe Shin
nvmet_tcp_build_pdu_iovec() could walk past cmd->req.sg when a PDU length or offset exceeds sg_cnt and then use bogus sg->length/offset values, leading to _copy_to_iter() GPF/KASAN. Guard sg_idx, remaining entries, and sg->length/offset before building the bvec. Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver") Signed-off-by: YunJe Shin <ioerts@kookmin.ac.kr> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Joonkyo Jung <joonkyoj@yonsei.ac.kr> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-05nvme-pci: handle changing device dma map requirementsKeith Busch
The initial state of dma_needs_unmap may be false, but change to true while mapping the data iterator. Enabling swiotlb is one such case that can change the result. The nvme driver needs to save the mapped dma vectors to be unmapped later, so allocate as needed during iteration rather than assume it was always allocated at the beginning. This fixes a NULL dereference from accessing an uninitialized dma_vecs when the device dma unmapping requirements change mid-iteration. Fixes: b8b7570a7ec8 ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping") Link: https://lore.kernel.org/linux-nvme/20260202125738.1194899-1-pradeep.pragallapati@oss.qualcomm.com/ Reported-by: Pradeep P V K <pradeep.pragallapati@oss.qualcomm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-02nvme: fix admin queue leak on controller resetMing Lei
When nvme_alloc_admin_tag_set() is called during a controller reset, a previous admin queue may still exist. Release it properly before allocating a new one to avoid orphaning the old queue. This fixes a regression introduced by commit 03b3bcd319b3 ("nvme: fix admin request_queue lifetime"). Cc: Keith Busch <kbusch@kernel.org> Fixes: 03b3bcd319b3 ("nvme: fix admin request_queue lifetime"). Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/linux-block/CAHj4cs9wv3SdPo+N01Fw2SHBYDs9tj2M_e1-GdQOkRy=DsBB1w@mail.gmail.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-02-02nvme-fabrics: use kfree_sensitive() for DHCHAP secretsDaniel Hodges
The DHCHAP secrets (dhchap_secret and dhchap_ctrl_secret) contain authentication key material for NVMe-oF. Use kfree_sensitive() instead of kfree() in nvmf_free_options() to ensure secrets are zeroed before the memory is freed, preventing recovery from freed pages. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Hodges <hodgesd@meta.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-30Merge tag 'block-6.19-20260130' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Fix for an accounting leak in bcache that's been there forever, and a related dead code removal - Revert of a fix for rnbd that went into this series, but depends on other changes that are staged for 7.0 - NVMe pull request via Keith: - TCP target completion race condition fix (Ming) - DMA descriptor cleanup fix (Roger) * tag 'block-6.19-20260130' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: bcache: fix I/O accounting leak in detached_dev_do_request bcache: remove dead code in detached_dev_do_request nvme-pci: DMA unmap the correct regions in nvme_free_sgls Revert "rnbd-clt: fix refcount underflow in device unmap path" nvmet: fix race in nvmet_bio_done() leading to NULL pointer dereference
2026-01-30block: introduce bdev_rot()Damien Le Moal
Introduce the helper function bdev_rot() to test if a block device is a rotational one. The existing function bdev_nonrot() which tests for the opposite condition is redefined using this new helper. This avoids the double negation (operator and name) that appears when testing if a block device is a rotational device, thus making the code a little easier to read. Call sites of bdev_nonrot() in the block layer are updated to use this new helper. Remaining users in other subsystems are left unchanged for now. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28nvme: stop using AWUPFJohn Garry
As described at [0], much of the atomic write parts of the specification are lacking. For now, there is nothing which we can do in software about the lack of a dedicated NVMe write atomic command. As for reading the atomic write limits, it is felt that the per-namespace values are mostly properly specified and it is assumed that they are properly implemented. The specification of NAWUPF is quite clear. However the specification of NABSPF is less clear. The lack of clarity in NABSPF comes from deciding whether NABSPF applies when NSABP is 0 - it is assumed that NSABPF does not apply when NSABP is 0. As for the per-controller AWUPF, how this value applies to shared namespaces is missing in the specification. Furthermore, the value is in terms of logical blocks, which is an NS entity. Since AWUPF is so poorly defined, stop using it already together. Hopefully this will force vendors to implement NAWUPF support always. Note that AWUPF not only effects atomic write support, but also the physical block size reported for the device. To help users know this restriction, log an info message per NS. [0] https://lore.kernel.org/linux-nvme/20250707141834.GA30198@lst.de/ Tested-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-28nvme-pci: DMA unmap the correct regions in nvme_free_sglsRoger Pau Monne
The call to nvme_free_sgls() in nvme_unmap_data() has the sg_list and sge parameters swapped. This wasn't noticed by the compiler because both share the same type. On a Xen PV hardware domain, and possibly any other architectures that takes that path, this leads to corruption of the NVMe contents. Fixes: f0887e2a52d4 ("nvme-pci: create common sgl unmapping helper") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-21nvmet: fix race in nvmet_bio_done() leading to NULL pointer dereferenceMing Lei
There is a race condition in nvmet_bio_done() that can cause a NULL pointer dereference in blk_cgroup_bio_start(): 1. nvmet_bio_done() is called when a bio completes 2. nvmet_req_complete() is called, which invokes req->ops->queue_response(req) 3. The queue_response callback can re-queue and re-submit the same request 4. The re-submission reuses the same inline_bio from nvmet_req 5. Meanwhile, nvmet_req_bio_put() (called after nvmet_req_complete) invokes bio_uninit() for inline_bio, which sets bio->bi_blkg to NULL 6. The re-submitted bio enters submit_bio_noacct_nocheck() 7. blk_cgroup_bio_start() dereferences bio->bi_blkg, causing a crash: BUG: kernel NULL pointer dereference, address: 0000000000000028 #PF: supervisor read access in kernel mode RIP: 0010:blk_cgroup_bio_start+0x10/0xd0 Call Trace: submit_bio_noacct_nocheck+0x44/0x250 nvmet_bdev_execute_rw+0x254/0x370 [nvmet] process_one_work+0x193/0x3c0 worker_thread+0x281/0x3a0 Fix this by reordering nvmet_bio_done() to call nvmet_req_bio_put() BEFORE nvmet_req_complete(). This ensures the bio is cleaned up before the request can be re-submitted, preventing the race condition. Fixes: 190f4c2c863a ("nvmet: fix memory leak of bio integrity") Cc: Dmitry Bogdanov <d.bogdanov@yadro.com> Cc: stable@vger.kernel.org Cc: Guangwu Zhang <guazhang@redhat.com> Link: http://www.mail-archive.com/debian-kernel@lists.debian.org/msg146238.html Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-20kernel.h: drop hex.h and update all hex.h usersRandy Dunlap
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of hex.h interfaces to directly #include <linux/hex.h> as part of the process of putting kernel.h on a diet. Removing hex.h from kernel.h means that 36K C source files don't have to pay the price of parsing hex.h for the roughly 120 C source files that need it. This change has been build-tested with allmodconfig on most ARCHes. Also, all users/callers of <linux/hex.h> in the entire source tree have been updated if needed (if not already #included). Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20nvme/io_uring: optimize IOPOLL completions for local ring contextMing Lei
When multiple io_uring rings poll on the same NVMe queue, one ring can find completions belonging to another ring. The current code always uses task_work to handle this, but this adds overhead for the common single-ring case. This patch passes the polling io_ring_ctx through io_comp_batch's new poll_ctx field. In io_do_iopoll(), the polling ring's context is stored in iob.poll_ctx before calling the iopoll callbacks. In nvme_uring_cmd_end_io(), we now compare iob->poll_ctx with the request's owning io_ring_ctx (via io_uring_cmd_ctx_handle()). If they match (local context), we complete inline with io_uring_cmd_done32(). If they differ (remote context) or iob is NULL (non-iopoll path), we use task_work as before. This optimization eliminates task_work scheduling overhead for the common case where a ring polls and finds its own completions. ~10% IOPS improvement is observed in the following benchmark: fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -O0 -P1 -u1 -n1 /dev/ng0n1 Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-20block: pass io_comp_batch to rq_end_io_fn callbackMing Lei
Add a third parameter 'const struct io_comp_batch *' to the rq_end_io_fn callback signature. This allows end_io handlers to access the completion batch context when requests are completed via blk_mq_end_request_batch(). The io_comp_batch is passed from blk_mq_end_request_batch(), while NULL is passed from __blk_mq_end_request() and blk_mq_put_rq_ref() which don't have batch context. This infrastructure change enables drivers to detect whether they're being called from a batched completion path (like iopoll) and access additional context stored in the io_comp_batch. Update all rq_end_io_fn implementations: - block/blk-mq.c: blk_end_sync_rq - block/blk-flush.c: flush_end_io, mq_flush_data_end_io - drivers/nvme/host/ioctl.c: nvme_uring_cmd_end_io - drivers/nvme/host/core.c: nvme_keep_alive_end_io - drivers/nvme/host/pci.c: abort_endio, nvme_del_queue_end, nvme_del_cq_end - drivers/nvme/target/passthru.c: nvmet_passthru_req_done - drivers/scsi/scsi_error.c: eh_lock_door_done - drivers/scsi/sg.c: sg_rq_end_io - drivers/scsi/st.c: st_scsi_execute_end - drivers/target/target_core_pscsi.c: pscsi_req_done - drivers/md/dm-rq.c: end_clone_request Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-18Merge branch 'for-7.0/blk-pvec' into for-7.0/blockJens Axboe
* for-7.0/blk-pvec: types: move phys_vec definition to common header nvme-pci: Use size_t for length fields to handle larger sizes
2026-01-16Merge tag 'block-6.19-20260116' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Device quirk to disable faulty temperature (Ilikara) - TCP target null pointer fix from bad host protocol usage (Shivam) - Add apple,t8103-nvme-ans2 as a compatible apple controller (Janne) - FC tagset leak fix (Chaitanya) - TCP socket deadlock fix (Hannes) - Target name buffer overrun fix (Shin'ichiro) - Fix for an underflow for rnbd during device unmap - Zero the non-PI part of the auto integrity buffer - Fix for a configfs memory leak in the null block driver * tag 'block-6.19-20260116' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: rnbd-clt: fix refcount underflow in device unmap path nvme: fix PCIe subsystem reset controller state transition nvmet: do not copy beyond sybsysnqn string length nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready() null_blk: fix kmemleak by releasing references to fault configfs items block: zero non-PI portion of auto integrity buffer nvme-fc: release admin tagset if init fails nvme-apple: add "apple,t8103-nvme-ans2" as compatible nvme-tcp: fix NULL pointer dereferences in nvmet_tcp_build_pdu_iovec nvme-pci: disable secondary temp for Wodposit WPBSNM8
2026-01-14nvme: fix PCIe subsystem reset controller state transitionNilay Shroff
The commit d2fe192348f9 (“nvme: only allow entering LIVE from CONNECTING state”) disallows controller state transitions directly from RESETTING to LIVE. However, the NVMe PCIe subsystem reset path relies on this transition to recover the controller on PowerPC (PPC) systems. On PPC systems, issuing a subsystem reset causes a temporary loss of communication with the NVMe adapter. A subsequent PCIe MMIO read then triggers EEH recovery, which restores the PCIe link and brings the controller back online. For EEH recovery to proceed correctly, the controller must transition back to the LIVE state. Due to the changes introduced by commit d2fe192348f9 (“nvme: only allow entering LIVE from CONNECTING state”), the controller can no longer transition directly from RESETTING to LIVE. As a result, EEH recovery exits prematurely, leaving the controller stuck in the RESETTING state. Fix this by explicitly transitioning the controller state from RESETTING to CONNECTING and then to LIVE. This satisfies the updated state transition rules and allows the controller to be successfully recovered on PPC systems following a PCIe subsystem reset. Cc: stable@vger.kernel.org Fixes: d2fe192348f9 ("nvme: only allow entering LIVE from CONNECTING state") Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-13nvme: expose active quirks in sysfsMaurizio Lombardi
Currently, there is no straightforward way for a user to inspect which quirks are active for a given device from userspace. Add a new "quirks" sysfs attribute to the nvme controller device. Reading this file will display a human-readable list of all active quirks, with each quirk name on a new line. If no quirks are active, it will display "none". Tested-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-13nvmet: do not copy beyond sybsysnqn string lengthShin'ichiro Kawasaki
Commit edd17206e363 ("nvmet: remove redundant subsysnqn field from ctrl") replaced ctrl->subsysnqn with ctrl->subsys->subsysnqn. This change works as expected because both point to strings with the same data. However, their memory allocation lengths differ. ctrl->subsysnqn had the fixed size defined as NVMF_NQN_FILED_LEN, while ctrl->subsys->subsysnqn has variable length determined by kstrndup(). Due to this difference, KASAN slab-out-of-bounds occurs at memcpy() in nvmet_passthru_override_id_ctrl() after the commit. The failure can be recreated by running the blktests test case nvme/033. To prevent such failures, replace memcpy() with strscpy(), which copies only the string length and avoids overruns. Fixes: edd17206e363 ("nvmet: remove redundant subsysnqn field from ctrl") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-13nvme/host: fixup some typosWilfred Mallawa
Fix up some minor typos in the nvme host driver and a comment style to conform to the standard kernel style. Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-13nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()Hannes Reinecke
When the socket is closed while in TCP_LISTEN a callback is run to flush all outstanding packets, which in turns calls nvmet_tcp_listen_data_ready() with the sk_callback_lock held. So we need to check if we are in TCP_LISTEN before attempting to get the sk_callback_lock() to avoid a deadlock. Link: https://lore.kernel.org/linux-nvme/CAHj4cs-zu7eVB78yUpFjVe2UqMWFkLk8p+DaS3qj+uiGCXBAoA@mail.gmail.com/ Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-01-13block, nvme: remove unused dma_iova_state function parameterNitesh Shetty
DMA IOVA state is not used inside blk_rq_dma_map_iter_next, get rid of the argument. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-09nvme-fc: release admin tagset if init failsChaitanya Kulkarni
nvme_fabrics creates an NVMe/FC controller in following path: nvmf_dev_write() -> nvmf_create_ctrl() -> nvme_fc_create_ctrl() -> nvme_fc_init_ctrl() nvme_fc_init_ctrl() allocates the admin blk-mq resources right after nvme_add_ctrl() succeeds. If any of the subsequent steps fail (changing the controller state, scheduling connect work, etc.), we jump to the fail_ctrl path, which tears down the controller references but never frees the admin queue/tag set. The leaked blk-mq allocations match the kmemleak report seen during blktests nvme/fc. Check ctrl->ctrl.admin_tagset in the fail_ctrl path and call nvme_remove_admin_tag_set() when it is set so that all admin queue allocations are reclaimed whenever controller setup aborts. Reported-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>