summaryrefslogtreecommitdiff
path: root/block/blk.h
AgeCommit message (Collapse)Author
2026-02-16blk-mq: use NOIO context to prevent deadlock during debugfs creationYu Kuai
Creating debugfs entries can trigger fs reclaim, which can enter back into the block layer request_queue. This can cause deadlock if the queue is frozen. Previously, a WARN_ON_ONCE check was used in debugfs_create_files() to detect this condition, but it was racy since the queue can be frozen from another context at any time. Introduce blk_debugfs_lock()/blk_debugfs_unlock() helpers that combine the debugfs_mutex with memalloc_noio_save()/restore() to prevent fs reclaim from triggering block I/O. Also add blk_debugfs_lock_nomemsave() and blk_debugfs_unlock_nomemrestore() variants for callers that don't need NOIO protection (e.g., debugfs removal or read-only operations). Replace all raw debugfs_mutex lock/unlock pairs with these helpers, using the _nomemsave/_nomemrestore variants where appropriate. Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/ Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/ Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-09Merge tag 'for-7.0/block-stable-pages-20260206' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull bounce buffer dio for stable pages from Jens Axboe: "This adds support for bounce buffering of dio for stable pages. This was all done by Christoph. In his words: This series tries to address the problem that under I/O pages can be modified during direct I/O, even when the device or file system require stable pages during I/O to calculate checksums, parity or data operations. It does so by adding block layer helpers to bounce buffer an iov_iter into a bio, then wires that up in iomap and ultimately XFS. The reason that the file system even needs to know about it, is because reads need a user context to copy the data back, and the infrastructure to defer ioends to a workqueue currently sits in XFS. I'm going to look into moving that into ioend and enabling it for other file systems. Additionally btrfs already has it's own infrastructure for this, and actually an urgent need to bounce buffer, so this should be useful there and could be wire up easily. In fact the idea comes from patches by Qu that did this in btrfs. This patch fixes all but one xfstests failures on T10 PI capable devices (generic/095 seems to have issues with a mix of mmap and splice still, I'm looking into that separately), and make qemu VMs running Windows, or Linux with swap enabled fine on an XFS file on a device using PI. Performance numbers on my (not exactly state of the art) NVMe PI test setup: Sequential reads using io_uring, QD=16. Bandwidth and CPU usage (usr/sys): | size | zero copy | bounce | +------+--------------------------+--------------------------+ | 4k | 1316MiB/s (12.65/55.40%) | 1081MiB/s (11.76/49.78%) | | 64K | 3370MiB/s ( 5.46/18.20%) | 3365MiB/s ( 4.47/15.68%) | | 1M | 3401MiB/s ( 0.76/23.05%) | 3400MiB/s ( 0.80/09.06%) | +------+--------------------------+--------------------------+ Sequential writes using io_uring, QD=16. Bandwidth and CPU usage (usr/sys): | size | zero copy | bounce | +------+--------------------------+--------------------------+ | 4k | 882MiB/s (11.83/33.88%) | 750MiB/s (10.53/34.08%) | | 64K | 2009MiB/s ( 7.33/15.80%) | 2007MiB/s ( 7.47/24.71%) | | 1M | 1992MiB/s ( 7.26/ 9.13%) | 1992MiB/s ( 9.21/19.11%) | +------+--------------------------+--------------------------+ Note that the 64k read numbers look really odd to me for the baseline zero copy case, but are reproducible over many repeated runs. The bounce read numbers should further improve when moving the PI validation to the file system and removing the double context switch, which I have patches for that will sent out soon" * tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: xfs: use bounce buffering direct I/O when the device requires stable pages iomap: add a flag to bounce buffer direct I/O iomap: support ioends for direct reads iomap: rename IOMAP_DIO_DIRTY to IOMAP_DIO_USER_BACKED iomap: free the bio before completing the dio iomap: share code between iomap_dio_bio_end_io and iomap_finish_ioend_direct iomap: split out the per-bio logic from iomap_dio_bio_iter iomap: simplify iomap_dio_bio_iter iomap: fix submission side handling of completion side errors block: add helpers to bounce buffer an iov_iter into bios block: remove bio_release_page iov_iter: extract a iov_iter_extract_bvecs helper from bio code block: open code bio_add_page and fix handling of mismatching P2P ranges block: refactor get_contig_folio_len block: add a BIO_MAX_SIZE constant and use it
2026-02-04block: decouple secure erase size limit from discard size limitLuke Wang
Secure erase should use max_secure_erase_sectors instead of being limited by max_discard_sectors. Separate the handling of REQ_OP_SECURE_ERASE from REQ_OP_DISCARD to allow each operation to use its own size limit. Signed-off-by: Luke Wang <ziniu.wang_1@nxp.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-28block: remove bio_release_pageChristoph Hellwig
Merge bio_release_page into the only remaining caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-10block: account for bi_bvec_done in bio_may_need_split()Ming Lei
When checking if a bio fits in a single segment, bio_may_need_split() compares bi_size against the current bvec's bv_len. However, for partially consumed bvecs (bi_bvec_done > 0), such as in cloned or split bios, the remaining bytes in the current bvec is actually (bv_len - bi_bvec_done), not bv_len. This could cause bio_may_need_split() to incorrectly return false, leading to nr_phys_segments being set to 1 when the bio actually spans multiple segments. This triggers the WARN_ON in __blk_rq_map_sg() when the actual mapped segments exceed the expected count. Fix by subtracting bi_bvec_done from bv_len in the comparison. Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Close: https://lore.kernel.org/linux-block/9687cf2b-1f32-44e1-b58d-2492dc6e7185@linux.ibm.com/ Repored-and-bisected-by: Christoph Hellwig <hch@infradead.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Tested-by: Christoph Hellwig <hch@infradead.org> Fixes: ee623c892aa5 ("block: use bvec iterator helper for bio_may_need_split()") Cc: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-07block: use bvec iterator helper for bio_may_need_split()Ming Lei
bio_may_need_split() uses bi_vcnt to determine if a bio has a single segment, but bi_vcnt is unreliable for cloned bios. Cloned bios share the parent's bi_io_vec array but iterate over a subset via bi_iter, so bi_vcnt may not reflect the actual segment count being iterated. Replace the bi_vcnt check with bvec iterator access via __bvec_iter_bvec(), comparing bi_iter.bi_size against the current bvec's length. This correctly handles both cloned and non-cloned bios. Move bi_io_vec into the first cache line adjacent to bi_iter. This is a sensible layout since bi_io_vec and bi_iter are commonly accessed together throughout the block layer - every bvec iteration requires both fields. This displaces bi_end_io to the second cache line, which is acceptable since bi_end_io and bi_private are always fetched together in bio_endio() anyway. The struct layout change requires bio_reset() to preserve and restore bi_io_vec across the memset, since it now falls within BIO_RESET_BYTES. Nitesh verified that this patch doesn't regress NVMe 512-byte IO perf [1]. Link: https://lore.kernel.org/linux-block/20251220081607.tvnrltcngl3cc2fh@green245.gost/ [1] Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13block: unify elevator tags and type xarrays into struct elv_change_ctxNilay Shroff
Currently, the nr_hw_queues update path manages two disjoint xarrays — one for elevator tags and another for elevator type — both used during elevator switching. Maintaining these two parallel structures for the same purpose adds unnecessary complexity and potential for mismatched state. This patch unifies both xarrays into a single structure, struct elv_change_ctx, which holds all per-queue elevator change context. A single xarray, named elv_tbl, now maps each queue (q->id) in a tagset to its corresponding elv_change_ctx entry, encapsulating the elevator tags, type and name references. This unification simplifies the code, improves maintainability, and clarifies ownership of per-queue elevator state. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05block: handle zone management operations completionsDamien Le Moal
The functions blk_zone_wplug_handle_reset_or_finish() and blk_zone_wplug_handle_reset_all() both modify the zone write pointer offset of zone write plugs that are the target of a reset, reset all or finish zone management operation. However, these functions do this modification before the BIO is executed. So if the zone operation fails, the modified zone write pointer offsets become invalid. Avoid this by modifying the zone write pointer offset of a zone write plug that is the target of a zone management operation when the operation completes. To do so, modify blk_zone_bio_endio() to call the new function blk_zone_mgmt_bio_endio() which in turn calls the functions blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or blk_zone_finish_bio_endio() depending on the operation of the completed BIO, to modify a zone write plug write pointer offset accordingly. These functions are called only if the BIO execution was successful. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22block: rename min_segment_sizeKeith Busch
Despite its name, the block layer is fine with segments smaller that the "min_segment_size" limit. The value is an optimization limit indicating the largest segment that can be used without considering boundary limits. Smaller segments can take a fast path, so give it a name that reflects that: max_fast_segment_size. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10block: fix ordering of recursive split IOYu Kuai
Currently, split bio will be chained to original bio, and original bio will be resubmitted to the tail of current->bio_list, waiting for split bio to be issued. However, if split bio get split again, the IO order will be messed up. This problem, on the one hand, will cause performance degradation, especially for mdraid with large IO size; on the other hand, will cause write errors for zoned block devices[1]. For example, in raid456 IO will first be split by max_sector from md_submit_bio(), and then later be split again by chunksize for internal handling: For example, assume max_sectors is 1M, and chunksize is 512k 1) issue a 2M IO: bio issuing: 0+2M current->bio_list: NULL 2) md_submit_bio() split by max_sector: bio issuing: 0+1M current->bio_list: 1M+1M 3) chunk_aligned_read() split by chunksize: bio issuing: 0+512k current->bio_list: 1M+1M -> 512k+512k 4) after first bio issued, __submit_bio_noacct() will contuine issuing next bio: bio issuing: 1M+1M current->bio_list: 512k+512k bio issued: 0+512k 5) chunk_aligned_read() split by chunksize: bio issuing: 1M+512k current->bio_list: 512k+512k -> 1536k+512k bio issued: 0+512k 6) no split afterwards, finally the issue order is: 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k This behaviour will cause large IO read on raid456 endup to be small discontinuous IO in underlying disks. Fix this problem by placing split bio to the head of current->bio_list. Test script: test on 8 disk raid5 with 64k chunksize dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct Test results: Before this patch 1) iostat results: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10 sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20 2) blktrace G stage: 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd] 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd] 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd] 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd] 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd] 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd] 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd] 3) io seek for sdx: Noted io seek is the result from blktrace D stage, statistic of: ABS((offset of next IO) - (offset + len of previous IO)) Read|Write seek cnt 55175, zero cnt 25079 >=(KB) .. <(KB) : count ratio |distribution | 0 .. 1 : 25079 45.5% |########################################| 1 .. 2 : 0 0.0% | | 2 .. 4 : 0 0.0% | | 4 .. 8 : 0 0.0% | | 8 .. 16 : 0 0.0% | | 16 .. 32 : 0 0.0% | | 32 .. 64 : 12540 22.7% |##################### | 64 .. 128 : 2508 4.5% |##### | 128 .. 256 : 0 0.0% | | 256 .. 512 : 10032 18.2% |################# | 512 .. 1024 : 5016 9.1% |######### | After this patch: 1) iostat results: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60 sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50 2) blktrace G stage: 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd] 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd] 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd] 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd] 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd] 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd] 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd] 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd] 3) io seek for sdx Read|Write seek cnt 28644, zero cnt 21483 >=(KB) .. <(KB) : count ratio |distribution | 0 .. 1 : 21483 75.0% |########################################| 1 .. 2 : 0 0.0% | | 2 .. 4 : 0 0.0% | | 4 .. 8 : 0 0.0% | | 8 .. 16 : 0 0.0% | | 16 .. 32 : 0 0.0% | | 32 .. 64 : 7161 25.0% |############## | BTW, this looks like a long term problem from day one, and large sequential IO read is pretty common case like video playing. And even with this patch, in this test case IO is merged to at most 128k is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such limit can get even better performance. However, we'll figure out how to do this properly later. [1] https://lore.kernel.org/all/e40b076d-583d-406b-b223-005910a9f46f@acm.org/ Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls") Reported-by: Tie Ren <tieren@fnnas.com> Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10block: skip unnecessary checks for split bioYu Kuai
Lots of checks are already done while submitting this bio the first time, and there is no need to check them again when this bio is resubmitted after split. Hence open code should_fail_bio() and blk_throtl_bio() that are still necessary from submit_bio_split_bioset(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10block: cleanup bio_issueYu Kuai
Now that bio->bi_issue is only used by blk-iolatency to get bio issue time, replace bio_issue with u64 time directly and remove bio_issue to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08blk-mq: Defer freeing flush queue to SRCU callbackMing Lei
The freeing of the flush queue/request in blk_mq_exit_hctx() can race with tag iterators that may still be accessing it. To prevent a potential use-after-free, the deallocation should be deferred until after a grace period. With this way, we can replace the big tags->lock in tags iterator code path with srcu for solving the issue. This patch introduces an SRCU-based deferred freeing mechanism for the flush queue. The changes include: - Adding a `rcu_head` to `struct blk_flush_queue`. - Creating a new callback function, `blk_free_flush_queue_callback`, to handle the actual freeing. - Replacing the direct call to `blk_free_flush_queue()` in `blk_mq_exit_hctx()` with `call_srcu()`, using the `tags_srcu` instance to ensure synchronization with tag iterators. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11block: fix kobject double initialization in add_diskZheng Qixing
Device-mapper can call add_disk() multiple times for the same gendisk due to its two-phase creation process (dm create + dm load). This leads to kobject double initialization errors when the underlying iSCSI devices become temporarily unavailable and then reappear. However, if the first add_disk() call fails and is retried, the queue_kobj gets initialized twice, causing: kobject: kobject (ffff88810c27bb90): tried to init an initialized object, something is seriously wrong. Call Trace: <TASK> dump_stack_lvl+0x5b/0x80 kobject_init.cold+0x43/0x51 blk_register_queue+0x46/0x280 add_disk_fwnode+0xb5/0x280 dm_setup_md_queue+0x194/0x1c0 table_load+0x297/0x2d0 ctl_ioctl+0x2a2/0x480 dm_ctl_ioctl+0xe/0x20 __x64_sys_ioctl+0xc7/0x110 do_syscall_64+0x72/0x390 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix this by separating kobject initialization from sysfs registration: - Initialize queue_kobj early during gendisk allocation - add_disk() only adds the already-initialized kobject to sysfs - del_gendisk() removes from sysfs but doesn't destroy the kobject - Final cleanup happens when the disk is released Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs") Reported-by: Li Lingfeng <lilingfeng3@huawei.com> Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/ Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250808053609.3237836-1-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-30block: fix potential deadlock while running nr_hw_queue updateNilay Shroff
Move scheduler tags (sched_tags) allocation and deallocation outside both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues. This change breaks the dependency chain from the percpu allocator lock to the elevator lock, helping to prevent potential deadlocks, as observed in the reported lockdep splat[1]. This commit introduces batch allocation and deallocation helpers for sched_tags, which are now used from within __blk_mq_update_nr_hw_queues routine while iterating through the tagset. With this change, all sched_tags memory management is handled entirely outside the ->elevator_lock and the ->freeze_lock context, thereby eliminating the lock dependency that could otherwise manifest during nr_hw_queues updates. [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Reported-by: Stefan Haberland <sth@linux.ibm.com> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250730074614.2537382-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-25block: restore two stage elevator switch while running nr_hw_queue updateNilay Shroff
The kmemleak reports memory leaks related to elevator resources that were originally allocated in the ->init_hctx() method. The following leak traces are observed after running blktests block/040: unreferenced object 0xffff8881b82f7400 (size 512): comm "check", pid 68454, jiffies 4310588881 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 5bac8b34): __kvmalloc_node_noprof+0x55d/0x7a0 sbitmap_init_node+0x15a/0x6a0 kyber_init_hctx+0x316/0xb90 blk_mq_init_sched+0x419/0x580 elevator_switch+0x18b/0x630 elv_update_nr_hw_queues+0x219/0x2c0 __blk_mq_update_nr_hw_queues+0x36a/0x6f0 blk_mq_update_nr_hw_queues+0x3a/0x60 0xffffffffc09ceb80 0xffffffffc09d7e0b configfs_write_iter+0x2b1/0x470 vfs_write+0x527/0xe70 ksys_write+0xff/0x200 do_syscall_64+0x98/0x3c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e unreferenced object 0xffff8881b82f6000 (size 512): comm "check", pid 68454, jiffies 4310588881 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 5bac8b34): __kvmalloc_node_noprof+0x55d/0x7a0 sbitmap_init_node+0x15a/0x6a0 kyber_init_hctx+0x316/0xb90 blk_mq_init_sched+0x419/0x580 elevator_switch+0x18b/0x630 elv_update_nr_hw_queues+0x219/0x2c0 __blk_mq_update_nr_hw_queues+0x36a/0x6f0 blk_mq_update_nr_hw_queues+0x3a/0x60 0xffffffffc09ceb80 0xffffffffc09d7e0b configfs_write_iter+0x2b1/0x470 vfs_write+0x527/0xe70 ksys_write+0xff/0x200 do_syscall_64+0x98/0x3c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e unreferenced object 0xffff8881b82f5800 (size 512): comm "check", pid 68454, jiffies 4310588881 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 5bac8b34): __kvmalloc_node_noprof+0x55d/0x7a0 sbitmap_init_node+0x15a/0x6a0 kyber_init_hctx+0x316/0xb90 blk_mq_init_sched+0x419/0x580 elevator_switch+0x18b/0x630 elv_update_nr_hw_queues+0x219/0x2c0 __blk_mq_update_nr_hw_queues+0x36a/0x6f0 blk_mq_update_nr_hw_queues+0x3a/0x60 0xffffffffc09ceb80 0xffffffffc09d7e0b configfs_write_iter+0x2b1/0x470 vfs_write+0x527/0xe70 ksys_write+0xff/0x200 do_syscall_64+0x98/0x3c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The issue arises while we run nr_hw_queue update, Specifically, we first reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and then later invoke elevator_switch() (assuming q->elevator is not NULL). The elevator switch code would first exit old elevator (elevator_exit) and then switches to the new elevator. The elevator_exit loops through each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(), which releases resources allocated during ->init_hctx(). This memleak manifests when we reduce the num of h/w queues - for example, when the initial update sets the number of queues to X, and a later update reduces it to Y, where Y < X. In this case, we'd loose the access to old hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs would have already released the old hctxs. As we don't now have any reference left to the old hctxs, we don't have any way to free the scheduler resources (which are allocate in ->init_hctx()) and kmemleak complains about it. This issue was caused due to the commit 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues"). That change unified the two-stage elevator teardown and reattachment into a single call that occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs. This patch restores the previous two-stage elevator switch logic during nr_hw_queues updates. First, the elevator is switched to 'none', which ensures all scheduler resources are properly freed. Then, the hardware contexts (hctxs) are reallocated, and the software-to-hardware queue mappings are updated. Finally, the original elevator is reattached. This sequence prevents loss of references to old hctxs and avoids the scheduler resource leaks reported by kmemleak. Reported-by : Yi Zhang <yi.zhang@redhat.com> Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues") Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/ Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250724102540.1366308-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15block: split blk_zone_update_request_bio into two functionsJohannes Thumshirn
blk_zone_update_request_bio() does two things. First it checks if the request to be completed was written via ZONE APPEND and if yes it then updates the sector to the one that the data was written to. This is small enough to be an inline function. But upcoming changes adding a tracepoint don't work if the function is inlined. Split the function into two, the first is blk_req_bio_is_zone_append() checking if the sector needs to be updated. This can still be an inline function. The second is blk_zone_append_update_request_bio() doing the sector update. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250715115324.53308-3-johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30block: Increase BLK_DEF_MAX_SECTORS_CAPDamien Le Moal
Back in 2015, commit d2be537c3ba3 ("block: bump BLK_DEF_MAX_SECTORS to 2560") increased the default maximum size of a block device I/O to 2560 sectors (1280 KiB) to "accommodate a 10-data-disk stripe write with chunk size 128k". This choice is rather arbitrary and since then, improvements to the block layer have software RAID drivers correctly advertize their stripe width through chunk_sectors and abuses of BLK_DEF_MAX_SECTORS_CAP by drivers (to set the HW limit rather than the default user controlled maximum I/O size) have been fixed. Since many block devices can benefit from a larger value of BLK_DEF_MAX_SECTORS_CAP, and in particular HDDs, increase this value to be 4MiB, or 8192 sectors. And given that BLK_DEF_MAX_SECTORS_CAP is only used in the block layer and should not be used by drivers directly, move this macro definition to the block layer internal header file block/blk.h. Suggested-by: Martin K . Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250618060045.37593-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-16blk-mq: move the DMA mapping code to a separate fileChristoph Hellwig
While working on the new DMA API I kept getting annoyed how it was placed right in the middle of the bio splitting code in blk-merge.c. Split it out into a separate file. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250513071433.836797-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13block: remove the same_page output argument to bvec_try_merge_pageChristoph Hellwig
bvec_try_merge_page currently returns if the added page fragment is within the same page as the last page in the last current bio_vec. This information is used by __bio_iov_iter_get_pages so that we always have a single folio pin per page even when the page is split over multiple __bio_iov_iter_get_pages calls. Threading this through the entire lowlevel add page to bio logic is annoying and inefficient and leads to less code sharing than otherwise possible. Instead add code to __bio_iov_iter_get_pages that checks if the bio_vecs did not change and thus a merge into the last segment must have happened, and if there is an offset into the page for the currently added fragment, because if yes we must have already had a previous fragment of the same page in the last bio_vec. While this is still a bit ugly, it keeps the logic in the one place that needs it and allows for more code sharing. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250512042354.514329-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-10block: export API to get the number of bdev inflight IOYu Kuai
- rename part_in_{flight, flight_rw} to bdev_count_{inflight, inflight_rw} - export bdev_count_inflight, to fix a problem in mdraid that foreground IO can be starved by background sync IO in later patches Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-6-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-06block: only update request sector if neededJohannes Thumshirn
In case of a ZONE APPEND write, regardless of native ZONE APPEND or the emulation layer in the zone write plugging code, the sector the data got written to by the device needs to be updated in the bio. At the moment, this is done for every native ZONE APPEND write and every request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus superfluously updates the sector for regular writes to a zoned block device. Check if a bio is a native ZONE APPEND write or if the bio is flagged as 'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging code handles the ZONE APPEND and translates it into a regular write and back. Only if one of these two criterion is met, update the sector in the bio upon completion. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: unifying elevator changeMing Lei
Elevator change is one well-define behavior: - tear down current elevator if it exists - setup new elevator It is supposed to cover any case for changing elevator by single internal API, typically the following cases: - setup default elevator in add_disk() - switch to none in del_disk() - reset elevator in blk_mq_update_nr_hw_queues() - switch elevator in sysfs `store` elevator attribute This patch uses elevator_change() to cover all above cases: - every elevator switch is serialized with each other: add_disk/del_disk/ store elevator is serialized already, blk_mq_update_nr_hw_queues() uses srcu for syncing with the other three cases - for both add_disk()/del_disk(), queue freeze works at atomic mode or has been froze, so the freeze in elevator_change() won't add extra delay - `struct elev_change_ctx` instance holds any info for changing elevator Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-17-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: simplify elevator reattachment for updating nr_hw_queuesMing Lei
In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data depends on it, and elevator has to be reattached, so call elevator_switch() to force attachment. Add elv_update_nr_hw_queues() simply for blk_mq_update_nr_hw_queues() to reattach elevator, since elevator switch isn't likely when running blk_mq_update_nr_hw_queues(). This way removes the current switch none and switch back code. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-14-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: look up the elevator type in elevator_switchChristoph Hellwig
That makes the function nicely self-contained and can be used to avoid code duplication. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-11-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05block: remove bounce buffering supportChristoph Hellwig
The block layer bounce buffering support is unused now, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: don't autoload drivers on statChristoph Hellwig
blkdev_get_no_open can trigger the legacy autoload of block drivers. A simple stat of a block device has not historically done that, so disable this behavior again. Fixes: 9abcfbd235f5 ("block: Add atomic write support for statx") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24block: move blkdev_{get,put} _no_open prototypes out of blkdev.hChristoph Hellwig
These are only to be used by block internal code. Remove the comment as we grew more users due to reworking block device node opening. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-26Merge tag 'for-6.15/block-20250322' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - Fixes for integrity handling - NVMe pull request via Keith: - Secure concatenation for TCP transport (Hannes) - Multipath sysfs visibility (Nilay) - Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li) - Correct use of 64-bit BARs for pci-epf target (Niklas) - Socket fix for selinux when used in containers (Peijie) - MD pull request via Yu: - fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai) - Series cleaning up and fixing bugs in the bad block handling code - Improve support for write failure simulation in null_blk - Various lock ordering fixes - Fixes for locking for debugfs attributes - Various ublk related fixes and improvements - Cleanups for blk-rq-qos wait handling - blk-throttle fixes - Fixes for loop dio and sync handling - Fixes and cleanups for the auto-PI code - Block side support for hardware encryption keys in blk-crypto - Various cleanups and fixes * tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits) nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi) nvme-tcp: fix selinux denied when calling sock_sendmsg nvmet: pci-epf: Always configure BAR0 as 64-bit nvmet: Remove duplicate uuid_copy nvme: zns: Simplify nvme_zone_parse_entry() nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls nvmet-fc: Remove unused functions nvme-pci: remove stale comment nvme-fc: Utilise min3() to simplify queue count calculation nvme-multipath: Add visibility for queue-depth io-policy nvme-multipath: Add visibility for numa io-policy nvme-multipath: Add visibility for round-robin io-policy nvmet: add tls_concat and tls_key debugfs entries nvmet-tcp: support secure channel concatenation nvmet: Add 'sq' argument to alloc_ctrl_args nvme-fabrics: reset admin connection for secure concatenation nvme-tcp: request secure channel concatenation nvme-keyring: add nvme_tls_psk_refresh() nvme: add nvme_auth_derive_tls_psk() nvme: add nvme_auth_generate_digest() ...
2025-03-03block: split struct bio_integrity_payloadChristoph Hellwig
Many of the fields in struct bio_integrity_payload are only needed for the default integrity buffer in the block layer, and the variable sized array at the end of the structure makes it very hard to embed into caller allocated structures. Reduce struct bio_integrity_payload to the minimal structure needed in common code and create two separate containing structures for the automatically generated payload and the caller allocated payload. The latter is a simple wrapper for struct bio_integrity_payload and the bvecs, while the former contains the additional fields moved out of struct bio_integrity_payload. Always use a dedicated mempool for automatic integrity metadata instead of depending on bio_set that is submitter controlled and thus often doesn't have the mempool initialized and stop using mempools for the submitter buffers as they aren't in the NOIO I/O submission path where we need to guarantee forward progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250225154449.422989-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-25block: make segment size limit workable for > 4K PAGE_SIZEMing Lei
Using PAGE_SIZE as a minimum expected DMA segment size in consideration of devices which have a max DMA segment size of < 64k when used on 64k PAGE_SIZE systems leads to devices not being able to probe such as eMMC and Exynos UFS controller [0] [1] you can end up with a probe failure as follows: WARNING: CPU: 2 PID: 397 at block/blk-settings.c:339 blk_validate_limits+0x364/0x3c0 Ensure we use min(max_seg_size, seg_boundary_mask + 1) as the new min segment size when max segment size is < PAGE_SIZE for 16k and 64k base page size systems. If anyone need to backport this patch, the following commits are depended: commit 6aeb4f836480 ("block: remove bio_add_pc_page") commit 02ee5d69e3ba ("block: remove blk_rq_bio_prep") commit b7175e24d6ac ("block: add a dma mapping iterator") Link: https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/ # [0] Link: https://lore.kernel.org/linux-block/1d55e942-5150-de4c-3a02-c3d066f87028@acm.org/ # [1] Cc: Yi Zhang <yi.zhang@redhat.com> Cc: John Garry <john.g.garry@oracle.com> Cc: Keith Busch <kbusch@kernel.org> Tested-by: Paul Bunyan <pbunyan@redhat.com> Reviewed-by: Daniel Gomez <da.gomez@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250225022141.2154581-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-15block: limit disk max sectors to (LLONG_MAX >> 9)Ming Lei
Kernel `loff_t` is defined as `long long int`, so we can't support disk which size is > LLONG_MAX. There are many virtual block drivers, and hardware may report bad capacity too, so limit max sectors to (LLONG_MAX >> 9) for avoiding potential trouble. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250115092648.1104452-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-04block: remove bio_add_pc_pageChristoph Hellwig
Lift bio_split_rw_at into blk_rq_append_bio so that it validates the hardware limits. With this all passthrough callers can simply add bio_add_page to build the bio and delay checking for exceeding of limits to this point instead of doing it for each page. While this looks like adding a new expensive loop over all bio_vecs, blk_rq_append_bio is already doing that just to counter the number of segments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20250103073417.459715-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: track queue dying state automatically for modeling queue freeze lockdepMing Lei
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable to save queue lying state when we want to lock the freeze queue since the state is one per-task variable now. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241127135133.3952153-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: track disk DEAD state automatically for modeling queue freeze lockdepMing Lei
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable to save disk DEAD state when we want to lock the freeze queue since the state is one per-task variable now. Doing this way can kill lots of false positive when freeze queue is called before adding disk[1]. [1] https://lore.kernel.org/linux-block/6741f6b2.050a0220.1cc393.0017.GAE@google.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241127135133.3952153-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-11block: lift bio_is_zone_append to bio.hChristoph Hellwig
Make bio_is_zone_append globally available, because file systems need to use to check for a zone append bio in their end_io handlers to deal with the block layer emulation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241104062647.91160-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07block: always verify unfreeze lock on the owner taskMing Lei
commit f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") tries to apply lockdep for verifying freeze & unfreeze. However, the verification is only done the outmost freeze and unfreeze. This way is actually not correct because q->mq_freeze_depth still may drop to zero on other task instead of the freeze owner task. Fix this issue by always verifying the last unfreeze lock on the owner task context, and make sure both the outmost freeze & unfreeze are verified in the current task. Fixes: f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07block: remove blk_freeze_queue()Ming Lei
No one use blk_freeze_queue(), so remove it and the obsolete comment. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-26block: model freeze & enter queue as lock for supporting lockdepMing Lei
Recently we got several deadlock report[1][2][3] caused by blk_mq_freeze_queue and blk_enter_queue(). Turns out the two are just like acquiring read/write lock, so model them as read/write lock for supporting lockdep: 1) model q->q_usage_counter as two locks(io and queue lock) - queue lock covers sync with blk_enter_queue() - io lock covers sync with bio_enter_queue() 2) make the lockdep class/key as per-queue: - different subsystem has very different lock use pattern, shared lock class causes false positive easily - freeze_queue degrades to no lock in case that disk state becomes DEAD because bio_enter_queue() won't be blocked any more - freeze_queue degrades to no lock in case that request queue becomes dying because blk_enter_queue() won't be blocked any more 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock - it is exclusive lock, so dependency with blk_enter_queue() is covered - it is trylock because blk_mq_freeze_queue() are allowed to run concurrently 4) model blk_enter_queue() & bio_enter_queue() as acquire_read() - nested blk_enter_queue() are allowed - dependency with blk_mq_freeze_queue() is covered - blk_queue_exit() is often called from other contexts(such as irq), and it can't be annotated as lock_release(), so simply do it in blk_enter_queue(), this way still covered cases as many as possible With lockdep support, such kind of reports may be reported asap and needn't wait until the real deadlock is triggered. For example, lockdep report can be triggered in the report[3] with this patch applied. [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' https://bugzilla.kernel.org/show_bug.cgi?id=219166 [2] del_gendisk() vs blk_queue_enter() race condition https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/ [3] queue_freeze & queue_enter deadlock in scsi https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241025003722.3630252-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: add support for defining read-only partitionsChristian Marangi
Add support for defining read-only partitions and complete support for it in the cmdline partition parser as the additional "ro" after a partition is scanned but never actually applied. Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241002221306.4403-2-ansuelsmth@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: kill blk_do_io_stat() helperJens Axboe
It's now just checking whether or not RQF_IO_STAT is set, so let's get rid of it and just open-code the specific flag that is being checked. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: move iostat check into blk_acount_io_start()Jens Axboe
Rather than have blk_do_io_stat() check for both RQF_IO_STAT and whether the request is a passthrough requests every time, move both of those checks into blk_account_io_start(). Then blk_do_io_stat() can be reduced to just checking for RQF_IO_STAT. Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11block: implement async io_uring discard cmdPavel Begunkov
io_uring allows implementing custom file specific asynchronous operations via the fops->uring_cmd callback, a.k.a. IORING_OP_URING_CMD requests or just io_uring commands. Use it to add support for async discards. Normally, it first tries to queue up bios in a non-blocking context, and if that fails, we'd retry from a blocking context by returning -EAGAIN to the core io_uring. We always get the result from bios asynchronously by setting a custom bi_end_io callback, at which point we drag the request into the task context to either reissue or complete it and post a completion to the user. Unlike ioctl(BLKDISCARD) with stronger guarantees against races, we only do a best effort attempt to invalidate page cache, and it can race with any writes and reads and leave page cache stale. It's the same kind of races we allow to direct writes. Also, apart from cases where discarding is not allowed at all, e.g. discards are not supported or the file/device is read only, the user should assume that the sector range on disk is not valid anymore, even when an error was returned to the user. Suggested-by: Conrad Meyer <conradmeyer@meta.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b5210443e4fa0257934f73dfafcc18a77cd0e09.1726072086.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11block: Added folio-ized version of bio_add_hw_page()Kundan Kumar
Added new bio_add_hw_folio() function as a wrapper around bio_add_hw_page(). This is a prep patch. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Tested-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240911064935.5630-2-kundan.kumar@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29block: don't use bio_split_rw on misc operationsChristoph Hellwig
bio_split_rw is designed to split read and write bios with a payload. Currently it is called by __bio_split_to_limits for all operations not explicitly list, which works because bio_may_need_split explicitly checks for bi_vcnt == 1 and thus skips the bypass if there is no payload and bio_for_each_bvec loop will never execute it's body if bi_size is 0. But all this is hard to understand, fragile and wasted pointless cycles. Switch __bio_split_to_limits to only call bio_split_rw for READ and WRITE command and don't attempt any kind split for operation that do not require splitting. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Link: https://lore.kernel.org/r/20240826173820.1690925-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29block: properly handle REQ_OP_ZONE_APPEND in __bio_split_to_limitsChristoph Hellwig
Currently REQ_OP_ZONE_APPEND is handled by the bio_split_rw case in __bio_split_to_limits. This is harmful because REQ_OP_ZONE_APPEND bios do not adhere to the soft max_limits value but instead use their own capped version of max_hw_sectors, leading to incorrect splits that later blow up in bio_split. We still need the bio_split_rw logic to count nr_segs for blk-mq code, so add a new wrapper that passes in the right limit, and turns any bio that would need a split into an error as an additional debugging aid. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Link: https://lore.kernel.org/r/20240826173820.1690925-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29block: rework bio splittingChristoph Hellwig
The current setup with bio_may_exceed_limit and __bio_split_to_limits is a bit of a mess. Change it so that __bio_split_to_limits does all the work and is just a variant of bio_split_to_limits that returns nr_segs. This is done by inlining it and instead have the various bio_split_* helpers directly submit the potentially split bios. To support btrfs, the rw version has a lower level helper split out that just returns the offset to split. This turns out to nicely clean up the btrfs flow as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Link: https://lore.kernel.org/r/20240826173820.1690925-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-22Merge tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linuxLinus Torvalds
Pull block integrity mapping updates from Jens Axboe: "A set of cleanups and fixes for the block integrity support. Sent separately from the main block changes from last week, as they depended on later fixes in the 6.10-rc cycle" * tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux: block: don't free the integrity payload in bio_integrity_unmap_free_user block: don't free submitter owned integrity payload on I/O completion block: call bio_integrity_unmap_free_user from blk_rq_unmap_user block: don't call bio_uninit from bio_endio block: also return bio_integrity_payload * from stubs block: split integrity support out of bio.h
2024-07-08block: add a bvec_phys helperChristoph Hellwig
Get callers out of poking into bvec internals a bit more. Not a huge win right now, but with the proposed new DMA mapping API we might end up with a lot more of this otherwise. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240706075228.2350978-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-03block: don't free submitter owned integrity payload on I/O completionChristoph Hellwig
Currently __bio_integrity_endio frees the integrity payload unless it is explicitly marked as user-mapped. This means in-kernel callers that allocate their own integrity payload never get to see it on I/O completion. The current two users don't need it as they just pre-mapped PI tuples received over the network, but this limits uses of integrity data lot. Change bio_integrity_endio to call __bio_integrity_endio for block layer generated integrity data only, and leave freeing of submitter allocated integrity data to bio_uninit which also gets called from the final bio_put. This requires that unmapping user mapped or copied integrity data is now always done by the caller, and the special BIP_INTEGRITY_USER flag can go away. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240702151047.1746127-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>