summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-05-19fs: remove the special !CONFIG_BLOCK def_blk_fopsChristoph Hellwig
def_blk_fops always returns -ENODEV, which dosn't match the return value of a non-existing block device with CONFIG_BLOCK, which is -ENXIO. Just remove the extra implementation and fall back to the default no_open_fops that always returns -ENXIO. Fixes: 9361401eb761 ("[PATCH] BLOCK: Make it possible to disable the block layer [try #6]") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230508144405.41792-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: BFQ: Add several invariant checksBart Van Assche
If anything goes wrong with the counters that track the number of requests, I/O locks up. Make such scenarios easier to debug by adding invariant checks for the request counters. Additionally, check that BFQ queues are empty before these are freed. Cc: Jan Kara <jack@suse.cz> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230516223853.1385255-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: mq-deadline: Fix handling of at-head zoned writesBart Van Assche
Before dispatching a zoned write from the FIFO list, check whether there are any zoned writes in the RB-tree with a lower LBA for the same zone. This patch ensures that zoned writes happen in order even if at_head is set for some writes for a zone and not for others. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-12-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: mq-deadline: Handle requeued requests correctlyBart Van Assche
Start dispatching from the start of a zone instead of from the starting position of the most recently dispatched request. If a zoned write is requeued with an LBA that is lower than already inserted zoned writes, make sure that it is submitted first. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-11-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: mq-deadline: Track the dispatch positionBart Van Assche
Track the position (sector_t) of the most recently dispatched request instead of tracking a pointer to the next request to dispatch. This patch is the basis for patch "Handle requeued requests correctly". Without this patch it would be significantly more complicated to make sure that zoned writes are dispatched in LBA order per zone. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-10-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: mq-deadline: Reduce lock contentionBart Van Assche
blk_mq_free_requests() calls dd_finish_request() indirectly. Prevent nested locking of dd->lock and dd->zone_lock by moving the code for freeing requests. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-9-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: mq-deadline: Simplify deadline_skip_seq_writes()Bart Van Assche
Make the deadline_skip_seq_writes() code shorter without changing its functionality. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-8-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: mq-deadline: Clean up deadline_check_fifo()Bart Van Assche
Change the return type of deadline_check_fifo() from 'int' into 'bool'. Use time_is_before_eq_jiffies() instead of time_after_eq(). No functionality has been changed. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-7-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: Introduce blk_rq_is_seq_zoned_write()Bart Van Assche
Introduce the function blk_rq_is_seq_zoned_write(). This function will be used in later patches to preserve the order of zoned writes that require write serialization. This patch includes an optimization: instead of using rq->q->disk->part0->bd_queue to check whether or not the queue is associated with a zoned block device, use rq->q->disk->queue. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-6-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: Introduce op_needs_zoned_write_locking()Bart Van Assche
Introduce a helper function for checking whether write serialization is required if the operation will be sent to a zoned device. A second caller for op_needs_zoned_write_locking() will be introduced in the next patch in this series. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-5-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: Fix the type of the second bdev_op_is_zoned_write() argumentBart Van Assche
Change the type of the second argument of bdev_op_is_zoned_write() from blk_opf_t into enum req_op because this function expects an operation without flags as second argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Fixes: 8cafdb5ab94c ("block: adapt blk_mq_plug() to not plug for writes that require a zone lock") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: Simplify blk_req_needs_zone_write_lock()Bart Van Assche
Remove the blk_rq_is_passthrough() check because it is redundant: blk_req_needs_zone_write_lock() also calls bdev_op_is_zoned_write() and the latter function returns false for pass-through requests. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230517174230.897144-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: mq-deadline: Add a word in a source code commentBart Van Assche
Add the missing word "and". Cc: Damien Le Moal <dlemoal@kernel.org> Suggested-by: Damien Le Moal <dlemoal@kernel.org> Fixes: 945ffb60c11d ("mq-deadline: add blk-mq adaptation of the deadline IO scheduler") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18blk-mq: make sure elevator callbacks aren't called for passthrough requestChristoph Hellwig
In case of q->elevator, passthrough request can still be marked as RQF_ELV, so some elevator callbacks will be called for them. Fix this by splitting RQF_SCHED_TAGS, which is set for all requests that are issued on a queue that uses an I/O scheduler, and RQF_USE_SCHED for non-flush, non-passthrough requests on such a queue. Roughly based on two different patches from Ming Lei <ming.lei@redhat.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230518053101.760632-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18blk-mq: remove RQF_ELVPRIVChristoph Hellwig
RQF_ELVPRIV is set for all non-flush requests that have RQF_ELV set. Expand this condition in the two users of the flag and remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230518053101.760632-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18blk-mq: don't queue plugged passthrough requests into schedulerMing Lei
Passthrough requests should never be queued to the I/O scheduler, as scheduling these opaque requests doesn't make sense, and I/O schedulers might require req->bio to be always valid. We never let passthrough requests insert into the scheduler before commit 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging"), restore this behavior even for passthrough requests issued under a plug. [hch: use blk_mq_insert_requests for passthrough requests, fix up the commit message and comments] Reported-by: Guangwu Zhang <guazhang@redhat.com> Closes: https://lore.kernel.org/linux-block/CAGS2=YosaYaUTEMU3uaf+y=8MqSrhL7sYsJn8EwbaM=76p_4Qg@mail.gmail.com/ Investigated-by: Yu Kuai <yukuai1@huaweicloud.com> Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230518053101.760632-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: Decode all flag names in the debugfs outputBart Van Assche
See also: * Commit 4d337cebcb1c ("blk-mq: avoid to touch q->elevator without any protection"). * Commit 414dd48e882c ("blk-mq: add tagset quiesce interface"). Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230518222708.1190867-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-16brd: use XArray instead of radix-tree to index backing pagesPankaj Raghav
XArray was introduced to hold large array of pointers with a simple API. XArray API also provides array semantics which simplifies the way we store and access the backing pages, and the code becomes significantly easier to understand. No performance difference was noticed between the two implementation using fio with direct=1 [1]. [1] Performance in KIOPS: | radix-tree | XArray | Diff | | | write | 315 | 313 | -0.6% randwrite | 286 | 290 | +1.3% read | 330 | 335 | +1.5% randread | 309 | 312 | +0.9% Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230511121544.111648-1-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-14Linux 6.4-rc2v6.4-rc2Linus Torvalds
2023-05-14Merge tag 'cxl-fixes-6.4-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull compute express link fixes from Dan Williams: - Fix a compilation issue with DEFINE_STATIC_SRCU() in the unit tests - Fix leaking kernel memory to a root-only sysfs attribute * tag 'cxl-fixes-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl: Add missing return to cdat read error path tools/testing/cxl: Use DEFINE_STATIC_SRCU()
2023-05-14Merge tag 'parisc-for-6.4-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc architecture fixes from Helge Deller: - Fix encoding of swp_entry due to added SWP_EXCLUSIVE flag - Include reboot.h to avoid gcc-12 compiler warning * tag 'parisc-for-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Fix encoding of swp_entry due to added SWP_EXCLUSIVE flag parisc: kexec: include reboot.h
2023-05-14Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds
Pull ARM fixes from Russell King: - fix unwinder for uleb128 case - fix kernel-doc warnings for HP Jornada 7xx - fix unbalanced stack on vfp success path * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 9297/1: vfp: avoid unbalanced stack on 'success' return path ARM: 9296/1: HP Jornada 7XX: fix kernel-doc warnings ARM: 9295/1: unwind:fix unwind abort for uleb128 case
2023-05-14Merge tag 'locking_urgent_for_v6.4_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: - Make sure __down_read_common() is always inlined so that the callers' names land in traceevents output and thus the blocked function can be identified * tag 'locking_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Add __always_inline annotation to __down_read_common() and inlined callers
2023-05-14Merge tag 'perf_urgent_for_v6.4_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Make sure the PEBS buffer is flushed before reprogramming the hardware so that the correct record sizes are used - Update the sample size for AMD BRS events - Fix a confusion with using the same on-stack struct with different events in the event processing path * tag 'perf_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG perf/x86: Fix missing sample size update on AMD BRS perf/core: Fix perf_sample_data not properly initialized for different swevents in perf_tp_event()
2023-05-14Merge tag 'sched_urgent_for_v6.4_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Fix a couple of kernel-doc warnings * tag 'sched_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: fix cid_lock kernel-doc warnings
2023-05-14Merge tag 'x86_urgent_for_v6.4_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: - Add the required PCI IDs so that the generic SMN accesses provided by amd_nb.c work for drivers which switch to them. Add a PCI device ID to k10temp's table so that latter is loaded on such systems too * tag 'x86_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hwmon: (k10temp) Add PCI ID for family 19, model 78h x86/amd_nb: Add PCI ID for family 19h model 78h
2023-05-14Merge tag 'timers_urgent_for_v6.4_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Prevent CPU state corruption when an active clockevent broadcast device is replaced while the system is already in oneshot mode * tag 'timers_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/broadcast: Make broadcast device replacement work correctly
2023-05-13Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Some ext4 bug fixes (mostly to address Syzbot reports)" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: bail out of ext4_xattr_ibody_get() fails for any reason ext4: add bounds checking in get_max_inline_xattr_value_size() ext4: add indication of ro vs r/w mounts in the mount message ext4: fix deadlock when converting an inline directory in nojournal mode ext4: improve error recovery code paths in __ext4_remount() ext4: improve error handling from ext4_dirhash() ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled ext4: check iomap type only if ext4_iomap_begin() does not fail ext4: avoid a potential slab-out-of-bounds in ext4_group_desc_csum ext4: fix data races when using cached status extents ext4: avoid deadlock in fs reclaim with page writeback ext4: fix invalid free tracking in ext4_xattr_move_to_block() ext4: remove a BUG_ON in ext4_mb_release_group_pa() ext4: allow ext4_get_group_info() to fail ext4: fix lockdep warning when enabling MMP ext4: fix WARNING in mb_find_extent
2023-05-13Merge tag 'fbdev-for-6.4-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev Pull fbdev fixes from Helge Deller: - use after free fix in imsttfb (Zheng Wang) - fix error handling in arcfb (Zongjie Li) - lots of whitespace cleanups (Thomas Zimmermann) - add 1920x1080 modedb entry (me) * tag 'fbdev-for-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: fbdev: stifb: Fix info entry in sti_struct on error path fbdev: modedb: Add 1920x1080 at 60 Hz video mode fbdev: imsttfb: Fix use after free bug in imsttfb_probe fbdev: vfb: Remove trailing whitespaces fbdev: valkyriefb: Remove trailing whitespaces fbdev: stifb: Remove trailing whitespaces fbdev: sa1100fb: Remove trailing whitespaces fbdev: platinumfb: Remove trailing whitespaces fbdev: p9100: Remove trailing whitespaces fbdev: maxinefb: Remove trailing whitespaces fbdev: macfb: Remove trailing whitespaces fbdev: hpfb: Remove trailing whitespaces fbdev: hgafb: Remove trailing whitespaces fbdev: g364fb: Remove trailing whitespaces fbdev: controlfb: Remove trailing whitespaces fbdev: cg14: Remove trailing whitespaces fbdev: atmel_lcdfb: Remove trailing whitespaces fbdev: 68328fb: Remove trailing whitespaces fbdev: arcfb: Fix error handling in arcfb_probe()
2023-05-13Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fix from James Bottomley: "A single small fix for the UFS driver to fix a power management failure" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: ufs: core: Fix I/O hang that occurs when BKOPS fails in W-LUN suspend
2023-05-14parisc: Fix encoding of swp_entry due to added SWP_EXCLUSIVE flagHelge Deller
Fix the __swp_offset() and __swp_entry() macros due to commit 6d239fc78c0b ("parisc/mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE") which introduced the SWP_EXCLUSIVE flag by reusing the _PAGE_ACCESSED flag. Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Helge Deller <deller@gmx.de> Fixes: 6d239fc78c0b ("parisc/mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE") Cc: <stable@vger.kernel.org> # v6.3+
2023-05-13ext4: bail out of ext4_xattr_ibody_get() fails for any reasonTheodore Ts'o
In ext4_update_inline_data(), if ext4_xattr_ibody_get() fails for any reason, it's best if we just fail as opposed to stumbling on, especially if the failure is EFSCORRUPTED. Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: add bounds checking in get_max_inline_xattr_value_size()Theodore Ts'o
Normally the extended attributes in the inode body would have been checked when the inode is first opened, but if someone is writing to the block device while the file system is mounted, it's possible for the inode table to get corrupted. Add bounds checking to avoid reading beyond the end of allocated memory if this happens. Reported-by: syzbot+1966db24521e5f6e23f7@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=1966db24521e5f6e23f7 Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: add indication of ro vs r/w mounts in the mount messageTheodore Ts'o
Whether the file system is mounted read-only or read/write is more important than the quota mode, which we are already printing. Add the ro vs r/w indication since this can be helpful in debugging problems from the console log. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: fix deadlock when converting an inline directory in nojournal modeTheodore Ts'o
In no journal mode, ext4_finish_convert_inline_dir() can self-deadlock by calling ext4_handle_dirty_dirblock() when it already has taken the directory lock. There is a similar self-deadlock in ext4_incvert_inline_data_nolock() for data files which we'll fix at the same time. A simple reproducer demonstrating the problem: mke2fs -Fq -t ext2 -O inline_data -b 4k /dev/vdc 64 mount -t ext4 -o dirsync /dev/vdc /vdc cd /vdc mkdir file0 cd file0 touch file0 touch file1 attr -s BurnSpaceInEA -V abcde . touch supercalifragilisticexpialidocious Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230507021608.1290720-1-tytso@mit.edu Reported-by: syzbot+91dccab7c64e2850a4e5@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=ba84cc80a9491d65416bc7877e1650c87530fe8a Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: improve error recovery code paths in __ext4_remount()Theodore Ts'o
If there are failures while changing the mount options in __ext4_remount(), we need to restore the old mount options. This commit fixes two problem. The first is there is a chance that we will free the old quota file names before a potential failure leading to a use-after-free. The second problem addressed in this commit is if there is a failed read/write to read-only transition, if the quota has already been suspended, we need to renable quota handling. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230506142419.984260-2-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: improve error handling from ext4_dirhash()Theodore Ts'o
The ext4_dirhash() will *almost* never fail, especially when the hash tree feature was first introduced. However, with the addition of support of encrypted, casefolded file names, that function can most certainly fail today. So make sure the callers of ext4_dirhash() properly check for failures, and reflect the errors back up to their callers. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230506142419.984260-1-tytso@mit.edu Reported-by: syzbot+394aa8a792cb99dbc837@syzkaller.appspotmail.com Reported-by: syzbot+344aaa8697ebd232bfc8@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=db56459ea4ac4a676ae4b4678f633e55da005a9b Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabledTheodore Ts'o
When a file system currently mounted read/only is remounted read/write, if we clear the SB_RDONLY flag too early, before the quota is initialized, and there is another process/thread constantly attempting to create a directory, it's possible to trigger the WARN_ON_ONCE(dquot_initialize_needed(inode)); in ext4_xattr_block_set(), with the following stack trace: WARNING: CPU: 0 PID: 5338 at fs/ext4/xattr.c:2141 ext4_xattr_block_set+0x2ef2/0x3680 RIP: 0010:ext4_xattr_block_set+0x2ef2/0x3680 fs/ext4/xattr.c:2141 Call Trace: ext4_xattr_set_handle+0xcd4/0x15c0 fs/ext4/xattr.c:2458 ext4_initxattrs+0xa3/0x110 fs/ext4/xattr_security.c:44 security_inode_init_security+0x2df/0x3f0 security/security.c:1147 __ext4_new_inode+0x347e/0x43d0 fs/ext4/ialloc.c:1324 ext4_mkdir+0x425/0xce0 fs/ext4/namei.c:2992 vfs_mkdir+0x29d/0x450 fs/namei.c:4038 do_mkdirat+0x264/0x520 fs/namei.c:4061 __do_sys_mkdirat fs/namei.c:4076 [inline] __se_sys_mkdirat fs/namei.c:4074 [inline] __x64_sys_mkdirat+0x89/0xa0 fs/namei.c:4074 Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230506142419.984260-1-tytso@mit.edu Reported-by: syzbot+6385d7d3065524c5ca6d@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=6513f6cb5cd6b5fc9f37e3bb70d273b94be9c34c Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: check iomap type only if ext4_iomap_begin() does not failBaokun Li
When ext4_iomap_overwrite_begin() calls ext4_iomap_begin() map blocks may fail for some reason (e.g. memory allocation failure, bare disk write), and later because "iomap->type ! = IOMAP_MAPPED" triggers WARN_ON(). When ext4 iomap_begin() returns an error, it is normal that the type of iomap->type may not match the expectation. Therefore, we only determine if iomap->type is as expected when ext4_iomap_begin() is executed successfully. Cc: stable@kernel.org Reported-by: syzbot+08106c4b7d60702dbc14@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/00000000000015760b05f9b4eee9@google.com Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230505132429.714648-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: avoid a potential slab-out-of-bounds in ext4_group_desc_csumTudor Ambarus
When modifying the block device while it is mounted by the filesystem, syzbot reported the following: BUG: KASAN: slab-out-of-bounds in crc16+0x206/0x280 lib/crc16.c:58 Read of size 1 at addr ffff888075f5c0a8 by task syz-executor.2/15586 CPU: 1 PID: 15586 Comm: syz-executor.2 Not tainted 6.2.0-rc5-syzkaller-00205-gc96618275234 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:306 print_report+0x107/0x1f0 mm/kasan/report.c:417 kasan_report+0xcd/0x100 mm/kasan/report.c:517 crc16+0x206/0x280 lib/crc16.c:58 ext4_group_desc_csum+0x81b/0xb20 fs/ext4/super.c:3187 ext4_group_desc_csum_set+0x195/0x230 fs/ext4/super.c:3210 ext4_mb_clear_bb fs/ext4/mballoc.c:6027 [inline] ext4_free_blocks+0x191a/0x2810 fs/ext4/mballoc.c:6173 ext4_remove_blocks fs/ext4/extents.c:2527 [inline] ext4_ext_rm_leaf fs/ext4/extents.c:2710 [inline] ext4_ext_remove_space+0x24ef/0x46a0 fs/ext4/extents.c:2958 ext4_ext_truncate+0x177/0x220 fs/ext4/extents.c:4416 ext4_truncate+0xa6a/0xea0 fs/ext4/inode.c:4342 ext4_setattr+0x10c8/0x1930 fs/ext4/inode.c:5622 notify_change+0xe50/0x1100 fs/attr.c:482 do_truncate+0x200/0x2f0 fs/open.c:65 handle_truncate fs/namei.c:3216 [inline] do_open fs/namei.c:3561 [inline] path_openat+0x272b/0x2dd0 fs/namei.c:3714 do_filp_open+0x264/0x4f0 fs/namei.c:3741 do_sys_openat2+0x124/0x4e0 fs/open.c:1310 do_sys_open fs/open.c:1326 [inline] __do_sys_creat fs/open.c:1402 [inline] __se_sys_creat fs/open.c:1396 [inline] __x64_sys_creat+0x11f/0x160 fs/open.c:1396 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f72f8a8c0c9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f72f97e3168 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 00007f72f8bac050 RCX: 00007f72f8a8c0c9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000280 RBP: 00007f72f8ae7ae9 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd165348bf R14: 00007f72f97e3300 R15: 0000000000022000 Replace le16_to_cpu(sbi->s_es->s_desc_size) with sbi->s_desc_size It reduces ext4's compiled text size, and makes the code more efficient (we remove an extra indirect reference and a potential byte swap on big endian systems), and there is no downside. It also avoids the potential KASAN / syzkaller failure, as a bonus. Reported-by: syzbot+fc51227e7100c9294894@syzkaller.appspotmail.com Reported-by: syzbot+8785e41224a3afd04321@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=70d28d11ab14bd7938f3e088365252aa923cff42 Link: https://syzkaller.appspot.com/bug?id=b85721b38583ecc6b5e72ff524c67302abbc30f3 Link: https://lore.kernel.org/all/000000000000ece18705f3b20934@google.com/ Fixes: 717d50e4971b ("Ext4: Uninitialized Block Groups") Cc: stable@vger.kernel.org Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Link: https://lore.kernel.org/r/20230504121525.3275886-1-tudor.ambarus@linaro.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: fix data races when using cached status extentsJan Kara
When using cached extent stored in extent status tree in tree->cache_es another process holding ei->i_es_lock for reading can be racing with us setting new value of tree->cache_es. If the compiler would decide to refetch tree->cache_es at an unfortunate moment, it could result in a bogus in_range() check. Fix the possible race by using READ_ONCE() when using tree->cache_es only under ei->i_es_lock for reading. Cc: stable@kernel.org Reported-by: syzbot+4a03518df1e31b537066@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/000000000000d3b33905fa0fd4a6@google.com Suggested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230504125524.10802-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: avoid deadlock in fs reclaim with page writebackJan Kara
Ext4 has a filesystem wide lock protecting ext4_writepages() calls to avoid races with switching of journalled data flag or inode format. This lock can however cause a deadlock like: CPU0 CPU1 ext4_writepages() percpu_down_read(sbi->s_writepages_rwsem); ext4_change_inode_journal_flag() percpu_down_write(sbi->s_writepages_rwsem); - blocks, all readers block from now on ext4_do_writepages() ext4_init_io_end() kmem_cache_zalloc(io_end_cachep, GFP_KERNEL) fs_reclaim frees dentry... dentry_unlink_inode() iput() - last ref => iput_final() - inode dirty => write_inode_now()... ext4_writepages() tries to acquire sbi->s_writepages_rwsem and blocks forever Make sure we cannot recurse into filesystem reclaim from writeback code to avoid the deadlock. Reported-by: syzbot+6898da502aef574c5f8a@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/0000000000004c66b405fa108e27@google.com Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230504124723.20205-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: fix invalid free tracking in ext4_xattr_move_to_block()Theodore Ts'o
In ext4_xattr_move_to_block(), the value of the extended attribute which we need to move to an external block may be allocated by kvmalloc() if the value is stored in an external inode. So at the end of the function the code tried to check if this was the case by testing entry->e_value_inum. However, at this point, the pointer to the xattr entry is no longer valid, because it was removed from the original location where it had been stored. So we could end up calling kvfree() on a pointer which was not allocated by kvmalloc(); or we could also potentially leak memory by not freeing the buffer when it should be freed. Fix this by storing whether it should be freed in a separate variable. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430160426.581366-1-tytso@mit.edu Link: https://syzkaller.appspot.com/bug?id=5c2aee8256e30b55ccf57312c16d88417adbd5e1 Link: https://syzkaller.appspot.com/bug?id=41a6b5d4917c0412eb3b3c3c604965bed7d7420b Reported-by: syzbot+64b645917ce07d89bde5@syzkaller.appspotmail.com Reported-by: syzbot+0d042627c4f2ad332195@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: remove a BUG_ON in ext4_mb_release_group_pa()Theodore Ts'o
If a malicious fuzzer overwrites the ext4 superblock while it is mounted such that the s_first_data_block is set to a very large number, the calculation of the block group can underflow, and trigger a BUG_ON check. Change this to be an ext4_warning so that we don't crash the kernel. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430154311.579720-3-tytso@mit.edu Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13ext4: allow ext4_get_group_info() to failTheodore Ts'o
Previously, ext4_get_group_info() would treat an invalid group number as BUG(), since in theory it should never happen. However, if a malicious attaker (or fuzzer) modifies the superblock via the block device while it is the file system is mounted, it is possible for s_first_data_block to get set to a very large number. In that case, when calculating the block group of some block number (such as the starting block of a preallocation region), could result in an underflow and very large block group number. Then the BUG_ON check in ext4_get_group_info() would fire, resutling in a denial of service attack that can be triggered by root or someone with write access to the block device. For a quality of implementation perspective, it's best that even if the system administrator does something that they shouldn't, that it will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info() will call ext4_error and return NULL. We also add fallback code in all of the callers of ext4_get_group_info() that it might NULL. Also, since ext4_get_group_info() was already borderline to be an inline function, un-inline it. The results in a next reduction of the compiled text size of ext4 by roughly 2k. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2023-05-13Merge tag 'block-6.4-2023-05-13' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: "Just a few minor fixes for drivers, and a deletion of a file that is woefully out-of-date these days" * tag 'block-6.4-2023-05-13' of git://git.kernel.dk/linux: Documentation/block: drop the request.rst file ublk: fix command op code check block/rnbd: replace REQ_OP_FLUSH with REQ_OP_WRITE nbd: Fix debugfs_create_dir error checking
2023-05-13cxl: Add missing return to cdat read error pathDave Jiang
Add a return to the error path when cxl_cdat_read_table() fails. Current code continues with the table pointer points to freed memory. Fixes: 7a877c923995 ("cxl/pci: Simplify CDAT retrieval error path") Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://lore.kernel.org/r/168382793506.3510737.4792518576623749076.stgit@djiang5-mobl3 Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-05-13tools/testing/cxl: Use DEFINE_STATIC_SRCU()Dan Williams
Starting with commit: 95433f726301 ("srcu: Begin offloading srcu_struct fields to srcu_update") ...it is no longer possible to do: static DEFINE_SRCU(x) Switch to DEFINE_STATIC_SRCU(x) to fix: tools/testing/cxl/test/mock.c:22:1: error: duplicate ‘static’ 22 | static DEFINE_SRCU(cxl_mock_srcu); | ^~~~~~ Reviewed-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/168392709546.1135523.10424917245934547117.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-05-12x86/retbleed: Fix return thunk alignmentBorislav Petkov (AMD)
SYM_FUNC_START_LOCAL_NOALIGN() adds an endbr leading to this layout (leaving only the last 2 bytes of the address): 3bff <zen_untrain_ret>: 3bff: f3 0f 1e fa endbr64 3c03: f6 test $0xcc,%bl 3c04 <__x86_return_thunk>: 3c04: c3 ret 3c05: cc int3 3c06: 0f ae e8 lfence However, "the RET at __x86_return_thunk must be on a 64 byte boundary, for alignment within the BTB." Use SYM_START instead. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-12Merge tag 'for-6.4-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs fixes from David Sterba: - fix incorrect number of bitmap entries for space cache if loading is interrupted by some error - fix backref walking, this breaks a mode of LOGICAL_INO_V2 ioctl that is used in deduplication tools - zoned mode fixes: - properly finish zone reserved for relocation - correctly calculate super block zone end on ZNS - properly initialize new extent buffer for redirty - make mount option clear_cache work with block-group-tree, to rebuild free-space-tree instead of temporarily disabling it that would lead to a forced read-only mount - fix alignment check for offset when printing extent item * tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: make clear_cache mount option to rebuild FST without disabling it btrfs: zero the buffer before marking it dirty in btrfs_redirty_list_add btrfs: zoned: fix full zone super block reading on ZNS btrfs: zoned: zone finish data relocation BG with last IO btrfs: fix backref walking not returning all inode refs btrfs: fix space cache inconsistency after error loading it from disk btrfs: print-tree: parent bytenr must be aligned to sector size