summaryrefslogtreecommitdiff
path: root/fs/btrfs/inode.c
AgeCommit message (Collapse)Author
2025-05-15btrfs: rename set_extent_bit() to include a btrfs prefixFilipe Manana
This is an exported function so it should have a 'btrfs_' prefix by convention, to make it clear it's btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So rename it to btrfs_set_extent_bit(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename the functions to clear bits for an extent rangeFilipe Manana
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. One of them has a double underscore prefix which is also discouraged. So remove double underscore prefix where applicable and add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename __lock_extent() and __try_lock_extent()Filipe Manana
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. Their double underscore prefix is also discouraged. So remove their double underscore prefix, add a 'btrfs_' prefix to their name to make it clear they are from btrfs and a '_bits' suffix to avoid collision with btrfs_lock_extent() and btrfs_try_lock_extent(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add btrfs prefix to main lock, try lock and unlock extent functionsFilipe Manana
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a prefix to their name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_encoded_read_inline()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use BTRFS_PATH_AUTO_FREE in can_nocow_extent()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_set_inode_index_count()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use BTRFS_PATH_AUTO_FREE in may_destroy_subvol()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: do more trivial BTRFS_PATH_AUTO_FREE conversionsDavid Sterba
The most trivial pattern for the auto freeing when the variable is declared with the macro and the final btrfs_free_path() is removed. There are almost none goto -> return conversions and there's no other function cleanup. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use clear_extent_bits() instead of clear_extent_bit() where possibleFilipe Manana
Several places are using clear_extent_bit() and passing a NULL value for the 'cached' argument, which is pointless as they can use instead clear_extent_bits(). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: allow folios to be released while ordered extent is finishingFilipe Manana
When the release_folio callback (from struct address_space_operations) is invoked we don't allow the folio to be released if its range is currently locked in the inode's io_tree, as it may indicate the folio may be needed by the task that locked the range. However if the range is locked because an ordered extent is finishing, then we can safely allow the folio to be released because ordered extent completion doesn't need to use the folio at all. When we are under memory pressure, the kernel starts writeback of dirty pages (folios) with the goal of releasing the pages from the page cache after writeback completes, however this often is not possible on btrfs because: * Once the writeback completes we queue the ordered extent completion; * Once the ordered extent completion starts, we lock the range in the inode's io_tree (at btrfs_finish_one_ordered()); * If the release_folio callback is called while the folio's range is locked in the inode's io_tree, we don't allow the folio to be released, so the kernel has to try to release memory elsewhere, which may result in triggering more writeback or releasing other pages from the page cache which may be more useful to have around for applications. In contrast, when the release_folio callback is invoked after writeback finishes and before ordered extent completion starts or locks the range, we allow the folio to be released, as well as when the release_folio callback is invoked after ordered extent completion unlocks the range. Improve on this by detecting if the range is locked for ordered extent completion and if it is, allow the folio to be released. This detection is achieved by adding a new extent flag in the io_tree that is set when the range is locked during ordered extent completion. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove leftover EXTENT_UPTODATE clear from an inode's io_treeFilipe Manana
After commit 52b029f42751 ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path") we never set EXTENT_UPTODATE in an inode's io_tree anymore, but we still have some code attempting to clear that bit from an inode's io_tree. Remove that code as it doesn't do anything anymore. The sole use of the EXTENT_UPTODATE bit is for the excluded extents io_tree (fs_info->excluded_extents), which is used to track the locations of super blocks, so that their ranges are never marked as free, making them unavailable for extent allocation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: move block perfect compression out of experimental featuresQu Wenruo
Commit 1d2fbb7f1f9e ("btrfs: allow compression even if the range is not page aligned") introduced the block perfect compression for block size < page size cases. Before that commit, if the fs block size is smaller than page size (aka subpage cases), compressed write is only enabled if the dirty range is fully page aligned. This block perfect compression support was introduced in v6.13, and has been tested for two kernel releases. I believe it's time to move it out of experimental features so that we can get more tests in the real world. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-14Merge tag 'for-6.15-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix potential endless loop when discarding a block group when disabling discard - reinstate message when setting a large value of mount option 'commit' - fix a folio leak when async extent submission fails * tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add back warning for mount option commit values exceeding 300 btrfs: fix folio leak in submit_one_async_extent() btrfs: fix discard worker infinite loop after disabling discard
2025-05-12btrfs: fix folio leak in submit_one_async_extent()Boris Burkov
If btrfs_reserve_extent() fails while submitting an async_extent for a compressed write, then we fail to call free_async_extent_pages() on the async_extent and leak its folios. A likely cause for such a failure would be btrfs_reserve_extent() failing to find a large enough contiguous free extent for the compressed extent. I was able to reproduce this by: 1. mount with compress-force=zstd:3 2. fallocating most of a filesystem to a big file 3. fragmenting the remaining free space 4. trying to copy in a file which zstd would generate large compressed extents for (vmlinux worked well for this) Step 4. hits the memory leak and can be repeated ad nauseam to eventually exhaust the system memory. Fix this by detecting the case where we fallback to uncompressed submission for a compressed async_extent and ensuring that we call free_async_extent_pages(). Fixes: 131a821a243f ("btrfs: fallback if compressed IO fails for ENOSPC") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Co-developed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-30Merge tag 'for-6.15-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix potential inode leak in iget() after memory allocation failure - in subpage mode, fix extent buffer bitmap iteration when writing out dirty sectors - fix range calculation when falling back to COW for a NOCOW file * tag 'for-6.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: adjust subpage bit start based on sectorsize btrfs: fix the inode leak in btrfs_iget() btrfs: fix COW handling in run_delalloc_nocow()
2025-04-23btrfs: fix the inode leak in btrfs_iget()Penglei Jiang
[BUG] There is a bug report that a syzbot reproducer can lead to the following busy inode at unmount time: BTRFS info (device loop1): last unmount of filesystem 1680000e-3c1e-4c46-84b6-56bd3909af50 VFS: Busy inodes after unmount of loop1 (btrfs) ------------[ cut here ]------------ kernel BUG at fs/super.c:650! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 0 UID: 0 PID: 48168 Comm: syz-executor Not tainted 6.15.0-rc2-00471-g119009db2674 #2 PREEMPT(full) Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:generic_shutdown_super+0x2e9/0x390 fs/super.c:650 Call Trace: <TASK> kill_anon_super+0x3a/0x60 fs/super.c:1237 btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2099 deactivate_locked_super+0xbe/0x1a0 fs/super.c:473 deactivate_super fs/super.c:506 [inline] deactivate_super+0xe2/0x100 fs/super.c:502 cleanup_mnt+0x21f/0x440 fs/namespace.c:1435 task_work_run+0x14d/0x240 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x269/0x290 kernel/entry/common.c:218 do_syscall_64+0xd4/0x250 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> [CAUSE] When btrfs_alloc_path() failed, btrfs_iget() directly returned without releasing the inode already allocated by btrfs_iget_locked(). This results the above busy inode and trigger the kernel BUG. [FIX] Fix it by calling iget_failed() if btrfs_alloc_path() failed. If we hit error inside btrfs_read_locked_inode(), it will properly call iget_failed(), so nothing to worry about. Although the iget_failed() cleanup inside btrfs_read_locked_inode() is a break of the normal error handling scheme, let's fix the obvious bug and backport first, then rework the error handling later. Reported-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/linux-btrfs/20250421102425.44431-1-superman.xpt@gmail.com/ Fixes: 7c855e16ab72 ("btrfs: remove conditional path allocation in btrfs_read_locked_inode()") CC: stable@vger.kernel.org # 6.13+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-23btrfs: fix COW handling in run_delalloc_nocow()Dave Chen
In run_delalloc_nocow(), when the found btrfs_key's offset > cur_offset, it indicates a gap between the current processing region and the next file extent. The original code would directly jump to the "must_cow" label, which increments the slot and forces a fallback to COW. This behavior might skip an extent item and result in an overestimated COW fallback range. This patch modifies the logic so that when a gap is detected: - If no COW range is already being recorded (cow_start is unset), cow_start is set to cur_offset. - cur_offset is then advanced to the beginning of the next extent. - Instead of jumping to "must_cow", control flows directly to "next_slot" so that the same extent item can be reexamined properly. The change ensures that we accurately account for the extent gap and avoid accidentally extending the range that needs to fallback to COW. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-27Merge tag 'for-6.15-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "User visible changes: - fall back to buffered write if direct io is done on a file that requires checksums - this avoids a problem with checksum mismatch errors, observed e.g. on virtual images when writes to pages under writeback cause the checksum mismatch reports - this may lead to some performance degradation but currently the recommended setup for VM images is to use the NOCOW file attribute that also disables checksums - fast/realtime zstd levels -15 to -1 - supported by mount options (compress=zstd:-5) and defrag ioctl - improved speed, reduced compression ratio, check the commit for sample measurements - defrag ioctl extended to accept negative compression levels - subpage mode - remove warning when subpage mode is used, the feature is now reasonably complete and tested - in debug mode allow to create 2K b-tree nodes to allow testing subpage on x86_64 with 4K pages too Performance improvements: - in send, better file path caching improves runtime (on sample load by -30%) - on s390x with hardware zlib support prepare the input buffer in a better way to get the best results from the acceleration - minor speed improvement in encoded read, avoid memory allocation in synchronous mode Core: - enable stable writes on inodes, replacing manually waiting for writeback and allowing to skip that on inodes without checksums - add last checks and warnings for out-of-band dirty writes to pages, requiring a fixup ("fixup worker"), this should not be necessary since 5.8 where get_user_page() and pin_user_pages*() prevent this - long history behind that, we'll be happy to remove the whole infrastructure in the near future - more folio API conversions and preparations for large folio support - subpage cleanups and refactoring, split handling of data and metadata to allow future support for large folios - readpage works as block-by-block, no change for normal mode, this is preparation for future subpage updates - block group refcount fixes and hardening - delayed iput fixes - in zoned mode, fix zone activation on filesystem with missing devices Cleanups: - inode parameter cleanups - path auto-freeing updates - code flow simplifications in send - redundant parameter cleanups" * tag 'for-6.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (164 commits) btrfs: zoned: fix zone finishing with missing devices btrfs: zoned: fix zone activation with missing devices btrfs: remove end_no_trans label from btrfs_log_inode_parent() btrfs: simplify condition for logging new dentries at btrfs_log_inode_parent() btrfs: remove redundant else statement from btrfs_log_inode_parent() btrfs: use memcmp_extent_buffer() at replay_one_extent() btrfs: update outdated comment for overwrite_item() btrfs: use variables to store extent buffer and slot at overwrite_item() btrfs: avoid unnecessary memory allocation and copy at overwrite_item() btrfs: don't clobber ret in btrfs_validate_super() btrfs: prepare btrfs_page_mkwrite() for large folios btrfs: prepare extent_io.c for future large folio support btrfs: prepare btrfs_launcher_folio() for large folios support btrfs: replace PAGE_SIZE with folio_size for subpage.[ch] btrfs: add a size parameter to btrfs_alloc_subpage() btrfs: subpage: make btrfs_is_subpage() check against a folio btrfs: add extra warning if delayed iput is added when it's not allowed btrfs: avoid redundant path slot assignment in btrfs_search_forward() btrfs: remove unnecessary btrfs_key local variable in btrfs_search_forward() btrfs: simplify the return value handling in search_ioctl() ...
2025-03-24Merge tag 'vfs-6.15-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async dir updates from Christian Brauner: "This contains cleanups that fell out of the work from async directory handling: - Change kern_path_locked() and user_path_locked_at() to never return a negative dentry. This simplifies the usability of these helpers in various places - Drop d_exact_alias() from the remaining place in NFS where it is still used. This also allows us to drop the d_exact_alias() helper completely - Drop an unnecessary call to fh_update() from nfsd_create_locked() - Change i_op->mkdir() to return a struct dentry Change vfs_mkdir() to return a dentry provided by the filesystems which is hashed and positive. This allows us to reduce the number of cases where the resulting dentry is not positive to very few cases. The code in these places becomes simpler and easier to understand. - Repack DENTRY_* and LOOKUP_* flags" * tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: fix inline emphasis warning VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() VFS: add common error checks to lookup_one_qstr_excl() VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry VFS: repack LOOKUP_ bit flags. VFS: repack DENTRY_ flags.
2025-03-18btrfs: prepare btrfs_launcher_folio() for large folios supportQu Wenruo
That function is only calling btrfs_qgroup_free_data(), which doesn't care about the size of the folio. Just replace the fixed PAGE_SIZE with folio_size(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: subpage: make btrfs_is_subpage() check against a folioQu Wenruo
To support large data folios, we can no longer assume every filemap folio is page sized. So btrfs_is_subpage() check must be done against a folio. Thankfully for metadata folios, we have the full control and ensure a large folio will not be large than nodesize, so btrfs_meta_is_subpage() doesn't need this change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: add extra warning if delayed iput is added when it's not allowedQu Wenruo
Since I have triggered the ASSERT() on the delayed iput too many times, now is the time to add some extra debug warnings for delayed iput. All delayed iputs should be queued after all ordered extents finish their IO and all involved workqueues are flushed. Thus after the btrfs_run_delayed_iputs() inside close_ctree(), there should be no more delayed puts added. So introduce a new BTRFS_FS_STATE_NO_DELAYED_IPUT, set after the above mentioned timing. And all btrfs_add_delayed_iput() will check that flag and give a WARN_ON_ONCE(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_iget_path() return a btrfs inode insteadFilipe Manana
It's an internal function and btrfs_iget() is now returning a btrfs inode, so change btrfs_iget_path() to also return a btrfs inode instead of a VFS inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_iget() return a btrfs inode insteadFilipe Manana
It's an internal function and most of the time the callers are doing a lot of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so change the return type to struct btrfs_inode instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: defrag: extend ioctl to accept compression levelsDaniel Vacek
The zstd and zlib compression types support setting compression levels. Enhance the defrag interface to specify the levels as well. For zstd the negative (realtime) levels are also accepted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: reject out-of-band dirty folios during writebackQu Wenruo
[OUT-OF-BAND DIRTY FOLIOS] An out-of-band folio means the folio is marked dirty but without notifying the filesystem. This can lead to various problems, not limited to: - No folio::private to track per block status - No proper space reserved for such a dirty folio [HISTORY IN BTRFS] This used to be a problem related to get_user_page(), but with the introduction of pin_user_pages*(), we should no longer hit such case anymore. In btrfs, we have a long history of catching such out-of-band dirty folios by: - Mark the folio ordered during delayed allocation - Check the folio ordered flag during writeback If the folio has no ordered flag, it means it doesn't go through delayed allocation, thus it's definitely an out-of-band one. If we got one, we go through COW fixup, which will re-dirty the folio with proper handling in another workqueue. [PROBLEMS OF COW-FIXUP] Such workaround is a blockage for us to migrate to iomap (it requires extra flags to trace if a folio is dirtied by the fs or not) and I'd argue it's not data checksum safe, since if a folio can be marked dirty without informing the fs, the content can also change at any time. But with the introduction of pin_user_pages*() during v5.8 merge window, such out-of-band dirty folio such be treated as a bug. Ext4 has treated such case by warning and erroring out even before pin_user_pages*(). Furthermore, there are already proofs that such folio ordered flag tracking can be screwed up by incorrect error handling, check the commit messages of the following commits: 06f364284794 ("btrfs: do proper folio cleanup when cow_file_range() failed") c2b47df81c8e ("btrfs: do proper folio cleanup when run_delalloc_nocow() failed") [FIXES] Unlike btrfs, ext4 and xfs (iomap) never bother handling such out-of-band dirty folios. - Ext4 just warns and errors out - Iomap always follows the folio/block dirty flags And there is nothing really COW specific, xfs also supports COW too. Here we take one step towards ext4 by doing warning and erroring out. But since the cow fixup thing is introduced from the beginning, we keep the old behavior for non-experimental builds, and only do the new warning for experimental builds before we're 100% sure and remove cow fixup. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: unify inode variable namingDavid Sterba
Rename binode to inode in local variables or parameters so it's more unified with the rest of the code. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags()David Sterba
Pass a struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags() as it's an internal interface. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: properly limit inline data extent according to block sizeQu Wenruo
Btrfs utilizes inline data extent for the following cases: - Regular small files - Symlinks And "btrfs check" detects any file extents that are too large as an error. It's not a problem for 4K block size, but for the incoming smaller block sizes (2K), it can cause problems due to bad limits: - Non-compressed inline data extents We do not allow a non-compressed inline data extent to be as large as block size. - Symlinks Currently the only real limit on symlinks are 4K, which can be larger than 2K block size. These will result btrfs-check to report too large file extents. Fix it by adding proper size checks for the above cases. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: allow inline data extents creation if block size < page sizeQu Wenruo
Previously inline data extents creation was disabled if the block size (previously called sector size) is smaller than the page size, for the following reasons: - Possible mixed inline and regular data extents However this is also the same if the block size matches the page size, thus we do not treat mixed inline and regular extents as an error. And the chance to cause mixed inline and regular data extents are not even increased, it has the same requirement (compressed inline data extent covering the whole first block, followed by regular extents). - Inability to handle async/inline delalloc range for block size < page size cases This is already fixed since commit 1d2fbb7f1f9e ("btrfs: allow compression even if the range is not page aligned"). This was the major technical obstacle, but it's not anymore. With that removed, we can enable inline data extents creation no matter the block size nor the page size, allowing btrfs to have the same capacity for all block sizes. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: fix the qgroup data free range for inline data extentsQu Wenruo
Inside function __cow_file_range_inline() since the inlined data no longer take any data space, we need to free up the reserved space. However the code is still using the old page size == sector size assumption, and will not handle subpage case well. Thankfully it is not going to cause any problems because we have two extra safe nets: - Inline data extents creation is disabled for sector size < page size cases for now But it won't stay that for long. - btrfs_qgroup_free_data() will only clear ranges which have been already reserved So even if we pass a range larger than what we need, it should still be fine, especially there is only reserved space for a single block at file offset 0 of an inline data extent. But just for the sake of consistency, fix the call site to use sectorsize instead of page size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: prevent inline data extents read from touching blocks beyond its rangeQu Wenruo
Currently reading an inline data extent will zero out the remaining range in the page. This is not yet causing problems even for block size < page size (subpage) cases because: 1) An inline data extent always starts at file offset 0 Meaning at page read, we always read the inline extent first, before any other blocks in the page. Then later blocks are properly read out and re-fill the zeroed out ranges. 2) Currently btrfs will read out the whole page if a buffered write is not page aligned So a page is either fully uptodate at buffered write time (covers the whole page), or we will read out the whole page first. Meaning there is nothing to lose for such an inline extent read. But it's still not ideal: - We're zeroing out the page twice Once done by read_inline_extent()/uncompress_inline(), once done by btrfs_do_readpage() for ranges beyond i_size. - We're touching blocks that don't belong to the inline extent In the incoming patches, we can have a partial uptodate folio, of which some dirty blocks can exist while the page is not fully uptodate: The page size is 16K and block size is 4K: 0 4K 8K 12K 16K | | |/////////| | And range [8K, 12K) is dirtied by a buffered write, the remaining blocks are not uptodate. If range [0, 4K) contains an inline data extent, and we try to read the whole page, the current behavior will overwrite range [8K, 12K) with zero and cause data loss. So to make the behavior more consistent and in preparation for future changes, limit the inline data extents read to only zero out the range inside the first block, not the whole page. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_inode_inherit_props()David Sterba
Pass a struct btrfs_inode to btrfs_inherit_props() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_load_inode_props()David Sterba
Pass a struct btrfs_inode to btrfs_load_inode_props() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_fill_inode()David Sterba
Pass a struct btrfs_inode to btrfs_fill_inode() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_inode_type()David Sterba
Pass a struct btrfs_inode to btrfs_inode() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to new_simple_dir()David Sterba
Pass a struct btrfs_inode to new_simple_dir() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_iget_locked()David Sterba
Pass a struct btrfs_inode to btrfs_inode() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_read_locked_inode()David Sterba
Pass a struct btrfs_inode to btrfs_read_locked_inode() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to extent_range_clear_dirty_for_io()David Sterba
Pass a struct btrfs_inode to extent_range_clear_dirty_for_io() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to can_nocow_extent()David Sterba
Pass a struct btrfs_inode to can_nocow_extent() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: unify ordering of btrfs_key initializationsDavid Sterba
The btrfs_key is defined as objectid/type/offset and the keys are also printed like that. For better readability, update all key initializations to match this order. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: zstd: enable negative compression levels mount optionDaniel Vacek
Allow using the fast modes (negative compression levels) of zstd as a mount option. As per the results, the compression ratio is (expectedly) lower: for level in {-15..-1} 1 2 3; \ do printf "level %3d\n" $level; \ mount -o compress=zstd:$level /dev/sdb /mnt/test/; \ grep sdb /proc/mounts; \ cp -r /usr/bin /mnt/test/; sync; compsize /mnt/test/bin; \ cp -r /usr/share/doc /mnt/test/; sync; compsize /mnt/test/doc; \ cp enwik9 /mnt/test/; sync; compsize /mnt/test/enwik9; \ cp linux-6.13.tar /mnt/test/; sync; compsize /mnt/test/linux-6.13.tar; \ rm -r /mnt/test/{bin,doc,enwik9,linux-6.13.tar}; \ umount /mnt/test/; \ done |& tee results | \ awk '/^level/{print}/^TOTAL/{print$3"\t"$2" |"}' | paste - - - - - 266M bin | 45M doc | 953M wiki | 1.4G source =============================+===============+===============+===============+ level -15 180M 67% | 30M 68% | 694M 72% | 598M 40% | level -14 180M 67% | 30M 67% | 683M 71% | 581M 39% | level -13 177M 66% | 29M 66% | 671M 70% | 566M 38% | level -12 174M 65% | 29M 65% | 658M 69% | 548M 37% | level -11 174M 65% | 28M 64% | 645M 67% | 530M 35% | level -10 171M 64% | 28M 62% | 631M 66% | 512M 34% | level -9 165M 62% | 27M 61% | 615M 64% | 493M 33% | level -8 161M 60% | 27M 59% | 598M 62% | 475M 32% | level -7 155M 58% | 26M 58% | 582M 61% | 457M 30% | level -6 151M 56% | 25M 56% | 565M 59% | 437M 29% | level -5 145M 54% | 24M 55% | 545M 57% | 417M 28% | level -4 139M 52% | 23M 52% | 520M 54% | 391M 26% | level -3 135M 50% | 22M 50% | 495M 51% | 369M 24% | level -2 127M 47% | 22M 48% | 470M 49% | 349M 23% | level -1 120M 45% | 21M 47% | 452M 47% | 332M 22% | level 1 110M 41% | 17M 39% | 362M 38% | 290M 19% | level 2 106M 40% | 17M 38% | 349M 36% | 288M 19% | level 3 104M 39% | 16M 37% | 340M 35% | 276M 18% | The samples represent some data sets that can be commonly found and show approximate compressibility. The fast levels trade off speed for ratio and are best suitable for highly compressible data. As can be seen above, comparing the results to the current default zstd level 3, the negative levels are roughly 2x worse at -15 and the ratio increases almost linearly with each level. Signed-off-by: Daniel Vacek <neelx@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: move ordered extent cleanup to where they are allocatedQu Wenruo
The ordered extent cleanup is hard to grasp because it doesn't follow the common cleanup-asap pattern. E.g. run_delalloc_nocow() and cow_file_range() allocate one or more ordered extent, but if any error is hit, the cleanup is done later inside btrfs_run_delalloc_range(). To change the existing delayed cleanup: - Update the comment on error handling of run_delalloc_nocow() There are in fact 3 different cases other than 2 if we are doing ordered extents cleanup inside run_delalloc_nocow(): 1) @cow_start and @cow_end not set No fallback to COW at all. Before @cur_offset we need to cleanup the OE and page dirty. After @cur_offset just clear all involved page and extent flags. 2) @cow_start set but @cow_end not set. This means we failed before even calling fallback_to_cow(). It's just a variant of case 1), where it's @cow_start splitting the two parts (and we should just ignore @cur_offset since it's advanced without any new ordered extent). 3) @cow_start and @cow_end both set This means fallback_to_cow() failed, meaning [start, cow_start) needs the regular OE and dirty folio cleanup, and skip range [cow_start, cow_end) as cow_file_range() has done the cleanup, and eventually cleanup [cow_end, end) range. - Only reset @cow_start after fallback_to_cow() succeeded As above case 2) and 3) are both relying on @cow_start to determine the cleanup range. - Move btrfs_cleanup_ordered_extents() into run_delalloc_nocow(), cow_file_range() and nocow_one_range() For cow_file_range() it's pretty straightforward and easy. For run_delalloc_nocow() refer to the above 3 different error cases. For nocow_one_range() if we hit an error, we need to cleanup the ordered extents by ourselves. And then it will fallback to case 1), since @cur_offset is not yet advanced, the existing cleanup will co-operate with nocow_one_range() well. - Remove the btrfs_cleanup_ordered_extents() inside submit_uncompressed_range() As failed cow_file_range() will do all the proper cleanup now. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: factor out nocow ordered extent and extent map generation into a helperQu Wenruo
Currently we're doing all the ordered extent and extent map generation inside a while() loop of run_delalloc_nocow(). This makes it pretty hard to read, nor doing proper error handling. So move that part of code into a helper, nocow_one_range(). This should not change anything, but there is a tiny timing change where btrfs_dec_nocow_writers() is only called after nocow_one_range() helper exits. This timing change is small, and makes error handling easier, thus should be fine. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: expose per-inode stable writes flagQu Wenruo
The address space flag AS_STABLE_WRITES determine if FGP_STABLE for will wait for the folio to finish its writeback. For btrfs, due to the default data checksum behavior, if we modify the folio while it's still under writeback, it will cause data checksum mismatch. Thus for quite some call sites we manually call folio_wait_writeback() to prevent such problem from happening. Currently there is only one call site inside btrfs really utilizing FGP_STABLE, and in that case we also manually call folio_wait_writeback() to do the waiting. But it's better to properly expose the stable writes flag to a per-inode basis, to allow call sites to fully benefit from FGP_STABLE flag. E.g. for inodes with NODATASUM allowing beginning dirtying the page without waiting for writeback. This involves: - Update the mapping's stable write flag when setting/clearing NODATASUM inode flag using ioctl This only works for empty files, so it should be fine. - Update the mapping's stable write flag when reading an inode from disk - Remove the explicit folio_wait_writeback() for FGP_BEGINWRITE call site Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-17btrfs: update some folio related commentsMatthew Wilcox (Oracle)
Remove references to the page lock and page->mapping. Also btrfs folios can never be swizzled into swap (mentioned in extent_write_cache_pages()). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-17btrfs: keep private struct on stack for sync reads in ↵Daniel Vacek
btrfs_encoded_read_regular_fill_pages() Only allocate the btrfs_encoded_read_private structure for asynchronous (io_uring) mode. There's no need to allocate an object from slab in the synchronous mode. In such a case stack can be happily used as it used to be before 68d3b27e05c7 ("btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()") which was a preparation for the async mode. While at it, fix the comment to reflect the atomic => refcount change in d29662695ed7 ("btrfs: fix use-after-free waiting for encoded read endios"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-07Merge tag 'for-6.14-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix leaked extent map after error when reading chunks - replace use of deprecated strncpy - in zoned mode, fixed range when ulocking extent range, causing a hang * tag 'for-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix a leaked chunk map issue in read_one_chunk() btrfs: replace deprecated strncpy() with strscpy() btrfs: zoned: fix extent range end unlock in cow_file_range()