summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
10 daysxfs: remove unused event xlog_iclog_want_syncSteven Rostedt
The trace event xlog_iclog_want_sync was added but never used. As trace events can take up around 5K of memory in text and meta data regardless if they are used or not, remove this unused event. Fixes: 956f6daa84bf ("xfs: add iclog state trace events") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
10 daysxfs: remove unused trace event xfs_attr_remove_iter_returnSteven Rostedt
When the function xfs_attri_remove_iter was removed, it did not remove the trace event that it called. As a trace event can take up to 5K of memory for text and meta data regardless of if it is used or not, remove this unused trace event. Fixes: 59782a236b62 ("xfs: remove xfs_attri_remove_iter") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
10 dayserofs: support to readahead dirent blocks in erofs_readdir()Chao Yu
This patch supports to readahead more blocks in erofs_readdir(), it can enhance readdir performance in large direcotry. readdir test in a large directory which contains 12000 sub-files. files_per_second Before: 926385.54 After: 2380435.562 Meanwhile, let's introduces a new sysfs entry to control readahead bytes to provide more flexible policy for readahead of readdir(). - location: /sys/fs/erofs/<disk>/dir_ra_bytes - default value: 16384 - disable readahead: set the value to 0 Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250721021352.2495371-1-chao@kernel.org [ Gao Xiang: minor styling adjustment. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
10 dayserofs: implement metadata compressionBo Liu (OpenAnolis)
Thanks to the meta buffer infrastructure, metadata-compressed inodes are just read from the metabox inode instead of the blockdevice (or backing file) inode. The same is true for shared extended attributes. When metadata compression is enabled, inode numbers are divided from on-disk NIDs because of non-LTS 32-bit application compatibility. Co-developed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Bo Liu (OpenAnolis) <liubo03@inspur.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250722003229.2121752-1-hsiangkao@linux.alibaba.com
10 dayserofs: add on-disk definition for metadata compressionGao Xiang
Filesystem metadata has a high degree of redundancy, so it should compress well in the general case. Although metadata compression can increase overall I/O latency, many users care more about minimized image sizes than extreme runtime performance. Let's implement metadata compression in response to user requests [1]. Actually, it's quite simple to implement metadata compression: since EROFS already supports per-inode compression, we can simply treat a special inode (called `the metabox inode`) as a container for compressed inode metadata. Since EROFS supports multiple algorithms, users can even specify LZ4 for metadata and LZMA for data. To better support incremental builds, the MSB of NIDs indicates where the inode metadata is located: if bit 63 is set, the inode itself should be read from `the metabox inode`. Optionally, shared xattrs can also be kept in `the metabox inode` if COMPAT_SHARED_EA_IN_METABOX is set. [1] https://issues.redhat.com/browse/RHEL-75783 Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250717070804.1446345-2-hsiangkao@linux.alibaba.com
10 dayserofs: fix build error with CONFIG_EROFS_FS_ZIP_ACCEL=yBo Liu (OpenAnolis)
fix build err: ld.lld: error: undefined symbol: crypto_req_done referenced by decompressor_crypto.c fs/erofs/decompressor_crypto.o:(z_erofs_crypto_decompress) in archive vmlinux.a referenced by decompressor_crypto.c fs/erofs/decompressor_crypto.o:(z_erofs_crypto_decompress) in archive vmlinux.a ld.lld: error: undefined symbol: crypto_acomp_decompress referenced by decompressor_crypto.c fs/erofs/decompressor_crypto.o:(z_erofs_crypto_decompress) in archive vmlinux.a ld.lld: error: undefined symbol: crypto_alloc_acomp referenced by decompressor_crypto.c fs/erofs/decompressor_crypto.o:(z_erofs_crypto_enable_engine) in archive vmlinux.a Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507161032.QholMPtn-lkp@intel.com/ Fixes: b4a29efc5146 ("erofs: support DEFLATE decompression by using Intel QAT") Signed-off-by: Bo Liu (OpenAnolis) <liubo03@inspur.com> Link: https://lore.kernel.org/r/20250718033039.3609-1-liubo03@inspur.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
10 dayserofs: remove ENOATTR definitionGao Xiang
ENOATTR is not defined in Linux; use ENODATA instead. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250717042317.1218597-1-hsiangkao@linux.alibaba.com
10 dayserofs: refine erofs_iomap_begin()Gao Xiang
- Avoid calling erofs_map_dev() for unmapped extents; - Assign `iomap->addr` for inline extents too (since they have physical location). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250716092254.3826715-1-hsiangkao@linux.alibaba.com
10 dayserofs: unify meta buffers in z_erofs_fill_inode()Gao Xiang
There is no need to keep additional local metabufs since we already have one in `struct erofs_map_blocks`. This was actually a leftover when applying meta buffers to zmap operations, see commit 09c543798c3c ("erofs: use meta buffers for zmap operations"). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250716064152.3537457-1-hsiangkao@linux.alibaba.com
10 dayserofs: remove need_kmap in erofs_read_metabuf()Gao Xiang
- need_kmap is always true except for a ztailpacking case; thus, just open-code that one; - The upcoming metadata compression will add a new boolean, so simplify this first. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250714090907.4095645-1-hsiangkao@linux.alibaba.com
10 dayserofs: do sanity check on m->type in z_erofs_load_compact_lcluster()Chao Yu
All below functions will do sanity check on m->type, let's move sanity check to z_erofs_load_compact_lcluster() for cleanup. - z_erofs_map_blocks_fo - z_erofs_get_extent_compressedlen - z_erofs_get_extent_decompressedlen - z_erofs_extent_lookback Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250708110928.3110375-1-chao@kernel.org Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
10 dayserofs: get rid of {get,put}_page() for ztailpacking dataGao Xiang
The compressed data for the ztailpacking feature is fetched from the metadata inode (e.g., bd_inode), which is folio-based. Therefore, the folio interface should be used instead. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250626085459.339830-1-hsiangkao@linux.alibaba.com
11 daysfix the regression in ufs options parsingAl Viro
A really dumb braino on rebasing and a dumber fuckup with managing #for-next Fixes: b70cb459890b ("ufs: convert ufs to the new mount API") Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
11 dayseventpoll: fix sphinx documentation build warningJann Horn
Sphinx complains that ep_get_upwards_depth_proc() has a kerneldoc-style comment without documenting its parameters. This is an internal function that was not meant to show up in kernel documentation, so fix the warning by changing the comment to a non-kerneldoc one. Fixes: 22bacca48a17 ("epoll: prevent creating circular epoll structures") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/r/20250717173655.10ecdce6@canb.auug.org.au Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507171958.aMcW08Cn-lkp@intel.com/ Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/20250721-epoll-sphinx-fix-v1-1-b695c92bf009@google.com Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
11 daysafs: Set vllist to NULL if addr parsing failsEdward Adam Davis
syzbot reported a bug in in afs_put_vlserverlist. kAFS: bad VL server IP address BUG: unable to handle page fault for address: fffffffffffffffa ... Oops: Oops: 0002 [#1] SMP KASAN PTI ... RIP: 0010:refcount_dec_and_test include/linux/refcount.h:450 [inline] RIP: 0010:afs_put_vlserverlist+0x3a/0x220 fs/afs/vl_list.c:67 ... Call Trace: <TASK> afs_alloc_cell fs/afs/cell.c:218 [inline] afs_lookup_cell+0x12a5/0x1680 fs/afs/cell.c:264 afs_cell_init+0x17a/0x380 fs/afs/cell.c:386 afs_proc_rootcell_write+0x21f/0x290 fs/afs/proc.c:247 proc_simple_write+0x114/0x1b0 fs/proc/generic.c:825 pde_write fs/proc/inode.c:330 [inline] proc_reg_write+0x23d/0x330 fs/proc/inode.c:342 vfs_write+0x25c/0x1180 fs/read_write.c:682 ksys_write+0x12a/0x240 fs/read_write.c:736 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0x260 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Because afs_parse_text_addrs() parses incorrectly, its return value -EINVAL is assigned to vllist, which results in -EINVAL being used as the vllist address when afs_put_vlserverlist() is executed. Set the vllist value to NULL when a parsing error occurs to avoid this issue. Fixes: e2c2cb8ef07a ("afs: Simplify cell record handling") Reported-by: syzbot+5c042fbab0b292c98fc6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5c042fbab0b292c98fc6 Tested-by: syzbot+5c042fbab0b292c98fc6@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/4119365.1753108011@warthog.procyon.org.uk cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
11 daysafs: Fix check for NULL terminatorLeo Stone
Add a missing check for reaching the end of the string while attempting to split a command. Fixes: f94f70d39cc2 ("afs: Provide a way to configure address priorities") Reported-by: syzbot+7741f872f3c53385a2e2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=7741f872f3c53385a2e2 Signed-off-by: Leo Stone <leocstone@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/4119428.1753108152@warthog.procyon.org.uk Acked-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
12 daysfs/orangefs: Allow 2 more characters in do_c_string()Dan Carpenter
The do_k_string() and do_c_string() functions do essentially the same thing which is they add a string and a comma onto the end of an existing string. At the end, the caller will overwrite the last comma with a newline. Later, in orangefs_kernel_debug_init(), we add a newline to the string. The change to do_k_string() is just cosmetic. I moved the "- 1" to the other side of the comparison and made it "+ 1". This has no effect on runtime, I just wanted the functions to match each other and the rest of the file. However in do_c_string(), I removed the "- 2" which allows us to print two extra characters. I noticed this issue while reviewing the code and I doubt affects anything in real life. My guess is that this was double counting the comma and the newline. The "+ 1" accounts for the newline, and the caller will delete the final comma which ensures there is enough space for the newline. Removing the "- 2" lets us print 2 more characters, but mainly it makes the code more consistent and understandable for reviewers. Fixes: 44f4641073f1 ("orangefs: clean up debugfs globals") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
12 daysstackleak: Rename STACKLEAK to KSTACK_ERASEKees Cook
In preparation for adding Clang sanitizer coverage stack depth tracking that can support stack depth callbacks: - Add the new top-level CONFIG_KSTACK_ERASE option which will be implemented either with the stackleak GCC plugin, or with the Clang stack depth callback support. - Rename CONFIG_GCC_PLUGIN_STACKLEAK as needed to CONFIG_KSTACK_ERASE, but keep it for anything specific to the GCC plugin itself. - Rename all exposed "STACKLEAK" names and files to "KSTACK_ERASE" (named for what it does rather than what it protects against), but leave as many of the internals alone as possible to avoid even more churn. While here, also split "prev_lowest_stack" into CONFIG_KSTACK_ERASE_METRICS, since that's the only place it is referenced from. Suggested-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250717232519.2984886-1-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
12 daysbtrfs: send: use fallocate for hole punching with send stream v2Filipe Manana
Currently holes are sent as writes full of zeroes, which results in unnecessarily using disk space at the receiving end and increasing the stream size. In some cases we avoid sending writes of zeroes, like during a full send operation where we just skip writes for holes. But for some cases we fill previous holes with writes of zeroes too, like in this scenario: 1) We have a file with a hole in the range [2M, 3M), we snapshot the subvolume and do a full send. The range [2M, 3M) stays as a hole at the receiver since we skip sending write commands full of zeroes; 2) We punch a hole for the range [3M, 4M) in our file, so that now it has a 2M hole in the range [2M, 4M), and snapshot the subvolume. Now if we do an incremental send, we will send write commands full of zeroes for the range [2M, 4M), removing the hole for [2M, 3M) at the receiver. We could improve cases such as this last one by doing additional comparisons of file extent items (or their absence) between the parent and send snapshots, but that's a lot of code to add plus additional CPU and IO costs. Since the send stream v2 already has a fallocate command and btrfs-progs implements a callback to execute fallocate since the send stream v2 support was added to it, update the kernel to use fallocate for punching holes for V2+ streams. Test coverage is provided by btrfs/284 which is a version of btrfs/007 that exercises send stream v2 instead of v1, using fsstress with random operations and fssum to verify file contents. Link: https://github.com/kdave/btrfs-progs/issues/1001 CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: unfold transaction aborts when writing dirty block groupsFilipe Manana
We have a single transaction abort call that can be due to an error from one of two calls to update_block_group_item(). Unfold the transaction abort calls so that if they happen we know which update_block_group_item() call failed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: use saner variable type and name to indicate extrefs at add_inode_ref()Filipe Manana
We are using a variable named 'log_ref_ver' of type int to indicate if we are processing an extref item or not, using a value of 1 if so, otherwise 0. This is an odd name and type, so rename it to 'is_extref_item' and change its type to bool. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: don't skip remaining extrefs if dir not found during log replayFilipe Manana
During log replay, at add_inode_ref(), if we have an extref item that contains multiple extrefs and one of them points to a directory that does not exist in the subvolume tree, we are supposed to ignore it and process the remaining extrefs encoded in the extref item, since each extref can point to a different parent inode. However when that happens we just return from the function and ignore the remaining extrefs. The problem has been around since extrefs were introduced, in commit f186373fef00 ("btrfs: extended inode refs"), but it's hard to hit in practice because getting extref items encoding multiple extref requires getting a hash collision when computing the offset of the extref's key. The offset if computed like this: key.offset = btrfs_extref_hash(dir_ino, name->name, name->len); and btrfs_extref_hash() is just a wrapper around crc32c(). Fix this by moving to next iteration of the loop when we don't find the parent directory that an extref points to. Fixes: f186373fef00 ("btrfs: extended inode refs") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: don't ignore inode missing when replaying log treeFilipe Manana
During log replay, at add_inode_ref(), we return -ENOENT if our current inode isn't found on the subvolume tree or if a parent directory isn't found. The error comes from btrfs_iget_logging() <- btrfs_iget() <- btrfs_read_locked_inode(). The single caller of add_inode_ref(), replay_one_buffer(), ignores an -ENOENT error because it expects that error to mean only that a parent directory wasn't found and that is ok. Before commit 5f61b961599a ("btrfs: fix inode lookup error handling during log replay") we were converting any error when getting a parent directory to -ENOENT and any error when getting the current inode to -EIO, so our caller would fail log replay in case we can't find the current inode. After that commit however in case the current inode is not found we return -ENOENT to the caller and therefore it ignores the critical fact that the current inode was not found in the subvolume tree. Fix this by converting -ENOENT to 0 when we don't find a parent directory, returning -ENOENT when we don't find the current inode and making the caller, replay_one_buffer(), not ignore -ENOENT anymore. Fixes: 5f61b961599a ("btrfs: fix inode lookup error handling during log replay") CC: stable@vger.kernel.org # 6.16 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: enable large data folios for data reloc inodeQu Wenruo
For data reloc inodes, they are a special type of inodes that are not exposed to user space, and are only utilized during data block groups relocation. They do not go under regular read-write operations, but have their file extents manually created to have the same layout of a block group, then its content is read from the original block group, and written back to the new location which is in a new block group. Previously all the handling was done in page units, and commit c2832898126f ("btrfs: make relocate_one_page() handle subpage case") changed the handling to subpage blocks. On the other hand, data reloc inodes are a perfect match for large data folios, as each relocation cluster represents one or more data extents that are contiguous in their logical addresses. This patch enables large folios for data reloc inodes by: - Remove the special handling of data reloc inodes when setting folio order - Change relocate_one_folio() to return the file offset of the next folio Originally it's designed to handle fixed page sized blocks, but with large folios, we can handle a large folio, thus we have to return the end of the current folio. - Remove the warning on folio_order() - Use folio_size() to replace fixed PAGE_SIZE usage - Use file_offset as iterator inside relocate_file_extent_cluster Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: output more info when btrfs_subpage_assert() failedQu Wenruo
The function btrfs_subpage_assert() is a very commonly utilized assert to make sure the range passed in is correct inside the folio. And when some code is not properly subpage/large folio compatible btrfs_subpage_assert() will be the first to be triggered. E.g. when I incorrectly enabled large folios for data reloc inodes, it immediately triggered btrfs_subpage_assert(). In that case, outputting all the involved members will be very helpful, this includes: - start - len - folio position inside the mapping - folio size Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: reloc: unconditionally invalidate the page cache for each clusterQu Wenruo
Commit 9d9ea1e68a05 ("btrfs: subpage: fix relocation potentially overwriting last page data") fixed a bug when relocating data block groups for subpage cases. However for the incoming large folios for data reloc inode, we can hit the same situation where block size is the same as page size, but the folio we got is still larger than a block. In that case, the old subpage specific check is no longer reliable. Here we have to enhance the handling by: - Unconditionally invalidate the page cache for the current cluster We set the @flush to true so that any dirty folios are properly written back first. And this time instead of dropping the whole page cache, just drop the range covered by the current cluster. This will bring some minor performance drop, as for a large folio, the heading half will be read twice (read by previous cluster, then invalidated, then read again by the current cluster). However that is required to support large folios, and this gets rid of the kinda tricky manual uptodate flag clearing for each block. - Remove the special handling of writing back the whole page cache filemap_invalidate_inode() handles the write back already, and since we're invalidating all pages in the range, we no longer need to manually clear the uptodate flags for involved blocks. Thus there is no need to manually write back the whole page cache. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: defrag: add flag to force no-compressionDavid Sterba
Currently the defrag ioctl cannot rewrite the extents without compression. Add a new flag for that, as setting compression to 0 (or "no compression") means to do no changes to compression so take what is the current default, like mount options or properties. The defrag setting overrides mount or properties. The compression BTRFS_DEFRAG_DONT_COMPRESS is only used for in-memory operations and does not need to have a fixed value. Mount with zstd:9, copy test file from /usr/bin/ (about 260KB): $ mount -o compress=zstd:9 /dev/vda /mnt $ filefrag -vsb testfile filefrag: -b needs a blocksize option, assuming 1024-byte blocks. Filesystem type is: 9123683e File size of testfile is 297704 (292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 13312.. 13439: 128: encoded 1: 128.. 255: 13364.. 13491: 128: 13440: encoded 2: 256.. 291: 13424.. 13459: 36: 13492: last,encoded,eof testfile: 3 extents found $ compsize testfile Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments. Type Perc Disk Usage Uncompressed Referenced TOTAL 42% 124K 292K 292K zstd 42% 124K 292K 292K Defrag to uncompressed: $ btrfs fi defrag --nocomp testfile $ filefrag -vsb testfile filefrag: -b needs a blocksize option, assuming 1024-byte blocks. Filesystem type is: 9123683e File size of testfile is 297704 (292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 291: 291840.. 292131: 292: last,eof testfile: 1 extent found $ compsize testfile Processed 1 file, 1 regular extents (1 refs), 0 inline, 1 fragments. Type Perc Disk Usage Uncompressed Referenced TOTAL 100% 292K 292K 292K none 100% 292K 292K 292K Compress again with LZO: $ btrfs fi defrag -clzo testfile $ filefrag -vsb testfile filefrag: -b needs a blocksize option, assuming 1024-byte blocks. Filesystem type is: 9123683e File size of testfile is 297704 (292 blocks of 1024 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 127: 13312.. 13439: 128: encoded 1: 128.. 255: 13392.. 13519: 128: 13440: encoded 2: 256.. 291: 13480.. 13515: 36: 13520: last,encoded,eof testfile: 3 extents found $ compsize testfile Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments. Type Perc Disk Usage Uncompressed Referenced TOTAL 64% 188K 292K 292K lzo 64% 188K 292K 292K Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: fix ssd_spread overallocationBoris Burkov
If the ssd_spread mount option is enabled, then we run the so called clustered allocator for data block groups. In practice, this results in creating a btrfs_free_cluster which caches a block_group and borrows its free extents for allocation. Since the introduction of allocation size classes in 6.1, there has been a bug in the interaction between that feature and ssd_spread. find_free_extent() has a number of nested loops. The loop going over the allocation stages, stored in ffe_ctl->loop and managed by find_free_extent_update_loop(), the loop over the raid levels, and the loop over all the block_groups in a space_info. The size class feature relies on the block_group loop to ensure it gets a chance to see a block_group of a given size class. However, the clustered allocator uses the cached cluster block_group and breaks that loop. Each call to do_allocation() will really just go back to the same cached block_group. Normally, this is OK, as the allocation either succeeds and we don't want to loop any more or it fails, and we clear the cluster and return its space to the block_group. But with size classes, the allocation can succeed, then later fail, outside of do_allocation() due to size class mismatch. That latter failure is not properly handled due to the highly complex multi loop logic. The result is a painful loop where we continue to allocate the same num_bytes from the cluster in a tight loop until it fails and releases the cluster and lets us try a new block_group. But by then, we have skipped great swaths of the available block_groups and are likely to fail to allocate, looping the outer loop. In pathological cases like the reproducer below, the cached block_group is often the very last one, in which case we don't perform this tight bg loop but instead rip through the ffe stages to LOOP_CHUNK_ALLOC and allocate a chunk, which is now the last one, and we enter the tight inner loop until an allocation failure. Then allocation succeeds on the final block_group and if the next allocation is a size mismatch, the exact same thing happens again. Triggering this is as easy as mounting with -o ssd_spread and then running: mount -o ssd_spread $dev $mnt dd if=/dev/zero of=$mnt/big bs=16M count=1 &>/dev/null dd if=/dev/zero of=$mnt/med bs=4M count=1 &>/dev/null sync if you do the two writes + sync in a loop, you can force btrfs to spin an excessive amount on semi-successful clustered allocations, before ultimately failing and advancing to the stage where we force a chunk allocation. This results in 2G of data allocated per iteration, despite only using ~20M of data. By using a small size classed extent, the inner loop takes longer and we can spin for longer. The simplest, shortest term fix to unbreak this is to make the clustered allocator size_class aware in the dumbest way, where it fails on size class mismatch. This may hinder the operation of the clustered allocator, but better hindered than completely broken and terribly overallocating. Further re-design improvements are also in the works. Fixes: 52bb7a2166af ("btrfs: introduce size class to block group allocator") CC: stable@vger.kernel.org # 6.1+ Reported-by: David Sterba <dsterba@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: zoned: requeue to unused block group list if zone finish failedNaohiro Aota
btrfs_zone_finish() can fail for several reason. If it is -EAGAIN, we need to try it again later. So, put the block group to the retry list properly. Failing to do so will keep the removable block group intact until remount and can causes unnecessary ENOSPC. Fixes: 74e91b12b115 ("btrfs: zoned: zone finish unused block group") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: zoned: do not remove unwritten non-data block groupNaohiro Aota
There are some reports of "unable to find chunk map for logical 2147483648 length 16384" error message appears in dmesg. This means some IOs are occurring after a block group is removed. When a metadata tree node is cleaned on a zoned setup, we keep that node still dirty and write it out not to create a write hole. However, this can make a block group's used bytes == 0 while there is a dirty region left. Such an unused block group is moved into the unused_bg list and processed for removal. When the removal succeeds, the block group is removed from the transaction->dirty_bgs list, so the unused dirty nodes in the block group are not sent at the transaction commit time. It will be written at some later time e.g, sync or umount, and causes "unable to find chunk map" errors. This can happen relatively easy on SMR whose zone size is 256MB. However, calling do_zone_finish() on such block group returns -EAGAIN and keep that block group intact, which is why the issue is hidden until now. Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: remove btrfs_clear_extent_bits()Filipe Manana
It's just a simple wrapper around btrfs_clear_extent_bit() that passes a NULL for its last argument (a cached extent state record), plus there is not counter part - we have a btrfs_set_extent_bit() but we do not have a btrfs_set_extent_bits() (plural version). So just remove it and make all callers use btrfs_clear_extent_bit() directly. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: use cached state when falling back from NOCoW write to CoW writeFilipe Manana
We have a cached extent state record from the previous extent locking so we can use when setting the EXTENT_NORESERVE in the range, allowing the operation to be faster if the extent io tree is relatively large. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: set EXTENT_NORESERVE before range unlock in btrfs_truncate_block()Filipe Manana
Set the EXTENT_NORESERVE bit in the io tree before unlocking the range so that we can use the cached state and speedup the operation, since the unlock operation releases the cached state. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: don't print relocation messages from auto reclaimJohannes Thumshirn
When BTRFS is doing automatic block-group reclaim, it is spamming the kernel log messages a lot. Add a 'verbose' parameter to btrfs_relocate_chunk() and btrfs_relocate_block_group() to control the verbosity of these log message. This way the old behaviour of printing log messages on a user-space initiated balance operation can be kept while excessive log spamming due to auto reclaim is mitigated. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: remove redundant auto reclaim log messageJohannes Thumshirn
Remove the log message before reclaiming a chunk in btrfs_reclaim_bgs_work(). Especially with automatic block-group reclaiming these messages spam the kernel log. Note there is also a tracepoint for the same condition to ease debugging. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: make btrfs_check_nocow_lock() check more than one extentFilipe Manana
Currently btrfs_check_nocow_lock() stops at the first extent it finds and that extent may be smaller than the target range we want to NOCOW into. But we can have multiple consecutive extents which we can NOCOW into, so by stopping at the first one we find we just make the caller do more work by splitting the write into multiple ones, or in the case of mmap writes with large folios we fail with -ENOSPC in case the folio's range is covered by more than one extent (the fallback to NOCOW for mmap writes in case there's no available data space to reserve/allocate was recently added by the patch "btrfs: fix -ENOSPC mmap write failure on NOCOW files/extents"). Improve on this by checking for multiple consecutive extents. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: assert we can NOCOW the range in btrfs_truncate_block()Filipe Manana
We call btrfs_check_nocow_lock() to see if we can NOCOW a block sized range but we don't check later if we can NOCOW the whole range. It's unexpected to be able to NOCOW a range smaller than blocksize, so add an assertion to check the NOCOW range matches the blocksize. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: update function comment for btrfs_check_nocow_lock()Filipe Manana
The documentation for the @nowait parameter is missing, so add it. The @nowait parameter was added in commit 80f9d24130e4 ("btrfs: make btrfs_check_nocow_lock nowait compatible"), which forgot to update the function comment. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: use btrfs_inode local variable at btrfs_page_mkwrite()Filipe Manana
Most of the time we want to use the btrfs_inode, so change the local inode variable to be a btrfs_inode instead of a VFS inode, reducing verbosity by eliminating a lot of BTRFS_I() calls. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: use variable for io_tree when clearing range in btrfs_page_mkwrite()Filipe Manana
We have the inode's io_tree already stored in a local variable, so use it instead of grabbing it again in the call to btrfs_clear_extent_bit(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: fix -ENOSPC mmap write failure on NOCOW files/extentsFilipe Manana
If we attempt a mmap write into a NOCOW file or a prealloc extent when there is no more available data space (or unallocated space to allocate a new data block group) and we can do a NOCOW write (there are no reflinks for the target extent or snapshots), we always fail due to -ENOSPC, unlike for the regular buffered write and direct IO paths where we check that we can do a NOCOW write in case we can't reserve data space. Simple reproducer: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi umount $DEV &> /dev/null mkfs.btrfs -f -b $((512 * 1024 * 1024)) $DEV mount $DEV $MNT touch $MNT/foobar # Make it a NOCOW file. chattr +C $MNT/foobar # Add initial data to file. xfs_io -c "pwrite -S 0xab 0 1M" $MNT/foobar # Fill all the remaining data space and unallocated space with data. dd if=/dev/zero of=$MNT/filler bs=4K &> /dev/null # Overwrite the file with a mmap write. Should succeed. xfs_io -c "mmap -w 0 1M" \ -c "mwrite -S 0xcd 0 1M" \ -c "munmap" \ $MNT/foobar # Unmount, mount again and verify the new data was persisted. umount $MNT mount $DEV $MNT od -A d -t x1 $MNT/foobar umount $MNT Running this: $ ./test.sh (...) wrote 1048576/1048576 bytes at offset 0 1 MiB, 256 ops; 0.0008 sec (1.188 GiB/sec and 311435.5231 ops/sec) ./test.sh: line 24: 234865 Bus error xfs_io -c "mmap -w 0 1M" -c "mwrite -S 0xcd 0 1M" -c "munmap" $MNT/foobar 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab * 1048576 Fix this by not failing in case we can't allocate data space and we can NOCOW into the target extent - reserving only metadata space in this case. After this change the test passes: $ ./test.sh (...) wrote 1048576/1048576 bytes at offset 0 1 MiB, 256 ops; 0.0007 sec (1.262 GiB/sec and 330749.3540 ops/sec) 0000000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd * 1048576 A test case for fstests will be added soon. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: use clear_and_wake_up_bit() where open codedDavid Sterba
There are two cases open coding the clear and wake up pattern, we can use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: accessors: rename variable for folio offsetDavid Sterba
There used to be 'oip' short for offset in page, which got changed during conversion to folios, the name is a bit confusing so rename it. Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: accessors: factor out split memcpy with two sourcesDavid Sterba
The case of a reading the bytes from 2 folios needs two memcpy()s, the compiler does not emit calls but two inline loops. Factoring out the code makes some improvement (stack, code) and in the future will provide an optimized implementation as well. (The analogical version with two destinations is not done as it increases stack usage but can be done if needed.) The address of the second folio is reordered before the first memcpy, which leads to an optimization reusing the vmemmap_base and page_offset_base (implementing folio_address()). Stack usage reduction: btrfs_get_32 -8 (32 -> 24) btrfs_get_64 -8 (32 -> 24) Code size reduction: text data bss dec hex filename 1454279 115665 16088 1586032 183370 pre/btrfs.ko 1454229 115665 16088 1585982 18333e post/btrfs.ko DELTA: -50 As this is the last patch in this series, here's the overall diff starting and including commit "btrfs: accessors: simplify folio bounds checks": Stack: btrfs_set_16 -72 (88 -> 16) btrfs_get_32 -56 (80 -> 24) btrfs_set_8 -72 (88 -> 16) btrfs_set_64 -64 (88 -> 24) btrfs_get_8 -72 (80 -> 8) btrfs_get_16 -64 (80 -> 16) btrfs_set_32 -64 (88 -> 24) btrfs_get_64 -56 (80 -> 24) NEW (48): report_setget_bounds 48 LOST/NEW DELTA: +48 PRE/POST DELTA: -472 Code: text data bss dec hex filename 1456601 115665 16088 1588354 183c82 pre/btrfs.ko 1454229 115665 16088 1585982 18333e post/btrfs.ko DELTA: -2372 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: accessors: set target address at initializationDavid Sterba
The target address for the read/write can be simplified as it's the same expression for the first folio. This improves the generated code as the folio address does not have to be cached on stack. Stack usage reduction: btrfs_set_32 -8 (32 -> 24) btrfs_set_64 -8 (32 -> 24) btrfs_get_16 -8 (24 -> 16) Code size reduction: text data bss dec hex filename 1454459 115665 16088 1586212 183424 pre/btrfs.ko 1454279 115665 16088 1586032 183370 post/btrfs.ko DELTA: -180 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: accessors: compile-time fast path for u16David Sterba
Reading/writing 2 bytes (u16) may need 2 folios to be written to, each time it's just one byte so using memcpy for that is an overkill. Add a branch for the split case so that memcpy is now used for u32 and u64. Another side effect is that the u16 types now don't need additional stack space, everything fits to registers. Stack usage is reduced: btrfs_get_16 -8 (32 -> 24) btrfs_set_16 -16 (32 -> 16) Code size reduction: text data bss dec hex filename 1454691 115665 16088 1586444 18350c pre/btrfs.ko 1454459 115665 16088 1586212 183424 post/btrfs.ko DELTA: -232 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: accessors: compile-time fast path for u8David Sterba
Reading/writing 1 byte (u8) is a special case compared to the others as it's always contained in the folio we find, so the split memcpy will never be needed. Turn it to a compile-time check that the memcpy part can be optimized out. The stack usage is reduced: btrfs_set_8 -16 (32 -> 16) btrfs_get_8 -16 (24 -> 8) Code size reduction: text data bss dec hex filename 1454951 115665 16088 1586704 183610 pre/btrfs.ko 1454691 115665 16088 1586444 18350c post/btrfs.ko DELTA: -260 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: accessors: inline eb bounds check and factor out the error reportDavid Sterba
There's a check in each set/get helper if the requested range is within extent buffer bounds, and if it's not then report it. This was in an ASSERT statement so with CONFIG_BTRFS_ASSERT this crashes right away, on other configs this is only reported but reading out of the bounds is done anyway. There are currently no known reports of this particular condition failing. There are some drawbacks though. The behaviour dependence on the assertions being compiled in or not and a less visible effect of inlining report_setget_bounds() into each helper. As the bounds check is expected to succeed almost always it's ok to inline it but make the report a function and move it out of the helper completely (__cold puts it to a different section). This also skips reading/writing the requested range in case it fails. This improves stack usage significantly: btrfs_get_16 -48 (80 -> 32) btrfs_get_32 -48 (80 -> 32) btrfs_get_64 -48 (80 -> 32) btrfs_get_8 -48 (72 -> 24) btrfs_set_16 -56 (88 -> 32) btrfs_set_32 -56 (88 -> 32) btrfs_set_64 -56 (88 -> 32) btrfs_set_8 -48 (80 -> 32) NEW (48): report_setget_bounds 48 LOST/NEW DELTA: +48 PRE/POST DELTA: -360 Same as .ko size: text data bss dec hex filename 1456079 115665 16088 1587832 183a78 pre/btrfs.ko 1454951 115665 16088 1586704 183610 post/btrfs.ko DELTA: -1128 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: accessors: use type sizeof constants directlyDavid Sterba
Now unit_size is used only once, so use it directly in 'part' calculation. Don't cache sizeof(type) in a variable. While this is a compile-time constant, forcing the type 'int' generates worse code as it leads to additional conversion from 32 to 64 bit type on x86_64. The sizeof() is used only a few times and it does not make the code that harder to read, so use it directly and let the compiler utilize the immediate constants in the context it needs. The .ko code size slightly increases (+50) but further patches will reduce that again. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
12 daysbtrfs: accessors: simplify folio bounds checksDavid Sterba
As we can a have non-contiguous range in the eb->folios, any item can be straddling two folios and we need to check if it can be read in one go or in two parts. For that there's a check which is not implemented in the simplest way: offset in folio + size <= folio size With a simple expression transformation: oil + size <= unit_size size <= unit_size - oil sizeof() <= part this can be simplified and reusing existing run-time or compile-time constants. Add likely() annotation for this expression as this is the fast path and compiler sometimes reorders that after the followup block with the memcpy (observed in practice with other simplifications). Overall effect on stack consumption: btrfs_get_8 -8 (80 -> 72) btrfs_set_8 -8 (88 -> 80) And .ko size (due to optimizations making use of the direct constants): text data bss dec hex filename 1456601 115665 16088 1588354 183c82 pre/btrfs.ko 1456093 115665 16088 1587846 183a86 post/btrfs.ko DELTA: -508 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>