summaryrefslogtreecommitdiff
path: root/fs/btrfs/file.c
AgeCommit message (Collapse)Author
2025-07-10btrfs: fix wrong start offset for delalloc space release during mmap writeFilipe Manana
[ Upstream commit 17a85f520469a1838379de8ad24f63e778f7c277 ] If we're doing a mmap write against a folio that has i_size somewhere in the middle and we have multiple sectors in the folio, we may have to release excess space previously reserved, for the range going from the rounded up (to sector size) i_size to the folio's end offset. We are calculating the right amount to release and passing it to btrfs_delalloc_release_space(), but we are passing the wrong start offset of that range - we're passing the folio's start offset instead of the end offset, plus 1, of the range for which we keep the reservation. This may result in releasing more space then we should and eventually trigger an underflow of the data space_info's bytes_may_use counter. So fix this by passing the start offset as 'end + 1' instead of 'page_start' to btrfs_delalloc_release_space(). Fixes: d0b7da88f640 ("Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-10btrfs: prepare btrfs_page_mkwrite() for large foliosQu Wenruo
[ Upstream commit 49990d8fa27d75f8ecf4ad013b13de3c4b1ff433 ] This changes the assumption that the folio is always page sized. (Although the ASSERT() for folio order is still kept as-is). Just replace the PAGE_SIZE with folio_size(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 17a85f520469 ("btrfs: fix wrong start offset for delalloc space release during mmap write") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range()Qu Wenruo
[ Upstream commit bc2dbc4983afedd198490cca043798f57c93e9bf ] [BUG] When running btrfs/004 with 4K fs block size and 64K page size, sometimes fsstress workload can take 100% CPU for a while, but not long enough to trigger a 120s hang warning. [CAUSE] When such 100% CPU usage happens, btrfs_punch_hole_lock_range() is always in the call trace. One example when this problem happens, the function btrfs_punch_hole_lock_range() got the following parameters: lock_start = 4096, lockend = 20469 Then we calculate @page_lockstart by rounding up lock_start to page boundary, which is 64K (page size is 64K). For @page_lockend, we round down the value towards page boundary, which result 0. Then since we need to pass an inclusive end to filemap_range_has_page(), we subtract 1 from the rounded down value, resulting in (u64)-1. In the above case, the range is inside the same page, and we do not even need to call filemap_range_has_page(), not to mention to call it with (u64)-1 at the end. This behavior will cause btrfs_punch_hole_lock_range() to busy loop waiting for irrelevant range to have its pages dropped. [FIX] Calculate @page_lockend by just rounding down @lockend, without decreasing the value by one. So @page_lockend will no longer overflow. Then exit early if @page_lockend is no larger than @page_lockstart. As it means either the range is inside the same page, or the two pages are adjacent already. Finally only decrease @page_lockend when calling filemap_range_has_page(). Fixes: 0528476b6ac7 ("btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13btrfs: fix data overwriting bug during buffered write when block size < page ↵Qu Wenruo
size [ Upstream commit efa11fd269c139e29b71ec21bc9c9c0063fde40d ] [BUG] When running generic/418 with a btrfs whose block size < page size (subpage cases), it always fails. And the following minimal reproducer is more than enough to trigger it reliably: workload() { mkfs.btrfs -s 4k -f $dev > /dev/null dmesg -C mount $dev $mnt $fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest ret=$? umount $mnt stop_trace if [ $ret -ne 0 ]; then fail fi } for (( i = 0; i < 1024; i++)); do echo "=== $i/$runtime ===" workload done [CAUSE] With extra trace printk added to the following functions: - btrfs_buffered_write() * Which folio is touched * The file offset (start) where the buffered write is at * How many bytes are copied * The content of the write (the first 2 bytes) - submit_one_sector() * Which folio is touched * The position inside the folio * The content of the page cache (the first 2 bytes) - pagecache_isize_extended() * The parameters of the function itself * The parameters of the folio_zero_range() Which are enough to show the problem: 22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101 22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101 22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101 22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101 22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192 22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000 22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101 The tool dio-invalidate-cache will start 3 threads, each doing a buffered write with 0x01 at offset 0, 4096 and 8192, do a fsync, then do a direct read, and compare the read buffer with the write buffer. Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into the page cache. But at submit_one_sector(), at file offset 4096, the content is zeroed out, by pagecache_isize_extended(). The race happens like this: Thread A is writing into range [4K, 8K). Thread B is writing into range [8K, 12k). Thread A | Thread B -------------------------------------+------------------------------------ btrfs_buffered_write() | btrfs_buffered_write() |- old_isize = 4K; | |- old_isize = 4096; |- btrfs_inode_lock() | | |- write into folio range [4K, 8K) | | |- pagecache_isize_extended() | | | extend isize from 4096 to 8192 | | | no folio_zero_range() called | | |- btrfs_inode_lock() | | | |- btrfs_inode_lock() | |- write into folio range [8K, 12K) | |- pagecache_isize_extended() | | calling folio_zero_range(4K, 8K) | | This is caused by the old_isize is | | grabbed too early, without any | | inode lock. | |- btrfs_inode_unlock() The @old_isize is grabbed without inode lock, causing race between two buffered write threads and making pagecache_isize_extended() to zero range which is still containing cached data. And this is only affecting subpage btrfs, because for regular blocksize == page size case, the function pagecache_isize_extended() will do nothing if the block size >= page size. [FIX] Grab the old i_size while holding the inode lock. This means each buffered write thread will have a stable view of the old inode size, thus avoid the above race. CC: stable@vger.kernel.org # 5.15+ Fixes: 5e8b9ef30392 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21btrfs: fix hole expansion when writing at an offset beyond EOFFilipe Manana
commit da2dccd7451de62b175fb8f0808d644959e964c7 upstream. At btrfs_write_check() if our file's i_size is not sector size aligned and we have a write that starts at an offset larger than the i_size that falls within the same page of the i_size, then we end up not zeroing the file range [i_size, write_offset). The code is this: start_pos = round_down(pos, fs_info->sectorsize); oldsize = i_size_read(inode); if (start_pos > oldsize) { /* Expand hole size to cover write data, preventing empty gap */ loff_t end_pos = round_up(pos + count, fs_info->sectorsize); ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos); if (ret) return ret; } So if our file's i_size is 90269 bytes and a write at offset 90365 bytes comes in, we get 'start_pos' set to 90112 bytes, which is less than the i_size and therefore we don't zero out the range [90269, 90365) by calling btrfs_cont_expand(). This is an old bug introduced in commit 9036c10208e1 ("Btrfs: update hole handling v2"), from 2008, and the buggy code got moved around over the years. Fix this by discarding 'start_pos' and comparing against the write offset ('pos') without any alignment. This bug was recently exposed by test case generic/363 which tests this scenario by polluting ranges beyond EOF with an mmap write and than verify that after a file increases we get zeroes for the range which is supposed to be a hole and not what we wrote with the previous mmaped write. We're only seeing this exposed now because generic/363 used to run only on xfs until last Sunday's fstests update. The test was failing like this: $ ./check generic/363 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 generic/363 0s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad) # --- tests/generic/363.out 2025-02-05 15:31:14.013646509 +0000 # +++ /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad 2025-02-05 17:25:33.112630781 +0000 @@ -1 +1,46 @@ QA output created by 363 +READ BAD DATA: offset = 0xdcad, size = 0xd921, fname = /home/fdmanana/btrfs-tests/dev/junk +OFFSET GOOD BAD RANGE +0x1609d 0x0000 0x3104 0x0 +operation# (mod 256) for the bad data may be 4 +0x1609e 0x0000 0x0472 0x1 +operation# (mod 256) for the bad data may be 4 ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/363.out /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad' to see the entire diff) Ran: generic/363 Failures: generic/363 Failed 1 of 1 tests Fixes: 9036c10208e1 ("Btrfs: update hole handling v2") CC: stable@vger.kernel.org Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17btrfs: fix data race when accessing the inode's disk_i_size at ↵Hao-ran Zheng
btrfs_drop_extents() [ Upstream commit 5324c4e10e9c2ce307a037e904c0d9671d7137d9 ] A data race occurs when the function `insert_ordered_extent_file_extent()` and the function `btrfs_inode_safe_disk_i_size_write()` are executed concurrently. The function `insert_ordered_extent_file_extent()` is not locked when reading inode->disk_i_size, causing `btrfs_inode_safe_disk_i_size_write()` to cause data competition when writing inode->disk_i_size, thus affecting the value of `modify_tree`. The specific call stack that appears during testing is as follows: ============DATA_RACE============ btrfs_drop_extents+0x89a/0xa060 [btrfs] insert_reserved_file_extent+0xb54/0x2960 [btrfs] insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs] btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs] btrfs_finish_ordered_io+0x37/0x60 [btrfs] finish_ordered_fn+0x3e/0x50 [btrfs] btrfs_work_helper+0x9c9/0x27a0 [btrfs] process_scheduled_works+0x716/0xf10 worker_thread+0xb6a/0x1190 kthread+0x292/0x330 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 ============OTHER_INFO============ btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs] btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs] btrfs_finish_ordered_io+0x37/0x60 [btrfs] finish_ordered_fn+0x3e/0x50 [btrfs] btrfs_work_helper+0x9c9/0x27a0 [btrfs] process_scheduled_works+0x716/0xf10 worker_thread+0xb6a/0x1190 kthread+0x292/0x330 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 ================================= The main purpose of the check of the inode's disk_i_size is to avoid taking write locks on a btree path when we have a write at or beyond EOF, since in these cases we don't expect to find extent items in the root to drop. However if we end up taking write locks due to a data race on disk_i_size, everything is still correct, we only add extra lock contention on the tree in case there's concurrency from other tasks. If the race causes us to not take write locks when we actually need them, then everything is functionally correct as well, since if we find out we have extent items to drop and we took read locks (modify_tree set to 0), we release the path and retry again with write locks. Since this data race does not affect the correctness of the function, it is a harmless data race, use data_race() to check inode->disk_i_size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-17btrfs: fix race setting file private on concurrent lseek using same fdFilipe Manana
When doing concurrent lseek(2) system calls against the same file descriptor, using multiple threads belonging to the same process, we have a short time window where a race happens and can result in a memory leak. The race happens like this: 1) A program opens a file descriptor for a file and then spawns two threads (with the pthreads library for example), lets call them task A and task B; 2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at file.c:find_desired_extent() while holding a read lock on the inode; 3) At the start of find_desired_extent(), it extracts the file's private_data pointer into a local variable named 'private', which has a value of NULL; 4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode in shared mode and enters file.c:find_desired_extent(), where it also extracts file->private_data into its local variable 'private', which has a NULL value; 5) Because it saw a NULL file private, task A allocates a private structure and assigns to the file structure; 6) Task B also saw a NULL file private so it also allocates its own file private and then assigns it to the same file structure, since both tasks are using the same file descriptor. At this point we leak the private structure allocated by task A. Besides the memory leak, there's also the detail that both tasks end up using the same cached state record in the private structure (struct btrfs_file_private::llseek_cached_state), which can result in a use-after-free problem since one task can free it while the other is still using it (only one task took a reference count on it). Also, sharing the cached state is not a good idea since it could result in incorrect results in the future - right now it should not be a problem because it end ups being used only in extent-io-tree.c:count_range_bits() where we do range validation before using the cached state. Fix this by protecting the private assignment and check of a file while holding the inode's spinlock and keep track of the task that allocated the private, so that it's used only by that task in order to prevent user-after-free issues with the cached state record as well as potentially using it incorrectly in the future. Fixes: 3c32c7212f16 ("btrfs: use cached state when looking for delalloc ranges with lseek") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: add and use helper to verify the calling task has locked the inodeFilipe Manana
We have a few places that check if we have the inode locked by doing: ASSERT(inode_is_locked(vfs_inode)); This actually proved to be useful several times as if assertions are enabled (and by default they are in many distros) it immediately triggers a crash which is impossible for users to miss. However that doesn't check if the lock is held by the calling task, so the check passes if some other task locked the inode. Using one of the lockdep functions to check the lock is held, like lockdep_assert_held() for example, does check that the calling task holds the lock, and if that's not the case it produces a warning and stack trace in dmesg. However, despite the misleading "assert" in the name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just a warning and splat in dmesg, which is easy to get unnoticed by users who may have lockdep enabled. So add a helper that does the ASSERT() and calls lockdep_assert_held() immediately after and use it every where we check the inode is locked. Like this if the lock is held by some other task we get the warning in dmesg which is caught by fstests, very helpful during development, and may also be occassionaly noticed by users with lockdep enabled. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: utilize folio more in btrfs_page_mkwrite()Josef Bacik
We already have a folio that we're using in btrfs_page_mkwrite, update the rest of the function to use folio everywhere else. This will make it easier on Willy when he drops page->index. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-03btrfs: fix race between direct IO write and fsync when using same fdFilipe Manana
If we have 2 threads that are using the same file descriptor and one of them is doing direct IO writes while the other is doing fsync, we have a race where we can end up either: 1) Attempt a fsync without holding the inode's lock, triggering an assertion failures when assertions are enabled; 2) Do an invalid memory access from the fsync task because the file private points to memory allocated on stack by the direct IO task and it may be used by the fsync task after the stack was destroyed. The race happens like this: 1) A user space program opens a file descriptor with O_DIRECT; 2) The program spawns 2 threads using libpthread for example; 3) One of the threads uses the file descriptor to do direct IO writes, while the other calls fsync using the same file descriptor. 4) Call task A the thread doing direct IO writes and task B the thread doing fsyncs; 5) Task A does a direct IO write, and at btrfs_direct_write() sets the file's private to an on stack allocated private with the member 'fsync_skip_inode_lock' set to true; 6) Task B enters btrfs_sync_file() and sees that there's a private structure associated to the file which has 'fsync_skip_inode_lock' set to true, so it skips locking the inode's VFS lock; 7) Task A completes the direct IO write, and resets the file's private to NULL since it had no prior private and our private was stack allocated. Then it unlocks the inode's VFS lock; 8) Task B enters btrfs_get_ordered_extents_for_logging(), then the assertion that checks the inode's VFS lock is held fails, since task B never locked it and task A has already unlocked it. The stack trace produced is the following: assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ordered-data.c:983! Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8 Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020 RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs] Code: 50 d6 86 c0 e8 (...) RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246 RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800 RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38 R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800 R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000 FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0 Call Trace: <TASK> ? __die_body.cold+0x14/0x24 ? die+0x2e/0x50 ? do_trap+0xca/0x110 ? do_error_trap+0x6a/0x90 ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? exc_invalid_op+0x50/0x70 ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? asm_exc_invalid_op+0x1a/0x20 ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] ? __seccomp_filter+0x31d/0x4f0 __x64_sys_fdatasync+0x4f/0x90 do_syscall_64+0x82/0x160 ? do_futex+0xcb/0x190 ? __x64_sys_futex+0x10e/0x1d0 ? switch_fpu_return+0x4f/0xd0 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 ? syscall_exit_to_user_mode+0x72/0x220 ? do_syscall_64+0x8e/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Another problem here is if task B grabs the private pointer and then uses it after task A has finished, since the private was allocated in the stack of task A, it results in some invalid memory access with a hard to predict result. This issue, triggering the assertion, was observed with QEMU workloads by two users in the Link tags below. Fix this by not relying on a file's private to pass information to fsync that it should skip locking the inode and instead pass this information through a special value stored in current->journal_info. This is safe because in the relevant section of the direct IO write path we are not holding a transaction handle, so current->journal_info is NULL. The following C program triggers the issue: $ cat repro.c /* Get the O_DIRECT definition. */ #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <stdint.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <pthread.h> static int fd; static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset) { while (count > 0) { ssize_t ret; ret = pwrite(fd, buf, count, offset); if (ret < 0) { if (errno == EINTR) continue; return ret; } count -= ret; buf += ret; } return 0; } static void *fsync_loop(void *arg) { while (1) { int ret; ret = fsync(fd); if (ret != 0) { perror("Fsync failed"); exit(6); } } } int main(int argc, char *argv[]) { long pagesize; void *write_buf; pthread_t fsyncer; int ret; if (argc != 2) { fprintf(stderr, "Use: %s <file path>\n", argv[0]); return 1; } fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666); if (fd == -1) { perror("Failed to open/create file"); return 1; } pagesize = sysconf(_SC_PAGE_SIZE); if (pagesize == -1) { perror("Failed to get page size"); return 2; } ret = posix_memalign(&write_buf, pagesize, pagesize); if (ret) { perror("Failed to allocate buffer"); return 3; } ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL); if (ret != 0) { fprintf(stderr, "Failed to create writer thread: %d\n", ret); return 4; } while (1) { ret = do_write(fd, write_buf, pagesize, 0); if (ret != 0) { perror("Write failed"); exit(5); } } return 0; } $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt/sdi $ timeout 10 ./repro /mnt/sdi/foo Usually the race is triggered within less than 1 second. A test case for fstests will follow soon. Reported-by: Paulo Dias <paulo.miguel.dias@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187 Reported-by: Andreas Jahn <jahn-andi@web.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199 Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/ Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-02btrfs: fix double inode unlock for direct IO sync writesFilipe Manana
If we do a direct IO sync write, at btrfs_sync_file(), and we need to skip inode logging or we get an error starting a transaction or an error when flushing delalloc, we end up unlocking the inode when we shouldn't under the 'out_release_extents' label, and then unlock it again at btrfs_direct_write(). Fix that by checking if we have to skip inode unlocking under that label. Reported-by: syzbot+7dbbb74af6291b5a5a8b@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000dfd631061eaeb4bc@google.com/ Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-29btrfs: fix corruption after buffer fault in during direct IO append writeFilipe Manana
During an append (O_APPEND write flag) direct IO write if the input buffer was not previously faulted in, we can corrupt the file in a way that the final size is unexpected and it includes an unexpected hole. The problem happens like this: 1) We have an empty file, with size 0, for example; 2) We do an O_APPEND direct IO with a length of 4096 bytes and the input buffer is not currently faulted in; 3) We enter btrfs_direct_write(), lock the inode and call generic_write_checks(), which calls generic_write_checks_count(), and that function sets the iocb position to 0 with the following code: if (iocb->ki_flags & IOCB_APPEND) iocb->ki_pos = i_size_read(inode); 4) We call btrfs_dio_write() and enter into iomap, which will end up calling btrfs_dio_iomap_begin() and that calls btrfs_get_blocks_direct_write(), where we update the i_size of the inode to 4096 bytes; 5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access the page of the write input buffer (at iomap_dio_bio_iter(), with a call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets returned to btrfs at btrfs_direct_write() via btrfs_dio_write(); 6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode, fault in the write buffer and then goto to the label 'relock'; 7) We lock again the inode, do all the necessary checks again and call again generic_write_checks(), which calls generic_write_checks_count() again, and there we set the iocb's position to 4K, which is the current i_size of the inode, with the following code pointed above: if (iocb->ki_flags & IOCB_APPEND) iocb->ki_pos = i_size_read(inode); 8) Then we go again to btrfs_dio_write() and enter iomap and the write succeeds, but it wrote to the file range [4K, 8K), leaving a hole in the [0, 4K) range and an i_size of 8K, which goes against the expectations of having the data written to the range [0, 4K) and get an i_size of 4K. Fix this by not unlocking the inode before faulting in the input buffer, in case we get -EFAULT or an incomplete write, and not jumping to the 'relock' label after faulting in the buffer - instead jump to a location immediately before calling iomap, skipping all the write checks and relocking. This solves this problem and it's fine even in case the input buffer is memory mapped to the same file range, since only holding the range locked in the inode's io tree can cause a deadlock, it's safe to keep the inode lock (VFS lock), as was fixed and described in commit 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes"). A sample reproducer provided by a reporter is the following: $ cat test.c #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <fcntl.h> #include <stdio.h> #include <sys/mman.h> #include <sys/stat.h> #include <unistd.h> int main(int argc, char *argv[]) { if (argc < 2) { fprintf(stderr, "Usage: %s <test file>\n", argv[0]); return 1; } int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT | O_APPEND, 0644); if (fd < 0) { perror("creating test file"); return 1; } char *buf = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); ssize_t ret = write(fd, buf, 4096); if (ret < 0) { perror("pwritev2"); return 1; } struct stat stbuf; ret = fstat(fd, &stbuf); if (ret < 0) { perror("stat"); return 1; } printf("size: %llu\n", (unsigned long long)stbuf.st_size); return stbuf.st_size == 4096 ? 0 : 1; } A test case for fstests will be sent soon. Reported-by: Hanna Czenczek <hreitz@redhat.com> Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/ Fixes: 8184620ae212 ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb") CC: stable@vger.kernel.org # 6.1+ Tested-by: Hanna Czenczek <hreitz@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: move the direct IO code into its own fileFilipe Manana
The direct IO code is over a thousand lines and it's currently spread between file.c and inode.c, which makes it not easy to locate some parts of it sometimes. Also inode.c is about 11 thousand lines and file.c about 4 thousand lines, both too big. So move all the direct IO code into a dedicated file, so that it's easy to locate all its code and reduce the sizes of inode.c and file.c. This is a pure move of code without any other changes except export a a couple functions from inode.c (get_extent_allocation_hint() and create_io_em()) because they are used in inode.c and the new direct-io.c file, and a couple functions from file.c (btrfs_buffered_write() and btrfs_write_check()) because they are used both in file.c and in the new direct-io.c file. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: cleanup duplicated parameters related to can_nocow_file_extent_argsQu Wenruo
The following functions and structures can be simplified using the btrfs_file_extent structure: - can_nocow_extent() No need to return ram_bytes/orig_block_len through the parameter list, the @file_extent parameter contains all the needed info. - can_nocow_file_extent_args The following members are no longer needed: * disk_bytenr This one is confusing as it's not really the btrfs_file_extent_item::disk_bytenr, but where the IO would be, thus it's file_extent::disk_bytenr + file_extent::offset now. * num_bytes Now file_extent::num_bytes. * extent_offset Now file_extent::offset. * disk_num_bytes Now file_extent::disk_num_bytes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove extent_map::block_start memberQu Wenruo
The member extent_map::block_start can be calculated from extent_map::disk_bytenr + extent_map::offset for regular extents. And otherwise just extent_map::disk_bytenr. And this is already validated by the validate_extent_map(). Now we can remove the member. However there is a special case in btrfs_create_dio_extent() where we for NOCOW/PREALLOC ordered extents cannot directly use the resulting btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them yet. So for that call site, we pass file_extent->disk_bytenr + file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for offset. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove extent_map::block_len memberQu Wenruo
The extent_map::block_len is either extent_map::len (non-compressed extent) or extent_map::disk_num_bytes (compressed extent). Since we already have sanity checks to do the cross-checks between the new and old members, we can drop the old extent_map::block_len now. For most call sites, they can manually select extent_map::len or extent_map::disk_num_bytes, since most if not all of them have checked if the extent is compressed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove extent_map::orig_start memberQu Wenruo
Since we have extent_map::offset, the old extent_map::orig_start is just extent_map::start - extent_map::offset for non-hole/inline extents. And since the new extent_map::offset is already verified by validate_extent_map() while the old orig_start is not, let's just remove the old member from all call sites. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: introduce new members for extent_mapQu Wenruo
Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: export the expected file extent through can_nocow_extent()Qu Wenruo
Currently function can_nocow_extent() only returns members needed for extent_map. However since we will soon change the extent_map structure to be more like btrfs_file_extent_item, we want to expose the expected file extent caused by the NOCOW write for future usage. This introduces a new structure, btrfs_file_extent, to be a more memory access friendly representation of btrfs_file_extent_item. And use that structure to expose the expected file extent caused by the NOCOW write. For now there is no user of the new structure yet. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: rename extent_map::orig_block_len to disk_num_bytesQu Wenruo
This would make it very obvious that the member just matches btrfs_file_extent_item::disk_num_bytes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use a btrfs_inode local variable at btrfs_sync_file()Filipe Manana
Instead of using a VFS inode local pointer and then doing many BTRFS_I() calls inside btrfs_sync_file(), use a btrfs_inode pointer instead. This makes everything a bit easier to read and less confusing, allowing to make some statements shorter. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()Filipe Manana
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode instead, as this is generally what we do for internal APIs, making it more consistent with most of the code base. This will later allow to help to remove a lot of BTRFS_I() calls in btrfs_sync_file(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: pass a btrfs_inode to btrfs_fdatawrite_range()Filipe Manana
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode instead, as this is generally what we do for internal APIs, making it more consistent with most of the code base. This will later allow to help to remove a lot of BTRFS_I() calls in btrfs_sync_file(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use a btrfs_inode in the log context (struct btrfs_log_ctx)Filipe Manana
Instead of using a inode pointer, use a btrfs_inode pointer in the log context structure, as this is generally what we need and allows for some internal APIs to take a btrfs_inode instead, making them more consistent with most of the code base. This will later allow to help to remove a lot of BTRFS_I() calls in btrfs_sync_file(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-05Merge tag 'for-6.10-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A fix for fast fsync that needs to handle errors during writes after some COW failure so it does not lead to an inconsistent state" * tag 'for-6.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: ensure fast fsync waits for ordered extents after a write failure
2024-05-28btrfs: ensure fast fsync waits for ordered extents after a write failureFilipe Manana
If a write path in COW mode fails, either before submitting a bio for the new extents or an actual IO error happens, we can end up allowing a fast fsync to log file extent items that point to unwritten extents. This is because dropping the extent maps happens when completing ordered extents, at btrfs_finish_one_ordered(), and the completion of an ordered extent is executed in a work queue. This can result in a fast fsync to start logging file extent items based on existing extent maps before the ordered extents complete, therefore resulting in a log that has file extent items that point to unwritten extents, resulting in a corrupt file if a crash happens after and the log tree is replayed the next time the fs is mounted. This can happen for both direct IO writes and buffered writes. For example consider a direct IO write, in COW mode, that fails at btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an error: 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter set to false, meaning an error happened; 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR flag; 3) btrfs_finish_ordered_extent() queues the completion of the ordered extent - so that btrfs_finish_one_ordered() will be executed later in a work queue. That function will drop extent maps in the range when it's executed, since the extent maps point to unwritten locations (signaled by the BTRFS_ORDERED_IOERR flag); 4) After calling btrfs_finish_ordered_extent() we keep going down the write path and unlock the inode; 5) After that a fast fsync starts and locks the inode; 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync task sees the extent maps that point to the unwritten locations and logs file extent items based on them - it does not know they are unwritten, and the fast fsync path does not wait for ordered extents to complete, which is an intentional behaviour in order to reduce latency. For the buffered write case, here's one example: 1) A fast fsync begins, and it starts by flushing delalloc and waiting for the writeback to complete by calling filemap_fdatawait_range(); 2) Flushing the dellaloc created a new extent map X; 3) During the writeback some IO error happened, and at the end io callback (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its completion; 4) After queuing the ordered extent completion, the end io callback clears the writeback flag from all pages (or folios), and from that moment the fast fsync can proceed; 5) The fast fsync proceeds sees extent map X and logs a file extent item based on extent map X, resulting in a log that points to an unwritten data extent - because the ordered extent completion hasn't run yet, it happens only after the logging. To fix this make btrfs_finish_ordered_extent() set the inode flag BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write, so that a fast fsync will wait for ordered extent completion. Note that this issues of using extent maps that point to unwritten locations can not happen for reads, because in read paths we start by locking the extent range and wait for any ordered extents in the range to complete before looking for extent maps. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-14Merge tag 'for-6.10-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This update brings a few minor performance improvements, otherwise there's a lot of refactoring, cleanups and other sort of not user visible changes. Performance improvements: - inline b-tree locking functions, improvement in metadata-heavy changes - relax locking on a range that's being reflinked, allows read operations to run in parallel - speed up NOCOW write checks (throughput +9% on a sample test) - extent locking ranges have been reduced in several places, namely around delayed ref processing Core: - more page to folio conversions: - relocation - send - compression - inline extent handling - super block write and wait - extent_map structure optimizations: - reduced structure size - code simplifications - add shrinker for allocated objects, the numbers can go high and could exhaust memory on smaller systems (reported) as they may not get an opportunity to be freed fast enough - extent locking optimizations: - reduce locking ranges where it does not seem to be necessary and are safe due to other means of synchronization - potential improvements due to lower contention, allocation/freeing and state management operations of extent state tracking structures - delayed ref cleanups and simplifications - updated trace points - improved error handling, warnings and assertions - cleanups and refactoring, unification of error handling paths" * tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (122 commits) btrfs: qgroup: fix initialization of auto inherit array btrfs: count super block write errors in device instead of tracking folio error state btrfs: use the folio iterator in btrfs_end_super_write() btrfs: convert super block writes to folio in write_dev_supers() btrfs: convert super block writes to folio in wait_dev_supers() bio: Export bio_add_folio_nofail to modules btrfs: remove duplicate included header from fs.h btrfs: add a cached state to extent_clear_unlock_delalloc btrfs: push extent lock down in submit_one_async_extent btrfs: push lock_extent down in cow_file_range() btrfs: move can_cow_file_range_inline() outside of the extent lock btrfs: push lock_extent into cow_file_range_inline btrfs: push extent lock into cow_file_range btrfs: push extent lock into run_delalloc_cow btrfs: remove unlock_extent from run_delalloc_compressed btrfs: push extent lock down in run_delalloc_nocow btrfs: adjust while loop condition in run_delalloc_nocow btrfs: push extent lock into run_delalloc_nocow btrfs: push the extent lock into btrfs_run_delalloc_range btrfs: lock extent when doing inline extent in compression ...
2024-05-07btrfs: change root->root_key.objectid to btrfs_root_id()Josef Bacik
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename ->len to ->num_bytes in btrfs_refJosef Bacik
We consistently use ->num_bytes everywhere through the delayed ref code, except in btrfs_ref. Rename btrfs_ref to match all the other code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: move ref_root into btrfs_refJosef Bacik
We have this in both btrfs_tree_ref and btrfs_data_ref, which is just wasting space and making the code more complicated. Move this into btrfs_ref proper and update all the call sites to do the assignment in btrfs_ref. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: do not use a function to initialize btrfs_refJosef Bacik
btrfs_ref currently has ->owning_root, and ->ref_root is shared between the tree ref and data ref, so in order to move that into btrfs_ref proper I would need to add another root parameter to the initialization function. This function has too many arguments, and adding another root will make it easy to make mistakes about which root goes where. Drop the generic ref init function and statically initialize the btrfs_ref in every usage. This makes the code easier to read because we can see what elements we're assigning, and will make the upcoming change moving the ref_root into the btrfs_ref more clear and less error prone than adding a new element to the initialization function. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: move btrfs_page_mkwrite() from inode.c into file.cFilipe Manana
btrfs_page_mkwrite() is a struct vm_operations_struct callback and we define that structure in file.c. Currently the function is in inode.c and has to be exported to be used in file.c, which makes no sense because it's not used anywhere else. So move btrfs_page_mkwrite() from inode.c and into file.c. While at it do a few minor style changes: 1) Capitalize the first word of every comment and end each sentence with punctuation; 2) Avoid splitting some statements into two lines when everything fits in 85 characters or less. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename err to ret in btrfs_direct_write()Anand Jain
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename err to ret in prepare_pages()Anand Jain
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename err to ret in btrfs_dirty_pages()Anand Jain
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-07fs: claw back a few FMODE_* bitsChristian Brauner
There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-04btrfs: add helper to get fs_info from struct inode pointerDavid Sterba
Add a convenience helper to get a fs_info from a VFS inode pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct btrfs_inode, btrfs_root or btrfs_fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: preallocate temporary extent buffer for inode logging when neededFilipe Manana
When logging an inode and we require to copy items from subvolume leaves to the log tree, we clone each subvolume leaf and than use that clone to copy items to the log tree. This is required to avoid possible deadlocks as stated in commit 796787c978ef ("btrfs: do not modify log tree while holding a leaf from fs tree locked"). The cloning requires allocating an extent buffer (struct extent_buffer) and then allocating pages (folios) to attach to the extent buffer. This may be slow in case we are under memory pressure, and since we are doing the cloning while holding a read lock on a subvolume leaf, it means we can be blocking other operations on that leaf for significant periods of time, which can increase latency on operations like creating other files, renaming files, etc. Similarly because we're under a log transaction, we may also cause extra delay on other tasks doing an fsync, because syncing the log requires waiting for tasks that joined a log transaction to exit the transaction. So to improve this, for any inode logging operation that needs to copy items from a subvolume leaf ("full sync" or "copy everything" bit set in the inode), preallocate a dummy extent buffer before locking any extent buffer from the subvolume tree, and even before joining a log transaction, add it to the log context and then use it when we need to copy items from a subvolume leaf to the log tree. This avoids making other operations get extra latency when waiting to lock a subvolume leaf that is used during inode logging and we are under heavy memory pressure. The following test script with bonnie++ was used to test this: $ cat test.sh #!/bin/bash DEV=/dev/sdh MNT=/mnt/sdh MOUNT_OPTIONS="-o ssd" MEMTOTAL_BYTES=`free -b | grep Mem: | awk '{ print $2 }'` NR_DIRECTORIES=20 NR_FILES=20480 DATASET_SIZE=$((MEMTOTAL_BYTES * 2 / 1048576)) DIRECTORY_SIZE=$((MEMTOTAL_BYTES * 2 / NR_FILES)) NR_FILES=$((NR_FILES / 1024)) echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT bonnie++ -u root -d $MNT \ -n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \ -r 0 -s $DATASET_SIZE -b umount $MNT The results of this test on a 8G VM running a non-debug kernel (Debian's default kernel config), were the following. Before this change: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 7501M 376k 99 1.4g 96 117m 14 1510k 99 2.5g 95 +++++ +++ Latency 35068us 24976us 2944ms 30725us 71770us 26152us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:384100:384100/20 20480 32 20480 58 20480 48 20480 39 20480 56 20480 61 Latency 411ms 11914us 119ms 617ms 10296us 110ms After this change: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 7501M 375k 99 1.4g 97 117m 14 1546k 99 2.3g 98 +++++ +++ Latency 35975us 20945us 2144ms 10297us 2217us 6004us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:384100:384100/20 20480 35 20480 58 20480 48 20480 40 20480 57 20480 59 Latency 320ms 11237us 77779us 518ms 6470us 86389us Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove unused included headersDavid Sterba
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: replace i_blocksize by fs_info::sectorsizeDavid Sterba
The block size calculated by i_blocksize from inode is the same as what we have in fs_info, initalized in inode_init_always(). Unify that to use the fs_info value everywhere. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove the pg_offset parameter from btrfs_get_extent()Qu Wenruo
The parameter @pg_offset of btrfs_get_extent() is only utilized for inlined extent, and we already have an ASSERT() and tree-checker, to make sure we can only get inline extent at file offset 0. Any invalid inline extent with non-zero file offset would be rejected by tree-checker in the first place. Thus the @pg_offset parameter is not really necessary, just remove it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: migrate subpage code to folio interfacesQu Wenruo
Although subpage itself is conflicting with higher folio, since subpage (sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never need higher order folio, there is a hidden pitfall: - btrfs_page_*() helpers Those helpers are an abstraction to handle both subpage and non-subpage cases, which means we're going to pass pages pointers to those helpers. And since those helpers are shared between data and metadata paths, it's unavoidable to let them to handle folios, including higher order folios). Meanwhile for true subpage case, we should only have a single page backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert() to ensure that. Also since those helpers are shared between both data and metadata, add some extra ASSERT()s for data path to make sure we only get single page backed folio for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use the flags of an extent map to identify the compression typeFilipe Manana
Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: migrate to use folio private instead of page privateQu Wenruo
As a cleanup and preparation for future folio migration, this patch would replace all page->private to folio version. This includes: - PagePrivate() -> folio_test_private() - page->private -> folio_get_private() - attach_page_private() -> folio_attach_private() - detach_page_private() -> folio_detach_private() Since we're here, also remove the forced cast on page->private, since it's (void *) already, we don't really need to do the cast. For now even if we missed some call sites, it won't cause any problem yet, as we're only using order 0 folio (single page), thus all those folio/page flags should be synced. But for the future conversion to utilize higher order folio, the page <-> folio flag sync is no longer guaranteed, thus we have to migrate to utilize folio flags. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-14Merge tag 'for-6.7-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Some fixes to quota accounting code, mostly around error handling and correctness: - free reserves on various error paths, after IO errors or transaction abort - don't clear reserved range at the folio release time, it'll be properly cleared after final write - fix integer overflow due to int used when passing around size of freed reservations - fix a regression in squota accounting that missed some cases with delayed refs" * tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: ensure releasing squota reserve on head refs btrfs: don't clear qgroup reserved bit in release_folio btrfs: free qgroup pertrans reserve on transaction abort btrfs: fix qgroup_free_reserved_data int overflow btrfs: free qgroup reserve when ORDERED_IOERR is set
2023-12-06btrfs: fix qgroup_free_reserved_data int overflowBoris Burkov
The reserved data counter and input parameter is a u64, but we inadvertently accumulate it in an int. Overflowing that int results in freeing the wrong amount of data and breaking reserve accounting. Unfortunately, this overflow rot spreads from there, as the qgroup release/free functions rely on returning an int to take advantage of negative values for error codes. Therefore, the full fix is to return the "released" or "freed" amount by a u64 argument and to return 0 or negative error code via the return value. Most of the call sites simply ignore the return value, though some of them handle the error and count the returned bytes. Change all of them accordingly. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-30Merge tag 'for-6.7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "New features: - raid-stripe-tree New tree for logical file extent mapping where the physical mapping may not match on multiple devices. This is now used in zoned mode to implement RAID0/RAID1* profiles, but can be used in non-zoned mode as well. The support for RAID56 is in development and will eventually fix the problems with the current implementation. This is a backward incompatible feature and has to be enabled at mkfs time. - simple quota accounting (squota) A simplified mode of qgroup that accounts all space on the initial extent owners (a subvolume), the snapshots are then cheap to create and delete. The deletion of snapshots in fully accounting qgroups is a known CPU/IO performance bottleneck. The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time. This is a backward incompatible feature as it needs extending some structures, but can be enabled on an existing filesystem. - temporary filesystem fsid (temp_fsid) The fsid identifies a filesystem and is hard coded in the structures, which disallows mounting the same fsid found on different devices. For a single device filesystem this is not strictly necessary, a new temporary fsid can be generated on mount e.g. after a device is cloned. This will be used by Steam Deck for root partition A/B testing, or can be used for VM root images. Other user visible changes: - filesystems with partially finished metadata_uuid conversion cannot be mounted anymore and the uuid fixup has to be done by btrfs-progs (btrfstune). Performance improvements: - reduce reservations for checksum deletions (with enabled free space tree by factor of 4), on a sample workload on file with many extents the deletion time decreased by 12% - make extent state merges more efficient during insertions, reduce rb-tree iterations (run time of critical functions reduced by 5%) Core changes: - the integrity check functionality has been removed, this was a debugging feature and removal does not affect other integrity checks like checksums or tree-checker - space reservation changes: - more efficient delayed ref reservations, this avoids building up too much work or overusing or exhausting the global block reserve in some situations - move delayed refs reservation to the transaction start time, this prevents some ENOSPC corner cases related to exhaustion of global reserve - improvements in reducing excessive reservations for block group items - adjust overcommit logic in near full situations, account for one more chunk to eventually allocate metadata chunk, this is mostly relevant for small filesystems (<10GiB) - single device filesystems are scanned but not registered (except seed devices), this allows temp_fsid to work - qgroup iterations do not need GFP_ATOMIC allocations anymore - cleanups, refactoring, reduced data structure size, function parameter simplifications, error handling fixes" * tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits) btrfs: open code timespec64 in struct btrfs_inode btrfs: remove redundant log root tree index assignment during log sync btrfs: remove redundant initialization of variable dirty in btrfs_update_time() btrfs: sysfs: show temp_fsid feature btrfs: disable the device add feature for temp-fsid btrfs: disable the seed feature for temp-fsid btrfs: update comment for temp-fsid, fsid, and metadata_uuid btrfs: remove pointless empty log context list check when syncing log btrfs: update comment for struct btrfs_inode::lock btrfs: remove pointless barrier from btrfs_sync_file() btrfs: add and use helpers for reading and writing last_trans_committed btrfs: add and use helpers for reading and writing fs_info->generation btrfs: add and use helpers for reading and writing log_transid btrfs: add and use helpers for reading and writing last_log_commit btrfs: support cloned-device mount capability btrfs: add helper function find_fsid_by_disk btrfs: stop reserving excessive space for block group item insertions btrfs: stop reserving excessive space for block group item updates btrfs: reorder btrfs_inode to fill gaps btrfs: open code btrfs_ordered_inode_tree in btrfs_inode ...
2023-10-18btrfs: convert to new timestamp accessorsJeff Layton
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-21-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-12btrfs: remove pointless barrier from btrfs_sync_file()Filipe Manana
The memory barrier (smp_mb()) at btrfs_sync_file() is completely redundant now that fs_info->last_trans_committed is read using READ_ONCE(), with the helper btrfs_get_last_trans_committed(), and written using WRITE_ONCE() with the helper btrfs_set_last_trans_committed(). This barrier was introduced in 2011, by commit a4abeea41adf ("Btrfs: kill trans_mutex"), but even back then it was not correct since the writer side (in btrfs_commit_transaction()), did not issue a pairing memory barrier after it updated fs_info->last_trans_committed. So remove this barrier. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add and use helpers for reading and writing last_trans_committedFilipe Manana
Currently the last_trans_committed field of struct btrfs_fs_info is modified and read without any locking or other protection. For example early in the fsync path, skip_inode_logging() is called which reads fs_info->last_trans_committed, but at the same time we can have a transaction commit completing and updating that field. In the case of an fsync this is harmless and any data race should be rare and at most cause an unnecessary logging of an inode. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_trans_committed field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>