summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2023-06-19NFS: Make all of /sys/fs/nfs network-namespace uniqueBenjamin Coddington
Expand the NFS network-namespaced sysfs from /sys/fs/nfs/net down one level into /sys/fs/nfs by moving the "net" kobject onto struct nfs_netns_client and setting it up during network namespace init. This prepares the way for superblock kobjects within /sys/fs/nfs that will only be visible to matching network namespaces. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19NFS: Open-code the nfs_kset kset_create_and_add()Benjamin Coddington
In preparation to make objects below /sys/fs/nfs namespace aware, we need to define our own kobj_type for the nfs kset so that we can add the .child_ns_type member in a following patch. No functional change here, only the unrolling of kset_create_and_add(). Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19NFS: rename nfs_client_kobj to nfs_net_kobjBenjamin Coddington
Match the variable names to the sysfs structure. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19NFS: rename nfs_client_kset to nfs_ksetBenjamin Coddington
Be brief and match the subsystem name. There's no need to distinguish this kset variable from the server. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19btrfs: fix race between quota disable and relocationFilipe Manana
If we disable quotas while we have a relocation of a metadata block group that has extents belonging to the quota root, we can cause the relocation to fail with -ENOENT. This is because relocation builds backref nodes for extents of the quota root and later needs to walk the backrefs and access the quota root - however if in between a task disables quotas, it results in deleting the quota root from the root tree (with btrfs_del_root(), called from btrfs_quota_disable(). This can be sporadically triggered by test case btrfs/255 from fstests: $ ./check btrfs/255 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/255 6s ... _check_dmesg: something found in dmesg (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.dmesg) - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad) --- tests/btrfs/255.out 2023-03-02 21:47:53.876609426 +0000 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad 2023-06-16 10:20:39.267563212 +0100 @@ -1,2 +1,4 @@ QA output created by 255 +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No such file or directory +There may be more info in syslog - try dmesg | tail Silence is golden ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/255.out /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad' to see the entire diff) Ran: btrfs/255 Failures: btrfs/255 Failed 1 of 1 tests To fix this make the quota disable operation take the cleaner mutex, as relocation of a block group also takes this mutex. This is also what we do when deleting a subvolume/snapshot, we take the cleaner mutex in the cleaner kthread (at cleaner_kthread()) and then we call btrfs_del_root() at btrfs_drop_snapshot() while under the protection of the cleaner mutex. Fixes: bed92eae26cc ("Btrfs: qgroup implementation and prototypes") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: add comment to struct btrfs_fs_info::dirty_cowonly_rootsFilipe Manana
Add a comment to struct btrfs_fs_info::dirty_cowonly_roots to mention that struct btrfs_fs_info::trans_lock is the lock that protects that list. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: fix race when deleting free space root from the dirty cow roots listFilipe Manana
When deleting the free space tree we are deleting the free space root from the list fs_info->dirty_cowonly_roots without taking the lock that protects it, which is struct btrfs_fs_info::trans_lock. This unsynchronized list manipulation may cause chaos if there's another concurrent manipulation of this list, such as when adding a root to it with ctree.c:add_root_to_dirty_list(). This can result in all sorts of weird failures caused by a race, such as the following crash: [337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI [337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs] [337571.279928] Code: 85 38 06 00 (...) [337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206 [337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000 [337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070 [337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b [337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600 [337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48 [337571.281723] FS: 00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000 [337571.281950] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0 [337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [337571.282874] Call Trace: [337571.283101] <TASK> [337571.283327] ? __die_body+0x1b/0x60 [337571.283570] ? die_addr+0x39/0x60 [337571.283796] ? exc_general_protection+0x22e/0x430 [337571.284022] ? asm_exc_general_protection+0x22/0x30 [337571.284251] ? commit_cowonly_roots+0x11f/0x250 [btrfs] [337571.284531] btrfs_commit_transaction+0x42e/0xf90 [btrfs] [337571.284803] ? _raw_spin_unlock+0x15/0x30 [337571.285031] ? release_extent_buffer+0x103/0x130 [btrfs] [337571.285305] reset_balance_state+0x152/0x1b0 [btrfs] [337571.285578] btrfs_balance+0xa50/0x11e0 [btrfs] [337571.285864] ? __kmem_cache_alloc_node+0x14a/0x410 [337571.286086] btrfs_ioctl+0x249a/0x3320 [btrfs] [337571.286358] ? mod_objcg_state+0xd2/0x360 [337571.286577] ? refill_obj_stock+0xb0/0x160 [337571.286798] ? seq_release+0x25/0x30 [337571.287016] ? __rseq_handle_notify_resume+0x3ba/0x4b0 [337571.287235] ? percpu_counter_add_batch+0x2e/0xa0 [337571.287455] ? __x64_sys_ioctl+0x88/0xc0 [337571.287675] __x64_sys_ioctl+0x88/0xc0 [337571.287901] do_syscall_64+0x38/0x90 [337571.288126] entry_SYSCALL_64_after_hwframe+0x72/0xdc [337571.288352] RIP: 0033:0x7f478aaffe9b So fix this by locking struct btrfs_fs_info::trans_lock before deleting the free space root from that list. Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: fix race when deleting quota root from the dirty cow roots listFilipe Manana
When disabling quotas we are deleting the quota root from the list fs_info->dirty_cowonly_roots without taking the lock that protects it, which is struct btrfs_fs_info::trans_lock. This unsynchronized list manipulation may cause chaos if there's another concurrent manipulation of this list, such as when adding a root to it with ctree.c:add_root_to_dirty_list(). This can result in all sorts of weird failures caused by a race, such as the following crash: [337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI [337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs] [337571.279928] Code: 85 38 06 00 (...) [337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206 [337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000 [337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070 [337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b [337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600 [337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48 [337571.281723] FS: 00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000 [337571.281950] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0 [337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [337571.282874] Call Trace: [337571.283101] <TASK> [337571.283327] ? __die_body+0x1b/0x60 [337571.283570] ? die_addr+0x39/0x60 [337571.283796] ? exc_general_protection+0x22e/0x430 [337571.284022] ? asm_exc_general_protection+0x22/0x30 [337571.284251] ? commit_cowonly_roots+0x11f/0x250 [btrfs] [337571.284531] btrfs_commit_transaction+0x42e/0xf90 [btrfs] [337571.284803] ? _raw_spin_unlock+0x15/0x30 [337571.285031] ? release_extent_buffer+0x103/0x130 [btrfs] [337571.285305] reset_balance_state+0x152/0x1b0 [btrfs] [337571.285578] btrfs_balance+0xa50/0x11e0 [btrfs] [337571.285864] ? __kmem_cache_alloc_node+0x14a/0x410 [337571.286086] btrfs_ioctl+0x249a/0x3320 [btrfs] [337571.286358] ? mod_objcg_state+0xd2/0x360 [337571.286577] ? refill_obj_stock+0xb0/0x160 [337571.286798] ? seq_release+0x25/0x30 [337571.287016] ? __rseq_handle_notify_resume+0x3ba/0x4b0 [337571.287235] ? percpu_counter_add_batch+0x2e/0xa0 [337571.287455] ? __x64_sys_ioctl+0x88/0xc0 [337571.287675] __x64_sys_ioctl+0x88/0xc0 [337571.287901] do_syscall_64+0x38/0x90 [337571.288126] entry_SYSCALL_64_after_hwframe+0x72/0xdc [337571.288352] RIP: 0033:0x7f478aaffe9b So fix this by locking struct btrfs_fs_info::trans_lock before deleting the quota root from that list. Fixes: bed92eae26cc ("Btrfs: qgroup implementation and prototypes") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: tracepoints: also show actual number of the outstanding extentsNaohiro Aota
The btrfs_inode_mod_outstanding_extents trace event only shows the modified number to the number of outstanding extents. It would be helpful if we can see the resulting extent number as well. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19NFS: Add an "xprtsec=" NFS mount optionChuck Lever
After some discussion, we decided that controlling transport layer security policy should be separate from the setting for the user authentication flavor. To accomplish this, add a new NFS mount option to select a transport layer security policy for RPC operations associated with the mount point. xprtsec=none - Transport layer security is forced off. xprtsec=tls - Establish an encryption-only TLS session. If the initial handshake fails, the mount fails. If TLS is not available on a reconnect, drop the connection and try again. xprtsec=mtls - Both sides authenticate and an encrypted session is created. If the initial handshake fails, the mount fails. If TLS is not available on a reconnect, drop the connection and try again. To support client peer authentication (mtls), the handshake daemon will have configurable default authentication material (certificate or pre-shared key). In the future, mount options can be added that can provide this material on a per-mount basis. Updates to mount.nfs (to support xprtsec=auto) and nfs(5) will be sent under separate cover. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19NFS: Have struct nfs_client carry a TLS policy fieldChuck Lever
The new field is used to match struct nfs_clients that have the same TLS policy setting. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19ovl: enable fsnotify events on underlying real filesAmir Goldstein
Overlayfs creates the real underlying files with fake f_path, whose f_inode is on the underlying fs and f_path on overlayfs. Those real files were open with FMODE_NONOTIFY, because fsnotify code was not prapared to handle fsnotify hooks on files with fake path correctly and fanotify would report unexpected event->fd with fake overlayfs path, when the underlying fs was being watched. Teach fsnotify to handle events on the real files, and do not set real files to FMODE_NONOTIFY to allow operations on real file (e.g. open, access, modify, close) to generate async and permission events. Because fsnotify does not have notifications on address space operations, we do not need to worry about ->vm_file not reporting events to a watched overlayfs when users are accessing a mapped overlayfs file. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230615112229.2143178-6-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19fs: use backing_file container for internal files with "fake" f_pathAmir Goldstein
Overlayfs uses open_with_fake_path() to allocate internal kernel files, with a "fake" path - whose f_path is not on the same fs as f_inode. Allocate a container struct backing_file for those internal files, that is used to hold the "fake" ovl path along with the real path. backing_file_real_path() can be used to access the stored real path. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230615112229.2143178-5-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19NFS: Improvements for fs_context-related tracepointsChuck Lever
Add some missing observability to the fs_context tracepoints added by commit 33ce83ef0bb0 ("NFS: Replace fs_context-related dprintk() call sites with tracepoints"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19fs: move kmem_cache_zalloc() into alloc_empty_file*() helpersAmir Goldstein
Use a common helper init_file() instead of __alloc_file() for alloc_empty_file*() helpers and improrve the documentation. This is needed for a follow up patch that allocates a backing_file container. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230615112229.2143178-4-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19fs: use a helper for opening kernel internal filesAmir Goldstein
cachefiles uses kernel_open_tmpfile() to open kernel internal tmpfile without accounting for nr_files. cachefiles uses open_with_fake_path() for the same reason without the need for a fake path. Fork open_with_fake_path() to kernel_file_open() which only does the noaccount part and use it in cachefiles. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230615112229.2143178-3-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19NFSv4.2: SETXATTR should update ctimeAnna Schumaker
Otherwise, `stat` will report a stale value to users. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19NFSv4.2: Clean up xattr size macrosAnna Schumaker
Fold them into the other NFS v4.2 operations in the right spots and adjust spacing to keep the same style. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19fs: rename {vfs,kernel}_tmpfile_open()Amir Goldstein
Overlayfs and cachefiles use vfs_open_tmpfile() to open a tmpfile without accounting for nr_files. Rename this helper to kernel_tmpfile_open() to better reflect this helper is used for kernel internal users. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230615112229.2143178-2-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19NFSv4.2: Clean up nfs4_xdr_dec_*xattr() functionsAnna Schumaker
I add commends above each function to match the style of the other nfs4_xdr_dec_*() functions. I also remove the unnecessary #ifdef CONFIG_NFS_V4_2 that was added around this code, since we are already in a v4.2-only file. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19NFSv4.2: Clean up: Move nfs4_xdr_enc_*xattr() functionsAnna Schumaker
They should be in the nfs4_xdr_enc_*() section, and not at the bottom of the file. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19NFSv4.2: Clean up: move decode_*xattr() functionsAnna Schumaker
Move them out of the encode_*() section and into the decode_*() section where it belongs. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19NFSv4.2: Clean up: Move the encode_copy_commit() functionAnna Schumaker
Move the function to be with the other encode_*() functions, instead of in the middle of the nfs4_xdr_enc_*() section. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19btrfs: update i_version in update_dev_timeJeff Layton
When updating the ctime, we also want to update i_version. This is just something I noticed by inspection. There is probably no way to test this today unless you can somehow get to this inode via nfsd. Still, I think it's the right thing to do for consistency's sake. David Sterba's comment: I don't see anything wrong with setting the iversion bit, however I also don't see where this would be useful. Agreed with the consistency, otherwise the time is updated when device super block is wiped or a device initialized, both are big events so missing that due to lack of iversion update seems unlikely. I'll add it to the queue, thanks. Signed-off-by: Jeff Layton <jlayton@kernel.org> [ add comments ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: make btrfs_compressed_bioset staticBen Dooks
The 'btrfs_compressed_bioset' struct isn't exported outside of the fs/btrfs/compression.c file, so make it static to fix the following sparse warning: fs/btrfs/compression.c:40:16: warning: symbol 'btrfs_compressed_bioset' was not declared. Should it be static? Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19afs: Fix waiting for writeback then skipping folioVishal Moola (Oracle)
Commit acc8d8588cb7 converted afs_writepages_region() to write back a folio batch. The function waits for writeback to a folio, but then proceeds to the rest of the batch without trying to write that folio again. This patch fixes has it attempt to write the folio again. [DH: Also remove an 'else' that adding a goto makes redundant] Fixes: acc8d8588cb7 ("afs: convert afs_writepages_region() to use filemap_get_folios_tag()") Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/20230607204120.89416-2-vishal.moola@gmail.com/
2023-06-19afs: Fix dangling folio ref counts in writebackVishal Moola (Oracle)
Commit acc8d8588cb7 converted afs_writepages_region() to write back a folio batch. If writeback needs rescheduling, the function exits without dropping the references to the folios in fbatch. This patch fixes that. [DH: Moved the added line before the _leave()] Fixes: acc8d8588cb7 ("afs: convert afs_writepages_region() to use filemap_get_folios_tag()") Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/20230607204120.89416-1-vishal.moola@gmail.com/
2023-06-19btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profileMatt Corallo
Callers of `btrfs_reduce_alloc_profile` expect it to return exactly one allocation profile flag, and failing to do so may ultimately result in a WARN_ON and remount-ro when allocating new blocks, like the below transaction abort on 6.1. `btrfs_reduce_alloc_profile` has two ways of determining the profile, first it checks if a conversion balance is currently running and uses the profile we're converting to. If no balance is currently running, it returns the max-redundancy profile which at least one block in the selected block group has. This works by simply checking each known allocation profile bit in redundancy order. However, `btrfs_reduce_alloc_profile` has not been updated as new flags have been added - first with the `DUP` profile and later with the RAID1C34 profiles. Because of the way it checks, if we have blocks with different profiles and at least one is known, that profile will be selected. However, if none are known we may return a flag set with multiple allocation profiles set. This is currently only possible when a balance from one of the three unhandled profiles to another of the unhandled profiles is canceled after allocating at least one block using the new profile. In that case, a transaction abort like the below will occur and the filesystem will need to be mounted with -o skip_balance to get it mounted rw again (but the balance cannot be resumed without a similar abort). [770.648] ------------[ cut here ]------------ [770.648] BTRFS: Transaction aborted (error -22) [770.648] WARNING: CPU: 43 PID: 1159593 at fs/btrfs/extent-tree.c:4122 find_free_extent+0x1d94/0x1e00 [btrfs] [770.648] CPU: 43 PID: 1159593 Comm: btrfs Tainted: G W 6.1.0-0.deb11.7-powerpc64le #1 Debian 6.1.20-2~bpo11+1a~test [770.648] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV [770.648] NIP: c00800000f6784fc LR: c00800000f6784f8 CTR: c000000000d746c0 [770.648] REGS: c000200089afe9a0 TRAP: 0700 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test) [770.648] MSR: 9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28848282 XER: 20040000 [770.648] CFAR: c000000000135110 IRQMASK: 0 GPR00: c00800000f6784f8 c000200089afec40 c00800000f7ea800 0000000000000026 GPR04: 00000001004820c2 c000200089afea00 c000200089afe9f8 0000000000000027 GPR08: c000200ffbfe7f98 c000000002127f90 ffffffffffffffd8 0000000026d6a6e8 GPR12: 0000000028848282 c000200fff7f3800 5deadbeef0000122 c00000002269d000 GPR16: c0002008c7797c40 c000200089afef17 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000001 c000200008bc5a98 0000000000000001 GPR24: 0000000000000000 c0000003c73088d0 c000200089afef17 c000000016d3a800 GPR28: c0000003c7308800 c00000002269d000 ffffffffffffffea 0000000000000001 [770.648] NIP [c00800000f6784fc] find_free_extent+0x1d94/0x1e00 [btrfs] [770.648] LR [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] [770.648] Call Trace: [770.648] [c000200089afec40] [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] (unreliable) [770.648] [c000200089afed30] [c00800000f681398] btrfs_reserve_extent+0x1a0/0x2f0 [btrfs] [770.648] [c000200089afeea0] [c00800000f681bf0] btrfs_alloc_tree_block+0x108/0x670 [btrfs] [770.648] [c000200089afeff0] [c00800000f66bd68] __btrfs_cow_block+0x170/0x850 [btrfs] [770.648] [c000200089aff100] [c00800000f66c58c] btrfs_cow_block+0x144/0x288 [btrfs] [770.648] [c000200089aff1b0] [c00800000f67113c] btrfs_search_slot+0x6b4/0xcb0 [btrfs] [770.648] [c000200089aff2a0] [c00800000f679f60] lookup_inline_extent_backref+0x128/0x7c0 [btrfs] [770.648] [c000200089aff3b0] [c00800000f67b338] lookup_extent_backref+0x70/0x190 [btrfs] [770.648] [c000200089aff470] [c00800000f67b54c] __btrfs_free_extent+0xf4/0x1490 [btrfs] [770.648] [c000200089aff5a0] [c00800000f67d770] __btrfs_run_delayed_refs+0x328/0x1530 [btrfs] [770.648] [c000200089aff740] [c00800000f67ea2c] btrfs_run_delayed_refs+0xb4/0x3e0 [btrfs] [770.648] [c000200089aff800] [c00800000f699aa4] btrfs_commit_transaction+0x8c/0x12b0 [btrfs] [770.648] [c000200089aff8f0] [c00800000f6dc628] reset_balance_state+0x1c0/0x290 [btrfs] [770.648] [c000200089aff9a0] [c00800000f6e2f7c] btrfs_balance+0x1164/0x1500 [btrfs] [770.648] [c000200089affb40] [c00800000f6f8e4c] btrfs_ioctl+0x2b54/0x3100 [btrfs] [770.648] [c000200089affc80] [c00000000053be14] sys_ioctl+0x794/0x1310 [770.648] [c000200089affd70] [c00000000002af98] system_call_exception+0x138/0x250 [770.648] [c000200089affe10] [c00000000000c654] system_call_common+0xf4/0x258 [770.648] --- interrupt: c00 at 0x7fff94126800 [770.648] NIP: 00007fff94126800 LR: 0000000107e0b594 CTR: 0000000000000000 [770.648] REGS: c000200089affe80 TRAP: 0c00 Tainted: G W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test) [770.648] MSR: 900000000000d033 <SF,HV,EE,PR,ME,IR,DR,RI,LE> CR: 24002848 XER: 00000000 [770.648] IRQMASK: 0 GPR00: 0000000000000036 00007fffc9439da0 00007fff94217100 0000000000000003 GPR04: 00000000c4009420 00007fffc9439ee8 0000000000000000 0000000000000000 GPR08: 00000000803c7416 0000000000000000 0000000000000000 0000000000000000 GPR12: 0000000000000000 00007fff9467d120 0000000107e64c9c 0000000107e64d0a GPR16: 0000000107e64d06 0000000107e64cf1 0000000107e64cc4 0000000107e64c73 GPR20: 0000000107e64c31 0000000107e64bf1 0000000107e64be7 0000000000000000 GPR24: 0000000000000000 00007fffc9439ee0 0000000000000003 0000000000000001 GPR28: 00007fffc943f713 0000000000000000 00007fffc9439ee8 0000000000000000 [770.648] NIP [00007fff94126800] 0x7fff94126800 [770.648] LR [0000000107e0b594] 0x107e0b594 [770.648] --- interrupt: c00 [770.648] Instruction dump: [770.648] 3b00ffe4 e8898828 481175f5 60000000 4bfff4fc 3be00000 4bfff570 3d220000 [770.648] 7fc4f378 e8698830 4811cd95 e8410018 <0fe00000> f9c10060 f9e10068 fa010070 [770.648] ---[ end trace 0000000000000000 ]--- [770.648] BTRFS: error (device dm-2: state A) in find_free_extent_update_loop:4122: errno=-22 unknown [770.648] BTRFS info (device dm-2: state EA): forced readonly [770.648] BTRFS: error (device dm-2: state EA) in __btrfs_free_extent:3070: errno=-22 unknown [770.648] BTRFS error (device dm-2: state EA): failed to run delayed ref for logical 17838685708288 num_bytes 24576 type 184 action 2 ref_mod 1: -22 [770.648] BTRFS: error (device dm-2: state EA) in btrfs_run_delayed_refs:2144: errno=-22 unknown [770.648] BTRFS: error (device dm-2: state EA) in reset_balance_state:3599: errno=-22 unknown Fixes: 47e6f7423b91 ("btrfs: add support for 3-copy replication (raid1c3)") Fixes: 8d6fac0087e5 ("btrfs: add support for 4-copy replication (raid1c4)") CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Matt Corallo <blnxfsl@bluematt.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workersQu Wenruo
Since the scrub rework introduced by commit 2af2aaf98205 ("btrfs: scrub: introduce structure for new BTRFS_STRIPE_LEN based interface") and later commits, scrub only needs one single workqueue, fs_info::scrub_worker. That scrub_wr_completion_workers is initially to handle the delay work after write bios finished. But the new scrub code goes submit-and-wait for write bios, thus all the work are done inside the scrub_worker. The last user of fs_info::scrub_wr_completion_workers is removed in commit 16f93993498b ("btrfs: scrub: remove the old writeback infrastructure"), so we can safely remove the workqueue. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: scrub: remove scrub_ctx::csum_list memberQu Wenruo
Since the rework of scrub introduced by commit 2af2aaf98205 ("btrfs: scrub: introduce structure for new BTRFS_STRIPE_LEN based interface") and later commits, scrub no longer keeps its data checksum inside sctx. Instead we have scrub_stripe::csums for the checksum of the stripe. Thus we can remove the unused scrub_ctx::csum_list member. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: do not BUG_ON after failure to migrate space during truncationFilipe Manana
During truncation we reserve 2 metadata units when starting a transaction (reserved space goes to fs_info->trans_block_rsv) and then attempt to migrate 1 unit (min_size bytes) from fs_info->trans_block_rsv into the local block reserve. If we ever fail we trigger a BUG_ON(), which should never happen, because we reserved 2 units. However if we happen to fail for some reason, we don't need to be so dire and hit a BUG_ON(), we can just error out the truncation and, since this is highly unexpected, surround the error check with WARN_ON(), to get an informative stack trace and tag the branh as 'unlikely'. Also make the 'min_size' variable const, since it's not supposed to ever change and any accidental change could possibly make the space migration not so unlikely to fail. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: do not BUG_ON on failure to get dir index for new snapshotFilipe Manana
During the transaction commit path, at create_pending_snapshot(), there is no need to BUG_ON() in case we fail to get a dir index for the snapshot in the parent directory. This should fail very rarely because the parent inode should be loaded in memory already, with the respective delayed inode created and the parent inode's index_cnt field already initialized. However if it fails, it may be -ENOMEM like the comment at create_pending_snapshot() says or any error returned by btrfs_search_slot() through btrfs_set_inode_index_count(), which can be pretty much anything such as -EIO or -EUCLEAN for example. So the comment is not correct when it says it can only be -ENOMEM. However doing a BUG_ON() here is overkill, since we can instead abort the transaction and return the error. Note that any error returned by create_pending_snapshot() will eventually result in a transaction abort at cleanup_transaction(), called from btrfs_commit_transaction(), but we can explicitly abort the transaction at this point instead so that we get a stack trace to tell us that the call to btrfs_set_inode_index() failed. So just abort the transaction and return in case btrfs_set_inode_index() returned an error at create_pending_snapshot(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: send: do not BUG_ON() on unexpected symlink data extentFilipe Manana
There's really no need to BUG_ON() if we find a symlink with an extent that is not inline or is compressed. We can just make send fail with an error (-EUCLEAN) and log an informative error message, so just do that instead of BUG_ON(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: do not BUG_ON() when dropping inode items from log rootFilipe Manana
When dropping inode items from a log tree at drop_inode_items(), we this BUG_ON() on the result of btrfs_search_slot() because we don't expect an exact match since having a key with an offset of (u64)-1 is unexpected. That is generally true, but for dir index keys for example, we can get a key with such an offset value, even though it's very unlikely and it would take ages to increase the sequence counter for a dir index up to (u64)-1. We can deal with an exact match, we just have to delete the key at that slot, so there is really no need to BUG_ON(), error out or trigger any warning. So remove the BUG_ON(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: replace BUG_ON() at split_item() with proper error handlingFilipe Manana
There's no need to BUG_ON() at split_item() if the leaf does not have enough free space for the new item, we can just return -ENOSPC since the caller can deal with errors from split_item(). Also, as this is a very unlikely condition to happen, because the caller has invoked setup_leaf_for_split() before calling split_item(), surround the condition with a WARN_ON() which makes it easier to notice this unexpected condition and tags the if branch with 'unlikely' as well. I've actually once hit this BUG_ON() with some incorrect code changes I had, which was very inconvenient as it required rebooting the VM. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()Filipe Manana
At btrfs_del_ptr(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the callers. There's really no need for the BUG_ON() as we can release all resources in the context of all callers, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search. This implies btrfs_del_ptr() return an int instead of void, and making all callers check for returned errors. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: do not BUG_ON() on tree mod log failures at insert_ptr()Filipe Manana
At insert_ptr(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the callers. There's really no need for the BUG_ON() as we can release all resources in the context of all callers, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search. This implies making insert_ptr() return an int instead of void, and making all callers check for returned errors. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()Filipe Manana
At insert_new_root(), instead of doing a BUG_ON() in case we fail to record the tree mod log operation, just return the error to the callers after releasing the allocated tree block. At this point we haven't made any changes to the b+tree, so we haven't left it in an inconsistent state and therefore have no need to abort the transaction. All we need to do is to unlock and free the extent buffer we just allocated with the purpose of making it the new root. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert()Filipe Manana
At push_nodes_for_insert(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the caller. There's really no need for the BUG_ON() as we can release all resources in this context, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: abort transaction at update_ref_for_cow() when ref count is zeroFilipe Manana
At update_ref_for_cow() we are calling btrfs_handle_fs_error() if we find that the extent buffer has an unexpected ref count of zero, however we can simply use btrfs_abort_transaction(), which achieves the same purposes: to turn the fs to error state, abort the current transaction and turn the fs to RO mode as well. Besides that, btrfs_abort_transaction() also prints a stack trace which makes it more useful. Also, as this is a very unexpected situation, indicating a serious corruption/inconsistency, tag the if branch as 'unlikely', set the error code to -EUCLEAN instead of -EROFS, and log an explicit message. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: abort transaction at balance_level() when left child is missingFilipe Manana
At balance_level() we are calling btrfs_handle_fs_error() when the middle child only has 1 item and the left child is missing, however we can simply use btrfs_abort_transaction(), which achieves the same purposes: to turn the fs to error state, abort the current transaction and turn the fs to RO mode. Besides that, btrfs_abort_transaction() also prints a stack trace which makes it more useful. Also, as this is a highly unexpected case and it's about a b+tree inconsistency, change the error code from -EROFS to -EUCLEAN, tag the if branch as 'unlikely' and log an explicit error message. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: avoid unnecessarily setting the fs to RO and error state at ↵Filipe Manana
balance_level() At balance_level(), when trying to promote a child node to a root node, if we fail to read the child we call btrfs_handle_fs_error(), which turns the fs to RO mode and sets it to error state as well, causing any ongoing transaction to abort. This however is not necessary because at that point we have not made any change yet at balance_level(), so any error reading the child node does not leaves us in any inconsistent state. Therefore we can just return the error to the caller. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: rename enospc label to out at balance_level()Filipe Manana
At balance_level() we have this 'enospc' label where we jump to in case we get an error at several places. However that error is certainly not -ENOSPC in call cases, it can be -EIO or -ENOMEM when reading a child extent buffer for example, or -ENOMEM when trying to record tree mod log operations. So to make this less confusing, rename the label to 'out'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: do not BUG_ON() on tree mod log failure at balance_level()Filipe Manana
At balance_level(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the callers. There's really no need for the BUG_ON() as we can release all resources in this context, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: do not BUG_ON() on tree mod log failure at __btrfs_cow_block()Filipe Manana
At __btrfs_cow_block(), instead of doing a BUG_ON() in case we fail to record a tree mod log root insertion operation, do a transaction abort instead. There's really no need for the BUG_ON(), we can properly release all resources in this context and turn the filesystem to RO mode and in an error state instead. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: avoid tree mod log ENOMEM failures when we don't need to logFilipe Manana
When logging tree mod log operations we start by checking, in a lockless manner, if we need to log - if we don't, we just return and do nothing, otherwise we will allocate one or more tree mod log operations and then check again if we need to log. This second check will take the tree mod log lock in write mode if we need to log, otherwise it will do nothing and we just free the allocated memory and return success. We can improve on this by not returning an error in case the memory allocations fail, unless the second check tells us that we actually need to log. That is, if we fail to allocate memory and the second check tells use that we don't need to log, we can just return success and avoid returning -ENOMEM to the caller. Currently tree mod log failures are dealt with either a BUG_ON() or a transaction abort, as tree mod log operations are logged in code paths that modify a b+tree. So just avoid failing with -ENOMEM if we fail to allocate a tree mod log operation unless we actually need to log the operations, that is, if tree_mod_dont_log() returns true. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: fix extent buffer leak after tree mod log failure at split_node()Filipe Manana
At split_node(), if we fail to log the tree mod log copy operation, we return without unlocking the split extent buffer we just allocated and without decrementing the reference we own on it. Fix this by unlocking it and decrementing the ref count before returning. Fixes: 5de865eebb83 ("Btrfs: fix tree mod logging") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: add missing error handling when logging operation while COWing extent ↵Filipe Manana
buffer When COWing an extent buffer that is not the root node, we need to log in the tree mod log that we replaced a pointer in the parent node, otherwise a tree mod log user doing a search on the b+tree can return incorrect results (that miss something). We are doing the call to btrfs_tree_mod_log_insert_key() but we totally ignore its return value. So fix this by adding the missing error handling, resulting in a transaction abort and freeing the COWed extent buffer. Fixes: f230475e62f7 ("Btrfs: put all block modifications into the tree mod log") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO methodChristoph Hellwig
Since commit a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") file systems can just set the FMODE_CAN_ODIRECT flag at open time instead of wiring up a dummy direct_IO method to indicate support for direct I/O. Do that for btrfs so that noop_direct_IO can eventually be removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: update documentation for a block group's bg_list memberFilipe Manana
Currently we are only documenting two uses of the bg_list member of a block group, but there two more: 1) To track deleted block groups for discard purposes, introduced in commit e33e17ee1098 ("btrfs: add missing discards when unpinning extents with -o discard"); 2) To track block groups for automatic reclaim, introduced more recently by commit 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") So document those two other use cases. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>