summaryrefslogtreecommitdiff
path: root/fs/bcachefs
AgeCommit message (Collapse)Author
11 daysbcachefs: Don't trace should_be_locked unless changingKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Ensure that snapshot creation propagates has_case_insensitiveKent Overstreet
We normally can't create a new directory with the case-insensitive option already set - except when we're creating a snapshot. And if casefolding is enabled filesystem wide, we should still set it even though not strictly required, for consistency. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Print devices we're mounting on multi device filesystemsKent Overstreet
Previously, we only ever logged the filesystem UUID. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Don't trust sb->nr_devices in members_to_text()Kent Overstreet
We have to be able to print superblock sections even if they fail to validate (for debugging), so we have to calculate the number of entries from the field size. Reported-by: syzbot+5138f00559ffb3cb3610@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix version checks in validate_bset()Kent Overstreet
It seems btree node scan picked up a partially overwritten btree node, and corrected the "bset version older than sb version_min" error - resulting in an invalid superblock with a bad version_min field. Don't run this check at all when we're in btree node scan, and when we do run it, do something saner if the bset version is totally crazy. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: ioctl: avoid stack overflow warningArnd Bergmann
Multiple ioctl handlers individually use a lot of stack space, and clang chooses to inline them into the bch2_fs_ioctl() function, blowing through the warning limit: fs/bcachefs/chardev.c:655:6: error: stack frame size (1032) exceeds limit (1024) in 'bch2_fs_ioctl' [-Werror,-Wframe-larger-than] 655 | long bch2_fs_ioctl(struct bch_fs *c, unsigned cmd, void __user *arg) By marking the largest two of them as noinline_for_stack, no indidual code path ends up using this much, which avoids the warning and reduces the possible total stack usage in the ioctl handler. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Don't pass trans to fsck_err() in gc_accounting_doneKent Overstreet
fsck_err() can return a transaction restart if passed a transaction object - this has always been true when it has to drop locks to prompt for user input, but we're seeing this more now that we're logging the error being corrected in the journal. gc_accounting_done() doesn't call fsck_err() from an actual commit loop, and it doesn't need to be holding btree locks when it calls fsck_err(), so the easy fix here for the unhandled transaction restart is to just not pass it the transaction object. We'll miss out on the fancy new logging, but that's ok. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix leak in bch2_fs_recovery() error pathKent Overstreet
Fix a small leak of the superblock 'clean' section. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix rcu_pending for PREEMPT_RTKent Overstreet
PREEMPT_RT redefines how standard spinlocks work, so local_irq_save() + spin_lock() is no longer equivalent to spin_lock_irqsave(). Fortunately, we don't strictly need to do it that way. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix downgrade_table_extra()Kent Overstreet
Fix a UAF: we were calling darray_make_room() and retaining a pointer to the old buffer. And fix an UBSAN warning: struct bch_sb_field_downgrade_entry uses __counted_by, so set dst->nr_errors before assigning to the array entry. Reported-by: syzbot+14c52d86ddbd89bea13e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Don't put rhashtable on stackKent Overstreet
Object debugging generally needs special provisions for putting said objects on the stack, which rhashtable does not have. Reported-by: syzbot+bcc38a9556d0324c2ec2@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Make sure opts.read_only gets propagated back to VFSKent Overstreet
If we think we're read-only but the VFS doesn't, fun will ensue. And now that we know we have to be able to do this safely, just make nochanges imply ro. Reported-by: syzbot+a7d6ceaba099cc21dee4@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Fix possible console lock involved deadlockAlan Huang
Link: https://lore.kernel.org/all/6822ab02.050a0220.f2294.00cb.GAE@google.com/T/ Reported-by: syzbot+2c3ef91c9523c3d1a25c@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: mark more errors autofixKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Don't persistently run scan_for_btree_nodesKent Overstreet
bch2_btree_lost_data() gets called on btree node read error, but the error might be transient. btree_node_scan is expensive, and there's no need to run it persistently (marking it in the superblock as required to run) - check_topology will run it if required, via bch2_get_scanned_nodes(). Running it non-persistently is fine, to avoid check_topology having to rewind recovery to run it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Read error message now prints if self healingKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Only run 'increase_depth' for keys from btree node csanKent Overstreet
bch2_btree_increase_depth() was originally for disaster recovery, to get some data back from the journal when a btree root was bad. We don't need it for that purpose anymore; on bad btree root we'll launch btree node scan and reconstruct all the interior nodes. If there's a key in the journal for a depth that doesn't exists, and it's not from check_topology/btree node scan, we should just ignore it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Mark need_discard_freespace_key_bad autofixKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Update /dev/disk/by-uuid on device addKent Overstreet
Invalidate pagecache after we write the new superblock and send a uevent. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Add more flags to btree nodes for rewrite reasonKent Overstreet
It seems excessive forced btree node rewrites can cause interior btree updates to become wedged during recovery, before we're using the write buffer for backpointer updates. Add more flags so we can determine where these are coming from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Add range being updated to btree_update_to_text()Kent Overstreet
We had a deadlock during recovery where interior btree updates became wedged and all open_buckets were consumed; start adding more introspection. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Log fsck errors in the journalKent Overstreet
Log the specific error being corrected in the journal when we're repairing, this helps greatly with 'bcachefs list_journal' analysis. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Add missing restart handling to check_topology()Kent Overstreet
The next patch will add logging of the specific error being corrected in repair paths to the journal; this means __bch2_fsck_err() can return transaction restarts in places that previously weren't expecting them. check_topology() is old code that doesn't use btree iterators for btree node locking - it'll have to be rewritten in the future to work online. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-04Merge tag 'bcachefs-2025-06-04' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull more bcachefs updates from Kent Overstreet: "More bcachefs updates: - More stack usage improvements (~600 bytes) - Define CLASS()es for some commonly used types, and convert most rcu_read_lock() uses to the new lock guards - New introspection: - Superblock error counters are now available in sysfs: previously, they were only visible with 'show-super', which doesn't provide a live view - New tracepoint, error_throw(), which is called any time we return an error and start to unwind - Repair - check_fix_ptrs() can now repair btree node roots - We can now repair when we've somehow ended up with the journal using a superblock bucket - Revert some leftovers from the aborted directory i_size feature, and add repair code: some userspace programs (e.g. sshfs) were getting confused It seems in 6.15 there's a bug where i_nlink on the vfs inode has been getting incorrectly set to 0, with some unfortunate results; list_journal analysis showed bch2_inode_rm() being called (by bch2_evict_inode()) when it clearly should not have been. - bch2_inode_rm() now runs "should we be deleting this inode?" checks that were previously only run when deleting unlinked inodes in recovery - check_subvol() was treating a dangling subvol (pointing to a missing root inode) like a dangling dirent, and deleting it. This was the really unfortunate one: check_subvol() will now recreate the root inode if necessary This took longer to debug than it should have, and we lost several filesystems unnecessarily, because users have been ignoring the release notes and blindly running 'fsck -y'. Debugging required reconstructing what happened through analyzing the journal, when ideally someone would have noticed 'hey, fsck is asking me if I want to repair this: it usually doesn't, maybe I should run this in dry run mode and check what's going on?' As a reminder, fsck errors are being marked as autofix once we've verified, in real world usage, that they're working correctly; blindly running 'fsck -y' on an experimental filesystem is playing with fire Up to this incident we've had an excellent track record of not losing data, so let's try to learn from this one This is a community effort, I wouldn't be able to get this done without the help of all the people QAing and providing excellent bug reports and feedback based on real world usage. But please don't ignore advice and expect me to pick up the pieces If an error isn't marked as autofix, and it /is/ happening in the wild, that's also something I need to know about so we can check it out and add it to the autofix list if repair looks good. I haven't been getting those reports, and I should be; since we don't have any sort of telemetry yet I am absolutely dependent on user reports Now I'll be spending the weekend working on new repair code to see if I can get a filesystem back for a user who didn't have backups" * tag 'bcachefs-2025-06-04' of git://evilpiepirate.org/bcachefs: (69 commits) bcachefs: add cond_resched() to handle_overwrites() bcachefs: Make journal read log message a bit quieter bcachefs: Fix subvol to missing root repair bcachefs: Run may_delete_deleted_inode() checks in bch2_inode_rm() bcachefs: delete dead code from may_delete_deleted_inode() bcachefs: Add flags to subvolume_to_text() bcachefs: Fix oops in btree_node_seq_matches() bcachefs: Fix dirent_casefold_mismatch repair bcachefs: Fix bch2_fsck_rename_dirent() for casefold bcachefs: Redo bch2_dirent_init_name() bcachefs: Fix -Wc23-extensions in bch2_check_dirents() bcachefs: Run check_dirents second time if required bcachefs: Run snapshot deletion out of system_long_wq bcachefs: Make check_key_has_snapshot safer bcachefs: BCH_RECOVERY_PASS_NO_RATELIMIT bcachefs: bch2_require_recovery_pass() bcachefs: bch_err_throw() bcachefs: Repair code for directory i_size bcachefs: Kill un-reverted directory i_size code bcachefs: Delete redundant fsck_err() ...
2025-06-04bcachefs: add cond_resched() to handle_overwrites()Kent Overstreet
Fix soft lockup warnings in btree nodes can. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-04bcachefs: Make journal read log message a bit quieterKent Overstreet
Users seem to be assuming that the 'dropped unflushed entries' message at the end of journal read indicates some sort of problem, when it does not - we expect there to be entries in the journal that weren't commited, it's purely informational so that we can correlate journal sequence numbers elsewhere when debugging. Shorten the log message a bit to hopefully make this clearer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-04bcachefs: Fix subvol to missing root repairKent Overstreet
We had a bug where the root inode of a subvolume was erronously deleted: bch2_evict_inode() called bch2_inode_rm(), meaning the VFS inode's i_nlink was somehow set to 0 when it shouldn't have - the inode in the btree indicated it clearly was not unlinked. This has been addressed with additional safety checks in bch2_inode_rm() - pulling in the safety checks we already were doing when deleting unlinked inodes in recovery - but the really disastrous bug was in check_subvols(), which on finding a dangling subvol (subvol with a missing root inode) would delete the subvolume. I assume this bug dates from early check_directory_structure() code, which originally handled subvolumes and normal paths - the idea being that still live contents of the subvolume would get reattached somewhere. But that's incorrect, and disastrously so; deleting a subvolume triggers deleting the snapshot ID it points to, deleting the entire contents. The correct way to repair is to recreate the root inode if it's missing; then any contents will get reattached under that subvolume's lost+found. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-04bcachefs: Run may_delete_deleted_inode() checks in bch2_inode_rm()Kent Overstreet
We had a bug where bch2_evict_inode() incorrectly called bch2_inode_rm() - the journal clearly showed the inode was not unlinked. We've got checks that we use in recovery when cleaning up deleted inodes, lift them to bch2_inode_rm() as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-04bcachefs: delete dead code from may_delete_deleted_inode()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-04bcachefs: Add flags to subvolume_to_text()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-04bcachefs: Fix oops in btree_node_seq_matches()Kent Overstreet
btree_update_nodes_written() needs to wait on in-flight writes to old nodes before marking them as freed. But it has no reason to pin those old nodes in memory, so some trickyness ensues. The update we're completing deleted references to those nodes from the btree, so we know if they've been evicted they can't be pulled back in. We just have to check if the nodes we have pointers to are still those old nodes, and haven't been reused. To do that we check the node's "sequence number" (actually a random 64 bit cookie), but that lives in the node's data buffer. 'struct btree' can't be freed until filesystem shutdown (as they're quite small), but the data buffers can be freed or swapped around. Commit 1f88c3567495, which was fixing a kmsan warning, assumed that we could safely do this locklessly with just a READ_ONCE() - if we've got a non-null ptr it would be safe to read from. But that's not true if the data buffer is a vmalloc allocation, so we need to restore the locking that commit deleted (or alternatively RCU free those data buffers, but there's no other reason for that). Fixes: 1f88c3567495 ("bcachefs: Fix a KMSAN splat in btree_update_nodes_written()") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-04bcachefs: Fix dirent_casefold_mismatch repairKent Overstreet
Instead of simply recreating a mis-casefolded dirent, use the str_hash repair code, which will rename it if necessary - the dirent might have been created again with the correct casefolding. Factor out out bch2_str_hash_repair key() from __bch2_str_hash_check_key() for the new path to use, and export bch2_dirent_create_key() as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-04bcachefs: Fix bch2_fsck_rename_dirent() for casefoldKent Overstreet
bch2_fsck_renamed_dirent was creating bch_dirent keys open-coded - but we need to use the appropriate helper, if the directory is casefolded. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-04bcachefs: Redo bch2_dirent_init_name()Kent Overstreet
Redo (and simplify somewhat) how casefolded and non casefolded dirents are initialized, and export this to be used by fsck_rename_dirent(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-04bcachefs: Fix -Wc23-extensions in bch2_check_dirents()Nathan Chancellor
Clang warns (or errors with CONFIG_WERROR=y): fs/bcachefs/fsck.c:2325:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions] 2325 | int ret = bch2_trans_run(c, | ^ On clang-17 and older, this is an unconditional error: fs/bcachefs/fsck.c:2325:2: error: expected expression 2325 | int ret = bch2_trans_run(c, | ^ Move the declaration of ret to the top of the function to resolve both ways this issue manifests. Fixes: c72def523799 ("bcachefs: Run check_dirents second time if required") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02bcachefs: Run check_dirents second time if requiredKent Overstreet
If we move a key backwards, we'll need a second pass to run the rest of the fsck checks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02bcachefs: Run snapshot deletion out of system_long_wqKent Overstreet
We don't want this running out of the same workqueue, and blocking, writes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02bcachefs: Make check_key_has_snapshot saferKent Overstreet
Snapshot deletion v2 added sentinal values for deleted snapshots, so "key for deleted snapshot" - i.e. snapshot deletion missed something - is safe to repair automatically. But if we find a key for a missing snapshot we have no idea what happened, and we shouldn't delete it unless we're very sure that everything else is consistent. So hook it up to the new bch2_require_recovery_pass(), we'll now only delete if snapshots and subvolumes have recenlty been checked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02bcachefs: BCH_RECOVERY_PASS_NO_RATELIMITKent Overstreet
Add a superblock flag to temporarily disable ratelimiting for a recovery pass. This will be used to make check_key_has_snapshot safer: we don't want to delete a key for a missing snapshot unless we know that the snapshots and subvolumes btrees are consistent, i.e. check_snapshots and check_subvols have run recently. Changing those btrees - creating/deleting a subvolume or snapshot - will set the "disable ratelimit" flag, i.e. ensuring that those passes run if check_key_has_snapshot discovers an error. We're only disabling ratelimiting in the snapshot/subvol delete paths, we're not so concerned about the create paths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02bcachefs: bch2_require_recovery_pass()Kent Overstreet
Add a helper for requiring that a recovery pass has already run: either run it directly, if we're still in recovery, or if we're not in recovery check if it has run recently and schedule it if it hasn't. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02bcachefs: bch_err_throw()Kent Overstreet
Add a tracepoint for any time we return an error and unwind. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02bcachefs: Repair code for directory i_sizeKent Overstreet
We had a bug due due to an incomplete revert of the patch implementing directory i_size (summing up the size of the dirents), leading to completely screwy i_size values that underflow. Most userspace programs don't seem to care (e.g. du ignores it), but it turns out this broke sshfs, so needs to be repaired. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02bcachefs: Kill un-reverted directory i_size codeKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02bcachefs: Delete redundant fsck_err()Kent Overstreet
'inode_has_wrong_backpointer'; we have more specific errors for every case afterwards. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02bcachefs: Convert BUG() to errorKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-01bcachefs: Add better logging to fsck_rename_dirent()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-01bcachefs: Replace rcu_read_lock() with guardsKent Overstreet
The new guard(), scoped_guard() allow for more natural code. Some of the uses with creative flow control have been left. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-01bcachefs: CLASS(btree_trans)Kent Overstreet
Allow btree_trans to be used with CLASS(). Automatic cleanup, instead of manually calling bch2_trans_put(). We don't use DEFINE_CLASS because using a static inline for the constructor breaks bch2_trans_get()'s use of __func__, so we have to open code it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "hung_task: extend blocking task stacktrace dump to semaphore" from Lance Yang enhances the hung task detector. The detector presently dumps the blocking tasks's stack when it is blocked on a mutex. Lance's series extends this to semaphores - "nilfs2: improve sanity checks in dirty state propagation" from Wentao Liang addresses a couple of minor flaws in nilfs2 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn fixes a couple of issues in the gdb scripts - "Support kdump with LUKS encryption by reusing LUKS volume keys" from Coiby Xu addresses a usability problem with kdump. When the dump device is LUKS-encrypted, the kdump kernel may not have the keys to the encrypted filesystem. A full writeup of this is in the series [0/N] cover letter - "sysfs: add counters for lockups and stalls" from Max Kellermann adds /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count - "fork: Page operation cleanups in the fork code" from Pasha Tatashin implements a number of code cleanups in fork.c - "scripts/gdb/symbols: determine KASLR offset on s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in the gdb scripts * tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits) llist: make llist_add_batch() a static inline delayacct: remove redundant code and adjust indentation squashfs: add optional full compressed block caching crash_dump, nvme: select CONFIGFS_FS as built-in scripts/gdb/symbols: determine KASLR offset on s390 during early boot scripts/gdb/symbols: factor out pagination_off() scripts/gdb/symbols: factor out get_vmlinux() kernel/panic.c: format kernel-doc comments mailmap: update and consolidate Casey Connolly's name and email nilfs2: remove wbc->for_reclaim handling fork: define a local GFP_VMAP_STACK fork: check charging success before zeroing stack fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code fork: clean-up ifdef logic around stack allocation kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count x86/crash: make the page that stores the dm crypt keys inaccessible x86/crash: pass dm crypt keys to kdump kernel Revert "x86/mm: Remove unused __set_memory_prot()" crash_dump: retrieve dm crypt keys in kdump kernel ...
2025-05-31bcachefs: CLASS(darray)Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>