summaryrefslogtreecommitdiff
path: root/fs/bcachefs/btree_journal_iter.c
AgeCommit message (Collapse)Author
2025-06-24bcachefs: fix bch2_journal_keys_peek_prev_min() underflowKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: opts.journal_rewindKent Overstreet
Add a mount option for rewinding the journal, bringing the entire filesystem to where it was at a previous point in time. This is for extreme disaster recovery scenarios - it's not intended as an undelete operation. The option takes a journal sequence number; the desired sequence number can be determined with 'bcachefs list_journal' Caveats: - The 'journal_transaction_names' option must have been enabled (it's on by default). The option controls emitting of extra debug info in the journal, so we can see what individual transactions were doing; It also enables journalling of keys being overwritten, which is what we rely on here. - A full fsck run will be automatically triggered since alloc info will be inconsistent. Only leaf node updates to non-alloc btrees are rewound, since rewinding interior btree updates isn't possible or desirable. - We can't do anything about data that was deleted and overwritten. Lots of metadata updates after the point in time we're rewinding to shouldn't cause a problem, since we segragate data and metadata allocations (this is in order to make repair by btree node scan practical on larger filesystems; there's a small 64-bit per device bitmap in the superblock of device ranges with btree nodes, and we try to keep this small). However, having discards enabled will cause problems, since buckets are discarded as soon as they become empty (this is why we don't implement fstrim: we don't need it). Hopefully, this feature will be a one-off thing that's never used again: this was implemented for recovering from the "vfs i_nlink 0 -> subvol deletion" bug, and that bug was unusually disastrous and additional safeguards have since been implemented. But if it does turn out that we need this more in the future, I'll have to implement an option so that empty buckets aren't discarded immediately - lagging by perhaps 1% of device capacity. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Fix bch2_journal_keys_peek_prev_min()Kent Overstreet
this code is rarely invoked, so - we had a few bugs left from basing it off of bch2_journal_keys_peek_max()... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02bcachefs: bch_err_throw()Kent Overstreet
Add a tracepoint for any time we return an error and unwind. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-01bcachefs: Replace rcu_read_lock() with guardsKent Overstreet
The new guard(), scoped_guard() allow for more natural code. Some of the uses with creative flow control have been left. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: Use bch2_kvmalloc() for journal keys arrayKent Overstreet
We can hit this limit fairly easy when we have to reconstuct large amounts of alloc info on large filesystems. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06bcachefs: Use sort_nonatomic() instead of sort()Kent Overstreet
Fixes "task out to lunch" warnings during recovery on large machines with lots of dirty data in the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: cond_resched() in journal_key_sort_cmp()Kent Overstreet
Fixes "task out to lunch" warnings during recovery on large machines with lots of dirty data in the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: fix bch2_journal_key_insert_take() seqKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: fix O(n^2) issue with whiteouts in journal keysKent Overstreet
The journal_keys array can't be substantially modified after we go RW, because lookups need to be able to check it locklessly - thus we're limited on what we can do when a key in the journal has been overwritten. This is a problem when there's many overwrites to skip over for peek() operations. To fix this, add tracking of ranges of overwrites: we create a range entry when there's more than one contiguous whiteout. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: btree_and_journal_iter: don't iterate over too many whiteouts when ↵Kent Overstreet
prefetching To help ameloriate issues with peek operations having to skip over deletions in the journal - just bail out if all we're doing is prefetching btree nodes. Since btree node prefetching runs every time we iterate to a new node, and has to sequentially scan ahead, this avoids another O(n^2). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: journal keys: sort keys for interior nodes firstKent Overstreet
There's an unavoidable issue with btree lookups when we're overlaying journal keys and the journal has many deletions for keys present in the btree - peek operations will have to iterate over all those deletions to find the next live key to return. This is mainly a problem for lookups in interior nodes, if we have to traverse to a leaf. Looking up an insert position in a leaf (for journal replay) doesn't have to find the next live key, but walking down the btree does. So to ameloriate this, change journal key sort ordering so that we replay keys from roots and interior nodes first. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: kill bch2_journal_entries_free()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: Implement bch2_btree_iter_prev_min()Kent Overstreet
A user contributed a filessytem dump, where the dump was actually corrupted (due to being taken while the filesystem was online), but which exposed an interesting bug in fsck - reconstruct_inode(). When itearting in BTREE_ITER_filter_snapshots mode, it's required to give an end position for the iteration and it can't span inode numbers; continuing into the next inode might mean we start seeing keys from a different snapshot tree, that the is_ancestor() checks always filter, thus we're never able to return a key and stop iterating. Backwards iteration never implemented the end position because nothing else needed it - except for reconstuct_inode(). Additionally, backwards iteration is now able to overlay keys from the journal, which will be useful if we ever decide to start doing journal replay in the background. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: Rename btree_iter_peek_upto() -> btree_iter_peek_max()Kent Overstreet
We'll be introducing btree_iter_peek_prev_min(), so rename for consistency. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: Avoid bch2_btree_id_str()Kent Overstreet
Prefer bch2_btree_id_to_text() - it prints out the integer ID when unknown. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09bcachefs: Add a cond_resched() to __journal_keys_sort()Kent Overstreet
Without this, we'd potentially sort multiple times without a cond_resched(), leading to hung task warnings on larger systems. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14bcachefs: Accumulate accounting keys in journal replayKent Overstreet
Until accounting keys hit the btree, they are deltas, not new versions of the existing key; this means we have to teach journal replay to accumulate them. Additionally, the journal doesn't track precisely which entries have been flushed to the btree; it only tracks a range of entries that may possibly still need to be flushed. That means we need to compare accounting keys against the version in the btree and only flush updates that are newer. There's another wrinkle with the write buffer: if the write buffer starts flushing accounting keys before journal replay has finished flushing accounting keys, journal replay will see the version number from the new updates and updates from the journal will be lost. To avoid this, journal replay has to flush accounting keys first, and we'll be adding a flag so that write buffer flush knows to hold accounting keys until then. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08bcachefs: bch2_journal_keys_dump()Kent Overstreet
debug helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-10bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()Kent Overstreet
We weren't respecting trans->journal_replay_not_finished - we shouldn't be searching the journal keys unless we have a ref on them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-07bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()Kent Overstreet
Multiple bug fixes for journal iters: - When the journal keys gap buffer is resized, we have to adjust the iterators for moving the gap to the end - We don't want to rewind iterators to point to the key we just inserted if it's not for the correct btree/level Also, add some new assertions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03bcachefs: bch2_shoot_down_journal_keys()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31bcachefs: Be careful about btree node splits during journal replayKent Overstreet
Don't pick a pivot that's going to be deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31bcachefs: btree_and_journal_iter now respects trans->journal_replay_not_finishedKent Overstreet
btree_and_journal_iter is now safe to use at runtime, not just during recovery before journal keys have been freed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13bcachefs: split out ignore_blacklisted, ignore_not_dirtyKent Overstreet
prep work for replaying the journal backwards Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13bcachefs: improve move_gap()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13bcachefs: journal_keys now uses darray helpersKent Overstreet
nice bit of code cleanup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13bcachefs: Rename journal_keys.d -> journal_keys.dataKent Overstreet
This will let us use some darray helpers in the next patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13bcachefs: jset_entry for loops declare loop iterKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13bcachefs: kill kvpmalloc()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: btree node prefetching in check_topologyKent Overstreet
btree_and_journal_iter is old code that we want to get rid of, but we're not ready to yet. lack of btree node prefetching is, it turns out, a real performance issue for fsck on spinning rust, so - add it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10bcachefs: btree_and_journal_iter.transKent Overstreet
we now always have a btree_trans when using a btree_and_journal_iter; prep work for adding prefetching to btree_and_journal_iter Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-05bcachefs: __journal_keys_sort() refactoringKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01bcachefs: convert bch_fs_flags to x-macroKent Overstreet
Now we can print out filesystem flags in sysfs, useful for debugging various "what's my filesystem doing" issues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01bcachefs: Kill btree_iter->journal_posKent Overstreet
For BTREE_ITER_WITH_JOURNAL, we memoize lookups in the journal keys, to avoid the binary search overhead. Previously we stashed the pos of the last key returned from the journal, in order to force the lookup to be redone when rewinding. Now bch2_journal_keys_peek_upto() handles rewinding itself when necessary - so we can slim down btree_iter. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-24bcachefs: Proper refcounting for journal_keysKent Overstreet
The btree iterator code overlays keys from the journal until journal replay is finished; since we're now starting copygc/rebalance etc. before replay is finished, this is multithreaded access and thus needs refcounting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22bcachefs: btree_journal_iter.cKent Overstreet
Split out a new file from recovery.c for managing the list of keys we read from the journal: before journal replay finishes the btree iterator code needs to be able to iterate over and return keys from the journal as well, so there's a fair bit of code here. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>