summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-04-03bcachefs: Add error handling for zlib_deflateInit2()Wentao Liang
In attempt_compress(), the return value of zlib_deflateInit2() needs to be checked. A proper implementation can be found in pstore_compress(). Add an error check and return 0 immediately if the initialzation fails. Fixes: 986e9842fb68 ("bcachefs: Compression levels") Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: add missing selection of XARRAY_MULTIEric Biggers
When CONFIG_XARRAY_MULTI is not set, reading from a bcachefs file hits the 'BUG_ON(order > 0);' in xas_set_order(), because it tries to insert a large folio in the page cache. Fix this by making bcachefs select XARRAY_MULTI. Fixes: be212d86b19c ("bcachefs: bs > ps support") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: bch_dev_usage_fullKent Overstreet
All the fastpaths that need device usage don't need the sector totals or fragmentation, just bucket counts. Split bch_dev_usage up into two different versions, the normal one with just bucket counts. This is also a stack usage improvement, since we have a bch_dev_usage on the stack in the allocation path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: Kill btree_iter.transKent Overstreet
This was planned to be done ages ago, now finally completed; there are places where we have quite a few btree_trans objects on the stack, so this reduces stack usage somewhat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: do_trace_key_cache_fill()Kent Overstreet
Reducing stack frame usage; this moves the printbuf out of the main stack frame. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: Split up bch_dev.io_refKent Overstreet
We now have separate per device io_refs for read and write access. This fixes a device removal bug where the discard workers were still running while we're removing alloc info for that device. It's also a bit of hardening; we no longer allow writes to devices that are read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-01bcachefs: fix ref leak in btree_node_read_all_replicasKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31bcachefs: Fix null ptr deref in bch2_write_endio()Kent Overstreet
This was previously hard to hit since it requires racing with device removal, but splitting up io_ref uncovered it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31bcachefs: Fix field spanning write warningKent Overstreet
Struct with embedded VLA... memcpy: detected field-spanning write (size 8) of single field "&gc->r.e" at fs/bcachefs/ec.c:465 (size 3) WARNING: CPU: 1 PID: 936 at fs/bcachefs/ec.c:465 bch2_trigger_stripe+0x706/0x730 Modules linked in: CPU: 1 UID: 0 PID: 936 Comm: mount.bcachefs Not tainted 6.14.0-rc6-ktest-00236-gefb0b5c62dbc #55 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:bch2_trigger_stripe+0x706/0x730 Code: b4 00 01 b9 03 00 00 00 48 89 fb 48 c7 c7 33 54 da 81 48 89 d6 49 89 d6 48 c7 c2 c3 36 db 81 e8 60 54 c5 ff 48 89 df 4c 89 f2 <0f> 0b e9 5c fd ff ff e8 fe 5e 4e 00 bf 10 00 00 00 48 c7 c6 ff ff RSP: 0018:ffff88817081f680 EFLAGS: 00010246 RAX: f8fe7dd1c56b5600 RBX: ffff888101265368 RCX: 0000000000000027 RDX: 0000000000000008 RSI: 00000000fffbffff RDI: ffff888101265368 RBP: 0000000000000000 R08: 000000000003ffff R09: ffff88817f1fe000 R10: 00000000000bfffd R11: 0000000000000004 R12: ffff8881012652c0 R13: 0000000000000000 R14: 0000000000000008 R15: ffff88817081f6c9 FS: 00007fc428bc7c80(0000) GS:ffff888179280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd3ee4a038 CR3: 000000010a9bc000 CR4: 0000000000750eb0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0xce/0x1b0 ? bch2_trigger_stripe+0x706/0x730 ? report_bug+0x11b/0x1a0 ? bch2_trigger_stripe+0x706/0x730 ? handle_bug+0x5e/0x90 ? exc_invalid_op+0x1a/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? bch2_trigger_stripe+0x706/0x730 bch2_gc_mark_key+0x2cf/0x430 bch2_check_allocations+0x1a64/0x1ed0 ? vsnprintf+0x1ad/0x420 ? bch2_check_allocations+0x191f/0x1ed0 bch2_run_recovery_passes+0x13b/0x2b0 bch2_fs_recovery+0x9b7/0x1290 ? __bch2_print+0xb2/0xf0 ? bch2_printbuf_exit+0x1e/0x30 ? print_mount_opts+0x153/0x180 bch2_fs_start+0x274/0x3b0 bch2_fs_get_tree+0x516/0x6e0 vfs_get_tree+0x21/0xa0 do_new_mount+0x153/0x350 __x64_sys_mount+0x16c/0x1f0 do_syscall_64+0x6c/0x140 ? arch_exit_to_user_mode_prepare+0x9/0x40 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-31bcachefs: Fix striping behaviourKent Overstreet
For striping across devices, we maintain "clocks", and we advance them by the inverse of "how much free space this device has left", so that we round robin biased in favor of devices with more free space. This code was originally trying to do EWMA-ish stuff when originally written, ~10 years ago, and was never properly cleaned up when it was realized that an EWMA is not the right approach here. That left a bug, when we rescale to keep all the clocks in the correct range and prevent overflow. It was assumed that we'd always be allocated from the device with the smallest clock hand, but that's actually not correct: with the target options, allocations will be first tried from a subset of devices, and then the entire filesystem if that fails. Thus, the rescale from the first allocation - allocating from a subset of devices - can pick the wrong rescale value and cause the rest of the clocks to go to 0, losing information. This resuls in incorrect striping behaviour when the desired number of replicas doesn't fit on the foreground target. Link: https://www.reddit.com/r/bcachefs/comments/1jn3t26/replica_allocation_not_evenly_distributed_among/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: fix bch2_write_point_to_text() unitsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: Log original key being moved in data updatesKent Overstreet
There's something going on with the data move path; log the original key being moved for debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: BCH_JSET_ENTRY_log_bkeyKent Overstreet
Add a journal entry type for logging - but logging a bkey, not a string; to be used for data move path debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: Reorder error messages that include journal debugKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: Don't use designated initializers for disk_accounting_posKent Overstreet
Not all compilers fully initialize these - they're not guaranteed to because of the union shenanigans. Fixes: https://github.com/koverstreet/bcachefs/issues/844 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: Silence errors after emergency shutdownKent Overstreet
We don't care about errors from asynchronous ops that were because we did an emergency shutdown; silence them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: fix units in rebalance_statusKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: bch2_ioctl_subvolume_destroy() fixesKent Overstreet
bch2_evict_subvolume_inodes() was getting stuck - due to incorrectly pruning the dcache. Also, fix missing permissions checks. Reported-by: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29bcachefs: Clear fs_path_parent on subvolume unlinkKent Overstreet
This fixes recursive subvolume removal. Subvolume deletion is asynchronous; fs_path_parent, and thus the entry in the subvolume_children btree, need to be cleared when the subvolume is unlinked from the fs heirarchy - else we'll spuriously think a subvolume has children and deletion will fail. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29bcachefs: Change btree_insert_node() assertion to errorKent Overstreet
Debug for https://github.com/koverstreet/bcachefs/issues/843 Print useful debug info and go emergency read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29bcachefs: Better printing of inconsistency errorsKent Overstreet
Build up and emit the error message for an inconsistency error all at once, instead of spread over multiple printk calls, so they're not jumbled in the dmesg log. Also, add better indenting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-29bcachefs: bch2_count_fsck_err()Kent Overstreet
Factor out a helper from __bch2_fsck_err(), for counting the error in the superblock and deciding whether to print or ratelimit - will be used to replace some log_fsck_err() calls, where we want to lift out printing the error message. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Better helpers for inconsistency errorsKent Overstreet
An inconsistency error often happens as part of an event with multiple error messages, and we want to build up one single error message with proper indenting to produce more readable log messages that don't get garbled. Add new helpers that emit messages to a printbuf instead of printing them directly, next patch will convert to use them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Consistent indentation of multiline fsck errorsKent Overstreet
Add the new helper printbuf_indent_add_nextline(), and use it in __bch2_fsck_err() to centralize setting the indentation of multiline fsck errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Add an "ignore unknown" option to bch2_parse_mount_opts()Kent Overstreet
To be used by the mount helper in userspace, where we still have options to be parsed by other layers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: bch2_time_stats_init_no_pcpu()Kent Overstreet
Add a mode to disable automatic switching to percpu mode, useful when a time_stats will only be used by one thread and we don't want to have to flush the percpu buffers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Fix bch2_fs_get_tree() error pathFlorian Albrechtskirchinger
When a filesystem is mounted read-only, subsequent attempts to mount it as read-write fail with EBUSY. Previously, the error path in bch2_fs_get_tree() would unconditionally call __bch2_fs_stop(), improperly freeing resources for a filesystem that was still actively mounted. This change modifies the error path to only call __bch2_fs_stop() if the superblock has no valid root dentry, ensuring resources are not cleaned up prematurely when the filesystem is in use. Signed-off-by: Florian Albrechtskirchinger <falbrechtskirchinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: fix logging in journal_entry_err_msg()Kent Overstreet
We want to log errors all at once, not spread across multiple printks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: add missing newline in bch2_trans_updates_to_text()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: print_string_as_lines: fix extra newlineKent Overstreet
Don't print a newline on empty string; this was causing us to also print an extra newline when we got to the end of th string. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Fix WARN() in bch2_bkey_pick_read_device()Kent Overstreet
syzbot discovered that this one is possible: we have pointers, but none of them are to valid devices. Reported-by: syzbot+336a6e6a2dbb7d4dba9a@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Don't return 0 size holes from bch2_seek_hole()Kent Overstreet
The hole we find in the btree might be fully dirty in the page cache. If so, keep searching. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Fix bch2_seek_hole() lockingKent Overstreet
We can't call bch2_seek_pagecache_hole(), and block on page locks, with btree locks held. This is easily fixed because we're at the end of the transaction - we can just unlock, we don't need a drop_locks_do(). Reported-by: https://github.com/nagalun Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Recovery no longer holds state_lockKent Overstreet
state_lock guards against devices coming or leaving, changing state, or the filesystem changing between ro <-> rw. But it's not necessary for running recovery passes, and holding it blocks asynchronous events that would cause us to go RO or kick out devices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-28bcachefs: Fix permissions on version modparamKent Overstreet
There's no reason for this not to be world readable - it provides the currently supported on disk format version. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: cond_resched() in journal_key_sort_cmp()Kent Overstreet
Fixes "task out to lunch" warnings during recovery on large machines with lots of dirty data in the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: Fix 'hung task' messages in btree node scanKent Overstreet
btree node scan has to wait on kthread workers that scan each device, potentially for awhile. We would like this to be interruptible, but we may need a different mechanism than signals for that - we've had bugs in the past where mounts were failing due to checking for signals, and no explanation on where they came from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: Fix btree iter flags in data move (2)Kent Overstreet
Data move -> move_get_io_opts -> bch2_get_update_rebalance_opts requires a not_extents iterator; this fixes the path where we're walking the extents btree and chase a reflink pointer into the reflink btree. bch2_lookup_indirect_extent() requires working with an extents iterator (due to peek_slot() semantics), so we implement bch2_lookup_indirect_extent_for_move(). This is simplified because there's no need to report indirect_extent_missing_errors here, that can be deferred until fsck or when a user reads that data. Reported-by: Maƫl Kerbiriou <mael.kerbiriou@free.fr> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: Don't unnecessarily decrypt data when movingKent Overstreet
There's various checks for "are we going to compress this" - but we're not going to compress if we know it's incompressible. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: Document disk accounting keys and conutersKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26bcachefs: Validate number of counters for accounting keysKent Overstreet
We weren't checking that accounting keys have the expected number of accounters. Originally we probably wanted to be flexible on this, but it doesn't look like that will be required - accounting is extended by adding new counter types, not more counters to an existing type. This means we can drop a BUG_ON() that popped once in automated testing, and the new validation will make that bug easier to track down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25bcachefs: Use print_string_as_lines() for journal stuck messagesKent Overstreet
They were being truncated, printk has a 1k limit per call Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25bcachefs: Fix duplicate checksum error messages in write pathKent Overstreet
Also, improve the message in prep_encoded_data() - it now prints good/bad checksums, and checksum type. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25bcachefs: Fix silent short reads in data read retry pathKent Overstreet
__bch2_read, before calling __bch2_read_extent(), sets bvec_iter.bi_size to "the size we can read from the current extent" with a swap, and restores it to "the size for the total read" after the read_extent call with another swap. But we neglected to do the restore before the "if (ret) goto err;" - which is a problem if we're retrying those errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25bcachefs: Fix nonce inconsistency in bch2_write_prep_encoded_data()Kent Overstreet
If we're moving an extent that was partially overwritten, bch2_write_rechecksum() will trim it to the currenty live range. If we then also want to compress it, it'll be decrypted - but the nonce has been advanced for the overwritten start of the extent that we dropped, and we were using the nonce we calculated before rechecksum(). Reported-by: Gabriel de Perthuis <g2p.code@gmail.com> Fixes: 127d90d2823e ("bcachefs: bch2_write_prep_encoded_data() now returns errcode") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Kill unnecessary bch2_dev_usage_read()Kent Overstreet
bch2_dev_usage_read() is fairly expensive, we should optimize this more. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: btree node write errors now print btree nodeKent Overstreet
It turned out a user was wondering why we were going read-only after a write error, and he didn't realize he didn't have replication enabled - this will make that more obvious, and we should be printing it anyways. Link: https://www.reddit.com/r/bcachefs/comments/1jf9akl/large_data_transfers_switched_bcachefs_to_readonly/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Fix race in print_chain()Kent Overstreet
00636 Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b0 00636 Mem abort info: 00636 ESR = 0x0000000096000005 00636 EC = 0x25: DABT (current EL), IL = 32 bits 00636 SET = 0, FnV = 0 00636 EA = 0, S1PTW = 0 00636 FSC = 0x05: level 1 translation fault 00636 Data abort info: 00636 ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 00636 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 00636 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 00636 user pgtable: 4k pages, 39-bit VAs, pgdp=0000000101b10000 00636 [00000000000000b0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 00636 Internal error: Oops: 0000000096000005 [#1] SMP 00636 Modules linked in: 00636 CPU: 12 UID: 0 PID: 79369 Comm: cat Not tainted 6.14.0-rc6-ktest-g3783b8973ab7 #17757 00636 Hardware name: linux,dummy-virt (DT) 00636 pstate: 20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00636 pc : print_chain+0xb8/0x170 00636 lr : print_chain+0xa0/0x170 00636 sp : ffffff80d9c1bbb0 00636 x29: ffffff80d9c1bbb0 x28: 0000000000000002 x27: ffffff80c1be8250 00636 x26: ffffff80dd9b0000 x25: 0000000000000020 x24: 000000000000002d 00636 x23: 000000000000003c x22: ffffffc080a54518 x21: ffffff80da6e00d0 00636 x20: ffffff80da6e0170 x19: ffffff80c1a1d240 x18: 00000000ffffffff 00636 x17: 3535303937202d3c x16: 203139202d3c2035 x15: 00000000ffffffff 00636 x14: 0000000000000000 x13: ffffff80d71b63f1 x12: 0000000000000006 00636 x11: ffffffc080beb1c0 x10: 0000000000000020 x9 : 00000000000134cc 00636 x8 : 0000000000000020 x7 : 0000000000000004 x6 : 0000000000000020 00636 x5 : ffffff80d71b63f7 x4 : ffffffc080a5451b x3 : 0000000000000000 00636 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 00636 Call trace: 00636 print_chain+0xb8/0x170 (P) 00636 bch2_check_for_deadlock+0x444/0x5a0 00636 bch2_btree_deadlock_read+0xb4/0x1c8 00636 full_proxy_read+0x74/0xd8 00636 vfs_read+0x90/0x300 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: btree_trans_restart_foreign_task()Kent Overstreet
In debug mode, we save the call stack on transaction restart - but there's no locking, so we can't touch it if we're issuing the restart from another thread. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: bch2_disk_accounting_mod2()Kent Overstreet
We're hitting some issues with uninitialized struct padding, flagged by kmsan. They appear to be falso positives, otherwise bch2_accounting_validate() would have flagged them as "junk at end". But for now, we'll need to initialize disk_accounting_pos with memset(). This adds a new helper, bch2_disk_accounting_mod2(), that initializes a disk_accounting_pos and does the accounting mod all at once - so overall things actually get slightly more ergonomic. BCH_DISK_ACCOUNTING_replicas keys are left for now; KMSAN isn't warning about them and they're a bit special. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>