Age | Commit message (Collapse) | Author |
|
The new guard(), scoped_guard() allow for more natural code.
Some of the uses with creative flow control have been left.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
There can be a lot of rendundancy in accounting updates within a single
btree transaction.
Split out accounting updates so that they can be deduped, in the next
commit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This pops up when buliding in userspace.
The structs aren't actually variable length, but no way to tell the
compiler that...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Before invoking bch2_accounting_mem_mod_locked in
bch2_gc_accounting_done, we already write locked mark_lock,
in bch2_accounting_mem_insert, we lock mark_lock again.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We weren't checking that accounting keys have the expected number of
accounters. Originally we probably wanted to be flexible on this, but it
doesn't look like that will be required - accounting is extended by
adding new counter types, not more counters to an existing type.
This means we can drop a BUG_ON() that popped once in automated testing,
and the new validation will make that bug easier to track down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We're hitting some issues with uninitialized struct padding, flagged by
kmsan.
They appear to be falso positives, otherwise bch2_accounting_validate()
would have flagged them as "junk at end". But for now, we'll need to
initialize disk_accounting_pos with memset().
This adds a new helper, bch2_disk_accounting_mod2(), that initializes a
disk_accounting_pos and does the accounting mod all at once - so overall
things actually get slightly more ergonomic.
BCH_DISK_ACCOUNTING_replicas keys are left for now; KMSAN isn't warning
about them and they're a bit special.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
acc->k.data should be used with the lock hold:
00221 ========= TEST generic/187
00221 run fstests generic/187 at 2025-02-09 21:08:10
00221 spectre-v4 mitigation disabled by command-line option
00222 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro
00222 bcachefs (vdc): initializing new filesystem
00222 bcachefs (vdc): going read-write
00222 bcachefs (vdc): marking superblocks
00222 bcachefs (vdc): initializing freespace
00222 bcachefs (vdc): done initializing freespace
00222 bcachefs (vdc): reading snapshots table
00222 bcachefs (vdc): reading snapshots done
00222 bcachefs (vdc): done starting filesystem
00222 bcachefs (vdc): shutting down
00222 bcachefs (vdc): going read-only
00222 bcachefs (vdc): finished waiting for writes to stop
00223 bcachefs (vdc): flushing journal and stopping allocators, journal seq 6
00223 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 8
00223 bcachefs (vdc): clean shutdown complete, journal seq 9
00223 bcachefs (vdc): marking filesystem clean
00223 bcachefs (vdc): shutdown complete
00223 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro
00223 bcachefs (vdc): initializing new filesystem
00223 bcachefs (vdc): going read-write
00223 bcachefs (vdc): marking superblocks
00223 bcachefs (vdc): initializing freespace
00223 bcachefs (vdc): done initializing freespace
00223 bcachefs (vdc): reading snapshots table
00223 bcachefs (vdc): reading snapshots done
00223 bcachefs (vdc): done starting filesystem
00244 hrtimer: interrupt took 123350440 ns
00264 bcachefs (vdc): shutting down
00264 bcachefs (vdc): going read-only
00264 bcachefs (vdc): finished waiting for writes to stop
00264 bcachefs (vdc): flushing journal and stopping allocators, journal seq 97
00265 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 101
00265 bcachefs (vdc): clean shutdown complete, journal seq 102
00265 bcachefs (vdc): marking filesystem clean
00265 bcachefs (vdc): shutdown complete
00265 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro
00265 bcachefs (vdc): recovering from clean shutdown, journal seq 102
00265 bcachefs (vdc): accounting_read...
00265 ==================================================================
00265 done
00265 BUG: KASAN: slab-use-after-free in bch2_fs_to_text+0x12b4/0x1728
00265 bcachefs (vdc): alloc_read... done
00265 bcachefs (vdc): stripes_read... done
00265 Read of size 4 at addr ffffff80c57eac00 by task cat/7531
00265 bcachefs (vdc): snapshots_read... done
00265
00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103
00265 Hardware name: linux,dummy-virt (DT)
00265 Call trace:
00265 show_stack+0x1c/0x30 (C)
00265 dump_stack_lvl+0x6c/0x80
00265 print_report+0xf8/0x5d8
00265 kasan_report+0x90/0xd0
00265 __asan_report_load4_noabort+0x1c/0x28
00265 bch2_fs_to_text+0x12b4/0x1728
00265 bch2_fs_show+0x94/0x188
00265 sysfs_kf_seq_show+0x1a4/0x348
00265 kernfs_seq_show+0x12c/0x198
00265 seq_read_iter+0x27c/0xfd0
00265 kernfs_fop_read_iter+0x390/0x4f8
00265 vfs_read+0x480/0x7f0
00265 ksys_read+0xe0/0x1e8
00265 __arm64_sys_read+0x70/0xa8
00265 invoke_syscall.constprop.0+0x74/0x1e8
00265 do_el0_svc+0xc8/0x1c8
00265 el0_svc+0x20/0x60
00265 el0t_64_sync_handler+0x104/0x130
00265 el0t_64_sync+0x154/0x158
00265
00265 Allocated by task 7510:
00265 kasan_save_stack+0x28/0x50
00265 kasan_save_track+0x1c/0x38
00265 kasan_save_alloc_info+0x3c/0x50
00265 __kasan_kmalloc+0xac/0xb0
00265 __kmalloc_node_noprof+0x168/0x348
00265 __kvmalloc_node_noprof+0x20/0x140
00265 __bch2_darray_resize_noprof+0x90/0x1b0
00265 __bch2_accounting_mem_insert+0x76c/0xb08
00265 bch2_accounting_mem_insert+0x224/0x3b8
00265 bch2_accounting_mem_mod_locked+0x480/0xc58
00265 bch2_accounting_read+0xa94/0x3eb8
00265 bch2_run_recovery_pass+0x80/0x178
00265 bch2_run_recovery_passes+0x340/0x698
00265 bch2_fs_recovery+0x1c98/0x2bd8
00265 bch2_fs_start+0x240/0x490
00265 bch2_fs_get_tree+0xe1c/0x1458
00265 vfs_get_tree+0x7c/0x250
00265 path_mount+0xe24/0x1648
00265 __arm64_sys_mount+0x240/0x438
00265 invoke_syscall.constprop.0+0x74/0x1e8
00265 do_el0_svc+0xc8/0x1c8
00265 el0_svc+0x20/0x60
00265 el0t_64_sync_handler+0x104/0x130
00265 el0t_64_sync+0x154/0x158
00265
00265 Freed by task 7510:
00265 kasan_save_stack+0x28/0x50
00265 kasan_save_track+0x1c/0x38
00265 kasan_save_free_info+0x48/0x88
00265 __kasan_slab_free+0x48/0x60
00265 kfree+0x188/0x408
00265 kvfree+0x3c/0x50
00265 __bch2_darray_resize_noprof+0xe0/0x1b0
00265 __bch2_accounting_mem_insert+0x76c/0xb08
00265 bch2_accounting_mem_insert+0x224/0x3b8
00265 bch2_accounting_mem_mod_locked+0x480/0xc58
00265 bch2_accounting_read+0xa94/0x3eb8
00265 bch2_run_recovery_pass+0x80/0x178
00265 bch2_run_recovery_passes+0x340/0x698
00265 bch2_fs_recovery+0x1c98/0x2bd8
00265 bch2_fs_start+0x240/0x490
00265 bch2_fs_get_tree+0xe1c/0x1458
00265 vfs_get_tree+0x7c/0x250
00265 path_mount+0xe24/0x1648
00265 bcachefs (vdc): going read-write
00265 __arm64_sys_mount+0x240/0x438
00265 invoke_syscall.constprop.0+0x74/0x1e8
00265 do_el0_svc+0xc8/0x1c8
00265 el0_svc+0x20/0x60
00265 el0t_64_sync_handler+0x104/0x130
00265 el0t_64_sync+0x154/0x158
00265
00265 The buggy address belongs to the object at ffffff80c57eac00
00265 which belongs to the cache kmalloc-128 of size 128
00265 The buggy address is located 0 bytes inside of
00265 freed 128-byte region [ffffff80c57eac00, ffffff80c57eac80)
00265
00265 The buggy address belongs to the physical page:
00265 page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1057ea
00265 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
00265 flags: 0x8000000000000040(head|zone=2)
00265 page_type: f5(slab)
00265 raw: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122
00265 raw: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400
00265 head: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122
00265 head: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400
00265 head: 8000000000000001 fffffffec315fa81 ffffffffffffffff 0000000000000000
00265 head: 0000000000000002 0000000000000000 00000000ffffffff 0000000000000000
00265 page dumped because: kasan: bad access detected
00265
00265 Memory state around the buggy address:
00265 ffffff80c57eab00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00265 ffffff80c57eab80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
00265 >ffffff80c57eac00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
00265 ^
00265 ffffff80c57eac80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
00265 ffffff80c57ead00: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
00265 ==================================================================
00265 Kernel panic - not syncing: kasan.fault=panic set ...
00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103
00265 Hardware name: linux,dummy-virt (DT)
00265 Call trace:
00265 show_stack+0x1c/0x30 (C)
00265 dump_stack_lvl+0x30/0x80
00265 dump_stack+0x18/0x20
00265 panic+0x4d4/0x518
00265 start_report.constprop.0+0x0/0x90
00265 kasan_report+0xa0/0xd0
00265 __asan_report_load4_noabort+0x1c/0x28
00265 bch2_fs_to_text+0x12b4/0x1728
00265 bch2_fs_show+0x94/0x188
00265 sysfs_kf_seq_show+0x1a4/0x348
00265 kernfs_seq_show+0x12c/0x198
00265 seq_read_iter+0x27c/0xfd0
00265 kernfs_fop_read_iter+0x390/0x4f8
00265 vfs_read+0x480/0x7f0
00265 ksys_read+0xe0/0x1e8
00265 __arm64_sys_read+0x70/0xa8
00265 invoke_syscall.constprop.0+0x74/0x1e8
00265 do_el0_svc+0xc8/0x1c8
00265 el0_svc+0x20/0x60
00265 el0t_64_sync_handler+0x104/0x130
00265 el0t_64_sync+0x154/0x158
00265 SMP: stopping secondary CPUs
00265 Kernel Offset: disabled
00265 CPU features: 0x000,00000070,00000010,8240500b
00265 Memory Limit: none
00265 ---[ end Kernel panic - not syncing: kasan.fault=panic set ... ]---
00270 ========= FAILED TIMEOUT generic.187 in 1200s
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We can't check if we're racing with fsck ending until mark_lock is held.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix sort order for disk accounting keys, in order to fix a regression on
mount times.
The typetag is now the most significant byte of the key, meaning disk
accounting keys of the same type now sort together.
This lets us skip over disk accounting keys that aren't mirrored in
memory when reading accounting at startup, instead of having them
interleaved with other counter types.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Since we added per-inode counters there's now far too many counters to
show in one shot - if we want this in the future, it'll have to be in
debugfs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a new parameter to bkey validate functions, and use it to improve
invalid bkey error messages: we can now print the btree and depth it
came from, or if it came from the journal, or is a btree root.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Accounting keys that reference invalid devices are corrected by fsck,
they shouldn't cause an emergency shutdown.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Also, fix a minor bug in the revert path, where we weren't checking the
journal entry type correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
give bversions a more distinct name, to aid in grepping
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Minor refactoring - replace multiple bool arguments with an enum; prep
work for fixing a bug in accounting read.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds another disk accounting counter to track usage per inode
number (any snapshot ID).
This will be used for a couple things:
- It'll give us a way to tell the user how much space a given file ista
consuming in all snapshots; i.e. how much extra space it's consuming
due to snapshot versioning.
- It counts number of extents and total size of extents (both in btree
keyspace sectors and actual disk usage), meaning it gives us average
extent size: that is, it'll let us cheaply find fragmented files that
should be defragmented.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The next patch will be adding a disk accounting counter type which is
not kept in the in-memory eytzinger tree.
As prep, fold __bch2_accounting_mem_mod() into
bch2_accounting_mem_mod_locked() so that we can check for that counter
type and bail out without calling bpos_to_disk_accounting_pos() twice.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bkey_fsck_err() was added as an interface that looks like fsck_err(),
but previously all it did was ensure that the appropriate error counter
was incremented in the superblock.
This is a cleanup and bugfix patch that converts it to a wrapper around
fsck_err(). This is needed to fix an issue with the upgrade path to
disk_accounting_v3, where the "silent fix" error list now includes
bkey_fsck errors; fsck_err() handles this in a unified way, and since we
need to change printing of bkey fsck errors from the caller to the inner
bkey_fsck_err() calls, this ends up being a pretty big change.
Als,, rename .invalid() methods to .validate(), for clarity, while we're
changing the function signature anyways (to drop the printbuf argument).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a new helper to free zeroed out accounting entries, and use it in
bch2_replicas_gc2(); bch2_replicas_gc2() was killing superblock replicas
entries if their corresponding accounting counters were nonzero, but
that's incorrect - the superblock replicas entry needs to exist if the
accounting entry exists, not if it's nonzero, because we check and
create the replicas entry when creating the new accounting entry - we
don't know when it's becoming nonzero.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Break up the percpu counter allocations into individual allocations for
each disk accounting counter; this fixes an issue on large systems where
we have too many replica entries to for the percpu allocator's max
practical size.
Also, use just one eytzinger tree for the normal set of counters and the
gc counters; this simplifies accounting_gc_done() where we need the same
set of counters to be present in both tables.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch2_accounting_mem_insert() drops and retakes mark_lock; thus, we need
to check if the entry in question has already been inserted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a new ioctl that can return the new accounting counter types; it
takes as input a bitmask of accounting types to return.
This will be used for returning e.g. compression accounting and
rebalance_work accounting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Helper to show raw accounting in sysfs, mainly for debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Verify that the in-memory accounting verifies the on-disk accounting
after a clean shutdown.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Rewrite fsck/gc for the new accounting scheme.
This adds a second set of in-memory accounting counters for gc to use;
like with other parts of gc we run all trigger in TRIGGER_GC mode, then
compare what we calculated to existing in-memory accounting at the end.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reading disk accounting now requires an eytzinger lookup (see:
bch2_accounting_mem_read()), but the per-device counters are used
frequently enough that we'd like to still be able to read them with just
a percpu sum, as in the old code.
This patch special cases the device counters; when we update in-memory
accounting we also update the old style percpu counters if it's a deice
counter update.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Main part of the disk accounting rewrite.
This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.
With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.
Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.
An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
New key type for the disk space accounting rewrite.
- Holds a variable sized array of u64s (may be more than one for
accounting e.g. compressed and uncompressed size, or buckets and
sectors for a given data type)
- Updates are deltas, not new versions of the key: this means updates
to accounting can happen via the btree write buffer, which we'll be
teaching to accumulate deltas.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|