Age | Commit message (Collapse) | Author |
|
[BUG]
It's easy to trigger NULL pointer dereference, just by removing a
non-existing device id:
# mkfs.btrfs -f -m single -d single /dev/test/scratch1 \
/dev/test/scratch2
# mount /dev/test/scratch1 /mnt/btrfs
# btrfs device remove 3 /mnt/btrfs
Then we have the following kernel NULL pointer dereference:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs]
btrfs_ioctl+0x18bb/0x3190 [btrfs]
? lock_is_held_type+0xa5/0x120
? find_held_lock.constprop.0+0x2b/0x80
? do_user_addr_fault+0x201/0x6a0
? lock_release+0xd2/0x2d0
? __x64_sys_ioctl+0x83/0xb0
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
[CAUSE]
Commit a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return
btrfs_device directly") moves the "missing" device path check into
btrfs_rm_device().
But btrfs_rm_device() itself can have case where it only receives
@devid, with NULL as @device_path.
In that case, calling strcmp() on NULL will trigger the NULL pointer
dereference.
Before that commit, we handle the "missing" case inside
btrfs_find_device_by_devspec(), which will not check @device_path at all
if @devid is provided, thus no way to trigger the bug.
[FIX]
Before calling strcmp(), also make sure @device_path is not NULL.
Fixes: a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly")
CC: stable@vger.kernel.org # 5.4+
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's a common practice to start a search using offset (u64)-1, which is
the u64 maximum value, meaning that we want the search_slot function to
be set in the last item with the same objectid and type.
Once we are in this position, it's a matter to start a search backwards
by calling btrfs_previous_item, which will check if we'll need to go to
a previous leaf and other necessary checks, only to be sure that we are
in last offset of the same object and type.
The new btrfs_search_backwards function does the all these steps when
necessary, and can be used to avoid code duplication.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Function btrfs_check_raid_min_devices() returns error code from the enum
btrfs_err_code and it starts from 1. So there is no need to check if ret
is > 0. So drop this check and also drop the local variable ret.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The data on raid0 and raid10 are supposed to be spread over multiple
devices, so the minimum constraints are set to 2 and 4 respectively.
This is an artificial limit and there's some interest to remove it.
Change this to allow raid0 on one device and raid10 on two devices. This
works as expected eg. when converting or removing devices.
The only difference is when raid0 on two devices gets one device
removed. Unpatched would silently create a single profile, while newly
it would be raid0.
The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion), or
when there are disks of different size, with raid0 the otherwise
unusable space of the last device will be used too. Similarly for
raid10, though the two largest devices would need to be the same.
Unpatched kernel will mount and use the degenerate profiles just fine
but won't allow any operation that would not satisfy the stricter device
number constraints, eg. not allowing to go from 3 to 2 devices for
raid10 or various profile conversions.
Example output:
# btrfs fi us -T .
Overall:
Device size: 10.00GiB
Device allocated: 1.01GiB
Device unallocated: 8.99GiB
Device missing: 0.00B
Used: 200.61MiB
Free (estimated): 9.79GiB (min: 9.79GiB)
Free (statfs, df): 9.79GiB
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 3.25MiB (used: 0.00B)
Multiple profiles: no
Data Metadata System
Id Path RAID0 single single Unallocated
-- ---------- --------- --------- -------- -----------
1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB
-- ---------- --------- --------- -------- -----------
Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB
Used 200.25MiB 352.00KiB 16.00KiB
# btrfs dev us .
/dev/sda10, ID: 1
Device size: 10.00GiB
Device slack: 0.00B
Data,RAID0/1: 1.00GiB
Metadata,single: 8.00MiB
System,single: 1.00MiB
Unallocated: 8.99GiB
Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
profile is printed.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
RAID56 is not only unsafe due to its write-hole problem, but also has
tons of hardcoded PAGE_SIZE.
Disable it for subpage support for now.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Comparators just read the data and thus get const parameters. This
should be also preserved by the local variables, update all comparators
passed to sort or bsearch.
Cleanups:
- unnecessary casts are dropped
- btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern
and 'inline' is dropped as the function address is taken
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are two helpers doing the same calculations based on nparity and
ncopies. calc_data_stripes can be simplified into one expression, so far
we don't have profile with both copies and parity, so there's no
effective change. calc_stripe_length should reuse the helper and not
repeat the same calculation.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The device allocation is split to two functions, but one just calls the
other and they're very far in the file. Merge them together.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The helper does a simple translation from block group flags to index to
the btrfs_raid_array table. There's no apparent reason to inline the
function, the translation happens usually once per function and is not
called in a loop.
Making it a proper function saves quite some binary code (x86_64,
release config):
text data bss dec hex filename
1164011 19253 14912 1198176 124860 pre/btrfs.ko
1161559 19253 14912 1195724 123ecc post/btrfs.ko
DELTA: -2451
Also add the const attribute as there are no side effects, this could
help compiler to optimize a few things without the function body.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
After calling btrfs_search_slot is a common practice to check if the
slot found isn't bigger than number of slots in the current leaf, and if
so, search for the same key in the next leaf by calling btrfs_next_leaf,
which calls btrfs_next_old_leaf to do the job.
Calling btrfs_next_item in the same situation would end up in the same
code flow, since
* btrfs_next_item
* btrfs_next_old_item
* if slot >= nritems(curr_leaf)
btrfs_next_old_leaf
Change btrfs_verify_dev_extents and calculate_emulated_zone_size
functions to use btrfs_next_leaf in the same situation.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
One of the final things that must be done to add a new chunk is
inserting its device extent items in the device tree. They describe
the portion of allocated device physical space during phase 1 of
chunk allocation. This is currently done in btrfs_finish_chunk_alloc
whose name isn't very informative. What's more, this function is only
used in block-group.c but is defined as public. There isn't anything
special about it that would warrant it being defined in volumes.c.
Just move btrfs_finish_chunk_alloc and alloc_chunk_dev_extent to
block-group.c, make the former static and rename both functions to
insert_dev_extents and insert_dev_extent respectively.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix -Warray-bounds warning, to help external patchset to make it
default treewide
- fix writeable device accounting (syzbot report)
- fix fsync and log replay after a rename and inode eviction
- fix potentially lost error code when submitting multiple bios for
compressed range
* tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: calculate number of eb pages properly in csum_tree_block
btrfs: fix rw device counting in __btrfs_free_extra_devids
btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
btrfs: mark compressed range uptodate only if all bio succeed
|
|
When removing a writeable device in __btrfs_free_extra_devids, the rw
device count should be decremented.
This error was caught by Syzbot which reported a warning in
close_fs_devices:
WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
Modules linked in:
CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293
RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128
R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000
R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a
FS: 00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180
open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693
btrfs_fill_super fs/btrfs/super.c:1382 [inline]
btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x86/0x270 fs/super.c:1498
fc_mount fs/namespace.c:993 [inline]
vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023
btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809
legacy_get_tree+0xea/0x180 fs/fs_context.c:592
vfs_get_tree+0x86/0x270 fs/super.c:1498
do_new_mount fs/namespace.c:2905 [inline]
path_mount+0x196f/0x2be0 fs/namespace.c:3235
do_mount fs/namespace.c:3248 [inline]
__do_sys_mount fs/namespace.c:3456 [inline]
__se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
Because fs_devices->rw_devices was not 0 after
closing all devices. Here is the call trace that was observed:
btrfs_mount_root():
btrfs_scan_one_device():
device_list_add(); <---------------- device added
btrfs_open_devices():
open_fs_devices():
btrfs_open_one_device(); <-------- writable device opened,
rw device count ++
btrfs_fill_super():
open_ctree():
btrfs_free_extra_devids():
__btrfs_free_extra_devids(); <--- writable device removed,
rw device count not decremented
fail_tree_roots:
btrfs_close_devices():
close_fs_devices(); <------- rw device count off by 1
As a note, prior to commit cf89af146b7e ("btrfs: dev-replace: fail
mount if we don't have replace item with target device"), rw_devices
was decremented on removing a writable device in
__btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit
was not set for the device. However, this check does not need to be
reinstated as it is now redundant and incorrect.
In __btrfs_free_extra_devids, we skip removing the device if it is the
target for replacement. This is done by checking whether device->devid
== BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set
only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices
should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check,
and so it's redundant to test for that bit.
Additionally, following commit 82372bc816d7 ("Btrfs: make
the logic of source device removing more clear"), rw_devices is
incremented whenever a writeable device is added to the alloc
list (including the target device in btrfs_dev_replace_finishing), so
all removals of writable devices from the alloc list should also be
accompanied by a decrement to rw_devices.
Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
CC: stable@vger.kernel.org # 5.10+
Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs zoned mode fixes from David Sterba:
- fix deadlock when allocating system chunk
- fix wrong mutex unlock on an error path
- fix extent map splitting for append operation
- update and fix message reporting unusable chunk space
- don't block when background zone reclaim runs with balance in
parallel
* tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree
btrfs: don't block if we can't acquire the reclaim lock
btrfs: properly split extent_map for REQ_OP_ZONE_APPEND
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
btrfs: fix deadlock with concurrent chunk allocations involving system chunks
btrfs: zoned: print unusable percentage when reclaiming block groups
btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work
|
|
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pull core block updates from Jens Axboe:
- disk events cleanup (Christoph)
- gendisk and request queue allocation simplifications (Christoph)
- bdev_disk_changed cleanups (Christoph)
- IO priority improvements (Bart)
- Chained bio completion trace fix (Edward)
- blk-wbt fixes (Jan)
- blk-wbt enable/disable fix (Zhang)
- Scheduler dispatch improvements (Jan, Ming)
- Shared tagset scheduler improvements (John)
- BFQ updates (Paolo, Luca, Pietro)
- BFQ lock inversion fix (Jan)
- Documentation improvements (Kir)
- CLONE_IO block cgroup fix (Tejun)
- Remove of ancient and deprecated block dump feature (zhangyi)
- Discard merge fix (Ming)
- Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
Yang)
* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
block: fix discard request merge
block/mq-deadline: Remove a WARN_ON_ONCE() call
blk-mq: update hctx->dispatch_busy in case of real scheduler
blk: Fix lock inversion between ioc lock and bfqd lock
bfq: Remove merged request already in bfq_requests_merged()
block: pass a gendisk to bdev_disk_changed
block: move bdev_disk_changed
block: add the events* attributes to disk_attrs
block: move the disk events code to a separate file
block: fix trace completion for chained bio
block/partitions/msdos: Fix typo inidicator -> indicator
block, bfq: reset waker pointer with shared queues
block, bfq: check waker only for queues with no in-flight I/O
block, bfq: avoid delayed merge of async queues
block, bfq: boost throughput by extending queue-merging times
block, bfq: consider also creation time in delayed stable merge
block, bfq: fix delayed stable merge check
block, bfq: let also stably merged queues enjoy weight raising
blk-wbt: make sure throttle is enabled properly
blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
...
|
|
Relocation and send do not play well together because while send is
running a block group can be relocated, a transaction committed and
the respective disk extents get re-allocated and written to or discarded
while send is about to do something with the extents.
This was explained in commit 9e967495e0e0ae ("Btrfs: prevent send failures
and crashes due to concurrent relocation"), which prevented balance and
send from running in parallel but it did not address one remaining case
where chunk relocation can happen: shrinking a device (and device deletion
which shrinks a device's size to 0 before deleting the device).
We also have now one more case where relocation is triggered: on zoned
filesystems partially used block groups get relocated by a background
thread, introduced in commit 18bb8bbf13c183 ("btrfs: zoned: automatically
reclaim zones").
So make sure that instead of preventing balance from running when there
are ongoing send operations, we prevent relocation from happening.
This uses the infrastructure recently added by a patch that has the
subject: "btrfs: add cancellable chunk relocation support".
Also it adds a spinlock used exclusively for the exclusivity between
send and relocation, as before fs_info->balance_mutex was used, which
would make an attempt to run send to block waiting for balance to
finish, which can take a lot of time on large filesystems.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Fix typos that have snuck in since the last round. Found by codespell.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
just remove it.
It got removed in 420343131970 ("btrfs: let callers of
btrfs_get_io_geometry pass the em"), before that btrfs_get_chunk_map
utilized it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit b2598edf8b36 ("btrfs: remove unused argument seed from
btrfs_find_device") removed the argument seed from btrfs_find_device
but forgot the comment, so remove it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Replace the per-block device bd_mutex with a per-gendisk open_mutex,
thus simplifying locking wherever we deal with partitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes:
- fix fiemap to print extents that could get misreported due to
internal extent splitting and logical merging for fiemap output
- fix RCU stalls during delayed iputs
- fix removed dentries still existing after log is synced"
* tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix removed dentries still existing after log is synced
btrfs: return whole extents in fiemap
btrfs: avoid RCU stalls while running delayed iputs
btrfs: return 0 for dev_extent_hole_check_zoned hole_start in case of error
|
|
Commit 7000babddac6 ("btrfs: assign proper values to a bool variable in
dev_extent_hole_check_zoned") assigned false to the hole_start parameter
of dev_extent_hole_check_zoned().
The hole_start parameter is not boolean and returns the start location of
the found hole.
Fixes: 7000babddac6 ("btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull CFI on arm64 support from Kees Cook:
"This builds on last cycle's LTO work, and allows the arm64 kernels to
be built with Clang's Control Flow Integrity feature. This feature has
happily lived in Android kernels for almost 3 years[1], so I'm excited
to have it ready for upstream.
The wide diffstat is mainly due to the treewide fixing of mismatched
list_sort prototypes. Other things in core kernel are to address
various CFI corner cases. The largest code portion is the CFI runtime
implementation itself (which will be shared by all architectures
implementing support for CFI). The arm64 pieces are Acked by arm64
maintainers rather than coming through the arm64 tree since carrying
this tree over there was going to be awkward.
CFI support for x86 is still under development, but is pretty close.
There are a handful of corner cases on x86 that need some improvements
to Clang and objtool, but otherwise works well.
Summary:
- Clean up list_sort prototypes (Sami Tolvanen)
- Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)"
* tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
arm64: allow CONFIG_CFI_CLANG to be selected
KVM: arm64: Disable CFI for nVHE
arm64: ftrace: use function_nocfi for ftrace_call
arm64: add __nocfi to __apply_alternatives
arm64: add __nocfi to functions that jump to a physical address
arm64: use function_nocfi with __pa_symbol
arm64: implement function_nocfi
psci: use function_nocfi for cpu_resume
lkdtm: use function_nocfi
treewide: Change list_sort to use const pointers
bpf: disable CFI in dispatcher functions
kallsyms: strip ThinLTO hashes from static functions
kthread: use WARN_ON_FUNCTION_MISMATCH
workqueue: use WARN_ON_FUNCTION_MISMATCH
module: ensure __cfi_check alignment
mm: add generic function_nocfi macro
cfi: add __cficanonical
add support for Clang CFI
|
|
When a file gets deleted on a zoned file system, the space freed is not
returned back into the block group's free space, but is migrated to
zone_unusable.
As this zone_unusable space is behind the current write pointer it is not
possible to use it for new allocations. In the current implementation a
zone is reset once all of the block group's space is accounted as zone
unusable.
This behaviour can lead to premature ENOSPC errors on a busy file system.
Instead of only reclaiming the zone once it is completely unusable,
kick off a reclaim job once the amount of unusable bytes exceeds a user
configurable threshold between 51% and 100%. It can be set per mounted
filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
by default.
Similar to reclaiming unused block groups, these dirty block groups are
added to a to_reclaim list and then on a transaction commit, the reclaim
process is triggered but after we deleted unused block groups, which will
free space for the relocation process.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As a preparation for extending the block group deletion use case, rename
the unused_bgs_mutex to reclaim_bgs_lock.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When relocating a block group the freed up space is not discarded in one
big block, but each extent is discarded on its own with -odisard=sync.
For a zoned filesystem we need to discard the whole block group at once,
so btrfs_discard_extent() will translate the discard into a
REQ_OP_ZONE_RESET operation, which then resets the device's zone.
Failure to reset the zone is not fatal error.
Discussion about the approach and regarding transaction blocking:
https://lore.kernel.org/linux-btrfs/CAL3q7H4SjS_d5rBepfTMhU8Th3bJzdmyYd0g4Z60yUgC_rC_ZA@mail.gmail.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs uses internally mapped u64 address space for all its metadata.
Due to the page cache limit on 32bit systems, btrfs can't access
metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See
how MAX_LFS_FILESIZE and page::index are defined. This is 16T for 4K
page size while 256T for 64K page size.
Users can have a filesystem which doesn't have metadata beyond the
boundary at mount time, but later balance can cause it to create
metadata beyond the boundary.
And modification to MM layer is unrealistic just for such minor use
case. We can't do more than to prevent mounting such filesystem or warn
early when the numbers are still within the limits.
To address such problem, this patch will introduce the following checks:
- Mount time rejection
This will reject any fs which has metadata chunk at or beyond the
boundary.
- Mount time early warning
If there is any metadata chunk beyond 5/8th of the boundary, we do an
early warning and hope the end user will see it.
- Runtime extent buffer rejection
If we're going to allocate an extent buffer at or beyond the boundary,
reject such request with EOVERFLOW.
This is definitely going to cause problems like transaction abort, but
we have no better ways.
- Runtime extent buffer early warning
If an extent buffer beyond 5/8th of the max file size is allocated, do
an early warning.
Above error/warning message will only be printed once for each fs to
reduce dmesg flood.
If the mount is rejected, the filesystem will be mountable only on a
64bit host.
Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/
Reported-by: Erik Jensen <erikjensen@rkjnsn.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
gcc complains that the ctl->max_chunk_size member might be used
uninitialized when none of the three conditions for initializing it in
init_alloc_chunk_ctl_policy_zoned() are true:
In function ‘init_alloc_chunk_ctl_policy_zoned’,
inlined from ‘init_alloc_chunk_ctl’ at fs/btrfs/volumes.c:5023:3,
inlined from ‘btrfs_alloc_chunk’ at fs/btrfs/volumes.c:5340:2:
include/linux/compiler-gcc.h:48:45: error: ‘ctl.max_chunk_size’ may be used uninitialized [-Werror=maybe-uninitialized]
4998 | ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
| ^~~
fs/btrfs/volumes.c: In function ‘btrfs_alloc_chunk’:
fs/btrfs/volumes.c:5316:32: note: ‘ctl’ declared here
5316 | struct alloc_chunk_ctl ctl;
| ^~~
If we ever get into this condition, something is seriously
wrong, as validity is checked in the callers
btrfs_alloc_chunk
init_alloc_chunk_ctl
init_alloc_chunk_ctl_policy_zoned
so the same logic as in init_alloc_chunk_ctl_policy_regular()
and a few other places should be applied. This avoids both further
data corruption, and the compile-time warning.
Fixes: 1cd6121f2a38 ("btrfs: zoned: implement zoned chunk allocator")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Fix the following coccicheck warnings:
./fs/btrfs/volumes.c:1462:10-11: WARNING: return of 0/1 in function
'dev_extent_hole_check_zoned' with return type bool.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.
Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fixes for issues that have some user visibility and are simple enough
for this time of development cycle:
- a few fixes for rescue= mount option, adding more checks for
missing trees
- fix sleeping in atomic context on qgroup deletion
- fix subvolume deletion on mount
- fix build with M= syntax
- fix checksum mismatch error message for direct io"
* tag 'for-5.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix check_data_csum() error message for direct I/O
btrfs: fix sleep while in non-sleep context during qgroup removal
btrfs: fix subvolume/snapshot deletion not triggered on mount
btrfs: fix build when using M=fs/btrfs
btrfs: do not initialize dev replace for bad dev root
btrfs: initialize device::fs_info always
btrfs: do not initialize dev stats if we have no dev_root
btrfs: zoned: remove outdated WARN_ON in direct IO
|
|
Neal reported a panic trying to use -o rescue=all
BUG: kernel NULL pointer dereference, address: 0000000000000030
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 4095 Comm: mount Not tainted 5.11.0-0.rc7.149.fc34.x86_64 #1
RIP: 0010:btrfs_device_init_dev_stats+0x4c/0x1f0
RSP: 0018:ffffa60285fbfb68 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88b88f806498 RCX: ffff88b82e7a2a10
RDX: ffffa60285fbfb97 RSI: ffff88b82e7a2a10 RDI: 0000000000000000
RBP: ffff88b88f806b3c R08: 0000000000000000 R09: 0000000000000000
R10: ffff88b82e7a2a10 R11: 0000000000000000 R12: ffff88b88f806a00
R13: ffff88b88f806478 R14: ffff88b88f806a00 R15: ffff88b82e7a2a10
FS: 00007f698be1ec40(0000) GS:ffff88b937e00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000030 CR3: 0000000092c9c006 CR4: 00000000003706f0
Call Trace:
? btrfs_init_dev_stats+0x1f/0xf0
btrfs_init_dev_stats+0x62/0xf0
open_ctree+0x1019/0x15ff
btrfs_mount_root.cold+0x13/0xfa
legacy_get_tree+0x27/0x40
vfs_get_tree+0x25/0xb0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x131/0x3d0
? legacy_get_tree+0x27/0x40
? btrfs_show_options+0x640/0x640
legacy_get_tree+0x27/0x40
vfs_get_tree+0x25/0xb0
path_mount+0x441/0xa80
__x64_sys_mount+0xf4/0x130
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f698c04e52e
This happens because we unconditionally attempt to initialize device
stats on mount, but we may not have been able to read the device root.
Fix this by skipping initializing the device stats if we do not have a
device root.
Reported-by: Neal Gompa <ngompa13@gmail.com>
CC: stable@vger.kernel.org # 5.11+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pull core block updates from Jens Axboe:
"Another nice round of removing more code than what is added, mostly
due to Christoph's relentless pursuit of tech debt removal/cleanups.
This pull request contains:
- Two series of BFQ improvements (Paolo, Jan, Jia)
- Block iov_iter improvements (Pavel)
- bsg error path fix (Pan)
- blk-mq scheduler improvements (Jan)
- -EBUSY discard fix (Jan)
- bvec allocation improvements (Ming, Christoph)
- bio allocation and init improvements (Christoph)
- Store bdev pointer in bio instead of gendisk + partno (Christoph)
- Block trace point cleanups (Christoph)
- hard read-only vs read-only split (Christoph)
- Block based swap cleanups (Christoph)
- Zoned write granularity support (Damien)
- Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"
* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
mm: simplify swapdev_block
sd_zbc: clear zone resources for non-zoned case
block: introduce blk_queue_clear_zone_settings()
zonefs: use zone write granularity as block size
block: introduce zone_write_granularity limit
block: use blk_queue_set_zoned in add_partition()
nullb: use blk_queue_set_zoned() to setup zoned devices
nvme: cleanup zone information initialization
block: document zone_append_max_bytes attribute
block: use bi_max_vecs to find the bvec pool
md/raid10: remove dead code in reshape_request
block: mark the bio as cloned in bio_iov_bvec_set
block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
block: remove a layer of indentation in bio_iov_iter_get_pages
block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
block: remove the 1 and 4 vec bvec_slabs entries
block: streamline bvec_alloc
block: factor out a bvec_alloc_gfp helper
block: move struct biovec_slab to bio.c
block: reuse BIO_INLINE_VECS for integrity bvecs
...
|
|
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is 3/4 patch to implement device-replace on zoned filesystems.
This commit implements copying. To do this, it tracks the write pointer
during the device replace process. As device-replace's copy process is
smart enough to only copy used extents on the source device, we have to
fill the gap to honor the sequential write requirement in the target
device.
The device-replace process on zoned filesystems must copy or clone all
the extents in the source device exactly once. So, we need to ensure
allocations started just before the dev-replace process to have their
corresponding extent information in the B-trees.
finish_extent_writes_for_zoned() implements that functionality, which
basically is the removed code in the commit 042528f8d840 ("Btrfs: fix
block group remaining RO forever after error during device replace").
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is 2/4 patch to implement device replace for zoned filesystems.
In zoned mode, a block group must be either copied (from the source
device to the target device) or cloned (to both devices).
Implement the cloning part. If a block group targeted by an IO is marked
to copy, we should not clone the IO to the destination device, because
the block group is eventually copied by the replace process.
This commit also handles cloning of device reset.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Enable zone append writing for zoned mode. When using zone append, a
bio is issued to the start of a target zone and the device decides to
place it inside the zone. Upon completion the device reports the actual
written position back to the host.
Three parts are necessary to enable zone append mode. First, modify the
bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
bi_sector to point the beginning of the zone.
Second, record the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later
in btrfs_finish_ordered_io().
And finally, rewrite the logical addresses of the extent mapping and
checksum data according to the physical address using btrfs_rmap_block.
If the returned address matches the originally allocated address, we can
skip this rewriting process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual
devices.
Let btrfs_end_bio() and btrfs_op be aware of it, by mapping
REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of
bio_op().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add a check in verify_one_dev_extent() to ensure that a device extent on
a zoned block device is aligned to the respective zone boundary.
If it isn't, mark the filesystem as unclean.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Implement a zoned chunk and device extent allocator. One device zone
becomes a device extent so that a zone reset affects only this device
extent and does not change the state of blocks in the neighbor device
extents.
To implement the allocator, we need to extend the following functions for
a zoned filesystem.
- init_alloc_chunk_ctl
- dev_extent_search_start
- dev_extent_hole_check
- decide_stripe_size
init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
set the stripe_size to the zone size and aligns the parameters to the zone
size.
dev_extent_search_start() only aligns the start offset to zone boundaries.
We don't care about the first 1MB like in regular filesystem because we
anyway reserve the first two zones for superblock logging.
dev_extent_hole_check_zoned() checks if zones in given hole are either
conventional or empty sequential zones. Also, it skips zones reserved for
superblock logging.
With the change to the hole, the new hole may now contain pending extents.
So, in this case, loop again to check that.
Finally, decide_stripe_size_zoned() should shrink the number of devices
instead of stripe size because we need to honor stripe_size == zone_size.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is a preparation patch to implement zone emulation on a regular
device.
To emulate a zoned filesystem on a regular (non-zoned) device, we need to
decide an emulated zone size. Instead of making it a compile-time static
value, we'll make it configurable at mkfs time. Since we have one zone ==
one device extent restriction, we can determine the emulated zone size
from the size of a device extent. We can extend btrfs_get_dev_zone_info()
to show a regular device filled with conventional zones once the zone size
is decided.
The current call site of btrfs_get_dev_zone_info() during the mount process
is earlier than loading the file system trees so that we don't know the
size of a device extent at this point. Thus we can't slice a regular device
to conventional zones.
This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone
info for all the devices. And, it places this function in open_ctree()
after loading the trees.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Before this change, the btrfs_get_io_geometry() function was calling
btrfs_get_chunk_map() to get the extent mapping, necessary for
calculating the I/O geometry. It was using that extent mapping only
internally and freeing the pointer after its execution.
That resulted in calling btrfs_get_chunk_map() de facto twice by the
__btrfs_map_block() function. It was calling btrfs_get_io_geometry()
first and then calling btrfs_get_chunk_map() directly to get the extent
mapping, used by the rest of the function.
Change that to passing the extent mapping to the btrfs_get_io_geometry()
function as an argument.
This could improve performance in some cases. For very large
filesystems, i.e. several thousands of allocated chunks, not only this
avoids searching two times the rbtree, saving time, it may also help
reducing contention on the lock that protects the tree - thinking of
writeback starting for multiple inodes, other tasks allocating or
removing chunks, and anything else that requires access to the rbtree.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Michal Rostecki <mrostecki@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add Filipe's analysis ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of having three 'if' to handle non-NULL return value consolidate
this in one 'if (ret)'. That way the code is more obvious:
- Always drop delete_unused_bgs_mutex if ret is not NULL
- If ret is negative -> goto done
- If it's 1 -> reset ret to 0, release the path and finish the loop.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes for a late rc:
- fix lockdep complaint on 32bit arches and also remove an unsafe
memory use due to device vs filesystem lifetime
- two fixes for free space tree:
* race during log replay and cache rebuild, now more likely to
happen due to changes in this dev cycle
* possible free space tree corruption with online conversion
during initial tree population"
* tag 'for-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix log replay failure due to race with space cache rebuild
btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
btrfs: fix possible free space tree corruption with online conversion
|
|
Use bio_kmalloc instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This effectively reverts commit d5c8238849e7 ("btrfs: convert
data_seqcount to seqcount_mutex_t").
While running fstests on 32 bits test box, many tests failed because of
warnings in dmesg. One of those warnings (btrfs/003):
[66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
[66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G O 5.11.0-rc4-custom+ #5
[66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
[66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803
[66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70
[66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
[66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0
[66.441485] Call Trace:
[66.441510] btrfs_relocate_chunk+0xb1/0x100 [btrfs]
[66.441529] ? btrfs_lookup_block_group+0x17/0x20 [btrfs]
[66.441562] btrfs_balance+0x8ed/0x13b0 [btrfs]
[66.441586] ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
[66.441619] ? __this_cpu_preempt_check+0xf/0x11
[66.441643] btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
[66.441664] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441683] btrfs_ioctl+0x414/0x2ae0 [btrfs]
[66.441700] ? __lock_acquire+0x35f/0x2650
[66.441717] ? lockdep_hardirqs_on+0x87/0x120
[66.441720] ? lockdep_hardirqs_on_prepare+0xd0/0x1e0
[66.441724] ? call_rcu+0x2d3/0x530
[66.441731] ? __might_fault+0x41/0x90
[66.441736] ? kvm_sched_clock_read+0x15/0x50
[66.441740] ? sched_clock+0x8/0x10
[66.441745] ? sched_clock_cpu+0x13/0x180
[66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[66.441768] __ia32_sys_ioctl+0x165/0x8a0
[66.441773] ? __this_cpu_preempt_check+0xf/0x11
[66.441785] ? __might_fault+0x89/0x90
[66.441791] __do_fast_syscall_32+0x54/0x80
[66.441796] do_fast_syscall_32+0x32/0x70
[66.441801] do_SYSENTER_32+0x15/0x20
[66.441805] entry_SYSENTER_32+0x9f/0xf2
[66.441808] EIP: 0xab7b5549
[66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c
[66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98
[66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[66.441833] irq event stamp: 42579
[66.441835] hardirqs last enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590
[66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590
[66.441840] softirqs last enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60
[66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60
========================================================================
btrfs_remove_chunk+0x58b/0x7b0:
__seqprop_mutex_assert at linux/./include/linux/seqlock.h:279
(inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212
(inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994
========================================================================
The warning is produced by lockdep_assert_held() in
__seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled.
And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock
fs_info->chunk_mutex held already.
After adding some debug prints, the cause was found that many
__alloc_device() are called with NULL @fs_info (during scanning ioctl).
Inside the function, btrfs_device_data_ordered_init() is expanded to
seqcount_mutex_init(). In this scenario, its second
parameter info->chunk_mutex is &NULL->chunk_mutex which equals
to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus,
seqcount_mutex_init() is called in wrong way. And later
btrfs_device_get/set helpers trigger lockdep warnings.
The device and filesystem object lifetimes are different and we'd have
to synchronize initialization of the btrfs_device::data_seqcount with
the fs_info, possibly using some additional synchronization. It would
still not prevent concurrent access to the seqcount lock when it's used
for read and initialization.
Commit d5c8238849e7 ("btrfs: convert data_seqcount to seqcount_mutex_t")
does not mention a particular problem being fixed so revert should not
cause any harm and we'll get the lockdep warning fixed.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139
Reported-by: Erhard F <erhard_f@mailbox.org>
Fixes: d5c8238849e7 ("btrfs: convert data_seqcount to seqcount_mutex_t")
CC: stable@vger.kernel.org # 5.10
CC: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more one line fixes for various bugs, stable material.
- fix send when emitting clone operation from the same file and root
- fix double free on error when cleaning backrefs
- lockdep fix during relocation
- handle potential error during reloc when starting transaction
- skip running delayed refs during commit (leftover from code removal
in this dev cycle)"
* tag 'for-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't clear ret in btrfs_start_dirty_block_groups
btrfs: fix lockdep splat in btrfs_recover_relocation
btrfs: do not double free backref nodes on error
btrfs: don't get an EINTR during drop_snapshot for reloc
btrfs: send: fix invalid clone operations when cloning from the same file and root
btrfs: no need to run delayed refs after commit_fs_roots during commit
|
|
While testing the error paths of relocation I hit the following lockdep
splat:
======================================================
WARNING: possible circular locking dependency detected
5.10.0-rc6+ #217 Not tainted
------------------------------------------------------
mount/779 is trying to acquire lock:
ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340
but task is already holding lock:
ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-root-00){++++}-{3:3}:
down_read_nested+0x43/0x130
__btrfs_tree_read_lock+0x27/0x100
btrfs_read_lock_root_node+0x31/0x40
btrfs_search_slot+0x462/0x8f0
btrfs_update_root+0x55/0x2b0
btrfs_drop_snapshot+0x398/0x750
clean_dirty_subvols+0xdf/0x120
btrfs_recover_relocation+0x534/0x5a0
btrfs_start_pre_rw_mount+0xcb/0x170
open_ctree+0x151f/0x1726
btrfs_mount_root.cold+0x12/0xea
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x10d/0x380
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
path_mount+0x433/0xc10
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (sb_internal#2){.+.+}-{0:0}:
start_transaction+0x444/0x700
insert_balance_item.isra.0+0x37/0x320
btrfs_balance+0x354/0xf40
btrfs_ioctl_balance+0x2cf/0x380
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->balance_mutex){+.+.}-{3:3}:
__lock_acquire+0x1120/0x1e10
lock_acquire+0x116/0x370
__mutex_lock+0x7e/0x7b0
btrfs_recover_balance+0x2f0/0x340
open_ctree+0x1095/0x1726
btrfs_mount_root.cold+0x12/0xea
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x10d/0x380
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
path_mount+0x433/0xc10
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-root-00);
lock(sb_internal#2);
lock(btrfs-root-00);
lock(&fs_info->balance_mutex);
*** DEADLOCK ***
2 locks held by mount/779:
#0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
#1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100
stack backtrace:
CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb0
check_noncircular+0xcf/0xf0
? trace_call_bpf+0x139/0x260
__lock_acquire+0x1120/0x1e10
lock_acquire+0x116/0x370
? btrfs_recover_balance+0x2f0/0x340
__mutex_lock+0x7e/0x7b0
? btrfs_recover_balance+0x2f0/0x340
? btrfs_recover_balance+0x2f0/0x340
? rcu_read_lock_sched_held+0x3f/0x80
? kmem_cache_alloc_trace+0x2c4/0x2f0
? btrfs_get_64+0x5e/0x100
btrfs_recover_balance+0x2f0/0x340
open_ctree+0x1095/0x1726
btrfs_mount_root.cold+0x12/0xea
? rcu_read_lock_sched_held+0x3f/0x80
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x10d/0x380
? __kmalloc_track_caller+0x2f2/0x320
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
? capable+0x3a/0x60
path_mount+0x433/0xc10
__x64_sys_mount+0xe3/0x120
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This is straightforward to fix, simply release the path before we setup
the balance_ctl.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes that arrived before the end of the year:
- a bunch of fixes related to transaction handle lifetime wrt various
operations (umount, remount, qgroup scan, orphan cleanup)
- async discard scheduling fixes
- fix item size calculation when item keys collide for extend refs
(hardlinks)
- fix qgroup flushing from running transaction
- fix send, wrong file path when there is an inode with a pending
rmdir
- fix deadlock when cloning inline extent and low on free metadata
space"
* tag 'for-5.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: run delayed iputs when remounting RO to avoid leaking them
btrfs: add assertion for empty list of transactions at late stage of umount
btrfs: fix race between RO remount and the cleaner task
btrfs: fix transaction leak and crash after cleaning up orphans on RO mount
btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
btrfs: merge critical sections of discard lock in workfn
btrfs: fix racy access to discard_ctl data
btrfs: fix async discard stall
btrfs: tests: initialize test inodes location
btrfs: send: fix wrong file path when there is an inode with a pending rmdir
btrfs: qgroup: don't try to wait flushing if we're already holding a transaction
btrfs: correctly calculate item size used when item key collision happens
btrfs: fix deadlock when cloning inline extent and low on free metadata space
|