Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dentry d_flags updates from Al Viro:
"The current exclusion rules for dentry->d_flags stores are rather
unpleasant. The basic rules are simple:
- stores to dentry->d_flags are OK under dentry->d_lock
- stores to dentry->d_flags are OK in the dentry constructor, before
becomes potentially visible to other threads
Unfortunately, there's a couple of exceptions to that, and that's
where the headache comes from.
The main PITA comes from d_set_d_op(); that primitive sets ->d_op of
dentry and adjusts the flags that correspond to presence of individual
methods. It's very easy to misuse; existing uses _are_ safe, but proof
of correctness is brittle.
Use in __d_alloc() is safe (we are within a constructor), but we might
as well precalculate the initial value of 'd_flags' when we set the
default ->d_op for given superblock and set 'd_flags' directly instead
of messing with that helper.
The reasons why other uses are safe are bloody convoluted; I'm not
going to reproduce it here. See [1] for gory details, if you care. The
critical part is using d_set_d_op() only just prior to
d_splice_alias(), which makes a combination of d_splice_alias() with
setting ->d_op, etc a natural replacement primitive.
Better yet, if we go that way, it's easy to take setting ->d_op and
modifying 'd_flags' under ->d_lock, which eliminates the headache as
far as 'd_flags' exclusion rules are concerned. Other exceptions are
minor and easy to deal with.
What this series does:
- d_set_d_op() is no longer available; instead a new primitive
(d_splice_alias_ops()) is provided, equivalent to combination of
d_set_d_op() and d_splice_alias().
- new field of struct super_block - 's_d_flags'. This sets the
default value of 'd_flags' to be used when allocating dentries on
this filesystem.
- new primitive for setting 's_d_op': set_default_d_op(). This
replaces stores to 's_d_op' at mount time.
All in-tree filesystems converted; out-of-tree ones will get caught
by the compiler ('s_d_op' is renamed, so stores to it will be
caught). 's_d_flags' is set by the same primitive to match the
's_d_op'.
- a lot of filesystems had sb->s_d_op->d_delete equal to
always_delete_dentry; that is equivalent to setting
DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well
set that bit in 's_d_flags' and drop 'd_delete()' from
dentry_operations.
In quite a few cases that results in empty dentry_operations, which
means that we can get rid of those.
- kill simple_dentry_operations - not needed anymore
- massage d_alloc_parallel() to get rid of the other exception wrt
'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we
allocate the new dentry; no need to delay that until we commit to
using the sucker.
As the result, 'd_flags' stores are all either under ->d_lock or done
before the dentry becomes visible in any shared data structures"
Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1]
* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
configfs: use DCACHE_DONTCACHE
debugfs: use DCACHE_DONTCACHE
efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry()
9p: don't bother with always_delete_dentry
ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE
kill simple_dentry_operations
devpts, sunrpc, hostfs: don't bother with ->d_op
shmem: no dentry retention past the refcount reaching zero
d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier
make d_set_d_op() static
simple_lookup(): just set DCACHE_DONTCACHE
tracefs: Add d_delete to remove negative dentries
set_default_d_op(): calculate the matching value for ->d_flags
correct the set of flags forbidden at d_set_d_op() time
split d_flags calculation out of d_set_d_op()
new helper: set_default_d_op()
fuse: no need for special dentry_operations for root dentry
switch procfs from d_set_d_op() to d_splice_alias_ops()
new helper: d_splice_alias_ops()
procfs: kill ->proc_dops
...
|
|
[FLAG EXCLUSION]
Commit ead622674df5 ("btrfs: Do not restrict writes to btrfs devices")
removes the BLK_OPEN_RESTRICT_WRITES flag when opening the devices
during mount. This was an exception at the time as it depended on other
patches.
[REASON TO EXCLUDE THAT FLAG]
Btrfs needs to call btrfs_scan_one_device() to determine the fsid, no
matter if we're mounting a new fs or an existing one.
But if a fs is already mounted and the BLK_OPEN_RESTRICT_WRITES is
honored, meaning no other write open is allowed for the block device.
Then we want to mount a subvolume of the mounted fs to another mount
point, we will call btrfs_scan_one_device() again, but it will fail due
to the BLK_OPEN_RESTRICT_WRITES flag (no more write open allowed),
causing only one mount point for the fs.
Thus at that time, we had to exclude the BLK_OPEN_RESTRICT_WRITES to
allow multiple mount points for one fs.
[WHY IT'S SAFE NOW]
The root problem is, we do not need to nor should use BLK_OPEN_WRITE for
btrfs_scan_one_device().
That function is only to read out the super block, no write at all, and
BLK_OPEN_WRITE is only going to cause problems for such usage.
The root problem has been fixed by patch "btrfs: always open the device
read-only in btrfs_scan_one_device", so btrfs_scan_one_device() will
always work no matter if the device is opened with
BLK_OPEN_RESTRICT_WRITES.
[ENHANCEMENT]
Just remove the btrfs_open_mode(), as the only call site can be replaced
with regular sb_open_mode().
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The file system type is not a very useful holder as it doesn't allow us
to go back to the actual file system instance. Pass the super_block
instead which is useful when passed back to the file system driver.
This matches what is done for all other block device based file systems,
and allows us to remove btrfs_fs_info::bdev_holder completely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we always call btrfs_open_devices() before creating the
super block.
It's fine for now because:
- No blk_holder_ops is provided
- btrfs_fs_type is used as a holder
This means no matter who wins the device opening race, the holder will be
the same thus not affecting the later sget_fc() race.
And since no blk_holder_ops is provided, no bdev operation is depending on
the holder.
But this will no longer be true if we want to implement a proper
blk_holder_ops using fs_holder_ops.
This means we will need a proper super block as the bdev holder.
To prepare for such change:
- Add btrfs_fs_devices::holding member
This will prevent btrfs_free_stale_devices() and btrfs_close_device()
from deleting the fs_devices when there is another process trying to
mount the fs.
Along with the new member, here come the two helpers,
btrfs_fs_devices_inc_holding() and btrfs_fs_devices_dec_holding().
This will allow us to hold fs_devices without opening it.
This is needed because we cannot hold uuid_mutex while calling
sget_fc(), this will reverse the lock sequence with s_umount, causing
a lockdep warning.
- Delay btrfs_open_devices() until a super block is returned
This means we have to hold the initial fs_devices first, then unlock
uuid_mutex, call sget_fc(), then re-lock uuid_mutex, and decrease the
holding number.
For new super block case, we continue to btrfs_open_devices() with
uuid_mutex hold.
For existing super block case, we can unlock uuid_mutex and continue.
Although this means a more complex error handling path, as if we
didn't call btrfs_open_devices() (either got an existing sb, or
sget_fc() failed), we cannot let btrfs_put_fs_info() cleanup the
fs_devices, as it can be freed at any time after we decrease the hold
on fs_devices and unlock uuid_mutex.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Although btrfs is not yet implementing blk_holder_ops, there is a
requirement for proper blk_holder_ops:
- blkdev_put() must not be called under sb->s_umount
The blkdev_put()/bdev_fput() must not be called under sb->s_umount to
avoid lock order reversal with disk->open_mutex.
This is for the proper blk_holder_ops callbacks.
Currently we're fine because we call regular fput() which defers the
blk holder reclaiming.
To prepare for the future of blk_holder_ops, move the
btrfs_close_devices() calls into btrfs_free_fs_info().
That will be called from kill_sb() callbacks, which is also called for
error handing during mount failures, or there is already an existing
super block.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When calling sget_fc(), there are 3 different situations:
a) Critical error
No super block created.
b) A new super block is created
The fc->s_fs_info is transferred to the super block, and fc->s_fs_info
is reset to NULL.
In this case sb->s_root should still be NULL, and needs to be properly
initialized later by btrfs_fill_super().
c) An existing super block is returned
The fc->s_fs_info is untouched, and anything related to that fs_info
should be properly cleaned up.
This is not obvious even with the extra comments at sget_fc().
Enhance the situation by:
- Add comments for case b) and c)
Especially for case c), the fs_info and fs_devices cleanup happens at
different timing, thus needs extra explanation.
- Move the comments closer to case b) and case c)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[EXISTING PROBLEM]
Currently btrfs mount is split into two parts:
- btrfs_get_tree_subvol()
Which sets up the very basic fs_info, and eventually calls
mount_subvol() to mount the target subvolume.
- btrfs_get_tree_super()
This is the part doing super block allocation and if there is no
existing super block, do the real open_ctree() to open the fs.
However currently we're doing this in a complex re-entering way:
vfs_get_tree()
|- btrfs_get_tree()
|- btrfs_get_tree_subvol()
|- vfs_get_tree()
| |- btrfs_get_tree()
| |- btrfs_get_tree_super()
|- mount_subvol()
This is definitely not that easy to grasp.
[ENHANCEMENT]
The function vfs_get_tree() is only doing the following work:
- Call get_tree() call back
- Call super_wake()
- Call security_sb_set_mnt_opts()
In our case, super_wake() can be skipped, as after
btrfs_get_tree_subvol() finishes, vfs_get_tree() will call super_wake()
on the super block we got anyway.
The same applies to security_sb_set_mnt_opts(), as long as we do not
free the security from our original fc in btrfs_get_tree_subvol(), the
first vfs_get_tree() call will handle the security correctly.
So here we only need to:
- Replace vfs_get_tree() call with btrfs_get_tree_super()
- Keep the existing fc->security for vfs_get_tree() to handle the
security
This will remove the re-entering behavior and make thing much easier to
follow.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_scan_one_device() opens the block device only to read the super
block. Instead of passing a blk_mode_t argument to sometimes open
it for writing, just hard code BLK_OPEN_READ as it will never write
to the device or hand the block_device out to someone else.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[CURRENT BEHAVIOR]
Currently inside btrfs_get_tree_subvol(), we call fc_mount() to grab a
tree, then re-lock s_umount inside btrfs_reconfigure_for_mount() to
avoid race with remount.
However fc_mount() itself is just doing two things:
1. Call vfs_get_tree()
2. Release s_umount then call vfs_create_mount()
[ENHANCEMENT]
Instead of calling fc_mount(), we can open-code it with vfs_get_tree()
first.
This provides a benefit that, since we have the full control of
s_umount, we do not need to re-lock that rw_sempahore when calling
btrfs_reconfigure_for_mount(), meaning less race between RO/RW remount.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Rework the subject and commit message, refactor the error handling ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs happily but incorrectly accepts the `-o compress=zlib+foo` and similar
options with any random suffix.
Fix that by explicitly checking the end of the strings.
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are many options making the parsing a bit lengthy. Factor the
compress options out into a helper function. The next patch is going to
harden this function.
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
... to be used instead of manually assigning to ->s_d_op.
All in-tree filesystem converted (and field itself is renamed,
so any out-of-tree ones in need of conversion will be caught
by compiler).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Standalone "nologreplay" mount option has been marked deprecated since
commit 74ef00185eb8 ("btrfs: introduce "rescue=" mount option"), which
dates back to v5.9 (2020).
Furthermore there is no other filesystem with the same named mount
option, so this one is btrfs specific and we will not hit the same
problem when removing "norecovery" mount option.
So let's remove the standalone "nologreplay" mount option.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have two functions to read a super block from a block device:
- btrfs_read_dev_one_super()
Exported from disk-io.c
- btrfs_read_disk_super()
Local to volumes.c
And they have some minor differences:
- btrfs_read_dev_one_super() uses @copy_num
Meanwhile btrfs_read_disk_super() relies on the physical and expected
bytenr passed from the caller.
The parameter list of btrfs_read_dev_one_super() is more user
friendly.
- btrfs_read_disk_super() makes sure the label is NUL terminated
We do not need two different functions doing the same job, so merge the
behavior into btrfs_read_disk_super() by:
- Remove btrfs_read_dev_one_super()
- Export btrfs_read_disk_super()
The name pairs with btrfs_release_disk_super() perfectly.
- Change the parameter list of btrfs_read_disk_super() to mimic
btrfs_read_dev_one_super()
All existing callers are calculating the physical address and expect
bytenr before calling btrfs_read_disk_super() already.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Old code has a lot of int for bool return values, bool is recommended
and done in new code. Convert the trivial cases that do simple 0/false
and 1/true. Functions comment are updated if needed.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Rename all the exported functions from extent_map.h that don't have a
'btrfs_' prefix in their names, so that they are consistent with all the
other functions, to make it clear they are btrfs specific functions and
to avoid potential name collisions in the future with functions defined
elsewhere in the kernel.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Rename the remaning exported functions that don't have a 'btrfs_' prefix.
By convention exported functions should have such prefix to make it clear
they are btrfs specific and to avoid collisions with functions from
elsewhere in the kernel.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The Btrfs documentation states that if the commit value is greater than
300 a warning should be issued. The warning was accidentally lost in the
new mount API update.
Fixes: 6941823cc878 ("btrfs: remove old mount API code")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently, displaying the btrfs subvol mount option doesn't escape ','.
This makes parsing /proc/self/mounts and /proc/self/mountinfo
ambiguous for subvolume names that contain commas. The text after the
comma could be mistaken for another option (think "subvol=foo,ro", where
ro is actually part of the subvolumes name).
Replace the manual escape characters list with a call to
seq_show_option(). Thanks to Calvin Walton for suggesting this approach.
Fixes: c8d3fe028f64 ("Btrfs: show subvol= and subvolid= in /proc/mounts")
CC: stable@vger.kernel.org # 5.4+
Suggested-by: Calvin Walton <calvin.walton@kepstin.ca>
Signed-off-by: Johannes Kimmel <kernel@bareminimum.eu>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's an internal function and most of the time the callers are doing a lot
of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so
change the return type to struct btrfs_inode instead.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Allow using the fast modes (negative compression levels) of zstd as a
mount option.
As per the results, the compression ratio is (expectedly) lower:
for level in {-15..-1} 1 2 3; \
do printf "level %3d\n" $level; \
mount -o compress=zstd:$level /dev/sdb /mnt/test/; \
grep sdb /proc/mounts; \
cp -r /usr/bin /mnt/test/; sync; compsize /mnt/test/bin; \
cp -r /usr/share/doc /mnt/test/; sync; compsize /mnt/test/doc; \
cp enwik9 /mnt/test/; sync; compsize /mnt/test/enwik9; \
cp linux-6.13.tar /mnt/test/; sync; compsize /mnt/test/linux-6.13.tar; \
rm -r /mnt/test/{bin,doc,enwik9,linux-6.13.tar}; \
umount /mnt/test/; \
done |& tee results | \
awk '/^level/{print}/^TOTAL/{print$3"\t"$2" |"}' | paste - - - - -
266M bin | 45M doc | 953M wiki | 1.4G source
=============================+===============+===============+===============+
level -15 180M 67% | 30M 68% | 694M 72% | 598M 40% |
level -14 180M 67% | 30M 67% | 683M 71% | 581M 39% |
level -13 177M 66% | 29M 66% | 671M 70% | 566M 38% |
level -12 174M 65% | 29M 65% | 658M 69% | 548M 37% |
level -11 174M 65% | 28M 64% | 645M 67% | 530M 35% |
level -10 171M 64% | 28M 62% | 631M 66% | 512M 34% |
level -9 165M 62% | 27M 61% | 615M 64% | 493M 33% |
level -8 161M 60% | 27M 59% | 598M 62% | 475M 32% |
level -7 155M 58% | 26M 58% | 582M 61% | 457M 30% |
level -6 151M 56% | 25M 56% | 565M 59% | 437M 29% |
level -5 145M 54% | 24M 55% | 545M 57% | 417M 28% |
level -4 139M 52% | 23M 52% | 520M 54% | 391M 26% |
level -3 135M 50% | 22M 50% | 495M 51% | 369M 24% |
level -2 127M 47% | 22M 48% | 470M 49% | 349M 23% |
level -1 120M 45% | 21M 47% | 452M 47% | 332M 22% |
level 1 110M 41% | 17M 39% | 362M 38% | 290M 19% |
level 2 106M 40% | 17M 38% | 349M 36% | 288M 19% |
level 3 104M 39% | 16M 37% | 340M 35% | 276M 18% |
The samples represent some data sets that can be commonly found and show
approximate compressibility. The fast levels trade off speed for ratio
and are best suitable for highly compressible data.
As can be seen above, comparing the results to the current default zstd
level 3, the negative levels are roughly 2x worse at -15 and the
ratio increases almost linearly with each level.
Signed-off-by: Daniel Vacek <neelx@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify pre-content notification support from Jan Kara:
"This introduces a new fsnotify event (FS_PRE_ACCESS) that gets
generated before a file contents is accessed.
The event is synchronous so if there is listener for this event, the
kernel waits for reply. On success the execution continues as usual,
on failure we propagate the error to userspace. This allows userspace
to fill in file content on demand from slow storage. The context in
which the events are generated has been picked so that we don't hold
any locks and thus there's no risk of a deadlock for the userspace
handler.
The new pre-content event is available only for users with global
CAP_SYS_ADMIN capability (similarly to other parts of fanotify
functionality) and it is an administrator responsibility to make sure
the userspace event handler doesn't do stupid stuff that can DoS the
system.
Based on your feedback from the last submission, fsnotify code has
been improved and now file->f_mode encodes whether pre-content event
needs to be generated for the file so the fast path when nobody wants
pre-content event for the file just grows the additional file->f_mode
check. As a bonus this also removes the checks whether the old
FS_ACCESS event needs to be generated from the fast path. Also the
place where the event is generated during page fault has been moved so
now filemap_fault() generates the event if and only if there is no
uptodate folio in the page cache.
Also we have dropped FS_PRE_MODIFY event as current real-world users
of the pre-content functionality don't really use it so let's start
with the minimal useful feature set"
* tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
fanotify: Fix crash in fanotify_init(2)
fs: don't block write during exec on pre-content watched files
fs: enable pre-content events on supported file systems
ext4: add pre-content fsnotify hook for DAX faults
btrfs: disable defrag on pre-content watched files
xfs: add pre-content fsnotify hook for DAX faults
fsnotify: generate pre-content permission event on page fault
mm: don't allow huge faults for files with pre content watches
fanotify: disable readahead if we have pre-content watches
fanotify: allow to set errno in FAN_DENY permission response
fanotify: report file range info with pre-content events
fanotify: introduce FAN_PRE_ACCESS permission event
fsnotify: generate pre-content permission event on truncate
fsnotify: pass optional file access range in pre-content event
fsnotify: introduce pre-content permission events
fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
fanotify: rename a misnamed constant
fanotify: don't skip extra event info if no info_mode is set
fsnotify: check if file is actually being watched for pre-content events on open
fsnotify: opt-in for permission events at file open time
...
|
|
Print the read read policy if set as module parameter (with
CONFIG_BTRFS_EXPERIMENTAL).
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For testing purposes allow to configure the read policy via module
parameter from the beginning. Available only with CONFIG_BTRFS_EXPERIMENTAL
Examples:
- Set the RAID1 balancing method to round-robin with a custom
min_contig_read of 4k:
$ modprobe btrfs read_policy=round-robin:4096
- Set the round-robin balancing method with the default
min_contiguous_read:
$ modprobe btrfs read_policy=round-robin
- Set the "devid" balancing method, defaulting to the latest device:
$ modprobe btrfs read_policy=devid
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit c9c49e8f157e ("btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from
CONFIG_BTRFS_DEBUG") introduces a way to enable or disable experimental
features, print its status during module load, like:
Btrfs loaded, experimental=on, debug=on, assert=on, zoned=yes, fsverity=yes
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is a recent ML report that mounting a large fs backed by hardware
RAID56 controller (with one device missing) took too much time, and
systemd seems to kill the mount attempt.
In that case, the only error message is:
BTRFS error (device sdj): open_ctree failed
There is no reason on why the failure happened, making it very hard to
understand the reason.
At least output the error number (in the particular case it should be
-EINTR) to provide some clue.
Link: https://lore.kernel.org/linux-btrfs/9b9c4d2810abcca2f9f76e32220ed9a90febb235.camel@scientia.org/
Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Cc: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that all the code has been added for pre-content events, and the
various file systems that need the page fault hooks for fsnotify have
been updated, add SB_I_ALLOW_HSM to the supported file systems.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/46960dcb2725fa0317895ed66a8409ba1c306a82.1731684329.git.josef@toxicpanda.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes. Apart from the one liners and updated bio splitting
error handling there's a fix for subvolume mount with different flags.
This was known and fixed for some time but I've delayed it to give it
more testing.
- fix unbalanced locking when swapfile activation fails when the
subvolume gets deleted in the meantime
- add btrfs error handling after bio_split() calls that got error
handling recently
- during unmount, flush delalloc workers at the right time before the
cleaner thread is shut down
- fix regression in buffered write folio conversion, explicitly wait
for writeback as FGP_STABLE flag is currently a no-op on btrfs
- handle race in subvolume mount with different flags, the conversion
to the new mount API did not handle the case where multiple
subvolumes get mounted in parallel, which is a distro use case"
* tag 'for-6.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: flush delalloc workers queue before stopping cleaner kthread during unmount
btrfs: handle bio_split() errors
btrfs: properly wait for writeback before buffered write
btrfs: fix missing snapshot drew unlock when root is dead during swap activation
btrfs: fix mount failure due to remount races
|
|
[BUG]
The following reproducer can cause btrfs mount to fail:
dev="/dev/test/scratch1"
mnt1="/mnt/test"
mnt2="/mnt/scratch"
mkfs.btrfs -f $dev
mount $dev $mnt1
btrfs subvolume create $mnt1/subvol1
btrfs subvolume create $mnt1/subvol2
umount $mnt1
mount $dev $mnt1 -o subvol=subvol1
while mount -o remount,ro $mnt1; do mount -o remount,rw $mnt1; done &
bg=$!
while mount $dev $mnt2 -o subvol=subvol2; do umount $mnt2; done
kill $bg
wait
umount -R $mnt1
umount -R $mnt2
The script will fail with the following error:
mount: /mnt/scratch: /dev/mapper/test-scratch1 already mounted on /mnt/test.
dmesg(1) may have more information after failed mount system call.
umount: /mnt/test: target is busy.
umount: /mnt/scratch/: not mounted
And there is no kernel error message.
[CAUSE]
During the btrfs mount, to support mounting different subvolumes with
different RO/RW flags, we need to detect that and retry if needed:
Retry with matching RO flags if the initial mount fail with -EBUSY.
The problem is, during that retry we do not hold any super block lock
(s_umount), this means there can be a remount process changing the RO
flags of the original fs super block.
If so, we can have an EBUSY error during retry. And this time we treat
any failure as an error, without any retry and cause the above EBUSY
mount failure.
[FIX]
The current retry behavior is racy because we do not have a super block
thus no way to hold s_umount to prevent the race with remount.
Solve the root problem by allowing fc->sb_flags to mismatch from the
sb->s_flags at btrfs_get_tree_super().
Then at the re-entry point btrfs_get_tree_subvol(), manually check the
fc->s_flags against sb->s_flags, if it's a RO->RW mismatch, then
reconfigure with s_umount lock hold.
Reported-by: Enno Gotthold <egotthold@suse.com>
Reported-by: Fabian Vogt <fvogt@suse.com>
[ Special thanks for the reproducer and early analysis pointing to btrfs. ]
Fixes: f044b318675f ("btrfs: handle the ro->rw transition for mounting different subvolumes")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1231836
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Changes outside of btrfs: add io_uring command flag to track a dying
task (the rest will go via the block git tree).
User visible changes:
- wire encoded read (ioctl) to io_uring commands, this can be used on
itself, in the future this will allow 'send' to be asynchronous. As
a consequence, the encoded read ioctl can also work in non-blocking
mode
- new ioctl to wait for cleaned subvolumes, no need to use the
generic and root-only SEARCH_TREE ioctl, will be used by "btrfs
subvol sync"
- recognize different paths/symlinks for the same devices and don't
report them during rescanning, this can be observed with LVM or DM
- seeding device use case change, the sprout device (the one
capturing new writes) will not clear the read-only status of the
super block; this prevents accumulating space from deleted
snapshots
Performance improvements:
- reduce lock contention when traversing extent buffers
- reduce extent tree lock contention when searching for inline
backref
- switch from rb-trees to xarray for delayed ref tracking,
improvements due to better cache locality, branching factors and
more compact data structures
- enable extent map shrinker again (prevent memory exhaustion under
some types of IO load), reworked to run in a single worker thread
(there used to be problems causing long stalls under memory
pressure)
Core changes:
- raid-stripe-tree feature updates:
- make device replace and scrub work
- implement partial deletion of stripe extents
- new selftests
- split the config option BTRFS_DEBUG and add EXPERIMENTAL for
features that are experimental or with known problems so we don't
misuse debugging config for that
- subpage mode updates (sector < page):
- update compression implementations
- update writepage, writeback
- continued folio API conversions:
- buffered writes
- make buffered write copy one page at a time, preparatory work for
future integration with large folios, may cause performance drop
- proper locking of root item regarding starting send
- error handling improvements
- code cleanups and refactoring:
- dead code removal
- unused parameter reduction
- lockdep assertions"
* tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (119 commits)
btrfs: send: check for read-only send root under critical section
btrfs: send: check for dead send root under critical section
btrfs: remove check for NULL fs_info at btrfs_folio_end_lock_bitmap()
btrfs: fix warning on PTR_ERR() against NULL device at btrfs_control_ioctl()
btrfs: fix a typo in btrfs_use_zone_append
btrfs: avoid superfluous calls to free_extent_map() in btrfs_encoded_read()
btrfs: simplify logic to decrement snapshot counter at btrfs_mksnapshot()
btrfs: remove hole from struct btrfs_delayed_node
btrfs: update stale comment for struct btrfs_delayed_ref_node::add_list
btrfs: add new ioctl to wait for cleaned subvolumes
btrfs: simplify range tracking in cow_file_range()
btrfs: remove conditional path allocation in btrfs_read_locked_inode()
btrfs: push cleanup into btrfs_read_locked_inode()
io_uring/cmd: let cmds to know about dying task
btrfs: add struct io_btrfs_cmd as type for io_uring_cmd_to_pdu()
btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)
btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()
btrfs: don't sleep in btrfs_encoded_read() if IOCB_NOWAIT is set
btrfs: change btrfs_encoded_read() so that reading of extent is done by caller
btrfs: remove pointless iocb::ki_pos addition in btrfs_encoded_read()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs multigrain timestamps from Christian Brauner:
"This is another try at implementing multigrain timestamps. This time
with significant help from the timekeeping maintainers to reduce the
performance impact.
Thomas provided a base branch that contains the required timekeeping
interfaces for the VFS. It serves as the base for the multi-grain
timestamp work:
- Multigrain timestamps allow the kernel to use fine-grained
timestamps when an inode's attributes is being actively observed
via ->getattr(). With this support, it's possible for a file to get
a fine-grained timestamp, and another modified after it to get a
coarse-grained stamp that is earlier than the fine-grained time. If
this happens then the files can appear to have been modified in
reverse order, which breaks VFS ordering guarantees.
To prevent this, a floor value is maintained for multigrain
timestamps. Whenever a fine-grained timestamp is handed out, record
it, and when later coarse-grained stamps are handed out, ensure
they are not earlier than that value. If the coarse-grained
timestamp is earlier than the fine-grained floor, return the floor
value instead.
The timekeeper changes add a static singleton atomic64_t into
timekeeper.c that is used to keep track of the latest fine-grained
time ever handed out. This is tracked as a monotonic ktime_t value
to ensure that it isn't affected by clock jumps. Because it is
updated at different times than the rest of the timekeeper object,
the floor value is managed independently of the timekeeper via a
cmpxchg() operation, and sits on its own cacheline.
Two new public timekeeper interfaces are added:
(1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the
later of the coarse-grained clock and the floor time
(2) ktime_get_real_ts64_mg() gets the fine-grained clock value,
and tries to swap it into the floor. A timespec64 is filled
with the result.
- The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around
1 per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting
via NFSv3, which relies on timestamps to validate caches. A lot of
changes can happen in a jiffy, so timestamps aren't sufficient to
help the client decide when to invalidate the cache. Even with
NFSv4, a lot of exported filesystems don't properly support a
change attribute and are subject to the same problems with
timestamp granularity. Other applications have similar issues with
timestamps (e.g backup applications).
If we were to always use fine-grained timestamps, that would
improve the situation, but that becomes rather expensive, as the
underlying filesystem would have to log a lot more metadata
updates.
This adds a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in
inode->i_ctime_nsec as a flag that indicates whether the current
timestamps have been queried via stat() or the like. When it's set,
we allow the kernel to use a fine-grained timestamp iff it's
necessary to make the ctime show a different value.
This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible
for a file being changed to get a fine-grained timestamp. A file
that is altered just a bit later can then get a coarse-grained one
that appears older than the earlier fine-grained time. This
violates timestamp ordering guarantees.
This is where the earlier mentioned timkeeping interfaces help. A
global monotonic atomic64_t value is kept that acts as a timestamp
floor. When we go to stamp a file, we first get the latter of the
current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it
with that value.
If it has been queried, then first see whether the current coarse
time is later than the existing ctime. If it is, then we accept
that value. If it isn't, then we get a fine-grained time and try to
swap that into the global floor. Whether that succeeds or fails, we
take the resulting floor time, convert it to realtime and try to
swap that into the ctime.
We take the result of the ctime swap whether it succeeds or fails,
since either is just as valid.
Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same
floor value as multigrain filesystems)"
* tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: reduce pointer chasing in is_mgtime() test
tmpfs: add support for multigrain timestamps
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
Documentation: add a new file documenting multigrain timestamps
fs: add percpu counters for significant multigrain timestamp events
fs: tracepoints around multigrain timestamp events
fs: handle delegated timestamps in setattr_copy_mgtime
timekeeping: Add percpu counter for tracking floor swap events
timekeeping: Add interfaces for handling timestamps with a floor value
fs: have setattr_copy handle multigrain timestamps appropriately
fs: add infrastructure for multigrain timestamps
|
|
Smatch complains about calling PTR_ERR() against a NULL pointer:
fs/btrfs/super.c:2272 btrfs_control_ioctl() warn: passing zero to 'PTR_ERR'
Fix this by calling PTR_ERR() against the device pointer only if it
contains an error.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that the extent map shrinker can only be run by a single task and runs
asynchronously as a work queue job, enable it as it can no longer cause
stalls on tasks allocating memory and entering the extent map shrinker
through the fs shrinker (implemented by btrfs_free_cached_objects()).
This is crucial to prevent exhaustion of memory due to unbounded extent
map creation, primarily with direct IO but also for buffered IO on files
with holes. This problem, for the direct IO case, was first reported in
the Link tag below. That report was added to a Link tag of the first patch
that introduced the extent map shrinker, commit 956a17d9d050 ("btrfs: add
a shrinker for extent maps"), however the Link tag disappeared somehow
from the committed patch (but was included in the submitted patch to the
mailing list), so adding it below for future reference.
Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the extent map shrinker is run synchronously for kswapd tasks
that end up calling the fs shrinker (fs/super.c:super_cache_scan()).
This has some disadvantages and for some heavy workloads with memory
pressure it can cause some delays and stalls that make a machine
unresponsive for some periods. This happens because:
1) We can have several kswapd tasks on machines with multiple NUMA zones,
and running the extent map shrinker concurrently can cause high
contention on some spin locks, namely the spin locks that protect
the radix tree that tracks roots, the per root xarray that tracks
open inodes and the list of delayed iputs. This not only delays the
shrinker but also causes high CPU consumption and makes the task
running the shrinker monopolize a core, resulting in the symptoms
of an unresponsive system. This was noted in previous commits such as
commit ae1e766f623f ("btrfs: only run the extent map shrinker from
kswapd tasks");
2) The extent map shrinker's iteration over inodes can often be slow, even
after changing the data structure that tracks open inodes for a root
from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+).
The transition to the xarray while it made things a bit faster, it's
still somewhat slow - for example in a test scenario with 10000 inodes
that have no extent maps loaded, the extent map shrinker took between
5ms to 8ms, using a release, non-debug kernel. Iterating over the
extent maps of an inode can also be slow if have an inode with many
thousands of extent maps, since we use a red black tree to track and
search extent maps. So having the extent map shrinker run synchronously
adds extra delay for other things a kswapd task does.
So make the extent map shrinker run asynchronously as a job for the
system unbounded workqueue, just like what we do for data and metadata
space reclaim jobs.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The only caller passes NULL, we can drop the parameter. This is since
the new mount option parser done in 3bb17a25bcb09a ("btrfs: add get_tree
callback for new mount API").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since the new mount option parser in commit ad21f15b0f79 ("btrfs:
switch to the new mount API") we don't pass the options like that
anymore.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently CONFIG_BTRFS_EXPERIMENTAL is not only for the extra debugging
output, but also for experimental features.
This is not ideal to distinguish planned but not yet stable features
from those purely designed for debugging.
This patch splits the following features into CONFIG_BTRFS_EXPERIMENTAL:
- Extent map shrinker
This seems to be the first one to exit experimental.
- Extent tree v2
This seems to be the last one to graduate from experimental.
- Raid stripe tree
- Csum offload mode
- Send protocol v3
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
With util-linux 2.40.2, the 'mount' utility is already utilizing the new
mount API. e.g:
# strace mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test/
...
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0
fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(3, FSMOUNT_CLOEXEC, 0) = 4
mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0
move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0
But this leads to a new problem, that per-subvolume RO/RW mount no
longer works, if the initial mount is RO:
# mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test
# mount -o rw,subvol=subv2 /dev/test/scratch1 /mnt/scratch
# mount | grep mnt
/dev/mapper/test-scratch1 on /mnt/test type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=256,subvol=/subv1)
/dev/mapper/test-scratch1 on /mnt/scratch type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=257,subvol=/subv2)
# touch /mnt/scratch/foobar
touch: cannot touch '/mnt/scratch/foobar': Read-only file system
This is a common use cases on distros.
[CAUSE]
We have a workaround for remount to handle the RO->RW change, but if the
mount is using the new mount API, we do not do that, and rely on the
mount tool NOT to set the ro flag.
But that's not how the mount tool is doing for the new API:
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0
fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 <<<< Setting RO flag for super block
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(3, FSMOUNT_CLOEXEC, 0) = 4
mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0
move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0
This means we will set the super block RO at the first mount.
Later RW mount will not try to reconfigure the fs to RW because the
mount tool is already using the new API.
This totally breaks the per-subvolume RO/RW mount behavior.
[FIX]
Do not skip the reconfiguration even if using the new API. The old
comments are just expecting any mount tool to properly skip the RO flag
set even if we specify "ro", which is not the reality.
Update the comments regarding the backward compatibility on the kernel
level so it works with old and new mount utilities.
CC: stable@vger.kernel.org # 6.8+
Fixes: f044b318675f ("btrfs: handle the ro->rw transition for mounting different subvolumes")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
Syzbot reports the following crash:
BTRFS info (device loop0 state MCS): disabling free space tree
BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:backup_super_roots fs/btrfs/disk-io.c:1691 [inline]
RIP: 0010:write_all_supers+0x97a/0x40f0 fs/btrfs/disk-io.c:4041
Call Trace:
<TASK>
btrfs_commit_transaction+0x1eae/0x3740 fs/btrfs/transaction.c:2530
btrfs_delete_free_space_tree+0x383/0x730 fs/btrfs/free-space-tree.c:1312
btrfs_start_pre_rw_mount+0xf28/0x1300 fs/btrfs/disk-io.c:3012
btrfs_remount_rw fs/btrfs/super.c:1309 [inline]
btrfs_reconfigure+0xae6/0x2d40 fs/btrfs/super.c:1534
btrfs_reconfigure_for_mount fs/btrfs/super.c:2020 [inline]
btrfs_get_tree_subvol fs/btrfs/super.c:2079 [inline]
btrfs_get_tree+0x918/0x1920 fs/btrfs/super.c:2115
vfs_get_tree+0x90/0x2b0 fs/super.c:1800
do_new_mount+0x2be/0xb40 fs/namespace.c:3472
do_mount fs/namespace.c:3812 [inline]
__do_sys_mount fs/namespace.c:4020 [inline]
__se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
[CAUSE]
To support mounting different subvolume with different RO/RW flags for
the new mount APIs, btrfs introduced two workaround to support this feature:
- Skip mount option/feature checks if we are mounting a different
subvolume
- Reconfigure the fs to RW if the initial mount is RO
Combining these two, we can have the following sequence:
- Mount the fs ro,rescue=all,clear_cache,space_cache=v1
rescue=all will mark the fs as hard read-only, so no v2 cache clearing
will happen.
- Mount a subvolume rw of the same fs.
We go into btrfs_get_tree_subvol(), but fc_mount() returns EBUSY
because our new fc is RW, different from the original fs.
Now we enter btrfs_reconfigure_for_mount(), which switches the RO flag
first so that we can grab the existing fs_info.
Then we reconfigure the fs to RW.
- During reconfiguration, option/features check is skipped
This means we will restart the v2 cache clearing, and convert back to
v1 cache.
This will trigger fs writes, and since the original fs has "rescue=all"
option, it skips the csum tree read.
And eventually causing NULL pointer dereference in super block
writeback.
[FIX]
For reconfiguration caused by different subvolume RO/RW flags, ensure we
always run btrfs_check_options() to ensure we have proper hard RO
requirements met.
In fact the function btrfs_check_options() doesn't really do many
complex checks, but hard RO requirement and some feature dependency
checks, thus there is no special reason not to do the check for mount
reconfiguration.
Reported-by: syzbot+56360f93efa90ff15870@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/0000000000008c5d090621cb2770@google.com/
Fixes: f044b318675f ("btrfs: handle the ro->rw transition for mounting different subvolumes")
CC: stable@vger.kernel.org # 6.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
After the migration to use fs context for processing mount options we had
a slight change in the semantics for remounting a filesystem that was
mounted with compress-force. Before we could clear compress-force by
passing only "-o compress[=algo]" during a remount, but after that change
that does not work anymore, force-compress is still present and one needs
to pass "-o compress-force=no,compress[=algo]" to the mount command.
Example, when running on a kernel 6.8+:
$ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)
$ mount -o remount,compress=zlib:5 /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)
On a 6.7 kernel (or older):
$ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)
$ mount -o remount,compress=zlib:5 /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)
So update btrfs_parse_param() to clear "compress-force" when "compress" is
given, providing the same semantics as kernel 6.7 and older.
Reported-by: Roman Mamedov <rm@romanrm.net>
Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/
CC: stable@vger.kernel.org # 6.8+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.
Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.
Note that this also drops the IS_I_VERSION check and unconditionally
bumps the change attribute, since SB_I_VERSION is always set on btrfs.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241002-mgtime-v10-11-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Although there are several patches improving the extent map shrinker,
there are still reports of too frequent shrinker behavior, taking too
much CPU for the kswapd process.
So let's only enable extent shrinker for now, until we got more
comprehensive understanding and a better solution.
Link: https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/
Link: https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/
Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps")
CC: stable@vger.kernel.org # 6.10+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the extent map shrinker can be run by any task when attempting
to allocate memory and there's enough memory pressure to trigger it.
To avoid too much latency we stop iterating over extent maps and removing
them once the task needs to reschedule. This logic was introduced in commit
b3ebb9b7e92a ("btrfs: stop extent map shrinker if reschedule is needed").
While that solved high latency problems for some use cases, it's still
not enough because with a too high number of tasks entering the extent map
shrinker code, either due to memory allocations or because they are a
kswapd task, we end up having a very high level of contention on some
spin locks, namely:
1) The fs_info->fs_roots_radix_lock spin lock, which we need to find
roots to iterate over their inodes;
2) The spin lock of the xarray used to track open inodes for a root
(struct btrfs_root::inodes) - on 6.10 kernels and below, it used to
be a red black tree and the spin lock was root->inode_lock;
3) The fs_info->delayed_iput_lock spin lock since the shrinker adds
delayed iputs (calls btrfs_add_delayed_iput()).
Instead of allowing the extent map shrinker to be run by any task, make
it run only by kswapd tasks. This still solves the problem of running
into OOM situations due to an unbounded extent map creation, which is
simple to trigger by direct IO writes, as described in the changelog
of commit 956a17d9d050 ("btrfs: add a shrinker for extent maps"), and
by a similar case when doing buffered IO on files with a very large
number of holes (keeping the file open and creating many holes, whose
extent maps are only released when the file is closed).
Reported-by: kzd <kzd@56709.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121
Reported-by: Octavia Togami <octavia.togami@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/
Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps")
CC: stable@vger.kernel.org # 6.10+
Tested-by: kzd <kzd@56709.net>
Tested-by: Octavia Togami <octavia.togami@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We've been wanting to get rid of this for a while, add a message to
indicate that this feature is going away and when so we can finally have
a date when we're going to remove it. The output looks like this
BTRFS warning (device nvme0n1): space cache v1 is being deprecated and will be removed in a future release, please use -o space_cache=v2
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the BTRFS_MOUNT_* flags are already beyond 32 bits, this is
going to cause compilation errors for some 32 bit systems, as their
unsigned long is only 32 bits long, thus flag
BTRFS_MOUNT_IGNORESUPERFLAGS overflows and can lead to errors.
Fix the problem by:
- Migrate all existing BTRFS_MOUNT_* flags to unsigned long long
- Migrate all mount option related variables to unsigned long long
* btrfs_fs_info::mount_opt
* btrfs_fs_context::mount_opt
* mount_opt parameter of btrfs_check_options()
* old_opts parameter of btrfs_remount_begin()
* old_opts parameter of btrfs_remount_cleanup()
* mount_opt parameter of btrfs_check_mountopts_zoned()
* mount_opt and opt parameters of check_ro_option()
Fixes: 32e6216512b4 ("btrfs: introduce new "rescue=ignoresuperflags" mount option")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This new mount option allows the kernel to skip the super flags check,
it's mostly to allow the kernel to do a rescue mount of an interrupted
checksum conversion.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Introduce "rescue=ignoremetacsums" to ignore metadata csums, all the
other metadata sanity checks are still kept as is.
This new mount option is mostly to allow the kernel to mount an
interrupted checksum conversion (at the metadata csum overwrite stage).
And since the main part of metadata sanity checks is inside
tree-checker, we shouldn't lose much safety, and the new mount option is
rescue mount option it requires full read-only mount.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The following three Opt_* enums haven't been utilized since the port to
new mount API:
- Opt_ignorebadroots
- Opt_ignoredatacsums
- Opt_rescue_all
All those enums are from the old day where we have dedicated mount
options, nowadays they have been moved to "rescue=" mount option
groups, and no more global tokens for them.
So we can safely remove them now.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The direct IO code is over a thousand lines and it's currently spread
between file.c and inode.c, which makes it not easy to locate some parts
of it sometimes. Also inode.c is about 11 thousand lines and file.c about
4 thousand lines, both too big. So move all the direct IO code into a
dedicated file, so that it's easy to locate all its code and reduce the
sizes of inode.c and file.c.
This is a pure move of code without any other changes except export a
a couple functions from inode.c (get_extent_allocation_hint() and
create_io_em()) because they are used in inode.c and the new direct-io.c
file, and a couple functions from file.c (btrfs_buffered_write() and
btrfs_write_check()) because they are used both in file.c and in the new
direct-io.c file.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|