Age | Commit message (Collapse) | Author |
|
Pull NFS client bugfixes from Trond Myklebust:
- NFS: Fix a couple of missed handlers for the ENETDOWN and ENETUNREACH
transport errors
- NFS: Handle Oopsable failure of nfs_get_lock_context in the unlock
path
- NFSv4: Fix a race in nfs_local_open_fh()
- NFSv4/pNFS: Fix a couple of layout segment leaks in layoutreturn
- NFSv4/pNFS Avoid sharing pNFS DS connections between net namespaces
since IP addresses are not guaranteed to refer to the same nodes
- NFS: Don't flush file data while holding multiple directory locks in
nfs_rename()
* tag 'nfs-for-6.15-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Avoid flushing data while holding directory locks in nfs_rename()
NFS/pnfs: Fix the error path in pnfs_layoutreturn_retry_later_locked()
NFSv4/pnfs: Reset the layout state after a layoutreturn
NFS/localio: Fix a race in nfs_local_open_fh()
nfs: nfs3acl: drop useless assignment in nfs3_get_acl()
nfs: direct: drop useless initializer in nfs_direct_write_completion()
nfs: move the nfs4_data_server_cache into struct nfs_net
nfs: don't share pNFS DS connections between net namespaces
nfs: handle failure of nfs_get_lock_context in unlock path
pNFS/flexfiles: Record the RPC errors in the I/O tracepoints
NFSv4/pnfs: Layoutreturn on close must handle fatal networking errors
NFSv4: Handle fatal ENETDOWN and ENETUNREACH errors
|
|
The Linux client assumes that all filehandles are non-volatile for
renames within the same directory (otherwise sillyrename cannot work).
However, the existence of the Linux 'subtree_check' export option has
meant that nfs_rename() has always assumed it needs to flush writes
before attempting to rename.
Since NFSv4 does allow the client to query whether or not the server
exhibits this behaviour, and since knfsd does actually set the
appropriate flag when 'subtree_check' is enabled on an export, it
should be OK to optimise away the write flushing behaviour in the cases
where it is clearly not needed.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
|
|
If there isn't a valid layout, or the layout stateid has changed, the
cleanup after a layout return should clear out the old data.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
If there are still layout segments in the layout plh_return_lsegs list
after a layout return, we should be resetting the state to ensure they
eventually get returned as well.
Fixes: 68f744797edd ("pNFS: Do not free layout segments that are marked for return")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
check_igot_inode() prints a variable string, which causes a harmless
warning with 'make W=1':
fs/ext4/inode.c:4763:45: error: format string is not a string literal (potentially insecure) [-Werror,-Wformat-security]
4763 | ext4_error_inode(inode, function, line, 0, err_str);
Use a trivial "%s" format string instead.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250423164354.2780635-1-arnd@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Standalone "nologreplay" mount option has been marked deprecated since
commit 74ef00185eb8 ("btrfs: introduce "rescue=" mount option"), which
dates back to v5.9 (2020).
Furthermore there is no other filesystem with the same named mount
option, so this one is btrfs specific and we will not hit the same
problem when removing "norecovery" mount option.
So let's remove the standalone "nologreplay" mount option.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pull xfs fixes from Carlos Maiolino:
"This includes a bug fix for a possible data corruption vector on the
zoned allocator garbage collector"
* tag 'xfs-fixes-6.15-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Fix comment on xfs_trans_ail_update_bulk()
xfs: Fix a comment on xfs_ail_delete
xfs: Fail remount with noattr2 on a v5 with v4 enabled
xfs: fix zoned GC data corruption due to wrong bv_offset
xfs: free up mp->m_free[0].count in error case
|
|
They look rather messy right now.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-3-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-2-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We're going to extend the coredump code in follow-up patches.
Clean it up so we can do this more easily.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-1-664f3caf2516@kernel.org
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Make the logic of handling the InitializeFileRecordSegment operation
similar to that in windows.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
Resctrl is a filesystem interface to hardware that provides cache
allocation policy and bandwidth control for groups of tasks or CPUs.
To support more than one architecture, resctrl needs to live in /fs/.
Move the code that is concerned with the filesystem interface to
/fs/resctrl.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-25-james.morse@arm.com
|
|
Add Makefile and Kconfig for fs/resctrl. Add ARCH_HAS_CPU_RESCTRL
for the common parts of the resctrl interface and make X86_CPU_RESCTRL
select this.
Adding an include of asm/resctrl.h to linux/resctrl.h allows the
/fs/resctrl files to switch over to using this header instead.
Co-developed-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20250515165855.31452-16-james.morse@arm.com
|
|
Currently, when EROFS is built with per-CPU workers, the workers are
started and CPU hotplug hooks are registered during module initialization.
This leads to unnecessary worker start/stop cycles during CPU hotplug
events, particularly on Android devices that frequently suspend and resume.
This change defers the initialization of per-CPU workers and the
registration of CPU hotplug hooks until the first EROFS mount. This
ensures that these resources are only allocated and managed when EROFS is
actually in use.
The tear down of per-CPU workers and unregistration of CPU hotplug hooks
still occurs during z_erofs_exit_subsystem(), but only if they were
initialized.
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20250506225743.308517-1-dhavale@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
- trace_erofs_readpages => trace_erofs_readahead;
- Rename a redundant statement `nrpages = readahead_count(rac);`;
- Move the tracepoint to the beginning of z_erofs_readahead().
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://lore.kernel.org/r/20250514120820.2739288-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Pull bcachefs fixes from Kent Overstreet:
"The main user reported ones are:
- Fix a btree iterator locking inconsistency that's been causing us
to go emergency read-only in evacuate: "Fix broken btree_path lock
invariants in next_node()"
- Minor btree node cache reclaim tweak that should help with OOMs:
don't set btree nodes as accessed on fill
- Fix a bch2_bkey_clear_rebalance() issue that was causing rebalance
to do needless work"
* tag 'bcachefs-2025-05-15' of git://evilpiepirate.org/bcachefs:
bcachefs: fix wrong arg to fsck_err()
bcachefs: Fix missing commit in backpointer to missing target
bcachefs: Fix accidental O(n^2) in fiemap
bcachefs: Fix set_should_be_locked() call in peek_slot()
bcachefs: Fix self deadlock
bcachefs: Don't set btree nodes as accessed on fill
bcachefs: Fix livelock in journal_entry_open()
bcachefs: Fix broken btree_path lock invariants in next_node()
bcachefs: Don't strip rebalance_opts from indirect extents
|
|
We'd like to increase the maximum r/wsize that NFSD can support,
but without introducing possible regressions. So let's add a
default setting of 1MB. A subsequent patch will raise the
maximum value but leave the default alone.
No behavior change is expected.
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The 8192-byte maximum is a protocol-defined limit, and we already
have a symbolic constant defined whose name matches the name of
the limit defined in the protocol. Replace the duplicate.
No change in behavior is expected.
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Clean up: The documenting comment for NFSD_BUFSIZE is quite stale.
NFSD_BUFSIZE is used only for NFSv4 Reply these days; never for
NFSv2 or v3, and never for RPC Calls. Even so, the byte count
estimate does not include the size of the NFSv4 COMPOUND Reply
HEADER or the RPC auth flavor.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
If we can get rid of all uses of rq_vec, then it can be removed.
Replace one use of rqstp::rq_vec with rqstp::rq_bvec.
The feeling of layering violation grows stronger now that
<linux/sunrpc/xdr.h> is included in fs/nfsd/vfs.c.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
All three call sites do the same thing.
I'm struggling with this a bit, however. struct xdr_buf is an XDR
layer object and unmarshaling a WRITE payload is clearly a task
intended to be done by the proc and xdr functions, not by VFS. This
feels vaguely like a layering violation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
If we can get rid of all uses of rq_vec, then it can be removed.
Replace one use of rqstp::rq_vec with rqstp::rq_bvec.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Cross-merge networking fixes after downstream PR (net-6.15-rc7).
Conflicts:
tools/testing/selftests/drivers/net/hw/ncdevmem.c
97c4e094a4b2 ("tests/ncdevmem: Fix double-free of queue array")
2f1a805f32ba ("selftests: ncdevmem: Implement devmem TCP TX")
https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au
Adjacent changes:
net/core/devmem.c
net/core/devmem.h
0afc44d8cdf6 ("net: devmem: fix kernel panic when netlink socket close after module unload")
bd61848900bf ("net: devmem: Implement TX path")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a comment at the beginning of extents_status.c to clarify the rules
for loading, mapping, modifying, and removing extents and blocks.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-10-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Add ext4_check_map_extents_env() to the places where loading extents,
mapping blocks, removing blocks, and modifying extents, excluding the
I/O writeback context. This function will verify whether the locking
mechanisms in place are adequate.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We have two variables to track return values, ret and ret2, with types
vm_fault_t (an unsigned int type) and int, which makes it a bit confusing
and harder to keep track. So use a single variable, of type int, and under
the 'out' label return vmf_error(ret) in case ret contains an error,
otherwise return VM_FAULT_NOPAGE. This is equivalent to what we had before
and it's simpler.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If the call to btrfs_set_extent_delalloc() fails we are always returning
VM_FAULT_SIGBUS, which is odd since the error means "bad access" and the
most likely cause for btrfs_set_extent_delalloc() is -ENOMEM, which should
be translated to VM_FAULT_OOM.
Instead of returning VM_FAULT_SIGBUS return vmf_error(ret2), which gives
us a more appropriate return value, and we use that everywhere else too.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have this entangled error checks early at btrfs_page_mkwrite():
1) Try to reserve delalloc space by calling btrfs_delalloc_reserve_space()
and storing the return value in the ret2 variable;
2) If the reservation succeed, call file_update_time() and store the
return value in ret2 and also set the local variable 'reserved' to
true (1);
3) Then do an error check on ret2 to see if any of the previous calls
failed and if so, jump either to the 'out' label or to the
'out_noreserve' label, depending on whether 'reserved' is true or
not.
This is unnecessarily complex. Instead change this to a simpler and
more straightforward approach:
1) Call btrfs_delalloc_reserve_space(), if that returns an error jump to
the 'out_noreserve' label;
2) The call file_update_time() and if that returns an error jump to the
'out' label.
Like this there's less nested if statements, no need to use a local
variable to track if space was reserved and if statements are used only
to check errors.
Also move the call to extent_changeset_free() out of the 'out_noreserve'
label and under the 'out' label since the changeset is allocated only if
the call to reserve delalloc space succeeded.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the last call to btrfs_delalloc_release_space() where the value of the
variable 'ret' is never zero, we pass the expression 'ret != 0' as the
value for the argument 'qgroup_free', which always evaluates to true.
Make this less confusing and more clear by explicitly passing true
instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we're doing a mmap write against a folio that has i_size somewhere in
the middle and we have multiple sectors in the folio, we may have to
release excess space previously reserved, for the range going from the
rounded up (to sector size) i_size to the folio's end offset. We are
calculating the right amount to release and passing it to
btrfs_delalloc_release_space(), but we are passing the wrong start offset
of that range - we're passing the folio's start offset instead of the
end offset, plus 1, of the range for which we keep the reservation. This
may result in releasing more space then we should and eventually trigger
an underflow of the data space_info's bytes_may_use counter.
So fix this by passing the start offset as 'end + 1' instead of
'page_start' to btrfs_delalloc_release_space().
Fixes: d0b7da88f640 ("Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
refs
When running delayed references we are reading the number of ready delayed
ref heads without taking any lock which can make KCSAN report a race since
we can have concurrent tasks updating that number, such as for example
when freeing a tree block which will end up decrementing that counter or
when adding a new delayed ref while COWing a tree block which will
increment that counter.
This is a harmless race since running one more or one less delayed ref
head doesn't result in any problem, in the critical section of a
transaction commit we always run any remaining delayed refs and at that
point no one can create more.
So fix this harmless race by annotating the read with data_race().
Reported-by: cen zhang <zzzccc427@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAFRLqsUCLMz0hY-GaPj1Z=fhkgRHjxVXHZ8kz0PvkFN0b=8L2Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When writing super blocks, at write_dev_supers(), we log an error message
when we get some error but we don't show which error we got and we have
that information. So enhance the error messages with the error codes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to have special logic to return -EAGAIN in case the call
to __filemap_get_folio() fails, because when FGP_NOWAIT is passed to
__filemap_get_folio() it returns ERR_PTR(-EAGAIN) if it needs to do
something that would imply blocking.
The reason we have this logic is from the days before we migrated to the
folio interface, when we called pagecache_get_page() which would return
NULL instead of an error pointer.
So remove this special casing and always return the error that the call
to __filemap_get_folio() returned.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have a few places that always assume a -ENOMEM error happened in case a
call to __filemap_get_folio() returns an error, which is just too much of
an assumption and even if it would be the case at some point in time, it's
not future proof and there's nothing in the documentation that guarantees
that only ERR_PTR(-ENOMEM) can be returned with the flags we are passing
to it.
So use the exact error returned by __filemap_get_folio() instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the if statement that checks the return value from
btrfs_check_data_free_space(), there's no point to check if 'ret' is not
zero in the else branch, since the main if branch checked that it's zero,
so in the else branch it necessarily has a non-zero value.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If when truncating a block we fail to reserve data space and then we
proceed anyway because we can do a NOCOW write, if we later get an error
when trying to get the folio from the inode's mapping, we end up releasing
data space that we haven't reserved, screwing up the bytes_may_use counter
from the data space_info, eventually resulting in an underflow when all
other reservations done by other tasks are released, if any, or right away
if there are no other reservations at the moment.
This is because when we get an error when trying to grab the block's folio
we call btrfs_delalloc_release_space(), which releases metadata (which we
have reserved) and data (which we haven't reserved).
Fix this by calling btrfs_delalloc_release_space() only if we did reserve
data space, that is, if we aren't falling back to NOCOW, meaning the local
variable @only_release_metadata has a false value, otherwise release only
metadata by calling btrfs_delalloc_release_metadata().
Fixes: 6d4572a9d71d ("btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Expand what the options do and if they are OK to be enabled.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The list is out of date, the extent shrinker got fixed in 6.13. Add new
entries: the COW fixup warning in 6.15, rund robin policies in 6.14.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are debugging prints for each emitted send command and other
related actions. This does not seem right as the number of commands can
be high and dumping that to the system log will likely hit some rate
limiting. This should be done by trace points that are more lightweight
and can keep up with high frequency.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are using an integer for the 'delalloc' argument but all we need is a
boolean, so switch the type to 'bool' and rename the parameter to
'is_delalloc' to better match the fact that it's a boolean.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are using an integer for the 'delalloc' argument but all we need is a
boolean, so switch the type to 'bool' and rename the parameter to
'is_delalloc' to better match the fact that it's a boolean.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of having an error check and return on each branch of the if
statement, move the error check to happen after that if branch, reducing
source code and object code sizes.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1840174 163742 16136 2020052 1ed2d4 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1840138 163742 16136 2020016 1ed2b0 fs/btrfs/btrfs.ko
While at it and moving the comments, update the comments to be more clear
about how qgroup reserved space is released and the intricacies of how
it's managed for COW writes.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When allocating an ordered extent we call igrab() to get a reference on
the inode and attach it to the ordered extent. For an ordered extent we
always must have an inode reference since we during its life cycle we
need to access the inode for several things like for example:
* Inserting the ordered extent right after allocating it, when calling
insert_ordered_extent() - we need to lock the inode's ordered_tree_lock;
* In the bio submission path we need to add checksums to the ordered
extent and we end up at btrfs_add_ordered_sum(), where again we need
to grab the inode from the ordered extent to lock the inode's
ordered_tree_lock;
* When finishing an ordered extent, at btrfs_finish_ordered_extent(), we
need again to access its inode in order to lock the inode's
ordered_tree_lock;
* Etc etc etc.
Everywhere we deal with an ordered extent we always expect its inode to
be not NULL, the only exception being btrfs_put_ordered_extent() where
we check if it's NULL before calling btrfs_add_delayed_iput(), even though
we have already assumed it's not NULL when calling the tracepoint
trace_btrfs_ordered_extent_put() since the tracepoint dereferences the
inode to extract its number and root without ever checking it's NULL.
The igrab() call can return NULL if the inode is about to be freed or is
being freed (its state has I_FREEING or I_WILL_FREE set), and that's why
there's such check at btrfs_put_ordered_extent(). The igrab() and NULL
check were introduced in commit 5fd02043553b ("Btrfs: finish ordered
extents in their own thread") but even back then we always needed and
assumed igrab() returned a non-NULL pointer, since for example when
removing an ordered extent, at btrfs_remove_ordered_extent(), we assumed
the inode pointer was not NULL in order to access the inode's ordered
extent tree.
In fact whenever we allocate an ordered extent we are holding an inode
reference and the inode is not being freed or going to be freed (which
happens in the final iput), and since we depend on the inode for the
life cycle of the ordered extent, just make ordered extent allocation
to fail in case igrab() returns NULL and trigger a warning, to make it
clear it's not expected. This allows to remove the confusing NULL inode
check at btrfs_put_ordered_extent().
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we fail to allocate an ordered extent for a COW write we end up leaking
a qgroup data reservation since we called btrfs_qgroup_release_data() but
we didn't call btrfs_qgroup_free_refroot() (which would happen when
running the respective data delayed ref created by ordered extent
completion or when finishing the ordered extent in case an error happened).
So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
ordered extent for a COW write.
Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
That structure records needed info for block verification (either data
checksum pointer, or expected tree block generation).
But there is also a boolean to tell if this block belongs to a metadata
or not, as the data checksum pointer and expected tree block generation
is already a union, we need a dedicated bit to tell if this block is a
metadata or not.
However such layout means we're wasting 63 bits for x86_64, which is a
huge memory waste.
Thanks to the recent bitmap aggregation, we can easily move this
single-bit-per-block member to a new sub-bitmap.
And since we already have six 16 bits long bitmaps, adding another
bitmap won't even increase any memory usage for x86_64, as we need two
64 bits long anyway.
This will reduce the following memory usages:
- sizeof(struct scrub_sector_verification)
From 16 bytes to 8 bytes on x86_64.
- scrub_stripe::sectors
From 16 * 16 to 16 * 8 bytes.
- Per-device scrub_ctx memory usage
From 128 * (16 * 16) to 128 * (16 * 8), which saves 16KiB memory.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
For the following fsx -e 1 run, the btrfs still fails the run on 64K
page size with 4K fs block size:
READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk
OFFSET GOOD BAD RANGE
0x26b3a 0x0000 0x15b4 0x0
operation# (mod 256) for the bad data may be 21
[...]
LOG DUMP (28 total operations):
1( 1 mod 256): SKIPPED (no operation)
2( 2 mod 256): SKIPPED (no operation)
3( 3 mod 256): SKIPPED (no operation)
4( 4 mod 256): SKIPPED (no operation)
5( 5 mod 256): WRITE 0x1ea90 thru 0x285e0 (0x9b51 bytes) HOLE
6( 6 mod 256): ZERO 0x1b1a8 thru 0x20bd4 (0x5a2d bytes)
7( 7 mod 256): FALLOC 0x22b1a thru 0x272fa (0x47e0 bytes) INTERIOR
8( 8 mod 256): WRITE 0x741d thru 0x13522 (0xc106 bytes)
9( 9 mod 256): MAPWRITE 0x73ee thru 0xdeeb (0x6afe bytes)
10( 10 mod 256): FALLOC 0xb719 thru 0xb994 (0x27b bytes) INTERIOR
11( 11 mod 256): COPY 0x15ed8 thru 0x18be1 (0x2d0a bytes) to 0x25f6e thru 0x28c77
12( 12 mod 256): ZERO 0x1615e thru 0x1770e (0x15b1 bytes)
13( 13 mod 256): SKIPPED (no operation)
14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff (0x8000 bytes) to 0x1000 thru 0x8fff
15( 15 mod 256): SKIPPED (no operation)
16( 16 mod 256): CLONE 0xa000 thru 0xffff (0x6000 bytes) to 0x36000 thru 0x3bfff
17( 17 mod 256): ZERO 0x14adc thru 0x1b78a (0x6caf bytes)
18( 18 mod 256): TRUNCATE DOWN from 0x3c000 to 0x1e2e3 ******WWWW
19( 19 mod 256): CLONE 0x4000 thru 0x11fff (0xe000 bytes) to 0x16000 thru 0x23fff
20( 20 mod 256): FALLOC 0x311e1 thru 0x3681b (0x563a bytes) PAST_EOF
21( 21 mod 256): FALLOC 0x351c5 thru 0x40000 (0xae3b bytes) EXTENDING
22( 22 mod 256): WRITE 0x920 thru 0x7e51 (0x7532 bytes)
23( 23 mod 256): COPY 0x2b58 thru 0xc508 (0x99b1 bytes) to 0x117b1 thru 0x1b161
24( 24 mod 256): TRUNCATE DOWN from 0x40000 to 0x3c9a5
25( 25 mod 256): SKIPPED (no operation)
26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06 (0x1ae7 bytes)
27( 27 mod 256): SKIPPED (no operation)
28( 28 mod 256): READ 0x26b3a thru 0x36633 (0xfafa bytes) ***RRRR***
[CAUSE]
The involved operations are:
fallocating to largest ever: 0x40000
21 pollute_eof 0x24000 thru 0x2ffff (0xc000 bytes)
21 falloc from 0x351c5 to 0x40000 (0xae3b bytes)
28 read 0x26b3a thru 0x36633 (0xfafa bytes)
At operation #21 a pollute_eof is done, by memory mapped write into
range [0x24000, 0x2ffff).
At this stage, the inode size is 0x24000, which is block aligned.
Then fallocate happens, and since it's expanding the inode, it will call
btrfs_truncate_block() to truncate any unaligned range.
But since the inode size is already block aligned,
btrfs_truncate_block() does nothing and exits.
However remember the folio at 0x20000 has some range polluted already,
although it will not be written back to disk, it still affects the
page cache, resulting the later operation #28 to read out the polluted
value.
[FIX]
Instead of early exit from btrfs_truncate_block() if the range is
already block aligned, do extra filio zeroing if the fs block size is
smaller than the page size and we're truncating beyond EOF.
This is to address exactly the above case where memory mapped write can
still leave some garbage beyond EOF.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
The following fsx sequence will fail on btrfs with 64K page size and 4K
fs block size:
#fsx -d -e 1 -N 4 $mnt/junk -S 36386
READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk
OFFSET GOOD BAD RANGE
0xe9ba 0x0000 0x03ac 0x0
operation# (mod 256) for the bad data may be 3
...
LOG DUMP (4 total operations):
1( 1 mod 256): WRITE 0x6c62 thru 0x1147d (0xa81c bytes) HOLE ***WWWW
2( 2 mod 256): TRUNCATE DOWN from 0x1147e to 0x5448 ******WWWW
3( 3 mod 256): ZERO 0x1c7aa thru 0x28fe2 (0xc839 bytes)
4( 4 mod 256): MAPREAD 0xe9ba thru 0x1578e (0x6dd5 bytes) ***RRRR***
[CAUSE]
Only 2 operations are really involved in this case:
3 pollute_eof 0x5448 thru 0xffff (0xabb8 bytes)
3 zero from 0x1c7aa to 0x28fe3, (0xc839 bytes)
4 mapread 0xe9ba thru 0x1578e (0x6dd5 bytes)
At operation 3, fsx pollutes beyond EOF, that is done by mmap()
and write into that mmap() range beyond EOF.
Such write will fill the range beyond EOF, but it will never reach disk
as ranges beyond EOF will not be marked dirty nor uptodate.
Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond
our isize (which was 0x5448), we should zero out any range beyond
EOF (0x5448).
During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the
unaligned head block.
But that function only really zeroes out the block at [0x5000, 0x5fff], it
doesn't bother any range other that that block, since those ranges will
not be marked dirty nor written back.
So the range [0x6000, 0xffff] is still polluted, and later mapread()
will return the poisoned value.
[FIX]
Enhance btrfs_truncate_block() by:
- Pass a @start/@end pair to indicate the full truncation range
This is to handle the following truncation case:
Page size is 64K, fs block size is 4K, truncate range is
[6K, 60K]
0 32K 64K
| |///////////////////////////////////| |
6K 60K
The range is not aligned for its head block, so we need to call
btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0.
But with that information we only know to zero the range [6K, 8K),
if we zero out the range [6K, 64K), the last block will also be
zeroed, causing data loss.
So here we need the full range we're truncating, so that we can avoid
over-truncation.
- Rename @from to @offset
As now the parameter is only utilized to locate a block, it's not
really carrying the old @from meaning well.
- Remove @front parameter
With the full truncate range passed in, we can determine if the
@offset is at the head or tail block.
- Skip truncation if @offset is not in the head nor tail blocks
The call site in hole punch unconditionally call
btrfs_truncate_block() without even checking the range is aligned or
not.
If the @offset is neither in the head nor in tail block, it means we can
safely ignore it.
- Skip truncate if the range inside the target block is already aligned
- Make btrfs_truncate_block() zero all blocks beyond EOF
Since we have the original range, we know exactly if we're doing
truncation beyond EOF (the @end will be (u64)-1).
If we're doing truncation beyond EOF, then enlarge the truncation
range to the folio end, to address the possibly polluted ranges.
Otherwise still keep the zero range inside the block, as we can have
large data folios soon, always truncating every blocks inside the same
folio can be costly for large folios.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The (correct) commit e41c81d0d30e ("mm/truncate: Replace page_mapped()
call in invalidate_inode_page()") replaced the page_mapped(page) check
with a refcount check. However, this refcount check does not work as
expected with drop_caches for btrfs's metadata pages.
Btrfs has a per-sb metadata inode with cached pages, and when not in
active use by btrfs, they have a refcount of 3. One from the initial
call to alloc_pages(), one (nr_pages == 1) from filemap_add_folio(), and
one from folio_attach_private(). We would expect such pages to get dropped
by drop_caches. However, drop_caches calls into mapping_evict_folio() via
mapping_try_invalidate() which gets a reference on the folio with
find_lock_entries(). As a result, these pages have a refcount of 4, and
fail this check.
For what it's worth, such pages do get reclaimed under memory pressure,
so I would say that while this behavior is surprising, it is not really
dangerously broken.
When I asked the mm folks about the expected refcount in this case, I
was told that the correct thing to do is to donate the refcount from the
original allocation to the page cache after inserting it.
Therefore, attempt to fix this by adding a put_folio() to the critical
spot in alloc_extent_buffer() where we are sure that we have really
allocated and attached new pages. We must also adjust
folio_detach_private() to properly handle being the last reference to the
folio and not do a use-after-free after folio_detach_private().
extent_buffers allocated by clone_extent_buffer() and
alloc_dummy_extent_buffer() are unmapped, so this transfer of ownership
from allocation to insertion in the mapping does not apply to them.
However, we can still folio_put() them safely once they are finished
being allocated and have called folio_attach_private().
Finally, removing the generic put_folio() for the allocation from
btrfs_detach_extent_buffer_folios() means we need to be careful to do
the appropriate put_folio() in allocation failure paths in
alloc_extent_buffer(), clone_extent_buffer() and
alloc_dummy_extent_buffer().
Link: https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/
Tested-by: Klara Modin <klarasmodin@gmail.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We now have a verbose variant of ASSERT() so that we can print the value
of the block group's discard_index. So use it for better problem analysis
in case the assertion is triggered.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we have several small bitmaps inside scrub_stripe:
- extent_sector_bitmap
- error_bitmap
- io_error_bitmap
- csum_error_bitmap
- meta_error_bitmap
- meta_gen_error_bitmap
All those bitmaps are at most 16 bits long, but unsigned long is
either 32 or 64 (more common) bits.
This means we're wasting 1/2 or 3/4 space for each bitmap.
And we can have 128 scrub_stripe for each device, such wasted space adds up
quickly.
Instead of using a single unsigned long for each bitmap, aggregate them
into a larger bitmap, just like what we're doing for subpage support.
This reduces 24 bytes from each scrub_stripe structure on x86_64
systems.
This will need a lot of macros converting direct bitmap/bit operations into
our scrub_stripe specific helpers, but all those helpers are very small
and can be inlined.
So overall the overhead shouldn't be that huge, and we save quite some
memory space.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|