Age | Commit message (Collapse) | Author |
|
commit f8f210dc84709804c9f952297f2bfafa6ea6b4bd upstream.
When updating the global block reserve, we account for the 6 items needed
by an unlink operation and the 6 delayed references for each one of those
items. However the calculation for the delayed references is not correct
in case we have the free space tree enabled, as in that case we need to
touch the free space tree as well and therefore need twice the number of
bytes. So use the btrfs_calc_delayed_ref_bytes() helper to calculate the
number of bytes need for the delayed references at
btrfs_update_global_block_rsv().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Diogo: this patch has been cherry-picked from the original commit;
conflicts included lack of a define (picked from commit 5630e2bcfe223)
and lack of btrfs_calc_delayed_ref_bytes (picked from commit 0e55a54502b97)
- changed const struct -> struct for compatibility.]
Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f1e1765aad7de7a8b8102044fc6a44684bc36180 ]
If the journal geometry results in a sector or log stripe unit
validation problem, it indicates that we cannot set the log up to
safely write to the the journal. In these cases, we must abort the
mount because the corruption needs external intervention to resolve.
Similarly, a journal that is too large cannot be written to safely,
either, so we shouldn't allow those geometries to mount, either.
If the log is too small, we risk having transaction reservations
overruning the available log space and the system hanging waiting
for space it can never provide. This is purely a runtime hang issue,
not a corruption issue as per the first cases listed above. We abort
mounts of the log is too small for V5 filesystems, but we must allow
v4 filesystems to mount because, historically, there was no log size
validity checking and so some systems may still be out there with
undersized logs.
The problem is that on V4 filesystems, when we discover a log
geometry problem, we skip all the remaining checks and then allow
the log to continue mounting. This mean that if one of the log size
checks fails, we skip the log stripe unit check. i.e. we allow the
mount because a "non-fatal" geometry is violated, and then fail to
check the hard fail geometries that should fail the mount.
Move all these fatal checks to the superblock verifier, and add a
new check for the two log sector size geometry variables having the
same values. This will prevent any attempt to mount a log that has
invalid or inconsistent geometries long before we attempt to mount
the log.
However, for the minimum log size checks, we can only do that once
we've setup up the log and calculated all the iclog sizes and
roundoffs. Hence this needs to remain in the log mount code after
the log has been initialised. It is also the only case where we
should allow a v4 filesystem to continue running, so leave that
handling in place, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8e698ee72c4ecbbf18264568eb310875839fd601 ]
Through generic/300, I discovered that mkfs.xfs creates corrupt
filesystems when given these parameters:
Filesystems formatted with --unsupported are not supported!!
meta-data=/dev/sda isize=512 agcount=8, agsize=16352 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
data = bsize=4096 blocks=130816, imaxpct=25
= sunit=32 swidth=128 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=8192, version=2
= sectsz=512 sunit=32 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
= rgcount=0 rgsize=0 blks
Discarding blocks...Done.
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
- 16:30:50: zeroing log - 16320 of 16320 blocks done
- scan filesystem freespace and inode maps...
agf_freeblks 25, counted 0 in ag 4
sb_fdblocks 8823, counted 8798
The root cause of this problem is the numrecs handling in
xfs_freesp_init_recs, which is used to initialize a new AG. Prior to
calling the function, we set up the new bnobt block with numrecs == 1
and rely on _freesp_init_recs to format that new record. If the last
record created has a blockcount of zero, then it sets numrecs = 0.
That last bit isn't correct if the AG contains the log, the start of the
log is not immediately after the initial blocks due to stripe alignment,
and the end of the log is perfectly aligned with the end of the AG. For
this case, we actually formatted a single bnobt record to handle the
free space before the start of the (stripe aligned) log, and incremented
arec to try to format a second record. That second record turned out to
be unnecessary, so what we really want is to leave numrecs at 1.
The numrecs handling itself is overly complicated because a different
function sets numrecs == 1. Change the bnobt creation code to start
with numrecs set to zero and only increment it after successfully
formatting a free space extent into the btree block.
Fixes: f327a00745ff ("xfs: account for log space when formatting new AGs")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 537c013b140d373d1ffe6290b841dc00e67effaa ]
During review of the patcheset that provided reloading of the incore
iunlink list, Dave made a few suggestions, and I updated the copy in my
dev tree. Unfortunately, I then got distracted by ... who even knows
what ... and forgot to backport those changes from my dev tree to my
release candidate branch. I then sent multiple pull requests with stale
patches, and that's what was merged into -rc3.
So.
This patch re-adds the use of an unlocked iunlink list check to
determine if we want to allocate the resources to recreate the incore
list. Since lost iunlinked inodes are supposed to be rare, this change
helps us avoid paying the transaction and AGF locking costs every time
we open any inode.
This also re-adds the shutdowns on failure, and re-applies the
restructuring of the inner loop in xfs_inode_reload_unlinked_bucket, and
re-adds a requested comment about the quotachecking code.
Retain the original RVB tag from Dave since there's no code change from
the last submission.
Fixes: 68b957f64fca1 ("xfs: load uncached unlinked inodes into memory on demand")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 49813a21ed57895b73ec4ed3b99d4beec931496f ]
Teach quotacheck to reload the unlinked inode lists when walking the
inode table. This requires extra state handling, since it's possible
that a reloaded inode will get inactivated before quotacheck tries to
scan it; in this case, we need to ensure that the reloaded inode does
not have dquots attached when it is freed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 83771c50e42b92de6740a63e152c96c052d37736 ]
The previous patch to reload unrecovered unlinked inodes when adding a
newly created inode to the unlinked list is missing a key piece of
functionality. It doesn't handle the case that someone calls xfs_iget
on an inode that is not the last item in the incore list. For example,
if at mount time the ondisk iunlink bucket looks like this:
AGI -> 7 -> 22 -> 3 -> NULL
None of these three inodes are cached in memory. Now let's say that
someone tries to open inode 3 by handle. We need to walk the list to
make sure that inodes 7 and 22 get loaded cold, and that the
i_prev_unlinked of inode 3 gets set to 22.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f12b96683d6976a3a07fdf3323277c79dbe8f6ab ]
Alter the definition of i_prev_unlinked slightly to make it more obvious
when an inode with 0 link count is not part of the iunlink bucket lists
rooted in the AGI. This distinction is necessary because it is not
sufficient to check inode.i_nlink to decide if an inode is on the
unlinked list. Updates to i_nlink can happen while holding only
ILOCK_EXCL, but updates to an inode's position in the AGI unlinked list
(which happen after the nlink update) requires both ILOCK_EXCL and the
AGI buffer lock.
The next few patches will make it possible to reload an entire unlinked
bucket list when we're walking the inode table or performing handle
operations and need more than the ability to iget the last inode in the
chain.
The upcoming directory repair code also needs to be able to make this
distinction to decide if a zero link count directory should be moved to
the orphanage or allowed to inactivate. An upcoming enhancement to the
online AGI fsck code will need this distinction to check and rebuild the
AGI unlinked buckets.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3c90c01e49342b166e5c90ec2c85b220be15a20e ]
The agend should be "start + length - 1", then, blockcount should be
"end + 1 - start". Correct 2 calculation mistakes.
Also, rename "agend" to "range_agend" because it's not the end of the AG
per se; it's the end of the dead region within an AG's agblock space.
Fixes: 5cf32f63b0f4 ("xfs: fix the calculation for "end" and "length"")
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 348a1983cf4cf5099fc398438a968443af4c9f65 ]
Luis has been reporting an assert failure when freeing an inode
cluster during inode inactivation for a while. The assert looks
like:
XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs]
RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs
RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7
RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0
RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b
R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000
R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80
FS: 0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs
xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs
xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs
xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs
__xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs
xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs
xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs
xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs
process_one_work (kernel/workqueue.c:3231)
worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2))
kthread (kernel/kthread.c:389)
ret_from_fork (arch/x86/kernel/process.c:147)
ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
</TASK>
And occurs when the the inode precommit handlers is attempt to look
up the inode cluster buffer to attach the inode for writeback.
The trail of logic that I can reconstruct is as follows.
1. the inode is clean when inodegc runs, so it is not
attached to a cluster buffer when precommit runs.
2. #1 implies the inode cluster buffer may be clean and not
pinned by dirty inodes when inodegc runs.
3. #2 implies that the inode cluster buffer can be reclaimed
by memory pressure at any time.
4. The assert failure implies that the cluster buffer was
attached to the transaction, but not marked done. It had
been accessed earlier in the transaction, but not marked
done.
5. #4 implies the cluster buffer has been invalidated (i.e.
marked stale).
6. #5 implies that the inode cluster buffer was instantiated
uninitialised in the transaction in xfs_ifree_cluster(),
which only instantiates the buffers to invalidate them
and never marks them as done.
Given factors 1-3, this issue is highly dependent on timing and
environmental factors. Hence the issue can be very difficult to
reproduce in some situations, but highly reliable in others. Luis
has an environment where it can be reproduced easily by g/531 but,
OTOH, I've reproduced it only once in ~2000 cycles of g/531.
I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag
on the cluster buffers, even though they may not be initialised. The
reasons why I think this is safe are:
1. A buffer cache lookup hit on a XBF_STALE buffer will
clear the XBF_DONE flag. Hence all future users of the
buffer know they have to re-initialise the contents
before use and mark it done themselves.
2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which
means the buffer remains locked until the journal commit
completes and the buffer is unpinned. Hence once marked
XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only
context that can access the freed buffer is the currently
running transaction.
3. #2 implies that future buffer lookups in the currently
running transaction will hit the transaction match code
and not the buffer cache. Hence XBF_STALE and
XFS_BLI_STALE will not be cleared unless the transaction
initialises and logs the buffer with valid contents
again. At which point, the buffer will be marked marked
XBF_DONE again, so having XBF_DONE already set on the
stale buffer is a moot point.
4. #2 also implies that any concurrent access to that
cluster buffer will block waiting on the buffer lock
until the inode cluster has been fully freed and is no
longer an active inode cluster buffer.
5. #4 + #1 means that any future user of the disk range of
that buffer will always see the range of disk blocks
covered by the cluster buffer as not done, and hence must
initialise the contents themselves.
6. Setting XBF_DONE in xfs_ifree_cluster() then means the
unlinked inode precommit code will see a XBF_DONE buffer
from the transaction match as it expects. It can then
attach the stale but newly dirtied inode to the stale
but newly dirtied cluster buffer without unexpected
failures. The stale buffer will then sail through the
journal and do the right thing with the attached stale
inode during unpin.
Hence the fix is just one line of extra code. The explanation of
why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and
complex....
Fixes: 82842fee6e59 ("xfs: fix AGF vs inode cluster buffer deadlock")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 1bba82fe1afac69c85c1f5ea137c8e73de3c8032 ]
In commit 8ee81ed581ff, Ye Bin complained about an ASSERT in the bmapx
code that trips if we encounter a delalloc extent after flushing the
pagecache to disk. The ioctl code does not hold MMAPLOCK so it's
entirely possible that a racing write page fault can create a delalloc
extent after the file has been flushed. The proposed solution was to
replace the assertion with an early return that avoids filling out the
bmap recordset with a delalloc entry if the caller didn't ask for it.
At the time, I recall thinking that the forward logic sounded ok, but
felt hesitant because I suspected that changing this code would cause
something /else/ to burst loose due to some other subtlety.
syzbot of course found that subtlety. If all the extent mappings found
after the flush are delalloc mappings, we'll reach the end of the data
fork without ever incrementing bmv->bmv_entries. This is new, since
before we'd have emitted the delalloc mappings even though the caller
didn't ask for them. Once we reach the end, we'll try to set
BMV_OF_LAST on the -1st entry (because bmv_entries is zero) and go
corrupt something else in memory. Yay.
I really dislike all these stupid patches that fiddle around with debug
code and break things that otherwise worked well enough. Nobody was
complaining that calling XFS_IOC_BMAPX without BMV_IF_DELALLOC would
return BMV_OF_DELALLOC records, and now we've gone from "weird behavior
that nobody cared about" to "bad behavior that must be addressed
immediately".
Maybe I'll just ignore anything from Huawei from now on for my own sake.
Reported-by: syzbot+c103d3808a0de5faaf80@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-xfs/20230412024907.GP360889@frogsfrogsfrogs/
Fixes: 8ee81ed581ff ("xfs: fix BUG_ON in xfs_getbmap()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 68b957f64fca1930164bfc6d6d379acdccd547d7 ]
shrikanth hegde reports that filesystems fail shortly after mount with
the following failure:
WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs]
This of course is the WARN_ON_ONCE in xfs_iunlink_lookup:
ip = radix_tree_lookup(&pag->pag_ici_root, agino);
if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... }
>From diagnostic data collected by the bug reporters, it would appear
that we cleanly mounted a filesystem that contained unlinked inodes.
Unlinked inodes are only processed as a final step of log recovery,
which means that clean mounts do not process the unlinked list at all.
Prior to the introduction of the incore unlinked lists, this wasn't a
problem because the unlink code would (very expensively) traverse the
entire ondisk metadata iunlink chain to keep things up to date.
However, the incore unlinked list code complains when it realizes that
it is out of sync with the ondisk metadata and shuts down the fs, which
is bad.
Ritesh proposed to solve this problem by unconditionally parsing the
unlinked lists at mount time, but this imposes a mount time cost for
every filesystem to catch something that should be very infrequent.
Instead, let's target the places where we can encounter a next_unlinked
pointer that refers to an inode that is not in cache, and load it into
cache.
Note: This patch does not address the problem of iget loading an inode
from the middle of the iunlink list and needing to set i_prev_unlinked
correctly.
Reported-by: shrikanth hegde <sshegde@linux.vnet.ibm.com>
Triaged-by: Ritesh Harjani <ritesh.list@gmail.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 5cf32f63b0f4c520460c1a5dd915dc4f09085f29 ]
The value of "end" should be "start + length - 1".
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4b827b3f305d1fcf837265f1e12acc22ee84327c ]
It just creates unnecessary bot noise these days.
Reported-by: syzbot+6ae213503fb12e87934f@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c3b880acadc95d6e019eae5d669e072afda24f1b ]
I found a corruption during growfs:
XFS (loop0): Internal error agbno >= mp->m_sb.sb_agblocks at line 3661 of
file fs/xfs/libxfs/xfs_alloc.c. Caller __xfs_free_extent+0x28e/0x3c0
CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257
Call Trace:
<TASK>
dump_stack_lvl+0x50/0x70
xfs_corruption_error+0x134/0x150
__xfs_free_extent+0x2c1/0x3c0
xfs_ag_extend_space+0x291/0x3e0
xfs_growfs_data+0xd72/0xe90
xfs_file_ioctl+0x5f9/0x14a0
__x64_sys_ioctl+0x13e/0x1c0
do_syscall_64+0x39/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
XFS (loop0): Corruption detected. Unmount and run xfs_repair
XFS (loop0): Internal error xfs_trans_cancel at line 1097 of file
fs/xfs/xfs_trans.c. Caller xfs_growfs_data+0x691/0xe90
CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257
Call Trace:
<TASK>
dump_stack_lvl+0x50/0x70
xfs_error_report+0x93/0xc0
xfs_trans_cancel+0x2c0/0x350
xfs_growfs_data+0x691/0xe90
xfs_file_ioctl+0x5f9/0x14a0
__x64_sys_ioctl+0x13e/0x1c0
do_syscall_64+0x39/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f2d86706577
The bug can be reproduced with the following sequence:
# truncate -s 1073741824 xfs_test.img
# mkfs.xfs -f -b size=1024 -d agcount=4 xfs_test.img
# truncate -s 2305843009213693952 xfs_test.img
# mount -o loop xfs_test.img /mnt/test
# xfs_growfs -D 1125899907891200 /mnt/test
The root cause is that during growfs, user space passed in a large value
of newblcoks to xfs_growfs_data_private(), due to current sb_agblocks is
too small, new AG count will exceed UINT_MAX. Because of AG number type
is unsigned int and it would overflow, that caused nagcount much smaller
than the actual value. During AG extent space, delta blocks in
xfs_resizefs_init_new_ags() will much larger than the actual value due to
incorrect nagcount, even exceed UINT_MAX. This will cause corruption and
be detected in __xfs_free_extent. Fix it by growing the filesystem to up
to the maximally allowed AGs and not return EINVAL when new AG count
overflow.
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d4d12c02bf5f768f1b423c7ae2909c5afdfe0d5f ]
Unlinked list recovery requires errors removing the inode the from
the unlinked list get fed back to the main recovery loop. Now that
we offload the unlinking to the inodegc work, we don't get errors
being fed back when we trip over a corruption that prevents the
inode from being removed from the unlinked list.
This means we never clear the corrupt unlinked list bucket,
resulting in runtime operations eventually tripping over it and
shutting down.
Fix this by collecting inodegc worker errors and feed them
back to the flush caller. This is largely best effort - the only
context that really cares is log recovery, and it only flushes a
single inode at a time so we don't need complex synchronised
handling. Essentially the inodegc workers will capture the first
error that occurs and the next flush will gather them and clear
them. The flush itself will only report the first gathered error.
In the cases where callers can return errors, propagate the
collected inodegc flush error up the error handling chain.
In the case of inode unlinked list recovery, there are several
superfluous calls to flush queued unlinked inodes -
xlog_recover_iunlink_bucket() guarantees that it has flushed the
inodegc and collected errors before it returns. Hence nothing in the
calling path needs to run a flush, even when an error is returned.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 82842fee6e5979ca7e2bf4d839ef890c22ffb7aa ]
Lock order in XFS is AGI -> AGF, hence for operations involving
inode unlinked list operations we always lock the AGI first. Inode
unlinked list operations operate on the inode cluster buffer,
so the lock order there is AGI -> inode cluster buffer.
For O_TMPFILE operations, this now means the lock order set down in
xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
unlinked ops are done before the directory modifications that may
allocate space and lock the AGF.
Unfortunately, we also now lock the inode cluster buffer when
logging an inode so that we can attach the inode to the cluster
buffer and pin it in memory. This creates a lock order of AGF ->
inode cluster buffer in directory operations as we have to log the
inode after we've allocated new space for it.
This creates a lock inversion between the AGF and the inode cluster
buffer. Because the inode cluster buffer is shared across multiple
inodes, the inversion is not specific to individual inodes but can
occur when inodes in the same cluster buffer are accessed in
different orders.
To fix this we need move all the inode log item cluster buffer
interactions to the end of the current transaction. Unfortunately,
xfs_trans_log_inode() calls are littered throughout the transactions
with no thought to ordering against other items or locking. This
makes it difficult to do anything that involves changing the call
sites of xfs_trans_log_inode() to change locking orders.
However, we do now have a mechanism that allows is to postpone dirty
item processing to just before we commit the transaction: the
->iop_precommit method. This will be called after all the
modifications are done and high level objects like AGI and AGF
buffers have been locked and modified, thereby providing a mechanism
that guarantees we don't lock the inode cluster buffer before those
high level objects are locked.
This change is largely moving the guts of xfs_trans_log_inode() to
xfs_inode_item_precommit() and providing an extra flag context in
the inode log item to track the dirty state of the inode in the
current transaction. This also means we do a lot less repeated work
in xfs_trans_log_inode() by only doing it once per transaction when
all the work is done.
Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit cb042117488dbf0b3b38b05771639890fada9a52 ]
To fix a AGI-AGF-inode cluster buffer deadlock, we need to move
inode cluster buffer operations to the ->iop_precommit() method.
However, this means that deferred operations can require precommits
to be run on the final transaction that the deferred ops pass back
to xfs_trans_commit() context. This will be exposed by attribute
handling, in that the last changes to the inode in the attr set
state machine "disappear" because the precommit operation is not run.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 89a4bf0dc3857569a77061d3d5ea2ac85f7e13c6 ]
When a buffer is unpinned by xfs_buf_item_unpin(), we need to access
the buffer after we've dropped the buffer log item reference count.
This opens a window where we can have two racing unpins for the
buffer item (e.g. shutdown checkpoint context callback processing
racing with journal IO iclog completion processing) and both attempt
to access the buffer after dropping the BLI reference count. If we
are unlucky, the "BLI freed" context wins the race and frees the
buffer before the "BLI still active" case checks the buffer pin
count.
This results in a use after free that can only be triggered
in active filesystem shutdown situations.
To fix this, we need to ensure that buffer existence extends beyond
the BLI reference count checks and until the unpin processing is
complete. This implies that a buffer pin operation must also take a
buffer reference to ensure that the buffer cannot be freed until the
buffer unpin processing is complete.
Reported-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8ee81ed581ff35882b006a5205100db0b57bf070 ]
There's issue as follows:
XFS: Assertion failed: (bmv->bmv_iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c, line: 329
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 14612 Comm: xfs_io Not tainted 6.3.0-rc2-next-20230315-00006-g2729d23ddb3b-dirty #422
RIP: 0010:assfail+0x96/0xa0
RSP: 0018:ffffc9000fa178c0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff888179a18000
RDX: 0000000000000000 RSI: ffff888179a18000 RDI: 0000000000000002
RBP: 0000000000000000 R08: ffffffff8321aab6 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed1105f85139 R12: ffffffff8aacc4c0
R13: 0000000000000149 R14: ffff888269f58000 R15: 000000000000000c
FS: 00007f42f27a4740(0000) GS:ffff88882fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000b92388 CR3: 000000024f006000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
xfs_getbmap+0x1a5b/0x1e40
xfs_ioc_getbmap+0x1fd/0x5b0
xfs_file_ioctl+0x2cb/0x1d50
__x64_sys_ioctl+0x197/0x210
do_syscall_64+0x39/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Above issue may happen as follows:
ThreadA ThreadB
do_shared_fault
__do_fault
xfs_filemap_fault
__xfs_filemap_fault
filemap_fault
xfs_ioc_getbmap -> Without BMV_IF_DELALLOC flag
xfs_getbmap
xfs_ilock(ip, XFS_IOLOCK_SHARED);
filemap_write_and_wait
do_page_mkwrite
xfs_filemap_page_mkwrite
__xfs_filemap_fault
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
iomap_page_mkwrite
...
xfs_buffered_write_iomap_begin
xfs_bmapi_reserve_delalloc -> Allocate delay extent
xfs_ilock_data_map_shared(ip)
xfs_getbmap_report_one
ASSERT((bmv->bmv_iflags & BMV_IF_DELALLOC) != 0)
-> trigger BUG_ON
As xfs_filemap_page_mkwrite() only hold XFS_MMAPLOCK_SHARED lock, there's
small window mkwrite can produce delay extent after file write in xfs_getbmap().
To solve above issue, just skip delalloc extents.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 0c7273e494dd5121e20e160cb2f047a593ee14a8 ]
The background inode inactivation can attached dquots to inodes, but
this can race with a foreground quotacheck failure that leads to
disabling quotas and freeing the mp->m_quotainfo structure. The
background inode inactivation then tries to allocate a quota, tries
to dereference mp->m_quotainfo, and crashes like so:
XFS (loop1): Quotacheck: Unsuccessful (Error -5): Disabling quotas.
xfs filesystem being mounted at /root/syzkaller.qCVHXV/0/file0 supports timestamps until 2038 (0x7fffffff)
BUG: kernel NULL pointer dereference, address: 00000000000002a8
....
CPU: 0 PID: 161 Comm: kworker/0:4 Not tainted 6.2.0-c9c3395d5e3d #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: xfs-inodegc/loop1 xfs_inodegc_worker
RIP: 0010:xfs_dquot_alloc+0x95/0x1e0
....
Call Trace:
<TASK>
xfs_qm_dqread+0x46/0x440
xfs_qm_dqget_inode+0x154/0x500
xfs_qm_dqattach_one+0x142/0x3c0
xfs_qm_dqattach_locked+0x14a/0x170
xfs_qm_dqattach+0x52/0x80
xfs_inactive+0x186/0x340
xfs_inodegc_worker+0xd3/0x430
process_one_work+0x3b1/0x960
worker_thread+0x52/0x660
kthread+0x161/0x1a0
ret_from_fork+0x29/0x50
</TASK>
....
Prevent this race by flushing all the queued background inode
inactivations pending before purging all the cached dquots when
quotacheck fails.
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 60b730a40c43fbcc034970d3e77eb0f25b8cc1cf ]
If the end position of a GETFSMAP query overlaps an allocated space and
we're using the free space info to generate fsmap info, the akeys
information gets fed into the fsmap formatter with bad results.
Zero-init the space.
Reported-by: syzbot+090ae72d552e6bd93cfe@syzkaller.appspotmail.com
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d5753847b216db0e553e8065aa825cfe497ad143 ]
When we enter xfs_bmbt_alloc_block() without having first allocated
a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we
are doing something like unwritten extent conversion, the transaction
block reservation is used as the minleft value.
This works for operations like unwritten extent conversion, but it
assumes that the block reservation is only for a BMBT split. THis is
not always true, and sometimes results in larger than necessary
minleft values being set. We only actually need enough space for a
btree split, something we already handle correctly in
xfs_bmapi_write() via the xfs_bmapi_minleft() calculation.
We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to
calculate the number of blocks a BMBT split on this inode is going to
require, not use the transaction block reservation that contains the
maximum number of blocks this transaction may consume in it...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f08f984c63e9980614ae3a0a574b31eaaef284b2 ]
When an XFS filesystem has free inodes in chunks already allocated
on disk, it will still allocate new inode chunks if the target AG
has no free inodes in it. Normally, this is a good idea as it
preserves locality of all the inodes in a given directory.
However, at ENOSPC this can lead to using the last few remaining
free filesystem blocks to allocate a new chunk when there are many,
many free inodes that could be allocated without consuming free
space. This results in speeding up the consumption of the last few
blocks and inode create operations then returning ENOSPC when there
free inodes available because we don't have enough block left in the
filesystem for directory creation reservations to proceed.
Hence when we are near ENOSPC, we should be attempting to preserve
the remaining blocks for directory block allocation rather than
using them for unnecessary inode chunk creation.
This particular behaviour is exposed by xfs/294, when it drives to
ENOSPC on empty file creation whilst there are still thousands of
free inodes available for allocation in other AGs in the filesystem.
Hence, when we are within 1% of ENOSPC, change the inode allocation
behaviour to prefer to use existing free inodes over allocating new
inode chunks, even though it results is poorer locality of the data
set. It is more important for the allocations to be space efficient
near ENOSPC than to have optimal locality for performance, so lets
modify the inode AG selection code to reflect that fact.
This allows generic/294 to not only pass with this allocator rework
patchset, but to increase the number of post-ENOSPC empty inode
allocations to from ~600 to ~9080 before we hit ENOSPC on the
directory create transaction reservation.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 1dd0510f6d4b85616a36aabb9be38389467122d9 ]
I've recently encountered an ABBA deadlock with g/476. The upcoming
changes seem to make this much easier to hit, but the underlying
problem is a pre-existing one.
Essentially, if we select an AG for allocation, then lock the AGF
and then fail to allocate for some reason (e.g. minimum length
requirements cannot be satisfied), then we drop out of the
allocation with the AGF still locked.
The caller then modifies the allocation constraints - usually
loosening them up - and tries again. This can result in trying to
access AGFs that are lower than the AGF we already have locked from
the failed attempt. e.g. the failed attempt skipped several AGs
before failing, so we have locks an AG higher than the start AG.
Retrying the allocation from the start AG then causes us to violate
AGF lock ordering and this can lead to deadlocks.
The deadlock exists even if allocation succeeds - we can do a
followup allocations in the same transaction for BMBT blocks that
aren't guaranteed to be in the same AG as the original, and can move
into higher AGs. Hence we really need to move the tp->t_firstblock
tracking down into xfs_alloc_vextent() where it can be set when we
exit with a locked AG.
xfs_alloc_vextent() can also check there if the requested
allocation falls within the allow range of AGs set by
tp->t_firstblock. If we can't allocate within the range set, we have
to fail the allocation. If we are allowed to to non-blocking AGF
locking, we can ignore the AG locking order limitations as we can
use try-locks for the first iteration over requested AG range.
This invalidates a set of post allocation asserts that check that
the allocation is always above tp->t_firstblock if it is set.
Because we can use try-locks to avoid the deadlock in some
circumstances, having a pre-existing locked AGF doesn't always
prevent allocation from lower order AGFs. Hence those ASSERTs need
to be removed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c85007e2e3942da1f9361e4b5a9388ea3a8dcc5b ]
When we split a BMBT due to record insertion, we offload it to a
worker thread because we can be deep in the stack when we try to
allocate a new block for the BMBT. Allocation can use several
kilobytes of stack (full memory reclaim, swap and/or IO path can
end up on the stack during allocation) and we can already be several
kilobytes deep in the stack when we need to split the BMBT.
A recent workload demonstrated a deadlock in this BMBT split
offload. It requires several things to happen at once:
1. two inodes need a BMBT split at the same time, one must be
unwritten extent conversion from IO completion, the other must be
from extent allocation.
2. there must be a no available xfs_alloc_wq worker threads
available in the worker pool.
3. There must be sustained severe memory shortages such that new
kworker threads cannot be allocated to the xfs_alloc_wq pool for
both threads that need split work to be run
4. The split work from the unwritten extent conversion must run
first.
5. when the BMBT block allocation runs from the split work, it must
loop over all AGs and not be able to either trylock an AGF
successfully, or each AGF is is able to lock has no space available
for a single block allocation.
6. The BMBT allocation must then attempt to lock the AGF that the
second task queued to the rescuer thread already has locked before
it finds an AGF it can allocate from.
At this point, we have an ABBA deadlock between tasks queued on the
xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
holding the AGF lock can't be run by the rescuer thread until the
task the rescuer thread is runing gets the AGF lock....
This is a highly improbably series of events, but there it is.
There's a couple of ways to fix this, but the easiest way to ensure
that we only punt tasks with a locked AGF that holds enough space
for the BMBT block allocations to the worker thread.
This works for unwritten extent conversion in IO completion (which
doesn't have a locked AGF and space reservations) because we have
tight control over the IO completion stack. It is typically only 6
functions deep when xfs_btree_split() is called because we've
already offloaded the IO completion work to a worker thread and
hence we don't need to worry about stack overruns here.
The other place we can be called for a BMBT split without a
preceeding allocation is __xfs_bunmapi() when punching out the
center of an existing extent. We don't remove extents in the IO
path, so these operations don't tend to be called with a lot of
stack consumed. Hence we don't really need to ship the split off to
a worker thread in these cases, either.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 601a27ea09a317d0fe2895df7d875381fb393041 ]
In xfs_extent_busy_update_extent() case 6 and 7, whenever bno is modified on
extent busy, the relavent length has to be modified accordingly.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4da112513c01d7d0acf1025b8764349d46e177d6 ]
We are doing a test about deleting a large number of files
when memory is low. A deadlock problem was found.
[ 1240.279183] -> #1 (fs_reclaim){+.+.}-{0:0}:
[ 1240.280450] lock_acquire+0x197/0x460
[ 1240.281548] fs_reclaim_acquire.part.0+0x20/0x30
[ 1240.282625] kmem_cache_alloc+0x2b/0x940
[ 1240.283816] xfs_trans_alloc+0x8a/0x8b0
[ 1240.284757] xfs_inactive_ifree+0xe4/0x4e0
[ 1240.285935] xfs_inactive+0x4e9/0x8a0
[ 1240.286836] xfs_inodegc_worker+0x160/0x5e0
[ 1240.287969] process_one_work+0xa19/0x16b0
[ 1240.289030] worker_thread+0x9e/0x1050
[ 1240.290131] kthread+0x34f/0x460
[ 1240.290999] ret_from_fork+0x22/0x30
[ 1240.291905]
[ 1240.291905] -> #0 ((work_completion)(&gc->work)){+.+.}-{0:0}:
[ 1240.293569] check_prev_add+0x160/0x2490
[ 1240.294473] __lock_acquire+0x2c4d/0x5160
[ 1240.295544] lock_acquire+0x197/0x460
[ 1240.296403] __flush_work+0x6bc/0xa20
[ 1240.297522] xfs_inode_mark_reclaimable+0x6f0/0xdc0
[ 1240.298649] destroy_inode+0xc6/0x1b0
[ 1240.299677] dispose_list+0xe1/0x1d0
[ 1240.300567] prune_icache_sb+0xec/0x150
[ 1240.301794] super_cache_scan+0x2c9/0x480
[ 1240.302776] do_shrink_slab+0x3f0/0xaa0
[ 1240.303671] shrink_slab+0x170/0x660
[ 1240.304601] shrink_node+0x7f7/0x1df0
[ 1240.305515] balance_pgdat+0x766/0xf50
[ 1240.306657] kswapd+0x5bd/0xd20
[ 1240.307551] kthread+0x34f/0x460
[ 1240.308346] ret_from_fork+0x22/0x30
[ 1240.309247]
[ 1240.309247] other info that might help us debug this:
[ 1240.309247]
[ 1240.310944] Possible unsafe locking scenario:
[ 1240.310944]
[ 1240.312379] CPU0 CPU1
[ 1240.313363] ---- ----
[ 1240.314433] lock(fs_reclaim);
[ 1240.315107] lock((work_completion)(&gc->work));
[ 1240.316828] lock(fs_reclaim);
[ 1240.318088] lock((work_completion)(&gc->work));
[ 1240.319203]
[ 1240.319203] *** DEADLOCK ***
...
[ 2438.431081] Workqueue: xfs-inodegc/sda xfs_inodegc_worker
[ 2438.432089] Call Trace:
[ 2438.432562] __schedule+0xa94/0x1d20
[ 2438.435787] schedule+0xbf/0x270
[ 2438.436397] schedule_timeout+0x6f8/0x8b0
[ 2438.445126] wait_for_completion+0x163/0x260
[ 2438.448610] __flush_work+0x4c4/0xa40
[ 2438.455011] xfs_inode_mark_reclaimable+0x6ef/0xda0
[ 2438.456695] destroy_inode+0xc6/0x1b0
[ 2438.457375] dispose_list+0xe1/0x1d0
[ 2438.458834] prune_icache_sb+0xe8/0x150
[ 2438.461181] super_cache_scan+0x2b3/0x470
[ 2438.461950] do_shrink_slab+0x3cf/0xa50
[ 2438.462687] shrink_slab+0x17d/0x660
[ 2438.466392] shrink_node+0x87e/0x1d40
[ 2438.467894] do_try_to_free_pages+0x364/0x1300
[ 2438.471188] try_to_free_pages+0x26c/0x5b0
[ 2438.473567] __alloc_pages_slowpath.constprop.136+0x7aa/0x2100
[ 2438.482577] __alloc_pages+0x5db/0x710
[ 2438.485231] alloc_pages+0x100/0x200
[ 2438.485923] allocate_slab+0x2c0/0x380
[ 2438.486623] ___slab_alloc+0x41f/0x690
[ 2438.490254] __slab_alloc+0x54/0x70
[ 2438.491692] kmem_cache_alloc+0x23e/0x270
[ 2438.492437] xfs_trans_alloc+0x88/0x880
[ 2438.493168] xfs_inactive_ifree+0xe2/0x4e0
[ 2438.496419] xfs_inactive+0x4eb/0x8b0
[ 2438.497123] xfs_inodegc_worker+0x16b/0x5e0
[ 2438.497918] process_one_work+0xbf7/0x1a20
[ 2438.500316] worker_thread+0x8c/0x1060
[ 2438.504938] ret_from_fork+0x22/0x30
When the memory is insufficient, xfs_inonodegc_worker will trigger memory
reclamation when memory is allocated, then flush_work() may be called to
wait for the work to complete. This causes a deadlock.
So use memalloc_nofs_save() to avoid triggering memory reclamation in
xfs_inodegc_worker.
Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 52f31ed228212ba572c44e15e818a3a5c74122c0 ]
Resulting in a UAF if the shrinker races with some other dquot
freeing mechanism that sets XFS_DQFLAG_FREEING before the dquot is
removed from the LRU. This can occur if a dquot purge races with
drop_caches.
Reported-by: syzbot+912776840162c13db1a3@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit af77c4fc1871847b528d58b7fdafb4aa1f6a9262 ]
xattr in ocfs2 maybe 'non-indexed', which saved with additional space
requested. It's better to check if the memory is out of bound before
memcmp, although this possibility mainly comes from crafted poisonous
images.
Link: https://lkml.kernel.org/r/20240520024024.1976129-2-joseph.qi@linux.alibaba.com
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: lei lu <llfamsec@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9e3041fecdc8f78a5900c3aa51d3d756e73264d6 ]
Add a paranoia check to make sure it doesn't stray beyond valid memory
region containing ocfs2 xattr entries when scanning for a match. It will
prevent out-of-bound access in case of crafted images.
Link: https://lkml.kernel.org/r/20240520024024.1976129-1-joseph.qi@linux.alibaba.com
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: lei lu <llfamsec@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: af77c4fc1871 ("ocfs2: strict bound check before memcmp in ocfs2_xattr_find_entry()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7ccc1465465d78e6411b7bd730d06e7435802b5c ]
Call cifs_reconnect() to wake up processes waiting on negotiate
protocol to handle the case where server abruptly shut down and had no
chance to properly close the socket.
Simple reproducer:
ssh 192.168.2.100 pkill -STOP smbd
mount.cifs //192.168.2.100/test /mnt -o ... [never returns]
Cc: Rickard Andersson <rickaran@axis.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit ddb17dc880eeaac37b5a6e984de07b882de7d78d upstream.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f92214e4c312f6ea9d78650cc6291d200f17abb6 ]
If the call to nfs_delegation_grab_inode() fails, we will not have
dropped any locks that require us to rescan the list.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d72b7963115bea971a28eaa2cb76722c023f9fdf ]
Make sure that we clear the layout segments in cases where we see a
fatal error, and also in the case where the layout is invalid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2186a116538a715b20e15f84fdd3545e5fe0a39b ]
In most error cases, error code is not returned in smb2_open(),
__process_request() will not print error message.
Fix this by returning the correct value at the end of smb2_open().
Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3bc2ac2f8f0b78a13140fc72022771efe0c9b778 ]
Unlink changes the link count on the target inode. POSIX mandates that
the ctime must also change when this occurs.
According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html:
"Upon successful completion, unlink() shall mark for update the last data
modification and last file status change timestamps of the parent
directory. Also, if the file's link count is not 0, the last file status
change timestamp of the file shall be marked for update."
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add link to the opengroup docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f6bd41280a44dcc2e0a25ed72617d25f586974a7 ]
Sangsoo reported that a DAC denial error occurred when accessing
files through the ksmbd thread. This patch override fsids for
smb2_query_info().
Reported-by: Sangsoo Lee <constant.lee@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a018c1b636e79b60149b41151ded7c2606d8606e ]
Sangsoo reported that a DAC denial error occurred when accessing
files through the ksmbd thread. This patch override fsids for share
path check.
Reported-by: Sangsoo Lee <constant.lee@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 5cadfbd5a11e5495cac217534c5f788168b1afd7 upstream.
Add an init flag idicating whether the FUSE_EXPIRE_ONLY flag of
FUSE_NOTIFY_INVAL_ENTRY is effective.
This is needed for backports of this feature, otherwise the server could
just check the protocol version.
Fixes: 4f8d37020e1f ("fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY")
Cc: <stable@vger.kernel.org> # v6.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cd9253c23aedd61eb5ff11f37a36247cd46faf86 upstream.
If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:
1) Attempt a fsync without holding the inode's lock, triggering an
assertion failures when assertions are enabled;
2) Do an invalid memory access from the fsync task because the file private
points to memory allocated on stack by the direct IO task and it may be
used by the fsync task after the stack was destroyed.
The race happens like this:
1) A user space program opens a file descriptor with O_DIRECT;
2) The program spawns 2 threads using libpthread for example;
3) One of the threads uses the file descriptor to do direct IO writes,
while the other calls fsync using the same file descriptor.
4) Call task A the thread doing direct IO writes and task B the thread
doing fsyncs;
5) Task A does a direct IO write, and at btrfs_direct_write() sets the
file's private to an on stack allocated private with the member
'fsync_skip_inode_lock' set to true;
6) Task B enters btrfs_sync_file() and sees that there's a private
structure associated to the file which has 'fsync_skip_inode_lock' set
to true, so it skips locking the inode's VFS lock;
7) Task A completes the direct IO write, and resets the file's private to
NULL since it had no prior private and our private was stack allocated.
Then it unlocks the inode's VFS lock;
8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
assertion that checks the inode's VFS lock is held fails, since task B
never locked it and task A has already unlocked it.
The stack trace produced is the following:
assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
------------[ cut here ]------------
kernel BUG at fs/btrfs/ordered-data.c:983!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
Code: 50 d6 86 c0 e8 (...)
RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
Call Trace:
<TASK>
? __die_body.cold+0x14/0x24
? die+0x2e/0x50
? do_trap+0xca/0x110
? do_error_trap+0x6a/0x90
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? exc_invalid_op+0x50/0x70
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? asm_exc_invalid_op+0x1a/0x20
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? __seccomp_filter+0x31d/0x4f0
__x64_sys_fdatasync+0x4f/0x90
do_syscall_64+0x82/0x160
? do_futex+0xcb/0x190
? __x64_sys_futex+0x10e/0x1d0
? switch_fpu_return+0x4f/0xd0
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.
This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.
Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.
The following C program triggers the issue:
$ cat repro.c
/* Get the O_DIRECT definition. */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <pthread.h>
static int fd;
static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
{
while (count > 0) {
ssize_t ret;
ret = pwrite(fd, buf, count, offset);
if (ret < 0) {
if (errno == EINTR)
continue;
return ret;
}
count -= ret;
buf += ret;
}
return 0;
}
static void *fsync_loop(void *arg)
{
while (1) {
int ret;
ret = fsync(fd);
if (ret != 0) {
perror("Fsync failed");
exit(6);
}
}
}
int main(int argc, char *argv[])
{
long pagesize;
void *write_buf;
pthread_t fsyncer;
int ret;
if (argc != 2) {
fprintf(stderr, "Use: %s <file path>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
if (fd == -1) {
perror("Failed to open/create file");
return 1;
}
pagesize = sysconf(_SC_PAGE_SIZE);
if (pagesize == -1) {
perror("Failed to get page size");
return 2;
}
ret = posix_memalign(&write_buf, pagesize, pagesize);
if (ret) {
perror("Failed to allocate buffer");
return 3;
}
ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
if (ret != 0) {
fprintf(stderr, "Failed to create writer thread: %d\n", ret);
return 4;
}
while (1) {
ret = do_write(fd, write_buf, pagesize, 0);
if (ret != 0) {
perror("Write failed");
exit(5);
}
}
return 0;
}
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt/sdi
$ timeout 10 ./repro /mnt/sdi/foo
Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.
Reported-by: Paulo Dias <paulo.miguel.dias@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <jahn-andi@web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3002240d16494d798add0575e8ba1f284258ab34 ]
The memory of struct fuse_file is allocated but not freed
when get_create_ext return error.
Fixes: 3e2b6fdbdc9a ("fuse: send security context of inode on file")
Cc: stable@vger.kernel.org # v5.17
Signed-off-by: yangyun <yangyun50@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 15d937d7ca8c55d2b0ce9116e20c780fdd0b67cc ]
Will need to add supplementary groups to create messages, so add the
general concept of a request extension. A request extension is appended to
the end of the main request. It has a header indicating the size and type
of the extension.
The create security context (fuse_secctx_*) is similar to the generic
request extension, so include that as well in a backward compatible manner.
Add the total extension length to the request header. The offset of the
extension block within the request can be calculated by:
inh->len - inh->total_extlen * 8
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Stable-dep-of: 3002240d1649 ("fuse: fix memory leak in fuse_create_open")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 153524053bbb0d27bb2e0be36d1b46862e9ce74c ]
In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration of
the write request. I could not find in fuse code and git history a comment
which clearly explains why this exclusive lock is taken for direct writes.
Following might be the reasons for acquiring an exclusive lock but not be
limited to
1) Our guess is some USER space fuse implementations might be relying on
this lock for serialization.
2) The lock protects against file read/write size races.
3) Ruling out any issues arising from partial write failures.
This patch relaxes the exclusive lock for direct non-extending writes only.
File size extending writes might not need the lock either, but we are not
entirely sure if there is a risk to introduce any kind of regression.
Furthermore, benchmarking with fio does not show a difference between patch
versions that take on file size extension a) an exclusive lock and b) a
shared lock.
A possible example of an issue with i_size extending writes are write error
cases. Some writes might succeed and others might fail for file system
internal reasons - for example ENOSPACE. With parallel file size extending
writes it _might_ be difficult to revert the action of the failing write,
especially to restore the right i_size.
With these changes, we allow non-extending parallel direct writes on the
same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If
this flag is set on the file (flag is passed from libfuse to fuse kernel as
part of file open/create), we do not take exclusive lock anymore, but
instead use a shared lock that allows non-extending writes to run in
parallel. FUSE implementations which rely on this inode lock for
serialization can continue to do so and serialized direct writes are still
the default. Implementations that do not do write serialization need to be
updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file
open/create reply.
On patch review there were concerns that network file systems (or vfs
multiple mounts of the same file system) might have issues with parallel
writes. We believe this is not the case, as this is just a local lock,
which network file systems could not rely on anyway. I.e. this lock is
just for local consistency.
Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Stable-dep-of: 3002240d1649 ("fuse: fix memory leak in fuse_create_open")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4f8d37020e1fd0bf6ee9381ba918135ef3712efd ]
Add a flag to entry expiration that lets the filesystem expire a dentry
without kicking it out from the cache immediately.
This makes a difference for overmounted dentries, where plain invalidation
would detach all submounts before dropping the dentry from the cache. If
only expiry is set on the dentry, then any overmounts are left alone and
until ->d_revalidate() is called.
Note: ->d_revalidate() is not called for the case of following a submount,
so invalidation will only be triggered for the non-overmounted case. The
dentry could also be mounted in a different mount instance, in which case
any submounts will still be detached.
Suggested-by: Jakob Blomer <jblomer@cern.ch>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Stable-dep-of: 3002240d1649 ("fuse: fix memory leak in fuse_create_open")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a017ad1313fc91bdf235097fd0a02f673fc7bb11 ]
We're seeing reports of soft lockups when iterating through the loops,
so let's add rescheduling points.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 810ee43d9cd245d138a2733d87a24858a23f577d ]
Syzkiller reports a "KMSAN: uninit-value in pick_link" bug.
This is caused by an uninitialised page, which is ultimately caused
by a corrupted symbolic link size read from disk.
The reason why the corrupted symlink size causes an uninitialised
page is due to the following sequence of events:
1. squashfs_read_inode() is called to read the symbolic
link from disk. This assigns the corrupted value
3875536935 to inode->i_size.
2. Later squashfs_symlink_read_folio() is called, which assigns
this corrupted value to the length variable, which being a
signed int, overflows producing a negative number.
3. The following loop that fills in the page contents checks that
the copied bytes is less than length, which being negative means
the loop is skipped, producing an uninitialised page.
This patch adds a sanity check which checks that the symbolic
link size is not larger than expected.
--
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Link: https://lore.kernel.org/r/20240811232821.13903-1-phillip@squashfs.org.uk
Reported-by: Lizhi Xu <lizhi.xu@windriver.com>
Reported-by: syzbot+24ac24ff58dc5b0d26b9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000a90e8c061e86a76b@google.com/
V2: fix spelling mistake.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b8e947e9f64cac9df85a07672b658df5b2bcff07 ]
Some arch + compiler combinations report a potentially unused variable
location in btrfs_lookup_dentry(). This is a false alert as the variable
is passed by value and always valid or there's an error. The compilers
cannot probably reason about that although btrfs_inode_by_name() is in
the same file.
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.objectid' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5603:9
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.type' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5674:5
m68k-gcc8/m68k-allmodconfig
mips-gcc8/mips-allmodconfig
powerpc-gcc5/powerpc-all{mod,yes}config
powerpc-gcc5/ppc64_defconfig
Initialize it to zero, this should fix the warnings and won't change the
behaviour as btrfs_inode_by_name() accepts only a root or inode item
types, otherwise returns an error.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/linux-btrfs/bd4e9928-17b3-9257-8ba7-6b7f9bbb639a@linux-m68k.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b56329a782314fde5b61058e2a25097af7ccb675 ]
Instead of a BUG_ON() just return an error, log an error message and
abort the transaction in case we find an extent buffer belonging to the
relocation tree that doesn't have the full backref flag set. This is
unexpected and should never happen (save for bugs or a potential bad
memory).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b8ccef048354074a548f108e51d0557d6adfd3a3 ]
In reada we BUG_ON(refs == 0), which could be unkind since we aren't
holding a lock on the extent leaf and thus could get a transient
incorrect answer. In walk_down_proc we also BUG_ON(refs == 0), which
could happen if we have extent tree corruption. Change that to return
-EUCLEAN. In do_walk_down() we catch this case and handle it correctly,
however we return -EIO, which -EUCLEAN is a more appropriate error code.
Finally in walk_up_proc we have the same BUG_ON(refs == 0), so convert
that to proper error handling. Also adjust the error message so we can
actually do something with the information.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1f9d44c0a12730a24f8bb75c5e1102207413cc9b ]
We have a couple of areas where we check to make sure the tree block is
locked before looking up or messing with references. This is old code
so it has this as BUG_ON(). Convert this to ASSERT() for developers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|