Age | Commit message (Collapse) | Author |
|
While invalidating, the clearing of the bits in discard_buffer()
is done in one fully ordered CAS operation. In the past this was
done via individual clear_bit(), until e7470ee89f0 (fs: buffer:
do not use unnecessary atomic operations when discarding buffers).
This implies that there were never strong ordering requirements
outside of being serialized by the buffer lock.
As such relax the ordering for archs that can benefit. Further,
the implied ordering in buffer_unlock() makes current cmpxchg
implied barrier redundant due to release semantics. And while in
theory the unlock could be part of the bulk clearing, it is
best to leave it explicit, but without the double barriers.
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://lore.kernel.org/20250515173925.147823-5-dave@stgolabs.net
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Get rid of those unnecessary return statements.
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://lore.kernel.org/20250515173925.147823-4-dave@stgolabs.net
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
__getblk_slow() already implies failing a first lookup
as the fastpath, so try to create the buffers immediately
and avoid the redundant lookup. This saves 5-10% of the
total cost/latency of the slowpath.
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://lore.kernel.org/20250515173925.147823-3-dave@stgolabs.net
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Just as with the fast path, call the lookup variant depending
on the gfp flags.
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://lore.kernel.org/20250515173925.147823-2-dave@stgolabs.net
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We added the documentation in ext4_map_blocks() for usage of
EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF flag. But It's better to add
a WARN_ON_ONCE in case if anyone tries using this flag with CREATE to
avoid a random issue later. Since depth can change with CREATE and it
needs to be re-calculated before using it in there.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/ee6e82a224c50b432df9ce1ce3333c50182d8473.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now that we have EXT4_EX_QUERY_FILTER mask, let's use that to simplify
the filtering of flags for passing to ext4_ext_map_blocks() in
ext4_map_query_blocks() function. This allows us to kill the query_flags
local variable which is not needed anymore.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/4ae735e83e6f43341e53e2d289e59156a8360134.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Rename EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER to better describe its
purpose as a filter mask used specifically in ext4_map_query_blocks().
Add a comment explaining that this macro is used to filter flags needed
when querying the on-disk extent tree.
We will later use EXT4_EX_QUERY_FILTER mask to add another
EXT4_GET_BLOCKS_QUERY needed to lookup in on-disk extent tree.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/51f05d0ba286372eb8693af95bd4b10194b53141.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This simplifies the check for last in leaf in ext4_map_query_blocks()
and fixes this cocci warning.
cocci warnings: (new ones prefixed by >>)
>> fs/ext4/inode.c:573:49-51: WARNING !A || A && B is equivalent to !A || B
Fixes: 5bb12b1837c0 ("ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202505191524.auftmOwK-lkp@intel.com/
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/5fd5c806218c83f603c578c95997cf7f6da29d74.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This fixes the atomic write patch series after it was rebased on top of
extent status cache cleanup series i.e.
'commit 402e38e6b71f57 ("ext4: prevent stale extent cache entries caused by
concurrent I/O writeback")'
After the above series, EXT4_GET_BLOCKS_IO_CONVERT_EXT flag which has
EXT4_GET_BLOCKS_IO_SUBMIT flag set, requires that the io submit context
of any kind should pass EXT4_EX_NOCACHE to avoid caching unncecessary
extents in the extent status cache.
This patch fixes that by adding the EXT4_EX_NOCACHE flag in
ext4_convert_unwritten_extents_atomic() for unwritten to written
conversion calls to ext4_map_blocks().
Fixes: b86629c2b299 ("ext4: Add multi-fsblock atomic write support with bigalloc")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/ea0ad9378ff6d31d73f4e53f87548e3a20817689.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs fix from Mike Marshall:
"Fix for orangefs page writeout counting"
* tag 'for-linus-6.15-ofs2' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: adjust counting code to recover from 665575cf
|
|
A late commit to 6.14-rc7! broke orangefs. 665575cf seems like a
good change, but maybe should have been introduced during the merge
window. This patch adjusts the counting code associated with
writing out pages so that orangefs works in a 665575cf world.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
fstest generic/388 occasionally reproduces a crash that looks as
follows:
BUG: kernel NULL pointer dereference, address: 0000000000000000
...
Call Trace:
<TASK>
ext4_block_zero_page_range+0x30c/0x380 [ext4]
ext4_truncate+0x436/0x440 [ext4]
ext4_process_orphan+0x5d/0x110 [ext4]
ext4_orphan_cleanup+0x124/0x4f0 [ext4]
ext4_fill_super+0x262d/0x3110 [ext4]
get_tree_bdev_flags+0x132/0x1d0
vfs_get_tree+0x26/0xd0
vfs_cmd_create+0x59/0xe0
__do_sys_fsconfig+0x4ed/0x6b0
do_syscall_64+0x82/0x170
...
This occurs when processing a symlink inode from the orphan list. The
partial block zeroing code in the truncate path calls
ext4_dirty_journalled_data() -> folio_mark_dirty(). The latter calls
mapping->a_ops->dirty_folio(), but symlink inodes are not assigned an
a_ops vector in ext4, hence the crash.
To avoid this problem, update the ext4_dirty_journalled_data() helper to
only mark the folio dirty on regular files (for which a_ops is
assigned). This also matches the journaling logic in the ext4_symlink()
creation path, where ext4_handle_dirty_metadata() is called directly.
Fixes: d84c9ebdac1e ("ext4: Mark pages with journalled data dirty")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://patch.msgid.link/20250516173800.175577-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
|
|
Last couple of patches added the needed support for multi-fsblock atomic
writes using bigalloc. This patch ensures that filesystem advertizes the
needed atomic write unit min and max values for enabling multi-fsblock
atomic write support with bigalloc.
Acked-by: Darrick J. Wong <djwong@kernel.org>
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/5e45d7ed24499024b9079436ba6698dae5298e29.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
EXT4 supports bigalloc feature which allows the FS to work in size of
clusters (group of blocks) rather than individual blocks. This patch
adds atomic write support for bigalloc so that systems with bs = ps can
also create FS using -
mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>
With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
adjust ext4's atomic write unit max value to cluster size. This can then support
atomic write of size anywhere between [blocksize, clustersize]. This
patch adds the required changes to enable multi-fsblock atomic write
support using bigalloc in the next patch.
In this patch for block allocation:
we first query the underlying region of the requested range by calling
ext4_map_blocks() call. Here are the various cases which we then handle
depending upon the underlying mapping type:
1. If the underlying region for the entire requested range is a mapped extent,
then we don't call ext4_map_blocks() to allocate anything. We don't need to
even start the jbd2 txn in this case.
2. For an append write case, we create a mapped extent.
3. If the underlying region is entirely a hole, then we create an unwritten
extent for the requested range.
4. If the underlying region is a large unwritten extent, then we split the
extent into 2 unwritten extent of required size.
5. If the underlying region has any type of mixed mapping, then we call
ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
within the requested range. This then provide a single mapped extent type
mapping for the requested range.
Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
flag only when the underlying extent mapping of the requested range is
not entirely a hole, an unwritten extent, or a fully mapped extent. That
is, if the underlying region contains a mix of hole(s), unwritten
extent(s), and mapped extent(s), we use this loop to ensure that all the
short mappings are zeroed out. This guarantees that the entire requested
range becomes a single, uniformly mapped extent. It is ok to do so
because we know this is being done on a bigalloc enabled filesystem
where the block bitmap represents the entire cluster unit.
Note having a single contiguous underlying region of type mapped,
unwrittn or hole is not a problem. But the reason to avoid writing on
top of mixed mapping region is because, atomic writes requires all or
nothing should get written for the userspace pwritev2 request. So if at
any point in time during the write if a crash or a sudden poweroff
occurs, the region undergoing atomic write should read either complete
old data or complete new data. But it should never have a mix of both
old and new data.
So, we first convert any mixed mapping region to a single contiguous
mapped extent before any data gets written to it. This is because
normally FS will only convert unwritten extents to written at the end of
the write in ->end_io() call. And if we allow the writes over a mixed
mapping and if a sudden power off happens in between, we will end up
reading mix of new data (over mapped extents) and old data (over
unwritten extents), because unwritten to written conversion never went
through.
So to avoid this and to avoid writes getting torned due to mixed
mapping, we first allocate a single contiguous block mapping and then
do the write.
Acked-by: Darrick J. Wong <djwong@kernel.org>
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/c4965ac3407cbc773f0bc954d0966d9696f5038a.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
There can be a case where there are contiguous extents on the adjacent
leaf nodes of on-disk extent trees. So when someone tries to write to
this contiguous range, ext4_map_blocks() call will split by returning
1 extent at a time if this is not already cached in extent_status tree
cache (where if these extents when cached can get merged since they are
contiguous).
This is fine for a normal write however in case of atomic writes, it
can't afford to break the write into two. Now this is also something
that will only happen in the slow write case where we call
ext4_map_blocks() for each of these extents spread across different leaf
nodes. However, there is no guarantee that these extent status cache
cannot be reclaimed before the last call to ext4_map_blocks() in
ext4_map_blocks_atomic_write_slow().
Hence this patch adds support of EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS.
This flag checks if the requested range can be fully found in extent
status cache and return. If not, it looks up in on-disk extent
tree via ext4_map_query_blocks(). If the found extent is the last entry
in the leaf node, then it goes and queries the next lblk to see if there
is an adjacent contiguous extent in the adjacent leaf node of the
on-disk extent tree.
Even though there can be a case where there are multiple adjacent extent
entries spread across multiple leaf nodes. But we only read an adjacent
leaf block i.e. in total of 2 extent entries spread across 2 leaf nodes.
The reason for this is that we are mostly only going to support atomic
writes with upto 64KB or maybe max upto 1MB of atomic write support.
Acked-by: Darrick J. Wong <djwong@kernel.org>
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/6bb563e661f5fbd80e266a9e6ce6e29178f555f6.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Let's make ext4_meta_trans_blocks() non-static for use in later
functions during ->end_io conversion for atomic writes.
We will need this function to estimate journal credits for a special
case. Instead of adding another wrapper around it, let's make this
non-static.
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/23ce80d4286f792831ce99d13558182ee228fedb.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
EXT4 only supports doing atomic write on inodes which uses extents, so
add a check in ext4_inode_can_atomic_write() which gets called during
open.
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/86bb502c979398a736ab371d8f35f6866a477f6c.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_iomap_overwrite_begin() clears the flag for IOMAP_WRITE before
calling ext4_iomap_begin(). Document this above ext4_map_blocks() call
as it is easy to miss it when focusing on write paths alone.
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/fd50ba05440042dff77d555e463a620a79f8d0e9.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since jbd2_superblock_csum() no longer uses its journal_t argument,
remove it.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250513053809.699974-5-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since jbd2_chksum() no longer uses its journal_t argument, remove it.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250513053809.699974-4-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since ext4_superblock_csum() no longer uses its sb argument, remove it.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250513053809.699974-3-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since ext4_chksum() no longer uses its sbi argument, remove it.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250513053809.699974-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Besides fsverity, fscrypt, and the data=journal mode, ext4 now supports
large folios for regular files. Enable this feature by default. However,
since we cannot change the folio order limitation of mappings on active
inodes, setting the journal=data mode via ioctl on an active inode will
not take immediate effect in non-delalloc mode.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
move_extent_per_page() currently assumes that each folio is the size of
PAGE_SIZE and only copies data for one page. ext4_move_extents() should
call move_extent_per_page() for each page. To support larger folios,
simply modify the calculations for the block start and end offsets
within the folio based on the provided range of 'data_offset_in_page'
and 'block_len_in_page'. This function will continue to handle PAGE_SIZE
of data at a time and will not convert this function to manage an entire
folio. Additionally, we use the source folio to copy data, so it doesn't
matter if the source and dest folios are different in size.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In mpage_map_and_submit_buffers(), the 'lblk' is now aligned to
PAGE_SIZE. Convert it to be aligned to folio size. Additionally, modify
the wbc->nr_to_write update to reduce the number of pages in a single
folio, ensuring that the entire writeback path can support large folios.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The journal credits calculation in ext4_ext_index_trans_blocks() is
currently inadequate. It only multiplies the depth of the extents tree
and doesn't account for the blocks that may be required for adding the
leaf extents themselves.
After enabling large folios, we can easily run out of handle credits,
triggering a warning in jbd2_journal_dirty_metadata() on filesystems
with a 1KB block size. This occurs because we may need more extents when
iterating through each large folio in
ext4_do_writepages()->mpage_map_and_submit_extent(). Therefore, we
should modify ext4_ext_index_trans_blocks() to include a count of the
leaf extents in the worst case as well.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
jbd2_journal_blocks_per_page() returns the number of blocks in a single
page. Rename it to jbd2_journal_blocks_per_folio() and make it returns
the number of blocks in the largest folio, preparing for the calculation
of journal credits blocks when allocating blocks within a large folio in
the writeback path.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The partial block zero range helper __ext4_block_zero_page_range()
currently only supports folios of PAGE_SIZE in size. The calculations
for the start block and the offset within a folio for the given range
are incorrect. Modify the implementation to use offset_in_folio()
instead of directly masking PAGE_SIZE - 1, which will be able to support
for large folios.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The current buffered write path in ext4 can only allocate and handle
folios of PAGE_SIZE size. To support larger folios, modify
ext4_da_write_begin() and ext4_write_begin() to allocate higher-order
folios, and trim the write length if it exceeds the folio size.
Additionally, in ext4_da_do_write_end(), use offset_in_folio() instead
of PAGE_SIZE.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_mpage_readpages() currently assumes that each folio is the size of
PAGE_SIZE. Modify it to atomically calculate the number of blocks per
folio and iterate through the blocks in each folio, which would allow
for support of larger folios.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The inode i_size cannot be larger than maxbytes, check it while loading
inode from the disk.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250506012009.3896990-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
There are several locations that get the correct maxbytes value based on
the inode's block type. It would be beneficial to extract a common
helper function to make the code more clear.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250506012009.3896990-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
For the extents based inodes, the maxbytes should be sb->s_maxbytes
instead of sbi->s_bitmap_maxbytes. Additionally, for the calculation of
max_end, the -sb->s_blocksize operation is necessary only for
indirect-block based inodes. Correct the maxbytes and max_end value to
correct the behavior of punch hole.
Fixes: 2da376228a24 ("ext4: limit length to bitmap_maxbytes - blocksize in punch_hole")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250506012009.3896990-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Punching a hole with a start offset that exceeds max_end is not
permitted and will result in a negative length in the
truncate_inode_partial_folio() function while truncating the page cache,
potentially leading to undesirable consequences.
A simple reproducer:
truncate -s 9895604649994 /mnt/foo
xfs_io -c "pwrite 8796093022208 4096" /mnt/foo
xfs_io -c "fpunch 8796093022213 25769803777" /mnt/foo
kernel BUG at include/linux/highmem.h:275!
Oops: invalid opcode: 0000 [#1] SMP PTI
CPU: 3 UID: 0 PID: 710 Comm: xfs_io Not tainted 6.15.0-rc3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
RIP: 0010:zero_user_segments.constprop.0+0xd7/0x110
RSP: 0018:ffffc90001cf3b38 EFLAGS: 00010287
RAX: 0000000000000005 RBX: ffffea0001485e40 RCX: 0000000000001000
RDX: 000000000040b000 RSI: 0000000000000005 RDI: 000000000040b000
RBP: 000000000040affb R08: ffff888000000000 R09: ffffea0000000000
R10: 0000000000000003 R11: 00000000fffc7fc5 R12: 0000000000000005
R13: 000000000040affb R14: ffffea0001485e40 R15: ffff888031cd3000
FS: 00007f4f63d0b780(0000) GS:ffff8880d337d000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000001ae0b038 CR3: 00000000536aa000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
truncate_inode_partial_folio+0x3dd/0x620
truncate_inode_pages_range+0x226/0x720
? bdev_getblk+0x52/0x3e0
? ext4_get_group_desc+0x78/0x150
? crc32c_arch+0xfd/0x180
? __ext4_get_inode_loc+0x18c/0x840
? ext4_inode_csum+0x117/0x160
? jbd2_journal_dirty_metadata+0x61/0x390
? __ext4_handle_dirty_metadata+0xa0/0x2b0
? kmem_cache_free+0x90/0x5a0
? jbd2_journal_stop+0x1d5/0x550
? __ext4_journal_stop+0x49/0x100
truncate_pagecache_range+0x50/0x80
ext4_truncate_page_cache_block_range+0x57/0x3a0
ext4_punch_hole+0x1fe/0x670
ext4_fallocate+0x792/0x17d0
? __count_memcg_events+0x175/0x2a0
vfs_fallocate+0x121/0x560
ksys_fallocate+0x51/0xc0
__x64_sys_fallocate+0x24/0x40
x64_sys_call+0x18d2/0x4170
do_syscall_64+0xa7/0x220
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix this by filtering out cases where the punching start offset exceeds
max_end.
Fixes: 982bf37da09d ("ext4: refactor ext4_punch_hole()")
Reported-by: Liebes Wang <wanghaichi0403@gmail.com>
Closes: https://lore.kernel.org/linux-ext4/ac3a58f6-e686-488b-a9ee-fc041024e43d@huawei.com/
Tested-by: Liebes Wang <wanghaichi0403@gmail.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250506012009.3896990-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Since handle->h_transaction may be a NULL pointer, so we should change it
to call is_handle_aborted(handle) first before dereferencing it.
And the following data-race was reported in my fuzzer:
==================================================================
BUG: KCSAN: data-race in jbd2_journal_dirty_metadata / jbd2_journal_dirty_metadata
write to 0xffff888011024104 of 4 bytes by task 10881 on cpu 1:
jbd2_journal_dirty_metadata+0x2a5/0x770 fs/jbd2/transaction.c:1556
__ext4_handle_dirty_metadata+0xe7/0x4b0 fs/ext4/ext4_jbd2.c:358
ext4_do_update_inode fs/ext4/inode.c:5220 [inline]
ext4_mark_iloc_dirty+0x32c/0xd50 fs/ext4/inode.c:5869
__ext4_mark_inode_dirty+0xe1/0x450 fs/ext4/inode.c:6074
ext4_dirty_inode+0x98/0xc0 fs/ext4/inode.c:6103
....
read to 0xffff888011024104 of 4 bytes by task 10880 on cpu 0:
jbd2_journal_dirty_metadata+0xf2/0x770 fs/jbd2/transaction.c:1512
__ext4_handle_dirty_metadata+0xe7/0x4b0 fs/ext4/ext4_jbd2.c:358
ext4_do_update_inode fs/ext4/inode.c:5220 [inline]
ext4_mark_iloc_dirty+0x32c/0xd50 fs/ext4/inode.c:5869
__ext4_mark_inode_dirty+0xe1/0x450 fs/ext4/inode.c:6074
ext4_dirty_inode+0x98/0xc0 fs/ext4/inode.c:6103
....
value changed: 0x00000000 -> 0x00000001
==================================================================
This issue is caused by missing data-race annotation for jh->b_modified.
Therefore, the missing annotation needs to be added.
Reported-by: syzbot+de24c3fe3c4091051710@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=de24c3fe3c4091051710
Fixes: 6e06ae88edae ("jbd2: speedup jbd2_journal_dirty_metadata()")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250514130855.99010-1-aha310510@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Use writeback_iter directly instead of write_cache_pages for a nicer
code structure and less indirect calls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250505091604.3449879-1-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Luis and David are reporting that after running generic/750 test for 90+
hours on 2k ext4 filesystem, they are able to trigger a warning in
jbd2_journal_dirty_metadata() complaining that there are not enough
credits in the running transaction started in ext4_do_writepages().
Indeed the code in ext4_do_writepages() is racy and the extent tree can
change between the time we compute credits necessary for extent tree
computation and the time we actually modify the extent tree. Thus it may
happen that the number of credits actually needed is higher. Modify
ext4_ext_index_trans_blocks() to count with the worst case of maximum
tree depth. This can reduce the possible number of writers that can
operate in the system in parallel (because the credit estimates now won't
fit in one transaction) but for reasonably sized journals this shouldn't
really be an issue. So just go with a safe and simple fix.
Link: https://lore.kernel.org/all/20250415013641.f2ppw6wov4kn4wq2@offworld
Reported-by: Davidlohr Bueso <dave@stgolabs.net>
Reported-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: kdevops@lists.linux.dev
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250429175535.23125-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
If there is no stream data in file, v_len is zero.
So, If position(*pos) is zero, stream write will fail
due to stream write position validation check.
This patch reorganize stream write position validation.
Fixes: 0ca6df4f40cf ("ksmbd: prevent out-of-bounds stream writes by validating *pos")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Multiple pointers in struct cifs_search_info (ntwrk_buf_start,
srch_entries_start, and last_entry) point to the same allocated buffer.
However, when freeing this buffer, only ntwrk_buf_start was set to NULL,
while the other pointers remained pointing to freed memory.
This is defensive programming to prevent potential issues with stale
pointers. While the active UAF vulnerability is fixed by the previous
patch, this change ensures consistent pointer state and more robust error
handling.
Signed-off-by: Wang Zhaolong <wangzhaolong1@huawei.com>
Cc: stable@vger.kernel.org
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
An unprivileged user is allowed to create an fanotify group and add
inode marks, but not filesystem, mntns and mount marks.
Add limited support for setting up filesystem, mntns and mount marks by
an unprivileged user under the following conditions:
1. User has CAP_SYS_ADMIN in the user ns where the group was created
2.a. User has CAP_SYS_ADMIN in the user ns where the sb was created
OR (in case setting up a mntns mark)
2.b. User has CAP_SYS_ADMIN in the user ns associated with the mntns
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250516192803.838659-3-amir73il@gmail.com
|
|
FAN_UNLIMITED_QUEUE and FAN_UNLIMITED_MARK flags are already checked
as part of the CAP_SYS_ADMIN check for any FANOTIFY_ADMIN_INIT_FLAGS.
Remove the individual CAP_SYS_ADMIN checks for these flags.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250516192803.838659-2-amir73il@gmail.com
|
|
The nfs inodes for referral anchors that have not yet been followed have
their filehandles zeroed out.
Attempting to call getxattr() on one of these will cause the nfs client
to send a GETATTR to the nfs server with the preceding PUTFH sans
filehandle. The server will reply NFS4ERR_NOFILEHANDLE, leading to -EIO
being returned to the application.
For example:
$ strace -e trace=getxattr getfattr -n system.nfs4_acl /mnt/t/ref
getxattr("/mnt/t/ref", "system.nfs4_acl", NULL, 0) = -1 EIO (Input/output error)
/mnt/t/ref: system.nfs4_acl: Input/output error
+++ exited with 1 +++
Have the xattr handlers return -ENODATA instead.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
These are long-held references to the netns, so make sure the refcount
tracker is aware of them.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Don't dirty the whole folio - fixes write amplification with
applications doing mmaped writes.
https://www.reddit.com/r/bcachefs/comments/1klzcg1/incredible_amounts_of_write_amplification_when/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Remove all the code related to changing compression on the fly because
it's not safe and not maintainable.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
This wasn't checking indirect extents.
Fixes: https://github.com/koverstreet/bcachefs/issues/887
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
There is a race condition in the readdir concurrency process, which may
access the rsp buffer after it has been released, triggering the
following KASAN warning.
==================================================================
BUG: KASAN: slab-use-after-free in cifs_fill_dirent+0xb03/0xb60 [cifs]
Read of size 4 at addr ffff8880099b819c by task a.out/342975
CPU: 2 UID: 0 PID: 342975 Comm: a.out Not tainted 6.15.0-rc6+ #240 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x53/0x70
print_report+0xce/0x640
kasan_report+0xb8/0xf0
cifs_fill_dirent+0xb03/0xb60 [cifs]
cifs_readdir+0x12cb/0x3190 [cifs]
iterate_dir+0x1a1/0x520
__x64_sys_getdents+0x134/0x220
do_syscall_64+0x4b/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f996f64b9f9
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 0d f7 c3 0c 00 f7 d8 64 89 8
RSP: 002b:00007f996f53de78 EFLAGS: 00000207 ORIG_RAX: 000000000000004e
RAX: ffffffffffffffda RBX: 00007f996f53ecdc RCX: 00007f996f64b9f9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007f996f53dea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000207 R12: ffffffffffffff88
R13: 0000000000000000 R14: 00007ffc8cd9a500 R15: 00007f996f51e000
</TASK>
Allocated by task 408:
kasan_save_stack+0x20/0x40
kasan_save_track+0x14/0x30
__kasan_slab_alloc+0x6e/0x70
kmem_cache_alloc_noprof+0x117/0x3d0
mempool_alloc_noprof+0xf2/0x2c0
cifs_buf_get+0x36/0x80 [cifs]
allocate_buffers+0x1d2/0x330 [cifs]
cifs_demultiplex_thread+0x22b/0x2690 [cifs]
kthread+0x394/0x720
ret_from_fork+0x34/0x70
ret_from_fork_asm+0x1a/0x30
Freed by task 342979:
kasan_save_stack+0x20/0x40
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x37/0x50
kmem_cache_free+0x2b8/0x500
cifs_buf_release+0x3c/0x70 [cifs]
cifs_readdir+0x1c97/0x3190 [cifs]
iterate_dir+0x1a1/0x520
__x64_sys_getdents64+0x134/0x220
do_syscall_64+0x4b/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The buggy address belongs to the object at ffff8880099b8000
which belongs to the cache cifs_request of size 16588
The buggy address is located 412 bytes inside of
freed 16588-byte region [ffff8880099b8000, ffff8880099bc0cc)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x99b8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
anon flags: 0x80000000000040(head|node=0|zone=1)
page_type: f5(slab)
raw: 0080000000000040 ffff888001e03400 0000000000000000 dead000000000001
raw: 0000000000000000 0000000000010001 00000000f5000000 0000000000000000
head: 0080000000000040 ffff888001e03400 0000000000000000 dead000000000001
head: 0000000000000000 0000000000010001 00000000f5000000 0000000000000000
head: 0080000000000003 ffffea0000266e01 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8880099b8080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880099b8100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8880099b8180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8880099b8200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880099b8280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
POC is available in the link [1].
The problem triggering process is as follows:
Process 1 Process 2
-----------------------------------------------------------------
cifs_readdir
/* file->private_data == NULL */
initiate_cifs_search
cifsFile = kzalloc(sizeof(struct cifsFileInfo), GFP_KERNEL);
smb2_query_dir_first ->query_dir_first()
SMB2_query_directory
SMB2_query_directory_init
cifs_send_recv
smb2_parse_query_directory
srch_inf->ntwrk_buf_start = (char *)rsp;
srch_inf->srch_entries_start = (char *)rsp + ...
srch_inf->last_entry = (char *)rsp + ...
srch_inf->smallBuf = true;
find_cifs_entry
/* if (cfile->srch_inf.ntwrk_buf_start) */
cifs_small_buf_release(cfile->srch_inf // free
cifs_readdir ->iterate_shared()
/* file->private_data != NULL */
find_cifs_entry
/* in while (...) loop */
smb2_query_dir_next ->query_dir_next()
SMB2_query_directory
SMB2_query_directory_init
cifs_send_recv
compound_send_recv
smb_send_rqst
__smb_send_rqst
rc = -ERESTARTSYS;
/* if (fatal_signal_pending()) */
goto out;
return rc
/* if (cfile->srch_inf.last_entry) */
cifs_save_resume_key()
cifs_fill_dirent // UAF
/* if (rc) */
return -ENOENT;
Fix this by ensuring the return code is checked before using pointers
from the srch_inf.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220131 [1]
Fixes: a364bc0b37f1 ("[CIFS] fix saving of resume key before CIFSFindNext")
Cc: stable@vger.kernel.org
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Wang Zhaolong <wangzhaolong1@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
btree_key_cache_fill() will allocate and traverse another path (for the
underlying btree), so we can't hold pointers to paths across a call - we
have to pass indices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Commit 925baeddc5b0 ("Btrfs: Start btree concurrency work.") added the
comment for the field keep_locks. This got moved later but without the
comment, so move it to the right place and fix the comment style.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pull smb client fixes from Steve French:
- Fix memory leak in mkdir error path
- Fix max rsize miscalculation after channel reconnect
* tag '6.15-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: fix zero rsize error messages
smb: client: fix memory leak during error handling for POSIX mkdir
|