summaryrefslogtreecommitdiff
path: root/fs/ext4/extents.c
AgeCommit message (Collapse)Author
37 hoursMerge tag 'ext4_for_linus_6.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Major ext4 changes for 6.17: - Better scalability for ext4 block allocation - Fix insufficient credits when writing back large folios Miscellaneous bug fixes, especially when handling exteded attriutes, inline data, and fast commit" * tag 'ext4_for_linus_6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (39 commits) ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr ext4: implement linear-like traversal across order xarrays ext4: refactor choose group to scan group ext4: convert free groups order lists to xarrays ext4: factor out ext4_mb_scan_group() ext4: factor out ext4_mb_might_prefetch() ext4: factor out __ext4_mb_scan_group() ext4: fix largest free orders lists corruption on mb_optimize_scan switch ext4: fix zombie groups in average fragment size lists ext4: merge freed extent with existing extents before insertion ext4: convert sbi->s_mb_free_pending to atomic_t ext4: fix typo in CR_GOAL_LEN_SLOW comment ext4: get rid of some obsolete EXT4_MB_HINT flags ext4: utilize multiple global goals to reduce contention ext4: remove unnecessary s_md_lock on update s_mb_last_group ext4: remove unnecessary s_mb_last_start ext4: separate stream goal hits from s_bal_goals for better tracking ext4: add ext4_try_lock_group() to skip busy groups ext4: initialize superblock fields in the kballoc-test.c kunit tests ext4: refactor the inline directory conversion and new directory codepaths ...
2025-07-13ext4: replace ext4_writepage_trans_blocks()Zhang Yi
After ext4 supports large folios, the semantics of reserving credits in pages is no longer applicable. In most scenarios, reserving credits in extents is sufficient. Therefore, introduce ext4_chunk_trans_extent() to replace ext4_writepage_trans_blocks(). move_extent_per_page() is the only remaining location where we are still processing extents in pages. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250707140814.542883-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-06-23ext4: add FALLOC_FL_WRITE_ZEROES supportZhang Yi
Add support for FALLOC_FL_WRITE_ZEROES if the underlying device enable the unmap write zeroes operation. This first allocates blocks as unwritten, then issues a zero command outside of the running journal handle, and finally converts them to a written state. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-10-yi.zhang@huaweicloud.com Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-20ext4: Add a WARN_ON_ONCE for querying LAST_IN_LEAF insteadRitesh Harjani (IBM)
We added the documentation in ext4_map_blocks() for usage of EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF flag. But It's better to add a WARN_ON_ONCE in case if anyone tries using this flag with CREATE to avoid a random issue later. Since depth can change with CREATE and it needs to be re-calculated before using it in there. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/ee6e82a224c50b432df9ce1ce3333c50182d8473.1747677758.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Unwritten to written conversion requires EXT4_EX_NOCACHERitesh Harjani (IBM)
This fixes the atomic write patch series after it was rebased on top of extent status cache cleanup series i.e. 'commit 402e38e6b71f57 ("ext4: prevent stale extent cache entries caused by concurrent I/O writeback")' After the above series, EXT4_GET_BLOCKS_IO_CONVERT_EXT flag which has EXT4_GET_BLOCKS_IO_SUBMIT flag set, requires that the io submit context of any kind should pass EXT4_EX_NOCACHE to avoid caching unncecessary extents in the extent status cache. This patch fixes that by adding the EXT4_EX_NOCACHE flag in ext4_convert_unwritten_extents_atomic() for unwritten to written conversion calls to ext4_map_blocks(). Fixes: b86629c2b299 ("ext4: Add multi-fsblock atomic write support with bigalloc") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/ea0ad9378ff6d31d73f4e53f87548e3a20817689.1747677758.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Add multi-fsblock atomic write support with bigallocRitesh Harjani (IBM)
EXT4 supports bigalloc feature which allows the FS to work in size of clusters (group of blocks) rather than individual blocks. This patch adds atomic write support for bigalloc so that systems with bs = ps can also create FS using - mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to adjust ext4's atomic write unit max value to cluster size. This can then support atomic write of size anywhere between [blocksize, clustersize]. This patch adds the required changes to enable multi-fsblock atomic write support using bigalloc in the next patch. In this patch for block allocation: we first query the underlying region of the requested range by calling ext4_map_blocks() call. Here are the various cases which we then handle depending upon the underlying mapping type: 1. If the underlying region for the entire requested range is a mapped extent, then we don't call ext4_map_blocks() to allocate anything. We don't need to even start the jbd2 txn in this case. 2. For an append write case, we create a mapped extent. 3. If the underlying region is entirely a hole, then we create an unwritten extent for the requested range. 4. If the underlying region is a large unwritten extent, then we split the extent into 2 unwritten extent of required size. 5. If the underlying region has any type of mixed mapping, then we call ext4_map_blocks() in a loop to zero out the unwritten and the hole regions within the requested range. This then provide a single mapped extent type mapping for the requested range. Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO flag only when the underlying extent mapping of the requested range is not entirely a hole, an unwritten extent, or a fully mapped extent. That is, if the underlying region contains a mix of hole(s), unwritten extent(s), and mapped extent(s), we use this loop to ensure that all the short mappings are zeroed out. This guarantees that the entire requested range becomes a single, uniformly mapped extent. It is ok to do so because we know this is being done on a bigalloc enabled filesystem where the block bitmap represents the entire cluster unit. Note having a single contiguous underlying region of type mapped, unwrittn or hole is not a problem. But the reason to avoid writing on top of mixed mapping region is because, atomic writes requires all or nothing should get written for the userspace pwritev2 request. So if at any point in time during the write if a crash or a sudden poweroff occurs, the region undergoing atomic write should read either complete old data or complete new data. But it should never have a mix of both old and new data. So, we first convert any mixed mapping region to a single contiguous mapped extent before any data gets written to it. This is because normally FS will only convert unwritten extents to written at the end of the write in ->end_io() call. And if we allow the writes over a mixed mapping and if a sudden power off happens in between, we will end up reading mix of new data (over mapped extents) and old data (over unwritten extents), because unwritten to written conversion never went through. So to avoid this and to avoid writes getting torned due to mixed mapping, we first allocate a single contiguous block mapping and then do the write. Acked-by: Darrick J. Wong <djwong@kernel.org> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/c4965ac3407cbc773f0bc954d0966d9696f5038a.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKSRitesh Harjani (IBM)
There can be a case where there are contiguous extents on the adjacent leaf nodes of on-disk extent trees. So when someone tries to write to this contiguous range, ext4_map_blocks() call will split by returning 1 extent at a time if this is not already cached in extent_status tree cache (where if these extents when cached can get merged since they are contiguous). This is fine for a normal write however in case of atomic writes, it can't afford to break the write into two. Now this is also something that will only happen in the slow write case where we call ext4_map_blocks() for each of these extents spread across different leaf nodes. However, there is no guarantee that these extent status cache cannot be reclaimed before the last call to ext4_map_blocks() in ext4_map_blocks_atomic_write_slow(). Hence this patch adds support of EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS. This flag checks if the requested range can be fully found in extent status cache and return. If not, it looks up in on-disk extent tree via ext4_map_query_blocks(). If the found extent is the last entry in the leaf node, then it goes and queries the next lblk to see if there is an adjacent contiguous extent in the adjacent leaf node of the on-disk extent tree. Even though there can be a case where there are multiple adjacent extent entries spread across multiple leaf nodes. But we only read an adjacent leaf block i.e. in total of 2 extent entries spread across 2 leaf nodes. The reason for this is that we are mostly only going to support atomic writes with upto 64KB or maybe max upto 1MB of atomic write support. Acked-by: Darrick J. Wong <djwong@kernel.org> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/6bb563e661f5fbd80e266a9e6ce6e29178f555f6.1747337952.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: remove sbi argument from ext4_chksum()Eric Biggers
Since ext4_chksum() no longer uses its sbi argument, remove it. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250513053809.699974-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: correct the journal credits calculations of allocating blocksZhang Yi
The journal credits calculation in ext4_ext_index_trans_blocks() is currently inadequate. It only multiplies the depth of the extents tree and doesn't account for the blocks that may be required for adding the leaf extents themselves. After enabling large folios, we can easily run out of handle credits, triggering a warning in jbd2_journal_dirty_metadata() on filesystems with a 1KB block size. This occurs because we may need more extents when iterating through each large folio in ext4_do_writepages()->mpage_map_and_submit_extent(). Therefore, we should modify ext4_ext_index_trans_blocks() to include a count of the leaf extents in the worst case as well. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250512063319.3539411-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20ext4: factor out ext4_get_maxbytes()Zhang Yi
There are several locations that get the correct maxbytes value based on the inode's block type. It would be beneficial to extract a common helper function to make the code more clear. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250506012009.3896990-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-05-20ext4: fix calculation of credits for extent tree modificationJan Kara
Luis and David are reporting that after running generic/750 test for 90+ hours on 2k ext4 filesystem, they are able to trigger a warning in jbd2_journal_dirty_metadata() complaining that there are not enough credits in the running transaction started in ext4_do_writepages(). Indeed the code in ext4_do_writepages() is racy and the extent tree can change between the time we compute credits necessary for extent tree computation and the time we actually modify the extent tree. Thus it may happen that the number of credits actually needed is higher. Modify ext4_ext_index_trans_blocks() to count with the worst case of maximum tree depth. This can reduce the possible number of writers that can operate in the system in parallel (because the credit estimates now won't fit in one transaction) but for reasonably sized journals this shouldn't really be an issue. So just go with a safe and simple fix. Link: https://lore.kernel.org/all/20250415013641.f2ppw6wov4kn4wq2@offworld Reported-by: Davidlohr Bueso <dave@stgolabs.net> Reported-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops@lists.linux.dev Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250429175535.23125-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-05-15ext4: check env when mapping and modifying extentsZhang Yi
Add ext4_check_map_extents_env() to the places where loading extents, mapping blocks, removing blocks, and modifying extents, excluding the I/O writeback context. This function will verify whether the locking mechanisms in place are adequate. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14ext4: prevent stale extent cache entries caused by concurrent get es_cacheZhang Yi
The EXT4_IOC_GET_ES_CACHE and EXT4_IOC_PRECACHE_EXTENTS currently invokes ext4_ext_precache() to preload the extent cache without holding the inode's i_rwsem. This can result in stale extent cache entries when competing with operations such as ext4_collapse_range() which calls ext4_ext_remove_space() or ext4_ext_shift_extents(). The problem arises when ext4_ext_remove_space() temporarily releases i_data_sem due to insufficient journal credits. During this interval, a concurrent EXT4_IOC_GET_ES_CACHE or EXT4_IOC_PRECACHE_EXTENTS may cache extent entries that are about to be deleted. As a result, these cached entries become stale and inconsistent with the actual extents. Loading the extents cache without holding the inode's i_rwsem or the mapping's invalidate_lock is not permitted besides during the writeback. Fix this by holding the i_rwsem during EXT4_IOC_GET_ES_CACHE and EXT4_IOC_PRECACHE_EXTENTS. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14ext4: prevent stale extent cache entries caused by concurrent fiemapZhang Yi
The ext4_fiemap() currently invokes ext4_ext_precache() and iomap_fiemap() to preload the extent cache and query mapping information without holding the inode's i_rwsem. This can result in stale extent cache entries when competing with operations such as ext4_collapse_range() which calls ext4_ext_remove_space() or ext4_ext_shift_extents(). The problem arises when ext4_ext_remove_space() temporarily releases i_data_sem due to insufficient journal credits. During this interval, a concurrent ext4_fiemap() may cache extent entries that are about to be deleted. As a result, these cached entries become stale and inconsistent with the actual extents. Loading the extents cache without holding the inode's i_rwsem or the mapping's invalidate_lock is not permitted besides during the writeback. Fix this by holding the i_rwsem in ext4_fiemap(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14ext4: prevent stale extent cache entries caused by concurrent I/O writebackZhang Yi
Currently, in the I/O writeback path, ext4_map_blocks() may attempt to cache additional unrelated extents in the extent status tree without holding the inode's i_rwsem and the mapping's invalidate_lock. This can lead to stale extent status entries remaining in certain scenarios, potentially causing data corruption. For example, when performing a collapse range in ext4_collapse_range(), it clears the extent cache and dirty pages before removing blocks and shifting extents. It also holds the i_data_sem during these two operations. However, both ext4_ext_remove_space() and ext4_ext_shift_extents() may briefly release the i_data_sem if journal credits are insufficient (ext4_datasem_ensure_credits()). If another writeback process writes dirty pages from other regions during this interval, it may cache extents that are about to be modified. Unless ext4_collapse_range() explicitly clears the extent cache again, these cached entries can become stale and inconsistent with the actual extents. 0 a n b c m | | | | | | [www][wwwwww][wwwwwwww]...[wwwww][wwww]... | | N M Assume that block a is dirty. The collapse range operation is removing data from n to m and drops i_data_sem immediately after removing the extent from b to c. At the same time, a concurrent writeback begins to write back block a; it will reloads the extent from [n, b) into the extent status tree since it does not hold the i_rwsem or the invalidate_lock. After the collapse range operation, it left the stale extent [n, b), which points logical block n to N, but the actual physical block of n should be M. Similarly, both ext4_insert_range() and ext4_truncate() have the same problem. ext4_punch_hole() survived since it re-add a hole extent entry after removing space since commit 9f1118223aa0 ("ext4: add a hole extent entry in cache after punch"). In most cases, during dirty page writeback, the block mapping information is likely to be found in the extent cache, making it less necessary to search for physical extents. Consequently, loading unrelated extent caches during writeback appears to be ineffective. Therefore, fix this by adds EXT4_EX_NOCACHE in the writeback path to prevent caching of unrelated extents, eliminating this potential source of corruption. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14ext4: ext4: unify EXT4_EX_NOCACHE|NOFAIL flags in ext4_ext_remove_space()Zhang Yi
When removing space, we should use EXT4_EX_NOCACHE because we don't need to cache extents, and we should also use EXT4_EX_NOFAIL to prevent metadata inconsistencies that may arise from memory allocation failures. While ext4_ext_remove_space() already uses these two flags in most places, they are missing in ext4_ext_search_right() and read_extent_tree_block() calls. Unify the flags to ensure consistent behavior throughout the extent removal process. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-21ext4: correct the error handle in ext4_fallocate()Zhang Yi
The error out label of file_modified() should be out_inode_lock in ext4_fallocate(). Fixes: 2890e5e0f49e ("ext4: move out common parts into ext4_fallocate()") Reported-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20250319023557.2785018-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-17ext4: remove redundant function ext4_has_metadata_csumEric Biggers
Since commit f2b4fa19647e ("ext4: switch to using the crc32c library"), ext4_has_metadata_csum() is just an alias for ext4_has_feature_metadata_csum(). ext4_has_feature_metadata_csum() is generated by EXT4_FEATURE_RO_COMPAT_FUNCS and uses the regular naming convention for checking a single ext4 feature. Therefore, remove ext4_has_metadata_csum() and update all its callers to use ext4_has_feature_metadata_csum() directly. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/20250207031335.42637-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-02-10ext4: move out common parts into ext4_fallocate()Zhang Yi
Currently, all zeroing ranges, punch holes, collapse ranges, and insert ranges first wait for all existing direct I/O workers to complete, and then they acquire the mapping's invalidate lock before performing the actual work. These common components are nearly identical, so we can simplify the code by factoring them out into the ext4_fallocate(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20241220011637.1157197-11-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-02-10ext4: move out inode_lock into ext4_fallocate()Zhang Yi
Currently, all five sub-functions of ext4_fallocate() acquire the inode's i_rwsem at the beginning and release it before exiting. This process can be simplified by factoring out the management of i_rwsem into the ext4_fallocate() function. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20241220011637.1157197-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-02-10ext4: factor out ext4_do_fallocate()Zhang Yi
Now the real job of normal fallocate are open coded in ext4_fallocate(), factor out a new helper ext4_do_fallocate() to do the real job, like others functions (e.g. ext4_zero_range()) in ext4_fallocate() do, this can make the code more clear, no functional changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20241220011637.1157197-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-02-10ext4: refactor ext4_insert_range()Zhang Yi
Simplify ext4_insert_range() and align its code style with that of ext4_collapse_range(). Refactor it by: a) renaming variables, b) removing redundant input parameter checks and moving the remaining checks under i_rwsem in preparation for future refactoring, and c) renaming the three stale error tags. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20241220011637.1157197-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-02-10ext4: refactor ext4_collapse_range()Zhang Yi
Simplify ext4_collapse_range() and align its code style with that of ext4_zero_range() and ext4_punch_hole(). Refactor it by: a) renaming variables, b) removing redundant input parameter checks and moving the remaining checks under i_rwsem in preparation for future refactoring, and c) renaming the three stale error tags. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20241220011637.1157197-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-02-10ext4: refactor ext4_zero_range()Zhang Yi
The current implementation of ext4_zero_range() contains complex position calculations and stale error tags. To improve the code's clarity and maintainability, it is essential to clean up the code and improve its readability, this can be achieved by: a) simplifying and renaming variables, making the style the same as ext4_punch_hole(); b) eliminating unnecessary position calculations, writing back all data in data=journal mode, and drop page cache from the original offset to the end, rather than using aligned blocks; c) renaming the stale out_mutex tags. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20241220011637.1157197-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-02-10ext4: don't explicit update times in ext4_fallocate()Zhang Yi
After commit 'ad5cd4f4ee4d ("ext4: fix fallocate to use file_modified to update permissions consistently"), we can update mtime and ctime appropriately through file_modified() when doing zero range, collapse rage, insert range and punch hole, hence there is no need to explicit update times in those paths, just drop them. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20241220011637.1157197-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-02-10ext4: remove writable userspace mappings before truncating page cacheZhang Yi
When zeroing a range of folios on the filesystem which block size is less than the page size, the file's mapped blocks within one page will be marked as unwritten, we should remove writable userspace mappings to ensure that ext4_page_mkwrite() can be called during subsequent write access to these partial folios. Otherwise, data written by subsequent mmap writes may not be saved to disk. $mkfs.ext4 -b 1024 /dev/vdb $mount /dev/vdb /mnt $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \ -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \ -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo $od -Ax -t x1z /mnt/foo 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 * 000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 * 001000 $umount /mnt && mount /dev/vdb /mnt $od -Ax -t x1z /mnt/foo 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 * 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 001000 Fix this by introducing ext4_truncate_page_cache_block_range() to remove writable userspace mappings when truncating a partial folio range. Additionally, move the journal data mode-specific handlers and truncate_pagecache_range() into this function, allowing it to serve as a common helper that correctly manages the page cache in preparation for block range manipulations. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20241220011637.1157197-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: partial zero eof block on unaligned inode size extensionBrian Foster
Using mapped writes, it's technically possible to expose stale post-eof data on a truncate up operation. Consider the following example: $ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \ -c "truncate 8k" -c "pread -v 2k 16" <file> ... 00000800: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX ... This shows that the post-eof data written via mwrite lands within EOF after a truncate up. While this is deliberate of the test case, behavior is somewhat unpredictable because writeback does post-eof zeroing, and writeback can occur at any time in the background. For example, an fsync inserted between the mwrite and truncate causes the subsequent read to instead return zeroes. This basically means that there is a race window in this situation between any subsequent extending operation and writeback that dictates whether post-eof data is exposed to the file or zeroed. To prevent this problem, perform partial block zeroing as part of the various inode size extending operations that are susceptible to it. For truncate extension, zero around the original eof similar to how truncate down does partial zeroing of the new eof. For extension via writes and fallocate related operations, zero the newly exposed range of the file to cover any partial zeroing that must occur at the original and new eof blocks. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20240919160741.208162-2-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: fix race in buffer_head read fault injectionLong Li
When I enabled ext4 debug for fault injection testing, I encountered the following warning: EXT4-fs error (device sda): ext4_read_inode_bitmap:201: comm fsstress: Cannot read inode bitmap - block_group = 8, inode_bitmap = 1051 WARNING: CPU: 0 PID: 511 at fs/buffer.c:1181 mark_buffer_dirty+0x1b3/0x1d0 The root cause of the issue lies in the improper implementation of ext4's buffer_head read fault injection. The actual completion of buffer_head read and the buffer_head fault injection are not atomic, which can lead to the uptodate flag being cleared on normally used buffer_heads in race conditions. [CPU0] [CPU1] [CPU2] ext4_read_inode_bitmap ext4_read_bh() <bh read complete> ext4_read_inode_bitmap if (buffer_uptodate(bh)) return bh jbd2_journal_commit_transaction __jbd2_journal_refile_buffer __jbd2_journal_unfile_buffer __jbd2_journal_temp_unlink_buffer ext4_simulate_fail_bh() clear_buffer_uptodate mark_buffer_dirty <report warning> WARN_ON_ONCE(!buffer_uptodate(bh)) The best approach would be to perform fault injection in the IO completion callback function, rather than after IO completion. However, the IO completion callback function cannot get the fault injection code in sb. Fix it by passing the result of fault injection into the bh read function, we simulate faults within the bh read function itself. This requires adding an extra parameter to the bh read functions that need fault injection. Fixes: 46f870d690fe ("ext4: simulate various I/O and checksum errors when reading metadata") Signed-off-by: Long Li <leo.lilong@huawei.com> Link: https://patch.msgid.link/20240906091746.510163-1-leo.lilong@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-11-12ext4: don't pass full mapping flags to ext4_es_insert_extent()Zhang Yi
When converting a delalloc extent in ext4_es_insert_extent(), since we only want to pass the info of whether the quota has already been claimed if the allocation is a direct allocation from ext4_map_create_blocks(), there is no need to pass full mapping flags, so changes to just pass whether the EXT4_GET_BLOCKS_DELALLOC_RESERVE bit is set. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240906061401.2980330-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: save unnecessary indentation in ext4_ext_create_new_leaf()Baokun Li
Save an indentation level in ext4_ext_create_new_leaf() by removing unnecessary 'else'. Besides, the variable 'ee_block' is declared to avoid line breaks. No functional changes. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240822023545.1994557-26-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: make some fast commit functions reuse extents pathBaokun Li
The ext4_find_extent() can update the extent path so that it does not have to allocate and free the path repeatedly, thus reducing the consumption of memory allocation and freeing in the following functions: ext4_ext_clear_bb ext4_ext_replay_set_iblocks ext4_fc_replay_add_range ext4_fc_set_bitmaps_and_counters No functional changes. Note that ext4_find_extent() does not support error pointers, so in this case set path to NULL first. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-25-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: refactor ext4_swap_extents() to reuse extents pathBaokun Li
The ext4_find_extent() can update the extent path so it doesn't have to allocate and free path repeatedly, thus reducing the consumption of memory allocation and freeing in ext4_swap_extents(). Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-24-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in convert_initialized_extent()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in convert_initialized_extent(), the following is done here: * Free the extents path when an error is encountered. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-23-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in ext4_ext_handle_unwritten_extents()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_ext_handle_unwritten_extents(), the following is done here: * Free the extents path when an error is encountered. * The 'allocated' is changed from passing a value to passing an address. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-22-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in ext4_ext_convert_to_initialized()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_ext_convert_to_initialized(), the following is done here: * Free the extents path when an error is encountered. * Its caller needs to update ppath if it uses ppath. * The 'allocated' is changed from passing a value to passing an address. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-21-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in ext4_convert_unwritten_extents_endio()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_convert_unwritten_extents_endio(), the following is done here: * Free the extents path when an error is encountered. * Its caller needs to update ppath if it uses ppath. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-20-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in ext4_split_convert_extents()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_split_convert_extents(), the following is done here: * Its caller needs to update ppath if it uses ppath. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-19-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in ext4_split_extent()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_split_extent(), the following is done here: * The 'allocated' is changed from passing a value to passing an address. * Its caller needs to update ppath if it uses ppath. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-18-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in ext4_force_split_extent_at()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_force_split_extent_at(), the following is done here: * Free the extents path when an error is encountered. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-17-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in ext4_split_extent_at()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_split_extent_at(), the following is done here: * Free the extents path when an error is encountered. * Its caller needs to update ppath if it uses ppath. * Teach ext4_ext_show_leaf() to skip error pointer. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-16-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in ext4_ext_insert_extent()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_ext_insert_extent(), the following is done here: * Free the extents path when an error is encountered. * Its caller needs to update ppath if it uses ppath. * Free path when npath is used, free npath when it is not used. * The got_allocated_blocks label in ext4_ext_map_blocks() does not update err now, so err is updated to 0 if the err returned by ext4_ext_search_right() is greater than 0 and is about to enter got_allocated_blocks. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-15-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in ext4_ext_create_new_leaf()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_ext_create_new_leaf(), the following is done here: * Free the extents path when an error is encountered. * Its caller needs to update ppath if it uses ppath. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-14-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in get_ext_path()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. After getting rid of ppath in get_ext_path(), its caller may pass an error pointer to ext4_free_ext_path(), so it needs to teach ext4_free_ext_path() and ext4_ext_drop_refs() to skip the error pointer. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-13-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: get rid of ppath in ext4_find_extent()Baokun Li
The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. Getting rid of ppath in ext4_find_extent() requires its caller to update ppath. These ppaths will also be dropped later. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-12-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: propagate errors from ext4_find_extent() in ext4_insert_range()Baokun Li
Even though ext4_find_extent() returns an error, ext4_insert_range() still returns 0. This may confuse the user as to why fallocate returns success, but the contents of the file are not as expected. So propagate the error returned by ext4_find_extent() to avoid inconsistencies. Fixes: 331573febb6a ("ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-11-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: add new ext4_ext_path_brelse() helperBaokun Li
Add ext4_ext_path_brelse() helper function to reduce duplicate code and ensure that path->p_bh is set to NULL after it is released. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-10-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: fix double brelse() the buffer of the extents pathBaokun Li
In ext4_ext_try_to_merge_up(), set path[1].p_bh to NULL after it has been released, otherwise it may be released twice. An example of what triggers this is as follows: split2 map split1 |--------|-------|--------| ext4_ext_map_blocks ext4_ext_handle_unwritten_extents ext4_split_convert_extents // path->p_depth == 0 ext4_split_extent // 1. do split1 ext4_split_extent_at |ext4_ext_insert_extent | ext4_ext_create_new_leaf | ext4_ext_grow_indepth | le16_add_cpu(&neh->eh_depth, 1) | ext4_find_extent | // return -ENOMEM |// get error and try zeroout |path = ext4_find_extent | path->p_depth = 1 |ext4_ext_try_to_merge | ext4_ext_try_to_merge_up | path->p_depth = 0 | brelse(path[1].p_bh) ---> not set to NULL here |// zeroout success // 2. update path ext4_find_extent // 3. do split2 ext4_split_extent_at ext4_ext_insert_extent ext4_ext_create_new_leaf ext4_ext_grow_indepth le16_add_cpu(&neh->eh_depth, 1) ext4_find_extent path[0].p_bh = NULL; path->p_depth = 1 read_extent_tree_block ---> return err // path[1].p_bh is still the old value ext4_free_ext_path ext4_ext_drop_refs // path->p_depth == 1 brelse(path[1].p_bh) ---> brelse a buffer twice Finally got the following WARRNING when removing the buffer from lru: ============================================ VFS: brelse: Trying to free free buffer WARNING: CPU: 2 PID: 72 at fs/buffer.c:1241 __brelse+0x58/0x90 CPU: 2 PID: 72 Comm: kworker/u19:1 Not tainted 6.9.0-dirty #716 RIP: 0010:__brelse+0x58/0x90 Call Trace: <TASK> __find_get_block+0x6e7/0x810 bdev_getblk+0x2b/0x480 __ext4_get_inode_loc+0x48a/0x1240 ext4_get_inode_loc+0xb2/0x150 ext4_reserve_inode_write+0xb7/0x230 __ext4_mark_inode_dirty+0x144/0x6a0 ext4_ext_insert_extent+0x9c8/0x3230 ext4_ext_map_blocks+0xf45/0x2dc0 ext4_map_blocks+0x724/0x1700 ext4_do_writepages+0x12d6/0x2a70 [...] ============================================ Fixes: ecb94f5fdf4b ("ext4: collapse a single extent tree block into the inode if possible") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-9-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: drop ppath from ext4_ext_replay_update_ex() to avoid double-freeBaokun Li
When calling ext4_force_split_extent_at() in ext4_ext_replay_update_ex(), the 'ppath' is updated but it is the 'path' that is freed, thus potentially triggering a double-free in the following process: ext4_ext_replay_update_ex ppath = path ext4_force_split_extent_at(&ppath) ext4_split_extent_at ext4_ext_insert_extent ext4_ext_create_new_leaf ext4_ext_grow_indepth ext4_find_extent if (depth > path[0].p_maxdepth) kfree(path) ---> path First freed *orig_path = path = NULL ---> null ppath kfree(path) ---> path double-free !!! So drop the unnecessary ppath and use path directly to avoid this problem. And use ext4_find_extent() directly to update path, avoiding unnecessary memory allocation and freeing. Also, propagate the error returned by ext4_find_extent() instead of using strange error codes. Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-8-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: aovid use-after-free in ext4_ext_insert_extent()Baokun Li
As Ojaswin mentioned in Link, in ext4_ext_insert_extent(), if the path is reallocated in ext4_ext_create_new_leaf(), we'll use the stale path and cause UAF. Below is a sample trace with dummy values: ext4_ext_insert_extent path = *ppath = 2000 ext4_ext_create_new_leaf(ppath) ext4_find_extent(ppath) path = *ppath = 2000 if (depth > path[0].p_maxdepth) kfree(path = 2000); *ppath = path = NULL; path = kcalloc() = 3000 *ppath = 3000; return path; /* here path is still 2000, UAF! */ eh = path[depth].p_hdr ================================================================== BUG: KASAN: slab-use-after-free in ext4_ext_insert_extent+0x26d4/0x3330 Read of size 8 at addr ffff8881027bf7d0 by task kworker/u36:1/179 CPU: 3 UID: 0 PID: 179 Comm: kworker/u6:1 Not tainted 6.11.0-rc2-dirty #866 Call Trace: <TASK> ext4_ext_insert_extent+0x26d4/0x3330 ext4_ext_map_blocks+0xe22/0x2d40 ext4_map_blocks+0x71e/0x1700 ext4_do_writepages+0x1290/0x2800 [...] Allocated by task 179: ext4_find_extent+0x81c/0x1f70 ext4_ext_map_blocks+0x146/0x2d40 ext4_map_blocks+0x71e/0x1700 ext4_do_writepages+0x1290/0x2800 ext4_writepages+0x26d/0x4e0 do_writepages+0x175/0x700 [...] Freed by task 179: kfree+0xcb/0x240 ext4_find_extent+0x7c0/0x1f70 ext4_ext_insert_extent+0xa26/0x3330 ext4_ext_map_blocks+0xe22/0x2d40 ext4_map_blocks+0x71e/0x1700 ext4_do_writepages+0x1290/0x2800 ext4_writepages+0x26d/0x4e0 do_writepages+0x175/0x700 [...] ================================================================== So use *ppath to update the path to avoid the above problem. Reported-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Closes: https://lore.kernel.org/r/ZqyL6rmtwl6N4MWR@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com Fixes: 10809df84a4d ("ext4: teach ext4_ext_find_extent() to realloc path if necessary") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240822023545.1994557-7-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-09-03ext4: update orig_path in ext4_find_extent()Baokun Li
In ext4_find_extent(), if the path is not big enough, we free it and set *orig_path to NULL. But after reallocating and successfully initializing the path, we don't update *orig_path, in which case the caller gets a valid path but a NULL ppath, and this may cause a NULL pointer dereference or a path memory leak. For example: ext4_split_extent path = *ppath = 2000 ext4_find_extent if (depth > path[0].p_maxdepth) kfree(path = 2000); *orig_path = path = NULL; path = kcalloc() = 3000 ext4_split_extent_at(*ppath = NULL) path = *ppath; ex = path[depth].p_ext; // NULL pointer dereference! ================================================================== BUG: kernel NULL pointer dereference, address: 0000000000000010 CPU: 6 UID: 0 PID: 576 Comm: fsstress Not tainted 6.11.0-rc2-dirty #847 RIP: 0010:ext4_split_extent_at+0x6d/0x560 Call Trace: <TASK> ext4_split_extent.isra.0+0xcb/0x1b0 ext4_ext_convert_to_initialized+0x168/0x6c0 ext4_ext_handle_unwritten_extents+0x325/0x4d0 ext4_ext_map_blocks+0x520/0xdb0 ext4_map_blocks+0x2b0/0x690 ext4_iomap_begin+0x20e/0x2c0 [...] ================================================================== Therefore, *orig_path is updated when the extent lookup succeeds, so that the caller can safely use path or *ppath. Fixes: 10809df84a4d ("ext4: teach ext4_ext_find_extent() to realloc path if necessary") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240822023545.1994557-6-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>