summaryrefslogtreecommitdiff
path: root/fs/ext4/inode.c
AgeCommit message (Collapse)Author
2022-03-16fs: Convert __set_page_dirty_buffers to block_dirty_folioMatthew Wilcox (Oracle)
Convert all callers; mostly this is just changing the aops to point at it, but a few implementations need a little more work. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folioMatthew Wilcox (Oracle)
These filesystems use __set_page_dirty_nobuffers() either directly or with a very thin wrapper; convert them en masse. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15ext4: Convert invalidatepage to invalidate_folioMatthew Wilcox (Oracle)
Extensive changes, but fairly mechanical. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15fs: Remove noop_invalidatepage()Matthew Wilcox (Oracle)
We used to have to use noop_invalidatepage() to prevent block_invalidatepage() from being called, but that behaviour is now gone. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15fs: Turn block_invalidatepage into block_invalidate_folioMatthew Wilcox (Oracle)
Remove special-casing of a NULL invalidatepage, since there is no more block_invalidatepage. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15ext4: Use folio_invalidate()Matthew Wilcox (Oracle)
Instead of calling ->invalidatepage directly, use folio_invalidate(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-12ext4: warn when dirtying page w/o buffers in data=journal modeJan Kara
Recently I've got a report of BUG_ON trigerring during transaction commit in ext4_journalled_writepage_callback() because we've spotted a dirty page without buffers. Add WARN_ON_ONCE to ext4_journalled_set_page_dirty() to catch the problematic condition earlier where we have better chance of understanding which code path is creating dirty data without preparing the page properly. Also update the comment with current information when we are at it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220310101832.5645-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-03ext4: don't BUG if someone dirty pages without asking ext4 firstTheodore Ts'o
[un]pin_user_pages_remote is dirtying pages without properly warning the file system in advance. A related race was noted by Jan Kara in 2018[1]; however, more recently instead of it being a very hard-to-hit race, it could be reliably triggered by process_vm_writev(2) which was discovered by Syzbot[2]. This is technically a bug in mm/gup.c, but arguably ext4 is fragile in that if some other kernel subsystem dirty pages without properly notifying the file system using page_mkwrite(), ext4 will BUG, while other file systems will not BUG (although data will still be lost). So instead of crashing with a BUG, issue a warning (since there may be potential data loss) and just mark the page as clean to avoid unprivileged denial of service attacks until the problem can be properly fixed. More discussion and background can be found in the thread starting at [2]. [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz [2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com Reported-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/YiDS9wVfq4mM2jGK@mit.edu
2022-02-08ext4: support direct I/O with fscrypt using blk-cryptoEric Biggers
Encrypted files traditionally haven't supported DIO, due to the need to encrypt/decrypt the data. However, when the encryption is implemented using inline encryption (blk-crypto) instead of the traditional filesystem-layer encryption, it is straightforward to support DIO. Therefore, make ext4 support DIO on files that are using inline encryption. Since ext4 uses iomap for DIO, and fscrypt support was already added to iomap DIO, this just requires two small changes: - Let DIO proceed when supported, by checking fscrypt_dio_supported() instead of assuming that encrypted files never support DIO. - In ext4_iomap_begin(), use fscrypt_limit_io_blocks() to limit the length of the mapping in the rare case where a DUN discontiguity occurs in the middle of an extent. The iomap DIO implementation requires this, since it assumes that it can submit a bio covering (up to) the whole mapping, without checking fscrypt constraints itself. Co-developed-by: Satya Tangirala <satyat@google.com> Signed-off-by: Satya Tangirala <satyat@google.com> Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Link: https://lore.kernel.org/r/20220128233940.79464-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2022-02-06Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Various bug fixes for ext4 fast commit and inline data handling. Also fix regression introduced as part of moving to the new mount API" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: fs/ext4: fix comments mentioning i_mutex ext4: fix incorrect type issue during replay_del_range jbd2: fix kernel-doc descriptions for jbd2_journal_shrink_{scan,count}() ext4: fix potential NULL pointer dereference in ext4_fill_super() jbd2: refactor wait logic for transaction updates into a common function jbd2: cleanup unused functions declarations from jbd2.h ext4: fix error handling in ext4_fc_record_modified_inode() ext4: remove redundant max inline_size check in ext4_da_write_inline_data_begin() ext4: fix error handling in ext4_restore_inline_data() ext4: fast commit may miss file actions ext4: fast commit may not fallback for ineligible commit ext4: modify the logic of ext4_mb_new_blocks_simple ext4: prevent used blocks from being allocated during fast commit replay
2022-02-03fs/ext4: fix comments mentioning i_mutexhongnanli
inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix comments still mentioning i_mutex. Signed-off-by: hongnanli <hongnan.li@linux.alibaba.com> Link: https://lore.kernel.org/r/20220121070611.21618-1-hongnan.li@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-02-03ext4: fast commit may not fallback for ineligible commitXin Yin
For the follow scenario: 1. jbd start commit transaction n 2. task A get new handle for transaction n+1 3. task A do some ineligible actions and mark FC_INELIGIBLE 4. jbd complete transaction n and clean FC_INELIGIBLE 5. task A call fsync In this case fast commit will not fallback to full commit and transaction n+1 also not handled by jbd. Make ext4_fc_mark_ineligible() also record transaction tid for latest ineligible case, when call ext4_fc_cleanup() check current transaction tid, if small than latest ineligible tid do not clear the EXT4_MF_FC_INELIGIBLE. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Suggested-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Link: https://lore.kernel.org/r/20220117093655.35160-2-yinxin.x@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2022-01-12Merge tag 'libnvdimm-for-5.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax and libnvdimm updates from Dan Williams: "The bulk of this is a rework of the dax_operations API after discovering the obstacles it posed to the work-in-progress DAX+reflink support for XFS and other copy-on-write filesystem mechanics. Primarily the need to plumb a block_device through the API to handle partition offsets was a sticking point and Christoph untangled that dependency in addition to other cleanups to make landing the DAX+reflink support easier. The DAX_PMEM_COMPAT option has been around for 4 years and not only are distributions shipping userspace that understand the current configuration API, but some are not even bothering to turn this option on anymore, so it seems a good time to remove it per the deprecation schedule. Recall that this was added after the device-dax subsystem moved from /sys/class/dax to /sys/bus/dax for its sysfs organization. All recent functionality depends on /sys/bus/dax. Some other miscellaneous cleanups and reflink prep patches are included as well. Summary: - Simplify the dax_operations API: - Eliminate bdev_dax_pgoff() in favor of the filesystem maintaining and applying a partition offset to all its DAX iomap operations. - Remove wrappers and device-mapper stacked callbacks for ->copy_from_iter() and ->copy_to_iter() in favor of moving block_device relative offset responsibility to the dax_direct_access() caller. - Remove the need for an @bdev in filesystem-DAX infrastructure - Remove unused uio helpers copy_from_iter_flushcache() and copy_mc_to_iter() as only the non-check_copy_size() versions are used for DAX. - Prepare XFS for the pending (next merge window) DAX+reflink support - Remove deprecated DEV_DAX_PMEM_COMPAT support - Cleanup a straggling misuse of the GUID api" * tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (38 commits) iomap: Fix error handling in iomap_zero_iter() ACPI: NFIT: Import GUID before use dax: remove the copy_from_iter and copy_to_iter methods dax: remove the DAXDEV_F_SYNC flag dax: simplify dax_synchronous and set_dax_synchronous uio: remove copy_from_iter_flushcache() and copy_mc_to_iter() iomap: turn the byte variable in iomap_zero_iter into a ssize_t memremap: remove support for external pgmap refcounts fsdax: don't require CONFIG_BLOCK iomap: build the block based code conditionally dax: fix up some of the block device related ifdefs fsdax: shift partition offset handling into the file systems dax: return the partition offset from fs_dax_get_by_bdev iomap: add a IOMAP_DAX flag xfs: pass the mapping flags to xfs_bmbt_to_iomap xfs: use xfs_direct_write_iomap_ops for DAX zeroing xfs: move dax device handling into xfs_{alloc,free}_buftarg ext4: cleanup the dax handling in ext4_fill_super ext2: cleanup the dax handling in ext2_fill_super fsdax: decouple zeroing from the iomap buffered I/O code ...
2022-01-10ext4: remove useless resetting io_end_size in mpage_process_page()Nghia Le
The command "make clang-analyzer" detects dead stores in mpage_process_page() function. Do not reset io_end_size to 0 in the current paths, as the function exits on those paths without further using io_end_size. Signed-off-by: Nghia Le <nghialm78@gmail.com> Link: https://lore.kernel.org/r/20211025221803.3326-1-nghialm78@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: fix an use-after-free issue about data=journal writeback modeZhang Yi
Our syzkaller report an use-after-free issue that accessing the freed buffer_head on the writeback page in __ext4_journalled_writepage(). The problem is that if there was a truncate racing with the data=journalled writeback procedure, the writeback length could become zero and bget_one() refuse to get buffer_head's refcount, then the truncate procedure release buffer once we drop page lock, finally, the last ext4_walk_page_buffers() trigger the use-after-free problem. sync truncate ext4_sync_file() file_write_and_wait_range() ext4_setattr(0) inode->i_size = 0 ext4_writepage() len = 0 __ext4_journalled_writepage() page_bufs = page_buffers(page) ext4_walk_page_buffers(bget_one) <- does not get refcount do_invalidatepage() free_buffer_head() ext4_walk_page_buffers(page_bufs) <- trigger use-after-free After commit bdf96838aea6 ("ext4: fix race between truncate and __ext4_journalled_writepage()"), we have already handled the racing case, so the bget_one() and bput_one() are not needed. So this patch simply remove these hunk, and recheck the i_size to make it safe. Fixes: bdf96838aea6 ("ext4: fix race between truncate and __ext4_journalled_writepage()") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211225090937.712867-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: initialize err_blk before calling __ext4_get_inode_locHarshad Shirwadkar
It is not guaranteed that __ext4_get_inode_loc will definitely set err_blk pointer when it returns EIO. To avoid using uninitialized variables, let's first set err_blk to 0. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211201163421.2631661-1-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2022-01-10ext4: fast commit may miss tracking unwritten range during ftruncateXin Yin
If use FALLOC_FL_KEEP_SIZE to alloc unwritten range at bottom, the inode->i_size will not include the unwritten range. When call ftruncate with fast commit enabled, it will miss to track the unwritten range. Change to trace the full range during ftruncate. Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211223032337.5198-3-yinxin.x@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2022-01-10ext4: fix fast commit may miss tracking range for FALLOC_FL_ZERO_RANGEXin Yin
when call falloc with FALLOC_FL_ZERO_RANGE, to set an range to unwritten, which has been already initialized. If the range is align to blocksize, fast commit will not track range for this change. Also track range for unwritten range in ext4_map_blocks(). Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211221022839.374606-1-yinxin.x@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2021-12-23ext4: use ext4_journal_start/stop for fast commit transactionsHarshad Shirwadkar
This patch drops all calls to ext4_fc_start_update() and ext4_fc_stop_update(). To ensure that there are no ongoing journal updates during fast commit, we also make jbd2_fc_begin_commit() lock journal for updates. This way we don't have to maintain two different transaction start stop APIs for fast commit and full commit. This patch doesn't remove the functions altogether since in future we want to have inode level locking for fast commits. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211223202140.2061101-2-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-12-04fsdax: shift partition offset handling into the file systemsChristoph Hellwig
Remove the last user of ->bdev in dax.c by requiring the file system to pass in an address that already includes the DAX offset. As part of the only set ->bdev or ->daxdev when actually required in the ->iomap_begin methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> [erofs] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-27-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04iomap: add a IOMAP_DAX flagChristoph Hellwig
Add a flag so that the file system can easily detect DAX operations based just on the iomap operation requested instead of looking at inode state using IS_DAX. This will be needed to apply the to be added partition offset only for operations that actually use DAX, but not things like fiemap that are based on the block device. In the long run it should also allow turning the bdev, dax_dev and inline_data into a union. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-25-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04fsdax: decouple zeroing from the iomap buffered I/O codeChristoph Hellwig
Unshare the DAX and iomap buffered I/O page zeroing code. This code previously did a IS_DAX check deep inside the iomap code, which in fact was the only DAX check in the code. Instead move these checks into the callers. Most callers already have DAX special casing anyway and XFS will need it for reflink support as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-19-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-11-04ext4: prevent getting empty inode bufferZhang Yi
In ext4_get_inode_loc(), we may skip IO and get an zero && uptodate inode buffer when the inode monopolize an inode block for performance reason. For most cases, ext4_mark_iloc_dirty() will fill the inode buffer to make it fine, but we could miss this call if something bad happened. Finally, __ext4_get_inode_loc_noinmem() may probably get an empty inode buffer and trigger ext4 error. For example, if we remove a nonexistent xattr on inode A, ext4_xattr_set_handle() will return ENODATA before invoking ext4_mark_iloc_dirty(), it will left an uptodate but zero buffer. We will get checksum error message in ext4_iget() when getting inode again. EXT4-fs error (device sda): ext4_lookup:1784: inode #131074: comm cat: iget: checksum invalid Even worse, if we allocate another inode B at the same inode block, it will corrupt the inode A on disk when write back inode B. So this patch initialize the inode buffer by filling the in-mem inode contents if we skip read I/O, ensure that the buffer is really uptodate. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210901020955.1657340-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-11-04ext4: move ext4_fill_raw_inode() related functionsZhang Yi
In preparation for calling ext4_fill_raw_inode() in __ext4_get_inode_loc(), move three related functions before __ext4_get_inode_loc(), no logical change. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210901020955.1657340-3-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-11-04ext4: factor out ext4_fill_raw_inode()Zhang Yi
Factor out ext4_fill_raw_inode() from ext4_do_update_inode(), which is use to fill the in-mem inode contents into the inode table buffer, in preparation for initializing the exclusive inode buffer without reading the block in __ext4_get_inode_loc(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210901020955.1657340-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-11-04Revert "ext4: enforce buffer head state assertion in ext4_da_map_blocks"Eric Whitney
This reverts commit 948ca5f30e1df0c11eb5b0f410b9ceb97fa77ad9. Two crash reports from users running variations on 5.15-rc4 kernels suggest that it is premature to enforce the state assertion in the original commit. Both crashes were triggered by BUG calls in that code, indicating that under some rare circumstance the buffer head state did not match a delayed allocated block at the time the block was written out. No reproducer is available. Resolving this problem will require more time than remains in the current release cycle, so reverting the original patch for the time being is necessary to avoid any instability it may cause. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20211012171901.5352-1-enwlinux@gmail.com Fixes: 948ca5f30e1d ("ext4: enforce buffer head state assertion in ext4_da_map_blocks") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2021-10-03Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix a number of ext4 bugs in fast_commit, inline data, and delayed allocation. Also fix error handling code paths in ext4_dx_readdir() and ext4_fill_super(). Finally, avoid a grabbing a journal head in the delayed allocation write in the common cases where we are overwriting a pre-existing block or appending to an inode" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: recheck buffer uptodate bit under buffer lock ext4: fix potential infinite loop in ext4_dx_readdir() ext4: flush s_error_work before journal destroy in ext4_fill_super ext4: fix loff_t overflow in ext4_max_bitmap_size() ext4: fix reserved space counter leakage ext4: limit the number of blocks in one ADD_RANGE TLV ext4: enforce buffer head state assertion in ext4_da_map_blocks ext4: remove extent cache entries when truncating inline data ext4: drop unnecessary journal handle in delalloc write ext4: factor out write end code of inline file ext4: correct the error path of ext4_write_inline_data_end() ext4: check and update i_disksize properly ext4: add error checking to ext4_ext_replay_set_iblocks()
2021-10-01ext4: recheck buffer uptodate bit under buffer lockZhang Yi
Commit 8e33fadf945a ("ext4: remove an unnecessary if statement in __ext4_get_inode_loc()") forget to recheck buffer's uptodate bit again under buffer lock, which may overwrite the buffer if someone else have already brought it uptodate and changed it. Fixes: 8e33fadf945a ("ext4: remove an unnecessary if statement in __ext4_get_inode_loc()") Cc: stable@kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210910080316.70421-1-yi.zhang@huawei.com
2021-10-01ext4: fix reserved space counter leakageJeffle Xu
When ext4_insert_delayed block receives and recovers from an error from ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the space it has reserved for that block insertion as it should. One effect of this bug is that s_dirtyclusters_counter is not decremented and remains incorrectly elevated until the file system has been unmounted. This can result in premature ENOSPC returns and apparent loss of free space. Another effect of this bug is that /sys/fs/ext4/<dev>/delayed_allocation_blocks can remain non-zero even after syncfs has been executed on the filesystem. Besides, add check for s_dirtyclusters_counter when inode is going to be evicted and freed. s_dirtyclusters_counter can still keep non-zero until inode is written back in .evict_inode(), and thus the check is delayed to .destroy_inode(). Fixes: 51865fda28e5 ("ext4: let ext4 maintain extent status tree") Cc: stable@kernel.org Suggested-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210823061358.84473-1-jefflexu@linux.alibaba.com
2021-09-09ext4: enforce buffer head state assertion in ext4_da_map_blocksEric Whitney
Remove the code that re-initializes a buffer head with an invalid block number and BH_New and BH_Delay bits when a matching delayed and unwritten block has been found in the extent status cache. Replace it with assertions that verify the buffer head already has this state correctly set. The current code masked an inline data truncation bug that left stale entries in the extent status cache. With this change, generic/130 can be used to reproduce and detect that bug. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210819144927.25163-3-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-09-09Merge branch 'delalloc-buffer-write' into devTheodore Ts'o
Fix a bug in how we update i_disksize, and the error path in inline_data_end. Finally, drop an unnecessary creation of a journal handle which was only needed for inline data, which can give us a large performance gain in delayed allocation writes. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-09-04ext4: drop unnecessary journal handle in delalloc writeZhang Yi
After we factor out the inline data write procedure from ext4_da_write_end(), we don't need to start journal handle for the cases of both buffer overwrite and append-write. If we need to update i_disksize, mark_inode_dirty() do start handle and update inode buffer. So we could just remove all the journal handle codes in the delalloc write procedure. After this patch, we could get a lot of performance improvement. Below is the Unixbench comparison data test on my machine with 'Intel Xeon Gold 5120' CPU and nvme SSD backend. Test cmd: ./Run -c 56 -i 3 fstime fsbuffer fsdisk Before this patch: System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 422965.0 1068.1 File Copy 256 bufsize 500 maxblocks 1655.0 105077.0 634.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 1429092.0 2464.0 ====== System Benchmarks Index Score (Partial Only) 1186.6 After this patch: System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 732716.0 1850.3 File Copy 256 bufsize 500 maxblocks 1655.0 184940.0 1117.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 2427152.0 4184.7 ====== System Benchmarks Index Score (Partial Only) 2053.0 Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210716122024.1105856-5-yi.zhang@huawei.com
2021-09-04ext4: factor out write end code of inline fileZhang Yi
Now that the inline_data file write end procedure are falled into the common write end functions, it is not clear. Factor them out and do some cleanup. This patch also drop ext4_da_write_inline_data_end() and switch to use ext4_write_inline_data_end() instead because we also need to do the same error processing if we failed to write data into inline entry. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210716122024.1105856-4-yi.zhang@huawei.com
2021-09-04ext4: correct the error path of ext4_write_inline_data_end()Zhang Yi
Current error path of ext4_write_inline_data_end() is not correct. Firstly, it should pass out the error value if ext4_get_inode_loc() return fail, or else it could trigger infinite loop if we inject error here. And then it's better to add inode to orphan list if it return fail in ext4_journal_stop(), otherwise we could not restore inline xattr entry after power failure. Finally, we need to reset the 'ret' value if ext4_write_inline_data_end() return success in ext4_write_end() and ext4_journalled_write_end(), otherwise we could not get the error return value of ext4_journal_stop(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210716122024.1105856-3-yi.zhang@huawei.com
2021-09-04ext4: check and update i_disksize properlyZhang Yi
After commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <= isize"), i_disksize could always be updated to i_size in ext4_setattr(), and we could sure that i_disksize <= i_size since holding inode lock and if i_disksize < i_size there are delalloc writes pending in the range upto i_size. If the end of the current write is <= i_size, there's no need to touch i_disksize since writeback will push i_disksize upto i_size eventually. So we can switch to check i_size instead of i_disksize in ext4_da_write_end() when write to the end of the file. we also could remove ext4_mark_inode_dirty() together because we defer inode dirtying to generic_write_end() or ext4_da_write_inline_data_end(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210716122024.1105856-2-yi.zhang@huawei.com
2021-09-02Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "In addition to some ext4 bug fixes and cleanups, this cycle we add the orphan_file feature, which eliminates bottlenecks when doing a large number of parallel truncates and file deletions, and move the discard operation out of the jbd2 commit thread when using the discard mount option, to better support devices with slow discard operations" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits) ext4: make the updating inode data procedure atomic ext4: remove an unnecessary if statement in __ext4_get_inode_loc() ext4: move inode eio simulation behind io completeion ext4: Improve scalability of ext4 orphan file handling ext4: Orphan file documentation ext4: Speedup ext4 orphan inode handling ext4: Move orphan inode handling into a separate file ext4: Support for checksumming from journal triggers ext4: fix race writing to an inline_data file while its xattrs are changing jbd2: add sparse annotations for add_transaction_credits() ext4: fix sparse warnings ext4: Make sure quota files are not grabbed accidentally ext4: fix e2fsprogs checksum failure for mounted filesystem ext4: if zeroout fails fall back to splitting the extent node ext4: reduce arguments of ext4_fc_add_dentry_tlv ext4: flush background discard kwork when retry allocation ext4: get discard out of jbd2 commit kthread contex ext4: remove the repeated comment of ext4_trim_all_free ext4: add new helper interface ext4_try_to_trim_range() ext4: remove the 'group' parameter of ext4_trim_extent ...
2021-08-30ext4: make the updating inode data procedure atomicZhang Yi
Now that ext4_do_update_inode() return error before filling the whole inode data if we fail to set inode blocks in ext4_inode_blocks_set(). This error should never happen in theory since sb->s_maxbytes should not have allowed this, we have already init sb->s_maxbytes according to this feature in ext4_fill_super(). So even through that could only happen due to the filesystem corruption, we'd better to return after we finish updating the inode because it may left an uninitialized buffer and we could read this buffer later in "errors=continue" mode. This patch make the updating inode data procedure atomic, call EXT4_ERROR_INODE() after we dropping i_raw_lock after something bad happened, make sure that the inode is integrated, and also drop a BUG_ON and do some small cleanups. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210826130412.3921207-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30ext4: remove an unnecessary if statement in __ext4_get_inode_loc()Zhang Yi
The "if (!buffer_uptodate(bh))" hunk covered almost the whole code after getting buffer in __ext4_get_inode_loc() which seems unnecessary, remove it and switch to check ext4_buffer_uptodate(), it simplify code and make it more readable. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210826130412.3921207-3-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30ext4: move inode eio simulation behind io completeionZhang Yi
No EIO simulation is required if the buffer is uptodate, so move the simulation behind read bio completeion just like inode/block bitmap simulation does. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210826130412.3921207-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30ext4: Speedup ext4 orphan inode handlingJan Kara
Ext4 orphan inode handling is a bottleneck for workloads which heavily truncate / unlink small files since it contends on the global s_orphan_mutex lock (and generally it's difficult to improve scalability of the ondisk linked list of orphaned inodes). This patch implements new way of handling orphan inodes. Instead of linking orphaned inode into a linked list, we store it's inode number in a new special file which we call "orphan file". Only if there's no more space in the orphan file (too many inodes are currently orphaned) we fall back to using old style linked list. Currently we protect operations in the orphan file with a spinlock for simplicity but even in this setting we can substantially reduce the length of the critical section and thus speedup some workloads. In the next patch we improve this by making orphan handling lockless. Note that the change is backwards compatible when the filesystem is clean - the existence of the orphan file is a compat feature, we set another ro-compat feature indicating orphan file needs scanning for orphaned inodes when mounting filesystem read-write. This ro-compat feature gets cleared on unmount / remount read-only. Some performance data from 80 CPU Xeon Server with 512 GB of RAM, filesystem located on SSD, average of 5 runs: stress-orphan (microbenchmark truncating files byte-by-byte from N processes in parallel) Threads Time Time Vanilla Patched 1 1.057200 0.945600 2 1.680400 1.331800 4 2.547000 1.995000 8 7.049400 6.424200 16 14.827800 14.937600 32 40.948200 33.038200 64 87.787400 60.823600 128 206.504000 122.941400 So we can see significant wins all over the board. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210816095713.16537-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30ext4: Support for checksumming from journal triggersJan Kara
JBD2 layer support triggers which are called when journaling layer moves buffer to a certain state. We can use the frozen trigger, which gets called when buffer data is frozen and about to be written out to the journal, to compute block checksums for some buffer types (similarly as does ocfs2). This avoids unnecessary repeated recomputation of the checksum (at the cost of larger window where memory corruption won't be caught by checksumming) and is even necessary when there are unsynchronized updaters of the checksummed data. So add superblock and journal trigger type arguments to ext4_journal_get_write_access() and ext4_journal_get_create_access() so that frozen triggers can be set accordingly. Also add inode argument to ext4_walk_page_buffers() and all the callbacks used with that function for the same purpose. This patch is mostly only a change of prototype of the above mentioned functions and a few small helpers. Real checksumming will come later. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-08-30ext4: Make sure quota files are not grabbed accidentallyJan Kara
If ext4 filesystem is corrupted so that quota files are linked from directory hirerarchy, bad things can happen. E.g. quota files can get corrupted or deleted. Make sure we are not grabbing quota file inodes when we expect normal inodes. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210812133122.26360-1-jack@suse.cz
2021-07-13ext4: Convert to use mapping->invalidate_lockJan Kara
Convert ext4 to use mapping->invalidate_lock instead of its private EXT4_I(inode)->i_mmap_sem. This is mostly search-and-replace. By this conversion we fix a long standing race between hole punching and read(2) / readahead(2) paths that can lead to stale page cache contents. CC: <linux-ext4@vger.kernel.org> CC: Ted Tso <tytso@mit.edu> Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>
2021-06-30Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "In addition to bug fixes and cleanups, there are two new features for ext4 in 5.14: - Allow applications to poll on changes to /sys/fs/ext4/*/errors_count - Add the ioctl EXT4_IOC_CHECKPOINT which allows the journal to be checkpointed, truncated and discarded or zero'ed" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits) jbd2: export jbd2_journal_[un]register_shrinker() ext4: notify sysfs on errors_count value change fs: remove bdev_try_to_free_page callback ext4: remove bdev_try_to_free_page() callback jbd2: simplify journal_clean_one_cp_list() jbd2,ext4: add a shrinker to release checkpointed buffers jbd2: remove redundant buffer io error checks jbd2: don't abort the journal when freeing buffers jbd2: ensure abort the journal if detect IO error when writing original buffer back jbd2: remove the out label in __jbd2_journal_remove_checkpoint() ext4: no need to verify new add extent block jbd2: clean up misleading comments for jbd2_fc_release_bufs ext4: add check to prevent attempting to resize an fs with sparse_super2 ext4: consolidate checks for resize of bigalloc into ext4_resize_begin ext4: remove duplicate definition of ext4_xattr_ibody_inline_set() ext4: fsmap: fix the block/inode bitmap comment ext4: fix comment for s_hash_unsigned ext4: use local variable ei instead of EXT4_I() macro ext4: fix avefreec in find_group_orlov ext4: correct the cache_nr in tracepoint ext4_es_shrink_exit ...
2021-06-29fs: remove noop_set_page_dirty()Matthew Wilcox (Oracle)
Use __set_page_dirty_no_writeback() instead. This will set the dirty bit on the page, which will be used to avoid calling set_page_dirty() in the future. It will have no effect on actually writing the page back, as the pages are not on any LRU lists. [akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules] Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-22ext4: use local variable ei instead of EXT4_I() macrochenyichong
Signed-off-by: chenyichong <chenyichong@uniontech.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20210526052930.11278-1-chenyichong@uniontech.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-22ext4: add discard/zeroout flags to journal flushLeah Rumancik
Add a flags argument to jbd2_journal_flush to enable discarding or zero-filling the journal blocks while flushing the journal. Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Link: https://lore.kernel.org/r/20210518151327.130198-1-leah.rumancik@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-06-17ext4: fix overflow in ext4_iomap_alloc()Jan Kara
A code in iomap alloc may overflow block number when converting it to byte offset. Luckily this is mostly harmless as we will just use more expensive method of writing using unwritten extents even though we are writing beyond i_size. Cc: stable@kernel.org Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210412102333.2676-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09ext4: fix various seppling typosBhaskar Chowdhury
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Link: https://lore.kernel.org/r/cover.1616840203.git.unixbhaskar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09ext4: delete redundant uptodate check for bufferYang Guo
The buffer uptodate state has been checked in function set_buffer_uptodate, there is no need use buffer_uptodate before calling set_buffer_uptodate and delete it. Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Yang Guo <guoyang2@huawei.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/1617260610-29770-1-git-send-email-zhangshaokun@hisilicon.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>