summaryrefslogtreecommitdiff
path: root/fs/erofs/data.c
AgeCommit message (Collapse)Author
5 daysMerge tag 'vfs-6.17-rc1.mmap_prepare' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull mmap_prepare updates from Christian Brauner: "Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). This is preferred to the existing f_op->mmap() hook as it does require a VMA to be established yet, thus allowing the mmap logic to invoke this hook far, far earlier, prior to inserting a VMA into the virtual address space, or performing any other heavy handed operations. This allows for much simpler unwinding on error, and for there to be a single attempt at merging a VMA rather than having to possibly reattempt a merge based on potentially altered VMA state. Far more importantly, it prevents inappropriate manipulation of incompletely initialised VMA state, which is something that has been the cause of bugs and complexity in the past. The intent is to gradually deprecate f_op->mmap, and in that vein this series coverts the majority of file systems to using f_op->mmap_prepare. Prerequisite steps are taken - firstly ensuring all checks for mmap capabilities use the file_has_valid_mmap_hooks() helper rather than directly checking for f_op->mmap (which is now not a valid check) and secondly updating daxdev_mapping_supported() to not require a VMA parameter to allow ext4 and xfs to be converted. Commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for nested file systems") handles the nasty edge-case of nested file systems like overlayfs, which introduces a compatibility shim to allow f_op->mmap_prepare() to be invoked from an f_op->mmap() callback. This allows for nested filesystems to continue to function correctly with all file systems regardless of which callback is used. Once we finally convert all file systems, this shim can be removed. As a result, ecryptfs, fuse, and overlayfs remain unaltered so they can nest all other file systems. We additionally do not update resctl - as this requires an update to remap_pfn_range() (or an alternative to it) which we defer to a later series, equally we do not update cramfs which needs a mixed mapping insertion with the same issue, nor do we update procfs, hugetlbfs, syfs or kernfs all of which require VMAs for internal state and hooks. We shall return to all of these later" * tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: update porting, vfs documentation to describe mmap_prepare() fs: replace mmap hook with .mmap_prepare for simple mappings fs: convert most other generic_file_*mmap() users to .mmap_prepare() fs: convert simple use of generic_file_*_mmap() to .mmap_prepare() mm/filemap: introduce generic_file_*_mmap_prepare() helpers fs/xfs: transition from deprecated .mmap hook to .mmap_prepare fs/ext4: transition from deprecated .mmap hook to .mmap_prepare fs/dax: make it possible to check dev dax support without a VMA fs: consistently use can_mmap_file() helper mm/nommu: use file_has_valid_mmap_hooks() helper mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
9 dayserofs: implement metadata compressionBo Liu (OpenAnolis)
Thanks to the meta buffer infrastructure, metadata-compressed inodes are just read from the metabox inode instead of the blockdevice (or backing file) inode. The same is true for shared extended attributes. When metadata compression is enabled, inode numbers are divided from on-disk NIDs because of non-LTS 32-bit application compatibility. Co-developed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Bo Liu (OpenAnolis) <liubo03@inspur.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250722003229.2121752-1-hsiangkao@linux.alibaba.com
9 dayserofs: refine erofs_iomap_begin()Gao Xiang
- Avoid calling erofs_map_dev() for unmapped extents; - Assign `iomap->addr` for inline extents too (since they have physical location). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250716092254.3826715-1-hsiangkao@linux.alibaba.com
9 dayserofs: remove need_kmap in erofs_read_metabuf()Gao Xiang
- need_kmap is always true except for a ztailpacking case; thus, just open-code that one; - The upcoming metadata compression will add a new boolean, so simplify this first. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250714090907.4095645-1-hsiangkao@linux.alibaba.com
2025-07-10erofs: address D-cache aliasingGao Xiang
Flush the D-cache before unlocking folios for compressed inodes, as they are dirtied during decompression. Avoid calling flush_dcache_folio() on every CPU write, since it's more like playing whack-a-mole without real benefit. It has no impact on x86 and arm64/risc-v: on x86, flush_dcache_folio() is a no-op, and on arm64/risc-v, PG_dcache_clean (PG_arch_1) is clear for new page cache folios. However, certain ARM boards are affected, as reported. Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Closes: https://lore.kernel.org/r/c1e51e16-6cc6-49d0-a63e-4e9ff6c4dd53@pengutronix.de Closes: https://lore.kernel.org/r/38d43fae-1182-4155-9c5b-ffc7382d9917@siemens.com Tested-by: Jan Kiszka <jan.kiszka@siemens.com> Tested-by: Stefan Kerkmann <s.kerkmann@pengutronix.de> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250709034614.2780117-2-hsiangkao@linux.alibaba.com
2025-07-10erofs: fix to add missing tracepoint in erofs_read_folio()Chao Yu
Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap") converts to use iomap interface, it removed trace_erofs_readpage() tracepoint in the meantime, let's add it back. Fixes: 771c994ea51f ("erofs: convert all uncompressed cases to iomap") Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250708111942.3120926-1-chao@kernel.org Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-07-10erofs: fix to add missing tracepoint in erofs_readahead()Chao Yu
Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap") converts to use iomap interface, it removed trace_erofs_readahead() tracepoint in the meantime, let's add it back. Fixes: 771c994ea51f ("erofs: convert all uncompressed cases to iomap") Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250707084832.2725677-1-chao@kernel.org Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-06-19fs: convert most other generic_file_*mmap() users to .mmap_prepare()Lorenzo Stoakes
Update nearly all generic_file_mmap() and generic_file_readonly_mmap() callers to use generic_file_mmap_prepare() and generic_file_readonly_mmap_prepare() respectively. We update blkdev, 9p, afs, erofs, ext2, nfs, ntfs3, smb, ubifs and vboxsf file systems this way. Remaining users we cannot yet update are ecryptfs, fuse and cramfs. The former two are nested file systems that must support any underlying file ssytem, and cramfs inserts a mixed mapping which currently requires a VMA. Once all file systems have been converted to mmap_prepare(), we can then update nested file systems. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/08db85970d89b17a995d2cffae96fb4cc462377f.1750099179.git.lorenzo.stoakes@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-22erofs: add 'fsoffset' mount option to specify filesystem offsetSheng Yong
When attempting to use an archive file, such as APEX on android, as a file-backed mount source, it fails because EROFS image within the archive file does not start at offset 0. As a result, a loop or a dm device is still needed to attach the image file at an appropriate offset first. Similarly, if an EROFS image within a block device does not start at offset 0, it cannot be mounted directly either. To address this issue, this patch adds a new mount option `fsoffset=x' to accept a start offset for the primary device. The offset should be aligned to the block size. EROFS will add this offset before performing read requests. Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Signed-off-by: Wang Shuai <wangshuai12@xiaomi.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250517090544.2687651-1-shengyong1@xiaomi.com [ Gao Xiang: minor update on documentation and the error message. ] Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-03-17erofs: implement 48-bit block addressing for unencoded inodesGao Xiang
It adapts the on-disk changes from the previous commit. It also supports EROFS_NULL_ADDR (all 1's) for EROFS_INODE_FLAT_PLAIN inodes to indicate 0-filled inodes, as it's common for composefs use cases. As a result, EROFS_INODE_CHUNK_BASED is no longer needed. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-5-hsiangkao@linux.alibaba.com
2025-03-17erofs: add 48-bit block addressing on-disk supportGao Xiang
The current 32-bit block addressing limits EROFS to a 16TiB maximum volume size with 4KiB blocks. However, several new use cases now require larger capacity support: - Massive datasets for model training in order to boost random sampling performance for each epoch; - Object storage clients using EROFS direct passthrough. This extends core on-disk structures to support 48-bit block addressing, such as inodes, device slots, and inode chunks. Additionally: - Expand superblock root NID to 8-byte `rootnid_8b` to enable full out-of-place update incremental builds; - Introduce `epoch` field in the superblock as well as add `mtime` field to 32-byte compact inodes for basic timestamp support. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-4-hsiangkao@linux.alibaba.com
2025-03-17erofs: get rid of erofs_map_blocks_flatmode()Gao Xiang
It's simple enough to be folded into erofs_map_blocks(). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-2-hsiangkao@linux.alibaba.com
2025-03-17erofs: get rid of erofs_kmap_typeBo Liu
Since EROFS_KMAP_ATOMIC is no longer valid, get rid of erofs_kmap_type too. Signed-off-by: Bo Liu <liubo03@inspur.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250217093141.2659-1-liubo03@inspur.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-12-16erofs: reference `struct erofs_device_info` for erofs_map_devGao Xiang
Record `m_sb` and `m_dif` to replace `m_fscache`, `m_daxdev`, `m_fp` and `m_dax_part_off` in order to simplify the codebase. Note that `m_bdev` is still left since it can be assigned from `sb->s_bdev` directly. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241212235401.2857246-1-hsiangkao@linux.alibaba.com
2024-12-16erofs: use `struct erofs_device_info` for the primary deviceGao Xiang
Instead of just listing each one directly in `struct erofs_sb_info` except that we still use `sb->s_bdev` for the primary block device. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241216125310.930933-2-hsiangkao@linux.alibaba.com
2024-11-18erofs: clarify direct I/O supportGao Xiang
Currently, only filesystems backed by block devices support direct I/O. Also remove the unnecessary strict checks that can be supported with iomap. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241115074625.2520728-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: get rid of `buf->kmap_type`Gao Xiang
After commit 927e5010ff5b ("erofs: use kmap_local_page() only for erofs_bread()"), `buf->kmap_type` actually has no use at all. Let's get rid of `buf->kmap_type` now. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241114095813.839866-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: fix file-backed mounts over FUSEGao Xiang
syzbot reported a null-ptr-deref in fuse_read_args_fill: fuse_read_folio+0xb0/0x100 fs/fuse/file.c:905 filemap_read_folio+0xc6/0x2a0 mm/filemap.c:2367 do_read_cache_folio+0x263/0x5c0 mm/filemap.c:3825 read_mapping_folio include/linux/pagemap.h:1011 [inline] erofs_bread+0x34d/0x7e0 fs/erofs/data.c:41 erofs_read_superblock fs/erofs/super.c:281 [inline] erofs_fc_fill_super+0x2b9/0x2500 fs/erofs/super.c:625 Unlike most filesystems, some network filesystems and FUSE need unavoidable valid `file` pointers for their read I/Os [1]. Anyway, those use cases need to be supported too. [1] https://docs.kernel.org/filesystems/vfs.html Reported-by: syzbot+0b1279812c46e48bb0c1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/6727bbdf.050a0220.3c8d68.0a7e.GAE@google.com Fixes: fb176750266a ("erofs: add file-backed mount support") Tested-by: syzbot+0b1279812c46e48bb0c1@syzkaller.appspotmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241114234905.1873723-1-hsiangkao@linux.alibaba.com
2024-11-12erofs: add SEEK_{DATA,HOLE} supportGao Xiang
Many userspace programs (including erofs-utils itself) use SEEK_DATA / SEEK_HOLE to parse hole extents in addition to FIEMAP. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241011065128.2097377-1-hsiangkao@linux.alibaba.com
2024-09-10erofs: simplify erofs_map_blocks_flatmode()Hongzhen Luo
Get rid of redundant variables (nblocks, offset) and a dead branch (!tailendpacking). Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240905030339.1474396-1-hongzhen@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-09-10erofs: support unencoded inodes for fileioGao Xiang
Since EROFS only needs to handle read requests in simple contexts, Just directly use vfs_iocb_iter_read() for data I/Os. Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240905093031.2745929-1-hsiangkao@linux.alibaba.com
2024-09-10erofs: add file-backed mount supportGao Xiang
It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] https://github.com/containers/storage/pull/2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/cover.1723670362.git.josef@toxicpanda.com Closes: https://github.com/containers/composefs/issues/144 Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240830032840.3783206-1-hsiangkao@linux.alibaba.com
2024-07-26erofs: support multi-page folios for erofs_bread()Gao Xiang
If the requested page is part of the previous multi-page folio, there is no need to call read_mapping_folio() again. Also, get rid of the remaining one of page->index [1] in our codebase. [1] https://lore.kernel.org/r/Zp8fgUSIBGQ1TN0D@casper.infradead.org Cc: Matthew Wilcox <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240723073024.875290-1-hsiangkao@linux.alibaba.com
2024-05-24Merge tag 'erofs-for-6.10-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull more erofs updates from Gao Xiang: "The main ones are metadata API conversion to byte offsets by Al Viro. Another patch gets rid of unnecessary memory allocation out of DEFLATE decompressor. The remaining one is a trivial cleanup. - Convert metadata APIs to byte offsets - Avoid allocating DEFLATE streams unnecessarily - Some erofs_show_options() cleanup" * tag 'erofs-for-6.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: avoid allocating DEFLATE streams before mounting z_erofs_pcluster_begin(): don't bother with rounding position down erofs: don't round offset down for erofs_read_metabuf() erofs: don't align offset for erofs_read_metabuf() (simple cases) erofs: mechanically convert erofs_read_metabuf() to offsets erofs: clean up erofs_show_options()
2024-05-18erofs: don't align offset for erofs_read_metabuf() (simple cases)Al Viro
Most of the callers of erofs_read_metabuf() have the following form: block = erofs_blknr(sb, offset); off = erofs_blkoff(sb, offset); p = erofs_read_metabuf(...., erofs_pos(sb, block), ...); if (IS_ERR(p)) return PTR_ERR(p); q = p + off; // no further uses of p, block or off. The value passed to erofs_read_metabuf() is offset rounded down to block size, i.e. offset - off. Passing offset as-is would increase the return value by off in case of success and keep the return value unchanged in in case of error. In other words, the same could be achieved by q = erofs_read_metabuf(...., offset, ...); if (IS_ERR(q)) return PTR_ERR(q); This commit convert these simple cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240425195915.GD1031757@ZenIV Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-05-18erofs: mechanically convert erofs_read_metabuf() to offsetsAl Viro
just lift the call of erofs_pos() into the callers; it will collapse in most of them, but that's better done caller-by-caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240425195846.GC1031757@ZenIV Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-05-03use ->bd_mapping instead of ->bd_inode->i_mappingAl Viro
Just the low-hanging fruit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-25erofs_buf: store address_space instead of inodeAl Viro
... seeing that ->i_mapping is the only thing we want from the inode. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-04-07erofs: switch erofs_bread() to passing offset instead of block numberAl Viro
Callers are happier that way, especially since we no longer need to play with splitting offset into block number and offset within block, passing the former to erofs_bread(), then adding the latter... erofs_bread() always reads entire pages, anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-03-11Merge tag 'vfs-6.9.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_*() via struct bdev_handle bdev_file_open_by_*() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...
2024-03-07erofs: apply proper VMA alignment for memory mapped files on THPGao Xiang
There are mainly two reasons that thp_get_unmapped_area() should be used for EROFS as other filesystems: - It's needed to enable PMD mappings as a FSDAX filesystem, see commit 74d2fad1334d ("thp, dax: add thp_get_unmapped_area for pmd mappings"); - It's useful together with large folios and CONFIG_READ_ONLY_THP_FOR_FS which enable THPs for mmapped files (e.g. shared libraries) even without FSDAX. See commit 1854bc6e2420 ("mm/readahead: Align file mappings for non-DAX"). Fixes: 06252e9ce05b ("erofs: dax support for non-tailpacking regular file") Fixes: ce529cc25b18 ("erofs: enable large folios for iomap mode") Fixes: e6687b89225e ("erofs: enable large folios for fscache mode") Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240306053138.2240206-1-hsiangkao@linux.alibaba.com
2024-02-25erofs: port device access to fileChristian Brauner
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-20-adbd023e19cc@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-17erofs: fix NULL dereference of dif->bdev_handle in fscache modeJingbo Xu
Avoid NULL dereference of dif->bdev_handle, as dif->bdev_handle is NULL in fscache mode. BUG: kernel NULL pointer dereference, address: 0000000000000000 RIP: 0010:erofs_map_dev+0xbd/0x1c0 Call Trace: <TASK> erofs_fscache_data_read_slice+0xa7/0x340 erofs_fscache_data_read+0x11/0x30 erofs_fscache_readahead+0xd9/0x100 read_pages+0x47/0x1f0 page_cache_ra_order+0x1e5/0x270 filemap_get_pages+0xf2/0x5f0 filemap_read+0xb8/0x2e0 vfs_read+0x18d/0x2b0 ksys_read+0x53/0xd0 do_syscall_64+0x42/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 Reported-by: Yiqun Leng <yqleng@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7245 Fixes: 49845720080d ("erofs: Convert to use bdev_open_by_path()") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20231114070704.23398-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-11-02Merge tag 'erofs-for-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "Nothing exciting lands for this cycle, since we're still busying in developing support for sub-page blocks and large-folios of compressed data for new scenarios on Android. In this cycle, MicroLZMA format is marked as stable, and there are minor cleanups around documentation and codebase. In addition, it also fixes incorrect lockref usage in erofs_insert_workgroup(). Summary: - Fix inode metadata space layout documentation - Avoid warning for MicroLZMA format anymore - Fix erofs_insert_workgroup() lockref usage - Some cleanups" * tag 'erofs-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix erofs_insert_workgroup() lockref usage erofs: tidy up redundant includes erofs: get rid of ROOT_NID() erofs: simplify compression configuration parser erofs: don't warn MicroLZMA format anymore erofs: fix inode metadata space layout description in documentation
2023-10-31erofs: tidy up redundant includesFerry Meng
- Remove unused includes like <linux/parser.h> and <linux/prefetch.h>; - Move common includes into "internal.h". Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20231026021627.23284-2-mengferry@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-10-28erofs: Convert to use bdev_open_by_path()Jan Kara
Convert erofs to use bdev_open_by_path() and pass the handle around. CC: Gao Xiang <xiang@kernel.org> CC: Chao Yu <chao@kernel.org> CC: linux-erofs@lists.ozlabs.org Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-21-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-24mm: remove enum page_entry_sizeMatthew Wilcox (Oracle)
Remove the unnecessary encoding of page order into an enum and pass the page order directly. That lets us get rid of pe_order(). The switch constructs have to be changed to if/else constructs to prevent GCC from warning on builds with 3-level page tables where PMD_ORDER and PUD_ORDER have the same value. If you are looking at this commit because your driver stopped compiling, look at the previous commit as well and audit your driver to be sure it doesn't depend on mmap_lock being held in its ->huge_fault method. [willy@infradead.org: use "order %u" to match the (non dev_t) style] Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-24splice: Use filemap_splice_read() instead of generic_file_splice_read()David Howells
Replace pointers to generic_file_splice_read() with calls to filemap_splice_read(). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-29-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-17erofs: keep meta inode into erofs_bufGao Xiang
So that erofs_read_metadata() can read metadata from other inodes (e.g. packed inode) as well. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-04-17erofs: support flattened block device for multi-blob imagesJia Zhu
In order to support mounting multi-blobs container image as a single block device, add flattened block device feature for EROFS. In this mode, all meta/data contents will be mapped into one block space. User could compose a block device(by nbd/ublk/virtio-blk/ vhost-user-blk) from multiple sources and mount the block device by EROFS directly. It can reduce the number of block devices used, and it's also benefits in both VM file passthrough and distributed storage scenarios. You can test this using the method mentioned by: https://github.com/dragonflyoss/image-service/pull/1139 1. Compose a (nbd)block device from multi-blobs. 2. Mount EROFS on mntdir/. 3. Compare the md5sum between source dir and mntdir/. Later, we could also use it to refer original tar blobs. Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Tested-by: Jiang Liu <gerry@linux.alibaba.com> Link: https://lore.kernel.org/r/20230302071751.48425-1-zhujia.zj@bytedance.com [ Gao Xiang: refine commit message and use erofs_pos(). ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-04-17erofs: avoid hardcoded blocksize for subpage block supportJingbo Xu
As the first step of converting hardcoded blocksize to that specified in on-disk superblock, convert all call sites of hardcoded blocksize to sb->s_blocksize except for: 1) use sbi->blkszbits instead of sb->s_blocksize in erofs_superblock_csum_verify() since sb->s_blocksize has not been updated with the on-disk blocksize yet when the function is called. 2) use inode->i_blkbits instead of sb->s_blocksize in erofs_bread(), since the inode operated on may be an anonymous inode in fscache mode. Currently the anonymous inode is allocated from an anonymous mount maintained in erofs, while in the near future we may allocate anonymous inodes from a generic API directly and thus have no access to the anonymous inode's i_sb. Thus we keep the block size in i_blkbits for anonymous inodes in fscache mode. Be noted that this patch only gets rid of the hardcoded blocksize, in preparation for actually setting the on-disk block size in the following patch. The hard limit of constraining the block size to PAGE_SIZE still exists until the next patch. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230313135309.75269-2-jefflexu@linux.alibaba.com [ Gao Xiang: fold a patch to fix incorrect truncated offsets. ] Link: https://lore.kernel.org/r/20230413035734.15457-1-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-03-09erofs: use wrapper i_blocksize() in erofs_file_read_iter()Yue Hu
linux/fs.h has a wrapper for this operation. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230306075527.1338-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-15erofs: remove unused EROFS_GET_BLOCKS_RAW flagJingbo Xu
For erofs_map_blocks() and erofs_map_blocks_flatmode(), the flags argument is always EROFS_GET_BLOCKS_RAW. Thus remove the unused flags parameter for these two functions. Besides EROFS_GET_BLOCKS_RAW is originally introduced for reading compressed (raw) data for compressed files. However it's never used actually and let's remove it now. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230209024825.17335-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-02-15erofs: simplify iloc()Gao Xiang
Actually we could pass in inodes directly to clean up all callers. Also rename iloc() as erofs_iloc(). Link: https://lore.kernel.org/r/20230114150823.432069-1-xiang@kernel.org Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-02-09mm: replace vma->vm_flags direct modifications with modifier callsSuren Baghdasaryan
Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-07erofs: use kmap_local_page() only for erofs_bread()Gao Xiang
Convert all mapped erofs_bread() users to use kmap_local_page() instead of kmap() or kmap_atomic(). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-and-tested-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221018105313.4940-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-12-07erofs: enable large folios for iomap modeJingbo Xu
Enable large folios for iomap mode. Then the readahead routine will pass down large folios containing multiple pages. Let's enable this for non-compressed format for now, until the compression part supports large folios later. When large folios supported, the iomap routine will allocate iomap_page for each large folio and thus we need iomap_release_folio() and iomap_invalidate_folio() to free iomap_page when these folios get reclaimed or invalidated. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221130060455.44532-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-07-22erofs: get rid of erofs_prepare_dio() helperGao Xiang
Fold in erofs_prepare_dio() in order to simplify the code. Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220720082229.12172-1-hsiangkao@linux.alibaba.com
2022-05-24Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull page cache updates from Matthew Wilcox: - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio * tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits) nilfs2: Fix some kernel-doc comments Appoint myself page cache maintainer fs: Remove aops->freepage secretmem: Convert to free_folio nfs: Convert to free_folio orangefs: Convert to free_folio fs: Add free_folio address space operation fs: Convert drop_buffers() to use a folio fs: Change try_to_free_buffers() to take a folio jbd2: Convert release_buffer_page() to use a folio jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio reiserfs: Convert release_buffer_page() to use a folio fs: Remove last vestiges of releasepage ubifs: Convert to release_folio reiserfs: Convert to release_folio orangefs: Convert to release_folio ocfs2: Convert to release_folio nilfs2: Remove comment about releasepage nfs: Convert to release_folio jfs: Convert to release_folio ...