summaryrefslogtreecommitdiff
path: root/fs/erofs
AgeCommit message (Collapse)Author
3 daysMerge tag 'vfs-6.17-rc1.mmap_prepare' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull mmap_prepare updates from Christian Brauner: "Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). This is preferred to the existing f_op->mmap() hook as it does require a VMA to be established yet, thus allowing the mmap logic to invoke this hook far, far earlier, prior to inserting a VMA into the virtual address space, or performing any other heavy handed operations. This allows for much simpler unwinding on error, and for there to be a single attempt at merging a VMA rather than having to possibly reattempt a merge based on potentially altered VMA state. Far more importantly, it prevents inappropriate manipulation of incompletely initialised VMA state, which is something that has been the cause of bugs and complexity in the past. The intent is to gradually deprecate f_op->mmap, and in that vein this series coverts the majority of file systems to using f_op->mmap_prepare. Prerequisite steps are taken - firstly ensuring all checks for mmap capabilities use the file_has_valid_mmap_hooks() helper rather than directly checking for f_op->mmap (which is now not a valid check) and secondly updating daxdev_mapping_supported() to not require a VMA parameter to allow ext4 and xfs to be converted. Commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for nested file systems") handles the nasty edge-case of nested file systems like overlayfs, which introduces a compatibility shim to allow f_op->mmap_prepare() to be invoked from an f_op->mmap() callback. This allows for nested filesystems to continue to function correctly with all file systems regardless of which callback is used. Once we finally convert all file systems, this shim can be removed. As a result, ecryptfs, fuse, and overlayfs remain unaltered so they can nest all other file systems. We additionally do not update resctl - as this requires an update to remap_pfn_range() (or an alternative to it) which we defer to a later series, equally we do not update cramfs which needs a mixed mapping insertion with the same issue, nor do we update procfs, hugetlbfs, syfs or kernfs all of which require VMAs for internal state and hooks. We shall return to all of these later" * tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: update porting, vfs documentation to describe mmap_prepare() fs: replace mmap hook with .mmap_prepare for simple mappings fs: convert most other generic_file_*mmap() users to .mmap_prepare() fs: convert simple use of generic_file_*_mmap() to .mmap_prepare() mm/filemap: introduce generic_file_*_mmap_prepare() helpers fs/xfs: transition from deprecated .mmap hook to .mmap_prepare fs/ext4: transition from deprecated .mmap hook to .mmap_prepare fs/dax: make it possible to check dev dax support without a VMA fs: consistently use can_mmap_file() helper mm/nommu: use file_has_valid_mmap_hooks() helper mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
7 dayserofs: support to readahead dirent blocks in erofs_readdir()Chao Yu
This patch supports to readahead more blocks in erofs_readdir(), it can enhance readdir performance in large direcotry. readdir test in a large directory which contains 12000 sub-files. files_per_second Before: 926385.54 After: 2380435.562 Meanwhile, let's introduces a new sysfs entry to control readahead bytes to provide more flexible policy for readahead of readdir(). - location: /sys/fs/erofs/<disk>/dir_ra_bytes - default value: 16384 - disable readahead: set the value to 0 Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250721021352.2495371-1-chao@kernel.org [ Gao Xiang: minor styling adjustment. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
7 dayserofs: implement metadata compressionBo Liu (OpenAnolis)
Thanks to the meta buffer infrastructure, metadata-compressed inodes are just read from the metabox inode instead of the blockdevice (or backing file) inode. The same is true for shared extended attributes. When metadata compression is enabled, inode numbers are divided from on-disk NIDs because of non-LTS 32-bit application compatibility. Co-developed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Bo Liu (OpenAnolis) <liubo03@inspur.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250722003229.2121752-1-hsiangkao@linux.alibaba.com
7 dayserofs: add on-disk definition for metadata compressionGao Xiang
Filesystem metadata has a high degree of redundancy, so it should compress well in the general case. Although metadata compression can increase overall I/O latency, many users care more about minimized image sizes than extreme runtime performance. Let's implement metadata compression in response to user requests [1]. Actually, it's quite simple to implement metadata compression: since EROFS already supports per-inode compression, we can simply treat a special inode (called `the metabox inode`) as a container for compressed inode metadata. Since EROFS supports multiple algorithms, users can even specify LZ4 for metadata and LZMA for data. To better support incremental builds, the MSB of NIDs indicates where the inode metadata is located: if bit 63 is set, the inode itself should be read from `the metabox inode`. Optionally, shared xattrs can also be kept in `the metabox inode` if COMPAT_SHARED_EA_IN_METABOX is set. [1] https://issues.redhat.com/browse/RHEL-75783 Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250717070804.1446345-2-hsiangkao@linux.alibaba.com
7 dayserofs: fix build error with CONFIG_EROFS_FS_ZIP_ACCEL=yBo Liu (OpenAnolis)
fix build err: ld.lld: error: undefined symbol: crypto_req_done referenced by decompressor_crypto.c fs/erofs/decompressor_crypto.o:(z_erofs_crypto_decompress) in archive vmlinux.a referenced by decompressor_crypto.c fs/erofs/decompressor_crypto.o:(z_erofs_crypto_decompress) in archive vmlinux.a ld.lld: error: undefined symbol: crypto_acomp_decompress referenced by decompressor_crypto.c fs/erofs/decompressor_crypto.o:(z_erofs_crypto_decompress) in archive vmlinux.a ld.lld: error: undefined symbol: crypto_alloc_acomp referenced by decompressor_crypto.c fs/erofs/decompressor_crypto.o:(z_erofs_crypto_enable_engine) in archive vmlinux.a Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507161032.QholMPtn-lkp@intel.com/ Fixes: b4a29efc5146 ("erofs: support DEFLATE decompression by using Intel QAT") Signed-off-by: Bo Liu (OpenAnolis) <liubo03@inspur.com> Link: https://lore.kernel.org/r/20250718033039.3609-1-liubo03@inspur.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
7 dayserofs: remove ENOATTR definitionGao Xiang
ENOATTR is not defined in Linux; use ENODATA instead. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250717042317.1218597-1-hsiangkao@linux.alibaba.com
7 dayserofs: refine erofs_iomap_begin()Gao Xiang
- Avoid calling erofs_map_dev() for unmapped extents; - Assign `iomap->addr` for inline extents too (since they have physical location). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250716092254.3826715-1-hsiangkao@linux.alibaba.com
7 dayserofs: unify meta buffers in z_erofs_fill_inode()Gao Xiang
There is no need to keep additional local metabufs since we already have one in `struct erofs_map_blocks`. This was actually a leftover when applying meta buffers to zmap operations, see commit 09c543798c3c ("erofs: use meta buffers for zmap operations"). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250716064152.3537457-1-hsiangkao@linux.alibaba.com
7 dayserofs: remove need_kmap in erofs_read_metabuf()Gao Xiang
- need_kmap is always true except for a ztailpacking case; thus, just open-code that one; - The upcoming metadata compression will add a new boolean, so simplify this first. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250714090907.4095645-1-hsiangkao@linux.alibaba.com
7 dayserofs: do sanity check on m->type in z_erofs_load_compact_lcluster()Chao Yu
All below functions will do sanity check on m->type, let's move sanity check to z_erofs_load_compact_lcluster() for cleanup. - z_erofs_map_blocks_fo - z_erofs_get_extent_compressedlen - z_erofs_get_extent_decompressedlen - z_erofs_extent_lookback Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250708110928.3110375-1-chao@kernel.org Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
7 dayserofs: get rid of {get,put}_page() for ztailpacking dataGao Xiang
The compressed data for the ztailpacking feature is fetched from the metadata inode (e.g., bd_inode), which is folio-based. Therefore, the folio interface should be used instead. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250626085459.339830-1-hsiangkao@linux.alibaba.com
2025-07-12erofs: fix large fragment handlingGao Xiang
Fragments aren't limited by Z_EROFS_PCLUSTER_MAX_DSIZE. However, if a fragment's logical length is larger than Z_EROFS_PCLUSTER_MAX_DSIZE but the fragment is not the whole inode, it currently returns -EOPNOTSUPP because m_flags has the wrong EROFS_MAP_ENCODED flag set. It is not intended by design but should be rare, as it can only be reproduced by mkfs with `-Eall-fragments` in a specific case. Let's normalize fragment m_flags using the new EROFS_MAP_FRAGMENT. Reported-by: Axel Fontaine <axel@axelfontaine.com> Closes: https://github.com/erofs/erofs-utils/issues/23 Fixes: 7c3ca1838a78 ("erofs: restrict pcluster size limitations") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250711195826.3601157-1-hsiangkao@linux.alibaba.com
2025-07-10erofs: allow readdir() to be interruptedChao Yu
In a quick slow device, readdir() may loop for long time in large directory, let's give a chance to allow it to be interrupted by userspace. Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250710073619.4083422-1-chao@kernel.org [ Gao Xiang: move cond_resched() to the end of the while loop. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-07-10erofs: address D-cache aliasingGao Xiang
Flush the D-cache before unlocking folios for compressed inodes, as they are dirtied during decompression. Avoid calling flush_dcache_folio() on every CPU write, since it's more like playing whack-a-mole without real benefit. It has no impact on x86 and arm64/risc-v: on x86, flush_dcache_folio() is a no-op, and on arm64/risc-v, PG_dcache_clean (PG_arch_1) is clear for new page cache folios. However, certain ARM boards are affected, as reported. Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Closes: https://lore.kernel.org/r/c1e51e16-6cc6-49d0-a63e-4e9ff6c4dd53@pengutronix.de Closes: https://lore.kernel.org/r/38d43fae-1182-4155-9c5b-ffc7382d9917@siemens.com Tested-by: Jan Kiszka <jan.kiszka@siemens.com> Tested-by: Stefan Kerkmann <s.kerkmann@pengutronix.de> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250709034614.2780117-2-hsiangkao@linux.alibaba.com
2025-07-10erofs: use memcpy_to_folio() to replace copy_to_iter()Gao Xiang
Using copy_to_iter() here is overkill and even messy. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250709034614.2780117-1-hsiangkao@linux.alibaba.com
2025-07-10erofs: fix to add missing tracepoint in erofs_read_folio()Chao Yu
Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap") converts to use iomap interface, it removed trace_erofs_readpage() tracepoint in the meantime, let's add it back. Fixes: 771c994ea51f ("erofs: convert all uncompressed cases to iomap") Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250708111942.3120926-1-chao@kernel.org Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-07-10erofs: fix to add missing tracepoint in erofs_readahead()Chao Yu
Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap") converts to use iomap interface, it removed trace_erofs_readahead() tracepoint in the meantime, let's add it back. Fixes: 771c994ea51f ("erofs: convert all uncompressed cases to iomap") Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250707084832.2725677-1-chao@kernel.org Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-06-20erofs: remove a superfluous check for encoded extentsGao Xiang
It is possible when an inode is split into segments for multi-threaded compression, and the tail extent of a segment could also be small. Fixes: 1d191b4ca51d ("erofs: implement encoded extent metadata") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250620153108.1368029-1-hsiangkao@linux.alibaba.com
2025-06-19fs: convert most other generic_file_*mmap() users to .mmap_prepare()Lorenzo Stoakes
Update nearly all generic_file_mmap() and generic_file_readonly_mmap() callers to use generic_file_mmap_prepare() and generic_file_readonly_mmap_prepare() respectively. We update blkdev, 9p, afs, erofs, ext2, nfs, ntfs3, smb, ubifs and vboxsf file systems this way. Remaining users we cannot yet update are ecryptfs, fuse and cramfs. The former two are nested file systems that must support any underlying file ssytem, and cramfs inserts a mixed mapping which currently requires a VMA. Once all file systems have been converted to mmap_prepare(), we can then update nested file systems. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/08db85970d89b17a995d2cffae96fb4cc462377f.1750099179.git.lorenzo.stoakes@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19erofs: refuse crafted out-of-file-range encoded extentsGao Xiang
Crafted encoded extents could record out-of-range `lstart`, which should not happen in normal cases. It caused an iomap_iter_done() complaint [1] reported by syzbot. [1] https://lore.kernel.org/r/684cb499.a00a0220.c6bd7.0010.GAE@google.com Fixes: 1d191b4ca51d ("erofs: implement encoded extent metadata") Reported-and-tested-by: syzbot+d8f000c609f05f52d9b5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d8f000c609f05f52d9b5 Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250619032839.2642193-1-hsiangkao@linux.alibaba.com
2025-06-18erofs: impersonate the opener's credentials when accessing backing fileTatsuyuki Ishi
Previously, file operations on a file-backed mount used the current process' credentials to access the backing FD. Attempting to do so on Android lead to SELinux denials, as ACL rules on the backing file (e.g. /system/apex/foo.apex) is restricted to a small set of process. Arguably, this error is redundant and leaking implementation details, as access to files on a mount is already ACL'ed by path. Instead, override to use the opener's cred when accessing the backing file. This makes the behavior similar to a loop-backed mount, which uses kworker cred when accessing the backing file and does not cause SELinux denials. Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki@google.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20250612-b4-erofs-impersonate-v1-1-8ea7d6f65171@google.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-06-02Merge tag 'vfs-6.16-rc1.netfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull netfs updates from Christian Brauner: - The main API document has been extensively updated/rewritten - Fix an oops in write-retry due to mis-resetting the I/O iterator - Fix the recording of transferred bytes for short DIO reads - Fix a request's work item to not require a reference, thereby avoiding the need to get rid of it in BH/IRQ context - Fix waiting and waking to be consistent about the waitqueue used - Remove NETFS_SREQ_SEEK_DATA_READ, NETFS_INVALID_WRITE, NETFS_ICTX_WRITETHROUGH, NETFS_READ_HOLE_CLEAR, NETFS_RREQ_DONT_UNLOCK_FOLIOS, and NETFS_RREQ_BLOCKED - Reorder structs to eliminate holes - Remove netfs_io_request::ractl - Only provide proc_link field if CONFIG_PROC_FS=y - Remove folio_queue::marks3 - Fix undifferentiation of DIO reads from unbuffered reads * tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix undifferentiation of DIO reads from unbuffered reads netfs: Fix wait/wake to be consistent about the waitqueue used netfs: Fix the request's work item to not require a ref netfs: Fix setting of transferred bytes with short DIO reads netfs: Fix oops in write-retry from mis-resetting the subreq iterator fs/netfs: remove unused flag NETFS_RREQ_BLOCKED fs/netfs: remove unused flag NETFS_RREQ_DONT_UNLOCK_FOLIOS folio_queue: remove unused field `marks3` fs/netfs: declare field `proc_link` only if CONFIG_PROC_FS=y fs/netfs: remove `netfs_io_request.ractl` fs/netfs: reorder struct fields to eliminate holes fs/netfs: remove unused enum choice NETFS_READ_HOLE_CLEAR fs/netfs: remove unused flag NETFS_ICTX_WRITETHROUGH fs/netfs: remove unused source NETFS_INVALID_WRITE fs/netfs: remove unused flag NETFS_SREQ_SEEK_DATA_READ
2025-05-25erofs: support DEFLATE decompression by using Intel QATBo Liu
This patch introduces the use of the Intel QAT to offload EROFS data decompression, aiming to improve the decompression performance. A 285MiB dataset is used with the following command to create EROFS images with different cluster sizes: $ mkfs.erofs -zdeflate,level=9 -C{4096,16384,65536,131072,262144} Fio is used to test the following read patterns: $ fio -filename=testfile -bs=4k -rw=read -name=job1 $ fio -filename=testfile -bs=4k -rw=randread -name=job1 $ fio -filename=testfile -bs=4k -rw=randread --io_size=14m -name=job1 Here are some performance numbers for reference: Processors: Intel(R) Xeon(R) 6766E (144 cores) Memory: 512 GiB |-----------------------------------------------------------------------------| | | Cluster size | sequential read | randread | small randread(5%) | |-----------|--------------|-----------------|-----------|--------------------| | Intel QAT | 4096 | 538 MiB/s | 112 MiB/s | 20.76 MiB/s | | Intel QAT | 16384 | 699 MiB/s | 158 MiB/s | 21.02 MiB/s | | Intel QAT | 65536 | 917 MiB/s | 278 MiB/s | 20.90 MiB/s | | Intel QAT | 131072 | 1056 MiB/s | 351 MiB/s | 23.36 MiB/s | | Intel QAT | 262144 | 1145 MiB/s | 431 MiB/s | 26.66 MiB/s | | deflate | 4096 | 499 MiB/s | 108 MiB/s | 21.50 MiB/s | | deflate | 16384 | 422 MiB/s | 125 MiB/s | 18.94 MiB/s | | deflate | 65536 | 452 MiB/s | 159 MiB/s | 13.02 MiB/s | | deflate | 131072 | 452 MiB/s | 177 MiB/s | 11.44 MiB/s | | deflate | 262144 | 466 MiB/s | 194 MiB/s | 10.60 MiB/s | Signed-off-by: Bo Liu <liubo03@inspur.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250522094931.28956-1-liubo03@inspur.com [ Gao Xiang: refine the commit message. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-05-23erofs: clean up erofs_{init,exit}_sysfs()Gao Xiang
Get rid of useless `goto`s. No logic changes. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250522084953.412096-1-hsiangkao@linux.alibaba.com
2025-05-22erofs: add 'fsoffset' mount option to specify filesystem offsetSheng Yong
When attempting to use an archive file, such as APEX on android, as a file-backed mount source, it fails because EROFS image within the archive file does not start at offset 0. As a result, a loop or a dm device is still needed to attach the image file at an appropriate offset first. Similarly, if an EROFS image within a block device does not start at offset 0, it cannot be mounted directly either. To address this issue, this patch adds a new mount option `fsoffset=x' to accept a start offset for the primary device. The offset should be aligned to the block size. EROFS will add this offset before performing read requests. Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Signed-off-by: Wang Shuai <wangshuai12@xiaomi.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250517090544.2687651-1-shengyong1@xiaomi.com [ Gao Xiang: minor update on documentation and the error message. ] Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-05-21netfs: Fix the request's work item to not require a refDavid Howells
When the netfs_io_request struct's work item is queued, it must be supplied with a ref to the work item struct to prevent it being deallocated whilst on the queue or whilst it is being processed. This is tricky to manage as we have to get a ref before we try and queue it and then we may find it's already queued and is thus already holding a ref - in which case we have to try and get rid of the ref again. The problem comes if we're in BH or IRQ context and need to drop the ref: if netfs_put_request() reduces the count to 0, we have to do the cleanup - but the cleanup may need to wait. Fix this by adding a new work item to the request, ->cleanup_work, and dispatching that when the refcount hits zero. That can then synchronously cancel any outstanding work on the main work item before doing the cleanup. Adding a new work item also deals with another problem upstream where it's sometimes changing the work func in the put function and requeuing it - which has occasionally in the past caused the cleanup to happen incorrectly. As a bonus, this allows us to get rid of the 'was_async' parameter from a bunch of functions. This indicated whether the put function might not be permitted to sleep. Fixes: 3d3c95046742 ("netfs: Provide readahead and readpage netfs helpers") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250519090707.2848510-4-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Steve French <stfrench@microsoft.com> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-16erofs: lazily initialize per-CPU workers and CPU hotplug hooksSandeep Dhavale
Currently, when EROFS is built with per-CPU workers, the workers are started and CPU hotplug hooks are registered during module initialization. This leads to unnecessary worker start/stop cycles during CPU hotplug events, particularly on Android devices that frequently suspend and resume. This change defers the initialization of per-CPU workers and the registration of CPU hotplug hooks until the first EROFS mount. This ensures that these resources are only allocated and managed when EROFS is actually in use. The tear down of per-CPU workers and unregistration of CPU hotplug hooks still occurs during z_erofs_exit_subsystem(), but only if they were initialized. Signed-off-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250506225743.308517-1-dhavale@google.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-05-16erofs: refine readahead tracepointGao Xiang
- trace_erofs_readpages => trace_erofs_readahead; - Rename a redundant statement `nrpages = readahead_count(rac);`; - Move the tracepoint to the beginning of z_erofs_readahead(). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20250514120820.2739288-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-05-15erofs: avoid using multiple devices with different typeSheng Yong
For multiple devices, both primary and extra devices should be the same type. `erofs_init_device` has already guaranteed that if the primary is a file-backed device, extra devices should also be regular files. However, if the primary is a block device while the extra device is a file-backed device, `erofs_init_device` will get an ENOTBLK, which is not treated as an error in `erofs_fc_get_tree`, and that leads to an UAF: erofs_fc_get_tree get_tree_bdev_flags(erofs_fc_fill_super) erofs_read_superblock erofs_init_device // sbi->dif0 is not inited yet, // return -ENOTBLK deactivate_locked_super free(sbi) if (err is -ENOTBLK) sbi->dif0.file = filp_open() // sbi UAF So if -ENOTBLK is hitted in `erofs_init_device`, it means the primary device must be a block device, and the extra device is not a block device. The error can be converted to -EINVAL. Fixes: fb176750266a ("erofs: add file-backed mount support") Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20250515014837.3315886-1-shengyong1@xiaomi.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-05-15erofs: fix file handle encoding for 64-bit NIDsHongbo Li
EROFS uses NID to indicate the on-disk inode offset, which can exceed 32 bits. However, the default encode_fh uses the ino32, thus it doesn't work if the image is larger than 128GiB. Let's introduce our own helpers to encode file handles. It's easy to reproduce: 1. prepare an erofs image with nid bigger than U32_MAX 2. mount -t erofs foo.img /mnt/erofs 3. set exportfs with configuration: /mnt/erofs *(rw,sync, no_root_squash) 4. mount -t nfs $IP:/mnt/erofs /mnt/nfs 5. md5sum /mnt/nfs/foo # foo is the file which nid bigger than U32_MAX. # you will get ESTALE error. In the case of overlayfs, the underlying filesystem's file handle is encoded in ovl_fb.fid, which is similar to NFS's case. If the NID of file is larger than U32_MAX, the overlay will get -ESTALE error when calls exportfs_decode_fh. Fixes: 3e917cc305c6 ("erofs: make filesystem exportable") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250507094015.14007-1-lihongbo22@huawei.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-05-07erofs: ensure the extra temporary copy is valid for shortened bvecsGao Xiang
When compressed data deduplication is enabled, multiple logical extents may reference the same compressed physical cluster. The previous commit 94c43de73521 ("erofs: fix wrong primary bvec selection on deduplicated extents") already avoids using shortened bvecs. However, in such cases, the extra temporary buffers also need to be preserved for later use in z_erofs_fill_other_copies() to to prevent data corruption. IOWs, extra temporary buffers have to be retained not only due to varying start relative offsets (`pageofs_out`, as indicated by `pcl->multibases`) but also because of shortened bvecs. android.hardware.graphics.composer@2.1.so : 270696 bytes 0: 0.. 204185 | 204185 : 628019200.. 628084736 | 65536 -> 1: 204185.. 225536 | 21351 : 544063488.. 544129024 | 65536 2: 225536.. 270696 | 45160 : 0.. 0 | 0 com.android.vndk.v28.apex : 93814897 bytes ... 364: 53869896..54095257 | 225361 : 543997952.. 544063488 | 65536 -> 365: 54095257..54309344 | 214087 : 544063488.. 544129024 | 65536 366: 54309344..54514557 | 205213 : 544129024.. 544194560 | 65536 ... Both 204185 and 54095257 have the same start relative offset of 3481, but the logical page 55 of `android.hardware.graphics.composer@2.1.so` ranges from 225280 to 229632, forming a shortened bvec [225280, 225536) that cannot be used for decompressing the range from 54095257 to 54309344 of `com.android.vndk.v28.apex`. Since `pcl->multibases` is already meaningless, just mark `be->keepxcpy` on demand for simplicity. Again, this issue can only lead to data corruption if `-Ededupe` is on. Fixes: 94c43de73521 ("erofs: fix wrong primary bvec selection on deduplicated extents") Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250506101850.191506-1-hsiangkao@linux.alibaba.com
2025-04-30erofs: remove unused enum typeHongbo Li
Opt_err is not used in EROFS, we can remove it. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <xiang@kernel.org> Link: https://lore.kernel.org/r/20250429075056.689570-1-lihongbo22@huawei.com Signed-off-by: Gao Xiang <xiang@kernel.org>
2025-04-29fs/erofs/fileio: call erofs_onlinefolio_split() after bio_add_folio()Max Kellermann
If bio_add_folio() fails (because it is full), erofs_fileio_scan_folio() needs to submit the I/O request via erofs_fileio_rq_submit() and allocate a new I/O request with an empty `struct bio`. Then it retries the bio_add_folio() call. However, at this point, erofs_onlinefolio_split() has already been called which increments `folio->private`; the retry will call erofs_onlinefolio_split() again, but there will never be a matching erofs_onlinefolio_end() call. This leaves the folio locked forever and all waiters will be stuck in folio_wait_bit_common(). This bug has been added by commit ce63cb62d794 ("erofs: support unencoded inodes for fileio"), but was practically unreachable because there was room for 256 folios in the `struct bio` - until commit 9f74ae8c9ac9 ("erofs: shorten bvecs[] for file-backed mounts") which reduced the array capacity to 16 folios. It was now trivial to trigger the bug by manually invoking readahead from userspace, e.g.: posix_fadvise(fd, 0, st.st_size, POSIX_FADV_WILLNEED); This should be fixed by invoking erofs_onlinefolio_split() only after bio_add_folio() has succeeded. This is safe: asynchronous completions invoking erofs_onlinefolio_end() will not unlock the folio because erofs_fileio_scan_folio() is still holding a reference to be released by erofs_onlinefolio_end() at the end. Fixes: ce63cb62d794 ("erofs: support unencoded inodes for fileio") Fixes: 9f74ae8c9ac9 ("erofs: shorten bvecs[] for file-backed mounts") Cc: stable@vger.kernel.org Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Gao Xiang <xiang@kernel.org> Tested-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20250428230933.3422273-1-max.kellermann@ionos.com Signed-off-by: Gao Xiang <xiang@kernel.org>
2025-04-13Merge tag 'erofs-for-6.15-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: - Properly handle errors when file-backed I/O fails - Fix compilation issues on ARM platform (arm-linux-gnueabi) - Fix parsing of encoded extents - Minor cleanup * tag 'erofs-for-6.15-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: remove duplicate code erofs: fix encoded extents handling erofs: add __packed annotation to union(__le16..) erofs: set error to bio if file-backed IO fails
2025-04-10erofs: remove duplicate codeBo Liu
Remove duplicate code in function z_erofs_register_pcluster() Signed-off-by: Bo Liu <liubo03@inspur.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250410042048.3044-2-liubo03@inspur.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-04-09erofs: fix encoded extents handlingGao Xiang
- The MSB 32 bits of `z_fragmentoff` are available only in extent records of size >= 8B. - Use round_down() to calculate `lstart` as well as increase `pos` correspondingly for extent records of size == 8B. Fixes: 1d191b4ca51d ("erofs: implement encoded extent metadata") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250408114448.4040220-2-hsiangkao@linux.alibaba.com
2025-04-09erofs: add __packed annotation to union(__le16..)Gao Xiang
I'm unsure why they aren't 2 bytes in size only in arm-linux-gnueabi. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202504051202.DS7QIknJ-lkp@intel.com Fixes: 61ba89b57905 ("erofs: add 48-bit block addressing on-disk support") Fixes: efb2aef569b3 ("erofs: add encoded extent on-disk definition") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250408114448.4040220-1-hsiangkao@linux.alibaba.com
2025-04-09erofs: set error to bio if file-backed IO failsSheng Yong
If a file-backed IO fails before submitting the bio to the lower filesystem, an error is returned, but the bio->bi_status is not marked as an error. However, the error information should be passed to the end_io handler. Otherwise, the IO request will be treated as successful. Fixes: 283213718f5d ("erofs: support compressed inodes for fileio") Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250408122351.2104507-1-shengyong1@xiaomi.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-04-04lib/crc: remove CONFIG_LIBCRC32CEric Biggers
Now that LIBCRC32C does nothing besides select CRC32, make every option that selects LIBCRC32C instead select CRC32 directly. Then remove LIBCRC32C. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-03-17erofs: enable 48-bit layout supportGao Xiang
Both 48-bit block addressing and encoded extents are implemented, let's enable them formally. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095625.2623817-1-hsiangkao@linux.alibaba.com
2025-03-17erofs: support unaligned encoded dataGao Xiang
We're almost there. It's straight-forward to adapt the current decompression subsystem to support unaligned encoded (compressed) data. Note that unaligned data is not encouraged because of worse I/O and caching efficiency unless the corresponding compressor doesn't support fixed-sized output compression natively like Zstd. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-10-hsiangkao@linux.alibaba.com
2025-03-17erofs: implement encoded extent metadataGao Xiang
Implement the extent metadata parsing described in the previous commit. For 16-byte and 32-byte extent records, currently it is just a trivial binary search without considering the last access footprint, but it can be optimized for better sequential performance later. Tail fragments are supported, but ztailpacking feature is not for simplicity. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-9-hsiangkao@linux.alibaba.com
2025-03-17erofs: add encoded extent on-disk definitionGao Xiang
Previously, EROFS provided both (non-)compact compressed indexes to keep necessary hints for each logical block, enabling O(1) random indexing. This approach was originally designed for small compression units (e.g., 4KiB), where compressed data is strictly block-aligned via fixed-sized output compression. However, EROFS now supports big pclusters up to 1MiB and many users use large configurations to minimize image sizes. For such configurations, the total number of extents decreases significantly (e.g., only 1,024 extents for a 1GiB file using 1MiB pclusters), then runtime metadata overhead becomes negligible compared to data I/O and decoding costs. Additionally, some popular compression algorithm (mainly Zstd) still lacks native fixed-sized output compression support (although it's planned by their authors). Instead of just waiting for compressor improvements, let's adopt byte-oriented extents, allowing these compressors to retain their current methods. For example, it speeds up Zstd compression a lot: Processor: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz * 96 Dataset: enwik9 Build time Size Type Command Line 3m52.339s 266653696 FO -C524288 -zzstd,22 3m48.549s 266174464 FO -E48bit -C524288 -zzstd,22 0m12.821s 272134144 FI -E48bit -C1048576 --max-extent-bytes=1048576 -zzstd,22 0m14.528s 248987648 FO -C1048576 -zlzma,9 0m14.605s 248504320 FO -E48bit -C1048576 -zlzma,9 Encoded extents are structured as an array of `struct z_erofs_extent`, sorted by logical address in ascending order: __le32 plen // encoded length, algorithm id and flags __le32 pstart_lo // physical offset LSB __le32 pstart_hi // physical offset MSB __le32 lstart_lo // logical offset __le32 lstart_hi // logical offset MSB .. Note that prefixed reduced records can be used to minimize metadata for specific cases (e.g. lstart less than 32 bits, then 32 to 16 bytes). If the logical lengths of all encoded extents are the same, 4-byte (plen) and 8-byte (plen, pstart_lo) records can be used. Or, 16-byte (plen .. lstart_lo) and 32-byte full records have to be used instead. If 16-byte and 32-byte records are used, the total number of extents is kept in `struct z_erofs_map_header`, and binary search can be applied on them. Note that `eytzinger order` is not considerd because data sequential access is important. If 4-byte records are used, 8-byte start physical offset is between `struct z_erofs_map_header` and the `plen` array. In addition, 64-bit physical offsets can be applied with new encoded extent format to match full 48-bit block addressing. Remove redundant comments around `struct z_erofs_lcluster_index` too. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-8-hsiangkao@linux.alibaba.com
2025-03-17erofs: initialize decompression earlyGao Xiang
- Rename erofs_init_managed_cache() to z_erofs_init_super(); - Move the initialization of managed_pslots into z_erofs_init_super() too; - Move z_erofs_init_super() and packed inode preparation upwards, before the root inode initialization. Therefore, the root directory can also be compressible. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250317054840.3483000-1-hsiangkao@linux.alibaba.com
2025-03-17erofs: support dot-omitted directoriesGao Xiang
There's no need to record "." dirents in the directory data (while they could be used for sanity checks, they aren't very useful.) Omitting "." dirents also improves directory data deduplication. Use a per-inode (instead of per-sb) flag to indicate if the "." dirent is omitted or not, ensuring compatibility with incremental builds. It also reuses EROFS_I_NLINK_1_BIT, as it has very limited use cases for directories with `nlink = 1`. Emit the "." entry as the last virtual dirent in the directory because it is _much_ less frequently used than the ".." dirent. It also keeps `f_pos` meaningful, as it strictly follows the directory data when it's less than i_size. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-6-hsiangkao@linux.alibaba.com
2025-03-17erofs: implement 48-bit block addressing for unencoded inodesGao Xiang
It adapts the on-disk changes from the previous commit. It also supports EROFS_NULL_ADDR (all 1's) for EROFS_INODE_FLAT_PLAIN inodes to indicate 0-filled inodes, as it's common for composefs use cases. As a result, EROFS_INODE_CHUNK_BASED is no longer needed. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-5-hsiangkao@linux.alibaba.com
2025-03-17erofs: add 48-bit block addressing on-disk supportGao Xiang
The current 32-bit block addressing limits EROFS to a 16TiB maximum volume size with 4KiB blocks. However, several new use cases now require larger capacity support: - Massive datasets for model training in order to boost random sampling performance for each epoch; - Object storage clients using EROFS direct passthrough. This extends core on-disk structures to support 48-bit block addressing, such as inodes, device slots, and inode chunks. Additionally: - Expand superblock root NID to 8-byte `rootnid_8b` to enable full out-of-place update incremental builds; - Introduce `epoch` field in the superblock as well as add `mtime` field to 32-byte compact inodes for basic timestamp support. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-4-hsiangkao@linux.alibaba.com
2025-03-17erofs: simplify erofs_{read,fill}_inode()Gao Xiang
- Switch to on-stack `copied` since it's just 64 bytes; - Get rid of `nblks` and derive `i_blocks` directly; - Use `inode_set_mtime()` instead of `inode_set_ctime()` to follow the ondisk naming; - Rearrange the code. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-3-hsiangkao@linux.alibaba.com
2025-03-17erofs: get rid of erofs_map_blocks_flatmode()Gao Xiang
It's simple enough to be folded into erofs_map_blocks(). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-2-hsiangkao@linux.alibaba.com
2025-03-17erofs: move {in,out}pages into struct z_erofs_decompress_reqGao Xiang
It seems that all compressors need those two values, so just move them into the common structure. `struct z_erofs_lz4_decompress_ctx` can be dropped too. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250305124007.1810731-1-hsiangkao@linux.alibaba.com