summaryrefslogtreecommitdiff
path: root/Documentation/filesystems/porting.rst
AgeCommit message (Collapse)Author
3 daysMerge tag 'vfs-6.17-rc1.mmap_prepare' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull mmap_prepare updates from Christian Brauner: "Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). This is preferred to the existing f_op->mmap() hook as it does require a VMA to be established yet, thus allowing the mmap logic to invoke this hook far, far earlier, prior to inserting a VMA into the virtual address space, or performing any other heavy handed operations. This allows for much simpler unwinding on error, and for there to be a single attempt at merging a VMA rather than having to possibly reattempt a merge based on potentially altered VMA state. Far more importantly, it prevents inappropriate manipulation of incompletely initialised VMA state, which is something that has been the cause of bugs and complexity in the past. The intent is to gradually deprecate f_op->mmap, and in that vein this series coverts the majority of file systems to using f_op->mmap_prepare. Prerequisite steps are taken - firstly ensuring all checks for mmap capabilities use the file_has_valid_mmap_hooks() helper rather than directly checking for f_op->mmap (which is now not a valid check) and secondly updating daxdev_mapping_supported() to not require a VMA parameter to allow ext4 and xfs to be converted. Commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for nested file systems") handles the nasty edge-case of nested file systems like overlayfs, which introduces a compatibility shim to allow f_op->mmap_prepare() to be invoked from an f_op->mmap() callback. This allows for nested filesystems to continue to function correctly with all file systems regardless of which callback is used. Once we finally convert all file systems, this shim can be removed. As a result, ecryptfs, fuse, and overlayfs remain unaltered so they can nest all other file systems. We additionally do not update resctl - as this requires an update to remap_pfn_range() (or an alternative to it) which we defer to a later series, equally we do not update cramfs which needs a mixed mapping insertion with the same issue, nor do we update procfs, hugetlbfs, syfs or kernfs all of which require VMAs for internal state and hooks. We shall return to all of these later" * tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: update porting, vfs documentation to describe mmap_prepare() fs: replace mmap hook with .mmap_prepare for simple mappings fs: convert most other generic_file_*mmap() users to .mmap_prepare() fs: convert simple use of generic_file_*_mmap() to .mmap_prepare() mm/filemap: introduce generic_file_*_mmap_prepare() helpers fs/xfs: transition from deprecated .mmap hook to .mmap_prepare fs/ext4: transition from deprecated .mmap hook to .mmap_prepare fs/dax: make it possible to check dev dax support without a VMA fs: consistently use can_mmap_file() helper mm/nommu: use file_has_valid_mmap_hooks() helper mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
3 daysMerge tag 'vfs-6.17-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull async directory updates from Christian Brauner: "This contains preparatory changes for the asynchronous directory locking scheme. While the locking scheme is still very much controversial and we're still far away from landing any actual changes in that area the preparatory work that we've been upstreaming for a while now has been very useful. This is another set of minor changes and cleanups" * tag 'vfs-6.17-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: exportfs: use lookup_one_unlocked() coda: use iterate_dir() in coda_readdir() VFS: Minor fixes for porting.rst VFS: merge lookup_one_qstr_excl_raw() back into lookup_one_qstr_excl()
3 daysMerge tag 'pull-dcache' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dentry d_flags updates from Al Viro: "The current exclusion rules for dentry->d_flags stores are rather unpleasant. The basic rules are simple: - stores to dentry->d_flags are OK under dentry->d_lock - stores to dentry->d_flags are OK in the dentry constructor, before becomes potentially visible to other threads Unfortunately, there's a couple of exceptions to that, and that's where the headache comes from. The main PITA comes from d_set_d_op(); that primitive sets ->d_op of dentry and adjusts the flags that correspond to presence of individual methods. It's very easy to misuse; existing uses _are_ safe, but proof of correctness is brittle. Use in __d_alloc() is safe (we are within a constructor), but we might as well precalculate the initial value of 'd_flags' when we set the default ->d_op for given superblock and set 'd_flags' directly instead of messing with that helper. The reasons why other uses are safe are bloody convoluted; I'm not going to reproduce it here. See [1] for gory details, if you care. The critical part is using d_set_d_op() only just prior to d_splice_alias(), which makes a combination of d_splice_alias() with setting ->d_op, etc a natural replacement primitive. Better yet, if we go that way, it's easy to take setting ->d_op and modifying 'd_flags' under ->d_lock, which eliminates the headache as far as 'd_flags' exclusion rules are concerned. Other exceptions are minor and easy to deal with. What this series does: - d_set_d_op() is no longer available; instead a new primitive (d_splice_alias_ops()) is provided, equivalent to combination of d_set_d_op() and d_splice_alias(). - new field of struct super_block - 's_d_flags'. This sets the default value of 'd_flags' to be used when allocating dentries on this filesystem. - new primitive for setting 's_d_op': set_default_d_op(). This replaces stores to 's_d_op' at mount time. All in-tree filesystems converted; out-of-tree ones will get caught by the compiler ('s_d_op' is renamed, so stores to it will be caught). 's_d_flags' is set by the same primitive to match the 's_d_op'. - a lot of filesystems had sb->s_d_op->d_delete equal to always_delete_dentry; that is equivalent to setting DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well set that bit in 's_d_flags' and drop 'd_delete()' from dentry_operations. In quite a few cases that results in empty dentry_operations, which means that we can get rid of those. - kill simple_dentry_operations - not needed anymore - massage d_alloc_parallel() to get rid of the other exception wrt 'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we allocate the new dentry; no need to delay that until we commit to using the sucker. As the result, 'd_flags' stores are all either under ->d_lock or done before the dentry becomes visible in any shared data structures" Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1] * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits) configfs: use DCACHE_DONTCACHE debugfs: use DCACHE_DONTCACHE efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry() 9p: don't bother with always_delete_dentry ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE kill simple_dentry_operations devpts, sunrpc, hostfs: don't bother with ->d_op shmem: no dentry retention past the refcount reaching zero d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier make d_set_d_op() static simple_lookup(): just set DCACHE_DONTCACHE tracefs: Add d_delete to remove negative dentries set_default_d_op(): calculate the matching value for ->d_flags correct the set of flags forbidden at d_set_d_op() time split d_flags calculation out of d_set_d_op() new helper: set_default_d_op() fuse: no need for special dentry_operations for root dentry switch procfs from d_set_d_op() to d_splice_alias_ops() new helper: d_splice_alias_ops() procfs: kill ->proc_dops ...
8 daysdoc: update porting, vfs documentation to describe mmap_prepare()Lorenzo Stoakes
Now that we have established .mmap_prepare() as the preferred means by which filesystems establish state upon memory mapping of a file, update the VFS and porting documentation to reflect this. As part of this change, additionally update the VFS documentation to contain the current state of the file_operations struct. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/20250723123036.35472-1-lorenzo.stoakes@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23replace collect_mounts()/drop_collected_mounts() with a safer variantAl Viro
collect_mounts() has several problems - one can't iterate over the results directly, so it has to be done with callback passed to iterate_mounts(); it has an oopsable race with d_invalidate(); it creates temporary clones of mounts invisibly for sync umount (IOW, you can have non-lazy umount succeed leaving filesystem not mounted anywhere and yet still busy). A saner approach is to give caller an array of struct path that would pin every mount in a subtree, without cloning any mounts. * collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone * collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or a pointer to array of struct path, one for each chunk of tree visible under 'where' (i.e. the first element is a copy of where, followed by (mount,root) for everything mounted under it - the same set collect_mounts() would give). Unlike collect_mounts(), the mounts are *not* cloned - we just get pinning references to the roots of subtrees in the caller's namespace. Array is terminated by {NULL, NULL} struct path. If it fits into preallocated array (on-stack, normally), that's where it goes; otherwise it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated' is ignored (and expected to be NULL). * drop_collected_paths(paths, preallocated) is given the array returned by an earlier call of collect_paths() and the preallocated array passed to that call. All mount/dentry references are dropped and array is kfree'd if it's not equal to 'preallocated'. * instead of iterate_mounts(), users should just iterate over array of struct path - nothing exotic is needed for that. Existing users (all in audit_tree.c) are converted. [folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>] Fixes: 80b5dce8c59b0 ("vfs: Add a function to lazily unmount all mounts from any dentry") Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-11make d_set_d_op() staticAl Viro
Convert the last user (d_alloc_pseudo()) and be done with that. Any out-of-tree filesystem using it should switch to d_splice_alias_ops() or, better yet, check whether it really needs to have ->d_op vary among its dentries. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-11VFS: Minor fixes for porting.rstNeilBrown
This paragraph was relevant for an earlier version of the code which passed the qstr as a struct instead of a point. The version that landed passed the pointer in all cases so this para is now pointless. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/20250608230952.20539-3-neil@brown.name Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-10new helper: set_default_d_op()Al Viro
... to be used instead of manually assigning to ->s_d_op. All in-tree filesystem converted (and field itself is renamed, so any out-of-tree ones in need of conversion will be caught by compiler). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-30Merge tag 'pull-automount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull automount updates from Al Viro: "Automount wart removal A bunch of odd boilerplate gone from instances - the reason for those was the need to protect the yet-to-be-attched mount from mark_mounts_for_expiry() deciding to take it out. But that's easy to detect and take care of in mark_mounts_for_expiry() itself; no need to have every instance simulate mount being busy by grabbing an extra reference to it, with finish_automount() undoing that once it attaches that mount. Should've done it that way from the very beginning... This is a flagday change, thankfully there are very few instances. vfs_submount() is gone - its sole remaining user (trace_automount) had been switched to saner primitives" * tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill vfs_submount() saner calling conventions for ->d_automount()
2025-05-05saner calling conventions for ->d_automount()Al Viro
Currently the calling conventions for ->d_automount() instances have an odd wart - returned new mount to be attached is expected to have refcount 2. That kludge is intended to make sure that mark_mounts_for_expiry() called before we get around to attaching that new mount to the tree won't decide to take it out. finish_automount() drops the extra reference after it's done with attaching mount to the tree - or drops the reference twice in case of error. ->d_automount() instances have rather counterintuitive boilerplate in them. There's a much simpler approach: have mark_mounts_for_expiry() skip the mounts that are yet to be mounted. And to hell with grabbing/dropping those extra references. Makes for simpler correctness analysis, at that... Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Acked-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-04-08Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFSNeilBrown
try_lookup_noperm() and d_hash_and_lookup() are nearly identical. The former does some validation of the name where the latter doesn't. Outside of the VFS that validation is likely valuable, and having only one exported function for this task is certainly a good idea. So make d_hash_and_lookup() local to VFS files and change all other callers to try_lookup_noperm(). Note that the arguments are swapped. Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250319031545.2999807-6-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-08VFS: rename lookup_one_len family to lookup_noperm and remove permission checkNeilBrown
The lookup_one_len family of functions is (now) only used internally by a filesystem on itself either - in a context where permission checking is irrelevant such as by a virtual filesystem populating itself, or xfs accessing its ORPHANAGE or dquota accessing the quota file; or - in a context where a permission check (MAY_EXEC on the parent) has just been performed such as a network filesystem finding in "silly-rename" file in the same directory. This is also the context after the _parentat() functions where currently lookup_one_qstr_excl() is used. So the permission check is pointless. The name "one_len" is unhelpful in understanding the purpose of these functions and should be changed. Most of the callers pass the len as "strlen()" so using a qstr and QSTR() can simplify the code. This patch renames these functions (include lookup_positive_unlocked() which is part of the family despite the name) to have a name based on "lookup_noperm". They are changed to receive a 'struct qstr' instead of separate name and len. In a few cases the use of QSTR() results in a new call to strlen(). try_lookup_noperm() takes a pointer to a qstr instead of the whole qstr. This is consistent with d_hash_and_lookup() (which is nearly identical) and useful for lookup_noperm_unlocked(). The new lookup_noperm_common() doesn't take a qstr yet. That will be tidied up in a subsequent patch. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/r/20250319031545.2999807-5-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07VFS: improve interface for lookup_one functionsNeilBrown
The family of functions: lookup_one() lookup_one_unlocked() lookup_one_positive_unlocked() appear designed to be used by external clients of the filesystem rather than by filesystems acting on themselves as the lookup_one_len family are used. They are used by: btrfs/ioctl - which is a user-space interface rather than an internal activity exportfs - i.e. from nfsd or the open_by_handle_at interface overlayfs - at access the underlying filesystems smb/server - for file service They should be used by nfsd (more than just the exportfs path) and cachefs but aren't. It would help if the documentation didn't claim they should "not be called by generic code". Also the path component name is passed as "name" and "len" which are (confusingly?) separate by the "base". In some cases the len in simply "strlen" and so passing a qstr using QSTR() would make the calling clearer. Other callers do pass separate name and len which are stored in a struct. Sometimes these are already stored in a qstr, other times it easily could be. So this patch changes these three functions to receive a 'struct qstr *', and improves the documentation. QSTR_LEN() is added to make it easy to pass a QSTR containing a known len. [brauner@kernel.org: take a struct qstr pointer] Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/r/20250319031545.2999807-2-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-24Merge tag 'vfs-6.15-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async dir updates from Christian Brauner: "This contains cleanups that fell out of the work from async directory handling: - Change kern_path_locked() and user_path_locked_at() to never return a negative dentry. This simplifies the usability of these helpers in various places - Drop d_exact_alias() from the remaining place in NFS where it is still used. This also allows us to drop the d_exact_alias() helper completely - Drop an unnecessary call to fh_update() from nfsd_create_locked() - Change i_op->mkdir() to return a struct dentry Change vfs_mkdir() to return a dentry provided by the filesystems which is hashed and positive. This allows us to reduce the number of cases where the resulting dentry is not positive to very few cases. The code in these places becomes simpler and easier to understand. - Repack DENTRY_* and LOOKUP_* flags" * tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: fix inline emphasis warning VFS: Change vfs_mkdir() to return the dentry. nfs: change mkdir inode_operation to return alternate dentry if needed. fuse: return correct dentry for ->mkdir ceph: return the correct dentry on mkdir hostfs: store inode in dentry after mkdir if possible. Change inode_operations.mkdir to return struct dentry * nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked() nfs/vfs: discard d_exact_alias() VFS: add common error checks to lookup_one_qstr_excl() VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry VFS: repack LOOKUP_ bit flags. VFS: repack DENTRY_ flags.
2025-03-08vfs: Remove invalidate_inodes()Jan Kara
The function can be replaced by evict_inodes. The only difference is that evict_inodes() skips the inodes with positive refcount without touching ->i_lock, but they are equivalent as evict_inodes() repeats the refcount check after having grabbed ->i_lock. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250307144318.28120-2-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05doc: fix inline emphasis warningChristian Brauner
Fix a warning spotted by linux-next build (htmldocs): Documentation/filesystems/porting.rst:1186: WARNING: Inline emphasis start-string without end-string. [docutils] Introduced by commit 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry *") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry *") Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27Change inode_operations.mkdir to return struct dentry *NeilBrown
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have complete control of sequencing on the actual filesystem (e.g. on a different server) and may find that the inode created for a mkdir request already exists in the icache and dcache by the time the mkdir request returns. For example, if the filesystem is mounted twice the directory could be visible on the other mount before it is on the original mount, and a pair of name_to_handle_at(), open_by_handle_at() calls could instantiate the directory inode with an IS_ROOT() dentry before the first mkdir returns. This means that the dentry passed to ->mkdir() may not be the one that is associated with the inode after the ->mkdir() completes. Some callers need to interact with the inode after the ->mkdir completes and they currently need to perform a lookup in the (rare) case that the dentry is no longer hashed. This lookup-after-mkdir requires that the directory remains locked to avoid races. Planned future patches to lock the dentry rather than the directory will mean that this lookup cannot be performed atomically with the mkdir. To remove this barrier, this patch changes ->mkdir to return the resulting dentry if it is different from the one passed in. Possible returns are: NULL - the directory was created and no other dentry was used ERR_PTR() - an error occurred non-NULL - this other dentry was spliced in This patch only changes file-systems to return "ERR_PTR(err)" instead of "err" or equivalent transformations. Subsequent patches will make further changes to some file-systems to return a correct dentry. Not all filesystems reliably result in a positive hashed dentry: - NFS, cifs, hostfs will sometimes need to perform a lookup of the name to get inode information. Races could result in this returning something different. Note that this lookup is non-atomic which is what we are trying to avoid. Placing the lookup in filesystem code means it only happens when the filesystem has no other option. - kernfs and tracefs leave the dentry negative and the ->revalidate operation ensures that lookup will be called to correctly populate the dentry. This could be fixed but I don't think it is important to any of the users of vfs_mkdir() which look at the dentry. The recommendation to use d_drop();d_splice_alias() is ugly but fits with current practice. A planned future patch will change this. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-19VFS: add common error checks to lookup_one_qstr_excl()NeilBrown
Callers of lookup_one_qstr_excl() often check if the result is negative or positive. These changes can easily be moved into lookup_one_qstr_excl() by checking the lookup flags: LOOKUP_CREATE means it is NOT an error if the name doesn't exist. LOOKUP_EXCL means it IS an error if the name DOES exist. This patch adds these checks, then removes error checks from callers, and ensures that appropriate flags are passed. This subtly changes the meaning of LOOKUP_EXCL. Previously it could only accompany LOOKUP_CREATE. Now it can accompany LOOKUP_RENAME_TARGET as well. A couple of small changes are needed to accommodate this. The NFS change is functionally a no-op but ensures nfs_is_exclusive_create() does exactly what the name says. Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250217003020.3170652-3-neilb@suse.de Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-19VFS: change kern_path_locked() and user_path_locked_at() to never return ↵NeilBrown
negative dentry No callers of kern_path_locked() or user_path_locked_at() want a negative dentry. So change them to return -ENOENT instead. This simplifies callers. This results in a subtle change to bcachefs in that an ioctl will now return -ENOENT in preference to -EXDEV. I believe this restores the behaviour to what it was prior to Commit bbe6a7c899e7 ("bch2_ioctl_subvolume_destroy(): fix locking") Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250217003020.3170652-2-neilb@suse.de Acked-by: Paul Moore <paul@paul-moore.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-30Merge tag 'pull-revalidate' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs d_revalidate updates from Al Viro: "Provide stable parent and name to ->d_revalidate() instances Most of the filesystem methods where we care about dentry name and parent have their stability guaranteed by the callers; ->d_revalidate() is the major exception. It's easy enough for callers to supply stable values for expected name and expected parent of the dentry being validated. That kills quite a bit of boilerplate in ->d_revalidate() instances, along with a bunch of races where they used to access ->d_name without sufficient precautions" * tag 'pull-revalidate' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: 9p: fix ->rename_sem exclusion orangefs_d_revalidate(): use stable parent inode and name passed by caller ocfs2_dentry_revalidate(): use stable parent inode and name passed by caller nfs: fix ->d_revalidate() UAF on ->d_name accesses nfs{,4}_lookup_validate(): use stable parent inode passed by caller gfs2_drevalidate(): use stable parent inode and name passed by caller fuse_dentry_revalidate(): use stable parent inode and name passed by caller vfat_revalidate{,_ci}(): use stable parent inode passed by caller exfat_d_revalidate(): use stable parent inode passed by caller fscrypt_d_revalidate(): use stable parent inode passed by caller ceph_d_revalidate(): propagate stable name down into request encoding ceph_d_revalidate(): use stable parent inode passed by caller afs_d_revalidate(): use stable name and parent inode passed by caller Pass parent directory inode and expected name to ->d_revalidate() generic_ci_d_compare(): use shortname_storage ext4 fast_commit: make use of name_snapshot primitives dissolve external_name.u into separate members make take_dentry_name_snapshot() lockless dcache: back inline names with a struct-wrapped array of unsigned long make sure that DNAME_INLINE_LEN is a multiple of word size
2025-01-27Pass parent directory inode and expected name to ->d_revalidate()Al Viro
->d_revalidate() often needs to access dentry parent and name; that has to be done carefully, since the locking environment varies from caller to caller. We are not guaranteed that dentry in question will not be moved right under us - not unless the filesystem is such that nothing on it ever gets renamed. It can be dealt with, but that results in boilerplate code that isn't even needed - the callers normally have just found the dentry via dcache lookup and want to verify that it's in the right place; they already have the values of ->d_parent and ->d_name stable. There is a couple of exceptions (overlayfs and, to less extent, ecryptfs), but for the majority of calls that song and dance is not needed at all. It's easier to make ecryptfs and overlayfs find and pass those values if there's a ->d_revalidate() instance to be called, rather than doing that in the instances. This commit only changes the calling conventions; making use of supplied values is left to followups. NOTE: some instances need more than just the parent - things like CIFS may need to build an entire path from filesystem root, so they need more precautions than the usual boilerplate. This series doesn't do anything to that need - these filesystems have to keep their locking mechanisms (rename_lock loops, use of dentry_path_raw(), private rwsem a-la v9fs). One thing to keep in mind when using name is that name->name will normally point into the pathname being resolved; the filename in question occupies name->len bytes starting at name->name, and there is NUL somewhere after it, but it the next byte might very well be '/' rather than '\0'. Do not ignore name->len. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-12-13Documentation: Fix simple typo on filesystems/porting.rstCarlos Maiolino
Just spotted this while reading the doc. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20241213151743.23435-1-cem@kernel.org
2024-10-21reiserfs: The last commitJan Kara
Deprecation period of reiserfs ends with the end of this year so it is time to remove it from the kernel. Acked-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>
2024-05-21Merge tag 'pull-set_blocksize' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file *' and verifies that the caller has the device opened exclusively" * tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()
2024-05-02make set_blocksize() fail unless block device is opened exclusiveAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02Docs: typos/spellingRemington Brasga
Fix spelling and grammar in Docs descriptions Signed-off-by: Remington Brasga <rbrasga@uci.edu> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20240429225527.2329-1-rbrasga@uci.edu
2024-01-11Merge tag 'pull-dcache' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache updates from Al Viro: "Change of locking rules for __dentry_kill(), regularized refcounting rules in that area, assorted cleanups and removal of weird corner cases (e.g. now ->d_iput() on child is always called before the parent might hit __dentry_kill(), etc)" * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits) dcache: remove unnecessary NULL check in dget_dlock() kill DCACHE_MAY_FREE __d_unalias() doesn't use inode argument d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant get rid of DCACHE_GENOCIDE d_genocide(): move the extern into fs/internal.h simple_fill_super(): don't bother with d_genocide() on failure nsfs: use d_make_root() d_alloc_pseudo(): move setting ->d_op there from the (sole) caller kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller retain_dentry(): introduce a trimmed-down lockless variant __dentry_kill(): new locking scheme d_prune_aliases(): use a shrink list switch select_collect{,2}() to use of to_shrink_list() to_shrink_list(): call only if refcount is 0 fold dentry_kill() into dput() don't try to cut corners in shrink_lock_dentry() fold the call of retain_dentry() into fast_dput() Call retain_dentry() with refcount 0 dentry_kill(): don't bother with retain_dentry() on slow path ...
2024-01-11Merge tag 'pull-rename' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull rename updates from Al Viro: "Fix directory locking scheme on rename This was broken in 6.5; we really can't lock two unrelated directories without holding ->s_vfs_rename_mutex first and in case of same-parent rename of a subdirectory 6.5 ends up doing just that" * tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: rename(): avoid a deadlock in the case of parents having no common ancestor kill lock_two_inodes() rename(): fix the locking of subdirectories f2fs: Avoid reading renamed directory if parent does not change ext4: don't access the source subdirectory content on same-directory rename ext2: Avoid reading renamed directory if parent does not change udf_rename(): only access the child content on cross-directory rename ocfs2: Avoid touching renamed directory if parent does not change reiserfs: Avoid touching renamed directory if parent does not change
2023-11-25rename(): avoid a deadlock in the case of parents having no common ancestorAl Viro
... and fix the directory locking documentation and proof of correctness. Holding ->s_vfs_rename_mutex *almost* prevents ->d_parent changes; the case where we really don't want it is splicing the root of disconnected tree to somewhere. In other words, ->s_vfs_rename_mutex is sufficient to stabilize "X is an ancestor of Y" only if X and Y are already in the same tree. Otherwise it can go from false to true, and one can construct a deadlock on that. Make lock_two_directories() report an error in such case and update the callers of lock_rename()/lock_rename_child() to handle such errors. And yes, such conditions are not impossible to create ;-/ Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25rename(): fix the locking of subdirectoriesAl Viro
We should never lock two subdirectories without having taken ->s_vfs_rename_mutex; inode pointer order or not, the "order" proposed in 28eceeda130f "fs: Lock moved directories" is not transitive, with the usual consequences. The rationale for locking renamed subdirectory in all cases was the possibility of race between rename modifying .. in a subdirectory to reflect the new parent and another thread modifying the same subdirectory. For a lot of filesystems that's not a problem, but for some it can lead to trouble (e.g. the case when short directory contents is kept in the inode, but creating a file in it might push it across the size limit and copy its contents into separate data block(s)). However, we need that only in case when the parent does change - otherwise ->rename() doesn't need to do anything with .. entry in the first place. Some instances are lazy and do a tautological update anyway, but it's really not hard to avoid. Amended locking rules for rename(): find the parent(s) of source and target if source and target have the same parent lock the common parent else lock ->s_vfs_rename_mutex lock both parents, in ancestor-first order; if neither is an ancestor of another, lock the parent of source first. find the source and target. if source and target have the same parent if operation is an overwriting rename of a subdirectory lock the target subdirectory else if source is a subdirectory lock the source if target is a subdirectory lock the target lock non-directories involved, in inode pointer order if both source and target are such. That way we are guaranteed that parents are locked (for obvious reasons), that any renamed non-directory is locked (nfsd relies upon that), that any victim is locked (emptiness check needs that, among other things) and subdirectory that changes parent is locked (needed to protect the update of .. entries). We are also guaranteed that any operation locking more than one directory either takes ->s_vfs_rename_mutex or locks a parent followed by its child. Cc: stable@vger.kernel.org Fixes: 28eceeda130f "fs: Lock moved directories" Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25__dentry_kill(): new locking schemeAl Viro
Currently we enter __dentry_kill() with parent (along with the victim dentry and victim's inode) held locked. Then we mark dentry refcount as dead call ->d_prune() remove dentry from hash remove it from the parent's list of children unlock the parent, don't need it from that point on detach dentry from inode, unlock dentry and drop the inode (via ->d_iput()) call ->d_release() regain the lock on dentry check if it's on a shrink list (in which case freeing its empty husk has to be left to shrink_dentry_list()) or not (in which case we can free it ourselves). In the former case, mark it as an empty husk, so that shrink_dentry_list() would know it can free the sucker. drop the lock on dentry ... and usually the caller proceeds to drop a reference on the parent, possibly retaking the lock on it. That is painful for a bunch of reasons, starting with the need to take locks out of order, but not limited to that - the parent of positive dentry can change if we drop its ->d_lock, so getting these locks has to be done with care. Moreover, as soon as dentry is out of the parent's list of children, shrink_dcache_for_umount() won't see it anymore, making it appear as if the parent is inexplicably busy. We do work around that by having shrink_dentry_list() decrement the parent's refcount first and put it on shrink list to be evicted once we are done with __dentry_kill() of child, but that may in some cases lead to ->d_iput() on child called after the parent got killed. That doesn't happen in cases where in-tree ->d_iput() instances might want to look at the parent, but that's brittle as hell. Solution: do removal from the parent's list of children in the very end of __dentry_kill(). As the result, the callers do not need to lock the parent and by the time we really need the parent locked, dentry is negative and is guaranteed not to be moved around. It does mean that ->d_prune() will be called with parent not locked. It also means that we might see dentries in process of being torn down while going through the parent's list of children; those dentries will be unhashed, negative and with refcount marked dead. In practice, that's enough for in-tree code that looks through the list of children to do the right thing as-is. Out-of-tree code might need to be adjusted. Calling conventions: __dentry_kill(dentry) is called with dentry->d_lock held, along with ->i_lock of its inode (if any). It either returns the parent (locked, with refcount decremented to 0) or NULL (if there'd been no parent or if refcount decrement for parent hadn't reached 0). lock_for_kill() is adjusted for new requirements - it doesn't touch the parent's ->d_lock at all. Callers adjusted. Note that for dput() we don't need to bother with fast_dput() for the parent - we just need to check retain_dentry() for it, since its ->d_lock is still held since the moment when __dentry_kill() had taken it to remove the victim from the list of children. The kludge with early decrement of parent's refcount in shrink_dentry_list() is no longer needed - shrink_dcache_for_umount() sees the half-killed dentries in the list of children for as long as they are pinning the parent. They are easily recognized and accounted for by select_collect(), so we know we are not done yet. As the result, we always have the expected ordering for ->d_iput()/->d_release() vs. __dentry_kill() of the parent, no exceptions. Moreover, the current rules for shrink lists (one must make sure that shrink_dcache_for_umount() won't happen while any dentries from the superblock in question are on any shrink lists) are gone - shrink_dcache_for_umount() will do the right thing in all cases, taking such dentries out. Their empty husks (memory occupied by struct dentry itself + its external name, if any) will remain on the shrink lists, but they are no obstacles to filesystem shutdown. And such husks will get freed as soon as shrink_dentry_list() of the list they are on gets to them. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25Call retain_dentry() with refcount 0Al Viro
Instead of bumping it from 0 to 1, calling retain_dentry(), then decrementing it back to 0 (with ->d_lock held all the way through), just leave refcount at 0 through all of that. It will have a visible effect for ->d_delete() - now it can be called with refcount 0 instead of 1 and it can no longer play silly buggers with dropping/regaining ->d_lock. Not that any in-tree instances tried to (it's pretty hard to get right). Any out-of-tree ones will have to adjust (assuming they need any changes). Note that we do not need to extend rcu-critical area here - we have verified that refcount is non-negative after having grabbed ->d_lock, so nobody will be able to free dentry until they get into __dentry_kill(), which won't happen until they manage to grab ->d_lock. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-25dentry: switch the lists of children to hlistAl Viro
Saves a pointer per struct dentry and actually makes the things less clumsy. Cleaned the d_walk() and dcache_readdir() a bit by use of hlist_for_... iterators. A couple of new helpers - d_first_child() and d_next_sibling(), to make the expressions less awful. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-18porting: document block device freeze and thaw changesChristian Brauner
Link: https://lore.kernel.org/r/20230927-vfs-super-freeze-v1-7-ecc36d9ab4d9@kernel.org Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-9-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-07Merge tag 'vfs-6.7.fsid' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fanotify fsid updates from Christian Brauner: "This work is part of the plan to enable fanotify to serve as a drop-in replacement for inotify. While inotify is availabe on all filesystems, fanotify currently isn't. In order to support fanotify on all filesystems two things are needed: (1) all filesystems need to support AT_HANDLE_FID (2) all filesystems need to report a non-zero f_fsid This contains (1) and allows filesystems to encode non-decodable file handlers for fanotify without implementing any exportfs operations by encoding a file id of type FILEID_INO64_GEN from i_ino and i_generation. Filesystems that want to opt out of encoding non-decodable file ids for fanotify that don't support NFS export can do so by providing an empty export_operations struct. This also partially addresses (2) by generating f_fsid for simple filesystems as well as freevxfs. Remaining filesystems will be dealt with by separate patches. Finally, this contains the patch from the current exportfs maintainers which moves exportfs under vfs with Chuck, Jeff, and Amir as maintainers and vfs.git as tree" * tag 'vfs-6.7.fsid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: MAINTAINERS: create an entry for exportfs fs: fix build error with CONFIG_EXPORTFS=m or not defined freevxfs: derive f_fsid from bdev->bd_dev fs: report f_fsid from s_dev for "simple" filesystems exportfs: support encoding non-decodeable file handles by default exportfs: define FILEID_INO64_GEN* file handle types exportfs: make ->encode_fh() a mandatory method for NFS export exportfs: add helpers to check if filesystem can encode/decode file handles
2023-10-28exportfs: make ->encode_fh() a mandatory method for NFS exportAmir Goldstein
Rename the default helper for encoding FILEID_INO32_GEN* file handles to generic_encode_ino32_fh() and convert the filesystems that used the default implementation to use the generic helper explicitly. After this change, exportfs_encode_inode_fh() no longer has a default implementation to encode FILEID_INO32_GEN* file handles. This is a step towards allowing filesystems to encode non-decodeable file handles for fanotify without having to implement any export_operations. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231023180801.2953446-3-amir73il@gmail.com Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28porting: update locking requirementsChristian Brauner
Now that s_umount is never taken under open_mutex update the documentation to say so. Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20porting: document superblock as block device holderChristian Brauner
We've changed the holder of the block device which has consequences. Document this clearly and in detail so filesystem and vfs developers have a proper digital paper trail. Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20porting: document new block device opening orderChristian Brauner
We've changed the order of opening block devices and superblock handling. Let's document this so filesystem and vfs developers have a proper digital paper trail. Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-30Merge tag 'docs-6.6' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "Documentation work keeps chugging along; this includes: - Work from Carlos Bilbao to integrate rustdoc output into the generated HTML documentation. This took some work to figure out how to do it without slowing the docs build and without creating people who don't have Rust installed, but Carlos got there - Move the loongarch and mips architecture documentation under Documentation/arch/ - Some more maintainer documentation from Jakub ... plus the usual assortment of updates, translations, and fixes" * tag 'docs-6.6' of git://git.lwn.net/linux: (56 commits) Docu: genericirq.rst: fix irq-example input: docs: pxrc: remove reference to phoenix-sim Documentation: serial-console: Fix literal block marker docs/mm: remove references to hmm_mirror ops and clean typos docs/zh_CN: correct regi_chg(),regi_add() to region_chg(),region_add() Documentation: Fix typos Documentation/ABI: Fix typos scripts: kernel-doc: fix macro handling in enums scripts: kernel-doc: parse DEFINE_DMA_UNMAP_[ADDR|LEN] Documentation: riscv: Update boot image header since EFI stub is supported Documentation: riscv: Add early boot document Documentation: arm: Add bootargs to the table of added DT parameters docs: kernel-parameters: Refer to the correct bitmap function doc: update params of memhp_default_state= docs: Add book to process/kernel-docs.rst docs: sparse: fix invalid link addresses docs: vfs: clean up after the iterate() removal docs: Add a section on surveys to the researcher guidelines docs: move mips under arch docs: move loongarch under arch ...
2023-08-29Merge tag 'mm-stable-2023-08-28-18-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
2023-08-24mm: allow ->huge_fault() to be called without the mmap_lock heldMatthew Wilcox (Oracle)
Remove the checks for the VMA lock being held, allowing the page fault path to call into the filesystem instead of retrying with the mmap_lock held. This will improve scalability for DAX page faults. Also update the documentation to match (and fix some other changes that have happened recently). Link: https://lkml.kernel.org/r/20230818202335.2739663-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18Documentation: Fix typosBjorn Helgaas
Fix typos in Documentation. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/r/20230814212822.193684-4-helgaas@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2023-08-06vfs: get rid of old '->iterate' directory operationLinus Torvalds
All users now just use '->iterate_shared()', which only takes the directory inode lock for reading. Filesystems that never got convered to shared mode now instead use a wrapper that drops the lock, re-takes it in write mode, calls the old function, and then downgrades the lock back to read mode. This way the VFS layer and other callers no longer need to care about filesystems that never got converted to the modern era. The filesystems that use the new wrapper are ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf. Honestly, several of them look like they really could just iterate their directories in shared mode and skip the wrapper entirely, but the point of this change is to not change semantics or fix filesystems that haven't been fixed in the last 7+ years, but to finally get rid of the dual iterators. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2022-10-20fs: rename current get acl methodChristian Brauner
The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. The current inode operation for getting posix acls takes an inode argument but various filesystems (e.g., 9p, cifs, overlayfs) need access to the dentry. In contrast to the ->set_acl() inode operation we cannot simply extend ->get_acl() to take a dentry argument. The ->get_acl() inode operation is called from: acl_permission_check() -> check_acl() -> get_acl() which is part of generic_permission() which in turn is part of inode_permission(). Both generic_permission() and inode_permission() are called in the ->permission() handler of various filesystems (e.g., overlayfs). So simply passing a dentry argument to ->get_acl() would amount to also having to pass a dentry argument to ->permission(). We should avoid this unnecessary change. So instead of extending the existing inode operation rename it from ->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that passes a dentry argument and which filesystems that need access to the dentry can implement instead of ->get_inode_acl(). Filesystems like cifs which allow setting and getting posix acls but not using them for permission checking during lookup can simply not implement ->get_inode_acl(). This is intended to be a non-functional change. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-10Merge tag 'pull-tmpfile' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs tmpfile updates from Al Viro: "Miklos' ->tmpfile() signature change; pass an unopened struct file to it, let it open the damn thing. Allows to add tmpfile support to FUSE" * tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fuse: implement ->tmpfile() vfs: open inside ->tmpfile() vfs: move open right after ->tmpfile() vfs: make vfs_tmpfile() static ovl: use vfs_tmpfile_open() helper cachefiles: use vfs_tmpfile_open() helper cachefiles: only pass inode to *mark_inode_inuse() helpers cachefiles: tmpfile error handling cleanup hugetlbfs: cleanup mknod and tmpfile vfs: add vfs_tmpfile_open() helper
2022-09-24vfs: open inside ->tmpfile()Miklos Szeredi
This is in preparation for adding tmpfile support to fuse, which requires that the tmpfile creation and opening are done as a single operation. Replace the 'struct dentry *' argument of i_op->tmpfile with 'struct file *'. Call finish_open_simple() as the last thing in ->tmpfile() instances (may be omitted in the error case). Change d_tmpfile() argument to 'struct file *' as well to make callers more readable. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-08-17Change calling conventions for filldir_tAl Viro
filldir_t instances (directory iterators callbacks) used to return 0 for "OK, keep going" or -E... for "stop". Note that it's *NOT* how the error values are reported - the rules for those are callback-dependent and ->iterate{,_shared}() instances only care about zero vs. non-zero (look at emit_dir() and friends). So let's just return bool ("should we keep going?") - it's less confusing that way. The choice between "true means keep going" and "true means stop" is bikesheddable; we have two groups of callbacks - do something for everything in directory, until we run into problem and find an entry in directory and do something to it. The former tended to use 0/-E... conventions - -E<something> on failure. The latter tended to use 0/1, 1 being "stop, we are done". The callers treated anything non-zero as "stop", ignoring which non-zero value did they get. "true means stop" would be more natural for the second group; "true means keep going" - for the first one. I tried both variants and the things like if allocation failed something = -ENOMEM; return true; just looked unnatural and asking for trouble. [folded suggestion from Matthew Wilcox <willy@infradead.org>] Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-16fs: remove no_llseekJason A. Donenfeld
Now that all callers of ->llseek are going through vfs_llseek(), we don't gain anything by keeping no_llseek around. Nothing actually calls it and setting ->llseek to no_lseek is completely equivalent to leaving it NULL. Longer term (== by the end of merge window) we want to remove all such intializations. To simplify the merge window this commit does *not* touch initializers - it only defines no_llseek as NULL (and simplifies the tests on file opening). At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-08namei: Merge page_symlink() and __page_symlink()Matthew Wilcox (Oracle)
There are no callers of __page_symlink() left, so we can remove that entry point. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org>