Age | Commit message (Collapse) | Author |
|
The Btrfs documentation states that if the commit value is greater than
300 a warning should be issued. The warning was accidentally lost in the
new mount API update.
Fixes: 6941823cc878 ("btrfs: remove old mount API code")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If btrfs_reserve_extent() fails while submitting an async_extent for a
compressed write, then we fail to call free_async_extent_pages() on the
async_extent and leak its folios. A likely cause for such a failure
would be btrfs_reserve_extent() failing to find a large enough
contiguous free extent for the compressed extent.
I was able to reproduce this by:
1. mount with compress-force=zstd:3
2. fallocating most of a filesystem to a big file
3. fragmenting the remaining free space
4. trying to copy in a file which zstd would generate large compressed
extents for (vmlinux worked well for this)
Step 4. hits the memory leak and can be repeated ad nauseam to
eventually exhaust the system memory.
Fix this by detecting the case where we fallback to uncompressed
submission for a compressed async_extent and ensuring that we call
free_async_extent_pages().
Fixes: 131a821a243f ("btrfs: fallback if compressed IO fails for ENOSPC")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Co-developed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If the discard worker is running and there's currently only one block
group, that block group is a data block group, it's in the unused block
groups discard list and is being used (it got an extent allocated from it
after becoming unused), the worker can end up in an infinite loop if a
transaction abort happens or the async discard is disabled (during remount
or unmount for example).
This happens like this:
1) Task A, the discard worker, is at peek_discard_list() and
find_next_block_group() returns block group X;
2) Block group X is in the unused block groups discard list (its discard
index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past
it become an unused block group and was added to that list, but then
later it got an extent allocated from it, so its ->used counter is not
zero anymore;
3) The current transaction is aborted by task B and we end up at
__btrfs_handle_fs_error() in the transaction abort path, where we call
btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from
fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode
(setting SB_RDONLY in the super block's s_flags field);
4) Task A calls __add_to_discard_list() with the goal of moving the block
group from the unused block groups discard list into another discard
list, but at __add_to_discard_list() we end up doing nothing because
btrfs_run_discard_work() returns false, since the super block has
SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set
anymore in fs_info->flags. So block group X remains in the unused block
groups discard list;
5) Task A then does a goto into the 'again' label, calls
find_next_block_group() again we gets block group X again. Then it
repeats the previous steps over and over since there are not other
block groups in the discard lists and block group X is never moved
out of the unused block groups discard list since
btrfs_run_discard_work() keeps returning false and therefore
__add_to_discard_list() doesn't move block group X out of that discard
list.
When this happens we can get a soft lockup report like this:
[71.957] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97]
[71.957] Modules linked in: xfs af_packet rfkill (...)
[71.957] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
[71.957] Tainted: [W]=WARN
[71.957] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[71.957] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[71.957] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
[71.957] Code: c1 01 48 83 (...)
[71.957] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
[71.957] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
[71.957] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
[71.957] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
[71.957] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
[71.957] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
[71.957] FS: 0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000
[71.957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71.957] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0
[71.957] PKRU: 55555554
[71.957] Call Trace:
[71.957] <TASK>
[71.957] process_one_work+0x17e/0x330
[71.957] worker_thread+0x2ce/0x3f0
[71.957] ? __pfx_worker_thread+0x10/0x10
[71.957] kthread+0xef/0x220
[71.957] ? __pfx_kthread+0x10/0x10
[71.957] ret_from_fork+0x34/0x50
[71.957] ? __pfx_kthread+0x10/0x10
[71.957] ret_from_fork_asm+0x1a/0x30
[71.957] </TASK>
[71.957] Kernel panic - not syncing: softlockup: hung tasks
[71.987] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W L 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
[71.989] Tainted: [W]=WARN, [L]=SOFTLOCKUP
[71.989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[71.991] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[71.992] Call Trace:
[71.993] <IRQ>
[71.994] dump_stack_lvl+0x5a/0x80
[71.994] panic+0x10b/0x2da
[71.995] watchdog_timer_fn.cold+0x9a/0xa1
[71.996] ? __pfx_watchdog_timer_fn+0x10/0x10
[71.997] __hrtimer_run_queues+0x132/0x2a0
[71.997] hrtimer_interrupt+0xff/0x230
[71.998] __sysvec_apic_timer_interrupt+0x55/0x100
[71.999] sysvec_apic_timer_interrupt+0x6c/0x90
[72.000] </IRQ>
[72.000] <TASK>
[72.001] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[72.002] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
[72.002] Code: c1 01 48 83 (...)
[72.005] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
[72.006] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
[72.006] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
[72.007] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
[72.008] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
[72.009] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
[72.010] ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f]
[72.011] process_one_work+0x17e/0x330
[72.012] worker_thread+0x2ce/0x3f0
[72.013] ? __pfx_worker_thread+0x10/0x10
[72.014] kthread+0xef/0x220
[72.014] ? __pfx_kthread+0x10/0x10
[72.015] ret_from_fork+0x34/0x50
[72.015] ? __pfx_kthread+0x10/0x10
[72.016] ret_from_fork_asm+0x1a/0x30
[72.017] </TASK>
[72.017] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[72.019] Rebooting in 90 seconds..
So fix this by making sure we move a block group out of the unused block
groups discard list when calling __add_to_discard_list().
Fixes: 2bee7eb8bb81 ("btrfs: discard one region at a time in async discard")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fix from Jan Kara:
"Fix a bug in UDF inode eviction leading to spewing pointless
error messages"
* tag 'udf_for_v6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Make sure i_lenExtents is uptodate on inode eviction
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Ensure that simple_xattr_list() always includes security.* xattrs
- Fix eventpoll busy loop optimization when combined with timeouts
- Disable swapon() for devices with block sizes greater than page sizes
- Don't call errseq_set() twice during mark_buffer_write_io_error().
Just use mapping_set_error() which takes care to not deference
unconditionally
* tag 'vfs-6.15-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Remove redundant errseq_set call in mark_buffer_write_io_error.
swapfile: disable swapon for bs > ps devices
fs/eventpoll: fix endless busy loop after timeout has expired
fs/xattr.c: fix simple_xattr_list to always include security.* xattrs
|
|
We need the driver core fix in here as well for testing
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
When mounting a user-space filesystem on multiple clients, after
concurrent ->setattr() calls from different node, stale inode
attributes may be cached in some node.
This is caused by fuse_setattr() racing with
fuse_reverse_inval_inode().
When filesystem server receives setattr request, the client node
with valid iattr cached will be required to update the fuse_inode's
attr_version and invalidate the cache by fuse_reverse_inval_inode(),
and at the next call to ->getattr() they will be fetched from user
space.
The race scenario is:
1. client-1 sends setattr (iattr-1) request to server
2. client-1 receives the reply from server
3. before client-1 updates iattr-1 to the cached attributes by
fuse_change_attributes_common(), server receives another setattr
(iattr-2) request from client-2
4. server requests client-1 to update the inode attr_version and
invalidate the cached iattr, and iattr-1 becomes staled
5. client-2 receives the reply from server, and caches iattr-2
6. continue with step 2, client-1 invokes
fuse_change_attributes_common(), and caches iattr-1
The issue has been observed from concurrent of chmod, chown, or
truncate, which all invoke ->setattr() call.
The solution is to use fuse_inode's attr_version to check whether
the attributes have been modified during the setattr request's
lifetime. If so, mark the attributes as invalid in the function
fuse_change_attributes_common().
Signed-off-by: Guang Yuan Wu <gwu@ddn.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Now that the ChaCha state matrix is strongly-typed, add a helper
function chacha_zeroize_state() which zeroizes it. Then convert all
applicable callers to use it instead of direct memzero_explicit. No
functional changes.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
The ChaCha state matrix is 16 32-bit words. Currently it is represented
in the code as a raw u32 array, or even just a pointer to u32. This
weak typing is error-prone. Instead, introduce struct chacha_state:
struct chacha_state {
u32 x[16];
};
Convert all ChaCha and HChaCha functions to use struct chacha_state.
No functional changes.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
In preparation for writing logs, in nilfs_btree_propagate(), which makes
parent and ancestor node blocks dirty starting from a modified data block
or b-tree node block, if the starting block does not belong to the b-tree,
i.e. is isolated, nilfs_btree_do_lookup() called within the function
fails with -ENOENT.
In this case, even though -ENOENT is an internal code, it is propagated to
the log writer via nilfs_bmap_propagate() and may be erroneously returned
to system calls such as fsync().
Fix this issue by changing the error code to -EINVAL in this case, and
having the bmap layer detect metadata corruption and convert the error
code appropriately.
Link: https://lkml.kernel.org/r/20250428173808.6452-3-konishi.ryusuke@gmail.com
Fixes: 1f5abe7e7dbc ("nilfs2: replace BUG_ON and BUG calls triggerable from ioctl")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "nilfs2: improve sanity checks in dirty state propagation".
This fixes one missed check for block mapping anomalies and one improper
return of an error code during a preparation step for log writing, thereby
improving checking for filesystem corruption on writeback.
This patch (of 2):
In nilfs_direct_propagate(), the printer get from nilfs_direct_get_ptr()
need to be checked to ensure it is not an invalid pointer.
If the pointer value obtained by nilfs_direct_get_ptr() is
NILFS_BMAP_INVALID_PTR, means that the metadata (in this case, i_bmap in
the nilfs_inode_info struct) that should point to the data block at the
buffer head of the argument is corrupted and the data block is orphaned,
meaning that the file system has lost consistency.
Add a value check and return -EINVAL when it is an invalid pointer.
Link: https://lkml.kernel.org/r/20250428173808.6452-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20250428173808.6452-2-konishi.ryusuke@gmail.com
Fixes: 36a580eb489f ("nilfs2: direct block mapping")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Deduplicate the same functionality implemented in several places by
moving the cmp_int() helper macro into linux/sort.h.
The macro performs a three-way comparison of the arguments mostly useful
in different sorting strategies and algorithms.
Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Coly Li <colyli@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Coly Li <colyli@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
unregister_sysctl_table() checks for NULL pointers internally. Remove
unneeded NULL check here.
Link: https://lkml.kernel.org/r/20250422073051.1334310-1-nichen@iscas.ac.cn
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If ocfs2_finish_quota_recovery() exits due to an error before passing all
rc_list elements to ocfs2_recover_local_quota_file() then it can lead to a
memory leak as rc_list may still contain elements that have to be freed.
Release all memory allocated by ocfs2_add_recovery_chunk() using
ocfs2_free_quota_recovery() instead of kfree().
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Link: https://lkml.kernel.org/r/20250402065628.706359-2-m.masimov@mt-integration.ru
Fixes: 2205363dce74 ("ocfs2: Implement quota recovery")
Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Don't negate 'ret' and simplify the return statement.
No functional changes intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 8fa7292fee5c ("treewide: Switch/rename to timer_delete[_sync]()")
switched del_timer_sync to timer_delete_sync, but did not modify the
comment for o2net_idle_timer(). Now fix it.
Link: https://lkml.kernel.org/r/BDDB1E4E2876C36C+20250411102610.165946-1-wangyuli@uniontech.com
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Syzkaller reports an "UBSAN: shift-out-of-bounds in squashfs_bio_read" bug.
Syzkaller forks multiple processes which after mounting the Squashfs
filesystem, issues an ioctl("/dev/loop0", LOOP_SET_BLOCK_SIZE, 0x8000).
Now if this ioctl occurs at the same time another process is in the
process of mounting a Squashfs filesystem on /dev/loop0, the failure
occurs. When this happens the following code in squashfs_fill_super()
fails.
----
msblk->devblksize = sb_min_blocksize(sb, SQUASHFS_DEVBLK_SIZE);
msblk->devblksize_log2 = ffz(~msblk->devblksize);
----
sb_min_blocksize() returns 0, which means msblk->devblksize is set to 0.
As a result, ffz(~msblk->devblksize) returns 64, and msblk->devblksize_log2
is set to 64.
This subsequently causes the
UBSAN: shift-out-of-bounds in fs/squashfs/block.c:195:36
shift exponent 64 is too large for 64-bit type 'u64' (aka
'unsigned long long')
This commit adds a check for a 0 return by sb_min_blocksize().
Link: https://lkml.kernel.org/r/20250409024747.876480-1-phillip@squashfs.org.uk
Fixes: 0aa666190509 ("Squashfs: super block operations")
Reported-by: syzbot+65761fc25a137b9c8c6e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67f0dd7a.050a0220.0a13.0230.GAE@google.com/
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
proc_mem_open() can return an errno, NULL, or mm_struct*. If it fails to
acquire mm, it returns NULL, but the caller does not check for the case
when the return value is NULL.
The following conditions lead to failure in acquiring mm:
- The task is a kernel thread (PF_KTHREAD)
- The task is exiting (PF_EXITING)
Changes:
- Add documentation comments for the return value of proc_mem_open().
- Add checks in the caller to return -ESRCH when proc_mem_open()
returns NULL.
Link: https://lkml.kernel.org/r/20250404063357.78891-1-superman.xpt@gmail.com
Reported-by: syzbot+f9238a0a31f9b5603fef@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000f52642060d4e3750@google.com
Signed-off-by: Penglei Jiang <superman.xpt@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Adrian Ratiu <adrian.ratiu@collabora.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Felix Moessbauer <felix.moessbauer@siemens.com>
Cc: Jeff layton <jlayton@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
kpageflags_read() and kpagecgroup_read() are quite similar to
kpagecount_read(). Refactor common code into a helper function to reduce
code duplication.
Link: https://lkml.kernel.org/r/20250318063226.223284-1-liuyerd@163.com
Signed-off-by: Liu Ye <liuye@kylinos.cn>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Svetly Todorov <svetly.todorov@memverge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard
regions", v2.
Introduce the PAGE_IS_GUARD flag in the PAGEMAP_SCAN ioctl to expose
information about guard regions. This allows userspace tools, such as
CRIU, to detect and handle guard regions.
Currently, CRIU utilizes PAGEMAP_SCAN as a more efficient alternative to
parsing /proc/pid/pagemap. Without this change, guard regions are
incorrectly reported as swap-anon regions, leading CRIU to attempt dumping
them and subsequently failing.
The series includes updates to the documentation and selftests to reflect
the new functionality.
This patch (of 3):
Introduce the PAGE_IS_GUARD flag in the PAGEMAP_SCAN ioctl to expose
information about guard regions. This allows userspace tools, such as
CRIU, to detect and handle guard regions.
Link: https://lkml.kernel.org/r/20250324065328.107678-1-avagin@google.com
Link: https://lkml.kernel.org/r/20250324065328.107678-2-avagin@google.com
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Removes five conversions from folio to page. Also removes both callers of
mk_pmd() that aren't part of mk_huge_pmd(), getting us a step closer to
removing the confusion between mk_pmd(), mk_huge_pmd() and pmd_mkhuge().
Link: https://lkml.kernel.org/r/20250402181709.2386022-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Richard Weinberger <richard@nod.at>
Cc: <x86@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Observability here is now covered by static tracepoints.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observability here is now covered by static tracepoints.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observability here is now covered by static tracepoints.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observability here is now covered by static tracepoints.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observability here is now covered by static tracepoints.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observability here is now covered by static tracepoints.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observability here is now covered by static tracepoints.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
There isn't a common helper for getattrs, so add these into the
protocol-specific helpers.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observe the start of NFS READDIR operations.
The NFS READDIR's count argument can be interesting when tuning a
client's readdir behavior.
However, the count argument is not passed to nfsd_readdir(). To
properly capture the count argument, this tracepoint must appear in
each proc function before the nfsd_readdir() call.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observe the start of RENAME operations for all NFS versions.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observe the start of UNLINK, REMOVE, and RMDIR operations for all
NFS versions.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observe the start of NFS LINK operations.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observe the start of SYMLINK operations for all NFS versions.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Observe the start of file and directory creation for all NFS
versions.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Replace the dprintk in nfsd_lookup_dentry() with a trace point.
nfsd_lookup_dentry() is called frequently enough that enabling this
dprintk call site would result in log floods and performance issues.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Turn Sargun's internal kprobe based implementation of this into a normal
static tracepoint. Also, remove the dprintk's that got added recently
with the fix for zero-length ACLs.
Cc: Sargun Dillon <sargun@sargun.me>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Introduce tracing helpers that can be used before the procedure
status code is known. These macros are similar to the
SVC_RQST_ENDPOINT helpers, but they can be modified to include
NFS-specific fields if that is needed later.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Record and emit presentation addresses using tracing helpers
designed for the task.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
RFC 7862 states that if an NFS server implements a CLONE operation,
it MUST also implement FATTR4_CLONE_BLKSIZE. NFSD implements CLONE,
but does not implement FATTR4_CLONE_BLKSIZE.
Note that in Section 12.2, RFC 7862 claims that
FATTR4_CLONE_BLKSIZE is RECOMMENDED, not REQUIRED. Likely this is
because a minor version is not permitted to add a REQUIRED
attribute. Confusing.
We assume this attribute reports a block size as a count of bytes,
as RFC 7862 does not specify a unit.
Reported-by: Roland Mainz <roland.mainz@nrubsig.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org>
Cc: stable@vger.kernel.org # v6.7+
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This user of SHA-256 does not support any other algorithm, so the
crypto_shash abstraction provides no value. Just use the SHA-256
library API instead, which is much simpler and easier to use.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
In nfs4_state_start_net(), laundromat_work may access nfsd_ssc through
nfs4_laundromat -> nfsd4_ssc_expire_umount. If nfsd_ssc isn't initialized,
this can cause NULL pointer dereference.
Normally the delayed start of laundromat_work allows sufficient time for
nfsd_ssc initialization to complete. However, when the kernel waits too
long for userspace responses (e.g. in nfs4_state_start_net ->
nfsd4_end_grace -> nfsd4_record_grace_done -> nfsd4_cld_grace_done ->
cld_pipe_upcall -> __cld_pipe_upcall -> wait_for_completion path), the
delayed work may start before nfsd_ssc initialization finishes.
Fix this by moving nfsd_ssc initialization before starting laundromat_work.
Fixes: f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed.")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Very useful for gauging how long the vfs_fsync_range() takes.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
If the request being processed is not a v4 compound request, then
examining the cstate can have undefined results.
This patch adds a check that the rpc procedure being executed
(rq_procinfo) is the NFSPROC4_COMPOUND procedure.
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When an export policy with xprtsec policy is set with "tls"
and/or "mtls", but an NFS client is doing a v3 xprtsec=tls
mount, then NLM locking calls fail with an error because
there is currently no support for NLM with TLS.
Until such support is added, allow NLM calls under TLS-secured
policy.
Fixes: 4cc9b9f2bf4d ("nfsd: refine and rename NFSD_MAY_LOCK")
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
It can be removed since svc_fill_write_vector already has the
same WARN_ON_ONCE.
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
NFSD currently has two separate code paths for handling read
requests. One uses page splicing; the other is a traditional read
based on an iov iterator.
Because most Linux file systems support splice read, the latter
does not get nearly the same test experience as splice reads.
To force the use of vectored reads for testing and benchmarking,
introduce the ability to disable splice reads for all NFS READ
operations.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Create a small sandbox under /sys/kernel/debug for experimental NFS
server feature settings. There is no API/ABI compatibility guarantee
for these settings.
The only documentation for such settings, if any documentation exists,
is in the kernel source code.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
As of now nfsd calls create_proc_exports_entry() at start of init_nfsd
and cleanup by remove_proc_entry() at last of exit_nfsd.
Which causes kernel OOPs if there is race between below 2 operations:
(i) exportfs -r
(ii) mount -t nfsd none /proc/fs/nfsd
for 5.4 kernel ARM64:
CPU 1:
el1_irq+0xbc/0x180
arch_counter_get_cntvct+0x14/0x18
running_clock+0xc/0x18
preempt_count_add+0x88/0x110
prep_new_page+0xb0/0x220
get_page_from_freelist+0x2d8/0x1778
__alloc_pages_nodemask+0x15c/0xef0
__vmalloc_node_range+0x28c/0x478
__vmalloc_node_flags_caller+0x8c/0xb0
kvmalloc_node+0x88/0xe0
nfsd_init_net+0x6c/0x108 [nfsd]
ops_init+0x44/0x170
register_pernet_operations+0x114/0x270
register_pernet_subsys+0x34/0x50
init_nfsd+0xa8/0x718 [nfsd]
do_one_initcall+0x54/0x2e0
CPU 2 :
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
PC is at : exports_net_open+0x50/0x68 [nfsd]
Call trace:
exports_net_open+0x50/0x68 [nfsd]
exports_proc_open+0x2c/0x38 [nfsd]
proc_reg_open+0xb8/0x198
do_dentry_open+0x1c4/0x418
vfs_open+0x38/0x48
path_openat+0x28c/0xf18
do_filp_open+0x70/0xe8
do_sys_open+0x154/0x248
Sometimes it crashes at exports_net_open() and sometimes cache_seq_next_rcu().
and same is happening on latest 6.14 kernel as well:
[ 0.000000] Linux version 6.14.0-rc5-next-20250304-dirty
...
[ 285.455918] Unable to handle kernel paging request at virtual address 00001f4800001f48
...
[ 285.464902] pc : cache_seq_next_rcu+0x78/0xa4
...
[ 285.469695] Call trace:
[ 285.470083] cache_seq_next_rcu+0x78/0xa4 (P)
[ 285.470488] seq_read+0xe0/0x11c
[ 285.470675] proc_reg_read+0x9c/0xf0
[ 285.470874] vfs_read+0xc4/0x2fc
[ 285.471057] ksys_read+0x6c/0xf4
[ 285.471231] __arm64_sys_read+0x1c/0x28
[ 285.471428] invoke_syscall+0x44/0x100
[ 285.471633] el0_svc_common.constprop.0+0x40/0xe0
[ 285.471870] do_el0_svc_compat+0x1c/0x34
[ 285.472073] el0_svc_compat+0x2c/0x80
[ 285.472265] el0t_32_sync_handler+0x90/0x140
[ 285.472473] el0t_32_sync+0x19c/0x1a0
[ 285.472887] Code: f9400885 93407c23 937d7c27 11000421 (f86378a3)
[ 285.473422] ---[ end trace 0000000000000000 ]---
It reproduced simply with below script:
while [ 1 ]
do
/exportfs -r
done &
while [ 1 ]
do
insmod /nfsd.ko
mount -t nfsd none /proc/fs/nfsd
umount /proc/fs/nfsd
rmmod nfsd
done &
So exporting interfaces to user space shall be done at last and
cleanup at first place.
With change there is no Kernel OOPs.
Co-developed-by: Shubham Rana <s9.rana@samsung.com>
Signed-off-by: Shubham Rana <s9.rana@samsung.com>
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
With rpc_status netlink support, unregister of register_filesystem()
was missed in case of genl_register_family() fails.
Correcting it by making new label.
Fixes: bd9d6a3efa97 ("NFSD: add rpc_status netlink support")
Cc: stable@vger.kernel.org
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|