Age | Commit message (Collapse) | Author |
|
commit 694ce5e143d67267ad26b04463e790a597500b00 upstream.
Create a block group dedicated for data relocation on mount of a zoned
filesystem.
If there is already more than one empty DATA block group on mount, this
one is picked for the data relocation block group, instead of a newly
created one.
This is done to ensure, there is always space for performing garbage
collection and the filesystem is not hitting ENOSPC under heavy overwrite
workloads.
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2a5898c4aac67494c2f0f7fe38373c95c371c930 upstream.
If we failed walking a log tree during replay, we have a missing
transaction abort to prevent committing a transaction where we didn't
fully replay all the changes from a log tree and therefore can leave the
respective subvolume tree in some inconsistent state. So add the missing
transaction abort.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 55f7c65b2f69c7e4cb7aa7c1654a228ccf734fd8 upstream.
When deciding if a zoned filesystem is reaching the threshold to reclaim
data block groups, look at the size of the filesystem not to potentially
total available size of all drives in the filesystem.
Especially if a filesystem was created with mkfs' -b option, constraining
it to only a portion of the block device, the numbers won't match and
potentially garbage collection is kicking in too late.
Fixes: 3687fcb0752a ("btrfs: zoned: make auto-reclaim less aggressive")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 82e6381e23f1ea7a14f418215068aaa2ca046c84 upstream.
Various changes in the "ext4: better scalability for ext4 block
allocation" patch series have resulted in kunit test failures, most
notably in the test_new_blocks_simple and the test_mb_mark_used tests.
The root cause of these failures is that various in-memory ext4 data
structures were not getting initialized, and while previous versions
of the functions exercised by the unit tests didn't use these
structure members, this was arguably a test bug.
Since one of the patches in the block allocation scalability patches
is a fix which is has a cc:stable tag, this commit also has a
cc:stable tag.
CC: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250714130327.1830534-1-libaokun1@huawei.com
Link: https://patch.msgid.link/20250725021550.3177573-1-yi.zhang@huaweicloud.com
Link: https://patch.msgid.link/20250725021654.3188798-1-yi.zhang@huaweicloud.com
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/linux-ext4/b0635ad0-7ebf-4152-a69b-58e7e87d5085@roeck-us.net/
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7d345aa1fac4c2ec9584fbd6f389f2c2368671d5 upstream.
The grp->bb_largest_free_order is updated regardless of whether
mb_optimize_scan is enabled. This can lead to inconsistencies between
grp->bb_largest_free_order and the actual s_mb_largest_free_orders list
index when mb_optimize_scan is repeatedly enabled and disabled via remount.
For example, if mb_optimize_scan is initially enabled, largest free
order is 3, and the group is in s_mb_largest_free_orders[3]. Then,
mb_optimize_scan is disabled via remount, block allocations occur,
updating largest free order to 2. Finally, mb_optimize_scan is re-enabled
via remount, more block allocations update largest free order to 1.
At this point, the group would be removed from s_mb_largest_free_orders[3]
under the protection of s_mb_largest_free_orders_locks[2]. This lock
mismatch can lead to list corruption.
To fix this, whenever grp->bb_largest_free_order changes, we now always
attempt to remove the group from its old order list. However, we only
insert the group into the new order list if `mb_optimize_scan` is enabled.
This approach helps prevent lock inconsistencies and ensures the data in
the order lists remains reliable.
Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1c320d8e92925bb7615f83a7b6e3f402a5c2ca63 upstream.
Groups with no free blocks shouldn't be in any average fragment size list.
However, when all blocks in a group are allocated(i.e., bb_fragments or
bb_free is 0), we currently skip updating the average fragment size, which
means the group isn't removed from its previous s_mb_avg_fragment_size[old]
list.
This created "zombie" groups that were always skipped during traversal as
they couldn't satisfy any block allocation requests, negatively impacting
traversal efficiency.
Therefore, when a group becomes completely full, bb_avg_fragment_size_order
is now set to -1. If the old order was not -1, a removal operation is
performed; if the new order is not -1, an insertion is performed.
Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9d5eff7821f6d70f7d1b4d8a60680fba4de868a7 upstream.
We now do a weighted selection of server interfaces when allocating
new channels. The weights are decided based on the speed advertised.
The fulfilled weight for an interface is a counter that is used to
track the interface selection. It should be reset back to zero once
all interfaces fulfilling their weight.
In cifs_chan_update_iface, this reset logic was missing. As a result
when the server interface list changes, the client may not be able
to find a new candidate for other channels after all interfaces have
been fulfilled.
Fixes: a6d8fb54a515 ("cifs: distribute channels across interfaces based on speed")
Cc: <stable@vger.kernel.org>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b63335fb3d32579c5ff0b7038b9cc23688fff528 ]
collect_sample() is used to gather samples of the data in a Write op for
analysis to try and determine if the compression algorithm is likely to
achieve anything more quickly than actually running the compression
algorithm.
However, collect_sample() assumes that the data it is going to be sampling
is stored in an ITER_XARRAY-type iterator (which it now should never be)
and doesn't actually check that it is before accessing the underlying
xarray directly.
Fix this by replacing the code with a loop that just uses the standard
iterator functions to sample every other 2KiB block, skipping the
intervening ones. It's not quite the same as the previous algorithm as it
doesn't necessarily align to the pages within an ordinary write from the
pagecache.
Note that the btrfs code from which this was derived samples the inode's
pagecache directly rather than the iterator - but that doesn't necessarily
work for network filesystems if O_DIRECT is in operation.
Fixes: 94ae8c3fee94 ("smb: client: compress: LZ77 code improvements cleanup")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9768797c219326699778fba9cd3b607b2f1e7950 ]
The error occurs on the third attempt to encode extents. When function
ext_tree_prepare_commit() reallocates a larger buffer to retry encoding
extents, the "layoutupdate_pages" page array is initialized only after the
retry loop. But ext_tree_free_commitdata() is called on every iteration
and tries to put pages in the array, thus dereferencing uninitialized
pointers.
An additional problem is that there is no limit on the maximum possible
buffer_size. When there are too many extents, the client may create a
layoutcommit that is larger than the maximum possible RPC size accepted
by the server.
During testing, we observed two typical scenarios. First, one memory page
for extents is enough when we work with small files, append data to the
end of the file, or preallocate extents before writing. But when we fill
a new large file without preallocating, the number of extents can be huge,
and counting the number of written extents in ext_tree_encode_commit()
does not help much. Since this number increases even more between
unlocking and locking of ext_tree, the reallocated buffer may not be
large enough again and again.
Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250630183537.196479-2-sergeybashirov@gmail.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d897d81671bc4615c80f4f3bd5e6b218f59df50c ]
When there are too many block extents for a layoutcommit, they may not
all fit into the maximum-sized RPC. This patch allows the generic pnfs
code to properly handle -ENOSPC returned by the block/scsi layout driver
and trigger additional layoutcommits if necessary.
Co-developed-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Konstantin Evtushenko <koevtushenko@yandex.com>
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250630183537.196479-5-sergeybashirov@gmail.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7db6e66663681abda54f81d5916db3a3b8b1a13d ]
At the end of the isect translation, disc_addr represents the physical
disk offset. Thus, end calculated from disk_addr is also a physical disk
offset. Therefore, range checking should be done using map->disk_offset,
not map->start.
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250702133226.212537-1-sergeybashirov@gmail.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 81438498a285759f31e843ac4800f82a5ce6521f ]
Because of integer division, we need to carefully calculate the
disk offset. Consider the example below for a stripe of 6 volumes,
a chunk size of 4096, and an offset of 70000.
chunk = div_u64(offset, dev->chunk_size) = 70000 / 4096 = 17
offset = chunk * dev->chunk_size = 17 * 4096 = 69632
disk_offset_wrong = div_u64(offset, dev->nr_children) = 69632 / 6 = 11605
disk_chunk = div_u64(chunk, dev->nr_children) = 17 / 6 = 2
disk_offset = disk_chunk * dev->chunk_size = 2 * 4096 = 8192
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250701122341.199112-1-sergeybashirov@gmail.com
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
_smbd_get_connection
[ Upstream commit 550a194c5998e4e77affc6235e80d3766dc2d27e ]
It is already called long before we may hit this cleanup code path.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1773f63d108b1b9b9d053d8c95f8300c556f93b8 ]
F2FS-fs (dm-55): access invalid blkaddr:972878540
Call trace:
dump_backtrace+0xec/0x128
show_stack+0x18/0x28
dump_stack_lvl+0x40/0x88
dump_stack+0x18/0x24
__f2fs_is_valid_blkaddr+0x360/0x3b4
f2fs_is_valid_blkaddr+0x10/0x20
f2fs_get_node_info+0x21c/0x60c
__write_node_page+0x15c/0x734
f2fs_sync_node_pages+0x4f8/0x700
f2fs_write_checkpoint+0x4a8/0x99c
__checkpoint_and_complete_reqs+0x7c/0x20c
issue_checkpoint_thread+0x4c/0xd8
kthread+0x11c/0x1b0
ret_from_fork+0x10/0x20
If nat.blkaddr is corrupted, during checkpoint, f2fs_sync_node_pages()
will loop to flush node page w/ corrupted nat.blkaddr.
Although, it tags SBI_NEED_FSCK, checkpoint can not persist it due
to deadloop.
Let's call f2fs_handle_error(, ERROR_INCONSISTENT_NAT) to record such
error into superblock, it expects fsck can detect the error and repair
inconsistent nat.blkaddr after device reboot.
Note that, let's add sanity check in f2fs_get_node_info() to detect
in-memory nat.blkaddr inconsistency, but only if CONFIG_F2FS_CHECK_FS
is enabled.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e23ab8028de0d92df5921a570f5212c0370db3b5 ]
Let's return errors caught by the generic checks. This fixes generic/494 where
it expects to see EBUSY by setattr_prepare instead of EINVAL by f2fs for active
swapfile.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 99f9a97dce39ad413c39b92c90393bbd6778f3fd ]
An infinite loop may occur if the following conditions occur due to
file system corruption.
(1) Condition for exfat_count_dir_entries() to loop infinitely.
- The cluster chain includes a loop.
- There is no UNUSED entry in the cluster chain.
(2) Condition for exfat_create_upcase_table() to loop infinitely.
- The cluster chain of the root directory includes a loop.
- There are no UNUSED entry and up-case table entry in the cluster
chain of the root directory.
(3) Condition for exfat_load_bitmap() to loop infinitely.
- The cluster chain of the root directory includes a loop.
- There are no UNUSED entry and bitmap entry in the cluster chain
of the root directory.
(4) Condition for exfat_find_dir_entry() to loop infinitely.
- The cluster chain includes a loop.
- The unused directory entries were exhausted by some operation.
(5) Condition for exfat_check_dir_empty() to loop infinitely.
- The cluster chain includes a loop.
- The unused directory entries were exhausted by some operation.
- All files and sub-directories under the directory are deleted.
This commit adds checks to break the above infinite loop.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c214006856ff52a8ff17ed8da52d50601d54f9ce ]
When computing the tree index in dbAllocAG, we never check if we are
out of bounds realative to the size of the stree.
This could happen in a scenario where the filesystem metadata are
corrupted.
Reported-by: syzbot+cffd18309153948f3c3e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cffd18309153948f3c3e
Tested-by: syzbot+cffd18309153948f3c3e@syzkaller.appspotmail.com
Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2d04df8116426b6c7b9f8b9b371250f666a2a2fb ]
The reproducer builds a corrupted file on disk with a negative i_size value.
Add a check when opening this file to avoid subsequent operation failures.
Reported-by: syzbot+630f6d40b3ccabc8e96e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=630f6d40b3ccabc8e96e
Tested-by: syzbot+630f6d40b3ccabc8e96e@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2d91b3765cd05016335cd5df5e5c6a29708ec058 ]
The fileset value of the inode copy from the disk by the reproducer is
AGGR_RESERVED_I. When executing evict, its hard link number is 0, so its
inode pages are not truncated. This causes the bugon to be triggered when
executing clear_inode() because nrpages is greater than 0.
Reported-by: syzbot+6e516bb515d93230bc7b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6e516bb515d93230bc7b
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b12f423d598fd874df9ecfb2436789d582fda8e6 ]
In environments with a page size of 64KB, the maximum size of a folio
can reach up to 128MB. Consequently, during the write-back of folios,
the 'rsv_blocks' will be overestimated to 1,577, which can make
pressure on the journal space where the journal is small. This can
easily exceed the limit of a single transaction. Besides, an excessively
large folio is meaningless and will instead increase the overhead of
traversing the bhs within the folio. Therefore, limit the maximum order
of a folio to 2048 filesystem blocks.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Joseph Qi <jiangqi903@gmail.com>
Closes: https://lore.kernel.org/linux-ext4/CA+G9fYsyYQ3ZL4xaSg1-Tt5Evto7Zd+hgNWZEa9cQLbahA1+xg@mail.gmail.com/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-12-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit cdfa1304657d6f23be8fd2bb0516380a3c89034e ]
sprintf() is discouraged for use with bounded destination buffers
as it does not prevent buffer overflows when the formatted output
exceeds the destination buffer size. snprintf() is a safer
alternative as it limits the number of bytes written and ensures
NUL-termination.
Replace sprintf() with snprintf() for copying the debug string
into a temporary buffer, using ORANGEFS_MAX_DEBUG_STRING_LEN as
the maximum size to ensure safe formatting and prevent memory
corruption in edge cases.
EDIT: After this patch sat on linux-next for a few days, Dan
Carpenter saw it and suggested that I use scnprintf instead of
snprintf. I made the change and retested.
Signed-off-by: Amir Mohammad Jahangirzad <a.jahangirzad@gmail.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 099b847ccc6c1ad2f805d13cfbcc83f5b6d4bc42 ]
A syzbot fuzzed image triggered a BUG_ON in ext4_update_inline_data()
when an inode had the INLINE_DATA_FL flag set but was missing the
system.data extended attribute.
Since this can happen due to a maiciouly fuzzed file system, we
shouldn't BUG, but rather, report it as a corrupted file system.
Add similar replacements of BUG_ON with EXT4_ERROR_INODE() ii
ext4_create_inline_data() and ext4_inline_data_truncate().
Reported-by: syzbot+544248a761451c0df72f@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 33cfdd726381828b9907a61c038a9f48b6690a31 ]
Some servers might enforce the SPN to be set in the target info
blob (AV pairs) when sending NTLMSSP_AUTH message. In Windows Server,
this could be enforced with SmbServerNameHardeningLevel set to 2.
Fix this by always appending SPN (cifs/<hostname>) to the existing
list of target infos when setting up NTLMv2 response blob.
Cc: linux-cifs@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Reported-by: Pierguido Lambri <plambri@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b460249b9a1dab7a9f58483e5349d045ad6d585c ]
To query root path (without msearch wildcard) it is needed to
send pattern '\' instead of '' (empty string).
This allows to use CIFSFindFirst() to query information about root path
which is being used in followup changes.
This change fixes the stat() syscall called on the root path on the mount.
It is because stat() syscall uses the cifs_query_path_info() function and
it can fallback to the CIFSFindFirst() usage with msearch=false.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d9b13cdad80dc11d74408cf201939a946e9303a6 ]
If a lookup in tracefs is done on a file that does not exist, it leaves a
dentry hanging around until memory pressure removes it. But eventfs
dentries should hang around as when their ref count goes to zero, it
requires more work to recreate it. For the rest of the tracefs dentries,
they hang around as their dentry is used as a descriptor for the tracing
system. But if a file lookup happens for a file in tracefs that does not
exist, it should be deleted.
Add a .d_delete callback that checks if dentry->fsdata is set or not. Only
eventfs dentries set fsdata so if it has content it should not be deleted
and should hang around in the cache.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a099b09a3342a0b28ea330e405501b5b4d0424b4 ]
Previously, ext2_fiemap would unconditionally apply "len = min_t(u64, len,
i_size_read(inode));", When inode->i_size was 0 (for an empty file), this
would reduce the requested len to 0. Passing len = 0 to iomap_fiemap could
then result in an -EINVAL error, even for valid queries on empty files.
Link: https://github.com/linux-test-project/ltp/issues/1246
Signed-off-by: Wei Gao <wegao@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250613152402.3432135-1-wegao@suse.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1a1ad73aa1a66787f05f7f10f686b74bab77be72 ]
Similar to commit 1ed95281c0c7 ("anon_inode: raise SB_I_NODEV and SB_I_NOEXEC"):
it shouldn't be possible to execute pidfds via
execveat(fd_anon_inode, "", NULL, NULL, AT_EMPTY_PATH)
so raise SB_I_NOEXEC so that no one gets any creative ideas.
Also raise SB_I_NODEV as we don't expect or support any devices on pidfs.
Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-1-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b1e9d89408f402858c00103f9831b25ffa0994d3 ]
After applying this patch, could correctly create symlink:
ln -s "relative/path/to/file" symlink
Signed-off-by: Rong Zhang <ulin0208@gmail.com>
[almaz.alexandrovich@paragon-software.com: added cpu_to_le32 macro to
rs->Flags assignment]
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e841ecb139339602bc1853f5f09daa5d1ea920a2 ]
The length of the file name should be smaller than the directory entry size.
Reported-by: syzbot+598057afa0f49e62bd23@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=598057afa0f49e62bd23
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2a8061ee5e41034eb14170ec4517b5583dbeff9f ]
We want a class that nests outside of I_MUTEX_NORMAL (for the sake of
callbacks that might want to lock the victim) and inside I_MUTEX_PARENT
(so that a variant of that could be used with parent of the victim
held locked by the caller).
In reality, simple_recursive_removal()
* never holds two locks at once
* holds the lock on parent of dentry passed to callback
* is used only on the trees with fixed topology, so the depths
are not changing.
So the locking order is actually fine.
AFAICS, the best solution is to assign I_MUTEX_CHILD to the locks
grabbed by that thing.
Reported-by: syzbot+169de184e9defe7fe709@syzkaller.appspotmail.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d3ed6d6981f4756f145766753c872482bc3b28d3 ]
The generic/001 test of xfstests suite fails and corrupts
the HFS volume:
sudo ./check generic/001
FSTYP -- hfs
PLATFORM -- Linux/x86_64 hfsplus-testing-0001 6.15.0-rc2+ #3 SMP PREEMPT_DYNAMIC Fri Apr 25 17:13:00 PDT 2>
MKFS_OPTIONS -- /dev/loop51
MOUNT_OPTIONS -- /dev/loop51 /mnt/scratch
generic/001 32s ... _check_generic_filesystem: filesystem on /dev/loop50 is inconsistent
(see /home/slavad/XFSTESTS-2/xfstests-dev/results//generic/001.full for details)
Ran: generic/001
Failures: generic/001
Failed 1 of 1 tests
fsck.hfs -d -n ./test-image.bin
** ./test-image.bin (NO WRITE)
Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K.
Executing fsck_hfs (version 540.1-Linux).
** Checking HFS volume.
The volume name is untitled
** Checking extents overflow file.
** Checking catalog file.
Unused node is not erased (node = 2)
Unused node is not erased (node = 4)
<skipped>
Unused node is not erased (node = 253)
Unused node is not erased (node = 254)
Unused node is not erased (node = 255)
Unused node is not erased (node = 256)
** Checking catalog hierarchy.
** Checking volume bitmap.
** Checking volume information.
Verify Status: VIStat = 0x0000, ABTStat = 0x0000 EBTStat = 0x0000
CBTStat = 0x0004 CatStat = 0x00000000
** The volume untitled was found corrupt and needs to be repaired.
volume type is HFS
primary MDB is at block 2 0x02
alternate MDB is at block 20971518 0x13ffffe
primary VHB is at block 0 0x00
alternate VHB is at block 0 0x00
sector size = 512 0x200
VolumeObject flags = 0x19
total sectors for volume = 20971520 0x1400000
total sectors for embedded volume = 0 0x00
This patch adds logic of clearing the deleted b-tree node.
sudo ./check generic/001
FSTYP -- hfs
PLATFORM -- Linux/x86_64 hfsplus-testing-0001 6.15.0-rc2+ #3 SMP PREEMPT_DYNAMIC Fri Apr 25 17:13:00 PDT 2025
MKFS_OPTIONS -- /dev/loop51
MOUNT_OPTIONS -- /dev/loop51 /mnt/scratch
generic/001 9s ... 32s
Ran: generic/001
Passed all 1 tests
fsck.hfs -d -n ./test-image.bin
** ./test-image.bin (NO WRITE)
Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K.
Executing fsck_hfs (version 540.1-Linux).
** Checking HFS volume.
The volume name is untitled
** Checking extents overflow file.
** Checking catalog file.
** Checking catalog hierarchy.
** Checking volume bitmap.
** Checking volume information.
** The volume untitled appears to be OK.
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250430001211.1912533-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1a11201668e8635602577dcf06f2e96c591d8819 ]
Verify that number of partition maps isn't insanely high which can lead
to large allocation in udf_sb_alloc_partition_maps(). All partition maps
have to fit in the LVD which is in a single block.
Reported-by: syzbot+478f2c1a6f0f447a46bb@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5c8f12cf1e64e0e8e6cb80b0c935389973e8be8d ]
Clears up the warning added in 7ee3647243e5 ("migrate: Remove call to
->writepage") that occurs in various xfstests, causing "something found
in dmesg" failures.
[ 341.136573] gfs2_meta_aops does not implement migrate_folio
[ 341.136953] WARNING: CPU: 1 PID: 36 at mm/migrate.c:944 move_to_new_folio+0x2f8/0x300
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 557c024ca7250bb65ae60f16c02074106c2f197b ]
A fuzzer test introduced corruption that ends up with a depth of 0 in
dir_e_read(), causing an undefined shift by 32 at:
index = hash >> (32 - dip->i_depth);
As calculated in an open-coded way in dir_make_exhash(), the minimum
depth for an exhash directory is ilog2(sdp->sd_hash_ptrs) and 0 is
invalid as sdp->sd_hash_ptrs is fixed as sdp->bsize / 16 at mount time.
So we can avoid the undefined behaviour by checking for depth values
lower than the minimum in gfs2_dinode_in(). Values greater than the
maximum are already being checked for there.
Also switch the calculation in dir_make_exhash() to use ilog2() to
clarify how the depth is calculated.
Tested with the syzkaller repro.c and xfstests '-g quick'.
Reported-by: syzbot+4708579bb230a0582a57@syzkaller.appspotmail.com
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d5fc1400a34b4ea5e8f2ce296ea12bf8c8421694 ]
If smb2_create_link() is called with ReplaceIfExists set and the name
does exist then a deadlock will happen.
ksmbd_vfs_kern_path_locked() will return with success and the parent
directory will be locked. ksmbd_vfs_remove_file() will then remove the
file. ksmbd_vfs_link() will then be called while the parent is still
locked. It will try to lock the same parent and will deadlock.
This patch moves the ksmbd_vfs_kern_path_unlock() call to *before*
ksmbd_vfs_link() and then simplifies the code, removing the file_present
flag variable.
Signed-off-by: NeilBrown <neil@brown.name>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6599716de2d68ab7b3c0429221deca306dbda878 ]
If we attempt a mmap write into a NOCOW file or a prealloc extent when
there is no more available data space (or unallocated space to allocate a
new data block group) and we can do a NOCOW write (there are no reflinks
for the target extent or snapshots), we always fail due to -ENOSPC, unlike
for the regular buffered write and direct IO paths where we check that we
can do a NOCOW write in case we can't reserve data space.
Simple reproducer:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
umount $DEV &> /dev/null
mkfs.btrfs -f -b $((512 * 1024 * 1024)) $DEV
mount $DEV $MNT
touch $MNT/foobar
# Make it a NOCOW file.
chattr +C $MNT/foobar
# Add initial data to file.
xfs_io -c "pwrite -S 0xab 0 1M" $MNT/foobar
# Fill all the remaining data space and unallocated space with data.
dd if=/dev/zero of=$MNT/filler bs=4K &> /dev/null
# Overwrite the file with a mmap write. Should succeed.
xfs_io -c "mmap -w 0 1M" \
-c "mwrite -S 0xcd 0 1M" \
-c "munmap" \
$MNT/foobar
# Unmount, mount again and verify the new data was persisted.
umount $MNT
mount $DEV $MNT
od -A d -t x1 $MNT/foobar
umount $MNT
Running this:
$ ./test.sh
(...)
wrote 1048576/1048576 bytes at offset 0
1 MiB, 256 ops; 0.0008 sec (1.188 GiB/sec and 311435.5231 ops/sec)
./test.sh: line 24: 234865 Bus error xfs_io -c "mmap -w 0 1M" -c "mwrite -S 0xcd 0 1M" -c "munmap" $MNT/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
1048576
Fix this by not failing in case we can't allocate data space and we can
NOCOW into the target extent - reserving only metadata space in this case.
After this change the test passes:
$ ./test.sh
(...)
wrote 1048576/1048576 bytes at offset 0
1 MiB, 256 ops; 0.0007 sec (1.262 GiB/sec and 330749.3540 ops/sec)
0000000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
1048576
A test case for fstests will be added soon.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c7c6363ca186747ebc2df10c8a1a51e66e0e32d9 ]
When the volume header contains erroneous values that do not reflect
the actual state of the filesystem, hfsplus_fill_super() assumes that
the attributes file is not yet created, which later results in hitting
BUG_ON() when hfsplus_create_attributes_file() is called. Replace this
BUG_ON() with -EIO error with a message to suggest running fsck tool.
Reported-by: syzbot <syzbot+1107451c16b9eb9d29e6@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=1107451c16b9eb9d29e6
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/7b587d24-c8a1-4413-9b9a-00a33fbd849f@I-love.SAKURA.ne.jp
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 94458781aee6045bd3d0ad4b80b02886b9e2219b ]
The hfsplus_readdir() method is capable to crash by calling
hfsplus_uni2asc():
[ 667.121659][ T9805] ==================================================================
[ 667.122651][ T9805] BUG: KASAN: slab-out-of-bounds in hfsplus_uni2asc+0x902/0xa10
[ 667.123627][ T9805] Read of size 2 at addr ffff88802592f40c by task repro/9805
[ 667.124578][ T9805]
[ 667.124876][ T9805] CPU: 3 UID: 0 PID: 9805 Comm: repro Not tainted 6.16.0-rc3 #1 PREEMPT(full)
[ 667.124886][ T9805] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 667.124890][ T9805] Call Trace:
[ 667.124893][ T9805] <TASK>
[ 667.124896][ T9805] dump_stack_lvl+0x10e/0x1f0
[ 667.124911][ T9805] print_report+0xd0/0x660
[ 667.124920][ T9805] ? __virt_addr_valid+0x81/0x610
[ 667.124928][ T9805] ? __phys_addr+0xe8/0x180
[ 667.124934][ T9805] ? hfsplus_uni2asc+0x902/0xa10
[ 667.124942][ T9805] kasan_report+0xc6/0x100
[ 667.124950][ T9805] ? hfsplus_uni2asc+0x902/0xa10
[ 667.124959][ T9805] hfsplus_uni2asc+0x902/0xa10
[ 667.124966][ T9805] ? hfsplus_bnode_read+0x14b/0x360
[ 667.124974][ T9805] hfsplus_readdir+0x845/0xfc0
[ 667.124984][ T9805] ? __pfx_hfsplus_readdir+0x10/0x10
[ 667.124994][ T9805] ? stack_trace_save+0x8e/0xc0
[ 667.125008][ T9805] ? iterate_dir+0x18b/0xb20
[ 667.125015][ T9805] ? trace_lock_acquire+0x85/0xd0
[ 667.125022][ T9805] ? lock_acquire+0x30/0x80
[ 667.125029][ T9805] ? iterate_dir+0x18b/0xb20
[ 667.125037][ T9805] ? down_read_killable+0x1ed/0x4c0
[ 667.125044][ T9805] ? putname+0x154/0x1a0
[ 667.125051][ T9805] ? __pfx_down_read_killable+0x10/0x10
[ 667.125058][ T9805] ? apparmor_file_permission+0x239/0x3e0
[ 667.125069][ T9805] iterate_dir+0x296/0xb20
[ 667.125076][ T9805] __x64_sys_getdents64+0x13c/0x2c0
[ 667.125084][ T9805] ? __pfx___x64_sys_getdents64+0x10/0x10
[ 667.125091][ T9805] ? __x64_sys_openat+0x141/0x200
[ 667.125126][ T9805] ? __pfx_filldir64+0x10/0x10
[ 667.125134][ T9805] ? do_user_addr_fault+0x7fe/0x12f0
[ 667.125143][ T9805] do_syscall_64+0xc9/0x480
[ 667.125151][ T9805] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 667.125158][ T9805] RIP: 0033:0x7fa8753b2fc9
[ 667.125164][ T9805] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 48
[ 667.125172][ T9805] RSP: 002b:00007ffe96f8e0f8 EFLAGS: 00000217 ORIG_RAX: 00000000000000d9
[ 667.125181][ T9805] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa8753b2fc9
[ 667.125185][ T9805] RDX: 0000000000000400 RSI: 00002000000063c0 RDI: 0000000000000004
[ 667.125190][ T9805] RBP: 00007ffe96f8e110 R08: 00007ffe96f8e110 R09: 00007ffe96f8e110
[ 667.125195][ T9805] R10: 0000000000000000 R11: 0000000000000217 R12: 0000556b1e3b4260
[ 667.125199][ T9805] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 667.125207][ T9805] </TASK>
[ 667.125210][ T9805]
[ 667.145632][ T9805] Allocated by task 9805:
[ 667.145991][ T9805] kasan_save_stack+0x20/0x40
[ 667.146352][ T9805] kasan_save_track+0x14/0x30
[ 667.146717][ T9805] __kasan_kmalloc+0xaa/0xb0
[ 667.147065][ T9805] __kmalloc_noprof+0x205/0x550
[ 667.147448][ T9805] hfsplus_find_init+0x95/0x1f0
[ 667.147813][ T9805] hfsplus_readdir+0x220/0xfc0
[ 667.148174][ T9805] iterate_dir+0x296/0xb20
[ 667.148549][ T9805] __x64_sys_getdents64+0x13c/0x2c0
[ 667.148937][ T9805] do_syscall_64+0xc9/0x480
[ 667.149291][ T9805] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 667.149809][ T9805]
[ 667.150030][ T9805] The buggy address belongs to the object at ffff88802592f000
[ 667.150030][ T9805] which belongs to the cache kmalloc-2k of size 2048
[ 667.151282][ T9805] The buggy address is located 0 bytes to the right of
[ 667.151282][ T9805] allocated 1036-byte region [ffff88802592f000, ffff88802592f40c)
[ 667.152580][ T9805]
[ 667.152798][ T9805] The buggy address belongs to the physical page:
[ 667.153373][ T9805] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x25928
[ 667.154157][ T9805] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 667.154916][ T9805] anon flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
[ 667.155631][ T9805] page_type: f5(slab)
[ 667.155997][ T9805] raw: 00fff00000000040 ffff88801b442f00 0000000000000000 dead000000000001
[ 667.156770][ T9805] raw: 0000000000000000 0000000080080008 00000000f5000000 0000000000000000
[ 667.157536][ T9805] head: 00fff00000000040 ffff88801b442f00 0000000000000000 dead000000000001
[ 667.158317][ T9805] head: 0000000000000000 0000000080080008 00000000f5000000 0000000000000000
[ 667.159088][ T9805] head: 00fff00000000003 ffffea0000964a01 00000000ffffffff 00000000ffffffff
[ 667.159865][ T9805] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
[ 667.160643][ T9805] page dumped because: kasan: bad access detected
[ 667.161216][ T9805] page_owner tracks the page as allocated
[ 667.161732][ T9805] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN9
[ 667.163566][ T9805] post_alloc_hook+0x1c0/0x230
[ 667.164003][ T9805] get_page_from_freelist+0xdeb/0x3b30
[ 667.164503][ T9805] __alloc_frozen_pages_noprof+0x25c/0x2460
[ 667.165040][ T9805] alloc_pages_mpol+0x1fb/0x550
[ 667.165489][ T9805] new_slab+0x23b/0x340
[ 667.165872][ T9805] ___slab_alloc+0xd81/0x1960
[ 667.166313][ T9805] __slab_alloc.isra.0+0x56/0xb0
[ 667.166767][ T9805] __kmalloc_cache_noprof+0x255/0x3e0
[ 667.167255][ T9805] psi_cgroup_alloc+0x52/0x2d0
[ 667.167693][ T9805] cgroup_mkdir+0x694/0x1210
[ 667.168118][ T9805] kernfs_iop_mkdir+0x111/0x190
[ 667.168568][ T9805] vfs_mkdir+0x59b/0x8d0
[ 667.168956][ T9805] do_mkdirat+0x2ed/0x3d0
[ 667.169353][ T9805] __x64_sys_mkdir+0xef/0x140
[ 667.169784][ T9805] do_syscall_64+0xc9/0x480
[ 667.170195][ T9805] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 667.170730][ T9805] page last free pid 1257 tgid 1257 stack trace:
[ 667.171304][ T9805] __free_frozen_pages+0x80c/0x1250
[ 667.171770][ T9805] vfree.part.0+0x12b/0xab0
[ 667.172182][ T9805] delayed_vfree_work+0x93/0xd0
[ 667.172612][ T9805] process_one_work+0x9b5/0x1b80
[ 667.173067][ T9805] worker_thread+0x630/0xe60
[ 667.173486][ T9805] kthread+0x3a8/0x770
[ 667.173857][ T9805] ret_from_fork+0x517/0x6e0
[ 667.174278][ T9805] ret_from_fork_asm+0x1a/0x30
[ 667.174703][ T9805]
[ 667.174917][ T9805] Memory state around the buggy address:
[ 667.175411][ T9805] ffff88802592f300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 667.176114][ T9805] ffff88802592f380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 667.176830][ T9805] >ffff88802592f400: 00 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 667.177547][ T9805] ^
[ 667.177933][ T9805] ffff88802592f480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 667.178640][ T9805] ffff88802592f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 667.179350][ T9805] ==================================================================
The hfsplus_uni2asc() method operates by struct hfsplus_unistr:
struct hfsplus_unistr {
__be16 length;
hfsplus_unichr unicode[HFSPLUS_MAX_STRLEN];
} __packed;
where HFSPLUS_MAX_STRLEN is 255 bytes. The issue happens if length
of the structure instance has value bigger than 255 (for example,
65283). In such case, pointer on unicode buffer is going beyond of
the allocated memory.
The patch fixes the issue by checking the length value of
hfsplus_unistr instance and using 255 value in the case if length
value is bigger than HFSPLUS_MAX_STRLEN. Potential reason of such
situation could be a corruption of Catalog File b-tree's node.
Reported-by: Wenzhi Wang <wenzhi.wang@uwaterloo.ca>
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
cc: Yangtao Li <frank.li@vivo.com>
cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Yangtao Li <frank.li@vivo.com>
Link: https://lore.kernel.org/r/20250710230830.110500-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c80aa2aaaa5e69d5219c6af8ef7e754114bd08d2 ]
The hfsplus_bnode_read() method can trigger the issue:
[ 174.852007][ T9784] ==================================================================
[ 174.852709][ T9784] BUG: KASAN: slab-out-of-bounds in hfsplus_bnode_read+0x2f4/0x360
[ 174.853412][ T9784] Read of size 8 at addr ffff88810b5fc6c0 by task repro/9784
[ 174.854059][ T9784]
[ 174.854272][ T9784] CPU: 1 UID: 0 PID: 9784 Comm: repro Not tainted 6.16.0-rc3 #7 PREEMPT(full)
[ 174.854281][ T9784] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 174.854286][ T9784] Call Trace:
[ 174.854289][ T9784] <TASK>
[ 174.854292][ T9784] dump_stack_lvl+0x10e/0x1f0
[ 174.854305][ T9784] print_report+0xd0/0x660
[ 174.854315][ T9784] ? __virt_addr_valid+0x81/0x610
[ 174.854323][ T9784] ? __phys_addr+0xe8/0x180
[ 174.854330][ T9784] ? hfsplus_bnode_read+0x2f4/0x360
[ 174.854337][ T9784] kasan_report+0xc6/0x100
[ 174.854346][ T9784] ? hfsplus_bnode_read+0x2f4/0x360
[ 174.854354][ T9784] hfsplus_bnode_read+0x2f4/0x360
[ 174.854362][ T9784] hfsplus_bnode_dump+0x2ec/0x380
[ 174.854370][ T9784] ? __pfx_hfsplus_bnode_dump+0x10/0x10
[ 174.854377][ T9784] ? hfsplus_bnode_write_u16+0x83/0xb0
[ 174.854385][ T9784] ? srcu_gp_start+0xd0/0x310
[ 174.854393][ T9784] ? __mark_inode_dirty+0x29e/0xe40
[ 174.854402][ T9784] hfsplus_brec_remove+0x3d2/0x4e0
[ 174.854411][ T9784] __hfsplus_delete_attr+0x290/0x3a0
[ 174.854419][ T9784] ? __pfx_hfs_find_1st_rec_by_cnid+0x10/0x10
[ 174.854427][ T9784] ? __pfx___hfsplus_delete_attr+0x10/0x10
[ 174.854436][ T9784] ? __asan_memset+0x23/0x50
[ 174.854450][ T9784] hfsplus_delete_all_attrs+0x262/0x320
[ 174.854459][ T9784] ? __pfx_hfsplus_delete_all_attrs+0x10/0x10
[ 174.854469][ T9784] ? rcu_is_watching+0x12/0xc0
[ 174.854476][ T9784] ? __mark_inode_dirty+0x29e/0xe40
[ 174.854483][ T9784] hfsplus_delete_cat+0x845/0xde0
[ 174.854493][ T9784] ? __pfx_hfsplus_delete_cat+0x10/0x10
[ 174.854507][ T9784] hfsplus_unlink+0x1ca/0x7c0
[ 174.854516][ T9784] ? __pfx_hfsplus_unlink+0x10/0x10
[ 174.854525][ T9784] ? down_write+0x148/0x200
[ 174.854532][ T9784] ? __pfx_down_write+0x10/0x10
[ 174.854540][ T9784] vfs_unlink+0x2fe/0x9b0
[ 174.854549][ T9784] do_unlinkat+0x490/0x670
[ 174.854557][ T9784] ? __pfx_do_unlinkat+0x10/0x10
[ 174.854565][ T9784] ? __might_fault+0xbc/0x130
[ 174.854576][ T9784] ? getname_flags.part.0+0x1c5/0x550
[ 174.854584][ T9784] __x64_sys_unlink+0xc5/0x110
[ 174.854592][ T9784] do_syscall_64+0xc9/0x480
[ 174.854600][ T9784] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 174.854608][ T9784] RIP: 0033:0x7f6fdf4c3167
[ 174.854614][ T9784] Code: f0 ff ff 73 01 c3 48 8b 0d 26 0d 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 08
[ 174.854622][ T9784] RSP: 002b:00007ffcb948bca8 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
[ 174.854630][ T9784] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f6fdf4c3167
[ 174.854636][ T9784] RDX: 00007ffcb948bcc0 RSI: 00007ffcb948bcc0 RDI: 00007ffcb948bd50
[ 174.854641][ T9784] RBP: 00007ffcb948cd90 R08: 0000000000000001 R09: 00007ffcb948bb40
[ 174.854645][ T9784] R10: 00007f6fdf564fc0 R11: 0000000000000206 R12: 0000561e1bc9c2d0
[ 174.854650][ T9784] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 174.854658][ T9784] </TASK>
[ 174.854661][ T9784]
[ 174.879281][ T9784] Allocated by task 9784:
[ 174.879664][ T9784] kasan_save_stack+0x20/0x40
[ 174.880082][ T9784] kasan_save_track+0x14/0x30
[ 174.880500][ T9784] __kasan_kmalloc+0xaa/0xb0
[ 174.880908][ T9784] __kmalloc_noprof+0x205/0x550
[ 174.881337][ T9784] __hfs_bnode_create+0x107/0x890
[ 174.881779][ T9784] hfsplus_bnode_find+0x2d0/0xd10
[ 174.882222][ T9784] hfsplus_brec_find+0x2b0/0x520
[ 174.882659][ T9784] hfsplus_delete_all_attrs+0x23b/0x320
[ 174.883144][ T9784] hfsplus_delete_cat+0x845/0xde0
[ 174.883595][ T9784] hfsplus_rmdir+0x106/0x1b0
[ 174.884004][ T9784] vfs_rmdir+0x206/0x690
[ 174.884379][ T9784] do_rmdir+0x2b7/0x390
[ 174.884751][ T9784] __x64_sys_rmdir+0xc5/0x110
[ 174.885167][ T9784] do_syscall_64+0xc9/0x480
[ 174.885568][ T9784] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 174.886083][ T9784]
[ 174.886293][ T9784] The buggy address belongs to the object at ffff88810b5fc600
[ 174.886293][ T9784] which belongs to the cache kmalloc-192 of size 192
[ 174.887507][ T9784] The buggy address is located 40 bytes to the right of
[ 174.887507][ T9784] allocated 152-byte region [ffff88810b5fc600, ffff88810b5fc698)
[ 174.888766][ T9784]
[ 174.888976][ T9784] The buggy address belongs to the physical page:
[ 174.889533][ T9784] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10b5fc
[ 174.890295][ T9784] flags: 0x57ff00000000000(node=1|zone=2|lastcpupid=0x7ff)
[ 174.890927][ T9784] page_type: f5(slab)
[ 174.891284][ T9784] raw: 057ff00000000000 ffff88801b4423c0 ffffea000426dc80 dead000000000002
[ 174.892032][ T9784] raw: 0000000000000000 0000000080100010 00000000f5000000 0000000000000000
[ 174.892774][ T9784] page dumped because: kasan: bad access detected
[ 174.893327][ T9784] page_owner tracks the page as allocated
[ 174.893825][ T9784] page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52c00(GFP_NOIO|__GFP_NOWARN|__GFP_NO1
[ 174.895373][ T9784] post_alloc_hook+0x1c0/0x230
[ 174.895801][ T9784] get_page_from_freelist+0xdeb/0x3b30
[ 174.896284][ T9784] __alloc_frozen_pages_noprof+0x25c/0x2460
[ 174.896810][ T9784] alloc_pages_mpol+0x1fb/0x550
[ 174.897242][ T9784] new_slab+0x23b/0x340
[ 174.897614][ T9784] ___slab_alloc+0xd81/0x1960
[ 174.898028][ T9784] __slab_alloc.isra.0+0x56/0xb0
[ 174.898468][ T9784] __kmalloc_noprof+0x2b0/0x550
[ 174.898896][ T9784] usb_alloc_urb+0x73/0xa0
[ 174.899289][ T9784] usb_control_msg+0x1cb/0x4a0
[ 174.899718][ T9784] usb_get_string+0xab/0x1a0
[ 174.900133][ T9784] usb_string_sub+0x107/0x3c0
[ 174.900549][ T9784] usb_string+0x307/0x670
[ 174.900933][ T9784] usb_cache_string+0x80/0x150
[ 174.901355][ T9784] usb_new_device+0x1d0/0x19d0
[ 174.901786][ T9784] register_root_hub+0x299/0x730
[ 174.902231][ T9784] page last free pid 10 tgid 10 stack trace:
[ 174.902757][ T9784] __free_frozen_pages+0x80c/0x1250
[ 174.903217][ T9784] vfree.part.0+0x12b/0xab0
[ 174.903645][ T9784] delayed_vfree_work+0x93/0xd0
[ 174.904073][ T9784] process_one_work+0x9b5/0x1b80
[ 174.904519][ T9784] worker_thread+0x630/0xe60
[ 174.904927][ T9784] kthread+0x3a8/0x770
[ 174.905291][ T9784] ret_from_fork+0x517/0x6e0
[ 174.905709][ T9784] ret_from_fork_asm+0x1a/0x30
[ 174.906128][ T9784]
[ 174.906338][ T9784] Memory state around the buggy address:
[ 174.906828][ T9784] ffff88810b5fc580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 174.907528][ T9784] ffff88810b5fc600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 174.908222][ T9784] >ffff88810b5fc680: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 174.908917][ T9784] ^
[ 174.909481][ T9784] ffff88810b5fc700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 174.910432][ T9784] ffff88810b5fc780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 174.911401][ T9784] ==================================================================
The reason of the issue that code doesn't check the correctness
of the requested offset and length. As a result, incorrect value
of offset or/and length could result in access out of allocated
memory.
This patch introduces is_bnode_offset_valid() method that checks
the requested offset value. Also, it introduces
check_and_correct_requested_length() method that checks and
correct the requested length (if it is necessary). These methods
are used in hfsplus_bnode_read(), hfsplus_bnode_write(),
hfsplus_bnode_clear(), hfsplus_bnode_copy(), and hfsplus_bnode_move()
with the goal to prevent the access out of allocated memory
and triggering the crash.
Reported-by: Kun Hu <huk23@m.fudan.edu.cn>
Reported-by: Jiaji Qin <jjtan24@m.fudan.edu.cn>
Reported-by: Shuoran Bai <baishuoran@hrbeu.edu.cn>
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/20250703214804.244077-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a431930c9bac518bf99d6b1da526a7f37ddee8d8 ]
This patch introduces is_bnode_offset_valid() method that checks
the requested offset value. Also, it introduces
check_and_correct_requested_length() method that checks and
correct the requested length (if it is necessary). These methods
are used in hfs_bnode_read(), hfs_bnode_write(), hfs_bnode_clear(),
hfs_bnode_copy(), and hfs_bnode_move() with the goal to prevent
the access out of allocated memory and triggering the crash.
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/20250703214912.244138-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 736a0516a16268995f4898eded49bfef077af709 ]
The hfs_find_init() method can trigger the crash
if tree pointer is NULL:
[ 45.746290][ T9787] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000008: 0000 [#1] SMP KAI
[ 45.747287][ T9787] KASAN: null-ptr-deref in range [0x0000000000000040-0x0000000000000047]
[ 45.748716][ T9787] CPU: 2 UID: 0 PID: 9787 Comm: repro Not tainted 6.16.0-rc3 #10 PREEMPT(full)
[ 45.750250][ T9787] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 45.751983][ T9787] RIP: 0010:hfs_find_init+0x86/0x230
[ 45.752834][ T9787] Code: c1 ea 03 80 3c 02 00 0f 85 9a 01 00 00 4c 8d 6b 40 48 c7 45 18 00 00 00 00 48 b8 00 00 00 00 00 fc
[ 45.755574][ T9787] RSP: 0018:ffffc90015157668 EFLAGS: 00010202
[ 45.756432][ T9787] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff819a4d09
[ 45.757457][ T9787] RDX: 0000000000000008 RSI: ffffffff819acd3a RDI: ffffc900151576e8
[ 45.758282][ T9787] RBP: ffffc900151576d0 R08: 0000000000000005 R09: 0000000000000000
[ 45.758943][ T9787] R10: 0000000080000000 R11: 0000000000000001 R12: 0000000000000004
[ 45.759619][ T9787] R13: 0000000000000040 R14: ffff88802c50814a R15: 0000000000000000
[ 45.760293][ T9787] FS: 00007ffb72734540(0000) GS:ffff8880cec64000(0000) knlGS:0000000000000000
[ 45.761050][ T9787] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 45.761606][ T9787] CR2: 00007f9bd8225000 CR3: 000000010979a000 CR4: 00000000000006f0
[ 45.762286][ T9787] Call Trace:
[ 45.762570][ T9787] <TASK>
[ 45.762824][ T9787] hfs_ext_read_extent+0x190/0x9d0
[ 45.763269][ T9787] ? submit_bio_noacct_nocheck+0x2dd/0xce0
[ 45.763766][ T9787] ? __pfx_hfs_ext_read_extent+0x10/0x10
[ 45.764250][ T9787] hfs_get_block+0x55f/0x830
[ 45.764646][ T9787] block_read_full_folio+0x36d/0x850
[ 45.765105][ T9787] ? __pfx_hfs_get_block+0x10/0x10
[ 45.765541][ T9787] ? const_folio_flags+0x5b/0x100
[ 45.765972][ T9787] ? __pfx_hfs_read_folio+0x10/0x10
[ 45.766415][ T9787] filemap_read_folio+0xbe/0x290
[ 45.766840][ T9787] ? __pfx_filemap_read_folio+0x10/0x10
[ 45.767325][ T9787] ? __filemap_get_folio+0x32b/0xbf0
[ 45.767780][ T9787] do_read_cache_folio+0x263/0x5c0
[ 45.768223][ T9787] ? __pfx_hfs_read_folio+0x10/0x10
[ 45.768666][ T9787] read_cache_page+0x5b/0x160
[ 45.769070][ T9787] hfs_btree_open+0x491/0x1740
[ 45.769481][ T9787] hfs_mdb_get+0x15e2/0x1fb0
[ 45.769877][ T9787] ? __pfx_hfs_mdb_get+0x10/0x10
[ 45.770316][ T9787] ? find_held_lock+0x2b/0x80
[ 45.770731][ T9787] ? lockdep_init_map_type+0x5c/0x280
[ 45.771200][ T9787] ? lockdep_init_map_type+0x5c/0x280
[ 45.771674][ T9787] hfs_fill_super+0x38e/0x720
[ 45.772092][ T9787] ? __pfx_hfs_fill_super+0x10/0x10
[ 45.772549][ T9787] ? snprintf+0xbe/0x100
[ 45.772931][ T9787] ? __pfx_snprintf+0x10/0x10
[ 45.773350][ T9787] ? do_raw_spin_lock+0x129/0x2b0
[ 45.773796][ T9787] ? find_held_lock+0x2b/0x80
[ 45.774215][ T9787] ? set_blocksize+0x40a/0x510
[ 45.774636][ T9787] ? sb_set_blocksize+0x176/0x1d0
[ 45.775087][ T9787] ? setup_bdev_super+0x369/0x730
[ 45.775533][ T9787] get_tree_bdev_flags+0x384/0x620
[ 45.775985][ T9787] ? __pfx_hfs_fill_super+0x10/0x10
[ 45.776453][ T9787] ? __pfx_get_tree_bdev_flags+0x10/0x10
[ 45.776950][ T9787] ? bpf_lsm_capable+0x9/0x10
[ 45.777365][ T9787] ? security_capable+0x80/0x260
[ 45.777803][ T9787] vfs_get_tree+0x8e/0x340
[ 45.778203][ T9787] path_mount+0x13de/0x2010
[ 45.778604][ T9787] ? kmem_cache_free+0x2b0/0x4c0
[ 45.779052][ T9787] ? __pfx_path_mount+0x10/0x10
[ 45.779480][ T9787] ? getname_flags.part.0+0x1c5/0x550
[ 45.779954][ T9787] ? putname+0x154/0x1a0
[ 45.780335][ T9787] __x64_sys_mount+0x27b/0x300
[ 45.780758][ T9787] ? __pfx___x64_sys_mount+0x10/0x10
[ 45.781232][ T9787] do_syscall_64+0xc9/0x480
[ 45.781631][ T9787] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 45.782149][ T9787] RIP: 0033:0x7ffb7265b6ca
[ 45.782539][ T9787] Code: 48 8b 0d c9 17 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48
[ 45.784212][ T9787] RSP: 002b:00007ffc0c10cfb8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 45.784935][ T9787] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffb7265b6ca
[ 45.785626][ T9787] RDX: 0000200000000240 RSI: 0000200000000280 RDI: 00007ffc0c10d100
[ 45.786316][ T9787] RBP: 00007ffc0c10d190 R08: 00007ffc0c10d000 R09: 0000000000000000
[ 45.787011][ T9787] R10: 0000000000000048 R11: 0000000000000206 R12: 0000560246733250
[ 45.787697][ T9787] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 45.788393][ T9787] </TASK>
[ 45.788665][ T9787] Modules linked in:
[ 45.789058][ T9787] ---[ end trace 0000000000000000 ]---
[ 45.789554][ T9787] RIP: 0010:hfs_find_init+0x86/0x230
[ 45.790028][ T9787] Code: c1 ea 03 80 3c 02 00 0f 85 9a 01 00 00 4c 8d 6b 40 48 c7 45 18 00 00 00 00 48 b8 00 00 00 00 00 fc
[ 45.792364][ T9787] RSP: 0018:ffffc90015157668 EFLAGS: 00010202
[ 45.793155][ T9787] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff819a4d09
[ 45.794123][ T9787] RDX: 0000000000000008 RSI: ffffffff819acd3a RDI: ffffc900151576e8
[ 45.795105][ T9787] RBP: ffffc900151576d0 R08: 0000000000000005 R09: 0000000000000000
[ 45.796135][ T9787] R10: 0000000080000000 R11: 0000000000000001 R12: 0000000000000004
[ 45.797114][ T9787] R13: 0000000000000040 R14: ffff88802c50814a R15: 0000000000000000
[ 45.798024][ T9787] FS: 00007ffb72734540(0000) GS:ffff8880cec64000(0000) knlGS:0000000000000000
[ 45.799019][ T9787] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 45.799822][ T9787] CR2: 00007f9bd8225000 CR3: 000000010979a000 CR4: 00000000000006f0
[ 45.800747][ T9787] Kernel panic - not syncing: Fatal exception
The hfs_fill_super() calls hfs_mdb_get() method that tries
to construct Extents Tree and Catalog Tree:
HFS_SB(sb)->ext_tree = hfs_btree_open(sb, HFS_EXT_CNID, hfs_ext_keycmp);
if (!HFS_SB(sb)->ext_tree) {
pr_err("unable to open extent tree\n");
goto out;
}
HFS_SB(sb)->cat_tree = hfs_btree_open(sb, HFS_CAT_CNID, hfs_cat_keycmp);
if (!HFS_SB(sb)->cat_tree) {
pr_err("unable to open catalog tree\n");
goto out;
}
However, hfs_btree_open() calls read_mapping_page() that
calls hfs_get_block(). And this method calls hfs_ext_read_extent():
static int hfs_ext_read_extent(struct inode *inode, u16 block)
{
struct hfs_find_data fd;
int res;
if (block >= HFS_I(inode)->cached_start &&
block < HFS_I(inode)->cached_start + HFS_I(inode)->cached_blocks)
return 0;
res = hfs_find_init(HFS_SB(inode->i_sb)->ext_tree, &fd);
if (!res) {
res = __hfs_ext_cache_extent(&fd, inode, block);
hfs_find_exit(&fd);
}
return res;
}
The problem here that hfs_find_init() is trying to use
HFS_SB(inode->i_sb)->ext_tree that is not initialized yet.
It will be initailized when hfs_btree_open() finishes
the execution.
The patch adds checking of tree pointer in hfs_find_init()
and it reworks the logic of hfs_btree_open() by reading
the b-tree's header directly from the volume. The read_mapping_page()
is exchanged on filemap_grab_folio() that grab the folio from
mapping. Then, sb_bread() extracts the b-tree's header
content and copy it into the folio.
Reported-by: Wenzhi Wang <wenzhi.wang@uwaterloo.ca>
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
cc: Yangtao Li <frank.li@vivo.com>
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20250710213657.108285-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0b96d9bed324a1c1b7d02bfb9596351ef178428d ]
Fix incorrect shift order when combining the 48-bit block count.
Fixes: 2e1473d5195f ("erofs: implement 48-bit block addressing for unencoded inodes")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250807082019.3093539-1-hsiangkao@linux.alibaba.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 45d19b4b6c2d422771c29b83462d84afcbb33f01 ]
smaps_hugetlb_range() handles the pte without holdling ptl, and may be
concurrenct with migration, leaing to BUG_ON in pfn_swap_entry_to_page().
The race is as follows.
smaps_hugetlb_range migrate_pages
huge_ptep_get
remove_migration_ptes
folio_unlock
pfn_swap_entry_folio
BUG_ON
To fix it, hold ptl lock in smaps_hugetlb_range().
Link: https://lkml.kernel.org/r/20250724090958.455887-1-tujinjiang@huawei.com
Link: https://lkml.kernel.org/r/20250724090958.455887-2-tujinjiang@huawei.com
Fixes: 25ee01a2fca0 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps")
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Brahmajit Das <brahmajit.xyz@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Rientjes <rientjes@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 8c48e1c7520321cc87ff651e96093e2f412785fb upstream.
We already called ib_drain_qp() before and that makes sure
send_done() was called with IB_WC_WR_FLUSH_ERR, but
didn't called atomic_dec_and_test(&sc->send_io.pending.count)
So we may never reach the info->send_pending == 0 condition.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: 5349ae5e05fa ("smb: client: let send_done() cleanup before calling smbd_disconnect_rdma_connection()")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
smbd_disconnect_rdma_connection()
commit 5349ae5e05fa37409fd48a1eb483b199c32c889b upstream.
We should call ib_dma_unmap_single() and mempool_free() before calling
smbd_disconnect_rdma_connection().
And smbd_disconnect_rdma_connection() needs to be the last function to
call as all other state might already be gone after it returns.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Fixes: f198186aa9bb ("CIFS: SMBD: Establish SMB Direct connection")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 04a2c4b4511d186b0fce685da21085a5d4acd370 upstream.
When sysctl_nr_open is set to a very high value (for example, 1073741816
as set by systemd), processes attempting to use file descriptors near
the limit can trigger massive memory allocation attempts that exceed
INT_MAX, resulting in a WARNING in mm/slub.c:
WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
This happens because kvmalloc_array() and kvmalloc() check if the
requested size exceeds INT_MAX and emit a warning when the allocation is
not flagged with __GFP_NOWARN.
Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a
process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
- File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
- Multiple bitmaps: ~400MB
- Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647)
Reproducer:
1. Set /proc/sys/fs/nr_open to 1073741816:
# echo 1073741816 > /proc/sys/fs/nr_open
2. Run a program that uses a high file descriptor:
#include <unistd.h>
#include <sys/resource.h>
int main() {
struct rlimit rlim = {1073741824, 1073741824};
setrlimit(RLIMIT_NOFILE, &rlim);
dup2(2, 1073741880); // Triggers the warning
return 0;
}
3. Observe WARNING in dmesg at mm/slub.c:5027
systemd commit a8b627a introduced automatic bumping of fs.nr_open to the
maximum possible value. The rationale was that systems with memory
control groups (memcg) no longer need separate file descriptor limits
since memory is properly accounted. However, this change overlooked
that:
1. The kernel's allocation functions still enforce INT_MAX as a maximum
size regardless of memcg accounting
2. Programs and tests that legitimately test file descriptor limits can
inadvertently trigger massive allocations
3. The resulting allocations (>8GB) are impractical and will always fail
systemd's algorithm starts with INT_MAX and keeps halving the value
until the kernel accepts it. On most systems, this results in nr_open
being set to 1073741816 (0x3ffffff8), which is just under 1GB of file
descriptors.
While processes rarely use file descriptors near this limit in normal
operation, certain selftests (like
tools/testing/selftests/core/unshare_test.c) and programs that test file
descriptor limits can trigger this issue.
Fix this by adding a check in alloc_fdtable() to ensure the requested
allocation size does not exceed INT_MAX. This causes the operation to
fail with -EMFILE instead of triggering a kernel warning and avoids the
impractical >8GB memory allocation request.
Fixes: 9cfe015aa424 ("get rid of NR_OPEN and introduce a sysctl_nr_open")
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
Link: https://lore.kernel.org/20250629074021.1038845-1-sashal@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b41c1d8d07906786c60893980d52688f31d114a6 upstream.
Make fscrypt no longer use Crypto API drivers for non-inline crypto
engines, even when the Crypto API prioritizes them over CPU-based code
(which unfortunately it often does). These drivers tend to be really
problematic, especially for fscrypt's workload. This commit has no
effect on inline crypto engines, which are different and do work well.
Specifically, exclude drivers that have CRYPTO_ALG_KERN_DRIVER_ONLY or
CRYPTO_ALG_ALLOCATES_MEMORY set. (Later, CRYPTO_ALG_ASYNC should be
excluded too. That's omitted for now to keep this commit backportable,
since until recently some CPU-based code had CRYPTO_ALG_ASYNC set.)
There are two major issues with these drivers: bugs and performance.
First, these drivers tend to be buggy. They're fundamentally much more
error-prone and harder to test than the CPU-based code. They often
don't get tested before kernel releases, and even if they do, the crypto
self-tests don't properly test these drivers. Released drivers have
en/decrypted or hashed data incorrectly. These bugs cause issues for
fscrypt users who often didn't even want to use these drivers, e.g.:
- https://github.com/google/fscryptctl/issues/32
- https://github.com/google/fscryptctl/issues/9
- https://lore.kernel.org/r/PH0PR02MB731916ECDB6C613665863B6CFFAA2@PH0PR02MB7319.namprd02.prod.outlook.com
These drivers have also similarly caused issues for dm-crypt users,
including data corruption and deadlocks. Since Linux v5.10, dm-crypt
has disabled most of them by excluding CRYPTO_ALG_ALLOCATES_MEMORY.
Second, these drivers tend to be *much* slower than the CPU-based code.
This may seem counterintuitive, but benchmarks clearly show it. There's
a *lot* of overhead associated with going to a hardware driver, off the
CPU, and back again. To prove this, I gathered as many systems with
this type of crypto engine as I could, and I measured synchronous
encryption of 4096-byte messages (which matches fscrypt's workload):
Intel Emerald Rapids server:
AES-256-XTS:
xts-aes-vaes-avx512 16171 MB/s [CPU-based, Vector AES]
qat_aes_xts 289 MB/s [Offload, Intel QuickAssist]
Qualcomm SM8650 HDK:
AES-256-XTS:
xts-aes-ce 4301 MB/s [CPU-based, ARMv8 Crypto Extensions]
xts-aes-qce 73 MB/s [Offload, Qualcomm Crypto Engine]
i.MX 8M Nano LPDDR4 EVK:
AES-256-XTS:
xts-aes-ce 647 MB/s [CPU-based, ARMv8 Crypto Extensions]
xts(ecb-aes-caam) 20 MB/s [Offload, CAAM]
AES-128-CBC-ESSIV:
essiv(cbc-aes-caam,sha256-lib) 23 MB/s [Offload, CAAM]
STM32MP157F-DK2:
AES-256-XTS:
xts-aes-neonbs 13.2 MB/s [CPU-based, ARM NEON]
xts(stm32-ecb-aes) 3.1 MB/s [Offload, STM32 crypto engine]
AES-128-CBC-ESSIV:
essiv(cbc-aes-neonbs,sha256-lib)
14.7 MB/s [CPU-based, ARM NEON]
essiv(stm32-cbc-aes,sha256-lib)
3.2 MB/s [Offload, STM32 crypto engine]
Adiantum:
adiantum(xchacha12-arm,aes-arm,nhpoly1305-neon)
52.8 MB/s [CPU-based, ARM scalar + NEON]
So, there was no case in which the crypto engine was even *close* to
being faster. On the first three, which have AES instructions in the
CPU, the CPU was 30 to 55 times faster (!). Even on STM32MP157F-DK2
which has a Cortex-A7 CPU that doesn't have AES instructions, AES was
over 4 times faster on the CPU. And Adiantum encryption, which is what
actually should be used on CPUs like that, was over 17 times faster.
Other justifications that have been given for these non-inline crypto
engines (almost always coming from the hardware vendors, not actual
users) don't seem very plausible either:
- The crypto engine throughput could be improved by processing
multiple requests concurrently. Currently irrelevant to fscrypt,
since it doesn't do that. This would also be complex, and unhelpful
in many cases. 2 of the 4 engines I tested even had only one queue.
- Some of the engines, e.g. STM32, support hardware keys. Also
currently irrelevant to fscrypt, since it doesn't support these.
Interestingly, the STM32 driver itself doesn't support this either.
- Free up CPU for other tasks and/or reduce energy usage. Not very
plausible considering the "short" message length, driver overhead,
and scheduling overhead. There's just very little time for the CPU
to do something else like run another task or enter low-power state,
before the message finishes and it's time to process the next one.
- Some of these engines resist power analysis and electromagnetic
attacks, while the CPU-based crypto generally does not. In theory,
this sounds great. In practice, if this benefit requires the use of
an off-CPU offload that massively regresses performance and has a
low-quality, buggy driver, the price for this hardening (which is
not relevant to most fscrypt users, and tends to be incomplete) is
just too high. Inline crypto engines are much more promising here,
as are on-CPU solutions like RISC-V High Assurance Cryptography.
Fixes: b30ab0e03407 ("ext4 crypto: add ext4 encryption facilities")
Cc: stable@vger.kernel.org
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250704070322.20692-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b01f21cacde9f2878492cf318fee61bf4ccad323 upstream.
Capabilities cannot be inherited when we cross into a new filesystem.
They need to be reset to the minimal defaults, and then probed for
again.
Fixes: 54ceac451598 ("NFS: Share NFS superblocks per-protocol per-server per-FSID")
Cc: stable@vger.kernel.org
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9c65001c57164033ad08b654c8b5ae35512ddf4a upstream.
When the client sends an OPEN with claim type CLAIM_DELEG_CUR_FH or
CLAIM_DELEGATION_CUR, the delegation stateid and the file handle
must belong to the same file, otherwise return NFS4ERR_INVAL.
Note that RFC8881, section 8.2.4, mandates the server to return
NFS4ERR_BAD_STATEID if the selected table entry does not match the
current filehandle. However returning NFS4ERR_BAD_STATEID in the
OPEN causes the client to retry the operation and therefor get the
client into a loop. To avoid this situation we return NFS4ERR_INVAL
instead.
Reported-by: Petro Pavlov <petro.pavlov@vastdata.com>
Fixes: c44c5eeb2c02 ("[PATCH] nfsd4: add open state code for CLAIM_DELEGATE_CUR")
Cc: stable@vger.kernel.org
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 908e4ead7f757504d8b345452730636e298cbf68 upstream.
Lei Lu recently reported that nfsd4_setclientid_confirm() did not check
the return value from get_client_locked(). a SETCLIENTID_CONFIRM could
race with a confirmed client expiring and fail to get a reference. That
could later lead to a UAF.
Fix this by getting a reference early in the case where there is an
extant confirmed client. If that fails then treat it as if there were no
confirmed client found at all.
In the case where the unconfirmed client is expiring, just fail and
return the result from get_client_locked().
Reported-by: lei lu <llfamsec@gmail.com>
Closes: https://lore.kernel.org/linux-nfs/CAEBF3_b=UvqzNKdnfD_52L05Mqrqui9vZ2eFamgAbV0WG+FNWQ@mail.gmail.com/
Fixes: d20c11d86d8f ("nfsd: Protect session creation and client confirm using client_lock")
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|