summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-08-29smb: client: ignore unhandled reparse tagsPaulo Alcantara
[ Upstream commit ec686804117a0421cf31d54427768aaf93aa0069 ] Just ignore reparse points that the client can't parse rather than bailing out and not opening the file or directory. Reported-by: Marc <1marc1@gmail.com> Closes: https://lore.kernel.org/r/CAMHwNVv-B+Q6wa0FEXrAuzdchzcJRsPKDDRrNaYZJd6X-+iJzw@mail.gmail.com Fixes: 539aad7f14da ("smb: client: introduce ->parse_reparse_point()") Tested-by: Anthony Nandaa (Microsoft) <profnandaa@gmail.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: send: allow cloning non-aligned extent if it ends at i_sizeFilipe Manana
[ Upstream commit 46a6e10a1ab16cc71d4a3cab73e79aabadd6b8ea ] If we a find that an extent is shared but its end offset is not sector size aligned, then we don't clone it and issue write operations instead. This is because the reflink (remap_file_range) operation does not allow to clone unaligned ranges, except if the end offset of the range matches the i_size of the source and destination files (and the start offset is sector size aligned). While this is not incorrect because send can only guarantee that a file has the same data in the source and destination snapshots, it's not optimal and generates confusion and surprising behaviour for users. For example, running this test: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT # Use a file size not aligned to any possible sector size. file_size=$((1 * 1024 * 1024 + 5)) # 1MB + 5 bytes dd if=/dev/random of=$MNT/foo bs=$file_size count=1 cp --reflink=always $MNT/foo $MNT/bar btrfs subvolume snapshot -r $MNT/ $MNT/snap rm -f /tmp/send-test btrfs send -f /tmp/send-test $MNT/snap umount $MNT mkfs.btrfs -f $DEV mount $DEV $MNT btrfs receive -vv -f /tmp/send-test $MNT xfs_io -r -c "fiemap -v" $MNT/snap/bar umount $MNT Gives the following result: (...) mkfile o258-7-0 rename o258-7-0 -> bar write bar - offset=0 length=49152 write bar - offset=49152 length=49152 write bar - offset=98304 length=49152 write bar - offset=147456 length=49152 write bar - offset=196608 length=49152 write bar - offset=245760 length=49152 write bar - offset=294912 length=49152 write bar - offset=344064 length=49152 write bar - offset=393216 length=49152 write bar - offset=442368 length=49152 write bar - offset=491520 length=49152 write bar - offset=540672 length=49152 write bar - offset=589824 length=49152 write bar - offset=638976 length=49152 write bar - offset=688128 length=49152 write bar - offset=737280 length=49152 write bar - offset=786432 length=49152 write bar - offset=835584 length=49152 write bar - offset=884736 length=49152 write bar - offset=933888 length=49152 write bar - offset=983040 length=49152 write bar - offset=1032192 length=16389 chown bar - uid=0, gid=0 chmod bar - mode=0644 utimes bar utimes BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=06d640da-9ca1-604c-b87c-3375175a8eb3, stransid=7 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2055]: 26624..28679 2056 0x1 There's no clone operation to clone extents from the file foo into file bar and fiemap confirms there's no shared flag (0x2000). So update send_write_or_clone() so that it proceeds with cloning if the source and destination ranges end at the i_size of the respective files. After this changes the result of the test is: (...) mkfile o258-7-0 rename o258-7-0 -> bar clone bar - source=foo source offset=0 offset=0 length=1048581 chown bar - uid=0, gid=0 chmod bar - mode=0644 utimes bar utimes BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=582420f3-ea7d-564e-bbe5-ce440d622190, stransid=7 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2055]: 26624..28679 2056 0x2001 A test case for fstests will also follow up soon. Link: https://github.com/kdave/btrfs-progs/issues/572#issuecomment-2282841416 CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: replace sb::s_blocksize by fs_info::sectorsizeDavid Sterba
[ Upstream commit 4e00422ee62663e31e611d7de4d2c4aa3f8555f2 ] The block size stored in the super block is used by subsystems outside of btrfs and it's a copy of fs_info::sectorsize. Unify that to always use our sectorsize, with the exception of mount where we first need to use fixed values (4K) until we read the super block and can set the sectorsize. Replace all uses, in most cases it's fewer pointer indirections. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 46a6e10a1ab1 ("btrfs: send: allow cloning non-aligned extent if it ends at i_size") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29jfs: define xtree root and page independentlyDave Kleikamp
commit a779ed754e52d582b8c0e17959df063108bd0656 upstream. In order to make array bounds checking sane, provide a separate definition of the in-inode xtree root and the external xtree page. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Tested-by: Manas Ghandat <ghandatmanas@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29ext4: set the type of max_zeroout to unsigned int to avoid overflowBaokun Li
[ Upstream commit 261341a932d9244cbcd372a3659428c8723e5a49 ] The max_zeroout is of type int and the s_extent_max_zeroout_kb is of type uint, and the s_extent_max_zeroout_kb can be freely modified via the sysfs interface. When the block size is 1024, max_zeroout may overflow, so declare it as unsigned int to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-9-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29NFS: avoid infinite loop in pnfs_update_layout.NeilBrown
[ Upstream commit 2fdbc20036acda9e5694db74a032d3c605323005 ] If pnfsd_update_layout() is called on a file for which recovery has failed it will enter a tight infinite loop. NFS_LAYOUT_INVALID_STID will be set, nfs4_select_rw_stateid() will return -EIO, and nfs4_schedule_stateid_recovery() will do nothing, so nfs4_client_recover_expired_lease() will not wait. So the code will loop indefinitely. Break the loop by testing the validity of the open stateid at the top of the loop. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29f2fs: fix to do sanity check in update_sit_entryZhiguo Niu
[ Upstream commit 36959d18c3cf09b3c12157c6950e18652067de77 ] If GET_SEGNO return NULL_SEGNO for some unecpected case, update_sit_entry will access invalid memory address, cause system crash. It is better to do sanity check about GET_SEGNO just like update_segment_mtime & locate_dirty_segment. Also remove some redundant judgment code. Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: delete pointless BUG_ON check on quota root in ↵David Sterba
btrfs_qgroup_account_extent() [ Upstream commit f40a3ea94881f668084f68f6b9931486b1606db0 ] The BUG_ON is deep in the qgroup code where we can expect that it exists. A NULL pointer would cause a crash. It was added long ago in 550d7a2ed5db35 ("btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents()."). It maybe made sense back then as the quota enable/disable state machine was not that robust as it is nowadays, so we can just delete it. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: change BUG_ON to assertion in tree_move_down()David Sterba
[ Upstream commit 56f335e043ae73c32dbb70ba95488845dc0f1e6e ] There's only one caller of tree_move_down() that does not pass level 0 so the assertion is better suited here. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: send: handle unexpected inode in header process_recorded_refs()David Sterba
[ Upstream commit 5d2288711ccc483feca73151c46ee835bda17839 ] Change BUG_ON to proper error handling when an unexpected inode number is encountered. As the comment says this should never happen. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: send: handle unexpected data in header buffer in begin_cmd()David Sterba
[ Upstream commit e80e3f732cf53c64b0d811e1581470d67f6c3228 ] Change BUG_ON to a proper error handling in the unlikely case of seeing data when the command is started. This is supposed to be reset when the command is finished (send_cmd, send_encoded_extent). Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: handle invalid root reference found in may_destroy_subvol()David Sterba
[ Upstream commit 6fbc6f4ac1f4907da4fc674251527e7dc79ffbf6 ] The may_destroy_subvol() looks up a root by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break the allowed range of a root id. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: push errors up from add_async_extent()David Sterba
[ Upstream commit dbe6cda68f0e1be269e6509c8bf3d8d89089c1c4 ] The memory allocation error in add_async_extent() is not handled properly, return an error and push the BUG_ON to the caller. Handling it there is not trivial so at least make it visible. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: tests: allocate dummy fs_info and root in test_find_delalloc()David Sterba
[ Upstream commit b2136cc288fce2f24a92f3d656531b2d50ebec5a ] Allocate fs_info and root to have a valid fs_info pointer in case it's dereferenced by a helper outside of tests, like find_lock_delalloc_range(). Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: change BUG_ON to assertion when checking for delayed_node rootDavid Sterba
[ Upstream commit be73f4448b607e6b7ce41cd8ef2214fdf6e7986f ] The pointer to root is initialized in btrfs_init_delayed_node(), no need to check for it again. Change the BUG_ON to assertion. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: defrag: change BUG_ON to assertion in btrfs_defrag_leaves()David Sterba
[ Upstream commit 51d4be540054be32d7ce28b63ea9b84ac6ff1db2 ] The BUG_ON verifies a condition that should be guaranteed by the correct use of the path search (with keep_locks and lowest_level set), an assertion is the suitable check. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: delayed-inode: drop pointless BUG_ON in __btrfs_remove_delayed_item()David Sterba
[ Upstream commit 778e618b8bfedcc39354373c1b072c5fe044fa7b ] There's a BUG_ON checking for a valid pointer of fs_info::delayed_root but it is valid since init_mount_fs_info() and has the same lifetime as fs_info. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29f2fs: stop checkpoint when get a out-of-bounds segmentZhiguo Niu
[ Upstream commit f9e28904e6442019043a8e94ec6747a064d06003 ] There is low probability that an out-of-bounds segment will be got on a small-capacity device. In order to prevent subsequent write requests allocating block address from this invalid segment, which may cause unexpected issue, stop checkpoint should be performed. Also introduce a new stop cp reason: STOP_CP_REASON_NO_SEGMENT. Note, f2fs_stop_checkpoint(, false) is complex and it may sleep, so we should move it outside segmap_lock spinlock coverage in get_new_segment(). Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29virtiofs: forbid newlines in tagsStefan Hajnoczi
[ Upstream commit 40488cc16f7ea0d193a4e248f0d809c25cc377db ] Newlines in virtiofs tags are awkward for users and potential vectors for string injection attacks. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29fs: binfmt_elf_efpic: don't use missing interpreter's propertiesMax Filippov
[ Upstream commit 15fd1dc3dadb4268207fa6797e753541aca09a2a ] Static FDPIC executable may get an executable stack even when it has non-executable GNU_STACK segment. This happens when STACK segment has rw permissions, but does not specify stack size. In that case FDPIC loader uses permissions of the interpreter's stack, and for static executables with no interpreter it results in choosing the arch-default permissions for the stack. Fix that by using the interpreter's properties only when the interpreter is actually used. Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Link: https://lore.kernel.org/r/20240118150637.660461-1-jcmvbkbc@gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29kernfs: fix false-positive WARN(nr_mmapped) in kernfs_drain_open_filesNeel Natu
[ Upstream commit 05d8f255867e3196565bb31a911a437697fab094 ] Prior to this change 'on->nr_mmapped' tracked the total number of mmaps across all of its associated open files via kernfs_fop_mmap(). Thus if the file descriptor associated with a kernfs_open_file was mmapped 10 times then we would have: 'of->mmapped = true' and 'of_on(of)->nr_mmapped = 10'. The problem is that closing or draining a 'of->mmapped' file would only decrement one from the 'of_on(of)->nr_mmapped' counter. For e.g. we have this from kernfs_unlink_open_file(): if (of->mmapped) on->nr_mmapped--; The WARN_ON_ONCE(on->nr_mmapped) in kernfs_drain_open_files() is easy to reproduce by: 1. opening a (mmap-able) kernfs file. 2. mmap-ing that file more than once (mapping just once masks the issue). 3. trigger a drain of that kernfs file. Modulo out-of-tree patches I was able to trigger this reliably by identifying pci device nodes in sysfs that have resource regions that are mmap-able and that don't have any driver attached to them (steps 1 and 2). For step 3 we can "echo 1 > remove" to trigger a kernfs_drain. Signed-off-by: Neel Natu <neelnatu@google.com> Link: https://lore.kernel.org/r/20240127234636.609265-1-neelnatu@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29quota: Remove BUG_ON from dqget()Jan Kara
[ Upstream commit 249f374eb9b6b969c64212dd860cc1439674c4a8 ] dqget() checks whether dquot->dq_sb is set when returning it using BUG_ON. Firstly this doesn't work as an invalidation check for quite some time (we release dquot with dq_sb set these days), secondly using BUG_ON is quite harsh. Use WARN_ON_ONCE and check whether dquot is still hashed instead. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29fuse: fix UAF in rcu pathwalksAl Viro
[ Upstream commit 053fc4f755ad43cf35210677bcba798ccdc48d0c ] ->permission(), ->get_link() and ->inode_get_acl() might dereference ->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns as well) when called from rcu pathwalk. Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info and dropping ->user_ns rcu-delayed too. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29afs: fix __afs_break_callback() / afs_drop_open_mmap() raceAl Viro
[ Upstream commit 275655d3207b9e65d1561bf21c06a622d9ec1d43 ] In __afs_break_callback() we might check ->cb_nr_mmap and if it's non-zero do queue_work(&vnode->cb_work). In afs_drop_open_mmap() we decrement ->cb_nr_mmap and do flush_work(&vnode->cb_work) if it reaches zero. The trouble is, there's nothing to prevent __afs_break_callback() from seeing ->cb_nr_mmap before the decrement and do queue_work() after both the decrement and flush_work(). If that happens, we might be in trouble - vnode might get freed before the queued work runs. __afs_break_callback() is always done under ->cb_lock, so let's make sure that ->cb_nr_mmap can change from non-zero to zero while holding ->cb_lock (the spinlock component of it - it's a seqlock and we don't need to mess with the counter). Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: zlib: fix and simplify the inline extent decompressionQu Wenruo
[ Upstream commit 2c25716dcc25a0420c4ad49d6e6bf61e60a21434 ] [BUG] If we have a filesystem with 4k sectorsize, and an inlined compressed extent created like this: item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160 generation 8 transid 8 size 4096 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24 index 2 namelen 14 name: source_inlined item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69 generation 8 type 0 (inline) inline extent data size 48 ram_bytes 4096 compression 1 (zlib) Which has an inline compressed extent at file offset 0, and its decompressed size is 4K, allowing us to reflink that 4K range to another location (which will not be compressed). If we do such reflink on a subpage system, it would fail like this: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest XFS_IOC_CLONE_RANGE: Input/output error [CAUSE] In zlib_decompress(), we didn't treat @start_byte as just a page offset, but also use it as an indicator on whether we should switch our output buffer. In reality, for subpage cases, although @start_byte can be non-zero, we should never switch input/output buffer, since the whole input/output buffer should never exceed one sector. Note: The above assumption is only not true if we're going to support multi-page sectorsize. Thus the current code using @start_byte as a condition to switch input/output buffer or finish the decompression is completely incorrect. [FIX] The fix involves several modifications: - Rename @start_byte to @dest_pgoff to properly express its meaning - Add an extra ASSERT() inside btrfs_decompress() to make sure the input/output size never exceeds one sector. - Use Z_FINISH flag to make sure the decompression happens in one go - Remove the loop needed to switch input/output buffers - Use correct destination offset inside the destination page - Consider early end as an error After the fix, even on 64K page sized aarch64, above reflink now works as expected: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest linked 4096/4096 bytes at offset 61440 And resulted a correct file layout: item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160 generation 10 transid 10 size 65536 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14 index 3 namelen 4 name: dest item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR transid 10 data_len 37 name_len 16 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53 generation 10 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29ext4: do not trim the group with corrupted block bitmapBaokun Li
[ Upstream commit 172202152a125955367393956acf5f4ffd092e0d ] Otherwise operating on an incorrupted block bitmap can lead to all sorts of unknown problems. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29gfs2: Refcounting fix in gfs2_thaw_superAndreas Gruenbacher
[ Upstream commit 4e58543e7da4859c4ba61d15493e3522b6ad71fd ] It turns out that the .freeze_super and .thaw_super operations require the filesystem to manage the superblock refcount itself. We are using the freeze_super() and thaw_super() helpers to mostly take care of that for us, but this means that the superblock may no longer be around by when thaw_super() returns, and gfs2_thaw_super() will then access freed memory. Take an extra superblock reference in gfs2_thaw_super() to fix that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29gfs2: setattr_chown: Add missing initializationAndreas Gruenbacher
[ Upstream commit 2d8d7990619878a848b1d916c2f936d3012ee17d ] Add a missing initialization of variable ap in setattr_chown(). Without, chown() may be able to bypass quotas. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29binfmt_misc: cleanup on filesystem umountChristian Brauner
[ Upstream commit 1c5976ef0f7ad76319df748ccb99a4c7ba2ba464 ] Currently, registering a new binary type pins the binfmt_misc filesystem. Specifically, this means that as long as there is at least one binary type registered the binfmt_misc filesystem survives all umounts, i.e. the superblock is not destroyed. Meaning that a umount followed by another mount will end up with the same superblock and the same binary type handlers. This is a behavior we tend to discourage for any new filesystems (apart from a few special filesystems such as e.g. configfs or debugfs). A umount operation without the filesystem being pinned - by e.g. someone holding a file descriptor to an open file - should usually result in the destruction of the superblock and all associated resources. This makes introspection easier and leads to clearly defined, simple and clean semantics. An administrator can rely on the fact that a umount will guarantee a clean slate making it possible to reinitialize a filesystem. Right now all binary types would need to be explicitly deleted before that can happen. This allows us to remove the heavy-handed calls to simple_pin_fs() and simple_release_fs() when creating and deleting binary types. This in turn allows us to replace the current brittle pinning mechanism abusing dget() which has caused a range of bugs judging from prior fixes in [2] and [3]. The additional dget() in load_misc_binary() pins the dentry but only does so for the sake to prevent ->evict_inode() from freeing the node when a user removes the binary type and kill_node() is run. Which would mean ->interpreter and ->interp_file would be freed causing a UAF. This isn't really nicely documented nor is it very clean because it relies on simple_pin_fs() pinning the filesystem as long as at least one binary type exists. Otherwise it would cause load_misc_binary() to hold on to a dentry belonging to a superblock that has been shutdown. Replace that implicit pinning with a clean and simple per-node refcount and get rid of the ugly dget() pinning. A similar mechanism exists for e.g. binderfs (cf. [4]). All the cleanup work can now be done in ->evict_inode(). In a follow-up patch we will make it possible to use binfmt_misc in sandboxes. We will use the cleaner semantics where a umount for the filesystem will cause the superblock and all resources to be deallocated. In preparation for this apply the same semantics to the initial binfmt_misc mount. Note, that this is a user-visible change and as such a uapi change but one that we can reasonably risk. We've discussed this in earlier versions of this patchset (cf. [1]). The main user and provider of binfmt_misc is systemd. Systemd provides binfmt_misc via autofs since it is configurable as a kernel module and is used by a few exotic packages and users. As such a binfmt_misc mount is triggered when /proc/sys/fs/binfmt_misc is accessed and is only provided on demand. Other autofs on demand filesystems include EFI ESP which systemd umounts if the mountpoint stays idle for a certain amount of time. This doesn't apply to the binfmt_misc autofs mount which isn't touched once it is mounted meaning this change can't accidently wipe binary type handlers without someone having explicitly unmounted binfmt_misc. After speaking to systemd folks they don't expect this change to affect them. In line with our general policy, if we see a regression for systemd or other users with this change we will switch back to the old behavior for the initial binfmt_misc mount and have binary types pin the filesystem again. But while we touch this code let's take the chance and let's improve on the status quo. [1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu [2]: commit 43a4f2619038 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()" [3]: commit 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()") [4]: commit f0fe2c0f050d ("binder: prevent UAF for binderfs devices II") Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org (v1) Cc: Sargun Dhillon <sargun@sargun.me> Cc: Serge Hallyn <serge@hallyn.com> Cc: Jann Horn <jannh@google.com> Cc: Henning Schild <henning.schild@siemens.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Laurent Vivier <laurent@vivier.eu> Cc: linux-fsdevel@vger.kernel.org Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29btrfs: tree-checker: add dev extent item checksQu Wenruo
commit 008e2512dc5696ab2dc5bf264e98a9fe9ceb830e upstream. [REPORT] There is a corruption report that btrfs refused to mount a fs that has overlapping dev extents: BTRFS error (device sdc): dev extent devid 4 physical offset 14263979671552 overlap with previous dev extent end 14263980982272 BTRFS error (device sdc): failed to verify dev extents against chunks: -117 BTRFS error (device sdc): open_ctree failed [CAUSE] The direct cause is very obvious, there is a bad dev extent item with incorrect length. With btrfs check reporting two overlapping extents, the second one shows some clue on the cause: ERROR: dev extent devid 4 offset 14263979671552 len 6488064 overlap with previous dev extent end 14263980982272 ERROR: dev extent devid 13 offset 2257707008000 len 6488064 overlap with previous dev extent end 2257707270144 ERROR: errors found in extent allocation tree or chunk allocation The second one looks like a bitflip happened during new chunk allocation: hex(2257707008000) = 0x20da9d30000 hex(2257707270144) = 0x20da9d70000 diff = 0x00000040000 So it looks like a bitflip happened during new dev extent allocation, resulting the second overlap. Currently we only do the dev-extent verification at mount time, but if the corruption is caused by memory bitflip, we really want to catch it before writing the corruption to the storage. Furthermore the dev extent items has the following key definition: (<device id> DEV_EXTENT <physical offset>) Thus we can not just rely on the generic key order check to make sure there is no overlapping. [ENHANCEMENT] Introduce dedicated dev extent checks, including: - Fixed member checks * chunk_tree should always be BTRFS_CHUNK_TREE_OBJECTID (3) * chunk_objectid should always be BTRFS_FIRST_CHUNK_CHUNK_TREE_OBJECTID (256) - Alignment checks * chunk_offset should be aligned to sectorsize * length should be aligned to sectorsize * key.offset should be aligned to sectorsize - Overlap checks If the previous key is also a dev-extent item, with the same device id, make sure we do not overlap with the previous dev extent. Reported: Stefan N <stefannnau@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+W5K0rSO3koYTo=nzxxTm1-Pdu1HYgVxEpgJ=aGc7d=E8mGEg@mail.gmail.com/ CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29btrfs: zoned: properly take lock to read/update block group's zoned variablesNaohiro Aota
commit e30729d4bd4001881be4d1ad4332a5d4985398f8 upstream. __btrfs_add_free_space_zoned() references and modifies bg's alloc_offset, ro, and zone_unusable, but without taking the lock. It is mostly safe because they monotonically increase (at least for now) and this function is mostly called by a transaction commit, which is serialized by itself. Still, taking the lock is a safer and correct option and I'm going to add a change to reset zone_unusable while a block group is still alive. So, add locking around the operations. Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir typeQu Wenruo
commit 31723c9542dba1681cc3720571fdf12ffe0eddd9 upstream. [REPORT] There is a bug report that kernel is rejecting a mismatching inode mode and its dir item: [ 1881.553937] BTRFS critical (device dm-0): inode mode mismatch with dir: inode mode=040700 btrfs type=2 dir type=0 [CAUSE] It looks like the inode mode is correct, while the dir item type 0 is BTRFS_FT_UNKNOWN, which should not be generated by btrfs at all. This may be caused by a memory bit flip. [ENHANCEMENT] Although tree-checker is not able to do any cross-leaf verification, for this particular case we can at least reject any dir type with BTRFS_FT_UNKNOWN. So here we enhance the dir type check from [0, BTRFS_FT_MAX), to (0, BTRFS_FT_MAX). Although the existing corruption can not be fixed just by such enhanced checking, it should prevent the same 0x2->0x0 bitflip for dir type to reach disk in the future. Reported-by: Kota <nospam@kota.moe> Link: https://lore.kernel.org/linux-btrfs/CACsxjPYnQF9ZF-0OhH16dAx50=BXXOcP74MxBc3BG+xae4vTTw@mail.gmail.com/ CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29fs/netfs/fscache_cookie: add missing "n_accesses" checkMax Kellermann
commit f71aa06398aabc2e3eaac25acdf3d62e0094ba70 upstream. This fixes a NULL pointer dereference bug due to a data race which looks like this: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 33 PID: 16573 Comm: kworker/u97:799 Not tainted 6.8.7-cm4all1-hp+ #43 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/17/2018 Workqueue: events_unbound netfs_rreq_write_to_cache_work RIP: 0010:cachefiles_prepare_write+0x30/0xa0 Code: 57 41 56 45 89 ce 41 55 49 89 cd 41 54 49 89 d4 55 53 48 89 fb 48 83 ec 08 48 8b 47 08 48 83 7f 10 00 48 89 34 24 48 8b 68 20 <48> 8b 45 08 4c 8b 38 74 45 49 8b 7f 50 e8 4e a9 b0 ff 48 8b 73 10 RSP: 0018:ffffb4e78113bde0 EFLAGS: 00010286 RAX: ffff976126be6d10 RBX: ffff97615cdb8438 RCX: 0000000000020000 RDX: ffff97605e6c4c68 RSI: ffff97605e6c4c60 RDI: ffff97615cdb8438 RBP: 0000000000000000 R08: 0000000000278333 R09: 0000000000000001 R10: ffff97605e6c4600 R11: 0000000000000001 R12: ffff97605e6c4c68 R13: 0000000000020000 R14: 0000000000000001 R15: ffff976064fe2c00 FS: 0000000000000000(0000) GS:ffff9776dfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000005942c002 CR4: 00000000001706f0 Call Trace: <TASK> ? __die+0x1f/0x70 ? page_fault_oops+0x15d/0x440 ? search_module_extables+0xe/0x40 ? fixup_exception+0x22/0x2f0 ? exc_page_fault+0x5f/0x100 ? asm_exc_page_fault+0x22/0x30 ? cachefiles_prepare_write+0x30/0xa0 netfs_rreq_write_to_cache_work+0x135/0x2e0 process_one_work+0x137/0x2c0 worker_thread+0x2e9/0x400 ? __pfx_worker_thread+0x10/0x10 kthread+0xcc/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Modules linked in: CR2: 0000000000000008 ---[ end trace 0000000000000000 ]--- This happened because fscache_cookie_state_machine() was slow and was still running while another process invoked fscache_unuse_cookie(); this led to a fscache_cookie_lru_do_one() call, setting the FSCACHE_COOKIE_DO_LRU_DISCARD flag, which was picked up by fscache_cookie_state_machine(), withdrawing the cookie via cachefiles_withdraw_cookie(), clearing cookie->cache_priv. At the same time, yet another process invoked cachefiles_prepare_write(), which found a NULL pointer in this code line: struct cachefiles_object *object = cachefiles_cres_object(cres); The next line crashes, obviously: struct cachefiles_cache *cache = object->volume->cache; During cachefiles_prepare_write(), the "n_accesses" counter is non-zero (via fscache_begin_operation()). The cookie must not be withdrawn until it drops to zero. The counter is checked by fscache_cookie_state_machine() before switching to FSCACHE_COOKIE_STATE_RELINQUISHING and FSCACHE_COOKIE_STATE_WITHDRAWING (in "case FSCACHE_COOKIE_STATE_FAILED"), but not for FSCACHE_COOKIE_STATE_LRU_DISCARDING ("case FSCACHE_COOKIE_STATE_ACTIVE"). This patch adds the missing check. With a non-zero access counter, the function returns and the next fscache_end_cookie_access() call will queue another fscache_cookie_state_machine() call to handle the still-pending FSCACHE_COOKIE_DO_LRU_DISCARD. Fixes: 12bb21a29c19 ("fscache: Implement cookie user counting and resource pinning") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240729162002.3436763-2-dhowells@redhat.com cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHAREAl Viro
commit 9a2fa1472083580b6c66bdaf291f591e1170123a upstream. copy_fd_bitmaps(new, old, count) is expected to copy the first count/BITS_PER_LONG bits from old->full_fds_bits[] and fill the rest with zeroes. What it does is copying enough words (BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest. That works fine, *if* all bits past the cutoff point are clear. Otherwise we are risking garbage from the last word we'd copied. For most of the callers that is true - expand_fdtable() has count equal to old->max_fds, so there's no open descriptors past count, let alone fully occupied words in ->open_fds[], which is what bits in ->full_fds_bits[] correspond to. The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds), which is the smallest multiple of BITS_PER_LONG that covers all opened descriptors below max_fds. In the common case (copying on fork()) max_fds is ~0U, so all opened descriptors will be below it and we are fine, by the same reasons why the call in expand_fdtable() is safe. Unfortunately, there is a case where max_fds is less than that and where we might, indeed, end up with junk in ->full_fds_bits[] - close_range(from, to, CLOSE_RANGE_UNSHARE) with * descriptor table being currently shared * 'to' being above the current capacity of descriptor table * 'from' being just under some chunk of opened descriptors. In that case we end up with observably wrong behaviour - e.g. spawn a child with CLONE_FILES, get all descriptors in range 0..127 open, then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending up with descriptor #128, despite #64 being observably not open. The minimally invasive fix would be to deal with that in dup_fd(). If this proves to add measurable overhead, we can go that way, but let's try to fix copy_fd_bitmaps() first. * new helper: bitmap_copy_and_expand(to, from, bits_to_copy, size). * make copy_fd_bitmaps() take the bitmap size in words, rather than bits; it's 'count' argument is always a multiple of BITS_PER_LONG, so we are not losing any information, and that way we can use the same helper for all three bitmaps - compiler will see that count is a multiple of BITS_PER_LONG for the large ones, so it'll generate plain memcpy()+memset(). Reproducer added to tools/testing/selftests/core/close_range_test.c Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29btrfs: rename bitmap_set_bits() -> btrfs_bitmap_set_bits()Alexander Lobakin
commit 4ca532d64648d4776d15512caed3efea05ca7195 upstream. bitmap_set_bits() does not start with the FS' prefix and may collide with a new generic helper one day. It operates with the FS-specific types, so there's no change those two could do the same thing. Just add the prefix to exclude such possible conflict. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29fs/ntfs3: add prefix to bitmap_size() and use BITS_TO_U64()Alexander Lobakin
commit 3f5ef5109f6a054ce58b3bec7214ed76c9cc269f upstream. bitmap_size() is a pretty generic name and one may want to use it for a generic bitmap API function. At the same time, its logic is NTFS-specific, as it aligns to the sizeof(u64), not the sizeof(long) (although it uses ideologically right ALIGN() instead of division). Add the prefix 'ntfs3_' used for that FS (not just 'ntfs_' to not mix it with the legacy module) and use generic BITS_TO_U64() while at it. Suggested-by: Yury Norov <yury.norov@gmail.com> # BITS_TO_U64() Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29vfs: Don't evict inode under the inode lru traversing contextZhihao Cheng
commit 2a0629834cd82f05d424bbc193374f9a43d1f87d upstream. The inode reclaiming process(See function prune_icache_sb) collects all reclaimable inodes and mark them with I_FREEING flag at first, at that time, other processes will be stuck if they try getting these inodes (See function find_inode_fast), then the reclaiming process destroy the inodes by function dispose_list(). Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may do inode lookup in the inode evicting callback function, if the inode lookup is operated under the inode lru traversing context, deadlock problems may happen. Case 1: In function ext4_evict_inode(), the ea inode lookup could happen if ea_inode feature is enabled, the lookup process will be stuck under the evicting context like this: 1. File A has inode i_reg and an ea inode i_ea 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea 3. Then, following three processes running like this: PA PB echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // i_reg is added into lru, lru->i_ea->i_reg prune_icache_sb list_lru_walk_one inode_lru_isolate i_ea->i_state |= I_FREEING // set inode state inode_lru_isolate __iget(i_reg) spin_unlock(&i_reg->i_lock) spin_unlock(lru_lock) rm file A i_reg->nlink = 0 iput(i_reg) // i_reg->nlink is 0, do evict ext4_evict_inode ext4_xattr_delete_inode ext4_xattr_inode_dec_ref_all ext4_xattr_inode_iget ext4_iget(i_ea->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(i_ea) ----→ AA deadlock dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&i_ea->i_state) Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file deleting process holds BASEHD's wbuf->io_mutex while getting the xattr inode, which could race with inode reclaiming process(The reclaiming process could try locking BASEHD's wbuf->io_mutex in inode evicting function), then an ABBA deadlock problem would happen as following: 1. File A has inode ia and a xattr(with inode ixa), regular file B has inode ib and a xattr. 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa 3. Then, following three processes running like this: PA PB PC echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // ib and ia are added into lru, lru->ixa->ib->ia prune_icache_sb list_lru_walk_one inode_lru_isolate ixa->i_state |= I_FREEING // set inode state inode_lru_isolate __iget(ib) spin_unlock(&ib->i_lock) spin_unlock(lru_lock) rm file B ib->nlink = 0 rm file A iput(ia) ubifs_evict_inode(ia) ubifs_jnl_delete_inode(ia) ubifs_jnl_write_inode(ia) make_reservation(BASEHD) // Lock wbuf->io_mutex ubifs_iget(ixa->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(ixa) | iput(ib) // ib->nlink is 0, do evict | ubifs_evict_inode | ubifs_jnl_delete_inode(ib) ↓ ubifs_jnl_write_inode ABBA deadlock ←-----make_reservation(BASEHD) dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&ixa->i_state) Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to pin the inode in memory while inode_lru_isolate() reclaims its pages instead of using ordinary inode reference. This way inode deletion cannot be triggered from inode_lru_isolate() thus avoiding the deadlock. evict() is made to wait for I_LRU_ISOLATING to be cleared before proceeding with inode cleanup. Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022 Fixes: e50e5129f384 ("ext4: xattr-in-inode support") Fixes: 7959cf3a7506 ("ubifs: journal: Handle xattrs like files") Cc: stable@vger.kernel.org Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29fuse: Initialize beyond-EOF page contents before setting uptodateJann Horn
commit 3c0da3d163eb32f1f91891efaade027fa9b245b9 upstream. fuse_notify_store(), unlike fuse_do_readpage(), does not enable page zeroing (because it can be used to change partial page contents). So fuse_notify_store() must be more careful to fully initialize page contents (including parts of the page that are beyond end-of-file) before marking the page uptodate. The current code can leave beyond-EOF page contents uninitialized, which makes these uninitialized page contents visible to userspace via mmap(). This is an information leak, but only affects systems which do not enable init-on-alloc (via CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y or the corresponding kernel command line parameter). Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2574 Cc: stable@kernel.org Fixes: a1d75f258230 ("fuse: add store request") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-19Revert "jfs: fix shift-out-of-bounds in dbJoin"Dave Kleikamp
commit e42e29cc442395d62f1a8963ec2dfb700ba6a5d7 upstream. This reverts commit cca974daeb6c43ea971f8ceff5a7080d7d49ee30. The added sanity check is incorrect. BUDMIN is not the wrong value and is too small. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-19binfmt_flat: Fix corruption when not offsetting data startKees Cook
[ Upstream commit 3eb3cd5992f7a0c37edc8d05b4c38c98758d8671 ] Commit 04d82a6d0881 ("binfmt_flat: allow not offsetting data start") introduced a RISC-V specific variant of the FLAT format which does not allocate any space for the (obsolete) array of shared library pointers. However, it did not disable the code which initializes the array, resulting in the corruption of sizeof(long) bytes before the DATA segment, generally the end of the TEXT segment. Introduce MAX_SHARED_LIBS_UPDATE which depends on the state of CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET to guard the initialization of the shared library pointer region so that it will only be initialized if space is reserved for it. Fixes: 04d82a6d0881 ("binfmt_flat: allow not offsetting data start") Co-developed-by: Stefan O'Rear <sorear@fastmail.com> Signed-off-by: Stefan O'Rear <sorear@fastmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: Greg Ungerer <gerg@linux-m68k.org> Link: https://lore.kernel.org/r/20240807195119.it.782-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19fs/ntfs3: Do copy_to_user out of run_lockKonstantin Komarov
[ Upstream commit d57431c6f511bf020e474026d9f3123d7bfbea8c ] In order not to call copy_to_user (from fiemap_fill_next_extent) we allocate memory in the kernel, fill it and copy it to user memory after up_read(run_lock). Reported-by: syzbot+36bb70085ef6edc2ebb9@syzkaller.appspotmail.com Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19jfs: Fix shift-out-of-bounds in dbDiscardAGPei Li
[ Upstream commit 7063b80268e2593e58bee8a8d709c2f3ff93e2f2 ] When searching for the next smaller log2 block, BLKSTOL2() returned 0, causing shift exponent -1 to be negative. This patch fixes the issue by exiting the loop directly when negative shift is found. Reported-by: syzbot+61be3359d2ee3467e7e4@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=61be3359d2ee3467e7e4 Signed-off-by: Pei Li <peili.dev@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19jfs: fix null ptr deref in dtInsertEntryEdward Adam Davis
[ Upstream commit ce6dede912f064a855acf6f04a04cbb2c25b8c8c ] [syzbot reported] general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 5061 Comm: syz-executor404 Not tainted 6.8.0-syzkaller-08951-gfe46a7dd189e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:dtInsertEntry+0xd0c/0x1780 fs/jfs/jfs_dtree.c:3713 ... [Analyze] In dtInsertEntry(), when the pointer h has the same value as p, after writing name in UniStrncpy_to_le(), p->header.flag will be cleared. This will cause the previously true judgment "p->header.flag & BT-LEAF" to change to no after writing the name operation, this leads to entering an incorrect branch and accessing the uninitialized object ih when judging this condition for the second time. [Fix] After got the page, check freelist first, if freelist == 0 then exit dtInsert() and return -EINVAL. Reported-by: syzbot+bba84aef3a26fb93deb9@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19f2fs: fix to cover read extent cache access with lockChao Yu
[ Upstream commit d7409b05a64f212735f0d33f5f1602051a886eab ] syzbot reports a f2fs bug as below: BUG: KASAN: slab-use-after-free in sanity_check_extent_cache+0x370/0x410 fs/f2fs/extent_cache.c:46 Read of size 4 at addr ffff8880739ab220 by task syz-executor200/5097 CPU: 0 PID: 5097 Comm: syz-executor200 Not tainted 6.9.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 sanity_check_extent_cache+0x370/0x410 fs/f2fs/extent_cache.c:46 do_read_inode fs/f2fs/inode.c:509 [inline] f2fs_iget+0x33e1/0x46e0 fs/f2fs/inode.c:560 f2fs_nfs_get_inode+0x74/0x100 fs/f2fs/super.c:3237 generic_fh_to_dentry+0x9f/0xf0 fs/libfs.c:1413 exportfs_decode_fh_raw+0x152/0x5f0 fs/exportfs/expfs.c:444 exportfs_decode_fh+0x3c/0x80 fs/exportfs/expfs.c:584 do_handle_to_path fs/fhandle.c:155 [inline] handle_to_path fs/fhandle.c:210 [inline] do_handle_open+0x495/0x650 fs/fhandle.c:226 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f We missed to cover sanity_check_extent_cache() w/ extent cache lock, so, below race case may happen, result in use after free issue. - f2fs_iget - do_read_inode - f2fs_init_read_extent_tree : add largest extent entry in to cache - shrink - f2fs_shrink_read_extent_tree - __shrink_extent_tree - __detach_extent_node : drop largest extent entry - sanity_check_extent_cache : access et->largest w/o lock let's refactor sanity_check_extent_cache() to avoid extent cache access and call it before f2fs_init_read_extent_tree() to fix this issue. Reported-by: syzbot+74ebe2104433e9dc610d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/00000000000009beea061740a531@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19f2fs: fix to do sanity check on F2FS_INLINE_DATA flag in inode during GCChao Yu
[ Upstream commit fc01008c92f40015aeeced94750855a7111b6929 ] syzbot reports a f2fs bug as below: ------------[ cut here ]------------ kernel BUG at fs/f2fs/inline.c:258! CPU: 1 PID: 34 Comm: kworker/u8:2 Not tainted 6.9.0-rc6-syzkaller-00012-g9e4bc4bcae01 #0 RIP: 0010:f2fs_write_inline_data+0x781/0x790 fs/f2fs/inline.c:258 Call Trace: f2fs_write_single_data_page+0xb65/0x1d60 fs/f2fs/data.c:2834 f2fs_write_cache_pages fs/f2fs/data.c:3133 [inline] __f2fs_write_data_pages fs/f2fs/data.c:3288 [inline] f2fs_write_data_pages+0x1efe/0x3a90 fs/f2fs/data.c:3315 do_writepages+0x35b/0x870 mm/page-writeback.c:2612 __writeback_single_inode+0x165/0x10b0 fs/fs-writeback.c:1650 writeback_sb_inodes+0x905/0x1260 fs/fs-writeback.c:1941 wb_writeback+0x457/0xce0 fs/fs-writeback.c:2117 wb_do_writeback fs/fs-writeback.c:2264 [inline] wb_workfn+0x410/0x1090 fs/fs-writeback.c:2304 process_one_work kernel/workqueue.c:3254 [inline] process_scheduled_works+0xa12/0x17c0 kernel/workqueue.c:3335 worker_thread+0x86d/0xd70 kernel/workqueue.c:3416 kthread+0x2f2/0x390 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 The root cause is: inline_data inode can be fuzzed, so that there may be valid blkaddr in its direct node, once f2fs triggers background GC to migrate the block, it will hit f2fs_bug_on() during dirty page writeback. Let's add sanity check on F2FS_INLINE_DATA flag in inode during GC, so that, it can forbid migrating inline_data inode's data block for fixing. Reported-by: syzbot+848062ba19c8782ca5c8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/000000000000d103ce06174d7ec3@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19ext4: sanity check for NULL pointer after ext4_force_shutdownWojciech Gładysz
[ Upstream commit 83f4414b8f84249d538905825b088ff3ae555652 ] Test case: 2 threads write short inline data to a file. In ext4_page_mkwrite the resulting inline data is converted. Handling ext4_grp_locked_error with description "block bitmap and bg descriptor inconsistent: X vs Y free clusters" calls ext4_force_shutdown. The conversion clears EXT4_STATE_MAY_INLINE_DATA but fails for ext4_destroy_inline_data_nolock and ext4_mark_iloc_dirty due to ext4_forced_shutdown. The restoration of inline data fails for the same reason not setting EXT4_STATE_MAY_INLINE_DATA. Without the flag set a regular process path in ext4_da_write_end follows trying to dereference page folio private pointer that has not been set. The fix calls early return with -EIO error shall the pointer to private be NULL. Sample crash report: Unable to handle kernel paging request at virtual address dfff800000000004 KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027] Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [dfff800000000004] address between user and kernel address ranges Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 20274 Comm: syz-executor185 Not tainted 6.9.0-rc7-syzkaller-gfda5695d692c #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __block_commit_write+0x64/0x2b0 fs/buffer.c:2167 lr : __block_commit_write+0x3c/0x2b0 fs/buffer.c:2160 sp : ffff8000a1957600 x29: ffff8000a1957610 x28: dfff800000000000 x27: ffff0000e30e34b0 x26: 0000000000000000 x25: dfff800000000000 x24: dfff800000000000 x23: fffffdffc397c9e0 x22: 0000000000000020 x21: 0000000000000020 x20: 0000000000000040 x19: fffffdffc397c9c0 x18: 1fffe000367bd196 x17: ffff80008eead000 x16: ffff80008ae89e3c x15: 00000000200000c0 x14: 1fffe0001cbe4e04 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000001 x10: 0000000000ff0100 x9 : 0000000000000000 x8 : 0000000000000004 x7 : 0000000000000000 x6 : 0000000000000000 x5 : fffffdffc397c9c0 x4 : 0000000000000020 x3 : 0000000000000020 x2 : 0000000000000040 x1 : 0000000000000020 x0 : fffffdffc397c9c0 Call trace: __block_commit_write+0x64/0x2b0 fs/buffer.c:2167 block_write_end+0xb4/0x104 fs/buffer.c:2253 ext4_da_do_write_end fs/ext4/inode.c:2955 [inline] ext4_da_write_end+0x2c4/0xa40 fs/ext4/inode.c:3028 generic_perform_write+0x394/0x588 mm/filemap.c:3985 ext4_buffered_write_iter+0x2c0/0x4ec fs/ext4/file.c:299 ext4_file_write_iter+0x188/0x1780 call_write_iter include/linux/fs.h:2110 [inline] new_sync_write fs/read_write.c:497 [inline] vfs_write+0x968/0xc3c fs/read_write.c:590 ksys_write+0x15c/0x26c fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __arm64_sys_write+0x7c/0x90 fs/read_write.c:652 __invoke_syscall arch/arm64/kernel/syscall.c:34 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:48 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:133 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:152 el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Code: 97f85911 f94002da 91008356 d343fec8 (38796908) ---[ end trace 0000000000000000 ]--- ---------------- Code disassembly (best guess): 0: 97f85911 bl 0xffffffffffe16444 4: f94002da ldr x26, [x22] 8: 91008356 add x22, x26, #0x20 c: d343fec8 lsr x8, x22, #3 * 10: 38796908 ldrb w8, [x8, x25] <-- trapping instruction Reported-by: syzbot+18df508cf00a0598d9a6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=18df508cf00a0598d9a6 Link: https://lore.kernel.org/all/000000000000f19a1406109eb5c5@google.com/T/ Signed-off-by: Wojciech Gładysz <wojciech.gladysz@infogain.com> Link: https://patch.msgid.link/20240703070112.10235-1-wojciech.gladysz@infogain.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19ext4: convert ext4_da_do_write_end() to take a folioMatthew Wilcox (Oracle)
[ Upstream commit 4d5cdd757d0c74924b629559fccb68d8803ce995 ] There's nothing page-specific happening in ext4_da_do_write_end(); it's merely used for its refcount & lock, both of which are folio properties. Saves four calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231214053035.1018876-1-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 83f4414b8f84 ("ext4: sanity check for NULL pointer after ext4_force_shutdown") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19ext4: do not create EA inode under buffer lockJan Kara
[ Upstream commit 0a46ef234756dca04623b7591e8ebb3440622f0b ] ext4_xattr_set_entry() creates new EA inodes while holding buffer lock on the external xattr block. This is problematic as it nests all the allocation locking (which acquires locks on other buffers) under the buffer lock. This can even deadlock when the filesystem is corrupted and e.g. quota file is setup to contain xattr block as data block. Move the allocation of EA inode out of ext4_xattr_set_entry() into the callers. Reported-by: syzbot+a43d4f48b8397d0e41a9@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240321162657.27420-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19ext4: fold quota accounting into ext4_xattr_inode_lookup_create()Jan Kara
[ Upstream commit 8208c41c43ad5e9b63dce6c45a73e326109ca658 ] When allocating EA inode, quota accounting is done just before ext4_xattr_inode_lookup_create(). Logically these two operations belong together so just fold quota accounting into ext4_xattr_inode_lookup_create(). We also make ext4_xattr_inode_lookup_create() return the looked up / created inode to convert the function to a more standard calling convention. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240209112107.10585-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 0a46ef234756 ("ext4: do not create EA inode under buffer lock") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-19fs: Annotate struct file_handle with __counted_by() and use struct_size()Gustavo A. R. Silva
[ Upstream commit 68d6f4f3fbd9b1baae53e7cf33fb3362b5a21494 ] Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). While there, use struct_size() helper, instead of the open-coded version. [brauner@kernel.org: contains a fix by Edward for an OOB access] Reported-by: syzbot+4139435cb1b34cf759c2@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Link: https://lore.kernel.org/r/tencent_A7845DD769577306D813742365E976E3A205@qq.com Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/ZgImCXTdGDTeBvSS@neat Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>