summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-01-24Merge tag 'execve-v6.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fixes from Kees Cook: - Fix error handling in begin_new_exec() (Bernd Edlinger) - MAINTAINERS: specifically mention ELF (Alexey Dobriyan) - Various cleanups related to earlier open() (Askar Safin, Kees Cook) * tag 'execve-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Distinguish in_execve from in_exec exec: Fix error handling in begin_new_exec() exec: Add do_close_execat() helper exec: remove useless comment ELF, MAINTAINERS: specifically mention ELF
2024-01-24uselib: remove use of __FMODE_EXECLinus Torvalds
Jann Horn points out that uselib() really shouldn't trigger the new FMODE_EXEC logic introduced by commit 4759ff71f23e ("exec: __FMODE_EXEC instead of in_execve for LSMs"). In fact, it shouldn't even have ever triggered the old pre-existing logic for __FMODE_EXEC (like the NFS code that makes executables not need read permissions). Unlike a real execve(), that can work even with files that are purely executable by the user (not readable), uselib() has that MAY_READ requirement becasue it's really just a convenience wrapper around mmap() for legacy shared libraries. The whole FMODE_EXEC bit was originally introduced by commit b500531e6f5f ("[PATCH] Introduce FMODE_EXEC file flag"), primarily to give ETXTBUSY error returns for distributed filesystems. It has since grown a few other warts (like that NFS thing), but there really isn't any reason to use it for uselib(), and now that we are trying to use it to replace the horrid 'tsk->in_execve' flag, it's actively wrong. Of course, as Jann Horn also points out, nobody should be enabling CONFIG_USELIB in the first place in this day and age, but that's a different discussion entirely. Reported-by: Jann Horn <jannh@google.com> Fixes: 4759ff71f23e ("exec: __FMODE_EXEC instead of in_execve for LSMs") Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-24samples/cgroup: add .gitignore file for generated samplesLinus Torvalds
Make 'git status' quietly happy again after a full allmodconfig build. Fixes: 60433a9d038d ("samples: introduce new samples subdir for cgroup") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-24exec: Distinguish in_execve from in_execKees Cook
Just to help distinguish the fs->in_exec flag from the current->in_execve flag, add comments in check_unsafe_exec() and copy_fs() for more context. Also note that in_execve is only used by TOMOYO now. Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-24exec: Check __FMODE_EXEC instead of in_execve for LSMsKees Cook
After commit 978ffcbf00d8 ("execve: open the executable file before doing anything else"), current->in_execve was no longer in sync with the open(). This broke AppArmor and TOMOYO which depend on this flag to distinguish "open" operations from being "exec" operations. Instead of moving around in_execve, switch to using __FMODE_EXEC, which is where the "is this an exec?" intent is stored. Note that TOMOYO still uses in_execve around cred handling. Reported-by: Kevin Locke <kevin@kevinlocke.name> Closes: https://lore.kernel.org/all/ZbE4qn9_h14OqADK@kevinlocke.name Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 978ffcbf00d8 ("execve: open the executable file before doing anything else") Cc: Josh Triplett <josh@joshtriplett.org> Cc: John Johansen <john.johansen@canonical.com> Cc: Paul Moore <paul@paul-moore.com> Cc: James Morris <jmorris@namei.org> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <linux-fsdevel@vger.kernel.org> Cc: <linux-mm@kvack.org> Cc: <apparmor@lists.ubuntu.com> Cc: <linux-security-module@vger.kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-24Merge tag 'fbdev-for-6.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev Pull fbdev fixes and cleanups from Helge Deller: "A crash fix in stifb which was missed to be included in the drm-misc tree, two checks to prevent wrong userspace input in sisfb and savagefb and two trivial printk cleanups: - stifb: Fix crash in stifb_blank() - savage/sis: Error out if pixclock equals zero - minor trivial cleanups" * tag 'fbdev-for-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: fbdev: stifb: Fix crash in stifb_blank() fbcon: Fix incorrect printed function name in fbcon_prepare_logo() fbdev: sis: Error out if pixclock equals zero fbdev: savage: Error out if pixclock equals zero fbdev: vt8500lcdfb: Remove unnecessary print function dev_err()
2024-01-23Merge tag 'trace-v6.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing and eventfs fixes from Steven Rostedt: - Fix histogram tracing_map insertion. The tracing_map_insert copies the value into the elt variable and then assigns the elt to the entry value. But it is possible that the entry value becomes visible on other CPUs before the elt is fully initialized. This is fixed by adding a wmb() between the initialization of the elt variable and assigning it. - Have eventfs directory have unique inode numbers. Having them be all the same proved to be a failure as the 'find' application will think that the directories are causing loops, as it checks for directory loops via their inodes. Have the evenfs dir entries get their inodes assigned when they are referenced and then save them in the eventfs_inode structure. * tag 'trace-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Save directory inodes in the eventfs_inode structure tracing: Ensure visibility when inserting an element into tracing_map
2024-01-23eventfs: Save directory inodes in the eventfs_inode structureSteven Rostedt (Google)
The eventfs inodes and directories are allocated when referenced. But this leaves the issue of keeping consistent inode numbers and the number is only saved in the inode structure itself. When the inode is no longer referenced, it can be freed. When the file that the inode was representing is referenced again, the inode is once again created, but the inode number needs to be the same as it was before. Just making the inode numbers the same for all files is fine, but that does not work with directories. The find command will check for loops via the inode number and having the same inode number for directories triggers: # find /sys/kernel/tracing find: File system loop detected; '/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same file system loop as '/sys/kernel/debug/tracing/events/initcall'. [..] Linus pointed out that the eventfs_inode structure ends with a single 32bit int, and on 64 bit machines, there's likely a 4 byte hole due to alignment. We can use this hole to store the inode number for the eventfs_inode. All directories in eventfs are represented by an eventfs_inode and that data structure can hold its inode number. That last int was also purposely placed at the end of the structure to prevent holes from within. Now that there's a 4 byte number to hold the inode, both the inode number and the last integer can be moved up in the structure for better cache locality, where the llist and rcu fields can be moved to the end as they are only used when the eventfs_inode is being deleted. Link: https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=X+c7x0Eewxn-YRdCA@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240122152748.46897388@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Fixes: 53c41052ba31 ("eventfs: Have the inodes all for files and directories all be the same") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Kees Cook <keescook@chromium.org>
2024-01-23fbdev: stifb: Fix crash in stifb_blank()Helge Deller
Avoid a kernel crash in stifb by providing the correct pointer to the fb_info struct. Prior to commit e2e0b838a184 ("video/sticore: Remove info field from STI struct") the fb_info struct was at the beginning of the fb struct. Fixes: e2e0b838a184 ("video/sticore: Remove info field from STI struct") Signed-off-by: Helge Deller <deller@gmx.de> Cc: Thomas Zimmermann <tzimmermann@suse.de>
2024-01-22drm/ttm: fix ttm pool initialization for no-dma-device driversFedor Pchelkin
The QXL driver doesn't use any device for DMA mappings or allocations so dev_to_node() will panic inside ttm_device_init() on NUMA systems: general protection fault, probably for non-canonical address 0xdffffc000000007a: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x00000000000003d0-0x00000000000003d7] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.7.0+ #9 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 RIP: 0010:ttm_device_init+0x10e/0x340 Call Trace: qxl_ttm_init+0xaa/0x310 qxl_device_init+0x1071/0x2000 qxl_pci_probe+0x167/0x3f0 local_pci_probe+0xe1/0x1b0 pci_device_probe+0x29d/0x790 really_probe+0x251/0x910 __driver_probe_device+0x1ea/0x390 driver_probe_device+0x4e/0x2e0 __driver_attach+0x1e3/0x600 bus_for_each_dev+0x12d/0x1c0 bus_add_driver+0x25a/0x590 driver_register+0x15c/0x4b0 qxl_pci_driver_init+0x67/0x80 do_one_initcall+0xf5/0x5d0 kernel_init_freeable+0x637/0xb10 kernel_init+0x1c/0x2e0 ret_from_fork+0x48/0x80 ret_from_fork_asm+0x1b/0x30 RIP: 0010:ttm_device_init+0x10e/0x340 Fall back to NUMA_NO_NODE if there is no device for DMA. Found by Linux Verification Center (linuxtesting.org). Fixes: b0a7ce53d494 ("drm/ttm: Schedule delayed_delete worker closer") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Christian König <christian.koenig@amd.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-22Revert "btrfs: zstd: fix and simplify the inline extent decompression"Linus Torvalds
This reverts commit 1e7f6def8b2370ecefb54b3c8f390ff894b0c51b. It causes my machine to not even boot, and Klara Modin reports that the cause is that small zstd-compressed files return garbage when read. Reported-by: Klara Modin <klarasmodin@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABq1_vj4GpUeZpVG49OHCo-3sdbe2-2ROcu_xDvUG-6-5zPRXg@mail.gmail.com/ Reported-and-bisected-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Sterba <dsterba@suse.com> Cc: Qu Wenruo <wqu@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-22tracing: Ensure visibility when inserting an element into tracing_mapPetr Pavlu
Running the following two commands in parallel on a multi-processor AArch64 machine can sporadically produce an unexpected warning about duplicate histogram entries: $ while true; do echo hist:key=id.syscall:val=hitcount > \ /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist sleep 0.001 done $ stress-ng --sysbadaddr $(nproc) The warning looks as follows: [ 2911.172474] ------------[ cut here ]------------ [ 2911.173111] Duplicates detected: 1 [ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408 [ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E) [ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1 [ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G E 6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01 [ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018 [ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408 [ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408 [ 2911.185310] sp : ffff8000a1513900 [ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001 [ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008 [ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180 [ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff [ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8 [ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731 [ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c [ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8 [ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000 [ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480 [ 2911.194259] Call trace: [ 2911.194626] tracing_map_sort_entries+0x3e0/0x408 [ 2911.195220] hist_show+0x124/0x800 [ 2911.195692] seq_read_iter+0x1d4/0x4e8 [ 2911.196193] seq_read+0xe8/0x138 [ 2911.196638] vfs_read+0xc8/0x300 [ 2911.197078] ksys_read+0x70/0x108 [ 2911.197534] __arm64_sys_read+0x24/0x38 [ 2911.198046] invoke_syscall+0x78/0x108 [ 2911.198553] el0_svc_common.constprop.0+0xd0/0xf8 [ 2911.199157] do_el0_svc+0x28/0x40 [ 2911.199613] el0_svc+0x40/0x178 [ 2911.200048] el0t_64_sync_handler+0x13c/0x158 [ 2911.200621] el0t_64_sync+0x1a8/0x1b0 [ 2911.201115] ---[ end trace 0000000000000000 ]--- The problem appears to be caused by CPU reordering of writes issued from __tracing_map_insert(). The check for the presence of an element with a given key in this function is: val = READ_ONCE(entry->val); if (val && keys_match(key, val->key, map->key_size)) ... The write of a new entry is: elt = get_free_elt(map); memcpy(elt->key, key, map->key_size); entry->val = elt; The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;" stores may become visible in the reversed order on another CPU. This second CPU might then incorrectly determine that a new key doesn't match an already present val->key and subsequently insert a new element, resulting in a duplicate. Fix the problem by adding a write barrier between "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for good measure, also use WRITE_ONCE(entry->val, elt) for publishing the element. The sequence pairs with the mentioned "READ_ONCE(entry->val);" and the "val->key" check which has an address dependency. The barrier is placed on a path executed when adding an element for a new key. Subsequent updates targeting the same key remain unaffected. From the user's perspective, the issue was introduced by commit c193707dde77 ("tracing: Remove code which merges duplicates"), which followed commit cbf4100efb8f ("tracing: Add support to detect and avoid duplicates"). The previous code operated differently; it inherently expected potential races which result in duplicates but merged them later when they occurred. Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com Fixes: c193707dde77 ("tracing: Remove code which merges duplicates") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-22fbcon: Fix incorrect printed function name in fbcon_prepare_logo()Geert Uytterhoeven
If the boot logo does not fit, a message is printed, including a wrong function name prefix. Instead of correcting the function name (or using __func__), just use "fbcon", like is done in several other messages. While at it, modernize the call by switching to pr_info(). Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Helge Deller <deller@gmx.de>
2024-01-22Merge tag 'for-6.8-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - zoned mode fixes: - fix slowdown when writing large file sequentially by looking up block groups with enough space faster - locking fixes when activating a zone - new mount API fixes: - preserve mount options for a ro/rw mount of the same subvolume - scrub fixes: - fix use-after-free in case the chunk length is not aligned to 64K, this does not happen normally but has been reported on images converted from ext4 - similar alignment check was missing with raid-stripe-tree - subvolume deletion fixes: - prevent calling ioctl on already deleted subvolume - properly track flag tracking a deleted subvolume - in subpage mode, fix decompression of an inline extent (zlib, lzo, zstd) - fix crash when starting writeback on a folio, after integration with recent MM changes this needs to be started conditionally - reject unknown flags in defrag ioctl - error handling, API fixes, minor warning fixes * tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: scrub: limit RST scrub to chunk boundary btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned btrfs: don't unconditionally call folio_start_writeback in subpage btrfs: use the original mount's mount options for the legacy reconfigure btrfs: don't warn if discard range is not aligned to sector btrfs: tree-checker: fix inline ref size in error messages btrfs: zstd: fix and simplify the inline extent decompression btrfs: lzo: fix and simplify the inline extent decompression btrfs: zlib: fix and simplify the inline extent decompression btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted btrfs: don't abort filesystem when attempting to snapshot deleted subvolume btrfs: zoned: fix lock ordering in btrfs_zone_activate() btrfs: fix unbalanced unlock of mapping_tree_lock btrfs: ref-verify: free ref cache before clearing mount opt btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send() btrfs: zoned: optimize hint byte for zoned allocator btrfs: zoned: factor out prepare_allocation_zoned()
2024-01-22exec: Fix error handling in begin_new_exec()Bernd Edlinger
If get_unused_fd_flags() fails, the error handling is incomplete because bprm->cred is already set to NULL, and therefore free_bprm will not unlock the cred_guard_mutex. Note there are two error conditions which end up here, one before and one after bprm->cred is cleared. Fixes: b8a61c9e7b4a ("exec: Generic execfd support") Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://lore.kernel.org/r/AS8P193MB128517ADB5EFF29E04389EDAE4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-22exec: Add do_close_execat() helperKees Cook
Consolidate the calls to allow_write_access()/fput() into a single place, since we repeat this code pattern. Add comments around the callers for the details on it. Link: https://lore.kernel.org/r/202209161637.9EDAF6B18@keescook Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-22exec: remove useless commentAskar Safin
Function name is wrong and the comment tells us nothing Signed-off-by: Askar Safin <safinaskar@zohomail.com> Link: https://lore.kernel.org/r/20240109030801.31827-1-safinaskar@zohomail.com Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-22ELF, MAINTAINERS: specifically mention ELFAlexey Dobriyan
People complain when I miss people in Cc. [ kees: Also add the ELF uapi doc link ] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Link: https://lore.kernel.org/r/2cb0891e-d7c0-4939-bb5f-282812de6078@p183 Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-22Merge tag 'Wstringop-overflow-for-6.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull stringop-overflow warning update from Gustavo A. R. Silva: "Enable -Wstringop-overflow globally. I waited for the release of -rc1 to run a final build-test on top of it before sending this pull request. Fortunatelly, after building 358 kernels overnight (basically all supported archs with a wide variety of configs), no more warnings have surfaced! :) Thus, we are in a good position to enable this compiler option for all versions of GCC that support it, with the exception of GCC-11, which appears to have some issues with this option [1]" Link: https://lore.kernel.org/lkml/b3c99290-40bc-426f-b3d2-1aa903f95c4e@embeddedor.com/ [1] * tag 'Wstringop-overflow-for-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: init: Kconfig: Disable -Wstringop-overflow for GCC-11 Makefile: Enable -Wstringop-overflow globally
2024-01-22Merge tag 'xsa448-6.8-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen netback fix from Juergen Gross: "Transmit requests in Xen's virtual network protocol can consist of multiple parts. While not really useful, except for the initial part any of them may be of zero length, i.e. carry no data at all. Besides a certain initial portion of the to be transferred data, these parts are directly translated into what Linux calls SKB fragments. Such converted request parts can, when for a particular SKB they are all of length zero, lead to a de-reference of NULL in core networking code" * tag 'xsa448-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen-netback: don't produce zero-size SKB frags
2024-01-21init: Kconfig: Disable -Wstringop-overflow for GCC-11Gustavo A. R. Silva
-Wstringop-overflow is buggy in GCC-11. Therefore, we should disable this option specifically for that compiler version. To achieve this, we introduce a new configuration option: GCC11_NO_STRINGOP_OVERFLOW. The compiler option related to string operation overflow is now managed under configuration CC_STRINGOP_OVERFLOW. This option is enabled by default for all other versions of GCC that support it. Link: https://lore.kernel.org/lkml/b3c99290-40bc-426f-b3d2-1aa903f95c4e@embeddedor.com/ Link: https://lore.kernel.org/lkml/20231128091351.2bfb38dd@canb.auug.org.au/ Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/linux-hardening/ZWj1+jkweEDWbmAR@work/ Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2024-01-21Makefile: Enable -Wstringop-overflow globallyGustavo A. R. Silva
It seems that we have finished addressing all the remaining issues regarding -Wstringop-overflow. So, we are now in good shape to enable this compiler option globally. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2024-01-21Linux 6.8-rc1v6.8-rc1Linus Torvalds
2024-01-21Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefsLinus Torvalds
Pull more bcachefs updates from Kent Overstreet: "Some fixes, Some refactoring, some minor features: - Assorted prep work for disk space accounting rewrite - BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this makes our trigger context more explicit - A few fixes to avoid excessive transaction restarts on multithreaded workloads: fstests (in addition to ktest tests) are now checking slowpath counters, and that's shaking out a few bugs - Assorted tracepoint improvements - Starting to break up bcachefs_format.h and move on disk types so they're with the code they belong to; this will make room to start documenting the on disk format better. - A few minor fixes" * tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits) bcachefs: Improve inode_to_text() bcachefs: logged_ops_format.h bcachefs: reflink_format.h bcachefs; extents_format.h bcachefs: ec_format.h bcachefs: subvolume_format.h bcachefs: snapshot_format.h bcachefs: alloc_background_format.h bcachefs: xattr_format.h bcachefs: dirent_format.h bcachefs: inode_format.h bcachefs; quota_format.h bcachefs: sb-counters_format.h bcachefs: counters.c -> sb-counters.c bcachefs: comment bch_subvolume bcachefs: bch_snapshot::btime bcachefs: add missing __GFP_NOWARN bcachefs: opts->compression can now also be applied in the background bcachefs: Prep work for variable size btree node buffers bcachefs: grab s_umount only if snapshotting ...
2024-01-21Merge tag 'timers-core-2024-01-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Updates for time and clocksources: - A fix for the idle and iowait time accounting vs CPU hotplug. The time is reset on CPU hotplug which makes the accumulated systemwide time jump backwards. - Assorted fixes and improvements for clocksource/event drivers" * tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug clocksource/drivers/ep93xx: Fix error handling during probe clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings clocksource/timer-riscv: Add riscv_clock_shutdown callback dt-bindings: timer: Add StarFive JH8100 clint dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
2024-01-21Merge tag 'powerpc-6.8-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Aneesh Kumar: - Increase default stack size to 32KB for Book3S Thanks to Michael Ellerman. * tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s: Increase default stack size to 32KB
2024-01-21bcachefs: Improve inode_to_text()Kent Overstreet
Add line breaks - inode_to_text() is now much easier to read. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: logged_ops_format.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: reflink_format.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs; extents_format.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: ec_format.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: subvolume_format.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: snapshot_format.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: alloc_background_format.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: xattr_format.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: dirent_format.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: inode_format.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs; quota_format.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: sb-counters_format.hKent Overstreet
bcachefs_format.h has gotten too big; let's do some organizing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: counters.c -> sb-counters.cKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: comment bch_subvolumeKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: bch_snapshot::btimeKent Overstreet
Add a field to bch_snapshot for creation time; this will be important when we start exposing the snapshot tree to userspace. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: add missing __GFP_NOWARNKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: opts->compression can now also be applied in the backgroundKent Overstreet
The "apply this compression method in the background" paths now use the compression option if background_compression is not set; this means that setting or changing the compression option will cause existing data to be compressed accordingly in the background. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: Prep work for variable size btree node buffersKent Overstreet
bcachefs btree nodes are big - typically 256k - and btree roots are pinned in memory. As we're now up to 18 btrees, we now have significant memory overhead in mostly empty btree roots. And in the future we're going to start enforcing that certain btree node boundaries exist, to solve lock contention issues - analagous to XFS's AGIs. Thus, we need to start allocating smaller btree node buffers when we can. This patch changes code that refers to the filesystem constant c->opts.btree_node_size to refer to the btree node buffer size - btree_buf_bytes() - where appropriate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: grab s_umount only if snapshottingSu Yue
When I was testing mongodb over bcachefs with compression, there is a lockdep warning when snapshotting mongodb data volume. $ cat test.sh prog=bcachefs $prog subvolume create /mnt/data $prog subvolume create /mnt/data/snapshots while true;do $prog subvolume snapshot /mnt/data /mnt/data/snapshots/$(date +%s) sleep 1s done $ cat /etc/mongodb.conf systemLog: destination: file logAppend: true path: /mnt/data/mongod.log storage: dbPath: /mnt/data/ lockdep reports: [ 3437.452330] ====================================================== [ 3437.452750] WARNING: possible circular locking dependency detected [ 3437.453168] 6.7.0-rc7-custom+ #85 Tainted: G E [ 3437.453562] ------------------------------------------------------ [ 3437.453981] bcachefs/35533 is trying to acquire lock: [ 3437.454325] ffffa0a02b2b1418 (sb_writers#10){.+.+}-{0:0}, at: filename_create+0x62/0x190 [ 3437.454875] but task is already holding lock: [ 3437.455268] ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs] [ 3437.456009] which lock already depends on the new lock. [ 3437.456553] the existing dependency chain (in reverse order) is: [ 3437.457054] -> #3 (&type->s_umount_key#48){.+.+}-{3:3}: [ 3437.457507] down_read+0x3e/0x170 [ 3437.457772] bch2_fs_file_ioctl+0x232/0xc90 [bcachefs] [ 3437.458206] __x64_sys_ioctl+0x93/0xd0 [ 3437.458498] do_syscall_64+0x42/0xf0 [ 3437.458779] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 3437.459155] -> #2 (&c->snapshot_create_lock){++++}-{3:3}: [ 3437.459615] down_read+0x3e/0x170 [ 3437.459878] bch2_truncate+0x82/0x110 [bcachefs] [ 3437.460276] bchfs_truncate+0x254/0x3c0 [bcachefs] [ 3437.460686] notify_change+0x1f1/0x4a0 [ 3437.461283] do_truncate+0x7f/0xd0 [ 3437.461555] path_openat+0xa57/0xce0 [ 3437.461836] do_filp_open+0xb4/0x160 [ 3437.462116] do_sys_openat2+0x91/0xc0 [ 3437.462402] __x64_sys_openat+0x53/0xa0 [ 3437.462701] do_syscall_64+0x42/0xf0 [ 3437.462982] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 3437.463359] -> #1 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}: [ 3437.463843] down_write+0x3b/0xc0 [ 3437.464223] bch2_write_iter+0x5b/0xcc0 [bcachefs] [ 3437.464493] vfs_write+0x21b/0x4c0 [ 3437.464653] ksys_write+0x69/0xf0 [ 3437.464839] do_syscall_64+0x42/0xf0 [ 3437.465009] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 3437.465231] -> #0 (sb_writers#10){.+.+}-{0:0}: [ 3437.465471] __lock_acquire+0x1455/0x21b0 [ 3437.465656] lock_acquire+0xc6/0x2b0 [ 3437.465822] mnt_want_write+0x46/0x1a0 [ 3437.465996] filename_create+0x62/0x190 [ 3437.466175] user_path_create+0x2d/0x50 [ 3437.466352] bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs] [ 3437.466617] __x64_sys_ioctl+0x93/0xd0 [ 3437.466791] do_syscall_64+0x42/0xf0 [ 3437.466957] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 3437.467180] other info that might help us debug this: [ 3437.469670] 2 locks held by bcachefs/35533: other info that might help us debug this: [ 3437.467507] Chain exists of: sb_writers#10 --> &c->snapshot_create_lock --> &type->s_umount_key#48 [ 3437.467979] Possible unsafe locking scenario: [ 3437.468223] CPU0 CPU1 [ 3437.468405] ---- ---- [ 3437.468585] rlock(&type->s_umount_key#48); [ 3437.468758] lock(&c->snapshot_create_lock); [ 3437.469030] lock(&type->s_umount_key#48); [ 3437.469291] rlock(sb_writers#10); [ 3437.469434] *** DEADLOCK *** [ 3437.469670] 2 locks held by bcachefs/35533: [ 3437.469838] #0: ffffa0a02ce00a88 (&c->snapshot_create_lock){++++}-{3:3}, at: bch2_fs_file_ioctl+0x1e3/0xc90 [bcachefs] [ 3437.470294] #1: ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs] [ 3437.470744] stack backtrace: [ 3437.470922] CPU: 7 PID: 35533 Comm: bcachefs Kdump: loaded Tainted: G E 6.7.0-rc7-custom+ #85 [ 3437.471313] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 3437.471694] Call Trace: [ 3437.471795] <TASK> [ 3437.471884] dump_stack_lvl+0x57/0x90 [ 3437.472035] check_noncircular+0x132/0x150 [ 3437.472202] __lock_acquire+0x1455/0x21b0 [ 3437.472369] lock_acquire+0xc6/0x2b0 [ 3437.472518] ? filename_create+0x62/0x190 [ 3437.472683] ? lock_is_held_type+0x97/0x110 [ 3437.472856] mnt_want_write+0x46/0x1a0 [ 3437.473025] ? filename_create+0x62/0x190 [ 3437.473204] filename_create+0x62/0x190 [ 3437.473380] user_path_create+0x2d/0x50 [ 3437.473555] bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs] [ 3437.473819] ? lock_acquire+0xc6/0x2b0 [ 3437.474002] ? __fget_files+0x2a/0x190 [ 3437.474195] ? __fget_files+0xbc/0x190 [ 3437.474380] ? lock_release+0xc5/0x270 [ 3437.474567] ? __x64_sys_ioctl+0x93/0xd0 [ 3437.474764] ? __pfx_bch2_fs_file_ioctl+0x10/0x10 [bcachefs] [ 3437.475090] __x64_sys_ioctl+0x93/0xd0 [ 3437.475277] do_syscall_64+0x42/0xf0 [ 3437.475454] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 3437.475691] RIP: 0033:0x7f2743c313af ====================================================== In __bch2_ioctl_subvolume_create(), we grab s_umount unconditionally and unlock it at the end of the function. There is a comment "why do we need this lock?" about the lock coming from commit 42d237320e98 ("bcachefs: Snapshot creation, deletion") The reason is that __bch2_ioctl_subvolume_create() calls sync_inodes_sb() which enforce locked s_umount to writeback all dirty nodes before doing snapshot works. Fix it by read locking s_umount for snapshotting only and unlocking s_umount after sync_inodes_sb(). Signed-off-by: Su Yue <glass.su@suse.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: kvfree bch_fs::snapshots in bch2_fs_snapshots_exitSu Yue
bch_fs::snapshots is allocated by kvzalloc in __snapshot_t_mut. It should be freed by kvfree not kfree. Or umount will triger: [ 406.829178 ] BUG: unable to handle page fault for address: ffffe7b487148008 [ 406.830676 ] #PF: supervisor read access in kernel mode [ 406.831643 ] #PF: error_code(0x0000) - not-present page [ 406.832487 ] PGD 0 P4D 0 [ 406.832898 ] Oops: 0000 [#1] PREEMPT SMP PTI [ 406.833512 ] CPU: 2 PID: 1754 Comm: umount Kdump: loaded Tainted: G OE 6.7.0-rc7-custom+ #90 [ 406.834746 ] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 406.835796 ] RIP: 0010:kfree+0x62/0x140 [ 406.836197 ] Code: 80 48 01 d8 0f 82 e9 00 00 00 48 c7 c2 00 00 00 80 48 2b 15 78 9f 1f 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 56 9f 1f 01 <48> 8b 50 08 48 89 c7 f6 c2 01 0f 85 b0 00 00 00 66 90 48 8b 07 f6 [ 406.837810 ] RSP: 0018:ffffb9d641607e48 EFLAGS: 00010286 [ 406.838213 ] RAX: ffffe7b487148000 RBX: ffffb9d645200000 RCX: ffffb9d641607dc4 [ 406.838738 ] RDX: 000065bb00000000 RSI: ffffffffc0d88b84 RDI: ffffb9d645200000 [ 406.839217 ] RBP: ffff9a4625d00068 R08: 0000000000000001 R09: 0000000000000001 [ 406.839650 ] R10: 0000000000000001 R11: 000000000000001f R12: ffff9a4625d4da80 [ 406.840055 ] R13: ffff9a4625d00000 R14: ffffffffc0e2eb20 R15: 0000000000000000 [ 406.840451 ] FS: 00007f0a264ffb80(0000) GS:ffff9a4e2d500000(0000) knlGS:0000000000000000 [ 406.840851 ] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 406.841125 ] CR2: ffffe7b487148008 CR3: 000000018c4d2000 CR4: 00000000000006f0 [ 406.841464 ] Call Trace: [ 406.841583 ] <TASK> [ 406.841682 ] ? __die+0x1f/0x70 [ 406.841828 ] ? page_fault_oops+0x159/0x470 [ 406.842014 ] ? fixup_exception+0x22/0x310 [ 406.842198 ] ? exc_page_fault+0x1ed/0x200 [ 406.842382 ] ? asm_exc_page_fault+0x22/0x30 [ 406.842574 ] ? bch2_fs_release+0x54/0x280 [bcachefs] [ 406.842842 ] ? kfree+0x62/0x140 [ 406.842988 ] ? kfree+0x104/0x140 [ 406.843138 ] bch2_fs_release+0x54/0x280 [bcachefs] [ 406.843390 ] kobject_put+0xb7/0x170 [ 406.843552 ] deactivate_locked_super+0x2f/0xa0 [ 406.843756 ] cleanup_mnt+0xba/0x150 [ 406.843917 ] task_work_run+0x59/0xa0 [ 406.844083 ] exit_to_user_mode_prepare+0x197/0x1a0 [ 406.844302 ] syscall_exit_to_user_mode+0x16/0x40 [ 406.844510 ] do_syscall_64+0x4e/0xf0 [ 406.844675 ] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 406.844907 ] RIP: 0033:0x7f0a2664e4fb Signed-off-by: Su Yue <glass.su@suse.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: bios must be 512 byte alginedKent Overstreet
Fixes: 023f9ac9f70f bcachefs: Delete dio read alignment check Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: remove redundant variable tmpColin Ian King
The variable tmp is being assigned a value but it isn't being read afterwards. The assignment is redundant and so tmp can be removed. Cleans up clang scan build warning: warning: Although the value stored to 'ret' is used in the enclosing expression, the value is never actually read from 'ret' [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-21bcachefs: Improve trace_trans_restart_relockKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>