summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-10Merge tag 'for-linus-6.15a-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - A fix for the xenbus driver allowing to use a PVH Dom0 with Xenstore running in another domain - A fix for the xenbus driver addressing a rare race condition resulting in NULL dereferences and other problems - A fix for the xen-swiotlb driver fixing a problem seen on Arm platforms * tag 'for-linus-6.15a-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xenbus: Use kref to track req lifetime xenbus: Allow PVH dom0 a non-local xenstore xen: swiotlb: Use swiotlb bouncing if kmalloc allocation demands it
2025-05-10Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull mount fixes from Al Viro: "A couple of races around legalize_mnt vs umount (both fairly old and hard to hit) plus two bugs in move_mount(2) - both around 'move detached subtree in place' logics" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix IS_MNT_PROPAGATING uses do_move_mount(): don't leak MNTNS_PROPAGATING on failures do_umount(): add missing barrier before refcount checks in sync case __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock
2025-05-10Merge tag 'kvm-x86-fixes-6.15-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 fixes for 6.15-rcN - Forcibly leave SMM on SHUTDOWN interception on AMD CPUs to avoid causing problems due to KVM stuffing INIT on SHUTDOWN (KVM needs to sanitize the VMCB as its state is undefined after SHUTDOWN, emulating INIT is the least awful choice). - Track the valid sync/dirty fields in kvm_run as a u64 to ensure KVM KVM doesn't goof a sanity check in the future. - Free obsolete roots when (re)loading the MMU to fix a bug where pre-faulting memory can get stuck due to always encountering a stale root. - When dumping GHCB state, use KVM's snapshot instead of the raw GHCB page to print state, so that KVM doesn't print stale/wrong information. - When changing memory attributes (e.g. shared <=> private), add potential hugepage ranges to the mmu_invalidate_range_{start,end} set so that KVM doesn't create a shared/private hugepage when the the corresponding attributes will become mixed (the attributes are commited *after* KVM finishes the invalidation). - Rework the SRSO mitigation to enable BP_SPEC_REDUCE only when KVM has at least one active VM. Effectively BP_SPEC_REDUCE when KVM is loaded led to very measurable performance regressions for non-KVM workloads.
2025-05-10Merge tag 'kvmarm-fixes-6.15-3' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.15, round #3 - Avoid use of uninitialized memcache pointer in user_mem_abort() - Always set HCR_EL2.xMO bits when running in VHE, allowing interrupts to be taken while TGE=0 and fixing an ugly bug on AmpereOne that occurs when taking an interrupt while clearing the xMO bits (AC03_CPU_36) - Prevent VMMs from hiding support for AArch64 at any EL virtualized by KVM - Save/restore the host value for HCRX_EL2 instead of restoring an incorrect fixed value - Make host_stage2_set_owner_locked() check that the entire requested range is memory rather than just the first page
2025-05-10Merge tag 'kvm-riscv-fixes-6.15-1' of https://github.com/kvm-riscv/linux ↵Paolo Bonzini
into HEAD KVM/riscv fixes for 6.15, take #1 - Add missing reset of smstateen CSRs
2025-05-10Merge tag 'i2c-host-fixes-6.15-rc6' of ↵Wolfram Sang
git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current i2c-host-fixes for v6.15-rc6 - omap: use correct function to read from device tree - MAINTAINERS: remove Seth from ISMT maintainership
2025-05-10Merge tag 'imx-fixes-6.15-2' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into arm/fixes i.MX fixes for 6.15, 2nd round: - One more i.MX8MP nominal drive mode DT fix from Ahmad Fatoum to use 800MHz NoC OPP - A imx8mp-var-som DT change from Himanshu Bhavani to fix SD card timeout caused by LDO5 * tag 'imx-fixes-6.15-2' of https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux: arm64: dts: imx8mp-var-som: Fix LDO5 shutdown causing SD card timeout arm64: dts: imx8mp: use 800MHz NoC OPP for nominal drive mode Link: https://lore.kernel.org/r/aB6h/woeyG1bSo12@dragon Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-09Merge tag 'batadv-net-pullrequest-20250509' of ↵Jakub Kicinski
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here is a batman-adv bugfix: - fix duplicate MAC address check, by Matthias Schiffer * tag 'batadv-net-pullrequest-20250509' of git://git.open-mesh.org/linux-merge: batman-adv: fix duplicate MAC address check ==================== Link: https://patch.msgid.link/20250509090240.107796-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09Merge tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - Fix dentry leak which can cause umount crash - Add warning for parse contexts error on compounded operation * tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: client: Avoid race in open_cached_dir with lease breaks smb3 client: warn when parse contexts returns error on compounded operation
2025-05-10tracing: add missing trace_probe_log_clear for eprobesPaul Cacheux
Make sure trace_probe_log_clear is called in the tracing eprobe code path, matching the trace_probe_log_init call. Link: https://lore.kernel.org/all/20250504-fix-trace-probe-log-race-v3-1-9e99fec7eddc@gmail.com/ Signed-off-by: Paul Cacheux <paulcacheux@gmail.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-05-10tracing: fprobe: Fix RCU warning message in list traversalBreno Leitao
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following warning: WARNING: suspicious RCU usage kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!! other info that might help us debug this: #1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0 Call Trace: fprobe_module_callback notifier_call_chain blocking_notifier_call_chain This warning occurs because fprobe_remove_node_in_module() traverses an RCU list using RCU primitives without holding an RCU read lock. However, the function is only called from fprobe_module_callback(), which holds the fprobe_mutex lock that provides sufficient protection for safely traversing the list. Fix the warning by specifying the locking design to the CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex provides the required protection. Link: https://lore.kernel.org/all/20250410-fprobe-v1-1-068ef5f41436@debian.org/ Fixes: a3dc2983ca7b90 ("tracing: fprobe: Cleanup fprobe hash when module unloading") Signed-off-by: Breno Leitao <leitao@debian.org> Tested-by: Antonio Quartulli <antonio@mandelbit.com> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-05-09net: mctp: Ensure keys maintain only one ref to corresponding devAndrew Jeffery
mctp_flow_prepare_output() is called in mctp_route_output(), which places outbound packets onto a given interface. The packet may represent a message fragment, in which case we provoke an unbalanced reference count to the underlying device. This causes trouble if we ever attempt to remove the interface: [ 48.702195] usb 1-1: USB disconnect, device number 2 [ 58.883056] unregister_netdevice: waiting for mctpusb0 to become free. Usage count = 2 [ 69.022548] unregister_netdevice: waiting for mctpusb0 to become free. Usage count = 2 [ 79.172568] unregister_netdevice: waiting for mctpusb0 to become free. Usage count = 2 ... Predicate the invocation of mctp_dev_set_key() in mctp_flow_prepare_output() on not already having associated the device with the key. It's not yet realistic to uphold the property that the key maintains only one device reference earlier in the transmission sequence as the route (and therefore the device) may not be known at the time the key is associated with the socket. Fixes: 67737c457281 ("mctp: Pass flow data & flow release events to drivers") Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au> Link: https://patch.msgid.link/20250508-mctp-dev-refcount-v1-1-d4f965c67bb5@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09fix IS_MNT_PROPAGATING usesAl Viro
propagate_mnt() does not attach anything to mounts created during propagate_mnt() itself. What's more, anything on ->mnt_slave_list of such new mount must also be new, so we don't need to even look there. When move_mount() had been introduced, we've got an additional class of mounts to skip - if we are moving from anon namespace, we do not want to propagate to mounts we are moving (i.e. all mounts in that anon namespace). Unfortunately, the part about "everything on their ->mnt_slave_list will also be ignorable" is not true - if we have propagation graph A -> B -> C and do OPEN_TREE_CLONE open_tree() of B, we get A -> [B <-> B'] -> C as propagation graph, where B' is a clone of B in our detached tree. Making B private will result in A -> B' -> C C still gets propagation from A, as it would after making B private if we hadn't done that open_tree(), but now the propagation goes through B'. Trying to move_mount() our detached tree on subdirectory in A should have * moved B' on that subdirectory in A * skipped the corresponding subdirectory in B' itself * copied B' on the corresponding subdirectory in C. As it is, the logics in propagation_next() and friends ends up skipping propagation into C, since it doesn't consider anything downstream of B'. IOW, walking the propagation graph should only skip the ->mnt_slave_list of new mounts; the only places where the check for "in that one anon namespace" are applicable are propagate_one() (where we should treat that as the same kind of thing as "mountpoint we are looking at is not visible in the mount we are looking at") and propagation_would_overmount(). The latter is better dealt with in the caller (can_move_mount_beneath()); on the first call of propagation_would_overmount() the test is always false, on the second it is always true in "move from anon namespace" case and always false in "move within our namespace" one, so it's easier to just use check_mnt() before bothering with the second call and be done with that. Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09do_move_mount(): don't leak MNTNS_PROPAGATING on failuresAl Viro
as it is, a failed move_mount(2) from anon namespace breaks all further propagation into that namespace, including normal mounts in non-anon namespaces that would otherwise propagate there. Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09do_umount(): add missing barrier before refcount checks in sync caseAl Viro
do_umount() analogue of the race fixed in 119e1ef80ecf "fix __legitimize_mnt()/mntput() race". Here we want to make sure that if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will notice their refcount increment. Harder to hit than mntput_no_expire() one, fortunately, and consequences are milder (sync umount acting like umount -l on a rare race with RCU pathwalk hitting at just the wrong time instead of use-after-free galore mntput_no_expire() counterpart used to be hit). Still a bug... Fixes: 48a066e72d97 ("RCU'd vfsmounts") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09tests/ncdevmem: Fix double-free of queue arrayCosmin Ratiu
netdev_bind_rx takes ownership of the queue array passed as parameter and frees it, so a queue array buffer cannot be reused across multiple netdev_bind_rx calls. This commit fixes that by always passing in a newly created queue array to all netdev_bind_rx calls in ncdevmem. Fixes: 85585b4bc8d8 ("selftests: add ncdevmem, netcat for devmem TCP") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250508084434.1933069-1-cratiu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09net: mctp: Don't access ifa_index when missingMatt Johnston
In mctp_dump_addrinfo, ifa_index can be used to filter interfaces, but only when the struct ifaddrmsg is provided. Otherwise it will be comparing to uninitialised memory - reproducible in the syzkaller case from dhcpd, or busybox "ip addr show". The kernel MCTP implementation has always filtered by ifa_index, so existing userspace programs expecting to dump MCTP addresses must already be passing a valid ifa_index value (either 0 or a real index). BUG: KMSAN: uninit-value in mctp_dump_addrinfo+0x208/0xac0 net/mctp/device.c:128 mctp_dump_addrinfo+0x208/0xac0 net/mctp/device.c:128 rtnl_dump_all+0x3ec/0x5b0 net/core/rtnetlink.c:4380 rtnl_dumpit+0xd5/0x2f0 net/core/rtnetlink.c:6824 netlink_dump+0x97b/0x1690 net/netlink/af_netlink.c:2309 Fixes: 583be982d934 ("mctp: Add device handling and netlink interface") Reported-by: syzbot+e76d52dadc089b9d197f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68135815.050a0220.3a872c.000e.GAE@google.com/ Reported-by: syzbot+1065a199625a388fce60@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/681357d6.050a0220.14dd7d.000d.GAE@google.com/ Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Link: https://patch.msgid.link/20250508-mctp-addr-dump-v2-1-c8a53fd2dd66@codeconstruct.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09tools/net/ynl: ethtool: fix crash when Hardware Clock info is missingHangbin Liu
Fix a crash in the ethtool YNL implementation when Hardware Clock information is not present in the response. This ensures graceful handling of devices or drivers that do not provide this optional field. e.g. Traceback (most recent call last): File "/net/tools/net/ynl/pyynl/./ethtool.py", line 438, in <module> main() ~~~~^^ File "/net/tools/net/ynl/pyynl/./ethtool.py", line 341, in main print(f'PTP Hardware Clock: {tsinfo["phc-index"]}') ~~~~~~^^^^^^^^^^^^^ KeyError: 'phc-index' Fixes: f3d07b02b2b8 ("tools: ynl: ethtool testing tool") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508035414.82974-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-09__legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lockAl Viro
... or we risk stealing final mntput from sync umount - raising mnt_count after umount(2) has verified that victim is not busy, but before it has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see that it's safe to quietly undo mnt_count increment and leaves dropping the reference to caller, where it'll be a full-blown mntput(). Check under mount_lock is needed; leaving the current one done before taking that makes no sense - it's nowhere near common enough to bother with. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09Merge tag 'rust-fixes-6.15-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux Pull rust fixes from Miguel Ojeda: - Make CFI_AUTO_DEFAULT depend on !RUST or Rust >= 1.88.0 - Clean Rust (and Clippy) lints for the upcoming Rust 1.87.0 and 1.88.0 releases - Clean objtool warning for the upcoming Rust 1.87.0 release by adding one more noreturn function * tag 'rust-fixes-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: x86/Kconfig: make CFI_AUTO_DEFAULT depend on !RUST or Rust >= 1.88 rust: clean Rust 1.88.0's `clippy::uninlined_format_args` lint rust: clean Rust 1.88.0's warning about `clippy::disallowed_macros` configuration rust: clean Rust 1.88.0's `unnecessary_transmutes` lint rust: allow Rust 1.87.0's `clippy::ptr_eq` lint objtool/rust: add one more `noreturn` Rust function for Rust 1.87.0
2025-05-09selftest/x86/bugs: Add selftests for ITSPawan Gupta
Below are the tests added for Indirect Target Selection (ITS): - its_sysfs.py - Check if sysfs reflects the correct mitigation status for the mitigation selected via the kernel cmdline. - its_permutations.py - tests mitigation selection with cmdline permutations with other bugs like spectre_v2 and retbleed. - its_indirect_alignment.py - verifies that for addresses in .retpoline_sites section that belong to lower half of cacheline are patched to ITS-safe thunk. Typical output looks like below: Site 49: function symbol: __x64_sys_restart_syscall+0x1f <0xffffffffbb1509af> # vmlinux: 0xffffffff813509af: jmp 0xffffffff81f5a8e0 # kcore: 0xffffffffbb1509af: jmpq *%rax # ITS thunk NOT expected for site 49 # PASSED: Found *%rax # Site 50: function symbol: __resched_curr+0xb0 <0xffffffffbb181910> # vmlinux: 0xffffffff81381910: jmp 0xffffffff81f5a8e0 # kcore: 0xffffffffbb181910: jmp 0xffffffffc02000fc # ITS thunk expected for site 50 # PASSED: Found 0xffffffffc02000fc -> jmpq *%rax <scattered-thunk?> - its_ret_alignment.py - verifies that for addresses in .return_sites section that belong to lower half of cacheline are patched to its_return_thunk. Typical output looks like below: Site 97: function symbol: collect_event+0x48 <0xffffffffbb007f18> # vmlinux: 0xffffffff81207f18: jmp 0xffffffff81f5b500 # kcore: 0xffffffffbb007f18: jmp 0xffffffffbbd5b560 # PASSED: Found jmp 0xffffffffbbd5b560 <its_return_thunk> # Site 98: function symbol: collect_event+0xa4 <0xffffffffbb007f74> # vmlinux: 0xffffffff81207f74: jmp 0xffffffff81f5b500 # kcore: 0xffffffffbb007f74: retq # PASSED: Found retq Some of these tests have dependency on tools like virtme-ng[1] and drgn[2]. When the dependencies are not met, the test will be skipped. [1] https://github.com/arighi/virtme-ng [2] https://github.com/osandov/drgn Co-developed-by: Tao Zhang <tao1.zhang@linux.intel.com> Signed-off-by: Tao Zhang <tao1.zhang@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
2025-05-09x86/its: FineIBT-paranoid vs ITSPeter Zijlstra
FineIBT-paranoid was using the retpoline bytes for the paranoid check, disabling retpolines, because all parts that have IBT also have eIBRS and thus don't need no stinking retpolines. Except... ITS needs the retpolines for indirect calls must not be in the first half of a cacheline :-/ So what was the paranoid call sequence: <fineibt_paranoid_start>: 0: 41 ba 78 56 34 12 mov $0x12345678, %r10d 6: 45 3b 53 f7 cmp -0x9(%r11), %r10d a: 4d 8d 5b <f0> lea -0x10(%r11), %r11 e: 75 fd jne d <fineibt_paranoid_start+0xd> 10: 41 ff d3 call *%r11 13: 90 nop Now becomes: <fineibt_paranoid_start>: 0: 41 ba 78 56 34 12 mov $0x12345678, %r10d 6: 45 3b 53 f7 cmp -0x9(%r11), %r10d a: 4d 8d 5b f0 lea -0x10(%r11), %r11 e: 2e e8 XX XX XX XX cs call __x86_indirect_paranoid_thunk_r11 Where the paranoid_thunk looks like: 1d: <ea> (bad) __x86_indirect_paranoid_thunk_r11: 1e: 75 fd jne 1d __x86_indirect_its_thunk_r11: 20: 41 ff eb jmp *%r11 23: cc int3 [ dhansen: remove initialization to false ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Use dynamic thunks for indirect branchesPeter Zijlstra
ITS mitigation moves the unsafe indirect branches to a safe thunk. This could degrade the prediction accuracy as the source address of indirect branches becomes same for different execution paths. To improve the predictions, and hence the performance, assign a separate thunk for each indirect callsite. This is also a defense-in-depth measure to avoid indirect branches aliasing with each other. As an example, 5000 dynamic thunks would utilize around 16 bits of the address space, thereby gaining entropy. For a BTB that uses 32 bits for indexing, dynamic thunks could provide better prediction accuracy over fixed thunks. Have ITS thunks be variable sized and use EXECMEM_MODULE_TEXT such that they are both more flexible (got to extend them later) and live in 2M TLBs, just like kernel code, avoiding undue TLB pressure. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/ibt: Keep IBT disabled during alternative patchingPawan Gupta
cfi_rewrite_callers() updates the fineIBT hash matching at the caller side, but except for paranoid-mode it relies on apply_retpoline() and friends for any ENDBR relocation. This could temporarily cause an indirect branch to land on a poisoned ENDBR. For instance, with para-virtualization enabled, a simple wrmsrl() could have an indirect branch pointing to native_write_msr() who's ENDBR has been relocated due to fineIBT: <wrmsrl>: push %rbp mov %rsp,%rbp mov %esi,%eax mov %rsi,%rdx shr $0x20,%rdx mov %edi,%edi mov %rax,%rsi call *0x21e65d0(%rip) # <pv_ops+0xb8> ^^^^^^^^^^^^^^^^^^^^^^^ Such an indirect call during the alternative patching could #CP if the caller is not *yet* adjusted for the new target ENDBR. To prevent a false #CP, keep CET-IBT disabled until all callers are patched. Patching during the module load does not need to be guarded by IBT-disable because the module code is not executed until the patching is complete. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09mm/execmem: Unify early execmem_cache behaviourPeter Zijlstra
Early kernel memory is RWX, only at the end of early boot (before SMP) do we mark things ROX. Have execmem_cache mirror this behaviour for early users. This avoids having to remember what code is execmem and what is not -- we can poke everything with impunity ;-) Also performance for not having to do endless text_poke_mm switches. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09arm64: dts: amazon: Fix simple-bus node name schema warningsRob Herring (Arm)
Fix a couple of node name warnings from the schema checks: arch/arm64/boot/dts/amazon/alpine-v2-evp.dt.yaml: io-fabric: $nodename:0: 'io-fabric' does not match '^(bus|soc|axi|ahb|apb)(@[0-9a-f]+)?$' arch/arm64/boot/dts/amazon/alpine-v3-evp.dt.yaml: io-fabric: $nodename:0: 'io-fabric' does not match '^(bus|soc|axi|ahb|apb)(@[0-9a-f]+)?$' Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Link: https://lore.kernel.org/r/20250409210255.1541298-1-robh@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-09MAINTAINERS: delete email for Shiraz HashimWolfram Sang
The email address bounced. I couldn't find a newer one in recent git history (last activity 9 years ago), so delete this email entry. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Link: https://lore.kernel.org/r/20250331190731.5094-2-wsa+renesas@sang-engineering.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-09x86/its: Align RETs in BHB clear sequence to avoid thunkingPawan Gupta
The software mitigation for BHI is to execute BHB clear sequence at syscall entry, and possibly after a cBPF program. ITS mitigation thunks RETs in the lower half of the cacheline. This causes the RETs in the BHB clear sequence to be thunked as well, adding unnecessary branches to the BHB clear sequence. Since the sequence is in hot path, align the RET instructions in the sequence to avoid thunking. This is how disassembly clear_bhb_loop() looks like after this change: 0x44 <+4>: mov $0x5,%ecx 0x49 <+9>: call 0xffffffff81001d9b <clear_bhb_loop+91> 0x4e <+14>: jmp 0xffffffff81001de5 <clear_bhb_loop+165> 0x53 <+19>: int3 ... 0x9b <+91>: call 0xffffffff81001dce <clear_bhb_loop+142> 0xa0 <+96>: ret 0xa1 <+97>: int3 ... 0xce <+142>: mov $0x5,%eax 0xd3 <+147>: jmp 0xffffffff81001dd6 <clear_bhb_loop+150> 0xd5 <+149>: nop 0xd6 <+150>: sub $0x1,%eax 0xd9 <+153>: jne 0xffffffff81001dd3 <clear_bhb_loop+147> 0xdb <+155>: sub $0x1,%ecx 0xde <+158>: jne 0xffffffff81001d9b <clear_bhb_loop+91> 0xe0 <+160>: ret 0xe1 <+161>: int3 0xe2 <+162>: int3 0xe3 <+163>: int3 0xe4 <+164>: int3 0xe5 <+165>: lfence 0xe8 <+168>: pop %rbp 0xe9 <+169>: ret Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Add support for RSB stuffing mitigationPawan Gupta
When retpoline mitigation is enabled for spectre-v2, enabling call-depth-tracking and RSB stuffing also mitigates ITS. Add cmdline option indirect_target_selection=stuff to allow enabling RSB stuffing mitigation. When retpoline mitigation is not enabled, =stuff option is ignored, and default mitigation for ITS is deployed. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Add "vmexit" option to skip mitigation on some CPUsPawan Gupta
Ice Lake generation CPUs are not affected by guest/host isolation part of ITS. If a user is only concerned about KVM guests, they can now choose a new cmdline option "vmexit" that will not deploy the ITS mitigation when CPU is not affected by guest/host isolation. This saves the performance overhead of ITS mitigation on Ice Lake gen CPUs. When "vmexit" option selected, if the CPU is affected by ITS guest/host isolation, the default ITS mitigation is deployed. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Enable Indirect Target Selection mitigationPawan Gupta
Indirect Target Selection (ITS) is a bug in some pre-ADL Intel CPUs with eIBRS. It affects prediction of indirect branch and RETs in the lower half of cacheline. Due to ITS such branches may get wrongly predicted to a target of (direct or indirect) branch that is located in the upper half of the cacheline. Scope of impact =============== Guest/host isolation -------------------- When eIBRS is used for guest/host isolation, the indirect branches in the VMM may still be predicted with targets corresponding to branches in the guest. Intra-mode ---------- cBPF or other native gadgets can be used for intra-mode training and disclosure using ITS. User/kernel isolation --------------------- When eIBRS is enabled user/kernel isolation is not impacted. Indirect Branch Prediction Barrier (IBPB) ----------------------------------------- After an IBPB, indirect branches may be predicted with targets corresponding to direct branches which were executed prior to IBPB. This is mitigated by a microcode update. Add cmdline parameter indirect_target_selection=off|on|force to control the mitigation to relocate the affected branches to an ITS-safe thunk i.e. located in the upper half of cacheline. Also add the sysfs reporting. When retpoline mitigation is deployed, ITS safe-thunks are not needed, because retpoline sequence is already ITS-safe. Similarly, when call depth tracking (CDT) mitigation is deployed (retbleed=stuff), ITS safe return thunk is not used, as CDT prevents RSB-underflow. To not overcomplicate things, ITS mitigation is not supported with spectre-v2 lfence;jmp mitigation. Moreover, it is less practical to deploy lfence;jmp mitigation on ITS affected parts anyways. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Add support for ITS-safe return thunkPawan Gupta
RETs in the lower half of cacheline may be affected by ITS bug, specifically when the RSB-underflows. Use ITS-safe return thunk for such RETs. RETs that are not patched: - RET in retpoline sequence does not need to be patched, because the sequence itself fills an RSB before RET. - RET in Call Depth Tracking (CDT) thunks __x86_indirect_{call|jump}_thunk and call_depth_return_thunk are not patched because CDT by design prevents RSB-underflow. - RETs in .init section are not reachable after init. - RETs that are explicitly marked safe with ANNOTATE_UNRET_SAFE. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Add support for ITS-safe indirect thunkPawan Gupta
Due to ITS, indirect branches in the lower half of a cacheline may be vulnerable to branch target injection attack. Introduce ITS-safe thunks to patch indirect branches in the lower half of cacheline with the thunk. Also thunk any eBPF generated indirect branches in emit_indirect_jump(). Below category of indirect branches are not mitigated: - Indirect branches in the .init section are not mitigated because they are discarded after boot. - Indirect branches that are explicitly marked retpoline-safe. Note that retpoline also mitigates the indirect branches against ITS. This is because the retpoline sequence fills an RSB entry before RET, and it does not suffer from RSB-underflow part of the ITS. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09x86/its: Enumerate Indirect Target Selection (ITS) bugPawan Gupta
ITS bug in some pre-Alderlake Intel CPUs may allow indirect branches in the first half of a cache line get predicted to a target of a branch located in the second half of the cache line. Set X86_BUG_ITS on affected CPUs. Mitigation to follow in later commits. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09Documentation: x86/bugs/its: Add ITS documentationPawan Gupta
Add the admin-guide for Indirect Target Selection (ITS). Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-09Merge tag 'drm-fixes-2025-05-10' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds
Pull drm fixes from Dave Airlie: "Weekly drm fixes, bit bigger than last week, but overall amdgpu/xe with some ivpu bits and a random few fixes, and dropping the ttm_backup struct which wrapped struct file and was recently frowned at. drm: - Fix overflow when generating wedged event ttm: - Fix documentation - Remove struct ttm_backup panel: - simple: Fix timings for AUO G101EVN010 amdgpu: - DC FP fixes - Freesync fix - DMUB AUX fixes - VCN fix - Hibernation fixes - HDP fixes xe: - Prevent PF queue overflow - Hold all forcewake during mocs test - Remove GSC flush on reset path - Fix forcewake put on error path - Fix runtime warning when building without svm i915: - Fix oops on resume after disconnecting DP MST sinks during suspend - Fix SPLC num_waiters refcounting ivpu: - Increase timeouts - Fix deadlock in cmdq ioctl - Unlock mutices in correct order v3d: - Avoid memory leak in job handling" * tag 'drm-fixes-2025-05-10' of https://gitlab.freedesktop.org/drm/kernel: (32 commits) drm/i915/dp: Fix determining SST/MST mode during MTP TU state computation drm/xe: Add config control for svm flush work drm/xe: Release force wake first then runtime power drm/xe/gsc: do not flush the GSC worker from the reset path drm/xe/tests/mocs: Hold XE_FORCEWAKE_ALL for LNCF regs drm/xe: Add page queue multiplier drm/amdgpu/hdp7: use memcfg register to post the write for HDP flush drm/amdgpu/hdp6: use memcfg register to post the write for HDP flush drm/amdgpu/hdp5.2: use memcfg register to post the write for HDP flush drm/amdgpu/hdp5: use memcfg register to post the write for HDP flush drm/amdgpu/hdp4: use memcfg register to post the write for HDP flush drm/amdgpu: fix pm notifier handling Revert "drm/amd: Stop evicting resources on APUs in suspend" drm/amdgpu/vcn: using separate VCN1_AON_SOC offset drm/amd/display: Fix wrong handling for AUX_DEFER case drm/amd/display: Copy AUX read reply data whenever length > 0 drm/amd/display: Remove incorrect checking in dmub aux handler drm/amd/display: Fix the checking condition in dmub aux handling drm/amd/display: Shift DMUB AUX reply command if necessary drm/amd/display: Call FP Protect Before Mode Programming/Mode Support ...
2025-05-10Merge tag 'drm-intel-fixes-2025-05-09' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes drm/i915 fixes for v6.15-rc6: - Fix oops on resume after disconnecting DP MST sinks during suspend - Fix SPLC num_waiters refcounting Signed-off-by: Dave Airlie <airlied@redhat.com> From: Jani Nikula <jani.nikula@intel.com> Link: https://lore.kernel.org/r/87tt5umeaw.fsf@intel.com
2025-05-10Merge tag 'drm-xe-fixes-2025-05-09' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes Driver Changes: - Prevent PF queue overflow - Hold all forcewake during mocs test - Remove GSC flush on reset path - Fix forcewake put on error path - Fix runtime warning when building without svm Signed-off-by: Dave Airlie <airlied@redhat.com> From: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/jffqa56f2zp4i5ztz677cdspgxhnw7qfop3dd3l2epykfpfvza@q2nw6wapsphz
2025-05-09drm/meson: Use 1000ULL when operating with mode->clockI Hsin Cheng
Coverity scan reported the usage of "mode->clock * 1000" may lead to integer overflow. Use "1000ULL" instead of "1000" when utilizing it to avoid potential integer overflow issue. Link: https://scan5.scan.coverity.com/#/project-view/10074/10063?selectedIssue=1646759 Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Fixes: 1017560164b6 ("drm/meson: use unsigned long long / Hz for frequency types") Link: https://lore.kernel.org/r/20250505184338.678540-1-richard120310@gmail.com Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
2025-05-09Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fix from Catalin Marinas: "Move the arm64_use_ng_mappings variable from the .bss to the .data section as it is accessed very early during boot with the MMU off and before the .bss has been initialised. This could lead to incorrect idmap page table" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: cpufeature: Move arm64_use_ng_mappings to the .data section to prevent wrong idmap generation
2025-05-09Merge tag 'riscv-for-linus-6.15-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - The compressed half-word misaligned access instructions (c.lhu, c.lh, and c.sh) from the Zcb extension are now properly emulated - A series of fixes to properly emulate permissions while handling userspace misaligned accesses - A pair of fixes for PR_GET_TAGGED_ADDR_CTRL to avoid accessing the envcfg CSR on systems that don't support that CSR, and to report those failures up to userspace - The .rela.dyn section is no longer stripped from vmlinux, as it is necessary to relocate the kernel under some conditions (including kexec) * tag 'riscv-for-linus-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Disallow PR_GET_TAGGED_ADDR_CTRL without Supm scripts: Do not strip .rela.dyn section riscv: Fix kernel crash due to PR_SET_TAGGED_ADDR_CTRL riscv: misaligned: use get_user() instead of __get_user() riscv: misaligned: enable IRQs while handling misaligned accesses riscv: misaligned: factorize trap handling riscv: misaligned: Add handling for ZCB instructions
2025-05-09cgroup/cpuset: Extend kthread_is_per_cpu() check to all PF_NO_SETAFFINITY tasksWaiman Long
Commit ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset") enabled us to pull CPUs dedicated to child partitions from tasks in top_cpuset by ignoring per cpu kthreads. However, there can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY flag set to indicate that we shouldn't mess with their CPU affinity. For other kthreads, their affinity will be changed to skip CPUs dedicated to child partitions whether it is an isolating or a scheduling one. As all the per cpu kthreads have PF_NO_SETAFFINITY set, the PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads. Fix this issue by dropping the kthread_is_per_cpu() check and checking the PF_NO_SETAFFINITY flag instead. Fixes: ec5fbdfb99d1 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-09Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fix for a regression in this series for loop and read/write iterator handling - zone append block update tweak - remove a broken IO priority test - NVMe pull request via Christoph: - unblock ctrl state transition for firmware update (Daniel Wagner) * tag 'block-6.15-20250509' of git://git.kernel.dk/linux: block: remove test of incorrect io priority level nvme: unblock ctrl state transition for firmware update block: only update request sector if needed loop: Add sanity check for read/write_iter
2025-05-09Merge tag 'io_uring-6.15-20250509' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix for linked timeouts arming and firing wrt prep and issue of the request being managed by the linked timeout - Fix for a CQE ordering issue between requests with multishot and using the same buffer group. This is a dumbed down version for this release and for stable, it'll get improved for v6.16 - Tweak the SQPOLL submit batch size. A previous commit made SQPOLL manage its own task_work and chose a tiny batch size, bump it from 8 to 32 to fix a performance regression due to that * tag 'io_uring-6.15-20250509' of git://git.kernel.dk/linux: io_uring/sqpoll: Increase task_work submission batch size io_uring: ensure deferred completions are flushed for multishot io_uring: always arm linked timeouts prior to issue
2025-05-09Merge tag 'modules-6.15-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull modules fix from Petr Pavlu: "A single fix to prevent use of an uninitialized completion pointer when releasing a module_kobject in specific situations. This addresses a latent bug exposed by commit f95bbfe18512 ("drivers: base: handle module_kobject creation")" * tag 'modules-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: module: ensure that kobject_put() is safe for module type kobjects
2025-05-09Merge tag 'asahi-soc-fixes-6.15' of https://github.com/AsahiLinux/linux into ↵Arnd Bergmann
arm/fixes Apple SoC fixes for 6.15 This tag contains two small commits since rc1: - Add a .mailmap entry requested by Asahi Lina to better filter her emails - Mark the power domains for the touchbar support introduced with 6.15 as always on since the driver cannot initialize the touchbar from scratch after the domains are powered off (e.g. during suspend). * tag 'asahi-soc-fixes-6.15' of https://github.com/AsahiLinux/linux: arm64: dts: apple: touchbar: Mark ps_dispdfr_be as always-on mailmap: Update email for Asahi Lina Link: https://lore.kernel.org/r/20250423145047.3098-1-sven@svenpeter.dev Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-09Merge tag 'riscv-sophgo-dt-fixes-for-v6.15-rc1' of ↵Arnd Bergmann
https://github.com/sophgo/linux into arm/fixes RISC-V Sophgo Devicetree fixes for v6.15-rc1 Just one minor fix to correct DMA data-width configuration for CV18xx. Signed-off-by: Chen Wang <unicorn_wang@outlook.com> * tag 'riscv-sophgo-dt-fixes-for-v6.15-rc1' of https://github.com/sophgo/linux: riscv: dts: sophgo: fix DMA data-width configuration for CV18xx Link: https://lore.kernel.org/r/MA0P287MB2262454C19B8899BC1694D04FE832@MA0P287MB2262.INDP287.PROD.OUTLOOK.COM Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-09Merge tag 'amlogic-fixes-for-v6.15' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/amlogic/linux into arm/fixes Amlogic Fixes for v6.15: - fix reference to unknown/untested PWM clock on ARM/ARM64 boards - fix missing clkc_audio node on dreambox ARM64 DT * tag 'amlogic-fixes-for-v6.15' of https://git.kernel.org/pub/scm/linux/kernel/git/amlogic/linux: arm64: dts: amlogic: dreambox: fix missing clkc_audio node arm64: dts: amlogic: g12: fix reference to unknown/untested PWM clock arm64: dts: amlogic: gx: fix reference to unknown/untested PWM clock ARM: dts: amlogic: meson8b: fix reference to unknown/untested PWM clock ARM: dts: amlogic: meson8: fix reference to unknown/untested PWM clock Link: https://lore.kernel.org/r/e9c520a1-b986-49e1-b9b1-67511c187716@linaro.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-09Merge tag 'v6.15-rockchip-dtsfixes1' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into arm/fixes Removal of operating-points above what the rk3588j soc is rated for, and a number of smaller fixes: Turing RK1 fan can spin down again, fixed pins, pinmuxing and clocks and some devicetree-correctnes improvements. * tag 'v6.15-rockchip-dtsfixes1' of https://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip: arm64: dts: rockchip: fix Sige5 RTC interrupt pin arm64: dts: rockchip: Assign RT5616 MCLK rate on rk3588-friendlyelec-cm3588 arm64: dts: rockchip: Align wifi node name with bindings in CB2 arm64: dts: rockchip: Fix mmc-pwrseq clock name on rock-pi-4 arm64: dts: rockchip: Use "regulator-fixed" for btreg on px30-engicam for vcc3v3-btreg arm64: dts: rockchip: Add pinmuxing for eMMC on QNAP TS433 arm64: dts: rockchip: Remove overdrive-mode OPPs from RK3588J SoC dtsi arm64: dts: rockchip: Allow Turing RK1 cooling fan to spin down Link: https://lore.kernel.org/r/2923598.88bMQJbFj6@diego Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-09x86/mm: Eliminate window where TLB flushes may be inadvertently skippedDave Hansen
tl;dr: There is a window in the mm switching code where the new CR3 is set and the CPU should be getting TLB flushes for the new mm. But should_flush_tlb() has a bug and suppresses the flush. Fix it by widening the window where should_flush_tlb() sends an IPI. Long Version: === History === There were a few things leading up to this. First, updating mm_cpumask() was observed to be too expensive, so it was made lazier. But being lazy caused too many unnecessary IPIs to CPUs due to the now-lazy mm_cpumask(). So code was added to cull mm_cpumask() periodically[2]. But that culling was a bit too aggressive and skipped sending TLB flushes to CPUs that need them. So here we are again. === Problem === The too-aggressive code in should_flush_tlb() strikes in this window: // Turn on IPIs for this CPU/mm combination, but only // if should_flush_tlb() agrees: cpumask_set_cpu(cpu, mm_cpumask(next)); next_tlb_gen = atomic64_read(&next->context.tlb_gen); choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush); load_new_mm_cr3(need_flush); // ^ After 'need_flush' is set to false, IPIs *MUST* // be sent to this CPU and not be ignored. this_cpu_write(cpu_tlbstate.loaded_mm, next); // ^ Not until this point does should_flush_tlb() // become true! should_flush_tlb() will suppress TLB flushes between load_new_mm_cr3() and writing to 'loaded_mm', which is a window where they should not be suppressed. Whoops. === Solution === Thankfully, the fuzzy "just about to write CR3" window is already marked with loaded_mm==LOADED_MM_SWITCHING. Simply checking for that state in should_flush_tlb() is sufficient to ensure that the CPU is targeted with an IPI. This will cause more TLB flush IPIs. But the window is relatively small and I do not expect this to cause any kind of measurable performance impact. Update the comment where LOADED_MM_SWITCHING is written since it grew yet another user. Peter Z also raised a concern that should_flush_tlb() might not observe 'loaded_mm' and 'is_lazy' in the same order that switch_mm_irqs_off() writes them. Add a barrier to ensure that they are observed in the order they are written. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/oe-lkp/202411282207.6bd28eae-lkp@intel.com/ [1] Fixes: 6db2526c1d69 ("x86/mm/tlb: Only trim the mm_cpumask once a second") [2] Reported-by: Stephen Dolan <sdolan@janestreet.com> Cc: stable@vger.kernel.org Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>