summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-17perf/ring_buffer: Allow the EPOLLRDNORM flag for pollTao Chen
The poll man page says POLLRDNORM is equivalent to POLLIN. For poll(), it seems that if user sets pollfd with POLLRDNORM in userspace, perf_poll will not return until timeout even if perf_output_wakeup called, whereas POLLIN returns. Fixes: 76369139ceb9 ("perf: Split up buffer handling from core code") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250314030036.2543180-1-chen.dylane@linux.dev
2025-03-17perf/core: Use POLLHUP for pinned events in errorNamhyung Kim
Pinned performance events can enter an error state when they fail to be scheduled in the context due to a failed constraint or some other conflict or condition. In error state these events won't generate any samples anymore and are silently ignored until they are recovered by PERF_EVENT_IOC_ENABLE, or the condition can also change so that they can be scheduled in. Tooling should be allowed to know about the state change, but currently there's no mechanism to notify tooling when events enter an error state. One way to do this is to issue a POLLHUP event to poll(2) to handle this. Reading events in an error state would return 0 (EOF) and it matches to the behavior of POLLHUP according to the man page. Tooling should remove the fd of the event from pollfd after getting POLLHUP, otherwise it'll be returned repeatedly. [ mingo: Clarified the changelog ] Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317061745.1777584-1-namhyung@kernel.org
2025-03-16perf/core: Use sysfs_emit() instead of scnprintf()XieLudan
Follow the advice in Documentation/filesystems/sysfs.rst: "- show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space." No change in functionality intended. [ mingo: Updated the changelog ] Signed-off-by: XieLudan <xie.ludan@zte.com.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250315141738452lXIH39UJAXlCmcATCzcBv@zte.com.cn
2025-03-10perf/core: Remove optional 'size' arguments from strscpy() callsThorsten Blum
The 'size' parameter is optional and strscpy() automatically determines the length of the destination buffer using sizeof() if the argument is omitted. This makes the explicit sizeof() calls unnecessary. Furthermore, KSYM_NAME_LEN is equal to sizeof(name) and can also be removed. Remove them to shorten and simplify the code. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250310192336.442994-1-thorsten.blum@linux.dev
2025-03-06perf/x86/intel/bts: Check if bts_ctx is allocated when calling BTS functionsLi RongQing
bts_ctx might not be allocated, for example if the CPU has X86_FEATURE_PTI, but intel_bts_disable/enable_local() and intel_bts_interrupt() are called unconditionally from intel_pmu_handle_irq() and crash on bts_ctx. So check if bts_ctx is allocated when calling BTS functions. Fixes: 3acfcefa795c ("perf/x86/intel/bts: Allocate bts_ctx only if necessary") Reported-by: Jiri Olsa <olsajiri@gmail.com> Tested-by: Jiri Olsa <jolsa@kernel.org> Suggested-by: Adrian Hunter <adrian.hunter@intel.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250306051102.2642-1-lirongqing@baidu.com
2025-03-06uprobes/x86: Harden uretprobe syscall trampoline checkJiri Olsa
Jann reported a possible issue when trampoline_check_ip returns address near the bottom of the address space that is allowed to call into the syscall if uretprobes are not set up: https://lore.kernel.org/bpf/202502081235.5A6F352985@keescook/T/#m9d416df341b8fbc11737dacbcd29f0054413cbbf Though the mmap minimum address restrictions will typically prevent creating mappings there, let's make sure uretprobe syscall checks for that. Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250212220433.3624297-1-jolsa@kernel.org
2025-03-06watchdog/hardlockup/perf: Warn if watchdog_ev is leakedLi Huafei
When creating a new perf_event for the hardlockup watchdog, it should not happen that the old perf_event is not released. Introduce a WARN_ONCE() that should never trigger. [ mingo: Changed the type of the warning to WARN_ONCE(). ] Signed-off-by: Li Huafei <lihuafei1@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241021193004.308303-2-lihuafei1@huawei.com
2025-03-06watchdog/hardlockup/perf: Fix perf_event memory leakLi Huafei
During stress-testing, we found a kmemleak report for perf_event: unreferenced object 0xff110001410a33e0 (size 1328): comm "kworker/4:11", pid 288, jiffies 4294916004 hex dump (first 32 bytes): b8 be c2 3b 02 00 11 ff 22 01 00 00 00 00 ad de ...;...."....... f0 33 0a 41 01 00 11 ff f0 33 0a 41 01 00 11 ff .3.A.....3.A.... backtrace (crc 24eb7b3a): [<00000000e211b653>] kmem_cache_alloc_node_noprof+0x269/0x2e0 [<000000009d0985fa>] perf_event_alloc+0x5f/0xcf0 [<00000000084ad4a2>] perf_event_create_kernel_counter+0x38/0x1b0 [<00000000fde96401>] hardlockup_detector_event_create+0x50/0xe0 [<0000000051183158>] watchdog_hardlockup_enable+0x17/0x70 [<00000000ac89727f>] softlockup_start_fn+0x15/0x40 ... Our stress test includes CPU online and offline cycles, and updating the watchdog configuration. After reading the code, I found that there may be a race between cleaning up perf_event after updating watchdog and disabling event when the CPU goes offline: CPU0 CPU1 CPU2 (update watchdog) (hotplug offline CPU1) ... _cpu_down(CPU1) cpus_read_lock() // waiting for cpu lock softlockup_start_all smp_call_on_cpu(CPU1) softlockup_start_fn ... watchdog_hardlockup_enable(CPU1) perf create E1 watchdog_ev[CPU1] = E1 cpus_read_unlock() cpus_write_lock() cpuhp_kick_ap_work(CPU1) cpuhp_thread_fun ... watchdog_hardlockup_disable(CPU1) watchdog_ev[CPU1] = NULL dead_event[CPU1] = E1 __lockup_detector_cleanup for each dead_events_mask release each dead_event /* * CPU1 has not been added to * dead_events_mask, then E1 * will not be released */ CPU1 -> dead_events_mask cpumask_clear(&dead_events_mask) // dead_events_mask is cleared, E1 is leaked In this case, the leaked perf_event E1 matches the perf_event leak reported by kmemleak. Due to the low probability of problem recurrence (only reported once), I added some hack delays in the code: static void __lockup_detector_reconfigure(void) { ... watchdog_hardlockup_start(); cpus_read_unlock(); + mdelay(100); /* * Must be called outside the cpus locked section to prevent * recursive locking in the perf code. ... } void watchdog_hardlockup_disable(unsigned int cpu) { ... perf_event_disable(event); this_cpu_write(watchdog_ev, NULL); this_cpu_write(dead_event, event); + mdelay(100); cpumask_set_cpu(smp_processor_id(), &dead_events_mask); atomic_dec(&watchdog_cpus); ... } void hardlockup_detector_perf_cleanup(void) { ... perf_event_release_kernel(event); per_cpu(dead_event, cpu) = NULL; } + mdelay(100); cpumask_clear(&dead_events_mask); } Then, simultaneously performing CPU on/off and switching watchdog, it is almost certain to reproduce this leak. The problem here is that releasing perf_event is not within the CPU hotplug read-write lock. Commit: 941154bd6937 ("watchdog/hardlockup/perf: Prevent CPU hotplug deadlock") introduced deferred release to solve the deadlock caused by calling get_online_cpus() when releasing perf_event. Later, commit: efe951d3de91 ("perf/x86: Fix perf,x86,cpuhp deadlock") removed the get_online_cpus() call on the perf_event release path to solve another deadlock problem. Therefore, it is now possible to move the release of perf_event back into the CPU hotplug read-write lock, and release the event immediately after disabling it. Fixes: 941154bd6937 ("watchdog/hardlockup/perf: Prevent CPU hotplug deadlock") Signed-off-by: Li Huafei <lihuafei1@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241021193004.308303-1-lihuafei1@huawei.com
2025-03-05perf/x86: Annotate struct bts_buffer::buf with __counted_by()Thorsten Blum
Add the __counted_by() compiler attribute to the flexible array member buf to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250305123134.215577-2-thorsten.blum@linux.dev
2025-03-05perf/core: Clean up perf_try_init_event()Peter Zijlstra
Make sure that perf_try_init_event() doesn't leave event->pmu nor event->destroy set on failure. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250205102449.110145835@infradead.org
2025-03-04perf/core: Fix perf_mmap() failure pathPeter Zijlstra
When f_ops->mmap() returns failure, m_ops->close() is *not* called. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135519.248358497@infradead.org
2025-03-04perf/core: Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimesPeter Zijlstra
In prepration for being able to unregister a PMU with existing events, it becomes important to detach struct perf_cpu_pmu_context lifetimes from that of struct pmu. Notably struct perf_cpu_pmu_context embeds a struct perf_event_pmu_context that can stay referenced until the last event goes. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.760214287@infradead.org
2025-03-04perf/core: Lift event->mmap_mutex in perf_mmap()Peter Zijlstra
This puts 'all' of perf_mmap() under single event->mmap_mutex. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135519.582252957@infradead.org
2025-03-04perf/core: Remove retry loop from perf_mmap()Peter Zijlstra
AFAICT there is no actual benefit from the mutex drop on re-try. The 'worst' case scenario is that we instantly re-gain the mutex without perf_mmap_close() getting it. So might as well make that the normal case. Reflow the code to make the ring buffer detach case naturally flow into the no ring buffer case. [ mingo: Forward ported it ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135519.463607258@infradead.org
2025-03-04perf/core: Further simplify perf_mmap()Peter Zijlstra
Perform CSE and such. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135519.354909594@infradead.org
2025-03-04perf/core: Simplify the perf_mmap() control flowPeter Zijlstra
Identity-transform: if (c) { X1; } else { Y; goto l; } X2; l: into the simpler: if (c) { X1; X2; } else { Y; } [ mingo: Forward ported it ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135519.095904637@infradead.org
2025-03-04perf/bpf: Robustify perf_event_free_bpf_prog()Peter Zijlstra
Ensure perf_event_free_bpf_prog() is safe to call a second time; notably without making any references to event->pmu when there is no prog left. Note: perf_event_detach_bpf_prog() might leave a stale event->prog Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241104135518.978956692@infradead.org
2025-03-04perf/core: Introduce perf_free_addr_filters()Peter Zijlstra
Replace _free_event()'s use of perf_addr_filters_splice()s use with an explicit perf_free_addr_filters() with the explicit propery that it is able to be called a second time without ill effect. Most notable, referencing event->pmu must be avoided when there are no filters left (from eg a previous call). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.868460518@infradead.org
2025-03-04perf/core: Add this_cpc() helperPeter Zijlstra
As a preparation for adding yet another indirection. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.650051565@infradead.org
2025-03-04perf/core: Merge struct pmu::pmu_disable_count into struct ↵Peter Zijlstra
perf_cpu_pmu_context::pmu_disable_count Because it makes no sense to have two per-cpu allocations per pmu. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.518730578@infradead.org
2025-03-04perf/core: Simplify perf_event_alloc()Peter Zijlstra
Using the previous simplifications, transition perf_event_alloc() to the cleanup way of things -- reducing error path magic. [ mingo: Ported it to recent kernels. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.410755241@infradead.org
2025-03-04perf/core: Simplify perf_init_event()Peter Zijlstra
Use the <linux/cleanup.h> guard() and scoped_guard() infrastructure to simplify the control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20241104135518.302444446@infradead.org
2025-03-04perf/core: Simplify perf_pmu_register()Peter Zijlstra
Using the previously introduced perf_pmu_free() and a new IDR helper, simplify the perf_pmu_register error paths. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.198937277@infradead.org
2025-03-04perf/core: Simplify the perf_pmu_register() error pathPeter Zijlstra
The error path of perf_pmu_register() is of course very similar to a subset of perf_pmu_unregister(). Extract this common part in perf_pmu_free() and simplify things. [ mingo: Forward ported it ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135518.090915501@infradead.org
2025-03-04perf/core: Simplify the perf_event_alloc() error pathPeter Zijlstra
The error cleanup sequence in perf_event_alloc() is a subset of the existing _free_event() function (it must of course be). Split this out into __free_event() and simplify the error path. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20241104135517.967889521@infradead.org
2025-03-04perf/hw_breakpoint: Return EOPNOTSUPP for unsupported breakpoint typeSaket Kumar Bhaskar
Currently, __reserve_bp_slot() returns -ENOSPC for unsupported breakpoint types on the architecture. For example, powerpc does not support hardware instruction breakpoints. This causes the perf_skip BPF selftest to fail, as neither ENOENT nor EOPNOTSUPP is returned by perf_event_open for unsupported breakpoint types. As a result, the test that should be skipped for this arch is not correctly identified. To resolve this, hw_breakpoint_event_init() should exit early by checking for unsupported breakpoint types using hw_breakpoint_slots_cached() and return the appropriate error (-EOPNOTSUPP). Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: https://lore.kernel.org/r/20250303092451.1862862-1-skb99@linux.ibm.com
2025-03-01Merge branch 'perf/urgent' into perf/core, to pick up dependent patches and ↵Ingo Molnar
fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-01perf/core: Fix perf_pmu_register() vs. perf_init_event()Peter Zijlstra
There is a fairly obvious race between perf_init_event() doing idr_find() and perf_pmu_register() doing idr_alloc() with an incompletely initialized PMU pointer. Avoid by doing idr_alloc() on a NULL pointer to register the id, and swizzling the real struct pmu pointer at the end using idr_replace(). Also making sure to not set struct pmu members after publishing the struct pmu, duh. [ introduce idr_cmpxchg() in order to better handle the idr_replace() error case -- if it were to return an unexpected pointer, it will already have replaced the value and there is no going back. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241104135517.858805880@infradead.org
2025-03-01perf/core: Fix pmus_lock vs. pmus_srcu orderingPeter Zijlstra
Commit a63fbed776c7 ("perf/tracing/cpuhotplug: Fix locking order") placed pmus_lock inside pmus_srcu, this makes perf_pmu_unregister() trip lockdep. Move the locking about such that only pmu_idr and pmus (list) are modified while holding pmus_lock. This avoids doing synchronize_srcu() while holding pmus_lock and all is well again. Fixes: a63fbed776c7 ("perf/tracing/cpuhotplug: Fix locking order") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241104135517.679556858@infradead.org
2025-03-01Merge tag 'ata-6.14-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux Pull ata fixes from Niklas Cassel: - Fix a regression where the enablement of the PHYs would be skipped for device trees without any port child nodes (me) - Revert ATA_QUIRK_NOLPM for Samsung SSD 870 QVO drives, as it stops systems from entering lower package states. LPM works on newer firmware versions. We will need a more refined quirk that only targets the older firmware versions (me) * tag 'ata-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux: Revert "ata: libata-core: Add ATA_QUIRK_NOLPM for Samsung SSD 870 QVO drives" ata: ahci: Make ahci_ignore_port() handle empty mask_port_map
2025-03-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Fix TCR_EL2 configuration to not use the ASID in TTBR1_EL2 and not mess-up T1SZ/PS by using the HCR_EL2.E2H==0 layout. - Bring back the VMID allocation to the vcpu_load phase, ensuring that we only setup VTTBR_EL2 once on VHE. This cures an ugly race that would lead to running with an unallocated VMID. RISC-V: - Fix hart status check in SBI HSM extension - Fix hart suspend_type usage in SBI HSM extension - Fix error returned by SBI IPI and TIME extensions for unsupported function IDs - Fix suspend_type usage in SBI SUSP extension - Remove unnecessary vcpu kick after injecting interrupt via IMSIC guest file x86: - Fix an nVMX bug where KVM fails to detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI). - To avoid freeing the PIC while vCPUs are still around, which would cause a NULL pointer access with the previous patch, destroy vCPUs before any VM-level destruction. - Handle failures to create vhost_tasks" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: retry nx_huge_page_recovery_thread creation vhost: return task creation error instead of NULL KVM: nVMX: Process events on nested VM-Exit if injectable IRQ or NMI is pending KVM: x86: Free vCPUs before freeing VM state riscv: KVM: Remove unnecessary vcpu kick KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2 KVM: arm64: Fix tcr_el2 initialisation in hVHE mode riscv: KVM: Fix SBI sleep_type use riscv: KVM: Fix SBI TIME error generation riscv: KVM: Fix SBI IPI error generation riscv: KVM: Fix hart suspend_type use riscv: KVM: Fix hart suspend status check
2025-03-01Revert "ata: libata-core: Add ATA_QUIRK_NOLPM for Samsung SSD 870 QVO drives"Niklas Cassel
This reverts commit cc77e2ce187d26cc66af3577bf896d7410eb25ab. It was reported that adding ATA_QUIRK_NOLPM for Samsung SSD 870 QVO drives breaks entering lower package states for certain systems. It turns out that Samsung SSD 870 QVO actually has working LPM when using a recent SSD firmware version. The author of commit cc77e2ce187d ("ata: libata-core: Add ATA_QUIRK_NOLPM for Samsung SSD 870 QVO drives") reported himself that only older SSD firmware versions have broken LPM: https://lore.kernel.org/stable/93c10d38-718c-459d-84a5-4d87680b4da7@debian.org/ Unfortunately, he did not specify which older firmware version he was using which had broken LPM. Let's revert this quirk, which has FW version field specified as NULL (which means that it applies for all Samsung SSD 870 QVO firmware versions) for now. Once the author reports which older firmware version(s) that are broken, we can create a more fine grained quirk, which populates the FW version field accordingly. Fixes: cc77e2ce187d ("ata: libata-core: Add ATA_QUIRK_NOLPM for Samsung SSD 870 QVO drives") Reported-by: Dieter Mummenschanz <dmummenschanz@web.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219747 Link: https://lore.kernel.org/r/20250228122603.91814-2-cassel@kernel.org Signed-off-by: Niklas Cassel <cassel@kernel.org>
2025-03-01kvm: retry nx_huge_page_recovery_thread creationKeith Busch
A VMM may send a non-fatal signal to its threads, including vCPU tasks, at any time, and thus may signal vCPU tasks during KVM_RUN. If a vCPU task receives the signal while its trying to spawn the huge page recovery vhost task, then KVM_RUN will fail due to copy_process() returning -ERESTARTNOINTR. Rework call_once() to mark the call complete if and only if the called function succeeds, and plumb the function's true error code back to the call_once() invoker. This provides userspace with the correct, non-fatal error code so that the VMM doesn't terminate the VM on -ENOMEM, and allows subsequent KVM_RUN a succeed by virtue of retrying creation of the NX huge page task. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> [implemented the kvm user side] Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-3-kbusch@meta.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-01vhost: return task creation error instead of NULLKeith Busch
Lets callers distinguish why the vhost task creation failed. No one currently cares why it failed, so no real runtime change from this patch, but that will not be the case for long. Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-2-kbusch@meta.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-28Merge tag 'thermal-6.14-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull thermal control fixes from Rafael Wysocki: "These fix the processing of DT thermal properties and the Power Allocator thermal governor: - Fix parsing cooling-maps in DT for trip points with more than one cooling device (Rafael Wysocki) - Fix granted_power computation in the Power Allocator thermal governor and make it update total_weight on configuration changes after the thermal zone has been registered (Yu-Che Cheng)" * tag 'thermal-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: thermal: gov_power_allocator: Update total_weight on bind and cdev updates thermal/of: Fix cdev lookup in thermal_of_should_bind() thermal: gov_power_allocator: Fix incorrect calculation in divvy_up_power()
2025-02-28Merge tag 'pm-6.14-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Fix the handling of processors that stop the TSC in deeper C-states in the intel_idle driver (Thomas Gleixner)" * tag 'pm-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: intel_idle: Handle older CPUs, which stop the TSC in deeper C states, correctly
2025-02-28Merge tag 'x86-urgent-2025-02-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix conflicts between devicetree and ACPI SMP discovery & setup - Fix a warm-boot lockup on AMD SC1100 SoC systems - Fix a W=1 build warning related to x86 IRQ trace event setup - Fix a kernel-doc warning * tag 'x86-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry: Fix kernel-doc warning x86/irq: Define trace events conditionally x86/CPU: Fix warm boot hang regression on AMD SC1100 SoC systems x86/of: Don't use DTB for SMP setup if ACPI is enabled
2025-02-28Merge tag 'sched-urgent-2025-02-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Prevent cond_resched() based preemption when interrupts are disabled, on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels" * tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Prevent rescheduling when interrupts are disabled
2025-02-28Merge tag 'perf-urgent-2025-02-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fixes from Ingo Molnar: "Miscellaneous perf events fixes and a minor HW enablement change: - Fix missing RCU protection in perf_iterate_ctx() - Fix pmu_ctx_list ordering bug - Reject the zero page in uprobes - Fix a family of bugs related to low frequency sampling - Add Intel Arrow Lake U CPUs to the generic Arrow Lake RAPL support table - Fix a lockdep-assert false positive in uretprobes" * tag 'perf-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: uprobes: Remove too strict lockdep_assert() condition in hprobe_expire() perf/x86/rapl: Add support for Intel Arrow Lake U perf/x86/intel: Use better start period for frequency mode perf/core: Fix low freq setting via IOC_PERIOD perf/x86: Fix low freqency setting issue uprobes: Reject the shared zeropage in uprobe_write_opcode() perf/core: Order the PMU list to fix warning about unordered pmu_ctx_list perf/core: Add RCU read lock protection to perf_iterate_ctx()
2025-02-28Merge tag 'objtool-urgent-2025-02-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fixes from Ingo Molnar: "Fix an objtool false positive, and objtool related build warnings that happens on PIE-enabled architectures such as LoongArch" * tag 'objtool-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Add bch2_trans_unlocked_or_in_restart_error() to bcachefs noreturns objtool: Fix C jump table annotations for Clang vmlinux.lds: Ensure that const vars with relocations are mapped R/O
2025-02-28Merge tag 'locking-urgent-2025-02-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix an rcuref_put() slowpath race" * tag 'locking-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rcuref: Plug slowpath race in rcuref_put()
2025-02-28Merge tag 'trace-v6.14-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix crash from bad histogram entry An error path in the histogram creation could leave an entry in a link list that gets freed. Then when a new entry is added it can cause a u-a-f bug. This is fixed by restructuring the code so that the histogram is consistent on failure and everything is cleaned up appropriately. - Fix fprobe self test The fprobe self test relies on no function being attached by ftrace. BPF programs can attach to functions via ftrace and systemd now does so. This causes those functions to appear in the enabled_functions list which holds all functions attached by ftrace. The selftest also uses that file to see if functions are being connected correctly. It counts the functions in the file, but if there's already functions in the file, it fails. Instead, add the number of functions in the file at the start of the test to all the calculations during the test. - Fix potential division by zero of the function profiler stddev The calculated divisor that calculates the standard deviation of the function times can overflow. If the overflow happens to land on zero, that can cause a division by zero. Check for zero from the calculation before doing the division. TODO: Catch when it ever overflows and report it accordingly. For now, just prevent the system from crashing. * tag 'trace-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Avoid potential division by zero in function_stat_show() selftests/ftrace: Let fprobe test consider already enabled functions tracing: Fix bad hist from corrupting named_triggers list
2025-02-28Merge tag 'iommu-fixes-v6.14-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu fixes from Joerg Roedel: - Intel VT-d fixes: - Fix suspicious RCU usage splat - Fix passthrough for devices under PCIe-PCI bridge - AMD-Vi fix: - Fix to preserve bits when updating device table entries * tag 'iommu-fixes-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: iommu/vt-d: Fix suspicious RCU usage iommu/vt-d: Remove device comparison in context_setup_pass_through_cb iommu/amd: Preserve default DTE fields when updating Host Page Table Root
2025-02-28intel_idle: Handle older CPUs, which stop the TSC in deeper C states, correctlyThomas Gleixner
The Intel idle driver is preferred over the ACPI processor idle driver, but fails to implement the work around for Core2 generation CPUs, where the TSC stops in C2 and deeper C-states. This causes stalls and boot delays, when the clocksource watchdog does not catch the unstable TSC before the CPU goes deep idle for the first time. The ACPI driver marks the TSC unstable when it detects that the CPU supports C2 or deeper and the CPU does not have a non-stop TSC. Add the equivivalent work around to the Intel idle driver to cure that. Fixes: 18734958e9bf ("intel_idle: Use ACPI _CST for processor models without C-state tables") Reported-by: Fab Stz <fabstz-it@yahoo.fr> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Fab Stz <fabstz-it@yahoo.fr> Cc: All applicable <stable@vger.kernel.org> Closes: https://lore.kernel.org/all/10cf96aa-1276-4bd4-8966-c890377030c3@yahoo.fr Link: https://patch.msgid.link/87bjupfy7f.ffs@tglx Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-02-28Merge tag 'block-6.14-20250228' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fix plugging for native zone writes - Fix segment limit settings for != 4K page size archs - Fix for slab names overflowing * tag 'block-6.14-20250228' of git://git.kernel.dk/linux: block: fix 'kmem_cache of name 'bio-108' already exists' block: Remove zone write plugs when handling native zone append writes block: make segment size limit workable for > 4K PAGE_SIZE
2025-02-28Merge tag 'io_uring-6.14-20250228' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fix from Jens Axboe: "Just a single fix headed for stable, ensuring that msg_control is properly saved in compat mode as well" * tag 'io_uring-6.14-20250228' of git://git.kernel.dk/linux: io_uring/net: save msg_control for compat
2025-02-28Merge tag 'efi-fixes-for-v6.14-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "Another couple of EFI fixes for v6.14. Only James's patch stands out, as it implements a workaround for odd behavior in fwupd in user space, which creates EFI variables by touching a file in efivarfs, clearing the immutable bit (which gets set automatically for $reasons) and then opening it again for writing, none of which is really necessary. The fwupd author and LVFS maintainer is already rolling out a fix for this on the fwupd side, and suggested that the workaround in this PR could be backed out again during the next cycle. (There is a semantic mismatch in efivarfs where some essential variable attributes are stored in the first 4 bytes of the file, and so zero length files cannot exist, as they cannot be written back to the underlying variable store. So now, they are dropped once the last reference is released.) Summary: - Fix CPER error record parsing bugs - Fix a couple of efivarfs issues that were introduced in the merge window - Fix an issue in the early remapping code of the MOKvar table" * tag 'efi-fixes-for-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi/mokvar-table: Avoid repeated map/unmap of the same page efi: Don't map the entire mokvar table to determine its size efivarfs: allow creation of zero length files efivarfs: Defer PM notifier registration until .fill_super efi/cper: Fix cper_arm_ctx_info alignment efi/cper: Fix cper_ia_proc_ctx alignment
2025-02-28block: fix 'kmem_cache of name 'bio-108' already exists'Ming Lei
Device mapper bioset often has big bio_slab size, which can be more than 1000, then 8byte can't hold the slab name any more, cause the kmem_cache allocation warning of 'kmem_cache of name 'bio-108' already exists'. Fix the warning by extending bio_slab->name to 12 bytes, but fix output of /proc/slabinfo Reported-by: Guangwu Zhang <guazhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250228132656.2838008-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28iommu/vt-d: Fix suspicious RCU usageLu Baolu
Commit <d74169ceb0d2> ("iommu/vt-d: Allocate DMAR fault interrupts locally") moved the call to enable_drhd_fault_handling() to a code path that does not hold any lock while traversing the drhd list. Fix it by ensuring the dmar_global_lock lock is held when traversing the drhd list. Without this fix, the following warning is triggered: ============================= WARNING: suspicious RCU usage 6.14.0-rc3 #55 Not tainted ----------------------------- drivers/iommu/intel/dmar.c:2046 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 2 locks held by cpuhp/1/23: #0: ffffffff84a67c50 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x87/0x2c0 #1: ffffffff84a6a380 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x87/0x2c0 stack backtrace: CPU: 1 UID: 0 PID: 23 Comm: cpuhp/1 Not tainted 6.14.0-rc3 #55 Call Trace: <TASK> dump_stack_lvl+0xb7/0xd0 lockdep_rcu_suspicious+0x159/0x1f0 ? __pfx_enable_drhd_fault_handling+0x10/0x10 enable_drhd_fault_handling+0x151/0x180 cpuhp_invoke_callback+0x1df/0x990 cpuhp_thread_fun+0x1ea/0x2c0 smpboot_thread_fn+0x1f5/0x2e0 ? __pfx_smpboot_thread_fn+0x10/0x10 kthread+0x12a/0x2d0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x4a/0x60 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Holding the lock in enable_drhd_fault_handling() triggers a lockdep splat about a possible deadlock between dmar_global_lock and cpu_hotplug_lock. This is avoided by not holding dmar_global_lock when calling iommu_device_register(), which initiates the device probe process. Fixes: d74169ceb0d2 ("iommu/vt-d: Allocate DMAR fault interrupts locally") Reported-and-tested-by: Ido Schimmel <idosch@nvidia.com> Closes: https://lore.kernel.org/linux-iommu/Zx9OwdLIc_VoQ0-a@shredder.mtl.com/ Tested-by: Breno Leitao <leitao@debian.org> Cc: stable@vger.kernel.org Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250218022422.2315082-1-baolu.lu@linux.intel.com Tested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-02-28iommu/vt-d: Remove device comparison in context_setup_pass_through_cbJerry Snitselaar
Remove the device comparison check in context_setup_pass_through_cb. pci_for_each_dma_alias already makes a decision on whether the callback function should be called for a device. With the check in place it will fail to create context entries for aliases as it walks up to the root bus. Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity domain") Closes: https://lore.kernel.org/linux-iommu/82499eb6-00b7-4f83-879a-e97b4144f576@linux.intel.com/ Cc: stable@vger.kernel.org Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com> Link: https://lore.kernel.org/r/20250224180316.140123-1-jsnitsel@redhat.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>