Age | Commit message (Collapse) | Author |
|
The poll man page says POLLRDNORM is equivalent to POLLIN. For poll(),
it seems that if user sets pollfd with POLLRDNORM in userspace, perf_poll
will not return until timeout even if perf_output_wakeup called,
whereas POLLIN returns.
Fixes: 76369139ceb9 ("perf: Split up buffer handling from core code")
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250314030036.2543180-1-chen.dylane@linux.dev
|
|
Pinned performance events can enter an error state when they fail to be
scheduled in the context due to a failed constraint or some other conflict
or condition.
In error state these events won't generate any samples anymore and are
silently ignored until they are recovered by PERF_EVENT_IOC_ENABLE,
or the condition can also change so that they can be scheduled in.
Tooling should be allowed to know about the state change, but
currently there's no mechanism to notify tooling when events enter
an error state.
One way to do this is to issue a POLLHUP event to poll(2) to handle this.
Reading events in an error state would return 0 (EOF) and it matches to
the behavior of POLLHUP according to the man page.
Tooling should remove the fd of the event from pollfd after getting
POLLHUP, otherwise it'll be returned repeatedly.
[ mingo: Clarified the changelog ]
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250317061745.1777584-1-namhyung@kernel.org
|
|
Follow the advice in Documentation/filesystems/sysfs.rst:
"- show() should only use sysfs_emit() or sysfs_emit_at() when formatting
the value to be returned to user space."
No change in functionality intended.
[ mingo: Updated the changelog ]
Signed-off-by: XieLudan <xie.ludan@zte.com.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250315141738452lXIH39UJAXlCmcATCzcBv@zte.com.cn
|
|
The 'size' parameter is optional and strscpy() automatically determines
the length of the destination buffer using sizeof() if the argument is
omitted. This makes the explicit sizeof() calls unnecessary.
Furthermore, KSYM_NAME_LEN is equal to sizeof(name) and can also be
removed. Remove them to shorten and simplify the code.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250310192336.442994-1-thorsten.blum@linux.dev
|
|
bts_ctx might not be allocated, for example if the CPU has X86_FEATURE_PTI,
but intel_bts_disable/enable_local() and intel_bts_interrupt() are called
unconditionally from intel_pmu_handle_irq() and crash on bts_ctx.
So check if bts_ctx is allocated when calling BTS functions.
Fixes: 3acfcefa795c ("perf/x86/intel/bts: Allocate bts_ctx only if necessary")
Reported-by: Jiri Olsa <olsajiri@gmail.com>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Suggested-by: Adrian Hunter <adrian.hunter@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250306051102.2642-1-lirongqing@baidu.com
|
|
Jann reported a possible issue when trampoline_check_ip returns
address near the bottom of the address space that is allowed to
call into the syscall if uretprobes are not set up:
https://lore.kernel.org/bpf/202502081235.5A6F352985@keescook/T/#m9d416df341b8fbc11737dacbcd29f0054413cbbf
Though the mmap minimum address restrictions will typically prevent
creating mappings there, let's make sure uretprobe syscall checks
for that.
Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250212220433.3624297-1-jolsa@kernel.org
|
|
When creating a new perf_event for the hardlockup watchdog, it should not
happen that the old perf_event is not released.
Introduce a WARN_ONCE() that should never trigger.
[ mingo: Changed the type of the warning to WARN_ONCE(). ]
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241021193004.308303-2-lihuafei1@huawei.com
|
|
During stress-testing, we found a kmemleak report for perf_event:
unreferenced object 0xff110001410a33e0 (size 1328):
comm "kworker/4:11", pid 288, jiffies 4294916004
hex dump (first 32 bytes):
b8 be c2 3b 02 00 11 ff 22 01 00 00 00 00 ad de ...;....".......
f0 33 0a 41 01 00 11 ff f0 33 0a 41 01 00 11 ff .3.A.....3.A....
backtrace (crc 24eb7b3a):
[<00000000e211b653>] kmem_cache_alloc_node_noprof+0x269/0x2e0
[<000000009d0985fa>] perf_event_alloc+0x5f/0xcf0
[<00000000084ad4a2>] perf_event_create_kernel_counter+0x38/0x1b0
[<00000000fde96401>] hardlockup_detector_event_create+0x50/0xe0
[<0000000051183158>] watchdog_hardlockup_enable+0x17/0x70
[<00000000ac89727f>] softlockup_start_fn+0x15/0x40
...
Our stress test includes CPU online and offline cycles, and updating the
watchdog configuration.
After reading the code, I found that there may be a race between cleaning up
perf_event after updating watchdog and disabling event when the CPU goes offline:
CPU0 CPU1 CPU2
(update watchdog) (hotplug offline CPU1)
... _cpu_down(CPU1)
cpus_read_lock() // waiting for cpu lock
softlockup_start_all
smp_call_on_cpu(CPU1)
softlockup_start_fn
...
watchdog_hardlockup_enable(CPU1)
perf create E1
watchdog_ev[CPU1] = E1
cpus_read_unlock()
cpus_write_lock()
cpuhp_kick_ap_work(CPU1)
cpuhp_thread_fun
...
watchdog_hardlockup_disable(CPU1)
watchdog_ev[CPU1] = NULL
dead_event[CPU1] = E1
__lockup_detector_cleanup
for each dead_events_mask
release each dead_event
/*
* CPU1 has not been added to
* dead_events_mask, then E1
* will not be released
*/
CPU1 -> dead_events_mask
cpumask_clear(&dead_events_mask)
// dead_events_mask is cleared, E1 is leaked
In this case, the leaked perf_event E1 matches the perf_event leak
reported by kmemleak. Due to the low probability of problem recurrence
(only reported once), I added some hack delays in the code:
static void __lockup_detector_reconfigure(void)
{
...
watchdog_hardlockup_start();
cpus_read_unlock();
+ mdelay(100);
/*
* Must be called outside the cpus locked section to prevent
* recursive locking in the perf code.
...
}
void watchdog_hardlockup_disable(unsigned int cpu)
{
...
perf_event_disable(event);
this_cpu_write(watchdog_ev, NULL);
this_cpu_write(dead_event, event);
+ mdelay(100);
cpumask_set_cpu(smp_processor_id(), &dead_events_mask);
atomic_dec(&watchdog_cpus);
...
}
void hardlockup_detector_perf_cleanup(void)
{
...
perf_event_release_kernel(event);
per_cpu(dead_event, cpu) = NULL;
}
+ mdelay(100);
cpumask_clear(&dead_events_mask);
}
Then, simultaneously performing CPU on/off and switching watchdog, it is
almost certain to reproduce this leak.
The problem here is that releasing perf_event is not within the CPU
hotplug read-write lock. Commit:
941154bd6937 ("watchdog/hardlockup/perf: Prevent CPU hotplug deadlock")
introduced deferred release to solve the deadlock caused by calling
get_online_cpus() when releasing perf_event. Later, commit:
efe951d3de91 ("perf/x86: Fix perf,x86,cpuhp deadlock")
removed the get_online_cpus() call on the perf_event release path to solve
another deadlock problem.
Therefore, it is now possible to move the release of perf_event back
into the CPU hotplug read-write lock, and release the event immediately
after disabling it.
Fixes: 941154bd6937 ("watchdog/hardlockup/perf: Prevent CPU hotplug deadlock")
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241021193004.308303-1-lihuafei1@huawei.com
|
|
Add the __counted_by() compiler attribute to the flexible array member
buf to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
No functional changes intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250305123134.215577-2-thorsten.blum@linux.dev
|
|
Make sure that perf_try_init_event() doesn't leave event->pmu nor
event->destroy set on failure.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250205102449.110145835@infradead.org
|
|
When f_ops->mmap() returns failure, m_ops->close() is *not* called.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135519.248358497@infradead.org
|
|
In prepration for being able to unregister a PMU with existing events,
it becomes important to detach struct perf_cpu_pmu_context lifetimes
from that of struct pmu.
Notably struct perf_cpu_pmu_context embeds a struct perf_event_pmu_context
that can stay referenced until the last event goes.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.760214287@infradead.org
|
|
This puts 'all' of perf_mmap() under single event->mmap_mutex.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135519.582252957@infradead.org
|
|
AFAICT there is no actual benefit from the mutex drop on re-try. The
'worst' case scenario is that we instantly re-gain the mutex without
perf_mmap_close() getting it. So might as well make that the normal
case.
Reflow the code to make the ring buffer detach case naturally flow
into the no ring buffer case.
[ mingo: Forward ported it ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135519.463607258@infradead.org
|
|
Perform CSE and such.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135519.354909594@infradead.org
|
|
Identity-transform:
if (c) {
X1;
} else {
Y;
goto l;
}
X2;
l:
into the simpler:
if (c) {
X1;
X2;
} else {
Y;
}
[ mingo: Forward ported it ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135519.095904637@infradead.org
|
|
Ensure perf_event_free_bpf_prog() is safe to call a second time;
notably without making any references to event->pmu when there is no
prog left.
Note: perf_event_detach_bpf_prog() might leave a stale event->prog
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241104135518.978956692@infradead.org
|
|
Replace _free_event()'s use of perf_addr_filters_splice()s use with an
explicit perf_free_addr_filters() with the explicit propery that it is
able to be called a second time without ill effect.
Most notable, referencing event->pmu must be avoided when there are no
filters left (from eg a previous call).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.868460518@infradead.org
|
|
As a preparation for adding yet another indirection.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.650051565@infradead.org
|
|
perf_cpu_pmu_context::pmu_disable_count
Because it makes no sense to have two per-cpu allocations per pmu.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.518730578@infradead.org
|
|
Using the previous simplifications, transition perf_event_alloc() to
the cleanup way of things -- reducing error path magic.
[ mingo: Ported it to recent kernels. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.410755241@infradead.org
|
|
Use the <linux/cleanup.h> guard() and scoped_guard() infrastructure
to simplify the control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20241104135518.302444446@infradead.org
|
|
Using the previously introduced perf_pmu_free() and a new IDR helper,
simplify the perf_pmu_register error paths.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.198937277@infradead.org
|
|
The error path of perf_pmu_register() is of course very similar to a
subset of perf_pmu_unregister(). Extract this common part in
perf_pmu_free() and simplify things.
[ mingo: Forward ported it ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135518.090915501@infradead.org
|
|
The error cleanup sequence in perf_event_alloc() is a subset of the
existing _free_event() function (it must of course be).
Split this out into __free_event() and simplify the error path.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20241104135517.967889521@infradead.org
|
|
Currently, __reserve_bp_slot() returns -ENOSPC for unsupported
breakpoint types on the architecture. For example, powerpc
does not support hardware instruction breakpoints. This causes
the perf_skip BPF selftest to fail, as neither ENOENT nor
EOPNOTSUPP is returned by perf_event_open for unsupported
breakpoint types. As a result, the test that should be skipped
for this arch is not correctly identified.
To resolve this, hw_breakpoint_event_init() should exit early by
checking for unsupported breakpoint types using
hw_breakpoint_slots_cached() and return the appropriate error
(-EOPNOTSUPP).
Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: https://lore.kernel.org/r/20250303092451.1862862-1-skb99@linux.ibm.com
|
|
fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There is a fairly obvious race between perf_init_event() doing
idr_find() and perf_pmu_register() doing idr_alloc() with an
incompletely initialized PMU pointer.
Avoid by doing idr_alloc() on a NULL pointer to register the id, and
swizzling the real struct pmu pointer at the end using idr_replace().
Also making sure to not set struct pmu members after publishing
the struct pmu, duh.
[ introduce idr_cmpxchg() in order to better handle the idr_replace()
error case -- if it were to return an unexpected pointer, it will
already have replaced the value and there is no going back. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241104135517.858805880@infradead.org
|
|
Commit a63fbed776c7 ("perf/tracing/cpuhotplug: Fix locking order")
placed pmus_lock inside pmus_srcu, this makes perf_pmu_unregister()
trip lockdep.
Move the locking about such that only pmu_idr and pmus (list) are
modified while holding pmus_lock. This avoids doing synchronize_srcu()
while holding pmus_lock and all is well again.
Fixes: a63fbed776c7 ("perf/tracing/cpuhotplug: Fix locking order")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241104135517.679556858@infradead.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fixes from Niklas Cassel:
- Fix a regression where the enablement of the PHYs would be skipped
for device trees without any port child nodes (me)
- Revert ATA_QUIRK_NOLPM for Samsung SSD 870 QVO drives, as it stops
systems from entering lower package states. LPM works on newer
firmware versions. We will need a more refined quirk that only
targets the older firmware versions (me)
* tag 'ata-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
Revert "ata: libata-core: Add ATA_QUIRK_NOLPM for Samsung SSD 870 QVO drives"
ata: ahci: Make ahci_ignore_port() handle empty mask_port_map
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix TCR_EL2 configuration to not use the ASID in TTBR1_EL2 and not
mess-up T1SZ/PS by using the HCR_EL2.E2H==0 layout.
- Bring back the VMID allocation to the vcpu_load phase, ensuring
that we only setup VTTBR_EL2 once on VHE. This cures an ugly race
that would lead to running with an unallocated VMID.
RISC-V:
- Fix hart status check in SBI HSM extension
- Fix hart suspend_type usage in SBI HSM extension
- Fix error returned by SBI IPI and TIME extensions for unsupported
function IDs
- Fix suspend_type usage in SBI SUSP extension
- Remove unnecessary vcpu kick after injecting interrupt via IMSIC
guest file
x86:
- Fix an nVMX bug where KVM fails to detect that, after nested
VM-Exit, L1 has a pending IRQ (or NMI).
- To avoid freeing the PIC while vCPUs are still around, which would
cause a NULL pointer access with the previous patch, destroy vCPUs
before any VM-level destruction.
- Handle failures to create vhost_tasks"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: retry nx_huge_page_recovery_thread creation
vhost: return task creation error instead of NULL
KVM: nVMX: Process events on nested VM-Exit if injectable IRQ or NMI is pending
KVM: x86: Free vCPUs before freeing VM state
riscv: KVM: Remove unnecessary vcpu kick
KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2
KVM: arm64: Fix tcr_el2 initialisation in hVHE mode
riscv: KVM: Fix SBI sleep_type use
riscv: KVM: Fix SBI TIME error generation
riscv: KVM: Fix SBI IPI error generation
riscv: KVM: Fix hart suspend_type use
riscv: KVM: Fix hart suspend status check
|
|
This reverts commit cc77e2ce187d26cc66af3577bf896d7410eb25ab.
It was reported that adding ATA_QUIRK_NOLPM for Samsung SSD 870 QVO drives
breaks entering lower package states for certain systems.
It turns out that Samsung SSD 870 QVO actually has working LPM when using
a recent SSD firmware version.
The author of commit cc77e2ce187d ("ata: libata-core: Add ATA_QUIRK_NOLPM
for Samsung SSD 870 QVO drives") reported himself that only older SSD
firmware versions have broken LPM:
https://lore.kernel.org/stable/93c10d38-718c-459d-84a5-4d87680b4da7@debian.org/
Unfortunately, he did not specify which older firmware version he was using
which had broken LPM.
Let's revert this quirk, which has FW version field specified as NULL
(which means that it applies for all Samsung SSD 870 QVO firmware versions)
for now. Once the author reports which older firmware version(s) that are
broken, we can create a more fine grained quirk, which populates the FW
version field accordingly.
Fixes: cc77e2ce187d ("ata: libata-core: Add ATA_QUIRK_NOLPM for Samsung SSD 870 QVO drives")
Reported-by: Dieter Mummenschanz <dmummenschanz@web.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219747
Link: https://lore.kernel.org/r/20250228122603.91814-2-cassel@kernel.org
Signed-off-by: Niklas Cassel <cassel@kernel.org>
|
|
A VMM may send a non-fatal signal to its threads, including vCPU tasks,
at any time, and thus may signal vCPU tasks during KVM_RUN. If a vCPU
task receives the signal while its trying to spawn the huge page recovery
vhost task, then KVM_RUN will fail due to copy_process() returning
-ERESTARTNOINTR.
Rework call_once() to mark the call complete if and only if the called
function succeeds, and plumb the function's true error code back to the
call_once() invoker. This provides userspace with the correct, non-fatal
error code so that the VMM doesn't terminate the VM on -ENOMEM, and allows
subsequent KVM_RUN a succeed by virtue of retrying creation of the NX huge
page task.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
[implemented the kvm user side]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250227230631.303431-3-kbusch@meta.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Lets callers distinguish why the vhost task creation failed. No one
currently cares why it failed, so no real runtime change from this
patch, but that will not be the case for long.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250227230631.303431-2-kbusch@meta.com>
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fixes from Rafael Wysocki:
"These fix the processing of DT thermal properties and the Power
Allocator thermal governor:
- Fix parsing cooling-maps in DT for trip points with more than one
cooling device (Rafael Wysocki)
- Fix granted_power computation in the Power Allocator thermal
governor and make it update total_weight on configuration changes
after the thermal zone has been registered (Yu-Che Cheng)"
* tag 'thermal-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: gov_power_allocator: Update total_weight on bind and cdev updates
thermal/of: Fix cdev lookup in thermal_of_should_bind()
thermal: gov_power_allocator: Fix incorrect calculation in divvy_up_power()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix the handling of processors that stop the TSC in deeper C-states in
the intel_idle driver (Thomas Gleixner)"
* tag 'pm-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
intel_idle: Handle older CPUs, which stop the TSC in deeper C states, correctly
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- Fix conflicts between devicetree and ACPI SMP discovery & setup
- Fix a warm-boot lockup on AMD SC1100 SoC systems
- Fix a W=1 build warning related to x86 IRQ trace event setup
- Fix a kernel-doc warning
* tag 'x86-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Fix kernel-doc warning
x86/irq: Define trace events conditionally
x86/CPU: Fix warm boot hang regression on AMD SC1100 SoC systems
x86/of: Don't use DTB for SMP setup if ACPI is enabled
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Prevent cond_resched() based preemption when interrupts are disabled,
on PREEMPT_NONE and PREEMPT_VOLUNTARY kernels"
* tag 'sched-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Prevent rescheduling when interrupts are disabled
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fixes from Ingo Molnar:
"Miscellaneous perf events fixes and a minor HW enablement change:
- Fix missing RCU protection in perf_iterate_ctx()
- Fix pmu_ctx_list ordering bug
- Reject the zero page in uprobes
- Fix a family of bugs related to low frequency sampling
- Add Intel Arrow Lake U CPUs to the generic Arrow Lake RAPL support
table
- Fix a lockdep-assert false positive in uretprobes"
* tag 'perf-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
uprobes: Remove too strict lockdep_assert() condition in hprobe_expire()
perf/x86/rapl: Add support for Intel Arrow Lake U
perf/x86/intel: Use better start period for frequency mode
perf/core: Fix low freq setting via IOC_PERIOD
perf/x86: Fix low freqency setting issue
uprobes: Reject the shared zeropage in uprobe_write_opcode()
perf/core: Order the PMU list to fix warning about unordered pmu_ctx_list
perf/core: Add RCU read lock protection to perf_iterate_ctx()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar:
"Fix an objtool false positive, and objtool related build warnings that
happens on PIE-enabled architectures such as LoongArch"
* tag 'objtool-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Add bch2_trans_unlocked_or_in_restart_error() to bcachefs noreturns
objtool: Fix C jump table annotations for Clang
vmlinux.lds: Ensure that const vars with relocations are mapped R/O
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix an rcuref_put() slowpath race"
* tag 'locking-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcuref: Plug slowpath race in rcuref_put()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix crash from bad histogram entry
An error path in the histogram creation could leave an entry in a
link list that gets freed. Then when a new entry is added it can
cause a u-a-f bug. This is fixed by restructuring the code so that
the histogram is consistent on failure and everything is cleaned up
appropriately.
- Fix fprobe self test
The fprobe self test relies on no function being attached by ftrace.
BPF programs can attach to functions via ftrace and systemd now does
so. This causes those functions to appear in the enabled_functions
list which holds all functions attached by ftrace. The selftest also
uses that file to see if functions are being connected correctly. It
counts the functions in the file, but if there's already functions in
the file, it fails. Instead, add the number of functions in the file
at the start of the test to all the calculations during the test.
- Fix potential division by zero of the function profiler stddev
The calculated divisor that calculates the standard deviation of the
function times can overflow. If the overflow happens to land on zero,
that can cause a division by zero. Check for zero from the
calculation before doing the division.
TODO: Catch when it ever overflows and report it accordingly. For
now, just prevent the system from crashing.
* tag 'trace-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Avoid potential division by zero in function_stat_show()
selftests/ftrace: Let fprobe test consider already enabled functions
tracing: Fix bad hist from corrupting named_triggers list
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fixes from Joerg Roedel:
- Intel VT-d fixes:
- Fix suspicious RCU usage splat
- Fix passthrough for devices under PCIe-PCI bridge
- AMD-Vi fix:
- Fix to preserve bits when updating device table entries
* tag 'iommu-fixes-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu/vt-d: Fix suspicious RCU usage
iommu/vt-d: Remove device comparison in context_setup_pass_through_cb
iommu/amd: Preserve default DTE fields when updating Host Page Table Root
|
|
The Intel idle driver is preferred over the ACPI processor idle driver,
but fails to implement the work around for Core2 generation CPUs, where
the TSC stops in C2 and deeper C-states. This causes stalls and boot
delays, when the clocksource watchdog does not catch the unstable TSC
before the CPU goes deep idle for the first time.
The ACPI driver marks the TSC unstable when it detects that the CPU
supports C2 or deeper and the CPU does not have a non-stop TSC.
Add the equivivalent work around to the Intel idle driver to cure that.
Fixes: 18734958e9bf ("intel_idle: Use ACPI _CST for processor models without C-state tables")
Reported-by: Fab Stz <fabstz-it@yahoo.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Fab Stz <fabstz-it@yahoo.fr>
Cc: All applicable <stable@vger.kernel.org>
Closes: https://lore.kernel.org/all/10cf96aa-1276-4bd4-8966-c890377030c3@yahoo.fr
Link: https://patch.msgid.link/87bjupfy7f.ffs@tglx
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Pull block fixes from Jens Axboe:
- Fix plugging for native zone writes
- Fix segment limit settings for != 4K page size archs
- Fix for slab names overflowing
* tag 'block-6.14-20250228' of git://git.kernel.dk/linux:
block: fix 'kmem_cache of name 'bio-108' already exists'
block: Remove zone write plugs when handling native zone append writes
block: make segment size limit workable for > 4K PAGE_SIZE
|
|
Pull io_uring fix from Jens Axboe:
"Just a single fix headed for stable, ensuring that msg_control is
properly saved in compat mode as well"
* tag 'io_uring-6.14-20250228' of git://git.kernel.dk/linux:
io_uring/net: save msg_control for compat
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
"Another couple of EFI fixes for v6.14.
Only James's patch stands out, as it implements a workaround for odd
behavior in fwupd in user space, which creates EFI variables by
touching a file in efivarfs, clearing the immutable bit (which gets
set automatically for $reasons) and then opening it again for writing,
none of which is really necessary.
The fwupd author and LVFS maintainer is already rolling out a fix for
this on the fwupd side, and suggested that the workaround in this PR
could be backed out again during the next cycle.
(There is a semantic mismatch in efivarfs where some essential
variable attributes are stored in the first 4 bytes of the file, and
so zero length files cannot exist, as they cannot be written back to
the underlying variable store. So now, they are dropped once the last
reference is released.)
Summary:
- Fix CPER error record parsing bugs
- Fix a couple of efivarfs issues that were introduced in the merge
window
- Fix an issue in the early remapping code of the MOKvar table"
* tag 'efi-fixes-for-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi/mokvar-table: Avoid repeated map/unmap of the same page
efi: Don't map the entire mokvar table to determine its size
efivarfs: allow creation of zero length files
efivarfs: Defer PM notifier registration until .fill_super
efi/cper: Fix cper_arm_ctx_info alignment
efi/cper: Fix cper_ia_proc_ctx alignment
|
|
Device mapper bioset often has big bio_slab size, which can be more than
1000, then 8byte can't hold the slab name any more, cause the kmem_cache
allocation warning of 'kmem_cache of name 'bio-108' already exists'.
Fix the warning by extending bio_slab->name to 12 bytes, but fix output
of /proc/slabinfo
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250228132656.2838008-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit <d74169ceb0d2> ("iommu/vt-d: Allocate DMAR fault interrupts
locally") moved the call to enable_drhd_fault_handling() to a code
path that does not hold any lock while traversing the drhd list. Fix
it by ensuring the dmar_global_lock lock is held when traversing the
drhd list.
Without this fix, the following warning is triggered:
=============================
WARNING: suspicious RCU usage
6.14.0-rc3 #55 Not tainted
-----------------------------
drivers/iommu/intel/dmar.c:2046 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 1
2 locks held by cpuhp/1/23:
#0: ffffffff84a67c50 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x87/0x2c0
#1: ffffffff84a6a380 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0x87/0x2c0
stack backtrace:
CPU: 1 UID: 0 PID: 23 Comm: cpuhp/1 Not tainted 6.14.0-rc3 #55
Call Trace:
<TASK>
dump_stack_lvl+0xb7/0xd0
lockdep_rcu_suspicious+0x159/0x1f0
? __pfx_enable_drhd_fault_handling+0x10/0x10
enable_drhd_fault_handling+0x151/0x180
cpuhp_invoke_callback+0x1df/0x990
cpuhp_thread_fun+0x1ea/0x2c0
smpboot_thread_fn+0x1f5/0x2e0
? __pfx_smpboot_thread_fn+0x10/0x10
kthread+0x12a/0x2d0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x4a/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Holding the lock in enable_drhd_fault_handling() triggers a lockdep splat
about a possible deadlock between dmar_global_lock and cpu_hotplug_lock.
This is avoided by not holding dmar_global_lock when calling
iommu_device_register(), which initiates the device probe process.
Fixes: d74169ceb0d2 ("iommu/vt-d: Allocate DMAR fault interrupts locally")
Reported-and-tested-by: Ido Schimmel <idosch@nvidia.com>
Closes: https://lore.kernel.org/linux-iommu/Zx9OwdLIc_VoQ0-a@shredder.mtl.com/
Tested-by: Breno Leitao <leitao@debian.org>
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250218022422.2315082-1-baolu.lu@linux.intel.com
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Remove the device comparison check in context_setup_pass_through_cb.
pci_for_each_dma_alias already makes a decision on whether the
callback function should be called for a device. With the check
in place it will fail to create context entries for aliases as
it walks up to the root bus.
Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity domain")
Closes: https://lore.kernel.org/linux-iommu/82499eb6-00b7-4f83-879a-e97b4144f576@linux.intel.com/
Cc: stable@vger.kernel.org
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
Link: https://lore.kernel.org/r/20250224180316.140123-1-jsnitsel@redhat.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|