summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
18 hoursMerge tag 'trace-rv-v6.17-rc5' of ↵HEADmasterLinus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verifier fixes from Steven Rostedt: - Fix build in some RISC-V flavours Some system calls only are available for the 64bit RISC-V machines. #ifdef out the cases of clock_nanosleep and futex in the sleep monitor if they are not supported by the architecture. - Fix wrong cast, obsolete after refactoring Use container_of() to get to the rv_monitor structure from the enable_monitors_next() 'p' pointer. The assignment worked only because the list field used happened to be the first field of the structure. - Remove redundant include files Some include files were listed twice. Remove the extra ones and sort the includes. - Fix missing unlock on failure There was an error path that exited the rv_register_monitor() function without releasing a lock. Change that to goto the lock release. - Add Gabriele Monaco to be Runtime Verifier maintainer Gabriele is doing most of the work on RV as well as collecting patches. Add him to the maintainers file for Runtime Verification. * tag 'trace-rv-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Add Gabriele Monaco as maintainer for Runtime Verification rv: Fix missing mutex unlock in rv_register_monitor() include/linux/rv.h: remove redundant include file rv: Fix wrong type cast in enabled_monitors_next() rv: Support systems with time64-only syscalls
21 hoursrv: Add Gabriele Monaco as maintainer for Runtime VerificationSteven Rostedt
Gabriele will start taking over managing the changes to the Runtime Verification. Make him officially one of the maintainers. Cc: Gabriele Monaco <gmonaco@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20250911115744.66ccade3@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
23 hoursMerge tag 'net-6.17-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wireless. No known regressions at this point. Current release - fix to a fix: - eth: Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set" - wifi: iwlwifi: pcie: fix byte count table for 7000/8000 devices - net: clear sk->sk_ino in sk_set_socket(sk, NULL), fix CRIU Previous releases - regressions: - bonding: set random address only when slaves already exist - rxrpc: fix untrusted unsigned subtract - eth: - ice: fix Rx page leak on multi-buffer frames - mlx5: don't return mlx5_link_info table when speed is unknown Previous releases - always broken: - tls: make sure to abort the stream if headers are bogus - tcp: fix null-deref when using TCP-AO with TCP_REPAIR - dpll: fix skipping last entry in clock quality level reporting - eth: qed: don't collect too many protection override GRC elements, fix memory corruption" * tag 'net-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits) octeontx2-pf: Fix use-after-free bugs in otx2_sync_tstamp() cnic: Fix use-after-free bugs in cnic_delete_task devlink rate: Remove unnecessary 'static' from a couple places MAINTAINERS: update sundance entry net: liquidio: fix overflow in octeon_init_instr_queue() net: clear sk->sk_ino in sk_set_socket(sk, NULL) Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set" selftests: tls: test skb copy under mem pressure and OOB tls: make sure to abort the stream if headers are bogus selftest: packetdrill: Add tcp_fastopen_server_reset-after-disconnect.pkt. tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect(). octeon_ep: fix VF MAC address lifecycle handling selftests: bonding: add vlan over bond testing bonding: don't set oif to bond dev when getting NS target destination net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer net/mlx5e: Add a miss level for ipsec crypto offload net/mlx5e: Harden uplink netdev access against device unbind MAINTAINERS: make the DPLL entry cover drivers doc/netlink: Fix typos in operation attributes igc: don't fail igc_probe() on LED setup error ...
24 hoursMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "These are mostly Oliver's Arm changes: lock ordering fixes for the vGIC, and reverts for a buggy attempt to avoid RCU stalls on large VMs. Arm: - Invalidate nested MMUs upon freeing the PGD to avoid WARNs when visiting from an MMU notifier - Fixes to the TLB match process and TLB invalidation range for managing the VCNR pseudo-TLB - Prevent SPE from erroneously profiling guests due to UNKNOWN reset values in PMSCR_EL1 - Fix save/restore of host MDCR_EL2 to account for eagerly programming at vcpu_load() on VHE systems - Correct lock ordering when dealing with VGIC LPIs, avoiding scenarios where an xarray's spinlock was nested with a *raw* spinlock - Permit stage-2 read permission aborts which are possible in the case of NV depending on the guest hypervisor's stage-2 translation - Call raw_spin_unlock() instead of the internal spinlock API - Fix parameter ordering when assigning VBAR_EL1 - Reverted a couple of fixes for RCU stalls when destroying a stage-2 page table. There appears to be some nasty refcounting / UAF issues lurking in those patches and the band-aid we tried to apply didn't hold. s390: - mm fixes, including userfaultfd bug fix x86: - Sync the vTPR from the local APIC to the VMCB even when AVIC is active. This fixes a bug where host updates to the vTPR, e.g. via KVM_SET_LAPIC or emulation of a guest access, are lost and result in interrupt delivery issues in the guest" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active Revert "KVM: arm64: Split kvm_pgtable_stage2_destroy()" Revert "KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables" KVM: arm64: vgic: fix incorrect spinlock API usage KVM: arm64: Remove stage 2 read fault check KVM: arm64: Fix parameter ordering for VBAR_EL1 assignment KVM: arm64: nv: Fix incorrect VNCR invalidation range calculation KVM: arm64: vgic-v3: Indicate vgic_put_irq() may take LPI xarray lock KVM: arm64: vgic-v3: Don't require IRQs be disabled for LPI xarray lock KVM: arm64: vgic-v3: Erase LPIs from xarray outside of raw spinlocks KVM: arm64: Spin off release helper from vgic_put_irq() KVM: arm64: vgic-v3: Use bare refcount for VGIC LPIs KVM: arm64: vgic: Drop stale comment on IRQ active state KVM: arm64: VHE: Save and restore host MDCR_EL2 value correctly KVM: arm64: Initialize PMSCR_EL1 when in VHE KVM: arm64: nv: fix VNCR TLB ASID match logic for non-Global entries KVM: s390: Fix FOLL_*/FAULT_FLAG_* confusion KVM: s390: Fix incorrect usage of mmu_notifier_register() KVM: s390: Fix access to unavailable adapter indicator pages during postcopy KVM: arm64: Mark freed S2 MMUs as invalid
24 hoursMerge tag 'platform-drivers-x86-v6.17-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver fixes from Ilpo Järvinen: "Fixes and new HW support: - amd/pmc: Add MECHREVO Yilong15Pro to spurious_8042 list - amd/pmf: Support new ACPI ID AMDI0108 - asus-wmi: Re-add extra keys to ignore_key_wlan quirk - oxpec: Add support for AOKZOE A1X and OneXPlayer X1Pro EVA-02" * tag 'platform-drivers-x86-v6.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: platform/x86: asus-wmi: Re-add extra keys to ignore_key_wlan quirk platform/x86/amd/pmf: Support new ACPI ID AMDI0108 platform/x86: oxpec: Add support for AOKZOE A1X platform/x86: oxpec: Add support for OneXPlayer X1Pro EVA-02 platform/x86/amd/pmc: Add MECHREVO Yilong15Pro to spurious_8042 list
24 hoursMerge tag 'uml-for-6.17-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux Pull UML fixes from Johannes Berg: "A few fixes for UML, which I'd meant to send earlier but then forgot. All of them are pretty long-standing issues that are either not really happening (the UAF), in rarely used code (the FD buffer issue), or an issue only for some host configurations (the executable stack): - mark stack not executable to work on more modern systems with selinux - fix use-after-free in a virtio error path - fix stack buffer overflow in external unix socket FD receive function" * tag 'uml-for-6.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: um: Fix FD copy size in os_rcv_fd_msg() um: virtio_uml: Fix use-after-free after put_device in probe um: Don't mark stack executable
26 hoursocteontx2-pf: Fix use-after-free bugs in otx2_sync_tstamp()Duoming Zhou
The original code relies on cancel_delayed_work() in otx2_ptp_destroy(), which does not ensure that the delayed work item synctstamp_work has fully completed if it was already running. This leads to use-after-free scenarios where otx2_ptp is deallocated by otx2_ptp_destroy(), while synctstamp_work remains active and attempts to dereference otx2_ptp in otx2_sync_tstamp(). Furthermore, the synctstamp_work is cyclic, the likelihood of triggering the bug is nonnegligible. A typical race condition is illustrated below: CPU 0 (cleanup) | CPU 1 (delayed work callback) otx2_remove() | otx2_ptp_destroy() | otx2_sync_tstamp() cancel_delayed_work() | kfree(ptp) | | ptp = container_of(...); //UAF | ptp-> //UAF This is confirmed by a KASAN report: BUG: KASAN: slab-use-after-free in __run_timer_base.part.0+0x7d7/0x8c0 Write of size 8 at addr ffff88800aa09a18 by task bash/136 ... Call Trace: <IRQ> dump_stack_lvl+0x55/0x70 print_report+0xcf/0x610 ? __run_timer_base.part.0+0x7d7/0x8c0 kasan_report+0xb8/0xf0 ? __run_timer_base.part.0+0x7d7/0x8c0 __run_timer_base.part.0+0x7d7/0x8c0 ? __pfx___run_timer_base.part.0+0x10/0x10 ? __pfx_read_tsc+0x10/0x10 ? ktime_get+0x60/0x140 ? lapic_next_event+0x11/0x20 ? clockevents_program_event+0x1d4/0x2a0 run_timer_softirq+0xd1/0x190 handle_softirqs+0x16a/0x550 irq_exit_rcu+0xaf/0xe0 sysvec_apic_timer_interrupt+0x70/0x80 </IRQ> ... Allocated by task 1: kasan_save_stack+0x24/0x50 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x7f/0x90 otx2_ptp_init+0xb1/0x860 otx2_probe+0x4eb/0xc30 local_pci_probe+0xdc/0x190 pci_device_probe+0x2fe/0x470 really_probe+0x1ca/0x5c0 __driver_probe_device+0x248/0x310 driver_probe_device+0x44/0x120 __driver_attach+0xd2/0x310 bus_for_each_dev+0xed/0x170 bus_add_driver+0x208/0x500 driver_register+0x132/0x460 do_one_initcall+0x89/0x300 kernel_init_freeable+0x40d/0x720 kernel_init+0x1a/0x150 ret_from_fork+0x10c/0x1a0 ret_from_fork_asm+0x1a/0x30 Freed by task 136: kasan_save_stack+0x24/0x50 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3a/0x60 __kasan_slab_free+0x3f/0x50 kfree+0x137/0x370 otx2_ptp_destroy+0x38/0x80 otx2_remove+0x10d/0x4c0 pci_device_remove+0xa6/0x1d0 device_release_driver_internal+0xf8/0x210 pci_stop_bus_device+0x105/0x150 pci_stop_and_remove_bus_device_locked+0x15/0x30 remove_store+0xcc/0xe0 kernfs_fop_write_iter+0x2c3/0x440 vfs_write+0x871/0xd70 ksys_write+0xee/0x1c0 do_syscall_64+0xac/0x280 entry_SYSCALL_64_after_hwframe+0x77/0x7f ... Replace cancel_delayed_work() with cancel_delayed_work_sync() to ensure that the delayed work item is properly canceled before the otx2_ptp is deallocated. This bug was initially identified through static analysis. To reproduce and test it, I simulated the OcteonTX2 PCI device in QEMU and introduced artificial delays within the otx2_sync_tstamp() function to increase the likelihood of triggering the bug. Fixes: 2958d17a8984 ("octeontx2-pf: Add support for ptp 1-step mode on CN10K silicon") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
26 hourscnic: Fix use-after-free bugs in cnic_delete_taskDuoming Zhou
The original code uses cancel_delayed_work() in cnic_cm_stop_bnx2x_hw(), which does not guarantee that the delayed work item 'delete_task' has fully completed if it was already running. Additionally, the delayed work item is cyclic, the flush_workqueue() in cnic_cm_stop_bnx2x_hw() only blocks and waits for work items that were already queued to the workqueue prior to its invocation. Any work items submitted after flush_workqueue() is called are not included in the set of tasks that the flush operation awaits. This means that after the cyclic work items have finished executing, a delayed work item may still exist in the workqueue. This leads to use-after-free scenarios where the cnic_dev is deallocated by cnic_free_dev(), while delete_task remains active and attempt to dereference cnic_dev in cnic_delete_task(). A typical race condition is illustrated below: CPU 0 (cleanup) | CPU 1 (delayed work callback) cnic_netdev_event() | cnic_stop_hw() | cnic_delete_task() cnic_cm_stop_bnx2x_hw() | ... cancel_delayed_work() | /* the queue_delayed_work() flush_workqueue() | executes after flush_workqueue()*/ | queue_delayed_work() cnic_free_dev(dev)//free | cnic_delete_task() //new instance | dev = cp->dev; //use Replace cancel_delayed_work() with cancel_delayed_work_sync() to ensure that the cyclic delayed work item is properly canceled and that any ongoing execution of the work item completes before the cnic_dev is deallocated. Furthermore, since cancel_delayed_work_sync() uses __flush_work(work, true) to synchronously wait for any currently executing instance of the work item to finish, the flush_workqueue() becomes redundant and should be removed. This bug was identified through static analysis. To reproduce the issue and validate the fix, I simulated the cnic PCI device in QEMU and introduced intentional delays — such as inserting calls to ssleep() within the cnic_delete_task() function — to increase the likelihood of triggering the bug. Fixes: fdf24086f475 ("cnic: Defer iscsi connection cleanup") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
26 hoursdevlink rate: Remove unnecessary 'static' from a couple placesCosmin Ratiu
devlink_rate_node_get_by_name() and devlink_rate_nodes_destroy() have a couple of unnecessary static variables for iterating over devlink rates. This could lead to races/corruption/unhappiness if two concurrent operations execute the same function. Remove 'static' from both. It's amazing this was missed for 4+ years. While at it, I confirmed there are no more examples of this mistake in net/ with 1, 2 or 3 levels of indentation. Fixes: a8ecb93ef03d ("devlink: Introduce rate nodes") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
26 hoursMAINTAINERS: update sundance entryDenis Kirjanov
Signed-off-by: Denis Kirjanov <kirjanov@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
26 hoursnet: liquidio: fix overflow in octeon_init_instr_queue()Alexey Nepomnyashih
The expression `(conf->instr_type == 64) << iq_no` can overflow because `iq_no` may be as high as 64 (`CN23XX_MAX_RINGS_PER_PF`). Casting the operand to `u64` ensures correct 64-bit arithmetic. Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters") Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
26 hoursnet: clear sk->sk_ino in sk_set_socket(sk, NULL)Eric Dumazet
Andrei Vagin reported that blamed commit broke CRIU. Indeed, while we want to keep sk_uid unchanged when a socket is cloned, we want to clear sk->sk_ino. Otherwise, sock_diag might report multiple sockets sharing the same inode number. Move the clearing part from sock_orphan() to sk_set_socket(sk, NULL), called both from sock_orphan() and sk_clone_lock(). Fixes: 5d6b58c932ec ("net: lockless sock_i_ino()") Closes: https://lore.kernel.org/netdev/aMhX-VnXkYDpKd9V@google.com/ Closes: https://github.com/checkpoint-restore/criu/issues/2744 Reported-by: Andrei Vagin <avagin@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Andrei Vagin <avagin@google.com> Link: https://patch.msgid.link/20250917135337.1736101-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
26 hoursRevert "net/mlx5e: Update and set Xon/Xoff upon port speed set"Tariq Toukan
This reverts commit d24341740fe48add8a227a753e68b6eedf4b385a. It causes errors when trying to configure QoS, as well as loss of L2 connectivity (on multi-host devices). Reported-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/20250910170011.70528106@kernel.org Fixes: d24341740fe4 ("net/mlx5e: Update and set Xon/Xoff upon port speed set") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
30 hoursselftests: tls: test skb copy under mem pressure and OOBJakub Kicinski
Add a test which triggers mem pressure via OOB writes. Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250917002814.1743558-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
30 hourstls: make sure to abort the stream if headers are bogusJakub Kicinski
Normally we wait for the socket to buffer up the whole record before we service it. If the socket has a tiny buffer, however, we read out the data sooner, to prevent connection stalls. Make sure that we abort the connection when we find out late that the record is actually invalid. Retrying the parsing is fine in itself but since we copy some more data each time before we parse we can overflow the allocated skb space. Constructing a scenario in which we're under pressure without enough data in the socket to parse the length upfront is quite hard. syzbot figured out a way to do this by serving us the header in small OOB sends, and then filling in the recvbuf with a large normal send. Make sure that tls_rx_msg_size() aborts strp, if we reach an invalid record there's really no way to recover. Reported-by: Lee Jones <lee@kernel.org> Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250917002814.1743558-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
36 hoursMerge tag 'mm-hotfixes-stable-2025-09-17-21-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "15 hotfixes. 11 are cc:stable and the remainder address post-6.16 issues or aren't considered necessary for -stable kernels. 13 of these fixes are for MM. The usual shower of singletons, plus - fixes from Hugh to address various misbehaviors in get_user_pages() - patches from SeongJae to address a quite severe issue in DAMON - another series also from SeongJae which completes some fixes for a DAMON startup issue" * tag 'mm-hotfixes-stable-2025-09-17-21-10' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: zram: fix slot write race condition nilfs2: fix CFI failure when accessing /sys/fs/nilfs2/features/* samples/damon/mtier: avoid starting DAMON before initialization samples/damon/prcl: avoid starting DAMON before initialization samples/damon/wsse: avoid starting DAMON before initialization MAINTAINERS: add Lance Yang as a THP reviewer MAINTAINERS: add Jann Horn as rmap reviewer mm/damon/sysfs: use dynamically allocated repeat mode damon_call_control mm/damon/core: introduce damon_call_control->dealloc_on_cancel mm: folio_may_be_lru_cached() unless folio_test_large() mm: revert "mm: vmscan.c: fix OOM on swap stress test" mm: revert "mm/gup: clear the LRU flag of a page before adding to LRU batch" mm/gup: local lru_add_drain() to avoid lru_add_drain_all() mm/gup: check ref_count instead of lru before migration
39 hoursMerge tag '6.17-rc6-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fixes from Steve French: - Two fixes for remaining_data_length and offset checks in receive path - Don't go over max SGEs which caused smbdirect send to fail (and trigger disconnect) * tag '6.17-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: smbdirect: verify remaining_data_length respects max_fragmented_recv_size ksmbd: smbdirect: validate data_offset and data_length field of smb_direct_data_transfer smb: server: let smb_direct_writev() respect SMB_DIRECT_MAX_SEND_SGES
41 hoursMerge tag 'probes-fixes-v6.17-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probe fix from Masami Hiramatsu: - kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal(), by handling NULL return of kmemdup() correctly * tag 'probes-fixes-v6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal()
42 hoursMerge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-09-16 (ice, i40e, ixgbe, igc) For ice: Jake resolves leaking pages with multi-buffer frames when a 0-sized descriptor is encountered. For i40e: Maciej removes a redundant, and incorrect, memory barrier. For ixgbe: Jedrzej adjusts lifespan of ACI lock to ensure uses are while it is valid. For igc: Kohei Enju does not fail probe on LED setup failure which resolves a kernel panic in the cleanup path, if we were to fail. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: igc: don't fail igc_probe() on LED setup error ixgbe: destroy aci.lock later within ixgbe_remove path ixgbe: initialize aci.lock before it's used i40e: remove redundant memory barrier when cleaning Tx descs ice: fix Rx page leak on multi-buffer frames ==================== Link: https://patch.msgid.link/20250916212801.2818440-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
42 hoursMerge tag 'wireless-2025-09-17' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Just two fixes: - fix crash in rfkill due to uninitialized type_name - fix aggregation in iwlwifi 7000/8000 devices * tag 'wireless-2025-09-17' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer wifi: iwlwifi: pcie: fix byte count table for some devices ==================== Link: https://patch.msgid.link/20250917105159.161583-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
42 hoursMerge branch 'tcp-clear-tcp_sk-sk-fastopen_rsk-in-tcp_disconnect'Jakub Kicinski
Kuniyuki Iwashima says: ==================== tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect(). syzbot reported a warning in tcp_retransmit_timer() for TCP Fast Open socket. Patch 1 fixes the issue and Patch 2 adds a test for the scenario. ==================== Link: https://patch.msgid.link/20250915175800.118793-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
42 hoursselftest: packetdrill: Add tcp_fastopen_server_reset-after-disconnect.pkt.Kuniyuki Iwashima
The test reproduces the scenario explained in the previous patch. Without the patch, the test triggers the warning and cannot see the last retransmitted packet. # ./ksft_runner.sh tcp_fastopen_server_reset-after-disconnect.pkt TAP version 13 1..2 [ 29.229250] ------------[ cut here ]------------ [ 29.231414] WARNING: CPU: 26 PID: 0 at net/ipv4/tcp_timer.c:542 tcp_retransmit_timer+0x32/0x9f0 ... tcp_fastopen_server_reset-after-disconnect.pkt:26: error handling packet: Timed out waiting for packet not ok 1 ipv4 tcp_fastopen_server_reset-after-disconnect.pkt:26: error handling packet: Timed out waiting for packet not ok 2 ipv6 # Totals: pass:0 fail:2 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250915175800.118793-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
42 hourstcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect().Kuniyuki Iwashima
syzbot reported the splat below where a socket had tcp_sk(sk)->fastopen_rsk in the TCP_ESTABLISHED state. [0] syzbot reused the server-side TCP Fast Open socket as a new client before the TFO socket completes 3WHS: 1. accept() 2. connect(AF_UNSPEC) 3. connect() to another destination As of accept(), sk->sk_state is TCP_SYN_RECV, and tcp_disconnect() changes it to TCP_CLOSE and makes connect() possible, which restarts timers. Since tcp_disconnect() forgot to clear tcp_sk(sk)->fastopen_rsk, the retransmit timer triggered the warning and the intended packet was not retransmitted. Let's call reqsk_fastopen_remove() in tcp_disconnect(). [0]: WARNING: CPU: 2 PID: 0 at net/ipv4/tcp_timer.c:542 tcp_retransmit_timer (net/ipv4/tcp_timer.c:542 (discriminator 7)) Modules linked in: CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Not tainted 6.17.0-rc5-g201825fb4278 #62 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:tcp_retransmit_timer (net/ipv4/tcp_timer.c:542 (discriminator 7)) Code: 41 55 41 54 55 53 48 8b af b8 08 00 00 48 89 fb 48 85 ed 0f 84 55 01 00 00 0f b6 47 12 3c 03 74 0c 0f b6 47 12 3c 04 74 04 90 <0f> 0b 90 48 8b 85 c0 00 00 00 48 89 ef 48 8b 40 30 e8 6a 4f 06 3e RSP: 0018:ffffc900002f8d40 EFLAGS: 00010293 RAX: 0000000000000002 RBX: ffff888106911400 RCX: 0000000000000017 RDX: 0000000002517619 RSI: ffffffff83764080 RDI: ffff888106911400 RBP: ffff888106d5c000 R08: 0000000000000001 R09: ffffc900002f8de8 R10: 00000000000000c2 R11: ffffc900002f8ff8 R12: ffff888106911540 R13: ffff888106911480 R14: ffff888106911840 R15: ffffc900002f8de0 FS: 0000000000000000(0000) GS:ffff88907b768000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8044d69d90 CR3: 0000000002c30003 CR4: 0000000000370ef0 Call Trace: <IRQ> tcp_write_timer (net/ipv4/tcp_timer.c:738) call_timer_fn (kernel/time/timer.c:1747) __run_timers (kernel/time/timer.c:1799 kernel/time/timer.c:2372) timer_expire_remote (kernel/time/timer.c:2385 kernel/time/timer.c:2376 kernel/time/timer.c:2135) tmigr_handle_remote_up (kernel/time/timer_migration.c:944 kernel/time/timer_migration.c:1035) __walk_groups.isra.0 (kernel/time/timer_migration.c:533 (discriminator 1)) tmigr_handle_remote (kernel/time/timer_migration.c:1096) handle_softirqs (./arch/x86/include/asm/jump_label.h:36 ./include/trace/events/irq.h:142 kernel/softirq.c:580) irq_exit_rcu (kernel/softirq.c:614 kernel/softirq.c:453 kernel/softirq.c:680 kernel/softirq.c:696) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1050 (discriminator 35) arch/x86/kernel/apic/apic.c:1050 (discriminator 35)) </IRQ> Fixes: 8336886f786f ("tcp: TCP Fast Open Server - support TFO listeners") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250915175800.118793-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
42 hoursocteon_ep: fix VF MAC address lifecycle handlingSathesh B Edara
Currently, VF MAC address info is not updated when the MAC address is configured from VF, and it is not cleared when the VF is removed. This leads to stale or missing MAC information in the PF, which may cause incorrect state tracking or inconsistencies when VFs are hot-plugged or reassigned. Fix this by: - storing the VF MAC address in the PF when it is set from VF - clearing the stored VF MAC address when the VF is removed This ensures that the PF always has correct VF MAC state. Fixes: cde29af9e68e ("octeon_ep: add PF-VF mailbox communication") Signed-off-by: Sathesh B Edara <sedara@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250916133207.21737-1-sedara@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
42 hourstracing: kprobe-event: Fix null-ptr-deref in trace_kprobe_create_internal()Wang Liang
A crash was observed with the following output: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 UID: 0 PID: 2899 Comm: syz.2.399 Not tainted 6.17.0-rc5+ #5 PREEMPT(none) RIP: 0010:trace_kprobe_create_internal+0x3fc/0x1440 kernel/trace/trace_kprobe.c:911 Call Trace: <TASK> trace_kprobe_create_cb+0xa2/0xf0 kernel/trace/trace_kprobe.c:1089 trace_probe_create+0xf1/0x110 kernel/trace/trace_probe.c:2246 dyn_event_create+0x45/0x70 kernel/trace/trace_dynevent.c:128 create_or_delete_trace_kprobe+0x5e/0xc0 kernel/trace/trace_kprobe.c:1107 trace_parse_run_command+0x1a5/0x330 kernel/trace/trace.c:10785 vfs_write+0x2b6/0xd00 fs/read_write.c:684 ksys_write+0x129/0x240 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x5d/0x2d0 arch/x86/entry/syscall_64.c:94 </TASK> Function kmemdup() may return NULL in trace_kprobe_create_internal(), add check for it's return value. Link: https://lore.kernel.org/all/20250916075816.3181175-1-wangliang74@huawei.com/ Fixes: 33b4e38baa03 ("tracing: kprobe-event: Allocate string buffers from heap") Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
43 hoursselftests: bonding: add vlan over bond testingHangbin Liu
Add a vlan over bond testing to make sure arp/ns target works. Also change all the configs to mudules. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250916080127.430626-2-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
43 hoursbonding: don't set oif to bond dev when getting NS target destinationHangbin Liu
Unlike IPv4, IPv6 routing strictly requires the source address to be valid on the outgoing interface. If the NS target is set to a remote VLAN interface, and the source address is also configured on a VLAN over a bond interface, setting the oif to the bond device will fail to retrieve the correct destination route. Fix this by not setting the oif to the bond device when retrieving the NS target destination. This allows the correct destination device (the VLAN interface) to be determined, so that bond_verify_device_path can return the proper VLAN tags for sending NS messages. Reported-by: David Wilder <wilder@us.ibm.com> Closes: https://lore.kernel.org/netdev/aGOKggdfjv0cApTO@fedora/ Suggested-by: Jay Vosburgh <jv@jvosburgh.net> Tested-by: David Wilder <wilder@us.ibm.com> Acked-by: Jay Vosburgh <jv@jvosburgh.net> Fixes: 4e24be018eb9 ("bonding: add new parameter ns_targets") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250916080127.430626-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
44 hoursMerge tag 'sched_ext-for-6.17-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED - Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()" which was causing issues with per-CPU task scheduling and reenqueuing behavior * tag 'sched_ext-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext, sched/core: Fix build failure when !FAIR_GROUP_SCHED && EXT_GROUP_SCHED Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"
44 hoursMerge tag 'cgroup-for-6.17-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "This contains two cgroup changes. Both are pretty low risk. - Fix deadlock in cgroup destruction when repeatedly mounting/unmounting perf_event and net_prio controllers. The issue occurs because cgroup_destroy_wq has max_active=1, causing root destruction to wait for CSS offline operations that are queued behind it. The fix splits cgroup_destroy_wq into three separate workqueues to eliminate the blocking. - Set of->priv to NULL upon file release to make potential bugs to manifest as NULL pointer dereferences rather than use-after-free errors" * tag 'cgroup-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/psi: Set of->priv to NULL upon file release cgroup: split cgroup_destroy_wq into 3 workqueues
47 hoursMerge tag 'kvm-s390-master-6.17-1' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD - KVM mm fixes - Postcopy fix
47 hoursMerge tag 'kvm-x86-fixes-6.17-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 fix for 6.17-rcN Sync the vTPR from the local APIC to the VMCB even when AVIC is active, to fix a bug where host updates to the vTPR, e.g. via KVM_SET_LAPIC or emulation of a guest access, effectively get lost and result in interrupt delivery issues in the guest.
47 hoursMerge tag 'kvmarm-fixes-6.17-2' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.17, round #3 - Invalidate nested MMUs upon freeing the PGD to avoid WARNs when visiting from an MMU notifier - Fixes to the TLB match process and TLB invalidation range for managing the VCNR pseudo-TLB - Prevent SPE from erroneously profiling guests due to UNKNOWN reset values in PMSCR_EL1 - Fix save/restore of host MDCR_EL2 to account for eagerly programming at vcpu_load() on VHE systems - Correct lock ordering when dealing with VGIC LPIs, avoiding scenarios where an xarray's spinlock was nested with a *raw* spinlock - Permit stage-2 read permission aborts which are possible in the case of NV depending on the guest hypervisor's stage-2 translation - Call raw_spin_unlock() instead of the internal spinlock API - Fix parameter ordering when assigning VBAR_EL1
2 daysMerge tag 'for-6.17/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mikulas Patocka: - fix integer overflow in dm-stripe - limit tag size in dm-integrity to 255 bytes - fix 'alignment inconsistency' warning in dm-raid * tag 'for-6.17/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-raid: don't set io_min and io_opt for raid1 dm-integrity: limit MAX_TAG_SIZE to 255 dm-stripe: fix a possible integer overflow
2 daysMerge tag 'for-6.17-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - in zoned mode, turn assertion to proper code when reserving space in relocation block group - fix search key of extended ref (hardlink) when replaying log - fix initialization of file extent tree on filesystems without no-holes feature - add harmless data race annotation to block group comparator * tag 'for-6.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: annotate block group access with data_race() when sorting for reclaim btrfs: initialize inode::file_extent_tree after i_mode has been set btrfs: zoned: fix incorrect ASSERT in btrfs_zoned_reserve_data_reloc_bg() btrfs: fix invalid extref key setup when replaying dentry
2 daysdm-raid: don't set io_min and io_opt for raid1Mikulas Patocka
These commands modprobe brd rd_size=1048576 vgcreate vg /dev/ram* lvcreate -m4 -L10 -n lv vg trigger the following warnings: device-mapper: table: 252:10: adding target device (start sect 0 len 24576) caused an alignment inconsistency device-mapper: table: 252:10: adding target device (start sect 0 len 24576) caused an alignment inconsistency The warnings are caused by the fact that io_min is 512 and physical block size is 4096. If there's chunk-less raid, such as raid1, io_min shouldn't be set to zero because it would be raised to 512 and it would trigger the warning. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: stable@vger.kernel.org
2 daysnet: rfkill: gpio: Fix crash due to dereferencering uninitialized pointerHans de Goede
Since commit 7d5e9737efda ("net: rfkill: gpio: get the name and type from device property") rfkill_find_type() gets called with the possibly uninitialized "const char *type_name;" local variable. On x86 systems when rfkill-gpio binds to a "BCM4752" or "LNV4752" acpi_device, the rfkill->type is set based on the ACPI acpi_device_id: rfkill->type = (unsigned)id->driver_data; and there is no "type" property so device_property_read_string() will fail and leave type_name uninitialized, leading to a potential crash. rfkill_find_type() does accept a NULL pointer, fix the potential crash by initializing type_name to NULL. Note likely sofar this has not been caught because: 1. Not many x86 machines actually have a "BCM4752"/"LNV4752" acpi_device 2. The stack happened to contain NULL where type_name is stored Fixes: 7d5e9737efda ("net: rfkill: gpio: get the name and type from device property") Cc: stable@vger.kernel.org Cc: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Hans de Goede <hansg@kernel.org> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://patch.msgid.link/20250913113515.21698-1-hansg@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2 daysMerge tag 'iwlwifi-fixes-2025-09-15' of ↵Johannes Berg
https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next Miri Korenblit says: ==================== iwlwifi fix ==================== The fix is for byte count tables in 7000/8000 family devices. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2 dayssched_ext, sched/core: Fix build failure when !FAIR_GROUP_SCHED && ↵Tejun Heo
EXT_GROUP_SCHED While collecting SCX related fields in struct task_group into struct scx_task_group, 6e6558a6bc41 ("sched_ext, sched/core: Factor out struct scx_task_group") forgot update tg->scx_weight usage in tg_weight(), which leads to build failure when CONFIG_FAIR_GROUP_SCHED is disabled but CONFIG_EXT_GROUP_SCHED is enabled. Fix it. Fixes: 6e6558a6bc41 ("sched_ext, sched/core: Factor out struct scx_task_group") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509170230.MwZsJSWa-lkp@intel.com/ Tested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
3 daysMerge branch 'mlx5e-misc-fixes-2025-09-15'Jakub Kicinski
Tariq Toukan says: ==================== mlx5e misc fixes 2025-09-15 This patchset provides misc bug fixes from the team to the mlx5 Eth driver. ==================== Link: https://patch.msgid.link/1757939074-617281-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysnet/mlx5e: Add a miss level for ipsec crypto offloadLama Kayal
The cited commit adds a miss table for switchdev mode. But it uses the same level as policy table. Will hit the following error when running command: # ip xfrm state add src 192.168.1.22 dst 192.168.1.21 proto \ esp spi 1001 reqid 10001 aead 'rfc4106(gcm(aes))' \ 0x3a189a7f9374955d3817886c8587f1da3df387ff 128 \ mode tunnel offload dev enp8s0f0 dir in Error: mlx5_core: Device failed to offload this state. The dmesg error is: mlx5_core 0000:03:00.0: ipsec_miss_create:578:(pid 311797): fail to create IPsec miss_rule err=-22 Fix it by adding a new miss level to avoid the error. Fixes: 7d9e292ecd67 ("net/mlx5e: Move IPSec policy check after decryption") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Chris Mi <cmi@nvidia.com> Signed-off-by: Lama Kayal <lkayal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1757939074-617281-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysnet/mlx5e: Harden uplink netdev access against device unbindJianbo Liu
The function mlx5_uplink_netdev_get() gets the uplink netdevice pointer from mdev->mlx5e_res.uplink_netdev. However, the netdevice can be removed and its pointer cleared when unbound from the mlx5_core.eth driver. This results in a NULL pointer, causing a kernel panic. BUG: unable to handle page fault for address: 0000000000001300 at RIP: 0010:mlx5e_vport_rep_load+0x22a/0x270 [mlx5_core] Call Trace: <TASK> mlx5_esw_offloads_rep_load+0x68/0xe0 [mlx5_core] esw_offloads_enable+0x593/0x910 [mlx5_core] mlx5_eswitch_enable_locked+0x341/0x420 [mlx5_core] mlx5_devlink_eswitch_mode_set+0x17e/0x3a0 [mlx5_core] devlink_nl_eswitch_set_doit+0x60/0xd0 genl_family_rcv_msg_doit+0xe0/0x130 genl_rcv_msg+0x183/0x290 netlink_rcv_skb+0x4b/0xf0 genl_rcv+0x24/0x40 netlink_unicast+0x255/0x380 netlink_sendmsg+0x1f3/0x420 __sock_sendmsg+0x38/0x60 __sys_sendto+0x119/0x180 do_syscall_64+0x53/0x1d0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Ensure the pointer is valid before use by checking it for NULL. If it is valid, immediately call netdev_hold() to take a reference, and preventing the netdevice from being freed while it is in use. Fixes: 7a9fb35e8c3a ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1757939074-617281-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysMAINTAINERS: make the DPLL entry cover driversJakub Kicinski
DPLL maintainers should probably be CCed on driver patches, too. Remove the *, which makes the pattern only match files directly under drivers/dpll but not its sub-directories. Acked-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Acked-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20250915234255.1306612-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysdoc/netlink: Fix typos in operation attributesRemy D. Farley
I'm trying to generate Rust bindings for netlink using the yaml spec. It looks like there's a typo in conntrack spec: attribute set conntrack-attrs defines attributes "counters-{orig,reply}" (plural), while get operation references "counter-{orig,reply}" (singular). The latter should be fixed, as it denotes multiple counters (packet and byte). The corresonding C define is CTA_COUNTERS_ORIG. Also, dump request references "nfgen-family" attribute, which neither exists in conntrack-attrs attrset nor ctattr_type enum. There's member of nfgenmsg struct with the same name, which is where family value is actually taken from. > static int ctnetlink_dump_exp_ct(struct net *net, struct sock *ctnl, > struct sk_buff *skb, > const struct nlmsghdr *nlh, > const struct nlattr * const cda[], > struct netlink_ext_ack *extack) > { > int err; > struct nfgenmsg *nfmsg = nlmsg_data(nlh); > u_int8_t u3 = nfmsg->nfgen_family; ^^^^^^^^^^^^ Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com> Fixes: 23fc9311a526 ("netlink: specs: add conntrack dump and stats dump support") Link: https://patch.msgid.link/20250913140515.1132886-1-one-d-wide@protonmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 daysMerge tag 'perf-tools-fixes-for-v6.17-2025-09-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools Pull perf tools fixes from Namhyung Kim: "A small set of fixes for crashes in different commands and conditions" * tag 'perf-tools-fixes-for-v6.17-2025-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: perf maps: Ensure kmap is set up for all inserts perf lock: Provide a host_env for session new perf subcmd: avoid crash in exclude_cmds when excludes is empty
3 daysigc: don't fail igc_probe() on LED setup errorKohei Enju
When igc_led_setup() fails, igc_probe() fails and triggers kernel panic in free_netdev() since unregister_netdev() is not called. [1] This behavior can be tested using fault-injection framework, especially the failslab feature. [2] Since LED support is not mandatory, treat LED setup failures as non-fatal and continue probe with a warning message, consequently avoiding the kernel panic. [1] kernel BUG at net/core/dev.c:12047! Oops: invalid opcode: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 937 Comm: repro-igc-led-e Not tainted 6.17.0-rc4-enjuk-tnguy-00865-gc4940196ab02 #64 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:free_netdev+0x278/0x2b0 [...] Call Trace: <TASK> igc_probe+0x370/0x910 local_pci_probe+0x3a/0x80 pci_device_probe+0xd1/0x200 [...] [2] #!/bin/bash -ex FAILSLAB_PATH=/sys/kernel/debug/failslab/ DEVICE=0000:00:05.0 START_ADDR=$(grep " igc_led_setup" /proc/kallsyms \ | awk '{printf("0x%s", $1)}') END_ADDR=$(printf "0x%x" $((START_ADDR + 0x100))) echo $START_ADDR > $FAILSLAB_PATH/require-start echo $END_ADDR > $FAILSLAB_PATH/require-end echo 1 > $FAILSLAB_PATH/times echo 100 > $FAILSLAB_PATH/probability echo N > $FAILSLAB_PATH/ignore-gfp-wait echo $DEVICE > /sys/bus/pci/drivers/igc/bind Fixes: ea578703b03d ("igc: Add support for LEDs on i225/i226") Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
3 daysixgbe: destroy aci.lock later within ixgbe_remove pathJedrzej Jagielski
There's another issue with aci.lock and previous patch uncovers it. aci.lock is being destroyed during removing ixgbe while some of the ixgbe closing routines are still ongoing. These routines use Admin Command Interface which require taking aci.lock which has been already destroyed what leads to call trace. [ +0.000004] DEBUG_LOCKS_WARN_ON(lock->magic != lock) [ +0.000007] WARNING: CPU: 12 PID: 10277 at kernel/locking/mutex.c:155 mutex_lock+0x5f/0x70 [ +0.000002] Call Trace: [ +0.000003] <TASK> [ +0.000006] ixgbe_aci_send_cmd+0xc8/0x220 [ixgbe] [ +0.000049] ? try_to_wake_up+0x29d/0x5d0 [ +0.000009] ixgbe_disable_rx_e610+0xc4/0x110 [ixgbe] [ +0.000032] ixgbe_disable_rx+0x3d/0x200 [ixgbe] [ +0.000027] ixgbe_down+0x102/0x3b0 [ixgbe] [ +0.000031] ixgbe_close_suspend+0x28/0x90 [ixgbe] [ +0.000028] ixgbe_close+0xfb/0x100 [ixgbe] [ +0.000025] __dev_close_many+0xae/0x220 [ +0.000005] dev_close_many+0xc2/0x1a0 [ +0.000004] ? kernfs_should_drain_open_files+0x2a/0x40 [ +0.000005] unregister_netdevice_many_notify+0x204/0xb00 [ +0.000006] ? __kernfs_remove.part.0+0x109/0x210 [ +0.000006] ? kobj_kset_leave+0x4b/0x70 [ +0.000008] unregister_netdevice_queue+0xf6/0x130 [ +0.000006] unregister_netdev+0x1c/0x40 [ +0.000005] ixgbe_remove+0x216/0x290 [ixgbe] [ +0.000021] pci_device_remove+0x42/0xb0 [ +0.000007] device_release_driver_internal+0x19c/0x200 [ +0.000008] driver_detach+0x48/0x90 [ +0.000003] bus_remove_driver+0x6d/0xf0 [ +0.000006] pci_unregister_driver+0x2e/0xb0 [ +0.000005] ixgbe_exit_module+0x1c/0xc80 [ixgbe] Same as for the previous commit, the issue has been highlighted by the commit 337369f8ce9e ("locking/mutex: Add MUTEX_WARN_ON() into fast path"). Move destroying aci.lock to the end of ixgbe_remove(), as this simply fixes the issue. Fixes: 4600cdf9f5ac ("ixgbe: Enable link management in E610 device") Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
3 daysixgbe: initialize aci.lock before it's usedJedrzej Jagielski
Currently aci.lock is initialized too late. A bunch of ACI callbacks using the lock are called prior it's initialized. Commit 337369f8ce9e ("locking/mutex: Add MUTEX_WARN_ON() into fast path") highlights that issue what results in call trace. [ 4.092899] DEBUG_LOCKS_WARN_ON(lock->magic != lock) [ 4.092910] WARNING: CPU: 0 PID: 578 at kernel/locking/mutex.c:154 mutex_lock+0x6d/0x80 [ 4.098757] Call Trace: [ 4.098847] <TASK> [ 4.098922] ixgbe_aci_send_cmd+0x8c/0x1e0 [ixgbe] [ 4.099108] ? hrtimer_try_to_cancel+0x18/0x110 [ 4.099277] ixgbe_aci_get_fw_ver+0x52/0xa0 [ixgbe] [ 4.099460] ixgbe_check_fw_error+0x1fc/0x2f0 [ixgbe] [ 4.099650] ? usleep_range_state+0x69/0xd0 [ 4.099811] ? usleep_range_state+0x8c/0xd0 [ 4.099964] ixgbe_probe+0x3b0/0x12d0 [ixgbe] [ 4.100132] local_pci_probe+0x43/0xa0 [ 4.100267] work_for_cpu_fn+0x13/0x20 [ 4.101647] </TASK> Move aci.lock mutex initialization to ixgbe_sw_init() before any ACI command is sent. Along with that move also related SWFW semaphore in order to reduce size of ixgbe_probe() and that way all locks are initialized in ixgbe_sw_init(). Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Fixes: 4600cdf9f5ac ("ixgbe: Enable link management in E610 device") Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
3 daysi40e: remove redundant memory barrier when cleaning Tx descsMaciej Fijalkowski
i40e has a feature which writes to memory location last descriptor successfully sent. Memory barrier in i40e_clean_tx_irq() was used to avoid forward-reading descriptor fields in case DD bit was not set. Having mentioned feature in place implies that such situation will not happen as we know in advance how many descriptors HW has dealt with. Besides, this barrier placement was wrong. Idea is to have this protection *after* reading DD bit from HW descriptor, not before. Digging through git history showed me that indeed barrier was before DD bit check, anyways the commit introducing i40e_get_head() should have wiped it out altogether. Also, there was one commit doing s/read_barrier_depends/smp_rmb when get head feature was already in place, but it was only theoretical based on ixgbe experiences, which is different in these terms as that driver has to read DD bit from HW descriptor. Fixes: 1943d8ba9507 ("i40e/i40evf: enable hardware feature head write back") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
3 daysice: fix Rx page leak on multi-buffer framesJacob Keller
The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each buffer in the current frame. This function was introduced as part of handling multi-buffer XDP support in the ice driver. It works by iterating over the buffers from first_desc up to 1 plus the total number of fragments in the frame, cached from before the XDP program was executed. If the hardware posts a descriptor with a size of 0, the logic used in ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't call ice_put_rx_buf(). Because we don't call ice_put_rx_buf(), we don't attempt to re-use the page or free it. This leaves a stale page in the ring, as we don't increment next_to_alloc. The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented properly, and that it always points to a buffer with a NULL page. Since this function doesn't check, it will happily recycle a page over the top of the next_to_alloc buffer, losing track of the old page. Note that this leak only occurs for multi-buffer frames. The ice_put_rx_mbuf() function always handles at least one buffer, so a single-buffer frame will always get handled correctly. It is not clear precisely why the hardware hands us descriptors with a size of 0 sometimes, but it happens somewhat regularly with "jumbo frames" used by 9K MTU. To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on all buffers between first_desc and next_to_clean. Borrow the logic of a similar function in i40e used for this same purpose. Use the same logic also in ice_get_pgcnts(). Instead of iterating over just the number of fragments, use a loop which iterates until the current index reaches to the next_to_clean element just past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does call ice_put_rx_buf() on the last buffer of the frame indicating the end of packet. For non-linear (multi-buffer) frames, we need to take care when adjusting the pagecnt_bias. An XDP program might release fragments from the tail of the frame, in which case that fragment page is already released. Only update the pagecnt_bias for the first descriptor and fragments still remaining post-XDP program. Take care to only access the shared info for fragmented buffers, as this avoids a significant cache miss. The xdp_xmit value only needs to be updated if an XDP program is run, and only once per packet. Drop the xdp_xmit pointer argument from ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function directly. This avoids needing to pass the argument and avoids an extra bit-wise OR for each buffer in the frame. Move the increment of the ntc local variable to ensure its updated *before* all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic requires the index of the element just after the current frame. Now that we use an index pointer in the ring to identify the packet, we no longer need to track or cache the number of fragments in the rx_ring. Cc: Christoph Petrausch <christoph.petrausch@deepl.com> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@mail.gmail.com/ Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame") Tested-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Tested-by: Priya Singh <priyax.singh@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
3 daysRevert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"Andrea Righi
scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a CPU is taken by a higher scheduling class to give tasks queued to the CPU's local DSQ a chance to be migrated somewhere else, instead of waiting indefinitely for that CPU to become available again. In doing so, we decided to skip migration-disabled tasks, under the assumption that they cannot be migrated anyway. However, when a higher scheduling class preempts a CPU, the running task is always inserted at the head of the local DSQ as a migration-disabled task. This means it is always skipped by scx_bpf_reenqueue_local(), and ends up being confined to the same CPU even if that CPU is heavily contended by other higher scheduling class tasks. As an example, let's consider the following scenario: $ schedtool -a 0,1, -e yes > /dev/null $ sudo schedtool -F -p 99 -a 0, -e \ stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000 The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task (SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT task initially runs on CPU0, it will remain there because it always sees CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1% utilization while CPU1 stays idle: 0[||||||||||||||||||||||100.0%] 8[ 0.0%] 1[ 0.0%] 9[ 0.0%] 2[ 0.0%] 10[ 0.0%] 3[ 0.0%] 11[ 0.0%] 4[ 0.0%] 12[ 0.0%] 5[ 0.0%] 13[ 0.0%] 6[ 0.0%] 14[ 0.0%] 7[ 0.0%] 15[ 0.0%] PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command 1067 root RT 0 R 0 99.0 0.2 0:31.16 stress-ng-cpu [run] 975 arighi 20 0 R 0 1.0 0.0 0:26.32 yes By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in this case) via ops.enqueue(), leading to better CPU utilization: 0[||||||||||||||||||||||100.0%] 8[ 0.0%] 1[||||||||||||||||||||||100.0%] 9[ 0.0%] 2[ 0.0%] 10[ 0.0%] 3[ 0.0%] 11[ 0.0%] 4[ 0.0%] 12[ 0.0%] 5[ 0.0%] 13[ 0.0%] 6[ 0.0%] 14[ 0.0%] 7[ 0.0%] 15[ 0.0%] PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command 577 root RT 0 R 0 100.0 0.2 0:23.17 stress-ng-cpu [run] 555 arighi 20 0 R 1 100.0 0.0 0:28.67 yes It's debatable whether per-CPU tasks should be re-enqueued as well, but doing so is probably safer: the scheduler can recognize re-enqueued tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and either put them back at the head of the local DSQ or let another task attempt to take the CPU. This also prevents giving per-CPU tasks an implicit priority boost, which would otherwise make them more likely to reclaim CPUs preempted by higher scheduling classes. Fixes: 97e13ecb02668 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()") Cc: stable@vger.kernel.org # v6.15+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>