summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-08-21arm64: Add capability denoting FEAT_RASv1p1Marc Zyngier
Detecting FEAT_RASv1p1 is rather complicated, as there are two ways for the architecture to advertise the same thing (always a delight...). Add a capability that will advertise this in a synthetic way to the rest of the kernel. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Link: https://lore.kernel.org/r/20250817202158.395078-2-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-08-21KVM: arm64: Reschedule as needed when destroying the stage-2 page-tablesRaghavendra Rao Ananta
When a large VM, specifically one that holds a significant number of PTEs, gets abruptly destroyed, the following warning is seen during the page-table walk: sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE Tainted: [O]=OOT_MODULE Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x3c/0xb8 dump_stack+0x18/0x30 resched_latency_warn+0x7c/0x88 sched_tick+0x1c4/0x268 update_process_times+0xa8/0xd8 tick_nohz_handler+0xc8/0x168 __hrtimer_run_queues+0x11c/0x338 hrtimer_interrupt+0x104/0x308 arch_timer_handler_phys+0x40/0x58 handle_percpu_devid_irq+0x8c/0x1b0 generic_handle_domain_irq+0x48/0x78 gic_handle_irq+0x1b8/0x408 call_on_irq_stack+0x24/0x30 do_interrupt_handler+0x54/0x78 el1_interrupt+0x44/0x88 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x84/0x88 stage2_free_walker+0x30/0xa0 (P) __kvm_pgtable_walk+0x11c/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 __kvm_pgtable_walk+0x180/0x258 kvm_pgtable_walk+0xc4/0x140 kvm_pgtable_stage2_destroy+0x5c/0xf0 kvm_free_stage2_pgd+0x6c/0xe8 kvm_uninit_stage2_mmu+0x24/0x48 kvm_arch_flush_shadow_all+0x80/0xa0 kvm_mmu_notifier_release+0x38/0x78 __mmu_notifier_release+0x15c/0x250 exit_mmap+0x68/0x400 __mmput+0x38/0x1c8 mmput+0x30/0x68 exit_mm+0xd4/0x198 do_exit+0x1a4/0xb00 do_group_exit+0x8c/0x120 get_signal+0x6d4/0x778 do_signal+0x90/0x718 do_notify_resume+0x70/0x170 el0_svc+0x74/0xd8 el0t_64_sync_handler+0x60/0xc8 el0t_64_sync+0x1b0/0x1b8 The warning is seen majorly on the host kernels that are configured not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this, instead of walking the entire page-table in one go, split it into smaller ranges, by checking for cond_resched() between each range. Since the path is executed during VM destruction, after the page-table structure is unlinked from the KVM MMU, relying on cond_resched_rwlock_write() isn't necessary. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Suggested-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250820162242.2624752-3-rananta@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-08-21KVM: arm64: Split kvm_pgtable_stage2_destroy()Raghavendra Rao Ananta
Split kvm_pgtable_stage2_destroy() into two: - kvm_pgtable_stage2_destroy_range(), that performs the page-table walk and free the entries over a range of addresses. - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD. This refactoring enables subsequent patches to free large page-tables in chunks, calling cond_resched() between each chunk, to yield the CPU as necessary. Existing callers of kvm_pgtable_stage2_destroy(), that probably cannot take advantage of this (such as nVMHE), will continue to function as is. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Suggested-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250820162242.2624752-2-rananta@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-08-22btrfs: avoid load/store tearing races when checking if an inode was loggedFilipe Manana
At inode_logged() we do a couple lockless checks for ->logged_trans, and these are generally safe except the second one in case we get a load or store tearing due to a concurrent call updating ->logged_trans (either at btrfs_log_inode() or later at inode_logged()). In the first case it's safe to compare to the current transaction ID since once ->logged_trans is set the current transaction, we never set it to a lower value. In the second case, where we check if it's greater than zero, we are prone to load/store tearing races, since we can have a concurrent task updating to the current transaction ID with store tearing for example, instead of updating with a single 64 bits write, to update with two 32 bits writes or four 16 bits writes. In that case the reading side at inode_logged() could see a positive value that does not match the current transaction and then return a false negative. Fix this by doing the second check while holding the inode's spinlock, add some comments about it too. Also add the data_race() annotation to the first check to avoid any reports from KCSAN (or similar tools) and comment about it. Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-22btrfs: fix race between setting last_dir_index_offset and inode loggingFilipe Manana
At inode_logged() if we find that the inode was not logged before we update its ->last_dir_index_offset to (u64)-1 with the goal that the next directory log operation will see the (u64)-1 and then figure out it must check what was the index of the last logged dir index key and update ->last_dir_index_offset to that key's offset (this is done in update_last_dir_index_offset()). This however has a possibility for a time window where a race can happen and lead to directory logging skipping dir index keys that should be logged. The race happens like this: 1) Task A calls inode_logged(), sees ->logged_trans as 0 and then checks that the inode item was logged before, but before it sets the inode's ->last_dir_index_offset to (u64)-1... 2) Task B is at btrfs_log_inode() which calls inode_logged() early, and that has set ->last_dir_index_offset to (u64)-1; 3) Task B then enters log_directory_changes() which calls update_last_dir_index_offset(). There it sees ->last_dir_index_offset is (u64)-1 and that the inode was logged before (ctx->logged_before is true), and so it searches for the last logged dir index key in the log tree and it finds that it has an offset (index) value of N, so it sets ->last_dir_index_offset to N, so that we can skip index keys that are less than or equal to N (later at process_dir_items_leaf()); 4) Task A now sets ->last_dir_index_offset to (u64)-1, undoing the update that task B just did; 5) Task B will now skip every index key when it enters process_dir_items_leaf(), since ->last_dir_index_offset is (u64)-1. Fix this by making inode_logged() not touch ->last_dir_index_offset and initializing it to 0 when an inode is loaded (at btrfs_alloc_inode()) and then having update_last_dir_index_offset() treat a value of 0 as meaning we must check the log tree and update with the index of the last logged index key. This is fine since the minimum possible value for ->last_dir_index_offset is 1 (BTRFS_DIR_START_INDEX - 1 = 2 - 1 = 1). This also simplifies the management of ->last_dir_index_offset and now all accesses to it are done under the inode's log_mutex. Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-22btrfs: fix race between logging inode and checking if it was logged beforeFilipe Manana
There's a race between checking if an inode was logged before and logging an inode that can cause us to mark an inode as not logged just after it was logged by a concurrent task: 1) We have inode X which was not logged before neither in the current transaction not in past transaction since the inode was loaded into memory, so it's ->logged_trans value is 0; 2) We are at transaction N; 3) Task A calls inode_logged() against inode X, sees that ->logged_trans is 0 and there is a log tree and so it proceeds to search in the log tree for an inode item for inode X. It doesn't see any, but before it sets ->logged_trans to N - 1... 3) Task B calls btrfs_log_inode() against inode X, logs the inode and sets ->logged_trans to N; 4) Task A now sets ->logged_trans to N - 1; 5) At this point anyone calling inode_logged() gets 0 (inode not logged) since ->logged_trans is greater than 0 and less than N, but our inode was really logged. As a consequence operations like rename, unlink and link that happen afterwards in the current transaction end up not updating the log when they should. Fix this by ensuring inode_logged() only updates ->logged_trans in case the inode item is not found in the log tree if after tacking the inode's lock (spinlock struct btrfs_inode::lock) the ->logged_trans value is still zero, since the inode lock is what protects setting ->logged_trans at btrfs_log_inode(). Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-22btrfs: simplify error handling logic for btrfs_link()Filipe Manana
Instead of incrementing the inode's link count and refcount early before adding the link, updating the inode and deleting orphan item, do it after all those steps succeeded right before calling d_instantiate(). This makes the error handling logic simpler by avoiding the need for the 'drop_inode' variable to signal if we need to undo the link count increment and the inode refcount increase under the 'fail' label. This also reduces the level of indentation by one, making the code easier to read. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-22btrfs: fix inode leak on failure to add link to inodeFilipe Manana
If we fail to update the inode or delete the orphan item we leak the inode since we update its refcount with the ihold() call to account for the d_instantiate() call which never happens in case we fail those steps. Fix this by setting 'drop_inode' to true in case we fail those steps. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-22btrfs: abort transaction on failure to add link to inodeFilipe Manana
If we fail to update the inode or delete the orphan item, we must abort the transaction to prevent persisting an inconsistent state. For example if we fail to update the inode item, we have the inconsistency of having a persisted inode item with a link count of N but we have N + 1 inode ref items and N + 1 directory entries pointing to our inode in case the transaction gets committed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-08-21Merge tag 'cgroup-for-6.17-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix NULL de-ref in css_rstat_exit() which could happen after allocation failure - Fix a cpuset partition handling bug and a couple other misc issues - Doc spelling fix * tag 'cgroup-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup: fixed spelling mistakes in documentation cgroup: avoid null de-ref in css_rstat_exit() cgroup/cpuset: Remove the unnecessary css_get/put() in cpuset_partition_write() cgroup/cpuset: Fix a partition error with CPU hotplug cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key
2025-08-21Merge tag 'spi-fix-v6.17-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "A small collection of fixes that came in during the past week, a few driver specifics plus one fix for the spi-mem core where we weren't taking account of the frequency capabilities of the system when determining if it can support an operation" * tag 'spi-fix-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: st: fix PM macros to use CONFIG_PM instead of CONFIG_PM_SLEEP spi: spi-qpic-snand: fix calculating of ECC OOB regions' properties spi: spi-fsl-lpspi: Clamp too high speed_hz spi: spi-mem: add spi_mem_adjust_op_freq() in spi_mem_supports_op() spi: spi-mem: Add missing kdoc argument spi: spi-qpic-snand: use correct CW_PER_PAGE value for OOB write
2025-08-21Merge tag 'regulator-fix-v6.17-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "A couple of fairly minor device specific fixes that came in over the past week or so, plus the addition of an actual maintainer for the IR38060" * tag 'regulator-fix-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: tps65219: regulator: tps65219: Fix error codes in probe() regulator: pca9450: Use devm_register_sys_off_handler regulator: dt-bindings: infineon,ir38060: Add Guenter as maintainer from IBM
2025-08-21Merge tag 'acpi-6.17-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "These fix three new issues in the ACPI APEI error injection code and an ACPI platform firmware runtime update interface issue: - Make ACPI APEI error injection check the version of the request when mapping the EINJ parameter structure in the BIOS reserved memory to prevent injecting errors based on an uninitialized field (Tony Luck) - Fix potential NULL dereference in __einj_error_inject() that may occur when memory allocation fails (Charles Han) - Remove the __exit annotation from einj_remove(), so it can be called on errors during faux device probe (Uwe Kleine-König) - Use a security-version-number check instead of a runtime version check during ACPI platform firmware runtime driver updates to prevent those updates from failing due to false-positive driver version check failures (Chen Yu)" * tag 'acpi-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: pfr_update: Fix the driver update version check ACPI: APEI: EINJ: Fix resource leak by remove callback in .exit.text ACPI: APEI: EINJ: fix potential NULL dereference in __einj_error_inject() ACPI: APEI: EINJ: Check if user asked for EINJV2 injection
2025-08-21Merge tag 'pm-6.17-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a cpuidle menu governor issue and two issues in the cpupower utility: - Prevent the menu cpuidle governor from selecting idle states with exit latency exceeding the current PM QoS limit after stopping the scheduler tick (Rafael Wysocki) - Make the set subcommand's -t option in the cpupower utility work as documented and allow it to control the CPU boost feature of cpufreq beyond x86 (Shinji Nomoto)" * tag 'pm-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpuidle: governors: menu: Avoid selecting states with too much latency cpupower: Allow control of boost feature on non-x86 based systems with boost support. cpupower: Fix a bug where the -t option of the set subcommand was not working.
2025-08-21Merge tag 'sched_ext-for-6.17-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a subtle bug during SCX enabling where a dead task skips init but doesn't skip sched class switch leading to invalid task state transition warning - Cosmetic fix in selftests * tag 'sched_ext-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Remove duplicate sched.h header sched/ext: Fix invalid task state transitions on class switch
2025-08-21io_uring: clear ->async_data as part of normal initJens Axboe
Opcode handlers like POLL_ADD will use ->async_data as the pointer for double poll handling, which is a bit different than the usual case where it's strictly gated by the REQ_F_ASYNC_DATA flag. Be a bit more proactive in handling ->async_data, and clear it to NULL as part of regular init. Init is touching that cacheline anyway, so might as well clear it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21io_uring/futex: ensure io_futex_wait() cleans up properly on failureJens Axboe
The io_futex_data is allocated upfront and assigned to the io_kiocb async_data field, but the request isn't marked with REQ_F_ASYNC_DATA at that point. Those two should always go together, as the flag tells io_uring whether the field is valid or not. Additionally, on failure cleanup, the futex handler frees the data but does not clear ->async_data. Clear the data and the flag in the error path as well. Thanks to Trend Micro Zero Day Initiative and particularly ReDress for reporting this. Cc: stable@vger.kernel.org Fixes: 194bb58c6090 ("io_uring: add support for futex wake and wait") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-21dt-bindings: vendor-prefixes: add eswinPritesh Patel
Add new vendor string to dt bindings. This new vendor string is used by - ESWIN EIC770X SoC - HiFive Premier P550 board which uses EIC7700 SoC. Link: https://www.eswin.com/en/ Signed-off-by: Pritesh Patel <pritesh.patel@einfochips.com> Reviewed-by: Samuel Holland <samuel.holland@sifive.com> Signed-off-by: Pinkesh Vaghela <pinkesh.vaghela@einfochips.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20250616112316.3833343-4-pinkesh.vaghela@einfochips.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2025-08-21drm/xe: Fix vm_bind_ioctl double free bugChristoph Manszewski
If the argument check during an array bind fails, the bind_ops are freed twice as seen below. Fix this by setting bind_ops to NULL after freeing. ================================================================== BUG: KASAN: double-free in xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] Free of addr ffff88813bb9b800 by task xe_vm/14198 CPU: 5 UID: 0 PID: 14198 Comm: xe_vm Not tainted 6.16.0-xe-eudebug-cmanszew+ #520 PREEMPT(full) Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-P DDR5 RVP, BIOS ADLPFWI1.R00.2411.A02.2110081023 10/08/2021 Call Trace: <TASK> dump_stack_lvl+0x82/0xd0 print_report+0xcb/0x610 ? __virt_addr_valid+0x19a/0x300 ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] kasan_report_invalid_free+0xc8/0xf0 ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] check_slab_allocation+0x102/0x130 kfree+0x10d/0x440 ? should_fail_ex+0x57/0x2f0 ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] ? __lock_acquire+0xab9/0x27f0 ? lock_acquire+0x165/0x300 ? drm_dev_enter+0x53/0xe0 [drm] ? find_held_lock+0x2b/0x80 ? drm_dev_exit+0x30/0x50 [drm] ? drm_ioctl_kernel+0x128/0x1c0 [drm] drm_ioctl_kernel+0x128/0x1c0 [drm] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] ? find_held_lock+0x2b/0x80 ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm] ? should_fail_ex+0x57/0x2f0 ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] drm_ioctl+0x352/0x620 [drm] ? __pfx_drm_ioctl+0x10/0x10 [drm] ? __pfx_rpm_resume+0x10/0x10 ? do_raw_spin_lock+0x11a/0x1b0 ? find_held_lock+0x2b/0x80 ? __pm_runtime_resume+0x61/0xc0 ? rcu_is_watching+0x20/0x50 ? trace_irq_enable.constprop.0+0xac/0xe0 xe_drm_ioctl+0x91/0xc0 [xe] __x64_sys_ioctl+0xb2/0x100 ? rcu_is_watching+0x20/0x50 do_syscall_64+0x68/0x2e0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fa9acb24ded Fixes: b43e864af0d4 ("drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Christoph Manszewski <christoph.manszewski@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250813101231.196632-2-christoph.manszewski@intel.com (cherry picked from commit a01b704527c28a2fd43a17a85f8996b75ec8492a) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-21drm/xe: Move ASID allocation and user PT BO tracking into xe_vm_createPiotr Piórkowski
Currently, ASID assignment for user VMs and page-table BO accounting for client memory tracking are performed in xe_vm_create_ioctl. To consolidate VM object initialization, move this logic to xe_vm_create. v2: - removed unnecessary duplicate BO tracking code - using the local variable xef to verify whether the VM is being created by userspace Fixes: 658a1c8e0a66 ("drm/xe: Assign ioctl xe file handler to vm in xe_vm_create") Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250811104358.2064150-3-piotr.piorkowski@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com> (cherry picked from commit 30e0c3f43a414616e0b6ca76cf7f7b2cd387e1d4) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> [Rodrigo: Added fixes tag]
2025-08-21Merge branches 'acpi-apei' and 'acpi-pfrut'Rafael J. Wysocki
Merge ACPI APEI fixes and an ACPI platform firmware runtime update fix for 6.17-rc3. * acpi-apei: ACPI: APEI: EINJ: Fix resource leak by remove callback in .exit.text ACPI: APEI: EINJ: fix potential NULL dereference in __einj_error_inject() ACPI: APEI: EINJ: Check if user asked for EINJV2 injection * acpi-pfrut: ACPI: pfr_update: Fix the driver update version check
2025-08-21Merge branch 'pm-cpuidle'Rafael J. Wysocki
Merge a menu governor fix for 6.17-rc3 * pm-cpuidle: cpuidle: governors: menu: Avoid selecting states with too much latency
2025-08-21Merge tag 'net-6.17-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Bluetooth. Current release - fix to a fix: - usb: asix_devices: fix PHY address mask in MDIO bus initialization Current release - regressions: - Bluetooth: fixes for the split between BIS_LINK and PA_LINK - Revert "net: cadence: macb: sama7g5_emac: Remove USARIO CLKEN flag", breaks compatibility with some existing device tree blobs - dsa: b53: fix reserved register access in b53_fdb_dump() Current release - new code bugs: - sched: dualpi2: run probability update timer in BH to avoid deadlock - eth: libwx: fix the size in RSS hash key population - pse-pd: pd692x0: improve power budget error paths and handling Previous releases - regressions: - tls: fix handling of zero-length records on the rx_list - hsr: reject HSR frame if skb can't hold tag - bonding: fix negotiation flapping in 802.3ad passive mode Previous releases - always broken: - gso: forbid IPv6 TSO with extensions on devices with only IPV6_CSUM - sched: make cake_enqueue return NET_XMIT_CN when past buffer_limit, avoid packet drops with low buffer_limit, remove unnecessary WARN() - sched: fix backlog accounting after modifying config of a qdisc in the middle of the hierarchy - mptcp: improve handling of skb extension allocation failures - eth: mlx5: - fixes for the "HW Steering" flow management method - fixes for QoS and device buffer management" * tag 'net-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) netfilter: nf_reject: don't leak dst refcount for loopback packets net/mlx5e: Preserve shared buffer capacity during headroom updates net/mlx5e: Query FW for buffer ownership net/mlx5: Restore missing scheduling node cleanup on vport enable failure net/mlx5: Fix QoS reference leak in vport enable error path net/mlx5: Destroy vport QoS element when no configuration remains net/mlx5e: Preserve tc-bw during parent changes net/mlx5: Remove default QoS group and attach vports directly to root TSAR net/mlx5: Base ECVF devlink port attrs from 0 net: pse-pd: pd692x0: Skip power budget configuration when undefined net: pse-pd: pd692x0: Fix power budget leak in manager setup error path Octeontx2-af: Skip overlap check for SPI field selftests: tls: add tests for zero-length records tls: fix handling of zero-length records on the rx_list net: airoha: ppe: Do not invalid PPE entries in case of SW hash collision selftests: bonding: add test for passive LACP mode bonding: send LACPDUs periodically in passive mode after receiving partner's LACPDU bonding: update LACP activity flag after setting lacp_active Revert "net: cadence: macb: sama7g5_emac: Remove USARIO CLKEN flag" ipv6: sr: Fix MAC comparison to be constant-time ...
2025-08-21of: reserved_mem: Add missing IORESOURCE_MEM flag on resourcesRob Herring (Arm)
Commit f4fcfdda2fd8 ('of: reserved_mem: Add functions to parse "memory-region"') failed to set IORESOURCE_MEM flag on the resources. The result is functions such as devm_ioremap_resource_wc() will fail. Add the missing flag. Fixes: f4fcfdda2fd8 ('of: reserved_mem: Add functions to parse "memory-region"') Reported-by: Iuliana Prodan <iuliana.prodan@nxp.com> Reported-by: Daniel Baluta <daniel.baluta@gmail.com> Tested-by: Iuliana Prodan <iuliana.prodan@nxp.com> Reviewed-by: Iuliana Prodan <iuliana.prodan@nxp.com> Reviewed-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20250820192805.565568-1-robh@kernel.org Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2025-08-21netfilter: nf_reject: don't leak dst refcount for loopback packetsFlorian Westphal
recent patches to add a WARN() when replacing skb dst entry found an old bug: WARNING: include/linux/skbuff.h:1165 skb_dst_check_unset include/linux/skbuff.h:1164 [inline] WARNING: include/linux/skbuff.h:1165 skb_dst_set include/linux/skbuff.h:1210 [inline] WARNING: include/linux/skbuff.h:1165 nf_reject_fill_skb_dst+0x2a4/0x330 net/ipv4/netfilter/nf_reject_ipv4.c:234 [..] Call Trace: nf_send_unreach+0x17b/0x6e0 net/ipv4/netfilter/nf_reject_ipv4.c:325 nft_reject_inet_eval+0x4bc/0x690 net/netfilter/nft_reject_inet.c:27 expr_call_ops_eval net/netfilter/nf_tables_core.c:237 [inline] .. This is because blamed commit forgot about loopback packets. Such packets already have a dst_entry attached, even at PRE_ROUTING stage. Instead of checking hook just check if the skb already has a route attached to it. Fixes: f53b9b0bdc59 ("netfilter: introduce support for reject at prerouting stage") Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20250820123707.10671-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21idpf: remove obsolete stashing codeJoshua Hay
With the new Tx buffer management scheme, there is no need for all of the stashing mechanisms, the hash table, the reserve buffer stack, etc. Remove all of that. Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21idpf: stop Tx if there are insufficient buffer resourcesJoshua Hay
The Tx refillq logic will cause packets to be silently dropped if there are not enough buffer resources available to send a packet in flow scheduling mode. Instead, determine how many buffers are needed along with number of descriptors. Make sure there are enough of both resources to send the packet, and stop the queue if not. Fixes: 7292af042bcf ("idpf: fix a race in txq wakeup") Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21idpf: replace flow scheduling buffer ring with buffer poolJoshua Hay
Replace the TxQ buffer ring with one large pool/array of buffers (only for flow scheduling). This eliminates the tag generation and makes it impossible for a tag to be associated with more than one packet. The completion tag passed to HW through the descriptor is the index into the array. That same completion tag is posted back to the driver in the completion descriptor, and used to index into the array to quickly retrieve the buffer during cleaning. In this way, the tags are treated as a fix sized resource. If all tags are in use, no more packets can be sent on that particular queue (until some are freed up). The tag pool size is 64K since the completion tag width is 16 bits. For each packet, the driver pulls a free tag from the refillq to get the next free buffer index. When cleaning is complete, the tag is posted back to the refillq. A multi-frag packet spans multiple buffers in the driver, therefore it uses multiple buffer indexes/tags from the pool. Each frag pulls from the refillq to get the next free buffer index. These are tracked in a next_buf field that replaces the completion tag field in the buffer struct. This chains the buffers together so that the packet can be cleaned from the starting completion tag taken from the completion descriptor, then from the next_buf field for each subsequent buffer. In case of a dma_mapping_error occurs or the refillq runs out of free buf_ids, the packet will execute the rollback error path. This unmaps any buffers previously mapped for the packet. Since several free buf_ids could have already been pulled from the refillq, we need to restore its original state as well. Otherwise, the buf_ids/tags will be leaked and not used again until the queue is reallocated. Descriptor completions only advance the descriptor ring index to "clean" the descriptors. The packet completions only clean the buffers associated with the given packet completion tag and do not update the descriptor ring index. When operating in queue based scheduling mode, the array still acts as a ring and will only have TxQ descriptor count entries. The tx_bufs are still associated 1:1 with the descriptor ring entries and we can use the conventional indexing mechanisms. Fixes: c2d548cad150 ("idpf: add TX splitq napi poll support") Signed-off-by: Luigi Rizzo <lrizzo@google.com> Signed-off-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21idpf: simplify and fix splitq Tx packet rollback error pathJoshua Hay
Move (and rename) the existing rollback logic to singleq.c since that will be the only consumer. Create a simplified splitq specific rollback function to loop through and unmap tx_bufs based on the completion tag. This is critical before replacing the Tx buffer ring with the buffer pool since the previous rollback indexing will not work to unmap the chained buffers from the pool. Cache the next_to_use index before any portion of the packet is put on the descriptor ring. In case of an error, the rollback will bump tail to the correct next_to_use value. Because the splitq path now supports different types of context descriptors (and potentially multiple in the future), this will take care of rolling back any and all context descriptors encoded on the ring for the erroneous packet. The previous rollback logic was broken for PTP packets since it would not account for the PTP context descriptor. Fixes: 1a49cf814fe1 ("idpf: add Tx timestamp flows") Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21idpf: improve when to set RE bit logicJoshua Hay
Track the gap between next_to_use and the last RE index. Set RE again if the gap is large enough to ensure RE bit is set frequently. This is critical before removing the stashing mechanisms because the opportunistic descriptor ring cleaning from the out-of-order completions will go away. Previously the descriptors would be "cleaned" by both the descriptor (RE) completion and the out-of-order completions. Without the latter, we must ensure the RE bit is set more frequently. Otherwise, it's theoretically possible for the descriptor ring next_to_clean to never advance. The previous implementation was dependent on the start of a packet falling on a 64th index in the descriptor ring, which is not guaranteed with large packets. Signed-off-by: Luigi Rizzo <lrizzo@google.com> Signed-off-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21idpf: add support for Tx refillqs in flow scheduling modeJoshua Hay
In certain production environments, it is possible for completion tags to collide, meaning N packets with the same completion tag are in flight at the same time. In this environment, any given Tx queue is effectively used to send both slower traffic and higher throughput traffic simultaneously. This is the result of a customer's specific configuration in the device pipeline, the details of which Intel cannot provide. This configuration results in a small number of out-of-order completions, i.e., a small number of packets in flight. The existing guardrails in the driver only protect against a large number of packets in flight. The slower flow completions are delayed which causes the out-of-order completions. The fast flow will continue sending traffic and generating tags. Because tags are generated on the fly, the fast flow eventually uses the same tag for a packet that is still in flight from the slower flow. The driver has no idea which packet it should clean when it processes the completion with that tag, but it will look for the packet on the buffer ring before the hash table. If the slower flow packet completion is processed first, it will end up cleaning the fast flow packet on the ring prematurely. This leaves the descriptor ring in a bad state resulting in a crash or Tx timeout. In summary, generating a tag when a packet is sent can lead to the same tag being associated with multiple packets. This can lead to resource leaks, crashes, and/or Tx timeouts. Before we can replace the tag generation, we need a new mechanism for the send path to know what tag to use next. The driver will allocate and initialize a refillq for each TxQ with all of the possible free tag values. During send, the driver grabs the next free tag from the refillq from next_to_clean. While cleaning the packet, the clean routine posts the tag back to the refillq's next_to_use to indicate that it is now free to use. This mechanism works exactly the same way as the existing Rx refill queues, which post the cleaned buffer IDs back to the buffer queue to be reposted to HW. Since we're using the refillqs for both Rx and Tx now, genericize some of the existing refillq support. Note: the refillqs will not be used yet. This is only demonstrating how they will be used to pass free tags back to the send path. Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21s390/hypfs: Enable limited access during lockdownPeter Oberparleiter
When kernel lockdown is active, debugfs_locked_down() blocks access to hypfs files that register ioctl callbacks, even if the ioctl interface is not required for a function. This unnecessarily breaks userspace tools that only rely on read operations. Resolve this by registering a minimal set of file operations during lockdown, avoiding ioctl registration and preserving access for affected tooling. Note that this change restores hypfs functionality when lockdown is active from early boot (e.g. via lockdown=integrity kernel parameter), but does not apply to scenarios where lockdown is enabled dynamically while Linux is running. Tested-by: Mete Durlu <meted@linux.ibm.com> Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Fixes: 5496197f9b08 ("debugfs: Restrict debugfs when the kernel is locked down") Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-08-21s390/hypfs: Avoid unnecessary ioctl registration in debugfsPeter Oberparleiter
Currently, hypfs registers ioctl callbacks for all debugfs files, despite only one file requiring them. This leads to unintended exposure of unused interfaces to user space and can trigger side effects such as restricted access when kernel lockdown is enabled. Restrict ioctl registration to only those files that implement ioctl functionality to avoid interface clutter and unnecessary access restrictions. Tested-by: Mete Durlu <meted@linux.ibm.com> Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Fixes: 5496197f9b08 ("debugfs: Restrict debugfs when the kernel is locked down") Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-08-21ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validationTakashi Iwai
The entry of the validators table for UAC3 feature unit is defined with a wrong sub-type UAC_FEATURE (= 0x06) while it should have been UAC3_FEATURE (= 0x07). This patch corrects the entry value. Fixes: 57f8770620e9 ("ALSA: usb-audio: More validations of descriptor units") Link: https://patch.msgid.link/20250821150835.8894-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2025-08-21Merge branch 'mlx5-misx-fixes-2025-08-20'Jakub Kicinski
Mark Bloch says: ==================== mlx5 misx fixes 2025-08-20 This patchset provides misc bug fixes from the team to the mlx5 core and Eth drivers. v1: https://lore.kernel.org/1755095476-414026-1-git-send-email-tariqt@nvidia.com ==================== Link: https://patch.msgid.link/20250820133209.389065-1-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5e: Preserve shared buffer capacity during headroom updatesArmen Ratner
When port buffer headroom changes, port_update_shared_buffer() recalculates the shared buffer size and splits it in a 3:1 ratio (lossy:lossless) - Currently, the calculation is: lossless = shared / 4; lossy = (shared / 4) * 3; Meaning, the calculation dropped the remainder of shared % 4 due to integer division, unintentionally reducing the total shared buffer by up to three cells on each update. Over time, this could shrink the buffer below usable size. Fix it by changing the calculation to: lossless = shared / 4; lossy = shared - lossless; This retains all buffer cells while still approximating the intended 3:1 split, preventing capacity loss over time. While at it, perform headroom calculations in units of cells rather than in bytes for more accurate calculations avoiding extra divisions. Fixes: a440030d8946 ("net/mlx5e: Update shared buffer along with device buffer changes") Signed-off-by: Armen Ratner <armeng@nvidia.com> Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Alexei Lazar <alazar@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/20250820133209.389065-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5e: Query FW for buffer ownershipAlexei Lazar
The SW currently saves local buffer ownership when setting the buffer. This means that the SW assumes it has ownership of the buffer after the command is set. If setting the buffer fails and we remain in FW ownership, the local buffer ownership state incorrectly remains as SW-owned. This leads to incorrect behavior in subsequent PFC commands, causing failures. Instead of saving local buffer ownership in SW, query the FW for buffer ownership when setting the buffer. This ensures that the buffer ownership state is accurately reflected, avoiding the issues caused by incorrect ownership states. Fixes: ecdf2dadee8e ("net/mlx5e: Receive buffer support for DCBX") Signed-off-by: Alexei Lazar <alazar@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Restore missing scheduling node cleanup on vport enable failureCarolina Jubran
Restore the __esw_qos_free_node() call removed by the offending commit. Fixes: 97733d1e00a0 ("net/mlx5: Add traffic class scheduling support for vport QoS") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Fix QoS reference leak in vport enable error pathCarolina Jubran
Add missing esw_qos_put() call when __esw_qos_alloc_node() fails in mlx5_esw_qos_vport_enable(). Fixes: be034baba83e ("net/mlx5: Make vport QoS enablement more flexible for future extensions") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Destroy vport QoS element when no configuration remainsCarolina Jubran
If a VF has been configured and the user later clears all QoS settings, the vport element remains in the firmware QoS tree. This leads to inconsistent behavior compared to VFs that were never configured, since the FW assumes that unconfigured VFs are outside the QoS hierarchy. As a result, the bandwidth share across VFs may differ, even though none of them appear to have any configuration. Align the driver behavior with the FW expectation by destroying the vport QoS element when all configurations are removed. Fixes: c9497c98901c ("net/mlx5: Add support for setting VF min rate") Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/20250820133209.389065-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5e: Preserve tc-bw during parent changesCarolina Jubran
When changing parent of a node/leaf with tc-bw configured, the code saves and restores tc-bw values. However, it was reading the converted hardware bw_share values (where 0 becomes 1) instead of the original user values, causing incorrect tc-bw calculations after parent change. Store original tc-bw values in the node structure and use them directly for save/restore operations. Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Remove default QoS group and attach vports directly to root TSARCarolina Jubran
Currently, the driver creates a default group (`node0`) and attaches all vports to it unless the user explicitly sets a parent group. As a result, when a user configures tx_share on a group and tx_share on a VF, the expectation is for the group and the VF to share bandwidth relatively. However, since the VF is not connected to the same parent (but to the default node), the proportional share logic is not applied correctly. To fix this, remove the default group (`node0`) and instead connect vports directly to the root TSAR when no parent is specified. This ensures that vports and groups share the same root scheduler and their tx_share values are compared directly under the same hierarchy. Fixes: 0fe132eac38c ("net/mlx5: E-switch, Allow to add vports to rate groups") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Base ECVF devlink port attrs from 0Daniel Jurgens
Adjust the vport number by the base ECVF vport number so the port attributes start at 0. Previously the port attributes would start 1 after the maximum number of host VFs. Fixes: dc13180824b7 ("net/mlx5: Enable devlink port for embedded cpu VF vports") Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net: pse-pd: pd692x0: Skip power budget configuration when undefinedKory Maincent
If the power supply's power budget is not defined in the device tree, the current code still requests power and configures the PSE manager with a 0W power limit, which is undesirable behavior. Skip power budget configuration entirely when the budget is zero, avoiding unnecessary power requests and preventing invalid 0W limits from being set on the PSE manager. Fixes: 359754013e6a ("net: pse-pd: pd692x0: Add support for PSE PI priority feature") Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250820133321.841054-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net: pse-pd: pd692x0: Fix power budget leak in manager setup error pathKory Maincent
Fix a resource leak where manager power budgets were freed on both success and error paths during manager setup. Power budgets should only be freed on error paths after regulator registration or during driver removal. Refactor cleanup logic by extracting OF node cleanup and power budget freeing into separate helper functions for better maintainability. Fixes: 359754013e6a ("net: pse-pd: pd692x0: Add support for PSE PI priority feature") Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250820132708.837255-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21Octeontx2-af: Skip overlap check for SPI fieldHariprasad Kelam
Octeontx2/CN10K silicon supports generating a 256-bit key per packet. The specific fields to be extracted from a packet for key generation are configurable via a Key Extraction (MKEX) Profile. The AF driver scans the configured extraction profile to ensure that fields from upper layers do not overwrite fields from lower layers in the key. Example Packet Field Layout: LA: DMAC + SMAC LB: VLAN LC: IPv4/IPv6 LD: TCP/UDP Valid MKEX Profile Configuration: LA -> DMAC -> key_offset[0-5] LC -> SIP -> key_offset[20-23] LD -> SPORT -> key_offset[30-31] Invalid MKEX profile configuration: LA -> DMAC -> key_offset[0-5] LC -> SIP -> key_offset[20-23] LD -> SPORT -> key_offset[2-3] // Overlaps with DMAC field In another scenario, if the MKEX profile is configured to extract the SPI field from both AH and ESP headers at the same key offset, the driver rejecting this configuration. In a regular traffic, ipsec packet will be having either AH(LD) or ESP (LE). This patch relaxes the check for the same. Fixes: 12aa0a3b93f3 ("octeontx2-af: Harden rule validation.") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Link: https://patch.msgid.link/20250820063919.1463518-1-hkelam@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21selftests: tls: add tests for zero-length recordsJakub Kicinski
Test various combinations of zero-length records. Unfortunately, kernel cannot be coerced into producing those, so hardcode the ciphertext messages in the test. Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250820021952.143068-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21tls: fix handling of zero-length records on the rx_listJakub Kicinski
Each recvmsg() call must process either - only contiguous DATA records (any number of them) - one non-DATA record If the next record has different type than what has already been processed we break out of the main processing loop. If the record has already been decrypted (which may be the case for TLS 1.3 where we don't know type until decryption) we queue the pending record to the rx_list. Next recvmsg() will pick it up from there. Queuing the skb to rx_list after zero-copy decrypt is not possible, since in that case we decrypted directly to the user space buffer, and we don't have an skb to queue (darg.skb points to the ciphertext skb for access to metadata like length). Only data records are allowed zero-copy, and we break the processing loop after each non-data record. So we should never zero-copy and then find out that the record type has changed. The corner case we missed is when the initial record comes from rx_list, and it's zero length. Reported-by: Muhammad Alifa Ramdhan <ramdhan@starlabs.sg> Reported-by: Billy Jheng Bing-Jhong <billy@starlabs.sg> Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250820021952.143068-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21Merge tag 'loongarch-fixes-6.17-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch fixes from Huacai Chen: "Fix a lot of build warnings for LTO-enabled objtool check, increase COMMAND_LINE_SIZE up to 4096, rename a missing GCC_PLUGIN_STACKLEAK to KSTACK_ERASE, and fix some bugs about arch timer, module loading, LBT and KVM" * tag 'loongarch-fixes-6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: LoongArch: KVM: Add address alignment check in pch_pic register access LoongArch: KVM: Use kvm_get_vcpu_by_id() instead of kvm_get_vcpu() LoongArch: KVM: Fix stack protector issue in send_ipi_data() LoongArch: KVM: Make function kvm_own_lbt() robust LoongArch: Rename GCC_PLUGIN_STACKLEAK to KSTACK_ERASE LoongArch: Save LBT before FPU in setup_sigcontext() LoongArch: Optimize module load time by optimizing PLT/GOT counting LoongArch: Add cpuhotplug hooks to fix high cpu usage of vCPU threads LoongArch: Increase COMMAND_LINE_SIZE up to 4096 LoongArch: Pass annotate-tablejump option if LTO is enabled objtool/LoongArch: Get table size correctly if LTO is enabled
2025-08-21cpufreq/amd-pstate: Fix setting of CPPC.min_perf in active mode for ↵Gautham R. Shenoy
performance governor In the "active" mode of the amd-pstate driver with performance governor, the CPPC.min_perf is expected to be the nominal_perf. However after commit a9b9b4c2a4cd ("cpufreq/amd-pstate: Drop min and max cached frequencies"), this is not the case when the governor is switched from performance to powersave and back to performance, and the CPPC.min_perf will be equal to the scaling_min_freq that was set for the powersave governor. This is because prior to commit a9b9b4c2a4cd ("cpufreq/amd-pstate: Drop min and max cached frequencies"), amd_pstate_epp_update_limit() would unconditionally call amd_pstate_update_min_max_limit() and the latter function would enforce the CPPC.min_perf constraint when the governor is performance. However, after the aforementioned commit, amd_pstate_update_min_max_limit() is called by amd_pstate_epp_update_limit() only when either the scaling_{min/max}_freq is different from the cached value of cpudata->{min/max}_limit_freq, which wouldn't have changed on a governor transition from powersave to performance, thus missing out on enforcing the CPPC.min_perf constraint for the performance governor. Fix this by invoking amd_pstate_epp_udpate_limit() not only when the {min/max} limits have changed from the cached values, but also when the policy itself has changed. Fixes: a9b9b4c2a4cd ("cpufreq/amd-pstate: Drop min and max cached frequencies") Signed-off-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20250821042638.356-1-gautham.shenoy@amd.com Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>