summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-06-05selftests: hsr: Extend the hsr_ping.sh test to use fixed MAC addressesLukasz Majewski
Fixed MAC addresses help with debugging as last four bytes identify the network namespace. Signed-off-by: Lukasz Majewski <lukma@denx.de> Link: https://lore.kernel.org/r/20240603093322.3150030-1-lukma@denx.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05selftests: hsr: Extend the hsr_redbox.sh test to use fixed MAC addressesLukasz Majewski
Fixed MAC addresses help with debugging as last four bytes identify the network namespace. Moreover, it allows to mimic the real life setup with for example bridge having the same MAC address on each port. Signed-off-by: Lukasz Majewski <lukma@denx.de> Link: https://lore.kernel.org/r/20240603093322.3150030-2-lukma@denx.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-06-05 We've added 8 non-merge commits during the last 6 day(s) which contain a total of 9 files changed, 34 insertions(+), 35 deletions(-). The main changes are: 1) Fix a potential use-after-free in bpf_link_free when the link uses dealloc_deferred to free the link object but later still tests for presence of link->ops->dealloc, from Cong Wang. 2) Fix BPF test infra to set the run context for rawtp test_run callback where syzbot reported a crash, from Jiri Olsa. 3) Fix bpf_session_cookie BTF_ID in the special_kfunc_set list to exclude it for the case of !CONFIG_FPROBE, also from Jiri Olsa. 4) Fix a Coverity static analysis report to not close() a link_fd of -1 in the multi-uprobe feature detector, from Andrii Nakryiko. 5) Revert support for redirect to any xsk socket bound to the same umem as it can result in corrupted ring state which can lead to a crash when flushing rings. A different approach will be pursued for bpf-next to address it safely, from Magnus Karlsson. 6) Fix inet_csk_accept prototype in test_sk_storage_tracing.c which caused BPF CI failure after the last tree fast forwarding, from Andrii Nakryiko. 7) Fix a coccicheck warning in BPF devmap that iterator variable cannot be NULL, from Thorsten Blum. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: Revert "xsk: Document ability to redirect to any socket bound to the same umem" Revert "xsk: Support redirect to any socket bound to the same umem" bpf: Set run context for rawtp test_run callback bpf: Fix a potential use-after-free in bpf_link_free() bpf, devmap: Remove unnecessary if check in for loop libbpf: don't close(-1) in multi-uprobe feature detector bpf: Fix bpf_session_cookie BTF_ID in special_kfunc_set list selftests/bpf: fix inet_csk_accept prototype in test_sk_storage_tracing.c ==================== Link: https://lore.kernel.org/r/20240605091525.22628-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05ptp: Fix error message on failed pin verificationKarol Kolacinski
On failed verification of PTP clock pin, error message prints channel number instead of pin index after "pin", which is incorrect. Fix error message by adding channel number to the message and printing pin number instead of channel number. Fixes: 6092315dfdec ("ptp: introduce programmable pins.") Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Link: https://lore.kernel.org/r/20240604120555.16643-1-karol.kolacinski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05Merge branch 'vmxnet3-upgrade-to-version-9'Jakub Kicinski
Ronak Doshi says: ==================== vmxnet3: upgrade to version 9 vmxnet3 emulation has recently added timestamping feature which allows the hypervisor (ESXi) to calculate latency from guest virtual NIC driver to all the way up to the physical NIC. This patch series extends vmxnet3 driver to leverage these new feature. Compatibility is maintained using existing vmxnet3 versioning mechanism as follows: - new features added to vmxnet3 emulation are associated with new vmxnet3 version viz. vmxnet3 version 9. - emulation advertises all the versions it supports to the driver. - during initialization, vmxnet3 driver picks the highest version number supported by both the emulation and the driver and configures emulation to run at that version. In particular, following changes are introduced: Patch 1: This patch introduces utility macros for vmxnet3 version 9 comparison and updates Copyright information. Patch 2: This patch adds support to timestamp the packets so as to allow latency measurement in the ESXi. Patch 3: This patch adds support to disable certain offloads on the device based on the request specified by the user in the VM configuration. Patch 4: With all vmxnet3 version 9 changes incorporated in the vmxnet3 driver, with this patch, the driver can configure emulation to run at vmxnet3 version 9. ==================== Link: https://lore.kernel.org/r/20240531193050.4132-1-ronak.doshi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05vmxnet3: update to version 9Ronak Doshi
With all vmxnet3 version 9 changes incorporated in the vmxnet3 driver, the driver can configure emulation to run at vmxnet3 version 9, provided the emulation advertises support for version 9. Signed-off-by: Ronak Doshi <ronak.doshi@broadcom.com> Acked-by: Guolin Yang <guolin.yang@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240531193050.4132-5-ronak.doshi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05vmxnet3: add command to allow disabling of offloadsRonak Doshi
This patch adds a new command to disable certain offloads. This allows user to specify, using VM configuration, if certain offloads need to be disabled. Signed-off-by: Ronak Doshi <ronak.doshi@broadcom.com> Acked-by: Guolin Yang <guolin.yang@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240531193050.4132-4-ronak.doshi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05vmxnet3: add latency measurement support in vmxnet3Ronak Doshi
This patch enhances vmxnet3 to support latency measurement. This support will help to track the latency in packet processing between guest virtual nic driver and host. For this purpose, we introduce a new timestamp ring in vmxnet3 which will be per Tx/Rx queue. This ring will be used to carry timestamp of the packets which will be used to calculate the latency. User can enable latency measurement using realtime knob in vnic setting in VCenter. Signed-off-by: Ronak Doshi <ronak.doshi@broadcom.com> Acked-by: Guolin Yang <guolin.yang@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240531193050.4132-3-ronak.doshi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05vmxnet3: prepare for version 9 changesRonak Doshi
vmxnet3 is currently at version 7 and this patch initiates the preparation to accommodate changes for up to version 9. Introduced utility macros for vmxnet3 version 9 comparison and update Copyright information. Signed-off-by: Ronak Doshi <ronak.doshi@broadcom.com> Acked-by: Guolin Yang <guolin.yang@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240531193050.4132-2-ronak.doshi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/sched: taprio: always validate TCA_TAPRIO_ATTR_PRIOMAPEric Dumazet
If one TCA_TAPRIO_ATTR_PRIOMAP attribute has been provided, taprio_parse_mqprio_opt() must validate it, or userspace can inject arbitrary data to the kernel, the second time taprio_change() is called. First call (with valid attributes) sets dev->num_tc to a non zero value. Second call (with arbitrary mqprio attributes) returns early from taprio_parse_mqprio_opt() and bad things can happen. Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule") Reported-by: Noam Rathaus <noamr@ssd-disclosure.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20240604181511.769870-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05ionic: advertise 52-bit addressing limitation for MSI-XDavid Christensen
Current ionic devices only support 52 internal physical address lines. This is sufficient for x86_64 systems which have similar limitations but does not apply to all other architectures, notably IBM POWER (ppc64). To ensure that MSI/MSI-X vectors are not set outside the physical address limits of the NIC, set the no_64bit_msi value of the pci_dev structure during device probe. Signed-off-by: David Christensen <drc@linux.ibm.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20240603212747.1079134-1-drc@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05Merge tag 'thermal-6.10-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull thermal control fixes from Rafael Wysocki: "Fix issues related to the handling of invalid trip points in the thermal core and in the thermal debug code that have been overlooked by some recent thermal control core changes" * tag 'thermal-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: thermal: trip: Trigger trip down notifications when trips involved in mitigation become invalid thermal: core: Introduce thermal_trip_crossed() thermal/debugfs: Allow tze_seq_show() to print statistics for invalid trips thermal/debugfs: Print initial trip temperature and hysteresis in tze_seq_show()
2024-06-05Merge tag 'acpi-6.10-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "These fix the ACPI EC and AC drivers, the ACPI APEI error injection driver and build issues related to the dev_is_pnp() macro referring to pnp_bus_type that is not exported to modules. Specifics: - Fix error handling during EC operation region accesses in the ACPI EC driver (Armin Wolf) - Fix a memory leak in the APEI error injection driver introduced during its converion to a platform driver (Dan Williams) - Fix build failures related to the dev_is_pnp() macro by redefining it as a proper function and exporting it to modules as appropriate and unexport pnp_bus_type which need not be exported any more (Andy Shevchenko) - Update the ACPI AC driver to use power_supply_changed() to let the power supply core handle configuration changes properly (Thomas Weißschuh)" * tag 'acpi-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: AC: Properly notify powermanagement core about changes PNP: Hide pnp_bus_type from the non-PNP code PNP: Make dev_is_pnp() to be a function and export it for modules ACPI: EC: Avoid returning AE_OK on errors in address space handler ACPI: EC: Abort address space access upon error ACPI: APEI: EINJ: Fix einj_dev release leak
2024-06-05Merge tag 'pm-6.10-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix the intel_pstate and amd-pstate cpufreq drivers and the cpupower utility. Specifics: - Fix a recently introduced unchecked HWP MSR access in the intel_pstate driver (Srinivas Pandruvada) - Add missing conversion from MHz to KHz to amd_pstate_set_boost() to address sysfs inteface inconsistency and fix P-state frequency reporting on AMD Family 1Ah CPUs in the cpupower utility (Dhananjay Ugwekar) - Get rid of an excess global header file used by the amd-pstate cpufreq driver (Arnd Bergmann)" * tag 'pm-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: intel_pstate: Fix unchecked HWP MSR access cpufreq: amd-pstate: Fix the inconsistency in max frequency units cpufreq: amd-pstate: remove global header file tools/power/cpupower: Fix Pstate frequency reporting on AMD Family 1Ah CPUs
2024-06-05bnxt_en: fix atomic counter for ptp packetsVadim Fedorenko
atomic_dec_if_positive returns new value regardless if it is updated or not. The commit in fixes changed the behavior of the condition to one that differs from original code. Restore original condition to properly maintain atomic counter. Fixes: 165f87691a89 ("bnxt_en: add timestamping statistics support") Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240604091939.785535-1-vadfed@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5: Fix tainted pointer delete is case of flow rules creation failAleksandr Mishin
In case of flow rule creation fail in mlx5_lag_create_port_sel_table(), instead of previously created rules, the tainted pointer is deleted deveral times. Fix this bug by using correct flow rules pointers. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 352899f384d4 ("net/mlx5: Lag, use buckets in hash mode") Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240604100552.25201-1-amishin@t-argos.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05Merge tag 'ath-next-20240605' of ↵Kalle Valo
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath ath.git patches for v6.11 ath12k * remove unsupported tx monitor handling * channel 2 in 6 GHz band support * Spatial Multiplexing Power Save (SMPS) in 6 GHz band support * multiple BSSID (MBSSID) and Enhanced Multi-BSSID Advertisements (EMA) support * dynamic VLAN support * add panic handler for resetting the firmware state ath10k * add qcom,no-msa-ready-indicator Device Tree property * LED support for various chipsets
2024-06-05Merge tag 'for-6.10-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A fix for fast fsync that needs to handle errors during writes after some COW failure so it does not lead to an inconsistent state" * tag 'for-6.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: ensure fast fsync waits for ordered extents after a write failure
2024-06-05Merge tag 'bcachefs-2024-06-05' of https://evilpiepirate.org/git/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: "Just a few small fixes" * tag 'bcachefs-2024-06-05' of https://evilpiepirate.org/git/bcachefs: bcachefs: Fix trans->locked assert bcachefs: Rereplicate now moves data off of durability=0 devices bcachefs: Fix GFP_KERNEL allocation in break_cycle()
2024-06-05Merge tag 'rtw-next-2024-06-04' of https://github.com/pkshih/rtwKalle Valo
rtw-next patches for v6.11 Some fixes and refactors of rtlwifi, rtw88 and rtw89. Only one major change listed below: rtlwifi: - add new chip support of RTL8192DU
2024-06-05Merge tag 'i2c-for-6.10-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "This should have been my second pull request during the merge window but one dependency in the drm subsystem fell through the cracks and was only applied for rc2. Now we can finally remove I2C_CLASS_SPD" * tag 'i2c-for-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: Remove I2C_CLASS_SPD i2c: synquacer: Remove a clk reference from struct synquacer_i2c
2024-06-05Merge tag 'tpmdd-next-6.10-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull tpm fixes from Jarkko Sakkinen: "The bug fix for tpm_tis_core_init() is not that critical but still makes sense to get into release for the sake of better quality. I included the Intel CPU model define change mainly to help Tony just a bit, as for this subsystem it cannot realistically speaking cause any possible harm" * tag 'tpmdd-next-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: tpm: Switch to new Intel CPU model defines tpm_tis: Do *not* flush uninitialized work
2024-06-05Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "This is dominated by a couple large series for ARM and x86 respectively, but apart from that things are calm. ARM: - Large set of FP/SVE fixes for pKVM, addressing the fallout from the per-CPU data rework and making sure that the host is not involved in the FP/SVE switching any more - Allow FEAT_BTI to be enabled with NV now that FEAT_PAUTH is completely supported - Fix for the respective priorities of Failed PAC, Illegal Execution state and Instruction Abort exceptions - Fix the handling of AArch32 instruction traps failing their condition code, which was broken by the introduction of ESR_EL2.ISS2 - Allow vcpus running in AArch32 state to be restored in System mode - Fix AArch32 GPR restore that would lose the 64 bit state under some conditions RISC-V: - No need to use mask when hart-index-bits is 0 - Fix incorrect reg_subtype labels in kvm_riscv_vcpu_set_reg_isa_ext() x86: - Fixes and debugging help for the #VE sanity check. Also disable it by default, even for CONFIG_DEBUG_KERNEL, because it was found to trigger spuriously (most likely a processor erratum as the exact symptoms vary by generation). - Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled situation (GIF=0 or interrupt shadow) when the processor supports virtual NMI. While generally KVM will not request an NMI window when virtual NMIs are supported, in this case it *does* have to single-step over the interrupt shadow or enable the STGI intercept, in order to deliver the latched second NMI. - Drop support for hand tuning APIC timer advancement from userspace. Since we have adaptive tuning, and it has proved to work well, drop the module parameter for manual configuration and with it a few stupid bugs that it had" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (32 commits) KVM: x86/mmu: Don't save mmu_invalidate_seq after checking private attr KVM: arm64: Ensure that SME controls are disabled in protected mode KVM: arm64: Refactor CPACR trap bit setting/clearing to use ELx format KVM: arm64: Consolidate initializing the host data's fpsimd_state/sve in pKVM KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM KVM: arm64: Allocate memory mapped at hyp for host sve state in pKVM KVM: arm64: Specialize handling of host fpsimd state on trap KVM: arm64: Abstract set/clear of CPTR_EL2 bits behind helper KVM: arm64: Fix prototype for __sve_save_state/__sve_restore_state KVM: arm64: Reintroduce __sve_save_state KVM: x86: Drop support for hand tuning APIC timer advancement from userspace KVM: SEV-ES: Delegate LBR virtualization to the processor KVM: SEV-ES: Disallow SEV-ES guests when X86_FEATURE_LBRV is absent KVM: SEV-ES: Prevent MSR access post VMSA encryption RISC-V: KVM: Fix incorrect reg_subtype labels in kvm_riscv_vcpu_set_reg_isa_ext function RISC-V: KVM: No need to use mask when hart-index-bit is 0 KVM: arm64: nv: Expose BTI and CSV_frac to a guest hypervisor KVM: arm64: nv: Fix relative priorities of exceptions generated by ERETAx KVM: arm64: AArch32: Fix spurious trapping of conditional instructions KVM: arm64: Allow AArch32 PSTATE.M to be restored as System mode ...
2024-06-05Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq fixes for 6.10-rc3: - Fix a recently introduced unchecked HWP MSR access in the intel_pstate driver (Srinivas Pandruvada). - Add missing conversion from MHz to KHz to amd_pstate_set_boost() to address sysfs inteface inconsistency (Dhananjay Ugwekar). - Get rid of an excess global header file used by the amd-pstate cpufreq driver (Arnd Bergmann). * pm-cpufreq: cpufreq: intel_pstate: Fix unchecked HWP MSR access cpufreq: amd-pstate: Fix the inconsistency in max frequency units cpufreq: amd-pstate: remove global header file
2024-06-05Merge branches 'acpi-ec', 'acpi-apei' and 'pnp'Rafael J. Wysocki
Merge ACPI EC driver fixes, an ACPI APEI fix and PNP fixes for 6.10-rc3: - Fix error handling during EC operation region accesses in the ACPI EC driver (Armin Wolf). - Fix a memory leak in the APEI error injection driver introduced during its converion to a platform driver (Dan Williams). - Fix build failures related to the dev_is_pnp() macro by redefining it as a proper function and exporting it to modules as appropriate and unexport pnp_bus_type which need not be exported any more (Andy Shevchenko). * acpi-ec: ACPI: EC: Avoid returning AE_OK on errors in address space handler ACPI: EC: Abort address space access upon error * acpi-apei: ACPI: APEI: EINJ: Fix einj_dev release leak * pnp: PNP: Hide pnp_bus_type from the non-PNP code PNP: Make dev_is_pnp() to be a function and export it for modules
2024-06-05libbpf: Remove callback-based type/string BTF field visitor helpersAndrii Nakryiko
Now that all libbpf/bpftool code switched to btf_field_iter, remove btf_type_visit_type_ids() and btf_type_visit_str_offs() callback-based helpers as not needed anymore. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-6-andrii@kernel.org
2024-06-05bpftool: Use BTF field iterator in btfgenAndrii Nakryiko
Switch bpftool's code which is using libbpf-internal btf_type_visit_type_ids() helper to new btf_field_iter functionality. This makes bpftool code simpler, but also unblocks removing libbpf's btf_type_visit_type_ids() helper completely. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Quentin Monnet <qmo@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-5-andrii@kernel.org
2024-06-05libbpf: Make use of BTF field iterator in BTF handling codeAndrii Nakryiko
Use new BTF field iterator logic to replace all the callback-based visitor calls. There is still a .BTF.ext callback-based visitor APIs that should be converted, which will happens as a follow up. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-4-andrii@kernel.org
2024-06-05libbpf: Make use of BTF field iterator in BPF linker codeAndrii Nakryiko
Switch all BPF linker code dealing with iterating BTF type ID and string offset fields to new btf_field_iter facilities. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-3-andrii@kernel.org
2024-06-05libbpf: Add BTF field iteratorAndrii Nakryiko
Implement iterator-based type ID and string offset BTF field iterator. This is used extensively in BTF-handling code and BPF linker code for various sanity checks, rewriting IDs/offsets, etc. Currently this is implemented as visitor pattern calling custom callbacks, which makes the logic (especially in simple cases) unnecessarily obscure and harder to follow. Having equivalent functionality using iterator pattern makes for simpler to understand and maintain code. As we add more code for BTF processing logic in libbpf, it's best to switch to iterator pattern before adding more callback-based code. The idea for iterator-based implementation is to record offsets of necessary fields within fixed btf_type parts (which should be iterated just once), and, for kinds that have multiple members (based on vlen field), record where in each member necessary fields are located. Generic iteration code then just keeps track of last offset that was returned and handles N members correctly. Return type is just u32 pointer, where NULL is returned when all relevant fields were already iterated. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-2-andrii@kernel.org
2024-06-05sched/core: Drop spinlocks on contention iff kernel is preemptibleSean Christopherson
Use preempt_model_preemptible() to detect a preemptible kernel when deciding whether or not to reschedule in order to drop a contended spinlock or rwlock. Because PREEMPT_DYNAMIC selects PREEMPTION, kernels built with PREEMPT_DYNAMIC=y will yield contended locks even if the live preemption model is "none" or "voluntary". In short, make kernels with dynamically selected models behave the same as kernels with statically selected models. Somewhat counter-intuitively, NOT yielding a lock can provide better latency for the relevant tasks/processes. E.g. KVM x86's mmu_lock, a rwlock, is often contended between an invalidation event (takes mmu_lock for write) and a vCPU servicing a guest page fault (takes mmu_lock for read). For _some_ setups, letting the invalidation task complete even if there is mmu_lock contention provides lower latency for *all* tasks, i.e. the invalidation completes sooner *and* the vCPU services the guest page fault sooner. But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior can vary depending on the host VMM, the guest workload, the number of vCPUs, the number of pCPUs in the host, why there is lock contention, etc. In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the opposite and removing contention yielding entirely) needs to come with a big pile of data proving that changing the status quo is a net positive. Opportunistically document this side effect of preempt=full, as yielding contended spinlocks can have significant, user-visible impact. Fixes: c597bfddc9e9 ("sched: Provide Kconfig support for default dynamic preempt mode") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Link: https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@proxmox.com
2024-06-05sched/core: Move preempt_model_*() helpers from sched.h to preempt.hSean Christopherson
Move the declarations and inlined implementations of the preempt_model_*() helpers to preempt.h so that they can be referenced in spinlock.h without creating a potential circular dependency between spinlock.h and sched.h. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com> Link: https://lkml.kernel.org/r/20240528003521.979836-2-ankur.a.arora@oracle.com
2024-06-05bcachefs: Fix trans->locked assertKent Overstreet
in bch2_move_data_btree, we might start with the trans unlocked from a previous loop iteration - we need a trans_begin() before iter_init(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-05bcachefs: Rereplicate now moves data off of durability=0 devicesKent Overstreet
This fixes an issue where setting a device to durability=0 after it's been used makes it impossible to remove. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-05bcachefs: Fix GFP_KERNEL allocation in break_cycle()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-05sched/balance: Skip unnecessary updates to idle load balancer's flagsTim Chen
We observed that the overhead on trigger_load_balance(), now renamed sched_balance_trigger(), has risen with a system's core counts. For an OLTP workload running 6.8 kernel on a 2 socket x86 systems having 96 cores/socket, we saw that 0.7% cpu cycles are spent in trigger_load_balance(). On older systems with fewer cores/socket, this function's overhead was less than 0.1%. The cause of this overhead was that there are multiple cpus calling kick_ilb(flags), updating the balancing work needed to a common idle load balancer cpu. The ilb_cpu's flags field got updated unconditionally with atomic_fetch_or(). The atomic read and writes to ilb_cpu's flags causes much cache bouncing and cpu cycles overhead. This is seen in the annotated profile below. kick_ilb(): if (ilb_cpu < 0) test %r14d,%r14d ↑ js 6c flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu)); mov $0x2d600,%rdi movslq %r14d,%r8 mov %rdi,%rdx add -0x7dd0c3e0(,%r8,8),%rdx arch_atomic_read(): 0.01 mov 0x64(%rdx),%esi 35.58 add $0x64,%rdx arch_atomic_fetch_or(): static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v) { int val = arch_atomic_read(v); do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i)); 0.03 157: mov %r12d,%ecx arch_atomic_try_cmpxchg(): return arch_try_cmpxchg(&v->counter, old, new); 0.00 mov %esi,%eax arch_atomic_fetch_or(): do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i)); or %esi,%ecx arch_atomic_try_cmpxchg(): return arch_try_cmpxchg(&v->counter, old, new); 0.01 lock cmpxchg %ecx,(%rdx) 42.96 ↓ jne 2d2 kick_ilb(): With instrumentation, we found that 81% of the updates do not result in any change in the ilb_cpu's flags. That is, multiple cpus are asking the ilb_cpu to do the same things over and over again, before the ilb_cpu has a chance to run NOHZ load balance. Skip updates to ilb_cpu's flags if no new work needs to be done. Such updates do not change ilb_cpu's NOHZ flags. This requires an extra atomic read but it is less expensive than frequent unnecessary atomic updates that generate cache bounces. We saw that on the OLTP workload, cpu cycles from trigger_load_balance() (or sched_balance_trigger()) got reduced from 0.7% to 0.2%. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240531205452.65781-1-tim.c.chen@linux.intel.com
2024-06-05idle: Remove stale RCU commentChristian Loehle
The call of rcu_idle_enter() from within cpuidle_idle_call() was removed in commit 1098582a0f6c ("sched,idle,rcu: Push rcu_idle deeper into the idle path") which makes the comment out of place. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/5b936388-47df-4050-9229-6617a6c2bba5@arm.com
2024-06-05Merge branch 'mlx5-fixes'David S. Miller
Tariq Toukan says: ==================== mlx5 core fixes 20240603 This small patchset provides two bug fixes from the team to the mlx5 core driver. Series generated against: commit 33700a0c9b56 ("net/tcp: Don't consider TCP_CLOSE in TCP_AO_ESTABLISHED") ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net/mlx5: Always stop health timer during driver removalShay Drory
Currently, if teardown_hca fails to execute during driver removal, mlx5 does not stop the health timer. Afterwards, mlx5 continue with driver teardown. This may lead to a UAF bug, which results in page fault Oops[1], since the health timer invokes after resources were freed. Hence, stop the health monitor even if teardown_hca fails. [1] mlx5_core 0000:18:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) mlx5_core 0000:18:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) mlx5_core 0000:18:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0) mlx5_core 0000:18:00.0: E-Switch: cleanup mlx5_core 0000:18:00.0: wait_func:1155:(pid 1967079): TEARDOWN_HCA(0x103) timeout. Will cause a leak of a command resource mlx5_core 0000:18:00.0: mlx5_function_close:1288:(pid 1967079): tear_down_hca failed, skip cleanup BUG: unable to handle page fault for address: ffffa26487064230 PGD 100c00067 P4D 100c00067 PUD 100e5a067 PMD 105ed7067 PTE 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE ------- --- 6.7.0-68.fc38.x86_64 #1 Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0013.121520200651 12/15/2020 RIP: 0010:ioread32be+0x34/0x60 RSP: 0018:ffffa26480003e58 EFLAGS: 00010292 RAX: ffffa26487064200 RBX: ffff9042d08161a0 RCX: ffff904c108222c0 RDX: 000000010bbf1b80 RSI: ffffffffc055ddb0 RDI: ffffa26487064230 RBP: ffff9042d08161a0 R08: 0000000000000022 R09: ffff904c108222e8 R10: 0000000000000004 R11: 0000000000000441 R12: ffffffffc055ddb0 R13: ffffa26487064200 R14: ffffa26480003f00 R15: ffff904c108222c0 FS: 0000000000000000(0000) GS:ffff904c10800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffa26487064230 CR3: 00000002c4420006 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? __die+0x23/0x70 ? page_fault_oops+0x171/0x4e0 ? exc_page_fault+0x175/0x180 ? asm_exc_page_fault+0x26/0x30 ? __pfx_poll_health+0x10/0x10 [mlx5_core] ? __pfx_poll_health+0x10/0x10 [mlx5_core] ? ioread32be+0x34/0x60 mlx5_health_check_fatal_sensors+0x20/0x100 [mlx5_core] ? __pfx_poll_health+0x10/0x10 [mlx5_core] poll_health+0x42/0x230 [mlx5_core] ? __next_timer_interrupt+0xbc/0x110 ? __pfx_poll_health+0x10/0x10 [mlx5_core] call_timer_fn+0x21/0x130 ? __pfx_poll_health+0x10/0x10 [mlx5_core] __run_timers+0x222/0x2c0 run_timer_softirq+0x1d/0x40 __do_softirq+0xc9/0x2c8 __irq_exit_rcu+0xa6/0xc0 sysvec_apic_timer_interrupt+0x72/0x90 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x1a/0x20 RIP: 0010:cpuidle_enter_state+0xcc/0x440 ? cpuidle_enter_state+0xbd/0x440 cpuidle_enter+0x2d/0x40 do_idle+0x20d/0x270 cpu_startup_entry+0x2a/0x30 rest_init+0xd0/0xd0 arch_call_rest_init+0xe/0x30 start_kernel+0x709/0xa90 x86_64_start_reservations+0x18/0x30 x86_64_start_kernel+0x96/0xa0 secondary_startup_64_no_verify+0x18f/0x19b ---[ end trace 0000000000000000 ]--- Fixes: 9b98d395b85d ("net/mlx5: Start health poll at earlier stage of driver load") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net/mlx5: Stop waiting for PCI if pci channel is offlineMoshe Shemesh
In case pci channel becomes offline the driver should not wait for PCI reads during health dump and recovery flow. The driver has timeout for each of these loops trying to read PCI, so it would fail anyway. However, in case of recovery waiting till timeout may cause the pci error_detected() callback fail to meet pci_dpc_recovered() wait timeout. Fixes: b3bd076f7501 ("net/mlx5: Report devlink health on FW fatal issues") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net: ethernet: mtk_eth_soc: handle dma buffer size soc specificFrank Wunderlich
The mainline MTK ethernet driver suffers long time from rarly but annoying tx queue timeouts. We think that this is caused by fixed dma sizes hardcoded for all SoCs. We suspect this problem arises from a low level of free TX DMADs, the TX Ring alomost full. The transmit timeout is caused by the Tx queue not waking up. The Tx queue stops when the free counter is less than ring->thres, and it will wake up once the free counter is greater than ring->thres. If the CPU is too late to wake up the Tx queues, it may cause a transmit timeout. Therefore, we increased the TX and RX DMADs to improve this error situation. Use the dma-size implementation from SDK in a per SoC manner. In difference to SDK we have no RSS feature yet, so all RX/TX sizes should be raised from 512 to 2048 byte except fqdma on mt7988 to avoid the tx timeout issue. Fixes: 656e705243fd ("net-next: mediatek: add support for MT7623 ethernet") Suggested-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Frank Wunderlich <frank-w@public-files.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05Merge branch 'tcp-rto-min-us'David S. Miller
Kevin Yang says: ==================== tcp: add sysctl_tcp_rto_min_us Adding a sysctl knob to allow user to specify a default rto_min at socket init time. After this patch series, the rto_min will has multiple sources: route option has the highest precedence, followed by the TCP_BPF_RTO_MIN socket option, followed by this new tcp_rto_min_us sysctl. v3: fix typo, simplify min/max_t to min/max v2: fit line width to 80 column. v2: https://lore.kernel.org/netdev/20240530153436.2202800-1-yyd@google.com/ v1: https://lore.kernel.org/netdev/20240528171320.1332292-1-yyd@google.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05tcp: add sysctl_tcp_rto_min_usKevin Yang
Adding a sysctl knob to allow user to specify a default rto_min at socket init time, other than using the hard coded 200ms default rto_min. Note that the rto_min route option has the highest precedence for configuring this setting, followed by the TCP_BPF_RTO_MIN socket option, followed by the tcp_rto_min_us sysctl. Signed-off-by: Kevin Yang <yyd@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05tcp: derive delack_max with tcp_rto_min helperKevin Yang
Rto_min now has multiple sources, ordered by preprecedence high to low: ip route option rto_min, icsk->icsk_rto_min. When derive delack_max from rto_min, we should not only use ip route option, but should use tcp_rto_min helper to get the correct rto_min. Signed-off-by: Kevin Yang <yyd@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05sched/headers: Move struct pre-declarations to the beginning of the headerIngo Molnar
There's a random number of structure pre-declaration lines in kernel/sched/sched.h, some of which are unnecessary duplicates. Move them to the head & order them a bit for readability. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org
2024-06-05sched/core: Clean up kernel/sched/sched.h a bitIngo Molnar
- Fix whitespace noise - Fix col80 linebreak damage where possible - Apply CodingStyle consistently - Use consistent #else and #endif comments - Use consistent vertical alignment - Use 'extern' consistently Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org
2024-06-05rtnetlink: make the "split" NLM_DONE handling genericJakub Kicinski
Jaroslav reports Dell's OMSA Systems Management Data Engine expects NLM_DONE in a separate recvmsg(), both for rtnl_dump_ifinfo() and inet_dump_ifaddr(). We already added a similar fix previously in commit 460b0d33cf10 ("inet: bring NLM_DONE out to a separate recv() again") Instead of modifying all the dump handlers, and making them look different than modern for_each_netdev_dump()-based dump handlers - put the workaround in rtnetlink code. This will also help us move the custom rtnl-locking from af_netlink in the future (in net-next). Note that this change is not touching rtnl_dump_all(). rtnl_dump_all() is different kettle of fish and a potential problem. We now mix families in a single recvmsg(), but NLM_DONE is not coalesced. Tested: ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_addr.yaml \ --dump getaddr --json '{"ifa-family": 2}' ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \ --dump getroute --json '{"rtm-family": 2}' ./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_link.yaml \ --dump getlink Fixes: 3e41af90767d ("rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo()") Fixes: cdb2f80f1c10 ("inet: use xa_array iterator to implement inet_dump_ifaddr()") Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05Merge branch 'tcp-mptcp-close-wait'David S. Miller
Jason Xing says: ==================== tcp/mptcp: count CLOSE-WAIT for CurrEstab Taking CLOSE-WAIT sockets into CurrEstab counters is in accordance with RFC 1213, as suggested by Eric and Neal. v5 Link: https://lore.kernel.org/all/20240531091753.75930-1-kerneljasonxing@gmail.com/ 1. add more detailed comment (Matthieu) v4 Link: https://lore.kernel.org/all/20240530131308.59737-1-kerneljasonxing@gmail.com/ 1. correct the Fixes: tag in patch [2/2]. (Eric) Previous discussion Link: https://lore.kernel.org/all/20240529033104.33882-1-kerneljasonxing@gmail.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05mptcp: count CLOSE-WAIT sockets for MPTCP_MIB_CURRESTABJason Xing
Like previous patch does in TCP, we need to adhere to RFC 1213: "tcpCurrEstab OBJECT-TYPE ... The number of TCP connections for which the current state is either ESTABLISHED or CLOSE- WAIT." So let's consider CLOSE-WAIT sockets. The logic of counting When we increment the counter? a) Only if we change the state to ESTABLISHED. When we decrement the counter? a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT, say, on the client side, changing from ESTABLISHED to FIN-WAIT-1. b) if the socket leaves CLOSE-WAIT, say, on the server side, changing from CLOSE-WAIT to LAST-ACK. Fixes: d9cd27b8cd19 ("mptcp: add CurrEstab MIB counter support") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTABJason Xing
According to RFC 1213, we should also take CLOSE-WAIT sockets into consideration: "tcpCurrEstab OBJECT-TYPE ... The number of TCP connections for which the current state is either ESTABLISHED or CLOSE- WAIT." After this, CurrEstab counter will display the total number of ESTABLISHED and CLOSE-WAIT sockets. The logic of counting When we increment the counter? a) if we change the state to ESTABLISHED. b) if we change the state from SYN-RECEIVED to CLOSE-WAIT. When we decrement the counter? a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT, say, on the client side, changing from ESTABLISHED to FIN-WAIT-1. b) if the socket leaves CLOSE-WAIT, say, on the server side, changing from CLOSE-WAIT to LAST-ACK. Please note: there are two chances that old state of socket can be changed to CLOSE-WAIT in tcp_fin(). One is SYN-RECV, the other is ESTABLISHED. So we have to take care of the former case. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>