summaryrefslogtreecommitdiff
path: root/drivers/infiniband
AgeCommit message (Collapse)Author
18 hoursMerge tag 'net-next-6.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Wrap datapath globals into net_aligned_data, to avoid false sharing - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container) - Add SO_INQ and SCM_INQ support to AF_UNIX - Add SIOCINQ support to AF_VSOCK - Add TCP_MAXSEG sockopt to MPTCP - Add IPv6 force_forwarding sysctl to enable forwarding per interface - Make TCP validation of whether packet fully fits in the receive window and the rcv_buf more strict. With increased use of HW aggregation a single "packet" can be multiple 100s of kB - Add MSG_MORE flag to optimize large TCP transmissions via sockmap, improves latency up to 33% for sockmap users - Convert TCP send queue handling from tasklet to BH workque - Improve BPF iteration over TCP sockets to see each socket exactly once - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code - Support enabling kernel threads for NAPI processing on per-NAPI instance basis rather than a whole device. Fully stop the kernel NAPI thread when threaded NAPI gets disabled. Previously thread would stick around until ifdown due to tricky synchronization - Allow multicast routing to take effect on locally-generated packets - Add output interface argument for End.X in segment routing - MCTP: add support for gateway routing, improve bind() handling - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink - Add a new neighbor flag ("extern_valid"), which cedes refresh responsibilities to userspace. This is needed for EVPN multi-homing where a neighbor entry for a multi-homed host needs to be synced across all the VTEPs among which the host is multi-homed - Support NUD_PERMANENT for proxy neighbor entries - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM - Add sequence numbers to netconsole messages. Unregister netconsole's console when all net targets are removed. Code refactoring. Add a number of selftests - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol should be used for an inbound SA lookup - Support inspecting ref_tracker state via DebugFS - Don't force bonding advertisement frames tx to ~333 ms boundaries. Add broadcast_neighbor option to send ARP/ND on all bonded links - Allow providing upcall pid for the 'execute' command in openvswitch - Remove DCCP support from Netfilter's conntrack - Disallow multiple packet duplications in the queuing layer - Prevent use of deprecated iptables code on PREEMPT_RT Driver API: - Support RSS and hashing configuration over ethtool Netlink - Add dedicated ethtool callbacks for getting and setting hashing fields - Add support for power budget evaluation strategy in PSE / Power-over-Ethernet. Generate Netlink events for overcurrent etc - Support DPLL phase offset monitoring across all device inputs. Support providing clock reference and SYNC over separate DPLL inputs - Support traffic classes in devlink rate API for bandwidth management - Remove rtnl_lock dependency from UDP tunnel port configuration Device drivers: - Add a new Broadcom driver for 800G Ethernet (bnge) - Add a standalone driver for Microchip ZL3073x DPLL - Remove IBM's NETIUCV device driver - Ethernet high-speed NICs: - Broadcom (bnxt): - support zero-copy Tx of DMABUF memory - take page size into account for page pool recycling rings - Intel (100G, ice, idpf): - idpf: XDP and AF_XDP support preparations - idpf: add flow steering - add link_down_events statistic - clean up the TSPLL code - preparations for live VM migration - nVidia/Mellanox: - support zero-copy Rx/Tx interfaces (DMABUF and io_uring) - optimize context memory usage for matchers - expose serial numbers in devlink info - support PCIe congestion metrics - Meta (fbnic): - add 25G, 50G, and 100G link modes to phylink - support dumping FW logs - Marvell/Cavium: - support for CN20K generation of the Octeon chips - Amazon: - add HW clock (without timestamping, just hypervisor time access) - Ethernet virtual: - VirtIO net: - support segmentation of UDP-tunnel-encapsulated packets - Google (gve): - support packet timestamping and clock synchronization - Microsoft vNIC: - add handler for device-originated servicing events - allow dynamic MSI-X vector allocation - support Tx bandwidth clamping - Ethernet NICs consumer, and embedded: - AMD: - amd-xgbe: hardware timestamping and PTP clock support - Broadcom integrated MACs (bcmgenet, bcmasp): - use napi_complete_done() return value to support NAPI polling - add support for re-starting auto-negotiation - Broadcom switches (b53): - support BCM5325 switches - add bcm63xx EPHY power control - Synopsys (stmmac): - lots of code refactoring and cleanups - TI: - icssg-prueth: read firmware-names from device tree - icssg: PRP offload support - Microchip: - lan78xx: convert to PHYLINK for improved PHY and MAC management - ksz: add KSZ8463 switch support - Intel: - support similar queue priority scheme in multi-queue and time-sensitive networking (taprio) - support packet pre-emption in both - RealTek (r8169): - enable EEE at 5Gbps on RTL8126 - Airoha: - add PPPoE offload support - MDIO bus controller for Airoha AN7583 - Ethernet PHYs: - support for the IPQ5018 internal GE PHY - micrel KSZ9477 switch-integrated PHYs: - add MDI/MDI-X control support - add RX error counters - add cable test support - add Signal Quality Indicator (SQI) reporting - dp83tg720: improve reset handling and reduce link recovery time - support bcm54811 (and its MII-Lite interface type) - air_en8811h: support resume/suspend - support PHY counters for QCA807x and QCA808x - support WoL for QCA807x - CAN drivers: - rcar_canfd: support for Transceiver Delay Compensation - kvaser: report FW versions via devlink dev info - WiFi: - extended regulatory info support (6 GHz) - add statistics and beacon monitor for Multi-Link Operation (MLO) - support S1G aggregation, improve S1G support - add Radio Measurement action fields - support per-radio RTS threshold - some work around how FIPS affects wifi, which was wrong (RC4 is used by TKIP, not only WEP) - improvements for unsolicited probe response handling - WiFi drivers: - RealTek (rtw88): - IBSS mode for SDIO devices - RealTek (rtw89): - BT coexistence for MLO/WiFi7 - concurrent station + P2P support - support for USB devices RTL8851BU/RTL8852BU - Intel (iwlwifi): - use embedded PNVM in (to be released) FW images to fix compatibility issues - many cleanups (unused FW APIs, PCIe code, WoWLAN) - some FIPS interoperability - MediaTek (mt76): - firmware recovery improvements - more MLO work - Qualcomm/Atheros (ath12k): - fix scan on multi-radio devices - more EHT/Wi-Fi 7 features - encapsulation/decapsulation offload - Broadcom (brcm80211): - support SDIO 43751 device - Bluetooth: - hci_event: add support for handling LE BIG Sync Lost event - ISO: add socket option to report packet seqnum via CMSG - ISO: support SCM_TIMESTAMPING for ISO TS - Bluetooth drivers: - intel_pcie: support Function Level Reset - nxpuart: add support for 4M baudrate - nxpuart: implement powerup sequence, reset, FW dump, and FW loading" * tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits) dpll: zl3073x: Fix build failure selftests: bpf: fix legacy netfilter options ipv6: annotate data-races around rt->fib6_nsiblings ipv6: fix possible infinite loop in fib6_info_uses_dev() ipv6: prevent infinite loop in rt6_nlmsg_size() ipv6: add a retry logic in net6_rt_notify() vrf: Drop existing dst reference in vrf_ip6_input_dst net/sched: taprio: align entry index attr validation with mqprio net: fsl_pq_mdio: use dev_err_probe selftests: rtnetlink.sh: remove esp4_offload after test vsock: remove unnecessary null check in vsock_getname() igb: xsk: solve negative overflow of nb_pkts in zerocopy mode stmmac: xsk: fix negative overflow of budget in zerocopy mode dt-bindings: ieee802154: Convert at86rf230.txt yaml format net: dsa: microchip: Disable PTP function of KSZ8463 net: dsa: microchip: Setup fiber ports for KSZ8463 net: dsa: microchip: Write switch MAC address differently for KSZ8463 net: dsa: microchip: Use different registers for KSZ8463 net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver dt-bindings: net: dsa: microchip: Add KSZ8463 switch support ...
38 hoursMerge tag 'driver-core-6.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core updates from Danilo Krummrich: "debugfs: - Remove unneeded debugfs_file_{get,put}() instances - Remove last remnants of debugfs_real_fops() - Allow storing non-const void * in struct debugfs_inode_info::aux sysfs: - Switch back to attribute_group::bin_attrs (treewide) - Switch back to bin_attribute::read()/write() (treewide) - Constify internal references to 'struct bin_attribute' Support cache-ids for device-tree systems: - Add arch hook arch_compact_of_hwid() - Use arch_compact_of_hwid() to compact MPIDR values on arm64 Rust: - Device: - Introduce CoreInternal device context (for bus internal methods) - Provide generic drvdata accessors for bus devices - Provide Driver::unbind() callbacks - Use the infrastructure above for auxiliary, PCI and platform - Implement Device::as_bound() - Rename Device::as_ref() to Device::from_raw() (treewide) - Implement fwnode and device property abstractions - Implement example usage in the Rust platform sample driver - Devres: - Remove the inner reference count (Arc) and use pin-init instead - Replace Devres::new_foreign_owned() with devres::register() - Require T to be Send in Devres<T> - Initialize the data kept inside a Devres last - Provide an accessor for the Devres associated Device - Device ID: - Add support for ACPI device IDs and driver match tables - Split up generic device ID infrastructure - Use generic device ID infrastructure in net::phy - DMA: - Implement the dma::Device trait - Add DMA mask accessors to dma::Device - Implement dma::Device for PCI and platform devices - Use DMA masks from the DMA sample module - I/O: - Implement abstraction for resource regions (struct resource) - Implement resource-based ioremap() abstractions - Provide platform device accessors for I/O (remap) requests - Misc: - Support fallible PinInit types in Revocable - Implement Wrapper<T> for Opaque<T> - Merge pin-init blanket dependencies (for Devres) Misc: - Fix OF node leak in auxiliary_device_create() - Use util macros in device property iterators - Improve kobject sample code - Add device_link_test() for testing device link flags - Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits - Hint to prefer container_of_const() over container_of()" * tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (84 commits) rust: io: fix broken intra-doc links to `platform::Device` rust: io: fix broken intra-doc link to missing `flags` module rust: io: mem: enable IoRequest doc-tests rust: platform: add resource accessors rust: io: mem: add a generic iomem abstraction rust: io: add resource abstraction rust: samples: dma: set DMA mask rust: platform: implement the `dma::Device` trait rust: pci: implement the `dma::Device` trait rust: dma: add DMA addressing capabilities rust: dma: implement `dma::Device` trait rust: net::phy Change module_phy_driver macro to use module_device_table macro rust: net::phy represent DeviceId as transparent wrapper over mdio_device_id rust: device_id: split out index support into a separate trait device: rust: rename Device::as_ref() to Device::from_raw() arm64: cacheinfo: Provide helper to compress MPIDR value into u32 cacheinfo: Add arch hook to compress CPU h/w id into 32 bits for cache-id cacheinfo: Set cache 'id' based on DT data container_of: Document container_of() is not to be used in new code driver core: auxiliary bus: fix OF node leak ...
12 daysnet: s/dev_get_flags/netif_get_flags/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-6-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-15bnxt: move bnxt_hsi.h to include/linux/bnxt/hsi.hAndy Gospodarek
This moves bnxt_hsi.h contents to a common location so it can be properly referenced by bnxt_en, bnxt_re, and bnge. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20250714170202.39688-1-gospo@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14Merge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-07-14 * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: IFC updates for disabled host PF net/mlx5: Expose disciplined_fr_counter through HCA capabilities in mlx5_ifc RDMA/mlx5: Fix UMR modifying of mkey page size net/mlx5: Expose HCA capability bits for mkey max page size ==================== Link: https://patch.msgid.link/1752481357-34780-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-13RDMA/mlx5: Fix UMR modifying of mkey page sizeEdward Srouji
When changing the page size on an mkey, the driver needs to set the appropriate bits in the mkey mask to indicate which fields are being modified. The 6th bit of a page size in mlx5 driver is considered an extension, and this bit has a dedicated capability and mask bits. Previously, the driver was not setting this mask in the mkey mask when performing page size changes, regardless of its hardware support, potentially leading to an incorrect page size updates. This fixes the issue by setting the relevant bit in the mkey mask when performing page size changes on an mkey and the 6th bit of this field is supported by the hardware. Fixes: cef7dde8836a ("net/mlx5: Expand mkey page size to support 6 bits") Signed-off-by: Edward Srouji <edwards@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/9f43a9c73bf2db6085a99dc836f7137e76579f09.1751979184.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc6). No conflicts. Adjacent changes: Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml 0a12c435a1d6 ("dt-bindings: net: sun8i-emac: Add A100 EMAC compatible") b3603c0466a8 ("dt-bindings: net: sun8i-emac: Rename A523 EMAC0 to GMAC0") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09hfi1: get rid of redundant debugfs_file_{get,put}()Al Viro
All files in question are created via debugfs_create_file(), so exclusion with removals is provided by debugfs wrappers; as the matter of fact, hfi1-private wrappers had been redundant at least since 2017... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250702211508.GB3406663@ZenIV Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-08Merge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-07-08 The following pull-request contains common mlx5 updates for your *net-next* tree. v2: https://lore.kernel.org/1751574385-24672-1-git-send-email-tariqt@nvidia.com * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Check device memory pointer before usage net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow net/mlx5: Add IFC bits for PCIe Congestion Event object net/mlx5: Small refactor for general object capabilities net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain ==================== Link: https://patch.msgid.link/1752002102-11316-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-03Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Driver fixes plus core sd.c fix are all small and obvious. The larger change to hosts.c is less obvious, but required to avoid data corruption caused by bio splitting" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: ufs: core: Fix spelling of a sysfs attribute name scsi: core: Enforce unlimited max_segment_size when virt_boundary_mask is set scsi: RDMA/srp: Don't set a max_segment_size when virt_boundary_mask is set scsi: sd: Fix VPD page 0xb7 length check scsi: qla4xxx: Fix missing DMA mapping error in qla4xxx_alloc_pdu() scsi: qla2xxx: Fix DMA mapping test in qla24xx_get_port_database()
2025-07-02net/mlx5: Check device memory pointer before usageStav Aviram
Add a NULL check before accessing device memory to prevent a crash if dev->dm allocation in mlx5_init_once() fails. Fixes: c9b9dcb430b3 ("net/mlx5: Move device memory management to mlx5_core") Signed-off-by: Stav Aviram <saviram@nvidia.com> Link: https://patch.msgid.link/c88711327f4d74d5cebc730dc629607e989ca187.1751370035.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Fix vport loopback for MPV devicePatrisious Haddad
Always enable vport loopback for both MPV devices on driver start. Previously in some cases related to MPV RoCE, packets weren't correctly executing loopback check at vport in FW, since it was disabled. Due to complexity of identifying such cases for MPV always enable vport loopback for both GVMIs when binding the slave to the master port. Fixes: 0042f9e458a5 ("RDMA/mlx5: Enable vport loopback when user context or QP mandate") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/d4298f5ebb2197459e9e7221c51ecd6a34699847.1750064969.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Fix CC counters query for MPVPatrisious Haddad
In case, CC counters are querying for the second port use the correct core device for the query instead of always using the master core device. Fixes: aac4492ef23a ("IB/mlx5: Update counter implementation for dual port RoCE") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/9cace74dcf106116118bebfa9146d40d4166c6b0.1750064969.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Fix HW counters query for non-representor devicesPatrisious Haddad
To get the device HW counters, a non-representor switchdev device should use the mlx5_ib_query_q_counters() function and query all of the available counters. While a representor device in switchdev mode should use the mlx5_ib_query_q_counters_vport() function and query only the Q_Counters without the PPCNT counters and congestion control counters, since they aren't relevant for a representor device. Currently a non-representor switchdev device skips querying the PPCNT counters and congestion control counters, leaving them unupdated. Fix that by properly querying those counters for non-representor devices. Fixes: d22467a71ebe ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/56bf8af4ca8c58e3fb9f7e47b1dca2009eeeed81.1750064969.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25IB/core: Annotate umem_mutex acquisition under fs_reclaim for lockdepOr Har-Toov
Following the fix in the previous commit ("IB/mlx5: Fix potential deadlock in MR deregistration"), teach lockdep explicitly about the locking order between fs_reclaim and umem_mutex. The previous commit resolved a potential deadlock scenario where kzalloc(GFP_KERNEL) was called while holding umem_mutex, which could lead to reclaim and eventually invoke the MMU notifier (mlx5_ib_invalidate_range()), causing a recursive acquisition of umem_mutex. To prevent such issues from reoccurring unnoticed in future code changes, add a lockdep annotation in ib_init_umem_odp() that simulates taking umem_mutex inside a reclaim context. This makes lockdep aware of this locking dependency and ensures that future violations—such as calling kzalloc() or any memory allocator that may enter reclaim while holding umem_mutex—will immediately raise a lockdep warning. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/9d31b9d8fe1db648a9f47cec3df6b8463319dee5.1750061698.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25IB/mlx5: Fix potential deadlock in MR deregistrationOr Har-Toov
The issue arises when kzalloc() is invoked while holding umem_mutex or any other lock acquired under umem_mutex. This is problematic because kzalloc() can trigger fs_reclaim_aqcuire(), which may, in turn, invoke mmu_notifier_invalidate_range_start(). This function can lead to mlx5_ib_invalidate_range(), which attempts to acquire umem_mutex again, resulting in a deadlock. The problematic flow: CPU0 | CPU1 ---------------------------------------|------------------------------------------------ mlx5_ib_dereg_mr() | → revoke_mr() | → mutex_lock(&umem_odp->umem_mutex) | | mlx5_mkey_cache_init() | → mutex_lock(&dev->cache.rb_lock) | → mlx5r_cache_create_ent_locked() | → kzalloc(GFP_KERNEL) | → fs_reclaim() | → mmu_notifier_invalidate_range_start() | → mlx5_ib_invalidate_range() | → mutex_lock(&umem_odp->umem_mutex) → cache_ent_find_and_store() | → mutex_lock(&dev->cache.rb_lock) | Additionally, when kzalloc() is called from within cache_ent_find_and_store(), we encounter the same deadlock due to re-acquisition of umem_mutex. Solve by releasing umem_mutex in dereg_mr() after umr_revoke_mr() and before acquiring rb_lock. This ensures that we don't hold umem_mutex while performing memory allocations that could trigger the reclaim path. This change prevents the deadlock by ensuring proper lock ordering and avoiding holding locks during memory allocation operations that could trigger the reclaim path. The following lockdep warning demonstrates the deadlock: python3/20557 is trying to acquire lock: ffff888387542128 (&umem_odp->umem_mutex){+.+.}-{4:4}, at: mlx5_ib_invalidate_range+0x5b/0x550 [mlx5_ib] but task is already holding lock: ffffffff82f6b840 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: unmap_vmas+0x7b/0x1a0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}: fs_reclaim_acquire+0x60/0xd0 mem_cgroup_css_alloc+0x6f/0x9b0 cgroup_init_subsys+0xa4/0x240 cgroup_init+0x1c8/0x510 start_kernel+0x747/0x760 x86_64_start_reservations+0x25/0x30 x86_64_start_kernel+0x73/0x80 common_startup_64+0x129/0x138 -> #2 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x91/0xd0 __kmalloc_cache_noprof+0x4d/0x4c0 mlx5r_cache_create_ent_locked+0x75/0x620 [mlx5_ib] mlx5_mkey_cache_init+0x186/0x360 [mlx5_ib] mlx5_ib_stage_post_ib_reg_umr_init+0x3c/0x60 [mlx5_ib] __mlx5_ib_add+0x4b/0x190 [mlx5_ib] mlx5r_probe+0xd9/0x320 [mlx5_ib] auxiliary_bus_probe+0x42/0x70 really_probe+0xdb/0x360 __driver_probe_device+0x8f/0x130 driver_probe_device+0x1f/0xb0 __driver_attach+0xd4/0x1f0 bus_for_each_dev+0x79/0xd0 bus_add_driver+0xf0/0x200 driver_register+0x6e/0xc0 __auxiliary_driver_register+0x6a/0xc0 do_one_initcall+0x5e/0x390 do_init_module+0x88/0x240 init_module_from_file+0x85/0xc0 idempotent_init_module+0x104/0x300 __x64_sys_finit_module+0x68/0xc0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #1 (&dev->cache.rb_lock){+.+.}-{4:4}: __mutex_lock+0x98/0xf10 __mlx5_ib_dereg_mr+0x6f2/0x890 [mlx5_ib] mlx5_ib_dereg_mr+0x21/0x110 [mlx5_ib] ib_dereg_mr_user+0x85/0x1f0 [ib_core] uverbs_free_mr+0x19/0x30 [ib_uverbs] destroy_hw_idr_uobject+0x21/0x80 [ib_uverbs] uverbs_destroy_uobject+0x60/0x3d0 [ib_uverbs] uobj_destroy+0x57/0xa0 [ib_uverbs] ib_uverbs_cmd_verbs+0x4d5/0x1210 [ib_uverbs] ib_uverbs_ioctl+0x129/0x230 [ib_uverbs] __x64_sys_ioctl+0x596/0xaa0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #0 (&umem_odp->umem_mutex){+.+.}-{4:4}: __lock_acquire+0x1826/0x2f00 lock_acquire+0xd3/0x2e0 __mutex_lock+0x98/0xf10 mlx5_ib_invalidate_range+0x5b/0x550 [mlx5_ib] __mmu_notifier_invalidate_range_start+0x18e/0x1f0 unmap_vmas+0x182/0x1a0 exit_mmap+0xf3/0x4a0 mmput+0x3a/0x100 do_exit+0x2b9/0xa90 do_group_exit+0x32/0xa0 get_signal+0xc32/0xcb0 arch_do_signal_or_restart+0x29/0x1d0 syscall_exit_to_user_mode+0x105/0x1d0 do_syscall_64+0x79/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Chain exists of: &dev->cache.rb_lock --> mmu_notifier_invalidate_range_start --> &umem_odp->umem_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&umem_odp->umem_mutex); lock(mmu_notifier_invalidate_range_start); lock(&umem_odp->umem_mutex); lock(&dev->cache.rb_lock); *** DEADLOCK *** Fixes: abb604a1a9c8 ("RDMA/mlx5: Fix a race for an ODP MR which leads to CQE with error") Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/3c8f225a8a9fade647d19b014df1172544643e4a.1750061612.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-24scsi: RDMA/srp: Don't set a max_segment_size when virt_boundary_mask is setChristoph Hellwig
virt_boundary_mask implies an unlimited max_segment_size. Setting both can lead to data corruption because __blk_rq_map_sg() can split requests so that the virt_boundary_mask is not respected if max_segment_size is not UINT_MAX. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250624125233.219635-2-hch@lst.de Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Acked-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-06-17RDMA/mlx5: Initialize obj_event->obj_sub_list before xa_insertMark Zhang
The obj_event may be loaded immediately after inserted, then if the list_head is not initialized then we may get a poisonous pointer. This fixes the crash below: mlx5_core 0000:03:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced) mlx5_core.sf mlx5_core.sf.4: firmware version: 32.38.3056 mlx5_core 0000:03:00.0 en3f0pf0sf2002: renamed from eth0 mlx5_core.sf mlx5_core.sf.4: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps IPv6: ADDRCONF(NETDEV_CHANGE): en3f0pf0sf2002: link becomes ready Unable to handle kernel NULL pointer dereference at virtual address 0000000000000060 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000007760fb000 [0000000000000060] pgd=000000076f6d7003, p4d=000000076f6d7003, pud=0000000777841003, pmd=0000000000000000 Internal error: Oops: 96000006 [#1] SMP Modules linked in: ipmb_host(OE) act_mirred(E) cls_flower(E) sch_ingress(E) mptcp_diag(E) udp_diag(E) raw_diag(E) unix_diag(E) tcp_diag(E) inet_diag(E) binfmt_misc(E) bonding(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) isofs(E) cdrom(E) mst_pciconf(OE) ib_umad(OE) mlx5_ib(OE) ipmb_dev_int(OE) mlx5_core(OE) kpatch_15237886(OEK) mlxdevm(OE) auxiliary(OE) ib_uverbs(OE) ib_core(OE) psample(E) mlxfw(OE) tls(E) sunrpc(E) vfat(E) fat(E) crct10dif_ce(E) ghash_ce(E) sha1_ce(E) sbsa_gwdt(E) virtio_console(E) ext4(E) mbcache(E) jbd2(E) xfs(E) libcrc32c(E) mmc_block(E) virtio_net(E) net_failover(E) failover(E) sha2_ce(E) sha256_arm64(E) nvme(OE) nvme_core(OE) gpio_mlxbf3(OE) mlx_compat(OE) mlxbf_pmc(OE) i2c_mlxbf(OE) sdhci_of_dwcmshc(OE) pinctrl_mlxbf3(OE) mlxbf_pka(OE) gpio_generic(E) i2c_core(E) mmc_core(E) mlxbf_gige(OE) vitesse(E) pwr_mlxbf(OE) mlxbf_tmfifo(OE) micrel(E) mlxbf_bootctl(OE) virtio_ring(E) virtio(E) ipmi_devintf(E) ipmi_msghandler(E) [last unloaded: mst_pci] CPU: 11 PID: 20913 Comm: rte-worker-11 Kdump: loaded Tainted: G OE K 5.10.134-13.1.an8.aarch64 #1 Hardware name: https://www.mellanox.com BlueField-3 SmartNIC Main Card/BlueField-3 SmartNIC Main Card, BIOS 4.2.2.12968 Oct 26 2023 pstate: a0400089 (NzCv daIf +PAN -UAO -TCO BTYPE=--) pc : dispatch_event_fd+0x68/0x300 [mlx5_ib] lr : devx_event_notifier+0xcc/0x228 [mlx5_ib] sp : ffff80001005bcf0 x29: ffff80001005bcf0 x28: 0000000000000001 x27: ffff244e0740a1d8 x26: ffff244e0740a1d0 x25: ffffda56beff5ae0 x24: ffffda56bf911618 x23: ffff244e0596a480 x22: ffff244e0596a480 x21: ffff244d8312ad90 x20: ffff244e0596a480 x19: fffffffffffffff0 x18: 0000000000000000 x17: 0000000000000000 x16: ffffda56be66d620 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000040 x10: ffffda56bfcafb50 x9 : ffffda5655c25f2c x8 : 0000000000000010 x7 : 0000000000000000 x6 : ffff24545a2e24b8 x5 : 0000000000000003 x4 : ffff80001005bd28 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff244e0596a480 x0 : ffff244d8312ad90 Call trace: dispatch_event_fd+0x68/0x300 [mlx5_ib] devx_event_notifier+0xcc/0x228 [mlx5_ib] atomic_notifier_call_chain+0x58/0x80 mlx5_eq_async_int+0x148/0x2b0 [mlx5_core] atomic_notifier_call_chain+0x58/0x80 irq_int_handler+0x20/0x30 [mlx5_core] __handle_irq_event_percpu+0x60/0x220 handle_irq_event_percpu+0x3c/0x90 handle_irq_event+0x58/0x158 handle_fasteoi_irq+0xfc/0x188 generic_handle_irq+0x34/0x48 ... Fixes: 759738537142 ("IB/mlx5: Enable subscription for device events over DEVX") Link: https://patch.msgid.link/r/3ce7f20e0d1a03dc7de6e57494ec4b8eaf1f05c2.1750147949.git.leon@kernel.org Signed-off-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-17RDMA/core: Rate limit GID cache warning messagesMaor Gottlieb
The GID cache warning messages can flood the kernel log when there are multiple failed attempts to add GIDs. This can happen when creating many virtual interfaces without having enough space for their GIDs in the GID table. Change pr_warn to pr_warn_ratelimited to prevent log flooding while still maintaining visibility of the issue. Link: https://patch.msgid.link/r/fd45ed4a1078e743f498b234c3ae816610ba1b18.1750062357.git.leon@kernel.org Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-17RDMA/mlx5: Fix unsafe xarray access in implicit ODP handlingOr Har-Toov
__xa_store() and __xa_erase() were used without holding the proper lock, which led to a lockdep warning due to unsafe RCU usage. This patch replaces them with xa_store() and xa_erase(), which perform the necessary locking internally. ============================= WARNING: suspicious RCPU usage 6.14.0-rc7_for_upstream_debug_2025_03_18_15_01 #1 Not tainted ----------------------------- ./include/linux/xarray.h:1211 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by kworker/u136:0/219: at: process_one_work+0xbe4/0x15f0 process_one_work+0x75c/0x15f0 pagefault_mr+0x9a5/0x1390 [mlx5_ib] stack backtrace: CPU: 14 UID: 0 PID: 219 Comm: kworker/u136:0 Not tainted 6.14.0-rc7_for_upstream_debug_2025_03_18_15_01 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib] Call Trace: dump_stack_lvl+0xa8/0xc0 lockdep_rcu_suspicious+0x1e6/0x260 xas_create+0xb8a/0xee0 xas_store+0x73/0x14c0 __xa_store+0x13c/0x220 ? xa_store_range+0x390/0x390 ? spin_bug+0x1d0/0x1d0 pagefault_mr+0xcb5/0x1390 [mlx5_ib] ? _raw_spin_unlock+0x1f/0x30 mlx5_ib_eqe_pf_action+0x3be/0x2620 [mlx5_ib] ? lockdep_hardirqs_on_prepare+0x400/0x400 ? mlx5_ib_invalidate_range+0xcb0/0xcb0 [mlx5_ib] process_one_work+0x7db/0x15f0 ? pwq_dec_nr_in_flight+0xda0/0xda0 ? assign_work+0x168/0x240 worker_thread+0x57d/0xcd0 ? rescuer_thread+0xc40/0xc40 kthread+0x3b3/0x800 ? kthread_is_per_cpu+0xb0/0xb0 ? lock_downgrade+0x680/0x680 ? do_raw_spin_lock+0x12d/0x270 ? spin_bug+0x1d0/0x1d0 ? finish_task_switch.isra.0+0x284/0x9e0 ? lockdep_hardirqs_on_prepare+0x284/0x400 ? kthread_is_per_cpu+0xb0/0xb0 ret_from_fork+0x2d/0x70 ? kthread_is_per_cpu+0xb0/0xb0 ret_from_fork_asm+0x11/0x20 Fixes: d3d930411ce3 ("RDMA/mlx5: Fix implicit ODP use after free") Link: https://patch.msgid.link/r/a85ddd16f45c8cb2bc0a188c2b0fcedfce975eb8.1750061791.git.leon@kernel.org Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-17sysfs: treewide: switch back to attribute_group::bin_attrsThomas Weißschuh
The normal bin_attrs field can now handle const pointers. This makes the _new variant unnecessary. Switch all users back. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-4-724bfcf05b99@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-12RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanupArnd Bergmann
This function has an array of eight mlx5_async_cmd structures, which often fits on the stack, but depending on the configuration can end up blowing the stack frame warning limit: drivers/infiniband/hw/mlx5/devx.c:2670:6: error: stack frame size (1392) exceeds limit (1280) in 'mlx5_ib_ufile_hw_cleanup' [-Werror,-Wframe-larger-than] Change this to a dynamic allocation instead. While a kmalloc() can theoretically fail, a GFP_KERNEL allocation under a page will block until memory has been freed up, so in the worst case, this only adds extra time in an already constrained environment. Fixes: 7c891a4dbcc1 ("RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20250610092846.2642535-1-arnd@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-05-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Usual collection of driver fixes: - Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw - Further ODP functionality in rxe - Remote access MRs in mana, along with more page sizes - Improve CM scalability with a rwlock around the agent - More trace points for hns - ODP hmm conversion to the new two step dma API - Support the ethernet HW device in mana as well as the RNIC - Cleanups: - Use secs_to_jiffies() when appropriate - Use ERR_CAST() instead of naked casts - Don't use %pK in printk - Unusued functions removed - Allocation type matching" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (57 commits) RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work RDMA/bnxt_re: Support extended stats for Thor2 VF RDMA/hns: Fix endian issue in trace events RDMA/mlx5: Avoid flexible array warning IB/cm: Remove dead code and adjust naming RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices RDMA/rxe: Break endless pagefault loop for RO pages RDMA/bnxt_re: Fix return code of bnxt_re_configure_cc RDMA/bnxt_re: Fix missing error handling for tx_queue RDMA/bnxt_re: Fix incorrect display of inactivity_cp in debugfs output RDMA/mlx5: Add support for 200Gbps per lane speeds RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stage RDMA/iwcm: Fix use-after-free of work objects after cm_id destruction net: mana: Add support for auxiliary device servicing events RDMA/mana_ib: unify mana_ib functions to support any gdma device RDMA/mana_ib: Add support of mana_ib for RNIC and ETH nic net: mana: Probe rdma device in mana driver RDMA/siw: replace redundant ternary operator with just rv RDMA/umem: Separate implicit ODP initialization from explicit ODP RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage ...
2025-05-28Merge tag 'net-next-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Implement the Device Memory TCP transmit path, allowing zero-copy data transmission on top of TCP from e.g. GPU memory to the wire. - Move all the IPv6 routing tables management outside the RTNL scope, under its own lock and RCU. The route control path is now 3x times faster. - Convert queue related netlink ops to instance lock, reducing again the scope of the RTNL lock. This improves the control plane scalability. - Refactor the software crc32c implementation, removing unneeded abstraction layers and improving significantly the related micro-benchmarks. - Optimize the GRO engine for UDP-tunneled traffic, for a 10% performance improvement in related stream tests. - Cover more per-CPU storage with local nested BH locking; this is a prep work to remove the current per-CPU lock in local_bh_disable() on PREMPT_RT. - Introduce and use nlmsg_payload helper, combining buffer bounds verification with accessing payload carried by netlink messages. Netfilter: - Rewrite the procfs conntrack table implementation, improving considerably the dump performance. A lot of user-space tools still use this interface. - Implement support for wildcard netdevice in netdev basechain and flowtables. - Integrate conntrack information into nft trace infrastructure. - Export set count and backend name to userspace, for better introspection. BPF: - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops programs and can be controlled in similar way to traditional qdiscs using the "tc qdisc" command. - Refactor the UDP socket iterator, addressing long standing issues WRT duplicate hits or missed sockets. Protocols: - Improve TCP receive buffer auto-tuning and increase the default upper bound for the receive buffer; overall this improves the single flow maximum thoughput on 200Gbs link by over 60%. - Add AFS GSSAPI security class to AF_RXRPC; it provides transport security for connections to the AFS fileserver and VL server. - Improve TCP multipath routing, so that the sources address always matches the nexthop device. - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS, and thus preventing DoS caused by passing around problematic FDs. - Retire DCCP socket. DCCP only receives updates for bugs, and major distros disable it by default. Its removal allows for better organisation of TCP fields to reduce the number of cache lines hit in the fast path. - Extend TCP drop-reason support to cover PAWS checks. Driver API: - Reorganize PTP ioctl flag support to require an explicit opt-in for the drivers, avoiding the problem of drivers not rejecting new unsupported flags. - Converted several device drivers to timestamping APIs. - Introduce per-PHY ethtool dump helpers, improving the support for dump operations targeting PHYs. Tests and tooling: - Add support for classic netlink in user space C codegen, so that ynl-c can now read, create and modify links, routes addresses and qdisc layer configuration. - Add ynl sub-types for binary attributes, allowing ynl-c to output known struct instead of raw binary data, clarifying the classic netlink output. - Extend MPTCP selftests to improve the code-coverage. - Add tests for XDP tail adjustment in AF_XDP. New hardware / drivers: - OpenVPN virtual driver: offload OpenVPN data channels processing to the kernel-space, increasing the data transfer throughput WRT the user-space implementation. - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC. - Broadcom asp-v3.0 ethernet driver. - AMD Renoir ethernet device. - ReakTek MT9888 2.5G ethernet PHY driver. - Aeonsemi 10G C45 PHYs driver. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - refactor the steering table handling to significantly reduce the amount of memory used - add support for complex matches in H/W flow steering - improve flow streeing error handling - convert to netdev instance locking - Intel (100G, ice, igb, ixgbe, idpf): - ice: add switchdev support for LLDP traffic over VF - ixgbe: add firmware manipulation and regions devlink support - igb: introduce support for frame transmission premption - igb: adds persistent NAPI configuration - idpf: introduce RDMA support - idpf: add initial PTP support - Meta (fbnic): - extend hardware stats coverage - add devlink dev flash support - Broadcom (bnxt): - add support for RX-side device memory TCP - Wangxun (txgbe): - implement support for udp tunnel offload - complete PTP and SRIOV support for AML 25G/10G devices - Ethernet NICs embedded and virtual: - Google (gve): - add device memory TCP TX support - Amazon (ena): - support persistent per-NAPI config - Airoha: - add H/W support for L2 traffic offload - add per flow stats for flow offloading - RealTek (rtl8211): add support for WoL magic packet - Synopsys (stmmac): - dwmac-socfpga 1000BaseX support - add Loongson-2K3000 support - introduce support for hardware-accelerated VLAN stripping - Broadcom (bcmgenet): - expose more H/W stats - Freescale (enetc, dpaa2-eth): - enetc: add MAC filter, VLAN filter RSS and loopback support - dpaa2-eth: convert to H/W timestamping APIs - vxlan: convert FDB table to rhashtable, for better scalabilty - veth: apply qdisc backpressure on full ring to reduce TX drops - Ethernet switches: - Microchip (kzZ88x3): add ETS scheduler support - Ethernet PHYs: - RealTek (rtl8211): - add support for WoL magic packet - add support for PHY LEDs - CAN: - Adds RZ/G3E CANFD support to the rcar_canfd driver. - Preparatory work for CAN-XL support. - Add self-tests framework with support for CAN physical interfaces. - WiFi: - mac80211: - scan improvements with multi-link operation (MLO) - Qualcomm (ath12k): - enable AHB support for IPQ5332 - add monitor interface support to QCN9274 - add multi-link operation support to WCN7850 - add 802.11d scan offload support to WCN7850 - monitor mode for WCN7850, better 6 GHz regulatory - Qualcomm (ath11k): - restore hibernation support - MediaTek (mt76): - WiFi-7 improvements - implement support for mt7990 - Intel (iwlwifi): - enhanced multi-link single-radio (EMLSR) support on 5 GHz links - rework device configuration - RealTek (rtw88): - improve throughput for RTL8814AU - RealTek (rtw89): - add multi-link operation support - STA/P2P concurrency improvements - support different SAR configs by antenna - Bluetooth: - introduce HCI Driver protocol - btintel_pcie: do not generate coredump for diagnostic events - btusb: add HCI Drv commands for configuring altsetting - btusb: add RTL8851BE device 0x0bda:0xb850 - btusb: add new VID/PID 13d3/3584 for MT7922 - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925 - btnxpuart: implement host-wakeup feature" * tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits) selftests/bpf: Fix bpf selftest build warning selftests: netfilter: Fix skip of wildcard interface test net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames net: openvswitch: Fix the dead loop of MPLS parse calipso: Don't call calipso functions for AF_INET sk. selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem net_sched: hfsc: Address reentrant enqueue adding class to eltree twice octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback octeontx2-pf: QOS: Perform cache sync on send queue teardown net: mana: Add support for Multi Vports on Bare metal net: devmem: ncdevmem: remove unused variable net: devmem: ksft: upgrade rx test to send 1K data net: devmem: ksft: add 5 tuple FS support net: devmem: ksft: add exit_wait to make rx test pass net: devmem: ksft: add ipv4 support net: devmem: preserve sockc_err page_pool: fix ugly page_pool formatting net: devmem: move list_add to net_devmem_bind_dmabuf. selftests: netfilter: nft_queue.sh: include file transfer duration in log message net: phy: mscc: Fix memory leak when using one step timestamping ...
2025-05-26RDMA/cma: Fix hang when cma_netevent_callback fails to queue_workJack Morgenstein
The cited commit fixed a crash when cma_netevent_callback was called for a cma_id while work on that id from a previous call had not yet started. The work item was re-initialized in the second call, which corrupted the work item currently in the work queue. However, it left a problem when queue_work fails (because the item is still pending in the work queue from a previous call). In this case, cma_id_put (which is called in the work handler) is therefore not called. This results in a userspace process hang (zombie process). Fix this by calling cma_id_put() if queue_work fails. Fixes: 45f5dcdd0497 ("RDMA/cma: Fix workqueue crash in cma_netevent_work_handler") Link: https://patch.msgid.link/r/4f3640b501e48d0166f312a64fdadf72b059bd04.1747827103.git.leon@kernel.org Signed-off-by: Jack Morgenstein <jackm@nvidia.com> Signed-off-by: Feng Liu <feliu@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-26Merge tag 'v6.15' into rdma.git for-nextJason Gunthorpe
Following patches need the RDMA rc branch since we are past the RC cycle now. Merge conflicts resolved based on Linux-next: - For RXE odp changes keep for-next version and fixup new places that need to call is_odp_mr() https://lore.kernel.org/r/20250422143019.500201bd@canb.auug.org.au https://lore.kernel.org/r/20250514122455.3593b083@canb.auug.org.au - irdma is keeping the while/kfree bugfix from -rc and the pf/cdev_info change from for-next https://lore.kernel.org/r/20250513130630.280ee6c5@canb.auug.org.au Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-26Merge tag 'vfs-6.16-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs directory lookup updates from Christian Brauner: "This contains cleanups for the lookup_one*() family of helpers. We expose a set of functions with names containing "lookup_one_len" and others without the "_len". This difference has nothing to do with "len". It's rater a historical accident that can be confusing. The functions without "_len" take a "mnt_idmap" pointer. This is found in the "vfsmount" and that is an important question when choosing which to use: do you have a vfsmount, or are you "inside" the filesystem. A related question is "is permission checking relevant here?". nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len functions. They pass nop_mnt_idmap and refuse to work on filesystems which have any other idmap. This work changes nfsd and cachefile to use the lookup_one family of functions and to explictily pass &nop_mnt_idmap which is consistent with all other vfs interfaces used where &nop_mnt_idmap is explicitly passed. The remaining uses of the "_one" functions do not require permission checks so these are renamed to be "_noperm" and the permission checking is removed. This series also changes these lookup function to take a qstr instead of separate name and len. In many cases this simplifies the call" * tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: VFS: change lookup_one_common and lookup_noperm_common to take a qstr Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS VFS: rename lookup_one_len family to lookup_noperm and remove permission check cachefiles: Use lookup_one() rather than lookup_one_len() nfsd: Use lookup_one() rather than lookup_one_len() VFS: improve interface for lookup_one functions
2025-05-25RDMA/bnxt_re: Support extended stats for Thor2 VFAjit Khaparde
The driver currently checks if the user is querying VF RoCE statistics. It will not send the query_roce_stats_ext HWRM command if it is for a VF. But Thor2 VF can support extended statistics. Allow query of extended stats for Thor2 VFs. Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Signed-off-by: Shravya KN <shravya.k-n@broadcom.com> Link: https://patch.msgid.link/20250523075952.1267827-1-kalesh-anakkur.purayil@broadcom.com Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-25RDMA/hns: Fix endian issue in trace eventsJunxian Huang
Avoid using __le32 directly in trace events to fix sparse complains: drivers/infiniband/hw/hns/./hns_roce_trace.h:48:1: sparse: sparse: cast to restricted __le32 drivers/infiniband/hw/hns/./hns_roce_trace.h:48:1: sparse: sparse: restricted __le32 degrades to integer drivers/infiniband/hw/hns/./hns_roce_trace.h:48:1: sparse: sparse: restricted __le32 degrades to integer drivers/infiniband/hw/hns/./hns_roce_trace.h:90:1: sparse: sparse: cast to restricted __le32 drivers/infiniband/hw/hns/./hns_roce_trace.h:90:1: sparse: sparse: restricted __le32 degrades to integer drivers/infiniband/hw/hns/./hns_roce_trace.h:90:1: sparse: sparse: restricted __le32 degrades to integer drivers/infiniband/hw/hns/./hns_roce_trace.h:173:1: sparse: sparse: cast to restricted __le32 drivers/infiniband/hw/hns/./hns_roce_trace.h:173:1: sparse: sparse: restricted __le32 degrades to integer drivers/infiniband/hw/hns/./hns_roce_trace.h:173:1: sparse: sparse: restricted __le32 degrades to integer Fixes: 6c98c8670806 ("RDMA/hns: Add trace for WQE dumping") Fixes: 1e63e2f96613 ("RDMA/hns: Add trace for AEQE dumping") Fixes: 6bd18dabf1c9 ("RDMA/hns: Add trace for CMDQ dumping") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505170327.TNOpreil-lkp@intel.com/ Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250523023433.2171003-1-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-25RDMA/mlx5: Avoid flexible array warningLeon Romanovsky
The following warning is reported by sparse tool: drivers/infiniband/hw/mlx5/fs.c:1664:26: warning: array of flexible structures Avoid it by simply splitting array into two separate structs. Link: https://patch.msgid.link/7b891b96a9fc053d01284c184d25ae98d35db2d4.1747827041.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-25IB/cm: Remove dead code and adjust namingVlad Dumitrescu
Drop ib_send_cm_mra parameters which are always constant. Remove branch which is never taken. Adjust name to ib_prepare_cm_mra, which better reflects its functionality - no MRA is actually sent. Adjust name of related tracepoints. Push setting of the constant service timeout to cm.c and drop IB_CM_MRA_FLAG_DELAY. Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Reviewed-by: Sean Hefty <shefty@nvidia.com> Link: https://patch.msgid.link/cdd2a237acf2b495c19ce02e4b1c42c41c6751c2.1747827207.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-25RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devicesDaisuke Matsuda
Drivers such as rxe, which use virtual DMA, must not call into the DMA mapping core since they lack physical DMA capabilities. Otherwise, a NULL pointer dereference is observed as shown below. This patch ensures the RDMA core handles virtual and physical DMA paths appropriately. This fixes the following kernel oops: BUG: kernel NULL pointer dereference, address: 00000000000002fc #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1028eb067 P4D 1028eb067 PUD 105da0067 PMD 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 3 UID: 1000 PID: 1854 Comm: python3 Tainted: G W 6.15.0-rc1+ #11 PREEMPT(voluntary) Tainted: [W]=WARN Hardware name: Trigkey Key N/Key N, BIOS KEYN101 09/02/2024 RIP: 0010:hmm_dma_map_alloc+0x25/0x100 Code: 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 d6 49 c1 e6 0c 41 55 41 54 53 49 39 ce 0f 82 c6 00 00 00 49 89 fc <f6> 87 fc 02 00 00 20 0f 84 af 00 00 00 49 89 f5 48 89 d3 49 89 cf RSP: 0018:ffffd3d3420eb830 EFLAGS: 00010246 RAX: 0000000000001000 RBX: ffff8b727c7f7400 RCX: 0000000000001000 RDX: 0000000000000001 RSI: ffff8b727c7f74b0 RDI: 0000000000000000 RBP: ffffd3d3420eb858 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 00007262a622a000 R14: 0000000000001000 R15: ffff8b727c7f74b0 FS: 00007262a62a1080(0000) GS:ffff8b762ac3e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000002fc CR3: 000000010a1f0004 CR4: 0000000000f72ef0 PKRU: 55555554 Call Trace: <TASK> ib_init_umem_odp+0xb6/0x110 [ib_uverbs] ib_umem_odp_get+0xf0/0x150 [ib_uverbs] rxe_odp_mr_init_user+0x71/0x170 [rdma_rxe] rxe_reg_user_mr+0x217/0x2e0 [rdma_rxe] ib_uverbs_reg_mr+0x19e/0x2e0 [ib_uverbs] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xd9/0x150 [ib_uverbs] ib_uverbs_cmd_verbs+0xd19/0xee0 [ib_uverbs] ? mmap_region+0x63/0xd0 ? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs] ib_uverbs_ioctl+0xba/0x130 [ib_uverbs] __x64_sys_ioctl+0xa4/0xe0 x64_sys_call+0x1178/0x2660 do_syscall_64+0x7e/0x170 ? syscall_exit_to_user_mode+0x4e/0x250 ? do_syscall_64+0x8a/0x170 ? do_syscall_64+0x8a/0x170 ? syscall_exit_to_user_mode+0x4e/0x250 ? do_syscall_64+0x8a/0x170 ? syscall_exit_to_user_mode+0x4e/0x250 ? do_syscall_64+0x8a/0x170 ? do_user_addr_fault+0x1d2/0x8d0 ? irqentry_exit_to_user_mode+0x43/0x250 ? irqentry_exit+0x43/0x50 ? exc_page_fault+0x93/0x1d0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7262a6124ded Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 RSP: 002b:00007fffd08c3960 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fffd08c39f0 RCX: 00007262a6124ded RDX: 00007fffd08c3a10 RSI: 00000000c0181b01 RDI: 0000000000000007 RBP: 00007fffd08c39b0 R08: 0000000014107820 R09: 00007fffd08c3b44 R10: 000000000000000c R11: 0000000000000246 R12: 00007fffd08c3b44 R13: 000000000000000c R14: 00007fffd08c3b58 R15: 0000000014107960 </TASK> Fixes: 1efe8c0670d6 ("RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage") Closes: https://lore.kernel.org/all/3e8f343f-7d66-4f7a-9f08-3910623e322f@gmail.com/ Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com> Link: https://patch.msgid.link/20250524144328.4361-1-dskmtsd@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") 4bcc063939a5 ("ice, irdma: fix an off by one in error handling code") c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22IB/IPoIB: Allow using netdevs that require the instance lockCosmin Ratiu
After the last patch removing vlan_rwsem, it is an incremental step to allow ipoib to work with netdevs that require the instance lock. In several places, netdev_lock() is changed to netdev_lock_ops_to_full() which takes care of not acquiring the lock again when the netdev is already locked. In ipoib_ib_tx_timeout_work() and __ipoib_ib_dev_flush() for HEAVY flushes, the netdev lock is acquired/released. This is needed because these functions end up calling .ndo_stop()/.ndo_open() on subinterfaces, and the device may expect the netdev instance lock to be held. ipoib_set_mode() now explicitly acquires ops lock while manipulating the features, mtu and tx queues. Finally, ipoib_napi_enable()/ipoib_napi_disable() now use the *_locked variants of the napi_enable()/napi_disable() calls and optionally acquire the netdev lock themselves depending on the dev they operate on. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1747829342-1018757-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22IB/IPoIB: Replace vlan_rwsem with the netdev instance lockCosmin Ratiu
vlan_rwsem was added more than a decade ago to work around a deadlock involving the original mutex being acquired twice, once from the wq. Subsequent changes then tweaked it to partially protect access to ipoib_dev_priv->child_intfs together with the RTNL. Flushing the wq synchronously was also since then refactored to happen separately. This semaphore unfortunately prevents updating ipoib to work with devices that require the netdev lock, because of lock ordering issues between RTNL, vlan_rwsem and the netdev instance locks of parent and child devices. To uncomplicate things, this commit replaces vlan_rwsem with the netdev instance lock of the parent device. Both parent child_intfs list and the children's list membership in it require holding the parent netdev instance lock. All call paths were carefully reviewed and no-longer-needed ASSERT_RTNL calls were dropped. Some non-trivial changes: - ipoib_match_gid_pkey_addr() now only acquires the instance lock and iterates through child_intfs for the first level of recursion (the parent), as it's not possible to have multiple levels of nested subinterfaces. - ipoib_open() and ipoib_stop() schedule tasks on the global workqueue to open/stop child interfaces to avoid potentially acquiring nested netdev instance locks. To avoid the device going away between the task scheduling and execution, netdev_hold/netdev_put are used. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1747829342-1018757-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22IB/IPoIB: Enqueue separate work_structs for each flushed interfaceCosmin Ratiu
Previously, flushing a netdevice involved first flushing all child devices from the flush task itself. That requires holding the lock that protects the list for the entire duration of the flush. This poses a problem when converting from vlan_rwsem to the netdev instance lock (next patch), because holding the parent lock while trying to acquire a child lock makes lockdep unhappy, rightfully. Fix this by splitting a big flush task into individual flush tasks (all are already created in their respective ipoib_dev_priv structs) and defining a helper function to enqueue all of them while holding the list lock. In ipoib_set_mac, the function is not used and the task is enqueued directly, because in the subsequent patches locking is changed and this function may be called with the netdev instance lock held. This is effectively a noop, the wq is single-threaded and ordered and will execute the same flush operations in the same order as before. Furthermore, there should be no new races because ipoib_parent_unregister_pre() calls flush_workqueue() after stopping new work generation to wait for pending work to complete. flush_workqueue() waits for all currently enqueued work to finish before returning. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1747829342-1018757-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22RDMA/rxe: Break endless pagefault loop for RO pagesLeon Romanovsky
RO pages has "perm" equal to 0, that caused to the situation where such pages were marked as needed to have fault and caused to infinite loop. Fixes: eedd5b1276e7 ("RDMA/umem: Store ODP access mask information in PFN") Reported-by: Daisuke Matsuda <dskmtsd@gmail.com> Closes: https://lore.kernel.org/all/3016329a-4edd-4550-862f-b298a1b79a39@gmail.com/ Link: https://patch.msgid.link/096fab178d48ed86942ee22eafe9be98e29092aa.1747913377.git.leonro@nvidia.com Tested-by: Daisuke Matsuda <dskmtsd@gmail.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-21RDMA/siw: use skb_crc32c() instead of __skb_checksum()Eric Biggers
Instead of calling __skb_checksum() with a skb_checksum_ops struct that does CRC32C, just call the new function skb_crc32c(). This is faster and simpler. Acked-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-5-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21RDMA/bnxt_re: Fix return code of bnxt_re_configure_ccKalesh AP
Driver currently supports modifying GEN0_EXT0 CC parameters through debugfs hook. Fixed to return -EOPNOTSUPP instead of -EINVAL in bnxt_re_configure_cc() when the user tries to modify any other CC parameters. Fixes: 656dff55da19 ("RDMA/bnxt_re: Congestion control settings using debugfs hook") Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20250520035910.1061918-4-kalesh-anakkur.purayil@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-21RDMA/bnxt_re: Fix missing error handling for tx_queueGautam R A
bnxt_re_fill_gen0_ext0() did not return an error when attempting to modify CMDQ_MODIFY_ROCE_CC_MODIFY_MASK_TX_QUEUE, leading to silent failures. Fixed this by returning -EOPNOTSUPP for tx_queue modifications and ensuring proper error propagation in bnxt_re_configure_cc(). Fixes: 656dff55da19 ("RDMA/bnxt_re: Congestion control settings using debugfs hook") Signed-off-by: Gautam R A <gautam-r.a@broadcom.com> Link: https://patch.msgid.link/20250520035910.1061918-3-kalesh-anakkur.purayil@broadcom.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-21RDMA/bnxt_re: Fix incorrect display of inactivity_cp in debugfs outputGautam R A
The inactivity_cp parameter in debugfs was not being read or written correctly, resulting in "Invalid argument" errors. Fixed this by ensuring proper mapping of inactivity_cp in both the map_cc_config_offset_gen0_ext0 and bnxt_re_fill_gen0_ext0() functions. Fixes: 656dff55da19 ("RDMA/bnxt_re: Congestion control settings using debugfs hook") Signed-off-by: Gautam R A <gautam-r.a@broadcom.com> Link: https://patch.msgid.link/20250520035910.1061918-2-kalesh-anakkur.purayil@broadcom.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-18RDMA/mlx5: Add support for 200Gbps per lane speedsPatrisious Haddad
Add support for 200Gbps per lane speeds speed when querying PTYS and report it back correctly when needed. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://patch.msgid.link/b842d2f523e9b82e221378c444ebd5860d612959.1747134197.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-18RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stageYishai Hadas
The MLX5_IB_STAGE_UAR stage in the RDMA driver is redundant and should be removed. Responsibility for initializing the device's UAR pointer (mdev->priv.uar) lies with mlx5_core, which already sets it during the mlx5_load() process. At present, the RDMA UAR stage overwrites this pointer, which was correctly initialized by mlx5_core, creating the risk of inconsistency. Ownership and management of the UAR pointer should remain exclusively within mlx5_core. In the current upstream code, we luckily receive the same pointer, since mlx5_get_uars_page() still finds available BF registers for that UAR, allowing it to be shared. However, future changes in mlx5_core may expose this flaw. For instance, if mlx5_alloc_bfreg() is invoked twice before the RDMA UAR stage runs, the RDMA driver may overwrite the UAR allocated by mlx5_core. This could lead to real bugs. For example, if mlx5_ib is unloaded (rmmod), it might free the UAR, leaving mlx5_core with a dangling reference to an invalid UAR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Fan Li <fanl@nvidia.com> Link: https://patch.msgid.link/feaa84ec6f20468b4935c439923e9266122a93d0.1747134130.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-13Merge branch 'for-next' of ↵Leon Romanovsky
https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux into wip/leon-for-next Tony Nguyen says: ==================== Prepare for Intel IPU E2000 (GEN3) This is the first part in introducing RDMA support for idpf. ---------------------------------------------------------------- Tatyana Nikolova says: To align with review comments, the patch series introducing RDMA RoCEv2 support for the Intel Infrastructure Processing Unit (IPU) E2000 line of products is going to be submitted in three parts: 1. Modify ice to use specific and common IIDC definitions and pass a core device info to irdma. 2. Add RDMA support to idpf and modify idpf to use specific and common IIDC definitions and pass a core device info to irdma. 3. Add RDMA RoCEv2 support for the E2000 products, referred to as GEN3 to irdma. This first part is a 5 patch series based on the original "iidc/ice/irdma: Update IDC to support multiple consumers" patch to allow for multiple CORE PCI drivers, using the auxbus. Patches: 1) Move header file to new name for clarity and replace ice specific DSCP define with a kernel equivalent one in irdma 2) Unify naming convention 3) Separate header file into common and driver specific info 4) Replace ice specific DSCP define with a kernel equivalent one in ice 5) Implement core device info struct and update drivers to use it ---------------------------------------------------------------- v1: https://lore.kernel.org/20250505212037.2092288-1-anthony.l.nguyen@intel.com IWL reviews: [v5] https://lore.kernel.org/20250416021549.606-1-tatyana.e.nikolova@intel.com [v4] https://lore.kernel.org/20250225050428.2166-1-tatyana.e.nikolova@intel.com [v3] https://lore.kernel.org/20250207194931.1569-1-tatyana.e.nikolova@intel.com [v2] https://lore.kernel.org/20240824031924.421-1-tatyana.e.nikolova@intel.com [v1] https://lore.kernel.org/20240724233917.704-1-tatyana.e.nikolova@intel.com * 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux: iidc/ice/irdma: Update IDC to support multiple consumers ice: Replace ice specific DSCP mapping num with a kernel define iidc/ice/irdma: Break iidc.h into two headers iidc/ice/irdma: Rename to iidc_* convention iidc/ice/irdma: Rename IDC header file ==================== Link: https://patch.msgid.link/20250509200712.2911060-1-anthony.l.nguyen@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-13RDMA/iwcm: Fix use-after-free of work objects after cm_id destructionShin'ichiro Kawasaki
The commit 59c68ac31e15 ("iw_cm: free cm_id resources on the last deref") simplified cm_id resource management by freeing cm_id once all references to the cm_id were removed. The references are removed either upon completion of iw_cm event handlers or when the application destroys the cm_id. This commit introduced the use-after-free condition where cm_id_private object could still be in use by event handler works during the destruction of cm_id. The commit aee2424246f9 ("RDMA/iwcm: Fix a use-after-free related to destroying CM IDs") addressed this use-after- free by flushing all pending works at the cm_id destruction. However, still another use-after-free possibility remained. It happens with the work objects allocated for each cm_id_priv within alloc_work_entries() during cm_id creation, and subsequently freed in dealloc_work_entries() once all references to the cm_id are removed. If the cm_id's last reference is decremented in the event handler work, the work object for the work itself gets removed, and causes the use- after-free BUG below: BUG: KASAN: slab-use-after-free in __pwq_activate_work+0x1ff/0x250 Read of size 8 at addr ffff88811f9cf800 by task kworker/u16:1/147091 CPU: 2 UID: 0 PID: 147091 Comm: kworker/u16:1 Not tainted 6.15.0-rc2+ #27 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Workqueue: 0x0 (iw_cm_wq) Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_report+0x174/0x554 ? __virt_addr_valid+0x208/0x430 ? __pwq_activate_work+0x1ff/0x250 kasan_report+0xae/0x170 ? __pwq_activate_work+0x1ff/0x250 __pwq_activate_work+0x1ff/0x250 pwq_dec_nr_in_flight+0x8c5/0xfb0 process_one_work+0xc11/0x1460 ? __pfx_process_one_work+0x10/0x10 ? assign_work+0x16c/0x240 worker_thread+0x5ef/0xfd0 ? __pfx_worker_thread+0x10/0x10 kthread+0x3b0/0x770 ? __pfx_kthread+0x10/0x10 ? rcu_is_watching+0x11/0xb0 ? _raw_spin_unlock_irq+0x24/0x50 ? rcu_is_watching+0x11/0xb0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Allocated by task 147416: kasan_save_stack+0x2c/0x50 kasan_save_track+0x10/0x30 __kasan_kmalloc+0xa6/0xb0 alloc_work_entries+0xa9/0x260 [iw_cm] iw_cm_connect+0x23/0x4a0 [iw_cm] rdma_connect_locked+0xbfd/0x1920 [rdma_cm] nvme_rdma_cm_handler+0x8e5/0x1b60 [nvme_rdma] cma_cm_event_handler+0xae/0x320 [rdma_cm] cma_work_handler+0x106/0x1b0 [rdma_cm] process_one_work+0x84f/0x1460 worker_thread+0x5ef/0xfd0 kthread+0x3b0/0x770 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 Freed by task 147091: kasan_save_stack+0x2c/0x50 kasan_save_track+0x10/0x30 kasan_save_free_info+0x37/0x60 __kasan_slab_free+0x4b/0x70 kfree+0x13a/0x4b0 dealloc_work_entries+0x125/0x1f0 [iw_cm] iwcm_deref_id+0x6f/0xa0 [iw_cm] cm_work_handler+0x136/0x1ba0 [iw_cm] process_one_work+0x84f/0x1460 worker_thread+0x5ef/0xfd0 kthread+0x3b0/0x770 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 Last potentially related work creation: kasan_save_stack+0x2c/0x50 kasan_record_aux_stack+0xa3/0xb0 __queue_work+0x2ff/0x1390 queue_work_on+0x67/0xc0 cm_event_handler+0x46a/0x820 [iw_cm] siw_cm_upcall+0x330/0x650 [siw] siw_cm_work_handler+0x6b9/0x2b20 [siw] process_one_work+0x84f/0x1460 worker_thread+0x5ef/0xfd0 kthread+0x3b0/0x770 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 This BUG is reproducible by repeating the blktests test case nvme/061 for the rdma transport and the siw driver. To avoid the use-after-free of cm_id_private work objects, ensure that the last reference to the cm_id is decremented not in the event handler works, but in the cm_id destruction context. For that purpose, move iwcm_deref_id() call from destroy_cm_id() to the callers of destroy_cm_id(). In iw_destroy_cm_id(), call iwcm_deref_id() after flushing the pending works. During the fix work, I noticed that iw_destroy_cm_id() is called from cm_work_handler() and process_event() context. However, the comment of iw_destroy_cm_id() notes that the function "cannot be called by the event thread". Drop the false comment. Closes: https://lore.kernel.org/linux-rdma/r5676e754sv35aq7cdsqrlnvyhiq5zktteaurl7vmfih35efko@z6lay7uypy3c/ Fixes: 59c68ac31e15 ("iw_cm: free cm_id resources on the last deref") Cc: stable@vger.kernel.org Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://patch.msgid.link/20250510101036.1756439-1-shinichiro.kawasaki@wdc.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-12Merge branch 'for-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux Tony Nguyen says: ==================== Prepare for Intel IPU E2000 (GEN3) This is the first part in introducing RDMA support for idpf. ---------------------------------------------------------------- Tatyana Nikolova says: To align with review comments, the patch series introducing RDMA RoCEv2 support for the Intel Infrastructure Processing Unit (IPU) E2000 line of products is going to be submitted in three parts: 1. Modify ice to use specific and common IIDC definitions and pass a core device info to irdma. 2. Add RDMA support to idpf and modify idpf to use specific and common IIDC definitions and pass a core device info to irdma. 3. Add RDMA RoCEv2 support for the E2000 products, referred to as GEN3 to irdma. This first part is a 5 patch series based on the original "iidc/ice/irdma: Update IDC to support multiple consumers" patch to allow for multiple CORE PCI drivers, using the auxbus. Patches: 1) Move header file to new name for clarity and replace ice specific DSCP define with a kernel equivalent one in irdma 2) Unify naming convention 3) Separate header file into common and driver specific info 4) Replace ice specific DSCP define with a kernel equivalent one in ice 5) Implement core device info struct and update drivers to use it ---------------------------------------------------------------- v1: https://lore.kernel.org/20250505212037.2092288-1-anthony.l.nguyen@intel.com IWL reviews: [v5] https://lore.kernel.org/20250416021549.606-1-tatyana.e.nikolova@intel.com [v4] https://lore.kernel.org/20250225050428.2166-1-tatyana.e.nikolova@intel.com [v3] https://lore.kernel.org/20250207194931.1569-1-tatyana.e.nikolova@intel.com [v2] https://lore.kernel.org/20240824031924.421-1-tatyana.e.nikolova@intel.com [v1] https://lore.kernel.org/20240724233917.704-1-tatyana.e.nikolova@intel.com * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux: iidc/ice/irdma: Update IDC to support multiple consumers ice: Replace ice specific DSCP mapping num with a kernel define iidc/ice/irdma: Break iidc.h into two headers iidc/ice/irdma: Rename to iidc_* convention iidc/ice/irdma: Rename IDC header file ==================== Link: https://patch.msgid.link/20250509200712.2911060-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12RDMA/mana_ib: unify mana_ib functions to support any gdma deviceKonstantin Taranov
Use the installed gdma_device instead of hard-coded device in requests to the HW. Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Link: https://patch.msgid.link/1746633545-17653-4-git-send-email-kotaranov@linux.microsoft.com Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-12RDMA/mana_ib: Add support of mana_ib for RNIC and ETH nicKonstantin Taranov
Allow mana_ib to be created over ethernet gdma device and over rnic gdma device. The HW has two devices with different capabilities and different use-cases. Initialize required resources depending on the used gdma device. Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Link: https://patch.msgid.link/1746633545-17653-3-git-send-email-kotaranov@linux.microsoft.com Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-12RDMA/siw: replace redundant ternary operator with just rvColin Ian King
The use of the ternary operator on rv is redundant, rv is either the initialized value of 0 or a negative error return code, so it can never be greater than zero, and hence the zero assignment in ternary operator is redundant. Just return rv instead. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://patch.msgid.link/20250507131834.253823-1-colin.i.king@gmail.com Acked-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>