summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
13 daysnet: s/dev_get_mac_address/netif_get_mac_address/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. netif_get_mac_address is used only by tun/tap, so move it into NETDEV_INTERNAL namespace. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-3-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet: s/dev_get_port_parent_id/netif_get_port_parent_id/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet: selftests: add PHY-loopback test for bad TCP checksumsOleksij Rempel
Detect NICs and drivers that either drop frames with a corrupted TCP checksum or, worse, pass them up as valid. The test flips one bit in the checksum, transmits the packet in internal loopback, and fails when the driver reports CHECKSUM_UNNECESSARY. Discussed at: https://lore.kernel.org/all/20250625132117.1b3264e8@kernel.org/ Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250717083524.1645069-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet: track pfmemalloc drops via SKB_DROP_REASON_PFMEMALLOCJesper Dangaard Brouer
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets dropped due to memory pressure. In production environments, we've observed memory exhaustion reported by memory layer stack traces, but these drops were not properly tracked in the SKB drop reason infrastructure. While most network code paths now properly report pfmemalloc drops, some protocol-specific socket implementations still use sk_filter() without drop reason tracking: - Bluetooth L2CAP sockets - CAIF sockets - IUCV sockets - Netlink sockets - SCTP sockets - Unix domain sockets These remaining cases represent less common paths and could be converted in a follow-up patch if needed. The current implementation provides significantly improved observability into memory pressure events in the network stack, especially for key protocols like TCP and UDP, helping to diagnose problems in production environments. Reported-by: Matt Fleming <mfleming@cloudflare.com> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet: stream: add description for sk_stream_write_space()Suchit Karunakaran
Add a proper description for the sk_stream_write_space() function as previously marked by a FIXME comment. No functional changes. Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com> Link: https://patch.msgid.link/20250716153404.7385-1-suchitkarunakaran@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc6Alexei Starovoitov
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
14 daysReapply "wifi: mac80211: Update skb's control block key in ↵Remi Pommarel
ieee80211_tx_dequeue()" This reverts commit 0937cb5f345c ("Revert "wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()""). This commit broke TX with 802.11 encapsulation HW offloading, now that this is fixed, reapply it. Fixes: bb42f2d13ffc ("mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue") Signed-off-by: Remi Pommarel <repk@triplefau.lt> Link: https://patch.msgid.link/66b8fc39fb0194fa06c9ca7eeb6ffe0118dcb3ec.1752765971.git.repk@triplefau.lt Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: mac80211: Check 802.11 encaps offloading in ieee80211_tx_h_select_key()Remi Pommarel
With 802.11 encapsulation offloading, ieee80211_tx_h_select_key() is called on 802.3 frames. In that case do not try to use skb data as valid 802.11 headers. Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/linux-wireless/20250410215527.3001-1-spasswolf@web.de Fixes: bb42f2d13ffc ("mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue") Signed-off-by: Remi Pommarel <repk@triplefau.lt> Link: https://patch.msgid.link/1af4b5b903a5fca5ebe67333d5854f93b2be5abe.1752765971.git.repk@triplefau.lt Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: mac80211: Don't call fq_flow_idx() for management framesAlexander Wetzel
skb_get_hash() can only be used when the skb is linked to a netdev device. Signed-off-by: Alexander Wetzel <Alexander@wetzel-home.de> Fixes: 73bc9e0af594 ("mac80211: don't apply flow control on management frames") Link: https://patch.msgid.link/20250717162547.94582-3-Alexander@wetzel-home.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: mac80211: Do not schedule stopped TXQsAlexander Wetzel
Ignore TXQs with the flag IEEE80211_TXQ_STOP when scheduling a queue. The flag is only set after all fragments have been dequeued and won't allow dequeueing other frames as long as the flag is set. For drivers using ieee80211_txq_schedule_start() this prevents an loop trying to push the queued frames while IEEE80211_TXQ_STOP is set: After setting IEEE80211_TXQ_STOP the driver will call ieee80211_return_txq(). Which calls __ieee80211_schedule_txq(), detects that there sill are frames in the queue and immediately restarts the stopped TXQ. Which can't dequeue any frame and thus starts over the loop. Signed-off-by: Alexander Wetzel <Alexander@wetzel-home.de> Fixes: ba8c3d6f16a1 ("mac80211: add an intermediate software queue implementation") Link: https://patch.msgid.link/20250717162547.94582-2-Alexander@wetzel-home.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: cfg80211: Add missing lock in cfg80211_check_and_end_cac()Alexander Wetzel
Callers of wdev_chandef() must hold the wiphy mutex. But the worker cfg80211_propagate_cac_done_wk() never takes the lock. Which triggers the warning below with the mesh_peer_connected_dfs test from hostapd and not (yet) released mac80211 code changes: WARNING: CPU: 0 PID: 495 at net/wireless/chan.c:1552 wdev_chandef+0x60/0x165 Modules linked in: CPU: 0 UID: 0 PID: 495 Comm: kworker/u4:2 Not tainted 6.14.0-rc5-wt-g03960e6f9d47 #33 13c287eeabfe1efea01c0bcc863723ab082e17cf Workqueue: cfg80211 cfg80211_propagate_cac_done_wk Stack: 00000000 00000001 ffffff00 6093267c 00000000 6002ec30 6d577c50 60037608 00000000 67e8d108 6063717b 00000000 Call Trace: [<6002ec30>] ? _printk+0x0/0x98 [<6003c2b3>] show_stack+0x10e/0x11a [<6002ec30>] ? _printk+0x0/0x98 [<60037608>] dump_stack_lvl+0x71/0xb8 [<6063717b>] ? wdev_chandef+0x60/0x165 [<6003766d>] dump_stack+0x1e/0x20 [<6005d1b7>] __warn+0x101/0x20f [<6005d3a8>] warn_slowpath_fmt+0xe3/0x15d [<600b0c5c>] ? mark_lock.part.0+0x0/0x4ec [<60751191>] ? __this_cpu_preempt_check+0x0/0x16 [<600b11a2>] ? mark_held_locks+0x5a/0x6e [<6005d2c5>] ? warn_slowpath_fmt+0x0/0x15d [<60052e53>] ? unblock_signals+0x3a/0xe7 [<60052f2d>] ? um_set_signals+0x2d/0x43 [<60751191>] ? __this_cpu_preempt_check+0x0/0x16 [<607508b2>] ? lock_is_held_type+0x207/0x21f [<6063717b>] wdev_chandef+0x60/0x165 [<605f89b4>] regulatory_propagate_dfs_state+0x247/0x43f [<60052f00>] ? um_set_signals+0x0/0x43 [<605e6bfd>] cfg80211_propagate_cac_done_wk+0x3a/0x4a [<6007e460>] process_scheduled_works+0x3bc/0x60e [<6007d0ec>] ? move_linked_works+0x4d/0x81 [<6007d120>] ? assign_work+0x0/0xaa [<6007f81f>] worker_thread+0x220/0x2dc [<600786ef>] ? set_pf_worker+0x0/0x57 [<60087c96>] ? to_kthread+0x0/0x43 [<6008ab3c>] kthread+0x2d3/0x2e2 [<6007f5ff>] ? worker_thread+0x0/0x2dc [<6006c05b>] ? calculate_sigpending+0x0/0x56 [<6003b37d>] new_thread_handler+0x4a/0x64 irq event stamp: 614611 hardirqs last enabled at (614621): [<00000000600bc96b>] __up_console_sem+0x82/0xaf hardirqs last disabled at (614630): [<00000000600bc92c>] __up_console_sem+0x43/0xaf softirqs last enabled at (614268): [<00000000606c55c6>] __ieee80211_wake_queue+0x933/0x985 softirqs last disabled at (614266): [<00000000606c52d6>] __ieee80211_wake_queue+0x643/0x985 Fixes: 26ec17a1dc5e ("cfg80211: Fix radar event during another phy CAC") Signed-off-by: Alexander Wetzel <Alexander@wetzel-home.de> Link: https://patch.msgid.link/20250717162547.94582-1-Alexander@wetzel-home.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: mac80211: support returning the S1G short beacon skbLachlan Hodges
When short beaconing is enabled, check the value of the sb_count to determine whether we are to send a long beacon or short beacon. sb_count represents the number of short beacons until the next long beacon, where if its value is 0 we are to send a long beacon. The value is then reset to the long beacon period, which represents the number of beacon intervals between each long beacon. The decrement process follows the same cadence as the decrement of the DTIM count value. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20250717074205.312577-5-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: mac80211: support initialising current S1G short beacon indexLachlan Hodges
Introduce the sb_count variable which tracks the number of beacon intervals until the next long beacon. To initialise this value, we find the current short beacon index into this period which represents the number of short beacons left to send before the next long beacon. We use the same TSF value used to initialise the DTIM count to ensure the short beacon count and DTIM count are in sync as its common for the long beacon period and DTIM period to be equivalent. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20250717074205.312577-4-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: mac80211: support initialising an S1G short beaconing BSSLachlan Hodges
Introduce the ability to parse the short beacon data and long beacon period. The long beacon period represents the number of beacon intervals between each long beacon transmission. Additionally, as a BSS cannot change its configuration such that short beaconing is dynamically disabled/enabled without tearing down the interface - we ensure we have an existing short beacon before performing the update. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20250717074205.312577-3-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: cfg80211: support configuring an S1G short beaconing BSSLachlan Hodges
S1G short beacons are an optional frame type used in an S1G BSS that contain a limited set of elements. While they are optional, they are a fundamental part of S1G that enables significant power saving. Expose 2 additional netlink attributes, NL80211_ATTR_S1G_LONG_BEACON_PERIOD which denotes the number of beacon intervals between each long beacon and NL80211_ATTR_S1G_SHORT_BEACON which is a nested attribute containing the short beacon tail and head. We split them as the long beacon period cannot be updated, and is only used when initialisng the interface, whereas the short beacon data can be used to both initialise and update the templates. This follows how things such as the beacon interval and DTIM period currently operate. During the initialisation path, we ensure we have the long beacon period if the short beacon data is being passed down, whereas the update path will simply update the template if its sent down. The short beacon data is validated using the same routines for regular beacons as they support correctly parsing the short beacon format while ensuring the frame is well-formed. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20250717074205.312577-2-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: mac80211: reject TDLS operations when station is not associatedMoon Hee Lee
syzbot triggered a WARN in ieee80211_tdls_oper() by sending NL80211_TDLS_ENABLE_LINK immediately after NL80211_CMD_CONNECT, before association completed and without prior TDLS setup. This left internal state like sdata->u.mgd.tdls_peer uninitialized, leading to a WARN_ON() in code paths that assumed it was valid. Reject the operation early if not in station mode or not associated. Reported-by: syzbot+f73f203f8c9b19037380@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f73f203f8c9b19037380 Fixes: 81dd2b882241 ("mac80211: move TDLS data to mgd private part") Tested-by: syzbot+f73f203f8c9b19037380@syzkaller.appspotmail.com Signed-off-by: Moon Hee Lee <moonhee.lee.ca@gmail.com> Link: https://patch.msgid.link/20250715230904.661092-2-moonhee.lee.ca@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: mac80211: extend connection monitoring for MLOMaharaja Kennadyrajan
Currently, reset connection monitor (ieee80211_sta_reset_conn_monitor()) timer is handled only for non-AP non-MLD STA and do not support non-AP MLD STA. The current implementation checks for the CSA active and update the monitor timer with the timeout value of deflink and reset the timer based on the deflink's timeout value else schedule the connection loss work when the deflink is timed out and it won't work for the non-AP MLD STA. Handle the reset connection monitor timer for non-AP MLD STA by updating the monitor timer with the timeout value which is determined based on the link that will expire last among all the links in MLO. If at least one link has not timed out, the timer is updated accordingly with the latest timeout value else schedule the connection loss work when all links have timed out. Remove the MLO-related WARN_ON() checks in the beacon and connection monitoring logic code paths as they support MLO now. Signed-off-by: Maharaja Kennadyrajan <maharaja.kennadyrajan@oss.qualcomm.com> Link: https://patch.msgid.link/20250718060837.59371-5-maharaja.kennadyrajan@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: mac80211: extend beacon monitoring for MLOMaharaja Kennadyrajan
Currently, reset beacon monitor (ieee80211_sta_reset_beacon_monitor()) timer is handled only for non-AP non-MLD STA and do not support non-AP MLD STA. When the beacon loss occurs in non-AP MLD STA with the current implementation, it is treated as a single link and the timer will reset based on the timeout of the deflink, without checking all the links. Check the CSA flags for all the links in the MLO and decide whether to schedule the work queue for beacon loss. If any of the links has CSA active, then beacon loss work is not scheduled. Also, call the functions ieee80211_sta_reset_beacon_monitor() and ieee80211_sta_reset_conn_monitor() from ieee80211_csa_switch_work() only when all the links are CSA active. Signed-off-by: Maharaja Kennadyrajan <maharaja.kennadyrajan@oss.qualcomm.com> Link: https://patch.msgid.link/20250718060837.59371-4-maharaja.kennadyrajan@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: mac80211: Add link iteration macro for link data with rcu_dereferenceMaharaja Kennadyrajan
Currently, the existing macro for_each_link_data() uses sdata_dereference() which requires the wiphy lock. This lock cannot be used in atomic or RCU read-side contexts, such as in the RX path. Introduce a new macro, for_each_link_data_rcu(), that iterates over link of sdata using rcu_dereference(), making it safe to use in RCU contexts. This allows callers to access link data without requiring the wiphy lock. The macro takes into account the vif.valid_links bitmap and ensures only valid links are accessed safely. Callers are responsible for ensuring that rcu_read_lock() is held when using this macro. Signed-off-by: Maharaja Kennadyrajan <maharaja.kennadyrajan@oss.qualcomm.com> Link: https://patch.msgid.link/20250718060837.59371-3-maharaja.kennadyrajan@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: mac80211: fix macro scoping in for_each_link_dataAditya Kumar Singh
The for_each_link_data() macro currently declares a local variable __sdata directly, which could lead to compiler warnings or errors when reused in the same function or within switch-case blocks due to variable redefinition or invalid scoping. To address this, restructure the macro to use an outer for-loop that runs only once, allowing safe declaration of __sdata without polluting the outer scope. This ensures compatibility with static analyzers. No functional changes; this is purely a cleanup to improve macro hygiene. Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Signed-off-by: Maharaja Kennadyrajan <maharaja.kennadyrajan@oss.qualcomm.com> Link: https://patch.msgid.link/20250718060837.59371-2-maharaja.kennadyrajan@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 dayswifi: cfg80211/mac80211: remove wrong scan request n_channelsJohannes Berg
This (partially) reverts commits - 838c7b8f1f27 ("wifi: nl80211: Avoid address calculations via out of bounds array indexing") - f1d3334d604c ("wifi: cfg80211: sme: init n_channels before channels[] access") - 82bbe02b2500 ("wifi: mac80211: Set n_channels after allocating struct cfg80211_scan_request") These commits all set the structure to be in an inconsistent state, setting n_channels to some value before them actually being filled in. That's fine for what the code does now, but with the removal of __counted_by() in 444020f4bf06 ("wifi: cfg80211: remove scan request n_channels counted_by") it's no longer needed and it does leave a bit of a landmine there since breaking out of some code to send the scan or something would leave it wrong. Remove the now superfluous n_channels settings. Link: https://patch.msgid.link/20250718103237.59510b2384c5.Ied5ba9c5c49efc008f4491c8ca7a45858a83f064@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
14 daysMerge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-07-17 We've added 13 non-merge commits during the last 20 day(s) which contain a total of 4 files changed, 712 insertions(+), 84 deletions(-). The main changes are: 1) Avoid skipping or repeating a sk when using a TCP bpf_iter, from Jordan Rife. 2) Clarify the driver requirement on using the XDP metadata, from Song Yoong Siang * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: doc: xdp: Clarify driver implementation for XDP Rx metadata selftests/bpf: Add tests for bucket resume logic in established sockets selftests/bpf: Create iter_tcp_destroy test program selftests/bpf: Create established sockets in socket iterator tests selftests/bpf: Make ehash buckets configurable in socket iterator tests selftests/bpf: Allow for iteration over multiple states selftests/bpf: Allow for iteration over multiple ports selftests/bpf: Add tests for bucket resume logic in listening sockets bpf: tcp: Avoid socket skips and repeats during iteration bpf: tcp: Use bpf_tcp_iter_batch_item for bpf_tcp_iter_state batch items bpf: tcp: Get rid of st_bucket_done bpf: tcp: Make sure iter->batch always contains a full bucket snapshot bpf: tcp: Make mem flags configurable through bpf_iter_tcp_realloc_batch ==================== Link: https://patch.msgid.link/20250717191731.4142326-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Update pneigh_entry in pneigh_create().Kuniyuki Iwashima
neigh_add() updates pneigh_entry() found or created by pneigh_create(). This update is serialised by RTNL, but we will remove it. Let's move the update part to pneigh_create() and make it return errno instead of a pointer of pneigh_entry. Now, the pneigh code is RTNL free. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Protect tbl->phash_buckets[] with a dedicated mutex.Kuniyuki Iwashima
tbl->phash_buckets[] is only modified in the slow path by pneigh_create() and pneigh_delete() under the table lock. Both of them are called under RTNL, so no extra lock is needed, but we will remove RTNL from the paths. pneigh_create() looks up a pneigh_entry, and this part can be lockless, but it would complicate the logic like 1. lookup 2. allocate pengih_entry for GFP_KERNEL 3. lookup again but under lock 4. if found, return it after freeing the allocated memory 5. else, return the new one Instead, let's add a per-table mutex and run lookup and allocation under it. Note that updating pneigh_entry part in neigh_add() is still protected by RTNL and will be moved to pneigh_create() in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-15-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Drop read_lock_bh(&tbl->lock) in pneigh_lookup().Kuniyuki Iwashima
Now, all callers of pneigh_lookup() are under RCU, and the read lock there is no longer needed. Let's drop the lock, inline __pneigh_lookup_1() to pneigh_lookup(), and call it from pneigh_create(). The next patch will remove tbl->lock from pneigh_create(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-14-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Remove __pneigh_lookup().Kuniyuki Iwashima
__pneigh_lookup() is the lockless version of pneigh_lookup(), but its only caller pndisc_is_router() holds the table lock and reads pneigh_netry.flags. This is because accessing pneigh_entry after pneigh_lookup() was illegal unless the caller holds RTNL or the table lock. Now, pneigh_entry is guaranteed to be alive during the RCU critical section. Let's call pneigh_lookup() and use READ_ONCE() for n->flags in pndisc_is_router() and remove __pneigh_lookup(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-13-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Use rcu_dereference() in pneigh_get_{first,next}().Kuniyuki Iwashima
Now pneigh_entry is guaranteed to be alive during the RCU critical section even without holding tbl->lock. Let's use rcu_dereference() in pneigh_get_{first,next}(). Note that neigh_seq_start() still holds tbl->lock for the normal neighbour entry. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-12-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Drop read_lock_bh(&tbl->lock) in pneigh_dump_table().Kuniyuki Iwashima
Now pneigh_entry is guaranteed to be alive during the RCU critical section even without holding tbl->lock. Let's drop read_lock_bh(&tbl->lock) and use rcu_dereference() to iterate tbl->phash_buckets[] in pneigh_dump_table() Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-11-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Convert RTM_GETNEIGH to RCU.Kuniyuki Iwashima
Only __dev_get_by_index() is the RTNL dependant in neigh_get(). Let's replace it with dev_get_by_index_rcu() and convert RTM_GETNEIGH to RCU. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-10-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Annotate access to struct pneigh_entry.{flags,protocol}.Kuniyuki Iwashima
We will convert pneigh readers to RCU, and its flags and protocol will be read locklessly. Let's annotate the access to the two fields. Note that all access to pn->permanent is under RTNL (neigh_add() and pneigh_ifdown_and_unlock()), so WRITE_ONCE() and READ_ONCE() are not needed. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Free pneigh_entry after RCU grace period.Kuniyuki Iwashima
We will convert RTM_GETNEIGH to RCU. neigh_get() looks up pneigh_entry by pneigh_lookup() and passes it to pneigh_fill_info(). Then, we must ensure that the entry is alive till pneigh_fill_info() completes, but read_lock_bh(&tbl->lock) in pneigh_lookup() does not guarantee that. Also, we will convert all readers of tbl->phash_buckets[] to RCU. Let's use call_rcu() to free pneigh_entry and update phash_buckets[] and ->next by rcu_assign_pointer(). pneigh_ifdown_and_unlock() uses list_head to avoid overwriting ->next and moving RCU iterators to another list. pndisc_destructor() (only IPv6 ndisc uses this) uses a mutex, so it is not delayed to call_rcu(), where we cannot sleep. This is fine because the mcast code works with RCU and ipv6_dev_mc_dec() frees mcast objects after RCU grace period. While at it, we change the return type of pneigh_ifdown_and_unlock() to void. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Annotate neigh_table.phash_buckets and pneigh_entry.next with __rcu.Kuniyuki Iwashima
The next patch will free pneigh_entry with call_rcu(). Then, we need to annotate neigh_table.phash_buckets[] and pneigh_entry.next with __rcu. To make the next patch cleaner, let's annotate the fields in advance. Currently, all accesses to the fields are under the neigh table lock, so rcu_dereference_protected() is used with 1 for now, but most of them (except in pneigh_delete() and pneigh_ifdown_and_unlock()) will be replaced with rcu_dereference() and rcu_dereference_check(). Note that pneigh_ifdown_and_unlock() changes pneigh_entry.next to a local list, which is illegal because the RCU iterator could be moved to another list. This part will be fixed in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Split pneigh_lookup().Kuniyuki Iwashima
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which is confusing. When called with the last argument, creat, 0, pneigh_lookup() literally looks up a proxy neighbour entry. This is the case of the reader path as the fast path and RTM_GETNEIGH. pneigh_lookup(), however, creates a pneigh_entry when called with creat 1 from RTM_NEWNEIGH and SIOCSARP, which require RTNL. Let's split pneigh_lookup() into two functions. We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock) in the new pneigh_lookup() will be dropped. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Move neigh_find_table() to neigh_get().Kuniyuki Iwashima
neigh_valid_get_req() calls neigh_find_table() to fetch neigh_tables[]. neigh_find_table() uses rcu_dereference_rtnl(), but RTNL actually does not protect it at all; neigh_table_clear() can be called without RTNL and only waits for RCU readers by synchronize_rcu(). Fortunately, there is no bug because IPv4 is built-in, IPv6 cannot be unloaded, and DECNET was removed. To fetch neigh_tables[] by rcu_dereference() later, let's move neigh_find_table() from neigh_valid_get_req() to neigh_get(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Allocate skb in neigh_get().Kuniyuki Iwashima
We will remove RTNL for neigh_get() and run it under RCU instead. neigh_get_reply() and pneigh_get_reply() allocate skb with GFP_KERNEL. Let's move the allocation before __dev_get_by_index() in neigh_get(). Now, neigh_get_reply() and pneigh_get_reply() are inlined and rtnl_unicast() is factorised. We will convert pneigh_lookup() to __pneigh_lookup() later. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Move two validations from neigh_get() to neigh_valid_get_req().Kuniyuki Iwashima
We will remove RTNL for neigh_get() and run it under RCU instead. neigh_get() returns -EINVAL in the following cases: * NDA_DST is not specified * Both ndm->ndm_ifindex and NTF_PROXY are not specified These validations do not require RCU. Let's move them to neigh_valid_get_req(). While at it, the extack string for the first case is replaced with NL_SET_ERR_ATTR_MISS(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Make neigh_valid_get_req() return ndmsg.Kuniyuki Iwashima
neigh_get() passes 4 local variable pointers to neigh_valid_get_req(). If it returns a pointer of struct ndmsg, we do not need to pass two of them. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17ethtool: rss: support setting flow hashing fieldsJakub Kicinski
Add support for ETHTOOL_SRXFH (setting hashing fields) in RSS_SET. The tricky part is dealing with symmetric hashing. In netlink user can change the hashing fields and symmetric hash in one request, in IOCTL the two used to be set via different uAPI requests. Since fields and hash function config are still separate driver callbacks - changes to the two are not atomic. Keep things simple and validate the settings against both pre- and post- change ones. Meaning that we will reject the config request if user tries to correct the flow fields and set input_xfrm in one request, or disables input_xfrm and makes flow fields non-symmetric. We can adjust it later if there's a real need. Starting simple feels right, and potentially partially applying the settings isn't nice, either. Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250716000331.1378807-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17ethtool: rss: support setting input-xfrm via NetlinkJakub Kicinski
Support configuring symmetric hashing via Netlink. We have the flow field config prepared as part of SET handling, so scan it for conflicts instead of querying the driver again. Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250716000331.1378807-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17ethtool: rss: support setting hkey via NetlinkJakub Kicinski
Support setting RSS hashing key via ethtool Netlink. Use the Netlink policy to make sure user doesn't pass an empty key, "resetting" the key is not a thing. Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250716000331.1378807-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17ethtool: rss: support setting hfunc via NetlinkJakub Kicinski
Support setting RSS hash function / algo via ethtool Netlink. Like IOCTL we don't validate that the function is within the range known to the kernel. The drivers do a pretty good job validating the inputs, and the IDs are technically "dynamically queried" rather than part of uAPI. Only change should be that in Netlink we don't support user explicitly passing ETH_RSS_HASH_NO_CHANGE (0), if no change is requested the attribute should be absent. The ETH_RSS_HASH_NO_CHANGE is retained in driver-facing API for consistency (not that I see a strong reason for it). Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250716000331.1378807-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17ethtool: rss: initial RSS_SET (indirection table handling)Jakub Kicinski
Add initial support for RSS_SET, for now only operations on the indirection table are supported. Unlike the ioctl don't check if at least one parameter is being changed. This is how other ethtool-nl ops behave, so pick the ethtool-nl consistency vs copying ioctl behavior. There are two special cases here: 1) resetting the table to defaults; 2) support for tables of different size. For (1) I use an empty Netlink attribute (array of size 0). (2) may require some background. AFAICT a lot of modern devices allow allocating RSS tables of different sizes. mlx5 can upsize its tables, bnxt has some "table size calculation", and Intel folks asked about RSS table sizing in context of resource allocation in the past. The ethtool IOCTL API has a concept of table size, but right now the user is expected to provide a table exactly the size the device requests. Some drivers may change the table size at runtime (in response to queue count changes) but the user is not in control of this. What's not great is that all RSS contexts share the same table size. For example a device with 128 queues enabled, 16 RSS contexts 8 queues in each will likely have 256 entry tables for each of the 16 contexts, while 32 would be more than enough given each context only has 8 queues. To address this the Netlink API should avoid enforcing table size at the uAPI level, and should allow the user to express the min table size they expect. To fully solve (2) we will need more driver plumbing but at the uAPI level this patch allows the user to specify a table size smaller than what the device advertises. The device table size must be a multiple of the user requested table size. We then replicate the user-provided table to fill the full device size table. This addresses the "allow the user to express the min table size" objective, while not enforcing any fixed size. From Netlink perspective .get_rxfh_indir_size() is now de facto the "max" table size supported by the device. We may choose to support table replication in ethtool, too, when we actually plumb this thru the device APIs. Initially I was considering moving full pattern generation to the kernel (which queues to use, at which frequency and what min sequence length). I don't think this complexity would buy us much and most if not all devices have pow-2 table sizes, which simplifies the replication a lot. Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250716000331.1378807-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc7). Conflicts: Documentation/netlink/specs/ovpn.yaml 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") af52020fc599 ("ovpn: reject unexpected netlink attributes") drivers/net/phy/phy_device.c a44312d58e78 ("net: phy: Don't register LEDs for genphy") f0f2b992d818 ("net: phy: Don't register LEDs for genphy") https://lore.kernel.org/20250710114926.7ec3a64f@kernel.org drivers/net/wireless/intel/iwlwifi/fw/regulatory.c drivers/net/wireless/intel/iwlwifi/mld/regulatory.c 5fde0fcbd760 ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap") ea045a0de3b9 ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware") net/ipv6/mcast.c ae3264a25a46 ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()") a8594c956cc9 ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()") https://lore.kernel.org/8cc52891-3653-4b03-a45e-05464fe495cf@kernel.org No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge tag 'for-net-2025-07-17' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - hci_sync: fix connectable extended advertising when using static random address - hci_core: fix typos in macros - hci_core: add missing braces when using macro parameters - hci_core: replace 'quirks' integer by 'quirk_flags' bitmap - SMP: If an unallowed command is received consider it a failure - SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout - L2CAP: Fix null-ptr-deref in l2cap_sock_resume_cb() - L2CAP: Fix attempting to adjust outgoing MTU - btintel: Check if controller is ISO capable on btintel_classify_pkt_type - btusb: QCA: Fix downloading wrong NVM for WCN6855 GF variant without board ID * tag 'for-net-2025-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: L2CAP: Fix attempting to adjust outgoing MTU Bluetooth: btusb: QCA: Fix downloading wrong NVM for WCN6855 GF variant without board ID Bluetooth: hci_dev: replace 'quirks' integer by 'quirk_flags' bitmap Bluetooth: hci_core: add missing braces when using macro parameters Bluetooth: hci_core: fix typos in macros Bluetooth: SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout Bluetooth: SMP: If an unallowed command is received consider it a failure Bluetooth: btintel: Check if controller is ISO capable on btintel_classify_pkt_type Bluetooth: hci_sync: fix connectable extended advertising when using static random address Bluetooth: Fix null-ptr-deref in l2cap_sock_resume_cb() ==================== Link: https://patch.msgid.link/20250717142849.537425-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17rxrpc: Fix to use conn aborts for conn-wide failuresDavid Howells
Fix rxrpc to use connection-level aborts for things that affect the whole connection, such as the service ID not matching a local service. Fixes: 57af281e5389 ("rxrpc: Tidy up abort generation infrastructure") Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250717074350.3767366-6-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17rxrpc: Fix transmission of an abort in response to an abortDavid Howells
Under some circumstances, such as when a server socket is closing, ABORT packets will be generated in response to incoming packets. Unfortunately, this also may include generating aborts in response to incoming aborts - which may cause a cycle. It appears this may be made possible by giving the client a multicast address. Fix this such that rxrpc_reject_packet() will refuse to generate aborts in response to aborts. Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> cc: LePremierHomme <kwqcheii@proton.me> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250717074350.3767366-5-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17rxrpc: Fix notification vs call-release vs recvmsgDavid Howells
When a call is released, rxrpc takes the spinlock and removes it from ->recvmsg_q in an effort to prevent racing recvmsg() invocations from seeing the same call. Now, rxrpc_recvmsg() only takes the spinlock when actually removing a call from the queue; it doesn't, however, take it in the lead up to that when it checks to see if the queue is empty. It *does* hold the socket lock, which prevents a recvmsg/recvmsg race - but this doesn't prevent sendmsg from ending the call because sendmsg() drops the socket lock and relies on the call->user_mutex. Fix this by firstly removing the bit in rxrpc_release_call() that dequeues the released call and, instead, rely on recvmsg() to simply discard released calls (done in a preceding fix). Secondly, rxrpc_notify_socket() is abandoned if the call is already marked as released rather than trying to be clever by setting both pointers in call->recvmsg_link to NULL to trick list_empty(). This isn't perfect and can still race, resulting in a released call on the queue, but recvmsg() will now clean that up. Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> cc: LePremierHomme <kwqcheii@proton.me> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250717074350.3767366-4-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17rxrpc: Fix recv-recv race of completed callDavid Howells
If a call receives an event (such as incoming data), the call gets placed on the socket's queue and a thread in recvmsg can be awakened to go and process it. Once the thread has picked up the call off of the queue, further events will cause it to be requeued, and once the socket lock is dropped (recvmsg uses call->user_mutex to allow the socket to be used in parallel), a second thread can come in and its recvmsg can pop the call off the socket queue again. In such a case, the first thread will be receiving stuff from the call and the second thread will be blocked on call->user_mutex. The first thread can, at this point, process both the event that it picked call for and the event that the second thread picked the call for and may see the call terminate - in which case the call will be "released", decoupling the call from the user call ID assigned to it (RXRPC_USER_CALL_ID in the control message). The first thread will return okay, but then the second thread will wake up holding the user_mutex and, if it sees that the call has been released by the first thread, it will BUG thusly: kernel BUG at net/rxrpc/recvmsg.c:474! Fix this by just dequeuing the call and ignoring it if it is seen to be already released. We can't tell userspace about it anyway as the user call ID has become stale. Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: LePremierHomme <kwqcheii@proton.me> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250717074350.3767366-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17rxrpc: Fix irq-disabled in local_bh_enable()David Howells
The rxrpc_assess_MTU_size() function calls down into the IP layer to find out the MTU size for a route. When accepting an incoming call, this is called from rxrpc_new_incoming_call() which holds interrupts disabled across the code that calls down to it. Unfortunately, the IP layer uses local_bh_enable() which, config dependent, throws a warning if IRQs are enabled: WARNING: CPU: 1 PID: 5544 at kernel/softirq.c:387 __local_bh_enable_ip+0x43/0xd0 ... RIP: 0010:__local_bh_enable_ip+0x43/0xd0 ... Call Trace: <TASK> rt_cache_route+0x7e/0xa0 rt_set_nexthop.isra.0+0x3b3/0x3f0 __mkroute_output+0x43a/0x460 ip_route_output_key_hash+0xf7/0x140 ip_route_output_flow+0x1b/0x90 rxrpc_assess_MTU_size.isra.0+0x2a0/0x590 rxrpc_new_incoming_peer+0x46/0x120 rxrpc_alloc_incoming_call+0x1b1/0x400 rxrpc_new_incoming_call+0x1da/0x5e0 rxrpc_input_packet+0x827/0x900 rxrpc_io_thread+0x403/0xb60 kthread+0x2f7/0x310 ret_from_fork+0x2a/0x230 ret_from_fork_asm+0x1a/0x30 ... hardirqs last enabled at (23): _raw_spin_unlock_irq+0x24/0x50 hardirqs last disabled at (24): _raw_read_lock_irq+0x17/0x70 softirqs last enabled at (0): copy_process+0xc61/0x2730 softirqs last disabled at (25): rt_add_uncached_list+0x3c/0x90 Fix this by moving the call to rxrpc_assess_MTU_size() out of rxrpc_init_peer() and further up the stack where it can be done without interrupts disabled. It shouldn't be a problem for rxrpc_new_incoming_call() to do it after the locks are dropped as pmtud is going to be performed by the I/O thread - and we're in the I/O thread at this point. Fixes: a2ea9a907260 ("rxrpc: Use irq-disabling spinlocks between app and I/O thread") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> cc: LePremierHomme <kwqcheii@proton.me> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250717074350.3767366-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net/sched: Return NULL when htb_lookup_leaf encounters an empty rbtreeWilliam Liu
htb_lookup_leaf has a BUG_ON that can trigger with the following: tc qdisc del dev lo root tc qdisc add dev lo root handle 1: htb default 1 tc class add dev lo parent 1: classid 1:1 htb rate 64bit tc qdisc add dev lo parent 1:1 handle 2: netem tc qdisc add dev lo parent 2:1 handle 3: blackhole ping -I lo -c1 -W0.001 127.0.0.1 The root cause is the following: 1. htb_dequeue calls htb_dequeue_tree which calls the dequeue handler on the selected leaf qdisc 2. netem_dequeue calls enqueue on the child qdisc 3. blackhole_enqueue drops the packet and returns a value that is not just NET_XMIT_SUCCESS 4. Because of this, netem_dequeue calls qdisc_tree_reduce_backlog, and since qlen is now 0, it calls htb_qlen_notify -> htb_deactivate -> htb_deactiviate_prios -> htb_remove_class_from_row -> htb_safe_rb_erase 5. As this is the only class in the selected hprio rbtree, __rb_change_child in __rb_erase_augmented sets the rb_root pointer to NULL 6. Because blackhole_dequeue returns NULL, netem_dequeue returns NULL, which causes htb_dequeue_tree to call htb_lookup_leaf with the same hprio rbtree, and fail the BUG_ON The function graph for this scenario is shown here: 0) | htb_enqueue() { 0) + 13.635 us | netem_enqueue(); 0) 4.719 us | htb_activate_prios(); 0) # 2249.199 us | } 0) | htb_dequeue() { 0) 2.355 us | htb_lookup_leaf(); 0) | netem_dequeue() { 0) + 11.061 us | blackhole_enqueue(); 0) | qdisc_tree_reduce_backlog() { 0) | qdisc_lookup_rcu() { 0) 1.873 us | qdisc_match_from_root(); 0) 6.292 us | } 0) 1.894 us | htb_search(); 0) | htb_qlen_notify() { 0) 2.655 us | htb_deactivate_prios(); 0) 6.933 us | } 0) + 25.227 us | } 0) 1.983 us | blackhole_dequeue(); 0) + 86.553 us | } 0) # 2932.761 us | qdisc_warn_nonwc(); 0) | htb_lookup_leaf() { 0) | BUG_ON(); ------------------------------------------ The full original bug report can be seen here [1]. We can fix this just by returning NULL instead of the BUG_ON, as htb_dequeue_tree returns NULL when htb_lookup_leaf returns NULL. [1] https://lore.kernel.org/netdev/pF5XOOIim0IuEfhI-SOxTgRvNoDwuux7UHKnE_Y5-zVd4wmGvNk2ceHjKb8ORnzw0cGwfmVu42g9dL7XyJLf1NEzaztboTWcm0Ogxuojoeo=@willsroot.io/ Fixes: 512bb43eb542 ("pkt_sched: sch_htb: Optimize WARN_ONs in htb_dequeue_tree() etc.") Signed-off-by: William Liu <will@willsroot.io> Signed-off-by: Savino Dicanosa <savy@syst3mfailure.io> Link: https://patch.msgid.link/20250717022816.221364-1-will@willsroot.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>