summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-02-16mptcp: drop port parameter of mptcp_pm_add_addr_signalGeliang Tang
Drop the port parameter of mptcp_pm_add_addr_signal() and reflect it to avoid passing too many parameters. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16mptcp: drop unneeded type casts for hmacGeliang Tang
Drop the unneeded type casts to 'unsigned long long' for printing out the hmac values in add_addr_hmac_valid() and subflow_thmac_valid(). Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16mptcp: drop unused sk in mptcp_get_optionsGeliang Tang
The parameter 'sk' became useless since the code using it was dropped from mptcp_get_options() in the commit 8d548ea1dd15 ("mptcp: do not set unconditionally csum_reqd on incoming opt"). Let's drop it. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16mptcp: add SNDTIMEO setsockopt supportGeliang Tang
Add setsockopt support for SO_SNDTIMEO_OLD and SO_SNDTIMEO_NEW to fix this error reported by the mptcp bpf selftest: (network_helpers.c:64: errno: Operation not supported) Failed to set SO_SNDTIMEO test_mptcp:FAIL:115 All error logs: (network_helpers.c:64: errno: Operation not supported) Failed to set SO_SNDTIMEO test_mptcp:FAIL:115 Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16net: sched: limit TC_ACT_REPEAT loopsEric Dumazet
We have been living dangerously, at the mercy of malicious users, abusing TC_ACT_REPEAT, as shown by this syzpot report [1]. Add an arbitrary limit (32) to the number of times an action can return TC_ACT_REPEAT. v2: switch the limit to 32 instead of 10. Use net_warn_ratelimited() instead of pr_err_once(). [1] (C repro available on demand) rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 1-...!: (10500 ticks this GP) idle=021/1/0x4000000000000000 softirq=5592/5592 fqs=0 (t=10502 jiffies g=5305 q=190) rcu: rcu_preempt kthread timer wakeup didn't happen for 10502 jiffies! g5305 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 rcu: Possible timer handling issue on cpu=0 timer-softirq=3527 rcu: rcu_preempt kthread starved for 10505 jiffies! g5305 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:I stack:29344 pid: 14 ppid: 2 flags:0x00004000 Call Trace: <TASK> context_switch kernel/sched/core.c:4986 [inline] __schedule+0xab2/0x4db0 kernel/sched/core.c:6295 schedule+0xd2/0x260 kernel/sched/core.c:6368 schedule_timeout+0x14a/0x2a0 kernel/time/timer.c:1881 rcu_gp_fqs_loop+0x186/0x810 kernel/rcu/tree.c:1963 rcu_gp_kthread+0x1de/0x320 kernel/rcu/tree.c:2136 kthread+0x2e9/0x3a0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> rcu: Stack dump where RCU GP kthread last ran: Sending NMI from CPU 1 to CPUs 0: NMI backtrace for cpu 0 CPU: 0 PID: 3646 Comm: syz-executor358 Not tainted 5.17.0-rc3-syzkaller-00149-gbf8e59fd315f #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:rep_nop arch/x86/include/asm/vdso/processor.h:13 [inline] RIP: 0010:cpu_relax arch/x86/include/asm/vdso/processor.h:18 [inline] RIP: 0010:pv_wait_head_or_lock kernel/locking/qspinlock_paravirt.h:437 [inline] RIP: 0010:__pv_queued_spin_lock_slowpath+0x3b8/0xb40 kernel/locking/qspinlock.c:508 Code: 48 89 eb c6 45 01 01 41 bc 00 80 00 00 48 c1 e9 03 83 e3 07 41 be 01 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8d 2c 01 eb 0c <f3> 90 41 83 ec 01 0f 84 72 04 00 00 41 0f b6 45 00 38 d8 7f 08 84 RSP: 0018:ffffc9000283f1b0 EFLAGS: 00000206 RAX: 0000000000000003 RBX: 0000000000000000 RCX: 1ffff1100fc0071e RDX: 0000000000000001 RSI: 0000000000000201 RDI: 0000000000000000 RBP: ffff88807e0038f0 R08: 0000000000000001 R09: ffffffff8ffbf9ff R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000004c1e R13: ffffed100fc0071e R14: 0000000000000001 R15: ffff8880b9c3aa80 FS: 00005555562bf300(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffdbfef12b8 CR3: 00000000723c2000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> pv_queued_spin_lock_slowpath arch/x86/include/asm/paravirt.h:591 [inline] queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:51 [inline] queued_spin_lock include/asm-generic/qspinlock.h:85 [inline] do_raw_spin_lock+0x200/0x2b0 kernel/locking/spinlock_debug.c:115 spin_lock_bh include/linux/spinlock.h:354 [inline] sch_tree_lock include/net/sch_generic.h:610 [inline] sch_tree_lock include/net/sch_generic.h:605 [inline] prio_tune+0x3b9/0xb50 net/sched/sch_prio.c:211 prio_init+0x5c/0x80 net/sched/sch_prio.c:244 qdisc_create.constprop.0+0x44a/0x10f0 net/sched/sch_api.c:1253 tc_modify_qdisc+0x4c5/0x1980 net/sched/sch_api.c:1660 rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5594 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:725 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2413 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f7ee98aae99 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 41 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffdbfef12d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007ffdbfef1300 RCX: 00007f7ee98aae99 RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000003 RBP: 0000000000000000 R08: 000000000000000d R09: 000000000000000d R10: 000000000000000d R11: 0000000000000246 R12: 00007ffdbfef12f0 R13: 00000000000f4240 R14: 000000000004ca47 R15: 00007ffdbfef12e4 </TASK> INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 2.293 msecs NMI backtrace for cpu 1 CPU: 1 PID: 3260 Comm: kworker/1:3 Not tainted 5.17.0-rc3-syzkaller-00149-gbf8e59fd315f #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: mld mld_ifc_work Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 nmi_cpu_backtrace.cold+0x47/0x144 lib/nmi_backtrace.c:111 nmi_trigger_cpumask_backtrace+0x1b3/0x230 lib/nmi_backtrace.c:62 trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline] rcu_dump_cpu_stacks+0x25e/0x3f0 kernel/rcu/tree_stall.h:343 print_cpu_stall kernel/rcu/tree_stall.h:604 [inline] check_cpu_stall kernel/rcu/tree_stall.h:688 [inline] rcu_pending kernel/rcu/tree.c:3919 [inline] rcu_sched_clock_irq.cold+0x5c/0x759 kernel/rcu/tree.c:2617 update_process_times+0x16d/0x200 kernel/time/timer.c:1785 tick_sched_handle+0x9b/0x180 kernel/time/tick-sched.c:226 tick_sched_timer+0x1b0/0x2d0 kernel/time/tick-sched.c:1428 __run_hrtimer kernel/time/hrtimer.c:1685 [inline] __hrtimer_run_queues+0x1c0/0xe50 kernel/time/hrtimer.c:1749 hrtimer_interrupt+0x31c/0x790 kernel/time/hrtimer.c:1811 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline] __sysvec_apic_timer_interrupt+0x146/0x530 arch/x86/kernel/apic/apic.c:1103 sysvec_apic_timer_interrupt+0x8e/0xc0 arch/x86/kernel/apic/apic.c:1097 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638 RIP: 0010:__sanitizer_cov_trace_const_cmp4+0xc/0x70 kernel/kcov.c:286 Code: 00 00 00 48 89 7c 30 e8 48 89 4c 30 f0 4c 89 54 d8 20 48 89 10 5b c3 0f 1f 80 00 00 00 00 41 89 f8 bf 03 00 00 00 4c 8b 14 24 <89> f1 65 48 8b 34 25 00 70 02 00 e8 14 f9 ff ff 84 c0 74 4b 48 8b RSP: 0018:ffffc90002c5eea8 EFLAGS: 00000246 RAX: 0000000000000007 RBX: ffff88801c625800 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: ffff8880137d3100 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff874fcd88 R11: 0000000000000000 R12: ffff88801d692dc0 R13: ffff8880137d3104 R14: 0000000000000000 R15: ffff88801d692de8 tcf_police_act+0x358/0x11d0 net/sched/act_police.c:256 tcf_action_exec net/sched/act_api.c:1049 [inline] tcf_action_exec+0x1a6/0x530 net/sched/act_api.c:1026 tcf_exts_exec include/net/pkt_cls.h:326 [inline] route4_classify+0xef0/0x1400 net/sched/cls_route.c:179 __tcf_classify net/sched/cls_api.c:1549 [inline] tcf_classify+0x3e8/0x9d0 net/sched/cls_api.c:1615 prio_classify net/sched/sch_prio.c:42 [inline] prio_enqueue+0x3a7/0x790 net/sched/sch_prio.c:75 dev_qdisc_enqueue+0x40/0x300 net/core/dev.c:3668 __dev_xmit_skb net/core/dev.c:3756 [inline] __dev_queue_xmit+0x1f61/0x3660 net/core/dev.c:4081 neigh_hh_output include/net/neighbour.h:533 [inline] neigh_output include/net/neighbour.h:547 [inline] ip_finish_output2+0x14dc/0x2170 net/ipv4/ip_output.c:228 __ip_finish_output net/ipv4/ip_output.c:306 [inline] __ip_finish_output+0x396/0x650 net/ipv4/ip_output.c:288 ip_finish_output+0x32/0x200 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip_output+0x196/0x310 net/ipv4/ip_output.c:430 dst_output include/net/dst.h:451 [inline] ip_local_out+0xaf/0x1a0 net/ipv4/ip_output.c:126 iptunnel_xmit+0x628/0xa50 net/ipv4/ip_tunnel_core.c:82 geneve_xmit_skb drivers/net/geneve.c:966 [inline] geneve_xmit+0x10c8/0x3530 drivers/net/geneve.c:1077 __netdev_start_xmit include/linux/netdevice.h:4683 [inline] netdev_start_xmit include/linux/netdevice.h:4697 [inline] xmit_one net/core/dev.c:3473 [inline] dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3489 __dev_queue_xmit+0x2985/0x3660 net/core/dev.c:4116 neigh_hh_output include/net/neighbour.h:533 [inline] neigh_output include/net/neighbour.h:547 [inline] ip6_finish_output2+0xf7a/0x14f0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] __ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170 ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:451 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] mld_sendpack+0x9a3/0xe40 net/ipv6/mcast.c:1826 mld_send_cr net/ipv6/mcast.c:2127 [inline] mld_ifc_work+0x71c/0xdc0 net/ipv6/mcast.c:2659 process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307 worker_thread+0x657/0x1110 kernel/workqueue.c:2454 kthread+0x2e9/0x3a0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> ---------------- Code disassembly (best guess): 0: 48 89 eb mov %rbp,%rbx 3: c6 45 01 01 movb $0x1,0x1(%rbp) 7: 41 bc 00 80 00 00 mov $0x8000,%r12d d: 48 c1 e9 03 shr $0x3,%rcx 11: 83 e3 07 and $0x7,%ebx 14: 41 be 01 00 00 00 mov $0x1,%r14d 1a: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 21: fc ff df 24: 4c 8d 2c 01 lea (%rcx,%rax,1),%r13 28: eb 0c jmp 0x36 * 2a: f3 90 pause <-- trapping instruction 2c: 41 83 ec 01 sub $0x1,%r12d 30: 0f 84 72 04 00 00 je 0x4a8 36: 41 0f b6 45 00 movzbl 0x0(%r13),%eax 3b: 38 d8 cmp %bl,%al 3d: 7f 08 jg 0x47 3f: 84 .byte 0x84 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220215235305.3272331-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16tipc: fix wrong notification node addressesJon Maloy
The previous bug fix had an unfortunate side effect that broke distribution of binding table entries between nodes. The updated tipc_sock_addr struct is also used further down in the same function, and there the old value is still the correct one. Fixes: 032062f363b4 ("tipc: fix wrong publisher node address in link publications") Signed-off-by: Jon Maloy <jmaloy@redhat.com> Link: https://lore.kernel.org/r/20220216020009.3404578-1-jmaloy@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16ipv6: per-netns exclusive flowlabel checksWillem de Bruijn
Ipv6 flowlabels historically require a reservation before use. Optionally in exclusive mode (e.g., user-private). Commit 59c820b2317f ("ipv6: elide flowlabel check if no exclusive leases exist") introduced a fastpath that avoids this check when no exclusive leases exist in the system, and thus any flowlabel use will be granted. That allows skipping the control operation to reserve a flowlabel entirely. Though with a warning if the fast path fails: This is an optimization. Robust applications still have to revert to requesting leases if the fast path fails due to an exclusive lease. Still, this is subtle. Better isolate network namespaces from each other. Flowlabels are per-netns. Also record per-netns whether exclusive leases are in use. Then behavior does not change based on activity in other netns. Changes v2 - wrap in IS_ENABLED(CONFIG_IPV6) to avoid breakage if disabled Fixes: 59c820b2317f ("ipv6: elide flowlabel check if no exclusive leases exist") Link: https://lore.kernel.org/netdev/MWHPR2201MB1072BCCCFCE779E4094837ACD0329@MWHPR2201MB1072.namprd22.prod.outlook.com/ Reported-by: Congyu Liu <liu3101@purdue.edu> Signed-off-by: Willem de Bruijn <willemb@google.com> Tested-by: Congyu Liu <liu3101@purdue.edu> Link: https://lore.kernel.org/r/20220215160037.1976072-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16net: dsa: tag_8021q: only call skb_push/skb_pull around __skb_vlan_popVladimir Oltean
__skb_vlan_pop() needs skb->data to point at the mac_header, while skb_vlan_tag_present() and skb_vlan_tag_get() don't, because they don't look at skb->data at all. So we can avoid uselessly moving around skb->data for the case where the VLAN tag was offloaded by the DSA master. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20220215204722.2134816-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16net: bridge: multicast: notify switchdev driver whenever MC processing gets ↵Oleksandr Mazur
disabled Whenever bridge driver hits the max capacity of MDBs, it disables the MC processing (by setting corresponding bridge option), but never notifies switchdev about such change (the notifiers are called only upon explicit setting of this option, through the registered netlink interface). This could lead to situation when Software MDB processing gets disabled, but this event never gets offloaded to the underlying Hardware. Fix this by adding a notify message in such case. Fixes: 147c1e9b902c ("switchdev: bridge: Offload multicast disabled") Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20220215165303.31908-1-oleksandr.mazur@plvision.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16net/smc: return ETIMEDOUT when smc_connect_clc() timeoutD. Wythe
When smc_connect_clc() times out, it will return -EAGAIN(tcp_recvmsg retuns -EAGAIN while timeout), then this value will passed to the application, which is quite confusing to the applications, makes inconsistency with TCP. From the manual of connect, ETIMEDOUT is more suitable, and this patch try convert EAGAIN to ETIMEDOUT in that case. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Link: https://lore.kernel.org/r/1644913490-21594-1-git-send-email-alibuda@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16mac80211: fix forwarded mesh frames AC & queue selectionNicolas Escande
There are two problems with the current code that have been highlighted with the AQL feature that is now enbaled by default. First problem is in ieee80211_rx_h_mesh_fwding(), ieee80211_select_queue_80211() is used on received packets to choose the sending AC queue of the forwarding packet although this function should only be called on TX packet (it uses ieee80211_tx_info). This ends with forwarded mesh packets been sent on unrelated random AC queue. To fix that, AC queue can directly be infered from skb->priority which has been extracted from QOS info (see ieee80211_parse_qos()). Second problem is the value of queue_mapping set on forwarded mesh frames via skb_set_queue_mapping() is not the AC of the packet but a hardware queue index. This may or may not work depending on AC to HW queue mapping which is driver specific. Both of these issues lead to improper AC selection while forwarding mesh packets but more importantly due to improper airtime accounting (which is done on a per STA, per AC basis) caused traffic stall with the introduction of AQL. Fixes: cf44012810cc ("mac80211: fix unnecessary frame drops in mesh fwding") Fixes: d3c1597b8d1b ("mac80211: fix forwarded mesh frame queue mapping") Co-developed-by: Remi Pommarel <repk@triplefau.lt> Signed-off-by: Remi Pommarel <repk@triplefau.lt> Signed-off-by: Nicolas Escande <nico.escande@gmail.com> Link: https://lore.kernel.org/r/20220214173214.368862-1-nico.escande@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-02-16mac80211: refuse aggregations sessions before authorizedJohannes Berg
If an MFP station isn't authorized, the receiver will (or at least should) drop the action frame since it's a robust management frame, but if we're not authorized we haven't installed keys yet. Refuse attempts to start a session as they'd just time out. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20220203201528.ff4d5679dce9.I34bb1f2bc341e161af2d6faf74f91b332ba11285@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-02-16mac80211: fix EAPoL rekey fail in 802.3 rx pathDeren Wu
mac80211 set capability NL80211_EXT_FEATURE_CONTROL_PORT_OVER_NL80211 to upper layer by default. That means we should pass EAPoL packets through nl80211 path only, and should not send the EAPoL skb to netdevice diretly. At the meanwhile, wpa_supplicant would not register sock to listen EAPoL skb on the netdevice. However, there is no control_port_protocol handler in mac80211 for 802.3 RX packets, mac80211 driver would pass up the EAPoL rekey frame to netdevice and wpa_supplicant would be never interactive with this kind of packets, if SUPPORTS_RX_DECAP_OFFLOAD is enabled. This causes STA always rekey fail if EAPoL frame go through 802.3 path. To avoid this problem, align the same process as 802.11 type to handle this frame before put it into network stack. This also addresses a potential security issue in 802.3 RX mode that was previously fixed in commit a8c4d76a8dd4 ("mac80211: do not accept/forward invalid EAPOL frames"). Cc: stable@vger.kernel.org # 5.12+ Fixes: 80a915ec4427 ("mac80211: add rx decapsulation offload support") Signed-off-by: Deren Wu <deren.wu@mediatek.com> Link: https://lore.kernel.org/r/6889c9fced5859ebb088564035f84fd0fa792a49.1644680751.git.deren.wu@mediatek.com [fix typos, update comment and add note about security issue] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-02-16net: dsa: offload bridge port VLANs on foreign interfacesVladimir Oltean
DSA now explicitly handles VLANs installed with the 'self' flag on the bridge as host VLANs, instead of just replicating every bridge port VLAN also on the CPU port and never deleting it, which is what it did before. However, this leaves a corner case uncovered, as explained by Tobias Waldekranz: https://patchwork.kernel.org/project/netdevbpf/patch/20220209213044.2353153-6-vladimir.oltean@nxp.com/#24735260 Forwarding towards a bridge port VLAN installed on a bridge port foreign to DSA (separate NIC, Wi-Fi AP) used to work by virtue of the fact that DSA itself needed to have at least one port in that VLAN (therefore, it also had the CPU port in said VLAN). However, now that the CPU port may not be member of all VLANs that user ports are members of, we need to ensure this isn't the case if software forwarding to a foreign interface is required. The solution is to treat bridge port VLANs on standalone interfaces in the exact same way as host VLANs. From DSA's perspective, there is no difference between local termination and software forwarding; packets in that VLAN must reach the CPU in both cases. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16net: dsa: add explicit support for host bridge VLANsVladimir Oltean
Currently, DSA programs VLANs on shared (DSA and CPU) ports each time it does so on user ports. This is good for basic functionality but has several limitations: - the VLAN group which must reach the CPU may be radically different from the VLAN group that must be autonomously forwarded by the switch. In other words, the admin may want to isolate noisy stations and avoid traffic from them going to the control processor of the switch, where it would just waste useless cycles. The bridge already supports independent control of VLAN groups on bridge ports and on the bridge itself, and when VLAN-aware, it will drop packets in software anyway if their VID isn't added as a 'self' entry towards the bridge device. - Replaying host FDB entries may depend, for some drivers like mv88e6xxx, on replaying the host VLANs as well. The 2 VLAN groups are approximately the same in most regular cases, but there are corner cases when timing matters, and DSA's approximation of replicating VLANs on shared ports simply does not work. - If a user makes the bridge (implicitly the CPU port) join a VLAN by accident, there is no way for the CPU port to isolate itself from that noisy VLAN except by rebooting the system. This is because for each VLAN added on a user port, DSA will add it on shared ports too, but for each VLAN deletion on a user port, it will remain installed on shared ports, since DSA has no good indication of whether the VLAN is still in use or not. Now that the bridge driver emits well-balanced SWITCHDEV_OBJ_ID_PORT_VLAN addition and removal events, DSA has a simple and straightforward task of separating the bridge port VLANs (these have an orig_dev which is a DSA slave interface, or a LAG interface) from the host VLANs (these have an orig_dev which is a bridge interface), and to keep a simple reference count of each VID on each shared port. Forwarding VLANs must be installed on the bridge ports and on all DSA ports interconnecting them. We don't have a good view of the exact topology, so we simply install forwarding VLANs on all DSA ports, which is what has been done until now. Host VLANs must be installed primarily on the dedicated CPU port of each bridge port. More subtly, they must also be installed on upstream-facing and downstream-facing DSA ports that are connecting the bridge ports and the CPU. This ensures that the mv88e6xxx's problem (VID of host FDB entry may be absent from VTU) is still addressed even if that switch is in a cross-chip setup, and it has no local CPU port. Therefore: - user ports contain only bridge port (forwarding) VLANs, and no refcounting is necessary - DSA ports contain both forwarding and host VLANs. Refcounting is necessary among these 2 types. - CPU ports contain only host VLANs. Refcounting is also necessary. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16net: switchdev: introduce switchdev_handle_port_obj_{add,del} for foreign ↵Vladimir Oltean
interfaces The switchdev_handle_port_obj_add() helper is good for replicating a port object on the lower interfaces of @dev, if that object was emitted on a bridge, or on a bridge port that is a LAG. However, drivers that use this helper limit themselves to a box from which they can no longer intercept port objects notified on neighbor ports ("foreign interfaces"). One such driver is DSA, where software bridging with foreign interfaces such as standalone NICs or Wi-Fi APs is an important use case. There, a VLAN installed on a neighbor bridge port roughly corresponds to a forwarding VLAN installed on the DSA switch's CPU port. To support this use case while also making use of the benefits of the switchdev_handle_* replication helper for port objects, introduce a new variant of these functions that crawls through the neighbor ports of @dev, in search of potentially compatible switchdev ports that are interested in the event. The strategy is identical to switchdev_handle_fdb_event_to_device(): if @dev wasn't a switchdev interface, then go one step upper, and recursively call this function on the bridge that this port belongs to. At the next recursion step, __switchdev_handle_port_obj_add() will iterate through the bridge's lower interfaces. Among those, some will be switchdev interfaces, and one will be the original @dev that we came from. To prevent infinite recursion, we must suppress reentry into the original @dev, and just call the @add_cb for the switchdev_interfaces. It looks like this: br0 / | \ / | \ / | \ swp0 swp1 eth0 1. __switchdev_handle_port_obj_add(eth0) -> check_cb(eth0) returns false -> eth0 has no lower interfaces -> eth0's bridge is br0 -> switchdev_lower_dev_find(br0, check_cb, foreign_dev_check_cb)) finds br0 2. __switchdev_handle_port_obj_add(br0) -> check_cb(br0) returns false -> netdev_for_each_lower_dev -> check_cb(swp0) returns true, so we don't skip this interface 3. __switchdev_handle_port_obj_add(swp0) -> check_cb(swp0) returns true, so we call add_cb(swp0) (back to netdev_for_each_lower_dev from 2) -> check_cb(swp1) returns true, so we don't skip this interface 4. __switchdev_handle_port_obj_add(swp1) -> check_cb(swp1) returns true, so we call add_cb(swp1) (back to netdev_for_each_lower_dev from 2) -> check_cb(eth0) returns false, so we skip this interface to avoid infinite recursion Note: eth0 could have been a LAG, and we don't want to suppress the recursion through its lowers if those exist, so when check_cb() returns false, we still call switchdev_lower_dev_find() to estimate whether there's anything worth a recursion beneath that LAG. Using check_cb() and foreign_dev_check_cb(), switchdev_lower_dev_find() not only figures out whether the lowers of the LAG are switchdev, but also whether they actively offload the LAG or not (whether the LAG is "foreign" to the switchdev interface or not). The port_obj_info->orig_dev is preserved across recursive calls, so switchdev drivers still know on which device was this notification originally emitted. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16net: switchdev: rename switchdev_lower_dev_find to switchdev_lower_dev_find_rcuVladimir Oltean
switchdev_lower_dev_find() assumes RCU read-side critical section calling context, since it uses netdev_walk_all_lower_dev_rcu(). Rename it appropriately, in preparation of adding a similar iterator that assumes writer-side rtnl_mutex protection. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16net: bridge: switchdev: replay all VLAN groupsVladimir Oltean
The major user of replayed switchdev objects is DSA, and so far it hasn't needed information about anything other than bridge port VLANs, so this is all that br_switchdev_vlan_replay() knows to handle. DSA has managed to get by through replicating every VLAN addition on a user port such that the same VLAN is also added on all DSA and CPU ports, but there is a corner case where this does not work. The mv88e6xxx DSA driver currently prints this error message as soon as the first port of a switch joins a bridge: mv88e6085 0x0000000008b96000:00: port 0 failed to add a6:ef:77:c8:5f:3d vid 1 to fdb: -95 where a6:ef:77:c8:5f:3d vid 1 is a local FDB entry corresponding to the bridge MAC address in the default_pvid. The -EOPNOTSUPP is returned by mv88e6xxx_port_db_load_purge() because it tries to map VID 1 to a FID (the ATU is indexed by FID not VID), but fails to do so. This is because ->port_fdb_add() is called before ->port_vlan_add() for VID 1. The abridged timeline of the calls is: br_add_if -> netdev_master_upper_dev_link -> dsa_port_bridge_join -> switchdev_bridge_port_offload -> br_switchdev_vlan_replay (*) -> br_switchdev_fdb_replay -> mv88e6xxx_port_fdb_add -> nbp_vlan_init -> nbp_vlan_add -> mv88e6xxx_port_vlan_add and the issue is that at the time of (*), the bridge port isn't in VID 1 (nbp_vlan_init hasn't been called), therefore br_switchdev_vlan_replay() won't have anything to replay, therefore VID 1 won't be in the VTU by the time mv88e6xxx_port_fdb_add() is called. This happens only when the first port of a switch joins. For further ports, the initial mv88e6xxx_port_vlan_add() is sufficient for VID 1 to be loaded in the VTU (which is switch-wide, not per port). The problem is somewhat unique to mv88e6xxx by chance, because most other drivers offload an FDB entry by VID, so FDBs and VLANs can be added asynchronously with respect to each other, but addressing the issue at the bridge layer makes sense, since what mv88e6xxx requires isn't absurd. To fix this problem, we need to recognize that it isn't the VLAN group of the port that we're interested in, but the VLAN group of the bridge itself (so it isn't a timing issue, but rather insufficient information being passed from switchdev to drivers). As mentioned, currently nbp_switchdev_sync_objs() only calls br_switchdev_vlan_replay() for VLANs corresponding to the port, but the VLANs corresponding to the bridge itself, for local termination, also need to be replayed. In this case, VID 1 is not (yet) present in the port's VLAN group but is present in the bridge's VLAN group. So to fix this bug, DSA is now obligated to explicitly handle VLANs pointing towards the bridge in order to "close this race" (which isn't really a race). As Tobias Waldekranz notices, this also implies that it must explicitly handle port VLANs on foreign interfaces, something that worked implicitly before: https://patchwork.kernel.org/project/netdevbpf/patch/20220209213044.2353153-6-vladimir.oltean@nxp.com/#24735260 So in the end, br_switchdev_vlan_replay() must replay all VLANs from all VLAN groups: all the ports, and the bridge itself. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16net: bridge: make nbp_switchdev_unsync_objs() follow reverse order of sync()Vladimir Oltean
There may be switchdev drivers that can add/remove a FDB or MDB entry only as long as the VLAN it's in has been notified and offloaded first. The nbp_switchdev_sync_objs() method satisfies this requirement on addition, but nbp_switchdev_unsync_objs() first deletes VLANs, then deletes MDBs and FDBs. Reverse the order of the function calls to cater to this requirement. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16net: bridge: switchdev: differentiate new VLANs from changed onesVladimir Oltean
br_switchdev_port_vlan_add() currently emits a SWITCHDEV_PORT_OBJ_ADD event with a SWITCHDEV_OBJ_ID_PORT_VLAN for 2 distinct cases: - a struct net_bridge_vlan got created - an existing struct net_bridge_vlan was modified This makes it impossible for switchdev drivers to properly balance PORT_OBJ_ADD with PORT_OBJ_DEL events, so if we want to allow that to happen, we must provide a way for drivers to distinguish between a VLAN with changed flags and a new one. Annotate struct switchdev_obj_port_vlan with a "bool changed" that distinguishes the 2 cases above. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16net: bridge: vlan: notify switchdev only when something changedVladimir Oltean
Currently, when a VLAN entry is added multiple times in a row to a bridge port, nbp_vlan_add() calls br_switchdev_port_vlan_add() each time, even if the VLAN already exists and nothing about it has changed: bridge vlan add dev lan12 vid 100 master static Similarly, when a VLAN is added multiple times in a row to a bridge, br_vlan_add_existing() doesn't filter at all the calls to br_switchdev_port_vlan_add(): bridge vlan add dev br0 vid 100 self This behavior makes driver-level accounting of VLANs impossible, since it is enough for a single deletion event to remove a VLAN, but the addition event can be emitted an unlimited number of times. The cause for this can be identified as follows: we rely on __vlan_add_flags() to retroactively tell us whether it has changed anything about the VLAN flags or VLAN group pvid. So we'd first have to call __vlan_add_flags() before calling br_switchdev_port_vlan_add(), in order to have access to the "bool *changed" information. But we don't want to change the event ordering, because we'd have to revert the struct net_bridge_vlan changes we've made if switchdev returns an error. So to solve this, we need another function that tells us whether any change is going to occur in the VLAN or VLAN group, _prior_ to calling __vlan_add_flags(). Split __vlan_add_flags() into a precommit and a commit stage, and rename it to __vlan_flags_update(). The precommit stage, __vlan_flags_would_change(), will determine whether there is any reason to notify switchdev due to a change of flags (note: the BRENTRY flag transition from false to true is treated separately: as a new switchdev entry, because we skipped notifying the master VLAN when it wasn't a brentry yet, and therefore not as a change of flags). With this lookahead/precommit function in place, we can avoid notifying switchdev if nothing changed for the VLAN and VLAN group. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16net: bridge: vlan: make __vlan_add_flags react only to PVID and UNTAGGEDVladimir Oltean
Currently there is a very subtle aspect to the behavior of __vlan_add_flags(): it changes the struct net_bridge_vlan flags and pvid, yet it returns true ("changed") even if none of those changed, just a transition of br_vlan_is_brentry(v) took place from false to true. This can be seen in br_vlan_add_existing(), however we do not actually rely on this subtle behavior, since the "if" condition that checks that the vlan wasn't a brentry before had a useless (until now) assignment: *changed = true; Make things more obvious by actually making __vlan_add_flags() do what's written on the box, and be more specific about what is actually written on the box. This is needed because further transformations will be done to __vlan_add_flags(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16net: bridge: vlan: don't notify to switchdev master VLANs without BRENTRY flagVladimir Oltean
When a VLAN is added to a bridge port and it doesn't exist on the bridge device yet, it gets created for the multicast context, but it is 'hidden', since it doesn't have the BRENTRY flag yet: ip link add br0 type bridge && ip link set swp0 master br0 bridge vlan add dev swp0 vid 100 # the master VLAN 100 gets created bridge vlan add dev br0 vid 100 self # that VLAN becomes brentry just now All switchdev drivers ignore switchdev notifiers for VLAN entries which have the BRENTRY unset, and for good reason: these are merely private data structures used by the bridge driver. So we might just as well not notify those at all. Cleanup in the switchdev drivers that check for the BRENTRY flag is now possible, and will be handled separately, since those checks just became dead code. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16net: bridge: vlan: check early for lack of BRENTRY flag in br_vlan_add_existingVladimir Oltean
When a VLAN is added to a bridge port, a master VLAN gets created on the bridge for context, but it doesn't have the BRENTRY flag. Then, when the same VLAN is added to the bridge itself, that enters through the br_vlan_add_existing() code path and gains the BRENTRY flag, thus it becomes "existing". It seems natural to check for this condition early, because the current code flow is to notify switchdev of the addition of a VLAN that isn't a brentry, just to delete it immediately afterwards. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-15mctp: fix use after freeTom Rix
Clang static analysis reports this problem route.c:425:4: warning: Use of memory after it is freed trace_mctp_key_acquire(key); ^~~~~~~~~~~~~~~~~~~~~~~~~~~ When mctp_key_add() fails, key is freed but then is later used in trace_mctp_key_acquire(). Add an else statement to use the key only when mctp_key_add() is successful. Fixes: 4f9e1ba6de45 ("mctp: Add tracepoints for tag/key handling") Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-15net: bridge: vlan: check for errors from __vlan_del in __vlan_flushVladimir Oltean
If the following call path returns an error from switchdev: nbp_vlan_flush -> __vlan_del -> __vlan_vid_del -> br_switchdev_port_vlan_del -> __vlan_group_free -> WARN_ON(!list_empty(&vg->vlan_list)); then the deletion of the net_bridge_vlan is silently halted, which will trigger the WARN_ON from __vlan_group_free(). The WARN_ON is rather unhelpful, because nothing about the source of the error is printed. Add a print to catch errors from __vlan_del. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-15ipv4: add description about martian sourceZhang Yunkai
When multiple containers are running in the environment and multiple macvlan network port are configured in each container, a lot of martian source prints will appear after martian_log is enabled. they are almost the same, and printed by net_warn_ratelimited. Each arp message will trigger this print on each network port. Such as: IPv4: martian source 173.254.95.16 from 173.254.100.109, on dev eth0 ll header: 00000000: ff ff ff ff ff ff 40 00 ad fe 64 6d 08 06 ......@...dm.. IPv4: martian source 173.254.95.16 from 173.254.100.109, on dev eth1 ll header: 00000000: ff ff ff ff ff ff 40 00 ad fe 64 6d 08 06 ......@...dm.. There is no description of this kind of source in the RFC1812. Signed-off-by: Zhang Yunkai <zhang.yunkai@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14tipc: fix wrong publisher node address in link publicationsJon Maloy
When a link comes up we add its presence to the name table to make it possible for users to subscribe for link up/down events. However, after a previous call signature change the binding is wrongly published with the peer node as publishing node, instead of the own node as it should be. This has the effect that the command 'tipc name table show' will list the link binding (service type 2) with node scope and a peer node as originator, something that obviously is impossible. We correct this bug here. Fixes: 50a3499ab853 ("tipc: simplify signature of tipc_namtbl_publish()") Signed-off-by: Jon Maloy <jmaloy@redhat.com> Link: https://lore.kernel.org/r/20220214013852.2803940-1-jmaloy@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-14ipv6: blackhole_netdev needs snmp6 countersIdo Schimmel
Whenever rt6_uncached_list_flush_dev() swaps rt->rt6_idev to the blackhole device, parts of IPv6 stack might still need to increment one SNMP counter. Root cause, patch from Ido, changelog from Eric :) This bug suggests that we need to audit rt->rt6_idev usages and make sure they are properly using RCU protection. Fixes: e5f80fcf869a ("ipv6: give an IPv6 dev to blackhole_netdev") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14net: fix documentation for kernel_getsocknameAlex Maydanik
Fixes return value documentation of kernel_getsockname() and kernel_getpeername() functions. The previous documentation wrongly specified that the return value is 0 in case of success, however sock->ops->getname returns the length of the address in bytes in case of success. Signed-off-by: Alex Maydanik <alexander.maydanik@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14net: dev: Make rps_lock() disable interrupts.Sebastian Andrzej Siewior
Disabling interrupts and in the RPS case locking input_pkt_queue is split into local_irq_disable() and optional spin_lock(). This breaks on PREEMPT_RT because the spinlock_t typed lock can not be acquired with disabled interrupts. The sections in which the lock is acquired is usually short in a sense that it is not causing long und unbounded latiencies. One exception is the skb_flow_limit() invocation which may invoke a BPF program (and may require sleeping locks). By moving local_irq_disable() + spin_lock() into rps_lock(), we can keep interrupts disabled on !PREEMPT_RT and enabled on PREEMPT_RT kernels. Without RPS on a PREEMPT_RT kernel, the needed synchronisation happens as part of local_bh_disable() on the local CPU. ____napi_schedule() is only invoked if sd is from the local CPU. Replace it with __napi_schedule_irqoff() which already disables interrupts on PREEMPT_RT as needed. Move this call to rps_ipi_queued() and rename the function to napi_schedule_rps as suggested by Jakub. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14net: dev: Makes sure netif_rx() can be invoked in any context.Sebastian Andrzej Siewior
Dave suggested a while ago (eleven years by now) "Let's make netif_rx() work in all contexts and get rid of netif_rx_ni()". Eric agreed and pointed out that modern devices should use netif_receive_skb() to avoid the overhead. In the meantime someone added another variant, netif_rx_any_context(), which behaves as suggested. netif_rx() must be invoked with disabled bottom halves to ensure that pending softirqs, which were raised within the function, are handled. netif_rx_ni() can be invoked only from process context (bottom halves must be enabled) because the function handles pending softirqs without checking if bottom halves were disabled or not. netif_rx_any_context() invokes on the former functions by checking in_interrupts(). netif_rx() could be taught to handle both cases (disabled and enabled bottom halves) by simply disabling bottom halves while invoking netif_rx_internal(). The local_bh_enable() invocation will then invoke pending softirqs only if the BH-disable counter drops to zero. Eric is concerned about the overhead of BH-disable+enable especially in regard to the loopback driver. As critical as this driver is, it will receive a shortcut to avoid the additional overhead which is not needed. Add a local_bh_disable() section in netif_rx() to ensure softirqs are handled if needed. Provide __netif_rx() which does not disable BH and has a lockdep assert to ensure that interrupts are disabled. Use this shortcut in the loopback driver and in drivers/net/*.c. Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they can be removed once they are no more users left. Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.net Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal().Sebastian Andrzej Siewior
The preempt_disable() () section was introduced in commit cece1945bffcf ("net: disable preemption before call smp_processor_id()") and adds it in case this function is invoked from preemtible context and because get_cpu() later on as been added. The get_cpu() usage was added in commit b0e28f1effd1d ("net: netif_rx() must disable preemption") because ip_dev_loopback_xmit() invoked netif_rx() with enabled preemption causing a warning in smp_processor_id(). The function netif_rx() should only be invoked from an interrupt context which implies disabled preemption. The commit e30b38c298b55 ("ip: Fix ip_dev_loopback_xmit()") was addressing this and replaced netif_rx() with in netif_rx_ni() in ip_dev_loopback_xmit(). Based on the discussion on the list, the former patch (b0e28f1effd1d) should not have been applied only the latter (e30b38c298b55). Remove get_cpu() and preempt_disable() since the function is supposed to be invoked from context with stable per-CPU pointers. Bottom halves have to be disabled at this point because the function may raise softirqs which need to be processed. Link: https://lkml.kernel.org/r/20100415.013347.98375530.davem@davemloft.net Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14net_sched: add __rcu annotation to netdev->qdiscEric Dumazet
syzbot found a data-race [1] which lead me to add __rcu annotations to netdev->qdisc, and proper accessors to get LOCKDEP support. [1] BUG: KCSAN: data-race in dev_activate / qdisc_lookup_rcu write to 0xffff888168ad6410 of 8 bytes by task 13559 on cpu 1: attach_default_qdiscs net/sched/sch_generic.c:1167 [inline] dev_activate+0x2ed/0x8f0 net/sched/sch_generic.c:1221 __dev_open+0x2e9/0x3a0 net/core/dev.c:1416 __dev_change_flags+0x167/0x3f0 net/core/dev.c:8139 rtnl_configure_link+0xc2/0x150 net/core/rtnetlink.c:3150 __rtnl_newlink net/core/rtnetlink.c:3489 [inline] rtnl_newlink+0xf4d/0x13e0 net/core/rtnetlink.c:3529 rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5594 netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmsg+0x195/0x230 net/socket.c:2496 __do_sys_sendmsg net/socket.c:2505 [inline] __se_sys_sendmsg net/socket.c:2503 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888168ad6410 of 8 bytes by task 13560 on cpu 0: qdisc_lookup_rcu+0x30/0x2e0 net/sched/sch_api.c:323 __tcf_qdisc_find+0x74/0x3a0 net/sched/cls_api.c:1050 tc_del_tfilter+0x1c7/0x1350 net/sched/cls_api.c:2211 rtnetlink_rcv_msg+0x5ba/0x7e0 net/core/rtnetlink.c:5585 netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343 netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmsg+0x195/0x230 net/socket.c:2496 __do_sys_sendmsg net/socket.c:2505 [inline] __se_sys_sendmsg net/socket.c:2503 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0xffffffff85dee080 -> 0xffff88815d96ec00 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 13560 Comm: syz-executor.2 Not tainted 5.17.0-rc3-syzkaller-00116-gf1baf68e1383-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 470502de5bdb ("net: sched: unlock rules update API") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@mellanox.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14net: dsa: mv88e6xxx: flush switchdev FDB workqueue before removing VLANVladimir Oltean
mv88e6xxx is special among DSA drivers in that it requires the VTU to contain the VID of the FDB entry it modifies in mv88e6xxx_port_db_load_purge(), otherwise it will return -EOPNOTSUPP. Sometimes due to races this is not always satisfied even if external code does everything right (first deletes the FDB entries, then the VLAN), because DSA commits to hardware FDB entries asynchronously since commit c9eb3e0f8701 ("net: dsa: Add support for learning FDB through notification"). Therefore, the mv88e6xxx driver must close this race condition by itself, by asking DSA to flush the switchdev workqueue of any FDB deletions in progress, prior to exiting a VLAN. Fixes: c9eb3e0f8701 ("net: dsa: Add support for learning FDB through notification") Reported-by: Rafael Richter <rafael.richter@gin.de> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14ipv6: mcast: use rcu-safe version of ipv6_get_lladdr()Ignat Korchagin
Some time ago 8965779d2c0e ("ipv6,mcast: always hold idev->lock before mca_lock") switched ipv6_get_lladdr() to __ipv6_get_lladdr(), which is rcu-unsafe version. That was OK, because idev->lock was held for these codepaths. In 88e2ca308094 ("mld: convert ifmcaddr6 to RCU") these external locks were removed, so we probably need to restore the original rcu-safe call. Otherwise, we occasionally get a machine crashed/stalled with the following in dmesg: [ 3405.966610][T230589] general protection fault, probably for non-canonical address 0xdead00000000008c: 0000 [#1] SMP NOPTI [ 3405.982083][T230589] CPU: 44 PID: 230589 Comm: kworker/44:3 Tainted: G O 5.15.19-cloudflare-2022.2.1 #1 [ 3405.998061][T230589] Hardware name: SUPA-COOL-SERV [ 3406.009552][T230589] Workqueue: mld mld_ifc_work [ 3406.017224][T230589] RIP: 0010:__ipv6_get_lladdr+0x34/0x60 [ 3406.025780][T230589] Code: 57 10 48 83 c7 08 48 89 e5 48 39 d7 74 3e 48 8d 82 38 ff ff ff eb 13 48 8b 90 d0 00 00 00 48 8d 82 38 ff ff ff 48 39 d7 74 22 <66> 83 78 32 20 77 1b 75 e4 89 ca 23 50 2c 75 dd 48 8b 50 08 48 8b [ 3406.055748][T230589] RSP: 0018:ffff94e4b3fc3d10 EFLAGS: 00010202 [ 3406.065617][T230589] RAX: dead00000000005a RBX: ffff94e4b3fc3d30 RCX: 0000000000000040 [ 3406.077477][T230589] RDX: dead000000000122 RSI: ffff94e4b3fc3d30 RDI: ffff8c3a31431008 [ 3406.089389][T230589] RBP: ffff94e4b3fc3d10 R08: 0000000000000000 R09: 0000000000000000 [ 3406.101445][T230589] R10: ffff8c3a31430000 R11: 000000000000000b R12: ffff8c2c37887100 [ 3406.113553][T230589] R13: ffff8c3a39537000 R14: 00000000000005dc R15: ffff8c3a31431000 [ 3406.125730][T230589] FS: 0000000000000000(0000) GS:ffff8c3b9fc80000(0000) knlGS:0000000000000000 [ 3406.138992][T230589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3406.149895][T230589] CR2: 00007f0dfea1db60 CR3: 000000387b5f2000 CR4: 0000000000350ee0 [ 3406.162421][T230589] Call Trace: [ 3406.170235][T230589] <TASK> [ 3406.177736][T230589] mld_newpack+0xfe/0x1a0 [ 3406.186686][T230589] add_grhead+0x87/0xa0 [ 3406.195498][T230589] add_grec+0x485/0x4e0 [ 3406.204310][T230589] ? newidle_balance+0x126/0x3f0 [ 3406.214024][T230589] mld_ifc_work+0x15d/0x450 [ 3406.223279][T230589] process_one_work+0x1e6/0x380 [ 3406.232982][T230589] worker_thread+0x50/0x3a0 [ 3406.242371][T230589] ? rescuer_thread+0x360/0x360 [ 3406.252175][T230589] kthread+0x127/0x150 [ 3406.261197][T230589] ? set_kthread_struct+0x40/0x40 [ 3406.271287][T230589] ret_from_fork+0x22/0x30 [ 3406.280812][T230589] </TASK> [ 3406.288937][T230589] Modules linked in: ... [last unloaded: kheaders] [ 3406.476714][T230589] ---[ end trace 3525a7655f2f3b9e ]--- Fixes: 88e2ca308094 ("mld: convert ifmcaddr6 to RCU") Reported-by: David Pinilla Caparros <dpini@cloudflare.com> Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14ipv6: Add reasons for skb drops to __udp6_lib_rcvDavid Ahern
Add reasons to __udp6_lib_rcv for skb drops. The only twist is that the NO_SOCKET takes precedence over the CSUM or other counters for that path (motivation behind this patch - csum counter was misleading). Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14net/smc: Add comment for smc_tx_pendingTony Lu
The previous patch introduces a lock-free version of smc_tx_work() to solve unnecessary lock contention, which is expected to be held lock. So this adds comment to remind people to keep an eye out for locks. Suggested-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14Generate netlink notification when default IPv6 route preference changesKalash Nainwal
Generate RTM_NEWROUTE netlink notification when the route preference changes on an existing kernel generated default route in response to RA messages. Currently netlink notifications are generated only when this route is added or deleted but not when the route preference changes, which can cause userspace routing application state to go out of sync with kernel. Signed-off-by: Kalash Nainwal <kalash@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-14net/sched: act_police: more accurate MTU policingDavide Caratti
in current Linux, MTU policing does not take into account that packets at the TC ingress have the L2 header pulled. Thus, the same TC police action (with the same value of tcfp_mtu) behaves differently for ingress/egress. In addition, the full GSO size is compared to tcfp_mtu: as a consequence, the policer drops GSO packets even when individual segments have the L2 + L3 + L4 + payload length below the configured valued of tcfp_mtu. Improve the accuracy of MTU policing as follows: - account for mac_len for non-GSO packets at TC ingress. - compare MTU threshold with the segmented size for GSO packets. Also, add a kselftest that verifies the correct behavior. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-13netfilter: xt_socket: missing ifdef CONFIG_IP6_NF_IPTABLES dependencyPablo Neira Ayuso
nf_defrag_ipv6_disable() requires CONFIG_IP6_NF_IPTABLES. Fixes: 75063c9294fb ("netfilter: xt_socket: fix a typo in socket_mt_destroy()") Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Eric Dumazet<edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-13tipc: fix a bit overflow in tipc_crypto_key_rcv()Hangyu Hua
msg_data_sz return a 32bit value, but size is 16bit. This may lead to a bit overflow. Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-11Merge tag 'wireless-next-2022-02-11' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next wireless-next patches for v5.18 First set of patches for v5.18, with both wireless and stack patches. rtw89 now has AP mode support and wcn36xx has survey support. But otherwise pretty normal. Major changes: ath11k * add LDPC FEC type in 802.11 radiotap header * enable RX PPDU stats in monitor co-exist mode wcn36xx * implement survey reporting brcmfmac * add CYW43570 PCIE device rtw88 * rtw8821c: enable RFE 6 devices rtw89 * AP mode support mt76 * mt7916 support * background radar detection support
2022-02-11Merge tag 'wireless-2022-02-11' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless wireless fixes for v5.17 Second set of fixes for v5.17. This is the first pull request with both driver and stack patches. Most important here are a regression fix for brcmfmac USB devices and an iwlwifi fix for use after free when the firmware was missing. We have new maintainers for ath9k and wcn36xx as well as ath6kl is now orphaned. Also smaller fixes to iwlwifi and stack.
2022-02-11bpf: Do not try bpf_msg_push_data with len 0Felix Maurer
If bpf_msg_push_data() is called with len 0 (as it happens during selftests/bpf/test_sockmap), we do not need to do anything and can return early. Calling bpf_msg_push_data() with len 0 previously lead to a wrong ENOMEM error: we later called get_order(copy + len); if len was 0, copy + len was also often 0 and get_order() returned some undefined value (at the moment 52). alloc_pages() caught that and failed, but then bpf_msg_push_data() returned ENOMEM. This was wrong because we are most probably not out of memory and actually do not need any additional memory. Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data") Signed-off-by: Felix Maurer <fmaurer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/df69012695c7094ccb1943ca02b4920db3537466.1644421921.git.fmaurer@redhat.com
2022-02-11Merge ra.kernel.org:/pub/scm/linux/kernel/git/netfilter/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Add selftest for nft_synproxy, from Florian Westphal. 2) xt_socket destroy path incorrectly disables IPv4 defrag for IPv6 traffic (typo), from Eric Dumazet. 3) Fix exit value selftest nft_concat_range.sh, from Hangbin Liu. 4) nft_synproxy disables the IPv4 hooks if the IPv6 hooks fail to be registered. 5) disable rp_filter on router in selftest nft_fib.sh, also from Hangbin Liu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-11ipv4: add (struct uncached_list)->quarantine listEric Dumazet
This is an optimization to keep the per-cpu lists as short as possible: Whenever rt_flush_dev() changes one rtable dst.dev matching the disappearing device, it can can transfer the object to a quarantine list, waiting for a final rt_del_uncached_list(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-11ipv6: add (struct uncached_list)->quarantine listEric Dumazet
This is an optimization to keep the per-cpu lists as short as possible: Whenever rt6_uncached_list_flush_dev() changes one rt6_info matching the disappearing device, it can can transfer the object to a quarantine list, waiting for a final rt6_uncached_list_del(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-11ipv6: give an IPv6 dev to blackhole_netdevEric Dumazet
IPv6 addrconf notifiers wants the loopback device to be the last device being dismantled at netns deletion. This caused many limitations and work arounds. Back in linux-5.3, Mahesh added a per host blackhole_netdev that can be used whenever we need to make sure objects no longer refer to a disappearing device. If we attach to blackhole_netdev an ip6_ptr (allocate an idev), then we can use this special device (which is never freed) in place of the loopback_dev (which can be freed). This will permit improvements in netdev_run_todo() and other parts of the stack where had steps to make sure loopback_dev was the last device to disappear. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-11ipv6: get rid of net->ipv6.rt6_stats->fib_rt_uncacheEric Dumazet
This counter has never been visible, there is little point trying to maintain it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>