summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-05-16net/sched: act_pedit: sanitize shift argument before usagePaolo Abeni
syzbot was able to trigger an Out-of-Bound on the pedit action: UBSAN: shift-out-of-bounds in net/sched/act_pedit.c:238:43 shift exponent 1400735974 is too large for 32-bit type 'unsigned int' CPU: 0 PID: 3606 Comm: syz-executor151 Not tainted 5.18.0-rc5-syzkaller-00165-g810c2f0a3f86 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 ubsan_epilogue+0xb/0x50 lib/ubsan.c:151 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x187 lib/ubsan.c:322 tcf_pedit_init.cold+0x1a/0x1f net/sched/act_pedit.c:238 tcf_action_init_1+0x414/0x690 net/sched/act_api.c:1367 tcf_action_init+0x530/0x8d0 net/sched/act_api.c:1432 tcf_action_add+0xf9/0x480 net/sched/act_api.c:1956 tc_ctl_action+0x346/0x470 net/sched/act_api.c:2015 rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5993 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:725 ____sys_sendmsg+0x6e2/0x800 net/socket.c:2413 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fe36e9e1b59 Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffef796fe88 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe36e9e1b59 RDX: 0000000000000000 RSI: 0000000020000300 RDI: 0000000000000003 RBP: 00007fe36e9a5d00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe36e9a5d90 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> The 'shift' field is not validated, and any value above 31 will trigger out-of-bounds. The issue predates the git history, but syzbot was able to trigger it only after the commit mentioned in the fixes tag, and this change only applies on top of such commit. Address the issue bounding the 'shift' value to the maximum allowed by the relevant operator. Reported-and-tested-by: syzbot+8ed8fc4c57e9dcf23ca6@syzkaller.appspotmail.com Fixes: 8b796475fd78 ("net/sched: act_pedit: really ensure the skb is writable") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16octeontx2-pf: Remove unnecessary synchronize_irq() before free_irq()Minghao Chi
Calling synchronize_irq() right before free_irq() is quite useless. On one hand the IRQ can easily fire again before free_irq() is entered, on the other hand free_irq() itself calls synchronize_irq() internally (in a race condition free way), before any state associated with the IRQ is freed. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: wwan: t7xx: Fix return type of t7xx_dl_add_timedout()YueHaibing
t7xx_dl_add_timedout() now return int 'ret', but the return type is bool. Change the return type to int for furthor errcode upstream. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ALSA: usb-audio: Restore Rane SL-1 quirkTakashi Iwai
At cleaning up and moving the device rename from the quirk table to its own table, we removed the entry for Rane SL-1 as we thought it's only for renaming. It turned out, however, that the quirk is required for matching with the device that declares itself as no standard audio but only as vendor-specific. Restore the quirk entry for Rane SL-1 to fix the regression. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215887 Fixes: 5436f59bc5bc ("ALSA: usb-audio: Move device rename and profile quirks to an internal table") Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220516103112.12950-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2022-05-16octeon_ep: delete unnecessary NULL checkZiyang Xuan
vfree(NULL) is safe. NULL check before vfree() is not needed. Delete them to simplify the code. Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16octeon_ep: add missing destroy_workqueue in octep_init_moduleZheng Bin
octep_init_module misses destroy_workqueue in error path, this patch fixes that. Fixes: 862cd659a6fb ("octeon_ep: Add driver framework and device initialization") Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16Merge branch 'net-skb-defer-freeing-polish'David S. Miller
Eric Dumazet says: ==================== net: polish skb defer freeing While testing this recently added feature on a variety of platforms/configurations, I found the following issues: 1) A race leading to concurrent calls to smp_call_function_single_async() 2) Missed opportunity to use napi_consume_skb() 3) Need to limit the max length of the per-cpu lists. 4) Process the per-cpu list more frequently, for the (unusual) case where net_rx_action() has mutiple napi_poll() to process per round. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: call skb_defer_free_flush() before each napi_poll()Eric Dumazet
skb_defer_free_flush() can consume cpu cycles, it seems better to call it in the inner loop: - Potentially frees page/skb that will be reallocated while hot. - Account for the cpu cycles in the @time_limit determination. - Keep softnet_data.defer_count small to reduce chances for skb_attempt_defer_free() to send an IPI. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: add skb_defer_max sysctlEric Dumazet
commit 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") added another per-cpu cache of skbs. It was expected to be small, and an IPI was forced whenever the list reached 128 skbs. We might need to be able to control more precisely queue capacity and added latency. An IPI is generated whenever queue reaches half capacity. Default value of the new limit is 64. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: use napi_consume_skb() in skb_defer_free_flush()Eric Dumazet
skb_defer_free_flush() runs from softirq context, we have the opportunity to refill the napi_alloc_cache, and/or use kmem_cache_free_bulk() when this cache is full. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: fix possible race in skb_attempt_defer_free()Eric Dumazet
A cpu can observe sd->defer_count reaching 128, and call smp_call_function_single_async() Problem is that the remote CPU can clear sd->defer_count before the IPI is run/acknowledged. Other cpus can queue more packets and also decide to call smp_call_function_single_async() while the pending IPI was not yet delivered. This is a common issue with smp_call_function_single_async(). Callers must ensure correct synchronization and serialization. I triggered this issue while experimenting smaller threshold. Performing the call to smp_call_function_single_async() under sd->defer_lock protection did not solve the problem. Commit 5a18ceca6350 ("smp: Allow smp_call_function_single_async() to insert locked csd") replaced an informative WARN_ON_ONCE() with a return of -EBUSY, which is often ignored. Test of CSD_FLAG_LOCK presence is racy anyway. Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: tulip: convert to devresRolf Eike Beer
Works fine on my HP C3600: [ 274.452394] tulip0: no phy info, aborting mtable build [ 274.499041] tulip0: MII transceiver #1 config 1000 status 782d advertising 01e1 [ 274.750691] net eth0: Digital DS21142/43 Tulip rev 65 at MMIO 0xf4008000, 00:30:6e:08:7d:21, IRQ 17 [ 283.104520] net eth0: Setting full-duplex based on MII#1 link partner capability of c1e1 Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16Merge ath-next from git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.gitKalle Valo
ath.git patches for v5.19. Major changes: ath11k * enable keepalive during WoWLAN suspend * implement remain-on-channel support
2022-05-16Merge tag 'mt76-for-kvalo-2022-05-12' of https://github.com/nbd168/wirelessKalle Valo
mt76 patches for 5.19 - tx locking improvements - wireless ethernet dispatch support for flow offload - non-standard VHT MCS10-11 support - fixes - runtime PM improvements - mt7921 AP mode support - mt7921 ipv6 NS offload support
2022-05-16net: hinic: add missing destroy_workqueue in hinic_pf_to_mgmt_initZheng Bin
hinic_pf_to_mgmt_init misses destroy_workqueue in error path, this patch fixes that. Fixes: 6dbb89014dc3 ("hinic: fix sending mailbox timeout in aeq event work") Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16Merge branch 'skb-drop-reason-boundary'David S. Miller
Menglong Dong says: ==================== net: skb: check the boundrary of skb drop reason In the commit 1330b6ef3313 ("skb: make drop reason booleanable"), SKB_NOT_DROPPED_YET is added to the enum skb_drop_reason, which makes the invalid drop reason SKB_NOT_DROPPED_YET can leak to the kfree_skb tracepoint. Once this happen (it happened, as 4th patch says), it can cause NULL pointer in drop monitor and result in kernel panic. Therefore, check the boundrary of drop reason in both kfree_skb_reason (2th patch) and drop monitor (1th patch) to prevent such case happens again. Meanwhile, fix the invalid drop reason passed to kfree_skb_reason() in tcp_v4_rcv() and tcp_v6_rcv(). Changes since v2: 1/4 - don't reset the reason and print the debug warning only (Jakub Kicinski) 4/4 - remove new lines between tags Changes since v1: - consider tcp_v6_rcv() in the 4th patch ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()Menglong Dong
The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv() and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the return value of tcp_inbound_md5_hash(). And it can panic the kernel with NULL pointer in net_dm_packet_report_size() if the reason is 0, as drop_reasons[0] is NULL. Fixes: 1330b6ef3313 ("skb: make drop reason booleanable") Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: skb: change the definition SKB_DR_SET()Menglong Dong
The SKB_DR_OR() is used to set the drop reason to a value when it is not set or specified yet. SKB_NOT_DROPPED_YET should also be considered as not set. Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: skb: check the boundrary of drop reason in kfree_skb_reason()Menglong Dong
Sometimes, we may forget to reset skb drop reason to NOT_SPECIFIED after we make it the return value of the functions with return type of enum skb_drop_reason, such as tcp_inbound_md5_hash. Therefore, its value can be SKB_NOT_DROPPED_YET(0), which is invalid for kfree_skb_reason(). So we check the range of drop reason in kfree_skb_reason() with DEBUG_NET_WARN_ON_ONCE(). Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: dm: check the boundary of skb drop reasonsMenglong Dong
The 'reason' will be set to 'SKB_DROP_REASON_NOT_SPECIFIED' if it not small that SKB_DROP_REASON_MAX in net_dm_packet_trace_kfree_skb_hit(), but it can't avoid it to be 0, which is invalid and can cause NULL pointer in drop_reasons. Therefore, reset it to SKB_DROP_REASON_NOT_SPECIFIED when 'reason <= 0'. Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net/smc: align the connect behaviour with TCPGuangguan Wang
Connect with O_NONBLOCK will not be completed immediately and returns -EINPROGRESS. It is possible to use selector/poll for completion by selecting the socket for writing. After select indicates writability, a second connect function call will return 0 to indicate connected successfully as TCP does, but smc returns -EISCONN. Use socket state for smc to indicate connect state, which can help smc aligning the connect behaviour with TCP. Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16Merge branch 'sk_bound_dev_if-annotations'David S. Miller
Eric Dumazet says: ==================== net: add annotations for sk->sk_bound_dev_if While writes on sk->sk_bound_dev_if are protected by socket lock, we have many lockless reads all over the places. This is based on syzbot report found in the first patch changelog. v2: inline ipv6 function only defined if IS_ENABLED(CONFIG_IPV6) (kernel bots) Change the INET6_MATCH() to inet6_match(), this is no longer a macro. Change INET_MATCH() to inet_match() (Olivier Hartkopp & Jakub Kicinski) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16inet: rename INET_MATCH()Eric Dumazet
This is no longer a macro, but an inlined function. INET_MATCH() -> inet_match() Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Olivier Hartkopp <socketcan@hartkopp.net> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ipv6: add READ_ONCE(sk->sk_bound_dev_if) in INET6_MATCH()Eric Dumazet
INET6_MATCH() runs without holding a lock on the socket. We probably need to annotate most reads. This patch makes INET6_MATCH() an inline function to ease our changes. v2: inline function only defined if IS_ENABLED(CONFIG_IPV6) Change the name to inet6_match(), this is no longer a macro. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16l2tp: use add READ_ONCE() to fetch sk->sk_bound_dev_ifEric Dumazet
Use READ_ONCE() in paths not holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net_sched: em_meta: add READ_ONCE() in var_sk_bound_if()Eric Dumazet
sk->sk_bound_dev_if can change under us, use READ_ONCE() annotation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16inet: add READ_ONCE(sk->sk_bound_dev_if) in inet_csk_bind_conflict()Eric Dumazet
inet_csk_bind_conflict() can access sk->sk_bound_dev_if for unlocked sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16dccp: use READ_ONCE() to read sk->sk_bound_dev_ifEric Dumazet
When reading listener sk->sk_bound_dev_if locklessly, we must use READ_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: core: add READ_ONCE/WRITE_ONCE annotations for sk->sk_bound_dev_ifEric Dumazet
sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val), because other cpus/threads might locklessly read this field. sock_getbindtodevice(), sock_getsockopt() need READ_ONCE() because they run without socket lock held. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16tcp: sk->sk_bound_dev_if once in inet_request_bound_dev_if()Eric Dumazet
inet_request_bound_dev_if() reads sk->sk_bound_dev_if twice while listener socket is not locked. Another cpu could change this field under us. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16sctp: read sk->sk_bound_dev_if once in sctp_rcv()Eric Dumazet
sctp_rcv() reads sk->sk_bound_dev_if twice while the socket is not locked. Another cpu could change this field under us. Fixes: 0fd9a65a76e8 ("[SCTP] Support SO_BINDTODEVICE socket option on incoming packets.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: annotate races around sk->sk_bound_dev_ifEric Dumazet
UDP sendmsg() is lockless, and reads sk->sk_bound_dev_if while this field can be changed by another thread. Adds minimal annotations to avoid KCSAN splats for UDP. Following patches will add more annotations to potential lockless readers. BUG: KCSAN: data-race in __ip6_datagram_connect / udpv6_sendmsg write to 0xffff888136d47a94 of 4 bytes by task 7681 on cpu 0: __ip6_datagram_connect+0x6e2/0x930 net/ipv6/datagram.c:221 ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272 inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576 __sys_connect_file net/socket.c:1900 [inline] __sys_connect+0x197/0x1b0 net/socket.c:1917 __do_sys_connect net/socket.c:1927 [inline] __se_sys_connect net/socket.c:1924 [inline] __x64_sys_connect+0x3d/0x50 net/socket.c:1924 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888136d47a94 of 4 bytes by task 7670 on cpu 1: udpv6_sendmsg+0xc60/0x16e0 net/ipv6/udp.c:1436 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:652 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0xffffff9b Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7670 Comm: syz-executor.3 Tainted: G W 5.18.0-rc1-syzkaller-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 I chose to not add Fixes: tag because race has minor consequences and stable teams busy enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16Merge branch 'big-tcp'David S. Miller
Eric Dumazet says: ==================== tcp: BIG TCP implementation This series implements BIG TCP as presented in netdev 0x15: https://netdevconf.info/0x15/session.html?BIG-TCP Jonathan Corbet made a nice summary: https://lwn.net/Articles/884104/ Standard TSO/GRO packet limit is 64KB With BIG TCP, we allow bigger TSO/GRO packet sizes for IPv6 traffic. Note that this feature is by default not enabled, because it might break some eBPF programs assuming TCP header immediately follows IPv6 header. While tcpdump recognizes the HBH/Jumbo header, standard pcap filters are unable to skip over IPv6 extension headers. Reducing number of packets traversing networking stack usually improves performance, as shown on this experiment using a 100Gbit NIC, and 4K MTU. 'Standard' performance with current (74KB) limits. for i in {1..10}; do ./netperf -t TCP_RR -H iroa23 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done 77 138 183 8542.19 79 143 178 8215.28 70 117 164 9543.39 80 144 176 8183.71 78 126 155 9108.47 80 146 184 8115.19 71 113 165 9510.96 74 113 164 9518.74 79 137 178 8575.04 73 111 171 9561.73 Now enable BIG TCP on both hosts. ip link set dev eth0 gro_max_size 185000 gso_max_size 185000 for i in {1..10}; do ./netperf -t TCP_RR -H iroa23 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done 57 83 117 13871.38 64 118 155 11432.94 65 116 148 11507.62 60 105 136 12645.15 60 103 135 12760.34 60 102 134 12832.64 62 109 132 10877.68 58 82 115 14052.93 57 83 124 14212.58 57 82 119 14196.01 We see an increase of transactions per second, and lower latencies as well. v7: adopt unsafe_memcpy() in mlx5 to avoid FORTIFY warnings. v6: fix a compilation error for CONFIG_IPV6=n in "net: allow gso_max_size to exceed 65536", reported by kernel bots. v5: Replaced two patches (that were adding new attributes) with patches from Alexander Duyck. Idea is to reuse existing gso_max_size/gro_max_size v4: Rebased on top of Jakub series (Merge branch 'tso-gso-limit-split') max_tso_size is now family independent. v3: Fixed a typo in RFC number (Alexander) Added Reviewed-by: tags from Tariq on mlx4/mlx5 parts. v2: Removed the MAX_SKB_FRAGS change, this belongs to a different series. Addressed feedback, for Alexander and nvidia folks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16mlx5: support BIG TCP packetsEric Dumazet
mlx5 supports LSOv2. IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header with JUMBO TLV for big packets. We need to ignore/skip this HBH header when populating TX descriptor. Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only. v7: adopt unsafe_memcpy() and MLX5_UNSAFE_MEMCPY_DISCLAIMER v2: clear hopbyhop in mlx5e_tx_get_gso_ihs() v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16mlx4: support BIG TCP packetsEric Dumazet
mlx4 supports LSOv2 just fine. IPv6 stack inserts a temporary Hop-by-Hop header with JUMBO TLV for big packets. We need to ignore the HBH header when populating TX descriptor. Tested: Before: (not enabling bigger TSO/GRO packets) ip link set dev eth0 gso_max_size 65536 gro_max_size 65536 netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000 MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 262144 540000 70000 70000 10.00 6591.45 0.86 1.34 62.490 97.446 262144 540000 After: (enabling bigger TSO/GRO packets) ip link set dev eth0 gso_max_size 185000 gro_max_size 185000 netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000 MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 262144 540000 70000 70000 10.00 8383.95 0.95 1.01 54.432 57.584 262144 540000 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16veth: enable BIG TCP packetsEric Dumazet
Set the TSO driver limit to GSO_MAX_SIZE (512 KB). This allows the admin/user to set a GSO limit up to this value. ip link set dev veth10 gso_max_size 200000 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: loopback: enable BIG TCP packetsEric Dumazet
Set the driver limit to GSO_MAX_SIZE (512 KB). This allows the admin/user to set a GSO limit up to this value. Tested: ip link set dev lo gso_max_size 200000 netperf -H ::1 -t TCP_RR -l 100 -- -r 80000,80000 & tcpdump shows : 18:28:42.962116 IP6 ::1 > ::1: HBH 40051 > 63780: Flags [P.], seq 3626480001:3626560001, ack 3626560001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 80000 18:28:42.962138 IP6 ::1.63780 > ::1.40051: Flags [.], ack 3626560001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 0 18:28:42.962152 IP6 ::1 > ::1: HBH 63780 > 40051: Flags [P.], seq 3626560001:3626640001, ack 3626560001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 80000 18:28:42.962157 IP6 ::1.40051 > ::1.63780: Flags [.], ack 3626640001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 0 18:28:42.962180 IP6 ::1 > ::1: HBH 40051 > 63780: Flags [P.], seq 3626560001:3626640001, ack 3626640001, win 17743, options [nop,nop,TS val 3771179265 ecr 3771179265], length 80000 18:28:42.962214 IP6 ::1.63780 > ::1.40051: Flags [.], ack 3626640001, win 17743, options [nop,nop,TS val 3771179266 ecr 3771179265], length 0 18:28:42.962228 IP6 ::1 > ::1: HBH 63780 > 40051: Flags [P.], seq 3626640001:3626720001, ack 3626640001, win 17743, options [nop,nop,TS val 3771179266 ecr 3771179265], length 80000 18:28:42.962233 IP6 ::1.40051 > ::1.63780: Flags [.], ack 3626720001, win 17743, options [nop,nop,TS val 3771179266 ecr 3771179266], length 0 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ipv6: Add hop-by-hop header to jumbograms in ip6_outputCoco Li
Instead of simply forcing a 0 payload_len in IPv6 header, implement RFC 2675 and insert a custom extension header. Note that only TCP stack is currently potentially generating jumbograms, and that this extension header is purely local, it wont be sent on a physical link. This is needed so that packet capture (tcpdump and friends) can properly dissect these large packets. Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: allow gro_max_size to exceed 65536Alexander Duyck
Allow the gro_max_size to exceed a value larger than 65536. There weren't really any external limitations that prevented this other than the fact that IPv4 only supports a 16 bit length field. Since we have the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to exceed this value and for IPv4 and non-TCP flows we can cap things at 65536 via a constant rather than relying on gro_max_size. [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ipv6/gro: insert temporary HBH/jumbo headerEric Dumazet
Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build BIG TCP ipv6 packets (bigger than 64K). This patch changes ipv6_gro_complete() to insert a HBH/jumbo header so that resulting packet can go through IPv6/TCP stacks. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ipv6/gso: remove temporary HBH/jumbo headerEric Dumazet
ipv6 tcp and gro stacks will soon be able to build big TCP packets, with an added temporary Hop By Hop header. If GSO is involved for these large packets, we need to remove the temporary HBH header before segmentation happens. v2: perform HBH removal from ipv6_gso_segment() instead of skb_segment() (Alexander feedback) Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ipv6: add struct hop_jumbo_hdr definitionEric Dumazet
Following patches will need to add and remove local IPv6 jumbogram options to enable BIG TCP. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16tcp_cubic: make hystart_ack_delay() aware of BIG TCPEric Dumazet
hystart_ack_delay() had the assumption that a TSO packet would not be bigger than GSO_MAX_SIZE. This will no longer be true. We should use sk->sk_gso_max_size instead. This reduces chances of spurious Hystart ACK train detections. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: limit GSO_MAX_SIZE to 524280 bytesEric Dumazet
Make sure we will not overflow shinfo->gso_segs Minimal TCP MSS size is 8 bytes, and shinfo->gso_segs is a 16bit field. TCP_MIN_GSO_SIZE is currently defined in include/net/tcp.h, it seems cleaner to not bring tcp details into include/linux/netdevice.h Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: allow gso_max_size to exceed 65536Alexander Duyck
The code for gso_max_size was added originally to allow for debugging and workaround of buggy devices that couldn't support TSO with blocks 64K in size. The original reason for limiting it to 64K was because that was the existing limits of IPv4 and non-jumbogram IPv6 length fields. With the addition of Big TCP we can remove this limit and allow the value to potentially go up to UINT_MAX and instead be limited by the tso_max_size value. So in order to support this we need to go through and clean up the remaining users of the gso_max_size value so that the values will cap at 64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper limit for GSO_MAX_SIZE. v6: (edumazet) fixed a compile error if CONFIG_IPV6=n, in a new sk_trim_gso_size() helper. netif_set_tso_max_size() caps the requested TSO size with GSO_MAX_SIZE. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: add IFLA_TSO_{MAX_SIZE|SEGS} attributesEric Dumazet
New netlink attributes IFLA_TSO_MAX_SIZE and IFLA_TSO_MAX_SEGS are used to report to user-space the device TSO limits. ip -d link sh dev eth1 ... tso_max_size 65536 tso_max_segs 65535 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16Merge branch 'Renesas-RSZ-V2M-support'David S. Miller
Phil Edworthy says: ==================== Add Renesas RZ/V2M Ethernet support The RZ/V2M Ethernet is very similar to R-Car Gen3 Ethernet-AVB, though some small parts are the same as R-Car Gen2. Other differences are: * It has separate data (DI), error (Line 1) and management (Line 2) irqs rather than one irq for all three. * Instead of using the High-speed peripheral bus clock for gPTP, it has a separate gPTP reference clock. v4: * Add clk_disable_unprepare() for gptp ref clk v3: * Really renamed irq_en_dis_regs to irq_en_dis this time * Modified ravb_ptp_extts() to use irq_en_dis * Added Reviewed-by tags v2: * Just net patches in this series * Instead of reusing ch22 and ch24 interrupt names, use the proper names * Renamed irq_en_dis_regs to irq_en_dis * Squashed use of GIC reg versus GIE/GID and got rid of separate gptp_ptm_gic feature. * Move err_mgmt_irqs code under multi_irqs * Minor editing of the commit msgs ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ravb: Add support for RZ/V2MPhil Edworthy
RZ/V2M Ethernet is very similar to R-Car Gen3 Ethernet-AVB, though some small parts are the same as R-Car Gen2. Other differences to R-Car Gen3 and Gen2 are: * It has separate data (DI), error (Line 1) and management (Line 2) irqs rather than one irq for all three. * Instead of using the High-speed peripheral bus clock for gPTP, it has a separate gPTP reference clock. Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ravb: Use separate clock for gPTPPhil Edworthy
RZ/V2M has a separate gPTP reference clock that is used when the AVB-DMAC Mode Register (CCC) gPTP Clock Select (CSEL) bits are set to "01: High-speed peripheral bus clock". Therefore, add a feature that allows this clock to be used for gPTP. Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ravb: Support separate Line0 (Desc), Line1 (Err) and Line2 (Mgmt) irqsPhil Edworthy
R-Car has a combined interrupt line, ch22 = Line0_DiA | Line1_A | Line2_A. RZ/V2M has separate interrupt lines for each of these, so add a feature that allows the driver to get these interrupts and call the common handler. Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>