summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-03-12selftests: fib_nexthops: Declutter test outputIdo Schimmel
Before: # ./fib_nexthops.sh -t ipv4_torture IPv4 runtime torture -------------------- TEST: IPv4 torture test [ OK ] ./fib_nexthops.sh: line 213: 19376 Killed ipv4_del_add_loop1 ./fib_nexthops.sh: line 213: 19377 Killed ipv4_grp_replace_loop ./fib_nexthops.sh: line 213: 19378 Killed ip netns exec me ping -f 172.16.101.1 > /dev/null 2>&1 ./fib_nexthops.sh: line 213: 19380 Killed ip netns exec me ping -f 172.16.101.2 > /dev/null 2>&1 ./fib_nexthops.sh: line 213: 19381 Killed ip netns exec me mausezahn veth1 -B 172.16.101.2 -A 172.16.1.1 -c 0 -t tcp "dp=1-1023, flags=syn" > /dev/null 2>&1 Tests passed: 1 Tests failed: 0 # ./fib_nexthops.sh -t ipv6_torture IPv6 runtime torture -------------------- TEST: IPv6 torture test [ OK ] ./fib_nexthops.sh: line 213: 24453 Killed ipv6_del_add_loop1 ./fib_nexthops.sh: line 213: 24454 Killed ipv6_grp_replace_loop ./fib_nexthops.sh: line 213: 24456 Killed ip netns exec me ping -f 2001:db8:101::1 > /dev/null 2>&1 ./fib_nexthops.sh: line 213: 24457 Killed ip netns exec me ping -f 2001:db8:101::2 > /dev/null 2>&1 ./fib_nexthops.sh: line 213: 24458 Killed ip netns exec me mausezahn -6 veth1 -B 2001:db8:101::2 -A 2001:db8:91::1 -c 0 -t tcp "dp=1-1023, flags=syn" > /dev/null 2>&1 Tests passed: 1 Tests failed: 0 After: # ./fib_nexthops.sh -t ipv4_torture IPv4 runtime torture -------------------- TEST: IPv4 torture test [ OK ] Tests passed: 1 Tests failed: 0 # ./fib_nexthops.sh -t ipv6_torture IPv6 runtime torture -------------------- TEST: IPv6 torture test [ OK ] Tests passed: 1 Tests failed: 0 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12netdevsim: Allow reporting activity on nexthop bucketsIdo Schimmel
A key component of the resilient hashing algorithm is the hash buckets' activity. If a bucket is active, it will not be populated with a new nexthop in order not to break existing flows. Therefore, in order to easily and thoroughly test the algorithm, we need to be in full control over the reported activity. Add a debugfs interface that allows user space to have netdevsim report a nexthop bucket within a resilient nexthop group as active. For example: # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity Will mark bucket 23 in nexthop group 10 as active. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12netdevsim: Add support for resilient nexthop groupsIdo Schimmel
Allow resilient nexthop groups to be programmed and account their occupancy according to their number of buckets. The nexthop group itself as well as its buckets are marked with hardware flags (i.e., 'RTNH_F_TRAP'). Replacement of a single nexthop bucket can fail using the following debugfs knob: # cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace N # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace # cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace Y Replacement of a resilient nexthop group can fail using the following debugfs knob: # cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace N # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace # cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace Y This enables testing of various error paths. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12netdevsim: Create a helper for setting nexthop hardware flagsIdo Schimmel
Instead of calling nexthop_set_hw_flags(), call a helper. It will be used to also set nexthop bucket flags in a subsequent patch. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12netdevsim: fib: Introduce a lock to guard nexthop hashtablePetr Machata
Currently netdevsim relies on RTNL to maintain exclusivity in accessing the nexthop hash table. However, bucket notification may be called without RTNL having been held. Instead, introduce a custom lock to guard the table. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12Merge branch 'ptp-warnings'David S. Miller
Lee Jones says: ==================== Rid W=1 warnings from PTP This set is part of a larger effort attempting to clean-up W=1 kernel builds, which are currently overwhelmingly riddled with niggly little warnings. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12ptp: ptp_p: Demote non-conformant kernel-doc headers and supply a param ↵Lee Jones
description Fixes the following W=1 kernel build warning(s): drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'control' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'event' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'addend' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'accum' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'test' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ts_compare' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rsystime_lo' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rsystime_hi' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'systime_lo' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'systime_hi' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'trgt_lo' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'trgt_hi' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'asms_lo' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'asms_hi' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'amms_lo' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'amms_hi' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ch_control' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ch_event' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'tx_snap_lo' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'tx_snap_hi' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rx_snap_lo' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rx_snap_hi' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'src_uuid_lo' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'src_uuid_hi' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'can_status' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'can_snap_lo' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'can_snap_hi' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ts_sel' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ts_st' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'reserve1' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'stl_max_set_en' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'stl_max_set' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'reserve2' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'srst' not described in 'pch_ts_regs' drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'regs' not described in 'pch_dev' drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'ptp_clock' not described in 'pch_dev' drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'caps' not described in 'pch_dev' drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'exts0_enabled' not described in 'pch_dev' drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'exts1_enabled' not described in 'pch_dev' drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'mem_base' not described in 'pch_dev' drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'mem_size' not described in 'pch_dev' drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'irq' not described in 'pch_dev' drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'pdev' not described in 'pch_dev' drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'register_lock' not described in 'pch_dev' drivers/ptp/ptp_pch.c:128: warning: Function parameter or member 'station' not described in 'pch_params' drivers/ptp/ptp_pch.c:291: warning: Function parameter or member 'pdev' not described in 'pch_set_station_address' Cc: Richard Cochran <richardcochran@gmail.com> Cc: LAPIS SEMICONDUCTOR <tshimizu818@gmail.com> Cc: netdev@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12ptp: ptp_clockmatrix: Demote non-kernel-doc header to standard commentLee Jones
Fixes the following W=1 kernel build warning(s): drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds Cc: Richard Cochran <richardcochran@gmail.com> Cc: IDT-support-1588@lm.renesas.com Cc: netdev@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12ptp_pch: Move 'pch_*()' prototypes to shared headerLee Jones
Fixes the following W=1 kernel build warning(s): drivers/ptp/ptp_pch.c:193:6: warning: no previous prototype for ‘pch_ch_control_write’ [-Wmissing-prototypes] drivers/ptp/ptp_pch.c:201:5: warning: no previous prototype for ‘pch_ch_event_read’ [-Wmissing-prototypes] drivers/ptp/ptp_pch.c:212:6: warning: no previous prototype for ‘pch_ch_event_write’ [-Wmissing-prototypes] drivers/ptp/ptp_pch.c:220:5: warning: no previous prototype for ‘pch_src_uuid_lo_read’ [-Wmissing-prototypes] drivers/ptp/ptp_pch.c:231:5: warning: no previous prototype for ‘pch_src_uuid_hi_read’ [-Wmissing-prototypes] drivers/ptp/ptp_pch.c:242:5: warning: no previous prototype for ‘pch_rx_snap_read’ [-Wmissing-prototypes] drivers/ptp/ptp_pch.c:259:5: warning: no previous prototype for ‘pch_tx_snap_read’ [-Wmissing-prototypes] drivers/ptp/ptp_pch.c:300:5: warning: no previous prototype for ‘pch_set_station_address’ [-Wmissing-prototypes] Cc: Richard Cochran <richardcochran@gmail.com> (maintainer:PTP HARDWARE CLOCK SUPPORT) Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Flavio Suligoi <f.suligoi@asem.it> Cc: netdev@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12ptp_pch: Remove unused function 'pch_ch_control_read()'Lee Jones
Fixes the following W=1 kernel build warning(s): drivers/ptp/ptp_pch.c:182:5: warning: no previous prototype for ‘pch_ch_control_read’ [-Wmissing-prototypes] Cc: Richard Cochran <richardcochran@gmail.com> (maintainer:PTP HARDWARE CLOCK SUPPORT) Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Flavio Suligoi <f.suligoi@asem.it> Cc: netdev@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12net: dsa: bcm_sf2: setup BCM4908 internal crossbarRafał Miłecki
On some SoCs (e.g. BCM4908, BCM631[345]8) SF2 has an integrated crossbar. It allows connecting its selected external ports to internal ports. It's used by vendors to handle custom Ethernet setups. BCM4908 has following 3x2 crossbar. On Asus GT-AC5300 rgmii is used for connecting external BCM53134S switch. GPHY4 is usually used for WAN port. More fancy devices use SerDes for 2.5 Gbps Ethernet. ┌──────────┐ SerDes ─── 0 ─┤ │ │ 3x2 ├─ 0 ─── switch port 7 GPHY4 ─── 1 ─┤ │ │ crossbar ├─ 1 ─── runner (accelerator) rgmii ─── 2 ─┤ │ └──────────┘ Use setup data based on DT info to configure BCM4908's switch port 7. Right now only GPHY and rgmii variants are supported. Handling SerDes can be implemented later. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12net: dsa: bcm_sf2: store PHY interface/mode in port structureRafał Miłecki
It's needed later for proper switch / crossbar setup. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12net: ipv4: route.c: Fix indentation of multi line comment.Shubhankar Kuranagatti
All comment lines inside the comment block have been aligned. Every line of comment starts with a * (uniformity in code). Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12net: broadcom: bcm4908_enet: support TX interruptRafał Miłecki
It appears that each DMA channel has its own interrupt and both rings can be configured (the same way) to handle interrupts. 1. Make ring interrupts code generic (make it operate on given ring) 2. Move napi to ring (so each has its own) 3. Make IRQ handler generic (match ring against received IRQ number) 4. Add (optional) support for TX interrupt Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12dt-bindings: net: bcm4908-enet: add optional TX interruptRafał Miłecki
I discovered that hardware actually supports two interrupts, one per DMA channel (RX and TX). Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12Merge branch 'macb-fixed-link-fixes'David S. Miller
Robert Hancock says: ==================== macb SGMII fixed-link fixes Some fixes to the macb driver for use in SGMII mode with a fixed-link (such as for chip-to-chip connectivity). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12net: macb: Disable PCS auto-negotiation for SGMII fixed-link modeRobert Hancock
When using a fixed-link configuration in SGMII mode, it's not really sensible to have auto-negotiation enabled since the link settings are fixed by definition. In other configurations, such as an SGMII connection to a PHY, it should generally be enabled. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12net: macb: poll for fixed link state in SGMII modeRobert Hancock
When using a fixed-link configuration with GEM in SGMII mode, such as for a chip-to-chip interconnect, the link state was always showing as established regardless of the actual connectivity state. We can monitor the pcs_link_state bit in the Network Status register to determine whether the PCS link state is actually up. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12Merge tag 'mlx5-updates-2021-03-12' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2021-03-12 1) TC support for ICMP parameters 2) TC connection tracking with mirroring 3) A round of trivial fixups and cleanups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-12net/mlx5e: Allow to match on ICMP parametersMaor Dickman
Support matching on ICMPv4/6 type and code parameters using misc3 section of match parameters. Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5: CT: Add support for mirroringPaul Blakey
Add support for mirroring before the CT action by spliting the pre ct rule. Mirror outputs are done first on the tc chain,prio table rule (the fwd rule), which will then forward to a per port fwd table. On this fwd table, we insert the original pre ct rule that forwards to ct/ct nat table. Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5: Display the command index in command mailbox dumpAlaa Hleihel
Multiple commands can be printed at the same time which can lead to wrong order of their lines in dmesg output. As a result, it's hard to match data dumps to the correct command or which command was fully dumped at some point. Fix this by displaying the corresponding command index, and also indicate when a command was fully dumped. Signed-off-by: Alaa Hleihel <alaa@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5e: allocate 'indirection_rqt' buffer dynamicallyArnd Bergmann
Increasing the size of the indirection_rqt array from 128 to 256 bytes pushed the stack usage of the mlx5e_hairpin_fill_rqt_rqns() function over the warning limit when building with clang and CONFIG_KASAN: drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:970:1: error: stack frame size of 1180 bytes in function 'mlx5e_tc_add_nic_flow' [-Werror,-Wframe-larger-than=] Using dynamic allocation here is safe because the caller does the same, and it reduces the stack usage of the function to just a few bytes. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5e: Dump ICOSQ WQE descriptor on CQE with error eventsTariq Toukan
Dump the ICOSQ's WQE descriptor when a completion with error is received. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5e: Use net_prefetchw instead of prefetchw in MPWQE TX datapathMaxim Mikityanskiy
Commit e20f0dbf204f ("net/mlx5e: RX, Add a prefetch command for small L1_CACHE_BYTES") switched to using net_prefetchw at all places in mlx5e. In the same time frame, commit 5af75c747e2a ("net/mlx5e: Enhanced TX MPWQE for SKBs") added one more usage of prefetchw. When these two changes were merged, this new occurrence of prefetchw wasn't replaced with net_prefetchw. This commit fixes this last occurrence of prefetchw in mlx5e_tx_mpwqe_session_start, making the same change that was done in mlx5e_xdp_mpwqe_session_start. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5e: Remove redundant newline in NL_SET_ERR_MSG_MODRoi Dayan
Fix the following coccicheck warnings: drivers/net/ethernet/mellanox/mlx5/core/devlink.c:145:29-66: WARNING avoid newline at end of message in NL_SET_ERR_MSG_MOD drivers/net/ethernet/mellanox/mlx5/core/devlink.c:140:29-77: WARNING avoid newline at end of message in NL_SET_ERR_MSG_MOD Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5: Read congestion counters from all ports when lag is activeMark Zhang
Read congestion counters from all ports in any lag mode rather than only in RoCE lag mode (e.g., VF lag). Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5: remove unneeded semicolonJiapeng Chong
Fix the following coccicheck warnings: ./drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c:495:2-3: Unneeded semicolon. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5: use kvfree() for memory allocated with kvzalloc()Junlin Yang
It is allocated with kvzalloc(), the corresponding release function should not be kfree(), use kvfree() instead. Generated by: scripts/coccinelle/api/kfree_mismatch.cocci Signed-off-by: Junlin Yang <yangjunlin@yulong.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5: DR, Add missing vhca_id consume from STEv1Yevgeny Kliteynik
The field source_eswitch_owner_vhca_id was not consumed in the same way as in STEv0. Added the missing set. Fixes: 10b694186410 ("net/mlx5: DR, Add HW STEv1 match logic") Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5: DR, Remove unneeded rx_decap_l3 function for STEv1Yevgeny Kliteynik
Remove the dr_ste_v1_set_rx_decap_l3 function that was replaced by another function - fixing a rebase error. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12net/mlx5: DR, Fixed typo in STE v0Yevgeny Kliteynik
"reforamt" -> "reformat" Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-12docs: networking: phy: Improve placement of parenthesisJonathan Neuschäfer
"either" is outside the parentheses, so the matching "or" should be too. Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11Merge branch 'tcp-delayed-completions'David S. Miller
Eric Dumazet says: ==================== tcp: better deal with delayed TX completions Jakub and Neil reported an increase of RTO timers whenever TX completions are delayed a bit more (by increasing NIC TX coalescing parameters) While problems have been there forever, second patch might introduce some regressions so I prefer not backport them to stable releases before things settle. Many thanks to FB team for their help and tests. Few packetdrill tests need to be changed to reflect the improvements brought by this series. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tcp: remove obsolete check in __tcp_retransmit_skb()Eric Dumazet
TSQ provides a nice way to avoid bufferbloat on individual socket, including retransmit packets. We can get rid of the old heuristic: /* Do not sent more than we queued. 1/4 is reserved for possible * copying overhead: fragmentation, tunneling, mangling etc. */ if (refcount_read(&sk->sk_wmem_alloc) > min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf)) return -EAGAIN; This heuristic was giving false positives according to Jakub, whenever TX completions are delayed above RTT. (Ack packets are processed by TCP stack before clones are orphaned/freed) Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()Eric Dumazet
Jakub reported Data included in a Fastopen SYN that had to be retransmit would have to wait for an RTO if TX completions are slow, even with prior fix. This is because tcp_rcv_fastopen_synack() does not use standard rtx logic, meaning TSQ handler exits early in tcp_tsq_write() because tp->lost_out == tp->retrans_out Lets make tcp_rcv_fastopen_synack() use standard rtx logic, by using tcp_mark_skb_lost() on the skb thats needs to be sent again. Not this raised a warning in tcp_fastretrans_alert() during my tests since we consider the data not being aknowledged by the receiver does not mean packet was lost on the network. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tcp: plug skb_still_in_host_queue() to TSQEric Dumazet
Jakub and Neil reported an increase of RTO timers whenever TX completions are delayed a bit more (by increasing NIC TX coalescing parameters) Main issue is that TCP stack has a logic preventing a packet being retransmit if the prior clone has not yet been orphaned or freed. This logic came with commit 1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host queues") Thankfully, in the case skb_still_in_host_queue() detects the initial clone is still in flight, it can use TSQ logic that will eventually retry later, at the moment the clone is freed or orphaned. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Neil Spring <ntspring@fb.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11isdn: remove extra spaces in the header fileTong Zhang
fix some coding style issues in the isdn header Signed-off-by: Tong Zhang <ztong0001@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tipc: clean up warnings detected by sparseHoang Huu Le
This patch fixes the following warning from sparse: net/tipc/monitor.c:263:35: warning: incorrect type in assignment (different base types) net/tipc/monitor.c:263:35: expected unsigned int net/tipc/monitor.c:263:35: got restricted __be32 [usertype] [...] net/tipc/node.c:374:13: warning: context imbalance in 'tipc_node_read_lock' - wrong count at exit net/tipc/node.c:379:13: warning: context imbalance in 'tipc_node_read_unlock' - unexpected unlock net/tipc/node.c:384:13: warning: context imbalance in 'tipc_node_write_lock' - wrong count at exit net/tipc/node.c:389:13: warning: context imbalance in 'tipc_node_write_unlock_fast' - unexpected unlock net/tipc/node.c:404:17: warning: context imbalance in 'tipc_node_write_unlock' - unexpected unlock [...] net/tipc/crypto.c:1201:9: warning: incorrect type in initializer (different address spaces) net/tipc/crypto.c:1201:9: expected struct tipc_aead [noderef] __rcu *__tmp net/tipc/crypto.c:1201:9: got struct tipc_aead * [...] Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11tipc: convert dest node's address to network orderHoang Le
(struct tipc_link_info)->dest is in network order (__be32), so we must convert the value to network order before assigning. The problem detected by sparse: net/tipc/netlink_compat.c:699:24: warning: incorrect type in assignment (different base types) net/tipc/netlink_compat.c:699:24: expected restricted __be32 [usertype] dest net/tipc/netlink_compat.c:699:24: got int Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11Merge branch 'mlxsw-Implement-sampling-using-mirroring'David S. Miller
Ido Schimmel says: ==================== mlxsw: Implement sampling using mirroring So far, sampling was implemented using a dedicated sampling mechanism that is available on all Spectrum ASICs. Spectrum-2 and later ASICs support sampling by mirroring packets to the CPU port with probability. This method has a couple of advantages compared to the legacy method: * Extra metadata per-packet: Egress port, egress traffic class, traffic class occupancy and end-to-end latency * Ability to sample packets on egress / per-flow as opposed to only ingress This series should not result in any user-visible changes and its aim is to convert Spectrum-2 and later ASICs to perform sampling by mirroring to the CPU port with probability. Future submissions will expose the additional metadata and enable sampling using more triggers (e.g., egress). Series overview: Patches #1-#3 extend the SPAN (mirroring) module to accept new parameters required for sampling. See individual commit messages for detailed explanation. Patch #4-#5 split sampling support between Spectrum-1 and later ASIC while still using the legacy method for all ASIC generations. Patch #6 converts Spectrum-2 and later ASICs to perform sampling by mirroring to the CPU port with probability. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11mlxsw: spectrum_matchall: Implement sampling using mirroringIdo Schimmel
Spectrum-2 and later ASICs support sampling of packets by mirroring to the CPU with probability. There are several advantages compared to the legacy dedicated sampling mechanism: * Extra metadata per-packet: Egress port, egress traffic class, traffic class occupancy and end-to-end latency * Ability to sample packets on egress / per-flow Convert Spectrum-2 and later ASICs to perform sampling by mirroring to the CPU with probability. Subsequent patches will add support for egress / per-flow sampling and expose the extra metadata. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11mlxsw: spectrum_trap: Split sampling traps between ASICsIdo Schimmel
Sampling of ingress packets is supported using a dedicated sampling mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs support more sophisticated sampling by mirroring packets to the CPU. As a preparation for more advanced sampling configurations, split the trap configuration used for sampled packets between Spectrum-1 and later ASICs. This is needed since packets that are mirrored to the CPU are trapped via a different trap identifier compared to packets that are sampled using the dedicated sampling mechanism. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11mlxsw: spectrum_matchall: Split sampling support between ASICsIdo Schimmel
Sampling of ingress packets is supported using a dedicated sampling mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs support more sophisticated sampling by mirroring packets to the CPU. As a preparation for more advanced sampling configurations, split the sampling operations between Spectrum-1 and later ASICs. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11mlxsw: spectrum_span: Add SPAN probability rate supportIdo Schimmel
Currently, every packet that matches a mirroring trigger (e.g., received packets, buffer dropped packets) is mirrored. Spectrum-2 and later ASICs support mirroring with probability, where every 1 in N matched packets is mirrored. Extend the API that creates the binding between the trigger and the SPAN agent with a probability rate parameter, which is an attribute of the trigger. Set it to '1' to maintain existing behavior. Subsequent patches will use it to perform more sophisticated sampling, by mirroring packets to the CPU with probability. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11mlxsw: reg: Extend mirroring registers with probability rate fieldIdo Schimmel
The MPAR and MPAGR registers are used to configure the binding between the mirroring trigger (e.g., received packet) and the SPAN agent. Add probability rate field, which will allow us to support sampling by mirroring to the CPU. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11mlxsw: spectrum_span: Add SPAN session identifier supportIdo Schimmel
When packets are mirrored to the CPU, the trap identifier with which the packets are trapped is determined according to the session identifier of the SPAN agent performing the mirroring. Packets that are trapped for the same logical reason (e.g., buffer drops) should use the same session identifier. Currently, a single session is implicitly supported (identifier 0) and is used for packets that are mirrored to the CPU due to buffer drops (e.g., early drop). Subsequent patches are going to mirror packets to the CPU due to sampling, which will require a different session identifier. Prepare for that by making the session identifier an attribute of the SPAN agent. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11Merge tag 'mlx5-updates-2021-03-11' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== This series provides some cleanups to mlx5 driver For more information please see tag log below. Please pull and let me know if there is any problem. mlx5-updates-2021-03-11 Cleanups for mlx5 driver 1) Fix build warnings form Arnd and Vlad 2) Leon improves locking for driver load/unload flows 3) From Roi, Lockdep false dependency warning 4) Other trivial cleanups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11Merge branch 'nexthop-Resilient-next-hop-groups'David S. Miller
Petr Machata says: ==================== nexthop: Resilient next-hop groups At this moment, there is only one type of next-hop group: an mpath group. Mpath groups implement the hash-threshold algorithm, described in RFC 2992[1]. To select a next hop, hash-threshold algorithm first assigns a range of hashes to each next hop in the group, and then selects the next hop by comparing the SKB hash with the individual ranges. When a next hop is removed from the group, the ranges are recomputed, which leads to reassignment of parts of hash space from one next hop to another. RFC 2992 illustrates it thus: +-------+-------+-------+-------+-------+ | 1 | 2 | 3 | 4 | 5 | +-------+-+-----+---+---+-----+-+-------+ | 1 | 2 | 4 | 5 | +---------+---------+---------+---------+ Before and after deletion of next hop 3 under the hash-threshold algorithm. Note how next hop 2 gave up part of the hash space in favor of next hop 1, and 4 in favor of 5. While there will usually be some overlap between the previous and the new distribution, some traffic flows change the next hop that they resolve to. If a multipath group is used for load-balancing between multiple servers, this hash space reassignment causes an issue that packets from a single flow suddenly end up arriving at a server that does not expect them, which may lead to TCP reset. If a multipath group is used for load-balancing among available paths to the same server, the issue is that different latencies and reordering along the way causes the packets to arrive in the wrong order. Resilient hashing is a technique to address the above problem. Resilient next-hop group has another layer of indirection between the group itself and its constituent next hops: a hash table. The selection algorithm uses a straightforward modulo operation on the SKB hash to choose a hash table bucket, then reads the next hop that this bucket contains, and forwards traffic there. This indirection brings an important feature. In the hash-threshold algorithm, the range of hashes associated with a next hop must be continuous. With a hash table, mapping between the hash table buckets and the individual next hops is arbitrary. Therefore when a next hop is deleted the buckets that held it are simply reassigned to other next hops: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ v v v v +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Before and after deletion of next hop 3 under the resilient hashing algorithm. When weights of next hops in a group are altered, it may be possible to choose a subset of buckets that are currently not used for forwarding traffic, and use those to satisfy the new next-hop distribution demands, keeping the "busy" buckets intact. This way, established flows are ideally kept being forwarded to the same endpoints through the same paths as before the next-hop group change. This patch set adds the implementation of resilient next-hop groups. In a nutshell, the algorithm works as follows. Each next hop has a number of buckets that it wants to have, according to its weight and the number of buckets in the hash table. In case of an event that might cause bucket allocation change, the numbers for individual next hops are updated, similarly to how ranges are updated for mpath group next hops. Following that, a new "upkeep" algorithm runs, and for idle buckets that belong to a next hop that is currently occupying more buckets than it wants (it is "overweight"), it migrates the buckets to one of the next hops that has fewer buckets than it wants (it is "underweight"). If, after this, there are still underweight next hops, another upkeep run is scheduled to a future time. Chances are there are not enough "idle" buckets to satisfy the new demands. The algorithm has knobs to select both what it means for a bucket to be idle, and for whether and when to forcefully migrate buckets if there keeps being an insufficient number of idle ones. To illustrate the usage, consider the following commands: # ip nexthop add id 1 via 192.0.2.2 dev dummy1 # ip nexthop add id 2 via 192.0.2.3 dev dummy1 # ip nexthop add id 10 group 1/2 type resilient \ buckets 8 idle_timer 60 unbalanced_timer 300 The last command creates a resilient next-hop group. It will have 8 buckets, each bucket will be considered idle when no traffic hits it for at least 60 seconds, and if the table remains out of balance for 300 seconds, it will be forcefully brought into balance. If not present in netlink message, the idle timer defaults to 120 seconds, and there is no unbalanced timer, meaning the group may remain unbalanced indefinitely. The value of 120 is the default in Cumulus implementation of resilient next-hop groups. To a degree the default is arbitrary, the only value that certainly does not make sense is 0. Therefore going with an existing deployed implementation is reasonable. Unbalanced time, i.e. how long since the last time that all nexthops had as many buckets as they should according to their weights, is reported when the group is dumped: # ip nexthop show id 10 id 10 group 1/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0 When replacing next hops or changing weights, if one does not specify some parameters, their value is left as it was: # ip nexthop replace id 10 group 1,2/2 type resilient # ip nexthop show id 10 id 10 group 1,2/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0 It is also possible to do a dump of individual buckets (and now you know why there were only 8 of them in the example above): # ip nexthop bucket show id 10 id 10 index 0 idle_time 5.59 nhid 1 id 10 index 1 idle_time 5.59 nhid 1 id 10 index 2 idle_time 8.74 nhid 2 id 10 index 3 idle_time 8.74 nhid 2 id 10 index 4 idle_time 8.74 nhid 1 id 10 index 5 idle_time 8.74 nhid 1 id 10 index 6 idle_time 8.74 nhid 1 id 10 index 7 idle_time 8.74 nhid 1 Note the two buckets that have a shorter idle time. Those are the ones that were migrated after the nexthop replace command to satisfy the new demand that nexthop 1 be given 6 buckets instead of 4. The patchset proceeds as follows: - Patches #1 and #2 are small refactoring patches. - Patch #3 adds a new flag to struct nh_group, is_multipath. This flag is meant to be set for all nexthop groups that in general have several nexthops from which they choose, and avoids a more expensive dispatch based on reading several flags, one for each nexthop group type. - Patch #4 contains defines of new UAPI attributes and the new next-hop group type. At this point, the nexthop code is made to bounce the new type. As the resilient hashing code is gradually added in the following patch sets, it will remain dead. The last patch will make it accessible. This patch also adds a suite of new messages related to next hop buckets. This approach was taken instead of overloading the information on the existing RTM_{NEW,DEL,GET}NEXTHOP messages for the following reasons. First, a next-hop group can contain a large number of next-hop buckets (4k is not unheard of). This imposes limits on the amount of information that can be encoded for each next-hop bucket given a netlink message is limited to 64k bytes. Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this point, in the future it can be extended to provide user space with control over next-hop buckets configuration. - Patch #5 contains the meat of the resilient next-hop group support. - Patches #6 and #7 implement support for notifications towards the drivers. - Patch #8 adds an interface for the drivers to report resilient hash table bucket activity. Drivers will be able to report through this interface whether traffic is hitting a given bucket. - Patch #9 adds an interface for the drivers to report whether a given hash table bucket is offloaded or trapping traffic. - In patches #10, #11, #12 and #13, UAPI is implemented. This includes all the code necessary for creation of resilient groups, bucket dumping and getting, and bucket migration notifications. - In patch #14 the next-hop groups are finally made available. The overall plan is to contribute approximately the following patchsets: 1) Nexthop policy refactoring (already pushed) 2) Preparations for resilient next-hop groups (already pushed) 3) Implementation of resilient next-hop groups (this patchset) 4) Netdevsim offload plus a suite of selftests 5) Preparations for mlxsw offload of resilient next-hop groups 6) mlxsw offload including selftests Interested parties can look at the current state of the code at [2] and [3]. [1] https://tools.ietf.org/html/rfc2992 [2] https://github.com/idosch/linux/commits/submit/res_integ_v1 [3] https://github.com/idosch/iproute2/commits/submit/res_v1 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11nexthop: Enable resilient next-hop groupsPetr Machata
Now that all the code is in place, stop rejecting requests to create resilient next-hop groups. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>